You're in all Blogs Section

# Much more efficient bubble sort in R using the Rcpp and inline packages

Recently I wrote a blogpost showing the implementation of a simple bubble sort algorithm in pure R code. The downside of that implementation was that is was awfully slow. And by slow, I mean really slow, as in “a 100 element vector takes 7 seconds to sort”-slow. One of the major opportunities for a speed is to start using a compiled language. I chose to use C++ as this is really easy to integrate into R using the `Rcpp` package. In addition to the `Rcpp` package I use the `inline` package which allows one to use C++ code and R code in a seamless fashion. The following code creates an R function `bubble_sort_cpp`:

Quite amazing how easy it is to integrate R code and C++ code. `inline` compiles and links the C++ code on-the-fly, creating an R function that delivers the functionality. Of course the most important question is now how fast this is. I use the `microbenchmark` package to run the bubble sort I implemented in pure R (here), the bubble sort implemented in C++ (see above), and the standard R sorting algorithm:

These results speak for itself, the C++ version is more than 1300 times faster when looking at the median speed, even faster than the built-in `sort` function. These differences will only get more pronounced when the size of the vector grows.

Tagged with: , , ,
Posted in R stuff

# Bubble sort implemented in pure R

Please note that this is programming I purely did for the learning experience. The pure R bubble sort implemented in this post is veeeeery slow for two reasons:

1. Interpreted code with lots of iteration is very slow.
2. Bubble sort is one of the slowest sorting algorithms (`O(N^2)`)

The bubble sort sorting algorithm works by iterating over the unsorted vector and comparing pairs of numbers. Let’s say the first point pair is `c(61, 3)`, here the numbers need to be swapped as the 3 should be earlier in the sorted vector. The following function returns `TRUE` if the numbers should be swapped, and returns `FALSE` otherwise:

This function is used by the following function:

which returns the swapped version of the pair if appropriate, or the original pair if the order is ok. For each point pair (element1-element2, element2-element3, etc) `swap_if_larger` is called:

One pass of this function performs a comparison on all pairs, swapping if necessary. To fully sort the vector, we need to perform multiple passes until no swaps are needed anymore. I chose to implement this using recursion:

The function starts by perform a swapping pass over the vector. If the new vector is equal to the old vector, no swaps where needed, i.e. the vector is already sorted. The function than returns the vector. Alternatively, if the vectors are different, the vector is not yet fully sorted, and we need to perform more passes. This is accomplished by recursively calling `bubble_sort` again on the vector. An example of the function in action:

The full sorting process is nicely illustrated by the following animated gif (linked from wikipedia):

This implementation is horribly slow:

Probably implementing the relatively slow bubble sort in a compiled language pose a dramatic increase in speed. Maybe a nice first testcase for `Rcpp`

Tagged with: , , ,
Posted in R stuff

# Ctags support for IDL: regular expression definitons

One of the reasons I switched to using Vim as a text editor is the excellent supports for Ctags. In a nutshell, ctags allows you to put your cursor on a function name, press C-p, and jump to the file where that function is defined. Ctags supports a great number of programming languages, but unfortunately, IDL is not one of them. Luckily it is straightforward to add support for new languages. Simply add this:

to your `~/.ctags` file to be able to use Ctags with IDL.

Tagged with: ,
Posted in IDL

# Indexing IDL matrices with vectors: some unexpected behavior for out of range values

IDL is not the most used language around, many of you might not even have heard of it. Mind you that by IDL I mean the Interactive Data Language, and not the Interface Description Language, which many more people know. IDL is stil used a lot in some scientific applications, for example in astronomy and remote sensing. IDL has some quirks which might catch you off guard. I just encountered one of those, which I would like to share. Note that this is valid for version 7 of IDL, which is not the latest one.

Assume we have the following matrix:

We can index this matrix in the following manner, observe that IDL indexing starts at zero:

Providing an invalid index nicely leads to an exception:

IDL is also vectorized, so we can pass vectors of indices to extract multiple values in one go:

But now comes the problem. Let’s pass some out of range indices as vectors:

In stead of throwing an exception, IDL happily returns the last valid value in that dimension, or in this case a pair of dimensions, i.e. 23. So, beware, when passing vectors of indices to an IDL array there is no out of range checking: you are on your own .

Tagged with: , ,
Posted in IDL

# Parsing complex text files using regular expressions and vectorization

When text data is in a nice CSV format, `read.csv` is enough to parse it into a useable format. But if this is not the case, getting the data into a useable format is not so straightforward. In this post I particularly illustrate the use of regular expressions for complex and flexible text processing, and the power of vectorization in R. Vectorization means that we operate on vectors as a whole, not operate on individual elements of a vector.

Take for example a snippet of this data which I downloaded from StackOverflow:

In this post I’ll be stepping through the R code needed to get this text data into a useable format. First, we want to read the data into a character vector:

where each element of the vector is a line in the the text file. Already we see that the first line is some header information which we want to skip:

note the use of negative indexing to remove an element. Next we want to find all the elements in the vector that relate to the date for which the data is representative, we do that by using a regular expression which looks for lines that start with `-`:

and find the amount of actions, upvotes or downvotes etc, that have taken place on each day, i.e. the index of a certain day minus the index of the day before that:

note that we add `rep_date_entries[1]` because `diff` cuts off the first element. Now that we know which elements relate to the date, we can read all other lines into a nice `data.frame`:

The reputation column has a somewhat strange format (`(10)`), we need to get rid of the brackets. A nice way of doing that is using a regular expression, and the `str_extract` function from the `stringr` package:

The regular expression `[0-9]+` matches one or more numbers, and `str_extract` gets the number out of the string. Now we have the data, we need to add a column which says for each row to which date it belongs. We know which lines in the data belong to a date (`rep_date_entries`) and we know how much data entries there are per day (`actions_per_day`). We can now simply repeat each element in `rep_date_entries` as many times as there are actions:

You can see that the date is not yet in a nice format, we need to get rid of all the text, except the date itself. Again, we can use a regular expression, combined with `str_extract` for this:

The regular expression `"[0-9]{4}-[0-9]{2}-[0-9]{2}"` matches any occurence of 4 numbers-2 numbers-2 numbers. Finally, we transform the date from a string to a real date object using `strptime`:

The end result is the following `data.frame`:

All this code together leads to the following function:

I think this nicely illustrates the power of both vectorization, very short and to-the-point for-loop-less syntax, and regular expressions in editing strings.

Tagged with: , ,
Posted in R stuff

# Slate: an XMonad like windowmanager for Mac OS

For work I recently switched to using Mac OS (MacBook Pro 15” retina), until then I had been using Linux. The switch was rather painless, as a lot of the unix goodness (terminal!) is also present on the Mac. One thing that I was missing from my Linux days is XMonad. XMonad is a tiled window manager, which is very lightweight and heavily depends on shortcuts. But then a colleague suggested I take a look at Slate. Slate has a lot of the functionality and configurability of XMonad, and is a very nice addition to my already shortcut centered way of working on my MacBook.

Slate allows you to:

• Attach keystrokes to how you want to manage your windows: e.g. resizing, moving around the screen, shifting focus between apps.
• Create app layouts, e.g. Google Chrome full screen, Mail Client on my 2nd screen maximized to the left half of the screen, Terminal maximized to the right half of the screen, and being able to quickly switch between them. It also allows you to create layouts for 1 and 2 monitor setups, and switches between them automatically.
• Windows Hints, press a button and all the apps are marked by a letter. Pushing that letter shifts focus to that app.
• An alternative app switcher.

Configuring Slate is done using a `.slate` file in your homedirectory, and can be a bit daunting. But if you are used to working with `.bash_profile`‘s and such, you’ll feel right at home. I’ve only just starting to work with Slate, but you can have a look at my config file at the bottom of this post. A nice introductory blog post for Slate has been written by Tristan Hume.

One issue for me right now is that Slate does not work well with Mission Control (multiple workspaces). The features I use in the config file below work fine, but for example the layout’s aren’t able to use multiple workspaces. This is a known issue, and progress in this direction is hampered by the lack of API support from apple.

Tagged with:
Posted in MacBook

# Automatic spatial interpolation with R: the automap package

In case of continuously collected data, e.g. observations from a monitoring network, spatial interpolation of this data cannot be done manually. Instead, the interpolation should be done automatically. To achieve this goal, I developed the `automap` package. `automap` builds on top of the excellent `gstat` package, and provides automatic spatial interpolation, more specifically, automatic kriging. Kriging in its more simple form (Ordinary Kriging, Universal Kriging, aka Kriging with External Drift) is actually nothing more than linear regression with spatially correlated residuals.

`automap` provides the following set of functions (for details I refer to the online manual):

• `autofitVariogram`, automatically fits the variogram model to the data.
• `autoKrige`, automatically fits the variogram model using `autofitVariogram`, and creates an interpolated map.
• `autoKrige.cv`, automatically fits the variogram model using `autofitVariogram`, and performs cross-validation. Uses `krige.cv` under the hood.
• `compare.cv`, allows comparison of the output of `autoKrige.cv` and `krige.cv`. This can be used to evaluate the performance of different interpolation algorithms. `compare.cv` allows comparison using both summary statistics and spatial plots.

In general, the interface of `automap` mimics that of `gstat`. The following code snippets show some examples of creating interpolated maps using `automap`:

You can get `automap` from either CRAN:

PS: `automap` was the first package I wrote, at the beginning of my PhD, so it is not the most beautiful code I ever wrote .

Tagged with: ,
Posted in R stuff

# Implementing a hash table in Fortran 90: Part 2

In my last post I proposed a simple implementation for a hash table in Fortran 90 using a module. I extended the hashtable to make it more usable in a realistic setting. Do note that in some aspects, this implementation of a hash table is not very efficient. This is mainly in the ability to quickly add and retrieve elements form the hash table. My implementation uses a linear search to find the key, this could be done much more efficiently using e.g. a binary search.

• There is now just one subroutine to put stuff into the hash table: `hash_set`. When `hash_set` is called, the hash table tries to find the key, and if it is not found, the key-value pair is pushed onto the hash table. `hash_push` is no longer publicly available.
• Standard behavior now is that the hash table has a size of 50 items, and the table no longer has to be initialized at a certain size. In addition, when the 50 numbers are full, `hash_reallocate` is called to extend the hash table by another 50 items. This makes the hash table much more flexible.
• `hash_print` now only prints the key-value pairs that are actually used.

The following program shows the new hash table module in action:

A nice addition would be add an index to the hash table, allowing one to have several hash tables inside the one hash table module. Any call to a subroutine would then also require specifying which hash table one needs to change.

The source code for the module is given here:

Tagged with: ,
Posted in Fortran

# Implementing a simple hash table in Fortran 90

Implementing a hash table, or dictionary as it is called in Python, in Fortran 90 turned out to be non-trivial. For starters, no standard data type was available (afaik) in Fortran 90. I decided to implement one myself, custom for my situation. What I needed was a mapping from a model name to a `REAL` value, storing key-value pairs. The hash table would enable me to take the model name, and retrieve the associated `REAL` value. The following module implements the hash table. It has a number of subroutines:

• `hash_init(no_items)`, initializes the hash table with the correct number of key-value pairs.
• `hash_push(key, value)`, pushes the key-value pair to the next available place in the hash table, i.e. first index 1, next index 2, etc.
• `hash_get(key, value)`, retrieves the value for a given key. Errors are raised when the hash table is not yet fully filled, or when the key has not been found.
• `hash_set(key, value)`, allows the user to change the value of a given key-value pair. Errors are raised when the hash table is not yet fully filled, or when the key has not been found.
• `hash_print()`, prints the current contents of the hash table to the screen.

The following program illustrates the use of the hash module (tested using `ifort` and `gfortran`):

The code of the module `hash` is given here:

Tagged with: , ,
Posted in Fortran

# Data Mining with R course taught by Luis Torgo

From the 25th of march onwards, Dr. Luis Torgo will teach a Data Mining with R course together with the DIKW Academy in Nieuwegein, The Netherlands. Dr. Torgo is an Associate Professor at the department of Computer Science at the university of Porto. He is also the author of the book Data Mining with R. His interest are in Machine Learning in general, but particularly focused on inductive learning problems.

The course is aimed at professionals interested in business analytics and data mining. Previous programming experience is not mandatory, but does help in getting the most effect from the course. More information can be found on the course page of the DIKW Academy.

Tagged with: , ,
Posted in Announcements, R stuff