You're in all Blogs Section

Data mining with R course in the Netherlands taught by Luis Torgo

In the course of this year, Dr. Luis Torgo will teach a Data Mining with R course together with the DIKW Academy in Nieuwegein, The Netherlands. Dr. Torgo is an Associate Professor at the department of Computer Science at the university of Porto. He is also the author of the book Data Mining with R. His interest are in Machine Learning in general, but particularly focused on inductive learning problems.

The course is aimed at professionals interested in business analytics and data mining. Previous programming experience is not mandatory, but does help in getting the most effect from the course. More information can be found on the course page of the DIKW Academy.

Posted in R stuff

Vectorisation is your best friend: replacing many elements in a character vector

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a data.frame that contains an ID character column:

We want to replace all occurrences of A by 'Text for A', and the same for B and C. One approach is to use a combination of a for-loop and some if statements, in a style that looks more like C:

This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an apply loop. First we construct a helper function:

which uses switch in stead of the set of nested if statements. Next we use sapply to call the helper function on each of the elements in df$ID:

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit for-loop and if statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

and subset this vector using df$ID:

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

but how long do they take. For this, we benchmark the three solutions:

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the apply base solution is only a factor 1.10 faster than the for-loop based solution. The take home message: apply-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a factor, and simply replacing the levels is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

Tagged with: ,
Posted in R stuff

The performance of dplyr blows plyr out of the water

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

In this case, dplyr is about 8x faster. However, some log file processing I did recently was sped up by a factor of 800. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.

Tagged with: ,
Posted in R stuff

Bubble sorting in R, C++ and Julia: code improvements and the R compiler

In the past few months I have written posts about implementing the bubble sort algorithm in different languages. In the mean while I have gotten some feedback and suggestions regarding improvements to the implementation I made, see the end of the post for the new source code of the algorithms. These often had a quite profound effect on performance.

One of the best tips was to use the R compiler, i.e. the compiler package which is now part of the standard R distribution. This works by simply calling cmpfun:

The following table presents the timings in microseconds of the different algorithms:

The following things I find striking:

  1. Compiling the for/while loop based R solution benefits massively form the compiler package, increasing the speed almost 5 fold.
  2. Compiling the recursive R based solution does not yield any improvements.
  3. The C++ solution is obviously much faster than any R based solution, increasing the speed between 1900 and 400 times.
  4. The Julia based solution is almost as fast as the C++ solution, which is very impressive for a high-level programming language.
  5. The native R sort is almost 8 times faster than the fastest bubble sort in C++ and Julia, but sort probably uses a faster algorithm that scales O(n log n) in stead of the 0(n^2) of the bubble sort algorithm.

R (recursive implementation). No improvements.

R (for/while loop implementation). This implementation was previously not present.

C++ (linked into R using Rcpp/inline). Precomputing vec.size() outside the loop improved performance by a factor of 2.

Julia. Subtly changing the definition of the for loop (1:(length(vec_in) - 1 - passes) vs [1:(length(vec_in) - 1 - passes)]) improved performance two fold.

Tagged with: , , ,
Posted in R stuff

Parallel processing with short jobs only increases the run time

Parallel processing has become much more important over the years as multi-core processors have become common place. From version 02.14 onwards, parallel processing has become part of the standard R installation in the form of the parallel package. This package makes parallel makes running parallel jobs as easy as creating a function that runs a job, and calling parSapply on a list of inputs to this function.

Of course, parallelisation incurs some overhead: information needs to be distributed over the nodes, and the result from each node needs to be collected and aggregated into the resulting object. This overhead is one of the main reasons why in certain cases parallel processing takes longer than sequential processing, see for example this StackOverflow question.

In this post I explore the influence of the time a single job takes on the total performance of parallel processing compared to sequential processing. To simulate a job, I simply use the R function Sys.sleep. The problem that I solve is simply waiting for a second. By cutting this second up into increasingly small pieces, the size of each job becomes shorter and shorter. By comparing the run-time of calling Sys.sleep sequentially and in parallel, I can investigate the relation between the temporal size of a job and the performance of parallel processing.

The following figure shows the results of my experiment (the R code is listed at the end of the blogpost):

The x-axis shows the run-time of an individual job in msecs, the y-axis shows the factor between parallel and sequential processing (> 1 means parallel is faster), and the color shows the result for 4 and 8 cores. The dots are runs comparing parallel and sequential processing (20 repetitions), the lines shows the median value for the 20 repetitions.

The most striking feature is that shorter jobs decrease the effectiveness of parallel processing, from around 0.011 msecs parallel processing becomes slower than sequential processing. From that moment, the overhead of parallelisation is bigger than the gain. In addition, above 0.011 msecs, parallel processing might be faster, but it is a far cry from the 4-8 fold increase in performance one would naively expect. Finally, for the job sizes in the figure, increasing the number of cores only marginally improves performance.

In conclusion, when individual jobs are short, parallelisation is going to have a small impact on performance, or even decrease performance. Keep this in the back of your mind when trying to run your code in parallel.

Source code needed to perform the experiment:

Tagged with: , ,
Posted in R stuff

Julia is lightning fast: bubble sort revisited

I had heard the name of the new technical computing language Julia buzzing around for some time already. Now during Christmas I had some time on my hands, and implemented the bubble sort algorithm that I have already posted about several times (R, C++). The main argument for Julia is it’s speed. It claims speeds comparable to that of a compiled language such as C++, but in the form of a high-level programming language such as R or Matlab.

The following Julia code implements the bubble sort algorithm, in the same style as the C++ algorithm I implemented using Rcpp.

Running this from the command line leads to the following timings in seconds:

This is in the same order of magnitude as the C++ implementation, which leads to timings of around 0.15 secs. So, this first examples shows that Julia is comparable in speed to C++ for this example, which is much much faster than the same implementation would be in R.

Julia certainly lacks the maturity of a tool such as R or Python, but the speed is impressive and I like the clean looking syntax. I’m curious to see where Julia will go from here, but I’ll certainly try and find an excuse to try Julia out for a real project.

Tagged with: ,
Posted in R stuff

Cloning packages from one Ubuntu install to another

Right now I’m busy transferring an Ubuntu install from a physical to a virtual machine. One of the issues is to get all the relevant packages installed. The following commands do just this.

  1. Make a list of the packages on the donor system:
  2. Copy the package list from the donor to the virtual system.
  3. Login to the virtual system, and tell dpkg which packages should be installed:
  4. And install the packages:

Just press Y, and all the missing packages will be installed on the virtual system. This is why I love a package manager! Note that this will probably work on any Debian based Linux distribution.

Posted in General, interesting

Upgrade OS X Leopard to Mountain Lion using a clean install

My wife’s laptop (2009 Alum. MacBook) was still running OS X Leopard, and the lack of performance and support was getting pretty annoying. Finally, I took the effort to install the latest Apple OS on her laptop. When running Leopard, there is on complication: Mountain Lion can only be bought in the App Store, and Leopard has no App Store.

The official Apple way of solving this is to first upgrade to Snow Leopard, and then use the App Store to buy and install Mountain Lion (ML). I did not want to do this for two reasons: I wanted a clean install, and I did not want to first upgrade to Snow Leopard. The following steps describe what I did to do a single step, clean install of Mountain Lion. Do note that this howto requires another Apple computer with Lion/Mountain Lion already installed, I will refer to this as the other machine.

  1. Backup you data before following these steps as you will wipe the Leopard device’s hard drive. I would recommend not using a full backup such as Time Machine, but copy your homedrive to an external drive. A full backup will also backup and restore all kinds of old Leopard stuff. With a backup of your homedrive, you can then selectively copy back data such as photo’s.
  2. Acquire Mountain Lion. If you already have ML, and the device you want to install it on is going to use your Apple ID, you do not need to rebuy ML. If, as in my case, the other computer is going to run under another Apple ID, you need to take the following steps:
    1. Open the App store on the other machine,
    2. log in with the Apple ID for which you want to buy ML
    3. buy ML. Note that ML will start downloading immediately, you can cancel this right now.
  3. Create a recovery USB drive using the Recovery Disk Assistant that Apply provides. This needs at least 1GB of space.
  4. Insert the USB drive into your Leopard device, and restart. During the boot sequence, press the Option key (alt) to gain access to the boot menu. Select the USB drive as boot device.
  5. Once you’ve booted into the Recovery Assistant, go to the Disk Utility and wipe the boot partition (not the hard drive itself) on your hard drive.
  6. Connect to your WiFi network or insert your network cable.
  7. Go back to the main menu, and select the option to install ML. You will be asked to enter an Apple ID, enter the one relevant for this old Leopard device. This was my wife’s ID in my case. The Recovery Assistant will now download ML (can take some time, as it is 4.4 GB), and run through the installation procedure. This took around 3 hours for me. The machine will automatically reboot, and you will enter Mountain Lion heaven…


- Do note that iLife with not be available after installing ML from scratch. You can, however, install iLife from the original installation DVD’s you got with your Apple device. You will get the old iLife version, the new, updated ones need to be purchased in the App Store if you follow this guide.

Tagged with: , , ,
Posted in MacBook

Much more efficient bubble sort in R using the Rcpp and inline packages

Recently I wrote a blogpost showing the implementation of a simple bubble sort algorithm in pure R code. The downside of that implementation was that is was awfully slow. And by slow, I mean really slow, as in “a 100 element vector takes 7 seconds to sort”-slow. One of the major opportunities for a speed is to start using a compiled language. I chose to use C++ as this is really easy to integrate into R using the Rcpp package. In addition to the Rcpp package I use the inline package which allows one to use C++ code and R code in a seamless fashion. The following code creates an R function bubble_sort_cpp:

Quite amazing how easy it is to integrate R code and C++ code. inline compiles and links the C++ code on-the-fly, creating an R function that delivers the functionality. Of course the most important question is now how fast this is. I use the microbenchmark package to run the bubble sort I implemented in pure R (here), the bubble sort implemented in C++ (see above), and the standard R sorting algorithm:

These results speak for itself, the C++ version is more than 1300 times faster when looking at the median speed, even faster than the built-in sort function. These differences will only get more pronounced when the size of the vector grows.

Tagged with: , , ,
Posted in R stuff

Bubble sort implemented in pure R

Please note that this is programming I purely did for the learning experience. The pure R bubble sort implemented in this post is veeeeery slow for two reasons:

  1. Interpreted code with lots of iteration is very slow.
  2. Bubble sort is one of the slowest sorting algorithms (O(N^2))

The bubble sort sorting algorithm works by iterating over the unsorted vector and comparing pairs of numbers. Let’s say the first point pair is c(61, 3), here the numbers need to be swapped as the 3 should be earlier in the sorted vector. The following function returns TRUE if the numbers should be swapped, and returns FALSE otherwise:

This function is used by the following function:

which returns the swapped version of the pair if appropriate, or the original pair if the order is ok. For each point pair (element1-element2, element2-element3, etc) swap_if_larger is called:

One pass of this function performs a comparison on all pairs, swapping if necessary. To fully sort the vector, we need to perform multiple passes until no swaps are needed anymore. I chose to implement this using recursion:

The function starts by perform a swapping pass over the vector. If the new vector is equal to the old vector, no swaps where needed, i.e. the vector is already sorted. The function than returns the vector. Alternatively, if the vectors are different, the vector is not yet fully sorted, and we need to perform more passes. This is accomplished by recursively calling bubble_sort again on the vector. An example of the function in action:

The full sorting process is nicely illustrated by the following animated gif (linked from wikipedia):

enter image description here

This implementation is horribly slow:

Probably implementing the relatively slow bubble sort in a compiled language pose a dramatic increase in speed. Maybe a nice first testcase for Rcpp

Tagged with: , , ,
Posted in R stuff