You're in all Blogs Section

Using mutate from dplyr inside a function: getting around non-standard evaluation

To edit or add columns to a data.frame, you can use mutate from the dplyr package:

Here, dplyr uses non-standard evaluation in finding the contents for mpg and wt, knowing that it needs to look in the context of mtcars. This is nice for interactive use, but not so nice for using mutate inside a function where mpg and wt are inputs to the function.

The goal is to write a function f that takes the columns in mtcars you want to add up as strings, and executes mutate. Note that we also want to be able to set the new column name. A first naive approach might be:

The problem is that col1 and col2 are not interpreted, in stead dplyr tries looking for col1 and col2 in mtcars. In addition, the name of the new column will be new_col_name, and not the content of new_col_name. To get around non-standard evaluation, you can use the lazyeval package. The following function does what we expect:

The important parts here are, given the call to f above:

  • lazyeval::interp(~ a + b, a =, b = this creates the expression wt + mpg.
  • mutate_(mutate_call) where mutate_ is the version of mutate that uses standard evaluation (SE).
  • setNames(list(mutate_call), new_col_name)) sets the output name to the content of new_col_name, i.e. hahaaa.
Tagged with: ,
Posted in R stuff

Parsing a large amount of characters into a POSIXct object

When trying to parse a large amount of datetime characters into POSXIct objects, it struck me that strftime and as.POSIXct where actually quite slow. When using the parsing functions from lubridate, these where a lot faster. The following benchmark shows this quite nicely.

We have a character vector xi, which contains about 2.3 million elements. When using strftime, this takes about 105 seconds on my MacBook pro:

If we switch to ymd_hms from lubridate, we get a very large performance increase:

ymd_hms is about a 120 times faster.

Tagged with: ,
Posted in R stuff

Tutorials freely available of course I taught: including ggplot2, dplyr and shiny

I was asked to write a R course for a group of innovative companies in the North of the Netherlands. The group of 12 people was a mix of engineers and programmers, and the course aimed at giving them a running start at working with R. The tutorials I developed are now freely available for personal use on the website of the company I work for, S&T in Delft.

The content of the course covers many topics such as data types in R, ggplot2 and dplyr. If you find the material interesting, or have remarks, please drop me a message.

Tagged with: , ,
Posted in R stuff

Data mining with R course in the Netherlands taught by Luis Torgo

In the course of this year, Dr. Luis Torgo will teach a Data Mining with R course together with the DIKW Academy in Nieuwegein, The Netherlands. Dr. Torgo is an Associate Professor at the department of Computer Science at the university of Porto. He is also the author of the book Data Mining with R. His interest are in Machine Learning in general, but particularly focused on inductive learning problems.

The course is aimed at professionals interested in business analytics and data mining. Previous programming experience is not mandatory, but does help in getting the most effect from the course. More information can be found on the course page of the DIKW Academy.

Posted in R stuff

Vectorisation is your best friend: replacing many elements in a character vector

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a data.frame that contains an ID character column:

We want to replace all occurrences of A by 'Text for A', and the same for B and C. One approach is to use a combination of a for-loop and some if statements, in a style that looks more like C:

This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an apply loop. First we construct a helper function:

which uses switch in stead of the set of nested if statements. Next we use sapply to call the helper function on each of the elements in df$ID:

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit for-loop and if statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

and subset this vector using df$ID:

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

but how long do they take. For this, we benchmark the three solutions:

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the apply base solution is only a factor 1.10 faster than the for-loop based solution. The take home message: apply-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a factor, and simply replacing the levels is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

Tagged with: ,
Posted in R stuff

The performance of dplyr blows plyr out of the water

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

In this case, dplyr is about 8x faster. However, some log file processing I did recently was sped up by a factor of 800. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.

Tagged with: ,
Posted in R stuff

Bubble sorting in R, C++ and Julia: code improvements and the R compiler

In the past few months I have written posts about implementing the bubble sort algorithm in different languages. In the mean while I have gotten some feedback and suggestions regarding improvements to the implementation I made, see the end of the post for the new source code of the algorithms. These often had a quite profound effect on performance.

One of the best tips was to use the R compiler, i.e. the compiler package which is now part of the standard R distribution. This works by simply calling cmpfun:

The following table presents the timings in microseconds of the different algorithms:

The following things I find striking:

  1. Compiling the for/while loop based R solution benefits massively form the compiler package, increasing the speed almost 5 fold.
  2. Compiling the recursive R based solution does not yield any improvements.
  3. The C++ solution is obviously much faster than any R based solution, increasing the speed between 1900 and 400 times.
  4. The Julia based solution is almost as fast as the C++ solution, which is very impressive for a high-level programming language.
  5. The native R sort is almost 8 times faster than the fastest bubble sort in C++ and Julia, but sort probably uses a faster algorithm that scales O(n log n) in stead of the 0(n^2) of the bubble sort algorithm.

R (recursive implementation). No improvements.

R (for/while loop implementation). This implementation was previously not present.

C++ (linked into R using Rcpp/inline). Precomputing vec.size() outside the loop improved performance by a factor of 2.

Julia. Subtly changing the definition of the for loop (1:(length(vec_in) - 1 - passes) vs [1:(length(vec_in) - 1 - passes)]) improved performance two fold.

Tagged with: , , ,
Posted in R stuff

Parallel processing with short jobs only increases the run time

Parallel processing has become much more important over the years as multi-core processors have become common place. From version 02.14 onwards, parallel processing has become part of the standard R installation in the form of the parallel package. This package makes parallel makes running parallel jobs as easy as creating a function that runs a job, and calling parSapply on a list of inputs to this function.

Of course, parallelisation incurs some overhead: information needs to be distributed over the nodes, and the result from each node needs to be collected and aggregated into the resulting object. This overhead is one of the main reasons why in certain cases parallel processing takes longer than sequential processing, see for example this StackOverflow question.

In this post I explore the influence of the time a single job takes on the total performance of parallel processing compared to sequential processing. To simulate a job, I simply use the R function Sys.sleep. The problem that I solve is simply waiting for a second. By cutting this second up into increasingly small pieces, the size of each job becomes shorter and shorter. By comparing the run-time of calling Sys.sleep sequentially and in parallel, I can investigate the relation between the temporal size of a job and the performance of parallel processing.

The following figure shows the results of my experiment (the R code is listed at the end of the blogpost):

The x-axis shows the run-time of an individual job in msecs, the y-axis shows the factor between parallel and sequential processing (> 1 means parallel is faster), and the color shows the result for 4 and 8 cores. The dots are runs comparing parallel and sequential processing (20 repetitions), the lines shows the median value for the 20 repetitions.

The most striking feature is that shorter jobs decrease the effectiveness of parallel processing, from around 0.011 msecs parallel processing becomes slower than sequential processing. From that moment, the overhead of parallelisation is bigger than the gain. In addition, above 0.011 msecs, parallel processing might be faster, but it is a far cry from the 4-8 fold increase in performance one would naively expect. Finally, for the job sizes in the figure, increasing the number of cores only marginally improves performance.

In conclusion, when individual jobs are short, parallelisation is going to have a small impact on performance, or even decrease performance. Keep this in the back of your mind when trying to run your code in parallel.

Source code needed to perform the experiment:

Tagged with: , ,
Posted in R stuff

Julia is lightning fast: bubble sort revisited

I had heard the name of the new technical computing language Julia buzzing around for some time already. Now during Christmas I had some time on my hands, and implemented the bubble sort algorithm that I have already posted about several times (R, C++). The main argument for Julia is it’s speed. It claims speeds comparable to that of a compiled language such as C++, but in the form of a high-level programming language such as R or Matlab.

The following Julia code implements the bubble sort algorithm, in the same style as the C++ algorithm I implemented using Rcpp.

Running this from the command line leads to the following timings in seconds:

This is in the same order of magnitude as the C++ implementation, which leads to timings of around 0.15 secs. So, this first examples shows that Julia is comparable in speed to C++ for this example, which is much much faster than the same implementation would be in R.

Julia certainly lacks the maturity of a tool such as R or Python, but the speed is impressive and I like the clean looking syntax. I’m curious to see where Julia will go from here, but I’ll certainly try and find an excuse to try Julia out for a real project.

Tagged with: ,
Posted in R stuff

Cloning packages from one Ubuntu install to another

Right now I’m busy transferring an Ubuntu install from a physical to a virtual machine. One of the issues is to get all the relevant packages installed. The following commands do just this.

  1. Make a list of the packages on the donor system:
  2. Copy the package list from the donor to the virtual system.
  3. Login to the virtual system, and tell dpkg which packages should be installed:
  4. And install the packages:

Just press Y, and all the missing packages will be installed on the virtual system. This is why I love a package manager! Note that this will probably work on any Debian based Linux distribution.

Posted in General, interesting