The performance of dplyr blows plyr out of the water

Together with many other packages written by Hadley Wickham, plyr is a package that I use a lot for data processing. The syntax is clean, and it works great for breaking down larger data.frame‘s into smaller summaries. The greatest disadvantage of plyr is the performance. On StackOverflow, the answer is often that you want plyr for the syntax, but that for real performance you need to use data.table.

Recently, Hadley has released the successor to plyr: dplyr. dplyr provides the kind of performance you would expect from data.table, but with a syntax that leans closer to plyr. The following example illustrates this performance difference:

In this case, dplyr is about 8x faster. However, some log file processing I did recently was sped up by a factor of 800. dplyr is an exciting new development, that promises to be the single most influential new package since ggplot2.

Tagged with: ,
Posted in R stuff
8 Comments » for The performance of dplyr blows plyr out of the water
  1. Dieter Menne says:

    Looks like there was a HTML-formatting problem < in your code. Should be <- or simply =

    • Dieter Menne says:

      Oops, a case where = does not work, <- required

      • Paul Hiemstra says:

        Yes, system.time is one of the few places where <- is needed. I edited the code, thanks for the heads up!

        • Michael Sumner says:

          Just use braces:

          system.time({summary_ddply = ddply(dat, .(factor1, factor2), summarise, mn = mean(num))})

          Then you can put arbitrary blocks of code in there. (But also checkout rbenchmark).

  2. Kent Johnson says:

    You really should time the group_by() as well; on my computer it takes longer than the summarise().

    or use the new %.% operator including in dplyr:

    • Paul Hiemstra says:

      Thanks! I updated the code, and the improvement is less indeed (38 vs 8 times). However, in a real-life example the improvement is 800 times (previously 1000 times).

  3. Maciej says:

    To be precise, I think that you shoud combine first and second line of dplyr solution and then show what is the difference.

    For mine computer it is about, more or less 4 times faster.

    • Paul Hiemstra says:

      Thanks for your comment. I updated the code, the improvement in speed is indeed much less, but still impressive. However, a real life example with processing log information shows an increase of 800x.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

*