Parallelization using plyr: loading objects and packages into worker nodes

I really love the plyr package. Apart from having a progress bar and plyr handeling a lot of the overhead, a very interesting feature is being able to run plyr in parallel mode. Essentially, setting .parallel = TRUE runs any plyr function in parallel. This is under the assumption that a parallel backend was registered. In my case, I use the doSNOW package to register a backend that uses the Simple Network of Workstations (SNOW) package for parallel computing.

However, the focus of this post is not on how exactly SNOW works, but on a particular problem I had. My problem will be most effectively explained using the following example (note that I assume that a parallel backend has been registered):

The problem in this case is that the object y is not present in the environment of the worker nodes. Luckily, snow provides a few functions that allow us to load objects, functions and packages into the worker nodes. The clusterExport can be used to load objects and functions, the clusterEvalQ can be used to load packages. To streamline setting up and configuring a cluster that can be used by plyr, I wrote a small function createCluster. It assumes that you have snow and doSNOW installed, installing doSNOW gets you all the required packages. Note that the function also requires an adapted version of the clusterExport function to allow exports from other environments that .GlobalEnv.

where noCores is a numeric specifying the number of cores to register, logfile is the location where the output of the workers goes to (set to “” to output to screen), export is a list of objects/functions that needs to be loaded into the workers, and lib is a list of packages that need to be loaded into the workers. The following example show the function in action:

Update 16 nov: this function was created and tested under Linux (Fedora 13). It should also work under OS X, Windows and other operating systems that support SOCK clusters. I have not tested this however…

Tagged with: , , ,
Posted in R stuff
5 Comments » for Parallelization using plyr: loading objects and packages into worker nodes
  1. Scott Chamberlain says:

    Does this work on OSX, or just Windows?

    • Paul Hiemstra says:

      I tested the function on Linux (Fedora 13), so I think it should work for OS X. I have not tried this under Windows, but SOCK clusters should work also under Windows.

  2. Paul Hiemstra says:

    As of version 0.3-8 of snow, the tweaked version of the clusterExport is no longer needed as the maintainer accepted my patch.

  3. Arsenio says:

    Impressive! I tested this with gdata package and read.xls function to read and manipulate with llply 400+ xls files with perl in R 2.13 x64 on Win 7 64 QuadCore. Workers are set to 7.

    > foo.time.b
    user system elapsed
    2.34 0.81 2070.59

    > foo.time.a
    user system elapsed
    0.55 0.04 455.74

    • Paul Hiemstra says:

      Good to see you like it :). For operations that are easy to parallelize (e.g. your problem where the files to be read are independent, i.e. the outcome of one loop in llply does not influence another), these kind of speed gains are quite normal.

2 Pings/Trackbacks for "Parallelization using plyr: loading objects and packages into worker nodes"
  1. […] Parallelization using plyr: loading objects and packages into worker nodes | NumberTheory […]

  2. […] principle, this can be parallelized over MPI using an additional function, seems to […]

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see