Vectorisation is your best friend: replacing many elements in a character vector

As with any programming language, R allows you to tackle the same problem in many different ways or styles. These styles differ both in the amount of code, readability, and speed. In this post I want to illustrate this by tackling the following problem. We have a data.frame that contains an ID character column:

We want to replace all occurrences of A by 'Text for A', and the same for B and C. One approach is to use a combination of a for-loop and some if statements, in a style that looks more like C:

This kind of imperative programming style is not typically R-like. The first response of an R-aficionado is to suggest using an apply loop. First we construct a helper function:

which uses switch in stead of the set of nested if statements. Next we use sapply to call the helper function on each of the elements in df$ID:

The advantage here is that we use roughly half the amount of code to express the same functionality, and I find the code more readable (seeing it’s purpose at a glance). Readability however is in the eye of the beholder, and some people used to non-functional programming languages might prefer the more explicit for-loop and if statement.

Ofcourse, R also supports vectorisation, which can be of particular interest if you are interested in performance. FOr a vectorised solution, we first create a lookup vector:

and subset this vector using df$ID:

I encourage you to spend a little time figuring out what this subsetting trick does, as I think it is quite a nice trick. The code of this final solution is even shorter, although it does take some careful consideration on the part of the reader to understand what is happening. Careful naming of variables, or encapsulation in a function can solve this issue.

All three solutions yield the same result:

but how long do they take. For this, we benchmark the three solutions:

The benchmark clearly shows that the performance of the vectorised solution is vastly superior to the other two, in the order of 70-80 times faster. In addition, the apply base solution is only a factor 1.10 faster than the for-loop based solution. The take home message: apply-loops are not inherently faster, and vectorisation is your friend!

ps: In this case, making the character vector a factor, and simply replacing the levels is probably much much faster even than using a vectorised substitution. However, the point of the post was to compare different coding styles, and this problem was just a convenient example.

Tagged with: ,
Posted in R stuff
13 Comments » for Vectorisation is your best friend: replacing many elements in a character vector
  1. kaz_yos says:

    How about this?

    It may be more intuitive.

    df$ID2 <- paste0(“Text for “, df$ID)

    • Paul Hiemstra says:

      This is a valid option indeed. However, the point of the post is not to find the optimal way of solving the problem, but to compare three styles of programming that are common in R. The example here is rather contrived to be true. It does make the point though.

  2. dirk says:

    What about applying the ifelse command to df$ID?

    • Paul Hiemstra says:

      ifelse only supports two option, we already have three options c(A, B, C).

      • Ananda says:

        Directly, yes, ifelse only supports two options, but it can still be used:

        Don’t know how it compares with respect to speed, though, and I would hate to type it out for a longer set of conditions 🙂

  3. Ananda says:

    Interesting results, but I think you’ve left one obvious candidate out: factor.

    You can even wrap that in as.character if you want a character vector as the result.

    Additionally, sapply is not likely to be the fastest of the *apply family. vapply or lapply + unlist might be faster than sapply. I would be interested in seeing some benchmarks with those updates considered.

  4. Alex Zolot says:

    You may win in performance with

  5. Alex Zolot says:

    sorry, I missed dtt= data.table

  6. Bill says:

    I don’t get it.translator_vector[df$ID] any class?

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

*