Cleaning sentences by recursively merging words using R

A question on StackOverflow really sparked my attention. The aim was to clean up a dataset of inappropriately spaced words. For example:

My approach was to create what I call a wordpair object. The word pair object for the example sentence looks like:

Then we iterate over the word pairs, and check if they are correct words using the aspell function in R, and recursively keep merging words until no new correct words can be found. The code I created to create the wordpair object, transform a wordpair back to a list of words, and some additional functions can be found at the end of this post.

Applied to the example dataset this would result in:

Tagged with: ,
Posted in R stuff
1 Comment » for Cleaning sentences by recursively merging words using R
  1. kay says:

    stunning!..

Leave a Reply

Your email address will not be published. Required fields are marked *

*

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">