Parsing complex text files using regular expressions and vectorization

When text data is in a nice CSV format, read.csv is enough to parse it into a useable format. But if this is not the case, getting the data into a useable format is not so straightforward. In this post

Cleaning sentences by recursively merging words using R

A question on StackOverflow really sparked my attention. The aim was to clean up a dataset of inappropriately spaced words. For example:

My approach was to create what I call a wordpair object. The word pair object for the

