A question on StackOverflow really sparked my attention. The aim was to clean up a dataset of inappropriately spaced words. For example:
|
1 |
> word5 <- "hotter the doghou se would be bec ause the co lor was diffe rent" |
My approach was to create what I call a wordpair object. The word pair object for the example sentence looks like:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
> abc1_pairs word1 word2 1 hotter the 2 the doghou 3 doghou se 4 se would 5 would be 6 be bec 7 bec ause 8 ause the 9 the col 10 col or 11 or was 12 was diffe 13 diffe rent |
Then we iterate over the word pairs, and check if they are correct words using the aspell function in R, and recursively keep merging words until no new correct words can be found. The code I created to create the wordpair object, transform a wordpair back to a list of words, and some additional functions can be found at the end of this post.
Applied to the example dataset this would result in:
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
> word5 <- "hotter the doghou se would be bec ause the col or was diffe rent" > abc1 = strsplit(word5, split = " ")[[1]] > abc1_pairs = wordlist2wordpairs(abc1) > abc1_pairs word1 word2 1 hotter the 2 the doghou 3 doghou se 4 se would 5 would be 6 be bec 7 bec ause 8 ause the 9 the col 10 col or 11 or was 12 was diffe 13 diffe rent > abc1_merged_pairs = merge_wordpairs(abc1_pairs) Number of words about to be merged in this pass: 4 > merged_sentence = paste(wordpairs2wordlist(abc1_merged_pairs), collapse = " ") > c(word5, merged_sentence) [1] "hotter the doghou se would be bec ause the col or was diffe rent" [2] "hotter the doghouse would be because the color was different" |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 |
# A bunch of functions # Data transformation wordlist2wordpairs = function(word_list) { require(plyr) wordpairs = ldply(seq_len(length(word_list) - 1), function(idx) return(c(word_list[idx], word_list[idx+1]))) names(wordpairs) = c("word1", "word2") return(wordpairs) } wordpairs2wordlist = function(wordpairs) { return(c(wordpairs[[1]], wordpairs[[2]][nrow(wordpairs)])) } # Some checking functions # Is the word correct? word_correct = function(word) return(nrow(aspell(factor(word))) == 0) # Merge two words merge_word = function(word1, word2) return(paste(word1, word2, sep = "")) # Merge a specific pair, option to postpone deletion of pair merge_specific_pair = function(wordpairs, idx, delete_pair = TRUE) { # merge pair into word merged_word = do.call("merge_word", wordpairs[idx,]) # assign the pair to the idx above if(!(idx == 1)) wordpairs[idx - 1, "word2"] = merged_word if(!(idx == nrow(wordpairs))) wordpairs[idx + 1, "word1"] = merged_word # assign the pair to the index below (if not last one) if(delete_pair) wordpairs = wordpairs[-idx,] return(wordpairs) } # Recursively delete wordpairs which lead to a correct word merge_wordpairs = function(wordpairs) { require(plyr) merged_pairs = as.character(mlply(wordpairs, merge_word)) correct_words_idxs = which(sapply(merged_pairs, word_correct)) if(length(correct_words_idxs) == 0) { return(wordpairs) } else { message(sprintf("Number of words about to be merged in this pass: %s", length(correct_words_idxs))) for(idx in correct_words_idxs) { wordpairs = merge_specific_pair(wordpairs, idx, delete_pair = FALSE) } return(merge_wordpairs(wordpairs[-correct_words_idxs,])) # recursive call } } |

stunning!..