I had a bunch of filenames. I wanted to know which two-word phrases appeared most frequently. The purpose of this effort was to put those filenames into categories. I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process. This post describes the first approach I took to identify those two-word phrases.
Parsing the Filenames
I hoped that two-word phrases would be a good size. I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog"). I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left. It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.
The approach would have been similar if I had been examining sentences instead of filenames. I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel. It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.". I did not have more than 65,000 filenames, so Excel 2003 would handle it. Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).
Consider this sentence or filename: "I saw a dog - a big dog." Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog." This approach would produce some junk, but I hoped it would be easily filtered. Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a"). It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.
That sentence contains seven spaces. I could count the number of spaces manually. With many sentences, it would make more sense to use a formula. One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces. TRIM would remove excess spaces.
Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words. Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases. The approach I took was to set up the spreadsheet like this:
|H3||=IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))|
|L3||=IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))|