Ray Woodcock's Latest: characters

Showing posts with label characters. Show all posts

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

I had a bunch of filenames. I wanted to know which two-word phrases appeared most frequently. The purpose of this effort was to put those filenames into categories. I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process. This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size. I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog"). I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left. It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames. I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel. It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.". I did not have more than 65,000 filenames, so Excel 2003 would handle it. Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename: "I saw a dog - a big dog." Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog." This approach would produce some junk, but I hoped it would be easily filtered. Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a"). It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces. I could count the number of spaces manually. With many sentences, it would make more sense to use a formula. One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces. TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words. Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases. The approach I took was to set up the spreadsheet like this:

The top row indicated the number of the two-word phrase. The first one was "I saw," in group 1. The second row indicated the number of the space found. For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw"). This image shows only the first three groups; I would need four more. As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known. I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula. It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying. For instance, the formula in cell F2 was =F1, and in F1 it was =C1. That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro. Those formulas were as follows:

C3	=FIND(" ",$A3)
D3	=FIND(" ",$A3,C3+1)
E3	=TRIM(LEFT(A3,D3))
F3	=IF(D3=LEN($A3),"",C3)
G3	=IF(F3="","",FIND(" ",$A3,F3+1))
H3	=IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))
I3	=IF(H3="","",TRIM(MID($A3,F3,H3-F3+1)))
J3	=IF(H3=LEN($A3),"",G3)
K3	=IF(J3="","",FIND(" ",$A3,J3+1))
L3	=IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))
M3	=IF(L3="","",TRIM(MID($A3,J3,L3-J3+1)))

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine. I made a backup and proceeded to convert the spreadsheet's formulas to values. With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown. There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere. An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter. I had usually found that the same keystrokes worked in later versions of Excel. In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3). Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3. With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...). I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2. I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:

That is, the ends of the rows did not end at the same place. In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16. I wanted them all in a single column. I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously. That post still provided more details than I have written here. But in some regards, I now saw a way to simplify that process a bit. I deleted row 1, containing the column headings, as well as column 1, containing the original filenames. They seemed unnecessary and potentially confusing at this point. Then I created another worksheet (Insert > Worksheet). Its default name was Sheet2. In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located. The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1: [xA8.xls]Sheet1!$A$1. (xA8.xls was the name of my spreadsheet file.) I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began. In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this: #$A$1. Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1. In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998. So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998. A fast way to do that was to hit F5 and type Q14998 and hit Enter. That took me to Q14998. I put an X there. Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter. Boom! Done. I went back to Q14998 to double-check. Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above): [xA8.xls]Sheet1!$A$1. I hit Ctrl-A > Ctrl-C to copy it all. I opened Notepad and hit Ctrl-V to paste it. (If there had been too much data for Notepad, I think something like Notepad++ would have worked.) The purpose of putting it into Notepad was to remove the column formatting. (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word. Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks. That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p). Word reported that it had made about 170,000 replacements. This gave me the desired list, starting like this:

$A$1

$B$1

$C$1

$D$1

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines. I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more. The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2: I didn't need it anymore, and all its calculations would slow things down. I created a new Sheet2 and pasted the results from Word into it. I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem. My old Excel 2003 couldn't accommodate 170,000 rows. It gave me an error: "File not loaded completely." I took a look. Row 65536 contained a reference to $F$5911. I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature. I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad. I deleted everything down to that point, copied, and repeated in an adjacent column. I had to do this a couple of times. So now I had a spreadsheet that was three columns wide and 65,536 rows long. Warning: do not print your results.

I inserted blank columns between these columns of output. Those blank columns would contain the text corresponding to those cell references. For instance, in Sheet2 cell A1, I now had $A$1. So in the blank cell B1, I put this formula: =INDIRECT("Sheet1!"&A1). I copied that formula to all the other cells in those adjacent columns. That gave me three sets of two columns each: the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns. So now I just had three columns containing two-word phrases. In short, I had managed to isolate the phrases. Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones. For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A. But Calc hung for about 15 hours before producing output. (Excel also hang at a number of points in this whole process.) I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list. I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that. This gave me 7,539 unique two-word phrases. That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases. The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters. I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols). I also did some global (Ctrl-H) searches in Excel to the same effect. In addition, I deleted a few meaningless two-word pairs. These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P"). It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith"). Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium"). That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences. Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items. In other words, my filtering for unique items substantially accounted for all items. Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it. To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion. Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process. Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant. It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process. Such changes seemed likely to shape this sentence-parsing process. For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization. The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files. There was no index. Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability. I could continue with this effort, using backups to reconstruct the links. But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort. One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front. That question led me to a second try.

Using Excel to Clean Unwanted Characters from Filenames or Other Text

I was trying to organize some files. This post contains some notes I made on ways of using Microsoft Excel to change a list of filenames. The renaming project drew on file renaming techniques described elsewhere.

In various attempts in this process, I came up against situations where inconsistent punctuation and other filename characteristics made the process unnecessarily complicated. For instance, file names might refer to "Barack Obama," "Barack H. Obama," "President Obama," and so forth. These would all be alphabetized differently. Duplicates might be concealed, and files that belonged together topically might be overlooked.

In part, that was an issue of choosing and using consistent file naming conventions. But at this point, given that the files were already named, there was an issue of how to deal with punctuation and other non-alphanumeric characters. I wondered if there was a relatively straightforward way to get my arms around the varieties of character usage within a particular corpus.

The first task seemed to be identifying what unwanted characters were in use. Taking Microsoft Excel as my weapon du jour, I made a list of the filenames that I might want to change, using DIR on the Windows command line (see the comments following another recent post). I pasted that list of filenames into Excel.

Next, I ran a search and found several ingenious approaches to related problems. One would use a wildcard search-and-replace in Microsoft Word: search for ? (making sure to indicate that this would be a "Use wildcards" search) and replace with ^&^p to put the same character (that is, any character) on a line by itself; then paste the whole text into Excel and do a unique filter to get all of the unique characters found in the text. This did not seem likely to work for my purposes, unfortunately. For one thing, Excel 2003 was limited to 65,536 rows, and there could easily be more characters than that in a text being analyzed. I had also found that spreadsheets capable of handling more rows could be slow at it.

There were other approaches that used an array formula or a script. I was not as comfortable with those that I did not understand, because it would be difficult for me to work through their logic and make sure they were really doing what I wanted them to do. An unnoticed mistake could have unfortunate effects when altering a large or important text, or when renaming thousands of files.

Another possibility came to mind as I looked at the suggestion to use MoreFunc's REGEX.COUNT function (alternately, Excel's SUBSTITUTE or possibly MATCH) to produce character-by-character counts. I ran a search and found some interesting variations in SUBSTITUTE, including nested and case-insensitive substitutions, along with the CLEAN function. One solution, it seemed, would be to use 26 columns containing something like =SUBSTITUTE(UPPER(A1),"","") to replace all upper- and lower-case letters with nothing, another 10 columns to do the same for numbers, and then see what's left.

With SUBSTITUTE, one possibility was to set up a map. Put the rows of text being analyzed in one worksheet ("Sheet1"). Create a second worksheet ("Sheet2") within the same Excel file. In Sheet2, make each row relate to the row of the same number in Sheet1. So, for example, if there was a text string in cell A5 in Sheet1, there would be a row of characters across row 5 in Sheet2. Each cell in that row would count the occurrences of a specific character. For example, the 65th column would detect occurrences of the character whose Excel code happened to be 65 (that is, the letter "A," producible by =CHAR(65)). The chief problem with this approach was that its calculations would take a year. It would also max out Excel 2003's limit of 256 columns. It would have the advantage of transparency: I would be able to see much of what was happening in the data.

I found a user-defined function (UDF) -- what I had been referring to as a script -- called SUBSTITUTELIST. This UDF would let me indicate a list of characters to simply remove. In naming files, I would definitely not want to include many of the odd characters that were readily visible if, say, I opened a non-text file in Notepad.

The first step, with SUBSTITUTELIST, was to define it as a function that I could use in spreadsheet formulas. To do that, using the steps described atop another post, I copied and pasted this text into an Excel module:

Function SubstituteList(rString As String, rList As Range) As String
Dim rCell As Range
For Each rCell In rList
rString = Replace(rString, rCell.Text, "")
Next rCell
SubstituteList = rString
End Function

I didn't write that code, and was only partly clear on what it said. It appeared, though, that its middle line was going to replace rString with nothing (""). In other words, rString was being removed. Now I had to tell it what characters to treat as rString. In the spreadsheet listing the filenames I wanted to work on, I opened a new worksheet, called Sheet2, and put the numbers 1 through 255 in cells A2 to A256. (Two easy ways to insert a series of numbers: use =A1+1 in cell A2, and so forth; or enter the first two numbers (i.e., 1 and 2) manually, highlight the cells to be filled, and then use Edit > Insert > Series. The Alt command option would probably still work in post-2003 versions of Excel: Alt-E, I, S, Enter.)

Then, in Sheet2 cell B1, I typed =CHAR(A1), and copied it down to B255. So now column B contained the actual characters I wanted to look at, whose CHAR equivalents were shown in column A. Finally, column C contained the characters to be removed from consideration. For example, in cell C128, I typed =B128. That put a copy of CHAR(127) into column C. SUBSTITUTELIST was going to remove anything in column C. So that particular character would be removed from the cell in Sheet1 where I was going to be using SUBSTITUTELIST. Depending on the situation, I might or might not want to remove all non-alphanumeric characters. For instance, I thought that what I would really want to do with something like the euro character (€), in a filename, would not be to eliminate it, but rather to replace it (e.g., €35 becomes EUR 35).

To use SUBSTITUTELIST, I put my text to be analyzed in column A of Sheet1. In B1, I entered this: =SUBSTITUTELIST(A1,Sheet2!$C$1:$C$255). I copied that formula all the way down, next to each text string to be examined. That gave me a collection of funky characters in Sheet1 column B. Originally, I added TRIM to the SUBSTITUTELIST formula just cited, but then I realized that I actually wanted to see the pattern of spaces and other non-alphanumeric characters.

SUBSTITUTELIST seemed to have potential for single-character eliminations, but it didn't seem capable of substitutions involving multiple-character strings (e.g., EUR in place of €). It looked like a relatively simple change to SUBSTITUTELIST might do the job, but I had little ability for that sort of work at this point, so I posted a question on it. One reply to that question suggested a UDF called CLEANUPTEXT. The contents of that UDF (created in the same manner as SUBSTITUTELIST, above) were as follows:

Option Explicit

Sub cleanupText()



Dim allTxt() As Variant, sublist() As Variant

Dim i As Long, j As Long, k As Long, tdots As Integer



'Store data from sheets in arrays.

allTxt = Sheets(1).UsedRange.Value

sublist = Sheets(2).UsedRange.Offset(1, 0).Resize(Sheets(2).UsedRange.Rows.Count - 1, Sheets(2).UsedRange.Columns.Count).Value



For i = 1 To UBound(allTxt, 1)

    For j = 1 To UBound(allTxt, 2)

        'Loop through replacement terms and make replacements to data in array.

        For k = 1 To UBound(sublist, 1)

            allTxt(i, j) = Replace(allTxt(i, j), sublist(k, 1), sublist(k, 2))

        Next k

        allTxt(i, j) = Trim(allTxt(i, j))

        'Remove series of trailing periods.

        If Right(allTxt(i, j), 1) = "." Then

            tdots = 1

        Else

            tdots = 0

        End If

        Do While tdots = 1

            allTxt(i, j) = Left(allTxt(i, j), Len(allTxt(i, j)) - 1)

            If Right(allTxt(i, j), 1) = "." Then

                tdots = 1

            Else

                tdots = 0

            End If

        Loop

allTxt(i, j) = Trim(allTxt(i, j))

    Next j

Next i

'Print cleaned up results in array onto sheet.

ActiveSheet.UsedRange.Value = allTxt

End Sub

The instructions accompanying that UDF seemed to indicate that it required a separate worksheet, called Sheet2, containing the "before" terms in column A of Sheet2, and the "after" terms in column B of that sheet; and Sheet1 had to be the name of the tab containg the text being operated on. In response to my question, the text shown in blue (above) was designed to remove trailing dots from the text being modified. I did not actually try this UDF, but I hoped that I would not have to write similar lines of text for other kinds of unwanted text strings -- that, in other words, the column A / column B replacement approach would suffice. If so, I might try to simplify this UDF by removing the lines in blue.

On a separate note, I had noticed that some of the files in my list looked like their names were identical except for punctuation, so I added a column, in my spreadsheet that listed filenames, for this formula: =LOWER(substitutelist(M2,Sheet2!$C$1:$C$255)). M2 contained the existing filename, or part thereof, and Sheet2 column C was set to eliminate all non-alphanumeric characters, as shown above. This step did succeed in highlighting some duplicate filenames and other issues.

Another problem was that, in some cases, I would have to rename files before proceeding with something like SUBSTITUTELIST. The specific problem was that some characters in filenames would defeat batch commands. That is, I could have a command to change a filename, perhaps in response to an insight emerging from SUBSTITUTELIST; but the command might not run. So I would have to change or delete those troublesome characters first. (I often used Excel formulas to generate commands that I could run in a batch file to rename numerous files. The basic idea here was that I would put the existing filename in cell A1 of an empty worksheet, and the desired filename in cell B1, and then I would enter this formula in cell C1: ="ren "&CHAR(34)&A2&CHAR(34)&" "&B2. (If the new filename in B2 had spaces in it, then it too would have to be surrounded by CHAR(34) instances, as done in that formula for cell A1.) The particular chacters at issue here were percent symbols and possibly exclamation marks. The commands were right. They would run if I typed them into the command line, or copied and pasted them there. But they wouldn't work from within a batch file. To address this concern, I just deleted the exclamation marks, and I replaced the percent symbols with "pct" (e.g., 45% became 45pct).

In some cases, I would want a tool that could replace one or more undesirable characters, in a single bulk process, with one or more preferred characters. In part, this was a simple interest in transliteration or Romanization (i.e., converting foreign-language characters to the standard English alphabet). There would be at least two reasons to do that. One was that I had run into some file handling issues with some such characters. I recalled one recent case where something that looked like a C (maybe ḉ or ć or ĉ or ċ or č) was not working correctly with some backup program or something. The other reason was that I might not remember which character to use when doing filename searches. So, for instance, if I was looking for all of the files containing a copy of someone's resume, I might or might not want to have to search for both "resume" and "résumé." There were many Romanization tables for various languages, possibly best approached by finding conversion tables for the thousands of possible Unicode characters. This could take further attention, as there were multiple kinds of Unicode, not all of which might be consistent with Excel's coding. In filenames, especially, examples involving several characters could include "pct" (above) and the replacement of $ with USD and of © with (C). One readability example, arising where HTML has become involved, would be to replace the HTML code of ' with the apostrophe.

As I thought further about the problem, I realized that characters might be unwanted in context, not in isolation. For instance, the hyphen ("-") could be fine in some settings, not fine in others. At the risk of opening the door to complexity, it seemed the ideal approach (perhaps the less messy approach, in the long run) would give me the possibility of listing multi-character combinations and indicating their desired replacement. I found a UDF called SUPERSUB that seemed to be just what the doctor ordered:

Function SuperSub(OriginalText As String, rngOldText As Range)
    Dim cel As Range
    Dim strOldText As String, strNewText As String
     'loop through list of old_text used in substitute
    For Each cel In rngOldText.Cells
        strOldText = cel.Value
        strNewText = cel.Offset(0, 1).Value
        OriginalText = Application.WorksheetFunction.Substitute(OriginalText, strOldText, strNewText)
    Next cel
    SuperSub = OriginalText
End Function

But, again, I did not try this script at the time, but was merely preserving its information here for possible future application.

Another post offered a UDF called REMOVEPUNCTUATION where I could create a list of items to replace, and their replacements, within the macro. The text of that one (with a few illustrative examples) was as follows:

Sub test()
Dim c As Range, t As String
For Each c In Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row)
    t = UCase(" " & RemovePunctuation(c.Text) & " ")
    t = Replace(t, " AVE ", " AVENUE ")
    t = Replace(t, " BLVD ", " BOULEVARD ")
    t = Replace(t, " BVD ", " BOULEVARD ")
    t = Replace(t, " CYN ", " CANYON ")
    t = Replace(t, " CTR ", " CENTER ")
    t = Replace(t, " CIR ", " CIRCLE ")
    t = Replace(t, " CT ", " COURT ")
    t = Replace(t, " DR ", " DRIVE ")
    t = Replace(t, " FWY ", " FREEWAY ")
    t = Replace(t, " HBR ", " HARBOR ")
    t = Replace(t, " HTS ", " HEIGHTS ")
    t = Replace(t, " HWY ", " HIGHWAY ")
    t = Replace(t, " JCT ", " JUNCTION ")
    t = Replace(t, " LN ", " LANE ")
    t = Replace(t, " MTN ", " MOUNTAIN ")
    t = Replace(t, " PKWY ", " PARKWAY ")
    t = Replace(t, " PL ", " PLACE ")
    t = Replace(t, " PLZ ", " PLAZA ")
    t = Replace(t, " RDG ", " RIDGE ")
    t = Replace(t, " RD ", " ROAD ")
    t = Replace(t, " RTE ", " ROUTE ")
    t = Replace(t, " ST ", " STREET ")
    t = Replace(t, " TRWY ", " THROUGHWAY ")
    t = Replace(t, " TL ", " TRAIL ")
    t = Replace(t, " TPKE ", " TURNPIKE ")
    t = Replace(t, " VLY ", " VALLEY ")
    t = Replace(t, " VLG ", " VILLAGE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " APTS ", " APARTMENTS ")
    t = Replace(t, " BLDG ", " BUILDING ")
    t = Replace(t, " FLR ", " FLOOR ")
    t = Replace(t, " OFC ", " OFFICE ")
    t = Replace(t, " OF ", " OFFICE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " STE ", " SUITE ")
    t = Replace(t, " N ", " NORTH ")
    t = Replace(t, " E ", " EAST ")
    t = Replace(t, " S ", " SOUTH ")
    t = Replace(t, " W ", " WEST ")
    t = Replace(t, " NE ", " NORTHEAST ")
    t = Replace(t, " SE ", " SOUTHEAST ")
    t = Replace(t, " SW ", " SOUTHWEST ")
    t = Replace(t, " NW ", " NORTHWEST ")
    c = Trim(t)
Next
End Sub

Function RemovePunctuation(r As String) As String
With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Z0-9 ]"
    .IgnoreCase = True
    .Global = True
    RemovePunctuation = .Replace(r, "")
End With
End Function

It seemed that, if I used REMOVEPUNCTUATION to make a large number of changes (more than just the few dozen shown here), I might use Excel to generate the text of the additional lines to be inserted into the REMOVEPUNCTUATION script.

That was the extent of these notes. I would have to await another situation requiring mass file renaming in order to test these techniques.

Ray Woodcock's Latest

Pages

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

Using Excel to Clean Unwanted Characters from Filenames or Other Text

Support This Blog

Total Pageviews

Archives

Ray Woodcock's Latest

Pages

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

Using Excel to Clean Unwanted Characters from Filenames or Other Text

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives