Thursday, March 1, 2012

Using Excel to Clean Unwanted Characters from Filenames or Other Text

I was trying to organize some files.  This post contains some notes I made on ways of using Microsoft Excel to change a list of filenames.  The renaming project drew on file renaming techniques described elsewhere.

In various attempts in this process, I came up against situations where inconsistent punctuation and other filename characteristics made the process unnecessarily complicated.  For instance, file names might refer to "Barack Obama," "Barack H. Obama," "President Obama," and so forth.  These would all be alphabetized differently.  Duplicates might be concealed, and files that belonged together topically might be overlooked.

In part, that was an issue of choosing and using consistent file naming conventions.  But at this point, given that the files were already named, there was an issue of how to deal with punctuation and other non-alphanumeric characters.  I wondered if there was a relatively straightforward way to get my arms around the varieties of character usage within a particular corpus.

The first task seemed to be identifying what unwanted characters were in use. Taking Microsoft Excel as my weapon du jour, I made a list of the filenames that I might want to change, using DIR on the Windows command line (see the comments following another recent post). I pasted that list of filenames into Excel.

Next, I ran a search and found several ingenious approaches to related problems. One would use a wildcard search-and-replace in Microsoft Word:   search for ? (making sure to indicate that this would be a "Use wildcards" search) and replace with ^&^p to put the same character (that is, any character) on a line by itself; then paste the whole text into Excel and do a unique filter to get all of the unique characters found in the text. This did not seem likely to work for my purposes, unfortunately. For one thing, Excel 2003 was limited to 65,536 rows, and there could easily be more characters than that in a text being analyzed.  I had also found that spreadsheets capable of handling more rows could be slow at it.

There were other approaches that used an array formula or a script. I was not as comfortable with those that I did not understand, because it would be difficult for me to work through their logic and make sure they were really doing what I wanted them to do. An unnoticed mistake could have unfortunate effects when altering a large or important text, or when renaming thousands of files.

Another possibility came to mind as I looked at the suggestion to use MoreFunc's REGEX.COUNT function (alternately, Excel's SUBSTITUTE or possibly MATCH) to produce character-by-character counts.  I ran a search and found some interesting variations in SUBSTITUTE, including nested and case-insensitive substitutions, along with the CLEAN function. One solution, it seemed, would be to use 26 columns containing something like =SUBSTITUTE(UPPER(A1),"","") to replace all upper- and lower-case letters with nothing, another 10 columns to do the same for numbers, and then see what's left.

With SUBSTITUTE, one possibility was to set up a map. Put the rows of text being analyzed in one worksheet ("Sheet1"). Create a second worksheet ("Sheet2") within the same Excel file.  In Sheet2, make each row relate to the row of the same number in Sheet1. So, for example, if there was a text string in cell A5 in Sheet1, there would be a row of characters across row 5 in Sheet2. Each cell in that row would count the occurrences of a specific character. For example, the 65th column would detect occurrences of the character whose Excel code happened to be 65 (that is, the letter "A," producible by =CHAR(65)).  The chief problem with this approach was that its calculations would take a year.  It would also max out Excel 2003's limit of 256 columns.  It would have the advantage of transparency:  I would be able to see much of what was happening in the data.

I found a user-defined function (UDF) -- what I had been referring to as a script -- called SUBSTITUTELIST.  This UDF would let me indicate a list of characters to simply remove.  In naming files, I would definitely not want to include many of the odd characters that were readily visible if, say, I opened a non-text file in Notepad.

The first step, with SUBSTITUTELIST, was to define it as a function that I could use in spreadsheet formulas.  To do that, using the steps described atop another post, I copied and pasted this text into an Excel module:

Function SubstituteList(rString As String, rList As Range) As String
Dim rCell As Range
For Each rCell In rList
rString = Replace(rString, rCell.Text, "")
Next rCell
SubstituteList = rString
End Function
I didn't write that code, and was only partly clear on what it said.  It appeared, though, that its middle line was going to replace rString with nothing (""). In other words, rString was being removed. Now I had to tell it what characters to treat as rString. In the spreadsheet listing the filenames I wanted to work on, I opened a new worksheet, called Sheet2, and put the numbers 1 through 255 in cells A2 to A256. (Two easy ways to insert a series of numbers: use =A1+1 in cell A2, and so forth; or enter the first two numbers (i.e., 1 and 2) manually, highlight the cells to be filled, and then use Edit > Insert > Series. The Alt command option would probably still work in post-2003 versions of Excel: Alt-E, I, S, Enter.)
Then, in Sheet2 cell B1, I typed =CHAR(A1), and copied it down to B255. So now column B contained the actual characters I wanted to look at, whose CHAR equivalents were shown in column A. Finally, column C contained the characters to be removed from consideration.  For example, in cell C128, I typed =B128.  That put a copy of CHAR(127) into column C.  SUBSTITUTELIST was going to remove anything in column C.  So that particular character would be removed from the cell in Sheet1 where I was going to be using SUBSTITUTELIST.  Depending on the situation, I might or might not want to remove all non-alphanumeric characters.  For instance, I thought that what I would really want to do with something like the euro character (€), in a filename, would not be to eliminate it, but rather to replace it (e.g., €35 becomes EUR 35).

To use SUBSTITUTELIST, I put my text to be analyzed in column A of Sheet1. In B1, I entered this: =SUBSTITUTELIST(A1,Sheet2!$C$1:$C$255). I copied that formula all the way down, next to each text string to be examined. That gave me a collection of funky characters in Sheet1 column B. Originally, I added TRIM to the SUBSTITUTELIST formula just cited, but then I realized that I actually wanted to see the pattern of spaces and other non-alphanumeric characters.

SUBSTITUTELIST seemed to have potential for single-character eliminations, but it didn't seem capable of substitutions involving multiple-character strings (e.g., EUR in place of €).  It looked like a relatively simple change to SUBSTITUTELIST might do the job, but I had little ability for that sort of work at this point, so I posted a question on it.  One reply to that question suggested a UDF called CLEANUPTEXT.  The contents of that UDF (created in the same manner as SUBSTITUTELIST, above) were as follows:

Option Explicit
Sub cleanupText()

Dim allTxt() As Variant, sublist() As Variant
Dim i As Long, j As Long, k As Long, tdots As Integer

'Store data from sheets in arrays.
allTxt = Sheets(1).UsedRange.Value
sublist = Sheets(2).UsedRange.Offset(1, 0).Resize(Sheets(2).UsedRange.Rows.Count - 1, Sheets(2).UsedRange.Columns.Count).Value

For i = 1 To UBound(allTxt, 1)
    For j = 1 To UBound(allTxt, 2)
        'Loop through replacement terms and make replacements to data in array.
        For k = 1 To UBound(sublist, 1)
            allTxt(i, j) = Replace(allTxt(i, j), sublist(k, 1), sublist(k, 2))
        Next k
        allTxt(i, j) = Trim(allTxt(i, j))
'Remove series of trailing periods.
        If Right(allTxt(i, j), 1) = "." Then
            tdots = 1
            tdots = 0
        End If
        Do While tdots = 1
            allTxt(i, j) = Left(allTxt(i, j), Len(allTxt(i, j)) - 1)
            If Right(allTxt(i, j), 1) = "." Then
                tdots = 1
                tdots = 0
            End If
allTxt(i, j) = Trim(allTxt(i, j))
    Next j
Next i
'Print cleaned up results in array onto sheet.
ActiveSheet.UsedRange.Value = allTxt
End Sub

The instructions accompanying that UDF seemed to indicate that it required a separate worksheet, called Sheet2, containing the "before" terms in column A of Sheet2, and the "after" terms in column B of that sheet; and Sheet1 had to be the name of the tab containg the text being operated on.  In response to my question, the text shown in blue (above) was designed to remove trailing dots from the text being modified.  I did not actually try this UDF, but I hoped that I would not have to write similar lines of text for other kinds of unwanted text strings -- that, in other words, the column A / column B replacement approach would suffice.  If so, I might try to simplify this UDF by removing the lines in blue.

On a separate note, I had noticed that some of the files in my list looked like their names were identical except for punctuation, so I added a column, in my spreadsheet that listed filenames, for this formula:  =LOWER(substitutelist(M2,Sheet2!$C$1:$C$255)).  M2 contained the existing filename, or part thereof, and Sheet2 column C was set to eliminate all non-alphanumeric characters, as shown above.  This step did succeed in highlighting some duplicate filenames and other issues.

Another problem was that, in some cases, I would have to rename files before proceeding with something like SUBSTITUTELIST.  The specific problem was that some characters in filenames would defeat batch commands.  That is, I could have a command to change a filename, perhaps in response to an insight emerging from SUBSTITUTELIST; but the command might not run.  So I would have to change or delete those troublesome characters first.  (I often used Excel formulas to generate commands that I could run in a batch file to rename numerous files.  The basic idea here was that I would put the existing filename in cell A1 of an empty worksheet, and the desired filename in cell B1, and then I would enter this formula in cell C1:  ="ren "&CHAR(34)&A2&CHAR(34)&" "&B2. (If the new filename in B2 had spaces in it, then it too would have to be surrounded by CHAR(34) instances, as done in that formula for cell A1.)  The particular chacters at issue here were percent symbols and possibly exclamation marks.  The commands were right.  They would run if I typed them into the command line, or copied and pasted them there.  But they wouldn't work from within a batch file.  To address this concern, I just deleted the exclamation marks, and I replaced the percent symbols with "pct" (e.g., 45% became 45pct).

In some cases, I would want a tool that could replace one or more undesirable characters, in a single bulk process, with one or more preferred characters.  In part, this was a simple interest in transliteration or Romanization (i.e., converting foreign-language characters to the standard English alphabet). There would be at least two reasons to do that. One was that I had run into some file handling issues with some such characters. I recalled one recent case where something that looked like a C (maybe ḉ or ć or ĉ or ċ or č) was not working correctly with some backup program or something. The other reason was that I might not remember which character to use when doing filename searches.  So, for instance, if I was looking for all of the files containing a copy of someone's resume, I might or might not want to have to search for both "resume" and "résumé."  There were many Romanization tables for various languages, possibly best approached by finding conversion tables for the thousands of possible Unicode characters.  This could take further attention, as there were multiple kinds of Unicode, not all of which might be consistent with Excel's coding.  In filenames, especially, examples involving several characters could include "pct" (above) and the replacement of $ with USD and of © with (C).  One readability example, arising where HTML has become involved, would be to replace the HTML code of ' with the apostrophe.

As I thought further about the problem, I realized that characters might be unwanted in context, not in isolation. For instance, the hyphen ("-") could be fine in some settings, not fine in others. At the risk of opening the door to complexity, it seemed the ideal approach (perhaps the less messy approach, in the long run) would give me the possibility of listing multi-character combinations and indicating their desired replacement. I found a UDF called SUPERSUB that seemed to be just what the doctor ordered:

Function SuperSub(OriginalText As String, rngOldText As Range)
    Dim cel As Range
    Dim strOldText As String, strNewText As String
     'loop through list of old_text used in substitute
    For Each cel In rngOldText.Cells
        strOldText = cel.Value
        strNewText = cel.Offset(0, 1).Value
        OriginalText = Application.WorksheetFunction.Substitute(OriginalText, strOldText, strNewText)
    Next cel
    SuperSub = OriginalText
End Function

But, again, I did not try this script at the time, but was merely preserving its information here for possible future application.

Another post offered a UDF called REMOVEPUNCTUATION where I could create a list of items to replace, and their replacements, within the macro.  The text of that one (with a few illustrative examples) was as follows:

Sub test()
Dim c As Range, t As String
For Each c In Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row)
    t = UCase(" " & RemovePunctuation(c.Text) & " ")
    t = Replace(t, " AVE ", " AVENUE ")
    t = Replace(t, " BLVD ", " BOULEVARD ")
    t = Replace(t, " BVD ", " BOULEVARD ")
    t = Replace(t, " CYN ", " CANYON ")
    t = Replace(t, " CTR ", " CENTER ")
    t = Replace(t, " CIR ", " CIRCLE ")
    t = Replace(t, " CT ", " COURT ")
    t = Replace(t, " DR ", " DRIVE ")
    t = Replace(t, " FWY ", " FREEWAY ")
    t = Replace(t, " HBR ", " HARBOR ")
    t = Replace(t, " HTS ", " HEIGHTS ")
    t = Replace(t, " HWY ", " HIGHWAY ")
    t = Replace(t, " JCT ", " JUNCTION ")
    t = Replace(t, " LN ", " LANE ")
    t = Replace(t, " MTN ", " MOUNTAIN ")
    t = Replace(t, " PKWY ", " PARKWAY ")
    t = Replace(t, " PL ", " PLACE ")
    t = Replace(t, " PLZ ", " PLAZA ")
    t = Replace(t, " RDG ", " RIDGE ")
    t = Replace(t, " RD ", " ROAD ")
    t = Replace(t, " RTE ", " ROUTE ")
    t = Replace(t, " ST ", " STREET ")
    t = Replace(t, " TRWY ", " THROUGHWAY ")
    t = Replace(t, " TL ", " TRAIL ")
    t = Replace(t, " TPKE ", " TURNPIKE ")
    t = Replace(t, " VLY ", " VALLEY ")
    t = Replace(t, " VLG ", " VILLAGE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " APTS ", " APARTMENTS ")
    t = Replace(t, " BLDG ", " BUILDING ")
    t = Replace(t, " FLR ", " FLOOR ")
    t = Replace(t, " OFC ", " OFFICE ")
    t = Replace(t, " OF ", " OFFICE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " STE ", " SUITE ")
    t = Replace(t, " N ", " NORTH ")
    t = Replace(t, " E ", " EAST ")
    t = Replace(t, " S ", " SOUTH ")
    t = Replace(t, " W ", " WEST ")
    t = Replace(t, " NE ", " NORTHEAST ")
    t = Replace(t, " SE ", " SOUTHEAST ")
    t = Replace(t, " SW ", " SOUTHWEST ")
    t = Replace(t, " NW ", " NORTHWEST ")
    c = Trim(t)
End Sub

Function RemovePunctuation(r As String) As String
With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Z0-9 ]"
    .IgnoreCase = True
    .Global = True
    RemovePunctuation = .Replace(r, "")
End With
End Function

It seemed that, if I used REMOVEPUNCTUATION to make a large number of changes (more than just the few dozen shown here), I might use Excel to generate the text of the additional lines to be inserted into the REMOVEPUNCTUATION script.

That was the extent of these notes.  I would have to await another situation requiring mass file renaming in order to test these techniques.



I used some aspects of this post in another post for renaming exported emails.


Another post identifies some freeware that may be useful for similar character-cleaning purposes.