Ray Woodcock's Latest: excel

Showing posts with label excel. Show all posts

Sunday, June 17, 2012

Excel: Storing the Results of Repeated Calculations

I had a spreadsheet containing a random-number formula. Each time it would calculate, it would generate a different random number, and this would result in a different output value. I wanted to run that same calculation numerous times, so as to see what values tended to result; and I wanted to save those values for later analysis. This post describes how I addressed that problem.

With guidance from Stephen Bullen, I set up my spreadsheet to look like this:

Before describing how it worked, I will explain each of those boldfaced headings:

Random Guide. This was a random number that I chose. Basically, I had 3,141 items to select from, representing the rates of poverty in each of 3,141 counties in the United States. I wanted to select only 30 of those 3,141 items each time; but I wanted to do many such selections (of 30 items each time), to see what kind of results I would get, on average -- to see, in other words, how well those 30 items would represent the total of 3,141. (I was interested in this because of a separate statistics problem.) So this "Random Guide" value is the percentage of data points that I would not select each time. In my main table, I set up a Selected column, using a RAND (random number) calculation on each of the 3,141 rows in the main table, to decide whether that particular row would be selected in that particular go-round.
N. This was the number of items that actually got selected in this particular instance. Over the long haul, if I had my numbers right, the random number generator would average out at 30 items per iteration; but there could be quite a bit of variation in the individual case.
Mean. Similar to N. This was the mean (average) of the data values that got selected this time around. This was what I was really after, for statistical purposes: I wanted a list of the sample means that would result from many selections of approximately 30 random items from the data table.
Zero to Stop. If I put a zero in this cell, calculations would stop. See below.
Current Iteration and Next Iteration. If the spreadsheet was now calculating the 12th random selection from the table, this would read 12. At the point when I took the snapshot shown above, the calculation of my desired set of 1,000 sample means was nearly completed.
Mean Results. This is the mean of the values shown in the results table. This would show error values, first time through, because there would be some zero values in the Results table. When I took the snapshot shown here, I was on my second pass through the Results table, so the means appeared properly at this point.
Results: Iteration. I set up the Results table to show the results of 1,000 calculations of Mean and N, based on 1,000 random selections of about 30 items from my table of 3,141 items.
Results: Mean. On the first iteration -- that is, the first time I ran the calculation to select a sample of about 30 items -- I got, from those 30 items, a mean poverty rate of about 13.6. So that particular sample was a bit on the high side, when compared to the overall 13.3% rate that I would get from either (a) direct calculation of the mean of all poverty rates shown in the main table or (b) the means of 1,000 samples, as shown on the Mean Results line (above).
Results: N. On the first iteration, the random number generator resulted in the selection of 36 items from the main table. There didn't appear to be a way to restrict this to produce exactly 30 items each time, without screwing up the desired randomness of the selection process.

So now that I have explained what my spreadsheet was doing, I will explain how it worked. Here is an image of the formulas driving those cells, as displayed here using Tools > Formula Auditing > Formula Auditing Mode (shortcut: Ctrl-Backtick, i.e., Ctrl-`):

The explanations of those formulas are as follows:

Random Guide. See above.
N. This one counted the number of cells that were marked for selection (in the Selected column in the main table -- see Random Guide, above) in a given iteration's random selection.
Mean. The average of the values displayed in the Selected column in the main table.
Zero to Stop. If I put a zero in this cell, the spreadsheet would not enter values into the Results table. After putting zero here, any other value would restart the count at Iteration 1 upon pressing F9. I would have had to refine the spreadsheet to let me restart at some other location.
Current Iteration and Next Iteration. These cells circularly referred to each other. That was permissible only after selecting Tools > Options > Calculation tab > Iteration. I had to set the maximum number of iterations to just 1: any larger value would cause the spreadsheet to generate lots of duplicative values that would distort my results. Holding down the F9 key would sometimes insert zero values; I had to hit F9 each time to recalculate. The differences between these two cells were that (a) Current Iteration checked Zero to Stop before proceeding, and (b) Next Iteration incremented the Current Iteration by one.
Mean Results. Explained above.
Results: Mean and Results: N. These formulas looked at the Current Iteration value. If it was not playing their tune, they would (circularly) keep whatever value they already had: zero, at the beginning, or whatever had already been calculated for them otherwise. But if their number was up, they would copy new values from cells I2 and I3.

The approach described here worked. It was very tedious to have to hit F9 each time I wanted to calculate a cell, given that I was doing it over 3,000 times. It was not clear why I was not getting the kind of automatic calculation I saw in Bullen's gs_demo.xls spreadsheet. I was still working on figuring that out.

Saturday, June 16, 2012

Excel Bubble Sort - Arranging Many Cells in Many Columns

In a previous post, and to some extent in a later one, I worked on ways to sort Excel cells in which data started out in an irregular arrangement, like this:

My goal in that case was to get the data items arranged in the original (vertical) order in a single column, like this:

This time around, I had a somewhat different objective. It seemed likely to require me to continue past the outcome just shown, to return the data to their original columns, but without the empty spaces between them, like this:

The question for this post was, how would I do that?

As I reviewed the previous posts, it seemed that the key step was to use Excel's CELL function. In the example just shown, an adjacent table containing formulas like =CELL("address",A1) and =CELL("address",A2) would produce values like $A$1 and $A$2. I could make those formulas conditional. Then =IF(A3="","",CELL("address",A3)) would return a blank cell. One additional step: I could combine the cell's address and its contents, like this:

The formula for that was =IF(A1="","",CELL("address",A1)&"--"&A1). Now I could set up a shadow table, matching the original one cell for cell, with the same number of rows and columns, and with each cell containing this kind of CELL formula.

With that done, I could copy the parallel table to a word processor that would have the ability to remove line breaks. (The find-and-replace symbol for a line break in Microsoft Word was usually ^p.) The table I was working on had 100 columns and about 3,000 rows. Each column had just 30 actual data items, so there were a total of exactly 3,000 (not 300,000) actual data items in that table; all the other cells were blank. Word 2003 was not able to accommodate this table initially, so I pasted it into Notepad++ and saved it as a text file, and then opened that in Word. After a series of find-and-replace operations (replacing double tabs (i.e., ^t^t) and so forth), I wound up with a single column of entries, 53 pages long.

I pasted that list back into a different Excel table, temporarily, for further operations. I parsed the entries into separate columns, using FIND to divide each item into its address and its contents. In the first cell shown in the example above, the address was $A$1 and the contents were 3. Now I had the 30 items that had originally been in column A, the 30 items that had been in column B, and so forth, without any of those blank cells. This resulting set of 3,000 items was all still just in one long column, but I could sort them in column order; I could take another few steps and sort them in row order; or I could sort them by increasing data item value.

So at that point, I had achieved approximately what I had tried to do in at least one of the preceding posts. Now there was the additional step of getting the 30 items pertaining to column A back into column A, leaving me with a table 100 columns wide but only 30 (not 3,000) rows deep. To do this, I did go ahead and isolate out the column and row information into separate columns. It was awkward to put the Excel column letters into proper order -- getting AA after Y, and so forth -- so I set up another column with numbers as substitute column indicators. To do that, I did a unique data filter to get just a single representative from each column (e.g., AA, AB, ...), and then did a VLOOKUP to the appropriate number in a table built from that unique filter (e.g., Y = 1, Z = 2, AA = 3 ...). This gave me a table which, after sorting, began like this:

I further modified this table by adding a Row # column after the existing Row column. For this project, I didn't care what row the item came from in the original spreadsheet; I just cared that it was sorted in proper row order and that it would appear somewhere within rows 1-30 of my new table. So the numbers in this new Row # column just ran from 1 to 30, matching the 30 items from a given column in the original spreadsheet (e.g., Y), and then they started over again at 1 and ran up to 30 for the next original column (e.g., Z), and so forth). This was easy enough to arrange: after setting up the first set of 1 to 30, the cells in all following rows could just refer to the value appearing 30 rows up. So now I had set of 1, 2, 3 ... 30, appearing over and over again, all the way down my new Row # column. I added a new NAddress column to express the combination of row and column:

With that in place, I could do a lookup to complete the job. This called for two tables. The top left corners of those two tables looked like this:

Note the formulas shown for the top left data cell in each of these two tables. The first table produced the thing that I would be looking for in my VLOOKUP; the second was the final data table.

I soon discovered that I should have included a VALUE formula in that final table's calculations; the resulting data were behaving like strings, not numbers, so I did have to do that additional transformation. Otherwise, though, the data checked out OK, so this task was done.

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

I had a bunch of filenames. I wanted to know which two-word phrases appeared most frequently. The purpose of this effort was to put those filenames into categories. I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process. This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size. I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog"). I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left. It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames. I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel. It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.". I did not have more than 65,000 filenames, so Excel 2003 would handle it. Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename: "I saw a dog - a big dog." Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog." This approach would produce some junk, but I hoped it would be easily filtered. Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a"). It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces. I could count the number of spaces manually. With many sentences, it would make more sense to use a formula. One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces. TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words. Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases. The approach I took was to set up the spreadsheet like this:

The top row indicated the number of the two-word phrase. The first one was "I saw," in group 1. The second row indicated the number of the space found. For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw"). This image shows only the first three groups; I would need four more. As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known. I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula. It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying. For instance, the formula in cell F2 was =F1, and in F1 it was =C1. That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro. Those formulas were as follows:

C3	=FIND(" ",$A3)
D3	=FIND(" ",$A3,C3+1)
E3	=TRIM(LEFT(A3,D3))
F3	=IF(D3=LEN($A3),"",C3)
G3	=IF(F3="","",FIND(" ",$A3,F3+1))
H3	=IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))
I3	=IF(H3="","",TRIM(MID($A3,F3,H3-F3+1)))
J3	=IF(H3=LEN($A3),"",G3)
K3	=IF(J3="","",FIND(" ",$A3,J3+1))
L3	=IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))
M3	=IF(L3="","",TRIM(MID($A3,J3,L3-J3+1)))

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine. I made a backup and proceeded to convert the spreadsheet's formulas to values. With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown. There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere. An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter. I had usually found that the same keystrokes worked in later versions of Excel. In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3). Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3. With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...). I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2. I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:

That is, the ends of the rows did not end at the same place. In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16. I wanted them all in a single column. I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously. That post still provided more details than I have written here. But in some regards, I now saw a way to simplify that process a bit. I deleted row 1, containing the column headings, as well as column 1, containing the original filenames. They seemed unnecessary and potentially confusing at this point. Then I created another worksheet (Insert > Worksheet). Its default name was Sheet2. In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located. The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1: [xA8.xls]Sheet1!$A$1. (xA8.xls was the name of my spreadsheet file.) I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began. In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this: #$A$1. Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1. In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998. So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998. A fast way to do that was to hit F5 and type Q14998 and hit Enter. That took me to Q14998. I put an X there. Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter. Boom! Done. I went back to Q14998 to double-check. Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above): [xA8.xls]Sheet1!$A$1. I hit Ctrl-A > Ctrl-C to copy it all. I opened Notepad and hit Ctrl-V to paste it. (If there had been too much data for Notepad, I think something like Notepad++ would have worked.) The purpose of putting it into Notepad was to remove the column formatting. (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word. Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks. That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p). Word reported that it had made about 170,000 replacements. This gave me the desired list, starting like this:

$A$1

$B$1

$C$1

$D$1

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines. I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more. The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2: I didn't need it anymore, and all its calculations would slow things down. I created a new Sheet2 and pasted the results from Word into it. I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem. My old Excel 2003 couldn't accommodate 170,000 rows. It gave me an error: "File not loaded completely." I took a look. Row 65536 contained a reference to $F$5911. I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature. I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad. I deleted everything down to that point, copied, and repeated in an adjacent column. I had to do this a couple of times. So now I had a spreadsheet that was three columns wide and 65,536 rows long. Warning: do not print your results.

I inserted blank columns between these columns of output. Those blank columns would contain the text corresponding to those cell references. For instance, in Sheet2 cell A1, I now had $A$1. So in the blank cell B1, I put this formula: =INDIRECT("Sheet1!"&A1). I copied that formula to all the other cells in those adjacent columns. That gave me three sets of two columns each: the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns. So now I just had three columns containing two-word phrases. In short, I had managed to isolate the phrases. Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones. For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A. But Calc hung for about 15 hours before producing output. (Excel also hang at a number of points in this whole process.) I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list. I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that. This gave me 7,539 unique two-word phrases. That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases. The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters. I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols). I also did some global (Ctrl-H) searches in Excel to the same effect. In addition, I deleted a few meaningless two-word pairs. These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P"). It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith"). Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium"). That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences. Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items. In other words, my filtering for unique items substantially accounted for all items. Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it. To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion. Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process. Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant. It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process. Such changes seemed likely to shape this sentence-parsing process. For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization. The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files. There was no index. Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability. I could continue with this effort, using backups to reconstruct the links. But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort. One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front. That question led me to a second try.

Using Excel to Clean Unwanted Characters from Filenames or Other Text

I was trying to organize some files. This post contains some notes I made on ways of using Microsoft Excel to change a list of filenames. The renaming project drew on file renaming techniques described elsewhere.

In various attempts in this process, I came up against situations where inconsistent punctuation and other filename characteristics made the process unnecessarily complicated. For instance, file names might refer to "Barack Obama," "Barack H. Obama," "President Obama," and so forth. These would all be alphabetized differently. Duplicates might be concealed, and files that belonged together topically might be overlooked.

In part, that was an issue of choosing and using consistent file naming conventions. But at this point, given that the files were already named, there was an issue of how to deal with punctuation and other non-alphanumeric characters. I wondered if there was a relatively straightforward way to get my arms around the varieties of character usage within a particular corpus.

The first task seemed to be identifying what unwanted characters were in use. Taking Microsoft Excel as my weapon du jour, I made a list of the filenames that I might want to change, using DIR on the Windows command line (see the comments following another recent post). I pasted that list of filenames into Excel.

Next, I ran a search and found several ingenious approaches to related problems. One would use a wildcard search-and-replace in Microsoft Word: search for ? (making sure to indicate that this would be a "Use wildcards" search) and replace with ^&^p to put the same character (that is, any character) on a line by itself; then paste the whole text into Excel and do a unique filter to get all of the unique characters found in the text. This did not seem likely to work for my purposes, unfortunately. For one thing, Excel 2003 was limited to 65,536 rows, and there could easily be more characters than that in a text being analyzed. I had also found that spreadsheets capable of handling more rows could be slow at it.

There were other approaches that used an array formula or a script. I was not as comfortable with those that I did not understand, because it would be difficult for me to work through their logic and make sure they were really doing what I wanted them to do. An unnoticed mistake could have unfortunate effects when altering a large or important text, or when renaming thousands of files.

Another possibility came to mind as I looked at the suggestion to use MoreFunc's REGEX.COUNT function (alternately, Excel's SUBSTITUTE or possibly MATCH) to produce character-by-character counts. I ran a search and found some interesting variations in SUBSTITUTE, including nested and case-insensitive substitutions, along with the CLEAN function. One solution, it seemed, would be to use 26 columns containing something like =SUBSTITUTE(UPPER(A1),"","") to replace all upper- and lower-case letters with nothing, another 10 columns to do the same for numbers, and then see what's left.

With SUBSTITUTE, one possibility was to set up a map. Put the rows of text being analyzed in one worksheet ("Sheet1"). Create a second worksheet ("Sheet2") within the same Excel file. In Sheet2, make each row relate to the row of the same number in Sheet1. So, for example, if there was a text string in cell A5 in Sheet1, there would be a row of characters across row 5 in Sheet2. Each cell in that row would count the occurrences of a specific character. For example, the 65th column would detect occurrences of the character whose Excel code happened to be 65 (that is, the letter "A," producible by =CHAR(65)). The chief problem with this approach was that its calculations would take a year. It would also max out Excel 2003's limit of 256 columns. It would have the advantage of transparency: I would be able to see much of what was happening in the data.

I found a user-defined function (UDF) -- what I had been referring to as a script -- called SUBSTITUTELIST. This UDF would let me indicate a list of characters to simply remove. In naming files, I would definitely not want to include many of the odd characters that were readily visible if, say, I opened a non-text file in Notepad.

The first step, with SUBSTITUTELIST, was to define it as a function that I could use in spreadsheet formulas. To do that, using the steps described atop another post, I copied and pasted this text into an Excel module:

Function SubstituteList(rString As String, rList As Range) As String
Dim rCell As Range
For Each rCell In rList
rString = Replace(rString, rCell.Text, "")
Next rCell
SubstituteList = rString
End Function

I didn't write that code, and was only partly clear on what it said. It appeared, though, that its middle line was going to replace rString with nothing (""). In other words, rString was being removed. Now I had to tell it what characters to treat as rString. In the spreadsheet listing the filenames I wanted to work on, I opened a new worksheet, called Sheet2, and put the numbers 1 through 255 in cells A2 to A256. (Two easy ways to insert a series of numbers: use =A1+1 in cell A2, and so forth; or enter the first two numbers (i.e., 1 and 2) manually, highlight the cells to be filled, and then use Edit > Insert > Series. The Alt command option would probably still work in post-2003 versions of Excel: Alt-E, I, S, Enter.)

Then, in Sheet2 cell B1, I typed =CHAR(A1), and copied it down to B255. So now column B contained the actual characters I wanted to look at, whose CHAR equivalents were shown in column A. Finally, column C contained the characters to be removed from consideration. For example, in cell C128, I typed =B128. That put a copy of CHAR(127) into column C. SUBSTITUTELIST was going to remove anything in column C. So that particular character would be removed from the cell in Sheet1 where I was going to be using SUBSTITUTELIST. Depending on the situation, I might or might not want to remove all non-alphanumeric characters. For instance, I thought that what I would really want to do with something like the euro character (€), in a filename, would not be to eliminate it, but rather to replace it (e.g., €35 becomes EUR 35).

To use SUBSTITUTELIST, I put my text to be analyzed in column A of Sheet1. In B1, I entered this: =SUBSTITUTELIST(A1,Sheet2!$C$1:$C$255). I copied that formula all the way down, next to each text string to be examined. That gave me a collection of funky characters in Sheet1 column B. Originally, I added TRIM to the SUBSTITUTELIST formula just cited, but then I realized that I actually wanted to see the pattern of spaces and other non-alphanumeric characters.

SUBSTITUTELIST seemed to have potential for single-character eliminations, but it didn't seem capable of substitutions involving multiple-character strings (e.g., EUR in place of €). It looked like a relatively simple change to SUBSTITUTELIST might do the job, but I had little ability for that sort of work at this point, so I posted a question on it. One reply to that question suggested a UDF called CLEANUPTEXT. The contents of that UDF (created in the same manner as SUBSTITUTELIST, above) were as follows:

Option Explicit

Sub cleanupText()



Dim allTxt() As Variant, sublist() As Variant

Dim i As Long, j As Long, k As Long, tdots As Integer



'Store data from sheets in arrays.

allTxt = Sheets(1).UsedRange.Value

sublist = Sheets(2).UsedRange.Offset(1, 0).Resize(Sheets(2).UsedRange.Rows.Count - 1, Sheets(2).UsedRange.Columns.Count).Value



For i = 1 To UBound(allTxt, 1)

    For j = 1 To UBound(allTxt, 2)

        'Loop through replacement terms and make replacements to data in array.

        For k = 1 To UBound(sublist, 1)

            allTxt(i, j) = Replace(allTxt(i, j), sublist(k, 1), sublist(k, 2))

        Next k

        allTxt(i, j) = Trim(allTxt(i, j))

        'Remove series of trailing periods.

        If Right(allTxt(i, j), 1) = "." Then

            tdots = 1

        Else

            tdots = 0

        End If

        Do While tdots = 1

            allTxt(i, j) = Left(allTxt(i, j), Len(allTxt(i, j)) - 1)

            If Right(allTxt(i, j), 1) = "." Then

                tdots = 1

            Else

                tdots = 0

            End If

        Loop

allTxt(i, j) = Trim(allTxt(i, j))

    Next j

Next i

'Print cleaned up results in array onto sheet.

ActiveSheet.UsedRange.Value = allTxt

End Sub

The instructions accompanying that UDF seemed to indicate that it required a separate worksheet, called Sheet2, containing the "before" terms in column A of Sheet2, and the "after" terms in column B of that sheet; and Sheet1 had to be the name of the tab containg the text being operated on. In response to my question, the text shown in blue (above) was designed to remove trailing dots from the text being modified. I did not actually try this UDF, but I hoped that I would not have to write similar lines of text for other kinds of unwanted text strings -- that, in other words, the column A / column B replacement approach would suffice. If so, I might try to simplify this UDF by removing the lines in blue.

On a separate note, I had noticed that some of the files in my list looked like their names were identical except for punctuation, so I added a column, in my spreadsheet that listed filenames, for this formula: =LOWER(substitutelist(M2,Sheet2!$C$1:$C$255)). M2 contained the existing filename, or part thereof, and Sheet2 column C was set to eliminate all non-alphanumeric characters, as shown above. This step did succeed in highlighting some duplicate filenames and other issues.

Another problem was that, in some cases, I would have to rename files before proceeding with something like SUBSTITUTELIST. The specific problem was that some characters in filenames would defeat batch commands. That is, I could have a command to change a filename, perhaps in response to an insight emerging from SUBSTITUTELIST; but the command might not run. So I would have to change or delete those troublesome characters first. (I often used Excel formulas to generate commands that I could run in a batch file to rename numerous files. The basic idea here was that I would put the existing filename in cell A1 of an empty worksheet, and the desired filename in cell B1, and then I would enter this formula in cell C1: ="ren "&CHAR(34)&A2&CHAR(34)&" "&B2. (If the new filename in B2 had spaces in it, then it too would have to be surrounded by CHAR(34) instances, as done in that formula for cell A1.) The particular chacters at issue here were percent symbols and possibly exclamation marks. The commands were right. They would run if I typed them into the command line, or copied and pasted them there. But they wouldn't work from within a batch file. To address this concern, I just deleted the exclamation marks, and I replaced the percent symbols with "pct" (e.g., 45% became 45pct).

In some cases, I would want a tool that could replace one or more undesirable characters, in a single bulk process, with one or more preferred characters. In part, this was a simple interest in transliteration or Romanization (i.e., converting foreign-language characters to the standard English alphabet). There would be at least two reasons to do that. One was that I had run into some file handling issues with some such characters. I recalled one recent case where something that looked like a C (maybe ḉ or ć or ĉ or ċ or č) was not working correctly with some backup program or something. The other reason was that I might not remember which character to use when doing filename searches. So, for instance, if I was looking for all of the files containing a copy of someone's resume, I might or might not want to have to search for both "resume" and "résumé." There were many Romanization tables for various languages, possibly best approached by finding conversion tables for the thousands of possible Unicode characters. This could take further attention, as there were multiple kinds of Unicode, not all of which might be consistent with Excel's coding. In filenames, especially, examples involving several characters could include "pct" (above) and the replacement of $ with USD and of © with (C). One readability example, arising where HTML has become involved, would be to replace the HTML code of ' with the apostrophe.

As I thought further about the problem, I realized that characters might be unwanted in context, not in isolation. For instance, the hyphen ("-") could be fine in some settings, not fine in others. At the risk of opening the door to complexity, it seemed the ideal approach (perhaps the less messy approach, in the long run) would give me the possibility of listing multi-character combinations and indicating their desired replacement. I found a UDF called SUPERSUB that seemed to be just what the doctor ordered:

Function SuperSub(OriginalText As String, rngOldText As Range)
    Dim cel As Range
    Dim strOldText As String, strNewText As String
     'loop through list of old_text used in substitute
    For Each cel In rngOldText.Cells
        strOldText = cel.Value
        strNewText = cel.Offset(0, 1).Value
        OriginalText = Application.WorksheetFunction.Substitute(OriginalText, strOldText, strNewText)
    Next cel
    SuperSub = OriginalText
End Function

But, again, I did not try this script at the time, but was merely preserving its information here for possible future application.

Another post offered a UDF called REMOVEPUNCTUATION where I could create a list of items to replace, and their replacements, within the macro. The text of that one (with a few illustrative examples) was as follows:

Sub test()
Dim c As Range, t As String
For Each c In Range("A1:A" & Range("A" & Rows.Count).End(xlUp).Row)
    t = UCase(" " & RemovePunctuation(c.Text) & " ")
    t = Replace(t, " AVE ", " AVENUE ")
    t = Replace(t, " BLVD ", " BOULEVARD ")
    t = Replace(t, " BVD ", " BOULEVARD ")
    t = Replace(t, " CYN ", " CANYON ")
    t = Replace(t, " CTR ", " CENTER ")
    t = Replace(t, " CIR ", " CIRCLE ")
    t = Replace(t, " CT ", " COURT ")
    t = Replace(t, " DR ", " DRIVE ")
    t = Replace(t, " FWY ", " FREEWAY ")
    t = Replace(t, " HBR ", " HARBOR ")
    t = Replace(t, " HTS ", " HEIGHTS ")
    t = Replace(t, " HWY ", " HIGHWAY ")
    t = Replace(t, " JCT ", " JUNCTION ")
    t = Replace(t, " LN ", " LANE ")
    t = Replace(t, " MTN ", " MOUNTAIN ")
    t = Replace(t, " PKWY ", " PARKWAY ")
    t = Replace(t, " PL ", " PLACE ")
    t = Replace(t, " PLZ ", " PLAZA ")
    t = Replace(t, " RDG ", " RIDGE ")
    t = Replace(t, " RD ", " ROAD ")
    t = Replace(t, " RTE ", " ROUTE ")
    t = Replace(t, " ST ", " STREET ")
    t = Replace(t, " TRWY ", " THROUGHWAY ")
    t = Replace(t, " TL ", " TRAIL ")
    t = Replace(t, " TPKE ", " TURNPIKE ")
    t = Replace(t, " VLY ", " VALLEY ")
    t = Replace(t, " VLG ", " VILLAGE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " APTS ", " APARTMENTS ")
    t = Replace(t, " BLDG ", " BUILDING ")
    t = Replace(t, " FLR ", " FLOOR ")
    t = Replace(t, " OFC ", " OFFICE ")
    t = Replace(t, " OF ", " OFFICE ")
    t = Replace(t, " APT ", " APARTMENT ")
    t = Replace(t, " STE ", " SUITE ")
    t = Replace(t, " N ", " NORTH ")
    t = Replace(t, " E ", " EAST ")
    t = Replace(t, " S ", " SOUTH ")
    t = Replace(t, " W ", " WEST ")
    t = Replace(t, " NE ", " NORTHEAST ")
    t = Replace(t, " SE ", " SOUTHEAST ")
    t = Replace(t, " SW ", " SOUTHWEST ")
    t = Replace(t, " NW ", " NORTHWEST ")
    c = Trim(t)
Next
End Sub

Function RemovePunctuation(r As String) As String
With CreateObject("vbscript.regexp")
    .Pattern = "[^A-Z0-9 ]"
    .IgnoreCase = True
    .Global = True
    RemovePunctuation = .Replace(r, "")
End With
End Function

It seemed that, if I used REMOVEPUNCTUATION to make a large number of changes (more than just the few dozen shown here), I might use Excel to generate the text of the additional lines to be inserted into the REMOVEPUNCTUATION script.

That was the extent of these notes. I would have to await another situation requiring mass file renaming in order to test these techniques.

Wednesday, February 29, 2012

Excel 2003: Print or Export the Formulas Used in Each Cell

I had a spreadsheet in Excel 2003. (I suspect the approach used here would also work in other versions of Excel, but I have not tried it.) I wanted to print out the formulas used in each cell. I did a couple of searches and wound up in a thread where they were advising me to use a macro for this purpose. The steps I used to set up the macro were similar to those that I had used for another Excel macro:

Close all Excel files other than the one you're working on.
Go into Tools > Macro > Visual Basic Editor > Insert > Module.
Copy and paste macro text (see below) into the window.
Go to File > Close and return to Microsoft Excel.
In this case, I used the macro by going into Tools > Macro > Macros and running the ListFormulas macro.

The text of the macro -- what I copied and pasted into the module window -- was as follows:

Sub ListFormulas()
    Dim FormulaCells As Range, Cell As Range
    Dim FormulaSheet As Worksheet
    Dim Row As Integer
    
'   Create a Range object for all formula cells
    On Error Resume Next
    Set FormulaCells = Range("A1").SpecialCells(xlFormulas, 23)
    
'   Exit if no formulas are found
    If FormulaCells Is Nothing Then
        MsgBox "No Formulas."
        Exit Sub
    End If
    
'   Add a new worksheet
    Application.ScreenUpdating = False
    Set FormulaSheet = ActiveWorkbook.Worksheets.Add
    FormulaSheet.Name = "Formulas in " & FormulaCells.Parent.Name
    

'   Set up the column headings
    With FormulaSheet
        Range("A1") = "Address"
        Range("B1") = "Formula"
        Range("C1") = "Value"

        Range("A1:C1").Font.Bold = True
    End With
    
'   Process each formula
    Row = 2
    For Each Cell In FormulaCells
        Application.StatusBar = Format((Row - 1) / FormulaCells.Count, "0%")
        With FormulaSheet
            Cells(Row, 1) = Cell.Address _
                (RowAbsolute:=False, ColumnAbsolute:=False)
            Cells(Row, 2) = " " & Cell.Formula
            Cells(Row, 3) = Cell.Value
            Row = Row + 1
        End With
    Next Cell
    
'   Adjust column widths
    FormulaSheet.Columns("A:C").AutoFit
    Application.StatusBar = False
End Sub

(Note that the format of this blog may wrap some lines. Copying and pasting may yield better results than retyping.) The author named in that code, John Walkenbach, provided this code and also offered a Power Utility Pak ($40) that contained this and other tools. I had installed a couple of freeware utility collections -- ASAP Utilities and Morefunc -- and I hardly ever used them. I checked the list of tools in ASAP Utilities, just in case, but didn't find anything quite along these lines. A quick glance revealed no list of Morefunc utilities.

When I ran the macro (step 5, above), it seemed to hang my machine. I was using a fairly large spreadsheet -- I probably should have tried it on something smaller -- but instead I went to bed. I didn't know how long it took, but it worked. When I awoke, it had created a new worksheet (i.e., a new tab at the bottom of the spreadsheet), with three columns: Address (e.g., F2), Formula (e.g., =C2), and Value (e.g., 17).

Tuesday, January 3, 2012

Converting Scattered WMA Files to MP3

I had .WMA files scattered around my hard drive. I wanted to convert them to .MP3. I could search for *.WMA, using a file finder or search tool like Everything, thereby seeing that those files were in many different folders. Having already sorted them, I didn't want to move them anywhere for the format conversion. I wanted to convert them right where they were. A command-line tool would do this. The general form of the command would be like this: PROGRAM SOURCE TARGET OPTIONS. For PROGRAM, I would enter the name of the command-line conversion program that I was going to be using. For SOURCE and TARGET, I would enter the full pathname (i.e., the name of the folder plus the name of the file, like "D:\Folder\File to Convert.wma," where the target would end in mp3 rather than wma). OPTIONS would be specified by the conversion program. For instance, there might be an option allowing me to indicate that I wanted the resulting MP3 file to be 64bps.

The problem was, I didn't have a command-line WMA to MP3 conversion tool. I ran a search and wound up trying the free Boxoft WMA to MP3 Converter. (They also had lots of other free and paid conversion and file manipulation programs.) When I ran their converter, it steered me to an instruction file that inspired me to compose the following command (all on one line):

AlltoMp3Cmd "D:\Folder\Filename.wma" "D:\Folder\Filename.mp3" -B56

I had to use quotation marks around the source and target names in some cases (though not in this particular example) because some of the path or file names contained spaces. The -B56 option was supposed to tell it to produce a 56-bit MP3. (I also tried it with a space: "-B 56".) I was able to produce similar commands en masse, for all of the WMAs that I wanted to convert, by exporting the results of the *.WMA search from Everything to a text file called wmalist.txt, making sure to remove entries for files that I did not wnat to convert. (At the root of each drive containing files of interest, I could also have used this command, assuming wmalist.txt did not already exist: dir *.wma /b /s >> D:\wmalist.txt.) I then massaged the contents of wmalist.txt using Microsoft Excel. So now I had all of these AlltoMp3Cmd commands ready to run. I copied them all into a Notepad file named Renamer.bat. All I had to do was double-click on it in Windows Explorer and it would run.

I decided to try Renamer.bat with just one WMA file. So I created another file, X.bat, with just one line in it, like the line shown above. To run X.bat from the command line, so that I could see what it was doing, I would need a command window that was ready to execute commands in the folder where X.bat was located. Problem: X.bat was not in the same folder as Boxoft's AlltoMp3Cmd.exe executable program, so X.bat would fail. If I didn't want to get into changing the computer's PATH, I could either put X.bat in the Boxoft program folder or I could copy AlltoMp3Cmd.exe to the folder where X.bat was located.

Either way, I needed to open a command window in one of those two folders, so as to run X.bat. I could start from scratch (Start > Run > cmd) and use commands (e.g., "D:" would take me to drive D and "cd \Folder" would take me to the folder where Filename.wma was located), or I could use Ultimate Windows Tweaker to install a right-click option to open a command window in any folder. I had already done the latter, so this step was easy.

Once I had sorted out all that, I was ready to try running X.bat. But when I did, it crashed the AlltoMp3Cmd.exe program. If I clicked on Cancel when I got the crash dialog, the command window said this:

Exception Exception in module AlltoMp3Cmd.exe at 0005B4E1.
Installation file incorrect. Please re-install it!.

But reinstalling the Boxoft program didn't help. I sent them a note to let them know of this problem and decided to try other approaches. One possibility was that their program was suitable for Windows XP but not Windows 7, which I was using. It didn't seem to be a question of how the main program was installed, since the error message was referring specifically to the AlltoMp3Cmd.exe command-line executable (which presumably would be the same on any Windows system).

I decided to try running it in a Windows XP virtual machine (VM). I had already installed Microsoft's Windows Virtual PC, which came with a WinXP VM, so I fired it up to try the same command line in the same folder. To move quickly to the proper folder in the WinXP command window, I ran my trusty old RegTweak2.reg file, created in Notepad, to install a right-click option to open a command window in any folder in Windows Explorer. But when I tried to use it, I got an error:

'\\tsclient\D\Folder Name\Subfolder Name'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported. Defaulting to Windows directory.
'\\tsclient\D\Folder Name\Subfolder Name'
CMD does not support UNC paths as current directories.

A bit more playing around persuaded me that what this message meant was that command-line work in the VM would have to be done on what the VM considered a "real" (actually a virtual) drive -- in other words, drive C. So I put copies of X.bat and AlltoMp3Cmd.exe into the VM's drive C, in a new folder I called Workspace, and I tried running X.bat from the command line there. But again I got an error: "AlltoMp3Cmd.exe has encountered a problem and needs to close." Maybe the program wasn't built to handle paths. For whatever reason, it looked like the Boxoft AlltoMp3Cmd command-line utility was not going to work for me.

A search in CNET brought up some other possibilities. One was IrfanView, reminding me that I had used that program to work partway through a somewhat similar problem months earlier. Using IrfanView version 4.28 and various insights described more fully in that other writeup (and in a recent thread), I went back to my original list of files in wmalist.txt and prepared this command:

i_view32.exe /filelist=D:\wmalist.txt /convert=$D$N.mp3

This command was supposed to use the file names ($N) and directories (i.e., folders, $D) specified in wmalist.txt to produce MP3 files with those same names, in those same directories. Before trying it out, I made a copy of wmalist.txt and changed the original so that it contained only two lines, referring to WMA files on two different drives. I ran the command shown above in a CMD window. I got an error:

'i_view32.exe' is not recognized as an internal or external command, operable program or batch file.

In other words, Windows 7 did not know where to look to find IrfanView. I could have taken the steps mentioned above, moving the .txt file to wherever i_view32.exe was located; but since I used IrfanView often, I wanted to add it to the PATH variable so that Windows would permanently recognize it. The solution was to go to Start > Run > SystemPropertiesAdvanced.exe (also available through Control Panel > System > Advanced System Settings) and then click on Environment Variables > System Variables > highlight Path > Edit. To see clearly what I was doing, I cut the existing Variable Value out of the dialog and doctored it in Notepad. The basic idea was to add, to the end of the existing value, a semicolon and then (without adding a space after the semicolon) paste the location of i_view32.exe (found easily enough via an Everything search > right-click > Copy path to clipboard). I made sure to add a final backslash ("\") after the path to i_view32.exe. I pasted that back into the dialog, OKed my way out of System Properties, went back into the command window, pressed the Up arrow key to repeat the command ... and it still didn't work. I thought that possibly I would have to reboot to have the new PATH definition take effect. That was the answer to that particular problem. After rebooting, in a command window, I ran the command shown above, and there were no errors. IrfanView was open, but nothing was in it. I ran searches in Everything for the two files in my test WMAlist.txt file, with wildcard extensions (i.e., I searched for Filename.*). No joy: there were no MP3 versions of those files. I tried a modified version of the command:

i_view32.exe /filelist=D:\wmalist.txt /convert=D:\*.mp3

but that produced no output in D. The IrfanView command was not working. I tried yet another variation, as above but without "D:\" but that wasn't it either. I tried the original command without using the filelist option:

i_view32.exe "D:\File Path\File Name.wma" /convert=$D$N.mp3

This produced an error:

Error! Can't load 'D:\File Path\File Name.wma'

Did that mean that the /convert option was not being recognized? Everything indicated that no MP3 file had been created. And why would IrfanView be unable to load the existing WMA file? It could load it easily enough from Windows Explorer or Everything. I tried again:

i_view32.exe "D:\File Path\File Name.wma"

That worked: IrfanView played the file. So the convert option was the problem. Another variation:

i_view32.exe "D:\File Path\File Name.wma" /convert="File Name.mp3"

If that did work, I wasn't sure where the output file would turn up. No worries there: it didn't work. I got the "Can't Load" error again. IrfanView's help file said that it did support wildcards for /convert, so that was presumably not the problem. I had seen an indication that IrfanView would not batch-convert certain kinds of files, but WMA was not on the list I saw. I was going to post a question in the relevant IrfanView forum, but at this point they weren't letting me in, for some reason. Eventually it occurred to me to look in IrfanView's File > Batch Conversion/Rename area, where it appeared that the program would support only image conversions, not audio.

It seemed I would need to continue searching for a command-line option. Back at that CNET search, I looked at the Koyota Free Mp3 WMA Converter -- from another company that offered multiple free conversion products -- but saw no indications that it had command-line options. Likewise for Power MP3 WMA Converter and others.

I finally opted for a kludge solution. Using an Excel spreadsheet, I created a batch file (again, using techniques described in the other post referenced above and elsewhere) to rename each file in WMAlist.txt to a unique name (example: ZZZ_00001.wma) -- after making sure I did not already have any files with that kind of name. The unique names would help to insure that all WMA files would get the treatment, even if two of them had the same original name. This produced 386 files. Then, using Everything, I selected and moved all ZZZ_*.wma files to D:\Workspace. Somehow, only 375 files made it to that folder. It turned out that I had inadvertently included WMA program files from drive C after all, which I had not wanted to do, and for some reason a few of those were not moved to D:\Workspace -- probably for insufficient rights. So now I would have to undo that damage.

After taking care of that, in D:\Workspace, I tried the Boxoft program again, this time using its Batch Convert mode. It took a while. Spot checks suggested that the conversion quality was good. I wasn't sure what bitrate to use to convert the files. It seems that, at 56 kbps for what appeared to be a bunch of voice (not music) files, I erred on the high side. I started with 353 WMA files occupying a total of 237MB, and I ended up with 353 MP3 files occupying 405MB. Those files were converted at what appeared, at quick glance, to be a rate of about 6MB per minute. I then revised the spreadsheet to produce batch file command lines that would move those MP3s back to the folders where the similarly named WMA files had been and rename them back to their original names (but with MP3 extensions).

Saturday, April 23, 2011

Using a Spreadsheet to Rename Thousands of Files - First Try

I had a list of a couple thousand files. I got the list as an export from a program, but I could also have gotten it from a directory listing at a command prompt (e.g., DIR /a-d /b /s). This post describes how I used that list to rename those files. Needless to say, since I was working with the possibility of screwing up years' worth of information, I did make a backup of these files before proceeding.

This particular list was a list of emails that I had just exported from Thunderbird, as described in another message that I expect to post to this blog on the same day as this one. I had exported, not only the list, but also the actual emails, in EML format. I planned to use the list to rename those EMLs so that they would be in the format I wanted, which was like this:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf

It would require some massaging to get there.

The Date and Time Segment

Starting with the date and time, here's what Thunderbird had given me in the list of files:

3/20/11 14.23

These were all in one field, in the .csv (i.e., Excel-readable) output from T-bird. They might have been in two or more fields, in a text file produced by a directory listing (e.g., DIR /a-d /b /s > dirlist.txt), but to some extent the same techniques might prove useful. These numbers were not in a format that would work as a Windows filename, and in fact the exported EMLs did not contain the date and time data in this format.

I was going to use the list to rename the emails the way I wanted, with data that existed in the list but not in the current EML filenames. To do that, I would need to try to get the list so that it accurately represented the actual EML filenames. Otherwise, if the list produced a command that said, Rename File 234A.eml to be "Message from Ray 234A," that command would not work if the file being renamed was actually called File_234A.eml (with an underscore). My command would just get a "File not found" error. So my first step was to find a way, in my Excel spreadsheet, to reproduce what the files were actually called, as I viewed them Windows Explorer.
First, I would have to extract each element into a separate column, there in the spreadsheet, so that I could format and rearrange them as desired. For this, I started with FIND commands. (Excel's internal help had good information on using these and other related commands. I was doing this in Excel 2003. There might have been other functions that would automate this process in newer versions.) For instance, =FIND("/",A1) would locate the first occurrence of the forward slash in cell A1, if that's where I put the "3/20/11 14.23" item shown above. So then I could search for the next slash (=FIND("/",A1,B1+1), starting one place after the results of the first FIND statement (which, in this example, I had put in cell B1). I could do the same for spaces, colons, or whatever else might delimit various components of the date and time entry. I'd then use RIGHT, LEFT, or MID statements to find those components. For instance, a MID statement like =MID(A1,B1+1,C1-B1-1) might start one space after the first slash, continue to one space before the second slash, and thus give me the day of the month.

In the 3/20/11 14.23 example, the value of "3" for the month wouldn't give me the desired "03" so I would use IF and LEN statements to pad that out:

=IF(LEN(G1)=1,"0","")&G1

thereby inserting a zero in front of single-digit month values found in cell G1, but adding nothing to those month values if they were not single-digit.

I had gotten two different things from Thunderbird. On one hand, as noted above, I had gotten a list of data about emails that I was going to export, with dates in the 3/20/11 format. On the other hand, I had also gotten the actual emails, exported as individual EML files. These were not named quite the same way. The particular Thunderbird extension that I had used to produce this list of files, ImportExportTools, had produced files whose names were in this format: Date-Time-Sender-Subject. But in those filenames, the dates could not be rendered as 3/20/11, since slashes were not allowed in Windows 7 filenames. Instead, the date and time came in the format of 20110320-1423. So as hinted above, I would ultimately be producing a batch file that would automate the renaming of thousands of files. That example, 20110320-1423, would instead begin with the more readable 2011-03-20 14.23.

Next: Massaging the Sender Data

Before I could get to that point, I had to change some of the Sender data that T-bird had produced. Again, the list of actual email data from T-bird contained characters (e.g., the > symbol) that could appear in email Subject lines but that were not allowed in Windows 7 filenames. (The full set of forbidden characters: \ / : * ? " < > | ). For example, the spreadsheet might show a sender to be this:

John Doe <jdoe@xcom.com>

but the actual exported EML file's name would just have John Doe. So I could get rid of some of these problems of verboten characters by just doing a FIND for < and then a MID or a LEFT statement for everything before that. That wouldn't necessarily get rid of all the bad-character problems, but I wanted to clean those up just once, so I deferred that problem for the moment. For right now, in a bid to match the format of the exported EMLs, I now had this much of the filename:

20110320-1423-John Doe

To get there, the actual command I used, for the Sender portion, was this:

=LEFT(B2,V2-1)

Cell B2 had the Sender's name as exported from Thunderbird, and cell V2 had the FIND location for the < symbol, which marked the end of the part of the Sender data that I planned to use.

At this point, there was a problem. For some reason, ImportExportTools put underscores before and after some Senders' names, but not all. So I might instead have this:

20110320-1423-_John Doe_

The filenames already used a hyphen ( - ) to delimit fields, so the combination hyphen-underscore seemed superfluous. It needed to be fixed before I could continue with the main project here, so that I could be confident that Sender names were delimited consistently by a hyphen, and not by unpredictable choices involving underscores.

Detour: Renaming Thousands of Files, So That I Could Proceed to Rename Thousands of Files

To fix this problem, I went to the Windows 7 command prompt (Start > Run > cmd) and did a directory listing to save the filenames to a file: DIR /b > dirlist.txt. I copied the contents of that file into column A of a new Excel spreadsheet. I also copied those contents into Notepad. In Notepad, I did two global replaces (changing both -_ and _- to just plain - ). I copied those changed contents into column B of that Excel spreadsheet. Next, I needed to combine these two columns, A and B, into a single DOS command that would rename the files. To do that, I put this formula in column C:

="ren "&char(34)&A1&char(34)&" "&char(34)&B1&char(34)

The char(34) entries would introduce quotation marks: 34 was the ASCII code for double quotes. I could have introduced any character that way -- Z, a semicolon, whatever -- but it was essential to use the char(34) approach for quotation marks specifically because otherwise Excel would have misunderstood them. Quotation marks would be necessary for any DOS command involving filenames that contained spaces; DOS would otherwise think that the space marked the end of a part of the command. Anyway, I copied the contents of column C back into Notepad and saved the file as a batch file (renamer.bat). I saved renamer.bat in the folder where I had all those EML files I was going to rename. At a DOS prompt in that folder, I typed "renamer.bat," without the quotes. I could have just double-clicked on renamer.bat in Windows Explorer. Either way, the files were renamed.

Later, I discovered that ImportExportTools would truncate the Subject field in the spreadsheet to 50 characters even if I had not asked it to do so. Many emails in my set had Subject fields longer than that. I initially tried to fix this by setting up a separate spreadsheet, using the process just outlined, but with a search for the last hyphen in the filename, which would hopefully identify the start of the Subject field in most cases. Unfortunately, the calculation of 50 characters proved difficult, after taking into account the character substitutions described here. Ultimately, I started over with a fresh export of EML files from Thunderbird, after setting the option at Tools > ImportExportTools > Options > Filenames tab > Cut subject at 50 characters (and also Cut complete file path length at 256 characters).

Returning to the Sender and Subject Segments

So now I could go back to work on the main spreadsheet. I could assume, that is, that the first part of the EML filenames were going to look like this:

20110320-1423-John Doe

without underscores. I could see that there were still a few underscores around names in the EMLs, and that was problematic. Possibly ImportExportTools had introduced two underscores rather than one in a row, in some cases. There were few enough that I figured I could fix those exceptions manually.

Now it was time to add the Subject part of the filename. This was simple enough: just use another ampersand (&) to combine it with the rest. Sketching it out, the EMLs now just used hyphens to delimit the date, time, sender, and subject portions, so I just needed to use an Excel command of this format:

=[Date]&"-"&[Time]&"-"&[Sender]&"-"&[Subject]&".eml"

and that would give me a complete representation of how ImportExportTools seemed to have constructed the filenames for most of the exported EMLs.

Cleaning Up Unwanted Characters

As noted above, some characters (e.g., the colon, ":") were not acceptable as Windows 7 filenames, but were quite common in email subject lines. It looked like ImportExportTools had replaced those characters with an underscore, and had removed spaces before and after the underscore. So, for instance, a subject line of "Re: Tomorrow" was represented, in the EML filenames, as "Re_Tomorrow," without a space.

At this point, preparing to remove those unwanted characters, my spreadsheet had something like this:

20110320-1423-John Doe-Re: Tomorrow

Some of these fledgling combinations contained other unwanted characters (e.g., question marks), listed above. I was almost ready to remove them. But first, it was time to proofread my spreadsheet. Having copied all of my various formulas down from the first line of the spreadsheet, where I was developing them, so that they were present in all rows, I now prepared to sort the spreadsheet. First, I inserted an Index column as column A, and in that column I inserted numbers in ascending order, starting with 1. The purpose of this column was to remember my original sort order, in case the date field or anything else did not accurately reproduce the order in which emails appeared in Thunderbird. There were times when a person would want to check back and see if things were matching the source. To insert these numbers, using Excel 2003, there were two options. One was to enter a formula in each cell, adding 1 to the number above it, and then replace the formulas with fixed values. The sequence for this was Edit > Copy, Edit > Paste Special > Values. An alternative was Edit > Fill > Series. Either way, I now had index numbers that wouldn't change if I sorted rows.

So now I did sort rows, sorting first on the column that concatenated (i.e., combined) the date, sender, and subject. I sorted to highlight those rows in which the process had not worked as intended. One problem, I saw, was that some Sender names did not include the <jdoe@xcom.com> part. The sender of these emails was just listed as John Doe (or whoever), without the actual email address. For these, I just copied the Sender straight over, skipping the unnecessary part of the spreadsheet calculation. That cleared up the error messages in the spreadsheet. So now I saved a version of the spreadsheet and moved the output column -- the one combining all of my work with these various fields -- and pasted it into Notepad. I could have edited it right there in Excel, but there was a risk that I would accidentally have changed other columns in ways that I did not intend, or that I could not achieve what I needed to achieve. For example, the ? symbol was a wildcard in Excel, so attempting to replace it with an underscore would either not work or create a mess. I noticed that ImportExportTools had also converted commas into underscores, even though they were not forbidden characters. Note: now that I had parts of the spreadsheet in Notepad, and expected to paste those parts back into Excel in the same order, this would have been an especially bad time to sort the spreadsheet.

Over in Notepad, I did global find-and-replace operations for each of the special characters listed above, guided by the objective of matching as closely as possible the actual filenames of the EMLs I had exported from Thunderbird. It was premature to do line-by-line editing of individual entries, though; those would be more obvious and perhaps more easily fixed later. At this point, if I saw something that needed to be changed, I executed it as a global command, so that it would be changed wherever it occurred. Obviously, a person can make serious mistakes with global changes so, again, this was not the time for fine-detail fixes of individual entries. On the other hand, each of these global changes was a potential time-saver: a single change here could save the need to make manual changes to 20 or 300 individual files later.

After making these changes, I again had to replace the -_ and _- combinations with a simple hyphen ( - ), since some forbidden characters that I had just replaced with underscores had been adjacent to the delimiting hyphens that ImportExportTools seemed to convert into simple hyphens. With these changes made, I copied the contents of that Notepad file back into the appropriate place in the spreadsheet.

Testing the Filename Match

How well did my spreadsheet now reflect the actual names of those EML files? I decided to test it. To do this, I first made a backup copy of the folder containing the EMLs. I then created a subfolder in the EMLs folder, called Test. In the spreadsheet, I worked up a MOVE command for each file, commanding it to move to the Test folder. The command was like this:

="move "&CHAR(34)&Z2&".eml"&CHAR(34)&" Test"

I copied all those MOVE commands from the spreadsheet to Notepad and saved the Notepad file as MOVER.BAT in the EML folder. Then I ran it. It succeeded in moving less than half of the files, which meant that the spreadsheet had not yet captured hundreds of EML file names correctly. I dir a DIR /b > dirlist.txt in the EMLs folder to capture the names of the ones that had failed. I brought that list into the Excel spreadsheet and matched them up with my attempts. This matching process was sufficiently time-consuming that I found myself grateful for the ones that I had managed to identify in a more automated fashion. The time investment was acceptable. As had been the case for me with spreadsheets going back almost 30 years, I figured that, once I identified the rules and became more familiar with this particular process, I would be able to use it again in similar tasks.

To match up the real filenames with the versions in the spreadsheet, I pasted the contents of dirlist.txt into a separate worksheet in Excel. From both, I compared the date and time data, using VLOOKUP on shortened versions of the relevant colums (=LEFT(Z1,13)). This revealed several things. First, there was an error in how I had converted the date and time data, so I would have to fix that and re-run it to move more of the items from the EML folder into the Test folder. Since I planned on doing this task again, I also decided to automate some of the find-and-replace operations described above. This required using FIND and MID to divide the text string into parts appearing before and after the character I wanted to replace (e.g., colons) and then joining them together without the middle part. I didn't do this for all possible combinations; I was basically interested in eliminating the first and/or most frequent occurrences of the forbidden characters that did seem to appear.

Starting Over: Revise the Filenames, Not the List

After hours of effort and a couple of retries, the approach described above was still identifying (i.e., successfully moving) only about 80% of the EMLs I had exported from Thunderbird. Depending on the total number of emails involved, that could mean hundreds or even thousands of emails that would have to be manually renamed in order to insure that their filenames contained all of the desired data.

I decided that the better approach mgiht be to try to rename the files, and keep renaming them if necessary, until they conformed to certain basic rules in the file list exported into Index.csv. There had already been some of that in the steps described above; the change now was to make that the primary effort.

The first step was to take a listing of the EMLs. This was easy enough: DIR /b > dirlist.txt. Then I copied the contents of that text file into Excel and changed components of the filename to be the way I wanted. The names of EMLs exported from Thunderbird did not have "To" fields, but I was able to match up most of the names automatically with what I had already prepared, thus borrowing To fields from the work described above. (This is a cursory description. At this writing, I had run out of time for this project, and was focused on getting it done. But the basic techniques were as described above.)

Some filenames did not match up easily to support an automated renaming of the EMLs.. I renamed the rest, using a MOVE command to put them into a separate folder. I copied those that did not rename into a separate folder too. These I renamed using the Bulk Renamer (above), so as to have .txt extensions. I did that so that I could view them in IrfanView, which enabled me to flip back and forth among them quickly, so as to identify the proper "To" information.

At this point, I started a new post to summarize more clearly the steps taken here.

Ray Woodcock's Latest

Pages

Sunday, June 17, 2012

Excel: Storing the Results of Repeated Calculations

Saturday, June 16, 2012

Excel Bubble Sort - Arranging Many Cells in Many Columns

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

Using Excel to Clean Unwanted Characters from Filenames or Other Text

Wednesday, February 29, 2012

Excel 2003: Print or Export the Formulas Used in Each Cell

Tuesday, January 3, 2012

Converting Scattered WMA Files to MP3

Saturday, April 23, 2011

Using a Spreadsheet to Rename Thousands of Files - First Try

Support This Blog

Total Pageviews

Archives

Pages

Sunday, June 17, 2012

Saturday, June 16, 2012

Thursday, March 1, 2012

Wednesday, February 29, 2012

Tuesday, January 3, 2012

Saturday, April 23, 2011

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives