Thursday, April 21, 2011

Batch Renaming Text File Memos or Emails Using Their Contents to Create the Filenames

I was using Windows 7.  I had text files that contained old email messages, one message per file.  Each began with four or five lines:  FROM, TO, DATE, and SUBJECT, sometimes preceded by a blank line.  In that sense, they were very similar to memoranda.  At present, these emails had numerical names (e.g., 0045.txt) that said nothing about their contents.  I wanted to rename them to a Date - Time - From - To - Subject format, like this:

2010-04-13 13.45 Email from Joe to Nancy re Dinner.txt
In other words, their top several lines told me everything I needed to name them; I just didn't want to have to do it manually.  This post describes the steps I took to automate that naming process.

My knowledge of programming was more or less limited to DOS-style batch file scripting.  If I had been interested in entering single commands, I would have done that in a DOS box (a/k/a command window), which I could open in Windows 7 (customized) by going to Start > Run > cmd.  I wanted to run the relevant commands automatically, across a number of text files, so I would be saving those commands in a batch file, created in Notepad as a text file (ANSI format), saved as CommandFile.bat, and executed by double-clicking on it in Windows Explorer.  The question here was, what commands should go into that file?

Instead of creating a batch file, I could have used some other approach.  For instance, it seemed possible that IntelliGet or TextCrawler would do the job better, provided I were interested in spending the time to learn how to use them.  I did actually try with TextCrawler, but at this writing I had a question sitting in the TextCrawler forum for almost two days without an answer.  Same thing with AutoIT.  I wasn't sure how to proceed with those approaches, so at this point I was deciding to return to the batch file approach.  I had been using batch files on a very simple level for a long time, so I started there.

A search for batch command syntax information led me to the A-Z Index site, which I had used before.  I used that site to interpret what was happening in a two-line batch script that I found in another post:
FOR /F "skip=2 delims=" %%i in (%1) DO >> lines.txt ECHO %%i & GOTO :eof
I've presented the DOS commands in capital letters to make them more visible.  ECHO would tell the computer what to make visible onscreen or in the output file (in this case, lines.txt).  The second line used a "FOR /F ... DO" loop to repeat the command.  (There were other FOR options.  For instance, FOR /D would operate on directories, not files.)  In plain English, the concept there was, "For each X, do Y."  The big idea was, apparently, that the "skip" part would tell it to skip two lines and then echo the third one to the lines.txt file, and then go to :eof (short for "end of file," i.e., end of this script).  In other words, this FOR-DO combination just said, skip a line, skip a line, output a line, then break out of the loop (i.e., quit).  So there would have to be another command that would call this script for each file in the directory.  But before getting to that, I wanted to understand how the "skip" part worked, because that's what I would need to modify for my purposes.  The syntax reference seemed to be saying that "skip=2" was a command to skip two lines, as I had just surmised.  Another source seemed to say that the "delims" part would specify what would count as the unit of analysis.  For instance, if there were a blank space between the equals sign (=) and the quotation mark, we would be looking for a blank space, not the end of a line, as a delimiter.  The loop would restart each time it found a blank space.  Apparently leaving no space between = and " meant that it would default to end-of-line as its delimiter.  I wasn't sure about that.  But what we seemed to be saying, so far, was that we were going to loop each time we encountered a line end, for a total of two loops.  Then, having arrived at the third line, we would proceed with the main attraction:  we would do something with "%%i in (%1)."  Specifically, we would export it, whatever it was, to lines.txt.  What we would be exporting would be %%i.  The percent signs, I knew, indicated a variable.  So we were going to export variable i, in %1, to lines.txt.  I guessed that (%1), being in parentheses, referred to a varible already supplied.  That is, the command that would call this script would indicate the name of a file to examine, and this %1 would represent that filename.  So, to recap:  having arrived at the third line of the file in question, let's export something (specifically, %%i) to lines.txt.  Since there wasn't any more information on what the %i variable (presented as %%i because it was occurring inside a batch file) was supposed to do, it seemed that it must just refer to the whole unit of analysis -- in this case, the whole line -- where we had ended up, having skipped two lines to get here.  In other words, this whole third line of the text file would be exported to lines.txt.

Now, how would all that work for me, where I wanted to output multiple lines?  I thought I might try to identify each of the lines in question with a separate variable, and then concatenate (i.e., combine) those variables in the output.  Basically, if variable L1 stands for line 1, and L2 for line 2, and so forth, then what I wanted was to output L1+L2+L3+L4 to lines.txt (though I preferred to call it output.txt).  (The double output signs, >>, meant "add this onto whatever already exists in the output file, as distinct from >, which would mean "replace whatever is in the output file with this new data.")  I wouldn't need a skip parameter -- I wanted to output all of the first four lines or, actually, the first five, since in some files the first line was a blank.

At this point, it occurred to me that I might have another option.  I made a backup of the folder containing the emails.  Then, at the DOS command prompt, I typed COPY *.* COMBINED.TXT.  This put all of their contents into one large text file.  I opened it and copied and pasted its entire contents into an Excel spreadsheet.  Then I parsed its lines to produce the desired filenames.  This was much more time-consuming than a scripted solution would have been, and it was not what I set out to find, but it did get the job done.



Note that if the combined text file exceeded the 65,000-row limit of Excel 2003, OpenOffice Calc was now able to handle up to a million rows, though I was finding it very unstable at this point.