Ray Woodcock's Latest: EMLs

Showing posts with label EMLs. Show all posts

Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

I was using Thunderbird 11.0.1 in Windows 7. I had accumulated some emails that I wanted to export as individual EML files. An EML would still be readable in Thunderbird, and it would carry any attachments along with it. I had attacked this problem on several previous occasions. As before, I was not sure I would get all the way through from Thunderbird to EML to PDF. This post provides another contribution in the slog toward that outcome.

First Step: From Thunderbird to EML Format

Some of my previous efforts to export to EML and then convert to PDF had produced something of a mess. Exporting, itself, was easy enough. I was using ImportExportTools. It would give me EMLs with names containing some, but not all, of the information that I wanted in file names. Specifically, it would provide the date and time, the sender, and the subject; but it did not include the recipient. I could get it to produce a separate Index.csv file that would contain the full information, but that would just be a spreadsheet file. I could use that spreadsheet file to give me nice names for files; but which file was supposed to get which name? Matching them up had required a surprising amount of manual effort, last time around. I was hoping to make the process smoother, if I could.

It wouldn't help to print a PDF directly from Thunderbird. As far as I knew, that would require me to enter PDF filenames manually. I was looking for a mass-production kind of solution. About.com recommended mbx2eml, but it seemed to have some disadvantages, notably a very limited set of options for the resulting EML filenames -- which was the main problem. Generally, it did not seem that any solution had broken through into prominence, in either the T-bird to EML or T-bird to PDF category.

In my first try at this problem, I had tried Total Thunderbird Converter and Birdie EML to PDF Converter, but for various reasons had not been impressed with either. I did like Attachment Extractor, for when I got to that part of the project. My notes seemed to favor Universal Document Converter (UDC) ($69), if I wanted a direct T-bird-to PDF-solution. As I reviewed the struggles I'd had in that first try at this problem, and also in the second and third tries, I wondered if I should have focused more seriously on UDC. But it did not seem to have command-line capability or other automation features. It was basically a glorified PDF printer. Moreover, its default filenames did not include all the information I wanted.

My previous notes did not seem to mention that Thunderbird messages were apparently already in EML format, stored in Thunderbird subfolders. For instance, I had moved the messages that I was now seeking to export to a Local Folders subfolder called Export, and I could see that folder in Windows Explorer as Mail\Local Folders\Export.mozmsgs. But this was confusing: the number of EML files in that folder was not very close to the number of messages in the Export subfolder in Thunderbird. Anyway, the EMLs in Export.mozmsgs had seemingly random names that would be useless for my purposes.

So I went ahead with ImportExportTools. My first step was to eliminate duplicates. For this, I used Remove Duplicate Messages (Alternate). Then, in Thunderbird, I went to Tools > ImportExportTools > Export all messages in the folder > EML format. The first time around, this produced undesirable results (see below). But I didn't know that until I was partway through the second step.

Second Step: Adding Recipient to the EML File Name

I had my EMLs. But as noted above, I wanted to add the name of the Recipient to the filename, in the format Date-From-To-Subject. As a first step, I thought I would just try to append the Recipient's name to the end of the filename. Then I would figure out how to shuffle the words around to the desired order.

Given my limited knowledge of programming and such, I decided to try to achieve this with a Windows batch file. I struggled to figure out how to write a suitable one, and finally posted a question on it. One of the early answers to that question led to a separate pursuit -- a one-line batch file that would convert Word and WordPerfect documents to PDF.

The answers that I had received, at the point when I was writing up these notes, fell into two categories. One, which I found easier to understand (and, predictably, seemed less popular among the knowledgeable respondents), involved a simple loop that would call an external process. Basically, in plain English, it went like this:

FOR each EML file, run Process.
Repeat loop.
When list of files is exhausted, quit.

Process starts here.
Do various things.
End of process

By contrast, the approach preferred by most of the answering individuals would put all the steps inside the loop, instead of having a separate process afterwards. It seemed to be a matter of style. A second difference was that, in discussing the specific steps, they seemed divided between two general possibilities: with, or without, delayed expansion. Delayed expansion was apparently a response to a complication in how the FOR command worked. As I understood it, the computer would read the entire contents of a FOR command as soon as it hit the word FOR. So assigning a value to a variable inside a FOR loop would be too late; the computer would already have decided what value that variable had. The variable would have been immediately expanded to its value. Delayed expansion would postpone definition of the variable's value until later in the game. A variable would be marked for delayed expansion by surrounding it with exclamation marks (e.g., !VAR!). I wasn't familiar with delayed expansion, so I was in accord with some advisors' feeling that it would be better to proceed without it if possible. What they (especially Aacini) suggested was:

@ECHO OFF

IF EXIST fullnames.txt DEL fullnames.txt

FOR %%f IN (*.eml) DO (

SET firstfind=

FOR /F "delims=" %%l IN ('findstr /B /C:"To: " "%%f"') DO (

IF NOT DEFINED firstfind SET firstfind=now & ECHO %%f%%l >> fullnames.txt

)

)

I have double-spaced the lines for clarity, anticipating that Blogger will wrap some long lines. I haven't indented the way a programmer would, because of apparent limitations in the formatting options here in Blogger. Basically, this batch file said, give me a fresh output file called Fullnames.txt; and on each line in Fullnames.txt, type the contents of two variables. The first variable, %%f, was the name of the EML file under consideration, in all its Date-Sender-Subject glory. There would be one such filename assignment for each EML file in the folder; hence a FOR loop. The batch file would loop through all EML files in the folder.

Inside that FOR loop, there would be an examination of the contents of each individual EML. This examination would use FINDSTR to locate the first line beginning with "To: ." The contents of that line would be assigned to the %%l variable. (That's an L, not a one.) I wasn't sure why this had to be done inside a second, inner loop, and I also didn't know how the "now" part worked. But I was an openminded individual. I was interested in new ideas. The point is, I was willing to plow ahead and give it a try.

So I copied the foregoing lines of script, beginning with @ECHO OFF and continuing to the last closed parenthesis (")"), and pasted them into a file in Notepad. I saved that file as EMLNamer.bat, and put it into the folder containing the EMLs that I had exported from Thunderbird (above). There, I ran it (either double-click it or highlight and hit Enter). The command window displayed nothing, which was a bit disconcerting; but, viewing the folder in Windows Explorer, I could see Fullnames.txt spring into existence and grow larger.

When it was done, the command window disappeared, and Fullnames.txt stopped getting bigger. I put EMLNamer.bat into a folder where I could find it later. I opened Fullnames.txt file and pasted its contents into Excel. Some lines seemed to be missing. Not many, but less than the total number of files shown in the Windows Explorer folder minus two (for EMLNamer.bat and Fullnames.txt). I guessed that the names of a few EMLs had presented complications for the script. I would have to process the rest and see what remained.

Third Step: Improving the EML File Name

I looked at the new Excel spreadsheet. Spot checks, supplemented by previous experience with ImportExportTools, yielded the following observations:

The first 13 characters in each filename seemed match the date and time (in 24-hour format) shown in Thunderbird for the email in question -- the time, that is, when the email was sent or received.
The next characters indicated the sender. This string ended, in some cases, with three characters (namely, "_-_") and in other cases with just one (namely, "-"). It seemed that ImportExportTools would surround some senders' names with underscores ("_") but would not do so for others. The reason seemed to be that those senders' names appeared within brackets. For instance, I had emails from "[Wordpress.com]" that now appeared as "_WordPress_com_." So at least in these situations, the underscore seemed to be something that I could replace with a space, which would then be removed by an Excel TRIM command if it appeared at the start or end of a string.
Some senders' names ended with "_com." Ordinarily, the preceding note would suggest replacing that with ".com," and likewise for ".org," ".edu," and so forth. But I decided that step would come later, if at all: instead, I would start by identifying full names (e.g., "Yahoo_com") that I might want to replace with simpler names (e.g., "Yahoo").
Hyphens were not always a reliable indicator of the end of a sender's name. For example, an email from some "Pan-European" organization came through the ImportExportTools process unchanged.
ImportExportTools seemed to replace apostrophes with underscores. So instead of "Miller's" I would get "Miller_s_." Likewise for other uses of the apostrophe (e.g., "Don't" became "Don_t_"). It seemed that, before doing any sweeping replacement of underscores, I might want to look for those sorts of special cases.
Sometimes a hyphen would not be a reliable indicator of the end of a sender's name. An example appeared in an email from a "Pan-European" organization: it came through the ImportExportTools process unchanged.
Due to the EMLNamer process, the end of the Subject field and the beginning of the Recipient field were marked by ".emlTo:" -- which was certainly recognizable.
Subject fields often began with things like "Fwd_" and "Re_" -- which, I had decided in a previous use of ImportExportTools, would best be deleted.

In short, the default results from ImportExportTools (possibly altered during my previous tinkering) were creating some confusion. I deleted the existing EMLs from the output folder, so as to start over. Then I went into Thunderbird > Tools > ImportExportTools > Options and made several changes. In the Misc. tab, I set each item to a maximum of 100 (rather than 50) characters. (This wasn't exactly a mistake, but I would later realize that, as a result of this change, I needed to be more aggressive in keeping the total filename length relatively short; otherwise, it would cause problems in some other Windows operations.) In the Filenames tab, I unchecked the option to "Use just alphanumeric characters in export"; I left the format to be Date - Sender - Subject; I left "Add time to date" checked; and I unchecked the "Cut subject" and "Cut complete file path" options. In the Export directories" tab, I chose the "create a new directory and the index of messages" option.

When I ran that, I got an index.html file listing relevant information about each file: its subject, from, to, date, and an indication of whether it had attachments. This did not appear likely to be helpful, given its HTML format. In the output folder, there was the right number of files. I ran EMLNamer.bat again. This time, the command window gave me some error messages. Preliminarily, it seemed they were produced by the length of the filenames. I could not save them before the command window closed. There was probably a way to modify EMLNamer.bat to save those messages to a file, but I did not tinker with that at this point. These messages appeared to be in addition to the unknown problems that had prevented Fullnames.txt from containing a complete list of all EMLs: there were now about 20 filenames missing from the output that I pasted into Excel. So, again, those would have to be dealt with manually.

This time around, when I pasted the results from Fullnames.txt into Excel, I saw that the output filenames had characteristics largely similar to, but in some regards different from, those noted above. There were fewer underscores, which meant that it would probably be simpler to develop rules to translate them into more useful characters. Hyphens were still not reliable field-end indicators.

Manipulating the File Information in a Spreadsheet

In Excel, after a couple of false starts not detailed here, I took the following steps:

Insert row 1 for column headings. Label column A as "Combined." These entries contained the combined original filename plus the "To:" information added by EMLNamer.bat.
In column B (heading: "Original"), use =LEFT(A2,FIND(".emlTo: ",A2)-1) to obtain the original filename as exported from Thunderbird. I would need this to remain unchanged: my ultimate goal, a batch command indicating how the original filename should be changed, would need this information to tell the command processor what file was being renamed. As with all other columns discussed below, I copied the formula down the column to all rows in use.
In column C (heading: "Find & Replace"), use =A2. Fix the values in this column -- that is, make them permanent by highlighting them all and using the Edit - Copy, Edit - Paste Special -Values sequence. The shortcut key sequence for Excel 2003 -- which I believed would work in ribbon versions like Excel 2007 and 2010 --was Alt-E C, Alt-E S V Enter Enter. Now column C contained values rather than formulas.
Move the values from column C to a new worksheet. Don't rearrange them. I needed a new worksheet because I was going to be using global find-and-replace (Ctrl-H) commands, and I didn't want to have to try to protect columns A and B from being affected by these commands.
In that new worksheet, I made changes to the list that I had just brought over from column C in the first worksheet. The first thing I did was to search for an unusual character, one I searched, in Excel, to find a character that did not already appear in the list. The caret ("^") was one such character. I would use this as my field delimiter. I didn't want any of my Subject field entries to begin with "Re" or "Fwd," so I started by replacing "-Re_" and "-Fw_" and "-Fwd_" with carets, gambling (on the basis of previous experience) that there would be few instances where this would prove inadvisable.
I also replaced the "-_" and "_-" and "-[" combinations with carets. To reduce the number of underscores potentially requiring manual attention, I did one or two additional find-and-replace operations in obvious cases; for example, "Woodcock_s " (ending with a space) became "Woodcock's ." It could have been counterproductive to go too far with this, though. For example, I did not try to remove underscores from every version of my name and email address, because that could have created additional variations on my name, somewhere down in the list, potentially complicating the number of things I would have to look for later. It was better to leave the underscore as a flag for some purposes. Then I cut and pasted that modified list back into column C in the main worksheet.
Back in the main worksheet, in column D, I set up a Date and Time column B, using =LEFT(C2,13). I didn't parse that column for the various year, month, day, hour, and minute components at this point; that could wait until I needed that information.
In column E, I created my first Remainder column. The purpose of the Remainder columns was to show what was left from the modified values appearing in column C, after removing whatever I had just separated out (in this case, the date and time). The formula was =TRIM(MID(C2,15,LEN(C2))).
I used column F for the Recipient (i.e., "To") value, from the end of the string appearing in the Remainder column (E). The reason was that this was a fairly obvious entry, and its removal would simplify the next steps. The formula in column F was =TRIM(MID(E2,FIND(".emlTo: ",E2)+7,LEN(E2))).
Column G could be another Remainder column: =TRIM(LEFT(E2,FIND(".emlTo: ",E2)-1)).
In column H (heading: "Left 1"), I entered =LEFT(G2,1). The reason was that ImportExportTools had failed to export the names of some senders, notably those appearing in angle brackets ("< >"), and I couldn't identify them by just sorting on the Remainder column because Excel would irritatingly overlook those characters when doing a sort. But now I could sort on column H and make manual entries of those senders' names in the appropriate column. I had not yet created that column, nor made those manual entries, because there was something else I needed to do first:
In column I, under a "Hyphen" heading, I entered =IF(ISERROR(FIND("-",G2)),"",FIND("-",G2)). In column J (heading: "Caret"), I entered =IF(ISERROR(FIND("^",G2)),"",FIND("^",G2)). Finally, in column K (heading: "Best"), I entered =IF(J2="",I2,J2). Column I would look for the first occurrence of a hyphen in the Remainder (column G). Column J would do likewise for a caret. It was necessary to use both because, at this point, either one might have been the delimiter indicating the end of the Sender field. Column K would favor carets over hyphens, so as to reduce the number of problems with senders with hyphenated names.
In column L ("Sender), I used =TRIM(LEFT(G2,K2-1)). This produced good Sender names in most cases. It was not yet time to deal with the exceptions.
In column M ("Subject"), I used =TRIM(MID(G2,LEN(L2)+1,LEN(G2))). This produced good Subject names in most cases. Now it was time to deal with the exceptions.
I went back and sorted on column H to identify those rows where I would have to make manual entries of Sender names because none was provided by ImportExportTools. I put those entries in column L as needed, replacing whatever the automatic calculation had put there. To assist in my process of looking up those that I didn't recognize, I sorted the From column in Thunderbird, for the Export folder, to gather all those senders at the top of the list for easier reference; I moved these items into a separate subfolder, sorted by Subject; I maximized the viewable space for that list; and once I had dealt with them, I moved them to another subfolder, so as to reduce the size of the list that I would have to page through. The objective here was just to make sure I had a coherent division of information between the Sender and Subject columns -- to prevent some Sender data from appearing in the Subject column, or vice-versa. Cleaning them up or otherwise improving them at this point would have been premature. Changing Sender names worked best if I made the changes back in column G, or if I altered or removed numbers in columns I and J. Just making a change in the Sender column would leave a problem in column M. It helped, for this purpose, to fix the values in column G (that is, to replace formulas with values; see the procedure described in connection with column C, above).
I sorted on column M ("Subject") and cleaned up the entries there. I found that I wanted to do find-and-replace operations on multiple entries. I decided at this point that I could safely fix the values of the entire spreadsheet. It seemed that I would want to sort and re-sort these Subject values to get similar ones together. To preserve the original order, I added an Index column, indicating the original numerical order of entries. (Enter 1 and 2 in the first two rows; highlight all rows to be numbered; then hit Alt-E I S Enter.) Then I moved the Subject and Index columns to a separate temporary worksheet, where I could do these sweeping changes without affecting other columns. There, I reversed these two columns, putting Index on the left, to keep it out of harm's way. My changes here included LEFT and RIGHT commands to sort by first and last characters of Subjects (supplemented, on the left, with CODE comparisons, to identify unwanted lowercasing), as well as FIND and Ctrl-H searches and replacements for underscores (doing many replaces to eliminate most instances) and other text that I wanted to change across multiple Subjects. To identify undesirable characters (e.g., exclamation marks and others whose presence in filenames might mess up batch commands and other applications), I used SUBSTITUTELIST. SUBSTITUTELIST would remove the characters listed in a separate worksheet (generated with a series of numbers 1-255 in column A and a corresponding =CHAR(A1) in column B). I could have had it remove characters that looked unwanted, but to be cautious I decided instead to have it remove everything that I knew was normal (i.e., 0-9 and a-z and A-Z, plus a few others) and show me what was left.
I deleted columns that were unnecessary, now that I had fixed the values. I also moved some columns, and inserted a few ones. My arrangement was now as follows: Index (column A), Original (B), Date & Time (C), Sender (D), NewSender (E), Recipient (F), NewRecipient (G), and Subject (H).
I copied values from columns D (Sender) and Recpient (F) to a separate worksheet. There, I did a unique filter. This gave me a list of names that I might want to change or simplify. I put the original (unique) name in column A in that separate worksheet, sorted it, and entered the desired replacement names in column B. I sorted this Names worksheet on column A (Original). I named this worksheet Names; I planned to keep it for future Thunderbird EML exports. I named the main worksheet Data. I sped up the process of developing replacement names by using various functions (e.g., FIND, MID) to distinguish first and last names of individuals. When I had my completed list of preferred names for Senders and Recipients, I went back to the main (Data) worksheet. In column E (NewSender), I entered =VLOOKUP(D2,Names!$A$2:$B$869,2,FALSE). (There were 869 rows in the Names spreadsheet.) I copied that formula over to column G (NewRecipient); it provided a similar replacement for the Recipient values.
I inserted columns to figure out the date and time. In column D ("Y"), I used =LEFT(C2,4). In colunn E ("M"), I used =MID(C2,5,2). In column F ("D"), I used =MID(C2,7,2). In column G ("H"), I used =MID(C2,10,2). In column H ("M"), I used =MID(C2,12,2). Finally, in column I ("NewDate"), I used =D2&"-"&E2&"-"&F2&" "&G2&"."&H2.
I added column O ("New Name"). There, I used =CHAR(34)&I2&" Email from "&K2&" to "&M2&" re "&N2&".eml"&CHAR(34). This produced a new name for the EML file. I sorted on this column to identify instances where my formulas had failed, and made corrections as needed.
I added column P ("Batch"). There, I used ="ren "&CHAR(34)&B2&".eml"&CHAR(34)&" "&O2. This produced a batch command to rename the EML file to my preferred new name. I copied the command down the column and then copied all those commands, one from each row, to Notepad. I saved the Notepad file as Renamer.bat and put it into the folder where the EMLs were located. I ran Renamer.bat. The renamed files sorted conspicuously in Windows Explorer, so I didn't need to work up a modification of these REN commands in a new column Q, using MOVE instead of REN, to move the newly renamed files to another folder. Instead, I could just cut them from the folder in Windows Explorer and put them aside.
Now I had a couple dozen EMLs remaining. They had not renamed properly. I probably should have added something like " > errorlist.txt" at the end of each batch command, to show me whether I was trying to give the same name to two different files. I did a DIR of the files remaining, saved its output to dirlist.txt, copied the contents of dirlist.txt into Excel, and compared them against my main spreadsheet. To my surprise, none of these files appeared in the original list of files shown there. I'd had some problems not described in this post; had I somehow dropped some EMLs somewhere along the line? Was I not doing this comparison properly? I did not have a clear answer. I worked up another set of new file names for these EMLs, substantially following the steps presented above, and renamed them. It looked like, somehow, at least some of them were duplicates after all. So I was not understanding something there. Others were apparently not renaming properly because the original filenames contained characters like ®.

In the end, I wound up with all but 10 of the exported emails. But which 10 did I lose? I probably didn't lose any. There was a point when I deleted a handful of what I thought were duplicates. Now it seemed they probably weren't.

As these final notes suggest, while this process went much more smoothly than on my previous exports of emails from Thunderbird, there were still some parts of the process where I was making mistakes or where things were not going smoothly.

Fourth Step: Converting the Appropriately Renamed EML to PDF

Some of the previous posts cited at the top of this post grappled with the problem of converting EML to PDF. It would seem that it should have been a straightforward matter of selecting EMLs in Windows Explorer -- indeed, within Thunderbird itself -- and clicking a Print command. Alas, it was not, not if the goal was to have PDFs whose filenames would be recognizable. I hoped there would be a Thunderbird add-in that would solve all the many steps shown above. I hadn't found one yet. In my most recent effort, I had proceeded only as far as a truly cumbersome solution that divided EMLs with and without attachments, used Emacs to edit EMLs with attachments so that they would print, extracted attachments to separate files that could also be PDFed, and then manually matched the PDFed attachments up with their PDFed parent emails. Truly a mess, very time-consuming, and for that reason I hadn't gone very far with it. Most of my emails were still in EML rather than PDF format.

I had decided, generally, that PDF was the superior long-term archival format. I didn't want lots of formats rattling around, lest the day come (as had happened for previous formats) when it was a struggle to find software that would read it. That said, EMLs were presently displaying nicely in Thunderbird, with easy access to attachments. Having devoted the time to the foregoing effort, I was presently out of time for further development of this project. So the stack of EMLs grew higher, and the day for conversion to PDF still lay somewhere in the future.

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time. I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf

reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow." This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages. There were a lot of messages, so I didn't want this to be a manual process. This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird. Then I deleted duplicates from that folder. Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird. So I moved the older messages into a separate folder. I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email. I did that because I wanted to capture the text information about the individual email messages. The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames. For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once. I ran a search and downloaded and installed the ImportExportTools add-on for T-bird. (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.) It took Thunderbird a while to shut down and restart after the installation. I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder. When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)). I opened the CSV file (named index.csv) in Excel and saw that this was perfect: I had a list of exactly the fields mentioned above (date, time, etc.). I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames. The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working. I thought this might also be better for purposes of making things the way I wanted them. I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order. There seemed to be some other potential limitations. It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted. Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird. So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message. But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs. In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming. There were other solutions that would have worked for that purpose as well. A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook. (Alternate: Automatic Print Email for $69.) There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments. I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments. Funny thing about that, though: as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them: it thought they had attachments, but then it realized they didn't. That trimmed out maybe 5% of the ones that had originally been marked as having attachments. But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment. Total Thunderbird Converter looked like it might come close, at a hefty price ($50). It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly. But then I saw that they were offering a 30-day free trial, so I downloaded and installed it. It turned out to be useless for my purposes: it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation. I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial. It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on. I tried Birdie EML to PDF Converter. Their free demo version allowed me to convert one EML file at a time. I liked its interface: it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats). I didn't like the output, though: they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF. I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML. But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages. It just vanished. The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions. In other words, I didn't. So now I had to go rooting around in backups to see what I could rescue from the wreckage. I found that Backup Maker had been happily making backups, as God intended. Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it. What Backup Maker had saved was some files with an .MSF extension. These were supposedly backups of Thunderbird. But then, no, on closer inspection I realized these were much too small, so I had to do some more digging. Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off. A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached. There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort. On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost. I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension. I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder." It worked -- it did extract the attachments without removing them from the emails. I assumed it would do this with hundreds of emails if I so ordered. Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination. Attachment Extractor did have one potentially useful feature for this purpose: a right-click option to "Extract with a Custom Filename Pattern." I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF. To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab. There, I used this pattern:

#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#

and, per instructions, in the Edit Date Pattern option I used this pattern:

Y-m-d H.i

That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult. A search led to inquiries about renaming PDF print output. For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames. Print Helper, for $125 to $175, was another option. A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it. A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30.

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software. No doubt many people have used pdf995 successfully. I was not among them.

I tried Pdf995's Free Converter. The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page. They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad. But that .ini file seemed to be configured for printing one specific file that I had just printed. They didn't say how to adjust it. Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there. So I tried that. But I got an error message:

PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.

But they were! I had just installed them. Was I supposed to reboot first? No, a reboot didn't fix it. I tried again to install basic pdf995 and the Free Converter, which I had downloaded together. Once again, I got the webpage saying I had installed it successfully. Was I supposed to install the printer driver too? I understood the webpage to be saying that was included in the 9MB download. But I tried that. Got the congratulatory webpage, so apparently it installed correctly. Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:

Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.

But I had just downloaded it from their website! Not an altogether auspicious beginning here. But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before. A cursory read-through still did not give me a clear approach to automated naming of PDFs. Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation. Who knew? I downloaded and installed OmniFormat. Got a new congratulatory webpage, but still no straightforward explanation of batch naming. Instead, it said that pdfEdit995 was what I wanted to create batch print jobs. So, OK. a bridge too far. Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird. Went back to their incredibly tiny-print pdfEdit instructions page. It said I would have to set pdf995 as the default printer to do the batch thing. That was OK. But it still sounded like it was intended primarily for batch printing from Microsoft Word. I decided to just try making pdf995 the default printer. That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer. While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing. Now I went to Thunderbird, selected several messages, and did a right-click > Print. Funny, it defaulted to Bullzip, which was my usual default printer. Checked again: yeah, pdf995 was definitely set as my default printer. Tried again, and this time manually set it to pdf995 when it was printing. It asked for the filename, so that wasn't working. Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either. It seemed pdf995 was not the solution for me. I came away from this little exploration with less time and ambition for the larger project. Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first. I did, so I took a look. There was an option to download Universal Document Converter (UDC) as well. I wasn't sure, but I thought I might need that for Print Conductor, so I installed both. PrintConductor didn't seem to have a way of printing EML files. Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that. But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually. (Print Conductor did not appear in the list of available printers.) When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab. There, I chose the "predefined location and filename option." I left the default filename alone and printed. And it worked. Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails. So I would be able to match them with the attachments produced by Attachment Extractor (above). All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames. I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above): Tools > ImportExportTools > Export all messages in the folder > EML format. Unfortunately, there could be a number of messages in a single day on the same subject. Without the time data in the filename, I would have duplicates. More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails. I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data. As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data. But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails. Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails. There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached. This seemed like the kind of project that really had to await some further development in application software. For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it. It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird. I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered. It seemed that a good solution would be to focus on the emails themselves for now. I would export them to EML format. EMLs would continue to contain the attachments. I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice. Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs. So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment. If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject). I had sorted the old emails in Thunderbird into separate folders by year. I went to one of those folders in T-bird and sorted it in ascending date order. Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV). This gave me what appeared to be a matching list of those messages, in that same order. The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order. Good news: the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order. At least the first and last emails were in reversed positions. So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order. This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed. I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34). I wasn't sure that the present and future date fields would always sort exactly the same. I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes: would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves: Tools > ImportExportTools > Export all messages in the folder > EML format. As partially noted above, these were named in Date - Subject - Number format. I now did a search to try to figure out what that number signified. It wasn't clear, but it seemed to be just randomly generated. Too bad. It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order. (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.) It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day. There would surely be large numbers of files like that. I now realized they would not at all be sorted correctly in Windows Explorer: with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options. Silly me. I had just overlooked that. But there they were: Tools > ImportExportTools > Options > Filenames tab. I selected "Add time to date" and I chose Date - Name (Sender) - Subject format. Now I tried another export of EMLs. The messages now had names like this:

19970730-0836-Microsoft-Welcome!

I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM. Suddenly the remainder of my job got a lot easier. I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools. There were two parts to this mission. First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools. Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted. This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird.

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility. That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML. This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird. I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird. This seemed like a separate step. I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

Using a Spreadsheet to Rename Thousands of Files - First Try

I had a list of a couple thousand files. I got the list as an export from a program, but I could also have gotten it from a directory listing at a command prompt (e.g., DIR /a-d /b /s). This post describes how I used that list to rename those files. Needless to say, since I was working with the possibility of screwing up years' worth of information, I did make a backup of these files before proceeding.

This particular list was a list of emails that I had just exported from Thunderbird, as described in another message that I expect to post to this blog on the same day as this one. I had exported, not only the list, but also the actual emails, in EML format. I planned to use the list to rename those EMLs so that they would be in the format I wanted, which was like this:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf

It would require some massaging to get there.

The Date and Time Segment

Starting with the date and time, here's what Thunderbird had given me in the list of files:

3/20/11 14.23

These were all in one field, in the .csv (i.e., Excel-readable) output from T-bird. They might have been in two or more fields, in a text file produced by a directory listing (e.g., DIR /a-d /b /s > dirlist.txt), but to some extent the same techniques might prove useful. These numbers were not in a format that would work as a Windows filename, and in fact the exported EMLs did not contain the date and time data in this format.

I was going to use the list to rename the emails the way I wanted, with data that existed in the list but not in the current EML filenames. To do that, I would need to try to get the list so that it accurately represented the actual EML filenames. Otherwise, if the list produced a command that said, Rename File 234A.eml to be "Message from Ray 234A," that command would not work if the file being renamed was actually called File_234A.eml (with an underscore). My command would just get a "File not found" error. So my first step was to find a way, in my Excel spreadsheet, to reproduce what the files were actually called, as I viewed them Windows Explorer.
First, I would have to extract each element into a separate column, there in the spreadsheet, so that I could format and rearrange them as desired. For this, I started with FIND commands. (Excel's internal help had good information on using these and other related commands. I was doing this in Excel 2003. There might have been other functions that would automate this process in newer versions.) For instance, =FIND("/",A1) would locate the first occurrence of the forward slash in cell A1, if that's where I put the "3/20/11 14.23" item shown above. So then I could search for the next slash (=FIND("/",A1,B1+1), starting one place after the results of the first FIND statement (which, in this example, I had put in cell B1). I could do the same for spaces, colons, or whatever else might delimit various components of the date and time entry. I'd then use RIGHT, LEFT, or MID statements to find those components. For instance, a MID statement like =MID(A1,B1+1,C1-B1-1) might start one space after the first slash, continue to one space before the second slash, and thus give me the day of the month.

In the 3/20/11 14.23 example, the value of "3" for the month wouldn't give me the desired "03" so I would use IF and LEN statements to pad that out:

=IF(LEN(G1)=1,"0","")&G1

thereby inserting a zero in front of single-digit month values found in cell G1, but adding nothing to those month values if they were not single-digit.

I had gotten two different things from Thunderbird. On one hand, as noted above, I had gotten a list of data about emails that I was going to export, with dates in the 3/20/11 format. On the other hand, I had also gotten the actual emails, exported as individual EML files. These were not named quite the same way. The particular Thunderbird extension that I had used to produce this list of files, ImportExportTools, had produced files whose names were in this format: Date-Time-Sender-Subject. But in those filenames, the dates could not be rendered as 3/20/11, since slashes were not allowed in Windows 7 filenames. Instead, the date and time came in the format of 20110320-1423. So as hinted above, I would ultimately be producing a batch file that would automate the renaming of thousands of files. That example, 20110320-1423, would instead begin with the more readable 2011-03-20 14.23.

Next: Massaging the Sender Data

Before I could get to that point, I had to change some of the Sender data that T-bird had produced. Again, the list of actual email data from T-bird contained characters (e.g., the > symbol) that could appear in email Subject lines but that were not allowed in Windows 7 filenames. (The full set of forbidden characters: \ / : * ? " < > | ). For example, the spreadsheet might show a sender to be this:

John Doe <jdoe@xcom.com>

but the actual exported EML file's name would just have John Doe. So I could get rid of some of these problems of verboten characters by just doing a FIND for < and then a MID or a LEFT statement for everything before that. That wouldn't necessarily get rid of all the bad-character problems, but I wanted to clean those up just once, so I deferred that problem for the moment. For right now, in a bid to match the format of the exported EMLs, I now had this much of the filename:

20110320-1423-John Doe

To get there, the actual command I used, for the Sender portion, was this:

=LEFT(B2,V2-1)

Cell B2 had the Sender's name as exported from Thunderbird, and cell V2 had the FIND location for the < symbol, which marked the end of the part of the Sender data that I planned to use.

At this point, there was a problem. For some reason, ImportExportTools put underscores before and after some Senders' names, but not all. So I might instead have this:

20110320-1423-_John Doe_

The filenames already used a hyphen ( - ) to delimit fields, so the combination hyphen-underscore seemed superfluous. It needed to be fixed before I could continue with the main project here, so that I could be confident that Sender names were delimited consistently by a hyphen, and not by unpredictable choices involving underscores.

Detour: Renaming Thousands of Files, So That I Could Proceed to Rename Thousands of Files

To fix this problem, I went to the Windows 7 command prompt (Start > Run > cmd) and did a directory listing to save the filenames to a file: DIR /b > dirlist.txt. I copied the contents of that file into column A of a new Excel spreadsheet. I also copied those contents into Notepad. In Notepad, I did two global replaces (changing both -_ and _- to just plain - ). I copied those changed contents into column B of that Excel spreadsheet. Next, I needed to combine these two columns, A and B, into a single DOS command that would rename the files. To do that, I put this formula in column C:

="ren "&char(34)&A1&char(34)&" "&char(34)&B1&char(34)

The char(34) entries would introduce quotation marks: 34 was the ASCII code for double quotes. I could have introduced any character that way -- Z, a semicolon, whatever -- but it was essential to use the char(34) approach for quotation marks specifically because otherwise Excel would have misunderstood them. Quotation marks would be necessary for any DOS command involving filenames that contained spaces; DOS would otherwise think that the space marked the end of a part of the command. Anyway, I copied the contents of column C back into Notepad and saved the file as a batch file (renamer.bat). I saved renamer.bat in the folder where I had all those EML files I was going to rename. At a DOS prompt in that folder, I typed "renamer.bat," without the quotes. I could have just double-clicked on renamer.bat in Windows Explorer. Either way, the files were renamed.

Later, I discovered that ImportExportTools would truncate the Subject field in the spreadsheet to 50 characters even if I had not asked it to do so. Many emails in my set had Subject fields longer than that. I initially tried to fix this by setting up a separate spreadsheet, using the process just outlined, but with a search for the last hyphen in the filename, which would hopefully identify the start of the Subject field in most cases. Unfortunately, the calculation of 50 characters proved difficult, after taking into account the character substitutions described here. Ultimately, I started over with a fresh export of EML files from Thunderbird, after setting the option at Tools > ImportExportTools > Options > Filenames tab > Cut subject at 50 characters (and also Cut complete file path length at 256 characters).

Returning to the Sender and Subject Segments

So now I could go back to work on the main spreadsheet. I could assume, that is, that the first part of the EML filenames were going to look like this:

20110320-1423-John Doe

without underscores. I could see that there were still a few underscores around names in the EMLs, and that was problematic. Possibly ImportExportTools had introduced two underscores rather than one in a row, in some cases. There were few enough that I figured I could fix those exceptions manually.

Now it was time to add the Subject part of the filename. This was simple enough: just use another ampersand (&) to combine it with the rest. Sketching it out, the EMLs now just used hyphens to delimit the date, time, sender, and subject portions, so I just needed to use an Excel command of this format:

=[Date]&"-"&[Time]&"-"&[Sender]&"-"&[Subject]&".eml"

and that would give me a complete representation of how ImportExportTools seemed to have constructed the filenames for most of the exported EMLs.

Cleaning Up Unwanted Characters

As noted above, some characters (e.g., the colon, ":") were not acceptable as Windows 7 filenames, but were quite common in email subject lines. It looked like ImportExportTools had replaced those characters with an underscore, and had removed spaces before and after the underscore. So, for instance, a subject line of "Re: Tomorrow" was represented, in the EML filenames, as "Re_Tomorrow," without a space.

At this point, preparing to remove those unwanted characters, my spreadsheet had something like this:

20110320-1423-John Doe-Re: Tomorrow

Some of these fledgling combinations contained other unwanted characters (e.g., question marks), listed above. I was almost ready to remove them. But first, it was time to proofread my spreadsheet. Having copied all of my various formulas down from the first line of the spreadsheet, where I was developing them, so that they were present in all rows, I now prepared to sort the spreadsheet. First, I inserted an Index column as column A, and in that column I inserted numbers in ascending order, starting with 1. The purpose of this column was to remember my original sort order, in case the date field or anything else did not accurately reproduce the order in which emails appeared in Thunderbird. There were times when a person would want to check back and see if things were matching the source. To insert these numbers, using Excel 2003, there were two options. One was to enter a formula in each cell, adding 1 to the number above it, and then replace the formulas with fixed values. The sequence for this was Edit > Copy, Edit > Paste Special > Values. An alternative was Edit > Fill > Series. Either way, I now had index numbers that wouldn't change if I sorted rows.

So now I did sort rows, sorting first on the column that concatenated (i.e., combined) the date, sender, and subject. I sorted to highlight those rows in which the process had not worked as intended. One problem, I saw, was that some Sender names did not include the <jdoe@xcom.com> part. The sender of these emails was just listed as John Doe (or whoever), without the actual email address. For these, I just copied the Sender straight over, skipping the unnecessary part of the spreadsheet calculation. That cleared up the error messages in the spreadsheet. So now I saved a version of the spreadsheet and moved the output column -- the one combining all of my work with these various fields -- and pasted it into Notepad. I could have edited it right there in Excel, but there was a risk that I would accidentally have changed other columns in ways that I did not intend, or that I could not achieve what I needed to achieve. For example, the ? symbol was a wildcard in Excel, so attempting to replace it with an underscore would either not work or create a mess. I noticed that ImportExportTools had also converted commas into underscores, even though they were not forbidden characters. Note: now that I had parts of the spreadsheet in Notepad, and expected to paste those parts back into Excel in the same order, this would have been an especially bad time to sort the spreadsheet.

Over in Notepad, I did global find-and-replace operations for each of the special characters listed above, guided by the objective of matching as closely as possible the actual filenames of the EMLs I had exported from Thunderbird. It was premature to do line-by-line editing of individual entries, though; those would be more obvious and perhaps more easily fixed later. At this point, if I saw something that needed to be changed, I executed it as a global command, so that it would be changed wherever it occurred. Obviously, a person can make serious mistakes with global changes so, again, this was not the time for fine-detail fixes of individual entries. On the other hand, each of these global changes was a potential time-saver: a single change here could save the need to make manual changes to 20 or 300 individual files later.

After making these changes, I again had to replace the -_ and _- combinations with a simple hyphen ( - ), since some forbidden characters that I had just replaced with underscores had been adjacent to the delimiting hyphens that ImportExportTools seemed to convert into simple hyphens. With these changes made, I copied the contents of that Notepad file back into the appropriate place in the spreadsheet.

Testing the Filename Match

How well did my spreadsheet now reflect the actual names of those EML files? I decided to test it. To do this, I first made a backup copy of the folder containing the EMLs. I then created a subfolder in the EMLs folder, called Test. In the spreadsheet, I worked up a MOVE command for each file, commanding it to move to the Test folder. The command was like this:

="move "&CHAR(34)&Z2&".eml"&CHAR(34)&" Test"

I copied all those MOVE commands from the spreadsheet to Notepad and saved the Notepad file as MOVER.BAT in the EML folder. Then I ran it. It succeeded in moving less than half of the files, which meant that the spreadsheet had not yet captured hundreds of EML file names correctly. I dir a DIR /b > dirlist.txt in the EMLs folder to capture the names of the ones that had failed. I brought that list into the Excel spreadsheet and matched them up with my attempts. This matching process was sufficiently time-consuming that I found myself grateful for the ones that I had managed to identify in a more automated fashion. The time investment was acceptable. As had been the case for me with spreadsheets going back almost 30 years, I figured that, once I identified the rules and became more familiar with this particular process, I would be able to use it again in similar tasks.

To match up the real filenames with the versions in the spreadsheet, I pasted the contents of dirlist.txt into a separate worksheet in Excel. From both, I compared the date and time data, using VLOOKUP on shortened versions of the relevant colums (=LEFT(Z1,13)). This revealed several things. First, there was an error in how I had converted the date and time data, so I would have to fix that and re-run it to move more of the items from the EML folder into the Test folder. Since I planned on doing this task again, I also decided to automate some of the find-and-replace operations described above. This required using FIND and MID to divide the text string into parts appearing before and after the character I wanted to replace (e.g., colons) and then joining them together without the middle part. I didn't do this for all possible combinations; I was basically interested in eliminating the first and/or most frequent occurrences of the forbidden characters that did seem to appear.

Starting Over: Revise the Filenames, Not the List

After hours of effort and a couple of retries, the approach described above was still identifying (i.e., successfully moving) only about 80% of the EMLs I had exported from Thunderbird. Depending on the total number of emails involved, that could mean hundreds or even thousands of emails that would have to be manually renamed in order to insure that their filenames contained all of the desired data.

I decided that the better approach mgiht be to try to rename the files, and keep renaming them if necessary, until they conformed to certain basic rules in the file list exported into Index.csv. There had already been some of that in the steps described above; the change now was to make that the primary effort.

The first step was to take a listing of the EMLs. This was easy enough: DIR /b > dirlist.txt. Then I copied the contents of that text file into Excel and changed components of the filename to be the way I wanted. The names of EMLs exported from Thunderbird did not have "To" fields, but I was able to match up most of the names automatically with what I had already prepared, thus borrowing To fields from the work described above. (This is a cursory description. At this writing, I had run out of time for this project, and was focused on getting it done. But the basic techniques were as described above.)

Some filenames did not match up easily to support an automated renaming of the EMLs.. I renamed the rest, using a MOVE command to put them into a separate folder. I copied those that did not rename into a separate folder too. These I renamed using the Bulk Renamer (above), so as to have .txt extensions. I did that so that I could view them in IrfanView, which enabled me to flip back and forth among them quickly, so as to identify the proper "To" information.

At this point, I started a new post to summarize more clearly the steps taken here.

Ray Woodcock's Latest

Pages

Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

Using a Spreadsheet to Rename Thousands of Files - First Try

Support This Blog

Total Pageviews

Archives

Ray Woodcock's Latest

Pages

Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

Using a Spreadsheet to Rename Thousands of Files - First Try

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives