Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

I was using Thunderbird 11.0.1 in Windows 7.  I had accumulated some emails that I wanted to export as individual EML files.  An EML would still be readable in Thunderbird, and it would carry any attachments along with it.  I had attacked this problem on several previous occasions.  As before, I was not sure I would get all the way through from Thunderbird to EML to PDF.  This post provides another contribution in the slog toward that outcome.

First Step: From Thunderbird to EML Format

Some of my previous efforts to export to EML and then convert to PDF had produced something of a mess.  Exporting, itself, was easy enough.  I was using ImportExportTools.  It would give me EMLs with names containing some, but not all, of the information that I wanted in file names.  Specifically, it would provide the date and time, the sender, and the subject; but it did not include the recipient.  I could get it to produce a separate Index.csv file that would contain the full information, but that would just be a spreadsheet file.  I could use that spreadsheet file to give me nice names for files; but which file was supposed to get which name?  Matching them up had required a surprising amount of manual effort, last time around.  I was hoping to make the process smoother, if I could.

It wouldn't help to print a PDF directly from Thunderbird.  As far as I knew, that would require me to enter PDF filenames manually.  I was looking for a mass-production kind of solution.  About.com recommended mbx2eml, but it seemed to have some disadvantages, notably a very limited set of options for the resulting EML filenames -- which was the main problem.  Generally, it did not seem that any solution had broken through into prominence, in either the T-bird to EML or T-bird to PDF category.

In my first try at this problem, I had tried Total Thunderbird Converter and Birdie EML to PDF Converter, but for various reasons had not been impressed with either.  I did like Attachment Extractor, for when I got to that part of the project.  My notes seemed to favor Universal Document Converter (UDC) ($69), if I wanted a direct T-bird-to PDF-solution.  As I reviewed the struggles I'd had in that first try at this problem, and also in the second and third tries, I wondered if I should have focused more seriously on UDC.  But it did not seem to have command-line capability or other automation features.  It was basically a glorified PDF printer.  Moreover, its default filenames did not include all the information I wanted.

My previous notes did not seem to mention that Thunderbird messages were apparently already in EML format, stored in Thunderbird subfolders.  For instance, I had moved the messages that I was now seeking to export to a Local Folders subfolder called Export, and I could see that folder in Windows Explorer as Mail\Local Folders\Export.mozmsgs.  But this was confusing:  the number of EML files in that folder was not very close to the number of messages in the Export subfolder in Thunderbird.  Anyway, the EMLs in Export.mozmsgs had seemingly random names that would be useless for my purposes.

So I went ahead with ImportExportTools.  My first step was to eliminate duplicates.  For this, I used Remove Duplicate Messages (Alternate).  Then, in Thunderbird, I went to Tools > ImportExportTools > Export all messages in the folder > EML format.  The first time around, this produced undesirable results (see below).  But I didn't know that until I was partway through the second step.

Second Step:  Adding Recipient to the EML File Name

I had my EMLs.  But as noted above, I wanted to add the name of the Recipient to the filename, in the format Date-From-To-Subject.  As a first step, I thought I would just try to append the Recipient's name to the end of the filename.  Then I would figure out how to shuffle the words around to the desired order.

Given my limited knowledge of programming and such, I decided to try to achieve this with a Windows batch file.  I struggled to figure out how to write a suitable one, and finally posted a question on it.  One of the early answers to that question led to a separate pursuit -- a one-line batch file that would convert Word and WordPerfect documents to PDF.

The answers that I had received, at the point when I was writing up these notes, fell into two categories.  One, which I found easier to understand (and, predictably, seemed less popular among the knowledgeable respondents), involved a simple loop that would call an external process.  Basically, in plain English, it went like this:

FOR each EML file, run Process.
Repeat loop.
When list of files is exhausted, quit.

Process starts here.
Do various things.
End of process
By contrast, the approach preferred by most of the answering individuals would put all the steps inside the loop, instead of having a separate process afterwards.  It seemed to be a matter of style.  A second difference was that, in discussing the specific steps, they seemed divided between two general possibilities:  with, or without, delayed expansion.  Delayed expansion was apparently a response to a complication in how the FOR command worked.  As I understood it, the computer would read the entire contents of a FOR command as soon as it hit the word FOR.  So assigning a value to a variable inside a FOR loop would be too late; the computer would already have decided what value that variable had.  The variable would have been immediately expanded to its value.  Delayed expansion would postpone definition of the variable's value until later in the game.  A variable would be marked for delayed expansion by surrounding it with exclamation marks (e.g., !VAR!).  I wasn't familiar with delayed expansion, so I was in accord with some advisors' feeling that it would be better to proceed without it if possible.  What they (especially Aacini) suggested was:
@ECHO OFF

IF EXIST fullnames.txt DEL fullnames.txt

FOR %%f IN (*.eml) DO (

SET firstfind=

FOR /F "delims=" %%l IN ('findstr /B /C:"To: " "%%f"') DO (

IF NOT DEFINED firstfind SET firstfind=now & ECHO %%f%%l >> fullnames.txt

)

)
I have double-spaced the lines for clarity, anticipating that Blogger will wrap some long lines.  I haven't indented the way a programmer would, because of apparent limitations in the formatting options here in Blogger.  Basically, this batch file said, give me a fresh output file called Fullnames.txt; and on each line in Fullnames.txt, type the contents of two variables.  The first variable, %%f, was the name of the EML file under consideration, in all its Date-Sender-Subject glory.  There would be one such filename assignment for each EML file in the folder; hence a FOR loop.  The batch file would loop through all EML files in the folder.

Inside that FOR loop, there would be an examination of the contents of each individual EML.  This examination would use FINDSTR to locate the first line beginning with "To:  ."  The contents of that line would be assigned to the %%l variable.  (That's an L, not a one.)  I wasn't sure why this had to be done inside a second, inner loop, and I also didn't know how the "now" part worked.  But I was an openminded individual.  I was interested in new ideas.  The point is, I was willing to plow ahead and give it a try. 

So I copied the foregoing lines of script, beginning with @ECHO OFF and continuing to the last closed parenthesis (")"), and pasted them into a file in Notepad.  I saved that file as EMLNamer.bat, and put it into the folder containing the EMLs that I had exported from Thunderbird (above).  There, I ran it (either double-click it or highlight and hit Enter).  The command window displayed nothing, which was a bit disconcerting; but, viewing the folder in Windows Explorer, I could see Fullnames.txt spring into existence and grow larger.

When it was done, the command window disappeared, and Fullnames.txt stopped getting bigger. I put EMLNamer.bat into a folder where I could find it later.  I opened Fullnames.txt file and pasted its contents into Excel.  Some lines seemed to be missing.  Not many, but less than the total number of files shown in the Windows Explorer folder minus two (for EMLNamer.bat and Fullnames.txt).  I guessed that the names of a few EMLs had presented complications for the script.  I would have to process the rest and see what remained.

Third Step: Improving the EML File Name

I looked at the new Excel spreadsheet.  Spot checks, supplemented by previous experience with ImportExportTools, yielded the following observations:
  • The first 13 characters in each filename seemed match the date and time (in 24-hour format) shown in Thunderbird for the email in question -- the time, that is, when the email was sent or received.
  • The next characters indicated the sender.  This string ended, in some cases, with three characters (namely, "_-_") and in other cases with just one (namely, "-").  It seemed that ImportExportTools would surround some senders' names with underscores ("_") but would not do so for others.  The reason seemed to be that those senders' names appeared within brackets.  For instance, I had emails from "[Wordpress.com]" that now appeared as "_WordPress_com_."  So at least in these situations, the underscore seemed to be something that I could replace with a space, which would then be removed by an Excel TRIM command if it appeared at the start or end of a string.
  • Some senders' names ended with "_com."  Ordinarily, the preceding note would suggest replacing that with ".com," and likewise for ".org," ".edu," and so forth.  But I decided that step would come later, if at all:  instead, I would start by identifying full names (e.g., "Yahoo_com") that I might want to replace with simpler names (e.g., "Yahoo").
  • Hyphens were not always a reliable indicator of the end of a sender's name.  For example, an email from some "Pan-European" organization came through the ImportExportTools process unchanged.
  • ImportExportTools seemed to replace apostrophes with underscores.  So instead of "Miller's" I would get "Miller_s_."  Likewise for other uses of the apostrophe (e.g., "Don't" became "Don_t_").  It seemed that, before doing any sweeping replacement of underscores, I might want to look for those sorts of special cases.
  • Sometimes a hyphen would not be a reliable indicator of the end of a sender's name.  An example appeared in an email from a "Pan-European" organization:  it came through the ImportExportTools process unchanged.
  • Due to the EMLNamer process, the end of the Subject field and the beginning of the Recipient field were marked by ".emlTo:" -- which was certainly recognizable.
  • Subject fields often began with things like "Fwd_" and "Re_" -- which, I had decided in a previous use of ImportExportTools, would best be deleted.
In short, the default results from ImportExportTools (possibly altered during my previous tinkering) were creating some confusion.  I deleted the existing EMLs from the output folder, so as to start over.  Then I went into Thunderbird > Tools > ImportExportTools > Options and made several changes.  In the Misc. tab, I set each item to a maximum of 100 (rather than 50) characters.  (This wasn't exactly a mistake, but I would later realize that, as a result of this change, I needed to be more aggressive in keeping the total filename length relatively short; otherwise, it would cause problems in some other Windows operations.)  In the Filenames tab, I unchecked the option to "Use just alphanumeric characters in export"; I left the format to be Date - Sender - Subject; I left "Add time to date" checked; and I unchecked the "Cut subject" and "Cut complete file path" options.  In the Export directories" tab, I chose the "create a new directory and the index of messages" option.

When I ran that, I got an index.html file listing relevant information about each file:  its subject, from, to, date, and an indication of whether it had attachments.  This did not appear likely to be helpful, given its HTML format.  In the output folder, there was the right number of files.  I ran EMLNamer.bat again.  This time, the command window gave me some error messages.  Preliminarily, it seemed they were produced by the length of the filenames.  I could not save them before the command window closed.  There was probably a way to modify EMLNamer.bat to save those messages to a file, but I did not tinker with that at this point.  These messages appeared to be in addition to the unknown problems that had prevented Fullnames.txt from containing a complete list of all EMLs:  there were now about 20 filenames missing from the output that I pasted into Excel.  So, again, those would have to be dealt with manually.

This time around, when I pasted the results from Fullnames.txt into Excel, I saw that the output filenames had characteristics largely similar to, but in some regards different from, those noted above.  There were fewer underscores, which meant that it would probably be simpler to develop rules to translate them into more useful characters.  Hyphens were still not reliable field-end indicators.

Manipulating the File Information in a Spreadsheet

In Excel, after a couple of false starts not detailed here, I took the following steps:
  • Insert row 1 for column headings.  Label column A as "Combined."  These entries contained the combined original filename plus the "To:" information added by EMLNamer.bat.
  • In column B (heading:  "Original"), use =LEFT(A2,FIND(".emlTo: ",A2)-1) to obtain the original filename as exported from Thunderbird.  I would need this to remain unchanged:  my ultimate goal, a batch command indicating how the original filename should be changed, would need this information to tell the command processor what file was being renamed.  As with all other columns discussed below, I copied the formula down the column to all rows in use.
  • In column C (heading:  "Find & Replace"), use =A2.  Fix the values in this column -- that is, make them permanent by highlighting them all and using the Edit - Copy, Edit - Paste Special -Values sequence.  The shortcut key sequence for Excel 2003 -- which I believed would work in ribbon versions like Excel 2007 and 2010 --was Alt-E C, Alt-E S V Enter Enter.  Now column C contained values rather than formulas.
  • Move the values from column C to a new worksheet.  Don't rearrange them.  I needed a new worksheet because I was going to be using global find-and-replace (Ctrl-H) commands, and I didn't want to have to try to protect columns A and B from being affected by these commands.
  • In that new worksheet, I made changes to the list that I had just brought over from column C in the first worksheet.  The first thing I did was to search for an unusual character, one I searched, in Excel, to find a character that did not already appear in the list.  The caret ("^") was one such character.  I would use this as my field delimiter.  I didn't want any of my Subject field entries to begin with "Re" or "Fwd," so I started by replacing "-Re_" and "-Fw_" and "-Fwd_" with carets, gambling (on the basis of previous experience) that there would be few instances where this would prove inadvisable.
  • I also replaced the "-_" and "_-" and "-[" combinations with carets.  To reduce the number of underscores potentially requiring manual attention, I did one or two additional find-and-replace operations in obvious cases; for example, "Woodcock_s " (ending with a space) became "Woodcock's ."  It could have been counterproductive to go too far with this, though.  For example, I did not try to remove underscores from every version of my name and email address, because that could have created additional variations on my name, somewhere down in the list, potentially complicating the number of things I would have to look for later.  It was better to leave the underscore as a flag for some purposes.  Then I cut and pasted that modified list back into column C in the main worksheet.
  • Back in the main worksheet, in column D, I set up a Date and Time column B, using =LEFT(C2,13).  I didn't parse that column for the various year, month, day, hour, and minute components at this point; that could wait until I needed that information.
  • In column E, I created my first Remainder column.  The purpose of the Remainder columns was to show what was left from the modified values appearing in column C, after removing whatever I had just separated out (in this case, the date and time).  The formula was =TRIM(MID(C2,15,LEN(C2))).
  • I used column F for the Recipient (i.e., "To") value, from the end of the string appearing in the Remainder column (E).  The reason was that this was a fairly obvious entry, and its removal would simplify the next steps.  The formula in column F was =TRIM(MID(E2,FIND(".emlTo: ",E2)+7,LEN(E2))).
  • Column G could be another Remainder column:  =TRIM(LEFT(E2,FIND(".emlTo: ",E2)-1)).
  • In column H (heading:  "Left 1"), I entered =LEFT(G2,1).  The reason was that ImportExportTools had failed to export the names of some senders, notably those appearing in angle brackets ("< >"), and I couldn't identify them by just sorting on the Remainder column because Excel would irritatingly overlook those characters when doing a sort.  But now I could sort on column H and make manual entries of those senders' names in the appropriate column.  I had not yet created that column, nor made those manual entries, because there was something else I needed to do first:
  • In column I, under a "Hyphen" heading, I entered =IF(ISERROR(FIND("-",G2)),"",FIND("-",G2)).  In column J (heading:  "Caret"), I entered =IF(ISERROR(FIND("^",G2)),"",FIND("^",G2)).  Finally, in column K (heading:  "Best"), I entered =IF(J2="",I2,J2).  Column I would look for the first occurrence of a hyphen in the Remainder (column G).  Column J would do likewise for a caret.  It was necessary to use both because, at this point, either one might have been the delimiter indicating the end of the Sender field.  Column K would favor carets over hyphens, so as to reduce the number of problems with senders with hyphenated names.
  • In column L ("Sender), I used =TRIM(LEFT(G2,K2-1)).  This produced good Sender names in most cases.  It was not yet time to deal with the exceptions.
  • In column M ("Subject"), I used =TRIM(MID(G2,LEN(L2)+1,LEN(G2))). This produced good Subject names in most cases. Now it was time to deal with the exceptions.
  • I went back and sorted on column H to identify those rows where I would have to make manual entries of Sender names because none was provided by ImportExportTools.  I put those entries in column L as needed, replacing whatever the automatic calculation had put there.  To assist in my process of looking up those that I didn't recognize, I sorted the From column in Thunderbird, for the Export folder, to gather all those senders at the top of the list for easier reference; I moved these items into a separate subfolder, sorted by Subject; I maximized the viewable space for that list; and once I had dealt with them, I moved them to another subfolder, so as to reduce the size of the list that I would have to page through.  The objective here was just to make sure I had a coherent division of information between the Sender and Subject columns -- to prevent some Sender data from appearing in the Subject column, or vice-versa. Cleaning them up or otherwise improving them at this point would have been premature.  Changing Sender names worked best if I made the changes back in column G, or if I altered or removed numbers in columns I and J.  Just making a change in the Sender column would leave a problem in column M.  It helped, for this purpose, to fix the values in column G (that is, to replace formulas with values; see the procedure described in connection with column C, above).
  • I sorted on column M ("Subject") and cleaned up the entries there.  I found that I wanted to do find-and-replace operations on multiple entries.  I decided at this point that I could safely fix the values of the entire spreadsheet.  It seemed that I would want to sort and re-sort these Subject values to get similar ones together.  To preserve the original order, I added an Index column, indicating the original numerical order of entries.  (Enter 1 and 2 in the first two rows; highlight all rows to be numbered; then hit Alt-E I S Enter.)  Then I moved the Subject and Index columns to a separate temporary worksheet, where I could do these sweeping changes without affecting other columns.  There, I reversed these two columns, putting Index on the left, to keep it out of harm's way.  My changes here included LEFT and RIGHT commands to sort by first and last characters of Subjects (supplemented, on the left, with CODE comparisons, to identify unwanted lowercasing), as well as FIND and Ctrl-H searches and replacements for underscores (doing many replaces to eliminate most instances) and other text that I wanted to change across multiple Subjects.  To identify undesirable characters (e.g., exclamation marks and others whose presence in filenames might mess up batch commands and other applications), I used SUBSTITUTELIST.  SUBSTITUTELIST would remove the characters listed in a separate worksheet (generated with a series of numbers 1-255 in column A and a corresponding =CHAR(A1) in column B).  I could have had it remove characters that looked unwanted, but to be cautious I decided instead to have it remove everything that I knew was normal (i.e., 0-9 and a-z and A-Z, plus a few others) and show me what was left.
  • I deleted columns that were unnecessary, now that I had fixed the values.  I also moved some columns, and inserted a few ones.  My arrangement was now as follows:  Index (column A), Original (B), Date & Time (C), Sender (D), NewSender (E), Recipient (F), NewRecipient (G), and Subject (H).
  • I copied values from columns D (Sender) and Recpient (F) to a separate worksheet.  There, I did a unique filter.  This gave me a list of names that I might want to change or simplify.  I put the original (unique) name in column A in that separate worksheet, sorted it, and entered the desired replacement names in column B.  I sorted this Names worksheet on column A (Original).  I named this worksheet Names; I planned to keep it for future Thunderbird EML exports.  I named the main worksheet Data.  I sped up the process of developing replacement names by using various functions (e.g., FIND, MID) to distinguish first and last names of individuals.  When I had my completed list of preferred names for Senders and Recipients, I went back to the main (Data) worksheet.  In column E (NewSender), I entered =VLOOKUP(D2,Names!$A$2:$B$869,2,FALSE).  (There were 869 rows in the Names spreadsheet.)  I copied that formula over to column G (NewRecipient); it provided a similar replacement for the Recipient values.
  • I inserted columns to figure out the date and time.  In column D ("Y"), I used =LEFT(C2,4).  In colunn E ("M"), I used =MID(C2,5,2). In column F ("D"), I used =MID(C2,7,2).  In column G ("H"), I used =MID(C2,10,2).  In column H ("M"), I used =MID(C2,12,2).  Finally, in column I ("NewDate"), I used =D2&"-"&E2&"-"&F2&" "&G2&"."&H2.
  • I added column O ("New Name").  There, I used =CHAR(34)&I2&" Email from "&K2&" to "&M2&" re "&N2&".eml"&CHAR(34).  This produced a new name for the EML file.  I sorted on this column to identify instances where my formulas had failed, and made corrections as needed.
  • I added column P ("Batch").  There, I used ="ren "&CHAR(34)&B2&".eml"&CHAR(34)&" "&O2.  This produced a batch command to rename the EML file to my preferred new name.  I copied the command down the column and then copied all those commands, one from each row, to Notepad.  I saved the Notepad file as Renamer.bat and put it into the folder where the EMLs were located.  I ran Renamer.bat.  The renamed files sorted conspicuously in Windows Explorer, so I didn't need to work up a modification of these REN commands in a new column Q, using MOVE instead of REN, to move the newly renamed files to another folder.  Instead, I could just cut them from the folder in Windows Explorer and put them aside.
  • Now I had a couple dozen EMLs remaining.  They had not renamed properly.  I probably should have added something like " > errorlist.txt" at the end of each batch command, to show me whether I was trying to give the same name to two different files.  I did a DIR of the files remaining, saved its output to dirlist.txt, copied the contents of dirlist.txt into Excel, and compared them against my main spreadsheet.  To my surprise, none of these files appeared in the original list of files shown there.  I'd had some problems not described in this post; had I somehow dropped some EMLs somewhere along the line?  Was I not doing this comparison properly?  I did not have a clear answer.  I worked up another set of new file names for these EMLs, substantially following the steps presented above, and renamed them.  It looked like, somehow, at least some of them were duplicates after all.  So I was not understanding something there.  Others were apparently not renaming properly because the original filenames contained characters like ®.
In the end, I wound up with all but 10 of the exported emails.  But which 10 did I lose?  I probably didn't lose any.  There was a point when I deleted a handful of what I thought were duplicates.  Now it seemed they probably weren't.

As these final notes suggest, while this process went much more smoothly than on my previous exports of emails from Thunderbird, there were still some parts of the process where I was making mistakes or where things were not going smoothly.

Fourth Step:  Converting the Appropriately Renamed EML to PDF

Some of the previous posts cited at the top of this post grappled with the problem of converting EML to PDF.  It would seem that it should have been a straightforward matter of selecting EMLs in Windows Explorer -- indeed, within Thunderbird itself -- and clicking a Print command.  Alas, it was not, not if the goal was to have PDFs whose filenames would be recognizable.  I hoped there would be a Thunderbird add-in that would solve all the many steps shown above.  I hadn't found one yet.  In my most recent effort, I had proceeded only as far as a truly cumbersome solution that divided EMLs with and without attachments, used Emacs to edit EMLs with attachments so that they would print, extracted attachments to separate files that could also be PDFed, and then manually matched the PDFed attachments up with their PDFed parent emails.  Truly a mess, very time-consuming, and for that reason I hadn't gone very far with it.  Most of my emails were still in EML rather than PDF format.

I had decided, generally, that PDF was the superior long-term archival format.  I didn't want lots of formats rattling around, lest the day come (as had happened for previous formats) when it was a struggle to find software that would read it.  That said, EMLs were presently displaying nicely in Thunderbird, with easy access to attachments.  Having devoted the time to the foregoing effort, I was presently out of time for further development of this project.  So the stack of EMLs grew higher, and the day for conversion to PDF still lay somewhere in the future.


5 comments:

raywood

In another post, I did work all the way through the process of converting an EML plus its attachments into a single PDF, for many EMLs and many attachments.

raywood

The next time I exported EMLs from Thunderbird, I benefited from what appeared to be improved index and EML output through an apparent upgrade of the ImportExportTools add-on. Key discovery: the index and the file list were in nearly perfectly opposite order from one another. I was able to use this to match the To and From fields of the various emails with only one detected error during the several hours it took to export and rename thousands of emails. To massage the index, I opened it as HTML, copied into Microsoft Word, and copied from there to Excel.

raywood

Another subsequent post comments further on the elimination of duplicate emails.

Hailey stone

Hi, Raywood

Don’t take all those lengthy process to convert Thunderbird emails to PDF, just simply download EML to PDF Converter software which provides you easy and simple EML emails conversion process so that you can simply Convert Thunder EML emails into PDF format within 2 to 3 steps. http://www.emltopdf.com/

raywood

Working through this process again at a later time, I found an exception to my previous comment: when emails had the same date and time, I had to do some additional massaging. With this approach, the current version of ImportExport Tools was exporting the To field in the index.csv file, so part of the foregoing discussion was no longer necessary (though it was still helpful to have the "To" field visible in Fulltext).