Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time.  I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf
reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow."  This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages.  There were a lot of messages, so I didn't want this to be a manual process.  This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird.  Then I deleted duplicates from that folder.  Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird.  So I moved the older messages into a separate folder.  I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email.  I did that because I wanted to capture the text information about the individual email messages.  The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames.  For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once.  I ran a search and downloaded and installed the ImportExportTools add-on for T-bird.  (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.)  It took Thunderbird a while to shut down and restart after the installation.  I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder.  When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)).  I opened the CSV file (named index.csv) in Excel and saw that this was perfect:  I had a list of exactly the fields mentioned above (date, time, etc.).  I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames.  The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working.  I thought this might also be better for purposes of making things the way I wanted them.  I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order.  There seemed to be some other potential limitations.  It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted.  Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird.  So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message.  But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs.  In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming.  There were other solutions that would have worked for that purpose as well.  A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook.  (Alternate:  Automatic Print Email for $69.)  There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments.  I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments.  Funny thing about that, though:  as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them:  it thought they had attachments, but then it realized they didn't.  That trimmed out maybe 5% of the ones that had originally been marked as having attachments.  But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment.  Total Thunderbird Converter looked like it might come close, at a hefty price ($50).  It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly.  But then I saw that they were offering a 30-day free trial, so I downloaded and installed it.  It turned out to be useless for my purposes:  it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation.  I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial.  It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on.  I tried Birdie EML to PDF Converter.  Their free demo version allowed me to convert one EML file at a time.  I liked its interface:  it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats).  I didn't like the output, though:  they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF.  I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML.  But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages.  It just vanished.  The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions.  In other words, I didn't.  So now I had to go rooting around in backups to see what I could rescue from the wreckage.  I found that Backup Maker had been happily making backups, as God intended.  Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it.  What Backup Maker had saved was some files with an .MSF extension.  These were supposedly backups of Thunderbird.  But then, no, on closer inspection I realized these were much too small, so I had to do some more digging.  Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off.  A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached.  There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort.  On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost.  I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension.  I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder."  It worked -- it did extract the attachments without removing them from the emails.  I assumed it would do this with hundreds of emails if I so ordered.  Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination.  Attachment Extractor did have one potentially useful feature for this purpose:  a right-click option to "Extract with a Custom Filename Pattern."  I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF.  To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab.  There, I used this pattern:
#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#
and, per instructions, in the Edit Date Pattern option I used this pattern:
Y-m-d H.i
That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult.  A search led to inquiries about renaming PDF print output.  For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames.  Print Helper, for $125 to $175, was another option.  A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it.  A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30. 

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software.  No doubt many people have used pdf995 successfully.  I was not among them.

I tried Pdf995's Free Converter.  The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page.  They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad.  But that .ini file seemed to be configured for printing one specific file that I had just printed.  They didn't say how to adjust it.  Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there.  So I tried that.  But I got an error message:
PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.
But they were!  I had just installed them.  Was I supposed to reboot first?  No, a reboot didn't fix it.  I tried again to install basic pdf995 and the Free Converter, which I had downloaded together.  Once again, I got the webpage saying I had installed it successfully.  Was I supposed to install the printer driver too?  I understood the webpage to be saying that was included in the 9MB download.  But I tried that.  Got the congratulatory webpage, so apparently it installed correctly.  Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:
Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.
But I had just downloaded it from their website!  Not an altogether auspicious beginning here.  But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before.  A cursory read-through still did not give me a clear approach to automated naming of PDFs.  Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation.  Who knew?  I downloaded and installed OmniFormat.  Got a new congratulatory webpage, but still no straightforward explanation of batch naming.  Instead, it said that pdfEdit995 was what I wanted to create batch print jobs.  So, OK. a bridge too far.  Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird.  Went back to their incredibly tiny-print pdfEdit instructions page.  It said I would have to set pdf995 as the default printer to do the batch thing.  That was OK.  But it still sounded like it was intended primarily for batch printing from Microsoft Word.  I decided to just try making pdf995 the default printer.  That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer.  While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing.  Now I went to Thunderbird, selected several messages, and did a right-click > Print.  Funny, it defaulted to Bullzip, which was my usual default printer.  Checked again:  yeah, pdf995 was definitely set as my default printer.  Tried again, and this time manually set it to pdf995 when it was printing.  It asked for the filename, so that wasn't working.  Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either.  It seemed pdf995 was not the solution for me.  I came away from this little exploration with less time and ambition for the larger project.  Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first.  I did, so I took a look.  There was an option to download Universal Document Converter (UDC) as well.  I wasn't sure, but I thought I might need that for Print Conductor, so I installed both.  PrintConductor didn't seem to have a way of printing EML files.  Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that.  But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually.  (Print Conductor did not appear in the list of available printers.)  When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab.  There, I chose the "predefined location and filename option."  I left the default filename alone and printed.  And it worked.  Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails.  So I would be able to match them with the attachments produced by Attachment Extractor (above).  All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames.  I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above):  Tools > ImportExportTools > Export all messages in the folder > EML format.  Unfortunately, there could be a number of messages in a single day on the same subject.  Without the time data in the filename, I would have duplicates.  More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails.  I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data.  As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data.  But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails.  Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails.  There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached.  This seemed like the kind of project that really had to await some further development in application software.  For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it.  It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird.  I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered.  It seemed that a good solution would be to focus on the emails themselves for now.  I would export them to EML format.  EMLs would continue to contain the attachments.  I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice.  Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs.  So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment.  If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject).  I had sorted the old emails in Thunderbird into separate folders by year.  I went to one of those folders in T-bird and sorted it in ascending date order.  Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV).  This gave me what appeared to be a matching list of those messages, in that same order.  The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order.  Good news:  the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order.  At least the first and last emails were in reversed positions.  So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order.  This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed.  I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34).  I wasn't sure that the present and future date fields would always sort exactly the same.  I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes:  would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves:  Tools > ImportExportTools > Export all messages in the folder > EML format.  As partially noted above, these were named in Date - Subject - Number format.  I now did a search to try to figure out what that number signified.  It wasn't clear, but it seemed to be just randomly generated.  Too bad.  It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order.  (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.)  It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day.  There would surely be large numbers of files like that.  I now realized they would not at all be sorted correctly in Windows Explorer:  with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options.  Silly me.  I had just overlooked that.  But there they were:   Tools > ImportExportTools > Options > Filenames tab.  I selected "Add time to date" and I chose Date - Name (Sender) - Subject format.  Now I tried another export of EMLs.  The messages now had names like this:
19970730-0836-Microsoft-Welcome!
I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM.  Suddenly the remainder of my job got a lot easier.  I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools.  There were two parts to this mission.  First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools.  Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted.  This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird. 

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility.  That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML.  This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird.  I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird.  This seemed like a separate step.  I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

5 comments:

IndoMK

I'd just like to say THANK YOU for writing this post! I've been trying for almost a year not to find some way to save my gmail chat history into pdf format, so I don't lose years' worth of conversations (mostly of sentimental value). I think with the information you so kindly shared, I can actually maybe successfully do it without manually printing each chat message to pdf. :) So thank you, thank you...

raywood

Glad to help. But if I recall correctly, you should follow the last link in this post to get to the next article. I believe it contains a rewrite/update that you may find more useful.

IndoMK

Oh, ok, thanks, will do. :)

raywood

A more recent post updates some aspects of this one.

raywood

A later post comments further on the elimination of duplicate emails.