Ray Woodcock's Latest: April 2011

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time. I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf

reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow." This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages. There were a lot of messages, so I didn't want this to be a manual process. This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird. Then I deleted duplicates from that folder. Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird. So I moved the older messages into a separate folder. I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email. I did that because I wanted to capture the text information about the individual email messages. The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames. For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once. I ran a search and downloaded and installed the ImportExportTools add-on for T-bird. (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.) It took Thunderbird a while to shut down and restart after the installation. I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder. When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)). I opened the CSV file (named index.csv) in Excel and saw that this was perfect: I had a list of exactly the fields mentioned above (date, time, etc.). I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames. The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working. I thought this might also be better for purposes of making things the way I wanted them. I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order. There seemed to be some other potential limitations. It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted. Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird. So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message. But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs. In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming. There were other solutions that would have worked for that purpose as well. A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook. (Alternate: Automatic Print Email for $69.) There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments. I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments. Funny thing about that, though: as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them: it thought they had attachments, but then it realized they didn't. That trimmed out maybe 5% of the ones that had originally been marked as having attachments. But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment. Total Thunderbird Converter looked like it might come close, at a hefty price ($50). It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly. But then I saw that they were offering a 30-day free trial, so I downloaded and installed it. It turned out to be useless for my purposes: it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation. I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial. It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on. I tried Birdie EML to PDF Converter. Their free demo version allowed me to convert one EML file at a time. I liked its interface: it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats). I didn't like the output, though: they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF. I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML. But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages. It just vanished. The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions. In other words, I didn't. So now I had to go rooting around in backups to see what I could rescue from the wreckage. I found that Backup Maker had been happily making backups, as God intended. Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it. What Backup Maker had saved was some files with an .MSF extension. These were supposedly backups of Thunderbird. But then, no, on closer inspection I realized these were much too small, so I had to do some more digging. Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off. A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached. There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort. On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost. I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension. I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder." It worked -- it did extract the attachments without removing them from the emails. I assumed it would do this with hundreds of emails if I so ordered. Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination. Attachment Extractor did have one potentially useful feature for this purpose: a right-click option to "Extract with a Custom Filename Pattern." I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF. To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab. There, I used this pattern:

#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#

and, per instructions, in the Edit Date Pattern option I used this pattern:

Y-m-d H.i

That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult. A search led to inquiries about renaming PDF print output. For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames. Print Helper, for $125 to $175, was another option. A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it. A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30.

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software. No doubt many people have used pdf995 successfully. I was not among them.

I tried Pdf995's Free Converter. The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page. They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad. But that .ini file seemed to be configured for printing one specific file that I had just printed. They didn't say how to adjust it. Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there. So I tried that. But I got an error message:

PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.

But they were! I had just installed them. Was I supposed to reboot first? No, a reboot didn't fix it. I tried again to install basic pdf995 and the Free Converter, which I had downloaded together. Once again, I got the webpage saying I had installed it successfully. Was I supposed to install the printer driver too? I understood the webpage to be saying that was included in the 9MB download. But I tried that. Got the congratulatory webpage, so apparently it installed correctly. Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:

Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.

But I had just downloaded it from their website! Not an altogether auspicious beginning here. But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before. A cursory read-through still did not give me a clear approach to automated naming of PDFs. Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation. Who knew? I downloaded and installed OmniFormat. Got a new congratulatory webpage, but still no straightforward explanation of batch naming. Instead, it said that pdfEdit995 was what I wanted to create batch print jobs. So, OK. a bridge too far. Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird. Went back to their incredibly tiny-print pdfEdit instructions page. It said I would have to set pdf995 as the default printer to do the batch thing. That was OK. But it still sounded like it was intended primarily for batch printing from Microsoft Word. I decided to just try making pdf995 the default printer. That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer. While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing. Now I went to Thunderbird, selected several messages, and did a right-click > Print. Funny, it defaulted to Bullzip, which was my usual default printer. Checked again: yeah, pdf995 was definitely set as my default printer. Tried again, and this time manually set it to pdf995 when it was printing. It asked for the filename, so that wasn't working. Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either. It seemed pdf995 was not the solution for me. I came away from this little exploration with less time and ambition for the larger project. Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first. I did, so I took a look. There was an option to download Universal Document Converter (UDC) as well. I wasn't sure, but I thought I might need that for Print Conductor, so I installed both. PrintConductor didn't seem to have a way of printing EML files. Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that. But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually. (Print Conductor did not appear in the list of available printers.) When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab. There, I chose the "predefined location and filename option." I left the default filename alone and printed. And it worked. Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails. So I would be able to match them with the attachments produced by Attachment Extractor (above). All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames. I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above): Tools > ImportExportTools > Export all messages in the folder > EML format. Unfortunately, there could be a number of messages in a single day on the same subject. Without the time data in the filename, I would have duplicates. More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails. I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data. As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data. But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails. Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails. There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached. This seemed like the kind of project that really had to await some further development in application software. For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it. It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird. I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered. It seemed that a good solution would be to focus on the emails themselves for now. I would export them to EML format. EMLs would continue to contain the attachments. I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice. Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs. So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment. If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject). I had sorted the old emails in Thunderbird into separate folders by year. I went to one of those folders in T-bird and sorted it in ascending date order. Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV). This gave me what appeared to be a matching list of those messages, in that same order. The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order. Good news: the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order. At least the first and last emails were in reversed positions. So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order. This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed. I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34). I wasn't sure that the present and future date fields would always sort exactly the same. I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes: would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves: Tools > ImportExportTools > Export all messages in the folder > EML format. As partially noted above, these were named in Date - Subject - Number format. I now did a search to try to figure out what that number signified. It wasn't clear, but it seemed to be just randomly generated. Too bad. It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order. (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.) It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day. There would surely be large numbers of files like that. I now realized they would not at all be sorted correctly in Windows Explorer: with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options. Silly me. I had just overlooked that. But there they were: Tools > ImportExportTools > Options > Filenames tab. I selected "Add time to date" and I chose Date - Name (Sender) - Subject format. Now I tried another export of EMLs. The messages now had names like this:

19970730-0836-Microsoft-Welcome!

I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM. Suddenly the remainder of my job got a lot easier. I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools. There were two parts to this mission. First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools. Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted. This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird.

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility. That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML. This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird. I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird. This seemed like a separate step. I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

Using a Spreadsheet to Rename Thousands of Files - First Try

I had a list of a couple thousand files. I got the list as an export from a program, but I could also have gotten it from a directory listing at a command prompt (e.g., DIR /a-d /b /s). This post describes how I used that list to rename those files. Needless to say, since I was working with the possibility of screwing up years' worth of information, I did make a backup of these files before proceeding.

This particular list was a list of emails that I had just exported from Thunderbird, as described in another message that I expect to post to this blog on the same day as this one. I had exported, not only the list, but also the actual emails, in EML format. I planned to use the list to rename those EMLs so that they would be in the format I wanted, which was like this:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf

It would require some massaging to get there.

The Date and Time Segment

Starting with the date and time, here's what Thunderbird had given me in the list of files:

3/20/11 14.23

These were all in one field, in the .csv (i.e., Excel-readable) output from T-bird. They might have been in two or more fields, in a text file produced by a directory listing (e.g., DIR /a-d /b /s > dirlist.txt), but to some extent the same techniques might prove useful. These numbers were not in a format that would work as a Windows filename, and in fact the exported EMLs did not contain the date and time data in this format.

I was going to use the list to rename the emails the way I wanted, with data that existed in the list but not in the current EML filenames. To do that, I would need to try to get the list so that it accurately represented the actual EML filenames. Otherwise, if the list produced a command that said, Rename File 234A.eml to be "Message from Ray 234A," that command would not work if the file being renamed was actually called File_234A.eml (with an underscore). My command would just get a "File not found" error. So my first step was to find a way, in my Excel spreadsheet, to reproduce what the files were actually called, as I viewed them Windows Explorer.
First, I would have to extract each element into a separate column, there in the spreadsheet, so that I could format and rearrange them as desired. For this, I started with FIND commands. (Excel's internal help had good information on using these and other related commands. I was doing this in Excel 2003. There might have been other functions that would automate this process in newer versions.) For instance, =FIND("/",A1) would locate the first occurrence of the forward slash in cell A1, if that's where I put the "3/20/11 14.23" item shown above. So then I could search for the next slash (=FIND("/",A1,B1+1), starting one place after the results of the first FIND statement (which, in this example, I had put in cell B1). I could do the same for spaces, colons, or whatever else might delimit various components of the date and time entry. I'd then use RIGHT, LEFT, or MID statements to find those components. For instance, a MID statement like =MID(A1,B1+1,C1-B1-1) might start one space after the first slash, continue to one space before the second slash, and thus give me the day of the month.

In the 3/20/11 14.23 example, the value of "3" for the month wouldn't give me the desired "03" so I would use IF and LEN statements to pad that out:

=IF(LEN(G1)=1,"0","")&G1

thereby inserting a zero in front of single-digit month values found in cell G1, but adding nothing to those month values if they were not single-digit.

I had gotten two different things from Thunderbird. On one hand, as noted above, I had gotten a list of data about emails that I was going to export, with dates in the 3/20/11 format. On the other hand, I had also gotten the actual emails, exported as individual EML files. These were not named quite the same way. The particular Thunderbird extension that I had used to produce this list of files, ImportExportTools, had produced files whose names were in this format: Date-Time-Sender-Subject. But in those filenames, the dates could not be rendered as 3/20/11, since slashes were not allowed in Windows 7 filenames. Instead, the date and time came in the format of 20110320-1423. So as hinted above, I would ultimately be producing a batch file that would automate the renaming of thousands of files. That example, 20110320-1423, would instead begin with the more readable 2011-03-20 14.23.

Next: Massaging the Sender Data

Before I could get to that point, I had to change some of the Sender data that T-bird had produced. Again, the list of actual email data from T-bird contained characters (e.g., the > symbol) that could appear in email Subject lines but that were not allowed in Windows 7 filenames. (The full set of forbidden characters: \ / : * ? " < > | ). For example, the spreadsheet might show a sender to be this:

John Doe <jdoe@xcom.com>

but the actual exported EML file's name would just have John Doe. So I could get rid of some of these problems of verboten characters by just doing a FIND for < and then a MID or a LEFT statement for everything before that. That wouldn't necessarily get rid of all the bad-character problems, but I wanted to clean those up just once, so I deferred that problem for the moment. For right now, in a bid to match the format of the exported EMLs, I now had this much of the filename:

20110320-1423-John Doe

To get there, the actual command I used, for the Sender portion, was this:

=LEFT(B2,V2-1)

Cell B2 had the Sender's name as exported from Thunderbird, and cell V2 had the FIND location for the < symbol, which marked the end of the part of the Sender data that I planned to use.

At this point, there was a problem. For some reason, ImportExportTools put underscores before and after some Senders' names, but not all. So I might instead have this:

20110320-1423-_John Doe_

The filenames already used a hyphen ( - ) to delimit fields, so the combination hyphen-underscore seemed superfluous. It needed to be fixed before I could continue with the main project here, so that I could be confident that Sender names were delimited consistently by a hyphen, and not by unpredictable choices involving underscores.

Detour: Renaming Thousands of Files, So That I Could Proceed to Rename Thousands of Files

To fix this problem, I went to the Windows 7 command prompt (Start > Run > cmd) and did a directory listing to save the filenames to a file: DIR /b > dirlist.txt. I copied the contents of that file into column A of a new Excel spreadsheet. I also copied those contents into Notepad. In Notepad, I did two global replaces (changing both -_ and _- to just plain - ). I copied those changed contents into column B of that Excel spreadsheet. Next, I needed to combine these two columns, A and B, into a single DOS command that would rename the files. To do that, I put this formula in column C:

="ren "&char(34)&A1&char(34)&" "&char(34)&B1&char(34)

The char(34) entries would introduce quotation marks: 34 was the ASCII code for double quotes. I could have introduced any character that way -- Z, a semicolon, whatever -- but it was essential to use the char(34) approach for quotation marks specifically because otherwise Excel would have misunderstood them. Quotation marks would be necessary for any DOS command involving filenames that contained spaces; DOS would otherwise think that the space marked the end of a part of the command. Anyway, I copied the contents of column C back into Notepad and saved the file as a batch file (renamer.bat). I saved renamer.bat in the folder where I had all those EML files I was going to rename. At a DOS prompt in that folder, I typed "renamer.bat," without the quotes. I could have just double-clicked on renamer.bat in Windows Explorer. Either way, the files were renamed.

Later, I discovered that ImportExportTools would truncate the Subject field in the spreadsheet to 50 characters even if I had not asked it to do so. Many emails in my set had Subject fields longer than that. I initially tried to fix this by setting up a separate spreadsheet, using the process just outlined, but with a search for the last hyphen in the filename, which would hopefully identify the start of the Subject field in most cases. Unfortunately, the calculation of 50 characters proved difficult, after taking into account the character substitutions described here. Ultimately, I started over with a fresh export of EML files from Thunderbird, after setting the option at Tools > ImportExportTools > Options > Filenames tab > Cut subject at 50 characters (and also Cut complete file path length at 256 characters).

Returning to the Sender and Subject Segments

So now I could go back to work on the main spreadsheet. I could assume, that is, that the first part of the EML filenames were going to look like this:

20110320-1423-John Doe

without underscores. I could see that there were still a few underscores around names in the EMLs, and that was problematic. Possibly ImportExportTools had introduced two underscores rather than one in a row, in some cases. There were few enough that I figured I could fix those exceptions manually.

Now it was time to add the Subject part of the filename. This was simple enough: just use another ampersand (&) to combine it with the rest. Sketching it out, the EMLs now just used hyphens to delimit the date, time, sender, and subject portions, so I just needed to use an Excel command of this format:

=[Date]&"-"&[Time]&"-"&[Sender]&"-"&[Subject]&".eml"

and that would give me a complete representation of how ImportExportTools seemed to have constructed the filenames for most of the exported EMLs.

Cleaning Up Unwanted Characters

As noted above, some characters (e.g., the colon, ":") were not acceptable as Windows 7 filenames, but were quite common in email subject lines. It looked like ImportExportTools had replaced those characters with an underscore, and had removed spaces before and after the underscore. So, for instance, a subject line of "Re: Tomorrow" was represented, in the EML filenames, as "Re_Tomorrow," without a space.

At this point, preparing to remove those unwanted characters, my spreadsheet had something like this:

20110320-1423-John Doe-Re: Tomorrow

Some of these fledgling combinations contained other unwanted characters (e.g., question marks), listed above. I was almost ready to remove them. But first, it was time to proofread my spreadsheet. Having copied all of my various formulas down from the first line of the spreadsheet, where I was developing them, so that they were present in all rows, I now prepared to sort the spreadsheet. First, I inserted an Index column as column A, and in that column I inserted numbers in ascending order, starting with 1. The purpose of this column was to remember my original sort order, in case the date field or anything else did not accurately reproduce the order in which emails appeared in Thunderbird. There were times when a person would want to check back and see if things were matching the source. To insert these numbers, using Excel 2003, there were two options. One was to enter a formula in each cell, adding 1 to the number above it, and then replace the formulas with fixed values. The sequence for this was Edit > Copy, Edit > Paste Special > Values. An alternative was Edit > Fill > Series. Either way, I now had index numbers that wouldn't change if I sorted rows.

So now I did sort rows, sorting first on the column that concatenated (i.e., combined) the date, sender, and subject. I sorted to highlight those rows in which the process had not worked as intended. One problem, I saw, was that some Sender names did not include the <jdoe@xcom.com> part. The sender of these emails was just listed as John Doe (or whoever), without the actual email address. For these, I just copied the Sender straight over, skipping the unnecessary part of the spreadsheet calculation. That cleared up the error messages in the spreadsheet. So now I saved a version of the spreadsheet and moved the output column -- the one combining all of my work with these various fields -- and pasted it into Notepad. I could have edited it right there in Excel, but there was a risk that I would accidentally have changed other columns in ways that I did not intend, or that I could not achieve what I needed to achieve. For example, the ? symbol was a wildcard in Excel, so attempting to replace it with an underscore would either not work or create a mess. I noticed that ImportExportTools had also converted commas into underscores, even though they were not forbidden characters. Note: now that I had parts of the spreadsheet in Notepad, and expected to paste those parts back into Excel in the same order, this would have been an especially bad time to sort the spreadsheet.

Over in Notepad, I did global find-and-replace operations for each of the special characters listed above, guided by the objective of matching as closely as possible the actual filenames of the EMLs I had exported from Thunderbird. It was premature to do line-by-line editing of individual entries, though; those would be more obvious and perhaps more easily fixed later. At this point, if I saw something that needed to be changed, I executed it as a global command, so that it would be changed wherever it occurred. Obviously, a person can make serious mistakes with global changes so, again, this was not the time for fine-detail fixes of individual entries. On the other hand, each of these global changes was a potential time-saver: a single change here could save the need to make manual changes to 20 or 300 individual files later.

After making these changes, I again had to replace the -_ and _- combinations with a simple hyphen ( - ), since some forbidden characters that I had just replaced with underscores had been adjacent to the delimiting hyphens that ImportExportTools seemed to convert into simple hyphens. With these changes made, I copied the contents of that Notepad file back into the appropriate place in the spreadsheet.

Testing the Filename Match

How well did my spreadsheet now reflect the actual names of those EML files? I decided to test it. To do this, I first made a backup copy of the folder containing the EMLs. I then created a subfolder in the EMLs folder, called Test. In the spreadsheet, I worked up a MOVE command for each file, commanding it to move to the Test folder. The command was like this:

="move "&CHAR(34)&Z2&".eml"&CHAR(34)&" Test"

I copied all those MOVE commands from the spreadsheet to Notepad and saved the Notepad file as MOVER.BAT in the EML folder. Then I ran it. It succeeded in moving less than half of the files, which meant that the spreadsheet had not yet captured hundreds of EML file names correctly. I dir a DIR /b > dirlist.txt in the EMLs folder to capture the names of the ones that had failed. I brought that list into the Excel spreadsheet and matched them up with my attempts. This matching process was sufficiently time-consuming that I found myself grateful for the ones that I had managed to identify in a more automated fashion. The time investment was acceptable. As had been the case for me with spreadsheets going back almost 30 years, I figured that, once I identified the rules and became more familiar with this particular process, I would be able to use it again in similar tasks.

To match up the real filenames with the versions in the spreadsheet, I pasted the contents of dirlist.txt into a separate worksheet in Excel. From both, I compared the date and time data, using VLOOKUP on shortened versions of the relevant colums (=LEFT(Z1,13)). This revealed several things. First, there was an error in how I had converted the date and time data, so I would have to fix that and re-run it to move more of the items from the EML folder into the Test folder. Since I planned on doing this task again, I also decided to automate some of the find-and-replace operations described above. This required using FIND and MID to divide the text string into parts appearing before and after the character I wanted to replace (e.g., colons) and then joining them together without the middle part. I didn't do this for all possible combinations; I was basically interested in eliminating the first and/or most frequent occurrences of the forbidden characters that did seem to appear.

Starting Over: Revise the Filenames, Not the List

After hours of effort and a couple of retries, the approach described above was still identifying (i.e., successfully moving) only about 80% of the EMLs I had exported from Thunderbird. Depending on the total number of emails involved, that could mean hundreds or even thousands of emails that would have to be manually renamed in order to insure that their filenames contained all of the desired data.

I decided that the better approach mgiht be to try to rename the files, and keep renaming them if necessary, until they conformed to certain basic rules in the file list exported into Index.csv. There had already been some of that in the steps described above; the change now was to make that the primary effort.

The first step was to take a listing of the EMLs. This was easy enough: DIR /b > dirlist.txt. Then I copied the contents of that text file into Excel and changed components of the filename to be the way I wanted. The names of EMLs exported from Thunderbird did not have "To" fields, but I was able to match up most of the names automatically with what I had already prepared, thus borrowing To fields from the work described above. (This is a cursory description. At this writing, I had run out of time for this project, and was focused on getting it done. But the basic techniques were as described above.)

Some filenames did not match up easily to support an automated renaming of the EMLs.. I renamed the rest, using a MOVE command to put them into a separate folder. I copied those that did not rename into a separate folder too. These I renamed using the Bulk Renamer (above), so as to have .txt extensions. I did that so that I could view them in IrfanView, which enabled me to flip back and forth among them quickly, so as to identify the proper "To" information.

At this point, I started a new post to summarize more clearly the steps taken here.

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

I had a large number of emails in Thunderbird (an email program like Outlook, but open source freeware). I wanted to export each of those emails to its own distinct PDF file with a filename containing Date, Time, Sender, Recipient, and Subject information in this format:

2011-03-20 14.23 Email from Me to John Doe re Tomorrow.pdf

In that example, I might ultimately eliminated the "from Me" part as understood, but of course other emails would be from John back to me, so for starting purposes I wanted all five of the fields just listed. The steps I went through are described below. There is a summary at the end of this post.

Recap and Development: Converting Emails into EML Format Files with Preferred Filenames

So far, I had already worked through the process of exporting those emails to distinct EML files. I had also used a spreadsheet to rename those EML files so that they would provide clearer and more complete information about the file's contents. (I was using Excel 2003 for spreadsheeting. OpenOffice Calc was now able to handle a million rows (i.e., to rename a million files), but it had not been stable for me. One option, for those who had more than 65,000 EMLs and therefore couldn't work within Excel's 65,000-row limit, was to do part of the list at a time.) This post picks up from there, summarizing a more streamlined approach to the steps described at greater length in the two previous posts linked to in this paragraph.

I had previously tried to begin with the Index.csv file exported from Thunderbird via ImportExportTools, but that had been a very convoluted and unsatisfactory process. I did continue to use Index.csv, but my main effort was to work up a spreadsheet that would use and alter the filenames created when I exported EMLs from T-bird, also using ImportExportTools. As described previously, I had developed some rules for automated cleaning of various debris from filenames, such as the underscores that ImportExportTools inserted in place of quotation marks and other characters.

To summarize the approach described in more detail in the previous post, I got the filenames from the folder where ImportExportTools had put them by using this command at the CMD prompt (Start > Run > cmd): "DIR /b > dirlist.txt" and then I copied and pasted the contents of dirlist.txt into an Excel spreadsheet. There I extracted the Date, Sender, and Subject fields from those filenames using Excel functions, including FIND, MID, TRIM, and LEN, all described in Excel's Help feature and in the previous post. I also used Excel in a separate worksheet to massage the data on the individual emails as provided in Index.csv.

The two worksheets did not produce the same information, and I needed them both. The one contained actual filenames, which I wanted to revise en masse to be more readable and to include the "To" field, which was contained in Index.csv. Many of the things that ImportExportTools screwed up about the subject fields of emails, for purposes of CMD-compatible filenames (and going well beyond that) involved the underscore character. Hence, the chief sections in the main worksheet (where I revised the data from dirlist.txt), going across the columns, were as follows:

Dirlist (raw EML filenames)
Date & time conversion (from 19980102-0132 to 1998-01-02 01.32
Subject: Clean up starting & ending underscores of From names
Subject: Replace "Re_ " with "Reply re" in Subject field
Subject: Replace "n_t" and "_s" (as in "don_t" and "Mike_s") with apostrophes
Subject: Replace serial underscores: "_ _" becomes " - "
Subject: Replace "I_m" with "I'm" and "you_re" with "you're"
Subject: Replace underscore and space ("_ ") with hyphen (" - ")
Subject: Remove starting and ending hyphens

That accounted for the bulk of the needed changes in the Subject field, in the files I was working with. I set these rules up to eliminate the first one, or in some instances two, occurrences of the underscore string in question. Few emails contained more than that; for those few, leaving the additional underscores in place was acceptable. There would be some predictable misfires of these rules, but they would generally improve the situation, and when dealing with a large number of EMLs that I didn't intend to rename manually, this was the best that I could hope for under the circumstances.

Then I used VLOOKUP to search for a match with the Index.csv-style Date and Time (e.g., 19980102-0132) data in the Index.csv worksheet, and also for a match with the Index.csv Date+Time+From combination. (Sometimes the From field was necessary to distinguish two or more emails sent at the same time. Because of the underscores and other oddities about the EML filenames, subjects were too different to compare in most cases.) This identified precise matches between the two worksheets for about 80% of EMLs.

So now I was going to try using that same spreadsheet with another batch of emails exported from Thunderbird. I exported the Index.csv and the EMLs, and set to work on the spreadsheeting process of reconciling their names and producing MOVE commands for a CMD batch file that would automatically rename large numbers of EMLs to be readable and to include data from the To field.

This time around, I did a first pass to bulk-recognize and batch-rename that first 80% of the EMLs. The CMD command format was this:

MOVE /-y "Old Filename.eml" "Renamed\New Filename.eml" 2> errlog.txt

This renamed the old EMLs to the desired new EML filenames, put them into the Renamed subfolder, and gave me an error log to say what went wrong with any of the renames. The error log wasn't very useful, so I stopped creating it in these commands. What I had to do instead, to find out which EMLs had been successfully renamed, was to do a dirlist.txt for the Renamed folder, feed that back into the spreadsheet, and delete those lines that had executed successfully. For about 15% of the emails, I could not automatically detect matches between data from Index.csv and actual files, so I wound up naming those files according to date, time, and sender only. Finally, I got down to less than 1% of emails that I had to rename in a more manual fashion, mostly due to non-ASCII characters in their filenames. For that, I used Bulk Rename Utility.

I was not sure whether this route wound up being better than the approach of using one of the shareware programs discussed in the previous post. I was not aware of the potential difficulties when I was looking at those programs, so for example I didn't try them out on emails with Chinese characters in their Subject fields. The other way always looks easier after a project like this. The approach I had taken had surely been more time-consuming than if I had known of a killer app that would do exactly what I wanted without unanticipated complications or failures. Absent a reliable, obvious solution at an affordable price, the main thing I could say at this point was that at least the conversion to EML was done.

Final Step: Converting EMLs into PDF

With EMLs thus exported from Thunderbird and mostly renamed to indicate date, time, sender, recipient, and subject, the remaining task was to convert the EMLs to PDF. This, it developed, might not be as simple as I had hoped. There was, first, the problem of finding a program that would do that. Some of the emails were simple text and could have been easily converted to TXT format just by changing their extensions from .eml to .txt. Acrobat and other PDF programs would readily print large numbers of text files, unlike EMLs. Other EMLs, however, contained HTML (e.g., different fonts, different colors of print, images). I wasn't sure what would happen if I changed their extensions and then printed. I noticed that the change to .txt caused the HTML codes to become visible in one message that I experimented with. When I converted that file to PDF using Acrobat, its header appeared in a relatively ugly form, but the colors and fonts seemed to be at least somewhat preserved. In another case, though, the PDF was largely a printout of code -- a truly undesirable replacement for what had been a pretty email with photos included. My version of Acrobat (ver. 8.2) did not provide any editable settings for conversion from text or HTML to PDF.

Thunderbird was my default program for displaying EMLs. I wondered if a different program could view them and would have better PDF printing capabilities, or if I should try converting them into another interim format in order to then convert them to PDF. A search led to the claim that Microsoft Word (or other programs) could display EMLs. I tried and found that this was essentially untrue: in Word, there was almost nothing left of that pretty email I had just tested. Converting EML to MSG seemed to be one option, but this looked like a dead end; that is, it didn't look like it would be any easier to PDF an MSG file than to PDF an EML. Getting the EMLs into Outlook wasn't likely to be a solution; as I recalled, my version of Outlook (2003) had been unable to batch print emails as individual PDFs. Marina Martin said that MBOX was the standard interoperable email file type. I could have exported from Thunderbird directly to MBOX using ImportExportTools, but I had not investigated that; I had assumed that MBOX meant one large file containing many emails, like PST, and I had wanted to rename my emails individually. Martin gave advice on using eml2mbox to convert EML to MBOX; hopefully I would not have lost anything by taking the route through EML format. But if MBOX was such a common format, there was surprisingly little interest in converting it to PDF. My search led to essentially nothing along those lines. Well, but couldn't Firefox or any other web browser read HTML emails? I tried; neither Firefox nor Internet Explorer were willing to open an EML. I renamed it to be an .html file. Both opened that, but here again the problem was that the header was so ugly and hard to read: it was just a paragraph-length jumble of text mixing up the generally important stuff (e.g., from, to) with technical information about the transmission. Even assuming I could work out a batch-PDF process for HTMLs, this was not the solution. There were other possibilities, but in the end it did appear that I simply needed to buy an EML-to-PDF converter.

It tentatively appeared that MSGViewer Pro ($70) might be the most frequently downloaded program in this area, ahead of its own sister program PSTViewer Pro as well as Total Mail Converter. A search for reviews led to very little. It didn't appear that MSGViewer Pro had the ability to include image attachments within the PDF of an EML, as Total Mail Converter Pro ($100) supposedly did. On the other hand, MSGViewer Pro supposedly provided a free five-day trial. I decided that I did not have time to mess with endless numbers of attachments right now, and was therefore willing to just zip the EMLs into a single file for possible future processing, if I decided that there was sufficient need and time for that. Given my unlikelihood of using these programs very often, I also hoped that their prices would drop. I figured that if the MSGViewer Pro trial was fully functional, I might be able to take care of my need for it now, converting EMLs into PDFs without attachments, and otherwise let the matter sit for another year or more.

On that basis, I downloaded and installed MSGViewer Pro. It was apparently designed for an older version of Windows. When I installed it, I got one of those Win7 messages indicating that it might not have installed properly, and inviting me to reinstall using "recommended settings," whatever that meant. I accepted the offer. Once properly installed, I ran the program. A dialog came up saying, "Trial is not licensed for commercial use." I clicked "Run Trial." Right away, I found that its Refresh feature did not work: I copied some EMLs into a separate folder to experiment with, and could not get the program to find that folder. I killed the program and started over. Now it found the folder. I selected those messages, clicked the Export button, and told it to give the resulting PDF (one of the available output options; the others were TXT, JPG, BMP, PNG, TIFF, and GIF) the same names as the input files. It had a nice option, which I accepted, to copy failed messages to a separate folder. A dialog came up saying, "You can only export 50 emails in trial version of MsgViewer Pro." So that popped that fantasy. It ran pretty quickly and reported that all of the files had been successfully exported. Sadly, the results were no better-looking than I had been able to achieve on my own, with other measures described above. HTML codes were visible in some PDFs -- or perhaps I should say, not visible, but overwhelming: it looked like a piece of ordinary HTML coding. The typeface was tiny. Some lines were actually split down the middle horizontally, with the top half of a line of text appearing at the bottom of one page and the bottom half appearing at the top of the next page. In a word, the results were junk. I uninstalled MSGViewer Pro.

I decided to try Total Mail Converter Pro. No installation problems. When installation ended, the program started right up without giving me a choice. Then it decided I needed to log onto Gmail. This was not my plan, so I canceled that. I liked its interface better than MSGViewer Pro: smaller but still readable font, seemingly more options. I selected my test files and clicked the PDF button. It gave me options to combine the files into one PDF or produce separate files. It also provided a file name template, with choices of subject, sender, recipient, date, and source filename. I tried these. There were other options: which fields to export, whether to include attachments in the doc or put them in separate folders, header, footer, document properties. It did the conversion almost instantly. The date format was month-day-year. The subject data weren't cleaned up, so I would still have had to go through something like my spreadsheet process to get the filenames the way I wanted them. Moment of truth: the file contents included a colored top part, as I had encountered with Birdie (see previous post). HTML codes were still visible in some messages, but in others the HTML seemed to have been better converted into rich text. Typefaces were still tiny. Definitely a better program. But worth $100 for my needs?

Ideally, I would have been converting my emails to PDF as I went along, without converting them around and around, from Outlook to Thunderbird to EML and wherever else they might have gone over the past several years. This might have better preserved what I recalled as the colorful, more engaging look of some of them, and perhaps I would have come up with better ways of capturing those characteristics as I continued to become more experienced with the process. In the present circumstances, where I really just wanted to get the job done and move on, it seemed that playing with that sort of thing was not a short-term option.

Since I was planning to keep the EMLs anyway, and since I did not plan to view these emails frequently, I decided that I really didn't lose much in informational terms by going with the free option identified above. I took a larger sample of EMLs and, using Bulk Rename Utility, renamed them to be .txt files (though later I realized I could have just said "ren D:\Documents\*.eml *.txt"). Since I had installed Adobe Acrobat, I had a right-click option to convert to Adobe PDF. No doubt some freeware PDF programs provided similar functionality. The Acrobat conversion of these files into PDF was not nearly as fast as that performed by Total Mail Converter Pro. Acrobat put each of those newly created PDFs onscreen and obliged me to manually confirm that I wished to save them. I had converted 40 files, and wasn't interested in manually closing all 40; ultimately I had to use Task Manager to shut them down. That problem turned out to be just a result of the settings I was using for my default Bullzip PDF printer; changing those defaults and using Acrobat's Advanced > Document Processing > Batch Processing option made the process completely automatic. In terms of appearance, it seemed the fonts, HTML handling, and other features were more or less the same as I had gotten from those other programs (above). I probably could have made the average resulting email more readable (except where HTML formatting made clear who was responding to whom) by looking for a program that would strip the HTML codes out of those TXT files, but I didn't feel like investing the time at this point and wasn't sure the effort would yield a net improvement.

Briefly, then, the PDFing part of this process involved using a bulk renamer to replace the .eml extension with a .txt extension, and then using a bulk PDF printer or converter to convert those TXT files into PDF. This approach still preserved the look of some emails, while allowing others to be overrun with HTML codes.

I ran that batch process on a full year's set of EMLs. I converted 1,422 EMLs into TXT files by changing their extensions with Bulk Rename Utility. Somehow, though, Acrobat produced only 689 PDFs from that set. Which ones, and what had happened to the rest? Acrobat didn't seem to be offering a log file. My guess was that Acrobat went too fast for Bullzip. There was no real reason why I shouldn't have been using Acrobat's own PDF printer for this particular project -- in fact, I did not remember precisely what Acrobat snafu had prompted me to switch to Bullzip as my default PDF printer in the first place -- so I went into Start > Settings > Printers and made that change now. I also right-clicked and changed some of the Printing Preferences, for that printer, so that it would run automatically. I deleted the first set of PDFs and tried again. I noticed, this time, that Acrobat was not even trying to convert more than 689 files -- it was saying, "1 of 689," "2 of 689," etc. What was causing it to overlook these other files, I was not sure. It seemed I would have to do a "DIR /b > Printed.txt" command in CMD, and then convert Printed.txt into a Deleter.bat file that would delete the text files that were successfully printed, so as to highlight the ones that remained. (See previous post for details on these sorts of commands.)

(Incidentally, I had also noticed, now that I was working with the Acrobat batch options, that it had a "Remove File Attachments" option. While it did not seem to work with EMLs, possibly it would have been useful if these emails had been in MSG or PST format.)

The automated process got as far as file no. 2 in the list before it stalled. Why it stalled, I had no idea. I clicked on the X at the top right-hand corner of the dialog to kill it -- I even said "Close the Program" when Windows gave me that option -- and then Acrobat took off and printed a couple hundred more PDFs before stalling in that same way again. Possibly I had the Acrobat PDF printer's properties set to stop on encountering an error. I ran through most of that first set before spacing out and killing Acrobat (the whole program) at a stall, instead of just killing the stalled task. I deleted those that had printed successfully, creating a Deleter.bat file for the purpose as just mentioned, and ran another batch. This time, Acrobat was printing a total of 667 files. So I figured the situation was as follows: Acrobat would print PDFs through a glorified command-line kind of process, and that command line would accommodate only so many characters. If I'd had shorter file names, maybe it would have been willing to print thousands of TXT files at one go. If I had wanted to add complexity to the process, I could have renamed my files with names like 0001.txt, reserving a spreadsheet to change their names back to original form after conversion to PDF. But with my filenames as they were, it was only going to process 600 or 700 at a time. That was my theory.

When Acrobat was done with the second set -- the first one that had run through to completion -- it showed me a list of warnings and errors. These were errors pertaining to maybe a dozen files. The errors included "File Not Found" (typically referring to GIFs that were apparently in the original email), "General Error" (hard to decipher, but in some cases apparently referring to ads that didn't get properly captured in the email), and several "Bad Image" errors (seemingly related to the absence of an image that was supposed to appear in the email). A spot check suggested that the messages with these errors tended to be commercial (e.g., advertising) messages, as distinct from personal or professional messages that I might actually care about. In a couple of cases a single commercial email would have several errors. But anyway, it looked like they were being converted, with or without errors.

I decided to try printing the next batch with Bullzip instead of Acrobat printer. I had to set it as the default printer in Settings > Printers. I also had to adjust its settings (by going to its Options shortcut in the Start Menu > General and Dialogs tabs) so that it would run without opening dialogs or completed PDFs. Would it now process significantly more than 600 input files? The short answer: no. So for the next round, I tried selecting all the TXT files in a folder and right-clicking > Convert to Adobe PDF. This was a bad idea. Now Acrobat wanted to open a couple thousand documents onscreen. I had to force-reboot the system to stop this one.

So now I thought maybe I'd look for some other text-to-PDF converter. It sounded like ActivePDF was a leading solution for IT professionals, but I didn't care to spend $700+. Shivaranjan recommended Zilla TXT To PDF Converter ($30). Softpedia listed a dozen freeware converters, of which by far the most popular was Free EasyPDF. But I couldn't quite figure out what was going on there. There was no help file, and the program wasn't even listed on its supposed creator's webpage. CNET called it fatally crippled. I didn't know why 30,000 people would have downloaded it. Back to Softpedia's list: Free Text to PDF Converter was another possibility with a Good rating. Its webpage said it could batch-convert text to PDF files. I went into its Open option, selected a boatload of TXT files, and saw no sign that it had any intention of doing anything with them. Looking more closely at its starting screen, I saw it said this:

Command Line usage:
TXT2PDF <inputfile> <output.pdf> [parameter table]

The documentation webpage said I was supposed to drag the TXT files into the window on the main screen to convert them. It also said this program would convert only plain text, not HTML. I wasn't sure what that meant for the EMLs that contained HTML code as plain text. The optional parameters had to do with font, paper size, etc. In the folder where I had my TXT files to be converted, I tried this command:

"C:\Program Files\Text2PDF v1.5\txt2pdf.exe" "Text File to Be Converted.txt"

with quotation marks as shown, on the command line. It worked. It produced a PDF. There was no word wrap, so words would just break in the middle at the end of the line, like this:

We can't pledge that we've entirely emerged from th
at episode, but this
past summer I sat down and rewrote the entire man
ual in a way that makes
more sense. The guy just didn't know how to phrase

The print size was very large, though there were parameters to change that, but nothing, apparently, to persuade lines to break at the ends of words rather than in the middle. This could defeat Copernic text searching, rendering some PDF file contents unfindable, so it wasn't going to be a good solution for me. But it really seemed like the command line approach, which would let me name each file to be converted, was the answer to the problem of being able to process only ~600 text files at a time. Another possibility: AcroPad. The following command worked:

Acropad "File to Convert.txt" "File Converted.pdf" Courier 11

I could have named other typefaces and font sizes. Output was double-spaced. Lines were broken at the ends of words, not in the middle. HTML code in the file was just treated as text and printed out as-is. I kept searching. A post by Adam Brand said I could use a command to automate printing if I had Acrobat Reader installed. That prompted another search that led to several insights. First, it turned out I could print a file from the command line using a Notepad command in the form of "notepad.exe /p filename." Since my default printer was a PDF printer, it printed a PDF -- a nice one, too, for basic purposes, nicer than some of the output I was getting from the programs tested above. It put the output on the desktop. I changed the location for the output by going into the Desktop folder for my username. Since I was running as Administrator, the location was C:\Users\Administrator\Desktop. There, I right-clicked on the Desktop folder, went to Properties > Location tab and changed it. (Another Notepad option, which I didn't need, was to specify which printer I wanted to use: /pt.)

The Notepad approach did nothing with HTML codes in these plain text files. An alternative that would work with rich text, which might or might not help in my case, was supposedly to try the same switch with Wordpad: "wordpad.exe /p filename." But when I did that, I got an error message:

'wordpad.exe' is not recognized as an internal or external command, operable program or batch file.

This was odd. To fix it, I ran regedit (Start > Run) and went to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\. There, following instructions, I right-clicked on App Paths and selected New > Key. I named the new key "Wordpad." I right-clicked on that Wordpad key and selected New > Expandable String Value. It apparently didn't matter what I called it. I called it ProgramPath. I right-clicked on ProgramPath and pasted in the path where Wordpad was, which I had obtained by going into the Properties of the Wordpad shortcut on my Start Menu. In other words, what I entered here included quotation marks and the name of the executable wordpad.exe, with extension. The instuctions said that, to run Wordpad from the command line (as distinct from in Start > Run), the command would have to begin with the Start command. For present purposes, what I would type at the C prompt would be "start wordpad /p filename." This worked (and I exported the new registry key and added it to my Win7RegEdit.reg file for future installations), but it did not produce a superior PDF compared to that which Notepad had produced, and for some reason it truncated the filename of the resulting PDF.

Revised Final Step: Converting TXT to HTML to PDF

Searching onward, there was a possibility of treating them as HTML rather than TXT files. I had flirted with this earlier but had not grasped that, of course, these actually were HTML files in the first place; they had become EMLs and TXTs only later. I typed "ren *.txt *.htm" to rename them all as HTML files. To print them, there were some complicated approaches, but I hoped that PrintHTML.exe would do the trick. The syntax, for my purposes, was this:

printhtml.exe file="filename.htm"

with optional leftmargin=1, rightmargin=1, topmargin=1, and bottommargin=1 parameters, among others that I didn't need. The printhtml.exe file would of course have to be in the folder with the files being printed unless I wanted to add it to the registry as just described for Wordpad. PrintHTML wouldn't work until I installed the DHTML Editing Control. I did all that, and got no error messages, but also did not seem to get any output. I decided to put that on hold to look at another possibility: automated PDF printing using Foxit Reader on the command line. Pretty much the same command syntax as above:

"FoxitReader.exe" /p filename

Here, again, there was a need for a registry edit, unless I wanted to park a copy of Foxit in every folder where I would use it from the command line. But the instructions were only for using Foxit to print PDFs, so I got an error: "Could not parse [filename]." There was also an option of using Acrobat Reader to print a PDF silently or with a dialog box, but there again it wasn't what I needed: I was printing HTMLs. I returned to that printhtml.exe program mentioned above. The command ran, with no indication any errors, but there did not seem to be any output. Another possibility was:

RUNDLL32.EXE MSHTML.DLL,PrintHTML "Filename.htm"

But for me, unfortunately, that produced an empty PDF. Turning again to freeware possibilities, I found an Xmarks list of top-ranked HTML to PDF programs. Most of the top-ranked items were online, one-file-at-a-time tools. Others required PHP knowledge that I didn't have (e.g., HTML_ToPDF, PDF-o-Matic). HTMLDOC looked promising for command-line usage; I found its manual; but when I downloaded and unzipped it, I couldn't find anything that looked like a setup or installation file. Apparently the version that's free is the source code, and I didn't know how to compile it. DomPDF and html2pdf (and, I suspect, some of these others) were apparently for Linux, not for Windows. I tried wkhtmltopdf. When I ran it, I got an error:

wkhtmltopdf.exe - System Error
The program can't start becuase libgcc_s_dw2-1.dll is missing from your computer. Try reinstalling the program to fix this problem.

Possibly the reason I got that error is that I was trying the same trick of running the program in a folder where my PDF files were. I had copied the executable (wkhtmltopdf.exe) to that folder, but had not brought along its libraries or whatever else it might need. I tried running it again -- I was just trying to use the help command, "wkhtmltopdf -- help" -- but this time pointing to the place where the program files were installed:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -- help

and that worked. I got a long list of command options. What I understood from it was that I wanted, in part, a command like this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "File to be converted.htm"

I tried that. It gave me an error:

Error: Failed loading page http: (sometimes it will work just to ignore this error with --load-error-handling ignore)

So I tried adding that long parameter to the command. It seemed like it worked: it gave the error message but then proceeded through the rest of its steps and announced, "Done." But I didn't see any output anywhere. Then I realized there was an error in what I had actually typed. I tried again. This time, it gave me a different error message: "You need to specify at least one input file, and exactly one output file." So the format I was supposed to use, aside from that additional "--load-error-handling ignore" parameter, was this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "HTML file to be converted.htm" "New PDF File.pdf"

And that worked. At last, I had a mass-production way of converting EMLs (by changing their extension to .htm, not .txt) to PDFs. It was too early to break out the champagne, but at least the computer and I were back on speaking terms. Now I just needed to run "DIR /s /b > dirlist.txt" in the top-level folder under which I had sorted my emails, convert that dirlist.txt file into a .bat file that would convert the file listings into batch commands, and run it. I was afraid the whole command, with the introductory reference to C:\Program Files, would be too long for Windows in some cases, so I edited the registry as described above, so that I would only have to type wkhtmltopdf.exe at the start of each command line. But now that registry edit wasn't working -- it certainly seemed to previously -- so I copied all of the wkhtmltopdf program files to the folder where I would be running this batch file. I didn't want the computer to crash itself by opening hundreds of simultaneous wkhtmltopdf processes, and I wanted to move the PDFs, so the format I used for these commands was:

start /wait wkhtmltopdf -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

That worked. Now I investigated the longer list of wkhtmltopdf command-line options, by typing "wkhtmltopdf -H" (with a capital H). Whew! The list was so long, I couldn't view it in the cmd window -- it scrolled past the point of recall. I tried again: "wkhtmltopdf -H > wkhtmltopdf_manual.txt." I couldn't add too much to the command line -- I was already afraid the long filenames would make some commands too long for CMD to process. But having viewed some output of these various PDFing programs, a few sets of commands seemed essential, including these:

-T 25 -B 25 -L 25 -R 25
--minimum-font-size 10

The first set would give me one-inch margins all around. Putting these on the already long command line increased my interest in another option: --read-args-from-stdin. This one, according to the manual, would also have the advantage of speeding up the process, since I would be starting wkhtmltopdf just once, and then re-running it with different arguments. The concept seemed to be that my conversion batch file (or, really, just a typed command) would contain this:

start wkhtmltopdf --read-args-from-stdin < do-this.txt

and then do-this.txt would contain line after line of instructions like this one:

-T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

Or perhaps they could be rearranged so that some of the contents of the second could be in the first, and therefore would not have to be repeated on every line in do-this.txt. In which case the main conversion command would look like this:

start wkhtmltopdf --read-args-from-stdin -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter < do-this.txt

and do-this.txt would contain only the "before" and "after" filenames. I decided to try this approach. Unfortunately, it didn't work. It froze. So then I tried just the minimal one shown a moment ago, putting all options except --read-args-from-stdin in the do-this.txt file. Sadly, that froze too. I tried the minimal command plus just filenames, leaving out the several additional commands about margins and font size. Still no joy. So, plainly, I did not understand the manual. I decided to go back to the approach of just putting it all on one line and repeating all commands, in a batch file, for each HTM file that I was converting to PDF. Each line would begin with "start /wait," not just "start," for reasons stated above. This worked, but now I noticed a new problem that I really hadn't wanted to notice before, because I just wanted this project to be done already.

Separating EMLs With and Without HTML Code

The new problem was that emails that were originally in HTML format turned out best when they were now renamed with an .htm extension, and processed that way, but the ones that didn't have HTML codes in them were now reduced to a mess. Specifically, line and paragraph breaks were gone; everything was just jumbled together in one continuous stream of text. Every non-HTML email was now being represented by a single long paragraph. To get decent output, it seemed that I needed to separate the emails that contained HTML code from those that did not. I would then use wkhtmltopdf with the former, but not with the latter. But how could I tell whether a file contained HTML code? I decided that an occurrence of "</" would be good enough in most cases. But then it occurred to me that there might be programs that would sort this out for me. A search led to the FileID utility. Their read-me file led me to think that this command, entered in the top-level folder where the files to be checked, might do the job:

"D:\FileID Folder\fileid" /s /e /k /n

This would run FileID from the folder where its program files were stored, and would instruct FileID to check all files in all subdirectories, to automatically change file extensions to match contents, to delete null files, and not to prompt me for input. But it did not seem to be working. Regardless of whether I entered these options as upper- or lower-case (e.g., /S or /s), FileID paused after every screenful of information, and did not seem to be renaming anything. I decided to try again with another command-line program of similar purpose, TrID. TrID had an online version and a GUI. On second thought, I decided to give the GUI version a whirl. I downloaded the program and its XML definitions. (I already had the necessary .NET Framework installed.) As advised by Billy, I moved everything from the XML definition folder (after unzipping them with WinRAR) into the folder containing the TrIDNet.exe file. I doubleclicked on that executable and saw that it would process only one file at a time.

I moved on to the command-line version. This called for a download of a different set of program files and definitions. I wasn't sure whether TrID would actually change incorrect extensions, or just detect them. Again, rather than plow into the support forums, I just tried it out. But in this case, that strategy didn't work: there was no manual or other use instructions in the download. The forum contained a tip on using PowerShell to fix extensions, but I didn't know enough about PowerShell to be able to interpret and adapt that tip to my situation. But, silly me, I forgot about just getting online help. In the folder where I had unzipped TrID.exe, I opened a cmd window and typed "trid -?" and got the idea that I could type "trid -ce" or perhaps "trid *.* -ce" to have the program change file extensions as needed, for all files in the current directory. It didn't appear to have a subdirectory option, so I would have to do some file moving.

A different approach was to use a CHK recovery program to detect the proper extension for anything with a CHK extension. While FileCHK looked like the better program for recovering real CHK files, it looked like UnCHK would have more flexibility for my situation, provided I first ran "ren *.htm *.chk" to change the file extensions to .chk. When I tried to run unchk.exe, I got an error message:

The program can't start because MSVBVM50.DLL is missing from your computer. Try reinstalling the program to fix this problem.

Eric had already warned me, in the read-me file, that this meant I needed to download and install the Visual Basic 5 runtime. I did, and tried again. Now it ran. I couldn't find documentation or a /help option to explain its settings. It took me a while to realize it wasn't a command-line program, though it could run from the command line. It was very bare-bones. I started it, navigated to the first of the folders I wanted to repair, and (having renamed files to have .chk extensions), gave it a try. It gave me a dialog asking about Scan Depth. I knew from the read-me that I wanted the Whole Files option. It ran for a while and then disappeared. It didn't seem to have done anything. After some more searching around, I concluded that this CHK approach wasn't what I wanted.

So I looked elsewhere. If I wanted to spend a day or so refreshing my aging knowledge of BASIC programming, or invest some time in learning more about batch scripting or Microsoft Access or some other program, I was pretty sure I could work up a way to examine file contents. But I wanted a solution faster than that, if possible. The CMD batch FIND command looked like it might do the job. But the command that I thought should work,

FOR %G IN (*.txt) do (find /i "</" "%G")

didn't. It wasn't because "</" were weird characters; it wasn't finding files containing ordinary text either. I tried again with the FINDSTR command:

findstr /m /s "</" *.* > dirlist.txt

This looked promising. But when I examined dirlist.txt, I saw that many of the files listed in it were better presented as TXT than as HTM. Apparently I should have been looking for files with more substantial HTML content. A spot check of several emails suggested that the existence of an upper- or lower-case "<html" might be a good guide. So apparently I would have to run FINDSTR twice:

findstr /m /s "<HTML" *.* > dirlist.txt
findstr /m /s "<html" *.* >> dirlist.txt

with two ">" symbols in the second one, so as to avoid overwriting the results of the first search with the results of the second. I tried that. There were some error messages, "Cannot open [filename]," apparently attributable to weird characters in the file's name; somehow it seemed I had still not entirely succeeded in cleaning those up. I assumed FINDSTR's failure in this regard would leave those files being treated as TXT by default, which would probably be OK since the majority of files overall appeared to be non-html. Ultimately, dirlist.txt contained a list of maybe 40% of all of the emails I was working on. That seemed like it might be about right. In other words, it seemed that about 60% of the emails were best treated as plain text, and I would be getting to those shortly. I put dirlist.txt into a spreadsheet to produce commands that would run wkhtmltopdf on the files that those two commands listed in dirlist.txt. The key formula from that spreadsheet:

="start /wait /min wkhtmltopdf -T 25 -B 25 -L 25 -R 25 --minimum-font-size 12 -s Letter "&CHAR(34)&B1&"\"&C1&".htm"&CHAR(34)&" "&CHAR(34)&"..\Converted\"&C1&".pdf"&CHAR(34)

That formula, applied to each file identified as containing "<html," produced PDFs that looked relatively good. I found that I needed a way of testing them, though, because in a number of cases wkhtmltopdf had produced PDFs that would not open. I also noticed that the batch file running these commands kept acting like it had died. Windows would say, "wkhtmltopdf.exe has stopped working," and I would click the option to "Close the program." And then, after a while, it would come roaring back to life. This may have happened especially when wkhtmltopdf was converting simple email messages into PDFs of a thousand pages or more. A thousand pages of gibberish. In a number of cases, too, the resulting PDF was a failure. When I tried to open those PDFs, Acrobat said this:

There was an error opening this document. The file is damaged and could not be repaired.

I was not sure what triggered these problems. I wondered if possibly the simpleminded conversion from EML to HTM by merely changing the extension caused problems in the case of EMLs that contained attachments. If that was the case, then what I should have done might have been to export from Thunderbird in HTML format in the first place -- to do two exports, in other words: one for EMLs, which would include attachments, to be zipped up into an archive and shelved until the future day when there would be a simple, cheap solution for the PDFing of emails plus their attachments; and another export in HTML, for purposes of PDFing here and now, without attachments. I tested this with one of the gibberished emails and found that, when exported from T-bird as HTML using ImportExportTools, it did print to a good-looking PDF. In that approach, the naming procedures used to rename the exported emails in the desired way -- containing date, time, sender, recipient, and subject information -- would apparently have to be preserved and reapplied, so that both exported sets -- the EMLs and the HTMLs -- would be named as desired.

To investigate these questions, I traced back one PDF that did not open -- that produced the error message quoted above -- and one that opened but that was filled with gibberish. The one that was damaged did not come from an email that originally contained attachments. I was able to print that email directly from Thunderbird without problems. So I wasn't sure what the problem was there. For a sample of one filled with gibberish, I chose the largest of them all. This was a 3,229-page PDF that was produced from a little two-page email that did originally have an attachment. I sampled three other PDFs containing gibberish. All three had come from emails that originally had attachments. So it did appear that attachments were foiling my simplistic approach of just changing file extensions from EML to HTM. I wondered if it was too late to just change the extensions back to .eml, for the ones that had not produced good PDFs, and maybe PDF them manually. I tried with one, and it worked. So that would have been a possibility, assuming I had time for printing emails one by one.

It seemed the gibberish might not be gibberish after all. It might be a digital representation of the photograph or whatever else was attached to the email. I didn't know of a way to test text for gibberish, so this didn't seem to be a problem that I could deal with very effectively at this point. I could name some files as HTM, as I had done, and just accept a certain amount of gibberish -- perhaps after screening out the really large PDFs (or, earlier in the process, the large EMLs, TXTs, or HTMs), which seemed most likely to have had attachments -- or I could rename them all as TXTs and print them that way, looking solely for the text content without regard to their appearance (and still probably getting gibberish). If I needed to know how they looked originally, I would have to go back to the archived EML version of the PDFd text. A third option was to go back to T-bird and re-export everything as HTML, thereby skimming off the attachments, and then use my saved renaming spreadsheets to rename the newly produced, roughly named HTMs, and then do my PDFing from those new HTMs. Presumably, that is, the new HTMs would print correctly, since they would not have attachments.

Back to the Drawing Board: T-Bird to HTML to PDF

I decided to try that third option. I went back to Thunderbird and used ImportExportTools to export the emails as HTML rather than as EML. It would have been more logical to start by PDFing these HTMLs, to make sure that would work; but at this point I had such a clutter of emails in various formats that I decided to proceed, as before, with the renaming process first, so as to be able to delete those that I wasn't going to need. Having already worked through the process of renaming to the point of achieving final names, I used directory listings and spreadsheets to try to match up the "before" names (i.e., the names of the raw HTML exports) and the "after" names (i.e., the final names I had developed previously).

Once I had the emails in individual HTML files with workable filenames, I ran wkhtmltopdf again. I started by taking a directory listing of the files to be converted; I put those into a spreadsheet, as before; and in the spreadsheet I used more or less the same wkhtmltopdf formula shown above in order to produce working commands. These pretty much succeeded. I was now getting good PDFs from the emails. It seemed that wkhtmltopdf had a habit of wrapping lines severely or perhaps indenting them too much. That is, if I wrote an email in reply to someone else, the text of my email would look fine,

but the text of the message

to which I was replying,
typically shown below the

reply text, would be
indented and then broken

like this.

Wkhtmltopdf converted HTML files to PDF at a rate of somewhat more than one email per second. Of course, these were small files, as email messages tend to be. There was a problem with them taking up a lot of disk space; it seemed I might have been well-advised to format the drive to have smaller than the default cluster size. The program slowed down considerably at times. I assume it was running into complexities with some HTML files.

The batch file ran and finished, but it had converted only about half of the HTMLs into PDFs. I decided to test the PDFs before deleting the corresponding HTMLs. I opened a half-dozen of them without a problem. Then, for a more thorough test, as described in a separate post, I ran an IrfanView batch conversion from PDF into RAW format. I chose RAW because it would result in just one file. TIF might have been another possibility. It did appear that this process was all working well. Ultimately, these steps converted all of the HTMLs into PDFs.

Summary

The first part of what I was able to achieve, at this point, was to export my emails from Thunderbird to EML format, using the ImportExportTools add-on for Thunderbird. Once I had exported all those EMLs, I used a zipping program (either WinRAR or 7zip) to bundle them together into a single file containing all of a year's emails. I took these steps because EML files, unlike HTML, PDF, JPG, TXT, or other formats, were able to contain email attachments along with the text of the email messages. I planned to keep these year-by-year ZIPs of EMLs until some point when I could find a cheap and broadly accepted program for printing both the email message and its attachment into a single PDF.

The other main achievement was to work out a process for converting HTMLs (also exported from Thunderbird via ImportExportTools) into PDFs. I used wkHTMLtoPDF for this purpose. I ran it in a batch file, produced by a spreadsheet, so that there was one command per file. I used DIR folder comparisons and other means to test that all files were being converted and that they were being converted into valid PDFs.

Ray Woodcock's Latest

Pages

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

Using a Spreadsheet to Rename Thousands of Files - First Try

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

Support This Blog

Total Pageviews

Archives

Ray Woodcock's Latest

Pages

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

Using a Spreadsheet to Rename Thousands of Files - First Try

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives