Sunday, December 18, 2011

Converting Email (EML) Files to PDF - Another (Partial) Try

Once before, I had converted individual email messages in EML format (from Thunderbird) to PDF format.  That had been a long and complicated process that I'd had to revisit a few weeks later.  It was now time to export some more emails from T-bird.  I decided to look for a simpler approach.

In the first stage of this inquiry, I was working with some EML files that had already been exported.  So this post starts halfway through the process.  The ordinary starting point, exporting emails from T-bird, appears later in this post.

Conversion Approaches

A thread gave me some ideas to play with.  It seemed that an EML, renamed to an HTM, could be opened in Internet browsers (e.g., Firefox) and also in editors (e.g.,  Microsoft Word, Wordpad).  Unfortunately, an email's header, containing the most recent sender, recipient, date, and subject information, was not inclined to print properly.  Some of the programs reviewed in the previous post would print the header with funky colors and other formatting stuff I didn't want.  Some would also produce tiny print.

I was able to produce better-looking output by changing the EML to an HTM extension and opening it in Notepad.  At the end of each line in the header, and twice more after the header, I inserted <br /> and then saved it.  Now it would open in Firefox with a halfway decent appearance.  There were also some lines in the header that I wanted to remove, technical lines other than the customary To, From, Date, and Subject lines.

The question was how to automate these changes.  I saw that Notepad++ would do changes for all opened files, but I didn't want to have to open large numbers of emails.  I found indications that a one-line Perl command and also an editor called TexFinderX would change all occurrences of certain text within multiple documents, but I didn't want to change all occurrences.  I wanted to change the first occurrence of "From:" to be "<br />From:" and likewise with Date and Subject, and I wanted to make other changes as well.

A search led to another Perl command that would apparently perform at least some of these tasks.  It was tempting to try to pick up enough Perl fluency to make that command work.  I decided, though, that it would be better to start by learning Perl more thoroughly (someday), so as to have a clearer understanding of what this command might or might not be doing inside some large number of files.  The search led, similarly, to various SED and AWK commands, with the same reservations on my end.

For some reason, it felt safer to use a utility programmed to achieve the same thing as those command-line approaches, even though I would still not be able to see what was being changed.  Maybe it was just reassuring to think that someone with programming knowledge was trying to solve the problem.  It helped to find a Gizmo recommendation for ReplaceText (formerly BK ReplaceEm).  My faith was shaken, however, when I saw that the Gizmo recommendation, dated October 17, 2011, was pointing to a webpage that said ReplaceText was no longer supported and "has known problems with some Windows 7 installations."  Not the end of the world, but also not ideal.  Gizmo's second recommendation was A.F.9, which appeared to have been last updated in 2002.  It seemed I might have to come up with some other approach.

Dealing with Headers

Another problem I was dealing with was that not all emails had the same kind of header.  Some had at least a dozen lines, with references to things like "X-Message-Delivery," where others had a smaller set of header lines.  It appeared that a fully automated approach could easily make a mess.  I noticed, for instance, that the Birdie conversion program would just run multiple lines together, in files with some kinds of headers.

Having spent a lot of time undoing messes caused by the previous EML conversion process, I decided to take a slower and more cautious approach, at least for starters.  I began with FIND commands, on the Windows command line, designed to distinguish EML files with different kinds of headers.  These FIND commands ran into Access Denied errors, resolved in a separate post.

It turned out that emails could contain a variety of header fields.  Some (e.g., Date, From, Subject) were essential and self-explanatory.  Others (e.g., Content-Type) were evidently common but were not ordinarily displayed in email readers (e.g., Outlook, Hotmail, Thunderbird).  Another category was the X-field.  According to one source, "X-fields are experimental [though evidently X was intended to stand for "extension," not "experimental"] fields added by email clients or servers and may be useful valid information or may be forged."  Things seemed to have changed since 1993, when someone seems to have felt that X-fields were to be "strongly discouraged."  At this point, they were widely used.  I had seen a number of them.  They were also called X-headers.  There was apparently no authoritative list of them; people were seemingly free to invent them as they saw fit, perhaps by using ordinary email programs (e.g., Eudora).  I found lists of X-headers for Usenet and listserv posts.  After some hunting, I did finally find a list of X-headers that might appear in email messages, as well as a discussion of standard HTTP header fields.  By this point, though, it was clear that any such list had to be open-ended.  Even that lengthy list did not contain some of the X-headers that appeared in one of the first emails I looked it (i.e., headers of the X-Message variety).  I found no clear indication of what headers might appear in a legitimate email message.

So apparently I was not going to be able to rely on someone's preconceived list of headers, as a guide to removing the unwanted ones from a large number of emails en masse (perhaps using some tool like TexFinderX, above).  The most reliable approach would seemingly require me to identify the header fields actually used in the emails I wanted to convert.  There did not seem to be an automated way to do that.  Some emails had HTML codes or empty lines dividing the two, but others did not.  I worked up an approach using screenshots, one per file (viewed in Notepad), to give me an impression of those codes.  But attempts to use optical character recognition (OCR) software on those screenshots did not give me an ideally reliable indication, in text, of the header contents.

It seemed that I probably had the ability to use macros in Word to eliminate unwanted headers and to reposition the wanted headers within the body of the email message, so as to produce a good appearance when the file was then printed to PDF.  (I was using Bullzip as my PDF printer.)  I decided to approach that project by whittling down the size of the messages -- that is, by removing attachments first.

PDFing the Attachments

I had a sample set of 46 EML files.  I was not sure how many had attachments.  A look at some of them in Notepad suggested that not all EMLs announced the presence of attachments in the same way.  It did seem, though that a certain text string would be found in most cases where an attachment existed.  That string was:

Content-Disposition: attachment;
On that starting hint, I ran a FIND command to see which of these 46 EMLs appeared to have attachments:
find /i /c "Content-Disposition: attachment;" "D:\Emails\*" > Attachlist.txt
Attachlist.txt gave me output like this:
---------- D:\EMAILS\FILENAME.EML: 1 
apparently indicating that FILENAME.EML contained one occurrence of the Content-Disposition text string.  It said that maybe a third of the files had one such occurrence, with the rest having none.  An exception:  one file contained two occurrences.  I looked at it.  It seemed to be a case of a forwarded message, with the ATT00001.htm filename that I had often observed but never did understand.  The message was displayed in the EML, but was also apparently attached in that ATT format.  My guess was that the best approach would be to keep everything up to the last occurrence of the Content-Disposition string.

I used Excel to convert Attachlist.txt to a batch file that would move the files containing Content-Disposition to a separate folder.  Then I opened all of the EMLs to see how accurate this FIND command was.  It appeared that I had found the key to distinguishing those files that did contain attachments:  the ones with "Content-Disposition: attachment;" did contain attachments, and the others did not.

I opened the EMLs containing attachments and manually printed their attachments as PDFs, with names that would help me link them to the emails later.  Some of them were extraneous winmail.dat attachments that I ignored.  Some were already PDFs and thus didn't need to be printed to PDF.  So now, in that separate folder, I had about a dozen EMLs that contained attachments, and about a half-dozen attachments that I had just PDFed from those EMLs.  I wanted to convert those EMLs to PDF first, and separately from the larger group of non-attachment EMLs, so as not to get confused as to which emails the attachments belonged to.  (More precise naming of the PDFs would have alleviated that concern, but would have taken more time.)  So now it was time to figure out a solution for the EMLs themselves.

Removing Attachment-Related Material from EMLs

My original plan was to get the EML into Microsoft Word, where I could write macros to manipulate the text in useful ways.  If I just opened an EML in Word, it would appear without its header.  That is, it would not look like an ordinary email, as viewed in Thunderbird or Outlook.  I could get around that problem by opening Word and setting it to Options > General tab > Confirm conversion at Open, and then open the EML as Plain Text, but that would be an extra delay.  With a lot of emails, it could be cumbersome.

I wanted to start by removing the contents of the emails that contained attachments, starting at the point of the last occurrence of "Content-Disposition: attachment."  Experimentation revealed that EMLs so truncated would still open and look fine in Thunderbird, and now they wouldn't have all that attachment material to prevent them from functioning like HTML files.  If they had more than one attachment, an attachment notice would still appear at the bottom of the Thunderbird EML screen, but I was OK with that.

It occurred to me that it might be possible to find a text editor with macro capabilities, so as to handle multiple tasks without having to cycle through multiple programs.  This led to a separate search for a suitable text editor.  In a development that would doubtless provoke some to cheer and others to weep, I wound up with Emacs.  In that process, I did manage to develop an Emacs macro that would eliminate, from an EML file, the post-HTML attachment material.

I was only working with a dozen or so EMLs in that pilot test.  I wound up combining the extracted attachments and manually produced PDF printouts of the truncated EMLs (viewed in Thunderbird) manually, so as to remove this potentially complicating part of the larger EML-to-PDF project.

Regular EMLs:  A First Pass in Emacs

Now I was able to focus on the headers, without worrying about attachments.  I had a group of 34 EMLs, with various kinds of headers, that I wanted to manipulate into printable HTMs.  I opened 15 of them in Notepad and took a look.  I decided to take them in batches.  The first kind, the simplest, had "To: " as its very first four characters.  These seemed to follow a pattern of separate lines for To, Date, Subject, and From lines, which I would keep, followed by Content-type, MIME-Version, and Content-transfer-encoding lines, which I would discard.  The four keeper lines would be moved down into the text, immediately after the BODY tag, with <br /> tags added for line breaks as needed.  I manually edited one of these in that way, opened it as an HTM, and it looked good.

As I thought about it some more, I thought that possibly a better strategy would be to write a macro that would go down to the <BODY> tag and then search upwards for the desired tags (From, To, etc.) and delete everything else.  And then, as I thought about it still more, I realized that all of the above were beyond my present Emacs ability.  I would get there ... but not today.

1 comments:

raywood

A more recent post updates some aspects of this one.