Ray Woodcock's Latest: HTML

Showing posts with label HTML. Show all posts

Friday, June 22, 2012

Finding and Cleaning Up EMLs That Display HTML Codes as Text

I had a bunch of email (EML) files scattered around my hard drive. Some of them, I noticed, were displaying a lot of HTML codes. For example, when I opened one (using Thunderbird as the default EML opener), it began with this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: Scholar Program</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

I was not sure how that happened. Apparently I had run these EMLs through some kind of conversion process, perhaps after renaming them to be .txt files. Whatever the origin, I wanted to eliminate all those HTML codes and wind up with a plain text file, probably saved as a PDF. This post describes the steps I took to achieve that outcome.

Finding the Offending Files

As I say, the files containing this text were scattered. Initially, I did a search for some of the text shown above (specifically, for "<!DOCTYPE HTML PUBLIC") in Copernic. (I assume any tool capable of searching for text within files would work for this purpose.) I thought maybe I would just copy and paste the lot of them from Copernic to a separate folder in Windows Explorer, where I could work on them in more detail. This approach was not working very well because failed because Copernic did not allow me to select and move multiple files to other folders. Moreover, Copernic did not display them with their actual filenames; rather, it showed the title indicated in the HTML "<TITLE>" line (see example above).

It was probably just as well. Moving them in bulk from Copernic would have lost the indications of the folders where they were originally located. The better approach, I decided, would be to use the command line and batch files to identify their source folder, move them to a single folder where I could work on them, and then move the resulting, cleaned-up files back to the folders where the originals had come from.

So the first thing I needed was a way to locate the files to be cleaned up. I decided to use a batch command for this purpose. I could have searched for every file (or just every EML file) that contained any HTML codes. For that purpose, a search for "</" might have done the trick. But then I decided that there could be a lot of HTML codes floating around out there, in various files, for a lot of different reasons; and for present purposes I didn't need to be trying to figure out what was happening in all those different situations. So instead, I searched for the same thing as before: "<!DOCTYPE HTML PUBLIC." To do that, after several false starts, I tried this command:

findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txt

It produced a dozen "Cannot open" error messages. The reason seemed to be that the filenames for those files had funky characters (e.g., #, §). Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command. DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file. So despite a number of variations with FINDSTR and several Google searches, I gave up. I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually. Copernic had a convenient right-click Move to Folder option, so that helped a little. So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder. I would just have to re-sort them back to where they belonged manually.

But I still wasn't sure that everything in that folder was problematic. Basically, I needed to see what the EMLs looked like when they were opened up. Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails. But I was having problems in my efforts to print EMLs as PDFs. As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder. To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat. Somehow, this failed to crash my system.

Cleaning Up the Files

After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility. Now I could edit them as text files. My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.

I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it. The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced. It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation. Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed. After a couple of tries, I found a search that did bring up a couple of solutions to related problems. But those solutions still looked pretty manual. Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?

A different search led to HotHotSoftware, which offered a tool for this purpose for $30. A video seemed to demonstrate that it would work. But, you know, $30 was more than the files were worth. Besides, I wouldn't learn anything useful that way. ReplacePioneer ($39, 21-day trial) looked like it might also do the job. A thread offered a way to do something like it in an unspecified language, presumably Visual Basic. Another thread offered an approach in sed. Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX. Other free options included Nodesoft Search and Replace and Replace Text.

I tried TexFinderX. In its File > Add Folder menu pick, I added the list of files to be changed. I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage. The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder. They advised using an existing table to create a new one. As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas. The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.

But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space. I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar. I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space. I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe. I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided. I also changed the top lines, the ones that explained what the file was about. I saved the file, went into the program's Replacement Table button, and there it was. I selected it and clicked Apply. On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files. Then I ran it. It looked like it worked.

I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff. I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line. So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it. It seemed best not to add too many chunks at once, lest I be repeating the same lines: run a few, after eyeballing them for duplication, and then see what was left. It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.

I added back some of their original items (modified) to the replacement table. These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>). I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work. The answer to that seemed to be to replace three pairs of <13><10> with two. I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.

I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth. After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others. Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run. I realized that it would probably not produce perfect results across all cases. It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.

I had noticed that the program took a little longer to run as I added more rules to its replacement table. But this did not seem to be due to file processing time: the time did not grow far longer when I added far more files to the list. It was still done within a minute or so in any case. Apparently it was just reading the instructions into memory.

The excess (now blank) lines in the files were the slowest to remove. I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks. Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.

Cleaning Up the Starts and Ends of Files

href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files. Now I wanted to use a similar process on these files. I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.

I had not previously noticed that most if not all of these emails had originally included attachments. As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion. These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition. These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself. I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX. I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.

Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously. After making a backup, the summary version of those steps, as modified, went like this:

Start Emacs. Open one of the post-TexFinderX emails. Hit F3 to start macro recording. C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end. Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line. I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails. So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
Hit C-space to set the mark. C-End. Del.

The macro was still recording; I wasn't done. The preceding steps did take care of the ending material in that particular file. (As before, it was essential to avoid typographical errors, which would terminate macro recording or worse.) But now, how about the unwanted starting material? I hadn't done this particular operation before, but it seemed straightforward enough. I had to use C-Home to get to the start of the file. Then -- since I had, again, deleted the objectionable boundary lines in some of these emails -- I had to search for the last surviving message header field. In the case of the first email I was looking at, which I believed was probably the most thoroughly scrubbed, that last surviving field was Message-ID. So I went through several additional but similar steps to clean up the start of the email and finish the task:

C-s to search for Message-ID. Then C-e to go to the end of that line, and right-arrow to go to the start of the next line. C-Space to set the mark, C-Home, and then Del. That was as much as I could do with this particular email; it was clean, though not ideally formatted.
C-x C-s to save the file. F4 to end the macro recording. C-x C-k n Macro1 Enter (to name the macro to be Macro1). C-x C-k b 1 (to bind the macro to key 1).
C-x C-f ~/ Enter (to find my Emacs Home directory). In my case, Home was C:\Users\Ray\AppData\Roaming\.emacs.d. I went there in Windows Explorer and created a new text file named _emacs, with no extension. This was my init file.
From the Emacs menu: File > Open File > navigate to the new _emacs init file > select and open _emacs. Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter. This hopefully saved my macro to my init file. C-x C-c to save and quit Emacs. A quick look with Notepad confirmed that there was something in _emacs.
Restart Emacs. Open another of these text emails. Test my macro by typing C-x C-k 1. I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again. Still undefined. Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs. In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs. I used C-x 2 to split the window. _emacs appeared in both panes. In the top pane, I went to Buffers > select the text file to be changed. (Apparently it was listed as one of the available buffers because I had already opened it.) So now I was viewing the macro in the bottom pane and the email file in the top pane. I selected the top pane and tried C-x C-k 1 again; still undefined. I found other advice to just use M-x Macro1. That worked. The macro ran in the top pane.

The macro didn't do such a great job of cleaning this second file. I would have to return to that later. For now, the next step was to figure out how to run the macro automatically on all the emails. Meager results from a search presented the possibility that people did not commonly do this sort of thing. A refined search led to further discussion suggesting that I should be searching for information on multiple buffers rather than multiple files. That innovation provoked the side question of whether perhaps jEdit was better than Emacs for such purposes but, once again, Emacs seemed better. Still another search led to Dired, which would apparently allow the user to conduct certain operations on the files listed in a directory. We were getting closer. I found someone who was feeling my pain, but without a solution.

A StackOverflow discussion suggested that I might want to begin a Dired approach by loading kmacro. I had no idea of how to do this. An Emacs manual page seemed to think that kmacro was already part of Emacs. I decided to try to follow the StackOverflow concepts without special attention to kmacro preliminaries. The first recommended step was to go to the top of my Dired buffer. This, too, was a mystery. Another Emacs manual page told me to use C-x d to start Dired. In the bottom line of the screen, that displayed the name of the directory containing the emails. I didn't know what else to do, so I hit Enter. Apparently that was just the right thing to do: it showed me a directory listing for that folder. It would develop, eventually, that the fast way to get it to show that directory was to use the menu option File > Open File to navigate to that directory and open a file there.

Now the StackOverflow advice was apparently to move the cursor to the first file in that list (which is where it already looked like it might be) and hit F3 to begin recording a keyboard macro. Then hit Enter to visit the file. Then M-x kmacro-call-ring-2nd. But at this point it said, "No keyboard macro defined." So kmacro was working, but on this command Dired was looking for a previous keyboard macro, not for an already saved one. I used C-x k Enter to close the email that I had opened. Now I was back at the Dired file list. I hit C-x 2 to split the window, so maybe I could see more clearly what was going on. With the cursor on the first target email in the top pane, I hit Enter to visit it again, then M-x Macro1 Enter. That seemed to be the answer, sort of: the bottom row said, "After 0 kbd macro iterations: Keyboard macro terminated by a command ringing the bell." So the macro did try to run. Adventures in the previous post suggested that this error message meant the macro failed to function properly, and I believed I knew why: this was the email that I had already edited. I had already removed, that is, the stuff that the macro was searching for, starting with the Content-Disposition line.

Time to try again. With the top pane (displaying the email message) selected, I hit C-x k Enter to close it. Then I moved the cursor to (i.e., mouse-clicked on) an email on which I had not yet run Macro1. There, going back to the (modified) StackOverflow advice, I hit F3 to start recording a keyboard macro; I hit Enter to visit the file; then M-x Macro1 Enter. It ran without an error message. The email was showing in both top and bottom panes, so evidently I had not yet mastered the art of pane. StackOverflow said C-x o to switch to the other buffer. This just switched me to the other pane; I was supposed to be seeing the Dired file list. With the keyboard macro still recording, I tried C-x k Enter to close the email. Now the bottom pane, where I was, had the cursor flashing on the wrong line. C-x o, then., followed by a tap on the down arrow key to take me to the next file to be processed. That was the end of the steps that I wanted my new keyboard macro to save, so I hit F4. StackOverflow said that now I had to hit C-u 0 c-x e to run the keyboard macro on every file in the list. But that command sequence only opened the next file and ran Macro1 on it. I hit C-x k Enter to close. I was back at the Dired list. The cursor did not advance to the next line; Macro1 did not run automatically.

I thought maybe my errors in that last try screwed up the keyboard macro, so I tried recording it again: F3; cursor on the target email; Enter to visit that file; M-x Macro1 Enter to run the macro; Ctrl-x k Enter to close the email; down-arrow to select the next email in the list; F4 to close the keyboard macro; C-u 0 C-x e to run it. No joy: I still had to close the file and start the next one manually.

By this point, a different approach had occurred to me. If I could open all the target emails at once, I would only have to hit keys to run Macro1 and then close the changed file: the next one would then be there, ready and waiting for Macro1. I decided to try this. As advised, with an email already opened in my target directory (via menu pick -- see above), so as to tell Emacs where to look, I used C-x C-f *.txt to open all of those emails. (As noted above, I was working on EMLs that I had mass-renamed to be TXT files.) That seemed to work. The first ones visible to me were those at the top of the list, on which I had already run Macro1. I closed those. I couldn't tell, from the Buffers menu pick, how many files remained opened. I could see that their timestamp would change in Windows Explorer after Emacs was done with them, so presumably I would be able to check which ones I had run Macro1 on. I made a mental note to make at least some kind of change in each file before closing it, so as to assure myself that there was no further need to work it over with Macro1.

So now I was looking at the first file that had not yet been caressed by the loving hand of Macro1. I wondered: can I define a keyboard macro to save the steps of running Macro1 and then closing the file? I tried: F3, M-x Macro1 Enter, C-x k Enter, F4. To execute that last defined keyboard macro, I used C-x e. It changed the file as desired -- that is, apparently it ran Macro1 -- and it also seemed to be saving the changed file, but it did not close the file. In other words, I had reduced the required number of keystrokes down to C-x e, C-x k Enter. That was what it took to run Macro1 and then close a file. Not bad, but could I do better?

The problem -- for both this approach and the Dired approach (above) -- seemed to be that the macros were not saving the C-x k Enter sequence. A search seemed to indicate this could be another difficult problem to solve. I was running low on time for this project, so I had to shelve that, along with the ensuing question of whether I could bind this C-x e C-x k Enter sequence to a function key.

Instead, I just went plodding through that sequence for these many files. In some cases, the scrollbar at the right showed me that there was a lot of extra material that I had to delete manually, usually from the ends of the emails. Saving after these additional edits required a C-x C-s Enter before the C-x k Enter. It was also handy to know that C-/ was the undo key.

Further Cleanup

When I was done running Macro1 on all those files, I saw that Emacs had created backup copies, with a .txt~ extension. I sorted by file type in Windows Explorer and deleted those. Also, while going through the process, I had noticed a number of files that were short and unimportant, and whose attachments did not interest me. So I was able to go through the list and remove those to a "Ready to PDF" folder. These steps reduced the number of files on which I might want to perform further operations.

While looking at those files in Windows Explorer, I noticed that some were much larger than others. These, I suspected, included some whose attachment sections had not been completely eliminated by the macro, perhaps because they had more than one attachment. I opened these in Notepad and eliminated material that did not contribute to the intelligible text of the email.

In some of the remaining files, there were still a lot of HTML codes and other material that would interfere significantly with an attempt to read the contents. It seemed that the spot checks I had conducted in TexFinderX had not brought out all of the things that TexFinderX could have cleaned up. I restarted TexFinderX, added more codes to the list of things to remove, and ran it some additional times on the files remaining in that folder. That didn't continue too long before I realized that there could be an endless number of such codes and variations.

The next step was to return to Emacs. This time, I was looking particularly for individual lines that could safely be deleted. This was not so much a concern with HTML codes, though there might be some of that too; it was more a concern with email headers, boundary lines, and other items that would vary from one email to the next, could therefore not be readily added to a TexFinderX replacement list, and yet could appear repeatedly within a single email. For example, each of the following lines appeared within a single email:

--===============3962046403588273==

boundary="----=_NextPart_000_002A_01C69314.AD087740"

------=_NextPart_000_002A_01C69314.AD087740

Moreover, variations on those themes recurred throughout that email, with quite a few of each. So I could write an Emacs macro to search for a line beginning with the relevant characters, select that entire line, and delete it. I wouldn't have to know which numbers appeared on different variations of these lines, as I would if I were using TexFinderX.

The problem here was that there were quite a few different kinds of lines to remove. In addition to the types just shown, there were also email header lines that would normally not be visible, but that had become visible in the original mangling of these files, and there were also various Content-Description and Content-Disposition and Content-ID and Content-Location lines. I would have to write an Emacs macro for each. I could write one macro to run them all, but it would terminate as soon as it failed to find the next requested line; and since these sorts of lines varied widely from one email to another, it was quite likely that such a general macro would be terminating prematurely more often than not. If I knew how to bind macros to individual keys, it might not be horrible to go down the list and punch the assigned function (or Ctrl-Function, Alt-Function, etc.) keys, one at a time, reiteratively for each of these many email files. But that seemed like a lot of work for a fairly unimportant project. A better approach would have been to write a script to handle such things, but my chosen scripting language for this purpose, Perl, had one significant drawback: I had not learned it yet. I had been meaning to, for about 20 years, and I knew that eventually the opportunity would arrive. But that day was not today.

I concluded that my cleanup phase for these emails was finished. If I really needed to go further with it, I could convert them from PDF back to text and have at it again, some fine day. If I had really intended to do that, I would have saved a list of the relevant files at this point. But for the time being, I needed to get on with the next part of the project.

Converting Emails to PDF

I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs. The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands. I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions. This produced PDFs with "Notepad" at the end of their names. For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.

Converting Attachments to PDF

As noted above, most of these troublesome emails had attachments. I now had, in a folder, only those emails (in .txt format) whose attachments I wanted to see. Using a DIR command as above, I did a listing of those .txt files. I put that list into Excel and modified it to produce batch commands that would move the EMLs of the same name to a separate folder. Then, in Thunderbird, I created a new local folder. With that folder selected, I went into Tools > ImportExportTools > Import eml file. I navigated to the folder containing the EMLs whose attachments I wanted to see, selected them all, and clicked Open. The icons indicated that all did have attachments.

Now, having configured Thunderbird's AttachmentExtractor add-on to generate filenames that I could recognize and connect with specific emails, I selected all those newly imported EMLs, right-clicked on them, and chose Extract from Selected Messages to (0) Browse. I set up a folder that was not too many levels deep, for fear that some of these attachments might already have long names that could cause problems. AttachmentExtractor went to work. When it was done, I deleted that folder in Thunderbird, so that I would not have a problem of confusing duplicates of EMLs that had already caused me enough grief.

Then, in Windows Explorer, I sorted the extracted attachments by Type. I began the process of converting to PDF those that were not already in PDF format. Many of these were Microsoft Word documents. I had already worked out a process that would automate the conversion of Word docs to PDF. I moved these files to another workspace folder for clarity, and after making the advisable adjustments to my PDF printer, I applied that process to these files.

Word had problems printing a number of these Word docs. It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques. It did produce some PDFs. I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time. Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF. That went fine, as a manual process. So apparently it was not, in every case, a problem with the file. Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.

Other extracted attachments were text files. At this point, I had two ways of dealing with these. On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files. I did start to use this approach, but ran into dialogs and potential problems. On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above). Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt. They also all originated from the same source. I examined a handful of these attachments and decided I could delete them all.

Some extracted attachments were image files -- JPG, GIF, PNG, BMP. I also had a dozen attachments without extensions. I opened the latter in IrfanView. I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly. On the other hand, as I looked through these files, I saw that some of the GIFs were animations. Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files. I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.

There were a few miscellaneous file types. For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been. One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.

Combining Email and Attachment PDFs

Now I had everything in PDF format. I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible. I combined the files and attachments into one folder and made a list of the files using DIR (above).

Now the goal was to combine the emails that did have attachments with their accompanying attachments. There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it. I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM. Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives. The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.

I downloaded and unzipped pdftk. As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32. I opened a command prompt in some other folder, at random, and typed "pdftk --help." This was supposed to give me the documentation. Instead, it gave me an error:

pdftk.exe - System Error The program can't start because libconv2.dll is missing from your computer. Try reinstalling the program to fix this problem.

I moved the two files to C:\Windows and tried again. That worked: I got documentation. It scrolled on past the point of recovery. Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder. The next step was to put pdftk to work. It would apparently allow me to specify the files to combine, using a command of this form:

pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf

My problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other. I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf. I would need to keep the spreadsheet for a while, so as to know what the original filenames were. The original filenames were my guide as to what files needed to be combined together. For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf". Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1. Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.

After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined. Each original filename began with date and time. With few exceptions, this was sufficient to distinguish one email and its attachments from another. So I used =LEFT to extract that identifying information from column A. Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using. Finally, I added a column to create the appropriate command for the batch file. Instead of the 123.pdf output shown in the example above, I used the original email filename. Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.

I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK. I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them. Pdftk had proved to be a much easier tool for this project than pdfSAM. This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.

Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

I was looking for a way to print a bunch of webpages to PDF files from the command line. This page describes the search that, as before, brought me to wkhtmltopdf.

One approach, it seemed, was to use Pdf995 and Omniformat. I had been frustrated, last time I tried pdf995, but nearly a year had passed, and this was a different project. Maybe this time it would work. They seemed to want me to install pdf995 and then install Omniformat. Not being entirely sure which ones I would need, I installed a half-dozen programs from their webpages. They said Omniformat would include HTML2PDF995, which would permit command-line conversions among formats including HTML and PDF. So that sounded promising. Installation of Omniformat brought up an HTML page included in the program (evidently not available online) that said the command line syntax was like this: omniformat.exe [input file] [output format]. So in my example, it would look like this:

omniformat.exe http://www.cnn.com/Chinastory.html "png"

In that case, the Omniformat command wouldn't give the file the desired name, so I would have to add a command to do that. I tried it, just doing the part as shown for now. I got the error, "omniformat.exe is not recognized as an internal or external command, operable program or batch file." In other words, it wasn't part of the computer's path. I had to run this command from within the folder where omniformat.exe was installed. A search of my computer said that the folder in question would be "C:\Program Files (x86)\omniformat." So I ran that omniformat.exe command there. But it opened up the GUI and made me wait for maybe 20 seconds until it would open a session of Internet Explorer, so that it could display its adware; but then that failed with "An error has occurred in the script on this page." Same thing if I tried using a file on my computer rather than a webpage's URL in the command. It seemed that pdf995 was still not going to work for me.

A search led to Total HTML Converter which, for $50, promised to do exactly what I needed: convert webpages to JPG and possibly to PDF from the command line. There didn't seem to be a listing on CNET for Total HTML Converter. It got three stars from 21 users (3,582 downloads) on Softpedia. Fifty bucks for a three-star program ... hmm.

A Softpedia search for similar programs turned up Spire PDF Converter (rated 3.0 by four users; 1,057 downloads), HTML to PDF Converter (rated 3.6 by eight users; 6,353 downloads), 7-PDF Website Converter (rated 3.7 by 10 users; 1,746 downloads); HTML_ to PDF (rated 3.2 by 19 users; 2,225 downloads); and Gerolf Markup Shredder (rated 2.8 by 23 users; 1,816 downloads). Gerolf was the only one whose description said it could run from the command line. I checked the homepages of the others to see about them. Spire said nothing about it. Likewise HTML to PDF Converter, and 7-PDF. I wasn't sure about HTML_ to PDF, so I downloaded that and Gerolf. HTML_ to PDF gave me an unzipped folder with no executables; it looked like I would have to learn something about PHP programming to use it. Meanwhile, Gerolf's installation asked me if I wanted to install GMS, to which I said sure, go ahead. Then it gave me a dialog partly in German, to which I replied Ja. Next an almost entirely dialog that seemed to be asking where I wanted to open the installation files. Its Durchsuchen (Search) button took me to a Temp folder, so I just clicked on that and said OK. Next, a dialog telling me to run gmsunzip.bat to install. Apparently I should have written down where I unpacked the files. Fortunately, Everything found gmsunzip.bat, so I did run it. I pressed the Whatever key to move past its first screen of information. It was starting to look like I should have chosen a more permanent location, so I went back and started over with the installation. Now I understood that its first dialog, referring to GMS, was of course referring to Gerolf Markup Shredder, and not to some other program; I just hadn't understood that it was asking me if I wanted to install the thing that I had just double-clicked on. So now I Durchsuched to a newly created folder called C:\GerolfHTMLtoPDF, and after the installation I went there and ran gmsunzip.bat. Unfortunately, at the end, I got a message indicating that this was an unsupported 16-bit installation that was incompatible with 64-bit versions of Windows. So I would have to run it in a Windows XP Virtual Machine. While thinking about that, I went back to HTML_ to PDF Converter. I took a closer look. The second script on the webpage seemed to be something that I might be able to just copy into Notepad, save as an HTM file, and double-click on. I tried that. No, it was going to require some PDF knowledge, though maybe not much. Now I noticed that Gerolf would not go away. It kept insisting on telling me, again and again, about the Unsupported 16-bit Application problem. I had to use Start > Run > taskmgr.exe. But, whoa, what's this? "Windows cannot find 'C:\Windows\System32\taskmgr.exe." Had one of these foolish programs, or something else, screwed up my system? I could see that taskmgr.exe was indeed in the System32 folder. Hmm. Not clear what was happening. Eventually I found that a CMD window was running; I had to kill that to shut off the recurrent dialogs. But that didn't fix the problem with taskmgr.exe. Maybe a reboot would ... later.

I went back to my previous post on a somewhat similar problem. The most promising solutions there seemed to be PrintHTML, print all linked files from an HTML page in Internet Explorer, or use wkHTMLtoPDF. I shied away from wkHTMLtoPDF because it was so complicated. I installed PrintHTML and the DHTML Editing Control (required on some systems, evidently including mine, judging from error messages when I tried running PrintHTML without it), and then looked at its instructions. It seemed to be just designed to permit some tinkering (e.g., margin adjustments) while printing local HTML files; no clear indication of how it would work with a webpage. I tried this command:

printhtml.exe file="https://www.nytimes.com"

(I had to run that command from within the folder where PrintHTML was installed.) It gave me a nearly blank page. It seemed that, basically, it was not designed to do what I needed. How about the approach of printing linked files from within Internet Explorer (IE)? The concept was that I could create an HTML page containing links to the webpages I wanted to print, and IE could be persuaded to print them all. I wasn't sure if they would print as one big PDF that I would have to split apart, but that seemed likely. In that case, the files wouldn't have the desired individual names. This tentatively seemed to be another case where the approach was designed for local files, not for webpages.

On this basis, I went back to wkHTMLtoPDF, as described in another post in this blog, posted at about the same time as this one, on the subject of Converting URL-Linked Webpages to PDF.

Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

I had a bunch of MHT files in a folder. (MHT was apparently short for mhtml, which was short for MIME html.) I produced these files in Internet Explorer (IE). To do this in a recent version of IE, the approach would be to look at a webpage and hit Ctrl-S > Save as type > Web archive, single file (*.mht). The MHT format would try to build everything on the screen into a single file, unlike the HTML formats (which would either save only the HTML text or create a subfolder to contain the images and other stuff appearing on the webpage).

Attempts to Print MHTs Directly

My goal now was to print those MHT files. I had Bullzip PDF Printer set as my default printer, and its settings (the default, I think) would have it pop up a dialog for each file being printed, asking me what I wanted to call the PDF output. This wasn't as slick as having a command-line PDF printer that would automatically print a file with a name specified on the command line, but I believed I had two options there. One would be to change Bullzip so that it just printed without a dialog; the other was to hit Enter for each file and let Bullzip print the PDF with the default filename. Either way, I could then come back in a second pass, using a batch file and/or Bulk Rename Utility to alter filenames as desired.

I actually would have had a one-pass command-line option, if I had been able to get PrintHTML to work with MHTs. I was briefly hoping that maybe I could use PRN from the command line, but Francois Degrelle said PRN would only work with text files. A PowerShell function would have been another possibility, if I had known how to proceed with something like that. There also appeared to be some older approaches that could provide a good way to spend a huge amount of time on something that wouldn't work, for reasons I couldn't understand.

I ran a search and found a webpage that made me think that PDFCreator might be a more useful PDF printer than Bullzip, for present purposes and also for the future. PDFCreator was favorably reviewed on CNET and Softpedia, so I downloaded and installed it. But it didn't seem to be printing PDFs automatically from these MHTs. It would just open the MHT in Microsoft Word, my default word processor, and then it would sit there. So I didn't continue to try using PDFCreator for this project.

Then again, Bullzip did the same thing: it opened the MHT in Word, and then stopped. This happened even after I went into Bullzip's options and changed them to what seemed to be the most streamlined approach possible. Word was resource-intensive; I couldn't very well open a hundred documents in it at once. Not that that was an option anyway. If I highlighted more than 15 MHTs in Windows Explorer, the right-click context menu wouldn't even give me a Print option.

Wordpad was less resource-intensive than Word, but it would open the MHT files as source code, same as Notepad: not pretty. I would also get the MHT opened in Word when I right-clicked on a couple of MHTs and selected "Convert to Adobe PDF." (I got that option because I had Acrobat installed.)

The easiest way to just open the MHTs and print them manually, if I wanted to do that, seemed to be to select a bunch of them and hit Enter, and they would open in tabs in my web browser. For some reason, they were opening in Opera, whereas I would have thought that Firefox would be the default, as it was for other kinds of web-type files. I couldn't even open them in Firefox by doing File > Open inside Firefox: somehow they would still open in Opera. I could have uninstalled Opera and then tried again, if I really cared; but in any event I still wasn't getting an automated solution.

PDF via Internet Explorer > Print All Linked Documents

Diamond Architects suggested creating an HTML file that would have links to all of the HTML files in a folder, and then using Internet Explorer to print that one HTML file, using Alt-F > Print > Options tab > Print all linked documents. The .mht files were obviously not .html files, but they contained HTML code. So it seemed like the same approach would work either way; or, at worst, I thought I could probably just type REN *.MHT *.HTML in a command window opened in that folder, and mass-rename them that way. I tried that. It made a mess. The files didn't look right anymore. So I renamed them back to MHT. (The easy way to open a command window in any folder was to go into Ultimate Windows Tweaker > Additional Tweaks > Show "Open Command Window Here." With that installed, a right-click in Windows Explorer would open up that option.)

But anyway, to test the "print all linked documents" concept, I needed to create the HTML file containing links to all those individual files. For that, I tried Arclab's Dir2HTML. But it didn't create links. It just gave me a list of files. If that was going to be the output, I preferred the kind of list I would get from this command:

DIR *.mht /a-d /b > dirlist.txt

That gave me a file, dirlist.txt, containing entries that looked like this:

File Name 1.mht
File Name 2.mht

To get them to function like links in an HTML file, I would have to change those lines so they looked like this:

<a href="One File Name.mht"</a>
<a href="Another File Name.mht"</a>

I could achieve that with a search-and-replace in Word, using ^p as the end-of-line character. That is, I could search for ^p and replace it with this, including the quotation marks:

"></a>^p<a href="

That would put "</a> at the end of each line, and <a href=" at the start of the next. Then I could paste the results back into dirlist.txt. Note: if smart quotes were turned on in Word, I would then have to do two additional search-and-replace operations, copying and pasting a sample of an opening and a closing smart quotation mark into Notepad's replace box, because smart quotes wouldn't work right. Then I might have to manually clean up the first and last lines in dirlist.txt. Another way to do this would be to paste the contents of dirlist.txt into Excel and massage them there. (For Excel instructions, go to this post and search for CHAR(34).) If I was going to do much of this, Excel would definitely be the way to go, because then I could just drop the new filenames into a column and let preexisting formulas parse them and output the HTML lines automatically.

That basically gave me an HTML file. Now I would just have to add its opening and closing lines. I wasn't sure what those should look like, so I right-clicked on some random webpage, selected "View Source" (an option that may not be available in all browsers, at least not without some add-ons; I wasn't sure), and decided that what I needed for an opening line would be "<!DOCTYPE html>" and the closing line should be "</html>" (without quotation marks), though I later realized that the latter was probably either unnecessary or incomplete. I also needed a second line that read, "This is my file," because otherwise everything that I had done would create a completely blank-looking page, leaving me uncertain and confused. So I added those lines to dirlist.txt, saved it as dirlist.htm, opened it in Internet Explorer (Ctrl-O or Alt-File > Open), and tried the Alt-F > Print > Options tab > "Print all linked documents" option mentioned above. (Note that dirlist.htm still had to be in the same folder as the .mht files that I wanted to print.)

That worked, sort of. It automatically gave me a boatload of .pdf files, and may I say it did so in a hell of a hurry. Problem was, they were all blank. It tentatively appeared that Bullzip and Internet Explorer were going to go through the motions of printing those linked files; but because I was dealing with MHTs instead of HTMs, they would passive-aggressively give me output with nothing inside. So, like Columbus finding Haiti instead of Malaysia, I had figured out how to bulk-print HTML files, but that wasn't what I had told everyone I was trying to do.

Bulk Converting MHTs to HTML with mht2htm

Well. Could I bulk-convert MHTs to HTMs and call it a day? A search led to mht2htm. I downloaded the Win32 versions (both GUI and command line), along with the Readme and the Help file. Basically, it looked like I just needed to (1) copy mht2htmcl.exe into the folder containing my MHT files, (2) create a subfolder, there, called OutputDir, (3) edit dirlist.htm to comment out the non-file (i.e., starting and ending) lines, and then (4) do another couple of searches and replaces in dirlist.htm, so that my lines looked like this:

mht2htmcl "First File Name.mht" OutputDir
mht2htmcl "Another File Name.mht" OutputDir

According to the very brief documentation accompanying mht2htm, these commands would do the trick. I made these changes, and then renamed dirlist.htm to be dirlist.bat, made sure it was in the folder containing the MHTs and mht2htmcl.exe, and ran it. It didn't work. I wasn't sure why not. So I tried the GUI version instead. Much easier, and it did produce something in the Output directory. What it produced was a bunch of folders, one for each MHT file, with names like "First File Name_Files." Each folder held a couple dozen files, mostly GIFs for the graphic elements of the HTM file. The key file in each folder wa scalled _0_start_me.htm. If I double-clicked on that, it would open in Firefox (my default web browser), with a line near the top that said, "Click here to open page"; and if I clicked on that, I got a nice-looking webpage in Firefox.

So that was not fantastic. Now, instead of opening MHT files one at a time in Word or a web browser, and printing from there, I would have to convert them to HTM so that I could dig into their separate folders and do the same thing with a _0_start_me.htm file. It would probably be easier to print HTMs than it had been to print MHTs, but there was the problem that those _0_start_me.htm files did not have the original filename. Fortunately, the file name had been preserved in the name of the folder created by mht2htm. So I would have to use an Excel spreadsheet to produce printing or renaming commands that would rename the PDF version of the first _0_start_me.htm file to be "First File Name.pdf," and likewise for all the others. But I wasn't ready to do that yet.

Reviewing How to Use wkHTMLtoPDF

So far, as discussed in a previous post, the best tool I had found for batch converting HTMs to PDFs was wkHTMLtoPDF. Somewhat past the halfway point in that long post, in a section titled "Revised Final Step: Converting TXT to HTML to PDF," I had worked out an approach for using wkHTMLtoPDF. The first step, as I reconstructed my efforts from that previous post, was to install wkHTMLtoPDF. That created a folder: C:\Program Files\wktohtml. wkHTMLtoPDF was a command-line program. Windows would have to know where to look to find it. To address that need, I copied everything from the C:\Program Files\wktohtml folder to a new, empty folder called D:\Workspace. Now I could type a command referring to wkHTMLtoPDF, in a batch file or command window running in D:\Workspace, and the computer would be able to execute the command. I also created a subfolder, under D:\Workspace, called OutputDir.

Next, I went into a command window, running in D:\Workspace, and typed "wkhtmltopdf /?" to get a list of command options. My previous post, interpreted in light of that command and a glance at wkHTMLtoPDF's manual, seemed to say that the command options that had worked best for me included "-s" to set the output paper size; options to set top (-T), bottom (-B), left (-L), and right (-R) margins (in millimeters); and --dpi (to specify dots per inch). It seemed, then, that the command line that I would need to use, for each of the _0_start_me.htm files, would use this basic syntax:

start /wait wkhtmltopdf [options] [input folder and HTM file name] [output folder and PDF file name]

I would run that command in the Workspace folder, where I had now placed the wkHTMLtoPDF program files. With a command of that type, wkHTMLtoPDF would find the _0_start_me.htm file created by mht2htm (above), and would convert it to a PDF file saved in D:\Workspace\OutputDir. The source folder and file names were pretty long in some cases, but this D:\Workspace\OutputDir part of the command was brief, so hopefully my full wkHTMLtoPDF command would not exceed any command line limits. So now I was ready to try an actual command. I made a copy of one of the folders created by mht2htm, renamed it to be simply "Test," and ran this command in D:\Workspace:

start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 "D:\Test\_0_start_me.htm" "D:\Workspace\OutputDir\Testfile.pdf"

That worked. But, of course, the resulting Testfile.pdf was just a PDF of the HTML page that said, "Click here to open page." I wouldn't get my actual MHT page in HTML format until I clicked on that link, in each of those _0_start_me.htm files, and the resulting HTML page would be open in Firefox, where I would still have to come up with a batch printing option to handle all of the tabs that I would be opening. It still wasn't an automated solution. I assumed that the approach of using Internet Explorer > Print All Linked Documents as above (but this time with HTMs instead of MHTs) would likewise give me webpages with that "Click here to open page" option.

Trying VeryPDF HTML Converter

My immediate problem seemed to be that I didn't have a good way to automate the conversion of MHTs to HTMs -- a way that wouldn't give me that funky "Click here to open page" stuff from mht2htm. My larger problem was that, of course, I didn't have a way to automate getting PDFs from those MHTs, which was the original issue.

The possibilities that I had developed so far seemed to be as follows: (1) Forget automation; just print the MHTs manually, selecting 15 at a time and choosing the Print option, which would start 15 sessions of Word. (2) Select and open them in Firefox or some other browser, which would open up 15 (or whatever number) of individual tabs, each likewise calling for manual printing as PDFs unless I could find a way to automate the printing of multiple browser tabs. (3) Try to figure out why the Internet Explorer approach was giving me blank PDFs. (4) Look again for something other than mht2htm, to convert MHTs to HTML. (5) Play some more with the wkHTMLtoPDF approach, in case some automated solution emerged from that.

As I wrote those words of review, I wondered whether Windows XP might handle one or more of those alternatives differently. I had already installed Windows Virtual PC, with its pre-installed virtual Windows XP session; all I needed was to go in there and, if necessary, install programs in it. But I hadn't encountered any specific indications that some program or approach had worked better in Windows XP, so I decided not to pursue this.

I thought I could at least search for some other MHT converter. It suddenly appeared that, in my focus on PDF printers, I might not have done a simple search for an MHT to PDF converter. That search, done at this point, led to novaPDF, a piece of commercial software that would apparently do the job. But on closer examination, novaPDF did not seem to have a batch printing capability. Another program, VeryPDF HTML Converter, came in a command line version whose basic syntax was apparently like this:

htmltools [options] [source file] [output file]

This syntax assumed, as with wkHTMLtoPDF (above), that htmltools.exe was being run in a folder, like my D:\Workspace, where the command files would be present -- unless, again, the user wanted to fiddle with path or environment variable adjustments. Typing just "htmltools" on the command line, or opening the accompanying Readme file, demonstrated that this had lots of options. I thought I might try just using it, to see if it worked at all, before fiddling with options. So I copied the full contents of the VeryPDF program folder (i.e., several folders and 15-20 files, including htmltools.exe) to D:\Workspace, made sure Test.mht was there as well, opened a command window there, and typed this:

htmltools Test.mht TestOut.pdf

The command window gave me a message, "You have 299 time to evaluate this product, you may purchase a full version from http://www.verypdf.com." I didn't find a reference to htmltools on their products webpage or on their list of PDF Products By Functions, and this particular message didn't give me another name to look for, so I wasn't sure whether I would be buying the right program. A review of a couple of webpages eventually revealed that this was VeryPDF HTML Converter. The GUI version, which I didn't want, would cost $59. Sixty bucks to convert MHTs? But it got better, or worse. The command-line version was $399. I guess while I was at it, I could ask them to throw in Gold Support for only $1,200 a year. Beyond a certain level of ridiculousness, a casual user might be forgiven for considering the option of just running this puppy in a disposable virtual machine, if uninstalling and reinstalling didn't do the trick. In all fairness, they seemed to be thinking of server administators, not private home users. And they did give us 300 free conversions. Still, at prices like these, it would have been nice if that would be 300 copies a year, not 300 lifetime. They were basically persuading me to use the program once and then forget about it.

Anyway, the program ran for a few seconds and then claimed it had succeeded. I looked. TestOut.pdf definitely did exist, and it looked good. No apparent need for any additional options. I wondered if it would default to the same filename with a PDF extension if I just typed "htmltools Test.mht," without specifying TestOut.pdf, so I ran the command again with that alteration. That worked. I tried it once more, this time specifying a source folder and an output folder without a filename ("htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output"). This time, it said, "Save to file failed!" Its messages seemed to say that it found Test.mht without a problem. Why wouldn't it write to Output? Maybe it was trying to write a file called Output, when I already had a folder by that name. I repeated the command, this time with a trailing backslash (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\"). Still failed. And the bastards docked me anyway. I was down to 296 free tries. So what were we saying: it could output a file without a need to specify a filename, but it couldn't output to another folder? If all else fails, RTFM. But the Readme.txt didn't contain any references to folders or directories. Well, would it at least work if I specified everything (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\Test.pdf")? Yes, it would. So that was the answer: I would have to work up my command lines in Excel (above) to include the full file and path names for both the source and the target. With those commands in a batch file, I decided to give it a run with a couple dozen files, just to make sure, before blowing my remaining 295 budgeted conversions on a futile gesture. It ran well. I was set. My fear that some commands might be too long was unfounded: the htmltools commands ran successfully with a command as long as 451 characters. I converted the rest of these MHTs and then deleted them, and hoped never to see them again.

Technically speaking, the project was done. If I needed more MHT conversions than I could accommodate within the limited private usage of VeryPDF's htmltools.exe, I would go back to the five options enumerated at the start of this last section of this post. Since I already had all this stuff in mind, and my Excel spreadsheet was set to go, I ran a couple more lines:

DIR D:\*.mht /s /a-d /b > D:\MHTlist.txt
DIR E:\*.mht /s /a-d /b >> D:\MHTlist.txt

to see if I had any other MHTs on D or E. (Note the double >> marks in the second line -- that says add to MHTlist.txt instead of overwriting it, if it already exists. Of course, once I had the command set, I could just hit the Up arrow in the command window to bring the previous command back, after running it, and then use Home and left & right arrow keys to revise it.) This gave me a file called MHTlist.txt, containing a list of additional MHTs that I thought I might as well convert to PDFs while I was at it. For these, the command lines would produce a PDF back in the source folder. Once those PDFs were created in the source folders, I used Excel (and could probably also have used Ctrl-H in Notepad), to do a DIR [filename].* >> listing (which would show both \Source Folder\File.mht and \Source Folder\File.pdf in the resulting dirlist.txt file) for each specific file that I had converted. This produced a nice pairs for each filename (i.e., x.mht and x.pdf). The process seemed to work. Now I just needed one more go with Excel, to produce DEL lines that would get rid of the MHTs in the source files. One more check: no MHTs left. Project completed.

Saturday, April 23, 2011

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

I had a large number of emails in Thunderbird (an email program like Outlook, but open source freeware). I wanted to export each of those emails to its own distinct PDF file with a filename containing Date, Time, Sender, Recipient, and Subject information in this format:

2011-03-20 14.23 Email from Me to John Doe re Tomorrow.pdf

In that example, I might ultimately eliminated the "from Me" part as understood, but of course other emails would be from John back to me, so for starting purposes I wanted all five of the fields just listed. The steps I went through are described below. There is a summary at the end of this post.

Recap and Development: Converting Emails into EML Format Files with Preferred Filenames

So far, I had already worked through the process of exporting those emails to distinct EML files. I had also used a spreadsheet to rename those EML files so that they would provide clearer and more complete information about the file's contents. (I was using Excel 2003 for spreadsheeting. OpenOffice Calc was now able to handle a million rows (i.e., to rename a million files), but it had not been stable for me. One option, for those who had more than 65,000 EMLs and therefore couldn't work within Excel's 65,000-row limit, was to do part of the list at a time.) This post picks up from there, summarizing a more streamlined approach to the steps described at greater length in the two previous posts linked to in this paragraph.

I had previously tried to begin with the Index.csv file exported from Thunderbird via ImportExportTools, but that had been a very convoluted and unsatisfactory process. I did continue to use Index.csv, but my main effort was to work up a spreadsheet that would use and alter the filenames created when I exported EMLs from T-bird, also using ImportExportTools. As described previously, I had developed some rules for automated cleaning of various debris from filenames, such as the underscores that ImportExportTools inserted in place of quotation marks and other characters.

To summarize the approach described in more detail in the previous post, I got the filenames from the folder where ImportExportTools had put them by using this command at the CMD prompt (Start > Run > cmd): "DIR /b > dirlist.txt" and then I copied and pasted the contents of dirlist.txt into an Excel spreadsheet. There I extracted the Date, Sender, and Subject fields from those filenames using Excel functions, including FIND, MID, TRIM, and LEN, all described in Excel's Help feature and in the previous post. I also used Excel in a separate worksheet to massage the data on the individual emails as provided in Index.csv.

The two worksheets did not produce the same information, and I needed them both. The one contained actual filenames, which I wanted to revise en masse to be more readable and to include the "To" field, which was contained in Index.csv. Many of the things that ImportExportTools screwed up about the subject fields of emails, for purposes of CMD-compatible filenames (and going well beyond that) involved the underscore character. Hence, the chief sections in the main worksheet (where I revised the data from dirlist.txt), going across the columns, were as follows:

Dirlist (raw EML filenames)
Date & time conversion (from 19980102-0132 to 1998-01-02 01.32
Subject: Clean up starting & ending underscores of From names
Subject: Replace "Re_ " with "Reply re" in Subject field
Subject: Replace "n_t" and "_s" (as in "don_t" and "Mike_s") with apostrophes
Subject: Replace serial underscores: "_ _" becomes " - "
Subject: Replace "I_m" with "I'm" and "you_re" with "you're"
Subject: Replace underscore and space ("_ ") with hyphen (" - ")
Subject: Remove starting and ending hyphens

That accounted for the bulk of the needed changes in the Subject field, in the files I was working with. I set these rules up to eliminate the first one, or in some instances two, occurrences of the underscore string in question. Few emails contained more than that; for those few, leaving the additional underscores in place was acceptable. There would be some predictable misfires of these rules, but they would generally improve the situation, and when dealing with a large number of EMLs that I didn't intend to rename manually, this was the best that I could hope for under the circumstances.

Then I used VLOOKUP to search for a match with the Index.csv-style Date and Time (e.g., 19980102-0132) data in the Index.csv worksheet, and also for a match with the Index.csv Date+Time+From combination. (Sometimes the From field was necessary to distinguish two or more emails sent at the same time. Because of the underscores and other oddities about the EML filenames, subjects were too different to compare in most cases.) This identified precise matches between the two worksheets for about 80% of EMLs.

So now I was going to try using that same spreadsheet with another batch of emails exported from Thunderbird. I exported the Index.csv and the EMLs, and set to work on the spreadsheeting process of reconciling their names and producing MOVE commands for a CMD batch file that would automatically rename large numbers of EMLs to be readable and to include data from the To field.

This time around, I did a first pass to bulk-recognize and batch-rename that first 80% of the EMLs. The CMD command format was this:

MOVE /-y "Old Filename.eml" "Renamed\New Filename.eml" 2> errlog.txt

This renamed the old EMLs to the desired new EML filenames, put them into the Renamed subfolder, and gave me an error log to say what went wrong with any of the renames. The error log wasn't very useful, so I stopped creating it in these commands. What I had to do instead, to find out which EMLs had been successfully renamed, was to do a dirlist.txt for the Renamed folder, feed that back into the spreadsheet, and delete those lines that had executed successfully. For about 15% of the emails, I could not automatically detect matches between data from Index.csv and actual files, so I wound up naming those files according to date, time, and sender only. Finally, I got down to less than 1% of emails that I had to rename in a more manual fashion, mostly due to non-ASCII characters in their filenames. For that, I used Bulk Rename Utility.

I was not sure whether this route wound up being better than the approach of using one of the shareware programs discussed in the previous post. I was not aware of the potential difficulties when I was looking at those programs, so for example I didn't try them out on emails with Chinese characters in their Subject fields. The other way always looks easier after a project like this. The approach I had taken had surely been more time-consuming than if I had known of a killer app that would do exactly what I wanted without unanticipated complications or failures. Absent a reliable, obvious solution at an affordable price, the main thing I could say at this point was that at least the conversion to EML was done.

Final Step: Converting EMLs into PDF

With EMLs thus exported from Thunderbird and mostly renamed to indicate date, time, sender, recipient, and subject, the remaining task was to convert the EMLs to PDF. This, it developed, might not be as simple as I had hoped. There was, first, the problem of finding a program that would do that. Some of the emails were simple text and could have been easily converted to TXT format just by changing their extensions from .eml to .txt. Acrobat and other PDF programs would readily print large numbers of text files, unlike EMLs. Other EMLs, however, contained HTML (e.g., different fonts, different colors of print, images). I wasn't sure what would happen if I changed their extensions and then printed. I noticed that the change to .txt caused the HTML codes to become visible in one message that I experimented with. When I converted that file to PDF using Acrobat, its header appeared in a relatively ugly form, but the colors and fonts seemed to be at least somewhat preserved. In another case, though, the PDF was largely a printout of code -- a truly undesirable replacement for what had been a pretty email with photos included. My version of Acrobat (ver. 8.2) did not provide any editable settings for conversion from text or HTML to PDF.

Thunderbird was my default program for displaying EMLs. I wondered if a different program could view them and would have better PDF printing capabilities, or if I should try converting them into another interim format in order to then convert them to PDF. A search led to the claim that Microsoft Word (or other programs) could display EMLs. I tried and found that this was essentially untrue: in Word, there was almost nothing left of that pretty email I had just tested. Converting EML to MSG seemed to be one option, but this looked like a dead end; that is, it didn't look like it would be any easier to PDF an MSG file than to PDF an EML. Getting the EMLs into Outlook wasn't likely to be a solution; as I recalled, my version of Outlook (2003) had been unable to batch print emails as individual PDFs. Marina Martin said that MBOX was the standard interoperable email file type. I could have exported from Thunderbird directly to MBOX using ImportExportTools, but I had not investigated that; I had assumed that MBOX meant one large file containing many emails, like PST, and I had wanted to rename my emails individually. Martin gave advice on using eml2mbox to convert EML to MBOX; hopefully I would not have lost anything by taking the route through EML format. But if MBOX was such a common format, there was surprisingly little interest in converting it to PDF. My search led to essentially nothing along those lines. Well, but couldn't Firefox or any other web browser read HTML emails? I tried; neither Firefox nor Internet Explorer were willing to open an EML. I renamed it to be an .html file. Both opened that, but here again the problem was that the header was so ugly and hard to read: it was just a paragraph-length jumble of text mixing up the generally important stuff (e.g., from, to) with technical information about the transmission. Even assuming I could work out a batch-PDF process for HTMLs, this was not the solution. There were other possibilities, but in the end it did appear that I simply needed to buy an EML-to-PDF converter.

It tentatively appeared that MSGViewer Pro ($70) might be the most frequently downloaded program in this area, ahead of its own sister program PSTViewer Pro as well as Total Mail Converter. A search for reviews led to very little. It didn't appear that MSGViewer Pro had the ability to include image attachments within the PDF of an EML, as Total Mail Converter Pro ($100) supposedly did. On the other hand, MSGViewer Pro supposedly provided a free five-day trial. I decided that I did not have time to mess with endless numbers of attachments right now, and was therefore willing to just zip the EMLs into a single file for possible future processing, if I decided that there was sufficient need and time for that. Given my unlikelihood of using these programs very often, I also hoped that their prices would drop. I figured that if the MSGViewer Pro trial was fully functional, I might be able to take care of my need for it now, converting EMLs into PDFs without attachments, and otherwise let the matter sit for another year or more.

On that basis, I downloaded and installed MSGViewer Pro. It was apparently designed for an older version of Windows. When I installed it, I got one of those Win7 messages indicating that it might not have installed properly, and inviting me to reinstall using "recommended settings," whatever that meant. I accepted the offer. Once properly installed, I ran the program. A dialog came up saying, "Trial is not licensed for commercial use." I clicked "Run Trial." Right away, I found that its Refresh feature did not work: I copied some EMLs into a separate folder to experiment with, and could not get the program to find that folder. I killed the program and started over. Now it found the folder. I selected those messages, clicked the Export button, and told it to give the resulting PDF (one of the available output options; the others were TXT, JPG, BMP, PNG, TIFF, and GIF) the same names as the input files. It had a nice option, which I accepted, to copy failed messages to a separate folder. A dialog came up saying, "You can only export 50 emails in trial version of MsgViewer Pro." So that popped that fantasy. It ran pretty quickly and reported that all of the files had been successfully exported. Sadly, the results were no better-looking than I had been able to achieve on my own, with other measures described above. HTML codes were visible in some PDFs -- or perhaps I should say, not visible, but overwhelming: it looked like a piece of ordinary HTML coding. The typeface was tiny. Some lines were actually split down the middle horizontally, with the top half of a line of text appearing at the bottom of one page and the bottom half appearing at the top of the next page. In a word, the results were junk. I uninstalled MSGViewer Pro.

I decided to try Total Mail Converter Pro. No installation problems. When installation ended, the program started right up without giving me a choice. Then it decided I needed to log onto Gmail. This was not my plan, so I canceled that. I liked its interface better than MSGViewer Pro: smaller but still readable font, seemingly more options. I selected my test files and clicked the PDF button. It gave me options to combine the files into one PDF or produce separate files. It also provided a file name template, with choices of subject, sender, recipient, date, and source filename. I tried these. There were other options: which fields to export, whether to include attachments in the doc or put them in separate folders, header, footer, document properties. It did the conversion almost instantly. The date format was month-day-year. The subject data weren't cleaned up, so I would still have had to go through something like my spreadsheet process to get the filenames the way I wanted them. Moment of truth: the file contents included a colored top part, as I had encountered with Birdie (see previous post). HTML codes were still visible in some messages, but in others the HTML seemed to have been better converted into rich text. Typefaces were still tiny. Definitely a better program. But worth $100 for my needs?

Ideally, I would have been converting my emails to PDF as I went along, without converting them around and around, from Outlook to Thunderbird to EML and wherever else they might have gone over the past several years. This might have better preserved what I recalled as the colorful, more engaging look of some of them, and perhaps I would have come up with better ways of capturing those characteristics as I continued to become more experienced with the process. In the present circumstances, where I really just wanted to get the job done and move on, it seemed that playing with that sort of thing was not a short-term option.

Since I was planning to keep the EMLs anyway, and since I did not plan to view these emails frequently, I decided that I really didn't lose much in informational terms by going with the free option identified above. I took a larger sample of EMLs and, using Bulk Rename Utility, renamed them to be .txt files (though later I realized I could have just said "ren D:\Documents\*.eml *.txt"). Since I had installed Adobe Acrobat, I had a right-click option to convert to Adobe PDF. No doubt some freeware PDF programs provided similar functionality. The Acrobat conversion of these files into PDF was not nearly as fast as that performed by Total Mail Converter Pro. Acrobat put each of those newly created PDFs onscreen and obliged me to manually confirm that I wished to save them. I had converted 40 files, and wasn't interested in manually closing all 40; ultimately I had to use Task Manager to shut them down. That problem turned out to be just a result of the settings I was using for my default Bullzip PDF printer; changing those defaults and using Acrobat's Advanced > Document Processing > Batch Processing option made the process completely automatic. In terms of appearance, it seemed the fonts, HTML handling, and other features were more or less the same as I had gotten from those other programs (above). I probably could have made the average resulting email more readable (except where HTML formatting made clear who was responding to whom) by looking for a program that would strip the HTML codes out of those TXT files, but I didn't feel like investing the time at this point and wasn't sure the effort would yield a net improvement.

Briefly, then, the PDFing part of this process involved using a bulk renamer to replace the .eml extension with a .txt extension, and then using a bulk PDF printer or converter to convert those TXT files into PDF. This approach still preserved the look of some emails, while allowing others to be overrun with HTML codes.

I ran that batch process on a full year's set of EMLs. I converted 1,422 EMLs into TXT files by changing their extensions with Bulk Rename Utility. Somehow, though, Acrobat produced only 689 PDFs from that set. Which ones, and what had happened to the rest? Acrobat didn't seem to be offering a log file. My guess was that Acrobat went too fast for Bullzip. There was no real reason why I shouldn't have been using Acrobat's own PDF printer for this particular project -- in fact, I did not remember precisely what Acrobat snafu had prompted me to switch to Bullzip as my default PDF printer in the first place -- so I went into Start > Settings > Printers and made that change now. I also right-clicked and changed some of the Printing Preferences, for that printer, so that it would run automatically. I deleted the first set of PDFs and tried again. I noticed, this time, that Acrobat was not even trying to convert more than 689 files -- it was saying, "1 of 689," "2 of 689," etc. What was causing it to overlook these other files, I was not sure. It seemed I would have to do a "DIR /b > Printed.txt" command in CMD, and then convert Printed.txt into a Deleter.bat file that would delete the text files that were successfully printed, so as to highlight the ones that remained. (See previous post for details on these sorts of commands.)

(Incidentally, I had also noticed, now that I was working with the Acrobat batch options, that it had a "Remove File Attachments" option. While it did not seem to work with EMLs, possibly it would have been useful if these emails had been in MSG or PST format.)

The automated process got as far as file no. 2 in the list before it stalled. Why it stalled, I had no idea. I clicked on the X at the top right-hand corner of the dialog to kill it -- I even said "Close the Program" when Windows gave me that option -- and then Acrobat took off and printed a couple hundred more PDFs before stalling in that same way again. Possibly I had the Acrobat PDF printer's properties set to stop on encountering an error. I ran through most of that first set before spacing out and killing Acrobat (the whole program) at a stall, instead of just killing the stalled task. I deleted those that had printed successfully, creating a Deleter.bat file for the purpose as just mentioned, and ran another batch. This time, Acrobat was printing a total of 667 files. So I figured the situation was as follows: Acrobat would print PDFs through a glorified command-line kind of process, and that command line would accommodate only so many characters. If I'd had shorter file names, maybe it would have been willing to print thousands of TXT files at one go. If I had wanted to add complexity to the process, I could have renamed my files with names like 0001.txt, reserving a spreadsheet to change their names back to original form after conversion to PDF. But with my filenames as they were, it was only going to process 600 or 700 at a time. That was my theory.

When Acrobat was done with the second set -- the first one that had run through to completion -- it showed me a list of warnings and errors. These were errors pertaining to maybe a dozen files. The errors included "File Not Found" (typically referring to GIFs that were apparently in the original email), "General Error" (hard to decipher, but in some cases apparently referring to ads that didn't get properly captured in the email), and several "Bad Image" errors (seemingly related to the absence of an image that was supposed to appear in the email). A spot check suggested that the messages with these errors tended to be commercial (e.g., advertising) messages, as distinct from personal or professional messages that I might actually care about. In a couple of cases a single commercial email would have several errors. But anyway, it looked like they were being converted, with or without errors.

I decided to try printing the next batch with Bullzip instead of Acrobat printer. I had to set it as the default printer in Settings > Printers. I also had to adjust its settings (by going to its Options shortcut in the Start Menu > General and Dialogs tabs) so that it would run without opening dialogs or completed PDFs. Would it now process significantly more than 600 input files? The short answer: no. So for the next round, I tried selecting all the TXT files in a folder and right-clicking > Convert to Adobe PDF. This was a bad idea. Now Acrobat wanted to open a couple thousand documents onscreen. I had to force-reboot the system to stop this one.

So now I thought maybe I'd look for some other text-to-PDF converter. It sounded like ActivePDF was a leading solution for IT professionals, but I didn't care to spend $700+. Shivaranjan recommended Zilla TXT To PDF Converter ($30). Softpedia listed a dozen freeware converters, of which by far the most popular was Free EasyPDF. But I couldn't quite figure out what was going on there. There was no help file, and the program wasn't even listed on its supposed creator's webpage. CNET called it fatally crippled. I didn't know why 30,000 people would have downloaded it. Back to Softpedia's list: Free Text to PDF Converter was another possibility with a Good rating. Its webpage said it could batch-convert text to PDF files. I went into its Open option, selected a boatload of TXT files, and saw no sign that it had any intention of doing anything with them. Looking more closely at its starting screen, I saw it said this:

Command Line usage:
TXT2PDF <inputfile> <output.pdf> [parameter table]

The documentation webpage said I was supposed to drag the TXT files into the window on the main screen to convert them. It also said this program would convert only plain text, not HTML. I wasn't sure what that meant for the EMLs that contained HTML code as plain text. The optional parameters had to do with font, paper size, etc. In the folder where I had my TXT files to be converted, I tried this command:

"C:\Program Files\Text2PDF v1.5\txt2pdf.exe" "Text File to Be Converted.txt"

with quotation marks as shown, on the command line. It worked. It produced a PDF. There was no word wrap, so words would just break in the middle at the end of the line, like this:

We can't pledge that we've entirely emerged from th
at episode, but this
past summer I sat down and rewrote the entire man
ual in a way that makes
more sense. The guy just didn't know how to phrase

The print size was very large, though there were parameters to change that, but nothing, apparently, to persuade lines to break at the ends of words rather than in the middle. This could defeat Copernic text searching, rendering some PDF file contents unfindable, so it wasn't going to be a good solution for me. But it really seemed like the command line approach, which would let me name each file to be converted, was the answer to the problem of being able to process only ~600 text files at a time. Another possibility: AcroPad. The following command worked:

Acropad "File to Convert.txt" "File Converted.pdf" Courier 11

I could have named other typefaces and font sizes. Output was double-spaced. Lines were broken at the ends of words, not in the middle. HTML code in the file was just treated as text and printed out as-is. I kept searching. A post by Adam Brand said I could use a command to automate printing if I had Acrobat Reader installed. That prompted another search that led to several insights. First, it turned out I could print a file from the command line using a Notepad command in the form of "notepad.exe /p filename." Since my default printer was a PDF printer, it printed a PDF -- a nice one, too, for basic purposes, nicer than some of the output I was getting from the programs tested above. It put the output on the desktop. I changed the location for the output by going into the Desktop folder for my username. Since I was running as Administrator, the location was C:\Users\Administrator\Desktop. There, I right-clicked on the Desktop folder, went to Properties > Location tab and changed it. (Another Notepad option, which I didn't need, was to specify which printer I wanted to use: /pt.)

The Notepad approach did nothing with HTML codes in these plain text files. An alternative that would work with rich text, which might or might not help in my case, was supposedly to try the same switch with Wordpad: "wordpad.exe /p filename." But when I did that, I got an error message:

'wordpad.exe' is not recognized as an internal or external command, operable program or batch file.

This was odd. To fix it, I ran regedit (Start > Run) and went to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\. There, following instructions, I right-clicked on App Paths and selected New > Key. I named the new key "Wordpad." I right-clicked on that Wordpad key and selected New > Expandable String Value. It apparently didn't matter what I called it. I called it ProgramPath. I right-clicked on ProgramPath and pasted in the path where Wordpad was, which I had obtained by going into the Properties of the Wordpad shortcut on my Start Menu. In other words, what I entered here included quotation marks and the name of the executable wordpad.exe, with extension. The instuctions said that, to run Wordpad from the command line (as distinct from in Start > Run), the command would have to begin with the Start command. For present purposes, what I would type at the C prompt would be "start wordpad /p filename." This worked (and I exported the new registry key and added it to my Win7RegEdit.reg file for future installations), but it did not produce a superior PDF compared to that which Notepad had produced, and for some reason it truncated the filename of the resulting PDF.

Revised Final Step: Converting TXT to HTML to PDF

Searching onward, there was a possibility of treating them as HTML rather than TXT files. I had flirted with this earlier but had not grasped that, of course, these actually were HTML files in the first place; they had become EMLs and TXTs only later. I typed "ren *.txt *.htm" to rename them all as HTML files. To print them, there were some complicated approaches, but I hoped that PrintHTML.exe would do the trick. The syntax, for my purposes, was this:

printhtml.exe file="filename.htm"

with optional leftmargin=1, rightmargin=1, topmargin=1, and bottommargin=1 parameters, among others that I didn't need. The printhtml.exe file would of course have to be in the folder with the files being printed unless I wanted to add it to the registry as just described for Wordpad. PrintHTML wouldn't work until I installed the DHTML Editing Control. I did all that, and got no error messages, but also did not seem to get any output. I decided to put that on hold to look at another possibility: automated PDF printing using Foxit Reader on the command line. Pretty much the same command syntax as above:

"FoxitReader.exe" /p filename

Here, again, there was a need for a registry edit, unless I wanted to park a copy of Foxit in every folder where I would use it from the command line. But the instructions were only for using Foxit to print PDFs, so I got an error: "Could not parse [filename]." There was also an option of using Acrobat Reader to print a PDF silently or with a dialog box, but there again it wasn't what I needed: I was printing HTMLs. I returned to that printhtml.exe program mentioned above. The command ran, with no indication any errors, but there did not seem to be any output. Another possibility was:

RUNDLL32.EXE MSHTML.DLL,PrintHTML "Filename.htm"

But for me, unfortunately, that produced an empty PDF. Turning again to freeware possibilities, I found an Xmarks list of top-ranked HTML to PDF programs. Most of the top-ranked items were online, one-file-at-a-time tools. Others required PHP knowledge that I didn't have (e.g., HTML_ToPDF, PDF-o-Matic). HTMLDOC looked promising for command-line usage; I found its manual; but when I downloaded and unzipped it, I couldn't find anything that looked like a setup or installation file. Apparently the version that's free is the source code, and I didn't know how to compile it. DomPDF and html2pdf (and, I suspect, some of these others) were apparently for Linux, not for Windows. I tried wkhtmltopdf. When I ran it, I got an error:

wkhtmltopdf.exe - System Error
The program can't start becuase libgcc_s_dw2-1.dll is missing from your computer. Try reinstalling the program to fix this problem.

Possibly the reason I got that error is that I was trying the same trick of running the program in a folder where my PDF files were. I had copied the executable (wkhtmltopdf.exe) to that folder, but had not brought along its libraries or whatever else it might need. I tried running it again -- I was just trying to use the help command, "wkhtmltopdf -- help" -- but this time pointing to the place where the program files were installed:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -- help

and that worked. I got a long list of command options. What I understood from it was that I wanted, in part, a command like this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "File to be converted.htm"

I tried that. It gave me an error:

Error: Failed loading page http: (sometimes it will work just to ignore this error with --load-error-handling ignore)

So I tried adding that long parameter to the command. It seemed like it worked: it gave the error message but then proceeded through the rest of its steps and announced, "Done." But I didn't see any output anywhere. Then I realized there was an error in what I had actually typed. I tried again. This time, it gave me a different error message: "You need to specify at least one input file, and exactly one output file." So the format I was supposed to use, aside from that additional "--load-error-handling ignore" parameter, was this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "HTML file to be converted.htm" "New PDF File.pdf"

And that worked. At last, I had a mass-production way of converting EMLs (by changing their extension to .htm, not .txt) to PDFs. It was too early to break out the champagne, but at least the computer and I were back on speaking terms. Now I just needed to run "DIR /s /b > dirlist.txt" in the top-level folder under which I had sorted my emails, convert that dirlist.txt file into a .bat file that would convert the file listings into batch commands, and run it. I was afraid the whole command, with the introductory reference to C:\Program Files, would be too long for Windows in some cases, so I edited the registry as described above, so that I would only have to type wkhtmltopdf.exe at the start of each command line. But now that registry edit wasn't working -- it certainly seemed to previously -- so I copied all of the wkhtmltopdf program files to the folder where I would be running this batch file. I didn't want the computer to crash itself by opening hundreds of simultaneous wkhtmltopdf processes, and I wanted to move the PDFs, so the format I used for these commands was:

start /wait wkhtmltopdf -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

That worked. Now I investigated the longer list of wkhtmltopdf command-line options, by typing "wkhtmltopdf -H" (with a capital H). Whew! The list was so long, I couldn't view it in the cmd window -- it scrolled past the point of recall. I tried again: "wkhtmltopdf -H > wkhtmltopdf_manual.txt." I couldn't add too much to the command line -- I was already afraid the long filenames would make some commands too long for CMD to process. But having viewed some output of these various PDFing programs, a few sets of commands seemed essential, including these:

-T 25 -B 25 -L 25 -R 25
--minimum-font-size 10

The first set would give me one-inch margins all around. Putting these on the already long command line increased my interest in another option: --read-args-from-stdin. This one, according to the manual, would also have the advantage of speeding up the process, since I would be starting wkhtmltopdf just once, and then re-running it with different arguments. The concept seemed to be that my conversion batch file (or, really, just a typed command) would contain this:

start wkhtmltopdf --read-args-from-stdin < do-this.txt

and then do-this.txt would contain line after line of instructions like this one:

-T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

Or perhaps they could be rearranged so that some of the contents of the second could be in the first, and therefore would not have to be repeated on every line in do-this.txt. In which case the main conversion command would look like this:

start wkhtmltopdf --read-args-from-stdin -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter < do-this.txt

and do-this.txt would contain only the "before" and "after" filenames. I decided to try this approach. Unfortunately, it didn't work. It froze. So then I tried just the minimal one shown a moment ago, putting all options except --read-args-from-stdin in the do-this.txt file. Sadly, that froze too. I tried the minimal command plus just filenames, leaving out the several additional commands about margins and font size. Still no joy. So, plainly, I did not understand the manual. I decided to go back to the approach of just putting it all on one line and repeating all commands, in a batch file, for each HTM file that I was converting to PDF. Each line would begin with "start /wait," not just "start," for reasons stated above. This worked, but now I noticed a new problem that I really hadn't wanted to notice before, because I just wanted this project to be done already.

Separating EMLs With and Without HTML Code

The new problem was that emails that were originally in HTML format turned out best when they were now renamed with an .htm extension, and processed that way, but the ones that didn't have HTML codes in them were now reduced to a mess. Specifically, line and paragraph breaks were gone; everything was just jumbled together in one continuous stream of text. Every non-HTML email was now being represented by a single long paragraph. To get decent output, it seemed that I needed to separate the emails that contained HTML code from those that did not. I would then use wkhtmltopdf with the former, but not with the latter. But how could I tell whether a file contained HTML code? I decided that an occurrence of "</" would be good enough in most cases. But then it occurred to me that there might be programs that would sort this out for me. A search led to the FileID utility. Their read-me file led me to think that this command, entered in the top-level folder where the files to be checked, might do the job:

"D:\FileID Folder\fileid" /s /e /k /n

This would run FileID from the folder where its program files were stored, and would instruct FileID to check all files in all subdirectories, to automatically change file extensions to match contents, to delete null files, and not to prompt me for input. But it did not seem to be working. Regardless of whether I entered these options as upper- or lower-case (e.g., /S or /s), FileID paused after every screenful of information, and did not seem to be renaming anything. I decided to try again with another command-line program of similar purpose, TrID. TrID had an online version and a GUI. On second thought, I decided to give the GUI version a whirl. I downloaded the program and its XML definitions. (I already had the necessary .NET Framework installed.) As advised by Billy, I moved everything from the XML definition folder (after unzipping them with WinRAR) into the folder containing the TrIDNet.exe file. I doubleclicked on that executable and saw that it would process only one file at a time.

I moved on to the command-line version. This called for a download of a different set of program files and definitions. I wasn't sure whether TrID would actually change incorrect extensions, or just detect them. Again, rather than plow into the support forums, I just tried it out. But in this case, that strategy didn't work: there was no manual or other use instructions in the download. The forum contained a tip on using PowerShell to fix extensions, but I didn't know enough about PowerShell to be able to interpret and adapt that tip to my situation. But, silly me, I forgot about just getting online help. In the folder where I had unzipped TrID.exe, I opened a cmd window and typed "trid -?" and got the idea that I could type "trid -ce" or perhaps "trid *.* -ce" to have the program change file extensions as needed, for all files in the current directory. It didn't appear to have a subdirectory option, so I would have to do some file moving.

A different approach was to use a CHK recovery program to detect the proper extension for anything with a CHK extension. While FileCHK looked like the better program for recovering real CHK files, it looked like UnCHK would have more flexibility for my situation, provided I first ran "ren *.htm *.chk" to change the file extensions to .chk. When I tried to run unchk.exe, I got an error message:

The program can't start because MSVBVM50.DLL is missing from your computer. Try reinstalling the program to fix this problem.

Eric had already warned me, in the read-me file, that this meant I needed to download and install the Visual Basic 5 runtime. I did, and tried again. Now it ran. I couldn't find documentation or a /help option to explain its settings. It took me a while to realize it wasn't a command-line program, though it could run from the command line. It was very bare-bones. I started it, navigated to the first of the folders I wanted to repair, and (having renamed files to have .chk extensions), gave it a try. It gave me a dialog asking about Scan Depth. I knew from the read-me that I wanted the Whole Files option. It ran for a while and then disappeared. It didn't seem to have done anything. After some more searching around, I concluded that this CHK approach wasn't what I wanted.

So I looked elsewhere. If I wanted to spend a day or so refreshing my aging knowledge of BASIC programming, or invest some time in learning more about batch scripting or Microsoft Access or some other program, I was pretty sure I could work up a way to examine file contents. But I wanted a solution faster than that, if possible. The CMD batch FIND command looked like it might do the job. But the command that I thought should work,

FOR %G IN (*.txt) do (find /i "</" "%G")

didn't. It wasn't because "</" were weird characters; it wasn't finding files containing ordinary text either. I tried again with the FINDSTR command:

findstr /m /s "</" *.* > dirlist.txt

This looked promising. But when I examined dirlist.txt, I saw that many of the files listed in it were better presented as TXT than as HTM. Apparently I should have been looking for files with more substantial HTML content. A spot check of several emails suggested that the existence of an upper- or lower-case "<html" might be a good guide. So apparently I would have to run FINDSTR twice:

findstr /m /s "<HTML" *.* > dirlist.txt
findstr /m /s "<html" *.* >> dirlist.txt

with two ">" symbols in the second one, so as to avoid overwriting the results of the first search with the results of the second. I tried that. There were some error messages, "Cannot open [filename]," apparently attributable to weird characters in the file's name; somehow it seemed I had still not entirely succeeded in cleaning those up. I assumed FINDSTR's failure in this regard would leave those files being treated as TXT by default, which would probably be OK since the majority of files overall appeared to be non-html. Ultimately, dirlist.txt contained a list of maybe 40% of all of the emails I was working on. That seemed like it might be about right. In other words, it seemed that about 60% of the emails were best treated as plain text, and I would be getting to those shortly. I put dirlist.txt into a spreadsheet to produce commands that would run wkhtmltopdf on the files that those two commands listed in dirlist.txt. The key formula from that spreadsheet:

="start /wait /min wkhtmltopdf -T 25 -B 25 -L 25 -R 25 --minimum-font-size 12 -s Letter "&CHAR(34)&B1&"\"&C1&".htm"&CHAR(34)&" "&CHAR(34)&"..\Converted\"&C1&".pdf"&CHAR(34)

That formula, applied to each file identified as containing "<html," produced PDFs that looked relatively good. I found that I needed a way of testing them, though, because in a number of cases wkhtmltopdf had produced PDFs that would not open. I also noticed that the batch file running these commands kept acting like it had died. Windows would say, "wkhtmltopdf.exe has stopped working," and I would click the option to "Close the program." And then, after a while, it would come roaring back to life. This may have happened especially when wkhtmltopdf was converting simple email messages into PDFs of a thousand pages or more. A thousand pages of gibberish. In a number of cases, too, the resulting PDF was a failure. When I tried to open those PDFs, Acrobat said this:

There was an error opening this document. The file is damaged and could not be repaired.

I was not sure what triggered these problems. I wondered if possibly the simpleminded conversion from EML to HTM by merely changing the extension caused problems in the case of EMLs that contained attachments. If that was the case, then what I should have done might have been to export from Thunderbird in HTML format in the first place -- to do two exports, in other words: one for EMLs, which would include attachments, to be zipped up into an archive and shelved until the future day when there would be a simple, cheap solution for the PDFing of emails plus their attachments; and another export in HTML, for purposes of PDFing here and now, without attachments. I tested this with one of the gibberished emails and found that, when exported from T-bird as HTML using ImportExportTools, it did print to a good-looking PDF. In that approach, the naming procedures used to rename the exported emails in the desired way -- containing date, time, sender, recipient, and subject information -- would apparently have to be preserved and reapplied, so that both exported sets -- the EMLs and the HTMLs -- would be named as desired.

To investigate these questions, I traced back one PDF that did not open -- that produced the error message quoted above -- and one that opened but that was filled with gibberish. The one that was damaged did not come from an email that originally contained attachments. I was able to print that email directly from Thunderbird without problems. So I wasn't sure what the problem was there. For a sample of one filled with gibberish, I chose the largest of them all. This was a 3,229-page PDF that was produced from a little two-page email that did originally have an attachment. I sampled three other PDFs containing gibberish. All three had come from emails that originally had attachments. So it did appear that attachments were foiling my simplistic approach of just changing file extensions from EML to HTM. I wondered if it was too late to just change the extensions back to .eml, for the ones that had not produced good PDFs, and maybe PDF them manually. I tried with one, and it worked. So that would have been a possibility, assuming I had time for printing emails one by one.

It seemed the gibberish might not be gibberish after all. It might be a digital representation of the photograph or whatever else was attached to the email. I didn't know of a way to test text for gibberish, so this didn't seem to be a problem that I could deal with very effectively at this point. I could name some files as HTM, as I had done, and just accept a certain amount of gibberish -- perhaps after screening out the really large PDFs (or, earlier in the process, the large EMLs, TXTs, or HTMs), which seemed most likely to have had attachments -- or I could rename them all as TXTs and print them that way, looking solely for the text content without regard to their appearance (and still probably getting gibberish). If I needed to know how they looked originally, I would have to go back to the archived EML version of the PDFd text. A third option was to go back to T-bird and re-export everything as HTML, thereby skimming off the attachments, and then use my saved renaming spreadsheets to rename the newly produced, roughly named HTMs, and then do my PDFing from those new HTMs. Presumably, that is, the new HTMs would print correctly, since they would not have attachments.

Back to the Drawing Board: T-Bird to HTML to PDF

I decided to try that third option. I went back to Thunderbird and used ImportExportTools to export the emails as HTML rather than as EML. It would have been more logical to start by PDFing these HTMLs, to make sure that would work; but at this point I had such a clutter of emails in various formats that I decided to proceed, as before, with the renaming process first, so as to be able to delete those that I wasn't going to need. Having already worked through the process of renaming to the point of achieving final names, I used directory listings and spreadsheets to try to match up the "before" names (i.e., the names of the raw HTML exports) and the "after" names (i.e., the final names I had developed previously).

Once I had the emails in individual HTML files with workable filenames, I ran wkhtmltopdf again. I started by taking a directory listing of the files to be converted; I put those into a spreadsheet, as before; and in the spreadsheet I used more or less the same wkhtmltopdf formula shown above in order to produce working commands. These pretty much succeeded. I was now getting good PDFs from the emails. It seemed that wkhtmltopdf had a habit of wrapping lines severely or perhaps indenting them too much. That is, if I wrote an email in reply to someone else, the text of my email would look fine,

but the text of the message

to which I was replying,
typically shown below the

reply text, would be
indented and then broken

like this.

Wkhtmltopdf converted HTML files to PDF at a rate of somewhat more than one email per second. Of course, these were small files, as email messages tend to be. There was a problem with them taking up a lot of disk space; it seemed I might have been well-advised to format the drive to have smaller than the default cluster size. The program slowed down considerably at times. I assume it was running into complexities with some HTML files.

The batch file ran and finished, but it had converted only about half of the HTMLs into PDFs. I decided to test the PDFs before deleting the corresponding HTMLs. I opened a half-dozen of them without a problem. Then, for a more thorough test, as described in a separate post, I ran an IrfanView batch conversion from PDF into RAW format. I chose RAW because it would result in just one file. TIF might have been another possibility. It did appear that this process was all working well. Ultimately, these steps converted all of the HTMLs into PDFs.

Summary

The first part of what I was able to achieve, at this point, was to export my emails from Thunderbird to EML format, using the ImportExportTools add-on for Thunderbird. Once I had exported all those EMLs, I used a zipping program (either WinRAR or 7zip) to bundle them together into a single file containing all of a year's emails. I took these steps because EML files, unlike HTML, PDF, JPG, TXT, or other formats, were able to contain email attachments along with the text of the email messages. I planned to keep these year-by-year ZIPs of EMLs until some point when I could find a cheap and broadly accepted program for printing both the email message and its attachment into a single PDF.

The other main achievement was to work out a process for converting HTMLs (also exported from Thunderbird via ImportExportTools) into PDFs. I used wkHTMLtoPDF for this purpose. I ran it in a batch file, produced by a spreadsheet, so that there was one command per file. I used DIR folder comparisons and other means to test that all files were being converted and that they were being converted into valid PDFs.

Ray Woodcock's Latest

Pages

Friday, June 22, 2012

Finding and Cleaning Up EMLs That Display HTML Codes as Text

Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

Saturday, April 23, 2011

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

Support This Blog

Total Pageviews

Archives

Ray Woodcock's Latest

Pages

Friday, June 22, 2012

Finding and Cleaning Up EMLs That Display HTML Codes as Text

Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

Saturday, April 23, 2011

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives