Showing posts with label print. Show all posts
Showing posts with label print. Show all posts

Friday, June 22, 2012

Finding and Cleaning Up EMLs That Display HTML Codes as Text

I had a bunch of email (EML) files scattered around my hard drive.  Some of them, I noticed, were displaying a lot of HTML codes.  For example, when I opened one (using Thunderbird as the default EML opener), it began with this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: Scholar Program</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

I was not sure how that happened.  Apparently I had run these EMLs through some kind of conversion process, perhaps after renaming them to be .txt files.  Whatever the origin, I wanted to eliminate all those HTML codes and wind up with a plain text file, probably saved as a PDF.  This post describes the steps I took to achieve that outcome.

Finding the Offending Files

As I say, the files containing this text were scattered.  Initially, I did a search for some of the text shown above (specifically, for "<!DOCTYPE HTML PUBLIC") in Copernic.  (I assume any tool capable of searching for text within files would work for this purpose.)  I thought maybe I would just copy and paste the lot of them from Copernic to a separate folder in Windows Explorer, where I could work on them in more detail.  This approach was not working very well because failed because Copernic did not allow me to select and move multiple files to other folders.  Moreover, Copernic did not display them with their actual filenames; rather, it showed the title indicated in the HTML "<TITLE>" line (see example above).

It was probably just as well.  Moving them in bulk from Copernic would have lost the indications of the folders where they were originally located.  The better approach, I decided, would be to use the command line and batch files to identify their source folder, move them to a single folder where I could work on them, and then move the resulting, cleaned-up files back to the folders where the originals had come from.

So the first thing I needed was a way to locate the files to be cleaned up.  I decided to use a batch command for this purpose.  I could have searched for every file (or just every EML file) that contained any HTML codes.  For that purpose, a search for "</" might have done the trick.  But then I decided that there could be a lot of HTML codes floating around out there, in various files, for a lot of different reasons; and for present purposes I didn't need to be trying to figure out what was happening in all those different situations.  So instead, I searched for the same thing as before:  "<!DOCTYPE HTML PUBLIC."  To do that, after several false starts, I tried this command:
findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txt
It produced a dozen "Cannot open" error messages.  The reason seemed to be that the filenames for those files had funky characters (e.g., #, §).  Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command.  DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file.  So despite a number of variations with FINDSTR and several Google searches, I gave up.  I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually.  Copernic had a convenient right-click Move to Folder option, so that helped a little.  So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder.  I would just have to re-sort them back to where they belonged manually.

But I still wasn't sure that everything in that folder was problematic.  Basically, I needed to see what the EMLs looked like when they were opened up.  Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails.  But I was having problems in my efforts to print EMLs as PDFs.  As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder.  To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat.  Somehow, this failed to crash my system.

Cleaning Up the Files

After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility.  Now I could edit them as text files.  My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.

I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it.  The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced.  It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation.  Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed.  After a couple of tries, I found a search that did bring up a couple of solutions to related problems.  But those solutions still looked pretty manual.  Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?

A different search led to HotHotSoftware, which offered a tool for this purpose for $30.  A video seemed to demonstrate that it would work.  But, you know, $30 was more than the files were worth.  Besides, I wouldn't learn anything useful that way.  ReplacePioneer ($39, 21-day trial) looked like it might also do the job.  A thread offered a way to do something like it in an unspecified language, presumably Visual Basic.  Another thread offered an approach in sed.  Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX.  Other free options included Nodesoft Search and Replace and Replace Text.

I tried TexFinderX.  In its File > Add Folder menu pick, I added the list of files to be changed.  I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage.  The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder.  They advised using an existing table to create a new one.  As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas.  The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.

But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space.  I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar.  I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space.  I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe.  I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided.  I also changed the top lines, the ones that explained what the file was about.  I saved the file, went into the program's Replacement Table button, and there it was.  I selected it and clicked Apply.  On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files.  Then I ran it.  It looked like it worked.

I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff.  I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line.  So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it.  It seemed best not to add too many chunks at once, lest I be repeating the same lines:  run a few, after eyeballing them for duplication, and then see what was left.  It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.

I added back some of their original items (modified) to the replacement table.  These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>).  I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work.  The answer to that seemed to be to replace three pairs of <13><10> with two.  I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.

I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth.  After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others.  Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run.  I realized that it would probably not produce perfect results across all cases.  It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.

I had noticed that the program took a little longer to run as I added more rules to its replacement table.  But this did not seem to be due to file processing time:  the time did not grow far longer when I added far more files to the list.  It was still done within a minute or so in any case.  Apparently it was just reading the instructions into memory.

The excess (now blank) lines in the files were the slowest to remove.  I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks.  Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.

Cleaning Up the Starts and Ends of Files

href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files.  Now I wanted to use a similar process on these files.  I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.

I had not previously noticed that most if not all of these emails had originally included attachments.  As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion.  These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition.  These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself.  I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX.  I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.

Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously.  After making a backup, the summary version of those steps, as modified, went like this:
  • Start Emacs.  Open one of the post-TexFinderX emails.  Hit F3 to start macro recording.  C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end.  Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
  • At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line.  I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails.  So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
  • Hit C-space to set the mark.  C-End.  Del.
The macro was still recording; I wasn't done.  The preceding steps did take care of the ending material in that particular file.  (As before, it was essential to avoid typographical errors, which would terminate macro recording or worse.)  But now, how about the unwanted starting material? I hadn't done this particular operation before, but it seemed straightforward enough.  I had to use C-Home to get to the start of the file.  Then -- since I had, again, deleted the objectionable boundary lines in some of these emails -- I had to search for the last surviving message header field.  In the case of the first email I was looking at, which I believed was probably the most thoroughly scrubbed, that last surviving field was Message-ID.  So I went through several additional but similar steps to clean up the start of the email and finish the task:
  • C-s to search for Message-ID.  Then C-e to go to the end of that line, and right-arrow to go to the start of the next line.  C-Space to set the mark, C-Home, and then Del.  That was as much as I could do with this particular email; it was clean, though not ideally formatted.
  • C-x C-s to save the file.  F4 to end the macro recording.  C-x C-k n Macro1 Enter (to name the macro to be Macro1).  C-x C-k b 1 (to bind the macro to key 1).
  • C-x C-f ~/ Enter (to find my Emacs Home directory).  In my case, Home was  C:\Users\Ray\AppData\Roaming\.emacs.d.  I went there in Windows Explorer and created a new text file named _emacs, with no extension.  This was my init file.
  • From the Emacs menu:  File > Open File > navigate to the new _emacs init file > select and open _emacs.  Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter.  This hopefully saved my macro to my init file.  C-x C-c to save and quit Emacs.  A quick look with Notepad confirmed that there was something in _emacs.
  • Restart Emacs.  Open another of these text emails.  Test my macro by typing C-x C-k 1.  I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again.  Still undefined.  Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs.  In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs.  I used C-x 2 to split the window.  _emacs appeared in both panes.  In the top pane, I went to Buffers > select the text file to be changed.  (Apparently it was listed as one of the available buffers because I had already opened it.)  So now I was viewing the macro in the bottom pane and the email file in the top pane.  I selected the top pane and tried C-x C-k 1 again; still undefined.  I found other advice to just use M-x Macro1.  That worked.  The macro ran in the top pane.
The macro didn't do such a great job of cleaning this second file.  I would have to return to that later.  For now, the next step was to figure out how to run the macro automatically on all the emails.  Meager results from a search presented the possibility that people did not commonly do this sort of thing.  A refined search led to further discussion suggesting that I should be searching for information on multiple buffers rather than multiple files.  That innovation provoked the side question of whether perhaps jEdit was better than Emacs for such purposes but, once again, Emacs seemed better.  Still another search led to Dired, which would apparently allow the user to conduct certain operations on the files listed in a directory.  We were getting closer.  I found someone who was feeling my pain, but without a solution.

A StackOverflow discussion suggested that I might want to begin a Dired approach by loading kmacro.  I had no idea of how to do this.  An Emacs manual page seemed to think that kmacro was already part of Emacs.  I decided to try to follow the StackOverflow concepts without special attention to kmacro preliminaries.  The first recommended step was to go to the top of my Dired buffer.  This, too, was a mystery.  Another Emacs manual page told me to use C-x d to start Dired.  In the bottom line of the screen, that displayed the name of the directory containing the emails.  I didn't know what else to do, so I hit Enter.  Apparently that was just the right thing to do:  it showed me a directory listing for that folder.  It would develop, eventually, that the fast way to get it to show that directory was to use the menu option File > Open File to navigate to that directory and open a file there.

Now the StackOverflow advice was apparently to move the cursor to the first file in that list (which is where it already looked like it might be) and hit F3 to begin recording a keyboard macro.  Then hit Enter to visit the file.  Then M-x kmacro-call-ring-2nd.  But at this point it said, "No keyboard macro defined."  So kmacro was working, but on this command Dired was looking for a previous keyboard macro, not for an already saved one.  I used C-x k Enter to close the email that I had opened.  Now I was back at the Dired file list.  I hit C-x 2 to split the window, so maybe I could see more clearly what was going on.  With the cursor on the first target email in the top pane, I hit Enter to visit it again, then M-x Macro1 Enter.  That seemed to be the answer, sort of:  the bottom row said, "After 0 kbd macro iterations: Keyboard macro terminated by a command ringing the bell."  So the macro did try to run.  Adventures in the previous post suggested that this error message meant the macro failed to function properly, and I believed I knew why:  this was the email that I had already edited.  I had already removed, that is, the stuff that the macro was searching for, starting with the Content-Disposition line.

Time to try again.  With the top pane (displaying the email message) selected, I hit C-x k Enter to close it.  Then I moved the cursor to (i.e., mouse-clicked on) an email on which I had not yet run Macro1.  There, going back to the (modified) StackOverflow advice, I hit F3 to start recording a keyboard macro; I hit Enter to visit the file; then M-x Macro1 Enter.  It ran without an error message.  The email was showing in both top and bottom panes, so evidently I had not yet mastered the art of pane.  StackOverflow said C-x o to switch to the other buffer.  This just switched me to the other pane; I was supposed to be seeing the Dired file list.  With the keyboard macro still recording, I tried C-x k Enter to close the email.  Now the bottom pane, where I was, had the cursor flashing on the wrong line.  C-x o, then., followed by a tap on the down arrow key to take me to the next file to be processed.  That was the end of the steps that I wanted my new keyboard macro to save, so I hit F4.  StackOverflow said that now I had to hit C-u 0 c-x e to run the keyboard macro on every file in the list.  But that command sequence only opened the next file and ran Macro1 on it.  I hit C-x k Enter to close.  I was back at the Dired list.  The cursor did not advance to the next line; Macro1 did not run automatically.

I thought maybe my errors in that last try screwed up the keyboard macro, so I tried recording it again:  F3; cursor on the target email; Enter to visit that file; M-x Macro1 Enter to run the macro; Ctrl-x k Enter to close the email; down-arrow to select the next email in the list; F4 to close the keyboard macro; C-u 0 C-x e to run it.  No joy:  I still had to close the file and start the next one manually.

By this point, a different approach had occurred to me.  If I could open all the target emails at once, I would only have to hit keys to run Macro1 and then close the changed file:  the next one would then be there, ready and waiting for Macro1.  I decided to try this.  As advised, with an email already opened in my target directory (via menu pick -- see above), so as to tell Emacs where to look, I used C-x C-f *.txt to open all of those emails. (As noted above, I was working on EMLs that I had mass-renamed to be TXT files.)  That seemed to work.  The first ones visible to me were those at the top of the list, on which I had already run Macro1.  I closed those.  I couldn't tell, from the Buffers menu pick, how many files remained opened.  I could see that their timestamp would change in Windows Explorer after Emacs was done with them, so presumably I would be able to check which ones I had run Macro1 on.  I made a mental note to make at least some kind of change in each file before closing it, so as to assure myself that there was no further need to work it over with Macro1.

So now I was looking at the first file that had not yet been caressed by the loving hand of Macro1.  I wondered:  can I define a keyboard macro to save the steps of running Macro1 and then closing the file?  I tried:  F3, M-x Macro1 Enter, C-x k Enter, F4.  To execute that last defined keyboard macro, I used C-x e.  It changed the file as desired -- that is, apparently it ran Macro1 -- and it also seemed to be saving the changed file, but it did not close the file.  In other words, I had reduced the required number of keystrokes down to C-x e, C-x k Enter.  That was what it took to run Macro1 and then close a file.  Not bad, but could I do better?

The problem -- for both this approach and the Dired approach (above) -- seemed to be that the macros were not saving the C-x k Enter sequence.  A search seemed to indicate this could be another difficult problem to solve.  I was running low on time for this project, so I had to shelve that, along with the ensuing question of whether I could bind this C-x e C-x k Enter sequence to a function key.  

Instead, I just went plodding through that sequence for these many files.  In some cases, the scrollbar at the right showed me that there was a lot of extra material that I had to delete manually, usually from the ends of the emails.  Saving after these additional edits required a C-x C-s Enter before the C-x k Enter.  It was also handy to know that C-/ was the undo key.

Further Cleanup

When I was done running Macro1 on all those files, I saw that Emacs had created backup copies, with a .txt~ extension.  I sorted by file type in Windows Explorer and deleted those.  Also, while going through the process, I had noticed a number of files that were short and unimportant, and whose attachments did not interest me.  So I was able to go through the list and remove those to a "Ready to PDF" folder.  These steps reduced the number of files on which I might want to perform further operations.

While looking at those files in Windows Explorer, I noticed that some were much larger than others.  These, I suspected, included some whose attachment sections had not been completely eliminated by the macro, perhaps because they had more than one attachment.  I opened these in Notepad and eliminated material that did not contribute to the intelligible text of the email.

In some of the remaining files, there were still a lot of HTML codes and other material that would interfere significantly with an attempt to read the contents.  It seemed that the spot checks I had conducted in TexFinderX had not brought out all of the things that TexFinderX could have cleaned up.  I restarted TexFinderX, added more codes to the list of things to remove, and ran it some additional times on the files remaining in that folder.  That didn't continue too long before I realized that there could be an endless number of such codes and variations.

The next step was to return to Emacs.  This time, I was looking particularly for individual lines that could safely be deleted.  This was not so much a concern with HTML codes, though there might be some of that too; it was more a concern with email headers, boundary lines, and other items that would vary from one email to the next, could therefore not be readily added to a TexFinderX replacement list, and yet could appear repeatedly within a single email.  For example, each of the following lines appeared within a single email:

--===============3962046403588273==
boundary="----=_NextPart_000_002A_01C69314.AD087740"
------=_NextPart_000_002A_01C69314.AD087740

Moreover, variations on those themes recurred throughout that email, with quite a few of each.  So I could write an Emacs macro to search for a line beginning with the relevant characters, select that entire line, and delete it.  I wouldn't have to know which numbers appeared on different variations of these lines, as I would if I were using TexFinderX.

The problem here was that there were quite a few different kinds of lines to remove.  In addition to the types just shown, there were also email header lines that would normally not be visible, but that had become visible in the original mangling of these files, and there were also various Content-Description and Content-Disposition and Content-ID and Content-Location lines.  I would have to write an Emacs macro for each.  I could write one macro to run them all, but it would terminate as soon as it failed to find the next requested line; and since these sorts of lines varied widely from one email to another, it was quite likely that such a general macro would be terminating prematurely more often than not.  If I knew how to bind macros to individual keys, it might not be horrible to go down the list and punch the assigned function (or Ctrl-Function, Alt-Function, etc.) keys, one at a time, reiteratively for each of these many email files.  But that seemed like a lot of work for a fairly unimportant project.  A better approach would have been to write a script to handle such things, but my chosen scripting language for this purpose, Perl, had one significant drawback:  I had not learned it yet.  I had been meaning to, for about 20 years, and I knew that eventually the opportunity would arrive.  But that day was not today.

I concluded that my cleanup phase for these emails was finished.  If I really needed to go further with it, I could convert them from PDF back to text and have at it again, some fine day.  If I had really intended to do that, I would have saved a list of the relevant files at this point.  But for the time being, I needed to get on with the next part of the project.

Converting Emails to PDF

I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs.  The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands.  I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions.  This produced PDFs with "Notepad" at the end of their names.  For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.

Converting Attachments to PDF

As noted above, most of these troublesome emails had attachments.  I now had, in a folder, only those emails (in .txt format) whose attachments I wanted to see.  Using a DIR command as above, I did a listing of those .txt files.  I put that list into Excel and modified it to produce batch commands that would move the EMLs of the same name to a separate folder.  Then, in Thunderbird, I created a new local folder.  With that folder selected, I went into Tools > ImportExportTools > Import eml file.  I navigated to the folder containing the EMLs whose attachments I wanted to see, selected them all, and clicked Open.  The icons indicated that all did have attachments.

Now, having configured Thunderbird's AttachmentExtractor add-on to generate filenames that I could recognize and connect with specific emails, I selected all those newly imported EMLs, right-clicked on them, and chose Extract from Selected Messages to (0) Browse.  I set up a folder that was not too many levels deep, for fear that some of these attachments might already have long names that could cause problems.  AttachmentExtractor went to work.  When it was done, I deleted that folder in Thunderbird, so that I would not have a problem of confusing duplicates of EMLs that had already caused me enough grief.

Then, in Windows Explorer, I sorted the extracted attachments by Type.  I began the process of converting to PDF those that were not already in PDF format.  Many of these were Microsoft Word documents.  I had already worked out a process that would automate the conversion of Word docs to PDF.  I moved these files to another workspace folder for clarity, and after making the advisable adjustments to my PDF printer, I applied that process to these files.

Word had problems printing a number of these Word docs.  It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques.  It did produce some PDFs.  I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time.  Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF.  That went fine, as a manual process.  So apparently it was not, in every case, a problem with the file.  Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.

Other extracted attachments were text files.  At this point, I had two ways of dealing with these.  On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files.  I did start to use this approach, but ran into dialogs and potential problems.  On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above).  Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt.  They also all originated from the same source.  I examined a handful of these attachments and decided I could delete them all.

Some extracted attachments were image files -- JPG, GIF, PNG, BMP.  I also had a dozen attachments without extensions.  I opened the latter in IrfanView.  I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly.  On the other hand, as I looked through these files, I saw that some of the GIFs were animations.  Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files.  I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.

There were a few miscellaneous file types.  For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been.  One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.

Combining Email and Attachment PDFs

Now I had everything in PDF format.  I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible.  I combined the files and attachments into one folder and made a list of the files using DIR (above).

Now the goal was to combine the emails that did have attachments with their accompanying attachments.  There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it.  I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM.  Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives.  The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.

I downloaded and unzipped pdftk.  As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32.  I opened a command prompt in some other folder, at random, and typed "pdftk --help."  This was supposed to give me the documentation.  Instead, it gave me an error:
pdftk.exe - System Error  The program can't start because libconv2.dll is missing from your computer.  Try reinstalling the program to fix this problem.
I moved the two files to C:\Windows and tried again.  That worked:  I got documentation.  It scrolled on past the point of recovery.  Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder.  The next step was to put pdftk to work.  It would apparently allow me to specify the files to combine, using a command of this form:
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
My problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other.  I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf.  I would need to keep the spreadsheet for a while, so as to know what the original filenames were.  The original filenames were my guide as to what files needed to be combined together.  For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf".  Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1.  Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.

After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined.  Each original filename began with date and time.  With few exceptions, this was sufficient to distinguish one email and its attachments from another.  So I used =LEFT to extract that identifying information from column A.  Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using.  Finally, I added a column to create the appropriate command for the batch file.  Instead of the 123.pdf output shown in the example above, I used the original email filename.  Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.

I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK.  I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them.  Pdftk had proved to be a much easier tool for this project than pdfSAM.  This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.

Thursday, March 15, 2012

Batch Converting Many Text Files to PDF

I had a bunch of .TXT files that I wanted to convert to PDF.  I had solved this problem previously, but it looked like I hadn't written it out clearly, so that's the purpose of this post.  This explanation includes solutions to several other sub-problems.  All together, the things presented here were useful for solving a variety of problems.

First, I made a list of the files to convert.  My preferred way of doing this was to use DIR.  First, I would open a command window.  My preferred way of doing *that* was to use the "Open command window here" context menu (i.e., right-click in Windows Explorer) option.  An alternative was to use Start > Run > cmd, but then I would have to navigate to the desired folder using commands like CD.

The DIR command I usually used, to make a list of files, was DIR /s /a-d /b > filelist.txt.  (Information on DIR and other DOS-style commands was available in the command window by typing the command followed by /?.  For example, DIR /? told me that that the /s option would tell DIR to search subdirectories.  A variation on the DIR command:  DIR *.txt /s /a-d /b.  The addition of *.txt, in that example, would tell DIR that I wanted a list of only the *.txt files in the folder in question (and its subfolders).  If I wanted to search a whole drive, I'd make it DIR D:\*.txt /s /a-d /b > filelist.txt.  If I wanted to search multiple drives, I'd use >> rather than > in the command for the second drive, so that the results would add to rather than overwrite the filelist.txt created by the preceding command.

Using DIR that way could gather files from all over the drive.  Sometimes it was better to gather the files into one folder first, and then run my DIR command just on that folder.  An easy way of finding certain kinds of files was to use the Everything file finding utility, and then just cut and paste all those files from Everything to the desired folder.  For instance, a search in Everything for this:

"see you tomorrow" *.txt
would find all text files whose names contained that phrase.  Cutting and pasting that specialized list into a separate folder would quickly give me a manageable set of files on which I could focus my DIR command.  (There were other directory listing or printing programs that would also do this work; I just found them more convoluted than the simple DIR command.)

Once I had dirlist.txt, I copied its contents into Excel (or I could have used Excel to open dirlist.txt) and used various formulas to create the commands that would convert my text files into PDF.  The form of the command was like this:
notepad /p textfile.txt
I wasn't sure in the case of Notepad specifically, but I was able to run some programs (e.g., Word) from the command line by just typing one word (instead of e.g., "notepad.exe," or a longer statement of the path to the folder where e.g., winword.exe was located) because I had put the necessary shortcuts in C:\Windows.

Those Notepad commands would send the text files to my default printer.  My default printer was Bullzip.  When I installed it, it gave me a separate shortcut leading to its options.  For this purpose, I set its options so that it did not open the document after creation (General tab), specified an output folder (General tab), and indicated that no dialogs or questions should be asked (Dialogs tab).

I copied the desired commands from Excel to a Notepad text file and saved it with a .bat extension.  The rest of the file name didn't matter, but the .bat extension was important to make it an executable program.  In other words, if I double-clicked on PrintThoseFiles.bat (or if I selected PrintThoseFiles.bat and hit Enter) in Windows Explorer, the batch file would run and those commands would execute.  (I could also run the batch file from the command line, just by typing its name and hitting Enter -- which meant that I could have a batch file running other batch files.)

So that pretty much did it for me.  I ran the batch file, running lots of Notepad commands, and it produced lots of good-looking PDFs.

Please feel free to post questions or comments.

Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

I was looking for a way to print a bunch of webpages to PDF files from the command line. This page describes the search that, as before, brought me to wkhtmltopdf.

One approach, it seemed, was to use Pdf995 and Omniformat. I had been frustrated, last time I tried pdf995, but nearly a year had passed, and this was a different project. Maybe this time it would work. They seemed to want me to install pdf995 and then install Omniformat. Not being entirely sure which ones I would need, I installed a half-dozen programs from their webpages. They said Omniformat would include HTML2PDF995, which would permit command-line conversions among formats including HTML and PDF. So that sounded promising. Installation of Omniformat brought up an HTML page included in the program (evidently not available online) that said the command line syntax was like this: omniformat.exe [input file] [output format]. So in my example, it would look like this:

omniformat.exe http://www.cnn.com/Chinastory.html "png"
In that case, the Omniformat command wouldn't give the file the desired name, so I would have to add a command to do that. I tried it, just doing the part as shown for now. I got the error, "omniformat.exe is not recognized as an internal or external command, operable program or batch file." In other words, it wasn't part of the computer's path. I had to run this command from within the folder where omniformat.exe was installed. A search of my computer said that the folder in question would be "C:\Program Files (x86)\omniformat." So I ran that omniformat.exe command there. But it opened up the GUI and made me wait for maybe 20 seconds until it would open a session of Internet Explorer, so that it could display its adware; but then that failed with "An error has occurred in the script on this page." Same thing if I tried using a file on my computer rather than a webpage's URL in the command. It seemed that pdf995 was still not going to work for me.

A search led to Total HTML Converter which, for $50, promised to do exactly what I needed: convert webpages to JPG and possibly to PDF from the command line. There didn't seem to be a listing on CNET for Total HTML Converter. It got three stars from 21 users (3,582 downloads) on Softpedia. Fifty bucks for a three-star program ... hmm.

A Softpedia search for similar programs turned up Spire PDF Converter (rated 3.0 by four users; 1,057 downloads), HTML to PDF Converter (rated 3.6 by eight users; 6,353 downloads), 7-PDF Website Converter (rated 3.7 by 10 users; 1,746 downloads); HTML_ to PDF (rated 3.2 by 19 users; 2,225 downloads); and Gerolf Markup Shredder (rated 2.8 by 23 users; 1,816 downloads). Gerolf was the only one whose description said it could run from the command line. I checked the homepages of the others to see about them. Spire said nothing about it. Likewise HTML to PDF Converter, and 7-PDF. I wasn't sure about HTML_ to PDF, so I downloaded that and Gerolf. HTML_ to PDF gave me an unzipped folder with no executables; it looked like I would have to learn something about PHP programming to use it. Meanwhile, Gerolf's installation asked me if I wanted to install GMS, to which I said sure, go ahead. Then it gave me a dialog partly in German, to which I replied Ja. Next an almost entirely dialog that seemed to be asking where I wanted to open the installation files. Its Durchsuchen (Search) button took me to a Temp folder, so I just clicked on that and said OK. Next, a dialog telling me to run gmsunzip.bat to install. Apparently I should have written down where I unpacked the files. Fortunately, Everything found gmsunzip.bat, so I did run it. I pressed the Whatever key to move past its first screen of information. It was starting to look like I should have chosen a more permanent location, so I went back and started over with the installation. Now I understood that its first dialog, referring to GMS, was of course referring to Gerolf Markup Shredder, and not to some other program; I just hadn't understood that it was asking me if I wanted to install the thing that I had just double-clicked on. So now I Durchsuched to a newly created folder called C:\GerolfHTMLtoPDF, and after the installation I went there and ran gmsunzip.bat. Unfortunately, at the end, I got a message indicating that this was an unsupported 16-bit installation that was incompatible with 64-bit versions of Windows. So I would have to run it in a Windows XP Virtual Machine. While thinking about that, I went back to HTML_ to PDF Converter. I took a closer look. The second script on the webpage seemed to be something that I might be able to just copy into Notepad, save as an HTM file, and double-click on. I tried that. No, it was going to require some PDF knowledge, though maybe not much. Now I noticed that Gerolf would not go away. It kept insisting on telling me, again and again, about the Unsupported 16-bit Application problem. I had to use Start > Run > taskmgr.exe. But, whoa, what's this? "Windows cannot find 'C:\Windows\System32\taskmgr.exe." Had one of these foolish programs, or something else, screwed up my system? I could see that taskmgr.exe was indeed in the System32 folder. Hmm. Not clear what was happening. Eventually I found that a CMD window was running; I had to kill that to shut off the recurrent dialogs. But that didn't fix the problem with taskmgr.exe. Maybe a reboot would ... later.

I went back to my previous post on a somewhat similar problem. The most promising solutions there seemed to be PrintHTML, print all linked files from an HTML page in Internet Explorer, or use wkHTMLtoPDF. I shied away from wkHTMLtoPDF because it was so complicated. I installed PrintHTML and the DHTML Editing Control (required on some systems, evidently including mine, judging from error messages when I tried running PrintHTML without it), and then looked at its instructions. It seemed to be just designed to permit some tinkering (e.g., margin adjustments) while printing local HTML files; no clear indication of how it would work with a webpage. I tried this command:
printhtml.exe file="https://www.nytimes.com"
(I had to run that command from within the folder where PrintHTML was installed.) It gave me a nearly blank page. It seemed that, basically, it was not designed to do what I needed. How about the approach of printing linked files from within Internet Explorer (IE)? The concept was that I could create an HTML page containing links to the webpages I wanted to print, and IE could be persuaded to print them all. I wasn't sure if they would print as one big PDF that I would have to split apart, but that seemed likely. In that case, the files wouldn't have the desired individual names. This tentatively seemed to be another case where the approach was designed for local files, not for webpages.

On this basis, I went back to wkHTMLtoPDF, as described in another post in this blog, posted at about the same time as this one, on the subject of Converting URL-Linked Webpages to PDF.

Wednesday, February 29, 2012

Excel 2003: Print or Export the Formulas Used in Each Cell

I had a spreadsheet in Excel 2003.  (I suspect the approach used here would also work in other versions of Excel, but I have not tried it.)  I wanted to print out the formulas used in each cell.  I did a couple of searches and wound up in a thread where they were advising me to use a macro for this purpose.  The steps I used to set up the macro were similar to those that I had used for another Excel macro:

  1. Close all Excel files other than the one you're working on.
  2. Go into Tools > Macro > Visual Basic Editor > Insert > Module.
  3. Copy and paste macro text (see below) into the window.
  4. Go to File > Close and return to Microsoft Excel.
  5. In this case, I used the macro by going into Tools > Macro > Macros and running the ListFormulas macro.
The text of the macro -- what I copied and pasted into the module window -- was as follows:
Sub ListFormulas()
    Dim FormulaCells As Range, Cell As Range
    Dim FormulaSheet As Worksheet
    Dim Row As Integer
    
'   Create a Range object for all formula cells
    On Error Resume Next
    Set FormulaCells = Range("A1").SpecialCells(xlFormulas, 23)
    
'   Exit if no formulas are found
    If FormulaCells Is Nothing Then
        MsgBox "No Formulas."
        Exit Sub
    End If
    
'   Add a new worksheet
    Application.ScreenUpdating = False
    Set FormulaSheet = ActiveWorkbook.Worksheets.Add
    FormulaSheet.Name = "Formulas in " & FormulaCells.Parent.Name
    

'   Set up the column headings
    With FormulaSheet
        Range("A1") = "Address"
        Range("B1") = "Formula"
        Range("C1") = "Value"

        Range("A1:C1").Font.Bold = True
    End With
    
'   Process each formula
    Row = 2
    For Each Cell In FormulaCells
        Application.StatusBar = Format((Row - 1) / FormulaCells.Count, "0%")
        With FormulaSheet
            Cells(Row, 1) = Cell.Address _
                (RowAbsolute:=False, ColumnAbsolute:=False)
            Cells(Row, 2) = " " & Cell.Formula
            Cells(Row, 3) = Cell.Value
            Row = Row + 1
        End With
    Next Cell
    
'   Adjust column widths
    FormulaSheet.Columns("A:C").AutoFit
    Application.StatusBar = False
End Sub 
(Note that the format of this blog may wrap some lines.  Copying and pasting may yield better results than retyping.)  The author named in that code, John Walkenbach, provided this code and also offered a Power Utility Pak ($40) that contained this and other tools.  I had installed a couple of freeware utility collections -- ASAP Utilities and Morefunc -- and I hardly ever used them.  I checked the list of tools in ASAP Utilities, just in case, but didn't find anything quite along these lines.  A quick glance revealed no list of Morefunc utilities.

When I ran the macro (step 5, above), it seemed to hang my machine.  I was using a fairly large spreadsheet -- I probably should have tried it on something smaller -- but instead I went to bed.  I didn't know how long it took, but it worked.  When I awoke, it had created a new worksheet (i.e., a new tab at the bottom of the spreadsheet), with three columns:  Address (e.g., F2), Formula (e.g., =C2), and Value (e.g., 17).

Monday, February 13, 2012

Windows 7: Testing/Verifying/Validating PDFs

I had previously gotten the impression that I could test PDFs by using IrfanView to convert them to JPGs.  (This was different from the approach I had recently taken to merge scattered JPGs into multipage PDFs.)

In a dry run, IrfanView had balked at a bad PDF, but had converted the good ones.  So the scenario was that I would run the conversion; check the output folder; verify that it had the correct number of files; and if the numbers of files didn't match up, I would go hunting for whatever was missing.

Now I had a bunch of PDFs that I wanted to test, so it was time to try out that theory.  I had found it helpful to use an Excel spreadsheet to work up the list of files to test.  (The post on multipage PDFs, above, contained some discussion of how I used Excel to create and massage lists of files.  More information appeared in an earlier post on renaming thousands of files and in another recent post on using a batch file to sort files.)

The PDFs that I wanted to test were scattered across different folders.  The conversion scenario would have me create JPGs from these PDFs, where the JPGs would all be in one folder.  That way, I could easily count them, keep them from cluttering up other folders, and delete them after I was done counting them.  A problem with all those JPGs converging into one folder was that I might have two identically named source PDFs in different folders.  For instance, there might be something called File001.jpg in D:\FolderB, and another completely different File001.jpg in E:\FolderQ.  The JPGs resulting from these two different PDFs would either overwrite or fail to come into existence, depending on how I set IrfanView's conversion process.  This would screw up my count and would potentially fail to test some PDFs.  I could surmount this problem by batch renaming those files into unique names, as long as I kept the list of what I had renamed so that I could rename them back when I was done screwing around.  That approach would involve time-consuming extra steps, though, so I was hesitant.  (For more information, go to this webpage and search for "ZZZ_00001.jpg.")

There was another problem, as I thought about it.  A three-page PDF would presumably convert into three one-page JPGs.  So my file count would get messed up that way too.  I could probably opt to convert instead from mulitpage PDFs into multipage TIFs, but I wasn't sure what would happen if one page on a JPG was junk.  I would have to experiment to see if the TIF would swallow it or barf.

These reveries were interrupted by an actual test.  I tried using IrfanView to batch-convert three PDFs to JPG.  It gave me errors:  "Can't load D:\Current\Text\x1.pdf" (and likewise for x2.pdf and x3.pdf).  One was a single-page PDF, so multipage issues weren't the problem.  I couldn't understand it.  I had previously used IrfanView for this purpose.  The PDFs opened OK in Adobe Acrobat (and would presumably do so in a free or less expensive alternative to Acrobat).

It occurred to me that maybe I could use Acrobat > Advanced > Document Processing > Batch Processing.  I tried that on my test files, saving them as RTFs rather than JPGs to circumvent multipage issues.  It was surprisingly slow, and the resulting RTF files were empty.  Not a promising start.  Going back to somewhere near the original plan, I tried using Acrobat to convert to JPG instead of RTF, and that worked.  As expected, each PDF page became a distinct JPG.  For instance, x2.jpg became x2_Page_01.jpg and x2_Page_02.jpg and so forth.  I would have to do further filename massaging in Excel, or maybe run a series of DEL commands (e.g., DEL *02.jpg, DEL *03.jpg, etc.) to see if the number of output JPGs (or groups thereof) matched the number of PDFs tested.

I wondered, at this point, why I couldn't just batch print the PDFs being tested -- print them to PDFs in another folder, that is, and do the file count and then delete the prints.  Would a junk PDF print?  I created a junk PDF by taking a copy of one of my test files, opening it in Hexedit, looking a little ways down in the ASCII column for Root # 0 R (in this case, it was Root 124 0 R), changing it to Root 00 0 R (inserting 30 as the hex value for zero), and saving it.  Then I made Bullzip my default PDF printer, changed its settings so that it would print without stopping to ask questions (via the Options shortcut in the Bullzip program folder), selected all four of my test files, and went to right-click > Print.  It printed three of the four.  I didn't have the settings right -- it still asked for filenames -- but the test worked:  for the file I had just made into a junk PDF, Bullzip (or, actually, Acrobat, my default PDF reader) gave me an error message ("There was an error opening this document.  The root object is missing or invalid" -- which was, of course, exactly what I had changed), and no output PDF was created.  So this approach of trying to print to PDF would work to identify at least some kinds of defective PDFs.

Unfortunately, that error message didn't specify which file was defective.  So that approach would require me to subtract the files that had successfully printed from the larger set of files that I had requested to be printed.  That might be a pretty fast process, if I used the same output filenames, paused for five or ten minutes, and then used Windows Explorer to copy the output PDFs over the original PDFs and then sorted by timestamp.  The PDFs with the visibly older timestamp, after that maneuver, would presumably be the ones that had failed to produce anything that would overwrite them.  This approach would wipe out my originals, which I would not want, so it would probably be best done using copies of the originals.  If there was some need to reverse the timestamp, I could probably fiddle with the system clock before step 2, so as to produce artificially ancient output PDFs.

Another approach might have been to use an Acrobat-type program to concatenate many if not all of the PDFs being tested into one PDF.  I wasn't sure if junk PDFs would concatenate.  I selected my test files > right-click > Combine supported files in Acrobat.  Acrobat said, "There was an error encountered while combining files.  Do you want to open the combined file or return to the file list and try again?"  I told it to open the combined file.  Acrobat's Bookmarks pane showed bookmarks for each of the good files, but no bookmark for the bad one.  So that would be one way of getting a list of good PDFs.  Of course, the concatenation process could be slow, especially because the resulting document could be huge.  The size of that document might also cause the Acrobat-type program to crash.

But this still wasn't giving me a testing approach that would test PDFs in place, without requiring me to relocate them to a single folder where I could manipulate them.  I could try to work up a batch command that would print the PDFs on my list to a common output folder from where they were, but in that case I wouldn't have two simple lists to compare.  Unless I could persuade the batch command to report its errors to a log, I would apparently have to go through the printing process manually, making sure to write down or attend to each PDF that didn't print.

I ran a search, to see if Bullzip could escort me out of this situation.  This strategem led, strangely enough, to the Bullzip manual, to which I probably already had at least a link in my Bullzip program folder.  But the manual -- besides being no fun -- seemed to be oriented toward installation rather than usage.  A search in the Bullzip forum led to the suggestion that I look at a bioPDF webpage.  bioPDF seemed to be telling me that I could use a program called PrintTo.exe to do the job.  But where could I find this PrintTo.exe program?  I wasn't seeing a link to it there on the bioPDF site.  A search of files on my computer turned up nothing.  It didn't register when I typed "printto /?" on the command line.  Softpedia didn't have it.  And yet a search produced indications that Bullzip users were using PrintTo too.  Baffling.  Another search led, directly or indirectly, to a FineThatFile webpage where I was able to download printto.exe as part of a zip file containing other stuff.  I unzipped it and ran "printto /?" in the folder where printto.exe was unzipped to.  Turns out it was a biopdf.com product after all.  The syntax was simply "printto filetoprint printername" -- using the default printer if printername was not specified.

So, alright.  Bullzip was already my default printer, so I would be test-printing those PDFs to some temporary folder with a simple command:  printto filetoprint.  There didn't seem to be an option to specify an output folder on the command line.  Apparently I would have to do that in Bullzip.  It took some tinkering, but eventually it came together.  It didn't look like printto.exe was eager to print JPGs, but that was alright; I didn't need that now.  Right now, I was just doing PDFs.  I did get it to print designated PDFs to a designated output folder from the command line without pausing, except in case of overwrite; I did want to be notified about that.  Printto.exe had to be present in the folder where the command or batch file was running, I assumed, but that was manageable.  When it got to my bad PDF, printto.exe gave a command-line error:  "ERROR:  Invalid file name specified."  I had forgotten to put that file's name into quotes.  (Unlike the others, that name had a space in it.)  I added quotation marks and tried again.  This time, when it got to the bad PDF, it gave me the error (above), "There was an error opening this document."  When I clicked OK, my little test batch file continued to print the next file in line.  So it looked like this was going to work.

Regrouping, then, the situation was as follows:  I had set up Bullzip to print PDFs to a specified folder called Bullzip Output, without pausing for any dialogs except error messages and overwrites.  I had downloaded printto.exe, and it was now sitting in the folder where I had also saved a file, created in Notepad, called Printer.bat.  Printer.bat contained commands of this nature:

printto "D:\Folder Name\File to Test Number 1.pdf"
Printer.bat contained one line like that for each of the PDF files I wanted to test.  What was supposed to happen next was that I was supposed to be able to double-click on Printer.bat, or type "Printer.bat" on the command line, and it would try to print the PDFs I was testing.  It would put the resulting PDFs (that is, the Bullzip printouts of the PDF files being tested) into the Bullzip Output folder.  Unless it encountered corrupt files or potential overwrites, it would work -- slowly -- through the list of PDFs that I wanted it to test.  I hadn't seen an option to steer the error messages to a log instead of showing them onscreen.  A log would have been better:  the batch file wouldn't sit idle, waiting half the night for me to wake up and fix a problem.  And maybe Bullzip or some other PDF printer offered that.  I just hadn't seen it.  It would be something to look into next time.  Hopefully Printer.bat would not encounter many corrupt PDFs.

I decided to run Printer.bat from the command line, so that I could watch what was going on.  One problem emerged almost immediately:  after Bullzip created a PDF, Acrobat would open up, even though I had checked the Bullzip option that said, basically, do not open the document after creation.  So, fine, the document would not open, but Acrobat would.  It would just sit there with a blank page, and that was fine, except printto.exe would not proceed with the next file to be processed until I killed Acrobat.  I wondered if things would work differently if I designated a different PDF default reader.  To test that, I right-clicked on a PDF file at random and went into Open With > Choose Default Program > Always use the selected program to open this kind of file > Browse.  I browsed to FoxitReader.exe (which may not have been its original name) and selected it.  I double-clicked on a random PDF to make sure that it would now open in Foxit rather than in Acrobat.  When I tried running Printer.bat again, I got a rapid series of error messages.  The gist of these messages was, "No application is associated with the specified file for this operation."  There were no files in the Bullzip Output folder.  Operation failed.  Now what?

I guessed that I was getting those error messages, not because Foxit was not registered as the default PDF reader at this point, but because something about its status as a portable rather than an installed program was confusing printto.exe.  I was surmising, in other words, that printto.exe needed a PDF reader to be installed.  This suggested that Acrobat was opening after each printing of a PDF, not because of some failure in Bullzip, but because printto.exe needed that.  So then could I perhaps insert a batch file command that would kill Acrobat after printto.exe gratuitously started it up?  Or could I install some other (non-portable) PDF reader that would respond differently than Acrobat had done?  Or would it perhaps help, somehow, if I left Acrobat open to another PDF file before running Printer.bat?

Trying that last possibility, I restored Acrobat as the default PDF reader, opened a PDF in it, and tried Printer.bat again.  This time, Printer.bat took off like a shot.  It processed a couple dozen PDFs almost instantly.  Then it slowed way down, but it seemed this was only because Bullzip was printing the PDFs much more slowly than printto.exe was printing them.  Evidently having a PDF already open in Acrobat was the solution.  Don't ask me why.

Well, now that we had worked out the terms of engagement, printto.exe and Bullzip seemed to be poised to execute the balance of their little pas de deux with grace.  Every few seconds, another PDF would be printed into the output folder.  Ah, but then the overwrite warnings began popping up.  I didn't have time to rename one existing PDF, so as to make room for its brother, before another duplicate warning would interfere with the manual renaming process.  I could have let them go until the end, but I was afraid there might be a lot of overwrite warnings, and the computer would crash or I would get corrupted results.  This was a pretty clumsy operation in the end.  It did seem that it would have been advisable to detect duplicte filenames before starting, and to assign duplicates a temporary or visibly alternate filename for this process.

But anyway, when this was done, I had 147 items in the output folder.  There had been no error messages.  Sadly, Printer.bat contained 148 lines.  Somewhere, we had a laggard.  I took a stab at finding it.  Failed.  I guessed I had allowed something to get overwritten, but my approach was way too sloppy to figure out which test output PDF got wiped.  I decided that I probably already knew it was a good PDF, since I'd gotten no error message while printing.  So that was the end of this test.

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time.  I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf
reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow."  This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages.  There were a lot of messages, so I didn't want this to be a manual process.  This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird.  Then I deleted duplicates from that folder.  Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird.  So I moved the older messages into a separate folder.  I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email.  I did that because I wanted to capture the text information about the individual email messages.  The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames.  For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once.  I ran a search and downloaded and installed the ImportExportTools add-on for T-bird.  (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.)  It took Thunderbird a while to shut down and restart after the installation.  I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder.  When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)).  I opened the CSV file (named index.csv) in Excel and saw that this was perfect:  I had a list of exactly the fields mentioned above (date, time, etc.).  I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames.  The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working.  I thought this might also be better for purposes of making things the way I wanted them.  I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order.  There seemed to be some other potential limitations.  It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted.  Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird.  So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message.  But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs.  In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming.  There were other solutions that would have worked for that purpose as well.  A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook.  (Alternate:  Automatic Print Email for $69.)  There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments.  I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments.  Funny thing about that, though:  as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them:  it thought they had attachments, but then it realized they didn't.  That trimmed out maybe 5% of the ones that had originally been marked as having attachments.  But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment.  Total Thunderbird Converter looked like it might come close, at a hefty price ($50).  It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly.  But then I saw that they were offering a 30-day free trial, so I downloaded and installed it.  It turned out to be useless for my purposes:  it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation.  I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial.  It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on.  I tried Birdie EML to PDF Converter.  Their free demo version allowed me to convert one EML file at a time.  I liked its interface:  it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats).  I didn't like the output, though:  they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF.  I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML.  But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages.  It just vanished.  The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions.  In other words, I didn't.  So now I had to go rooting around in backups to see what I could rescue from the wreckage.  I found that Backup Maker had been happily making backups, as God intended.  Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it.  What Backup Maker had saved was some files with an .MSF extension.  These were supposedly backups of Thunderbird.  But then, no, on closer inspection I realized these were much too small, so I had to do some more digging.  Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off.  A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached.  There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort.  On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost.  I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension.  I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder."  It worked -- it did extract the attachments without removing them from the emails.  I assumed it would do this with hundreds of emails if I so ordered.  Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination.  Attachment Extractor did have one potentially useful feature for this purpose:  a right-click option to "Extract with a Custom Filename Pattern."  I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF.  To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab.  There, I used this pattern:
#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#
and, per instructions, in the Edit Date Pattern option I used this pattern:
Y-m-d H.i
That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult.  A search led to inquiries about renaming PDF print output.  For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames.  Print Helper, for $125 to $175, was another option.  A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it.  A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30. 

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software.  No doubt many people have used pdf995 successfully.  I was not among them.

I tried Pdf995's Free Converter.  The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page.  They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad.  But that .ini file seemed to be configured for printing one specific file that I had just printed.  They didn't say how to adjust it.  Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there.  So I tried that.  But I got an error message:
PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.
But they were!  I had just installed them.  Was I supposed to reboot first?  No, a reboot didn't fix it.  I tried again to install basic pdf995 and the Free Converter, which I had downloaded together.  Once again, I got the webpage saying I had installed it successfully.  Was I supposed to install the printer driver too?  I understood the webpage to be saying that was included in the 9MB download.  But I tried that.  Got the congratulatory webpage, so apparently it installed correctly.  Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:
Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.
But I had just downloaded it from their website!  Not an altogether auspicious beginning here.  But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before.  A cursory read-through still did not give me a clear approach to automated naming of PDFs.  Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation.  Who knew?  I downloaded and installed OmniFormat.  Got a new congratulatory webpage, but still no straightforward explanation of batch naming.  Instead, it said that pdfEdit995 was what I wanted to create batch print jobs.  So, OK. a bridge too far.  Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird.  Went back to their incredibly tiny-print pdfEdit instructions page.  It said I would have to set pdf995 as the default printer to do the batch thing.  That was OK.  But it still sounded like it was intended primarily for batch printing from Microsoft Word.  I decided to just try making pdf995 the default printer.  That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer.  While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing.  Now I went to Thunderbird, selected several messages, and did a right-click > Print.  Funny, it defaulted to Bullzip, which was my usual default printer.  Checked again:  yeah, pdf995 was definitely set as my default printer.  Tried again, and this time manually set it to pdf995 when it was printing.  It asked for the filename, so that wasn't working.  Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either.  It seemed pdf995 was not the solution for me.  I came away from this little exploration with less time and ambition for the larger project.  Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first.  I did, so I took a look.  There was an option to download Universal Document Converter (UDC) as well.  I wasn't sure, but I thought I might need that for Print Conductor, so I installed both.  PrintConductor didn't seem to have a way of printing EML files.  Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that.  But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually.  (Print Conductor did not appear in the list of available printers.)  When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab.  There, I chose the "predefined location and filename option."  I left the default filename alone and printed.  And it worked.  Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails.  So I would be able to match them with the attachments produced by Attachment Extractor (above).  All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames.  I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above):  Tools > ImportExportTools > Export all messages in the folder > EML format.  Unfortunately, there could be a number of messages in a single day on the same subject.  Without the time data in the filename, I would have duplicates.  More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails.  I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data.  As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data.  But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails.  Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails.  There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached.  This seemed like the kind of project that really had to await some further development in application software.  For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it.  It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird.  I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered.  It seemed that a good solution would be to focus on the emails themselves for now.  I would export them to EML format.  EMLs would continue to contain the attachments.  I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice.  Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs.  So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment.  If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject).  I had sorted the old emails in Thunderbird into separate folders by year.  I went to one of those folders in T-bird and sorted it in ascending date order.  Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV).  This gave me what appeared to be a matching list of those messages, in that same order.  The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order.  Good news:  the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order.  At least the first and last emails were in reversed positions.  So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order.  This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed.  I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34).  I wasn't sure that the present and future date fields would always sort exactly the same.  I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes:  would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves:  Tools > ImportExportTools > Export all messages in the folder > EML format.  As partially noted above, these were named in Date - Subject - Number format.  I now did a search to try to figure out what that number signified.  It wasn't clear, but it seemed to be just randomly generated.  Too bad.  It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order.  (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.)  It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day.  There would surely be large numbers of files like that.  I now realized they would not at all be sorted correctly in Windows Explorer:  with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options.  Silly me.  I had just overlooked that.  But there they were:   Tools > ImportExportTools > Options > Filenames tab.  I selected "Add time to date" and I chose Date - Name (Sender) - Subject format.  Now I tried another export of EMLs.  The messages now had names like this:
19970730-0836-Microsoft-Welcome!
I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM.  Suddenly the remainder of my job got a lot easier.  I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools.  There were two parts to this mission.  First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools.  Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted.  This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird. 

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility.  That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML.  This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird.  I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird.  This seemed like a separate step.  I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.