Friday, June 29, 2012

End of This Blog. Transition to WordPress.

For reasons sketched out in a previous post, I do not presently plan to post any more messages in this blog, and will instead be posting my technical stuff in a new WordPress blog.

Friday, June 22, 2012

Finding and Cleaning Up EMLs That Display HTML Codes as Text

I had a bunch of email (EML) files scattered around my hard drive.  Some of them, I noticed, were displaying a lot of HTML codes.  For example, when I opened one (using Thunderbird as the default EML opener), it began with this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0">
<TITLE>RE: Scholar Program</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/rtf format -->

I was not sure how that happened.  Apparently I had run these EMLs through some kind of conversion process, perhaps after renaming them to be .txt files.  Whatever the origin, I wanted to eliminate all those HTML codes and wind up with a plain text file, probably saved as a PDF.  This post describes the steps I took to achieve that outcome.

Finding the Offending Files

As I say, the files containing this text were scattered.  Initially, I did a search for some of the text shown above (specifically, for "<!DOCTYPE HTML PUBLIC") in Copernic.  (I assume any tool capable of searching for text within files would work for this purpose.)  I thought maybe I would just copy and paste the lot of them from Copernic to a separate folder in Windows Explorer, where I could work on them in more detail.  This approach was not working very well because failed because Copernic did not allow me to select and move multiple files to other folders.  Moreover, Copernic did not display them with their actual filenames; rather, it showed the title indicated in the HTML "<TITLE>" line (see example above).

It was probably just as well.  Moving them in bulk from Copernic would have lost the indications of the folders where they were originally located.  The better approach, I decided, would be to use the command line and batch files to identify their source folder, move them to a single folder where I could work on them, and then move the resulting, cleaned-up files back to the folders where the originals had come from.

So the first thing I needed was a way to locate the files to be cleaned up.  I decided to use a batch command for this purpose.  I could have searched for every file (or just every EML file) that contained any HTML codes.  For that purpose, a search for "</" might have done the trick.  But then I decided that there could be a lot of HTML codes floating around out there, in various files, for a lot of different reasons; and for present purposes I didn't need to be trying to figure out what was happening in all those different situations.  So instead, I searched for the same thing as before:  "<!DOCTYPE HTML PUBLIC."  To do that, after several false starts, I tried this command:
findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txt
It produced a dozen "Cannot open" error messages.  The reason seemed to be that the filenames for those files had funky characters (e.g., #, §).  Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command.  DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file.  So despite a number of variations with FINDSTR and several Google searches, I gave up.  I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually.  Copernic had a convenient right-click Move to Folder option, so that helped a little.  So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder.  I would just have to re-sort them back to where they belonged manually.

But I still wasn't sure that everything in that folder was problematic.  Basically, I needed to see what the EMLs looked like when they were opened up.  Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails.  But I was having problems in my efforts to print EMLs as PDFs.  As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder.  To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat.  Somehow, this failed to crash my system.

Cleaning Up the Files

After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility.  Now I could edit them as text files.  My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.

I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it.  The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced.  It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation.  Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed.  After a couple of tries, I found a search that did bring up a couple of solutions to related problems.  But those solutions still looked pretty manual.  Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?

A different search led to HotHotSoftware, which offered a tool for this purpose for $30.  A video seemed to demonstrate that it would work.  But, you know, $30 was more than the files were worth.  Besides, I wouldn't learn anything useful that way.  ReplacePioneer ($39, 21-day trial) looked like it might also do the job.  A thread offered a way to do something like it in an unspecified language, presumably Visual Basic.  Another thread offered an approach in sed.  Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX.  Other free options included Nodesoft Search and Replace and Replace Text.

I tried TexFinderX.  In its File > Add Folder menu pick, I added the list of files to be changed.  I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage.  The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder.  They advised using an existing table to create a new one.  As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas.  The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.

But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space.  I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar.  I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space.  I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe.  I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided.  I also changed the top lines, the ones that explained what the file was about.  I saved the file, went into the program's Replacement Table button, and there it was.  I selected it and clicked Apply.  On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files.  Then I ran it.  It looked like it worked.

I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff.  I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line.  So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it.  It seemed best not to add too many chunks at once, lest I be repeating the same lines:  run a few, after eyeballing them for duplication, and then see what was left.  It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.

I added back some of their original items (modified) to the replacement table.  These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>).  I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work.  The answer to that seemed to be to replace three pairs of <13><10> with two.  I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.

I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth.  After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others.  Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run.  I realized that it would probably not produce perfect results across all cases.  It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.

I had noticed that the program took a little longer to run as I added more rules to its replacement table.  But this did not seem to be due to file processing time:  the time did not grow far longer when I added far more files to the list.  It was still done within a minute or so in any case.  Apparently it was just reading the instructions into memory.

The excess (now blank) lines in the files were the slowest to remove.  I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks.  Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.

Cleaning Up the Starts and Ends of Files

href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files.  Now I wanted to use a similar process on these files.  I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.

I had not previously noticed that most if not all of these emails had originally included attachments.  As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion.  These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition.  These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself.  I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX.  I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.

Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously.  After making a backup, the summary version of those steps, as modified, went like this:
  • Start Emacs.  Open one of the post-TexFinderX emails.  Hit F3 to start macro recording.  C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end.  Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
  • At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line.  I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails.  So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
  • Hit C-space to set the mark.  C-End.  Del.
The macro was still recording; I wasn't done.  The preceding steps did take care of the ending material in that particular file.  (As before, it was essential to avoid typographical errors, which would terminate macro recording or worse.)  But now, how about the unwanted starting material? I hadn't done this particular operation before, but it seemed straightforward enough.  I had to use C-Home to get to the start of the file.  Then -- since I had, again, deleted the objectionable boundary lines in some of these emails -- I had to search for the last surviving message header field.  In the case of the first email I was looking at, which I believed was probably the most thoroughly scrubbed, that last surviving field was Message-ID.  So I went through several additional but similar steps to clean up the start of the email and finish the task:
  • C-s to search for Message-ID.  Then C-e to go to the end of that line, and right-arrow to go to the start of the next line.  C-Space to set the mark, C-Home, and then Del.  That was as much as I could do with this particular email; it was clean, though not ideally formatted.
  • C-x C-s to save the file.  F4 to end the macro recording.  C-x C-k n Macro1 Enter (to name the macro to be Macro1).  C-x C-k b 1 (to bind the macro to key 1).
  • C-x C-f ~/ Enter (to find my Emacs Home directory).  In my case, Home was  C:\Users\Ray\AppData\Roaming\.emacs.d.  I went there in Windows Explorer and created a new text file named _emacs, with no extension.  This was my init file.
  • From the Emacs menu:  File > Open File > navigate to the new _emacs init file > select and open _emacs.  Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter.  This hopefully saved my macro to my init file.  C-x C-c to save and quit Emacs.  A quick look with Notepad confirmed that there was something in _emacs.
  • Restart Emacs.  Open another of these text emails.  Test my macro by typing C-x C-k 1.  I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again.  Still undefined.  Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs.  In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs.  I used C-x 2 to split the window.  _emacs appeared in both panes.  In the top pane, I went to Buffers > select the text file to be changed.  (Apparently it was listed as one of the available buffers because I had already opened it.)  So now I was viewing the macro in the bottom pane and the email file in the top pane.  I selected the top pane and tried C-x C-k 1 again; still undefined.  I found other advice to just use M-x Macro1.  That worked.  The macro ran in the top pane.
The macro didn't do such a great job of cleaning this second file.  I would have to return to that later.  For now, the next step was to figure out how to run the macro automatically on all the emails.  Meager results from a search presented the possibility that people did not commonly do this sort of thing.  A refined search led to further discussion suggesting that I should be searching for information on multiple buffers rather than multiple files.  That innovation provoked the side question of whether perhaps jEdit was better than Emacs for such purposes but, once again, Emacs seemed better.  Still another search led to Dired, which would apparently allow the user to conduct certain operations on the files listed in a directory.  We were getting closer.  I found someone who was feeling my pain, but without a solution.

A StackOverflow discussion suggested that I might want to begin a Dired approach by loading kmacro.  I had no idea of how to do this.  An Emacs manual page seemed to think that kmacro was already part of Emacs.  I decided to try to follow the StackOverflow concepts without special attention to kmacro preliminaries.  The first recommended step was to go to the top of my Dired buffer.  This, too, was a mystery.  Another Emacs manual page told me to use C-x d to start Dired.  In the bottom line of the screen, that displayed the name of the directory containing the emails.  I didn't know what else to do, so I hit Enter.  Apparently that was just the right thing to do:  it showed me a directory listing for that folder.  It would develop, eventually, that the fast way to get it to show that directory was to use the menu option File > Open File to navigate to that directory and open a file there.

Now the StackOverflow advice was apparently to move the cursor to the first file in that list (which is where it already looked like it might be) and hit F3 to begin recording a keyboard macro.  Then hit Enter to visit the file.  Then M-x kmacro-call-ring-2nd.  But at this point it said, "No keyboard macro defined."  So kmacro was working, but on this command Dired was looking for a previous keyboard macro, not for an already saved one.  I used C-x k Enter to close the email that I had opened.  Now I was back at the Dired file list.  I hit C-x 2 to split the window, so maybe I could see more clearly what was going on.  With the cursor on the first target email in the top pane, I hit Enter to visit it again, then M-x Macro1 Enter.  That seemed to be the answer, sort of:  the bottom row said, "After 0 kbd macro iterations: Keyboard macro terminated by a command ringing the bell."  So the macro did try to run.  Adventures in the previous post suggested that this error message meant the macro failed to function properly, and I believed I knew why:  this was the email that I had already edited.  I had already removed, that is, the stuff that the macro was searching for, starting with the Content-Disposition line.

Time to try again.  With the top pane (displaying the email message) selected, I hit C-x k Enter to close it.  Then I moved the cursor to (i.e., mouse-clicked on) an email on which I had not yet run Macro1.  There, going back to the (modified) StackOverflow advice, I hit F3 to start recording a keyboard macro; I hit Enter to visit the file; then M-x Macro1 Enter.  It ran without an error message.  The email was showing in both top and bottom panes, so evidently I had not yet mastered the art of pane.  StackOverflow said C-x o to switch to the other buffer.  This just switched me to the other pane; I was supposed to be seeing the Dired file list.  With the keyboard macro still recording, I tried C-x k Enter to close the email.  Now the bottom pane, where I was, had the cursor flashing on the wrong line.  C-x o, then., followed by a tap on the down arrow key to take me to the next file to be processed.  That was the end of the steps that I wanted my new keyboard macro to save, so I hit F4.  StackOverflow said that now I had to hit C-u 0 c-x e to run the keyboard macro on every file in the list.  But that command sequence only opened the next file and ran Macro1 on it.  I hit C-x k Enter to close.  I was back at the Dired list.  The cursor did not advance to the next line; Macro1 did not run automatically.

I thought maybe my errors in that last try screwed up the keyboard macro, so I tried recording it again:  F3; cursor on the target email; Enter to visit that file; M-x Macro1 Enter to run the macro; Ctrl-x k Enter to close the email; down-arrow to select the next email in the list; F4 to close the keyboard macro; C-u 0 C-x e to run it.  No joy:  I still had to close the file and start the next one manually.

By this point, a different approach had occurred to me.  If I could open all the target emails at once, I would only have to hit keys to run Macro1 and then close the changed file:  the next one would then be there, ready and waiting for Macro1.  I decided to try this.  As advised, with an email already opened in my target directory (via menu pick -- see above), so as to tell Emacs where to look, I used C-x C-f *.txt to open all of those emails. (As noted above, I was working on EMLs that I had mass-renamed to be TXT files.)  That seemed to work.  The first ones visible to me were those at the top of the list, on which I had already run Macro1.  I closed those.  I couldn't tell, from the Buffers menu pick, how many files remained opened.  I could see that their timestamp would change in Windows Explorer after Emacs was done with them, so presumably I would be able to check which ones I had run Macro1 on.  I made a mental note to make at least some kind of change in each file before closing it, so as to assure myself that there was no further need to work it over with Macro1.

So now I was looking at the first file that had not yet been caressed by the loving hand of Macro1.  I wondered:  can I define a keyboard macro to save the steps of running Macro1 and then closing the file?  I tried:  F3, M-x Macro1 Enter, C-x k Enter, F4.  To execute that last defined keyboard macro, I used C-x e.  It changed the file as desired -- that is, apparently it ran Macro1 -- and it also seemed to be saving the changed file, but it did not close the file.  In other words, I had reduced the required number of keystrokes down to C-x e, C-x k Enter.  That was what it took to run Macro1 and then close a file.  Not bad, but could I do better?

The problem -- for both this approach and the Dired approach (above) -- seemed to be that the macros were not saving the C-x k Enter sequence.  A search seemed to indicate this could be another difficult problem to solve.  I was running low on time for this project, so I had to shelve that, along with the ensuing question of whether I could bind this C-x e C-x k Enter sequence to a function key.  

Instead, I just went plodding through that sequence for these many files.  In some cases, the scrollbar at the right showed me that there was a lot of extra material that I had to delete manually, usually from the ends of the emails.  Saving after these additional edits required a C-x C-s Enter before the C-x k Enter.  It was also handy to know that C-/ was the undo key.

Further Cleanup

When I was done running Macro1 on all those files, I saw that Emacs had created backup copies, with a .txt~ extension.  I sorted by file type in Windows Explorer and deleted those.  Also, while going through the process, I had noticed a number of files that were short and unimportant, and whose attachments did not interest me.  So I was able to go through the list and remove those to a "Ready to PDF" folder.  These steps reduced the number of files on which I might want to perform further operations.

While looking at those files in Windows Explorer, I noticed that some were much larger than others.  These, I suspected, included some whose attachment sections had not been completely eliminated by the macro, perhaps because they had more than one attachment.  I opened these in Notepad and eliminated material that did not contribute to the intelligible text of the email.

In some of the remaining files, there were still a lot of HTML codes and other material that would interfere significantly with an attempt to read the contents.  It seemed that the spot checks I had conducted in TexFinderX had not brought out all of the things that TexFinderX could have cleaned up.  I restarted TexFinderX, added more codes to the list of things to remove, and ran it some additional times on the files remaining in that folder.  That didn't continue too long before I realized that there could be an endless number of such codes and variations.

The next step was to return to Emacs.  This time, I was looking particularly for individual lines that could safely be deleted.  This was not so much a concern with HTML codes, though there might be some of that too; it was more a concern with email headers, boundary lines, and other items that would vary from one email to the next, could therefore not be readily added to a TexFinderX replacement list, and yet could appear repeatedly within a single email.  For example, each of the following lines appeared within a single email:

--===============3962046403588273==
boundary="----=_NextPart_000_002A_01C69314.AD087740"
------=_NextPart_000_002A_01C69314.AD087740

Moreover, variations on those themes recurred throughout that email, with quite a few of each.  So I could write an Emacs macro to search for a line beginning with the relevant characters, select that entire line, and delete it.  I wouldn't have to know which numbers appeared on different variations of these lines, as I would if I were using TexFinderX.

The problem here was that there were quite a few different kinds of lines to remove.  In addition to the types just shown, there were also email header lines that would normally not be visible, but that had become visible in the original mangling of these files, and there were also various Content-Description and Content-Disposition and Content-ID and Content-Location lines.  I would have to write an Emacs macro for each.  I could write one macro to run them all, but it would terminate as soon as it failed to find the next requested line; and since these sorts of lines varied widely from one email to another, it was quite likely that such a general macro would be terminating prematurely more often than not.  If I knew how to bind macros to individual keys, it might not be horrible to go down the list and punch the assigned function (or Ctrl-Function, Alt-Function, etc.) keys, one at a time, reiteratively for each of these many email files.  But that seemed like a lot of work for a fairly unimportant project.  A better approach would have been to write a script to handle such things, but my chosen scripting language for this purpose, Perl, had one significant drawback:  I had not learned it yet.  I had been meaning to, for about 20 years, and I knew that eventually the opportunity would arrive.  But that day was not today.

I concluded that my cleanup phase for these emails was finished.  If I really needed to go further with it, I could convert them from PDF back to text and have at it again, some fine day.  If I had really intended to do that, I would have saved a list of the relevant files at this point.  But for the time being, I needed to get on with the next part of the project.

Converting Emails to PDF

I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs.  The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands.  I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions.  This produced PDFs with "Notepad" at the end of their names.  For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.

Converting Attachments to PDF

As noted above, most of these troublesome emails had attachments.  I now had, in a folder, only those emails (in .txt format) whose attachments I wanted to see.  Using a DIR command as above, I did a listing of those .txt files.  I put that list into Excel and modified it to produce batch commands that would move the EMLs of the same name to a separate folder.  Then, in Thunderbird, I created a new local folder.  With that folder selected, I went into Tools > ImportExportTools > Import eml file.  I navigated to the folder containing the EMLs whose attachments I wanted to see, selected them all, and clicked Open.  The icons indicated that all did have attachments.

Now, having configured Thunderbird's AttachmentExtractor add-on to generate filenames that I could recognize and connect with specific emails, I selected all those newly imported EMLs, right-clicked on them, and chose Extract from Selected Messages to (0) Browse.  I set up a folder that was not too many levels deep, for fear that some of these attachments might already have long names that could cause problems.  AttachmentExtractor went to work.  When it was done, I deleted that folder in Thunderbird, so that I would not have a problem of confusing duplicates of EMLs that had already caused me enough grief.

Then, in Windows Explorer, I sorted the extracted attachments by Type.  I began the process of converting to PDF those that were not already in PDF format.  Many of these were Microsoft Word documents.  I had already worked out a process that would automate the conversion of Word docs to PDF.  I moved these files to another workspace folder for clarity, and after making the advisable adjustments to my PDF printer, I applied that process to these files.

Word had problems printing a number of these Word docs.  It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques.  It did produce some PDFs.  I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time.  Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF.  That went fine, as a manual process.  So apparently it was not, in every case, a problem with the file.  Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.

Other extracted attachments were text files.  At this point, I had two ways of dealing with these.  On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files.  I did start to use this approach, but ran into dialogs and potential problems.  On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above).  Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt.  They also all originated from the same source.  I examined a handful of these attachments and decided I could delete them all.

Some extracted attachments were image files -- JPG, GIF, PNG, BMP.  I also had a dozen attachments without extensions.  I opened the latter in IrfanView.  I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly.  On the other hand, as I looked through these files, I saw that some of the GIFs were animations.  Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files.  I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.

There were a few miscellaneous file types.  For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been.  One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.

Combining Email and Attachment PDFs

Now I had everything in PDF format.  I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible.  I combined the files and attachments into one folder and made a list of the files using DIR (above).

Now the goal was to combine the emails that did have attachments with their accompanying attachments.  There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it.  I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM.  Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives.  The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.

I downloaded and unzipped pdftk.  As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32.  I opened a command prompt in some other folder, at random, and typed "pdftk --help."  This was supposed to give me the documentation.  Instead, it gave me an error:
pdftk.exe - System Error  The program can't start because libconv2.dll is missing from your computer.  Try reinstalling the program to fix this problem.
I moved the two files to C:\Windows and tried again.  That worked:  I got documentation.  It scrolled on past the point of recovery.  Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder.  The next step was to put pdftk to work.  It would apparently allow me to specify the files to combine, using a command of this form:
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
My problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other.  I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf.  I would need to keep the spreadsheet for a while, so as to know what the original filenames were.  The original filenames were my guide as to what files needed to be combined together.  For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf".  Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1.  Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.

After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined.  Each original filename began with date and time.  With few exceptions, this was sufficient to distinguish one email and its attachments from another.  So I used =LEFT to extract that identifying information from column A.  Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using.  Finally, I added a column to create the appropriate command for the batch file.  Instead of the 123.pdf output shown in the example above, I used the original email filename.  Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.

I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK.  I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them.  Pdftk had proved to be a much easier tool for this project than pdfSAM.  This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.

Windows Explorer Replacements: FreeCommander and Explorer++

In a previous post, I looked at replacements for Windows Explorer ("WinEx"), including especially FreeCommander.  The runner-up, at that point, was Explorer++.  Further experience with FreeCommander prompted me to take a closer look at Explorer++ after all.  This post provides further information on these two utilities.

As I used FreeCommander, I was surprised to find that a few right-click (context menu) options were missing.  For example, I often used LockHunter to find out why Windows was not letting me move or delete a certain file or folder.  But in FreeCommander, I was no longer seeing the context menu question, "What is locking this file?"  That option did continue to appear in Explorer++, as it had appeared in WinEx.  One possible explanation was that FreeCommander did not offer a 64-bit version, whereas Explorer++ did, and I was using the 64-bit version of LockHunter.

Another problem in both FreeCommander and Explorer++ was that I no longer had the option to create a new text file in a specified folder.  That option had been available in WinEx, as I recalled, via File > New > Text File.  I was pretty sure there was a way to create a new text file in FreeCommander.  It seemed to me that I had done so by accident, once or twice, while trying to do something else with a familiar command from WinEx.  But I was not seeing that option on the menu nor in the list of shortcuts, and likewise in Explorer++.  Workarounds in either program were to open a command window in the selected folder and type one of these options:

  • copy con filename.txt Enter.  Then type the text.  End with F6 or Ctrl-Z.
  • echo [a line of text to put into new text file] > filename.txt
  • notepad filename.txt Enter
Both FreeCommander and Explorer++ made the command window available in multiple ways:  via Ctrl-D or menu pick in FreeCommander, via menu pick in Explorer++, and via toolbar icon in both programs.  Both also allowed the customized context menu option to "Open command window here," available through Ultimate Windows Tweaker.  But the toolbar itself, the most readily accessible of these options, was smaller and less obtrusive in Explorer++ because it could be made to fit on the same line (at the top of the screen) as the address bar and the list of drives, whereas FreeCommander insisted on putting the toolbar (if I opted to display it) on its own separate row, and with somewhat larger icons.

Unlike FreeCommander, it was not necessary to display a toolbar listing all drives in Explorer++, because the navigation pane already showed all drives, as in WinEx.  Also like WinEx, Explorer++ allowed me to customize the toolbar area by right-clicking on it.  By contrast, FreeCommander required me to go to Extras > Settings > View > Toolbar; and once there, I had to save changes to each segment of the toolbar separately.  Explorer++ offered more toolbar icons that I was likely to find useful, including Back, Forward, and Up buttons.

Explorer++ did not offer the dual panel option.  But in recent weeks, I had not found myself using that option very often in FreeCommander.  I tended to prefer to keep my windows to half-screen width (using the half-screen snap available in Windows 7 via WinKey - left- (or right-) arrow), and a half-screen was too narrow for many filenames.  Moving from one tab to another was an easier way to work among multiple folders.  Explorer++ (unlike FreeCommander) further aided that by offering the option of bookmarking folders.  A bookmark would not create a new tab; it would change the focused folder in the already focused tab.

Unlike FreeCommander, Explorer++ offered the option of being treated as a replacement for WinEx.  This meant that my Start Menu icon (and other menu picks in various programs) that previously would have opened a Windows Explorer session were now opening an Explorer++ session instead.  That option was available via Tools > Options > General tab > Default File Manager.  I still had the option of opening Windows Explorer by typing "explorer" in a command box; hence, batch commands designed to open WinEx to a particular folder would still do so.

FreeCommander appeared to offer more command-line options.  The options in Explorer++ appeared to be limited to (a) the possibility of listing multiple directories to open when Explorer++ started up, each opening in its own tab and (b) the possibility of opening virtual folders by using their names (e.g., explorer++.exe "control panel").  I did not think I would need the latter.  The former would be useful only when dealing with relatively short pathnames; Windows might balk at a command listing several long paths.  I obtained information about these options by typing "explorer++.exe /?" at the command prompt.  That seemed to work only in the folder where explorer++.exe was located.

Other points of comparison:  Both Explorer++ and FreeCommander seemed to remember their window positions better than WinEx had done.  Even more so than FreeCommander, Explorer++ displayed much more information onscreen than WinEx:  51 rows, in my configuration.  Regrettably, unlike FreeCommander, the status bar in Explorer++ did not state both the number of items selected and the total number of items in the folder.  Like FreeCommander, Explorer++ did not offer an Undo option, in case I had accidentally moved or deleted the wrong file or folder.  Using Explorer++ or FreeCommander did not stop the annoying "This folder is shared with other people" messages.

As these remarks probably suggest, I found myself gravitating toward Explorer++ shortly after I began using FreeCommander in earnest as my WinEx replacement.  There would surely be many more contrasts between the two.  But I wasn't sure how many of them I would detect, since by this point it seemed that I would mostly just be using Explorer++.

Thursday, June 21, 2012

Windows Explorer Has Stopped Working: FreeCommander and Other Alternatives to Windows Explorer

I was using Windows 7. I installed a new EVGA video card. Immediately, I began getting "Windows Explorer has stopped working" error messages, especially when I would right-click on a file or folder and try to move it somewhere else. A search led to a Microsoft webpage that identified the video driver as the first culprit. I had also had problems with the previous video card (also an EVGA), but had partially resolved those by rolling back to an earlier driver. That solution did not work this time. While awaiting EVGA's reply to a service request, I decided to look into Windows Explorer replacements.

This was not the first time I'd had problems with Windows Explorer. As far back as the late 1990s, I had found PowerDesk 98 to be a superior alternative to Windows Explorer in Windows 95 and 98. I had grown so attached to it that I was still occasionally looking, years later, to see whether they had come out with an update. It just offered a lot of features and advantages that I hadn't found in Windows Explorer.

To me and to at least some other users, Windows Explorer in Windows 7 had seemed like a step backwards from Windows Explorer under Windows XP. Some functionality was lost; some new problems appeared. For example, not long before this latest "stopped working" issue, I started getting the unwanted reminder that "This folder is shared with other people," which would come up when I tried to move or delete a folder. Despite some effort, I hadn't been able to get rid of that.

FreeCommander

In the past year or so, after repeated albeit superficial looks at various Windows Explorer alternatives, I had started using FreeCommander for some tasks. One thing pushing me in that direction was that, in Win7, Windows Explorer seemed eager to forget file selections. That is, I could go through a list of files and select some; but then Windows Explorer would de-select them if I (or anything else) changed the file list. In FreeCommander, unlike Windows Explorer, I could select files, and they would tend to stay selected -- even if I deleted one entry from the list (from within another program, for example, or another Windows Explorer session) or if I re-sorted the list by file size rather than file name. FreeCommander wasn't perfect about this. At this writing, a quick test revealed that it, too, would lose selections if I sorted by file type. Nonetheless, FreeCommander was much better than the current version of Windows Explorer in this regard, in my ordinary usage. FreeCommander was also not crashing, at present, during these instances when I was getting these "Windows Explorer has stopped working" messages.

FreeCommander also did a better job of putting me where I wanted to be. It would remember my last location, and would never put me in a pseudo-folder under Ray (my user name in the navigation pane), as Windows Explorer insisted on doing, when I actually wanted to be in Computer > Drive D > Folder X. I also liked that FreeCommander did not lard down my navigation pane with all sorts of Libraries and other top-level folders (which, to some degree, I had been able to eliminate from Windows Explorer by using registry tweaks).

On the downside, unlike Windows Explorer, FreeCommander did not preserve a memory of multiple sessions after a reboot. That is, if I had three Windows Explorer sessions and three FreeCommander sessions open before a crash, all focused on different folders, I would wind up with those same three Windows Explorer sessions after the crash, but only one FreeCommander session (or pair of sessions, if I was using FreeCommander's split-window feature). (I think I had to enable an option somewhere in Windows, or in a tweaking program, to get this functionality from Windows Explorer.) Then again, FreeCommander's option to open or close tabs (via View > New Folder Tab or else Ctrl-T and Ctrl-W) had the potential to make it unnecessary to have so many Windows Explorer sessions open at once.

At first, I did not realize that I could turn FreeCommander's split-window feature off. There were times when it was convenient to be able to view two separate lists of files -- from, say, two different drives -- within one FreeCommander session (and to split the screen either horizontally or vertically). The options to split the window or to view just the left or right sides were available through the menu (View > Split Window) or by shortcuts (Ctrl-Shift F1, F2, and F3). If I had multiple tabs open on one side, it would remember them, even if it was currently displaying only the other side.

I didn't like that FreeCommander wouldn't show me a full navigation tree (in the left-hand pane) for all drives: instead, it was focused on just one drive. To switch drives, I had to go to the top of the screen and select another drive from a list or combo box (configurable via Extras > Settings > View > Drives). That seemed inconvenient, and that top bar took up screen space. One workaround might be to have a different tab open for each drive, and just navigate within that tab whenever I wanted to work in a different drive.

I had some batch files that would open various Windows Explorer sessions automatically, on certain dates or at certain times of day. There was a simple command format to specify which folder a Windows Explorer session should display: start explorer.exe /e,"D:\Folder Name\File Name.doc." I wondered if I could do something similar with FreeCommander. It would let me specify a pair of starting folders (via Extras > Settings > Start Program), but could I go further than that? It appeared that, to match what I was doing with Windows Explorer in this regard, I would have to set up a relatively complicated arrangement with alternative .ini files in FreeCommander. (I wasn't entirely sure whether this was what the program's author meant when he said that "several layouts can be saved.") I probably wouldn't go to that trouble for the most part, though I could imagine setting up a couple of really complicated sets of tabs and linking to them with shortcuts on my semi-portable customized Start Menu.

One feature of FreeCommander that I liked immediately was its more condensed display. Windows Explorer had seemed to add more space between lines, going from Windows XP to Windows 7. I wasn't sure why. The result was that Windows Explorer would display 38 items at a time (or a few more, if the top menu and bottom status bar were turned off) while FreeCommander would display 45 -- and those 45 would be presented in a larger, more readable font. FreeCommander's status bar was also more informative, telling me that I had selected 45 out of 110 items within a folder, and such information would remain visible in the status bar, whereas for some reason Windows Explorer would sometimes replace it with the date when a folder was created.

One thing I didn't like about the FreeCommander interface was that it didn't have an address bar into which I could paste an address for a quick switch to a different directory. In FreeCommander, I had to use the Edit menu to get a folder's address for pasting elsewhere, and the Folder > Go to Folder menu pick to paste an address from elsewhere into FreeCommander. These steps were inconvenient because (a) they required more steps and (b) they were in two different locations, which made things a little more confusing. On the other hand, FreeCommander offered the option (via Alt-Ins or Edit > Copy Full Name as Text) of copying both the path and the file name in one step, which Windows Explorer did not do.

I felt that FreeCommander needed some reorganization and clarification. There was apparently a new XE version in the works; it appeared that the present version dated from 2009. On the other hand, I wasn't sure how much I could complain about the program's offerings. Quite aside from the fact that it was free, FreeCommander had all kinds of features not available in Windows Explorer, and there didn't seem to be much that WinEx could do that FreeCommander couldn't. I figured that, with or without reorganization, I would become familiar with FreeCommander's menus and possibilities if I began using it more extensively.

One particular area that I did not explore, within FreeCommander, was its offering of various built-in utilities. These included file viewing (e.g., hex, image, binary formats), zipping, searching, wiping, checksums, listing, renaming, and splitting; multiple file renaming; directory comparison and synchronization; and the option to add more tools to the list. I didn't look into this because I was not sure how much I would use these tools. As indicated in other posts in this blog, a lot of problems and questions could arise when you really got into the details of these sorts of functions. For example, I had paid some attention, in recent years, to GoodSync and alternatives for synchronization between computers, and to Beyond Compare and rsync for comparing a computer's files against backups. I was hesitant to replace those sorts of dedicated tools with what might be a simplistic form of file comparison that could make a mess.

In terms of other conveniences, there was quite a list of hotkeys. FreeCommander was available in both installed and portable versions. There were user forums with a total of maybe 10,000 posts. It was definitely a stable and worthy program. Given my previous searches and my experiences to date with FreeCommander, the question for me at this point was whether there was some other Windows Explorer alternative that was even better.

Other Windows Explorer Alternatives

Along with various ways to customize Windows Explorer (which did not presently appeal to me, given the fact that it was crashing every time I looked at it sideways), there were various lists of Windows Explorer alternatives. For example, a PCTIPS 3000 webpage briefly listed FreeCommander as one of the best (free) Windows Explorer replacements, along with Q-Dir, Explorer XP, NexusFile, and CubicExplorer.

A potentially manipulable poll suggested that users of one forum were using Total Commander, Directory Opus, XYplorer, Explorer++, Q-Dir, and FreeCommander, in approximately that (descending) order. A SuperUser thread ranked the leading Windows Explorer alternatives as being (in this order) Total Commander, QTTabBar, FreeCommander, FarManager, XYplorer, Directory Opus, the command line, xyplorer2 (?), Q-Dir, TeraCopy (!), CubicExplorer, SpeedCommander, Altap Salamander, Xplorer2, muCommander, and others recommended by at least one person. Another webpage listed at least 30 free and paid alternatives to Windows Explorer. TechRepublic listed these as its preferred free Windows Explorer replacements: CubicExplorer, Explorer++, Xplorer2, NexusFile, and Q-Dir.

Wikipedia provided a comparison of many such programs. It appeared that they could first be divided into free and nonfree categories. Since I was not likely to pay for something that FreeCommander (not to mention Windows Explorer) could already do for free, I focused on the free ones. This removed Total Commander ($40) from consideration -- a perennial favorite that I had tried briefly but hadn't found that terribly impressive. The nonfree group also included some other popular entries from the lists cited above: Directory Opus ($85!), XYplorer ($43), SpeedCommander (~$55), Xplorer2 ($30), and Salamander ($30) -- assuming the lite versions available for some of these programs would fail to match a competitive freeware alternative like FreeCommander. It was not clear how recently the Wikipedia page had been updated -- it didn't include any reference to some of the programs listed above, and therefore might not reflect pricing changes (including decisions to offer a given program for free, or to stop doing so). Then again, I guessed that any developer who was on top of the game would know of this Wikipedia page and would be updating the pricing information pretty quickly.

ExplorerXP did not appear to have been updated for Windows 7. I had encountered other programs, great in Windows XP, that had stumbled at one point or another in Win7. Given the number of competing programs, I tentatively ruled out those that were not clearly Windows 7 compatible. I was also inclined to exclude those whose screenshots provided a clunky interface and/or limited information -- notably, Far Manager.

The leading free Windows Explorer alternatives, other than FreeCommander, thus appeared to include Q-Dir, NexusFile, CubicExplorer, Explorer++, QTTabBar, and muCommander. The Wikipedia comparison page provided a table indicating the kinds of views that each such program offered. I considered a Details view option essential. Evidently no such view was available in CubicExplorer. I was not sure if a twin-panel or tab options were essential, but CubicExplorer seemed to lack those features too. As noted above, I was not especially interested in the utilities (e.g., file compression) that such programs might offer, so I did not examine the Wikipedia tables on those matters.

I looked at the webpages for the ones that remained on my list: Q-Dir, NexusFile, Explorer++, QTTabBar, and muCommander. I had previously glanced at QTTabBar but, for reasons I did not recall, had not gone far with it. On this review, I was not too impressed. Its forums seemed to be lonely places; there was an indication that it was still in "public beta"; and there were remarks of instability and Windows 7 incompatibility.

Q-Dir's screenshots seemed to offer the potential to divide up the screen in a number of interesting ways. They did not seem to display a tabbed browsing option, but I gathered that tabbing was an option somehow. There did not appear to be a forum, but the author did provide an array of FAQs, with perhaps some English-language limitations. My principal reaction at this point was that, having glanced ahead at Explorer++, I thought it might be a little more of what I was looking for.

NexusFile had a snazzy webpage, especially for people who like black, but I wasn't so sure about its content. Forum posts didn't seem to be categorized, and there didn't seem to be a way of searching them. Generally, as with most of the other programs noted here, the information provided on the webpages did not seem responsive to the detailed concerns noted in my review of FreeCommander (above). That might have been alleviated in some cases if I had plunged into their FAQs in detail. It would not have been alleviated in the case of NexusFile, whose FAQs page had a total of six questions.

Between muCommander and Explorer++, a comparison of forum posts suggested that muCommander was more of a cross-platform option, with many of its posts coming from Linux users. There also seemed to be somewhat more recent activity in the Explorer++ forums. MuCommander, not listed on the Wikipedia comparison page (above) at this point, apparently did not yet have tab support. Between the two, then, I was inclined toward Explorer++.

The thing is, I wasn't seeing anything in any of these alternatives that FreeCommander lacked. At present, the best course of action seemed to be to focus on FreeCommander, to the extent that it performed more stably and functionally than Windows Explorer. I decided that I would return to these alternatives if, at some point, neither Windows Explorer nor FreeCommander seemed to be working well for me.

Suggestion

It seemed, in particular, that while these various programs were trying to compete to be the best Swiss Army knife, doing everything fairly well, there might actually be some merit in taking more of a niche approach. For example, I tended to use a few folders very frequently. I had not really worked out the whole thing of using Favorites, links, Send To, Move To, Copy To, and other ways of moving and sorting my attention, or my files, among these key folders. It would have been very helpful if, for instance, I could have just punched a function key or other hotkey to move a file to any of a number of previously designated folders. I often found myself sorting files; this would have been very useful.

In that particular example -- and that was obviously not the only specialized way in which people might use Windows Explorer -- there seemed to be a lot of room for people to develop tools that would help me navigate across my computer adequately, while providing a really special experience among key locations. In that case, I could easily imagine running this other program along with (not "instead of") Windows Explorer or FreeCommander.

How Google Started to Become a Problem

I guess I have assumed that almost everybody loves Google, and those who don't are the bad guys.  Microsoft, for example.  Maybe it takes a huge corporation to stand up to another huge corporation.  If so, Google is a champion for those who have disliked various things about how Microsoft got its start, what it did to increase its power, and what it has done with that power.

There comes a point, however, when the good guy turns bad.  Maybe it doesn't have to happen.  But power tends to corrupt.  And even when it doesn't actually corrupt, it tends to create an impression of corruption.  That impression may be able, by itself, to make people more or less as miserable as they would be in case of actual corruption and abuse.

Case in point.  I have been blogging for years, here in Blogger.  I wasn't necessarily eager to see Google acquire Blogger.  But they were welcome to do so, for my purposes, as long as they left me alone.  The deal was that I got to use their free blogging platform to put out various things that I wanted to write, and they got to use my work, my viewers, etc. to make money from advertising and whatnot.

Gifts can make people resentful when they stop.  I would be unhappy with Google if they pulled the plug on my blogging enterprise, even though they're not charging me for it.  I have spent years putting stuff here, linking one post to another and so forth.  It would take a lot of work -- work that I might never do -- if they were suddenly to just shut it down or screw it up.  I would feel that, after all, Google does have competitors, notably WordPress.  If nothing else, I'd sooner be paying for a hosted website than to do all this work and then watch it get messed up.

What's sad is that I have been warned that they are quite capable of doing exactly that.  It has already happened.  Circa 2000, many people were using DejaNews as a convenient gateway to Usenet.  Usenet newsgroups contained tons of free, helpful information on a vast array of subjects -- especially but not only computer-related, like this blog.  Google acquired DejaNews.  Evidently they felt that all that information would interfere with their desire to sell advertising related to webpages.  For whatever reason, they basically destroyed Deja.  That was a shame, for all those people who could have continued to use it to obtain useful information.  And it was irritating to me, because all the things I had put out there, thinking I would always be able to access them, were removed from access as a practical matter, by me and most everyone else.

I was pretty unhappy with Google about that.  That was the first big chink in their claim that they would "do no evil," as their corporate motto ("Don't be evil") has been widely reported.  They had obviously ruined something useful, for purposes of increasing profits.

That stuff would not be coming back to mind now if I weren't having an off day with Google today.  Here I am, working away on my blog, and suddenly it is no longer very functional in Internet Explorer.  I have a nice little desktop arrangement, with various browsers, but now Blogger has suddenly ceased to work properly when I try to post or edit.  Google lets me know that, instead, I should be using its own browser, Chrome, for this purpose.

That part happened several days ago.  So, OK, I have been trying to post in Chrome instead.  But I am finding that Chrome is not yet up to speed for this purpose.  Google was eager enough to move me over to its browser -- the statements and signals have been out there for some time -- but, lo, it develops that Chrome is inserting white backgrounds.  Whole chunks of my post are whited out.  Why?  I don't know.  Probably they don't know either.  I am having to go in and manually remove whiteing that I didn't put there.  Why not just leave me alone, free to work on my blog in Internet Explorer, until Chrome gets its act together?

That seemed like a fair question, so I tried to present it to Google.  Problem is, their "Contact Us" webpage is a lie.  You cannot contact them through their webpage.  Or at least I cannot.  I tried today.  I tried once before, with a problem so obvious and banal that it pained me to have to bring it to their attention.  In that case, I gave up and wrote them a letter.  It seemed ironic, and yet telling, that I had to use the U.S. Post Office to communicate a simple thought to one of the world's largest software corporations.

Like most people, I don't like being lied to.  If you're not going to let me contact you, don't give me a "Contact Us" webpage.  Call it "FAQs" or whatever.  It's great that you can hire the best and the brightest, but that can backfire:  you can create the impression that you think you're too good for the rest of us.  It wouldn't be terribly smart to generate unnecessary resentment, would it?

It had never occurred to me, until today, to search for something that I have now searched for and found.  Yes, as it turns out, there does exist something called IHateGoogle.org.  I'm not really sure what it's about.  I'm not resentful enough to dig into it.  But, Google, keep it up:  maybe someday I will be.  You seem to be making a good start at it:  today you tell me that as many as 1.4 million webpages convey that sort of feeling toward you and your actions.

Obviously, I am not the only person who has attempted to communicate with Google along these lines.  People rarely get resentful when they feel they are being respected.  If Google cannot make its own programs work together -- Chrome and Blogger, in this case -- it is welcome to keep them in beta.  But forcing me to use them when I don't want to:  at this point, that is a problem.  Not just a software problem.  As presented in this post, it is an indication of larger and more worrisome things.

Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

I was using Thunderbird 11.0.1 in Windows 7.  I had accumulated some emails that I wanted to export as individual EML files.  An EML would still be readable in Thunderbird, and it would carry any attachments along with it.  I had attacked this problem on several previous occasions.  As before, I was not sure I would get all the way through from Thunderbird to EML to PDF.  This post provides another contribution in the slog toward that outcome.

First Step: From Thunderbird to EML Format

Some of my previous efforts to export to EML and then convert to PDF had produced something of a mess.  Exporting, itself, was easy enough.  I was using ImportExportTools.  It would give me EMLs with names containing some, but not all, of the information that I wanted in file names.  Specifically, it would provide the date and time, the sender, and the subject; but it did not include the recipient.  I could get it to produce a separate Index.csv file that would contain the full information, but that would just be a spreadsheet file.  I could use that spreadsheet file to give me nice names for files; but which file was supposed to get which name?  Matching them up had required a surprising amount of manual effort, last time around.  I was hoping to make the process smoother, if I could.

It wouldn't help to print a PDF directly from Thunderbird.  As far as I knew, that would require me to enter PDF filenames manually.  I was looking for a mass-production kind of solution.  About.com recommended mbx2eml, but it seemed to have some disadvantages, notably a very limited set of options for the resulting EML filenames -- which was the main problem.  Generally, it did not seem that any solution had broken through into prominence, in either the T-bird to EML or T-bird to PDF category.

In my first try at this problem, I had tried Total Thunderbird Converter and Birdie EML to PDF Converter, but for various reasons had not been impressed with either.  I did like Attachment Extractor, for when I got to that part of the project.  My notes seemed to favor Universal Document Converter (UDC) ($69), if I wanted a direct T-bird-to PDF-solution.  As I reviewed the struggles I'd had in that first try at this problem, and also in the second and third tries, I wondered if I should have focused more seriously on UDC.  But it did not seem to have command-line capability or other automation features.  It was basically a glorified PDF printer.  Moreover, its default filenames did not include all the information I wanted.

My previous notes did not seem to mention that Thunderbird messages were apparently already in EML format, stored in Thunderbird subfolders.  For instance, I had moved the messages that I was now seeking to export to a Local Folders subfolder called Export, and I could see that folder in Windows Explorer as Mail\Local Folders\Export.mozmsgs.  But this was confusing:  the number of EML files in that folder was not very close to the number of messages in the Export subfolder in Thunderbird.  Anyway, the EMLs in Export.mozmsgs had seemingly random names that would be useless for my purposes.

So I went ahead with ImportExportTools.  My first step was to eliminate duplicates.  For this, I used Remove Duplicate Messages (Alternate).  Then, in Thunderbird, I went to Tools > ImportExportTools > Export all messages in the folder > EML format.  The first time around, this produced undesirable results (see below).  But I didn't know that until I was partway through the second step.

Second Step:  Adding Recipient to the EML File Name

I had my EMLs.  But as noted above, I wanted to add the name of the Recipient to the filename, in the format Date-From-To-Subject.  As a first step, I thought I would just try to append the Recipient's name to the end of the filename.  Then I would figure out how to shuffle the words around to the desired order.

Given my limited knowledge of programming and such, I decided to try to achieve this with a Windows batch file.  I struggled to figure out how to write a suitable one, and finally posted a question on it.  One of the early answers to that question led to a separate pursuit -- a one-line batch file that would convert Word and WordPerfect documents to PDF.

The answers that I had received, at the point when I was writing up these notes, fell into two categories.  One, which I found easier to understand (and, predictably, seemed less popular among the knowledgeable respondents), involved a simple loop that would call an external process.  Basically, in plain English, it went like this:

FOR each EML file, run Process.
Repeat loop.
When list of files is exhausted, quit.

Process starts here.
Do various things.
End of process
By contrast, the approach preferred by most of the answering individuals would put all the steps inside the loop, instead of having a separate process afterwards.  It seemed to be a matter of style.  A second difference was that, in discussing the specific steps, they seemed divided between two general possibilities:  with, or without, delayed expansion.  Delayed expansion was apparently a response to a complication in how the FOR command worked.  As I understood it, the computer would read the entire contents of a FOR command as soon as it hit the word FOR.  So assigning a value to a variable inside a FOR loop would be too late; the computer would already have decided what value that variable had.  The variable would have been immediately expanded to its value.  Delayed expansion would postpone definition of the variable's value until later in the game.  A variable would be marked for delayed expansion by surrounding it with exclamation marks (e.g., !VAR!).  I wasn't familiar with delayed expansion, so I was in accord with some advisors' feeling that it would be better to proceed without it if possible.  What they (especially Aacini) suggested was:
@ECHO OFF

IF EXIST fullnames.txt DEL fullnames.txt

FOR %%f IN (*.eml) DO (

SET firstfind=

FOR /F "delims=" %%l IN ('findstr /B /C:"To: " "%%f"') DO (

IF NOT DEFINED firstfind SET firstfind=now & ECHO %%f%%l >> fullnames.txt

)

)
I have double-spaced the lines for clarity, anticipating that Blogger will wrap some long lines.  I haven't indented the way a programmer would, because of apparent limitations in the formatting options here in Blogger.  Basically, this batch file said, give me a fresh output file called Fullnames.txt; and on each line in Fullnames.txt, type the contents of two variables.  The first variable, %%f, was the name of the EML file under consideration, in all its Date-Sender-Subject glory.  There would be one such filename assignment for each EML file in the folder; hence a FOR loop.  The batch file would loop through all EML files in the folder.

Inside that FOR loop, there would be an examination of the contents of each individual EML.  This examination would use FINDSTR to locate the first line beginning with "To:  ."  The contents of that line would be assigned to the %%l variable.  (That's an L, not a one.)  I wasn't sure why this had to be done inside a second, inner loop, and I also didn't know how the "now" part worked.  But I was an openminded individual.  I was interested in new ideas.  The point is, I was willing to plow ahead and give it a try. 

So I copied the foregoing lines of script, beginning with @ECHO OFF and continuing to the last closed parenthesis (")"), and pasted them into a file in Notepad.  I saved that file as EMLNamer.bat, and put it into the folder containing the EMLs that I had exported from Thunderbird (above).  There, I ran it (either double-click it or highlight and hit Enter).  The command window displayed nothing, which was a bit disconcerting; but, viewing the folder in Windows Explorer, I could see Fullnames.txt spring into existence and grow larger.

When it was done, the command window disappeared, and Fullnames.txt stopped getting bigger. I put EMLNamer.bat into a folder where I could find it later.  I opened Fullnames.txt file and pasted its contents into Excel.  Some lines seemed to be missing.  Not many, but less than the total number of files shown in the Windows Explorer folder minus two (for EMLNamer.bat and Fullnames.txt).  I guessed that the names of a few EMLs had presented complications for the script.  I would have to process the rest and see what remained.

Third Step: Improving the EML File Name

I looked at the new Excel spreadsheet.  Spot checks, supplemented by previous experience with ImportExportTools, yielded the following observations:
  • The first 13 characters in each filename seemed match the date and time (in 24-hour format) shown in Thunderbird for the email in question -- the time, that is, when the email was sent or received.
  • The next characters indicated the sender.  This string ended, in some cases, with three characters (namely, "_-_") and in other cases with just one (namely, "-").  It seemed that ImportExportTools would surround some senders' names with underscores ("_") but would not do so for others.  The reason seemed to be that those senders' names appeared within brackets.  For instance, I had emails from "[Wordpress.com]" that now appeared as "_WordPress_com_."  So at least in these situations, the underscore seemed to be something that I could replace with a space, which would then be removed by an Excel TRIM command if it appeared at the start or end of a string.
  • Some senders' names ended with "_com."  Ordinarily, the preceding note would suggest replacing that with ".com," and likewise for ".org," ".edu," and so forth.  But I decided that step would come later, if at all:  instead, I would start by identifying full names (e.g., "Yahoo_com") that I might want to replace with simpler names (e.g., "Yahoo").
  • Hyphens were not always a reliable indicator of the end of a sender's name.  For example, an email from some "Pan-European" organization came through the ImportExportTools process unchanged.
  • ImportExportTools seemed to replace apostrophes with underscores.  So instead of "Miller's" I would get "Miller_s_."  Likewise for other uses of the apostrophe (e.g., "Don't" became "Don_t_").  It seemed that, before doing any sweeping replacement of underscores, I might want to look for those sorts of special cases.
  • Sometimes a hyphen would not be a reliable indicator of the end of a sender's name.  An example appeared in an email from a "Pan-European" organization:  it came through the ImportExportTools process unchanged.
  • Due to the EMLNamer process, the end of the Subject field and the beginning of the Recipient field were marked by ".emlTo:" -- which was certainly recognizable.
  • Subject fields often began with things like "Fwd_" and "Re_" -- which, I had decided in a previous use of ImportExportTools, would best be deleted.
In short, the default results from ImportExportTools (possibly altered during my previous tinkering) were creating some confusion.  I deleted the existing EMLs from the output folder, so as to start over.  Then I went into Thunderbird > Tools > ImportExportTools > Options and made several changes.  In the Misc. tab, I set each item to a maximum of 100 (rather than 50) characters.  (This wasn't exactly a mistake, but I would later realize that, as a result of this change, I needed to be more aggressive in keeping the total filename length relatively short; otherwise, it would cause problems in some other Windows operations.)  In the Filenames tab, I unchecked the option to "Use just alphanumeric characters in export"; I left the format to be Date - Sender - Subject; I left "Add time to date" checked; and I unchecked the "Cut subject" and "Cut complete file path" options.  In the Export directories" tab, I chose the "create a new directory and the index of messages" option.

When I ran that, I got an index.html file listing relevant information about each file:  its subject, from, to, date, and an indication of whether it had attachments.  This did not appear likely to be helpful, given its HTML format.  In the output folder, there was the right number of files.  I ran EMLNamer.bat again.  This time, the command window gave me some error messages.  Preliminarily, it seemed they were produced by the length of the filenames.  I could not save them before the command window closed.  There was probably a way to modify EMLNamer.bat to save those messages to a file, but I did not tinker with that at this point.  These messages appeared to be in addition to the unknown problems that had prevented Fullnames.txt from containing a complete list of all EMLs:  there were now about 20 filenames missing from the output that I pasted into Excel.  So, again, those would have to be dealt with manually.

This time around, when I pasted the results from Fullnames.txt into Excel, I saw that the output filenames had characteristics largely similar to, but in some regards different from, those noted above.  There were fewer underscores, which meant that it would probably be simpler to develop rules to translate them into more useful characters.  Hyphens were still not reliable field-end indicators.

Manipulating the File Information in a Spreadsheet

In Excel, after a couple of false starts not detailed here, I took the following steps:
  • Insert row 1 for column headings.  Label column A as "Combined."  These entries contained the combined original filename plus the "To:" information added by EMLNamer.bat.
  • In column B (heading:  "Original"), use =LEFT(A2,FIND(".emlTo: ",A2)-1) to obtain the original filename as exported from Thunderbird.  I would need this to remain unchanged:  my ultimate goal, a batch command indicating how the original filename should be changed, would need this information to tell the command processor what file was being renamed.  As with all other columns discussed below, I copied the formula down the column to all rows in use.
  • In column C (heading:  "Find & Replace"), use =A2.  Fix the values in this column -- that is, make them permanent by highlighting them all and using the Edit - Copy, Edit - Paste Special -Values sequence.  The shortcut key sequence for Excel 2003 -- which I believed would work in ribbon versions like Excel 2007 and 2010 --was Alt-E C, Alt-E S V Enter Enter.  Now column C contained values rather than formulas.
  • Move the values from column C to a new worksheet.  Don't rearrange them.  I needed a new worksheet because I was going to be using global find-and-replace (Ctrl-H) commands, and I didn't want to have to try to protect columns A and B from being affected by these commands.
  • In that new worksheet, I made changes to the list that I had just brought over from column C in the first worksheet.  The first thing I did was to search for an unusual character, one I searched, in Excel, to find a character that did not already appear in the list.  The caret ("^") was one such character.  I would use this as my field delimiter.  I didn't want any of my Subject field entries to begin with "Re" or "Fwd," so I started by replacing "-Re_" and "-Fw_" and "-Fwd_" with carets, gambling (on the basis of previous experience) that there would be few instances where this would prove inadvisable.
  • I also replaced the "-_" and "_-" and "-[" combinations with carets.  To reduce the number of underscores potentially requiring manual attention, I did one or two additional find-and-replace operations in obvious cases; for example, "Woodcock_s " (ending with a space) became "Woodcock's ."  It could have been counterproductive to go too far with this, though.  For example, I did not try to remove underscores from every version of my name and email address, because that could have created additional variations on my name, somewhere down in the list, potentially complicating the number of things I would have to look for later.  It was better to leave the underscore as a flag for some purposes.  Then I cut and pasted that modified list back into column C in the main worksheet.
  • Back in the main worksheet, in column D, I set up a Date and Time column B, using =LEFT(C2,13).  I didn't parse that column for the various year, month, day, hour, and minute components at this point; that could wait until I needed that information.
  • In column E, I created my first Remainder column.  The purpose of the Remainder columns was to show what was left from the modified values appearing in column C, after removing whatever I had just separated out (in this case, the date and time).  The formula was =TRIM(MID(C2,15,LEN(C2))).
  • I used column F for the Recipient (i.e., "To") value, from the end of the string appearing in the Remainder column (E).  The reason was that this was a fairly obvious entry, and its removal would simplify the next steps.  The formula in column F was =TRIM(MID(E2,FIND(".emlTo: ",E2)+7,LEN(E2))).
  • Column G could be another Remainder column:  =TRIM(LEFT(E2,FIND(".emlTo: ",E2)-1)).
  • In column H (heading:  "Left 1"), I entered =LEFT(G2,1).  The reason was that ImportExportTools had failed to export the names of some senders, notably those appearing in angle brackets ("< >"), and I couldn't identify them by just sorting on the Remainder column because Excel would irritatingly overlook those characters when doing a sort.  But now I could sort on column H and make manual entries of those senders' names in the appropriate column.  I had not yet created that column, nor made those manual entries, because there was something else I needed to do first:
  • In column I, under a "Hyphen" heading, I entered =IF(ISERROR(FIND("-",G2)),"",FIND("-",G2)).  In column J (heading:  "Caret"), I entered =IF(ISERROR(FIND("^",G2)),"",FIND("^",G2)).  Finally, in column K (heading:  "Best"), I entered =IF(J2="",I2,J2).  Column I would look for the first occurrence of a hyphen in the Remainder (column G).  Column J would do likewise for a caret.  It was necessary to use both because, at this point, either one might have been the delimiter indicating the end of the Sender field.  Column K would favor carets over hyphens, so as to reduce the number of problems with senders with hyphenated names.
  • In column L ("Sender), I used =TRIM(LEFT(G2,K2-1)).  This produced good Sender names in most cases.  It was not yet time to deal with the exceptions.
  • In column M ("Subject"), I used =TRIM(MID(G2,LEN(L2)+1,LEN(G2))). This produced good Subject names in most cases. Now it was time to deal with the exceptions.
  • I went back and sorted on column H to identify those rows where I would have to make manual entries of Sender names because none was provided by ImportExportTools.  I put those entries in column L as needed, replacing whatever the automatic calculation had put there.  To assist in my process of looking up those that I didn't recognize, I sorted the From column in Thunderbird, for the Export folder, to gather all those senders at the top of the list for easier reference; I moved these items into a separate subfolder, sorted by Subject; I maximized the viewable space for that list; and once I had dealt with them, I moved them to another subfolder, so as to reduce the size of the list that I would have to page through.  The objective here was just to make sure I had a coherent division of information between the Sender and Subject columns -- to prevent some Sender data from appearing in the Subject column, or vice-versa. Cleaning them up or otherwise improving them at this point would have been premature.  Changing Sender names worked best if I made the changes back in column G, or if I altered or removed numbers in columns I and J.  Just making a change in the Sender column would leave a problem in column M.  It helped, for this purpose, to fix the values in column G (that is, to replace formulas with values; see the procedure described in connection with column C, above).
  • I sorted on column M ("Subject") and cleaned up the entries there.  I found that I wanted to do find-and-replace operations on multiple entries.  I decided at this point that I could safely fix the values of the entire spreadsheet.  It seemed that I would want to sort and re-sort these Subject values to get similar ones together.  To preserve the original order, I added an Index column, indicating the original numerical order of entries.  (Enter 1 and 2 in the first two rows; highlight all rows to be numbered; then hit Alt-E I S Enter.)  Then I moved the Subject and Index columns to a separate temporary worksheet, where I could do these sweeping changes without affecting other columns.  There, I reversed these two columns, putting Index on the left, to keep it out of harm's way.  My changes here included LEFT and RIGHT commands to sort by first and last characters of Subjects (supplemented, on the left, with CODE comparisons, to identify unwanted lowercasing), as well as FIND and Ctrl-H searches and replacements for underscores (doing many replaces to eliminate most instances) and other text that I wanted to change across multiple Subjects.  To identify undesirable characters (e.g., exclamation marks and others whose presence in filenames might mess up batch commands and other applications), I used SUBSTITUTELIST.  SUBSTITUTELIST would remove the characters listed in a separate worksheet (generated with a series of numbers 1-255 in column A and a corresponding =CHAR(A1) in column B).  I could have had it remove characters that looked unwanted, but to be cautious I decided instead to have it remove everything that I knew was normal (i.e., 0-9 and a-z and A-Z, plus a few others) and show me what was left.
  • I deleted columns that were unnecessary, now that I had fixed the values.  I also moved some columns, and inserted a few ones.  My arrangement was now as follows:  Index (column A), Original (B), Date & Time (C), Sender (D), NewSender (E), Recipient (F), NewRecipient (G), and Subject (H).
  • I copied values from columns D (Sender) and Recpient (F) to a separate worksheet.  There, I did a unique filter.  This gave me a list of names that I might want to change or simplify.  I put the original (unique) name in column A in that separate worksheet, sorted it, and entered the desired replacement names in column B.  I sorted this Names worksheet on column A (Original).  I named this worksheet Names; I planned to keep it for future Thunderbird EML exports.  I named the main worksheet Data.  I sped up the process of developing replacement names by using various functions (e.g., FIND, MID) to distinguish first and last names of individuals.  When I had my completed list of preferred names for Senders and Recipients, I went back to the main (Data) worksheet.  In column E (NewSender), I entered =VLOOKUP(D2,Names!$A$2:$B$869,2,FALSE).  (There were 869 rows in the Names spreadsheet.)  I copied that formula over to column G (NewRecipient); it provided a similar replacement for the Recipient values.
  • I inserted columns to figure out the date and time.  In column D ("Y"), I used =LEFT(C2,4).  In colunn E ("M"), I used =MID(C2,5,2). In column F ("D"), I used =MID(C2,7,2).  In column G ("H"), I used =MID(C2,10,2).  In column H ("M"), I used =MID(C2,12,2).  Finally, in column I ("NewDate"), I used =D2&"-"&E2&"-"&F2&" "&G2&"."&H2.
  • I added column O ("New Name").  There, I used =CHAR(34)&I2&" Email from "&K2&" to "&M2&" re "&N2&".eml"&CHAR(34).  This produced a new name for the EML file.  I sorted on this column to identify instances where my formulas had failed, and made corrections as needed.
  • I added column P ("Batch").  There, I used ="ren "&CHAR(34)&B2&".eml"&CHAR(34)&" "&O2.  This produced a batch command to rename the EML file to my preferred new name.  I copied the command down the column and then copied all those commands, one from each row, to Notepad.  I saved the Notepad file as Renamer.bat and put it into the folder where the EMLs were located.  I ran Renamer.bat.  The renamed files sorted conspicuously in Windows Explorer, so I didn't need to work up a modification of these REN commands in a new column Q, using MOVE instead of REN, to move the newly renamed files to another folder.  Instead, I could just cut them from the folder in Windows Explorer and put them aside.
  • Now I had a couple dozen EMLs remaining.  They had not renamed properly.  I probably should have added something like " > errorlist.txt" at the end of each batch command, to show me whether I was trying to give the same name to two different files.  I did a DIR of the files remaining, saved its output to dirlist.txt, copied the contents of dirlist.txt into Excel, and compared them against my main spreadsheet.  To my surprise, none of these files appeared in the original list of files shown there.  I'd had some problems not described in this post; had I somehow dropped some EMLs somewhere along the line?  Was I not doing this comparison properly?  I did not have a clear answer.  I worked up another set of new file names for these EMLs, substantially following the steps presented above, and renamed them.  It looked like, somehow, at least some of them were duplicates after all.  So I was not understanding something there.  Others were apparently not renaming properly because the original filenames contained characters like ®.
In the end, I wound up with all but 10 of the exported emails.  But which 10 did I lose?  I probably didn't lose any.  There was a point when I deleted a handful of what I thought were duplicates.  Now it seemed they probably weren't.

As these final notes suggest, while this process went much more smoothly than on my previous exports of emails from Thunderbird, there were still some parts of the process where I was making mistakes or where things were not going smoothly.

Fourth Step:  Converting the Appropriately Renamed EML to PDF

Some of the previous posts cited at the top of this post grappled with the problem of converting EML to PDF.  It would seem that it should have been a straightforward matter of selecting EMLs in Windows Explorer -- indeed, within Thunderbird itself -- and clicking a Print command.  Alas, it was not, not if the goal was to have PDFs whose filenames would be recognizable.  I hoped there would be a Thunderbird add-in that would solve all the many steps shown above.  I hadn't found one yet.  In my most recent effort, I had proceeded only as far as a truly cumbersome solution that divided EMLs with and without attachments, used Emacs to edit EMLs with attachments so that they would print, extracted attachments to separate files that could also be PDFed, and then manually matched the PDFed attachments up with their PDFed parent emails.  Truly a mess, very time-consuming, and for that reason I hadn't gone very far with it.  Most of my emails were still in EML rather than PDF format.

I had decided, generally, that PDF was the superior long-term archival format.  I didn't want lots of formats rattling around, lest the day come (as had happened for previous formats) when it was a struggle to find software that would read it.  That said, EMLs were presently displaying nicely in Thunderbird, with easy access to attachments.  Having devoted the time to the foregoing effort, I was presently out of time for further development of this project.  So the stack of EMLs grew higher, and the day for conversion to PDF still lay somewhere in the future.