Finding and Cleaning Up EMLs That Display HTML Codes as Text
I had a bunch of email (EML) files scattered around my hard drive. Some of them, I noticed, were displaying a lot of HTML codes. For example, when I opened one (using Thunderbird as the default EML opener), it began with this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0"> <TITLE>RE: Scholar Program</TITLE> </HEAD> <BODY> <!-- Converted from text/rtf format -->
Finding the Offending Files
findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txtIt produced a dozen "Cannot open" error messages. The reason seemed to be that the filenames for those files had funky characters (e.g., #, §). Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command. DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file. So despite a number of variations with FINDSTR and several Google searches, I gave up. I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually. Copernic had a convenient right-click Move to Folder option, so that helped a little. So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder. I would just have to re-sort them back to where they belonged manually.
But I still wasn't sure that everything in that folder was problematic. Basically, I needed to see what the EMLs looked like when they were opened up. Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails. But I was having problems in my efforts to print EMLs as PDFs. As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder. To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat. Somehow, this failed to crash my system.
Cleaning Up the Files
After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility. Now I could edit them as text files. My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.
I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it. The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced. It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation. Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed. After a couple of tries, I found a search that did bring up a couple of solutions to related problems. But those solutions still looked pretty manual. Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?
A different search led to HotHotSoftware, which offered a tool for this purpose for $30. A video seemed to demonstrate that it would work. But, you know, $30 was more than the files were worth. Besides, I wouldn't learn anything useful that way. ReplacePioneer ($39, 21-day trial) looked like it might also do the job. A thread offered a way to do something like it in an unspecified language, presumably Visual Basic. Another thread offered an approach in sed. Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX. Other free options included Nodesoft Search and Replace and Replace Text.
I tried TexFinderX. In its File > Add Folder menu pick, I added the list of files to be changed. I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage. The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder. They advised using an existing table to create a new one. As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas. The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.
But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space. I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar. I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space. I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe. I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided. I also changed the top lines, the ones that explained what the file was about. I saved the file, went into the program's Replacement Table button, and there it was. I selected it and clicked Apply. On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files. Then I ran it. It looked like it worked.
I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff. I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line. So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it. It seemed best not to add too many chunks at once, lest I be repeating the same lines: run a few, after eyeballing them for duplication, and then see what was left. It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.
I added back some of their original items (modified) to the replacement table. These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>). I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work. The answer to that seemed to be to replace three pairs of <13><10> with two. I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.
I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth. After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others. Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run. I realized that it would probably not produce perfect results across all cases. It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.
I had noticed that the program took a little longer to run as I added more rules to its replacement table. But this did not seem to be due to file processing time: the time did not grow far longer when I added far more files to the list. It was still done within a minute or so in any case. Apparently it was just reading the instructions into memory.
The excess (now blank) lines in the files were the slowest to remove. I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks. Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.
Cleaning Up the Starts and Ends of Files
href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files. Now I wanted to use a similar process on these files. I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.
I had not previously noticed that most if not all of these emails had originally included attachments. As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion. These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition. These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself. I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX. I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.
Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously. After making a backup, the summary version of those steps, as modified, went like this:
- Start Emacs. Open one of the post-TexFinderX emails. Hit F3 to start macro recording. C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end. Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
- At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line. I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails. So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
- Hit C-space to set the mark. C-End. Del.
- C-s to search for Message-ID. Then C-e to go to the end of that line, and right-arrow to go to the start of the next line. C-Space to set the mark, C-Home, and then Del. That was as much as I could do with this particular email; it was clean, though not ideally formatted.
- C-x C-s to save the file. F4 to end the macro recording. C-x C-k n Macro1 Enter (to name the macro to be Macro1). C-x C-k b 1 (to bind the macro to key 1).
- C-x C-f ~/ Enter (to find my Emacs Home directory). In my case, Home was C:\Users\Ray\AppData\Roaming\.emacs.d. I went there in Windows Explorer and created a new text file named _emacs, with no extension. This was my init file.
- From the Emacs menu: File > Open File > navigate to the new _emacs init file > select and open _emacs. Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter. This hopefully saved my macro to my init file. C-x C-c to save and quit Emacs. A quick look with Notepad confirmed that there was something in _emacs.
- Restart Emacs. Open another of these text emails. Test my macro by typing C-x C-k 1. I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again. Still undefined. Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs. In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs. I used C-x 2 to split the window. _emacs appeared in both panes. In the top pane, I went to Buffers > select the text file to be changed. (Apparently it was listed as one of the available buffers because I had already opened it.) So now I was viewing the macro in the bottom pane and the email file in the top pane. I selected the top pane and tried C-x C-k 1 again; still undefined. I found other advice to just use M-x Macro1. That worked. The macro ran in the top pane.
Converting Emails to PDF
I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs. The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands. I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions. This produced PDFs with "Notepad" at the end of their names. For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.
Word had problems printing a number of these Word docs. It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques. It did produce some PDFs. I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time. Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF. That went fine, as a manual process. So apparently it was not, in every case, a problem with the file. Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.
Other extracted attachments were text files. At this point, I had two ways of dealing with these. On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files. I did start to use this approach, but ran into dialogs and potential problems. On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above). Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt. They also all originated from the same source. I examined a handful of these attachments and decided I could delete them all.
Some extracted attachments were image files -- JPG, GIF, PNG, BMP. I also had a dozen attachments without extensions. I opened the latter in IrfanView. I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly. On the other hand, as I looked through these files, I saw that some of the GIFs were animations. Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files. I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.
There were a few miscellaneous file types. For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been. One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.
Combining Email and Attachment PDFs
Now I had everything in PDF format. I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible. I combined the files and attachments into one folder and made a list of the files using DIR (above).
Now the goal was to combine the emails that did have attachments with their accompanying attachments. There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it. I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM. Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives. The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.
I downloaded and unzipped pdftk. As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32. I opened a command prompt in some other folder, at random, and typed "pdftk --help." This was supposed to give me the documentation. Instead, it gave me an error:
pdftk.exe - System Error The program can't start because libconv2.dll is missing from your computer. Try reinstalling the program to fix this problem.I moved the two files to C:\Windows and tried again. That worked: I got documentation. It scrolled on past the point of recovery. Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder. The next step was to put pdftk to work. It would apparently allow me to specify the files to combine, using a command of this form:
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdfMy problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other. I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf. I would need to keep the spreadsheet for a while, so as to know what the original filenames were. The original filenames were my guide as to what files needed to be combined together. For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf". Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1. Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.
After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined. Each original filename began with date and time. With few exceptions, this was sufficient to distinguish one email and its attachments from another. So I used =LEFT to extract that identifying information from column A. Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using. Finally, I added a column to create the appropriate command for the batch file. Instead of the 123.pdf output shown in the example above, I used the original email filename. Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.
I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK. I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them. Pdftk had proved to be a much easier tool for this project than pdfSAM. This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.