Finding and Cleaning Up EMLs That Display HTML Codes as Text
I had a bunch of email (EML) files scattered around my hard drive. Some of them, I noticed, were displaying a lot of HTML codes. For example, when I opened one (using Thunderbird as the default EML opener), it began with this:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <HTML> <HEAD> <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7036.0"> <TITLE>RE: Scholar Program</TITLE> </HEAD> <BODY> <!-- Converted from text/rtf format -->
I was not sure how that happened. Apparently I had run these EMLs through some kind of conversion process, perhaps after renaming them to be .txt files. Whatever the origin, I wanted to eliminate all those HTML codes and wind up with a plain text file, probably saved as a PDF. This post describes the steps I took to achieve that outcome.
Finding the Offending Files
Finding the Offending Files
As I say, the files containing this text were scattered. Initially, I did a search for some of the text shown above (specifically, for "<!DOCTYPE HTML PUBLIC") in Copernic. (I assume any tool capable of searching for text within files would work for this purpose.) I thought maybe I would just copy and paste the lot of them from Copernic to a separate folder in Windows Explorer, where I could work on them in more detail. This approach was not working very well because failed because Copernic did not allow me to select and move multiple files to other folders. Moreover, Copernic did not display them with their actual filenames; rather, it showed the title indicated in the HTML "<TITLE>" line (see example above).
It was probably just as well. Moving them in bulk from Copernic would have lost the indications of the folders where they were originally located. The better approach, I decided, would be to use the command line and batch files to identify their source folder, move them to a single folder where I could work on them, and then move the resulting, cleaned-up files back to the folders where the originals had come from.
So the first thing I needed was a way to locate the files to be cleaned up. I decided to use a batch command for this purpose. I could have searched for every file (or just every EML file) that contained any HTML codes. For that purpose, a search for "</" might have done the trick. But then I decided that there could be a lot of HTML codes floating around out there, in various files, for a lot of different reasons; and for present purposes I didn't need to be trying to figure out what was happening in all those different situations. So instead, I searched for the same thing as before: "<!DOCTYPE HTML PUBLIC." To do that, after several false starts, I tried this command:
findstr /r /m /s "<!DOCTYPE HTML PUBLIC" D:\*.eml > D:\findlist.txtIt produced a dozen "Cannot open" error messages. The reason seemed to be that the filenames for those files had funky characters (e.g., #, §). Also, Findlist.txt contained the names of files that did not seem to have the DOCTYPE text specified in the command. DOCTYPE may have appeared in attachments to those files, but I didn't want to be flagging that sort of EML file. So despite a number of variations with FINDSTR and several Google searches, I gave up. I returned to Copernic, searched for the DOCTYPE text (in quotation marks, as shown above), and moved them manually. Copernic had a convenient right-click Move to Folder option, so that helped a little. So now, anyway, despite the imperfections of the process, I apparently had the desired EMLs in a single folder. I would just have to re-sort them back to where they belonged manually.
But I still wasn't sure that everything in that folder was problematic. Basically, I needed to see what the EMLs looked like when they were opened up. Ideally, I would have just clicked a button at this point to convert them to PDF and merge them into a single document, so I could just flip through and identify the problem emails. But I was having problems in my efforts to print EMLs as PDFs. As a poor second-best, I manually opened them all (again, using Thunderbird as my default EML opener), selected the ones needing repair in Windows Explorer, and moved them to a separate folder. To open them, I just did a "DIR /b /a-d > Opener.bat" and modified its contents, using Excel, so that each one started and ended with a quotation mark (actually, CHAR(34)) -- no other command needed -- and then ran Opener.bat. Somehow, this failed to crash my system.
Cleaning Up the Files
After verifying that most of them looked bad (and removing the others), I made copies in another folder, and renamed the copies to .TXT extensions using Bulk Rename Utility. Now I could edit them as text files. My plan was to store up a set of standard search-and-replace items, mostly replacing HTML codes with nothing at all, so as to clean up these files.
I had previously decided on Emacs as my default hard-core text editor, and had taken some first steps in re-learning how to use it. The task at hand was to find advice on how to set up before-and-after lists of text strings to be replaced. It was probably something I could have done in Excel, but I might have had to cook up a separate spreadsheet for each file, and here I was wanting to modify multiple files -- dozens, possibly hundreds -- in one operation. Now, unfortunately, it was looking like Emacs was not going to be as naturally adapted to this task as I had assumed. After a couple of tries, I found a search that did bring up a couple of solutions to related problems. But those solutions still looked pretty manual. Was there some more tried-and-true tool or method for replacing multiple text strings in multiple files?
A different search led to HotHotSoftware, which offered a tool for this purpose for $30. A video seemed to demonstrate that it would work. But, you know, $30 was more than the files were worth. Besides, I wouldn't learn anything useful that way. ReplacePioneer ($39, 21-day trial) looked like it might also do the job. A thread offered a way to do something like it in an unspecified language, presumably Visual Basic. Another thread offered an approach in sed. Another way to not learn anything, but also not to spend $30, was to try the free TexFinderX. Other free options included Nodesoft Search and Replace and Replace Text.
I tried TexFinderX. In its File > Add Folder menu pick, I added the list of files to be changed. I clicked the Replacement Table button, but did not see the Open Table Folder button shown on the webpage. The ReadMe file seemed to say that a new replacement table would appear in the list only after being manually created in the TFXTables subfolder. They advised using an existing table to create a new one. As I viewed their "Accented to None - UTF8.txt" replacement table, I recalled looking into character replacement using Excel formulas. The specific point of comparison was that I had discovered, in that process, that people had invented various character conversion tables that might be suitably implemented with TexFinderX.
But for my own immediate purposes, I needed to see if a TexFinderX replacement table would accept a whole string of characters, to be replaced by nothing or, say, a single space. I was hoping that what I was seeing, there in that "Accented to None" replacement table, was that the "before" and "after" columns were tab-delimited -- that, in other words, I could enter a whole long string, hit the tab key, and then hit the spacebar. I tried that, first saving the "Accented to None" table under the name of "Remove HTML Codes," and then entering "<!DOCTYPE HTML PUBLIC "-//W3C//DTD W3 HTML//EN">" (without the outside quotation marks, of course) and hitting Tab and then Space. I did this on what appeared to be the first replacement line in that "Accented to None" file, right after the line that said /////true/////, as guided by the ReadMe. I hit Enter at the end of that line, and deleted everything after it, removing all the commands they had provided. I also changed the top lines, the ones that explained what the file was about. I saved the file, went into the program's Replacement Table button, and there it was. I selected it and clicked Apply. On second thought, I decided to try it on just one or two files, so I emptied out the list and added back just a couple of files. Then I ran it. It looked like it worked.
I proceeded to add all kinds of other HTML codes to my new Remove HTML Codes replacement table, testing and running and removing more unwanted stuff. I found that it was not necessary to hit Tab and then Space at the end of each line that I wanted to remove; it would remove anything that was on a line by itself, where no other tab-delimited text followed it on the same line. So, basically, I could copy and paste whole chunks of unwanted text into the replacement table, and it would be removed from any files on the list that happened to contain it. It seemed best not to add too many chunks at once, lest I be repeating the same lines: run a few, after eyeballing them for duplication, and then see what was left. It appeared that I could add comments, on these lines in the replacement table, by again hitting Tab after the "replace" value on the line.
I added back some of their original items (modified) to the replacement table. These included the replacement of three spaces with two (which I might run several times to be thorough); the replacement of a Space-CR (Carriage Return) combination with a simple CR (using space-<13> tab <13> to achieve that, and apparently doing the same thing also with <10> in place of <13>). I tried replacing three CRs with two, using <13><13><13> on the same line, but it didn't work. The answer to that seemed to be to replace three pairs of <13><10> with two. I discovered that the conversion process that had mangled these files originally had placed different parts of HTML code sequences on different lines, so I had to break them up into smaller pieces -- but not too small, because I didn't want to be accidentally deleting real text from my emails that happened to look similar to these HTML codes.
I basically worked through all the codes that appeared in one email, and then started in on those that remained in the next after applying my accumulated rules to it, and so forth. After working through the first half-dozen files in the list, I skipped down and ran the accumulated corrections against some others. Running it repeatedly seemed to clear up some issues; possibly it was able to process only one change per line per run. I realized that it would probably not produce perfect results across all cases. It was succeeding, however, in giving me readable text that had previously been concealed beneath a mountain of HTML codes.
I had noticed that the program took a little longer to run as I added more rules to its replacement table. But this did not seem to be due to file processing time: the time did not grow far longer when I added far more files to the list. It was still done within a minute or so in any case. Apparently it was just reading the instructions into memory.
The excess (now blank) lines in the files were the slowest to remove. I ran TexFinderX against the whole list of files at least a half-dozen times, adding a few more codes with the aid of additional spot checks. Unless I was going to check every individual file for additional lingering codes, that appeared to be about as far as TexFinderX was going to take me in this project.
Cleaning Up the Starts and Ends of Files
href="http://raywoodcockslatest.blogspot.com/2012/03/choosing-emacs-as-text-editor-with.html" target="_blank">previouslyused Emacs to eliminate unwanted ending material from files. Now I wanted to use a similar process on these files. I also wanted to see if I could adapt that process to remove unwanted material elsewhere in the files.
I had not previously noticed that most if not all of these emails had originally included attachments. As such, they included certain lines after their text, apparently announcing the beginning of the attachment portion. These lines included indications of Content-Type, Content-Transfer-Encoding, and Content-Disposition. These seemed like good places to identify the start of ending material to delete, for purposes of printing a cleaned-up message portion by itself. I now saw that I had made things more difficult for myself by including references to some Content-Type and Content-Transfer-Encoding lines in my list of items to remove in TexFinderX. I had not removed Content-Disposition lines, however, so -- as in the previous use of Emacs -- those would be my focus.
Having already done the initial setup of GNU Emacs as described in the previous post, I set forth to modify the process that I had used previously. After making a backup, the summary version of those steps, as modified, went like this:
- Start Emacs. Open one of the post-TexFinderX emails. Hit F3 to start macro recording. C-End (that is, Ctrl-End, in Emacs-speak) to go to the file's end. Hit C-r and type "Content-Disposition" to back up to its last occurrence of Content-Disposition.
- At this point, modify the previous approach to back up a bit further, in search of the boundary line just preceding the Content-Disposition line. I could have done this by hitting C-r and typing "----------" to find that boundary line, but now I saw that my TexFinderX replacements had deleted that, too, from some of these emails. So instead, I just hit the Up arrow three times, hoping that that would take me to a point before most of the ending material.
- Hit C-space to set the mark. C-End. Del.
The macro was still recording; I wasn't done. The preceding steps did take care of the ending material in that particular file. (As before, it was essential to avoid typographical errors, which would terminate macro recording or worse.) But now, how about the unwanted starting material? I hadn't done this particular operation before, but it seemed straightforward enough. I had to use C-Home to get to the start of the file. Then -- since I had, again, deleted the objectionable boundary lines in some of these emails -- I had to search for the last surviving message header field. In the case of the first email I was looking at, which I believed was probably the most thoroughly scrubbed, that last surviving field was Message-ID. So I went through several additional but similar steps to clean up the start of the email and finish the task:
- C-s to search for Message-ID. Then C-e to go to the end of that line, and right-arrow to go to the start of the next line. C-Space to set the mark, C-Home, and then Del. That was as much as I could do with this particular email; it was clean, though not ideally formatted.
- C-x C-s to save the file. F4 to end the macro recording. C-x C-k n Macro1 Enter (to name the macro to be Macro1). C-x C-k b 1 (to bind the macro to key 1).
- C-x C-f ~/ Enter (to find my Emacs Home directory). In my case, Home was C:\Users\Ray\AppData\Roaming\.emacs.d. I went there in Windows Explorer and created a new text file named _emacs, with no extension. This was my init file.
- From the Emacs menu: File > Open File > navigate to the new _emacs init file > select and open _emacs. Using the Meta (i.e., Alt) key, I used M-x insert-kbd-macro Enter Macro1 Enter. This hopefully saved my macro to my init file. C-x C-c to save and quit Emacs. A quick look with Notepad confirmed that there was something in _emacs.
- Restart Emacs. Open another of these text emails. Test my macro by typing C-x C-k 1. I got "C-x C-k 1 is undefined." I killed Emacs and, following advice, in Windows Explorer I renamed _emacs to be init.el and tried again. Still undefined. Since _emacs had worked in my previous session, I decided that the advice about init.el might be oriented toward Unix rather than Windows systems, so I changed it back to _emacs. In the Emacs menu, I went to File > Open File > navigate to _emacs > open _emacs. I used C-x 2 to split the window. _emacs appeared in both panes. In the top pane, I went to Buffers > select the text file to be changed. (Apparently it was listed as one of the available buffers because I had already opened it.) So now I was viewing the macro in the bottom pane and the email file in the top pane. I selected the top pane and tried C-x C-k 1 again; still undefined. I found other advice to just use M-x Macro1. That worked. The macro ran in the top pane.
The macro didn't do such a great job of cleaning this second file. I would have to return to that later. For now, the next step was to figure out how to run the macro automatically on all the emails. Meager results from a search presented the possibility that people did not commonly do this sort of thing. A refined search led to further discussion suggesting that I should be searching for information on multiple buffers rather than multiple files. That innovation provoked the side question of whether perhaps jEdit was better than Emacs for such purposes but, once again, Emacs seemed better. Still another search led to Dired, which would apparently allow the user to conduct certain operations on the files listed in a directory. We were getting closer. I found someone who was feeling my pain, but without a solution.
A StackOverflow discussion suggested that I might want to begin a Dired approach by loading kmacro. I had no idea of how to do this. An Emacs manual page seemed to think that kmacro was already part of Emacs. I decided to try to follow the StackOverflow concepts without special attention to kmacro preliminaries. The first recommended step was to go to the top of my Dired buffer. This, too, was a mystery. Another Emacs manual page told me to use C-x d to start Dired. In the bottom line of the screen, that displayed the name of the directory containing the emails. I didn't know what else to do, so I hit Enter. Apparently that was just the right thing to do: it showed me a directory listing for that folder. It would develop, eventually, that the fast way to get it to show that directory was to use the menu option File > Open File to navigate to that directory and open a file there.
Now the StackOverflow advice was apparently to move the cursor to the first file in that list (which is where it already looked like it might be) and hit F3 to begin recording a keyboard macro. Then hit Enter to visit the file. Then M-x kmacro-call-ring-2nd. But at this point it said, "No keyboard macro defined." So kmacro was working, but on this command Dired was looking for a previous keyboard macro, not for an already saved one. I used C-x k Enter to close the email that I had opened. Now I was back at the Dired file list. I hit C-x 2 to split the window, so maybe I could see more clearly what was going on. With the cursor on the first target email in the top pane, I hit Enter to visit it again, then M-x Macro1 Enter. That seemed to be the answer, sort of: the bottom row said, "After 0 kbd macro iterations: Keyboard macro terminated by a command ringing the bell." So the macro did try to run. Adventures in the previous post suggested that this error message meant the macro failed to function properly, and I believed I knew why: this was the email that I had already edited. I had already removed, that is, the stuff that the macro was searching for, starting with the Content-Disposition line.
Time to try again. With the top pane (displaying the email message) selected, I hit C-x k Enter to close it. Then I moved the cursor to (i.e., mouse-clicked on) an email on which I had not yet run Macro1. There, going back to the (modified) StackOverflow advice, I hit F3 to start recording a keyboard macro; I hit Enter to visit the file; then M-x Macro1 Enter. It ran without an error message. The email was showing in both top and bottom panes, so evidently I had not yet mastered the art of pane. StackOverflow said C-x o to switch to the other buffer. This just switched me to the other pane; I was supposed to be seeing the Dired file list. With the keyboard macro still recording, I tried C-x k Enter to close the email. Now the bottom pane, where I was, had the cursor flashing on the wrong line. C-x o, then., followed by a tap on the down arrow key to take me to the next file to be processed. That was the end of the steps that I wanted my new keyboard macro to save, so I hit F4. StackOverflow said that now I had to hit C-u 0 c-x e to run the keyboard macro on every file in the list. But that command sequence only opened the next file and ran Macro1 on it. I hit C-x k Enter to close. I was back at the Dired list. The cursor did not advance to the next line; Macro1 did not run automatically.
I thought maybe my errors in that last try screwed up the keyboard macro, so I tried recording it again: F3; cursor on the target email; Enter to visit that file; M-x Macro1 Enter to run the macro; Ctrl-x k Enter to close the email; down-arrow to select the next email in the list; F4 to close the keyboard macro; C-u 0 C-x e to run it. No joy: I still had to close the file and start the next one manually.
By this point, a different approach had occurred to me. If I could open all the target emails at once, I would only have to hit keys to run Macro1 and then close the changed file: the next one would then be there, ready and waiting for Macro1. I decided to try this. As advised, with an email already opened in my target directory (via menu pick -- see above), so as to tell Emacs where to look, I used C-x C-f *.txt to open all of those emails. (As noted above, I was working on EMLs that I had mass-renamed to be TXT files.) That seemed to work. The first ones visible to me were those at the top of the list, on which I had already run Macro1. I closed those. I couldn't tell, from the Buffers menu pick, how many files remained opened. I could see that their timestamp would change in Windows Explorer after Emacs was done with them, so presumably I would be able to check which ones I had run Macro1 on. I made a mental note to make at least some kind of change in each file before closing it, so as to assure myself that there was no further need to work it over with Macro1.
So now I was looking at the first file that had not yet been caressed by the loving hand of Macro1. I wondered: can I define a keyboard macro to save the steps of running Macro1 and then closing the file? I tried: F3, M-x Macro1 Enter, C-x k Enter, F4. To execute that last defined keyboard macro, I used C-x e. It changed the file as desired -- that is, apparently it ran Macro1 -- and it also seemed to be saving the changed file, but it did not close the file. In other words, I had reduced the required number of keystrokes down to C-x e, C-x k Enter. That was what it took to run Macro1 and then close a file. Not bad, but could I do better?
The problem -- for both this approach and the Dired approach (above) -- seemed to be that the macros were not saving the C-x k Enter sequence. A search seemed to indicate this could be another difficult problem to solve. I was running low on time for this project, so I had to shelve that, along with the ensuing question of whether I could bind this C-x e C-x k Enter sequence to a function key.
Instead, I just went plodding through that sequence for these many files. In some cases, the scrollbar at the right showed me that there was a lot of extra material that I had to delete manually, usually from the ends of the emails. Saving after these additional edits required a C-x C-s Enter before the C-x k Enter. It was also handy to know that C-/ was the undo key.
Further Cleanup
When I was done running Macro1 on all those files, I saw that Emacs had created backup copies, with a .txt~ extension. I sorted by file type in Windows Explorer and deleted those. Also, while going through the process, I had noticed a number of files that were short and unimportant, and whose attachments did not interest me. So I was able to go through the list and remove those to a "Ready to PDF" folder. These steps reduced the number of files on which I might want to perform further operations.
While looking at those files in Windows Explorer, I noticed that some were much larger than others. These, I suspected, included some whose attachment sections had not been completely eliminated by the macro, perhaps because they had more than one attachment. I opened these in Notepad and eliminated material that did not contribute to the intelligible text of the email.
In some of the remaining files, there were still a lot of HTML codes and other material that would interfere significantly with an attempt to read the contents. It seemed that the spot checks I had conducted in TexFinderX had not brought out all of the things that TexFinderX could have cleaned up. I restarted TexFinderX, added more codes to the list of things to remove, and ran it some additional times on the files remaining in that folder. That didn't continue too long before I realized that there could be an endless number of such codes and variations.
The next step was to return to Emacs. This time, I was looking particularly for individual lines that could safely be deleted. This was not so much a concern with HTML codes, though there might be some of that too; it was more a concern with email headers, boundary lines, and other items that would vary from one email to the next, could therefore not be readily added to a TexFinderX replacement list, and yet could appear repeatedly within a single email. For example, each of the following lines appeared within a single email:
--===============3962046403588273==
boundary="----=_NextPart_000_002A_01C69314.AD087740"
------=_NextPart_000_002A_01C69314.AD087740
Moreover, variations on those themes recurred throughout that email, with quite a few of each. So I could write an Emacs macro to search for a line beginning with the relevant characters, select that entire line, and delete it. I wouldn't have to know which numbers appeared on different variations of these lines, as I would if I were using TexFinderX.
The problem here was that there were quite a few different kinds of lines to remove. In addition to the types just shown, there were also email header lines that would normally not be visible, but that had become visible in the original mangling of these files, and there were also various Content-Description and Content-Disposition and Content-ID and Content-Location lines. I would have to write an Emacs macro for each. I could write one macro to run them all, but it would terminate as soon as it failed to find the next requested line; and since these sorts of lines varied widely from one email to another, it was quite likely that such a general macro would be terminating prematurely more often than not. If I knew how to bind macros to individual keys, it might not be horrible to go down the list and punch the assigned function (or Ctrl-Function, Alt-Function, etc.) keys, one at a time, reiteratively for each of these many email files. But that seemed like a lot of work for a fairly unimportant project. A better approach would have been to write a script to handle such things, but my chosen scripting language for this purpose, Perl, had one significant drawback: I had not learned it yet. I had been meaning to, for about 20 years, and I knew that eventually the opportunity would arrive. But that day was not today.
I concluded that my cleanup phase for these emails was finished. If I really needed to go further with it, I could convert them from PDF back to text and have at it again, some fine day. If I had really intended to do that, I would have saved a list of the relevant files at this point. But for the time being, I needed to get on with the next part of the project.
Converting Emails to PDF
I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs. The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands. I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions. This produced PDFs with "Notepad" at the end of their names. For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.
Converting Emails to PDF
I had previously used "Notepad /p" to convert a set of TXT files, like these emails, to a set of PDFs. The basic idea was to make a list of files and then use Excel to convert those file paths and names (as needed) to batch commands. I used that same approach here, making sure to set the PDF printer operate with minimal dialog interruptions. This produced PDFs with "Notepad" at the end of their names. For some reason, Bulk Rename Utility was not able to remove that; I had to use Advanced Renamer instead.
Converting Attachments to PDF
As noted above, most of these troublesome emails had attachments. I now had, in a folder, only those emails (in .txt format) whose attachments I wanted to see. Using a DIR command as above, I did a listing of those .txt files. I put that list into Excel and modified it to produce batch commands that would move the EMLs of the same name to a separate folder. Then, in Thunderbird, I created a new local folder. With that folder selected, I went into Tools > ImportExportTools > Import eml file. I navigated to the folder containing the EMLs whose attachments I wanted to see, selected them all, and clicked Open. The icons indicated that all did have attachments.
Now, having configured Thunderbird's AttachmentExtractor add-on to generate filenames that I could recognize and connect with specific emails, I selected all those newly imported EMLs, right-clicked on them, and chose Extract from Selected Messages to (0) Browse. I set up a folder that was not too many levels deep, for fear that some of these attachments might already have long names that could cause problems. AttachmentExtractor went to work. When it was done, I deleted that folder in Thunderbird, so that I would not have a problem of confusing duplicates of EMLs that had already caused me enough grief.
Then, in Windows Explorer, I sorted the extracted attachments by Type. I began the process of converting to PDF those that were not already in PDF format. Many of these were Microsoft Word documents. I had already worked out a process that would automate the conversion of Word docs to PDF. I moved these files to another workspace folder for clarity, and after making the advisable adjustments to my PDF printer, I applied that process to these files.
Word had problems printing a number of these Word docs. It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques. It did produce some PDFs. I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time. Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF. That went fine, as a manual process. So apparently it was not, in every case, a problem with the file. Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.
Other extracted attachments were text files. At this point, I had two ways of dealing with these. On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files. I did start to use this approach, but ran into dialogs and potential problems. On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above). Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt. They also all originated from the same source. I examined a handful of these attachments and decided I could delete them all.
Some extracted attachments were image files -- JPG, GIF, PNG, BMP. I also had a dozen attachments without extensions. I opened the latter in IrfanView. I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly. On the other hand, as I looked through these files, I saw that some of the GIFs were animations. Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files. I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.
There were a few miscellaneous file types. For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been. One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.
Combining Email and Attachment PDFs
Now I had everything in PDF format. I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible. I combined the files and attachments into one folder and made a list of the files using DIR (above).
Now the goal was to combine the emails that did have attachments with their accompanying attachments. There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it. I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM. Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives. The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.
I downloaded and unzipped pdftk. As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32. I opened a command prompt in some other folder, at random, and typed "pdftk --help." This was supposed to give me the documentation. Instead, it gave me an error:
After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined. Each original filename began with date and time. With few exceptions, this was sufficient to distinguish one email and its attachments from another. So I used =LEFT to extract that identifying information from column A. Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using. Finally, I added a column to create the appropriate command for the batch file. Instead of the 123.pdf output shown in the example above, I used the original email filename. Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.
I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK. I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them. Pdftk had proved to be a much easier tool for this project than pdfSAM. This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.
Word had problems printing a number of these Word docs. It crashed repeatedly, during this process, whereas it had sailed right through other stacks of docs that I had converted to PDFs by using the same techniques. It did produce some PDFs. I looked at those, to make sure they turned out OK, and then I had to do a DIR /a-d /b *.pdf > successlist.txt in the output folder to see which docs had been successfully PDFed, and then convert successlist.txt into a batch file full of commands to delete the corresponding DOCs, so that I could try again with the DOCs that didn't convert properly the first time. Before re-running the doc-to-pdf conversion batch file, I opened one of the failed DOCs and printed it to PDF. That went fine, as a manual process. So apparently it was not, in every case, a problem with the file. Ultimately, I used OpenOffice Writer 3.2 and was able to print the remainder manually, using just a few keystrokes per file, with no problems.
Other extracted attachments were text files. At this point, I had two ways of dealing with these. On one hand, I could have used the same process as I had just used with the Word docs, after changing the command used for .doc files to refer instead to .txt files. I did start to use this approach, but ran into dialogs and potential problems. On the other hand, I could have used the approach of printing to Notepad, as I had used with the emails themselves (above). Before I got too far into this task, though, I noticed that every one of these text files had names like ATT3245657.txt. They also all originated from the same source. I examined a handful of these attachments and decided I could delete them all.
Some extracted attachments were image files -- JPG, GIF, PNG, BMP. I also had a dozen attachments without extensions. I opened the latter in IrfanView. I believe there was an IrfanView setting that allowed it to recognize, as it did, that some of these were actually image files, and to offer to rename them (as PNGs or whatever) accordingly. On the other hand, as I looked through these files, I saw that some of the GIFs were animations. Excluding those, I now had a list of what appeared to be all the attachments that should be treated as image files. I used IrfanView's File > Batch Conversion/Rename option to convert these to PDF.
There were a few miscellaneous file types. For videos, I just took a screenshot in the middle and used that as an indication of what the original attachment had been. One alternative would have been to use something like Shotshooter.bat to produce multiple images conveying a sense of the direction of the images in the video, and then combine those images in a single PDF.
Combining Email and Attachment PDFs
Now I had everything in PDF format. I used Bulk Rename Utility to rename emails and attachments so that, when combined into one folder, each email would come before its associated attachments (if any), and the difference between the two would be readily visible. I combined the files and attachments into one folder and made a list of the files using DIR (above).
Now the goal was to combine the emails that did have attachments with their accompanying attachments. There were probably too many of these to combine them manually, one set at a time, using Acrobat or something like it. I had previously worked out a convoluted approach for automating the merger of multiple PDFs (produced from multiple JPGs), using pdfSAM. Discussion on a SuperUser webpage and elsewhere suggested that pdftk and Ghostscript were alternatives. The instructions for Ghostscript looked more complex than those for pdftk, so I decided to start with pdftk.
I downloaded and unzipped pdftk. As advised, I copied the two files from its bin folder (pdftk.exe and libiconv2.dll) into C:\Windows\System32. I opened a command prompt in some other folder, at random, and typed "pdftk --help." This was supposed to give me the documentation. Instead, it gave me an error:
pdftk.exe - System Error The program can't start because libconv2.dll is missing from your computer. Try reinstalling the program to fix this problem.I moved the two files to C:\Windows and tried again. That worked: I got documentation. It scrolled on past the point of recovery. Typing "pdftk --help > documentation.txt" solved the problem, but ultimately it didn't seem to give me anything more than already existed in pdftk's docs subfolder. The next step was to put pdftk to work. It would apparently allow me to specify the files to combine, using a command of this form:
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdfMy problem was that, at least in some cases, the filenames I was working with were too long to fit on a single line like that, one after the other. I decided a solution would be to take a directory listing, put it into Excel, and use it to create commands for a batch file that would rename the emails and their accompanying attachments, with names like 0001.pdf. I would need to keep the spreadsheet for a while, so as to know what the original filenames were. The original filenames were my guide as to what files needed to be combined together. For this purpose, with one of the original filenames in spreadsheet cell A1, I put the ascending file numbers in cells B1, B2 ... (i.e., 1, 2, ...) and then, in cell C1, I put =REPT("0",4-LEN(B1))&B1&".pdf". Finally, in cell D1, I put ="ren "&CHAR(34)&A1&CHAR(34)&" "&C1. Then I copied the formulas from column D into Notepad, saved them as Renamer.bat, and ran it.
After doing that renaming, I went back to the spreadsheet for guidance on which of these numbers needed to be combined. Each original filename began with date and time. With few exceptions, this was sufficient to distinguish one email and its attachments from another. So I used =LEFT to extract that identifying information from column A. Then, in the next columns, I used IF statements to compare the extract from one line to the next, concatenate the appropriate filenames with a space between them, and choose which concatenations I would be using. Finally, I added a column to create the appropriate command for the batch file. Instead of the 123.pdf output shown in the example above, I used the original email filename. Where there were no attachments, pdftk would thus just convert the numbered PDF (e.g., 0001.pdf) back to its original name.
I finished with spot checks of various files, with and without attachments, to verify that they had come through the process OK. I was not happy with the remaining junk in the emails themselves, but at least I could tell what they were about now, and they had their attachments with them. Pdftk had proved to be a much easier tool for this project than pdfSAM. This had been an awful lot of work for not terribly much achievement on some not very important files, but at least I had finally worked through all of the steps in the PDF conversion process for Thunderbird emails with attachments.
2 comments:
A later post presents a use of TexFinderX used to update multiple blog posts in a bulk process.
I find PSPAD editor good for bulk changes. You could always use SED for multiple changes.
tOM
Post a Comment