Showing posts with label export. Show all posts
Showing posts with label export. Show all posts

Tuesday, June 19, 2012

Exporting Thunderbird Emails to PDF - Another Cut

I was using Thunderbird 11.0.1 in Windows 7.  I had accumulated some emails that I wanted to export as individual EML files.  An EML would still be readable in Thunderbird, and it would carry any attachments along with it.  I had attacked this problem on several previous occasions.  As before, I was not sure I would get all the way through from Thunderbird to EML to PDF.  This post provides another contribution in the slog toward that outcome.

First Step: From Thunderbird to EML Format

Some of my previous efforts to export to EML and then convert to PDF had produced something of a mess.  Exporting, itself, was easy enough.  I was using ImportExportTools.  It would give me EMLs with names containing some, but not all, of the information that I wanted in file names.  Specifically, it would provide the date and time, the sender, and the subject; but it did not include the recipient.  I could get it to produce a separate Index.csv file that would contain the full information, but that would just be a spreadsheet file.  I could use that spreadsheet file to give me nice names for files; but which file was supposed to get which name?  Matching them up had required a surprising amount of manual effort, last time around.  I was hoping to make the process smoother, if I could.

It wouldn't help to print a PDF directly from Thunderbird.  As far as I knew, that would require me to enter PDF filenames manually.  I was looking for a mass-production kind of solution.  About.com recommended mbx2eml, but it seemed to have some disadvantages, notably a very limited set of options for the resulting EML filenames -- which was the main problem.  Generally, it did not seem that any solution had broken through into prominence, in either the T-bird to EML or T-bird to PDF category.

In my first try at this problem, I had tried Total Thunderbird Converter and Birdie EML to PDF Converter, but for various reasons had not been impressed with either.  I did like Attachment Extractor, for when I got to that part of the project.  My notes seemed to favor Universal Document Converter (UDC) ($69), if I wanted a direct T-bird-to PDF-solution.  As I reviewed the struggles I'd had in that first try at this problem, and also in the second and third tries, I wondered if I should have focused more seriously on UDC.  But it did not seem to have command-line capability or other automation features.  It was basically a glorified PDF printer.  Moreover, its default filenames did not include all the information I wanted.

My previous notes did not seem to mention that Thunderbird messages were apparently already in EML format, stored in Thunderbird subfolders.  For instance, I had moved the messages that I was now seeking to export to a Local Folders subfolder called Export, and I could see that folder in Windows Explorer as Mail\Local Folders\Export.mozmsgs.  But this was confusing:  the number of EML files in that folder was not very close to the number of messages in the Export subfolder in Thunderbird.  Anyway, the EMLs in Export.mozmsgs had seemingly random names that would be useless for my purposes.

So I went ahead with ImportExportTools.  My first step was to eliminate duplicates.  For this, I used Remove Duplicate Messages (Alternate).  Then, in Thunderbird, I went to Tools > ImportExportTools > Export all messages in the folder > EML format.  The first time around, this produced undesirable results (see below).  But I didn't know that until I was partway through the second step.

Second Step:  Adding Recipient to the EML File Name

I had my EMLs.  But as noted above, I wanted to add the name of the Recipient to the filename, in the format Date-From-To-Subject.  As a first step, I thought I would just try to append the Recipient's name to the end of the filename.  Then I would figure out how to shuffle the words around to the desired order.

Given my limited knowledge of programming and such, I decided to try to achieve this with a Windows batch file.  I struggled to figure out how to write a suitable one, and finally posted a question on it.  One of the early answers to that question led to a separate pursuit -- a one-line batch file that would convert Word and WordPerfect documents to PDF.

The answers that I had received, at the point when I was writing up these notes, fell into two categories.  One, which I found easier to understand (and, predictably, seemed less popular among the knowledgeable respondents), involved a simple loop that would call an external process.  Basically, in plain English, it went like this:

FOR each EML file, run Process.
Repeat loop.
When list of files is exhausted, quit.

Process starts here.
Do various things.
End of process
By contrast, the approach preferred by most of the answering individuals would put all the steps inside the loop, instead of having a separate process afterwards.  It seemed to be a matter of style.  A second difference was that, in discussing the specific steps, they seemed divided between two general possibilities:  with, or without, delayed expansion.  Delayed expansion was apparently a response to a complication in how the FOR command worked.  As I understood it, the computer would read the entire contents of a FOR command as soon as it hit the word FOR.  So assigning a value to a variable inside a FOR loop would be too late; the computer would already have decided what value that variable had.  The variable would have been immediately expanded to its value.  Delayed expansion would postpone definition of the variable's value until later in the game.  A variable would be marked for delayed expansion by surrounding it with exclamation marks (e.g., !VAR!).  I wasn't familiar with delayed expansion, so I was in accord with some advisors' feeling that it would be better to proceed without it if possible.  What they (especially Aacini) suggested was:
@ECHO OFF

IF EXIST fullnames.txt DEL fullnames.txt

FOR %%f IN (*.eml) DO (

SET firstfind=

FOR /F "delims=" %%l IN ('findstr /B /C:"To: " "%%f"') DO (

IF NOT DEFINED firstfind SET firstfind=now & ECHO %%f%%l >> fullnames.txt

)

)
I have double-spaced the lines for clarity, anticipating that Blogger will wrap some long lines.  I haven't indented the way a programmer would, because of apparent limitations in the formatting options here in Blogger.  Basically, this batch file said, give me a fresh output file called Fullnames.txt; and on each line in Fullnames.txt, type the contents of two variables.  The first variable, %%f, was the name of the EML file under consideration, in all its Date-Sender-Subject glory.  There would be one such filename assignment for each EML file in the folder; hence a FOR loop.  The batch file would loop through all EML files in the folder.

Inside that FOR loop, there would be an examination of the contents of each individual EML.  This examination would use FINDSTR to locate the first line beginning with "To:  ."  The contents of that line would be assigned to the %%l variable.  (That's an L, not a one.)  I wasn't sure why this had to be done inside a second, inner loop, and I also didn't know how the "now" part worked.  But I was an openminded individual.  I was interested in new ideas.  The point is, I was willing to plow ahead and give it a try. 

So I copied the foregoing lines of script, beginning with @ECHO OFF and continuing to the last closed parenthesis (")"), and pasted them into a file in Notepad.  I saved that file as EMLNamer.bat, and put it into the folder containing the EMLs that I had exported from Thunderbird (above).  There, I ran it (either double-click it or highlight and hit Enter).  The command window displayed nothing, which was a bit disconcerting; but, viewing the folder in Windows Explorer, I could see Fullnames.txt spring into existence and grow larger.

When it was done, the command window disappeared, and Fullnames.txt stopped getting bigger. I put EMLNamer.bat into a folder where I could find it later.  I opened Fullnames.txt file and pasted its contents into Excel.  Some lines seemed to be missing.  Not many, but less than the total number of files shown in the Windows Explorer folder minus two (for EMLNamer.bat and Fullnames.txt).  I guessed that the names of a few EMLs had presented complications for the script.  I would have to process the rest and see what remained.

Third Step: Improving the EML File Name

I looked at the new Excel spreadsheet.  Spot checks, supplemented by previous experience with ImportExportTools, yielded the following observations:
  • The first 13 characters in each filename seemed match the date and time (in 24-hour format) shown in Thunderbird for the email in question -- the time, that is, when the email was sent or received.
  • The next characters indicated the sender.  This string ended, in some cases, with three characters (namely, "_-_") and in other cases with just one (namely, "-").  It seemed that ImportExportTools would surround some senders' names with underscores ("_") but would not do so for others.  The reason seemed to be that those senders' names appeared within brackets.  For instance, I had emails from "[Wordpress.com]" that now appeared as "_WordPress_com_."  So at least in these situations, the underscore seemed to be something that I could replace with a space, which would then be removed by an Excel TRIM command if it appeared at the start or end of a string.
  • Some senders' names ended with "_com."  Ordinarily, the preceding note would suggest replacing that with ".com," and likewise for ".org," ".edu," and so forth.  But I decided that step would come later, if at all:  instead, I would start by identifying full names (e.g., "Yahoo_com") that I might want to replace with simpler names (e.g., "Yahoo").
  • Hyphens were not always a reliable indicator of the end of a sender's name.  For example, an email from some "Pan-European" organization came through the ImportExportTools process unchanged.
  • ImportExportTools seemed to replace apostrophes with underscores.  So instead of "Miller's" I would get "Miller_s_."  Likewise for other uses of the apostrophe (e.g., "Don't" became "Don_t_").  It seemed that, before doing any sweeping replacement of underscores, I might want to look for those sorts of special cases.
  • Sometimes a hyphen would not be a reliable indicator of the end of a sender's name.  An example appeared in an email from a "Pan-European" organization:  it came through the ImportExportTools process unchanged.
  • Due to the EMLNamer process, the end of the Subject field and the beginning of the Recipient field were marked by ".emlTo:" -- which was certainly recognizable.
  • Subject fields often began with things like "Fwd_" and "Re_" -- which, I had decided in a previous use of ImportExportTools, would best be deleted.
In short, the default results from ImportExportTools (possibly altered during my previous tinkering) were creating some confusion.  I deleted the existing EMLs from the output folder, so as to start over.  Then I went into Thunderbird > Tools > ImportExportTools > Options and made several changes.  In the Misc. tab, I set each item to a maximum of 100 (rather than 50) characters.  (This wasn't exactly a mistake, but I would later realize that, as a result of this change, I needed to be more aggressive in keeping the total filename length relatively short; otherwise, it would cause problems in some other Windows operations.)  In the Filenames tab, I unchecked the option to "Use just alphanumeric characters in export"; I left the format to be Date - Sender - Subject; I left "Add time to date" checked; and I unchecked the "Cut subject" and "Cut complete file path" options.  In the Export directories" tab, I chose the "create a new directory and the index of messages" option.

When I ran that, I got an index.html file listing relevant information about each file:  its subject, from, to, date, and an indication of whether it had attachments.  This did not appear likely to be helpful, given its HTML format.  In the output folder, there was the right number of files.  I ran EMLNamer.bat again.  This time, the command window gave me some error messages.  Preliminarily, it seemed they were produced by the length of the filenames.  I could not save them before the command window closed.  There was probably a way to modify EMLNamer.bat to save those messages to a file, but I did not tinker with that at this point.  These messages appeared to be in addition to the unknown problems that had prevented Fullnames.txt from containing a complete list of all EMLs:  there were now about 20 filenames missing from the output that I pasted into Excel.  So, again, those would have to be dealt with manually.

This time around, when I pasted the results from Fullnames.txt into Excel, I saw that the output filenames had characteristics largely similar to, but in some regards different from, those noted above.  There were fewer underscores, which meant that it would probably be simpler to develop rules to translate them into more useful characters.  Hyphens were still not reliable field-end indicators.

Manipulating the File Information in a Spreadsheet

In Excel, after a couple of false starts not detailed here, I took the following steps:
  • Insert row 1 for column headings.  Label column A as "Combined."  These entries contained the combined original filename plus the "To:" information added by EMLNamer.bat.
  • In column B (heading:  "Original"), use =LEFT(A2,FIND(".emlTo: ",A2)-1) to obtain the original filename as exported from Thunderbird.  I would need this to remain unchanged:  my ultimate goal, a batch command indicating how the original filename should be changed, would need this information to tell the command processor what file was being renamed.  As with all other columns discussed below, I copied the formula down the column to all rows in use.
  • In column C (heading:  "Find & Replace"), use =A2.  Fix the values in this column -- that is, make them permanent by highlighting them all and using the Edit - Copy, Edit - Paste Special -Values sequence.  The shortcut key sequence for Excel 2003 -- which I believed would work in ribbon versions like Excel 2007 and 2010 --was Alt-E C, Alt-E S V Enter Enter.  Now column C contained values rather than formulas.
  • Move the values from column C to a new worksheet.  Don't rearrange them.  I needed a new worksheet because I was going to be using global find-and-replace (Ctrl-H) commands, and I didn't want to have to try to protect columns A and B from being affected by these commands.
  • In that new worksheet, I made changes to the list that I had just brought over from column C in the first worksheet.  The first thing I did was to search for an unusual character, one I searched, in Excel, to find a character that did not already appear in the list.  The caret ("^") was one such character.  I would use this as my field delimiter.  I didn't want any of my Subject field entries to begin with "Re" or "Fwd," so I started by replacing "-Re_" and "-Fw_" and "-Fwd_" with carets, gambling (on the basis of previous experience) that there would be few instances where this would prove inadvisable.
  • I also replaced the "-_" and "_-" and "-[" combinations with carets.  To reduce the number of underscores potentially requiring manual attention, I did one or two additional find-and-replace operations in obvious cases; for example, "Woodcock_s " (ending with a space) became "Woodcock's ."  It could have been counterproductive to go too far with this, though.  For example, I did not try to remove underscores from every version of my name and email address, because that could have created additional variations on my name, somewhere down in the list, potentially complicating the number of things I would have to look for later.  It was better to leave the underscore as a flag for some purposes.  Then I cut and pasted that modified list back into column C in the main worksheet.
  • Back in the main worksheet, in column D, I set up a Date and Time column B, using =LEFT(C2,13).  I didn't parse that column for the various year, month, day, hour, and minute components at this point; that could wait until I needed that information.
  • In column E, I created my first Remainder column.  The purpose of the Remainder columns was to show what was left from the modified values appearing in column C, after removing whatever I had just separated out (in this case, the date and time).  The formula was =TRIM(MID(C2,15,LEN(C2))).
  • I used column F for the Recipient (i.e., "To") value, from the end of the string appearing in the Remainder column (E).  The reason was that this was a fairly obvious entry, and its removal would simplify the next steps.  The formula in column F was =TRIM(MID(E2,FIND(".emlTo: ",E2)+7,LEN(E2))).
  • Column G could be another Remainder column:  =TRIM(LEFT(E2,FIND(".emlTo: ",E2)-1)).
  • In column H (heading:  "Left 1"), I entered =LEFT(G2,1).  The reason was that ImportExportTools had failed to export the names of some senders, notably those appearing in angle brackets ("< >"), and I couldn't identify them by just sorting on the Remainder column because Excel would irritatingly overlook those characters when doing a sort.  But now I could sort on column H and make manual entries of those senders' names in the appropriate column.  I had not yet created that column, nor made those manual entries, because there was something else I needed to do first:
  • In column I, under a "Hyphen" heading, I entered =IF(ISERROR(FIND("-",G2)),"",FIND("-",G2)).  In column J (heading:  "Caret"), I entered =IF(ISERROR(FIND("^",G2)),"",FIND("^",G2)).  Finally, in column K (heading:  "Best"), I entered =IF(J2="",I2,J2).  Column I would look for the first occurrence of a hyphen in the Remainder (column G).  Column J would do likewise for a caret.  It was necessary to use both because, at this point, either one might have been the delimiter indicating the end of the Sender field.  Column K would favor carets over hyphens, so as to reduce the number of problems with senders with hyphenated names.
  • In column L ("Sender), I used =TRIM(LEFT(G2,K2-1)).  This produced good Sender names in most cases.  It was not yet time to deal with the exceptions.
  • In column M ("Subject"), I used =TRIM(MID(G2,LEN(L2)+1,LEN(G2))). This produced good Subject names in most cases. Now it was time to deal with the exceptions.
  • I went back and sorted on column H to identify those rows where I would have to make manual entries of Sender names because none was provided by ImportExportTools.  I put those entries in column L as needed, replacing whatever the automatic calculation had put there.  To assist in my process of looking up those that I didn't recognize, I sorted the From column in Thunderbird, for the Export folder, to gather all those senders at the top of the list for easier reference; I moved these items into a separate subfolder, sorted by Subject; I maximized the viewable space for that list; and once I had dealt with them, I moved them to another subfolder, so as to reduce the size of the list that I would have to page through.  The objective here was just to make sure I had a coherent division of information between the Sender and Subject columns -- to prevent some Sender data from appearing in the Subject column, or vice-versa. Cleaning them up or otherwise improving them at this point would have been premature.  Changing Sender names worked best if I made the changes back in column G, or if I altered or removed numbers in columns I and J.  Just making a change in the Sender column would leave a problem in column M.  It helped, for this purpose, to fix the values in column G (that is, to replace formulas with values; see the procedure described in connection with column C, above).
  • I sorted on column M ("Subject") and cleaned up the entries there.  I found that I wanted to do find-and-replace operations on multiple entries.  I decided at this point that I could safely fix the values of the entire spreadsheet.  It seemed that I would want to sort and re-sort these Subject values to get similar ones together.  To preserve the original order, I added an Index column, indicating the original numerical order of entries.  (Enter 1 and 2 in the first two rows; highlight all rows to be numbered; then hit Alt-E I S Enter.)  Then I moved the Subject and Index columns to a separate temporary worksheet, where I could do these sweeping changes without affecting other columns.  There, I reversed these two columns, putting Index on the left, to keep it out of harm's way.  My changes here included LEFT and RIGHT commands to sort by first and last characters of Subjects (supplemented, on the left, with CODE comparisons, to identify unwanted lowercasing), as well as FIND and Ctrl-H searches and replacements for underscores (doing many replaces to eliminate most instances) and other text that I wanted to change across multiple Subjects.  To identify undesirable characters (e.g., exclamation marks and others whose presence in filenames might mess up batch commands and other applications), I used SUBSTITUTELIST.  SUBSTITUTELIST would remove the characters listed in a separate worksheet (generated with a series of numbers 1-255 in column A and a corresponding =CHAR(A1) in column B).  I could have had it remove characters that looked unwanted, but to be cautious I decided instead to have it remove everything that I knew was normal (i.e., 0-9 and a-z and A-Z, plus a few others) and show me what was left.
  • I deleted columns that were unnecessary, now that I had fixed the values.  I also moved some columns, and inserted a few ones.  My arrangement was now as follows:  Index (column A), Original (B), Date & Time (C), Sender (D), NewSender (E), Recipient (F), NewRecipient (G), and Subject (H).
  • I copied values from columns D (Sender) and Recpient (F) to a separate worksheet.  There, I did a unique filter.  This gave me a list of names that I might want to change or simplify.  I put the original (unique) name in column A in that separate worksheet, sorted it, and entered the desired replacement names in column B.  I sorted this Names worksheet on column A (Original).  I named this worksheet Names; I planned to keep it for future Thunderbird EML exports.  I named the main worksheet Data.  I sped up the process of developing replacement names by using various functions (e.g., FIND, MID) to distinguish first and last names of individuals.  When I had my completed list of preferred names for Senders and Recipients, I went back to the main (Data) worksheet.  In column E (NewSender), I entered =VLOOKUP(D2,Names!$A$2:$B$869,2,FALSE).  (There were 869 rows in the Names spreadsheet.)  I copied that formula over to column G (NewRecipient); it provided a similar replacement for the Recipient values.
  • I inserted columns to figure out the date and time.  In column D ("Y"), I used =LEFT(C2,4).  In colunn E ("M"), I used =MID(C2,5,2). In column F ("D"), I used =MID(C2,7,2).  In column G ("H"), I used =MID(C2,10,2).  In column H ("M"), I used =MID(C2,12,2).  Finally, in column I ("NewDate"), I used =D2&"-"&E2&"-"&F2&" "&G2&"."&H2.
  • I added column O ("New Name").  There, I used =CHAR(34)&I2&" Email from "&K2&" to "&M2&" re "&N2&".eml"&CHAR(34).  This produced a new name for the EML file.  I sorted on this column to identify instances where my formulas had failed, and made corrections as needed.
  • I added column P ("Batch").  There, I used ="ren "&CHAR(34)&B2&".eml"&CHAR(34)&" "&O2.  This produced a batch command to rename the EML file to my preferred new name.  I copied the command down the column and then copied all those commands, one from each row, to Notepad.  I saved the Notepad file as Renamer.bat and put it into the folder where the EMLs were located.  I ran Renamer.bat.  The renamed files sorted conspicuously in Windows Explorer, so I didn't need to work up a modification of these REN commands in a new column Q, using MOVE instead of REN, to move the newly renamed files to another folder.  Instead, I could just cut them from the folder in Windows Explorer and put them aside.
  • Now I had a couple dozen EMLs remaining.  They had not renamed properly.  I probably should have added something like " > errorlist.txt" at the end of each batch command, to show me whether I was trying to give the same name to two different files.  I did a DIR of the files remaining, saved its output to dirlist.txt, copied the contents of dirlist.txt into Excel, and compared them against my main spreadsheet.  To my surprise, none of these files appeared in the original list of files shown there.  I'd had some problems not described in this post; had I somehow dropped some EMLs somewhere along the line?  Was I not doing this comparison properly?  I did not have a clear answer.  I worked up another set of new file names for these EMLs, substantially following the steps presented above, and renamed them.  It looked like, somehow, at least some of them were duplicates after all.  So I was not understanding something there.  Others were apparently not renaming properly because the original filenames contained characters like ®.
In the end, I wound up with all but 10 of the exported emails.  But which 10 did I lose?  I probably didn't lose any.  There was a point when I deleted a handful of what I thought were duplicates.  Now it seemed they probably weren't.

As these final notes suggest, while this process went much more smoothly than on my previous exports of emails from Thunderbird, there were still some parts of the process where I was making mistakes or where things were not going smoothly.

Fourth Step:  Converting the Appropriately Renamed EML to PDF

Some of the previous posts cited at the top of this post grappled with the problem of converting EML to PDF.  It would seem that it should have been a straightforward matter of selecting EMLs in Windows Explorer -- indeed, within Thunderbird itself -- and clicking a Print command.  Alas, it was not, not if the goal was to have PDFs whose filenames would be recognizable.  I hoped there would be a Thunderbird add-in that would solve all the many steps shown above.  I hadn't found one yet.  In my most recent effort, I had proceeded only as far as a truly cumbersome solution that divided EMLs with and without attachments, used Emacs to edit EMLs with attachments so that they would print, extracted attachments to separate files that could also be PDFed, and then manually matched the PDFed attachments up with their PDFed parent emails.  Truly a mess, very time-consuming, and for that reason I hadn't gone very far with it.  Most of my emails were still in EML rather than PDF format.

I had decided, generally, that PDF was the superior long-term archival format.  I didn't want lots of formats rattling around, lest the day come (as had happened for previous formats) when it was a struggle to find software that would read it.  That said, EMLs were presently displaying nicely in Thunderbird, with easy access to attachments.  Having devoted the time to the foregoing effort, I was presently out of time for further development of this project.  So the stack of EMLs grew higher, and the day for conversion to PDF still lay somewhere in the future.


Wednesday, February 29, 2012

Excel 2003: Print or Export the Formulas Used in Each Cell

I had a spreadsheet in Excel 2003.  (I suspect the approach used here would also work in other versions of Excel, but I have not tried it.)  I wanted to print out the formulas used in each cell.  I did a couple of searches and wound up in a thread where they were advising me to use a macro for this purpose.  The steps I used to set up the macro were similar to those that I had used for another Excel macro:

  1. Close all Excel files other than the one you're working on.
  2. Go into Tools > Macro > Visual Basic Editor > Insert > Module.
  3. Copy and paste macro text (see below) into the window.
  4. Go to File > Close and return to Microsoft Excel.
  5. In this case, I used the macro by going into Tools > Macro > Macros and running the ListFormulas macro.
The text of the macro -- what I copied and pasted into the module window -- was as follows:
Sub ListFormulas()
    Dim FormulaCells As Range, Cell As Range
    Dim FormulaSheet As Worksheet
    Dim Row As Integer
    
'   Create a Range object for all formula cells
    On Error Resume Next
    Set FormulaCells = Range("A1").SpecialCells(xlFormulas, 23)
    
'   Exit if no formulas are found
    If FormulaCells Is Nothing Then
        MsgBox "No Formulas."
        Exit Sub
    End If
    
'   Add a new worksheet
    Application.ScreenUpdating = False
    Set FormulaSheet = ActiveWorkbook.Worksheets.Add
    FormulaSheet.Name = "Formulas in " & FormulaCells.Parent.Name
    

'   Set up the column headings
    With FormulaSheet
        Range("A1") = "Address"
        Range("B1") = "Formula"
        Range("C1") = "Value"

        Range("A1:C1").Font.Bold = True
    End With
    
'   Process each formula
    Row = 2
    For Each Cell In FormulaCells
        Application.StatusBar = Format((Row - 1) / FormulaCells.Count, "0%")
        With FormulaSheet
            Cells(Row, 1) = Cell.Address _
                (RowAbsolute:=False, ColumnAbsolute:=False)
            Cells(Row, 2) = " " & Cell.Formula
            Cells(Row, 3) = Cell.Value
            Row = Row + 1
        End With
    Next Cell
    
'   Adjust column widths
    FormulaSheet.Columns("A:C").AutoFit
    Application.StatusBar = False
End Sub 
(Note that the format of this blog may wrap some lines.  Copying and pasting may yield better results than retyping.)  The author named in that code, John Walkenbach, provided this code and also offered a Power Utility Pak ($40) that contained this and other tools.  I had installed a couple of freeware utility collections -- ASAP Utilities and Morefunc -- and I hardly ever used them.  I checked the list of tools in ASAP Utilities, just in case, but didn't find anything quite along these lines.  A quick glance revealed no list of Morefunc utilities.

When I ran the macro (step 5, above), it seemed to hang my machine.  I was using a fairly large spreadsheet -- I probably should have tried it on something smaller -- but instead I went to bed.  I didn't know how long it took, but it worked.  When I awoke, it had created a new worksheet (i.e., a new tab at the bottom of the spreadsheet), with three columns:  Address (e.g., F2), Formula (e.g., =C2), and Value (e.g., 17).

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time.  I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf
reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow."  This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages.  There were a lot of messages, so I didn't want this to be a manual process.  This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird.  Then I deleted duplicates from that folder.  Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird.  So I moved the older messages into a separate folder.  I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email.  I did that because I wanted to capture the text information about the individual email messages.  The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames.  For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once.  I ran a search and downloaded and installed the ImportExportTools add-on for T-bird.  (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.)  It took Thunderbird a while to shut down and restart after the installation.  I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder.  When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)).  I opened the CSV file (named index.csv) in Excel and saw that this was perfect:  I had a list of exactly the fields mentioned above (date, time, etc.).  I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames.  The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working.  I thought this might also be better for purposes of making things the way I wanted them.  I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order.  There seemed to be some other potential limitations.  It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted.  Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird.  So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message.  But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs.  In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming.  There were other solutions that would have worked for that purpose as well.  A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook.  (Alternate:  Automatic Print Email for $69.)  There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments.  I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments.  Funny thing about that, though:  as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them:  it thought they had attachments, but then it realized they didn't.  That trimmed out maybe 5% of the ones that had originally been marked as having attachments.  But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment.  Total Thunderbird Converter looked like it might come close, at a hefty price ($50).  It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly.  But then I saw that they were offering a 30-day free trial, so I downloaded and installed it.  It turned out to be useless for my purposes:  it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation.  I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial.  It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on.  I tried Birdie EML to PDF Converter.  Their free demo version allowed me to convert one EML file at a time.  I liked its interface:  it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats).  I didn't like the output, though:  they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF.  I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML.  But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages.  It just vanished.  The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions.  In other words, I didn't.  So now I had to go rooting around in backups to see what I could rescue from the wreckage.  I found that Backup Maker had been happily making backups, as God intended.  Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it.  What Backup Maker had saved was some files with an .MSF extension.  These were supposedly backups of Thunderbird.  But then, no, on closer inspection I realized these were much too small, so I had to do some more digging.  Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off.  A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached.  There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort.  On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost.  I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension.  I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder."  It worked -- it did extract the attachments without removing them from the emails.  I assumed it would do this with hundreds of emails if I so ordered.  Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination.  Attachment Extractor did have one potentially useful feature for this purpose:  a right-click option to "Extract with a Custom Filename Pattern."  I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF.  To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab.  There, I used this pattern:
#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#
and, per instructions, in the Edit Date Pattern option I used this pattern:
Y-m-d H.i
That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult.  A search led to inquiries about renaming PDF print output.  For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames.  Print Helper, for $125 to $175, was another option.  A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it.  A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30. 

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software.  No doubt many people have used pdf995 successfully.  I was not among them.

I tried Pdf995's Free Converter.  The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page.  They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad.  But that .ini file seemed to be configured for printing one specific file that I had just printed.  They didn't say how to adjust it.  Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there.  So I tried that.  But I got an error message:
PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.
But they were!  I had just installed them.  Was I supposed to reboot first?  No, a reboot didn't fix it.  I tried again to install basic pdf995 and the Free Converter, which I had downloaded together.  Once again, I got the webpage saying I had installed it successfully.  Was I supposed to install the printer driver too?  I understood the webpage to be saying that was included in the 9MB download.  But I tried that.  Got the congratulatory webpage, so apparently it installed correctly.  Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:
Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.
But I had just downloaded it from their website!  Not an altogether auspicious beginning here.  But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before.  A cursory read-through still did not give me a clear approach to automated naming of PDFs.  Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation.  Who knew?  I downloaded and installed OmniFormat.  Got a new congratulatory webpage, but still no straightforward explanation of batch naming.  Instead, it said that pdfEdit995 was what I wanted to create batch print jobs.  So, OK. a bridge too far.  Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird.  Went back to their incredibly tiny-print pdfEdit instructions page.  It said I would have to set pdf995 as the default printer to do the batch thing.  That was OK.  But it still sounded like it was intended primarily for batch printing from Microsoft Word.  I decided to just try making pdf995 the default printer.  That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer.  While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing.  Now I went to Thunderbird, selected several messages, and did a right-click > Print.  Funny, it defaulted to Bullzip, which was my usual default printer.  Checked again:  yeah, pdf995 was definitely set as my default printer.  Tried again, and this time manually set it to pdf995 when it was printing.  It asked for the filename, so that wasn't working.  Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either.  It seemed pdf995 was not the solution for me.  I came away from this little exploration with less time and ambition for the larger project.  Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first.  I did, so I took a look.  There was an option to download Universal Document Converter (UDC) as well.  I wasn't sure, but I thought I might need that for Print Conductor, so I installed both.  PrintConductor didn't seem to have a way of printing EML files.  Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that.  But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually.  (Print Conductor did not appear in the list of available printers.)  When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab.  There, I chose the "predefined location and filename option."  I left the default filename alone and printed.  And it worked.  Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails.  So I would be able to match them with the attachments produced by Attachment Extractor (above).  All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames.  I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above):  Tools > ImportExportTools > Export all messages in the folder > EML format.  Unfortunately, there could be a number of messages in a single day on the same subject.  Without the time data in the filename, I would have duplicates.  More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails.  I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data.  As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data.  But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails.  Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails.  There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached.  This seemed like the kind of project that really had to await some further development in application software.  For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it.  It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird.  I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered.  It seemed that a good solution would be to focus on the emails themselves for now.  I would export them to EML format.  EMLs would continue to contain the attachments.  I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice.  Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs.  So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment.  If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject).  I had sorted the old emails in Thunderbird into separate folders by year.  I went to one of those folders in T-bird and sorted it in ascending date order.  Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV).  This gave me what appeared to be a matching list of those messages, in that same order.  The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order.  Good news:  the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order.  At least the first and last emails were in reversed positions.  So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order.  This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed.  I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34).  I wasn't sure that the present and future date fields would always sort exactly the same.  I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes:  would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves:  Tools > ImportExportTools > Export all messages in the folder > EML format.  As partially noted above, these were named in Date - Subject - Number format.  I now did a search to try to figure out what that number signified.  It wasn't clear, but it seemed to be just randomly generated.  Too bad.  It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order.  (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.)  It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day.  There would surely be large numbers of files like that.  I now realized they would not at all be sorted correctly in Windows Explorer:  with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options.  Silly me.  I had just overlooked that.  But there they were:   Tools > ImportExportTools > Options > Filenames tab.  I selected "Add time to date" and I chose Date - Name (Sender) - Subject format.  Now I tried another export of EMLs.  The messages now had names like this:
19970730-0836-Microsoft-Welcome!
I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM.  Suddenly the remainder of my job got a lot easier.  I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools.  There were two parts to this mission.  First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools.  Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted.  This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird. 

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility.  That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML.  This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird.  I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird.  This seemed like a separate step.  I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

Friday, March 18, 2011

Thunderbird for Windows: Transition from Portable to Desktop; Duplicate Email Remover

I was using Thunderbird Portable 3.1.4 in Windows 7.  I wanted to use an add-on (Remove Duplicate Messages (Alternate) 0.3.6) to delete duplicate email messages.  I got the impression that it wouldn't run on the portable version.  I had been thinking about switching to the desktop version of Thunderbird anyway, and now seemed like the time.  To figure out how to transition from portable to installed versions of Thunderbird, I ran a search and found advice that seemed on point.  I did not precisely track all of the steps I took in this process, but the following is a pretty close approximation.

I started by installing regular (i.e., not portable) Thunderbird.  I think I created an email account at that point.  This generated C:\Users\Administrator\AppData\Roaming\Thunderbird\Profiles\f0xqaflh.default.  (The f0xqaflh part was randomly generated -- other installations would have a different ????????.default file.)  I closed Thunderbird and moved C:\Users\Administrator\AppData\Roaming\Thunderbird\Profiles\f0xqaflh.default to D:\Thunderbird\Profiles\f0xqaflh.default.  I put it on D so that it would be saved in case of Windows reinstallation.

Then I went to Start > Run > "thunderbird.exe -ProfileManager."  In Profile Manager, I clicked on Create Profile > Next > Choose Folder and pointed to D:\Thunderbird\Profiles.  I exited Profile Manager and moved the contents of ThunderbirdPortable\Data\profile (i.e., just the profile subfolder) to D:\Thunderbird\Profiles.  I clicked on my Start Menu shortcut for Thunderbird (not portable).  It ran, and it seemed that all of my emails were there.  I deleted the folder containing the portable version.

I hoped this was all I needed.  Now it was time to try to delete duplicate emails.  I installed the duplicate email remover add-on (Tools > Add-ons > Extensions tab > Install) and ran it (Tools > Remove Duplicates).  It wouldn't check my archive folder until I turned off the Skip Special Folders option (Tools > Add-ons > Extensions tab > Options > Message Comparison tab).  At first, I used the default comparison criteria in that same tab:  Author, Recipients, CC List, Message ID, Send Time, Size, Body, and Subject.  This did not identify too many duplicates, but it appeared they were exact duplicates, so I could delete them all without much manual comparison.  I ran another search, without the Message ID criterion, and yet another, without the Size comparison.  The former likewise seemed not to require much manual comparison; the latter did.  In other words, the final comparison criteria (Author, Recipients, CC List, Send Time, Subject) produced many alleged duplicates, some of which were of very different size.

The add-on did not allow me to open individual emails (via double-click or right-click), to see why two emails bearing the same subject, date, time, etc. would be so radically different in size, so I had to do a lot of manual toggling back and forth between the duplicate remover and Thunderbird, and then searching for individual items in T-bird, to check emails one by one.  In this regard, it was not like DoubleKiller, which I had found to be an excellent duplicate file finder.  But the manual selection process was similar:  check or uncheck the desired item under the "Keep?" column.  Both of these programs would probably have been easier to use if it had been possible to select or deselect items by clicking anywhere on the line, rather than having to mouse over to precisely the checkbox spot each time.

The add-on did allow arrow-key and spacebar navigation and selection.  Playing with this, I eventually discovered that the Enter key would open T-bird to one of the identified duplicate messages, but in that case the comparison window disappeared and I was back in Thunderbird, leaving me to wonder why I was now seeing only one of the duplicates.  Then I realized, oops, hitting the spacebar had not actually opened the selected duplicate; it had gone ahead and run the deletion.  Well, I hoped those 700 messages really were duplicates.  I had been verging toward just saying to hell with the time-consuming and awkward manual comparison process anyway; I just wasn't quite ready for this to happen.  I looked in Thunderbird's Trash folder and realized that I had not emptied the trash before running the duplicate checker (another ideal feature for the duplicate checker), so now I would have to restore not just the 700 messages that I had apparently just deleted, without an "Are you sure?" message, but would also have to restore about 700 other messages that were apparently in the Trash previously, since I was now seeing a total of 1400 messages there.  As I looked at the Trash, I found myself wondering, actually, what was wrong with those 700 other messages.  They didn't seem to be messages that I would have wanted to delete, unless they too were duplicates.  I decided to move the whole lot of them to the archive folder that I had been dup-checking.  At this point, needless to say, I was beginning to fear that I might just be turning my whole email archive into a giant hash.  I started back through a sequence of dup-checks, beginning with the most conservative (i.e., with the most comparison criteria checked), but of course this time I had no patience for checking individual items.  Instead, I just dreamt of an update that would actually display large thumbnails of alleged duplicates, right there in the add-on.

The column headings in the dup-check results window permitted sorting in ascending or descending order.  At first, I thought that feature was not working for some criteria.  Then I figured out that it was meant to sort only within a comparison.  For example, if Size was not a comparison criterion, it would not be in boldface in the top row, and then clicking on it would sort alleged duplicates according to size; but if Size was a comparison criterion, it would be bolded, and then clicking on that heading in the top row would do nothing, since in that case all duplicates within a set would be identical by definition.  It would have been helpful if selected comparison criteria headings had enabled a sorting of all pairs.  That is, if I was comparing by Send Time, I wanted to be able to show the earliest ones (i.e., the pairs of allegedly time-identical messages) first, so that I wouldn't have to do so much jumping-around when I toggled to Thunderbird for a manual comparison.

After running the several comparisons mentioned above, I tried running one with only the Send Time and Subject criteria checked.  This revealed some apparent duplicates whose only difference was that for some reason one item in a pair would be enclosed in quotation marks (e.g., a message from "Joe") while the other would not (e.g., a message from Joe).

That was the end of my use of the add-on at this point.  I returned to finish this post several hours after completing these processes.  It appeared, at that point, that the transition to desktop Thunderbird and the use of the add-on to delete duplicate emails were both successful.

Tuesday, August 31, 2010

Exporting from Thunderbird, Importing into Thunderbird

I was using Thunderbird as my e-mail program in Ubuntu.  I decided to switch to using Thunderbird 3.1 for Windows as my e-mail program.  It seemed, at this writing, that most people who were transitioning to Thunderbird were going toward Ubuntu, not away from it.  So in this post I am writing up some things that I had to figure out along the way.

I decided to switch to Thunderbird for Windows because I was planning to keep Ubuntu as my underlying operating system, but to focus my applications on Windows XP, which I would be running in a virtual machine in VMware.  This arrangement, I found, gave me dual-boot advantages without having to reboot.

I was particularly interested in using the portable version of Thunderbird as my Windows XP e-mail application.  This would enable me to take my e-mail and my address book with me on a USB flash drive.  The discussion of Thunderbird for Windows in this post relates specifically to the portable version.

After setting up Thunderbird Portable on a Windows computer and making a backup copy, I went into Ubuntu and simply copied over my data.  I found the relevant data in Nautilus (i.e., Ubuntu's File Browser, the equivalent of Windows Explorer), in this location:  Home Folder / .thunderbird / 6abstqrst.default.  (The 6abstqrst part of that name was apparently generated at random, and as such would have a different name in other installations.  Point is, it's the "default" folder.)

I copied that entire default folder to a USB jump drive and compared its subfolders, item by item, to those on the computer where I had installed Thunderbird Portable.  (Of course, I did these and other folder manipulations (below) while Thunderbird was *not* running.)  The comparable e-mail account data seemed especially to be located under the Data\profile\Mail folder.  Thunderbird's Address Book seemed to be in Data\profile\abook.mab.  The Address Book copied and worked without any problem.  The following discussion focuses on problems in getting the e-mail accounts to work correctly.

The simple process of copying e-mail accounts over seemed to work well enough.  I copied all of the folders from Ubuntu via my jump drive to the corresponding Thunderbird folders on the Windows machine.  I kept backups and did this rather painstakingly.  After replacing the contents of one subfolder in Thunderbird Portable with the contents brought over from Thunderbird for Linux, I would start up Thunderbird Portable and make sure that it still seemed to be functioning OK.  Through this process, I ended up with a Thunderbird Portable setup where the desired e-mail accounts did exist.  This may have been helped by the decision to run Thunderbird's Tools > Import option, which I did somewhere along the way.

When I was done, unfortunately, the e-mail accounts that showed up when I ran Thunderbird Portable were still not showing the contents that I wanted them to show.  For example, my Hotmail account was there, but its Inbox was empty, whereas the Hotmail Inbox on Thunderbird for Linux had contained a dozen e-mail messages.  I could see, moreover, that the Data\profile\Mail\pop3.live.com folder contained an Inbox that was 96MB in size.  That was larger than I would have expected, and in any case much larger than an Inbox containing nothing, which is what Thunderbird Portable was showing me.

It seemed that Thunderbird Portable was recognizing the e-mail accounts themselves, but was drawing the contents of those accounts from the wrong place.  I verified this by removing the entire Mail subfolder from Thunderbird Portable.  When I started it up, it was still seeing the same few old items in the same accounts.  I thought it might have observed or figured out where I had moved the Mail folder, so I removed it from that computer entirely; yet Portable was still seeing those same ghostly remnants of some previous state of my Thunderbird for Linux installation.

It took a bit of effort to figure out where those ghostly remains were hanging out.  Portable wasn't drawing them from Data\profile\Cache; they persisted even after I emptied that.  To find the answer, I started Portable, changed the system date to a year in the future, copied one of those old e-mail messages from one folder to another, changed the system date back to the correct year, and exited Portable.  Then I copied the entire Portable folder to a workspace folder elsewhere on the computer, and searched for files bearing that future year's date.  Aside from cache files and scripts, there turned out to be only a handful of files dated in that future year.

That effort led to the discovery that the Data\profile\prefs.js file that I had brought over from Ubuntu was not suited for Windows.  I went to the original backup of my Thunderbird for Windows Portable and copied its prefs.js file to the Portable installation that I was tinkering with, thus overwriting the Ubuntu prefs.js file.  Both of them began with a warning:  "Do not edit this file."  Instead, the warning said, I could follow the instructions provided on a webpage that, as it turned out, was no longer in existence.  A different webpage did advise me to edit prefs.js directly.  Again, of course, I would want to do this while Thunderbird was not running; and if there was any doubt about that, Windows Task Manager (Ctrl-Alt-Del > Processes tab) would confirm whether there was an instance of thunderbird.exe or ThunderbirdPortable.exe currently running.

I opened prefs.js in Notepad, widened the Notepad window to prevent lines from wrapping, and took a look.  I decided I didn't know exactly how to edit prefs.js, so I tried the alternative that the instructions at the top of prefs.js seemed to prefer:  I started Firefox, typed about:config in the address line, and looked to see what was there.  (Note that I did not have any other copies or versions of Thunderbird installed on that computer, else things could have become very confusing.)  I searched for instructions, and eventually realized that it might not make sense to use about:config to change system preferences for a portable program.

So I tried another search.  This led to a webpage that led, eventually, to a mozillaZine webpage that advised me to start over and try using the Kaosmos ImportExportTools utility.  So I made a fresh start, replacing my munged-up Thunderbird Portable with a copy of the backup, and then I installed the ImportExportTools utility as instructed.  Then, in Thunderbird Portable, I went to Tools > ImportExportTools > Import mbox file.  At this point, I had to ask myself:  What, exactly, is an mbox file?  A search led to the discovery that mbox is an e-mail storage format that didn't seem very relevant to Thunderbird's own storage format

To test this, I went ahead with where I was in the ImportExportTools process:  I selected the "Select a directory where searching the mbox files to import (also in subdirectories)" option, and pointed it toward the top level of the folder I had copied over from Ubuntu Thunderbird.  To my surprise, the tool asked me if I wanted to import various programs.  I said no to parentlock and yes to all the other folders it asked me about.  After asking me about those folders, it didn't seem to be doing anything, except that I could see movement in the green progress bar at the bottom of the screen.  When it seemed to be finished, I didn't see any change in Thunderbird's list of folders.  I killed and restarted Thunderbird.  Still no change.  I poked around and then, whoa, I discovered that it had imported everything, including my archives, into the Hotmail Inbox folder (not the actual online one -- just the copy of it that Thunderbird keeps).  I killed T-bird again, made a backup copy of this remarkable state of Thunderbird Portable, restarted the program, and began moving and rearranging folders.

This was looking good, but there were still some things to fix.  First, in T-bird Portable, I tried sending a message that I had kept in the Hotmail drafts folder in Thunderbird for Ubuntu.  I got this message:

Send Message Error
Sending of message failed.
An error occurred sending mail.  Unable to establish a secure link with SMTP server smtp.live.com using STARTTLS since it doesn't advertise that feature.  Switch off STARTTLS for that server or contact your service provider.
A search and then a refined search led to the quick answer that I just had to stop my avast! antivirus software from scanning outgoing messages.

Next, I wanted to get rid of some Local Folders, especially the Inbox and Outbox.  I found a thread that made me think these folders were a product of Smart Folders, which would supposedly combine all of my e-mail inboxes into one Inbox, etc.  I did not want this.  Actually, I wasn't sure this was even the correct explanation, because I was seeing new incoming messages in my Hotmail Inbox, and they were not being mirrored in my Local Folders Inbox.  The advice I got from Yahoo! Answers, usually a font of goofy bewilderment, was as follows:
You can't remove the Smart Folders account using Tools -> Account Settings. You need to either edit prefs.js with a text editor or use the Config editor to delete the account from mail.accountmanager.accounts.
I was inclined to believe this because I had just run across another webpage with more or less the same conclusion.  But the advice on that webpage was oriented toward deleting all local folders, whereas I was using the Local Folders heading as the place to park my e-mail archive.  I right-clicked and saw, from Properties, that Outbox folder was located at Data\profile\Mail\Local Folders\Unsent Messages.  I quit T-bird, made a backup copy of the whole T-bird Portable folder, went into that Local Folders folder in Windows Explorer, and deleted the Unsent Messages entries.  I then restarted T-bird.  No joy.  As expected, the Outbox was still there and the Unsent Messages entries were back.  A new search led to a blanket statement that you could not delete the Outbox because it served an essential function, different from a Drafts folder:  it held messages that the user had tried to send but (because of e.g., no Internet connection) had not yet actually been sent.

So I turned to the next problem arising from the import into Thunderbird Portable for Windows.  I now had two top-level folders appearing at the left side of the T-bird window.  One was for my Hotmail account; the other was for Local Folders.  There should have been a third one, for another e-mail account that had appeared as a top-level folder in T-bird in Ubuntu.  This seemed to be a simple matter of going into T-bird Portable > File > New > Mail Account and entering the information about the account as it was recorded in T-bird for Ubuntu.  But the Mail Account Setup process stayed stuck for a long time on "Looking up configuration:  Trying common server names."  I finally went into Manual Setup and got it working that way.  And with that, the project was done.  I had transitioned from Thunderbird (Ubuntu) to Thunderbird Portable for Windows.

Sunday, May 16, 2010

Importing Microsoft Word Autocorrect Entries into OpenOffice.org Writer

I had been looking, for some years, for a way to import my list of AutoCorrect entries from Microsoft Word 2003 into the OpenOffice.org (OOo) word processing program.

In Word, I had found AutoCorrect invaluable for converting shorthand expressions into longer terms, saving me a lot of typing. For example, I could type “fttt” and watch it expand to “from time to time,” having previously defined it as such. My list of Word AutoCorrect terms had grown long, into the thousands of entries, so I could not just retype them into OOo Writer manually.

I did know how to export the AutoCorrect entries from Word to a text file. There were apparently several macros available for this purpose. The challenge had been in getting the items from there to Writer. A Linuxtopia webpage now suggested a possible approach, however, and I decided to explore it.

My first step was to get into Writer’s DocumentList.xml file. To do this, in Ubuntu’s Nautilus (i.e., File Browser) I went to /usr/lib/openoffice/basis-link/share/autocorr. I double-clicked on acor_en-US.dat (there were files for other languages and for other flavors of English). There was DocumentList.xml. Now, what to do with it? I right-clicked on it and chose Extract > Extract. This gave me an error message: “Extraction not performed. You don’t have the right permissions.” So I went into Applications > Accessories > Terminal and typed “sudo nautilus,” and then, using that superuser File Browser session, went back to that same autocorr folder and tried again. This time, I didn’t try extracting; I just right-clicked on acor_en-US.dat and chose “Open with Archive Manager” and then right-clicked on DocumentList.xml and chose “Open with” and chose gedit. I went to the end of the file, right before the “</block-list:block-list>” entry, and copied the whole previous entry. In my case, it was the one that would change “yuor” to “your.” In full, it read like this:

<block-list:block block-list:abbreviated-name="yuor" block-list:name="your"/>
They all seemed to follow that same format.  So apparently it was just a matter of getting my Word abbreviations into that form.  To test this, I added an entry right after that “your” entry.  Mine read like this:
<block-list:block block-list:abbreviated-name="yr" block-list:name="your"/>
After making that change, I saved the file.  This provoked a File Roller message:  “Update the file ‘DocumentList.xml’ in the archive ‘acor_en-US.dat’?”  I said yes, i.e., Update.  Then I started Writer and tried typing “yr.”  It didn’t work.  It would correct “yuor” to “your,” but it wouldn’t correct “yr” to “your.”  I rebooted the system, in case that would make a difference, and tried again.  It didn’t.  Yr was still not listed in Writer’s autocorrect replacement list.  I went back and looked at the end of DocumentList.xml.  “Yr” was still there.  Had I not entered it correctly?  It looked like I might have entered it twice, possibly from a previous try at the same thing.  I made sure there was just one entry for “yr.”  Then it occurred to me to delete the one for “yuor” and see what would happen.  Or, even better, I deleted the one for “yr,” the one that I had added, and I changed the one for “yuor” to be for “yr” instead.  I went back into DocumentList.xml but, what’s this, there were two entries for “yr” again.  Then I realized that the file edit time had not changed:  it seemed I was editing and saving the changes, no error messages, but I hadn’t come in as root, so there was not anything actually happening.  Editing as root, I saw another problem:  I had apparently inserted a copy of the list-ending “/block-list:block-list” command before my “yr” entry.  So perhaps Writer wasn’t going beyond that, and this was why it wasn’t seeing the “yr” item.  I made those changes, started Writer, and it worked!  “Yr” became “your.”  I went into Writer’s AutoCorrect options, looked at the end of the list, and sure enough, there was “yr.”

So now the mission was to incorporate a bazillion Word AutoCorrect entries into this DocumentList.xml file.  Or, no, as I thought of it, I decided the first step was to make a backup copy of this xml file and then delete its contents.  I had been working with my Word AutoCorrect list for years.  I didn’t need any surprises from whatever might be in DocumentList.xml.  Actually, to make it easier, I just made a quick copy of the whole acor_en-US.dat file.  Then, in DocumentList.xml, I deleted everything except the file starting and file ending lines:

<?xml version="1.0" encoding="UTF-8"?>
<block-list:block-list xmlns:block-list="http://openoffice.org/2001/block-list">

</block-list:block-list>

Since I would probably be doing this again – adding to the OOo AutoCorrect list from the Word AutoCorrect list, or possibly vice versa – I decided to manage it all through an Excel 2003 spreadsheet.  This, I thought, would also be a good way to compare the AutoCorr lists that I had developed on different computers.  That is, I was using AutoCorrect on more than one computer, and it seemed likely that there would be some cases where those lists were not compatible.  So I began with that part of the project.  I ran the AutoCorrect macro in Word on each computer and brought all of the resulting wordlists together into one folder.  I opened one of those wordlists, copied the whole thing, waited a few minutes to make sure it was all there, and pasted it all into an Excel spreadsheet.  Here, too, I wished the AutoCorrect feature included a column indicating the date last used, because a lot of these entries were totally unfamiliar to me and others were for things I was no longer writing about.  Probably I should have done this spreadsheet thing when I first installed Word.  Then it occurred to me that I could set up a virtual machine, install Word on it, and do something like that now.  But without manual examination, I still wouldn’t be able to tell which of those original Word AutoCorr entries I had ever used.

I did manage to come up with some sorting rules that helped somewhat.  After deleting exact duplicates from the several combined AutoCorr files, I sorted alphabetically according to Value (i.e., the term that resulted from the auto-correction) and then according to value length.  For example, I had given “acl” a value of “actual,” and Word came with “actualyl” as also having a value of “actual.”  I could have left both, but it seemed pretty unlikely that I would let a paper go out with “actualyl” in it (not to mention “additinal” and “adequit”).  Actually, I reasoned, I would rather risk letting a paper go out with “actualyl” in it than to endure the insult of having such a spelling correction in my AutoCorrect file.  So I deleted a bunch of those.  I also searched for items containing a space, since those tended to be from Word, not me (e.g., “witht he” becomes “with the”).  I searched for items of the same length before and after, since these tended to be Word’s typo corrections.  When I was done, I copied and pasted it from the Excel file back into the Word AutoCorr list.  Doing that involved creating a new table with enough rows to accommodate all of the Excel entries, highlighting all those empty cells, and pasting the Excel cells into the highlighted space.  There were some extra rows, which Word redundantly filled by starting over at the start of the table and continuing until all rows were filled; I had to delete those.

The next step was to get rid of the existing AutoCorrect entries in Word, so that the unwanted ones that I had deleted would really be gone.  I did this by creating a Word macro to remove them all.  I had no idea how to do this, but it was easy:  in Word 2003, I went into Tools > Macro > Macros > Create.  It had a space for my new macro, starting with Sub AddTBMenuItem() and continuing on to End Sub.  I pretty much replaced that with the following macro, posted in 2001:

Sub RemoveAllDefaultAutoCorrects()
Dim aCor As AutoCorrectEntry
If MsgBox("This is a very destructive macro. Be sure that you " & vbCr & _
"want to delete all the AutoCorrect entries. There is no " & vbCr & _
"for this action. Click OK to continue", vbCritical + vbOKCancel, "CAUTION") _
= vbOK Then
For Each aCor In Application.AutoCorrect.Entries
aCor.Delete
Next aCor
End If
End Sub

I closed that, went back into Tools > Macro > Macros, selected that new macro entry, and ran it.  I gathered from somewhere that Word would restore the old list if you didn’t replace it with at least one new AutoCorrect entry, so I created a dummy one, exited Word, and then came back in to see what it looked like.  Sure enough, there was only that one dummy entry.  So now I ran the macro to restore my new list, and that took care of getting Word’s AutoCorr list updated.

Now, how to do the same thing in OOo Writer?  Using the format shown in that Linuxtopia webpage, I went back to the Excel spreadsheet, added another column on the right side, and used text concatenation to add all the missing stuff – basically, everything other than “yuor” and “your” in that example.  The formula I used was this:
=”<block-list:block block-list:abbreviated-name=”&CHAR(34)&A4&CHAR(34)&”block-list:name=&CHAR(34)&B4&CHAR(34)&”/>”
CHAR(34) was the Excel command for a regular (double) quotation mark.  I had to use CHAR(34) because the quotation mark means something different.  This formula said, take the name in cell A4 (e.g., “yuor”) and replace it with the value in cell B4 (e.g., “your”).  So I copied that formula all the way down the spreadsheet, in my column E (using column D to show the date when I did this, for future reference).  Then I copied all of those cells from column E into Notepad, made sure that Format > Word Wrap was turned off, and saved that as AC.TXT.  Back in Ubuntu, I opened AC.TXT in gedit.  Testing confirmed that OOo Writer was going to have a hard time with items that gedit displayed in funky format, like these:




Most of the items that caused problems that way were due to the use of smart apostrophes (i.e., single quotes) in Word.  So back in Windows, in Notepad, I opened AC.TXT, found an example of a smart apostrophe, highlighted and copied it into the Find & Replace box, and replaced it with a simple apostrophe.  I made a note in the spreadsheet, next to these items, to indicate that they were not compatible with Writer.  Word's em dash () was also problematic as an import into Writer, so I had to replace it, in this imported list, with two hyphens (--).

With these changes made, back in Ubuntu, I was ready to paste the revised AC.TXT into DocumentList.xml.  I closed Writer, did the paste, started Writer, and tried it out.  It didn’t work.  I looked at the AutoCorr list.  It had imported only a few items.  It looked like it had stopped at an item containing an ampersand (&).  I deleted that item from DocumentList.xml and tried again.  Now its AutoCorr list was longer, but still only a fraction.  Sure enough, it had stopped at another ampersand.  I went back to the spreadsheet and deleted or changed all items containing an ampersand, and marked them on the spreadsheet for incompatibility as well.  Trying again:  still no cigar.  This time, it seems there was an item in my list that was already in quotation marks.  So I was trying to import something like ““This”” and Writer wasn’t buying it.  I fixed that and tried again.  This time for sure.  It worked.  I had the entire list, and I played with it.  It looked like they were all going to work.  I doctored up the list by putting a copy of Writer’s special character for the em dash into a document (Insert > Special Character > Box Drawing) and then copying it into the places in the Tools > AutoCorrect list where I had had to import double hyphens (--) instead.  At some point, I would probably do the same with the smart apostrophes, ampersands, and other items if I decided to use Writer frequently.

So it worked.  I could now use my list of abbreviations in OOo Writer instead of having to use Word.