Showing posts with label renaming. Show all posts
Showing posts with label renaming. Show all posts

Thursday, March 8, 2012

File Naming Conventions

I had a bunch of files. I was looking at ways to sort them. It seemed that it might help if they were all named according to file naming conventions, so that files of a certain type would be named in certain standard ways.

It was not immediately clear if there was any scholarly consensus as to what approach would be best. Along with general advice, there seemed to be at least two fundamentally different philosophies. On one hand, sources like the National Technology Assistance Project (NTAP) recommended using a folder hierarchy and relatively plain-English filenames. For instance, a file named "2009-05-15 Letter to Smith, R P" might be found in the 2009 > Correspondence subfolder; or in a different scheme, it might be in the Completed Projects > Waterway subfolder. On the other hand, Vincent Santaguida recommended putting the information into the filename itself and avoiding folder hierarchies. (I found that document and others on an Exadox webpage.) Santaguida's first example said this:

Do: Z:\Prod\QA\AssL7_WO_Suzuki_L3688_20090725.xls

Don't: Z:\Production \Quality Control\Assembly Line7\Work Orders\Clients\Suzuki Motors\ LOT3688_July‐25‐2009.xls
Depending on who was using the files and how much they knew about the variety of filenames in the archive, it seemed that Santaguida's approach might benefit from a formal, elaborate naming scheme -- with, for instance, a reference work where users could look up "Quality Control" or "Assembling Line 7" (and other variations) and find the proper rules for naming documents related to those topics, and a list or guidance system leading to relevant documents already filed. I could see where such a system might be valuable in some settings. I had a couple of concerns about it, though. One was, what happens if you lose the reference list, or if the specialized database management system creating such filenames goes on the fritz?

It seemed that, for most purposes, PC Magazine had the better idea: make your filenames indicative of what the file contained, in terms that potential users could understand -- and, I would add, within a scheme that would not require more maintenance than users or database managers would devote to it. For instance, aside from special projects like this one, I was not generally going to invest the time to create a highly precise file naming arrangement. It did seem that Santaguida's approach could help reduce file duplication, but I felt that DoubleKiller gave me an adequate solution for that. The other thing was that I didn't actually know how life was going to turn out yet. File arrangements grew up on the fly, as new situations emerged. I wasn't positioned to put it all into a rigid structure.

In other words, while adopting some principles recommended in the Best Practice for File-Naming produced by the Government Records Branch of North Carolina, I was concerned that "Records will be moved from their original location" -- that, in other words, I might have to re-sort things that I had already sorted once -- but I didn't see an easy way around that. Building their location into the filename would have been a bad idea because, in many cases, I *wanted* to be able to re-sort things at random.

Within the individual filename, Santaguida's second principle seemed right: "Put sufficient elements in the structure [particularly in the filename] for easy retrieval and identification but do not overdo it." I had been working toward a couple of basic formats:
2009-05-04 13.59 Message from X to Y re Z.pdf
Shakespeare--Romeo and Juliet.pdf
Shakespeare--Romeo and Juliet - with notes.pdf
Garfunkel--Bridge Over Troubled Water.mp3
The present project, I decided, was one in which I could mostly tend toward the first example: Year-Mo-Da Hr.Mn DocType from Sender to Recipient re Subject.ext. I would use periods and hyphens (-)for some limited purposes, but would tend not to use other punctuation. This tended to agree with Santaguida's third rule: Do not use characters such as ! # $ % & ' @ ^ ` ~ + , . ; =)]([. Santaguida said don't use spaces, but I had rejected that in opting for plain-English filenames. He also said to use the underscore (_) to separate filename elements, but that was unnecessary in my approach. It also had the potential to confuse things. I noticed that some naming and conversion programs used the underscore in place of the space, giving me "File_name.exe" instead of "File name.exe." In Santaguida's approach, that would falsely suggest that "File" and "name" were two different elements. I planned to scout out and remove underscores. The intention to minimize punctuation also seemed generally consistent with various uses of special punctuation in Windows.

I also had to think about Santaguida's seventh principle. He recommended putting surname first, first name second: "Smith, Roger" rather than "Roger Smith." Actually, in his approach, it was "Smith-Roger." It seemed to me that there were some reasons not to do it that way, at least in my system. One was that I would have to sweep filenames (at least new filenames) for consistency occasionally at any rate, to catch and rename those newly created documents where some variation appeared (e.g., "Smith, R.P.," "Smith, R P," 'R Smith"). There didn't seem to be any difference between one approach and the other in that sense. What seemed more practical was to use whatever name I actually tended to use for someone, so that I would be most likely to name it correctly the first time, when I was thinking about something other than my file naming scheme. Typically, this would be along the lines of "Roger Smith" -- which would also have the advantage of eliminating extra hyphens and commas.

Once I had such ideas in mind, I went through the list of files, using Excel to generate batch commands to rename many files at once. Where the files had one of the structures shown above, I was able to use FIND and other text functions to segregate certain elements (e.g., Author, Date) into separate columns, and then use unique filters and other tools to eliminate variations (in e.g., personal names).

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

I had a bunch of filenames.  I wanted to know which two-word phrases appeared most frequently.  The purpose of this effort was to put those filenames into categories.  I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process.  This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size.  I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog").   I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left.  It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames.  I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel.  It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.".  I did not have more than 65,000 filenames, so Excel 2003 would handle it.  Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename:  "I saw a dog - a big dog."  Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog."  This approach would produce some junk, but I hoped it would be easily filtered.  Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a").  It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces.  I could count the number of spaces manually.  With many sentences, it would make more sense to use a formula.  One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces.  TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words.  Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases.  The approach I took was to set up the spreadsheet like this:



The top row indicated the number of the two-word phrase.  The first one was "I saw," in group 1.  The second row indicated the number of the space found.  For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw").  This image shows only the first three groups; I would need four more.  As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known.  I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula.  It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying.  For instance, the formula in cell F2 was =F1, and in F1 it was =C1.  That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro.  Those formulas were as follows:

C3  =FIND(" ",$A3)
D3  =FIND(" ",$A3,C3+1)
E3  =TRIM(LEFT(A3,D3))
F3  =IF(D3=LEN($A3),"",C3)
G3  =IF(F3="","",FIND(" ",$A3,F3+1))
H3  =IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))
I3  =IF(H3="","",TRIM(MID($A3,F3,H3-F3+1)))
J3  =IF(H3=LEN($A3),"",G3)
K3  =IF(J3="","",FIND(" ",$A3,J3+1))
L3  =IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))
M3  =IF(L3="","",TRIM(MID($A3,J3,L3-J3+1)))

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine.  I made a backup and proceeded to convert the spreadsheet's formulas to values.  With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown.  There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere.  An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter.  I had usually found that the same keystrokes worked in later versions of Excel.  In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3).  Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3.  With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...).  I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2.  I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:
That is, the ends of the rows did not end at the same place.  In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16.  I wanted them all in a single column.  I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously.  That post still provided more details than I have written here.  But in some regards, I now saw a way to simplify that process a bit.  I deleted row 1, containing the column headings, as well as column 1, containing the original filenames.  They seemed unnecessary and potentially confusing at this point.  Then I created another worksheet (Insert > Worksheet).  Its default name was Sheet2.  In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located.  The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1:  [xA8.xls]Sheet1!$A$1.  (xA8.xls was the name of my spreadsheet file.)  I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began.  In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this:  #$A$1.  Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1.  In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998.  So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998.  A fast way to do that was to hit F5 and type Q14998 and hit Enter.  That took me to Q14998.  I put an X there.  Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter.  Boom!  Done.  I went back to Q14998 to double-check.  Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above):  [xA8.xls]Sheet1!$A$1.  I hit Ctrl-A > Ctrl-C to copy it all.  I opened Notepad and hit Ctrl-V to paste it.  (If there had been too much data for Notepad, I think something like Notepad++ would have worked.)  The purpose of putting it into Notepad was to remove the column formatting.  (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word.  Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks.  That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p).  Word reported that it had made about 170,000 replacements.  This gave me the desired list, starting like this:

$A$1
$B$1  
$C$1  
$D$1  

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines.  I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more.  The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2:  I didn't need it anymore, and all its calculations would slow things down.  I created a new Sheet2 and pasted the results from Word into it.  I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem.  My old Excel 2003 couldn't accommodate 170,000 rows.  It gave me an error:  "File not loaded completely."  I took a look.  Row 65536 contained a reference to $F$5911.  I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature.  I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad.  I deleted everything down to that point, copied, and repeated in an adjacent column.  I had to do this a couple of times.  So now I had a spreadsheet that was three columns wide and 65,536 rows long.  Warning: do not print your results.

I inserted blank columns between these columns of output.  Those blank columns would contain the text corresponding to those cell references.  For instance, in Sheet2 cell A1, I now had $A$1.  So in the blank cell B1, I put this formula:  =INDIRECT("Sheet1!"&A1).  I copied that formula to all the other cells in those adjacent columns.  That gave me three sets of two columns each:  the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns.  So now I just had three columns containing two-word phrases.  In short, I had managed to isolate the phrases.  Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones.  For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A.  But Calc hung for about 15 hours before producing output.  (Excel also hang at a number of points in this whole process.)  I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list.  I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that.  This gave me 7,539 unique two-word phrases.  That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases.  The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters.  I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols).  I also did some global (Ctrl-H) searches in Excel to the same effect.  In addition, I deleted a few meaningless two-word pairs.  These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P").  It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith").  Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium").  That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences.  Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items.  In other words, my filtering for unique items substantially accounted for all items.  Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it.  To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion.  Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process.  Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant.  It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process.  Such changes seemed likely to shape this sentence-parsing process.  For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization.  The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files.  There was no index.  Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability.  I could continue with this effort, using backups to reconstruct the links.  But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort.  One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front.  That question led me to a second try.

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time.  I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf
reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow."  This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages.  There were a lot of messages, so I didn't want this to be a manual process.  This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird.  Then I deleted duplicates from that folder.  Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird.  So I moved the older messages into a separate folder.  I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email.  I did that because I wanted to capture the text information about the individual email messages.  The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames.  For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once.  I ran a search and downloaded and installed the ImportExportTools add-on for T-bird.  (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.)  It took Thunderbird a while to shut down and restart after the installation.  I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder.  When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)).  I opened the CSV file (named index.csv) in Excel and saw that this was perfect:  I had a list of exactly the fields mentioned above (date, time, etc.).  I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames.  The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working.  I thought this might also be better for purposes of making things the way I wanted them.  I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order.  There seemed to be some other potential limitations.  It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted.  Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird.  So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message.  But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs.  In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming.  There were other solutions that would have worked for that purpose as well.  A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook.  (Alternate:  Automatic Print Email for $69.)  There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments.  I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments.  Funny thing about that, though:  as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them:  it thought they had attachments, but then it realized they didn't.  That trimmed out maybe 5% of the ones that had originally been marked as having attachments.  But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment.  Total Thunderbird Converter looked like it might come close, at a hefty price ($50).  It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly.  But then I saw that they were offering a 30-day free trial, so I downloaded and installed it.  It turned out to be useless for my purposes:  it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation.  I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial.  It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on.  I tried Birdie EML to PDF Converter.  Their free demo version allowed me to convert one EML file at a time.  I liked its interface:  it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats).  I didn't like the output, though:  they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF.  I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML.  But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages.  It just vanished.  The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions.  In other words, I didn't.  So now I had to go rooting around in backups to see what I could rescue from the wreckage.  I found that Backup Maker had been happily making backups, as God intended.  Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it.  What Backup Maker had saved was some files with an .MSF extension.  These were supposedly backups of Thunderbird.  But then, no, on closer inspection I realized these were much too small, so I had to do some more digging.  Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off.  A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached.  There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort.  On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost.  I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension.  I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder."  It worked -- it did extract the attachments without removing them from the emails.  I assumed it would do this with hundreds of emails if I so ordered.  Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination.  Attachment Extractor did have one potentially useful feature for this purpose:  a right-click option to "Extract with a Custom Filename Pattern."  I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF.  To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab.  There, I used this pattern:
#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#
and, per instructions, in the Edit Date Pattern option I used this pattern:
Y-m-d H.i
That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult.  A search led to inquiries about renaming PDF print output.  For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames.  Print Helper, for $125 to $175, was another option.  A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it.  A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30. 

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software.  No doubt many people have used pdf995 successfully.  I was not among them.

I tried Pdf995's Free Converter.  The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page.  They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad.  But that .ini file seemed to be configured for printing one specific file that I had just printed.  They didn't say how to adjust it.  Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there.  So I tried that.  But I got an error message:
PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.
But they were!  I had just installed them.  Was I supposed to reboot first?  No, a reboot didn't fix it.  I tried again to install basic pdf995 and the Free Converter, which I had downloaded together.  Once again, I got the webpage saying I had installed it successfully.  Was I supposed to install the printer driver too?  I understood the webpage to be saying that was included in the 9MB download.  But I tried that.  Got the congratulatory webpage, so apparently it installed correctly.  Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:
Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.
But I had just downloaded it from their website!  Not an altogether auspicious beginning here.  But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before.  A cursory read-through still did not give me a clear approach to automated naming of PDFs.  Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation.  Who knew?  I downloaded and installed OmniFormat.  Got a new congratulatory webpage, but still no straightforward explanation of batch naming.  Instead, it said that pdfEdit995 was what I wanted to create batch print jobs.  So, OK. a bridge too far.  Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird.  Went back to their incredibly tiny-print pdfEdit instructions page.  It said I would have to set pdf995 as the default printer to do the batch thing.  That was OK.  But it still sounded like it was intended primarily for batch printing from Microsoft Word.  I decided to just try making pdf995 the default printer.  That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer.  While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing.  Now I went to Thunderbird, selected several messages, and did a right-click > Print.  Funny, it defaulted to Bullzip, which was my usual default printer.  Checked again:  yeah, pdf995 was definitely set as my default printer.  Tried again, and this time manually set it to pdf995 when it was printing.  It asked for the filename, so that wasn't working.  Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either.  It seemed pdf995 was not the solution for me.  I came away from this little exploration with less time and ambition for the larger project.  Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first.  I did, so I took a look.  There was an option to download Universal Document Converter (UDC) as well.  I wasn't sure, but I thought I might need that for Print Conductor, so I installed both.  PrintConductor didn't seem to have a way of printing EML files.  Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that.  But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually.  (Print Conductor did not appear in the list of available printers.)  When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab.  There, I chose the "predefined location and filename option."  I left the default filename alone and printed.  And it worked.  Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails.  So I would be able to match them with the attachments produced by Attachment Extractor (above).  All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames.  I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above):  Tools > ImportExportTools > Export all messages in the folder > EML format.  Unfortunately, there could be a number of messages in a single day on the same subject.  Without the time data in the filename, I would have duplicates.  More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails.  I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data.  As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data.  But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails.  Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails.  There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached.  This seemed like the kind of project that really had to await some further development in application software.  For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it.  It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird.  I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered.  It seemed that a good solution would be to focus on the emails themselves for now.  I would export them to EML format.  EMLs would continue to contain the attachments.  I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice.  Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs.  So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment.  If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject).  I had sorted the old emails in Thunderbird into separate folders by year.  I went to one of those folders in T-bird and sorted it in ascending date order.  Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV).  This gave me what appeared to be a matching list of those messages, in that same order.  The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order.  Good news:  the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order.  At least the first and last emails were in reversed positions.  So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order.  This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed.  I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34).  I wasn't sure that the present and future date fields would always sort exactly the same.  I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes:  would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves:  Tools > ImportExportTools > Export all messages in the folder > EML format.  As partially noted above, these were named in Date - Subject - Number format.  I now did a search to try to figure out what that number signified.  It wasn't clear, but it seemed to be just randomly generated.  Too bad.  It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order.  (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.)  It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day.  There would surely be large numbers of files like that.  I now realized they would not at all be sorted correctly in Windows Explorer:  with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options.  Silly me.  I had just overlooked that.  But there they were:   Tools > ImportExportTools > Options > Filenames tab.  I selected "Add time to date" and I chose Date - Name (Sender) - Subject format.  Now I tried another export of EMLs.  The messages now had names like this:
19970730-0836-Microsoft-Welcome!
I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM.  Suddenly the remainder of my job got a lot easier.  I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools.  There were two parts to this mission.  First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools.  Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted.  This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird. 

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility.  That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML.  This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird.  I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird.  This seemed like a separate step.  I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

Thursday, April 21, 2011

Batch Renaming Text File Memos or Emails Using Their Contents to Create the Filenames

I was using Windows 7.  I had text files that contained old email messages, one message per file.  Each began with four or five lines:  FROM, TO, DATE, and SUBJECT, sometimes preceded by a blank line.  In that sense, they were very similar to memoranda.  At present, these emails had numerical names (e.g., 0045.txt) that said nothing about their contents.  I wanted to rename them to a Date - Time - From - To - Subject format, like this:

2010-04-13 13.45 Email from Joe to Nancy re Dinner.txt
In other words, their top several lines told me everything I needed to name them; I just didn't want to have to do it manually.  This post describes the steps I took to automate that naming process.

My knowledge of programming was more or less limited to DOS-style batch file scripting.  If I had been interested in entering single commands, I would have done that in a DOS box (a/k/a command window), which I could open in Windows 7 (customized) by going to Start > Run > cmd.  I wanted to run the relevant commands automatically, across a number of text files, so I would be saving those commands in a batch file, created in Notepad as a text file (ANSI format), saved as CommandFile.bat, and executed by double-clicking on it in Windows Explorer.  The question here was, what commands should go into that file?

Instead of creating a batch file, I could have used some other approach.  For instance, it seemed possible that IntelliGet or TextCrawler would do the job better, provided I were interested in spending the time to learn how to use them.  I did actually try with TextCrawler, but at this writing I had a question sitting in the TextCrawler forum for almost two days without an answer.  Same thing with AutoIT.  I wasn't sure how to proceed with those approaches, so at this point I was deciding to return to the batch file approach.  I had been using batch files on a very simple level for a long time, so I started there.

A search for batch command syntax information led me to the A-Z Index site, which I had used before.  I used that site to interpret what was happening in a two-line batch script that I found in another post:
@ECHO OFF
FOR /F "skip=2 delims=" %%i in (%1) DO >> lines.txt ECHO %%i & GOTO :eof
I've presented the DOS commands in capital letters to make them more visible.  ECHO would tell the computer what to make visible onscreen or in the output file (in this case, lines.txt).  The second line used a "FOR /F ... DO" loop to repeat the command.  (There were other FOR options.  For instance, FOR /D would operate on directories, not files.)  In plain English, the concept there was, "For each X, do Y."  The big idea was, apparently, that the "skip" part would tell it to skip two lines and then echo the third one to the lines.txt file, and then go to :eof (short for "end of file," i.e., end of this script).  In other words, this FOR-DO combination just said, skip a line, skip a line, output a line, then break out of the loop (i.e., quit).  So there would have to be another command that would call this script for each file in the directory.  But before getting to that, I wanted to understand how the "skip" part worked, because that's what I would need to modify for my purposes.  The syntax reference seemed to be saying that "skip=2" was a command to skip two lines, as I had just surmised.  Another source seemed to say that the "delims" part would specify what would count as the unit of analysis.  For instance, if there were a blank space between the equals sign (=) and the quotation mark, we would be looking for a blank space, not the end of a line, as a delimiter.  The loop would restart each time it found a blank space.  Apparently leaving no space between = and " meant that it would default to end-of-line as its delimiter.  I wasn't sure about that.  But what we seemed to be saying, so far, was that we were going to loop each time we encountered a line end, for a total of two loops.  Then, having arrived at the third line, we would proceed with the main attraction:  we would do something with "%%i in (%1)."  Specifically, we would export it, whatever it was, to lines.txt.  What we would be exporting would be %%i.  The percent signs, I knew, indicated a variable.  So we were going to export variable i, in %1, to lines.txt.  I guessed that (%1), being in parentheses, referred to a varible already supplied.  That is, the command that would call this script would indicate the name of a file to examine, and this %1 would represent that filename.  So, to recap:  having arrived at the third line of the file in question, let's export something (specifically, %%i) to lines.txt.  Since there wasn't any more information on what the %i variable (presented as %%i because it was occurring inside a batch file) was supposed to do, it seemed that it must just refer to the whole unit of analysis -- in this case, the whole line -- where we had ended up, having skipped two lines to get here.  In other words, this whole third line of the text file would be exported to lines.txt.

Now, how would all that work for me, where I wanted to output multiple lines?  I thought I might try to identify each of the lines in question with a separate variable, and then concatenate (i.e., combine) those variables in the output.  Basically, if variable L1 stands for line 1, and L2 for line 2, and so forth, then what I wanted was to output L1+L2+L3+L4 to lines.txt (though I preferred to call it output.txt).  (The double output signs, >>, meant "add this onto whatever already exists in the output file, as distinct from >, which would mean "replace whatever is in the output file with this new data.")  I wouldn't need a skip parameter -- I wanted to output all of the first four lines or, actually, the first five, since in some files the first line was a blank.

At this point, it occurred to me that I might have another option.  I made a backup of the folder containing the emails.  Then, at the DOS command prompt, I typed COPY *.* COMBINED.TXT.  This put all of their contents into one large text file.  I opened it and copied and pasted its entire contents into an Excel spreadsheet.  Then I parsed its lines to produce the desired filenames.  This was much more time-consuming than a scripted solution would have been, and it was not what I set out to find, but it did get the job done.