Showing posts with label duplicates. Show all posts
Showing posts with label duplicates. Show all posts

Sunday, May 8, 2011

Data Nightmare: Reconciling Two Hard Drives

I was using Windows 7.  I had an old hard drive with files on it.  I wanted to see whether its files also existed on a newer drive.  My concern was that some of the old files might not have made it in good condition to the new drive.  This post describes the somewhat haphazard steps I took in this project. 

This is an incomplete account.  Ultimately, I was able to proceed only partway, with steps like those described here.  These steps got rid of half of the old files; I was able to see, fairly quickly, that at least half of them were duplicative or superseded.  When I closed this post, I was still not sure how I would be able to decide what to make of the other half of those files, other than through a long, manual file comparison process.
One first step, I thought, might have been to keep a complete directory listing from the old drive.  There were some scenarios where that might have been useful, and it was easy to make.  There were freeware utilities that would do it.  I would have just used the DIR command.  The commands I would have used would have been something like this:

F:
cd \
DIR *.* /s > D:\F-LIST.TXT
where F was the old drive, "cd \" started me in the root of F, and DIR put the names of all files, including those in subdirectories, in a file called D:\F-LIST.TXT.

Anyway, I decided to reduce the size of the problem by eliminating files on F that seemed very likely to have made a successful transition to the new drive (drive E).  Ideally, I would have made a backup copy of F before doing this, just in case some of my steps erred.

To delete those files that had successfully transitioned from F to E, I used DoubleKiller, though there were other duplicate remover programs that would have done the job too.  The criteria I used at this point were to check both the file sizes and the CRC32 checksums.  (Not all duplicate remover programs had these options, but some others did have something similar.)  The concept here was that a file's size could be the same and yet its contents could have changed, so it would pass a size-only test; and it was conceivable that two very different files could have the same CRC32 checksums; but it was virtually certain that if both the size and the CRC32 checksum were identical, they were in fact identical files, in which case one could safely be deleted.  I searched drive E for duplicates within itself, and resolved all of those.  Then I did likewise on F.  With that done, I could now check on both E and F.  DoubleKiller would display the duplicates in pairs, with E coming first and F coming second, so I could then do a bulk check-and-delete process (after some additional eyeballing) to remove the duplicates from F that had almost certainly made it to E successfully.

While doing all that, I noticed that a lot of files had changed format.  They had been .doc or .txt files, but I had converted them to .pdf.  This process would result in a file that had a different size, checksum, and filename.  But experience with the newer files, supplemented now by some random checking of these .pdfs, convinced me that most if not all with the new (*.pdf) filenames did reflect the same content.  So I could also delete the old .doc and .txt (and other extension) files on the old F drive whose content appeared to have made the transition to .pdf on the new E drive.  Space and time permitting, I could have played with that question some more and probably improved the accuracy of this comparison.

I could delete the old files (with .txt, .doc, etc. extensions) in favor of the new PDFs, that is, if I could figure out which ones had been so converted.  Consider these two filenames:
Letter to John Doe Containing Information.doc
Letter to John Doe Containing Information.pdf
If they were in the same folder, I could just sort them by file type in Windows Explorer, select the ones of the old type (.doc, in this example), then sort by name, and eyeball to make sure I had alternating selected and unselected files.  The problem in my case was that some of these, after being converted to .pdf, had also been moved to other folders.

To address that, I knew of two solutions.  One was the easy solution.  This apparently involved using Beyond Compare, or something like it, in its Professional version, and figuring out how to ignore filetypes so that I would have a straight-across comparison on two columns.  I had the Standard version, so I didn't have that option and, at this point, didn't even know that option existed.  What I was seeing in my Standard version of Beyond Compare was a more alternating layout that, when other files were added to the mix, could be hard to decipher, especially when dealing with large numbers of files that I did not care to scrutinize one-by-one.

As I say, that was the easy solution, so of course the solution I took was, instead, to produce a list of files on drive F, using the DIR command (above), and compare it against a list of files on drive E.  Since there were more than 65,000 files altogether, I could not simply copy these resulting E-LIST.TXT and F-LIST.TXT files into Microsoft Excel; its maximum was about 65,000 rows.  I discovered that OpenOffice.org (OOo) Calc had a similar limit, but a variant version of OOo Calc had a maximum of one million rows.  This version, called Go-OO, was delighted to accept my long E-LIST.TXT and F-LIST.TXT file listings, and it went right to work on them, crashing every couple of minutes.  After spending hours paring down the files to a somewhat smaller size, and converting formulas to simple text and numeric data (using Edit-Copy, Edit-Paste Special) as soon as I didn't need the formulas anymore, I was able to reduce the crashing.  By using text analysis functions (especially FIND and MID) within Go-OO, I was able to extract the filenames without extensions.  Now the problem was just getting Go-OO to copy and paste the lookup formula that I was using to search for the F filenames without extensions against the derived list of E filenames without extensions.  Go-OO was unwilling to copy and paste those formulas en masse; the very idea would cause it to freeze in fear.  Since I did not want to paste the formula manually down tens of thousands of rows, a few at a time, I wound up using AutoHotKey to automate the copying of those formulas.  AutoHotKey was useful if somewhat childlike.  I found, for instance, that if I absentmindedly switched the focus on that computer away from the copying process underway in Go-OO, AutoHotKey would blithely continue copying and pasting things into whatever random windows happened to be opened.  Fortunately, there were no delete commands in my AutoHotKey script, else it would have been an unpleasant morning.  Sometimes AutoHotKey would also decide to just start working away in some other part of the spreadsheet.  Should anyone care, the AutoHotKey macro that I prepared for this project went like this:
!x::

; Copies a cell in OOo Calc to the next one down
Send ^c{down}{Enter}{down}{up}
At this point, I wrote up these additional notes, which did not actually turn out to describe what happened next, but which may be useful for someone else who shares my indulgence in fantasy.  First, I said, "When the lookups were done and I was finally working with fewer than 65,000 files, I saved in an Excel-compatible format and resumed work there."  I also wrote, "Eventually, I noticed that some files that had previously been .jpgs were probably combined into .pdfs as well.  Since .jpg format did not accommodate multipage documents, what we had there was a bunch of files named file01.jpg, file02.jpg, file03.jpg, and so forth, and the resulting file.pdf.  So here I had to modify the spreadsheet formula somewhat to seek out those filenames ending in two or three digits, and search for equivalents along those lines."

Those additional notes did not turn into anything real because what happened instead was that Go-OO -- whether of its own volition, or as persuaded by AutoHotKey, I cannot say -- decided not to reopen the file anymore, nor any of its last several versions comprising hours of work on the spreadsheet.  After trying various angles on two different machines, I gave up and waited for news from the Beyond Compare people on what it would cost to upgrade from Standard to Pro version, so maybe they would provide a workable solution to this project.  It turned out that a previous version of Beyond Compare had a way of comparing that might be useful, so I looked into that but, unfortunately, I would have needed some of the features of the latest version to make it work.  Another Beyond Compare approach involved their Alignment Overrides option.  They had a way to experiment with this feature, which was in the Pro version.  This did not really help me much.

I wound up using Beyond Compare 3, using folder-by-folder comparisons and also comparisons regardless of folder.  I simply deleted, from the old drive, the files that were matched on the new one.  This left a bunch of empty folders.  I had always gotten rid of those using a batch file in a sequence of several steps, but I had never memorized exactly what those steps were, and what I was finding now wasn't working right.  I heard that Glary Utilities had a tool for this purpose, so I installed that freeware and went into its Modules tab > Files & Folders > Empty Folders Finder.  It found a boatload of them. Once that was out of the way, I ran the DOS command "DIR *.* /s > D:\DIRLIST.TXT."  This created DIRLIST.TXT, a list of the files left on the drive.  I imported that list into Excel -- it was now short enough to fit within the 65,000-record limit -- and sliced and diced it.  I started by using commands like MID and FIND to extract the file extensions.  Example:  =MID(F1,1,LEN(F1)-LEN(K1)), with F1 containing the imported filename and K1 containing the extension (e.g., .html), would give me the filename without extension.  (To get the extension, I did several columns of =FIND(".",F1,G1+1), where G1 contained the rightmost dot yet discovered in the filename, sorting and identifying and then excluding from further sorts.  Then I used ="DIR /S "&CHAR(34)&I2&"*.*"&CHAR(34)&" >> D:\FOUND_IT.TXT" to create a cell containing a DOS command that would look like this:  DIR /S "Letter to Joe*.*" >> D:\FOUND_IT.TXT.  This command would search for that filename and export its results to FOUND_IT.TXT.  Excel generated similar commands for each of the files I was looking for.  I copied and pasted the lot of them into a text file called Finder.bat.  At the start of Finder.bat, I added two lines:
D:
CD \
to make sure that the search would occur on the proper drive and would begin from its root.  I ran this overnight, and awoke to find that FOUND_IT.TXT had vanished.  No idea why.  So I ran it again, and this time simulataneously ran another batch file:
:: WAITER.BAT
:: Copies FOUND_IT.TXT to another folder
@echo off

:REPEET
copy FOUND_IT.TXT D:\Current
TIMEOUT /T 30 /NOBREAK
goto REPEET
Ultimately, this gave me a complete FOUND_IT.TXT, with lots of hits and misses.  I opened it in Word.  It was a 9,193-page document.  I did a bunch of search and replaces to get rid of useless lines, using search and replace terms like ^p (for line breaks) and ^t (for tabs).  At this point, I did not continue with the process of examining drive contents and logging my steps, deciding instead to defer this process until the next point when I would want to devote some time to this enterprise.

Saturday, April 23, 2011

Windows 7: Archiving Emails (with Attachments) from Thunderbird to Individual PDFs - First Try

I had been collecting email messages in Thunderbird for a long time.  I wanted to export them to PDF-format files, one message per file, with filenames in this format:

2011-03-20 14.23 Email to John Doe re Tomorrow.pdf
reflecting an email I sent to John Doe at 2:23 PM on March 20, 2011 with a subject line of "Tomorrow."  This type of filename would sort correctly in Windows Explorer, chronologically speaking; hence, I could quickly see the order of messages.  There were a lot of messages, so I didn't want this to be a manual process.  This post describes the steps I took to make it semi-automated.

The Screen Capture Approach

The first thing I did was to combine all of the email messages that I wanted to export into one folder in Thunderbird.  Then I deleted duplicates from that folder.  Next, I decided that, actually, I was just interested in exporting messages prior to the current year, since recent messages might have information I would want to search for in Thunderbird.  So I moved the older messages into a separate folder.  I maximized the view of that folder in T-bird and shut down unnecessary layout features (i.e., message pane, status bar), so that I could see as many emails as possible on the screen, and as much as possible of the relevant data (i.e., date, time, sender, recipient, subject) for each email.  I did that because I wanted to capture the text information about the individual email messages.  The concept here was that I would do a screenshot for each screenful of emails, and would save the data from that screenshot into a text file that I could then massage to produce the desired filenames.  For this purpose, I tried a couple of searches; I downloaded and ran JOCR; but after a bit of screwing around I decided to revert to the Aqua Deskperience screen text capture shareware that I had purchased years earlier.

Index.csv

Then it occurred to me that perhaps I could just export all that information from T-bird at once.  I ran a search and downloaded and installed the ImportExportTools add-on for T-bird.  (Alternatives to ImportExportTools included IMAPSize and mbx2eml, the latter explained by About.com.)  It took Thunderbird a while to shut down and restart after the installation.  I assumed it was getting acquainted with the fact that I had relocated so many messages to a new folder.  When it did restart, I ran the add-on (Tools > ImportExportTools > Export all messages in the folder > Just index (CSV)).  I opened the CSV file (named index.csv) in Excel and saw that this was perfect:  I had a list of exactly the fields mentioned above (date, time, etc.).  I verified that Excel was showing a number of rows equal to the number of messages reported on the status bar back in Thunderbird.

I noticed that some of the data in the Excel file included characters (i.e., \ / : * ? " < > | ) that were not permitted in Windows filenames.  The Mbx2eml option (above) would have removed these characters automatically, but for this first time I wanted to do everything manually, so as to see how it was all working.  I thought this might also be better for purposes of making things the way I wanted them.  I was also not sure that Mbx2eml would produce a CSV file, or that it would output the emails in the same order.  There seemed to be some other potential limitations.  It looked like a worthy alternative, but I didn't explore it.

Somewhere around this point, I went ahead prematurely with a time-consuming effort to revise the entries in the spreadsheet, so as to remove the unacceptable characters and otherwise make them look the way I wanted.  Eventually, I realized that this was a mistake, because now I would have a hard time matching spreadsheet entries automatically with the actual emails that I would be exporting from Thunderbird.  So I dropped that attempt and went back to the point of trying to plan in advance for how this was all going to work.

Attachments

I had assumed that I wanted to export emails to individual .eml files because EML format would bring along any attachments that happened to be included with a particular email message.  But I didn't plan to just leave the individual emails in in EML format; I wanted to save them all as PDFs.  In other words, I wanted to have the email and its attachment within a single PDF.

A quick test notified me that printing EMLs would be no more successful at including the attachments than if I just printed directly from Thunderbird, without all this time-consuming exporting and renaming.  There were other solutions that would have worked for that purpose as well.  A search led me to InboxRULES, which for $40 would do something or other with attachments in Outlook.  (Alternate:  Automatic Print Email for $69.)  There didn't seem to be a solution for Thunderbird, and I wasn't keen on spending $40 and having to install Outlook and move all these emails there in order to print their attachments.  I thought about handling the attachments manually -- print the email first, then print the attchment, and append it to the email -- but a quick sort in Thunderbird told me there were hundreds of messages with attachments.  Funny thing about that, though:  as I arrow-keyed down through them in Thunderbird, allowing them to become visible one at a time, I saw that Thunderbird would change its mind with many of them:  it thought they had attachments, but then it realized they didn't.  That trimmed out maybe 5% of the ones that had originally been marked as having attachments.  But there were still a lot of them.

Another search led to some T-bird options, but it still looked like there was going to be a lot of manual effort before I'd have a single PDF containing both the email and its attachment.  Total Thunderbird Converter looked like it might come close, at a hefty price ($50).  It wasn't reviewed on CNET.com or anywhere else, as far as I could tell, so there was a risk that (as I'd experienced in other cases) the program simply wouldn't work properly.  But then I saw that they were offering a 30-day free trial, so I downloaded and installed it.  It turned out to be useless for my purposes:  it had almost no options, and therefore could not find my Thunderbird folders, which I was saving on drive D rather than C so as to avoid losing them in the event of a Windows update or reinstallation.  I looked at Email Open View Pro (or its alternate, emlOpenView Free), which also offered a free trial.  It didn't look like it (or Universal Converter, or MSG Viewer Pro, or E-mail Examiner, or Convert EML to PDF) would bring the attachments into the same PDF as the email, so I moved on.  I tried Birdie EML to PDF Converter.  Their free demo version allowed me to convert one EML file at a time.  I liked its interface:  it gave me eight different naming options for the resulting file (e.g., "date + subject + from," in several different date formats).  I didn't like the output, though:  they formatted the PDF for the EML file oddly, with added colors that I didn't want, and all they did with the attachment was to put it into a subfolder bearing the same name as the resulting PDF.  I'd still have to PDF it -- the example I used was an EML with a .DOC file attachment -- and merge it with the PDF of the EML.  But now they had led me to see that perhaps I could at least automate the extraction of attachments, getting me partway to the goal.

At about this point, Thunderbird inconveniently lost a folder containing several thousand email messages.  It just vanished.  The program was struggling there for a few minutes before that, and part of me was instinctively thinking that I should shut down the program and do a backup, but this would have been a very deeply subconscious part of me that was basically unresponsive under real-life conditions.  In other words, I didn't.  So now I had to go rooting around in backups to see what I could rescue from the wreckage.  I found that Backup Maker had been happily making backups, as God intended.  Amazing what can happen, when you have about five different backup systems running; in this case I had just wiped a drive, moved a drive, etc., and so of course Backup Maker was the *only* backup system that was actually in a position to restore real data when I seriously needed it.  What Backup Maker had saved was some files with an .MSF extension.  These were supposedly backups of Thunderbird.  But then, no, on closer inspection I realized these were much too small, so I had to do some more digging.  Eventually I did patch together something resembling the way things had been before the crash, so I could go back and pick up where I had left off.  A couple of days passed for other interruptions here, so the following information just reports where I went from this point foward.

I had the option of just saving the Thunderbird file, or the exported emails, for some future date when there would perhaps be improved software for printing attachments to PDF in a single operation with the emails to which they were attached.  There had been times when software developments had saved (or would have saved) a great amount of time in what would have been (or actually was) a tedious manual effort.  On the other hand, I had also seen situations where letting something sit meant letting it become confused or corrupted, or where previous learning (especially on my part) had been lost.  I decided to go ahead with converting the emails to PDF to the extent possible without a tremendous time investment.

My searching led to Attachment Extractor, a Thunderbird extension.  I installed it, highlighted two emails with attachments, right-clicked on them, and selected "Extract to Suggested File-Folder."  It worked -- it did extract the attachments without removing them from the emails.  I assumed it would do this with hundreds of emails if I so ordered.  Then, to get them matched up with PDFs of the emails to which they were attached, I would apparently have to page down through those emails one by one, looking at what attachments they had, and setting them up for more or less manual combination.  Attachment Extractor did have one potentially useful feature for this purpose:  a right-click option to "Extract with a Custom Filename Pattern."  I found that I could configure the names given to the extracted attachments, so that they would correspond at least roughly with the names of emails printed to PDF.  To configure the naming in Attachment Extractor, I went into Thunderbird > Tools > Add-ons > Extensions Tab > AttachmentExtractor > Options > Filenames tab.  There, I used this pattern:
#date# - Email from #fromemail# re #subject# - #namepart# #count##extpart#
and, per instructions, in the Edit Date Pattern option I used this pattern:
Y-m-d H.i
That gave me extracted attachments with names that were at least roughly similar to the format I wanted (see above).

Batch Exporting Emails with Helpful Filenames

Now if I could print the corresponding email to PDF with a fairly similar name, the manual matching might not be so difficult.  A search led to inquiries about renaming PDF print output.  For $60, I could get Zan Image Printer, which sounded like it would have some capability for automating PDF printout filenames.  Print Helper, for $125 to $175, was another option.  A Gizmo's Freeware article did not say much about this kind of feature, though several people asked about it.  A list of free PDF printers led me to think that UltraPDF Printer was free and would do this; its actual price was $30. 

The pdf995 Experiment

At this point, I was reminded of how much time I could waste on uncooperative software.  No doubt many people have used pdf995 successfully.  I was not among them.

I tried Pdf995's Free Converter.  The instructions on bypassing the Save As dialog were in the Pdf995 Developer's FAQs page.  They seemed to require me to open C:\Program Files\PDF995\res\pdf995.ini in Notepad.  But that .ini file seemed to be configured for printing one specific file that I had just printed.  They didn't say how to adjust it.  Eventually I figured out that I needed to download and install pdfEdit995, and make the changes there.  So I tried that.  But I got an error message:
PdfEdit995 requires that Pdf995 v9.1 or later and the free converter v1.2 or later are already installed on your system.
But they were!  I had just installed them.  Was I supposed to reboot first?  No, a reboot didn't fix it.  I tried again to install basic pdf995 and the Free Converter, which I had downloaded together.  Once again, I got the webpage saying I had installed it successfully.  Was I supposed to install the printer driver too?  I understood the webpage to be saying that was included in the 9MB download.  But I tried that.  Got the congratulatory webpage, so apparently it installed correctly.  Now I noticed I had another error, which had not come up on top, so I was not sure how long it had been there:
Setup has detected you have an old version of pdfEdit995 that is incompatible with the latest version of pdf995.
But I had just downloaded it from their website!  Not an altogether auspicious beginning here.  But I downloaded and installed the very latest and tried again, and now it seemed to work, or at least I got a different congratulatory webpage than before.  A cursory read-through still did not give me a clear approach to automated naming of PDFs.  Instead, they said that maybe I wanted their OmniFormat program for batch PDF creation.  Who knew?  I downloaded and installed OmniFormat.  Got a new congratulatory webpage, but still no straightforward explanation of batch naming.  Instead, it said that pdfEdit995 was what I wanted to create batch print jobs.  So, OK. a bridge too far.  Though at this point they specified "batch print jobs from Microsoft Office applications," raising the prospect that this wasn't going to work from Thunderbird.  Went back to their incredibly tiny-print pdfEdit instructions page.  It said I would have to set pdf995 as the default printer to do the batch thing.  That was OK.  But it still sounded like it was intended primarily for batch printing from Microsoft Word.  I decided to just try making pdf995 the default printer.  That required me to go to the Windows Start button > Settings > Printers > right-click on PDF995 > set as default printer.  While I was there, I right-clicked on PDF995 and looked at its Properties, but there didn't seem to be anything to set for purposes of automating printing.  Now I went to Thunderbird, selected several messages, and did a right-click > Print.  Funny, it defaulted to Bullzip, which was my usual default printer.  Checked again:  yeah, pdf995 was definitely set as my default printer.  Tried again, and this time manually set it to pdf995 when it was printing.  It asked for the filename, so that wasn't working.  Back in Printers, I looked at the Properties for Bullzip, but it didn't seem to have any options for automatic naming either.  It seemed pdf995 was not the solution for me.  I came away from this little exploration with less time and ambition for the larger project.  Certainly I wasn't in the mood to buy software and then discover that I couldn't make it work.

Further Exploration

I ran across an Addictive Tips article that said PrintConductor was a good batch printing option, though I might need to have Adobe Acrobat installed first.  I did, so I took a look.  There was an option to download Universal Document Converter (UDC) as well.  I wasn't sure, but I thought I might need that for Print Conductor, so I installed both.  PrintConductor didn't seem to have a way of printing EML files.  Meanwhile, UDC's installer gave me the option of making it the default printer, so I tried that.  But as before, Thunderbird defaulted to Bullzip, so I had to select UDC as my printer manually.  (Print Conductor did not appear in the list of available printers.)  When I selected UDC as the printer, before printing, I clicked on the print dialog's Properties button and went into the Output Location tab.  There, I chose the "predefined location and filename option."  I left the default filename alone and printed.  And it worked.  Sure enough, I had a couple of PDFs whose names were the same as the Subject fields shown in Thunderbird for those printed emails.  So I would be able to match them with the attachments produced by Attachment Extractor (above).  All I had to do now was to pay $69 for a copy of UDC, so that each PDF would not have a big black "Demo Version" sticker on it.

Recap

So to review the situation at this point, I had a way of extracting email attachments with highly specific date, time, and subject filenames.  I also had a way of extracting emails themselves whose filenames would show date and subject, using ImportExportTools (above):  Tools > ImportExportTools > Export all messages in the folder > EML format.  Unfortunately, there could be a number of messages in a single day on the same subject.  Without the time data in the filename, I would have duplicates.  More to the point, it would be difficult to match emails and attachments automatically, and I didn't want to go through that matching process for large numbers of emails.  I would also prefer a result in which emails converted to PDFs would appear in the right order in Windows Explorer, and that would require the time data.  As I was writing this recap, several days after writing the foregoing material, I was not entirely sure that I had verified that the output filename in UDC would include the time data.  But if that proved to be the case on revisit, at this point one option would be to buy UDC (or perhaps one of the other programs just mentioned) for purposes of producing properly named emails.  Another would be to export the list of emails to Index.csv (above) and to hope that this list would match the order in which ImportExportTools would export individual emails.  There would still be the possibility that such a program would sometimes fail to do what it was supposedly doing, perhaps without me noticing until long after the data from which I had exported and renamed various files would be long gone.

The Interim Solution

I decided that, at this point, I could not justify the enormous time investment that would be required to complete this project -- in particular, to manually print to PDF each attachment to each email, to combine those PDFs, and to match and merge them them with a PDF of the email message to which they had been attached.  This seemed like the kind of project that really had to await some further development in application software.  For all I knew, the kind of solution I was seeking already existed, and I was just waiting until the day when I would become aware of it.  It was not at all an urgent project -- I rarely consulted attachments for old emails, and almost never consulted them for prior years, where I was focusing my present attention.

I wanted to get those old emails out of Thunderbird.  I didn't like the idea of having all that data at the mercy of a relatively inaccessible program (i.e., I couldn't see those emails in Windows Explorer), and anyway I didn't want T-bird to be so cluttered.  It seemed that a good solution would be to focus on the emails themselves for now.  I would export them to EML format.  EMLs would continue to contain the attachments.  I would then zip the EMLs into a small number of files, each no more than a few gigabytes in size, perhaps on a year-by-year basis, and I would store them until further notice.  Before zipping, I would make sure the EMLs were named the way I wanted, and would print each of them to separate PDFs.  So then I would have at least the contents of the emails themselves in readily accessible format, and could go digging into the zip file if I needed an attachment.  If I did someday find a way to automate the task of combining the emails and their attachments into a single PDF, I would give those PDFs the same name as the current email-only PDFs, so that the more complete versions would simply overwrite the email-only versions in the folders where I would store them.

Export and PDF via Index.csv

I decided to try and see if the Index.csv approach would work for purposes of producing EMLs whose names contained all of the elements identified above (i.e., date, from, to, subject).  I had sorted the old emails in Thunderbird into separate folders by year.  I went to one of those folders in T-bird and sorted it in ascending date order.  Then I went into Tools > ImportExportTools > Export all messages in the folder > Just index (CSV).  This gave me what appeared to be a matching list of those messages, in that same order.  The number of lines in the CSV spreadsheet (viewed in Excel) matched the number of messages in that folder as stated in T-bird's status bar.

I wondered what would happen if I exported another Index.csv after sorting the emails in that T-bird folder in descending chronological order.  Good news:  the resulting index.csv produced in that experiment seemed to be reversed from the one I had produced in ascending order.  At least the first and last emails were in reversed positions.  So it did appear that index.csv matched the order that I saw in T-bird.

On that basis, I added an Index number column at the left end of the index.csv file I was working with, the one with emails sorted in ascending date order.  This index column just contained ascending numbers (1, 2, 3 ...), so that I could revert to the original sort order if needed.  I assumed that the list would continue to sort in proper date order, but I planned to revise the date field (presently in "7/4/1997 18.34" format) so that it could function for file sorting purposes (e.g., 1997-07-04 18.34).  I wasn't sure that the present and future date fields would always sort exactly the same.  I could have retained the existing date field, but I wasn't sure that it, itself, was reliable for sorting purposes:  would two messages arriving in the same minute always sort in the same order?

Now I exported the emails themselves:  Tools > ImportExportTools > Export all messages in the folder > EML format.  As partially noted above, these were named in Date - Subject - Number format.  I now did a search to try to figure out what that number signified.  It wasn't clear, but it seemed to be just randomly generated.  Too bad.  It would have been better if they had included the time at the start of that random number, and had put it immediately after the date, so that the EMLs would sort in nearly true time order.  (There could still be multiple emails on the same subject within the same minute, and T-bird didn't seem to save time data down to the second or fraction of a second.)  It seemed I would have to manually sort files bearing the same subject line and arriving or being sent on the same day.  There would surely be large numbers of files like that.  I now realized they would not at all be sorted correctly in Windows Explorer:  with only date (not time) data in the filename, a file arriving in the morning with a subject of Zebras would be sorted, in Windows Explorer, after a file arriving in the afternoon on the subject of Aardvarks, and if there were three on the subject of Aardvarks they would all be sorted together even if they had arrived at several different times of day.

Ah, but now I discovered that ImportExportTools had file naming options.  Silly me.  I had just overlooked that.  But there they were:   Tools > ImportExportTools > Options > Filenames tab.  I selected "Add time to date" and I chose Date - Name (Sender) - Subject format.  Now I tried another export of EMLs.  The messages now had names like this:
19970730-0836-Microsoft-Welcome!
I checked and, sure enough, that was a message from Microsoft on that date at 8:36 AM.  Suddenly the remainder of my job got a lot easier.  I went back to the Index.csv spreadsheet (now renamed as an .xls) and worked toward perfecting its match with the filenames produced by ImportExportTools.  There were two parts to this mission.  First, I had to rework the Index.csv data exported from T-bird so that it would match the filenames given to the EMLs by ImportExportTools.  Second, I would then use the spreadsheet to produce a batch file that would rename those files to the format I wanted.  This called for some spreadsheet manipulation described in another post.

Converting EMLs to PDF

Now I faced the problem of converting the exported EMLs to PDF, as distinct from the problem (above) of exporting PDFs from Thunderbird. 

I found that EMLs could be converted into TXT files just by changing their extensions to .txt, which was easy enough to do en masse with a program like Bulk Rename Utility.  That would permit them to be converted to PDFs without the rich text, if necessary, since it was a lot easier to find a freeware program that would do that (or, in my case, to use Acrobat) than to find one that would PDF an EML.  This appeared to be a viable solution for some of the older emails, which had apparently been through the wringer and were not showing much sign of having glorified colors or other rich text or HTML features.

Before proceeding with this, I decided to export all of the remaining EMLs from Thunderbird.  I knew I could read the EMLs and the TXTs (if I renamed them as that); I also knew I could reimport them into T-bird.  This seemed like a separate step.  I also decided that going back through the exporting process would give me an opportunity to write a cleaner post that would summarize some of the foregoing information.

Wednesday, December 30, 2009

Sorting and Manipulating a Long Text List to Eliminate Some Files

In Windows XP, I made a listing of all of the files on a hard drive.  For that, I could have typed DIR *.* > OUTPUTFILE.TXT, but instead I used PrintFolders.  I selected the option for full pathnames, so each line in the file list was like this:  D:\FOLDER\SUBFOLDER\FILENAME.EXT, along with date and file size information.

I wanted to sort the lines in this file list alphabetically.  They already were sorted that way, but DIR and PrintFolders tended to insert blank lines and other lines (e.g., "=======" divider lines) that I didn't want in my final list.  The question was, how could I do that sort?  I tried the SORT command built into WinXP, but it seemed my list was too long.  I tried importing OUTPUTFILE.TXT into Excel, but it had more than 65,536 lines, so Excel couldn't handle it.  It gave me a "File not loaded completely" message.  I tried importing it into Microsoft Access, but it ended with this:

Import Text Wizard

Finished importing file 'D:\FOLDER\SUBFOLDER\OUTPUTFILE.TXT to table 'OUTPUTTXT'.  Not all of your data was successfully imported.  Error descriptions with associated row numbers of bad records can be found in the Microsoft Office Access table 'OUTPUTFILE.TXT'.

And then it turned out that it hadn't actually imported anything.  At this point, I didn't check the error log.  I looked for freeware file sorting utilities, but everything was shareware.  I was only planning to do this once, and didn't want to spend $30 for the privilege.  I did download and try one shareware program called Sort Text Lists Alphabetically Software (price $29.99), but it hung, probably because my text file had too many lines.  After 45 minutes or so, I killed it.

Eventually, I found I was able to do the sort very quickly using the SORT command in Ubuntu.  (I was running WinXP inside a VMware virtual machine on Ubuntu 9.04, so switching back and forth between the operating systems was just a matter of a click.)  The sort command I used was like this:
sort -b -d -f -i -o SORTEDFILE.TXT INPUTFILE.TXT
That worked.  I edited SORTEDFILE.TXT using Ubuntu's GEDIT program (like WinXP's Notepad).  For some reason, PrintFolders (or something) had inserted a lot of lines that did not match the expected pattern of D:\FOLDER\SUBFOLDER\FILENAME.EXT.  These may have been shortcuts or something.  Anyway, I removed them, so everything in SORTEDFILE.TXT matched the pattern.

Now I wanted to parse the lines.  My purpose in doing the file list and analysis was to see if I had any files that had the same names but different extensions.  I suspected, in particular, that I had converted some .doc and .jpg files to .pdf and had forgotten to zip or delete the original .doc and .jpg files.  So I wanted to get just the file names, without extensions, and line them up.  But how?  Access and Excel still couldn't handle the list.

This time around, I took a look at the Access error log mentioned in its error message (above).  The error, in every case, was "Field Truncation."  According to a Microsoft troubleshooting page, truncation was occurring because some of the lines in my text file contained more than 255 characters, which was the maximum Access could handle.  I tried importing into Access again, but this time I chose the Fixed Width option rather than Delimited.  It only went as far as 111 characters, so I just removed all delimiting lines in the Import Text Wizard and clicked Finish.  That didn't give me any errors, but it still truncated the lines.  Instead of File > Get External Data > Import, I tried Access's File > Open command.  Same result.

I probably could have worked through that problem in Access, but I had not planned to invest so much time in this project, and anyway I still wasn't sure how I was going to use Access to remove file extensions and folder paths so that I would just have filenames to compare.  I generally used Excel rather than Access for that kind of manipulation.  So I considered dividing up my text list into several smaller text files, each of which would be small enough for Excel to handle.  I'd probably have done that manually, by cutting and pasting, since I assumed that a file splitter program would give me files that Excel wouldn't recognize.  Also, to compare the file names in one subfile against the file names in another subfile would probably require some kind of lookup function.

That sounded like a mess, so instead I tried going at the problem from the other end.  I did another directory listing, this time looking only for PDFs.  I set the file filter to *.pdf in PrintFolders.  I still couldn't fit the result into Excel, so I did the Ubuntu SORT again, this time using a slightly more economical format:
sort -bdfio OUTPUTFILE.TXT INPUTFILE.TXT
This time, I belatedly noticed that PrintFolders and/or I had somehow introduced lots of duplicate lines, which would do much to explain why I had so many more files than I would have expected.  As advised, I used another Ubuntu command:
sort OUTPUTFILE.TXT | uniq -u
to remove duplicate lines.  But this did not seem to make any difference.  Regardless, after I had cleaned out the junk lines from OUTPUTFILE.TXT, it did all fit into Excel, with room to spare.  My import was giving me lots of #NAME? errors, because Excel was splitting rows in such a way that characters like "-" (which is supposed to be a mathematical operator) were the first characters in some rows, but were followed by letters rather than numbers, which did not compute.  (This would happen if e.g., the split came at the wrong place in a file named "TUESDAY--10AM.PDF."  So when running the Text Import Wizard, I had to designate each column as a Text column, not General.

I then used Excel text functions (e.g., MID and FIND) on each line, to isolate the filenames without pathnames or extensions.  I used Excel's text concatenations functions to work up a separate DIR command for each file I wanted to find.  In other words, I began with something like this:
D:\FOLDER\SUBFOLDER\FILENAME.EXT

and I ended with something like this:
DIR "FILE NAME."* /b/s/w >> OUTPUT.TXT

The quotes were necessary because some file names have spaces in them, which confuses the DIR command.  I forget what the /b and other options were about, but basically they made the output look the way I wanted.  The >> told the command to put the results in a file called OUTPUT.TXT.  If I had used just one > sign then that would have meant I wanted OUTPUT.TXT to be recreated every time a match was found.  Using two >> signs was an indication that OUTPUT.TXT should be created if it does not yet exist, but otherwise the results of the command should just be appended to whatever is already in OUTPUT.TXT.

In cooking up the final batch commands, I would have been helped by the MCONCAT function in the Morefunc add-in, but I didn't know about it yet.  I did use Morefunc's TEXTREVERSE function in this process, but I found that it would crash Excel when the string it was reversing was longer than 128 characters.  Following other advice, I used Excel's SUBSTITUTE command instead.

I took the thousands of resulting commands (such as the DIR FILENAME.* >> OUTPUT.TXT shown above), one for each file type (e.g., FILE NAME.*) that I was looking for, into a DOS batch file (i.e., a text file created in Notepad, with a .bat extension, saved in ANSI format) and ran it.  It began finding files (e.g., FILENAME.DOC, FILENAME.JPG) and listing them in OUTPUT.TXT.  Unfortunately, this thing was running very slowly.  Part of the slowness, I thought, was due to the generally slower performance of programs running inside a virtual machine.  So I thought I'd try my hand at creating an equivalent shell script in Ubuntu.  After several false starts, I settled on the FIND command.  I got some help from the Find page in O'Reilly's Linux Command Directory, but also found some useful tips in Pollock's Find tutorial.  It looked like I could recreate the DOS batch commands, like the example shown above, in this format:

find -name "FILE NAME.*" 2>/dev/null | tee -a found.txt

The "-name" part instructed FIND to find the name of the file.  There were three options for the > command, called a redirect:  1> would have sent the desired output to the null device (i.e., to nowhere, so that it would not be visible or saved anywhere), which was not what I wanted; 2> sent error messages (which I would get because the lost+found folder was producing them every time the FIND command tried to search that folder) to the null device instead; and &> would have sent both the standard output and the error messages to the same place, whatever I designated.  Then the pipe ("|") said to send everything else (i.e., the standard output) to TEE.  TEE would T the standard output; that is, it would send it to two places, namely, to the screen (so that I could see what was happening) and also to a file called found.txt.  The -a option served the same function as the DOS >> redirection, which is also available in Ubuntu's BASH script language, which is what I was using here:  that is, -a appended the desired output to the already existing found.txt, or created it if it was not yet existing.  I generated all of these commands in Excel - one for each FILE NAME.* - and saved them to a Notepad file, as before.  Then, in Ubuntu, I made the script executable by typing "chmod +x " at the BASH prompt and then ran it by typing "./" and it ran.  And the trouble proved to be worthwhile:  instead of needing a month to complete the job (which was what I had calculated for the snail's-pace search underway in the WinXP virtual machine), it looked like it could be done in a few hours.

And so it was.  I actually ran it on two different partitions, to be sure I had caught all duplicates.  Being cautious, I had the two partitions' results output to two separate .txt files, and then I merged them with the concatenation command:  "cat *.txt > full.lst."  (I used a different extension because I wasn't sure whether cat would try to combine the output file back into itself.  I think I've had problems with that in DOS.)  Then I renamed full.lst to be Found.txt, and made a backup copy of it.

I wanted to save the commands and text files I had accumulated so far, until I knew I wouldn't need them anymore, so I zipped them using the right-click context menu in Nautilus.  It didn't give me an option to simultaneously combine and delete the originals.


Next, I needed to remove duplicate lines from Found.txt.  I now understood that the command I had used earlier (above) had failed to specify where the output should go.  So I tried again:

sort Found.txt | uniq -u >> Sorted.txt

This produced a dramatically shorter file - 1/6 the size of Found.txt.  Had there really been that many duplicates?  I wanted to try sorting again, this time by filename.  But, of course, the entries in Sorted.txt included both folder and file names, like this:

./FOLDER1/File1.pdf
./SomeotherFolder/AAAA.doc
Sorting normally would put them in the order shown, but sorting by their ending letters would put the .doc files before the .pdfs, and would also alphabetize the .doc files by filename before foldername.  Sorting them in this way would show me how many copies of a given file there had been, so that I could eyeball the possibility that the ultimate list of unique files would really be so much shorter than reported in Found.txt.  I didn't know how to do that in bash, so I posted a question on it.


Meanwhile, I found that the output of the whole Found.txt file would fit into Excel.  When I sorted it, I found that each line was duplicated - but only once.  So plainly I had done something wrong in arriving at Sorted.txt.  From this point, I basically did the rest of the project in Excel, though it belatedly appeared that there were some workable answers in response to my post.