Sunday, May 8, 2011

Data Nightmare: Reconciling Two Hard Drives

I was using Windows 7.  I had an old hard drive with files on it.  I wanted to see whether its files also existed on a newer drive.  My concern was that some of the old files might not have made it in good condition to the new drive.  This post describes the somewhat haphazard steps I took in this project. 

This is an incomplete account.  Ultimately, I was able to proceed only partway, with steps like those described here.  These steps got rid of half of the old files; I was able to see, fairly quickly, that at least half of them were duplicative or superseded.  When I closed this post, I was still not sure how I would be able to decide what to make of the other half of those files, other than through a long, manual file comparison process.
One first step, I thought, might have been to keep a complete directory listing from the old drive.  There were some scenarios where that might have been useful, and it was easy to make.  There were freeware utilities that would do it.  I would have just used the DIR command.  The commands I would have used would have been something like this:

F:
cd \
DIR *.* /s > D:\F-LIST.TXT
where F was the old drive, "cd \" started me in the root of F, and DIR put the names of all files, including those in subdirectories, in a file called D:\F-LIST.TXT.

Anyway, I decided to reduce the size of the problem by eliminating files on F that seemed very likely to have made a successful transition to the new drive (drive E).  Ideally, I would have made a backup copy of F before doing this, just in case some of my steps erred.

To delete those files that had successfully transitioned from F to E, I used DoubleKiller, though there were other duplicate remover programs that would have done the job too.  The criteria I used at this point were to check both the file sizes and the CRC32 checksums.  (Not all duplicate remover programs had these options, but some others did have something similar.)  The concept here was that a file's size could be the same and yet its contents could have changed, so it would pass a size-only test; and it was conceivable that two very different files could have the same CRC32 checksums; but it was virtually certain that if both the size and the CRC32 checksum were identical, they were in fact identical files, in which case one could safely be deleted.  I searched drive E for duplicates within itself, and resolved all of those.  Then I did likewise on F.  With that done, I could now check on both E and F.  DoubleKiller would display the duplicates in pairs, with E coming first and F coming second, so I could then do a bulk check-and-delete process (after some additional eyeballing) to remove the duplicates from F that had almost certainly made it to E successfully.

While doing all that, I noticed that a lot of files had changed format.  They had been .doc or .txt files, but I had converted them to .pdf.  This process would result in a file that had a different size, checksum, and filename.  But experience with the newer files, supplemented now by some random checking of these .pdfs, convinced me that most if not all with the new (*.pdf) filenames did reflect the same content.  So I could also delete the old .doc and .txt (and other extension) files on the old F drive whose content appeared to have made the transition to .pdf on the new E drive.  Space and time permitting, I could have played with that question some more and probably improved the accuracy of this comparison.

I could delete the old files (with .txt, .doc, etc. extensions) in favor of the new PDFs, that is, if I could figure out which ones had been so converted.  Consider these two filenames:
Letter to John Doe Containing Information.doc
Letter to John Doe Containing Information.pdf
If they were in the same folder, I could just sort them by file type in Windows Explorer, select the ones of the old type (.doc, in this example), then sort by name, and eyeball to make sure I had alternating selected and unselected files.  The problem in my case was that some of these, after being converted to .pdf, had also been moved to other folders.

To address that, I knew of two solutions.  One was the easy solution.  This apparently involved using Beyond Compare, or something like it, in its Professional version, and figuring out how to ignore filetypes so that I would have a straight-across comparison on two columns.  I had the Standard version, so I didn't have that option and, at this point, didn't even know that option existed.  What I was seeing in my Standard version of Beyond Compare was a more alternating layout that, when other files were added to the mix, could be hard to decipher, especially when dealing with large numbers of files that I did not care to scrutinize one-by-one.

As I say, that was the easy solution, so of course the solution I took was, instead, to produce a list of files on drive F, using the DIR command (above), and compare it against a list of files on drive E.  Since there were more than 65,000 files altogether, I could not simply copy these resulting E-LIST.TXT and F-LIST.TXT files into Microsoft Excel; its maximum was about 65,000 rows.  I discovered that OpenOffice.org (OOo) Calc had a similar limit, but a variant version of OOo Calc had a maximum of one million rows.  This version, called Go-OO, was delighted to accept my long E-LIST.TXT and F-LIST.TXT file listings, and it went right to work on them, crashing every couple of minutes.  After spending hours paring down the files to a somewhat smaller size, and converting formulas to simple text and numeric data (using Edit-Copy, Edit-Paste Special) as soon as I didn't need the formulas anymore, I was able to reduce the crashing.  By using text analysis functions (especially FIND and MID) within Go-OO, I was able to extract the filenames without extensions.  Now the problem was just getting Go-OO to copy and paste the lookup formula that I was using to search for the F filenames without extensions against the derived list of E filenames without extensions.  Go-OO was unwilling to copy and paste those formulas en masse; the very idea would cause it to freeze in fear.  Since I did not want to paste the formula manually down tens of thousands of rows, a few at a time, I wound up using AutoHotKey to automate the copying of those formulas.  AutoHotKey was useful if somewhat childlike.  I found, for instance, that if I absentmindedly switched the focus on that computer away from the copying process underway in Go-OO, AutoHotKey would blithely continue copying and pasting things into whatever random windows happened to be opened.  Fortunately, there were no delete commands in my AutoHotKey script, else it would have been an unpleasant morning.  Sometimes AutoHotKey would also decide to just start working away in some other part of the spreadsheet.  Should anyone care, the AutoHotKey macro that I prepared for this project went like this:
!x::

; Copies a cell in OOo Calc to the next one down
Send ^c{down}{Enter}{down}{up}
At this point, I wrote up these additional notes, which did not actually turn out to describe what happened next, but which may be useful for someone else who shares my indulgence in fantasy.  First, I said, "When the lookups were done and I was finally working with fewer than 65,000 files, I saved in an Excel-compatible format and resumed work there."  I also wrote, "Eventually, I noticed that some files that had previously been .jpgs were probably combined into .pdfs as well.  Since .jpg format did not accommodate multipage documents, what we had there was a bunch of files named file01.jpg, file02.jpg, file03.jpg, and so forth, and the resulting file.pdf.  So here I had to modify the spreadsheet formula somewhat to seek out those filenames ending in two or three digits, and search for equivalents along those lines."

Those additional notes did not turn into anything real because what happened instead was that Go-OO -- whether of its own volition, or as persuaded by AutoHotKey, I cannot say -- decided not to reopen the file anymore, nor any of its last several versions comprising hours of work on the spreadsheet.  After trying various angles on two different machines, I gave up and waited for news from the Beyond Compare people on what it would cost to upgrade from Standard to Pro version, so maybe they would provide a workable solution to this project.  It turned out that a previous version of Beyond Compare had a way of comparing that might be useful, so I looked into that but, unfortunately, I would have needed some of the features of the latest version to make it work.  Another Beyond Compare approach involved their Alignment Overrides option.  They had a way to experiment with this feature, which was in the Pro version.  This did not really help me much.

I wound up using Beyond Compare 3, using folder-by-folder comparisons and also comparisons regardless of folder.  I simply deleted, from the old drive, the files that were matched on the new one.  This left a bunch of empty folders.  I had always gotten rid of those using a batch file in a sequence of several steps, but I had never memorized exactly what those steps were, and what I was finding now wasn't working right.  I heard that Glary Utilities had a tool for this purpose, so I installed that freeware and went into its Modules tab > Files & Folders > Empty Folders Finder.  It found a boatload of them. Once that was out of the way, I ran the DOS command "DIR *.* /s > D:\DIRLIST.TXT."  This created DIRLIST.TXT, a list of the files left on the drive.  I imported that list into Excel -- it was now short enough to fit within the 65,000-record limit -- and sliced and diced it.  I started by using commands like MID and FIND to extract the file extensions.  Example:  =MID(F1,1,LEN(F1)-LEN(K1)), with F1 containing the imported filename and K1 containing the extension (e.g., .html), would give me the filename without extension.  (To get the extension, I did several columns of =FIND(".",F1,G1+1), where G1 contained the rightmost dot yet discovered in the filename, sorting and identifying and then excluding from further sorts.  Then I used ="DIR /S "&CHAR(34)&I2&"*.*"&CHAR(34)&" >> D:\FOUND_IT.TXT" to create a cell containing a DOS command that would look like this:  DIR /S "Letter to Joe*.*" >> D:\FOUND_IT.TXT.  This command would search for that filename and export its results to FOUND_IT.TXT.  Excel generated similar commands for each of the files I was looking for.  I copied and pasted the lot of them into a text file called Finder.bat.  At the start of Finder.bat, I added two lines:
D:
CD \
to make sure that the search would occur on the proper drive and would begin from its root.  I ran this overnight, and awoke to find that FOUND_IT.TXT had vanished.  No idea why.  So I ran it again, and this time simulataneously ran another batch file:
:: WAITER.BAT
:: Copies FOUND_IT.TXT to another folder
@echo off

:REPEET
copy FOUND_IT.TXT D:\Current
TIMEOUT /T 30 /NOBREAK
goto REPEET
Ultimately, this gave me a complete FOUND_IT.TXT, with lots of hits and misses.  I opened it in Word.  It was a 9,193-page document.  I did a bunch of search and replaces to get rid of useless lines, using search and replace terms like ^p (for line breaks) and ^t (for tabs).  At this point, I did not continue with the process of examining drive contents and logging my steps, deciding instead to defer this process until the next point when I would want to devote some time to this enterprise.

1 comments:

raywood

A later post provides a partial update to this one.