Saturday, March 3, 2012

Mitigating a Data Verification Nightmare: Thoughts on Removing Duplicate Files

I had gotten myself into a data nightmare, with a bunch of files that appeared partly duplicative of one another.  I wanted to get rid of the duplicates.  This appeared likely to be a long struggle.  This post is one battle in that war.

For starters, I used DoubleKiller (I had the pro version, but the freeware one would have helped too) to pluck the low-hanging fruit -- to delete, that is, the verifiably exact duplicates.  But now there were files with almost identical names (e.g., Longfilename and LongfilenameA), files with identical names but different extensions (e.g., was Filename.pdf simply a PDF version of Filename.doc?), filenames with slight differences (e.g., was 2010-09-10 Résumé an essentially identical duplicate of 2010-09-10 Resume if their times were identical but their sizes were different?), and so forth.

It was easy enough to just guess at it and delete the ones that looked like they might be duplicates.  In many cases, that would have been fine; it wouldn't have made any real-world difference.  Obviously, though, this would not be not a good data management solution.  More like data abdication.  Second-best, I could identify likely duplicates and put them in a ZIP file, out of the way.  One problem with ZIP files, I had found, was that the reason for zipping them could fade from recollection, over a period of years; and then, one bright day, someone might decide to see what was in there, and the monster would live again.

One general underlying purpose was to have files that would actually be useful.  I thought that the processes of gradually absorbing useful files and eliminating unnecessary ones might be aided if I could sort them by topic.  Here, again, there were some obvious ways of quickly taking care of large numbers of files, such as those that were already sorted into folders with meaningful names.  But that left quite a few that were not usefully categorized.

In some cases, I could categorize files just from the information in their names.  But I hated to spend the time to do it manually.  I tried to sort them into categories by identifying key multiword phrases they contained.  That problem became complicated by variations in punctuation and other textual vagaries.  Therefore, I started over, this time beginning with an effort to clean up punctuation and other aspects of the text.

While that cleanup attempt was underway, I also looked for ways to reduce the number of filenames being sorted.  This brought me back to a focus on identifying duplicates.  It seemed, belatedly, that it would have been useful to have named all files according to a consistent rubric.  I took a look at that in a separate post.  That got me to a point where I was able to name many of the files in a certain standard way.  So, for example, an email would be named using a Date-From-To-Subject format.

Along the way, certain realizations forced themselves into my consciousness.  One was that, as a general rule, if it looks like a mess, it is probably a mess in ways that you haven't even imagined.  A corollary is that whatever you do, you will have to do over again, once you have discovered additional unforeseen ways in which the data are intertwined.  I did find that doing it wrong, several times over, was a good way to become familiarized with possible starting points.  Having a good backup was essential.  Being able to reconstruct steps, by saving relevant documents in generational steps, was a real plus.

There was also the problem of deciding whether to nibble around the edges or make a decisive stroke to divide major problem areas from one another.  The quandary here was that the decisive stroke would likely go astray if it was not informed by prior familiarity with the actual fault lines shooting through the data.  In the worst case, you would not only add to the confusion, but would divide things in exactly the wrong direction, so that the procedures taken with Group A would have to be repeated with Group B -- except that, inevitably, they could not be repeated *exactly* with Group B, because the two groups would differ in some subtle but significant way.  But nibbling around the edges could turn relatively simple aspects of the project into enormously tedious exercises in repetition, as exceedingly minor pieces of the puzzle were interminably quasi-resolved.

I made these notes during a particularly dark moment in the process.  I am pleased to report that, later on, when I returned to these notes to wrap them up and post them, I had turned to other projects, and was thus making good progress toward resolving the data verification nightmare by simply ignoring it until it (or I) went away.