Monday, August 27, 2007

Long-Term File Backup and Integrity

At some point, I began saving a lot of data on my computer. These data included copies of letters, photographs, and all sorts of other things. Over time, I transferred these data from one hard drive to another, or from hard drive to CD/DVD or vice versa. On rare occasions, I would find that a file that I thought I had saved properly had somehow become corrupt. I also wondered whether I would even notice if some old files somehow disappeared from my system. These thoughts prompted me to think about adopting or developing a reliable way of making sure that my files remained in good condition over time. The Contours of the Problem It seemed to me that there were two basic concerns. One had to do with having a reliable backup system. Somehow, my files needed to move from my immediate workspace (where I had written a letter, scanned a document, imported a photograph, etc.) to longer-term storage, and I needed some reliable way of making sure that they survived each of those transitions. The other basic concern was that, in that process, the computer file that purported to be my saved document or photograph was, in fact, readable or otherwise usable. This second concern called for a way to verify the integrity of each of the saved files. (There is also an authenticity concern, not examined here, that raises the question of whether third parties have interfered with file contents in ways that a verification check might not reveal.) The backup component, in turn, seemed to have two subparts. First, I needed to have devices capable of capturing and preserving my files. Second, I needed a logging system of some sort, so that I could be confident that all of the files that had been on the original device were now present on the receiving device. Of course, a hard drive could capture my data, but would not necessarily preserve it. Hard disks were known to crash, to get stolen, and to suffer data loss when the computer system would shut down abruptly because of failure in some other component. I had long been in the habit of making backups. For some years, I had used tape for backups. More recently, I had switched to a combination of DVDs and offline hard drives. But it seemed that it could be complicated to know when a file or directory from an offline disk should be reintroduced -- to know, that is, that some file or directory on the computer had ceased to be useful and should be replaced by its backup. To avoid the need to undertake constant revalidations of large numbers of files, it made sense to work toward keeping as much data as possible offline, so that it would be unlikely to be affected by day-to-day issues in the functioning of the computer. For the files that did remain on the system, it seemed that there were two basic approaches, which one could characterize as "trust" or "verify." In the trust approach, a person would use backup software that would indicate that it had made a full, incremental, or other kind of backup; and collectively, those various backups would be assumed to provide an up-to-date counterpart of the data maintained on the system. In the verify approach, one would actually compare (by manual and/or automated means) the dates, sizes, names, and perhaps other details of the original files and the backups; and through this verification process, one could directly certify that the backup contained exactly one copy, and an accurate copy at that, of the online file. A simple example of the verify approach would be to note that, when copying one hard drive to another, there were exactly 2,324 files on both drives at the end of the copying process. Either approach would ideally produce a log of some sort, so that a person could reliably track down where the backup copy of a particular file was located. While either approach could fail, it seemed that the trust approach would be more vulnerable to, say, the unpleasant discovery that the backup did not contain what one assumed it contained. Verification of the quality of the copied files was more than a matter of noting that a certain file had been copied, or even that the copy bore the same name, date, time, and size. In some instances, Windows XP had been known to indicate that it had copied a file, but because the copying process was interrupted before completion, the copy did not actually contain the file's full data. At the start of this present inquiry, I had the general impression that there existed programs that would be capable of verifying that the copy was indeed a true and accurate one, but I had had very little actual exposure to the workings of any such programs. As a practical matter, the verification component of this inquiry also invoked the question of whether a file's format continued to be readable by currently available software and, if not, how one could become aware of that and respond to it on a timely basis. Working toward Solutions: Backup While corporate users might have more time and money for sophisticated approaches, my own needs and experiences suggested that I divide my data storage needs into three categories: online, offline, and offsite. Offsite was easiest to describe: it was, ideally, a copy of everything that I kept onsite, preserved solely for purposes of emergency backup. Of course, unless I planned to make trips to the offsite location every hour, there would be some slippage between the two. I could remedy that, in part, by using quasi-offsite storage. For example, if I kept my offsite storage at my office, I could put my latest offsite backups into my car and could them drop them at the office the next time I went there. That way, if my house burned down, my car would contain a fairly recent backup of my onsite data -- subject, of course, to the risk that the car might get too hot, too cold, or too humid, or that some thief or passenger would accidentally or intentionally damage or remove it. So offsite storage seemed derivative. It depended on what I was doing onsite. The onsite options, as I say, seemed to be divided between those things that I wanted or needed to keep online, i.e., immediately available on the computer, and those that I could keep offline, i.e., on a hard drive or DVD that I would connect to the computer as needed. For purposes of minimizing the backup and verification chore, the idea of keeping as much data as possible offline had been greatly advanced by the development of large-capacity hard drives and the proliferation of connection technologies. I could now get a large-capacity hard drive at a relatively low cost; could keep it on a shelf; and could connect it to my computer, within a minute or less, using USB, Firewire, or SATA cables. The concept, then, was that I could have my active-duty data on the computer, and my relatively inactive data on an offline (on-the-shelf) drive or DVD, and I could back them both up to another hard drive, normally kept offsite, every now and then -- weekly, or biweekly, or perhaps monthly. It could have made sense, in this scenario, to preserve historical backups in case the absence or corruption of a file was not immediately noticed (or to protect against vulnerability at those times when the offsite drive came back onsite to be updated). Protection against these risks would likely call for having more than one offsite drive. Aside from offsite backups, there would presumably also be onsite backups, made more frequently. Ideally, as I had experienced when I had used a tape drive, the backup software would kick in at scheduled times and would do its work in the background; and when it was full or finished, a reminder would pop up and I would replace the full tape with a new, empty one. It was now possible to do something similar with current backup software, using either online or mostly offline hard drives. (Quick searches at Newegg and Pricegrabber indicated that, at this writing, tape drives having 100GB or more of compressed capacity cost $400 or more, not counting media. Much smaller tape drives were available, but were still not competitive with hard drives or, increasingly, with flash drives.) Thus, a person would seemingly have three distinct sets of data at home. First, there would be online data, primarily stored in hard drives on the computer. Second, there would be offline data, stored in disks on the shelf. Third, there would be backups of the first two categories, made more frequently than was the case with the weekly, biweekly, or monthly offsite backups. And as a possible fourth category, there could be stacks of DVDs or other media containing archival copies of old backups, in case one wanted to refer back to, say, the status of a certain file as of January 1 of the previous year. Working toward Solutions: Verification For purposes of maximum protection against data loss, if it were feasible, a person would ideally undertake some kind of verification effort in each of the backup processes described above, and would also do some verification when files were being copied from one drive to another. Indeed, verification would ideally occur whenever files were duplicated, deleted, transferred, or otherwise altered, as compared to where they had been when someone last checked up on them. At the other extreme, a verification effort might catch most failures even if it was postponed until the last possible date. The last possible date would presumably be the date on which the backup source was no longer available or reliable. Thus, for example, if a person had a stack of CDs that contained a backup of his/her computer as of January 1, 1997, it might still be possible to verify the current data on one's computer against those old CDs, and that possibility might continue to exist for as long as the CDs did not get thrown out or become too old. The old CDs would not provide any insights regarding files that had been modified or created after January 1, 1997; but at least it could indicate whether the files that had remained unchanged since then were still in good condition. Since those old CDs could fade when nobody was looking, a more reasonable intermediate approach might be to go through the closet for old backups; do the comparison of then vs. now; restore old originals to replace corrupted current copies on the computer as appropriate; and then burn a new current backup of the system, to replace the old set in the closet. Before burning that new set, one would probably want to compare as many original sources as possible. That is, since the January 1, 1997 CDs would not contain more recent files, one might want to check the computer's current data, not only against that old set, but also against more recent sets. Current Technologies The foregoing discussion has already provided a partial answer to the question of what technologies, specifically, should one use for purposes of backup and verification of stored data. At present, for many, it may be fastest, most affordable, and most convenient to make backups on hard drives, supplemented for some purposes by DVDs or other increasingly affordable media (e.g., flash drives, dual-layer DVDs, DVD/RWs). Verification, for these purposes, commonly consists of a trust approach based upon the presumed functioning of the file copying setup (as when one transfers files from one computer to another using Windows Explorer) or of the backup software (which may offer the valid or invalid claim that it verifies the files it copies in terms that match the user's expectations). The foregoing discussion also leaves some things unanswered, however. On the backup side, how would one notice if some files accidentally vanished? As long as the data being backed up have not been used since the last backup, the number of files should be the same. Unfortunately, active-duty files do change in number. For instance, a document may divided into several subparts, or subparts may be combined into one file. It may not be feasible to examine files individually to insure that such changes are always authorized. Long-term archival backups on DVDs may be, ultimately, the only insurance against that sort of thing. It may be somewhat more possible, however, to proceed on a directory-by-directory basis, in an effort to provide a list of folders whose contents have changed since the previous backup. That sort of listing, with some filtering (e.g., to focus on user data files, as distinct from program files whose variations are beyond the average user's scrutiny), could provide at least an occasional heads-up as to what is changing in one's file collection. There were several different file verification technologies, at this writing, including Simple File Verification (SFV), Message-Digest algorithm 5 (md5), Secure Hash Algorithm 1 (sha1), and Checksum. A number of computer programs incorporated these technologies, in various ways. The common goal, in general terms, was to use the data within a file to calculate a number. A very simple example, simpler than the one at the foregoing Wikipedia link for Checksum, would arise where one combined the numerical value of letters together. In the string "adb," for instance, the letter "a" would have a value of 1, because it comes first in the alphabet; "d" would have a value of 4; and "b" would have a value of two. If the algorithm in question called for adding the first two letters and subtracting the third, the result would be 1 + 4 - 2 = 3. If one of the letters got changed, the result would not be 3 anymore. A person could store the value of 3, with that "adb" string, and could re-run the calculation at any time to make sure the string's contents had not changed. More complex algorithms, based upon large quantities of data within a file, significantly reduced the possibility that part of a file could change without also changing the resulting value. So, for these technologies, the basic idea was that one would calculate the sum when the file was created; would keep that sum with the file; and would refer back to it whenever one wanted to verify that the file had not changed. From the several technologies just mentioned, a number of Windows-compatible computer programs had been generated. These included FastSum (free command-line version; $14.95 graphical user interface (GUI) version); md5sum (free); sha1sum (free); FSUM (free); MD5Summer (postcardware); Advanced CheckSum Verifier ($14.95); eXpress CheckSum Verifier (free); HyberHasher ($10.00); EF CheckSum Manager (free); AccuHash ($19.95); MD5 CheckSum Verifier ($14.95); MD5 Checker (free); FileCheckMD5 (free); HashCalc (free); Turbo WinMD5 (free); File Ace ($29.95); Chaos MD5 (free); AccuHash ($19.95); GizmoHasher (free). Among those labeled as freeware, the programs bearing a rating of four or five stars from (which did not specify the number of votes in any case) included EF CheckSum Manager (four stars), MD5 Checker (five stars), and HashCalc (five stars). The only one bearing a rating of at least four stars from was FSUM (four votes). The only one rating above four stars from CNET's was GizmoHasher (4.5 stars from six voters; 1,172 downloads); next closest was MD5 Checker (3.5 stars from five voters, 16,539 downloads). Note: these several ratings sites did not necessarily carry all of these programs. Finally, free programs rated Excellent at included Windows Md5 Checksum (4.7, seven votes, 9,093 downloads), MD5 Calculator (5.0, six votes, 4,713 downloads), hkSFV (4.7, 17 votes, 10,283 downloads), and FileAlyzer (4.7, four votes, 2,283 downloads). [These last two paragraphs posted on Wikipedia.] Based upon this review, it appeared that the best free verification programs for Windows might include MD5 Checker, GizmoHasher, and the four from Softpedia. More detailed review of those programs' writeups suggested that GizmoHasher, Windows Md5 Checksum, and hkSFV would be especially useful for dealing with directories and with large numbers of files (as distinct from checking one file at a time). I wasn't able to find an English-language homepage for Windows Md5 Checksum, so as to learn more about its capabilities with multiple files and folders. The creator of GizmoHasher no longer appeared to be offering it, so there was no information there either. I also could not find a homepage for hkSFV. Lacking other clear guiding criteria, I started with hkSFV, the most popular by the measures cited in the previous paragraph. I ran hkSFV on a sample folder. It created a little SFV or MD5 file -- your choice -- in that same folder. My understanding was that I could then check that folder, a month or a year later, and the program would hopefully report if any file had gone bad. To test this, I changed one of the files. I inserted a plain text file in place of an Excel spreadsheet, and gave it the same name as the Excel spreadsheet. I also deleted several files. Then I ran hkSFV again, with cached mode off. Sure enough, it reported the deleted files as being not found. Oddly, unless I had made a mistake, it appeared to recognize that the text file with an .xls extension was actually a text file; I think it renamed it as filename.txt. But when I changed it back to .xls and tried again, the hkSFV program crashed, which was not entirely encouraging. I ran the same procedure again, once again opening my little MD5 file there in that folder, and yes, it definitely did rename the so-called .xls file to be, as it actually was, a .txt file; and it reported that that file was there OK. So while I was not too encouraged by the crash, I thought maybe I had caused it by fiddling with the folder while the program was open. I definitely did like its ability to supply correct file extensions. These steps were only the start of my process. Before I was going to be prepared to generate MD5 checksum for each data folder, I wanted to make sure that the folders contained actual working files, not just duds, and that they were in the proper archival format; and also that I had eliminated duplicates and had otherwise generally prepared the folder not to be changed very often. That would be a project for the future.