Sunday, May 8, 2011

Windows 7: Verify That Data Files Are in Working Condition: JPG, MP3, PDF

I wondered if there was a way to test my data files, to make sure they would actually open without errors.  I posted a question on it, but that didn't get too far.

Eventually, I did find a couple of ways to test JPGs and other image files.  IrfanView was my favorite tool for this purpose.  I decided I wasn't really too concerned about corrupt spreadsheet and document files, since I rarely encountered anything like that.  I was in the habit of converting my documents into PDFs for storage.  JPGs, MP3s, and PDFs were probably the most numerous file types on my system, so I decided to focus on those for now.

For MP3s, a search led to a thread that identified a number of possible testing utilities.  I ran a search for several (i.e., MP3 Checker and Mpck, MP3 Diags, MP3Utility, and MP3 Validator) and came away with a preliminary impression that MP3 Diags and MP3Val were relatively popular.  None seemed to be listed on CNET.com, but I found MP3 Checker (2,111 downloads), MP3 Diags (3,387 downloads), and MP3val (1,925 downloads) on Softpedia.  It looked like MP3 Diags was being actively developed and had relatively good file correction possibilities, so I downloaded that.  The developer warned of potential data loss, so I decided I wouldn't necessarily use it to edit any MP3s until I was in a position to test them after the changes and make sure things had gone OK.  I ran it on a folder containing about 120 MP3s.  It ran for just a minute or so and identified problems with various songs (e.g., certain tags not found, low quality, two ID3V1 tags found when there should be no more than one).  It made these errors graphically visible, so that I could quickly see which files had the more worrisome kinds of errors (e.g., "Unknown stream found.  Since other streams follow, it is possible that players and tools will have problems using the file.)  In short, I liked MP3Diags.  Granted, I had not used it to fix anything.  But it made a good impression.

Now, how about testing PDFs?  A search did not yield much immediate help.  (A different search, later, was a whole different story, but I didn't get that in time for this.)  One suggestion was to automate printing them and see which ones printed.  It would have been possible to copy them all from various subfolders to a single subfolder, assuming duplicate names had first been resolved using something like DoubleKiller.  A Windows search would achieve this; so would a batch command using XCOPY.  From there, I could batch print to other PDFs, and perhaps the printing process would identify bad files.  I wondered whether an IrfanView conversion (as used in JPG testing) would do the same thing.  I tried opening a PDF in IrfanView and got an error:

Decode error !
Can't load Ghostscript or Ghostscript error.
Install Ghostscript from:
http://sourceforge.net/projects/ghostscript/
or
http://sourceforge.net
The latter appeared to be the current master site for Ghostscript.  I thought I did already have it installed, but perhaps not the latest version.  I tried using Irfanview (File > Batch Conversion/Rename) to convert several PDFs to JPGs, but got an error that way too:  "Error!  Can't load [filename].pdf."  To update Ghostscript, I downloaded and installed what looked like the Windows 32-bit version.  I tried the Irfanview conversion again, and that worked.  So it wasn't going to be necessary for me to explore the alternative of using PDFtoHTML, which would apparently require me to install Windows versions not only of Ghostscript and PDFtoHTML but also PDF2HTMLgui, which looked like it might be hard to find -- never mind the alternative of installing Xpdf, apparently an alternative to Ghostscript, or the approach of installing some relevant program (PDFtoHTML, I think) via GnuWin, which was going to be simplified by installing GetGnuWin.

Fortunately, I didn't even have to think about all that.  I just ran an IrfanView batch process on a bunch of PDFs.  To test if this was going to work properly, I inserted a bad PDF among those being processed.  To create a bad PDF, I searched for a hex editor, downloaded HexEdit, opened a copy of a small PDF file, looked in the ASCII pane (the right-hand one, in HexEdit; the one with occasional text rather than all numbers) for a reference to Root ## 0 R. (in my case, it was Root 9 0 R.), and change ## to 00 (so in my case, it came out being Root 00 0 R.), and then saved and tried it out.  Sure enough, when I tried to open the bad PDF, I got "There was an error opening this document.  The root object is missing or invalid."  So now I ran an IrfanView batch conversion of PDF to JPG, including that bad file (putting the output in a folder called X, which would cue me that I could delete the whole thing without looking at it).  Of the four files I tried to convert, three converted OK.  The bad PDF gave me an error in IrfanView and nothing in the output folder.  So then I would be able to just save the IrfanView report and examine the error messages; or if that failed, I could hunt around for a suitable folder comparison tool, or use a spreadsheet to compare folders, so as to work up a list of the PDFs that had failed the conversion process.

The spreadsheet approach, used with a command line process, might be best for those situations where the files I wanted to test were scattered across multiple folders.  What I would want, in that case, would be the ability to execute a batch command containing commands of this general form:
convert D:\Folder1\File35.pdf to D:\TestFolder\File35.jpg
convert D:\Folder2\File18.pdf to D:\TestFolder\File18.jpg
A problem there was that File35.jpg might already exist in TestFolder.  This could happen because there could be files named File35 in two different source folder, and now that would become evident when I tried to put them both (in JPG format) into one target folder.  One way to avoid that would be to begin with a DoubleKiller search for JPGs with duplicative names (having previously done a DoubleKiller search for files of any sort that had identical sizes and CRCs).  Alternately, I could test my spreadsheet-generated commands to see if they were going to produce duplicate filenames, and add a formula to change them as needed.  As for the "convert" part of that ideal command (above), I posted a question in the IrfanView forum, and then found a suggestion that the command I wanted would be like this:
i_view32 D:\Folder1\File35.pdf /c=d:\TestFolder\File35.jpg 
They said this (specifically, the "c=" option) would work to convert among all formats that IrfanView could handle except AVI, MOV, MPG, WAV, and MID.  (There would presumably have been PATH problems if I'd been running the portable version of IrfanView; the command line presumably wouldn't have known where to look for a non-installed program executable.)  Later, in response to the question I posted, someone said that, of course, I should have just gone into IrfanView's F1 (Help) > Contents tab > Overview > Command Line Options.  Which, when I finally did that, wow, there were a lot of them.  That help piece began with the advice to "See the 'i_options.txt' (IrfanView folder) for the most recent version of all command line options."  That file said I could use "/convert=" rather than just "/c=" and also that I should "See pattern help file page for more options."  At first, I thought that referred back to the F1 help page I had just come from; it had some conversion examples.  Those examples seemed mostly to show how I could include other command-line options at the same time.  One interesting option:  /filelist=txtfile would apparently use filenames contained in a file called "txtfile," so that apparently I would not have to repeat this command in full (on the command line or in a batch file) for each file being processed.  It said the conversion command would support wildcards.  Then I noticed that the pattern page was actually in a different place in IrfanView help:  it was under Options Menu > Text/Pattern Options.  There were variables or "placeholders" for a variety of components (e.g., $D was shorthand for the full path of the file being converted).  This seemed to mean that a command like "i_view32.exe d:\Folder1\*.jpg /c=d:\$D$N.pdf" would convert all the JPGs in Folder1 into PDFs.  I was going to have to play around a bit to understand clearly how that worked.

At the time when I closed this post, this process was still underway.  Additional steps I took were to run DoubleKiller for duplicative JPG filenames and then to do a directory listing of all JPGs on my drive.  To do that, in a CMD window, I went to the root folder (i.e., D:\ ) and typed this:  DIR *.jpg /a-d /s /b > JPG-List.txt.  That gave me a text file (JPG-List.txt) showing where all the JPGs were.  I put that into a spreadsheet and tested for duplicate output filenames.  Having already worked through duplicate filenames in DoubleKiller (by exporting the list of duplicates and generating batch-renaming commands in a spreadsheet), I did not find any duplicates now.  But I could tinker with filename extensions (e.g., .bmp, .jpg, .jpeg, .tif) and perhaps find that the same file existed under multiple names.  This was not exactly the same question as whether I had duplicates of the same photo; this was more a question of whether I had duplicate files under similar names.