Monday, February 13, 2012

Windows 7: Testing/Verifying/Validating PDFs

I had previously gotten the impression that I could test PDFs by using IrfanView to convert them to JPGs.  (This was different from the approach I had recently taken to merge scattered JPGs into multipage PDFs.)

In a dry run, IrfanView had balked at a bad PDF, but had converted the good ones.  So the scenario was that I would run the conversion; check the output folder; verify that it had the correct number of files; and if the numbers of files didn't match up, I would go hunting for whatever was missing.

Now I had a bunch of PDFs that I wanted to test, so it was time to try out that theory.  I had found it helpful to use an Excel spreadsheet to work up the list of files to test.  (The post on multipage PDFs, above, contained some discussion of how I used Excel to create and massage lists of files.  More information appeared in an earlier post on renaming thousands of files and in another recent post on using a batch file to sort files.)

The PDFs that I wanted to test were scattered across different folders.  The conversion scenario would have me create JPGs from these PDFs, where the JPGs would all be in one folder.  That way, I could easily count them, keep them from cluttering up other folders, and delete them after I was done counting them.  A problem with all those JPGs converging into one folder was that I might have two identically named source PDFs in different folders.  For instance, there might be something called File001.jpg in D:\FolderB, and another completely different File001.jpg in E:\FolderQ.  The JPGs resulting from these two different PDFs would either overwrite or fail to come into existence, depending on how I set IrfanView's conversion process.  This would screw up my count and would potentially fail to test some PDFs.  I could surmount this problem by batch renaming those files into unique names, as long as I kept the list of what I had renamed so that I could rename them back when I was done screwing around.  That approach would involve time-consuming extra steps, though, so I was hesitant.  (For more information, go to this webpage and search for "ZZZ_00001.jpg.")

There was another problem, as I thought about it.  A three-page PDF would presumably convert into three one-page JPGs.  So my file count would get messed up that way too.  I could probably opt to convert instead from mulitpage PDFs into multipage TIFs, but I wasn't sure what would happen if one page on a JPG was junk.  I would have to experiment to see if the TIF would swallow it or barf.

These reveries were interrupted by an actual test.  I tried using IrfanView to batch-convert three PDFs to JPG.  It gave me errors:  "Can't load D:\Current\Text\x1.pdf" (and likewise for x2.pdf and x3.pdf).  One was a single-page PDF, so multipage issues weren't the problem.  I couldn't understand it.  I had previously used IrfanView for this purpose.  The PDFs opened OK in Adobe Acrobat (and would presumably do so in a free or less expensive alternative to Acrobat).

It occurred to me that maybe I could use Acrobat > Advanced > Document Processing > Batch Processing.  I tried that on my test files, saving them as RTFs rather than JPGs to circumvent multipage issues.  It was surprisingly slow, and the resulting RTF files were empty.  Not a promising start.  Going back to somewhere near the original plan, I tried using Acrobat to convert to JPG instead of RTF, and that worked.  As expected, each PDF page became a distinct JPG.  For instance, x2.jpg became x2_Page_01.jpg and x2_Page_02.jpg and so forth.  I would have to do further filename massaging in Excel, or maybe run a series of DEL commands (e.g., DEL *02.jpg, DEL *03.jpg, etc.) to see if the number of output JPGs (or groups thereof) matched the number of PDFs tested.

I wondered, at this point, why I couldn't just batch print the PDFs being tested -- print them to PDFs in another folder, that is, and do the file count and then delete the prints.  Would a junk PDF print?  I created a junk PDF by taking a copy of one of my test files, opening it in Hexedit, looking a little ways down in the ASCII column for Root # 0 R (in this case, it was Root 124 0 R), changing it to Root 00 0 R (inserting 30 as the hex value for zero), and saving it.  Then I made Bullzip my default PDF printer, changed its settings so that it would print without stopping to ask questions (via the Options shortcut in the Bullzip program folder), selected all four of my test files, and went to right-click > Print.  It printed three of the four.  I didn't have the settings right -- it still asked for filenames -- but the test worked:  for the file I had just made into a junk PDF, Bullzip (or, actually, Acrobat, my default PDF reader) gave me an error message ("There was an error opening this document.  The root object is missing or invalid" -- which was, of course, exactly what I had changed), and no output PDF was created.  So this approach of trying to print to PDF would work to identify at least some kinds of defective PDFs.

Unfortunately, that error message didn't specify which file was defective.  So that approach would require me to subtract the files that had successfully printed from the larger set of files that I had requested to be printed.  That might be a pretty fast process, if I used the same output filenames, paused for five or ten minutes, and then used Windows Explorer to copy the output PDFs over the original PDFs and then sorted by timestamp.  The PDFs with the visibly older timestamp, after that maneuver, would presumably be the ones that had failed to produce anything that would overwrite them.  This approach would wipe out my originals, which I would not want, so it would probably be best done using copies of the originals.  If there was some need to reverse the timestamp, I could probably fiddle with the system clock before step 2, so as to produce artificially ancient output PDFs.

Another approach might have been to use an Acrobat-type program to concatenate many if not all of the PDFs being tested into one PDF.  I wasn't sure if junk PDFs would concatenate.  I selected my test files > right-click > Combine supported files in Acrobat.  Acrobat said, "There was an error encountered while combining files.  Do you want to open the combined file or return to the file list and try again?"  I told it to open the combined file.  Acrobat's Bookmarks pane showed bookmarks for each of the good files, but no bookmark for the bad one.  So that would be one way of getting a list of good PDFs.  Of course, the concatenation process could be slow, especially because the resulting document could be huge.  The size of that document might also cause the Acrobat-type program to crash.

But this still wasn't giving me a testing approach that would test PDFs in place, without requiring me to relocate them to a single folder where I could manipulate them.  I could try to work up a batch command that would print the PDFs on my list to a common output folder from where they were, but in that case I wouldn't have two simple lists to compare.  Unless I could persuade the batch command to report its errors to a log, I would apparently have to go through the printing process manually, making sure to write down or attend to each PDF that didn't print.

I ran a search, to see if Bullzip could escort me out of this situation.  This strategem led, strangely enough, to the Bullzip manual, to which I probably already had at least a link in my Bullzip program folder.  But the manual -- besides being no fun -- seemed to be oriented toward installation rather than usage.  A search in the Bullzip forum led to the suggestion that I look at a bioPDF webpage.  bioPDF seemed to be telling me that I could use a program called PrintTo.exe to do the job.  But where could I find this PrintTo.exe program?  I wasn't seeing a link to it there on the bioPDF site.  A search of files on my computer turned up nothing.  It didn't register when I typed "printto /?" on the command line.  Softpedia didn't have it.  And yet a search produced indications that Bullzip users were using PrintTo too.  Baffling.  Another search led, directly or indirectly, to a FineThatFile webpage where I was able to download printto.exe as part of a zip file containing other stuff.  I unzipped it and ran "printto /?" in the folder where printto.exe was unzipped to.  Turns out it was a product after all.  The syntax was simply "printto filetoprint printername" -- using the default printer if printername was not specified.

So, alright.  Bullzip was already my default printer, so I would be test-printing those PDFs to some temporary folder with a simple command:  printto filetoprint.  There didn't seem to be an option to specify an output folder on the command line.  Apparently I would have to do that in Bullzip.  It took some tinkering, but eventually it came together.  It didn't look like printto.exe was eager to print JPGs, but that was alright; I didn't need that now.  Right now, I was just doing PDFs.  I did get it to print designated PDFs to a designated output folder from the command line without pausing, except in case of overwrite; I did want to be notified about that.  Printto.exe had to be present in the folder where the command or batch file was running, I assumed, but that was manageable.  When it got to my bad PDF, printto.exe gave a command-line error:  "ERROR:  Invalid file name specified."  I had forgotten to put that file's name into quotes.  (Unlike the others, that name had a space in it.)  I added quotation marks and tried again.  This time, when it got to the bad PDF, it gave me the error (above), "There was an error opening this document."  When I clicked OK, my little test batch file continued to print the next file in line.  So it looked like this was going to work.

Regrouping, then, the situation was as follows:  I had set up Bullzip to print PDFs to a specified folder called Bullzip Output, without pausing for any dialogs except error messages and overwrites.  I had downloaded printto.exe, and it was now sitting in the folder where I had also saved a file, created in Notepad, called Printer.bat.  Printer.bat contained commands of this nature:

printto "D:\Folder Name\File to Test Number 1.pdf"
Printer.bat contained one line like that for each of the PDF files I wanted to test.  What was supposed to happen next was that I was supposed to be able to double-click on Printer.bat, or type "Printer.bat" on the command line, and it would try to print the PDFs I was testing.  It would put the resulting PDFs (that is, the Bullzip printouts of the PDF files being tested) into the Bullzip Output folder.  Unless it encountered corrupt files or potential overwrites, it would work -- slowly -- through the list of PDFs that I wanted it to test.  I hadn't seen an option to steer the error messages to a log instead of showing them onscreen.  A log would have been better:  the batch file wouldn't sit idle, waiting half the night for me to wake up and fix a problem.  And maybe Bullzip or some other PDF printer offered that.  I just hadn't seen it.  It would be something to look into next time.  Hopefully Printer.bat would not encounter many corrupt PDFs.

I decided to run Printer.bat from the command line, so that I could watch what was going on.  One problem emerged almost immediately:  after Bullzip created a PDF, Acrobat would open up, even though I had checked the Bullzip option that said, basically, do not open the document after creation.  So, fine, the document would not open, but Acrobat would.  It would just sit there with a blank page, and that was fine, except printto.exe would not proceed with the next file to be processed until I killed Acrobat.  I wondered if things would work differently if I designated a different PDF default reader.  To test that, I right-clicked on a PDF file at random and went into Open With > Choose Default Program > Always use the selected program to open this kind of file > Browse.  I browsed to FoxitReader.exe (which may not have been its original name) and selected it.  I double-clicked on a random PDF to make sure that it would now open in Foxit rather than in Acrobat.  When I tried running Printer.bat again, I got a rapid series of error messages.  The gist of these messages was, "No application is associated with the specified file for this operation."  There were no files in the Bullzip Output folder.  Operation failed.  Now what?

I guessed that I was getting those error messages, not because Foxit was not registered as the default PDF reader at this point, but because something about its status as a portable rather than an installed program was confusing printto.exe.  I was surmising, in other words, that printto.exe needed a PDF reader to be installed.  This suggested that Acrobat was opening after each printing of a PDF, not because of some failure in Bullzip, but because printto.exe needed that.  So then could I perhaps insert a batch file command that would kill Acrobat after printto.exe gratuitously started it up?  Or could I install some other (non-portable) PDF reader that would respond differently than Acrobat had done?  Or would it perhaps help, somehow, if I left Acrobat open to another PDF file before running Printer.bat?

Trying that last possibility, I restored Acrobat as the default PDF reader, opened a PDF in it, and tried Printer.bat again.  This time, Printer.bat took off like a shot.  It processed a couple dozen PDFs almost instantly.  Then it slowed way down, but it seemed this was only because Bullzip was printing the PDFs much more slowly than printto.exe was printing them.  Evidently having a PDF already open in Acrobat was the solution.  Don't ask me why.

Well, now that we had worked out the terms of engagement, printto.exe and Bullzip seemed to be poised to execute the balance of their little pas de deux with grace.  Every few seconds, another PDF would be printed into the output folder.  Ah, but then the overwrite warnings began popping up.  I didn't have time to rename one existing PDF, so as to make room for its brother, before another duplicate warning would interfere with the manual renaming process.  I could have let them go until the end, but I was afraid there might be a lot of overwrite warnings, and the computer would crash or I would get corrupted results.  This was a pretty clumsy operation in the end.  It did seem that it would have been advisable to detect duplicte filenames before starting, and to assign duplicates a temporary or visibly alternate filename for this process.

But anyway, when this was done, I had 147 items in the output folder.  There had been no error messages.  Sadly, Printer.bat contained 148 lines.  Somewhere, we had a laggard.  I took a stab at finding it.  Failed.  I guessed I had allowed something to get overwritten, but my approach was way too sloppy to figure out which test output PDF got wiped.  I decided that I probably already knew it was a good PDF, since I'd gotten no error message while printing.  So that was the end of this test.



Another post updates some aspects of this one.