Sunday, May 27, 2012

Batch Converting Multiple Word DOC Files to PDF in Scattered Folders

I had a large number of .doc files produced by Microsoft Word.  These files were in assorted folders.  I wanted to convert some or all of these files to PDF format.  This post describes the steps I took.

I had already tackled similar problems in several other posts, including these:

This post does not detail all of the steps described in those other posts.  If a step described here is not clear, perhaps one of those posts expresses it more lucidly.

I started by getting a list of the DOC files to be converted.  For this, I opened a command window and typed "DIR /s /b /a-d > doclist.txt."  It was OK if this DOC list included files that I did not want to convert:  I could go through the list manually at this point, deleting those that I did not want to convert, or I could do that in the next step.  The next step was to copy and paste the list of files from doclist.txt into Microsoft Excel or some other spreadsheet.  This gave me a list of file and path names that looked like this:
D:\Folder3\Subfolder 8\Filename Z.doc
Since some paths and/or filenames contained spaces, I would tend to use quotation marks in commands relating to them, in both Excel and the command window.  In Excel, I used the REVERSE function and other spreadsheet commands to extract the path (e.g., "D:\Folder3\Subfolder 8\") from the filename (e.g., "Filename Z.doc").  So now I had separate columns showing the paths and the filenames for each entry in doclist.txt.  This would be a good point for using formulas to identify groups of DOC files that I did not presently wish to convert to PDF.

The next step in the spreadsheet was to identify the filename without the extension, and to add PDF instead of DOC to that rump filename.  In other words, in this step I went from having Filename Z.doc to having Filename Z.pdf.  This gave me the essential ingredients for the batch commands that I would assemble on each line of the spreadsheet and would then paste into Notepad and save as a .bat file, so as to automate the conversion.

There were two ways to proceed at this point.  One was to leave the DOCs in place, in their home folders, and do the conversion and replacement right there.  I didn't like that approach.  It was too hard to be sure of what had happened in all those scattered folders.  The approach I preferred was to bring all those .DOC files together in one central folder, do the conversion, and then use the spreadsheet to construct batch files that would put those PDFs back where they belonged and, optionally, delete the DOCs from which they had come.

Bringing the DOC files to a central folder could be done very easily with a search program like Everything, searching for *.doc.  It could also be done with batch commands constructed in the spreadsheet.  An Excel formula producing a command of the latter nature would be something like ="move /-y "&char(34)&[cell containing filename including .doc extension]&char(34)&" D:\CentralFolder").  It would be important not to take this step -- that is, not to move the files away from their home folders to the central folder -- until I already had a list of where the files came from originally.  Without that, I'd have a big collection of DOC files and no idea of where they belonged.  Note that files bearing identical names, coming from different folders into one, could require some advance manual renaming to avoid overwriting.  In that case, after renaming but before moving, it would probably be advisable to re-run DIR, so as to get the current filenames.

Once the files were all in a central location (in this case, D:\Conversion), it was time to work up the batch conversion process.  For this, first, I set the General and Options tabs in Bullzip (my free PDF printer) so that it would operate without asking questions or opening PDFs, and would save the PDFs to a designated folder (D:\Conversion\PDFs).  Then I saved this command into a batch file that I called Converter.bat:
FOR /F "usebackq delims=" %%g IN (`dir /b "*.doc"`) DO "C:\Program Files (x86)\Microsoft Office\Office11\winword.exe" "%%g" /q /n /mFilePrintDefault /mFileExit && TASKKILL /f /im winword.exe
I saved Converter.bat in the folder containing the DOC files (in this case, D:\Conversion) and ran it.  It worked away for a while, at the speed of one document every few seconds, until it had produced one PDF for each of my DOC files.  Several times during the process, Word or Bullzip stalled with error messages (e.g., "Word cannot start the converter Rftdca32.cnv").  This seemed to result primarily from corrupted Word docs.  There seemed to be little alternative but to delete those files except where I could find a backup.

Now I had a set of DOCs and a set of PDFs.  One easy way to make sure that I had a copy of PDF for each DOC was to view the folders using a Windows Explorer alternative like FreeCommander.  In FreeCommander, I could combine the DOCs and PDFs together, sort by file type, select all DOCs, re-sort by file name, and look for instances in which alternating lines were not regularly highlighted.  (In Windows 7, Windows Explorer had lost the ability to retain highlighting after files were re-sorted.)  At this point or later, one could then just delete all DOCs that did have a corresponding PDF.  DoubleKiller Pro would provide a similar approach.  Another method, more suitable for large numbers of files, was to use the DIR and spreadsheet approach outlined above, writing formulas to check for identical filenames (not counting extensions).  Of course, there was no need to actually delete the DOCs if I wanted to keep both the PDF and the DOC.

I postponed that step to verify, first, that I would not be needing any of the DOCs anymore.  I had previously worked on ways to check PDFs by converting them to JPGs and seeing which ones converted successfully.  In that previous effort, IrfanView (my preferred tool) had not behaved as expected, so I had grappled with other approaches.  This time, however, the quick IrfanView batch conversion went smoothly.  This gave me a JPG displaying the first page of each PDF.  My decision there was that, in the interests of speed (and to avoid having to go through every page of every PDF),  I was content to look just at the first page.  There could still be errors on later pages of a PDF, but that would be rare.  If the first page came through OK, I could be fairly confident that most docs converted successfully.  So now, using IrfanView, I flipped through those JPGs quickly.

With these steps out of the way -- PDFs checked, superannuated DOCs deleted -- I went back to my Excel spreadsheet and worked up batch commands to move the new PDFs back to where the DOCs had been.  I had changed a couple of names along the way, so I had to move those manually, but the rest went automatically.  Project done!