Monday, February 6, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Streamlined

I was facing a task that I had undertaken before.  This presented an opportunity to revise and streamline the writeup that I had produced during that previous effort.

First, I had previously converted multiple JPGs to PDF, scattered among multiple folders, in some cases combining multiple JPGs into a single PDF.  The general approach taken there was to rename the JPGs to unique names, move them to a single folder, conduct my operations on them there, delete the JPGs, and move the new PDFs back to the folders where the JPGs had come from.

In a related effort, I had also previously converted MHTs to PDF.  Those were not multipage.  In that effort, I had taken the strategy of converting them in-place (i.e., without moving them to a single folder, transforming them there, and moving them back to the place of origin) and then deleting the MHTs.

Between these two strategies, I liked the former better.  Having all files in one folder made it easier to verify that my processes yielded the same number of output files as source files.  It was also easier to spot-check the output and make sure that the files looked like they should.

Now I wanted to clean up any remaining JPGs that should be converted and, in some cases, combined into PDFs.  Using techniques described in more detail in those previous posts, I ran a DIR to get a full list of JPGs.  I put that list into a spreadsheet, used a REVERSE function to identify paths, sorted the spreadsheet by path, and deleted those rows that contained images that I did not want to convert (e.g., photos).  This made it possible to reduce a starting set of thousands of JPGs into a list of a few hundred that actually needed to be converted.

Using my spreadsheet, I cooked up REN commands to run in a batch file.  This produced unique filenames, as described previously, so that I could use MOVE commands (or cut and paste them from a file finder) to pool them all into a single folder without fear of overwriting.  Of course, I needed to keep the spreadsheet, so that I would know what the original names and locations were, so that I could write MOVE commands to put the resulting PDFs back in the source folders.

Once the uniquely named JPGs were combined in one folder, I used IrfanView's batch capability -- again, as detailed in the previous posts -- to convert them to PDFs.  These were all one-page PDFs; I had one PDF per JPG.  I made sure the spreadsheet had a column associating these one-page PDFs (with names like ZZZ_0001.pdf) with their original filenames (e.g., Letter from Ed page 01.jpg).  That is, the current PDF name and the original JPG name would be on the same row.  Now I could use the original JPG name to calculate the name of the new multipage PDF (e.g., Letter from Ed.pdf).  So there would be spreadsheet rows like these:

ZZZ_0001.pdf   Letter to Ed.pdf
ZZZ_0002.pdf   Letter to Ed.pdf
ZZZ_0003.pdf   Letter to Jane.pdf
ZZZ_0004.pdf   Memo from ABCD.pdf
ZZZ_0005.pdf   Memo from ABCD.pdf
(Letter to Jane.pdf is there because I did a sweep for all JPGs, some of which would wind up having just one page.)  I sorted the spreadsheet alphabetically by the output filename and input PDF name (as shown, with the pages that would be going into Letter to Ed arranged in proper order, and with Letter to Ed coming before Memo from ABCD).  For convenience, I assigned a simple working name to the output filenames (e.g., Letter to Ed.pdf would be represented by YYY_0001.pdf, and Memo from ABCD.pdf by YYY_0002.pdf).  I expressed the relationship between the input (single-page) PDF filenames and the output (potentially multipage) PDF filenames with commands that would create the lines of the necessary XML files, in this format:
echo ^<file value="D:\Workspace\ZZZ_0001.pdf"/^> >> YYY_0001
That produced 460 batch file lines, each starting with "echo," that would generate most of the contents of 159 different XML files needed to produce 159 different single- or multi-page PDFs.  The XML files would each need to begin with these two lines:
<?xml version="1.0" encoding="UTF-8"?>
and end with "</filelist>" (without quotes).  I put those lines into Header.txt and Tailer.txt, respectively, and then combined them with a batch file containing 159 lines like this:
copy /b Header.txt+YYY_0001+Tailer.txt YYY_0001.xml
That batch file gave me 159 XML files, starting with YYY_0001.xml.  Now I needed 159 new batch files, one for each XML.  These batch files would tell PDFsam to do the actual work -- to merge the single-page PDFs listed in the XMLs into an appropriate output PDF.  So at this point, D:\Workspace contained those 159 XMLs and the 460 single-page PDFs that would soon be merged.  As above, each of these 159 batch files would begin with several lines.  Working in a separate folder, I saved those several lines in a new Header.txt file:
@echo off


set JAVA_OPTS=-Xmx256m -Dlog4j.configuration=console-log4j.xml

set CONSOLE_JAR="C:\Program Files (x86)\pdfsam\lib\pdfsam-console-2.3.1e.jar"

@echo on
Those lines made some assumptions about environment variables, which I had already set.  Only the final line of each of those 159 batch files would vary.  That final line would have two variables:  it would name the XML file that listed the PDFs to be combined, and it would name the output file that would contain those PDFs, as in this example:
%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Workspace\YYY_00001.xml -o D:\Workspace\Merged\YYY_00001.pdf concat
I had to create the D:\Workspace\Merged folder to hold the output, and I had to write Excel spreadsheet formulas to mass-produce one final line, like the one just shown, for each of the 159 batch files.  The Excel formula I used was like this:
="echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l D:\Workspace\"&A2&".xml -o D:\Workspace\Merged\"&A2&".pdf concat >> "&A2&".txt"
where cell A2 contained YYY_0001.  That gave me 159 batch commands that I put into Notepad, and saved and ran as Texter.bat.  That produced 159 files with names like YYY_0001.txt, each containing a single JAVA line.  Then, in the spreadsheet, I created another set of commands like this:
copy /b Header.txt+YYY_0001.txt YYY_0001.bat
to combine the header and the JAVA lines into 159 batch files, each of which would ask PDFsam to merge the single-page PDFs listed in the corresponding XML file into the appropriate single- or multi-page output PDF.  And finally, to run those 159 batch files and produce the desired PDFs, I used the spreadsheet to work up a batch file called Runner.bat that began like this:
@echo off
call YYY_00001.bat
call YYY_00002.bat
I had been doing some of this work in other folders, but at this point my simplistic approach would work only if I moved them all back to D:\Workspace before trying Runner.bat.

Altogether, the process worked for me, as it had before, and this time it took only a few hours to go through the foregoing steps, make the inevitable mistakes, and get the desired output.

The remaining step was to get these multipage PDFs back where they belonged.  I went back to the spreadsheet, prepared batch lines for that purpose, and finished the job.



A newer post has a more step-by-step explanation of this process.