Saturday, January 14, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Second Try

I had previously looked for a way to combine multiple JPG files into a single PDF.  As explained in more detail in that previous post, the specific problem was that I might have sets of several JPGs in a single folder that should be merged into several different PDFs, and there might be multiple folders at various places that would have such PDF sets.  Hence, if I wanted to automate this project across dozens of folders containing hundreds of JPGs, it seemed that I would need a command-line solution rather than a GUI.  This post updates that previous attempt.

There were commercial programs that seemed to offer the necessary command-line capabilities, such as PDF Merger Deluxe ($30) and ParmisPDF Enterprise Edition ($300).  A search suggested, however, that PDFsam might offer a freeware alternative.

Assembling the List of JPGs to Be Converted

Before investigating PDFsam, I decided to get a more specific sense of what I needed to accomplish.  In a command window (Start > Run > Cmd), I navigated to the root of the drive I wanted to search (using commands like D: and "cd \").  (The root was the folder whose command prompt looked like C:\ or D:\, as distinct from e.g., C:\FolderZ.)  Being at the root folder meant that my command would apply to all subfolders on that drive.  Once I was there, I ran this command:

DIR *.jpg /s /b /a-d > jpgslist.txt
That gave me a list of files (but not directories, /a-d) in all subdirectories (/s), listed in bare (i.e., simple, /b) format, saved in a new file called jpgslist.txt.  (For more information on DIR or any other DOS-like command, type DIR /? at the command prompt, or search for informational webpages.)  If I'd had files with a .jpeg (as distinct from .jpg) extension, I would have added a second line, referring to *.jpeg and using a double-arrow (>>) to tell the program to add these to the existing jpgslist.txt, rather than creating a new jpgslist.txt (which was what the single arrow (>) would do).

Now I wanted to see which folders had more than one JPG.  I would use Microsoft Excel to help me with this.  I could either open jpgslist.txt and copy its contents into an Excel spreadsheet, or import it into Excel.  In Excel, I did a reverse text search to find the last backslash in each text line, so as to distinguish file names from the directories holding them.  I sorted the spreadsheet by folder name and file name.  I set up a column to detect folders containing more than one JPG, and deleted the rows of folders not containing more than one JPG.  I might still want to do another search and conversion for isolated JPGs at some point, but that would be a different project.

Next, I wanted to see if I could eliminate some folders.  For instance, I might not want to PDF and combine JPGs that were awaiting image editing, or important photos whose quality might get degraded in the PDF conversion.  In other words, I decided that this particular project was just for those JPGs that I was going to combine into a single PDF and then delete.  To get a concise list of folders containing multiple JPGs, I went into Data > Filter > Advanced Filter.  (That's Excel 2003.)  I moved the output into another worksheet.  I could then do a VLOOKUP to automatically mark rows to be deleted.  So that gave me the folders to work on.

Now it was time to decide which files to combine, and in what order.  In some cases, I had named files very similarly -- usually with just an ending digit change (e.g., Photo 01, Photo 02 ...).  So I set up a couple of columns to find the filename's length, subtract a half-dozen characters, and decide whether those first X characters were the same as in the preceding row.  If so, and if both were in the same folder, we had a match.  I discovered, at this point, that one or two folders contained large numbers of files.  I decided to combine those manually.  With those out of the way, it seemed that the next step was to decide the names of the resulting multi-image PDFs (e.g., Medieval Churches.pdf), and to put those names on the spreadsheet rows, next to the individual JPGs that would go into them.

At this point, as described in another post, I learned how to use PDFsam to combine several PDFs into one PDF.  So I had a rough idea of the start of my project (i.e., identify the JPGs that I would want to merge into a single output PDF), and I also had a basic sense of the end of my project (i.e., use PDFsam to merge multiple PDFs into that single output PDF).  I was missing the middle part, where I would convert the original JPGs into single-page PDFs and would get them into a form where PDFsam could work on them.

Converting Individual JPGs to Individual PDFs

I had originally assumed that I would start by converting the JPGs to PDFs within the various folders where they were originally located.  So if I had File1.jpg in E:\Folder1, and if I had File2.jpg in E:\Folder2, my conversion would result in two files in each of those folders:  File1.jpg and File1.pdf in Folder1, and File2.jpg and File2.pdf in Folder2.  Then I would use PDFsam to merge the PDFs (i.e., File1.pdf and File2.pdf) from those locations; delete the original JPGs and PDFs; and move Output.pdf to an appropriate location.

I didn't entirely like that scenario.  It seemed like it could make a mess.  As I reviewed another post in which I had worked through similar issues, I decided that a better approach might (1) make a list of original JPG file locations, (2) move those JPGs to a single folder where I could convert them to individual PDFs, (3) merge the individual PDFs into concatenated PDFs, (4) delete the individual JPGs and PDFs, and (5) move the concatenated PDFs to the desired locations.  I decided to try this approach.

One problem with moving files from many folders to one folder was that there might be two files with the same name.  They would coexist peacefully as long as they were in separate folders; but when they converged into one target folder, something would get overwritten or left behind.  It seemed that a first step, then, was to rename the source JPGs, so that each one would have a unique name -- preferably a short name without spaces, potentially making it easier to write commands for them as needed.  In this step, as in others, it would be important to keep a list indicating how various files were changed.  To rename the files where they were, I returned to my spreadsheet and used various formulas to produce a bunch of rename commands of this type:
ren "D:\Folder Name\File Name.jpg" "D:\Folder Name\ZZZ_00001.jpg"
after doing a search to make sure that my computer did not already have any files with names resembling ZZZ_?????.jpg.  The spreadsheet gave me one REN command for each JPG.  I copied those commands into a Notepad file, named it Renamer.bat, and double-clicked to run that batch file.  (A slower and more cautious approach would have been to run it in a command window, perhaps with Pause commands among its renaming lines, so that I could monitor what it was doing.)  A search in a file-finding program like Everything now confirmed that the number of files on my computer with ZZZ_?????.jpg names was equal to the number of files listed in my spreadsheet.  I cut and pasted all those ZZZ_?????.jpg files from Everything to a single folder, D:\Workspace.  (I could also have used the spreadsheet to generate Move commands to run in a batch file for the same purpose.)

Now I had a spreadsheet telling me what the original names of these ZZZ_?????.jpg files were, and I had all those ZZZ files together in D:\Workspace.  My spreadsheet also told me which of them were supposed to be put together into which merged output PDFs.  But they weren't ready to be merged by PDFsam, because they were still JPGs, not PDFs.

To convert the JPGs to PDFs, I could have prepared another batch file, using IrfanView commands to do the conversion, like those that I had previously played with in another project.  But I figured it would be easier to use IrfanView's File > Batch Conversion/Rename.  There, I told IrfanView to Add All of the ZZZ files to its list.  I specified PDF as the Batch Conversion Settings output format, and set its Options > General tab to indicate that Preview was not needed (and adjusted other settings as desired).  I told it to Use Current ("Look In") Directory as the Output Directory for Result Files (adding "Output" as the name of the output subfolder to be created).  Then I clicked Start Batch.

That produced one PDF, in the Output subfolder, for each original JPG.  I hadn't done anything to change their filenames, so ZZZ_00001.jpg had been converted to ZZZ_00001.pdf.  Spot checks indicated that the resulting single-page PDFs were good.  I deleted the original ZZZ*.jpg files, moved the output PDFs up into D:\Workspace, made a backup, and turned to the project of merging those single-page PDFs into multipage PDFs.

Preparing XML Files to Concatenate PDFs

In my spreadsheet, I had already decided which ZZZ files would be merged together, and what the resulting multipage PDFs would be called.  Now -- referring, again, to the other post in which I worked through the process of using PDFsam -- I needed that information to create File Value lines for a set of ConcatList.xml files that PDFsam would then merge into a set of output PDFs.

In other words, I would have a batch file that would run PDFsam, and I would have a data file, in XML format, to specify the single-page PDFs that PDFsam would combine into the multipage output PDF.  I would have a pair of such files (i.e., a batch file and an XML data file) for each resulting multipage PDF.  In my particular project, there were 65 single-page PDFs, and they would be combined into a total of eight multipage PDFs.  So I would have eight pairs of .bat + .xml files, and the eight XML files would contain a total of 65 File Value lines.

To the extent possible, I would want to automate the creation of these batch and data files.  Sorting 65 data lines into eight different XMLs would be tedious and easily confused.  Things would get much worse if I wanted, in some later project, to use these procedures for hundreds or thousands of JPGs or other files.

I began by adding a column to my spreadsheet that contained the exact text of the appropriate File Value line.  Example:  for ZZZ_00001.pdf, the line would read like this:
<file value="D:\Workspace\ZZZ_00001.pdf"/>
To produce that result, if the Excel spreadsheet's cell D2 contained ZZZ_00001.pdf, its cell E2 would contain this formula:
="<file value="&CHAR(34)&"D:\Workspace\"&D2&CHAR(34)&"/>"
(Note the use of CHAR(34) to add quotation marks where they would otherwise be misunderstood.)  Next, I wanted to assign those File Value lines to the appropriate batch files.  A search confirmed that I didn't have any files on my data drives with YYY_????? names, so I decided that my first multipage output PDF would be called YYY_00001.pdf, and that the pair of files used to produce it would be YYY_00001.bat and YYY_00001.xml.  In other words, the File Value line for ZZZ_00001.pdf (above) would have to be one of the File Value lines appearing in YYY_00001.xml.  But the next File Value line in YYY_00001.xml could be a ZZZ file out of sequence (e.g., ZZZ_00027.pdf), if that happened to be the next original file that I wanted to put into YYY_00001.pdf.

Since YYY_00001.pdf was going to be the temporary working name of the multipage PDF that I would ultimately be calling "Short Letters to Mother," I sorted the spreadsheet (making sure to first use Edit > Copy, Edit > Paste Special to convert formulas to values) by the column containing my those ultimate desired filenames, and worked up a column indicating the corresponding YYY filename.  In other words, each cell in that column contained one of eight different labels, from YYY_00001.pdf to YYY_00008.pdf.

With that in place, I was ready to generate some batch commands.  Each batch command would use the ECHO command to send the contents of spreadsheet cells to YYY*.xml files.  My first attempt looked like this:
echo <file value="D:\Workspace\ZZZ_00001.pdf"/> >> YYY_00001.xml
The double greater-than signs (">>") indicated that YYY_00001.xml would be created, if it didn't already exist, and that the File Value line (above) would be added to it.  This first try produced an error, as I feared it might:  ">> was unexpected at this time."  The less-than and greater-than symbols were confusing Windows.  I had to modify the formula in my spreadsheet (or use Ctrl-H) to add carets (^) before them, like this:
^<file value="D:\Workspace\ZZZ_00001.pdf"/^>
That worked.  Now YYY_00001.xml contained that line.  With commands like that for each of the 65 single-page PDFs, my spreadsheet now had cells like these:
echo ^<file value="D:\Workspace\ZZZ_00051.pdf"/^> >> YYY_00004.xml
echo ^<file value="D:\Workspace\ZZZ_00025.pdf"/^> >> YYY_00006.xml
I sorted the rows in my spreadsheet by the appropriate column to make sure the single-page PDFs would get added to their multipage PDFs in the proper order.  (If necessary, I would have added another column containing numbers that I could manipulate to insure the desired order.)  Then I copied all those cells over to Notepad and saved it as a new batch file that I called Sorter.bat.  I ran Sorter.bat and got eight XMLs, as hoped.  Spot checks seemed to indicate that the process had worked correctly.

My eight XML files were not complete for purposes of PDFsam.  Each of them would need lines of code preceding and following the File Value lines.  As described in the other post, those files would begin with
<?xml version="1.0" encoding="UTF-8"?>
<filelist>
and would end with
</filelist>
I saved those two beginning lines into a text file called Header.txt, and I saved that ending line into another text file called Tailer.txt.  Now I needed to combine them with the XML files that Sorter.bat had just created.  For that purpose, it seemed that my spreadsheet could benefit from a separate page dedicated to XML file manipulation.  I added that page, filtered my existing page for unique values in the YYY*.pdf column, and placed the results on that new page. 

I could now see that I was too early in adding .xml extensions to the eight files.  I went back into the spreadsheet and changed it to produce files without extensions (e.g., YYY_00001 instead of YYY_00001.xml), and then I deleted the XML files and re-ran Sorter.bat (as modified) to verify that it was all still working.

With that change, I returned to the spreadsheet's XML Files page.  Next to each of the eight XML filenames, I added columns to produce commands of this type:
copy Header.txt+YYY_00001+Tailer.txt YYY_00001.xml
I put those eight commands into a batch file and ran it.  It worked:  I had eight XML files with everything that PDFsam needed.  There was just one small glitch:  at the end of each resulting XML file, there was a little arrow, pointing to the right.  Searches didn't yield any obvious explanations.  I wasn't sure if it would make a difference; I thought it might just be a symbol representing end-of-file or line feed.  I decided to forge ahead and see what happened.  Except for that little character, I had exactly what I needed for my XML files.

Preparing Batch Files to Use the XMLs

The next step was to create a matching YYY_?????.bat file for each YYY_?????.xml file.  This batch file would run the commands necessary to merge the single-page PDFs listed in the XML file.  I would use the same techniques as in the XML files.  There would be no Tailer.txt file this time; the line that would need to change, in each BAT file, was the very last line.  So my COPY command (above) would just have Header.txt plus the variable line to produce the YYY_?????.bat file.  The variable (last) line of the batch file had to look like this:
%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Workspace\YYY_00001.xml -o D:\Workspace\Merged\YYY_00001.pdf concat
In other words, it would have two variables:  the name of the XML file providing the input, and the name of the PDF file containing the output, saved in a Merged subfolder.  It was pretty straightforward, by now, to use the spreadsheet to generate the necessary commands and to run them in another Sorter.bat file (see above).  I just had to remember to delete my previous YYY_????? files (without extension), so that their contents would not get thrown into the mix.  I did wish that PDFsam's -l option were expressed as -L, so that nobody would think it was the number one, but I wasn't yet ready to experiment and find out whether -L would work just as well.  Anyway, to produce a line like the one shown immediately above, my Excel formula looked like this:
="echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l D:\Workspace\"&A2&".xml -o D:\Workspace\Merged\"&A2&".pdf concat >> "&A2 
where cell A2 contained the filename without extension (e.g., YYY_00001).  I had to use double percentage symbols to get a single one to come through.  I put the resulting eight lines into Sorter.bat, and it produced eight YYY_????? files, as before.  I ran another batch file to combine Header.txt plus the YYY_????? files to produce YYY_?????.bat -- again, same as above, but without Tailer.bat and making sure to produce .bat files, not .xml files. 

These steps gave me eight pairs of .bat and .xml files.  The batch files looked good, except for the little arrow at the end (above).  Now, if all went well, the batch files would run, would consult the XML files for the lists of single-page PDFs to merge, and would produce eight YYY_?????.pdf output files in the Merged subfolder.  I would not want to run the batch files manually, if I were producing a large number of merged PDFs, so I wrote a batch file to run the batch files.  The commands in this file looked like this:
@echo off
call YYY_00001.bat
call YYY_00002.bat
and so forth.  I ran this batch file.  It gave me error messages.  I opened a command window and typed just its first action line:  call YYY_00001.bat.  The error was:
FATAL  Error executing ConsoleClient
java.lang.Exception: org.pdfsam.console.exceptions.console.ParseException: PRS001 - Parse error. Invalid name (D:\Workspace\YYY_00001.xml) specified for <l>, must be an existing, readable, file.
Oh.  Dumb mistake.  The XML files weren't in D:\Workspace.  I moved them and tried again on the command line.  Another error:
Error on line 5 of document file:///D:/Workspace/YYY_00001.xml : Content is not allowed in trailing section.
That little arrow was on line 5.  I moved it to line 6 and tried again.  Same error, except now it said the problem was on line 6.  I deleted the little arrow and tried again.  That solved that problem.  Now a different error:  "The system cannot find the path specified."  That was probably because I had not yet created the Merged subfolder.  Apparently PDFsam was not going to create a folder that did not already exist.  I created it and tried again.  Success!  YYY_00001.pdf was created in the Merged folder with the desired single-page PDFs in it.

Now I just had to figure out how to prevent that little arrow from appearing in the XML files. It came at the end of both the XML and the BAT files, and it got there when I used the COPY command to combine the header, the command, and the tailer text files.  The solution was to add the /b switch:
copy /b Header.txt+YYY_00001+Tailer.txt YYY_00001.xml
With that change, I went back through the process of creating the XML files.  Then I tried running YYY_00001.bat again.  I got an error:  "Cannot overwrite output file (overwrite is false)."  Oops.  I had forgotten to get rid of the previously produced YYY_00001.pdf in the Merged folder.  This time I was successful without having to manually remove the little arrow -- it wasn't there anymore in the new YYY_00001.xml.  I ran the batch file that called the eight YYY_?????.bat files.  It ran and produced eight multipage PDFs.  Those eight contained a total of 65 pages.  I combined them all in Acrobat, just to take a quick look.  They were all good.

Putting the Multipage PDFs Back Where They Belong

Now I needed to decide where to put the multipage PDFs.  In this case, the individual PDFs that went into each of the multipage PDFs had all come from the same folder.  That is, I did not have Folder1\PDF1 plus Folder73\PDF2 going into BigPDF-A.  So I could use the spreadsheet to determine semi-automatically the path and filename for a rename command.  I wound up with Rename commands like this:
ren YYY_00004.pdf "Medieval Craft Workers.pdf"
followed by Move commands like this:
move /y "Medieval Craft Workers.pdf" "E:\IMAGES\Medieval Craftsmanship\"
I verified that the multipage PDFs had returned to the folders whence the single-page PDFs had originated.  This project was finished.

2 comments:

raywood

A subsequent post contains a streamlined account of the process described above.

raywood

Another recent post has a more step-by-step explanation of this process.