Saturday, January 14, 2012

Combining PDFs with PDFsam: Introductory Syntax

I was using Windows 7.  I had a project that would benefit from automated merging of multiple PDFs into a single PDF.  It looked like PDFsam would be useful for this purpose.

PDFsam had GUI and Console options.  In other words, it could be accessed through a user-friendly interface, like most Windows programs, and it could also be used on the command line.  My project had certain complexities, such that the GUI approach would not be ideal.  This post describes the steps I took to learn how to use the Console.

I began with the Console section of the PDFsam wiki.  It led to a page providing information on console parameters and commands.  The explanation was too thin, so I did a search for more guidance. This led to a 33-page Tutorial (installed with the program files). It also led to a thread that reminded me not to forget the PDFsam Forums.

The Tutorial (p. 18) seemed to say that, in PDFsam-speak, what I wanted to do was to Merge files, and for this I would use the Concat option. Other options, not of interest here, included Split and Encrypt. It appeared that PDFsam syntax would call for very long commands. Looking for examples, I went to a forum thread, but that pointed me back toward the wiki page (above).

The Tutorial said that, to make PDFsam run from the command line (i.e., Console), I could either type a certain command or just use one of the scripts in the bin folder where the program was installed (e.g., C:\Program Files\pdfsam\bin).  In that bin folder, it appeared I had my pick from two scripts, provided in Linux (.sh) and Windows batch (.bat) versions.  Since I wanted the Console, not the GUI, I focused on run-console.bat.  Its contents seemed to address various details that I didn't clearly understand, and didn't necessarily want to study; it just looked like the thing I would need to use.  So I created a shortcut to it and put that in my Start Menu. I also edited the Tutorial, adding bookmarks to the various sections, and moved it to the Start Menu too.  (My customized Start Menu would survive any subsequent Windows reinstallation, so I probably wouldn't need to do this housekeeping again, if I had to install PDFsam in a new Windows installation sometime down the line.)

Unfortunately, the run-console.bat batch file didn't work for me.  It gave me an endlessly scrolling set of messages. They were ripping past too quickly to read.  I hit PrintScreen, opened IrfanView (any image editor would do, as would Microsoft Word or Wordpad), and pasted the screenshot (Ctrl-V).  (I could have just hit Ctrl-C, or possibly the Pause key.)  Now I could see that it was just the same error message, repeating over and over:

java is not recognized as an internal or external command, operable program or batch file
Why wasn't my system recognizing java?  I right-clicked on run-console.bat, chose Edit, and looked for the line that referred to java. I couldn't quite figure out where the problem was, so I stuck in a "pause" command somewhere, saved the batch file, and, this time, ran it from the command line instead of from the shortcut. That way, the error statements would stay onscreen instead of scrolling past too quickly or disappearing when the batch file finished running. (This was another instance where it was handy to have the right-click option, "Open command window here," provided by Ultimate Windows Tweaker.)

Running the batch command meant just typing its name and hitting Enter. It paused where I had put the pause command, without any obvious errors, so I moved the pause command further down, saved, and repeated the cycle. (Running run-console.bat again required just hitting the Up key to repeat the command.) That's where the problem was: now I had the endless scrolling again. I hit Ctrl-C a couple of times to abort the batch file.

I played around with the batch file for a while, and eventually realized that maybe the problem was that the JAVA_HOME variable had not yet been assigned a value on my system. It appeared that the batch file was supposed to tell me this; if so, it wasn't working right. I went into Start > Run > SystemPropertiesAdvanced.exe > Environment Variables and, sure enough, no JAVA_HOME variable. I had already installed the Java Runtime Environment, and I almost always used the default installation paths when installing programs, so the advice seemed to be that the JAVA_HOME variable should point to C:\Program Files\Java\jre6. Since this folder name ("Program Files") had a space in it, apparently I would need to use the shortened, DOS-style name for it -- known as an "8.3" filename because it would have eight characters before the dot and three afterwards (e.g., yourfile.txt).

I knew the shortened name of that folder would probably contain Progra~1 (instead of "Program Files"), and I could have just experimented with that, but I had seen instances where it would be Progra~2 or something else, and anyway I wanted to know how to get the 8.3 name. Microsoft advised using the GetShortPathName function to figure it out, but that seemed to involve programming, and programming is a lot of work. Instead, I ran a search that took me to ShortPath by Marcello Zaniboni. To get ShortPath to work from the command line, I tried the C:\Windows shortcut trick, but it didn't work. I didn't want to add ShortPath to my PATH yet, so I just opened a command window in the folder where ShortPath.exe was located, typed "ShortPath " (with an ending space) but didn't hit Enter, and then dragged the C:\Program Files\Java\jre6 folder into that command window from Windows Explorer. (I think this worked because I had installed DropCommand. Otherwise I might have had to type it out, with quotation marks.)

ShortPath told me that, actually, the short path to that folder was C:\PROGRA~2\JAVA\JRE6. So I went back into SystemPropertiesAdvanced.exe > Environment Variables > System Variables > New > Variable Name = JAVA_HOME, Variable Value = C:\PROGRA~2\JAVA\JRE6. I OKed out of there and rebooted.

After doing that, I still had to play with the batch file for a long time, in a quest to learn, remember, get lucky, or otherwise do what I needed to make it work. By the time I was done, I almost thought that I would have been further ahead just using the command given in the wiki:
java -Dlog4j.configuration=console-log4j.xml -jar pdfsam-console-2.1.1e.jar
except that that didn't work either because, as I soon realized, it was a Linux command. I also did not fare too well with the advice to type "run-console.bat -h concat" for information on the syntax for the Concat option, because the run-console.bat file itself was not yet working.

The Tutorial (pp. 19-20) said that I had three ways to indicate which files I wanted to merge. Instead of entering one parameter to indicate a directory and then entering another parameter to indicate one or more files in that directory, it seemed I would want the option that would allow me to specify a file (including its path) on a single line. Evidently I could list a number of PDF files in a separate XML file, and invoke that file (with its list of PDFs) by using the -l (that's an L, not a one) option. But it wasn't working right. Ultimately, I posted a question on it. Andrea (a guy from the Netherlands), creator of PDFsam, posted a reply within 36 hours. And that got me where I needed to go. I was able to get a test run to work, with a run-console.bat file whose contents (viewed in something like Notepad, of course, not in a word processor like Word that would add all kinds of invisible junk) were as follows:
@echo off


set JAVA_OPTS=-Xmx256m -Dlog4j.configuration=console-log4j.xml

set CONSOLE_JAR="C:\Program Files (x86)\pdfsam\lib\pdfsam-console-

@echo on

%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Current\ConcatList.xml -o
D:\Current\PDFsamOut\Merged.pdf concat
While I wasn't entirely clear on what all those lines did, the basic idea seemed to be that the first lines would define JAVA, JAVA_OPTS, and CONSOLE_JAR, and then the last line would combine them all into one big command. That command seemed to say, "Run Java with these options, using this jar file for specific instructions; take your input from the PDF files listed in ConcatList.xml; and output a single PDF file, Merged.pdf, containing all of those PDF files." To make that work, I needed to know the format of the ConcatList.xml file. Here's the one that worked for me in this test run:
<?xml version="1.0" encoding="UTF-8"?>
<file value="D:\Current\TestDir1\x1.pdf"/>
<file value="D:\Current\TestDir2\x2.pdf"/>
I just needed a File Value line for each PDF to be merged, using the syntax shown.  To summarize, then, I used Notepad to create two files.  One, called run-console.bat, contained the first half-dozen lines of code shown above, beginning with @echo off.  The other, ConcatList.xml, contained these last five lines of code, beginning with the "xml version" line.  ConcatList.xml would contain File Value lines, each designating a PDF to be merged into the larger output PDF (and there were other options for ConcatList.xml; I just didn't need them for my project), and run-concat.bat would read those lines and do the actual concatenation into a single output PDF.