Showing posts with label command line. Show all posts
Showing posts with label command line. Show all posts

Sunday, March 18, 2012

Batch Converting Many Microsoft Word (.doc) Files to PDF - First Try

I had recently figured out how to batch convert many text files to PDF.  Now I was on a roll.  I wanted to know how to do the same thing with word processing documents produced in Microsoft Word 2003.

The approach used for the text files didn't seem likely to produce good results for Word files.  The text approach used Notepad on the command line.  Notepad would lose all the formatting.  It might actually create a mess.  It seemed there would probably be a better way.

In that previous approach, I had configured my default PDF printer, Bullzip, to shut up and stop asking questions.  So it sailed right through the printing task.  When producing complete garbage, I prefer not to be interrupted -- although, in that case, the output actually seemed OK.  In similar spirit, I wished for a Word conversion process that would just follow orders.

I thought of setting Bullzip in minimal-interruption mode, as in the text file approach, and just selecting a gaggle of Word docs in Windows Explorer, right-clicking, and choosing Print.  Sad to say, Windows 7 was not interested in giving me a print option when I selected more than 15 items.  So I would have to repeat the process with groups of 15 files at a time.  This did not fall within my definition of hassle-free.

Seeking some alternate approach, I did a search.  I was thinking, first, that maybe Word had command line options like Notepad.  Microsoft did not seem to offer any such option.  Others concurred that I would probably need some kind of macro, script, or other third-party solution.

Another search led to some relatively less desirable solutions, such as buying Fineprint or A-PDF or easyPDF SDK (seemingly complicated) or using a combination of VBScript and Automation or uploading Word docs to OCR Convert or using AnyToPDF, which admirably developed OpenOffice but would require the system to restart OO for each document being printed.  I found a thread that yielded other possibilities, including an apparent Word command-line possibility after all.  It seemed to require something called Quiet PDF Printer, which I could not locate.

As I was browsing Wikipedia's list of PDF software, not seeing much of relevance, I realized I would much prefer a solution that would use Word, as distinct from some other program, so as to have the greatest likelihood of preserving formatting.  After all, I was not planning to inspect the resulting PDFs closely.  I didn't want to find out, a year down the line, long after I had discarded the original Word docs, that the PDFs were missing the bottom two lines of text, or that important characters were being misprinted or something.  No doubt this approach of opening Word was going to be slow, though, as in the OpenOffice alternative disparaged above.

I saw that Quiet PDF Printer suggestion repeated in another thread, but without any mention of Quiet PDF Printer.  Maybe the first person who mentioned it meant that I should just have a no-hassle PDF printer, like Bullzip with the desired settings.  Anyway, the suggestion was to run this command:

"C:\Program Files\Microsoft Office\Office\winword.exe" "C:\My Documents\doc1.doc" /mFilePrintDefault
Of course, the path to winword.exe would have to be adjusted on some systems, and doc1.doc was just an example.  But the point is, it worked.  One problem:  it left Word running, and another iteration of the command opened another instance of Word.  So unless I wished to have a couple hundred unused Word sessions lounging around, consuming system resources, I would need to kill Word after printing the PDF.  Further reading in that same thread led to a refinement:
"C:\Program Files\Microsoft Office\Office\winword.exe" test.rtf /q /n /mFilePrintDefault /mFileExit
The description seemed to say that (1) those last two items were actually Word's way of calling a macro on the command line; (2) the selection of commands available for such use was visible in this menu pick in Word 2003:  Tools > Macro :> Macros > Macros in Word Commands (in ribbon versions of Word, try this key sequence:  Alt-T, M, M); (3) FilePrintDefault and FileExit were two such commands); and (3) if I went into Tools > Options > Print tab > uncheck Background Printing, I would not have Word exiting before the PDF was done printing.

I decided to try that last command line approach.  I made the stated settings changes in Word, and set Bullzip to stun.  Now it was a question of working up the list of commands, for all these Word documents that I wanted to PDF.  Ordinarily, I would have used a combination of DIR and Excel for that purpose, with one command per file, producing a batch file containing many commands.  But spring had arrived and, you know, in spring a man begins to feel powerful urges.  My social life being what it was, this translated into some recent experimentation with looping batch files.  That is, I believed I might be able to devise a batch program that would provide a simpler (or at least more direct) way to run this printing process.  So, from a command prompt in the folder containing my Word docs, I ran a batch file that I called Printit.bat.  That batch file contained just one line, though it wraps over several lines here.  The line was:
FOR /F %%g IN ('dir /b *.doc') DO "C:\Program Files (x86)\Microsoft Office\OFFICE11\WINWORD.EXE" test.rtf /q /n /mFilePrintDefault /mFileExit
Word immediately gave me a message indicating that it had encountered an error.  I wondered if that was because I had a session of Word open before running the batch file.  But that didn't seem to be the answer.  Well, maybe it was because I already had a PDF printout of the first file in the folder.  I had created that PDF during the process of testing this stuff.  Apparently my batch file and/or Word were not going to dilly-dally to ask me about overwriting.  So now I deleted that preexisting PDF and tried again.  No, that wasn't it; I still got the error.  This time, instead of guessing, I clicked its Show Help button and got an explanation:
The file you tried to open was not found. . . . [If the file exists but] does not open, it is either corrupt, locked by another application, or is protected by file permissions.
So, silly me, I looked again at my batch command.  Test.rtf?  WTF was Test.rtf?  I had copied the foolish thing verbatim, without pausing to reflect.  When your professors try to tell you how important it is to master critical thinking, believe them.  They're right.  As it turned out, there were multiple problems with that first try at a batch command.  One of those problems was that, contrary to initial hopes, Word was actually not postponing the next doc until it had closed the previous doc; therefore, it was stumbling over itself.  The solution was a batch file containing this one long line:
FOR /F "usebackq delims=" %%g IN (`dir /b "*.doc"`) DO "C:\Program Files (x86)\Microsoft Office\OFFICE11\WINWORD.EXE" "%%g" /q /n /mFilePrintDefault /mFileExit && TASKKILL /f /im winword.exe
The changes were mainly to add USEBACKQ and to change quotation marks (and use backquotes) accordingly, and also to add the "&& TASKKILL" part.  The && said that the next part (the taskkill) should proceed only after the previous command on the same line (i.e., printing) ran successfully.  From this point, the process ran pretty smoothly.  I found that it did not seem to matter if I already had a Word session active when this ran.  (If there was such a session, I would get a dialog; maybe I should have added another instance of TASKKILL before starting the FOR loop.)  Also, I found that Word would prompt me before overwriting.  I also had an interruption for a problem encountered when the batch file tried to convert a file created in an earlier version of Word.

There was another problem.  I got a dialog saying, "There is insufficient memory."  A search led to a Microsoft webpage that said this could result from a cramped paging file, or from some antivirus software or from using floppy disks.  None of these seemed to apply in my case.  Another discussion said that maybe this problem came from a corrupted Normal.dot.  That was a possibility in my case; I had occasional error messages involving Normal.dot.  Another potential cause:  abnormal termination of Word (such as I was doing myself, in this batch file, with TASKKILL), leaving junk in the %Temp% folder (located via Start > Run > %Temp% -- in my case, C:\Users\Ray\AppData\Local\Temp).  Cleaning out the %Temp% folder seemed to help:  there were hardly any memory error messages during the rest of the process.  The process seemed suitably restrained, whether by the "&&" device or otherwise, to the point that (judging from system tray icons) there were usually no more than one or two Bullzip processes underway at once.

When the process was done, most but not all of the DOCs had been converted.  I looked at the ones that had not.  (For that, I use an Excel comparison, with VLOOKUP, of filelists obtained by DIR from the input and output folders.)  All gave me an "insufficient memory" error when I tried to open them in Word.  Some seemed to be corrupted to various degrees.  I used Notepad and wReplace to slightly clean up the ones whose corruption prevented them from printing to PDF in a more or less normal fashion.  (In wReplace, the option I used was Replace Many > Open (arrow) > Diacritic to ASCII.)  Several others were printable, but I hadn't printed them.  That is, when the batch file was running, Word kept asking me if I wanted to save changes to (or to print; can't remember for sure) a document with a weird name.  The same name, over and over again.  I thought it was some kind of error, since that name wasn't in my file list.  Possibly this problem had something to do with the fact that these documents were originally created on a Mac and then converted.  So I had to PDF those manually.

Next, I wanted to take a quick look, to see whether any of the resulting PDFs were actually junk -- whether, for any reason, some of them made it through the process in garbled form.  For that, I took the approach of converting just the first page of each PDF to JPG, and then flipping through them in a photo viewer (e.g., IrfanView).  This process did turn up a few corrupted documents.  I was able to verify that they had been corrupted before I started this process; it did not appear that the steps described here had any effect.

I wasn't extremely concerned about these documents.  If I had been, I think a modified strategy would have been advisable:  take a quick look through all of the documents, as just mentioned, and then take a closer look at any that seemed important.  It would have been handy, for that purpose (and others), to have image- (and audio-) viewing (or listening) software that would not only display the item in question, but would also let me shove it into various categories with the touch of a key.  In this case, the categories would have been OK and Not OK and Examine More Closely.

Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

I was looking for a way to print a bunch of webpages to PDF files from the command line. This page describes the search that, as before, brought me to wkhtmltopdf.

One approach, it seemed, was to use Pdf995 and Omniformat. I had been frustrated, last time I tried pdf995, but nearly a year had passed, and this was a different project. Maybe this time it would work. They seemed to want me to install pdf995 and then install Omniformat. Not being entirely sure which ones I would need, I installed a half-dozen programs from their webpages. They said Omniformat would include HTML2PDF995, which would permit command-line conversions among formats including HTML and PDF. So that sounded promising. Installation of Omniformat brought up an HTML page included in the program (evidently not available online) that said the command line syntax was like this: omniformat.exe [input file] [output format]. So in my example, it would look like this:

omniformat.exe http://www.cnn.com/Chinastory.html "png"
In that case, the Omniformat command wouldn't give the file the desired name, so I would have to add a command to do that. I tried it, just doing the part as shown for now. I got the error, "omniformat.exe is not recognized as an internal or external command, operable program or batch file." In other words, it wasn't part of the computer's path. I had to run this command from within the folder where omniformat.exe was installed. A search of my computer said that the folder in question would be "C:\Program Files (x86)\omniformat." So I ran that omniformat.exe command there. But it opened up the GUI and made me wait for maybe 20 seconds until it would open a session of Internet Explorer, so that it could display its adware; but then that failed with "An error has occurred in the script on this page." Same thing if I tried using a file on my computer rather than a webpage's URL in the command. It seemed that pdf995 was still not going to work for me.

A search led to Total HTML Converter which, for $50, promised to do exactly what I needed: convert webpages to JPG and possibly to PDF from the command line. There didn't seem to be a listing on CNET for Total HTML Converter. It got three stars from 21 users (3,582 downloads) on Softpedia. Fifty bucks for a three-star program ... hmm.

A Softpedia search for similar programs turned up Spire PDF Converter (rated 3.0 by four users; 1,057 downloads), HTML to PDF Converter (rated 3.6 by eight users; 6,353 downloads), 7-PDF Website Converter (rated 3.7 by 10 users; 1,746 downloads); HTML_ to PDF (rated 3.2 by 19 users; 2,225 downloads); and Gerolf Markup Shredder (rated 2.8 by 23 users; 1,816 downloads). Gerolf was the only one whose description said it could run from the command line. I checked the homepages of the others to see about them. Spire said nothing about it. Likewise HTML to PDF Converter, and 7-PDF. I wasn't sure about HTML_ to PDF, so I downloaded that and Gerolf. HTML_ to PDF gave me an unzipped folder with no executables; it looked like I would have to learn something about PHP programming to use it. Meanwhile, Gerolf's installation asked me if I wanted to install GMS, to which I said sure, go ahead. Then it gave me a dialog partly in German, to which I replied Ja. Next an almost entirely dialog that seemed to be asking where I wanted to open the installation files. Its Durchsuchen (Search) button took me to a Temp folder, so I just clicked on that and said OK. Next, a dialog telling me to run gmsunzip.bat to install. Apparently I should have written down where I unpacked the files. Fortunately, Everything found gmsunzip.bat, so I did run it. I pressed the Whatever key to move past its first screen of information. It was starting to look like I should have chosen a more permanent location, so I went back and started over with the installation. Now I understood that its first dialog, referring to GMS, was of course referring to Gerolf Markup Shredder, and not to some other program; I just hadn't understood that it was asking me if I wanted to install the thing that I had just double-clicked on. So now I Durchsuched to a newly created folder called C:\GerolfHTMLtoPDF, and after the installation I went there and ran gmsunzip.bat. Unfortunately, at the end, I got a message indicating that this was an unsupported 16-bit installation that was incompatible with 64-bit versions of Windows. So I would have to run it in a Windows XP Virtual Machine. While thinking about that, I went back to HTML_ to PDF Converter. I took a closer look. The second script on the webpage seemed to be something that I might be able to just copy into Notepad, save as an HTM file, and double-click on. I tried that. No, it was going to require some PDF knowledge, though maybe not much. Now I noticed that Gerolf would not go away. It kept insisting on telling me, again and again, about the Unsupported 16-bit Application problem. I had to use Start > Run > taskmgr.exe. But, whoa, what's this? "Windows cannot find 'C:\Windows\System32\taskmgr.exe." Had one of these foolish programs, or something else, screwed up my system? I could see that taskmgr.exe was indeed in the System32 folder. Hmm. Not clear what was happening. Eventually I found that a CMD window was running; I had to kill that to shut off the recurrent dialogs. But that didn't fix the problem with taskmgr.exe. Maybe a reboot would ... later.

I went back to my previous post on a somewhat similar problem. The most promising solutions there seemed to be PrintHTML, print all linked files from an HTML page in Internet Explorer, or use wkHTMLtoPDF. I shied away from wkHTMLtoPDF because it was so complicated. I installed PrintHTML and the DHTML Editing Control (required on some systems, evidently including mine, judging from error messages when I tried running PrintHTML without it), and then looked at its instructions. It seemed to be just designed to permit some tinkering (e.g., margin adjustments) while printing local HTML files; no clear indication of how it would work with a webpage. I tried this command:
printhtml.exe file="https://www.nytimes.com"
(I had to run that command from within the folder where PrintHTML was installed.) It gave me a nearly blank page. It seemed that, basically, it was not designed to do what I needed. How about the approach of printing linked files from within Internet Explorer (IE)? The concept was that I could create an HTML page containing links to the webpages I wanted to print, and IE could be persuaded to print them all. I wasn't sure if they would print as one big PDF that I would have to split apart, but that seemed likely. In that case, the files wouldn't have the desired individual names. This tentatively seemed to be another case where the approach was designed for local files, not for webpages.

On this basis, I went back to wkHTMLtoPDF, as described in another post in this blog, posted at about the same time as this one, on the subject of Converting URL-Linked Webpages to PDF.

Saturday, February 25, 2012

Windows 7: Setting and Maintaining Accurate System Time

I wanted to keep two computers' clocks set the same, for purposes of synchronization, so that they would have an accurate sense of whether the version of File X on computer A was newer than the version of File X on computer B.  I had previously installed (or, more accurately, just added a copy of) Judah Levine's portable NISTIME 32 in something of a rush, when installing Windows 7, and, later, had vaguely recognized that it was not working right and/or I had not set it right.  Now I decided to work out the kinks in this function.

NISTIME-32BIT.EXE

I started with the National Institute of Standards and Technology (NIST), from which programs like NISTIME 32 would draw the current time.  It developed that NIST had a program called nistime-32bit.exe.  It turned out to be the same as NISTIME 32, just slightly updated.  The webpage's instructions were to start by going into File > Select Server and then Query Server > Now.  Somewhere I saw advice to choose a server near me.  I was tempted to choose two different ones, one for each computer, so as to have accurate time in case there was some terrible disruption of the national timekeeping system.  Then I realized that this could have the effect of making rivers run upstream, where my files were concerned, to wit:  new could be replaced by old.  Being up-to-date on the latest developments in American chronology suddenly seemed less important than making sure I didn't accidentally overwrite today's crossword puzzle.

When I went to the Query Server > Now menu pick, I got a dialog indicating that NISTIME 32 was prepared to adjust my computer by 0.953 seconds.  I told it to go ahead.  I also went into Query Server > Periodically and told it to update the computer every 12 hours.  Query Server > Server Status confirmed these settings.  File > Help in Choosing Dirs told me to hit File > Save Config to save my settings.  This gave me "File Error:  Cannot open file to save configuration."  That problem may have been caused by nesting the program too deeply in a subfolder.  I moved it elsewhere and tried again. Now it seemed to confirm that it had saved my settings, and it created NISTIMEW.CFG in the same folder as the program's portable executable (nistime-32bit.exe).  I exited and restarted, and it remembered what I had told it.  But I had to remember to hit File > Save Config; it would not remember anything.

But then, when I did go into Query Server > Periodically, specified 12 hours, and hit File > Save Config and then File > Exit, I could not get it back.  The program refused to become visible.  I tried a couple of times, and then looked at Windows Task Manager (Start > Run > taskmgr.exe) > Processes tab.  Taskmgr showed four separate instances of "nistime-32bit.exe *32."  I selected them and clicked End Process, one by one, and then ran nistime-32bit.exe again.  It returned to taskmgr.exe, but not to the screen.  I minimized all windows, one by one, but, no, it was not lurking anywhere.  There didn't seem to be a taskbar or system tray icon for it.  It was here, and yet not here.  I killed the processes again, now that I had started one or two new ones.  I renamed NISTIMEW.CFG to be something else, and now it would start, and it saved new settings in a new NISTIMEW.CFG.  Apparently the config file had gotten corrupted.  I had originally created that file manually in lowercase (nistimew.cfg); possibly something about the program needed the uppercase filename.

But now, same thing again.  Exiting and restarting gave me a hidden program:  visible in Task Manager's Processes tab, but not visible onscreen.  When I right-clicked on nistime-32bit.exe *32 in Task Manager and selected Properties, I got an error:  "Windows cannot find [pathname] nistime-32bit.exe."  I ended the process again.  I created a shortcut to the .exe and tried starting it that way.  I had no reason to think that would make any difference, and in fact it didn't.  I tried moving all of the files from the folder where I had put nistime-32bit.exe, and placed them all instead in C:\Windows, with a shortcut to the executable in my Start Menu.  That wasn't the answer; I still got lurking program sessions that appeared in Task Manager but nowhere else.  I deleted the CFG again and tried again.  Now it ran.  I went directly to File > Save Config without making any changes.  It indicated that it had saved the config file.  I exited and restarted the program.  It ran.

Now I saw something that may have explained the config file problem.  The server list had changed.  The Colorado server that I had selected previously was no longer listed in File > Select Server.  I had previously gone into File > Update Server List, and that had generated a message:  "New server file is C:\Windows\NIST-SRV.LST."  It did that again now, when I designated a new server.  I hit File > Save Config and then File > Exit, and then restarted the program.  Now it was running normally.  I moved the three files (the exe, cfg, and nist-srv.lst files) from C:\Windows back to the folder where I really preferred to have them.  It seemed that the server list had not properly updated when the files were in that folder originally.  I restarted and went through the same steps -- update server list, choose a new server, save config -- and now I was exiting and restarting without a problem.

But no, I spoke too soon.  When I restarted, saved a 12-hour periodic refresh, and exited, it would not restart.  Deleting the config and moving the other files back to C:\Windows did not fix it.  The problem seemed to relate specifically to the attempt to set up recurrent time checks.  I was doing something wrong, or perhaps the program had a bug, or maybe it was not suited for 64-bit Win7.  I went to the NIST webpage cited in the program's Help > More Help and sent an email to the Webmaster link at the bottom of that page, pointing them here.

The Built-In Windows Time Sync Option

I decided to look for an alternative time-updating program.  I ran a search and discovered that there was apparently some kind of automatic time-updating arrangement built into Windows.  The advice there was, however, that "The W32Time service is not a full-featured NTP solution that meets time-sensitive application needs."  That was consistent with the fact that my two computers' timeclocks tended to be somewhat inconsistent with one another.  I had not tried to see how inconsistent they could be, or how long they could remain that way.  I did see an indication somewhere that Windows defaulted to a weekly time update, so maybe it would verify that it was accurate to within a minute, or something, every week or so.

That appeared to be steered by Control Panel > Date and Time.  That dialog could also be opened by right-clicking the clock in the system tray and choosing Adjust Date/Time.  Or, as I now learned from Eric Phelps, it could also run from the command line via "rundll32.exe shell32.dll,Control_RunDLL timedate.cpl."  The latter option would facilitate the option of opening the Date and Time dialog for manual adjustment via, say, a batch file that would open it automatically (to the correct tab) every day, week, or whatever.  (Later, I found a How-To Geek webpage that said I could just run "w32tm /resync" as administrator to resynchronize the clock without even going into the Date and Time dialog.  That, too, could be incorporated into a scheduled batch file.)

The Date and Time dialog > Internet Time tab > Change Settings option gave me a choice of synchronizing with time.nist.gov, which I understood to be the most accurate (though others in that list, not counting time.windows.com, appeared to be cousins of NIST).  I noticed that the dialog told me, here, that "This computer is set to automatically synchronize on a scheduled basis."  The previous sync site, as I saw on the other computer, was time.windows.com."  I wasn't sure how synchronizing with that site could have left my two computers with different times -- differing by seconds, that is, not by minutes -- unless maybe time.windows.com was just not that worried about the seconds.  Or maybe it was trying to synchronize when my router was doing its daily self-restart, and was therefore not getting access to the online clock?  I wasn't sure.  (Note:  Fouzan said that this whole process wouldn't work if the computer was on a domain.)

Curious about the timing, I went into Start > Run > taskschd.msc > Task Scheduler Library.  There were maybe 15 items in the list, and none of them were obvious time sync tasks.  So another possibility was that some bug or tweak, brought into my system somewhere along the line, was preventing the creation or execution of the scheduling function.  Another emerging possibility was that, as stated in a How-To Geek webpage, time.windows.com (which my systems had been using by default) had "a ton of problems with uptime."  So possibly I had already fixed my problem, just by switching the machines to use time.nist.gov in the Date and Time dialog.  (I did notice, as soon as I made that switch and clicked the update button, that both computers' clocks showed exactly the same time.)

Other Possibilities

I ran another search and found a Gizmo recommendation for Dimension 4 as a time correction utility.  It occurred to me, at this point, that possibly I had fixed my problem, just by switching away from time.windows.com (above), and that maybe I should just let things slide for a week or two.  I decided mostly just to record some notes, here, for possible future reference.  So instead of installing Dimension 4, I just dragged the icon for its webpage from my browser's Address bar over to the Time subfolder in my customized Start Menu.  If I ever needed it, I could follow the link at that time.

There also appeared to be more to know than I had realized, regarding Task Scheduler (taskschd.msc).  In Task Scheduler's left-hand pane, I went down the tree into Task Scheduler Library > Microsoft > Windows > Time Synchronization.  Now I saw that my machine was indeed set to synchronize time at 1 AM every Sunday.  I saw advice from Tina Sieber on a way to adjust and improve the scheduling via Task Scheduler.  Tina seemed to believe, however, that using a separate program might be the simpler and more accurate approach.  Tina pointed toward two other programs, Atomic Clock Sync and AtomTime.  The webpage for the latter seemed very old.  I was not sure how it would fare in a 64-bit Windows 7 world.

For now, the solution seemed to be simply to go into the system's clock and change its time source to NIST.  My monthly batch file brought up the NIST/USNO timepage on the first of every month, so I could observe, later, whether my two computers were again diverging from one another and/or from the time on that webpage.  If they didn't stay in line, I would have two options.  One would be to add one of the foregoing command lines to my daily or weekly batch files, to permit manual and/or automatic checking and/or resynchronization.  Another would be to try one of the several freeware utilities just mentioned, particularly Dimension 4 or Atomic Clock Sync.

Saturday, January 14, 2012

Combining PDFs with PDFsam: Introductory Syntax

I was using Windows 7.  I had a project that would benefit from automated merging of multiple PDFs into a single PDF.  It looked like PDFsam would be useful for this purpose.

PDFsam had GUI and Console options.  In other words, it could be accessed through a user-friendly interface, like most Windows programs, and it could also be used on the command line.  My project had certain complexities, such that the GUI approach would not be ideal.  This post describes the steps I took to learn how to use the Console.

I began with the Console section of the PDFsam wiki.  It led to a page providing information on console parameters and commands.  The explanation was too thin, so I did a search for more guidance. This led to a 33-page Tutorial (installed with the program files). It also led to a thread that reminded me not to forget the PDFsam Forums.

The Tutorial (p. 18) seemed to say that, in PDFsam-speak, what I wanted to do was to Merge files, and for this I would use the Concat option. Other options, not of interest here, included Split and Encrypt. It appeared that PDFsam syntax would call for very long commands. Looking for examples, I went to a forum thread, but that pointed me back toward the wiki page (above).

The Tutorial said that, to make PDFsam run from the command line (i.e., Console), I could either type a certain command or just use one of the scripts in the bin folder where the program was installed (e.g., C:\Program Files\pdfsam\bin).  In that bin folder, it appeared I had my pick from two scripts, provided in Linux (.sh) and Windows batch (.bat) versions.  Since I wanted the Console, not the GUI, I focused on run-console.bat.  Its contents seemed to address various details that I didn't clearly understand, and didn't necessarily want to study; it just looked like the thing I would need to use.  So I created a shortcut to it and put that in my Start Menu. I also edited the Tutorial, adding bookmarks to the various sections, and moved it to the Start Menu too.  (My customized Start Menu would survive any subsequent Windows reinstallation, so I probably wouldn't need to do this housekeeping again, if I had to install PDFsam in a new Windows installation sometime down the line.)

Unfortunately, the run-console.bat batch file didn't work for me.  It gave me an endlessly scrolling set of messages. They were ripping past too quickly to read.  I hit PrintScreen, opened IrfanView (any image editor would do, as would Microsoft Word or Wordpad), and pasted the screenshot (Ctrl-V).  (I could have just hit Ctrl-C, or possibly the Pause key.)  Now I could see that it was just the same error message, repeating over and over:

java is not recognized as an internal or external command, operable program or batch file
Why wasn't my system recognizing java?  I right-clicked on run-console.bat, chose Edit, and looked for the line that referred to java. I couldn't quite figure out where the problem was, so I stuck in a "pause" command somewhere, saved the batch file, and, this time, ran it from the command line instead of from the shortcut. That way, the error statements would stay onscreen instead of scrolling past too quickly or disappearing when the batch file finished running. (This was another instance where it was handy to have the right-click option, "Open command window here," provided by Ultimate Windows Tweaker.)

Running the batch command meant just typing its name and hitting Enter. It paused where I had put the pause command, without any obvious errors, so I moved the pause command further down, saved, and repeated the cycle. (Running run-console.bat again required just hitting the Up key to repeat the command.) That's where the problem was: now I had the endless scrolling again. I hit Ctrl-C a couple of times to abort the batch file.

I played around with the batch file for a while, and eventually realized that maybe the problem was that the JAVA_HOME variable had not yet been assigned a value on my system. It appeared that the batch file was supposed to tell me this; if so, it wasn't working right. I went into Start > Run > SystemPropertiesAdvanced.exe > Environment Variables and, sure enough, no JAVA_HOME variable. I had already installed the Java Runtime Environment, and I almost always used the default installation paths when installing programs, so the advice seemed to be that the JAVA_HOME variable should point to C:\Program Files\Java\jre6. Since this folder name ("Program Files") had a space in it, apparently I would need to use the shortened, DOS-style name for it -- known as an "8.3" filename because it would have eight characters before the dot and three afterwards (e.g., yourfile.txt).

I knew the shortened name of that folder would probably contain Progra~1 (instead of "Program Files"), and I could have just experimented with that, but I had seen instances where it would be Progra~2 or something else, and anyway I wanted to know how to get the 8.3 name. Microsoft advised using the GetShortPathName function to figure it out, but that seemed to involve programming, and programming is a lot of work. Instead, I ran a search that took me to ShortPath by Marcello Zaniboni. To get ShortPath to work from the command line, I tried the C:\Windows shortcut trick, but it didn't work. I didn't want to add ShortPath to my PATH yet, so I just opened a command window in the folder where ShortPath.exe was located, typed "ShortPath " (with an ending space) but didn't hit Enter, and then dragged the C:\Program Files\Java\jre6 folder into that command window from Windows Explorer. (I think this worked because I had installed DropCommand. Otherwise I might have had to type it out, with quotation marks.)

ShortPath told me that, actually, the short path to that folder was C:\PROGRA~2\JAVA\JRE6. So I went back into SystemPropertiesAdvanced.exe > Environment Variables > System Variables > New > Variable Name = JAVA_HOME, Variable Value = C:\PROGRA~2\JAVA\JRE6. I OKed out of there and rebooted.

After doing that, I still had to play with the batch file for a long time, in a quest to learn, remember, get lucky, or otherwise do what I needed to make it work. By the time I was done, I almost thought that I would have been further ahead just using the command given in the wiki:
java -Dlog4j.configuration=console-log4j.xml -jar pdfsam-console-2.1.1e.jar
except that that didn't work either because, as I soon realized, it was a Linux command. I also did not fare too well with the advice to type "run-console.bat -h concat" for information on the syntax for the Concat option, because the run-console.bat file itself was not yet working.

The Tutorial (pp. 19-20) said that I had three ways to indicate which files I wanted to merge. Instead of entering one parameter to indicate a directory and then entering another parameter to indicate one or more files in that directory, it seemed I would want the option that would allow me to specify a file (including its path) on a single line. Evidently I could list a number of PDF files in a separate XML file, and invoke that file (with its list of PDFs) by using the -l (that's an L, not a one) option. But it wasn't working right. Ultimately, I posted a question on it. Andrea (a guy from the Netherlands), creator of PDFsam, posted a reply within 36 hours. And that got me where I needed to go. I was able to get a test run to work, with a run-console.bat file whose contents (viewed in something like Notepad, of course, not in a word processor like Word that would add all kinds of invisible junk) were as follows:
@echo off

set JAVA=%JAVA_HOME%\BIN\JAVA

set JAVA_OPTS=-Xmx256m -Dlog4j.configuration=console-log4j.xml

set CONSOLE_JAR="C:\Program Files (x86)\pdfsam\lib\pdfsam-console-
2.3.1e.jar"

@echo on

%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Current\ConcatList.xml -o
D:\Current\PDFsamOut\Merged.pdf concat
While I wasn't entirely clear on what all those lines did, the basic idea seemed to be that the first lines would define JAVA, JAVA_OPTS, and CONSOLE_JAR, and then the last line would combine them all into one big command. That command seemed to say, "Run Java with these options, using this jar file for specific instructions; take your input from the PDF files listed in ConcatList.xml; and output a single PDF file, Merged.pdf, containing all of those PDF files." To make that work, I needed to know the format of the ConcatList.xml file. Here's the one that worked for me in this test run:
<?xml version="1.0" encoding="UTF-8"?>
<filelist>
<file value="D:\Current\TestDir1\x1.pdf"/>
<file value="D:\Current\TestDir2\x2.pdf"/>
</filelist>
I just needed a File Value line for each PDF to be merged, using the syntax shown.  To summarize, then, I used Notepad to create two files.  One, called run-console.bat, contained the first half-dozen lines of code shown above, beginning with @echo off.  The other, ConcatList.xml, contained these last five lines of code, beginning with the "xml version" line.  ConcatList.xml would contain File Value lines, each designating a PDF to be merged into the larger output PDF (and there were other options for ConcatList.xml; I just didn't need them for my project), and run-concat.bat would read those lines and do the actual concatenation into a single output PDF.

Monday, April 18, 2011

Batch Merging (Combining, Concatenating) PDFs from the Command Line

I was using Windows 7.  I had a bunch of JPGs that were images of successive pages in a document.  In other words, when the document was scanned, each page was saved to its own separate file.  They were named Page01.jpg, Page02.jpg, Page03.jpg, and so forth.  I had converted these JPGs to PDF, thinking that would help me toward my goal.  The goal was to combine them all -- whether as JPGs or PDFs -- into one PDF file containing the entire document.  I had a large number of documents like this, each consisting of several pages, all together in one directory.  It was too big a job to do manually.  But could I automate it?  This post describes my efforts to that end.
What I was looking for was, somehow, a program or script that could recognize the differences among these files in a directory, and combine only the ones that should be combined:

Doc1Page1
Doc1Page2
Doc2Page1
Doc3Page1
Doc3Page2
so that I would wind up with this:
Doc1 (pages 1 & 2)
Doc 2 (page 1)
Doc 3 (pages 1 & 2)
A search led to iText, which looked sleek and got some good recommendations but unfortunately (a) did not appear to be available in a Windows/DOS version and (b) was not for end users.  In other words, I had no idea what to do with it.  A Gizmo's Freeware article did not seem to identify programs that could do this.  The article led me to PDFill PDF Editor as its first choice for an all-around freeware PDF solution.  There, I went to the Merge PDF Files tool.  Its batch command option, available only in its $20 paid version, looked like it would come close to doing what I wanted.  The example they gave looked like this:
"C:\Program Files\PlotSoft\PDFill\PDFill.exe" MERGE Input1.pdf Input2.pdf Input3.pdf Output.pdf
With many files or long filenames, that approach would run into limits on how long a command could be.  I suspected I could vary their command with standard DOS input options, which I vaguely recalled would look like this:
"C:\Program Files\PlotSoft\PDFill\PDFill.exe" MERGE < inputfilelist.txt
So then the challenge would be to automate the process of identifying filenames that would belong together in the same inputfilelist.txt file:  Doc1Page1.pdf and Doc1Page2.pdf would be in Doc1inputfilelist.txt, whereas Doc3Page1.pdf and Doc3Page2.pdf would be in Doc3inputfilelist.txt.  Then all I'd have to do would be to construct a batch file with lines like this:
"C:\Program Files\PlotSoft\PDFill\PDFill.exe" MERGE < Doc1inputfilelist.txt
"C:\Program Files\PlotSoft\PDFill\PDFill.exe" MERGE < Doc3inputfilelist.txt
I wasn't sure if PDFill would allow me to select a name for each resulting output file, or how that would work.  With a large number of files, that manual process could be very time-consuming.  I could also look into other possibilities, like going back to the JPGs from which I had created these PDFs and merging them into multipage TIF files that I could then convert into multipage PDFs.

These were the steps I would have to pursue as this project continued. But I had to shelve it for now, to deal with other things.