Sunday, March 11, 2012

Printing Webpages as PDFs from the Command Line

I was looking for a way to print a bunch of webpages to PDF files from the command line. This page describes the search that, as before, brought me to wkhtmltopdf.

One approach, it seemed, was to use Pdf995 and Omniformat. I had been frustrated, last time I tried pdf995, but nearly a year had passed, and this was a different project. Maybe this time it would work. They seemed to want me to install pdf995 and then install Omniformat. Not being entirely sure which ones I would need, I installed a half-dozen programs from their webpages. They said Omniformat would include HTML2PDF995, which would permit command-line conversions among formats including HTML and PDF. So that sounded promising. Installation of Omniformat brought up an HTML page included in the program (evidently not available online) that said the command line syntax was like this: omniformat.exe [input file] [output format]. So in my example, it would look like this:

omniformat.exe http://www.cnn.com/Chinastory.html "png"
In that case, the Omniformat command wouldn't give the file the desired name, so I would have to add a command to do that. I tried it, just doing the part as shown for now. I got the error, "omniformat.exe is not recognized as an internal or external command, operable program or batch file." In other words, it wasn't part of the computer's path. I had to run this command from within the folder where omniformat.exe was installed. A search of my computer said that the folder in question would be "C:\Program Files (x86)\omniformat." So I ran that omniformat.exe command there. But it opened up the GUI and made me wait for maybe 20 seconds until it would open a session of Internet Explorer, so that it could display its adware; but then that failed with "An error has occurred in the script on this page." Same thing if I tried using a file on my computer rather than a webpage's URL in the command. It seemed that pdf995 was still not going to work for me.

A search led to Total HTML Converter which, for $50, promised to do exactly what I needed: convert webpages to JPG and possibly to PDF from the command line. There didn't seem to be a listing on CNET for Total HTML Converter. It got three stars from 21 users (3,582 downloads) on Softpedia. Fifty bucks for a three-star program ... hmm.

A Softpedia search for similar programs turned up Spire PDF Converter (rated 3.0 by four users; 1,057 downloads), HTML to PDF Converter (rated 3.6 by eight users; 6,353 downloads), 7-PDF Website Converter (rated 3.7 by 10 users; 1,746 downloads); HTML_ to PDF (rated 3.2 by 19 users; 2,225 downloads); and Gerolf Markup Shredder (rated 2.8 by 23 users; 1,816 downloads). Gerolf was the only one whose description said it could run from the command line. I checked the homepages of the others to see about them. Spire said nothing about it. Likewise HTML to PDF Converter, and 7-PDF. I wasn't sure about HTML_ to PDF, so I downloaded that and Gerolf. HTML_ to PDF gave me an unzipped folder with no executables; it looked like I would have to learn something about PHP programming to use it. Meanwhile, Gerolf's installation asked me if I wanted to install GMS, to which I said sure, go ahead. Then it gave me a dialog partly in German, to which I replied Ja. Next an almost entirely dialog that seemed to be asking where I wanted to open the installation files. Its Durchsuchen (Search) button took me to a Temp folder, so I just clicked on that and said OK. Next, a dialog telling me to run gmsunzip.bat to install. Apparently I should have written down where I unpacked the files. Fortunately, Everything found gmsunzip.bat, so I did run it. I pressed the Whatever key to move past its first screen of information. It was starting to look like I should have chosen a more permanent location, so I went back and started over with the installation. Now I understood that its first dialog, referring to GMS, was of course referring to Gerolf Markup Shredder, and not to some other program; I just hadn't understood that it was asking me if I wanted to install the thing that I had just double-clicked on. So now I Durchsuched to a newly created folder called C:\GerolfHTMLtoPDF, and after the installation I went there and ran gmsunzip.bat. Unfortunately, at the end, I got a message indicating that this was an unsupported 16-bit installation that was incompatible with 64-bit versions of Windows. So I would have to run it in a Windows XP Virtual Machine. While thinking about that, I went back to HTML_ to PDF Converter. I took a closer look. The second script on the webpage seemed to be something that I might be able to just copy into Notepad, save as an HTM file, and double-click on. I tried that. No, it was going to require some PDF knowledge, though maybe not much. Now I noticed that Gerolf would not go away. It kept insisting on telling me, again and again, about the Unsupported 16-bit Application problem. I had to use Start > Run > taskmgr.exe. But, whoa, what's this? "Windows cannot find 'C:\Windows\System32\taskmgr.exe." Had one of these foolish programs, or something else, screwed up my system? I could see that taskmgr.exe was indeed in the System32 folder. Hmm. Not clear what was happening. Eventually I found that a CMD window was running; I had to kill that to shut off the recurrent dialogs. But that didn't fix the problem with taskmgr.exe. Maybe a reboot would ... later.

I went back to my previous post on a somewhat similar problem. The most promising solutions there seemed to be PrintHTML, print all linked files from an HTML page in Internet Explorer, or use wkHTMLtoPDF. I shied away from wkHTMLtoPDF because it was so complicated. I installed PrintHTML and the DHTML Editing Control (required on some systems, evidently including mine, judging from error messages when I tried running PrintHTML without it), and then looked at its instructions. It seemed to be just designed to permit some tinkering (e.g., margin adjustments) while printing local HTML files; no clear indication of how it would work with a webpage. I tried this command:
printhtml.exe file="https://www.nytimes.com"
(I had to run that command from within the folder where PrintHTML was installed.) It gave me a nearly blank page. It seemed that, basically, it was not designed to do what I needed. How about the approach of printing linked files from within Internet Explorer (IE)? The concept was that I could create an HTML page containing links to the webpages I wanted to print, and IE could be persuaded to print them all. I wasn't sure if they would print as one big PDF that I would have to split apart, but that seemed likely. In that case, the files wouldn't have the desired individual names. This tentatively seemed to be another case where the approach was designed for local files, not for webpages.

On this basis, I went back to wkHTMLtoPDF, as described in another post in this blog, posted at about the same time as this one, on the subject of Converting URL-Linked Webpages to PDF.

1 comments:

raywood

This post doesn't mention the many non-command-line solutions. For instance, a search led to a webpage that highlighted the Save as PDF extension that would enable Chrome and Firefox to save a webpage as a PDF, possibly with just one click.