Tuesday, March 13, 2012

Converting URL-Linked Webpages (Bookmarks, Favorites) to PDF

At some point, I discovered that it was possible to save a link to a webpage by just dragging its icon from the Address bar in an Internet browser (e.g., Internet Explorer, Chrome, Firefox) to a folder in Windows Explorer.  Doing that would create a URL file with two bits of information:  the location of the webpage, and the name by which I referred to it.  For example, I might have renamed one of those URL files, pointing at some random CNN.com webpage, to be "July 30 Article about China."  I had also gotten some URL files by dragging Firefox bookmarks to a Windows Explorer folder.

I now had a bunch of those URL files.  I wanted to print the linked webpages in PDF format, using those given names as the names of the resulting PDFs.  And I wanted to automate it, so that I could just provide the set of URL files, or something derived from it, and some program would do the rest.  Input a dozen URL link files; output a dozen PDFs, each containing a printout of what was on the webpage cited in one of the URL files.

Part of the solution was easy enough.  On the command line, I could type two commands:

echo China Story.url > filelist.txt
type "China Story.url" >> filelist.txt
Note the seeming inconsistency in quotation marks.  (I would have to continue using >> rather than > with any subsequent additions, so as not to overwrite what was already in Filelist.txt.)  Filelist.txt would then contain three lines, like this:
China Story.url
[Internet Shortcut]
URL=http://www.cnn.com/Chinastory.html
I could do a simple find-and-replace to get rid of the [Internet Shortcut] part, and manipulate the rest of it in a spreadsheet like Microsoft Excel, using DIR to produce the filenames.  So the spreadsheet created from filelist.txt could have a series of commands whose concept would be like this:
print webpage located at URL as PDF and call it China Story.pdf
So then I needed a PDF printer that would print webpages from the command line.  I had previously searched for something similar and had wound up using wkHTMLtoPDF (which they actually wrote as just all lower-case:  wkhtmltopdf, where the "wk" apparently stood for "webkit").  It had been somewhat complicated.  I decided to look for something else.  There followed a time-consuming search that brought me back to wkHTMLtoPDF.

As before, wkHTMLtoPDF seemed vastly more capable than the alternatives I had been looking at.  After the initial reluctance that drove me to look at those other possibilities, I realized that, this time, due to the different nature of the project, the wkHTMLtoPDF part might not complicated at all.  Among the many options available in the help file ("wkhtmltopdf --help" on the command line), it looked like I might get acceptable results from a command like this:
start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 "http://www.nytimes.com/" "D:\NY Times Front Page.pdf"
That command, which I would have to run from within the folder where wkhtmltopdf was installed, would specify a Size of letter-sized paper, a Top margin of 0.25 inch, and so forth.  I just needed to come up with my list of URLs and their names (above).  I noticed that some of those items were really long, so in Excel I added a column to calculate their lengths (using LEN) and edited some of them.  I also had to run tests (i.e., Excel formulas looking at the contents of preceding rows) to verify that I had exact alternating pairs:  one file title followed by one URL.  In some instances, somehow the file name had come over in corrupted form, so I had to add some rows to account for all the URLs.  Among other things, ampersands (&) in URL file names seemed to result in some confusion; I'd have been better off to replace them in advance with "and."  This cleanup took longer than expected.  If I were doing a truly huge number of URLs, I would probably have wanted to begin with a dry run of a hundred or so, to test it out and see if there were any other preliminary steps that might ease things along.

So I assembled my batch file lines and ran them in a batch file.  They worked for the first 42 PDFs.  Then I ran into a problem.  Wkhtmltopdf gave me this error and said it was going to close:
Warning: Javascript confirm: You need Adobe Flash Player 8 (or above) to view the charts.  It is a free and lightweight installation from Adobe.com.  Please click on Ok to install the same. (answered yes)
Error: Failed loading page."  The error message gave the URL and then said, "Sometimes it will work just to ignore this error with --load-error-handling ignore."
I was going to rewrite my batch commands, putting "--load-error-handling ignore" at the end of each.  But when I clicked Close, the program just kept on running.  I was closing wkhtmltopdf only for that particular session (i.e., for just one URL being PDFd); I wasn't closing the batch file.  So I let it run, helping it over another one or two similar speed bumps along the way, figuring that I would catch any failed PDFs in the post-game review.  "Speed bumps" is perhaps the wrong metaphor.  It was not fast.  Then again, neither was my Internet connection; that may have been the slowing factor.  In one case, there was no error message; it just got up to 57% completion in PDFing a URL and then stopped.  I hit Ctrl-C.  This made no difference in the active command window, so I killed it.  That may have been dumb, or maybe it was just the ticket.  Back in the other running command window, where I had started, I now had an option to terminate the batch job.  I said no.  This happened a couple more times.  It looked like it might be a problem with the pages on a particular website. 

After a while, it was finished.  Now I wanted to see what it had done.  The first step was to do a count and, if necessary, a comparison of the URL files being examined against the PDFs output.  I had 507 URL files but only 489 resulting PDFs.  Some of the missing pages were not found.  Apparently webmasters had made some changes between the time when I created the URL files and now.  Some were evidently due to a malfunction somewhere; they were available when I revisited them manually.  Others were PDFs (as distinct from HTML webpages) that apparently did not work in this process.  I had to PDF a few webpages by hand.  Then I used Adobe Acrobat to combine the resulting individual PDFs into a single PDF.  Its merge process would treat the names of the individual files as bookmarks, saving me that labor.  In the final wrapup, there were some imperfections, but on this first run it appeared that the basic process was successful in converting large numbers of webpages into a PDF.

1 comments:

raywood

Before combining preexisting downloaded PDFs with these newly created PDFs, maybe I should have examined the former to see if any were forms requiring data entry. Combining those into the resulting PDF was presumably why I got Acrobat's purple banner across the top of each page: "Please fill out the following form." This could be "fixed" by just clicking on the button at the left, to turn it off.