At some point, I discovered that it was possible to save a link to a webpage by just dragging its icon from the Address bar in an Internet browser (e.g., Internet Explorer, Chrome, Firefox) to a folder in Windows Explorer. Doing that would create a URL file with two bits of information: the location of the webpage, and the name by which I referred to it. For example, I might have renamed one of those URL files, pointing at some random CNN.com webpage, to be "July 30 Article about China." I had also gotten some URL files by dragging Firefox bookmarks to a Windows Explorer folder.
I now had a bunch of those URL files. I wanted to print the linked webpages in PDF format, using those given names as the names of the resulting PDFs. And I wanted to automate it, so that I could just provide the set of URL files, or something derived from it, and some program would do the rest. Input a dozen URL link files; output a dozen PDFs, each containing a printout of what was on the webpage cited in one of the URL files.
Part of the solution was easy enough. On the command line, I could type two commands:
echo China Story.url > filelist.txtNote the seeming inconsistency in quotation marks. (I would have to continue using >> rather than > with any subsequent additions, so as not to overwrite what was already in Filelist.txt.) Filelist.txt would then contain three lines, like this:
type "China Story.url" >> filelist.txt
China Story.urlI could do a simple find-and-replace to get rid of the [Internet Shortcut] part, and manipulate the rest of it in a spreadsheet like Microsoft Excel, using DIR to produce the filenames. So the spreadsheet created from filelist.txt could have a series of commands whose concept would be like this:
print webpage located at URL as PDF and call it China Story.pdfSo then I needed a PDF printer that would print webpages from the command line. I had previously searched for something similar and had wound up using wkHTMLtoPDF (which they actually wrote as just all lower-case: wkhtmltopdf, where the "wk" apparently stood for "webkit"). It had been somewhat complicated. I decided to look for something else. There followed a time-consuming search that brought me back to wkHTMLtoPDF.
As before, wkHTMLtoPDF seemed vastly more capable than the alternatives I had been looking at. After the initial reluctance that drove me to look at those other possibilities, I realized that, this time, due to the different nature of the project, the wkHTMLtoPDF part might not complicated at all. Among the many options available in the help file ("wkhtmltopdf --help" on the command line), it looked like I might get acceptable results from a command like this:
start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 "http://www.nytimes.com/" "D:\NY Times Front Page.pdf"That command, which I would have to run from within the folder where wkhtmltopdf was installed, would specify a Size of letter-sized paper, a Top margin of 0.25 inch, and so forth. I just needed to come up with my list of URLs and their names (above). I noticed that some of those items were really long, so in Excel I added a column to calculate their lengths (using LEN) and edited some of them. I also had to run tests (i.e., Excel formulas looking at the contents of preceding rows) to verify that I had exact alternating pairs: one file title followed by one URL. In some instances, somehow the file name had come over in corrupted form, so I had to add some rows to account for all the URLs. Among other things, ampersands (&) in URL file names seemed to result in some confusion; I'd have been better off to replace them in advance with "and." This cleanup took longer than expected. If I were doing a truly huge number of URLs, I would probably have wanted to begin with a dry run of a hundred or so, to test it out and see if there were any other preliminary steps that might ease things along.
So I assembled my batch file lines and ran them in a batch file. They worked for the first 42 PDFs. Then I ran into a problem. Wkhtmltopdf gave me this error and said it was going to close:
Error: Failed loading page." The error message gave the URL and then said, "Sometimes it will work just to ignore this error with --load-error-handling ignore."
After a while, it was finished. Now I wanted to see what it had done. The first step was to do a count and, if necessary, a comparison of the URL files being examined against the PDFs output. I had 507 URL files but only 489 resulting PDFs. Some of the missing pages were not found. Apparently webmasters had made some changes between the time when I created the URL files and now. Some were evidently due to a malfunction somewhere; they were available when I revisited them manually. Others were PDFs (as distinct from HTML webpages) that apparently did not work in this process. I had to PDF a few webpages by hand. Then I used Adobe Acrobat to combine the resulting individual PDFs into a single PDF. Its merge process would treat the names of the individual files as bookmarks, saving me that labor. In the final wrapup, there were some imperfections, but on this first run it appeared that the basic process was successful in converting large numbers of webpages into a PDF.