Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

I had a bunch of MHT files in a folder.  (MHT was apparently short for mhtml, which was short for MIME html.)  I produced these files in Internet Explorer (IE).  To do this in a recent version of IE, the approach would be to look at a webpage and hit Ctrl-S > Save as type > Web archive, single file (*.mht).  The MHT format would try to build everything on the screen into a single file, unlike the HTML formats (which would either save only the HTML text or create a subfolder to contain the images and other stuff appearing on the webpage).

Attempts to Print MHTs Directly

My goal now was to print those MHT files.  I had Bullzip PDF Printer set as my default printer, and its settings (the default, I think) would have it pop up a dialog for each file being printed, asking me what I wanted to call the PDF output.  This wasn't as slick as having a command-line PDF printer that would automatically print a file with a name specified on the command line, but I believed I had two options there.  One would be to change Bullzip so that it just printed without a dialog; the other was to hit Enter for each file and let Bullzip print the PDF with the default filename.  Either way, I could then come back in a second pass, using a batch file and/or Bulk Rename Utility to alter filenames as desired.

I actually would have had a one-pass command-line option, if I had been able to get PrintHTML to work with MHTs.  I was briefly hoping that maybe I could use PRN from the command line, but Francois Degrelle said PRN would only work with text files.  A PowerShell function would have been another possibility, if I had known how to proceed with something like that.  There also appeared to be some older approaches that could provide a good way to spend a huge amount of time on something that wouldn't work, for reasons I couldn't understand.

I ran a search and found a webpage that made me think that PDFCreator might be a more useful PDF printer than Bullzip, for present purposes and also for the future.  PDFCreator was favorably reviewed on CNET and Softpedia, so I downloaded and installed it.  But it didn't seem to be printing PDFs automatically from these MHTs.  It would just open the MHT in Microsoft Word, my default word processor, and then it would sit there.  So I didn't continue to try using PDFCreator for this project.

Then again, Bullzip did the same thing:  it opened the MHT in Word, and then stopped.  This happened even after I went into Bullzip's options and changed them to what seemed to be the most streamlined approach possible.  Word was resource-intensive; I couldn't very well open a hundred documents in it at once.  Not that that was an option anyway.  If I highlighted more than 15 MHTs in Windows Explorer, the right-click context menu wouldn't even give me a Print option.

Wordpad was less resource-intensive than Word, but it would open the MHT files as source code, same as Notepad:  not pretty.  I would also get the MHT opened in Word when I right-clicked on a couple of MHTs and selected "Convert to Adobe PDF."  (I got that option because I had Acrobat installed.)

The easiest way to just open the MHTs and print them manually, if I wanted to do that, seemed to be to select a bunch of them and hit Enter, and they would open in tabs in my web browser.  For some reason, they were opening in Opera, whereas I would have thought that Firefox would be the default, as it was for other kinds of web-type files.  I couldn't even open them in Firefox by doing File > Open inside Firefox:  somehow they would still open in Opera.  I could have uninstalled Opera and then tried again, if I really cared; but in any event I still wasn't getting an automated solution.

PDF via Internet Explorer > Print All Linked Documents

Diamond Architects suggested creating an HTML file that would have links to all of the HTML files in a folder, and then using Internet Explorer to print that one HTML file, using Alt-F > Print > Options tab > Print all linked documents.  The .mht files were obviously not .html files, but they contained HTML code.  So it seemed like the same approach would work either way; or, at worst, I thought I could probably just type REN *.MHT *.HTML in a command window opened in that folder, and mass-rename them that way.  I tried that.  It made a mess.  The files didn't look right anymore.  So I renamed them back to MHT.  (The easy way to open a command window in any folder was to go into Ultimate Windows Tweaker > Additional Tweaks > Show "Open Command Window Here."  With that installed, a right-click in Windows Explorer would open up that option.)

But anyway, to test the "print all linked documents" concept, I needed to create the HTML file containing links to all those individual files.  For that, I tried Arclab's Dir2HTML. But it didn't create links.  It just gave me a list of files.  If that was going to be the output, I preferred the kind of list I would get from this command:
DIR *.mht /a-d /b > dirlist.txt
That gave me a file, dirlist.txt, containing entries that looked like this:
File Name 1.mht
File Name 2.mht
To get them to function like links in an HTML file, I would have to change those lines so they looked like this:
<a href="One File Name.mht"</a>
<a href="Another File Name.mht"</a>
I could achieve that with a search-and-replace in Word, using ^p as the end-of-line character.  That is, I could search for ^p and replace it with this, including the quotation marks:
"></a>^p<a href="
That would put "</a> at the end of each line, and <a href=" at the start of the next.  Then I could paste the results back into dirlist.txt.  Note:  if smart quotes were turned on in Word, I would then have to do two additional search-and-replace operations, copying and pasting a sample of an opening and a closing smart quotation mark into Notepad's replace box, because smart quotes wouldn't work right.  Then I might have to manually clean up the first and last lines in dirlist.txt.  Another way to do this would be to paste the contents of dirlist.txt into Excel and massage them there.  (For Excel instructions, go to this post and search for CHAR(34).)  If I was going to do much of this, Excel would definitely be the way to go, because then I could just drop the new filenames into a column and let preexisting formulas parse them and output the HTML lines automatically.

That basically gave me an HTML file.  Now I would just have to add its opening and closing lines.  I wasn't sure what those should look like, so I right-clicked on some random webpage, selected "View Source" (an option that may not be available in all browsers, at least not without some add-ons; I wasn't sure), and decided that what I needed for an opening line would be "<!DOCTYPE html>" and the closing line should be "</html>" (without quotation marks), though I later realized that the latter was probably either unnecessary or incomplete.  I also needed a second line that read, "This is my file," because otherwise everything that I had done would create a completely blank-looking page, leaving me uncertain and confused.  So I added those lines to dirlist.txt, saved it as dirlist.htm, opened it in Internet Explorer (Ctrl-O or Alt-File > Open), and tried the Alt-F > Print > Options tab > "Print all linked documents" option mentioned above.  (Note that dirlist.htm still had to be in the same folder as the .mht files that I wanted to print.)

That worked, sort of.  It automatically gave me a boatload of .pdf files, and may I say it did so in a hell of a hurry.  Problem was, they were all blank.  It tentatively appeared that Bullzip and Internet Explorer were going to go through the motions of printing those linked files; but because I was dealing with MHTs instead of HTMs, they would passive-aggressively give me output with nothing inside.  So, like Columbus finding Haiti instead of Malaysia, I had figured out how to bulk-print HTML files, but that wasn't what I had told everyone I was trying to do.

Bulk Converting MHTs to HTML with mht2htm

Well.  Could I bulk-convert MHTs to HTMs and call it a day?  A search led to mht2htm.  I downloaded the Win32 versions (both GUI and command line), along with the Readme and the Help file.  Basically, it looked like I just needed to (1) copy mht2htmcl.exe into the folder containing my MHT files, (2) create a subfolder, there, called OutputDir, (3) edit dirlist.htm to comment out the non-file (i.e., starting and ending) lines, and then (4) do another couple of searches and replaces in dirlist.htm, so that my lines looked like this:
mht2htmcl "First File Name.mht" OutputDir
mht2htmcl "Another File Name.mht" OutputDir
According to the very brief documentation accompanying mht2htm, these commands would do the trick.  I made these changes, and then renamed dirlist.htm to be dirlist.bat, made sure it was in the folder containing the MHTs and mht2htmcl.exe, and ran it.  It didn't work.  I wasn't sure why not.  So I tried the GUI version instead.  Much easier, and it did produce something in the Output directory.  What it produced was a bunch of folders, one for each MHT file, with names like "First File Name_Files."  Each folder held a couple dozen files, mostly GIFs for the graphic elements of the HTM file.  The key file in each folder wa scalled _0_start_me.htm.  If I double-clicked on that, it would open in Firefox (my default web browser), with a line near the top that said, "Click here to open page"; and if I clicked on that, I got a nice-looking webpage in Firefox.

So that was not fantastic.  Now, instead of opening MHT files one at a time in Word or a web browser, and printing from there, I would have to convert them to HTM so that I could dig into their separate folders and do the same thing with a _0_start_me.htm file.  It would probably be easier to print HTMs than it had been to print MHTs, but there was the problem that those _0_start_me.htm files did not have the original filename.  Fortunately, the file name had been preserved in the name of the folder created by mht2htm.  So I would have to use an Excel spreadsheet to produce printing or renaming commands that would rename the PDF version of the first _0_start_me.htm file to be "First File Name.pdf," and likewise for all the others.  But I wasn't ready to do that yet.

Reviewing How to Use wkHTMLtoPDF

So far, as discussed in a previous post, the best tool I had found for batch converting HTMs to PDFs was wkHTMLtoPDF.  Somewhat past the halfway point in that long post, in a section titled "Revised Final Step:  Converting TXT to HTML to PDF," I had worked out an approach for using wkHTMLtoPDF.  The first step, as I reconstructed my efforts from that previous post, was to install wkHTMLtoPDF.  That created a folder:  C:\Program Files\wktohtml.  wkHTMLtoPDF was a command-line program.  Windows would have to know where to look to find it.  To address that need, I copied everything from the C:\Program Files\wktohtml folder to a new, empty folder called D:\Workspace.  Now I could type a command referring to wkHTMLtoPDF, in a batch file or command window running in D:\Workspace, and the computer would be able to execute the command. I also created a subfolder, under D:\Workspace, called OutputDir.

Next, I went into a command window, running in D:\Workspace, and typed "wkhtmltopdf /?" to get a list of command options.  My previous post, interpreted in light of that command and a glance at wkHTMLtoPDF's manual, seemed to say that the command options that had worked best for me included "-s" to set the output paper size; options to set top (-T), bottom (-B), left (-L), and right (-R) margins (in millimeters); and --dpi (to specify dots per inch).  It seemed, then, that the command line that I would need to use, for each of the _0_start_me.htm files, would use this basic syntax: 
start /wait wkhtmltopdf [options] [input folder and HTM file name] [output folder and PDF file name]
I would run that command in the Workspace folder, where I had now placed the wkHTMLtoPDF program files.  With a command of that type, wkHTMLtoPDF would find the _0_start_me.htm file created by mht2htm (above), and would convert it to a PDF file saved in D:\Workspace\OutputDir.  The source folder and file names were pretty long in some cases, but this D:\Workspace\OutputDir part of the command was brief, so hopefully my full wkHTMLtoPDF command would not exceed any command line limits.  So now I was ready to try an actual command.  I made a copy of one of the folders created by mht2htm, renamed it to be simply "Test," and ran this command in D:\Workspace:
start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 "D:\Test\_0_start_me.htm" "D:\Workspace\OutputDir\Testfile.pdf"
That worked.  But, of course, the resulting Testfile.pdf was just a PDF of the HTML page that said, "Click here to open page."  I wouldn't get my actual MHT page in HTML format until I clicked on that link, in each of those _0_start_me.htm files, and the resulting HTML page would be open in Firefox, where I would still have to come up with a batch printing option to handle all of the tabs that I would be opening.  It still wasn't an automated solution.  I assumed that the approach of using Internet Explorer > Print All Linked Documents as above (but this time with HTMs instead of MHTs) would likewise give me webpages with that "Click here to open page" option. 

Trying VeryPDF HTML Converter

My immediate problem seemed to be that I didn't have a good way to automate the conversion of MHTs to HTMs -- a way that wouldn't give me that funky "Click here to open page" stuff from mht2htm.  My larger problem was that, of course, I didn't have a way to automate getting PDFs from those MHTs, which was the original issue.

The possibilities that I had developed so far seemed to be as follows:  (1) Forget automation; just print the MHTs manually, selecting 15 at a time and choosing the Print option, which would start 15 sessions of Word.  (2) Select and open them in Firefox or some other browser, which would open up 15 (or whatever number) of individual tabs, each likewise calling for manual printing as PDFs unless I could find a way to automate the printing of multiple browser tabs.  (3) Try to figure out why the Internet Explorer approach was giving me blank PDFs.  (4) Look again for something other than mht2htm, to convert MHTs to HTML.  (5) Play some more with the wkHTMLtoPDF approach, in case some automated solution emerged from that.

As I wrote those words of review, I wondered whether Windows XP might handle one or more of those alternatives differently. I had already installed Windows Virtual PC, with its pre-installed virtual Windows XP session; all I needed was to go in there and, if necessary, install programs in it.  But I hadn't encountered any specific indications that some program or approach had worked better in Windows XP, so I decided not to pursue this.

I thought I could at least search for some other MHT converter.  It suddenly appeared that, in my focus on PDF printers, I might not have done a simple search for an MHT to PDF converterThat search, done at this point, led to novaPDF, a piece of commercial software that would apparently do the job.  But on closer examination, novaPDF did not seem to have a batch printing capability.  Another program, VeryPDF HTML Converter, came in a command line version whose basic syntax was apparently like this:
htmltools [options] [source file] [output file]
This syntax assumed, as with wkHTMLtoPDF (above), that htmltools.exe was being run in a folder, like my D:\Workspace, where the command files would be present -- unless, again, the user wanted to fiddle with path or environment variable adjustments.  Typing just "htmltools" on the command line, or opening the accompanying Readme file, demonstrated that this had lots of options.  I thought I might try just using it, to see if it worked at all, before fiddling with options.  So I copied the full contents of the VeryPDF program folder (i.e., several folders and 15-20 files, including htmltools.exe) to D:\Workspace, made sure Test.mht was there as well, opened a command window there, and typed this:
htmltools Test.mht TestOut.pdf
The command window gave me a message, "You have 299 time to evaluate this product, you may purchase a full version from http://www.verypdf.com."  I didn't find a reference to htmltools on their products webpage or on their list of PDF Products By Functions, and this particular message didn't give me another name to look for, so I wasn't sure whether I would be buying the right program.  A review of a couple of webpages eventually revealed that this was VeryPDF HTML Converter.  The GUI version, which I didn't want, would cost $59.  Sixty bucks to convert MHTs?  But it got better, or worse.  The command-line version was $399.  I guess while I was at it, I could ask them to throw in Gold Support for only $1,200 a year.  Beyond a certain level of ridiculousness, a casual user might be forgiven for considering the option of just running this puppy in a disposable virtual machine, if uninstalling and reinstalling didn't do the trick.  In all fairness, they seemed to be thinking of server administators, not private home users.  And they did give us 300 free conversions.  Still, at prices like these, it would have been nice if that would be 300 copies a year, not 300 lifetime.  They were basically persuading me to use the program once and then forget about it.

Anyway, the program ran for a few seconds and then claimed it had succeeded.  I looked.  TestOut.pdf definitely did exist, and it looked good.  No apparent need for any additional options.  I wondered if it would default to the same filename with a PDF extension if I just typed "htmltools Test.mht," without specifying TestOut.pdf, so I ran the command again with that alteration.  That worked.  I tried it once more, this time specifying a source folder and an output folder without a filename ("htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output").  This time, it said, "Save to file failed!"  Its messages seemed to say that it found Test.mht without a problem.  Why wouldn't it write to Output?  Maybe it was trying to write a file called Output, when I already had a folder by that name.  I repeated the command, this time with a trailing backslash (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\").  Still failed.  And the bastards docked me anyway.  I was down to 296 free tries.  So what were we saying:  it could output a file without a need to specify a filename, but it couldn't output to another folder?  If all else fails, RTFM.  But the Readme.txt didn't contain any references to folders or directories.  Well, would it at least work if I specified everything (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\Test.pdf")?  Yes, it would.  So that was the answer:  I would have to work up my command lines in Excel (above) to include the full file and path names for both the source and the target.  With those commands in a batch file, I decided to give it a run with a couple dozen files, just to make sure, before blowing my remaining 295 budgeted conversions on a futile gesture.  It ran well.  I was set.  My fear that some commands might be too long was unfounded: the htmltools commands ran successfully with a command as long as 451 characters.  I converted the rest of these MHTs and then deleted them, and hoped never to see them again.

Technically speaking, the project was done.  If I needed more MHT conversions than I could accommodate within the limited private usage of VeryPDF's htmltools.exe, I would go back to the five options enumerated at the start of this last section of this post.  Since I already had all this stuff in mind, and my Excel spreadsheet was set to go, I ran a couple more lines:
DIR D:\*.mht /s /a-d /b > D:\MHTlist.txt
DIR E:\*.mht /s /a-d /b >> D:\MHTlist.txt
to see if I had any other MHTs on D or E.  (Note the double >> marks in the second line -- that says add to MHTlist.txt instead of overwriting it, if it already exists.  Of course, once I had the command set, I could just hit the Up arrow in the command window to bring the previous command back, after running it, and then use Home and left & right arrow keys to revise it.)  This gave me a file called MHTlist.txt, containing a list of additional MHTs that I thought I might as well convert to PDFs while I was at it.  For these, the command lines would produce a PDF back in the source folder.  Once those PDFs were created in the source folders, I used Excel (and could probably also have used Ctrl-H in Notepad), to do a DIR [filename].* >> listing (which would show both \Source Folder\File.mht and \Source Folder\File.pdf in the resulting dirlist.txt file) for each specific file that I had converted.  This produced a nice pairs for each filename (i.e., x.mht and x.pdf).  The process seemed to work.  Now I just needed one more go with Excel, to produce DEL lines that would get rid of the MHTs in the source files.  One more check:  no MHTs left.  Project completed.

8 comments:

raywood

If this sort of project came along again, I would want to take a look at the PrintHTML webpage on printing HTML from the command line, to see if its approach would work for MHTs.

Anonymous

Funny!
I wanted to convert about 80 mht files into pdf today and was going same way thru applications as you! :)

raywood

Brilliant minds work alike.

raywood

A later post provides another angle on this.

Gallows

Brilliant minds lol. Here are some tips for you ppl.
1. "Linkify" is an firefox addon that converts text to links. so you just save a directory list with path as a text file and then change the extension to .html, open it in firefox to see it as links.
2. "Unmht" is a firefox addon to read and save as mht. (Just in case)
3. "Print pages to pdf" is a firefox addon that helps print all tabs to pdf. So just open all your mht files and use the addon to print it in one go!

Unknown

Thanks for describing this tool in such detail. I didn't know about it. I had been using a tool at http://mhtviewer.com to convert mht files into pdf files using their commandline line tool.

sunk818

Unless you have hundreds or thousands to batch process, it might be faster to just open in IE and print to PDF.

Anyway, you're welcome to try my solution:

Try this MHTML converter:

http://www.softpedia.com/get/Internet/Other-Internet-Related/MHTML-Converter.shtml

Use 7-zip Portable to Open the Inside of the EXE. You will find MHTML Converter.exe inside. This can run standalone.

You might have to manually rename some folders with non friendly characters such as # (pound) in the folder name and file name.

Delete your MHT files now as they are all HTML files within their respective folders.

You can finally use a forfiles (available from Vista and on, not forfiles 1.1 from 1998) and use a command line this:

forfiles /m *.htm* /s /c "cmd /c echo wkhtmltopdf.exe @path @fname.pdf"

The echo just creates the batch file that you can run to process all your HTML files into PDFs. All your PDF will be saved to the path you are running the batch file from.

Alternatively, you can run do dir /b /s *.htm* to get a list of HTML files in all your subdirectories. Then have to create your batch file manually from there.

Unknown

I would also add descriptions of two more programs: Print Conductor and 2Printer. The first one is simple and convenient. It is able to automate the conversion of a large number of files to the popular formats. It also supports MHT (HTML): http://www.print-conductor.com/news/batch-print-mht.html. The second program is the command line that can be utilized within the corporate network and can also convert files with fine tuning. It supports MHT files as well: http://www.print-conductor.com/news/batch-print-mht.html. We have made this software part of our corporate document flow system.