Ray Woodcock's Latest: January 2012

Sunday, January 29, 2012

JPG: Can't Read File Header - Unknown File Format or File Not Found

I was looking at various JPGs. I noticed that a number of them produced this error message when I tried to view them in IrfanView. I had checked the box to activate IrfanView's Unicode plug-in as suggested, and anyway these were not exotic file names. So I didn't know what this error would mean for these JPGs.

A search indicated that numerous people had encountered this error. I decided to start by verifying that this was not just a quirk of IrfanView. It didn't seem to be; I also wasn't able to view these JPGs as icons in Windows Explorer, and when I tried to view them in Firefox, I got an error: "The image [filename] cannot be displayed because it contains errors." When I tried Internet Explorer, I got "Your web browser has blocked this site from using an ActiveX control in an unsafe manner." Windows Photo Viewer said, "Windows Photo Viewer can't open this picture because the file appears to be damaged, corrupted, or is too large." Chrome didn't show an error message; it just gave me a blank page. Photoshop said, "Could not complete your request because an unknown or invalid JPEG marker type is found." Microsoft Paint said, "Paint cannot read this file. This is not a valid bitmap file, or its format is not currently supported."

Eliminating the Easy Solutions

A search led to indications that some problems of this type could be due to the program, as I had feared in the case of IrfanView. For instance, one webpage indicated that a faulty Skype extension could produce the foregoing Firefox error. Presumably an attempt to open the JPG in some other program, as above, would help to clarify whether it was a program issue rather than a JPG issue. A discussion thread raised the prospect that this kind of thing could result from various kinds of file system or drive problems. Other webpage said that USB flash drives (especially improperly removed) or Picasa could be an issue.

In some cases, a backup could be a solution, possibly beginning with a DoubleKiller search for other files having the same filename (just in case there might be another copy of the same file somewhere on the computer). That discussion thread also suggested that it could be a machine-specific problem, but it wasn't in my case: same problem when trying to open on another computer.

As noted in a previous post, building on an earlier effort and leading to some additional refinements, it was possible to use IrfanView to detect corrupted JPGs scattered around the computer. The basic idea was to do a search for *.jpg (in e.g., a command window, or using a file finding program like Everything) and then run IrfanView (using either File > Batch or command line methods) to see which JPGs would fail to convert to another format (e.g., PDF).

I had previously reviewed commercial software for fixing JPGs. The prices were generally high and my confidence in them was not great. I only had a few dozen corrupted JPGs and wasn't eager to spend much money on fixing them. One commenter, responding to my post, said that she'd had generally good results with JPEG Recovery Pro ($50), but she still seemed to be looking for a better solution. Voters on CNET had given it less than two stars.

One reviewer on CNET (like others) said the demo version of Corel's Paint Shop Pro had been useful. On CNET, it got 3.5 stars from 466 users, though it didn't look like the dozen users who had rated the current version were quite as pleased with it. I downloaded the latest version from CNET. It was large (366MB) and it took a while. CNET said it would be a 30-day trial version, $60 purchase price after that. (I could also have downloaded from Corel's website. Oddly, when I started to do that, the download dialog said their version was only 282MB.) Unfortunately, when I tried to open a few of my corrupted files, Paint Shop said, "An error occurred while trying to read from the file." Same outcome with a half-dozen different corrupted files.

Digging into the Files

My efforts (above) suggested that I might or might not be able to find a program that would help to automate the repair of corrupted JPGs. Assuming I did find such a program, the next question would be whether it would work on all corrupted JPGs. I was not finding a clear, obvious solution.

In other words, it seemed that, sooner or later, I was going to find myself among those who were talking of manually editing JPGs to fix them. I hadn't done that before. I had no idea whether that sort of process could be even partially automated. But the next step seemed to be that I should see if I could fix at least some simple problems in JPGs.

In a thread cited above, someone said that I could open a JPG in Notepad and could tell, from its first few characters, what kind of file it was. A GIF would tend to begin with "GIF89," a JPG would begin with "ÿØÿà," and a PNG would start with "‰PNG." I looked at a couple of my JPGs. Sure enough, the uncorrupted ones did begin with with "ÿØÿà." But the corrupted ones I examined didn't have anything like any of these three options. I knew, anyway, that it wasn't a case of the file being saved with the wrong extension. If that had been the problem, IrfanView would have caught it (given the program options I had selected) and would have offered to change it to whatever the correct extension should be.

Another site offered a step-by-step guide to the process of editing the JPG. The editing required a hex editor. That site recommended Frhed, whose interface was far friendlier than that of HexEdit. I opened one of my corrupted JPGs in Frhed. It looked a lot less crazy than it had looked in Notepad. According to a user-friendly version of the step-by-step guide, the JPG consisted of two sections: header and image. Corruption resulted from having a bad header. Or so we hoped. The solution was to replace the bad header with a good one. So I made a backup of the files I would be working on, and set to work.

The user-friendly guide recommended xvi32 instead of Frhed. It was rated 4.2 out of 5 stars (Very Good) by 47 users (76,213 downloads) at Softpedia -- vastly more than Frhed -- so I downloaded and ran that instead. I had to bump up its font a bit -- 8-point type seemed unduly ascetic.

Now that I was getting organized, I looked at my first corrupt JPG. Unlike the other one I had just glanced at, it did not have anything except zeroes. This file was completely toast. The next one had data. The guide said I should look for "ff da" in the hexadecimal data, so I did a Ctrl-F and searched for FF DA. (For some reason, xvi32 insisted on entering capital letters.) My search seemed to think that it had succeeded: it stopped at something that read 9F FA. Call me crazy, but that did not look exactly like FF DA to me. I tried searching the same file in Frhed. A search for ff da found nothing, and a search for just ff didn't find much. Going for a trifecta, I tried in HexEdit -- and there, I did find ff da.

These hex editors, sounding an uncertain trumpet, inspired me to search for that funky string noted above -- ÿØÿà -- in Notepad. It wasn't there. I tried five other JPGs in Notepad. No ÿØÿà in any of them. I felt lost.

According to the user-friendly guide, a hex search for ff da in xvi32 should have had some luck. I searched a good JPG in Notepad for ÿØÿà and, sure enough, there it was, right at the start of the file. I took a look at that same good JPG in xvi32 and it, too, found ff da in a hex search -- and this time it really was ff da. So were we saying that my bad JPGs did not have the ÿØÿà or the ff da that would be necessary for a manual repair job? Was that why neither Paint Shop Pro nor IrfanView nor anything else had been able to do anything with these files -- were they all completely fubar? Were people using these hex editors, searching for ff da, when instead they could have achieved the same thing with IrfanView or PSP?

Diagnosis

I had a folder full of JPGs. Some of them worked; some didn't. Having made a backup, I started at the beginning and viewed each of these files in IrfanView. I had set my IrfanView properties so that it would go immediately to the next JPG upon hitting an arrow key or upon deleting a file. In other words, I could just hold down the Del key until it came to an error, and then hit Enter and right-arrow to get to the next one. This would delete each good JPG, which was fine, since I didn't need to be editing this copy of it.

That left me with 159 apparently bad PDFs. Wow. More than I expected. This could take a lot of manual editing. IrfanView couldn't open them, which probably meant nothing else could either. Just to be sure, I tried opening several dozen of them in Paint Shop Pro. No dice.

I wondered if there was a fast way of searching all 159 of these files for ÿØÿà. Copernic Desktop Search didn't find anything like that. Well, what if I glued a bunch of them together with the COPY command and then viewed the one huge file in Notepad? To test this approach, I went into a command window and typed "COPY File1.jpg + File2.jpg Newfile.txt," where File1 and File2 were good JPGs. Notepad told me that File1.jpg and File2.jpg each had exactly one occurrence of ÿØÿà. How about Newfile.txt? Yes, indeed, it had two occurrences of ÿØÿà. So this approach seemed to work, at least for purposes of preserving occurrences of ÿØÿà in a concatenated file.

To run this little test, I needed a more powerful concatenator, unless I was willing to type out 159 filenames: COPY wouldn't use wildcards (e.g., COPY *.jpg Combined.txt) -- but I had forgotten that /b would fix that: COPY /b *.jpg Combined.txt. (I chose a .txt extension so that (a) Notepad would open it automatically and (b) Combined.jpg would not get mixed up in copying itself into itself with the *.jpg wildcard.) I ran that and then opened Combined.txt. It was a large file, of course, so Notepad took a while to open it. Once it was opened, I did a search for ÿØÿà. I found hundreds of occurrences -- far more than 159. Did this mean that my concatenation messed things up, or did it mean that some files had multiple occurrences of ÿØÿà? I tried concatenating just 10 bad JPGs: I had named them using sequential numbers, starting with ZZZ_0001.jpg, so I was able to use a wildcard to select ten: COPY /b ZZZ_003*.jpg Thirties.txt. The results were confusing. Eventually I wrote a batch file to open each file individually in Notepad. The batch file contained lines like these:

start notepad.exe ZZZ_0013.jpg
start notepad.exe ZZZ_0027.jpg

I thought I might crash the system if I opened 159 sessions of Notepad at once, so I broke it into four parts of about 40 lines each. I ran the first one and did a Ctrl-F and then Ctrl-V in each one to paste ÿØÿà on the search line. Now I had my answer. Some of these files (e.g., ZZZ_0074.jpg) contained many iterations of ÿØÿà, while others contained none. I went through them all. After a while, it wasn't hard to guess that the ones that seemed to be filled with Chinese characters (and there were quite a few of them) would have no occurrences of ÿØÿà, while some but not all of the files containing more familiar if gibberishy characters (e.g., 1 Òhhh) would have at least one such occurrence. There might also have been a way of speeding up the process by doing spot checks, since it seemed that files near to one another (probably originating from the same folder) (e.g., ZZZ_0058.jpg and ZZZ_0059.jpg) tended to follow the same pattern of having or not having occurrences of ÿØÿà. In the end, 27 of the 159 had at least one occurrence of ÿØÿà, and 132 did not.

Fun with Hex Editors

These facts seemed to call for two separate approaches. For the 132 jpgs that had no occurrences of ÿØÿà, maybe the situation was that the front ends had gotten lopped off, and that's why Paint Shop Pro et al. couldn't make anything of them. What would happen if I just arbitrarily rammed a header onto each of these files, down to the ÿØÿà point? The answer to that might give me some clues for the minority of files that did have multiple occurrences of ÿØÿà.

In a survey of good JPGs, I noticed that most began with this:

ÿØÿà JFIF ÿ

There seemed to be at least one invisible character in there, so the best approach (outside of a hex editor) seemed to be be to copy it from the start of a working JPG in Notepad (i.e., not from this webpage), and save it as header.txt. Then, starting with ZZZ_0013.jpg, the first of my bad JPGs, I typed this:

COPY /b header.txt + ZZZ_0013.jpg new0013.jpg

and then I tried to open new0013.jpg. IrfanView gave me an error: "Decode error! JPEG datastream contains no image." Tried it with a couple other files; same result. A brief search suggested that this "decode error" problem could be just as bad as the original one. So it appeared that this COPY approach was not the answer.

Back at the user-friendly guide, I confirmed that, in their view, the new header approach required me to locate the hex string FF DA, which I had not been able to do in many files. Given the uncertainties I had encountered in the several hex editors (above), I wondered if there was a way to output the hex contents of a JPG in text form, so that I could do ordinary searches in Notepad (or whatever) to confirm that there was no FF DA in these files. The solution was easy enough: Frhed (but not HexEdit or xvi32) had an option, File > Export as hexdump, that gave me a text file displaying the hex data. So if I wanted to see a file's hex in a text file, I could use that; and if I wanted to see a file's ASCII in a text file, I could use Notepad. Xvi32 did have a File > Print option that gave me pretty printed pages (in e.g., PDF or hard copy), displaying both hex and ASCII, so I could have searched for either text or hex values in its PDF output.

But now, this was odd. When I searched ZZZ_0030.jpg for "ff da" in Frhed, it found nothing; but when I searched Frhed's hex dump text file for "ff da" in Notepad, it found multiple occurrences. Ah, but the problem seemed to be that Frhed was searching the ASCII side, not the hex. Frhed's search box told me to consult the online help for guidance, but the program contained no link to any webpage as far as I could tell. Xvi32's Find option let me search for either text or hex, but as noted above, it often led to 9F instead of FF. Back in Frhed's hex dump file, I searched for the first occurrence of ff da. It was on row 0050b8 (which the hex dump displayed as 0050b835, mistakenly running together the row number, it seemed, with the first column of data). I tried to locate row 50b8 in xvi32, but there was no such row. I guessed that 0050b8 must refer to a specific location (such as the "ff" in "ff da"), so what I was calling "rows" would actually have different numbers within the same file, according to how many data points were being displayed on a single row onscreen in the hex editor. Armed with that theory, I did now see an occurrence of FF DA in xvi32, near where location 0050b8 should be; and when I clicked on that occurrence, the status bars in both Frhed and xvi32 displayed indications that I was at "hex address" or "offset" 50CD.

OK, so what could I do with this information? There seemed to be some confusion here. Those webpages had said that I was looking for ÿØÿà. An ASCII code list told me that the ASCII code for ÿ was 152. (This meant that I could type ÿ by holding down Alt and hitting 152 on the keyboard's numeric keypad. It turned out that not all ASCII code lists agreed: some versions of the extended ASCII table (most, it seemed) would say that 152 would give me a tilde (~), but that wasn't the tale told by my keyboard.) According to my preferred ASCII list, the ÿ had a decimal value of 152 and a hex value of 0x98. A converter told me that 152 in decimal = 98 in hexadecimal, so that added up.

So, ahh, now I thought maybe I was figuring this out. Looking at ZZZ_0030.jpg in xvi32, I noticed that selecting FF in the hex area would highlight ÿ in the ASCII area. As far as xvi32 was concerned, ÿ = FF, not 152. I tried FF in my hex converter. It said FF in hex = 255. Well, and of course it did: FF was as high as hex would go. In hex, you don't count from 0 to 9; you count 0 to 9, and after 9 you continue on with a, b,c, d, e, and then f. What we call 16 in decimal is called F in hex. After F, you start over back at 0 in hex, just as you start back over at 0 after 9 in base-10 (decimal) counting. So FF in hex actually meant 0FF: it was similar to 099 in decimal. After 099 comes 100 in decimal; after FF comes 100 in hex. So never mind my keyboard: these hex programs were interpreting FF as 255, and that was the last possible number in the 256-character extended ASCII set (beginning with zero). In these ASCII code lists, 255 was not represented by ÿ or any other character. Apparently it had some special meaning. To clean up a loose end, I found an occurrence of hex 98 in xvi32, clicked on it, and saw that, sure enough, it was linked with the ASCII tilde.

I looked at a good JPG file in Frhed. It began, as shown above, with "ÿØÿà JFIF ÿ" -- that is, with this set of hex values:

ff d8 ff e0 00 10 4a 46 49 46 00 01 01 01 01 2c 01 2c 00 00 ff

After that ending ff, the contents of the good JPGs seemed to diverge. That 17th character seemed to be where the image content began. Would I get a working file if, instead of pasting in that header.txt file (which, being represented in ASCII, was apparently not able to capture all of the hex nuances), I pasted these 17 (or maybe 16) codes at the start of a bad JPG? Or, wait. After the hex dump search experience (above), was I sure that this sequence, or part of it, was not already in those files? The Notepad search for ÿØÿà had produced mixed results, but maybe that wasn't the right way to go about it.

(Note that the second character, Ø, was represented by d8. There were apparently several different Ø-like characters in use in different languages. The hex calculator indicated that d8 in hex meant 216 in decimal. But when I typed Alt-216, I got ╪, not Ø. That seemed to be an error, according to indications of what I should have gotten in the Latin-1 (ISO-8859-1) character set. The answer was that I should have been typing Alt-0216, not Alt-216.)

In Frhed, I opened one of those JPGs in which, using Notepad, I had found no occurrences of ÿØÿà. I did a search (Ctrl-F) for ÿØ, which I could either have pasted into the search box or entered via Alt-152, Alt-0216. I did find an occurrence of ÿØ. I noticed that it came shortly after a big section full of zeroes, which made me think that much of that particular file might be gone forever. I tried again, this time searching for the full ÿØÿà (Alt-152, Alt-0216, Alt-152, Alt-133). Nothing found. So Frhed seemed consistent with Notepad in that particular search, at least in this file.

By this point, I was a bit lost. It did occur to me that I might be able to automate the triage of potentially salvageable JPGs by doing a hex dump, counting the occurrences of 00 (nothing, empty space) as a percentage of the total number of hex values in the file, doing some sampling of partially zeroed but still readable JPGs, and identifying a threshold (10%? 20%) beyond which a JPG would not be worth saving. But I wasn't there yet, because my JPGs weren't readable at all, and they did have non-zero data. Most of them, that is; I had found a total of two that were completely empty.

A Bit of Clarity

I went back to the first of my bad JPGs that had no occurrences of ÿØÿà. My hex editors all seemed to have some way to insert characters or a whole file. The latter, offered by xvi32 and Frhed, seemed easier, so in xvi32 I went into File > Insert. I had not yet created the file that I wanted to insert, so now I went into Notepad, pasted my string of 17 hex characters (above), saved it as Header.txt, and proceeded to insert that into my bad JPG in xvi32. Oops: that pasted the hex codes (ff d8 ff e0 ...) as text, not as hex. Xvi32 also gave me an Edit > Insert > Hex string option, so I tried that. That worked. I saved the bad JPG and tried opening it in IrfanView. This gave me a new error: "Decode error! Bogus marker length." Unfortunately, a search led nowhere from that. Another search produced more, including a FileFormat.info webpage that said this:

The first two bytes of every JPEG stream are the Start Of Image (SOI) marker values FFh D8h. In a JFIF-compliant file there is a JFIF APP0 (Application) marker, immediately following the SOI, which consists of the marker code values FFh E0h and the characters JFIF in the marker data, as described in the next section. In addition to the JFIF marker segment, there may be one or more optional JFIF extension marker segments, followed by the actual image data.

This was helpful. It seemed that all I really needed, after all, was the FF d8 bytes. The rest of "ÿØÿà JFIF" was perhaps related to JFIF compliance, but I didn't know if I needed that to produce a file that a program like IrfanView could read. So I went back into xvi32, went to the 17th byte (i.e., the last one I had just entered) and used the Edit > Delete to Cursor option to delete what I had just added, and then used Edit > Insert String to add back ff d8. I saved and tried again. Now we were back to the "Can't read file header" error in Irfanview. Another FileFormat webpage seemed to say that, to have a JPG file, all you needed was the first four bytes (ff d8 ff e0). In xvi32, I went into File > New, inserted those four bytes, and saved that file as Test.jpg. IrfanView gave me an error: "JPEG datastream contains no image." I inserted four empty bytes (i.e., eight zeroes) after those four header bytes but still got that error. I replaced those with bytes 5 through 8 from a good JPG and tried again. Still the same error.

Another search led to a webpage that seemed to explain something I had noticed in another FileInfo webpage: it seemed that a JPG file was (or at least could be) defined as one beginning with FF D8 and ending with FF D9. This page also explained that JFIF was an alternative to EXIF, but I wasn't sure whether I needed either of them. It seemed that I hadn't really added four data bytes, when I added bytes 5 through 8: I was just adding the JFIF part of the header. Assuming I had to have either EXIF or JFIF, in xvi32 I now modified Test.jpg so that it contained what appeared to be the standard JFIF header (ff d8 ff e0 00 10 4a 46), then added four bytes (90 60 1B 88), then added FF D9 at the end, saved, and tried again. Still "contains no image." I looked at Test.jpg in Notepad. Interestingly, it looked like I had made a start on one of those Chinese-looking files.

A more focused search tended to confirm the thought that I was getting in over my head and/or that there just might not be a solution. People were talking about serious programming, and they were also giving me the impression that, of course, the image data for a JPG would state, or be influenced by, its size, color, compression, and other factors. This was probably why the user-friendly webpage advised finding a header from a JPG of similar size, if possible, taken by the same camera and edited with the same software. I didn't have that kind of knowledge about these particular JPGs, so my chances of adding a good header were limited.
The user-friendly webpage hadn't actually been very clear, to me, so I returned to the original advice page that it was trying to present in more user-friendly terms. That page confirmed that there would typically be "several" occurrences of ff da in a JPG. This suggested that a file without any such occurrences, searched in a hex editor or dump rather than in Notepad, could be beyond saving. I had decided I wasn't going to make that determination today, though. If I couldn't save something now, I was going to zip it up and save it until maybe some better tool came along.

The advice here was to look for the *second* occurrence of ff da, occurring somewhere around 2000 to 4000 bytes into the file. I hadn't understood that from the other page. Everything up to that second occurrence was supposedly part of the header; everything after it was image data. So I would be replacing all of that header section with a similar section from a good JPG. If this was right, then my attempt to create Test.jpg (above) appaerently needed a second occurrence of ff d8, followed by something resembling image data, in order to work.

At about this point, I discovered that I might have been confusing ff d8 and ff da. Both appear in the foregoing paragraphs, and I was no longer sure which one I was supposed to be interested in. A look at a working JPG called Good.jpg indicated that the first two ASCII characters (ÿØ -- which, pronounced yo!, could be a great way to start a JPG) were represented by hex FF D8. But the original advice page was saying that the boundary between header and image data was marked by FF DA, not FF D8. So apparently I confused that. In Good.jpg, FF DA -- that is, ASCII characters ÿÚ, produced (as I now realized) by Alt-0255, Alt-0218 -- first appeared at location (would it be called "offset" or perhaps "byte number"?) 261. This was not nearly as far into the file as the advisor had suggested. Perhaps it varied with the contents of the file. But, no, again, this was the first occurrence of FF DA, not the second. In xvi32, I hit F3 to repeat the search. But xvi32 again took me to an instance of 9F DA, not FF DA. I tried the same thing in Fhred, finding once again that it searched for ASCII, not hex; so I searched Good.jpg, in Fhred, for ÿÚ. It said the offset or address of the first hit was at 609 or, in hex terms, 0x261. But it could not find any more occurrences. Was that why Fhred had taken me to the irrelevant 9F DA -- because there was only one FF DA? (Probably not, I decided later; probably it did that because it was accepting Ÿ as equivalent to ÿ, since I had not specified a case sensitive search.) HexEdit, too, appeared to be finding only one occurrence of FF DA in Good.jpg.

A Glimpse of Light

It seemed that I would have to try to make this work using the first rather than the second occurrence of FF DA in Good.jpg. In Fhred, with the cursor blinking on DA at offset 609, I went into Edit > Select Block. There, I typed x0 as the Start of Selection, and left the End of Selection at x262. I clicked OK. This selected everything from the start of the file to DA at offset 609. Then I realized I was doing this in the wrong file. But it was OK. There seemed to be another way to proceed. In xvi32, I went into File > Export as Hexdump. I left the same range (x0 to x262) and selected Export to File and Just Hex Digits on a Line. I saved it as Header.txt. Then I closed Good.jpg.

Hopefully that gave me a working header. In Fhred, I went into one of my bad JPGs and searched for FF DA. It found nothing. I double-checked, doing the same search in xvi32. It found an occurrence of FF DA. Plainly, I had still not quite gotten the hang of using these hex editors, or else maybe Fhred really was buggy, as the writer of the user-friendly webpage believed.

Viewing Bad.jpg in xvi32, the first occurrence of FF DA that xvi32 found was near the end of the file. That couldn't be an end-of-header marker, could it? The hex address was 4E6D4, indicating that there was a lot of data before this point. My guess was that this was one of the later occurrences of FF DA, not the early occurrence that would mark the header. Could I fix this bad JPG by just attaching Header.txt at the start of the file? In Fhred, I used Ctrl-Home to go to the start of Bad.jpg. There, I saw nothing that looked like ÿØ, which presumably would have appeared at the start of any good JPG. I went into File > Import from Hexdump. I named the newly created Header.txt as my source and clicked OK. I got a question: "Does this data have the same format as the Fhred display? This data contains only whitespace and hexdigits. (unlike Fhred display)." I assumed that a header would naturally get a question like this, so I clicked Yes to proceed. It said, "Unexpected end of data found. Cannot continue! Do you want to keep what has been found so far?" I clicked Yes. It gave me a blank screen. As you might have predicted, this did't fix Bad.jpg.

I tried again, this time following the instructions more closely, in case that made a difference. Specifically, in xvi32, I saved Good.jpg as Header.txt, searched it for FF DA (case sensitive), went to the byte immediately after DA, and then went into Edit > Delete from Cursor. Finally: Header.txt really did contain only the information up through FF DA. I saved it and opened Bad.jpg. Still in xvi32, and with the cursor located at the start of Bad.jpg, I went into Edit > Insert > Header.txt. This, I hoped, would prepend a good header to the image body of Bad.jpg and heal it. I saved Bad.jpg and tried opening it in IrfanView. Sadly, I got "Bogus marker length." Paint Shop Pro couldn't open it either.

I tried again with a different Bad.jpg. This was one of the files that I had identified (above) as having many instances of ÿØÿà, whereas the previous Bad.jpg (i.e., the one I had just been experimenting with) had none. I searched for the first instance of FF DA, used Ctrl-Shift-PgUp to mark everything up and through FF DA, pressed Del to delete it, and then inserted Header.txt at the start of the file. I saved it and tried opening it in IrfanView. Again, "Bogus marker length." I tried again, this time going to the second instance of FF DA. "Bogus marker length" once again.

Wrap-Up

I was out of time for this project. Perhaps some ideas would come to me later, or I would become aware of some new program or technique. As always, comments and suggestions were welcome. In the meantime, all I could do at this point was to archive these bad JPGs in a zip file and put them aside.

Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

I had a bunch of MHT files in a folder. (MHT was apparently short for mhtml, which was short for MIME html.) I produced these files in Internet Explorer (IE). To do this in a recent version of IE, the approach would be to look at a webpage and hit Ctrl-S > Save as type > Web archive, single file (*.mht). The MHT format would try to build everything on the screen into a single file, unlike the HTML formats (which would either save only the HTML text or create a subfolder to contain the images and other stuff appearing on the webpage).

Attempts to Print MHTs Directly

My goal now was to print those MHT files. I had Bullzip PDF Printer set as my default printer, and its settings (the default, I think) would have it pop up a dialog for each file being printed, asking me what I wanted to call the PDF output. This wasn't as slick as having a command-line PDF printer that would automatically print a file with a name specified on the command line, but I believed I had two options there. One would be to change Bullzip so that it just printed without a dialog; the other was to hit Enter for each file and let Bullzip print the PDF with the default filename. Either way, I could then come back in a second pass, using a batch file and/or Bulk Rename Utility to alter filenames as desired.

I actually would have had a one-pass command-line option, if I had been able to get PrintHTML to work with MHTs. I was briefly hoping that maybe I could use PRN from the command line, but Francois Degrelle said PRN would only work with text files. A PowerShell function would have been another possibility, if I had known how to proceed with something like that. There also appeared to be some older approaches that could provide a good way to spend a huge amount of time on something that wouldn't work, for reasons I couldn't understand.

I ran a search and found a webpage that made me think that PDFCreator might be a more useful PDF printer than Bullzip, for present purposes and also for the future. PDFCreator was favorably reviewed on CNET and Softpedia, so I downloaded and installed it. But it didn't seem to be printing PDFs automatically from these MHTs. It would just open the MHT in Microsoft Word, my default word processor, and then it would sit there. So I didn't continue to try using PDFCreator for this project.

Then again, Bullzip did the same thing: it opened the MHT in Word, and then stopped. This happened even after I went into Bullzip's options and changed them to what seemed to be the most streamlined approach possible. Word was resource-intensive; I couldn't very well open a hundred documents in it at once. Not that that was an option anyway. If I highlighted more than 15 MHTs in Windows Explorer, the right-click context menu wouldn't even give me a Print option.

Wordpad was less resource-intensive than Word, but it would open the MHT files as source code, same as Notepad: not pretty. I would also get the MHT opened in Word when I right-clicked on a couple of MHTs and selected "Convert to Adobe PDF." (I got that option because I had Acrobat installed.)

The easiest way to just open the MHTs and print them manually, if I wanted to do that, seemed to be to select a bunch of them and hit Enter, and they would open in tabs in my web browser. For some reason, they were opening in Opera, whereas I would have thought that Firefox would be the default, as it was for other kinds of web-type files. I couldn't even open them in Firefox by doing File > Open inside Firefox: somehow they would still open in Opera. I could have uninstalled Opera and then tried again, if I really cared; but in any event I still wasn't getting an automated solution.

PDF via Internet Explorer > Print All Linked Documents

Diamond Architects suggested creating an HTML file that would have links to all of the HTML files in a folder, and then using Internet Explorer to print that one HTML file, using Alt-F > Print > Options tab > Print all linked documents. The .mht files were obviously not .html files, but they contained HTML code. So it seemed like the same approach would work either way; or, at worst, I thought I could probably just type REN *.MHT *.HTML in a command window opened in that folder, and mass-rename them that way. I tried that. It made a mess. The files didn't look right anymore. So I renamed them back to MHT. (The easy way to open a command window in any folder was to go into Ultimate Windows Tweaker > Additional Tweaks > Show "Open Command Window Here." With that installed, a right-click in Windows Explorer would open up that option.)

But anyway, to test the "print all linked documents" concept, I needed to create the HTML file containing links to all those individual files. For that, I tried Arclab's Dir2HTML. But it didn't create links. It just gave me a list of files. If that was going to be the output, I preferred the kind of list I would get from this command:

DIR *.mht /a-d /b > dirlist.txt

That gave me a file, dirlist.txt, containing entries that looked like this:

File Name 1.mht
File Name 2.mht

To get them to function like links in an HTML file, I would have to change those lines so they looked like this:

<a href="One File Name.mht"</a>
<a href="Another File Name.mht"</a>

I could achieve that with a search-and-replace in Word, using ^p as the end-of-line character. That is, I could search for ^p and replace it with this, including the quotation marks:

"></a>^p<a href="

That would put "</a> at the end of each line, and <a href=" at the start of the next. Then I could paste the results back into dirlist.txt. Note: if smart quotes were turned on in Word, I would then have to do two additional search-and-replace operations, copying and pasting a sample of an opening and a closing smart quotation mark into Notepad's replace box, because smart quotes wouldn't work right. Then I might have to manually clean up the first and last lines in dirlist.txt. Another way to do this would be to paste the contents of dirlist.txt into Excel and massage them there. (For Excel instructions, go to this post and search for CHAR(34).) If I was going to do much of this, Excel would definitely be the way to go, because then I could just drop the new filenames into a column and let preexisting formulas parse them and output the HTML lines automatically.

That basically gave me an HTML file. Now I would just have to add its opening and closing lines. I wasn't sure what those should look like, so I right-clicked on some random webpage, selected "View Source" (an option that may not be available in all browsers, at least not without some add-ons; I wasn't sure), and decided that what I needed for an opening line would be "<!DOCTYPE html>" and the closing line should be "</html>" (without quotation marks), though I later realized that the latter was probably either unnecessary or incomplete. I also needed a second line that read, "This is my file," because otherwise everything that I had done would create a completely blank-looking page, leaving me uncertain and confused. So I added those lines to dirlist.txt, saved it as dirlist.htm, opened it in Internet Explorer (Ctrl-O or Alt-File > Open), and tried the Alt-F > Print > Options tab > "Print all linked documents" option mentioned above. (Note that dirlist.htm still had to be in the same folder as the .mht files that I wanted to print.)

That worked, sort of. It automatically gave me a boatload of .pdf files, and may I say it did so in a hell of a hurry. Problem was, they were all blank. It tentatively appeared that Bullzip and Internet Explorer were going to go through the motions of printing those linked files; but because I was dealing with MHTs instead of HTMs, they would passive-aggressively give me output with nothing inside. So, like Columbus finding Haiti instead of Malaysia, I had figured out how to bulk-print HTML files, but that wasn't what I had told everyone I was trying to do.

Bulk Converting MHTs to HTML with mht2htm

Well. Could I bulk-convert MHTs to HTMs and call it a day? A search led to mht2htm. I downloaded the Win32 versions (both GUI and command line), along with the Readme and the Help file. Basically, it looked like I just needed to (1) copy mht2htmcl.exe into the folder containing my MHT files, (2) create a subfolder, there, called OutputDir, (3) edit dirlist.htm to comment out the non-file (i.e., starting and ending) lines, and then (4) do another couple of searches and replaces in dirlist.htm, so that my lines looked like this:

mht2htmcl "First File Name.mht" OutputDir
mht2htmcl "Another File Name.mht" OutputDir

According to the very brief documentation accompanying mht2htm, these commands would do the trick. I made these changes, and then renamed dirlist.htm to be dirlist.bat, made sure it was in the folder containing the MHTs and mht2htmcl.exe, and ran it. It didn't work. I wasn't sure why not. So I tried the GUI version instead. Much easier, and it did produce something in the Output directory. What it produced was a bunch of folders, one for each MHT file, with names like "First File Name_Files." Each folder held a couple dozen files, mostly GIFs for the graphic elements of the HTM file. The key file in each folder wa scalled _0_start_me.htm. If I double-clicked on that, it would open in Firefox (my default web browser), with a line near the top that said, "Click here to open page"; and if I clicked on that, I got a nice-looking webpage in Firefox.

So that was not fantastic. Now, instead of opening MHT files one at a time in Word or a web browser, and printing from there, I would have to convert them to HTM so that I could dig into their separate folders and do the same thing with a _0_start_me.htm file. It would probably be easier to print HTMs than it had been to print MHTs, but there was the problem that those _0_start_me.htm files did not have the original filename. Fortunately, the file name had been preserved in the name of the folder created by mht2htm. So I would have to use an Excel spreadsheet to produce printing or renaming commands that would rename the PDF version of the first _0_start_me.htm file to be "First File Name.pdf," and likewise for all the others. But I wasn't ready to do that yet.

Reviewing How to Use wkHTMLtoPDF

So far, as discussed in a previous post, the best tool I had found for batch converting HTMs to PDFs was wkHTMLtoPDF. Somewhat past the halfway point in that long post, in a section titled "Revised Final Step: Converting TXT to HTML to PDF," I had worked out an approach for using wkHTMLtoPDF. The first step, as I reconstructed my efforts from that previous post, was to install wkHTMLtoPDF. That created a folder: C:\Program Files\wktohtml. wkHTMLtoPDF was a command-line program. Windows would have to know where to look to find it. To address that need, I copied everything from the C:\Program Files\wktohtml folder to a new, empty folder called D:\Workspace. Now I could type a command referring to wkHTMLtoPDF, in a batch file or command window running in D:\Workspace, and the computer would be able to execute the command. I also created a subfolder, under D:\Workspace, called OutputDir.

Next, I went into a command window, running in D:\Workspace, and typed "wkhtmltopdf /?" to get a list of command options. My previous post, interpreted in light of that command and a glance at wkHTMLtoPDF's manual, seemed to say that the command options that had worked best for me included "-s" to set the output paper size; options to set top (-T), bottom (-B), left (-L), and right (-R) margins (in millimeters); and --dpi (to specify dots per inch). It seemed, then, that the command line that I would need to use, for each of the _0_start_me.htm files, would use this basic syntax:

start /wait wkhtmltopdf [options] [input folder and HTM file name] [output folder and PDF file name]

I would run that command in the Workspace folder, where I had now placed the wkHTMLtoPDF program files. With a command of that type, wkHTMLtoPDF would find the _0_start_me.htm file created by mht2htm (above), and would convert it to a PDF file saved in D:\Workspace\OutputDir. The source folder and file names were pretty long in some cases, but this D:\Workspace\OutputDir part of the command was brief, so hopefully my full wkHTMLtoPDF command would not exceed any command line limits. So now I was ready to try an actual command. I made a copy of one of the folders created by mht2htm, renamed it to be simply "Test," and ran this command in D:\Workspace:

start /wait wkhtmltopdf -s Letter -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 "D:\Test\_0_start_me.htm" "D:\Workspace\OutputDir\Testfile.pdf"

That worked. But, of course, the resulting Testfile.pdf was just a PDF of the HTML page that said, "Click here to open page." I wouldn't get my actual MHT page in HTML format until I clicked on that link, in each of those _0_start_me.htm files, and the resulting HTML page would be open in Firefox, where I would still have to come up with a batch printing option to handle all of the tabs that I would be opening. It still wasn't an automated solution. I assumed that the approach of using Internet Explorer > Print All Linked Documents as above (but this time with HTMs instead of MHTs) would likewise give me webpages with that "Click here to open page" option.

Trying VeryPDF HTML Converter

My immediate problem seemed to be that I didn't have a good way to automate the conversion of MHTs to HTMs -- a way that wouldn't give me that funky "Click here to open page" stuff from mht2htm. My larger problem was that, of course, I didn't have a way to automate getting PDFs from those MHTs, which was the original issue.

The possibilities that I had developed so far seemed to be as follows: (1) Forget automation; just print the MHTs manually, selecting 15 at a time and choosing the Print option, which would start 15 sessions of Word. (2) Select and open them in Firefox or some other browser, which would open up 15 (or whatever number) of individual tabs, each likewise calling for manual printing as PDFs unless I could find a way to automate the printing of multiple browser tabs. (3) Try to figure out why the Internet Explorer approach was giving me blank PDFs. (4) Look again for something other than mht2htm, to convert MHTs to HTML. (5) Play some more with the wkHTMLtoPDF approach, in case some automated solution emerged from that.

As I wrote those words of review, I wondered whether Windows XP might handle one or more of those alternatives differently. I had already installed Windows Virtual PC, with its pre-installed virtual Windows XP session; all I needed was to go in there and, if necessary, install programs in it. But I hadn't encountered any specific indications that some program or approach had worked better in Windows XP, so I decided not to pursue this.

I thought I could at least search for some other MHT converter. It suddenly appeared that, in my focus on PDF printers, I might not have done a simple search for an MHT to PDF converter. That search, done at this point, led to novaPDF, a piece of commercial software that would apparently do the job. But on closer examination, novaPDF did not seem to have a batch printing capability. Another program, VeryPDF HTML Converter, came in a command line version whose basic syntax was apparently like this:

htmltools [options] [source file] [output file]

This syntax assumed, as with wkHTMLtoPDF (above), that htmltools.exe was being run in a folder, like my D:\Workspace, where the command files would be present -- unless, again, the user wanted to fiddle with path or environment variable adjustments. Typing just "htmltools" on the command line, or opening the accompanying Readme file, demonstrated that this had lots of options. I thought I might try just using it, to see if it worked at all, before fiddling with options. So I copied the full contents of the VeryPDF program folder (i.e., several folders and 15-20 files, including htmltools.exe) to D:\Workspace, made sure Test.mht was there as well, opened a command window there, and typed this:

htmltools Test.mht TestOut.pdf

The command window gave me a message, "You have 299 time to evaluate this product, you may purchase a full version from http://www.verypdf.com." I didn't find a reference to htmltools on their products webpage or on their list of PDF Products By Functions, and this particular message didn't give me another name to look for, so I wasn't sure whether I would be buying the right program. A review of a couple of webpages eventually revealed that this was VeryPDF HTML Converter. The GUI version, which I didn't want, would cost $59. Sixty bucks to convert MHTs? But it got better, or worse. The command-line version was $399. I guess while I was at it, I could ask them to throw in Gold Support for only $1,200 a year. Beyond a certain level of ridiculousness, a casual user might be forgiven for considering the option of just running this puppy in a disposable virtual machine, if uninstalling and reinstalling didn't do the trick. In all fairness, they seemed to be thinking of server administators, not private home users. And they did give us 300 free conversions. Still, at prices like these, it would have been nice if that would be 300 copies a year, not 300 lifetime. They were basically persuading me to use the program once and then forget about it.

Anyway, the program ran for a few seconds and then claimed it had succeeded. I looked. TestOut.pdf definitely did exist, and it looked good. No apparent need for any additional options. I wondered if it would default to the same filename with a PDF extension if I just typed "htmltools Test.mht," without specifying TestOut.pdf, so I ran the command again with that alteration. That worked. I tried it once more, this time specifying a source folder and an output folder without a filename ("htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output"). This time, it said, "Save to file failed!" Its messages seemed to say that it found Test.mht without a problem. Why wouldn't it write to Output? Maybe it was trying to write a file called Output, when I already had a folder by that name. I repeated the command, this time with a trailing backslash (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\"). Still failed. And the bastards docked me anyway. I was down to 296 free tries. So what were we saying: it could output a file without a need to specify a filename, but it couldn't output to another folder? If all else fails, RTFM. But the Readme.txt didn't contain any references to folders or directories. Well, would it at least work if I specified everything (i.e., "htmltools D:\Workspace\Source\Test.mht D:\Workspace\Output\Test.pdf")? Yes, it would. So that was the answer: I would have to work up my command lines in Excel (above) to include the full file and path names for both the source and the target. With those commands in a batch file, I decided to give it a run with a couple dozen files, just to make sure, before blowing my remaining 295 budgeted conversions on a futile gesture. It ran well. I was set. My fear that some commands might be too long was unfounded: the htmltools commands ran successfully with a command as long as 451 characters. I converted the rest of these MHTs and then deleted them, and hoped never to see them again.

Technically speaking, the project was done. If I needed more MHT conversions than I could accommodate within the limited private usage of VeryPDF's htmltools.exe, I would go back to the five options enumerated at the start of this last section of this post. Since I already had all this stuff in mind, and my Excel spreadsheet was set to go, I ran a couple more lines:

DIR D:\*.mht /s /a-d /b > D:\MHTlist.txt
DIR E:\*.mht /s /a-d /b >> D:\MHTlist.txt

to see if I had any other MHTs on D or E. (Note the double >> marks in the second line -- that says add to MHTlist.txt instead of overwriting it, if it already exists. Of course, once I had the command set, I could just hit the Up arrow in the command window to bring the previous command back, after running it, and then use Home and left & right arrow keys to revise it.) This gave me a file called MHTlist.txt, containing a list of additional MHTs that I thought I might as well convert to PDFs while I was at it. For these, the command lines would produce a PDF back in the source folder. Once those PDFs were created in the source folders, I used Excel (and could probably also have used Ctrl-H in Notepad), to do a DIR [filename].* >> listing (which would show both \Source Folder\File.mht and \Source Folder\File.pdf in the resulting dirlist.txt file) for each specific file that I had converted. This produced a nice pairs for each filename (i.e., x.mht and x.pdf). The process seemed to work. Now I just needed one more go with Excel, to produce DEL lines that would get rid of the MHTs in the source files. One more check: no MHTs left. Project completed.

Sunday, January 22, 2012

Windows Seven Forums: Banned!

I was trying to get an answer to a question about Windows 7. I went to Windows SevenForums.com. When I tried to log in, I got this message:

You have been banned for the following reason:
spam
Date the ban will be lifted: Never

I had gotten that message previously, and had twice asked them why I was getting this. They had never replied.

As one comedian put it, "It's funny, I haven't even done anything yet. How did they know that I'm going to spam?" The answer seemed to involve backlinking. I was not entirely sure what that was, and therefore could not say whether I had ever done it -- though it seemed unlikely, since I had only posted one message on their forum -- a message that I could not now examine, to see how it might have erred.

This time, when I tried to use their contact form to ask them what was going on, they just kept giving me one Captcha after another. Some of them were really hard to figure out, so I had to click on the recycle button to get a different one. I didn't realize they might just be playing games until I got this one:

It did seem unlikely that SevenForums.com seriously expected me to enter or translate that bit of Hebrew.

I concluded, at this point, that my only remedy was to put this item out there for the world to see, so that perhaps someone with some way of getting through to SevenForums.com might be able to persuade them to smell the coffee.

Unlocking PDFs

I was working with PDF files. For purposes of reading them onscreen, I found it helpful to go into Adobe Acrobat and crop off most of the white space around the text. This would allow Acrobat's page view to show the text in a larger size, making it easier to read. I also sometimes liked to add bookmarks. And for many PDFs, I would add a comment balloon on the first page, citing the URL, bibliographic information, or other indications of where the PDF came from, for future reference.

None of these things were possible when the creator of the PDF would password-protect it. I was not sure why people would do that. Most of them didn't, but there was the occasional exception. Maybe they feared that someone would go in and change their wording. I had never seen or heard of that, in my years of working with PDFs. It did not seem like a realistic concern, for purposes of the kinds of PDFs that I was working with. Its drawback was that it made it irritating, difficult, and for some purposes impossible to work with what they had written.

To get rid of the password, so that I could make adjustments like those described above, I did a search for appropriate freeware. The first site that came up was an About.com review of a handful of tools that were supposedly capable of unlocking PDF files. Some of these programs could actually figure out what the original password was, though apparently that could take hours, days, or longer, depending on how complex the password was and what level of security was used. It was evidently faster for a program to just break the password without trying to figure out what it was.

Among the tools listed in that About.com review, I had long used Freeware PDF Unlocker. Unfortunately, I was not able to install the version available at this point (1.0.4) in 64-bit Windows 7, even though the About.com webpage said that I should have been able to do that. The solution to that problem was to run it from within a Windows XP virtual machine (VM), created automatically by Windows Virtual PC. Once I had installed Virtual PC, I had to go into it, open a Windows Explorer session (Start > Run > explorer), copy the Freeware PDF Unlocker installation program to Local Disk C, and install it from there. I had to do that copying inside the VM because, of course, the virtual drive C used by Virtual PC did not actually exist on a hard drive: it would not appear in Windows Explorer in Windows 7, outside of the VM. I had to install it from Local Disk C, inside the VM, because the VM would not install programs located on real drives. Once it was installed, Freeware PDF Unlocker put an icon on the desktop in the VM. I could then drag a passworded PDF from Windows Explorer, drop it on that icon, and get a PDF of the same name, with "_noPW" added at the end of the file's name.

Needless to say, it was a bit of a hassle to start a Windows XP VM, find the PDF in a Windows Explorer session inside the VM, and drag it over to the icon. Freeware PDF Unlocker also was not able to unlock some PDFs with more advanced security, imposed by people who were desperate to insure that nobody would ever be able to change their words -- adding, perhaps, a faux testimonial about the author's positive experiences in having sex with animals. I mean, that was certainly something to be worried about. If I was going to continue with my work of inserting such comments into random PDFs, I was going to have to find a better PDF cracker.

The first relevant tool on the About.com list was Guaranteed PDF Decrypter (GuaPDF for short). I ran it on a PDF that Freeware PDF Unlocker hadn't been able to unlock. GuaPDF gave me a message stating that it definitely could unlock the file, but unfortunately the file was too large for the free version, so I was going to have to buy the paid version. That was a possibility, but I decided first to continue down the About.com list.

The next option on that list was FreeMyPDF.com. This was an online service. I had to upload my PDF and let them do their magic on it. I then spent a minute or two watching the charming RevolverMaps image, on their webpage, that showed how people around the world were using FreeMyPDF at this very ... well, recently. I didn't snap out of it until FreeMyPDF had downloaded the decrypted PDF back to me, a minute or two later. Its print was a bit faded, compared to the original, but it looked like everything was there. Acrobat wasn't able to OCR its pages until I first saved them as JPGs and then stitched them back together in Acrobat, producing a much larger file. Then again, I didn't really need to OCR its pages. They were already OCRed. I did it just because the OCR process (or, in this case, reducing the gamma on the individual JPGs in IrvanView > File > Batch Conversion/Rename > Advanced, after first testing on a sample JPG in IrfanView with Shift-G) could sometimes improve the appearance of the text.

Anyway, FreeMyPDF had done the job, so that killed the incentive to look further down the About.com list. But if the incentive had been there, the next candidate would have been PDF Password Remover Online. This appeared to be another upload-download option, though apparently (according to About.com) accommodating a smaller maximum PDF size. Another option: PdfCrypt, a command-line tool. There were a couple of other tools on the About.com webpage, but they seemed to be password retrievers and, as I say, would thus presumably be slower.

Saturday, January 21, 2012

Windows 7: A Batch File to Sort Files

Since the mid-1980s, I had been using batch files to automate various tasks. While a person could do some fancy things with batch files, mine tended to be very basic. The simple commands that I had learned in DOS had mostly remained useful in Windows 3.1, 95, 98, XP, and 7.

Batch files could be run in multiple ways. If the file was already set up and if there was no desire to watch it perform, it would run by just double-clicking on it or selecting it and hitting Enter in Windows Explorer. A batch file could also be run by being called from within another batch file or other program. It was also possible to run a batch file by opening a command window, moving to the folder where the batch file was saved, and typing its name.

Those and other aspects of batch files are described in previous posts. It occurred to me, at this point, that those posts had not yet mentioned a certain kind of batch file that might save some people a lot of time. This batch file would sort files into subfolders according to their names.

Consider, for instance, the situation in which I would have a thousand files with names like these:

Letter to Jones cbgb.doc
Letter to Jones cbgb.pdf
Letter to Jones cbgb A.pdf
Letter Dec.23 to Jones.txt
1988-12-23 Letter to Jones.doc
1993-06-23 Letter to Jones.doc

A real mess. How could I possibly find which of these, if any, were duplicates? And then, once I got past that, how could I sort them into appropriate folders?

I had been struggling with the first of those two questions for some time. I had some partial solutions. For instance, note that the first two items in this list have the same names but different extensions. I could convert one to the other format -- printing the .doc file to PDF, or using Acrobat or some other program to save the PDF in .doc format -- and then I could open both of them and eyeball them for differences. Or, in a different scenario, if I knew they were different, but they were scattered all over my hard drive, I could do a search that would ignore extensions or that would focus on file dates, using a program like DoubleKiller Pro, and decide which if any I could safely delete. Similarly, between the second and third items in the list, I could do a DoubleKiller search for near or exact duplicates, or a search that would focus on most but not all of the filename.

Once I had reduced the number of possible duplicates, I would want to rename files so that they would all follow a certain pattern. The last two in the list show the pattern that I would want to use for renaming the others. In a more complicated form that might be useful for some purposes, the pattern could be year-month-day-hour-minute-item-title-version.extension, like this:

1988-12-23 11.23 01 First Item Produced at 11-23 AM A.txt

That format would have two advantages: it would automatically sort and display files in chronological order, in Windows Explorer and other programs, and it would make it easier for my batch file to sort files into subfolders by year, month, and so forth. To get files into that format, as more fully explained elsewhere, I would use tools like Excel and commands like DIR -- perhaps with the aid of something like Bulk Rename Utility, whose initially discouraging interface had proved easier to use than I would have expected.

Once I got the files into the proper format, it would be time to work up my sorting batch file. I would do that in a program like Notepad -- not Microsoft Word or some other word processor, since they would tend to use characters that would not be understood by the command processor. Smart quotation marks are an example of that.

The concept in this example is that I would have top-level folders for decades, subfolders for years, and sub-subfolders for months, like this:

1990s
--1991
--1992
----1992-01
----1992-02

and so forth. It would be fairly easy to write a batch file, perhaps with the aid of Excel, that could automate the process of creating those folders (or any sequential set of empty folders), if necessary. The commands that I would use in that batch file would also tend to be the commands that I would use in the sorting batch file. They would include examples like these:

D:     move to drive D

cd \   change directories to the root folder in the current drive

cd .. change to the parent folder

cd Subfolder    change to Subfolder (this command works only if Subfolder is directly below the current folder, which can be verified by typing DIR)

md Subfolder     make directory called Subfolder immediately below the current folder

move document.txt D:    move document.txt to the currently active folder on drive D

*.txt     refers to all files whose names end with a .txt extension

199?.doc     refers to all files whose names start with 199, followed by one other character, with a .doc extension (e.g., 1991.doc, 199A.doc

More information was available for commands by typing the command name followed by /? (example: move /?) at the command prompt. They could be used in combination. For instance, "move *.txt D:\Folder\Subfolder" would move all files with a .txt extension from the current directory to the Subfolder child of Folder on drive D.

The sorting batch file could then be structured like this:

:: BATCH FILE TO SORT FILES

D:

:: ********** MOVE TO DECADE GROUPS **********

cd "\WORKSPACE"
move /-y 199?-*.* 1990s
move /-y 200?-*.* 2000s
move /-y 201?-*.* 2010s
(etc.)

:: ********** MOVE TO INDIVIDUAL YEARS **********

cd "\ARCHIVES BY YEAR\1990s"
move /-y 1990-*.* 1990
move /-y 1991-*.* 1991
move /-y 1992-*.* 1992
(etc.)

:: ********** MOVE TO MONTH FOLDERS **********

cd "\ARCHIVES BY YEAR\1990s\1990"
move /-y 1990-00-*.* 1990-00
move /-y 1990-01-*.* 1990-01
move /-y 1990-02-*.* 1990-02
(etc.)

Its first line is a comment. That is, it begins with a double colon (::) and, as such, would be ignored by the command processor. It is just there for the reader's information. The next line moves us to drive D. Then another comment, indicating that now we are going to move all of the files to decade folders (with names like "1990s" and "2000s"), according to the first three characters of their filename. The following commands do that, after a "cd" command that moves us to the Workspace folder where the files to be sorted are located. (The quotation marks around "Workspace" are unnecessary, but would be necessary if that folder's name had a space in it.)

Then the batch file contains another section, structured the same, that moves files to subfolders based on their year. For instance, "move /-y 1990-*.* 1990" means "move (but ask me to confirm before overwriting) all files with names that begin with 1990-, followed by any other characters, and with any extensions, to the folder named 1990." So the files that would have been sorted into the 1990s folder are now going to be subsorted into the 1990, 1991, 1992 ... folders. Finally, the months folders: all of the files that wound up in the 1991 folder are now going to be sorted into the folders for months 0, 1, 2 ... (I would allow a 0 folder for those files that could be traced to sometime in the year, but not to a specific month.)

And that's basically it. A lot of information in a brief presentation, but hopefully it provides some ideas for how a batch file could be used to automate the sorting of any number of files, on any number of occasions.

Friday, January 20, 2012

Adding Bad Clusters to the Bad Clusters File - Second NTFS Boot Sector Is Unwriteable

I was using Windows 7 and, as such, was using hard drives formatted in NTFS and divided into several partitions. I booted the system with an Acronis True Image Home 11 boot CD, made an image of one partition, and saved it on another. When I rebooted Windows, the folder that I had told Acronis to create (named "2012-01-20 Backup") did not exist. Windows Explorer showed it as being just a file, without an extension.

In Windows Explorer, I right-clicked on the drive and selected Properties > Tools > Check now. I clicked both boxes (i.e., "Automatically fix file system errors" and "Scan for and attempt recovery of bad sectors") and then Start. But the tool did not run. It just disappeared. That was probably because I had a paging file on that drive (Start > Run > SystemPropertiesAdvanced.exe > Advanced tab > Settings > Advanced tab > Virtual memory > Change). I could have changed that, and then rebooted to make the change effective, and then gone back into Windows and tried again.

Instead, I rebooted with a Windows 7 installation DVD and went into Repair your computer > Use recovery tools > Next > Command Prompt. At the prompt, I typed C: and then DIR, and then D: and then DIR, and so forth until I found the troubled partition. I typed CHKDSK /R and let it run. In stage 2 of 5, it deleted the index entry for the 2012-01-20 Backup folder (which it was calling "file 1045"). So it looked like I was going to have to redo my Acronis backup. After finishing stage 5 of 5, it said this:

Free space verification is complete.
Adding 602 bad clusters to the Bad Clusters File.
CHKDSK discovered free space marked as allocated in the volume bitmap.
The second NTFS boot sector is unwriteable.
Failed to transfer logged messages to the event log with status 50.

That last message didn't bother me -- I was always getting that when I ran CHKDSK this way. But I wanted to know more about the others. This was the second time I'd had this kind of problem after running Acronis. I wondered whether the problem was with the drive -- that perhaps I should get rid of it -- or whether it was instead something that Acronis was doing to the drive.

What I had done previously was to try to resize the partition using a GParted Live (i.e., bootable) CD -- or, actually, I had booted an Ubuntu Live CD (version 10.10 -- apparently not all versions had GParted built in) and had gone into its System > Administration > GParted option. If I recalled correctly, GParted had been unable to resize the partition, apparently because of the partition's problems, so my next approach was to remove everything from the partition and then use GParted (or possibly Windows Start > Run > diskmgmt.msc) to delete and recreate it. That had worked last time, so I decided to do it again. This was the last partition on the drive, and the consensus seemed to be that the second NTFS boot sector was at the end of the partition. I had used this Acronis CD for a long time, and had not otherwise been getting this problem. So I thought this time I would create another small partition, after this one, and just let it sit. If Acronis was getting confused when it wrote to the last partition on the drive, that little 1.5GB partition might provide a buffer and solve the problem.

So now I was going to try to find out whether it was indeed a hard drive issue. (One other note: as I continued in the process, it appeared that the Acronis backup may have come very close to filling the troubled partition. I wondered whether that could somehow have corrupted it.)

I went into GParted again. It showed an exclamation mark next to this partition. When I right-clicked on it and looked at Information, it said several things, including these:

ERROR: This software has detected that the disk has at least 602 bad sectors. . . . This means physical damage on the disk surface caused by deterioration, manufacturing faults or other reason. The reliability of the disk may stay stable or degrade fast. . . . Unable to read the contents of this file system! Because of this some operations may be unavailable.

This number of bad "sectors" matched the number of bad "clusters" reported by CHKDSK (above). That is, it appeared that there might be no bad sectors other than those that were reported after the Acronis process. I rebooted into Windows and continued to use the system as usual while moving my data off the troubled partition via Windows Explorer. Then I rebooted into Ubuntu > GParted and, as just indicated, I created two partitions where there was formerly one. As before, the first of the two was an NTFS partition. The new addition was that 1.5GB partition, which I formatted as ext3 (invisible to Windows, thus causing no confusion). Then I did another Acronis image of my programs drive (C:), as before, saving the image to the new NTFS partition. When Acronis was done, I went back into Windows and saw that Acronis had created what appeared to be a valid image.

I ran System Information for Windows (SIW) to find out which drive this new partition was on, and who made that drive. After consulting with Disk Management (Start > Run > diskmgmt.msc) to figure out whether I should be looking at disc 0 or disc 1, I saw that it was a Seagate ST31000520AS. I went to Seagate's webpage and downloaded SeaTools for Windows. (I already had a copy, but hadn't used it for a long time, and wasn't sure it was the latest version.) I installed it and ran its Short Drive Self-Test (Short DST). It passed. I ran a S.M.A.R.T. test. It passed that too.

It did not appear that this drive was failing. I hibernated and then booted up with Ubuntu > GParted. Its Information option saw no problems. Neither did its Check option. I rebooted with the Windows 7 DVD and ran CHKDSK /r on both of the new partitions (i.e., including the little 1.5GB trailer). There were no statements of the kind I had gotten on the previous try -- nothing about deleting index entries, bad clusters, or an unwriteable NTFS boot sector. What I got was, "Windows has checked the file system and found no problems" and "0 KB in bad sectors." Of course, all of my data (except for the newly re-created Acronis backup) was still on the other drive, so I didn't expect file errors (and CHKDSK ran somewhat more quickly on the nearly empty drive).

It appeared that the previous warnings had been false alarms, triggered by the use of Acronis True Image Home 11 to save an image that nearly filled the last partition on the drive.

To round out my investigation, I went back through the random webpages that I had opened up while searching for insight. One post raised the possibility that I might have been able to use ntfsresize to resize the partition instead of removing its data, deleting it, and creating a new one in its place. Ntfsresize (and ntfstruncate) were apparently Linux utilities, and it looked like using them could be risky. There was a report that seemed to confirm that GParted would have reported an error again if it had found a bad sector, so getting no error messages in GParted's Information option apparently implied that the bad sectors were gone. Evidently the repartitioning had reset them. One webpage raised the question of whether the partition had been marked as a boot partition. As I recalled from my glance in GParted, it hadn't.

Another thread alerted me to the thought that I should have run CHKDSK /B instead of /R. To verify that, I went into a command window and typed CHKDSK /?. It said, "NTFS only: Re-evaluates bad clusters on the volume (implies /R)." I wondered whether CHKDSK /B might have been a shorter response to the whole problem. It also said, "The /I or /C switch reduces the amount of time required to run Chkdsk by skipping certain checks of the volume." As Microsoft advised, I wouldn't have relied on those switches by themselves; but since I had already run /R, at this point I could have tried CHKDSK /I /C /B. (Too bad there wasn't an /M option -- I'd have had a missile.) Apparently some of these options were not available in earlier (pre-Vista) versions of Windows.

I had run a couple of searches and probably could have gone on indefinitely, but by this point it began to seem that I had already encountered most of the main recent lines of tinkering on the question. One exception: I had searched for, but never did find, a tool that would give me a visual representation or map of bad clusters on the drive. I vaguely recalled that Norton had provided something along those lines, back in the DOS ages. Defragmenters like Smart Defrag would show a nice, colorful map of disk clusters, but at this point I wasn't aware of a Windows 7 defragger (or other tool) that provided a depiction of bad clusters specifically, so I sent IOBit a suggestion to that effect.

Saturday, January 14, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Second Try

I had previously looked for a way to combine multiple JPG files into a single PDF. As explained in more detail in that previous post, the specific problem was that I might have sets of several JPGs in a single folder that should be merged into several different PDFs, and there might be multiple folders at various places that would have such PDF sets. Hence, if I wanted to automate this project across dozens of folders containing hundreds of JPGs, it seemed that I would need a command-line solution rather than a GUI. This post updates that previous attempt.

There were commercial programs that seemed to offer the necessary command-line capabilities, such as PDF Merger Deluxe ($30) and ParmisPDF Enterprise Edition ($300). A search suggested, however, that PDFsam might offer a freeware alternative.

Assembling the List of JPGs to Be Converted

Before investigating PDFsam, I decided to get a more specific sense of what I needed to accomplish. In a command window (Start > Run > Cmd), I navigated to the root of the drive I wanted to search (using commands like D: and "cd \"). (The root was the folder whose command prompt looked like C:\ or D:\, as distinct from e.g., C:\FolderZ.) Being at the root folder meant that my command would apply to all subfolders on that drive. Once I was there, I ran this command:

DIR *.jpg /s /b /a-d > jpgslist.txt

That gave me a list of files (but not directories, /a-d) in all subdirectories (/s), listed in bare (i.e., simple, /b) format, saved in a new file called jpgslist.txt. (For more information on DIR or any other DOS-like command, type DIR /? at the command prompt, or search for informational webpages.) If I'd had files with a .jpeg (as distinct from .jpg) extension, I would have added a second line, referring to *.jpeg and using a double-arrow (>>) to tell the program to add these to the existing jpgslist.txt, rather than creating a new jpgslist.txt (which was what the single arrow (>) would do).

Now I wanted to see which folders had more than one JPG. I would use Microsoft Excel to help me with this. I could either open jpgslist.txt and copy its contents into an Excel spreadsheet, or import it into Excel. In Excel, I did a reverse text search to find the last backslash in each text line, so as to distinguish file names from the directories holding them. I sorted the spreadsheet by folder name and file name. I set up a column to detect folders containing more than one JPG, and deleted the rows of folders not containing more than one JPG. I might still want to do another search and conversion for isolated JPGs at some point, but that would be a different project.

Next, I wanted to see if I could eliminate some folders. For instance, I might not want to PDF and combine JPGs that were awaiting image editing, or important photos whose quality might get degraded in the PDF conversion. In other words, I decided that this particular project was just for those JPGs that I was going to combine into a single PDF and then delete. To get a concise list of folders containing multiple JPGs, I went into Data > Filter > Advanced Filter. (That's Excel 2003.) I moved the output into another worksheet. I could then do a VLOOKUP to automatically mark rows to be deleted. So that gave me the folders to work on.

Now it was time to decide which files to combine, and in what order. In some cases, I had named files very similarly -- usually with just an ending digit change (e.g., Photo 01, Photo 02 ...). So I set up a couple of columns to find the filename's length, subtract a half-dozen characters, and decide whether those first X characters were the same as in the preceding row. If so, and if both were in the same folder, we had a match. I discovered, at this point, that one or two folders contained large numbers of files. I decided to combine those manually. With those out of the way, it seemed that the next step was to decide the names of the resulting multi-image PDFs (e.g., Medieval Churches.pdf), and to put those names on the spreadsheet rows, next to the individual JPGs that would go into them.

At this point, as described in another post, I learned how to use PDFsam to combine several PDFs into one PDF. So I had a rough idea of the start of my project (i.e., identify the JPGs that I would want to merge into a single output PDF), and I also had a basic sense of the end of my project (i.e., use PDFsam to merge multiple PDFs into that single output PDF). I was missing the middle part, where I would convert the original JPGs into single-page PDFs and would get them into a form where PDFsam could work on them.

Converting Individual JPGs to Individual PDFs

I had originally assumed that I would start by converting the JPGs to PDFs within the various folders where they were originally located. So if I had File1.jpg in E:\Folder1, and if I had File2.jpg in E:\Folder2, my conversion would result in two files in each of those folders: File1.jpg and File1.pdf in Folder1, and File2.jpg and File2.pdf in Folder2. Then I would use PDFsam to merge the PDFs (i.e., File1.pdf and File2.pdf) from those locations; delete the original JPGs and PDFs; and move Output.pdf to an appropriate location.

I didn't entirely like that scenario. It seemed like it could make a mess. As I reviewed another post in which I had worked through similar issues, I decided that a better approach might (1) make a list of original JPG file locations, (2) move those JPGs to a single folder where I could convert them to individual PDFs, (3) merge the individual PDFs into concatenated PDFs, (4) delete the individual JPGs and PDFs, and (5) move the concatenated PDFs to the desired locations. I decided to try this approach.

One problem with moving files from many folders to one folder was that there might be two files with the same name. They would coexist peacefully as long as they were in separate folders; but when they converged into one target folder, something would get overwritten or left behind. It seemed that a first step, then, was to rename the source JPGs, so that each one would have a unique name -- preferably a short name without spaces, potentially making it easier to write commands for them as needed. In this step, as in others, it would be important to keep a list indicating how various files were changed. To rename the files where they were, I returned to my spreadsheet and used various formulas to produce a bunch of rename commands of this type:

ren "D:\Folder Name\File Name.jpg" "D:\Folder Name\ZZZ_00001.jpg"

after doing a search to make sure that my computer did not already have any files with names resembling ZZZ_?????.jpg. The spreadsheet gave me one REN command for each JPG. I copied those commands into a Notepad file, named it Renamer.bat, and double-clicked to run that batch file. (A slower and more cautious approach would have been to run it in a command window, perhaps with Pause commands among its renaming lines, so that I could monitor what it was doing.) A search in a file-finding program like Everything now confirmed that the number of files on my computer with ZZZ_?????.jpg names was equal to the number of files listed in my spreadsheet. I cut and pasted all those ZZZ_?????.jpg files from Everything to a single folder, D:\Workspace. (I could also have used the spreadsheet to generate Move commands to run in a batch file for the same purpose.)

Now I had a spreadsheet telling me what the original names of these ZZZ_?????.jpg files were, and I had all those ZZZ files together in D:\Workspace. My spreadsheet also told me which of them were supposed to be put together into which merged output PDFs. But they weren't ready to be merged by PDFsam, because they were still JPGs, not PDFs.

To convert the JPGs to PDFs, I could have prepared another batch file, using IrfanView commands to do the conversion, like those that I had previously played with in another project. But I figured it would be easier to use IrfanView's File > Batch Conversion/Rename. There, I told IrfanView to Add All of the ZZZ files to its list. I specified PDF as the Batch Conversion Settings output format, and set its Options > General tab to indicate that Preview was not needed (and adjusted other settings as desired). I told it to Use Current ("Look In") Directory as the Output Directory for Result Files (adding "Output" as the name of the output subfolder to be created). Then I clicked Start Batch.

That produced one PDF, in the Output subfolder, for each original JPG. I hadn't done anything to change their filenames, so ZZZ_00001.jpg had been converted to ZZZ_00001.pdf. Spot checks indicated that the resulting single-page PDFs were good. I deleted the original ZZZ*.jpg files, moved the output PDFs up into D:\Workspace, made a backup, and turned to the project of merging those single-page PDFs into multipage PDFs.

Preparing XML Files to Concatenate PDFs

In my spreadsheet, I had already decided which ZZZ files would be merged together, and what the resulting multipage PDFs would be called. Now -- referring, again, to the other post in which I worked through the process of using PDFsam -- I needed that information to create File Value lines for a set of ConcatList.xml files that PDFsam would then merge into a set of output PDFs.

In other words, I would have a batch file that would run PDFsam, and I would have a data file, in XML format, to specify the single-page PDFs that PDFsam would combine into the multipage output PDF. I would have a pair of such files (i.e., a batch file and an XML data file) for each resulting multipage PDF. In my particular project, there were 65 single-page PDFs, and they would be combined into a total of eight multipage PDFs. So I would have eight pairs of .bat + .xml files, and the eight XML files would contain a total of 65 File Value lines.

To the extent possible, I would want to automate the creation of these batch and data files. Sorting 65 data lines into eight different XMLs would be tedious and easily confused. Things would get much worse if I wanted, in some later project, to use these procedures for hundreds or thousands of JPGs or other files.

I began by adding a column to my spreadsheet that contained the exact text of the appropriate File Value line. Example: for ZZZ_00001.pdf, the line would read like this:

<file value="D:\Workspace\ZZZ_00001.pdf"/>

To produce that result, if the Excel spreadsheet's cell D2 contained ZZZ_00001.pdf, its cell E2 would contain this formula:

="<file value="&CHAR(34)&"D:\Workspace\"&D2&CHAR(34)&"/>"

(Note the use of CHAR(34) to add quotation marks where they would otherwise be misunderstood.) Next, I wanted to assign those File Value lines to the appropriate batch files. A search confirmed that I didn't have any files on my data drives with YYY_????? names, so I decided that my first multipage output PDF would be called YYY_00001.pdf, and that the pair of files used to produce it would be YYY_00001.bat and YYY_00001.xml. In other words, the File Value line for ZZZ_00001.pdf (above) would have to be one of the File Value lines appearing in YYY_00001.xml. But the next File Value line in YYY_00001.xml could be a ZZZ file out of sequence (e.g., ZZZ_00027.pdf), if that happened to be the next original file that I wanted to put into YYY_00001.pdf.

Since YYY_00001.pdf was going to be the temporary working name of the multipage PDF that I would ultimately be calling "Short Letters to Mother," I sorted the spreadsheet (making sure to first use Edit > Copy, Edit > Paste Special to convert formulas to values) by the column containing my those ultimate desired filenames, and worked up a column indicating the corresponding YYY filename. In other words, each cell in that column contained one of eight different labels, from YYY_00001.pdf to YYY_00008.pdf.

With that in place, I was ready to generate some batch commands. Each batch command would use the ECHO command to send the contents of spreadsheet cells to YYY*.xml files. My first attempt looked like this:

echo <file value="D:\Workspace\ZZZ_00001.pdf"/> >> YYY_00001.xml

The double greater-than signs (">>") indicated that YYY_00001.xml would be created, if it didn't already exist, and that the File Value line (above) would be added to it. This first try produced an error, as I feared it might: ">> was unexpected at this time." The less-than and greater-than symbols were confusing Windows. I had to modify the formula in my spreadsheet (or use Ctrl-H) to add carets (^) before them, like this:

^<file value="D:\Workspace\ZZZ_00001.pdf"/^>

That worked. Now YYY_00001.xml contained that line. With commands like that for each of the 65 single-page PDFs, my spreadsheet now had cells like these:

echo ^<file value="D:\Workspace\ZZZ_00051.pdf"/^> >> YYY_00004.xml
echo ^<file value="D:\Workspace\ZZZ_00025.pdf"/^> >> YYY_00006.xml

I sorted the rows in my spreadsheet by the appropriate column to make sure the single-page PDFs would get added to their multipage PDFs in the proper order. (If necessary, I would have added another column containing numbers that I could manipulate to insure the desired order.) Then I copied all those cells over to Notepad and saved it as a new batch file that I called Sorter.bat. I ran Sorter.bat and got eight XMLs, as hoped. Spot checks seemed to indicate that the process had worked correctly.

My eight XML files were not complete for purposes of PDFsam. Each of them would need lines of code preceding and following the File Value lines. As described in the other post, those files would begin with

<?xml version="1.0" encoding="UTF-8"?>
<filelist>

and would end with

</filelist>

I saved those two beginning lines into a text file called Header.txt, and I saved that ending line into another text file called Tailer.txt. Now I needed to combine them with the XML files that Sorter.bat had just created. For that purpose, it seemed that my spreadsheet could benefit from a separate page dedicated to XML file manipulation. I added that page, filtered my existing page for unique values in the YYY*.pdf column, and placed the results on that new page.

I could now see that I was too early in adding .xml extensions to the eight files. I went back into the spreadsheet and changed it to produce files without extensions (e.g., YYY_00001 instead of YYY_00001.xml), and then I deleted the XML files and re-ran Sorter.bat (as modified) to verify that it was all still working.

With that change, I returned to the spreadsheet's XML Files page. Next to each of the eight XML filenames, I added columns to produce commands of this type:

copy Header.txt+YYY_00001+Tailer.txt YYY_00001.xml

I put those eight commands into a batch file and ran it. It worked: I had eight XML files with everything that PDFsam needed. There was just one small glitch: at the end of each resulting XML file, there was a little arrow, pointing to the right. Searches didn't yield any obvious explanations. I wasn't sure if it would make a difference; I thought it might just be a symbol representing end-of-file or line feed. I decided to forge ahead and see what happened. Except for that little character, I had exactly what I needed for my XML files.

Preparing Batch Files to Use the XMLs

The next step was to create a matching YYY_?????.bat file for each YYY_?????.xml file. This batch file would run the commands necessary to merge the single-page PDFs listed in the XML file. I would use the same techniques as in the XML files. There would be no Tailer.txt file this time; the line that would need to change, in each BAT file, was the very last line. So my COPY command (above) would just have Header.txt plus the variable line to produce the YYY_?????.bat file. The variable (last) line of the batch file had to look like this:

%JAVA% %JAVA_OPTS% -jar %CONSOLE_JAR% -l D:\Workspace\YYY_00001.xml -o D:\Workspace\Merged\YYY_00001.pdf concat

In other words, it would have two variables: the name of the XML file providing the input, and the name of the PDF file containing the output, saved in a Merged subfolder. It was pretty straightforward, by now, to use the spreadsheet to generate the necessary commands and to run them in another Sorter.bat file (see above). I just had to remember to delete my previous YYY_????? files (without extension), so that their contents would not get thrown into the mix. I did wish that PDFsam's -l option were expressed as -L, so that nobody would think it was the number one, but I wasn't yet ready to experiment and find out whether -L would work just as well. Anyway, to produce a line like the one shown immediately above, my Excel formula looked like this:

="echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l D:\Workspace\"&A2&".xml -o D:\Workspace\Merged\"&A2&".pdf concat >> "&A2

where cell A2 contained the filename without extension (e.g., YYY_00001). I had to use double percentage symbols to get a single one to come through. I put the resulting eight lines into Sorter.bat, and it produced eight YYY_????? files, as before. I ran another batch file to combine Header.txt plus the YYY_????? files to produce YYY_?????.bat -- again, same as above, but without Tailer.bat and making sure to produce .bat files, not .xml files.

These steps gave me eight pairs of .bat and .xml files. The batch files looked good, except for the little arrow at the end (above). Now, if all went well, the batch files would run, would consult the XML files for the lists of single-page PDFs to merge, and would produce eight YYY_?????.pdf output files in the Merged subfolder. I would not want to run the batch files manually, if I were producing a large number of merged PDFs, so I wrote a batch file to run the batch files. The commands in this file looked like this:

@echo off
call YYY_00001.bat
call YYY_00002.bat

and so forth. I ran this batch file. It gave me error messages. I opened a command window and typed just its first action line: call YYY_00001.bat. The error was:

FATAL Error executing ConsoleClient
java.lang.Exception: org.pdfsam.console.exceptions.console.ParseException: PRS001 - Parse error. Invalid name (D:\Workspace\YYY_00001.xml) specified for <l>, must be an existing, readable, file.

Oh. Dumb mistake. The XML files weren't in D:\Workspace. I moved them and tried again on the command line. Another error:

Error on line 5 of document file:///D:/Workspace/YYY_00001.xml : Content is not allowed in trailing section.

That little arrow was on line 5. I moved it to line 6 and tried again. Same error, except now it said the problem was on line 6. I deleted the little arrow and tried again. That solved that problem. Now a different error: "The system cannot find the path specified." That was probably because I had not yet created the Merged subfolder. Apparently PDFsam was not going to create a folder that did not already exist. I created it and tried again. Success! YYY_00001.pdf was created in the Merged folder with the desired single-page PDFs in it.

Now I just had to figure out how to prevent that little arrow from appearing in the XML files. It came at the end of both the XML and the BAT files, and it got there when I used the COPY command to combine the header, the command, and the tailer text files. The solution was to add the /b switch:

copy /b Header.txt+YYY_00001+Tailer.txt YYY_00001.xml

With that change, I went back through the process of creating the XML files. Then I tried running YYY_00001.bat again. I got an error: "Cannot overwrite output file (overwrite is false)." Oops. I had forgotten to get rid of the previously produced YYY_00001.pdf in the Merged folder. This time I was successful without having to manually remove the little arrow -- it wasn't there anymore in the new YYY_00001.xml. I ran the batch file that called the eight YYY_?????.bat files. It ran and produced eight multipage PDFs. Those eight contained a total of 65 pages. I combined them all in Acrobat, just to take a quick look. They were all good.

Putting the Multipage PDFs Back Where They Belong

Now I needed to decide where to put the multipage PDFs. In this case, the individual PDFs that went into each of the multipage PDFs had all come from the same folder. That is, I did not have Folder1\PDF1 plus Folder73\PDF2 going into BigPDF-A. So I could use the spreadsheet to determine semi-automatically the path and filename for a rename command. I wound up with Rename commands like this:

ren YYY_00004.pdf "Medieval Craft Workers.pdf"

followed by Move commands like this:

move /y "Medieval Craft Workers.pdf" "E:\IMAGES\Medieval Craftsmanship\"

I verified that the multipage PDFs had returned to the folders whence the single-page PDFs had originated. This project was finished.

Ray Woodcock's Latest

Pages

Sunday, January 29, 2012

JPG: Can't Read File Header - Unknown File Format or File Not Found

Thursday, January 26, 2012

Windows 7: HTML (MHT) Files: Batch Printing/Converting to PDF

Sunday, January 22, 2012

Windows Seven Forums: Banned!

Unlocking PDFs

Saturday, January 21, 2012

Windows 7: A Batch File to Sort Files

Friday, January 20, 2012

Adding Bad Clusters to the Bad Clusters File - Second NTFS Boot Sector Is Unwriteable

Saturday, January 14, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Second Try

Support This Blog

Total Pageviews

Archives

Pages

Sunday, January 29, 2012

Thursday, January 26, 2012

Sunday, January 22, 2012

Saturday, January 21, 2012

Friday, January 20, 2012

Saturday, January 14, 2012

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives