Ray Woodcock's Latest: conversion

Showing posts with label conversion. Show all posts

Sunday, June 10, 2012

Batch Converting Many WordStar (.ws) Files to PDF

I had previously worked out a command that would convert all of the Microsoft Word (.doc) or WordPerfect (.wpd) files in a folder to PDF. Now I wanted to try that on a batch of old WordStar (.ws) files. This post discusses that task.

As described in the previous post, I set my PDF printer (Bullzip) to print without opening the resulting files and without interruptions, except that I think I did let it notify me of error messages. I didn't want to have to approve each conversion manually. Also, I had named the WordStar documents to have a .ws extension, even though that extension was not necessary back in WordStar's heyday.

I had also configured my copy of Word 2003 to recognize and open .ws documents. I was not entirely sure how I had managed this. My records suggested two possibilities. One was to run a .reg file containing the following lines, so as to modify the registry in some hopefully appropriate way:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Text Converters\Export\WordStar]
"Extensions"="ws"
"Name"="WordStar 3.3 - 7"
"Path"="C:\\Program Files\\Common Files\\Microsoft Shared\\TextConv\\Wrdstr32.cnv"

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Shared Tools\Text Converters\Import\WordStar]
"Extensions"="ws"
"Name"="WordStar 3.3 - 7"
"Path"="C:\\Program Files\\Common Files\\Microsoft Shared\\TextConv\\Wrdstr32.cnv"

The other possibility was that I had apparently found a program that required me to add certain files to the Windows 7 Program Files folder, including particularly one called Wrdstr32.cnv. A search suggested that anyone hoping to download such files from a virus-free source had best be using something like WOT. It had been a while since I had set up my system, and in any case I had not tested these options individually to determine whether they were useful or necessary. For all I knew, Word was capable of reading .ws files without any of this. The point is that, at least on my system, Word was now capable of doing so.

With all that in place, I was set to run a command that would hopefully process a lot of WordStar files without much intervention from me. I started Notepad, created a blank file called Converter.bat, and put this line in it:

FOR /F "usebackq delims=" %%g IN (`dir /b "*.ws"`) DO "C:\Program Files (x86)\Microsoft Office\Office11\winword.exe" "%%g" /q /n /mFilePrintDefault /mFileExit && TASKKILL /f /im winword.exe

I saved Converter.bat and put it in the folder containing the .ws files. I probably could have used Excel to mass-produce commands that would have done the conversion in-place, for .ws files scattered among multiple folders, but my approach to that sort of situation tended to involve bringing the files to be converted together into one folder anyway, and then putting their converted replacements back where the original files had come from.

I ran Converter.bat in the .ws folder. It ran successfully; I had PDFs in my Bullzip output folder for each WS document in the input folder. Mission accomplished.

Sunday, May 27, 2012

Batch Converting Multiple Word DOC Files to PDF in Scattered Folders

I had a large number of .doc files produced by Microsoft Word. These files were in assorted folders. I wanted to convert some or all of these files to PDF format. This post describes the steps I took.

I had already tackled similar problems in several other posts, including these:

This post does not detail all of the steps described in those other posts. If a step described here is not clear, perhaps one of those posts expresses it more lucidly.

I started by getting a list of the DOC files to be converted. For this, I opened a command window and typed "DIR /s /b /a-d > doclist.txt." It was OK if this DOC list included files that I did not want to convert: I could go through the list manually at this point, deleting those that I did not want to convert, or I could do that in the next step. The next step was to copy and paste the list of files from doclist.txt into Microsoft Excel or some other spreadsheet. This gave me a list of file and path names that looked like this:

D:\Folder3\Subfolder 8\Filename Z.doc

Since some paths and/or filenames contained spaces, I would tend to use quotation marks in commands relating to them, in both Excel and the command window. In Excel, I used the REVERSE function and other spreadsheet commands to extract the path (e.g., "D:\Folder3\Subfolder 8\") from the filename (e.g., "Filename Z.doc"). So now I had separate columns showing the paths and the filenames for each entry in doclist.txt. This would be a good point for using formulas to identify groups of DOC files that I did not presently wish to convert to PDF.

The next step in the spreadsheet was to identify the filename without the extension, and to add PDF instead of DOC to that rump filename. In other words, in this step I went from having Filename Z.doc to having Filename Z.pdf. This gave me the essential ingredients for the batch commands that I would assemble on each line of the spreadsheet and would then paste into Notepad and save as a .bat file, so as to automate the conversion.

There were two ways to proceed at this point. One was to leave the DOCs in place, in their home folders, and do the conversion and replacement right there. I didn't like that approach. It was too hard to be sure of what had happened in all those scattered folders. The approach I preferred was to bring all those .DOC files together in one central folder, do the conversion, and then use the spreadsheet to construct batch files that would put those PDFs back where they belonged and, optionally, delete the DOCs from which they had come.

Bringing the DOC files to a central folder could be done very easily with a search program like Everything, searching for *.doc. It could also be done with batch commands constructed in the spreadsheet. An Excel formula producing a command of the latter nature would be something like ="move /-y "&char(34)&[cell containing filename including .doc extension]&char(34)&" D:\CentralFolder"). It would be important not to take this step -- that is, not to move the files away from their home folders to the central folder -- until I already had a list of where the files came from originally. Without that, I'd have a big collection of DOC files and no idea of where they belonged. Note that files bearing identical names, coming from different folders into one, could require some advance manual renaming to avoid overwriting. In that case, after renaming but before moving, it would probably be advisable to re-run DIR, so as to get the current filenames.

Once the files were all in a central location (in this case, D:\Conversion), it was time to work up the batch conversion process. For this, first, I set the General and Options tabs in Bullzip (my free PDF printer) so that it would operate without asking questions or opening PDFs, and would save the PDFs to a designated folder (D:\Conversion\PDFs). Then I saved this command into a batch file that I called Converter.bat:

FOR /F "usebackq delims=" %%g IN (`dir /b "*.doc"`) DO "C:\Program Files (x86)\Microsoft Office\Office11\winword.exe" "%%g" /q /n /mFilePrintDefault /mFileExit && TASKKILL /f /im winword.exe

I saved Converter.bat in the folder containing the DOC files (in this case, D:\Conversion) and ran it. It worked away for a while, at the speed of one document every few seconds, until it had produced one PDF for each of my DOC files. Several times during the process, Word or Bullzip stalled with error messages (e.g., "Word cannot start the converter Rftdca32.cnv"). This seemed to result primarily from corrupted Word docs. There seemed to be little alternative but to delete those files except where I could find a backup.

Now I had a set of DOCs and a set of PDFs. One easy way to make sure that I had a copy of PDF for each DOC was to view the folders using a Windows Explorer alternative like FreeCommander. In FreeCommander, I could combine the DOCs and PDFs together, sort by file type, select all DOCs, re-sort by file name, and look for instances in which alternating lines were not regularly highlighted. (In Windows 7, Windows Explorer had lost the ability to retain highlighting after files were re-sorted.) At this point or later, one could then just delete all DOCs that did have a corresponding PDF. DoubleKiller Pro would provide a similar approach. Another method, more suitable for large numbers of files, was to use the DIR and spreadsheet approach outlined above, writing formulas to check for identical filenames (not counting extensions). Of course, there was no need to actually delete the DOCs if I wanted to keep both the PDF and the DOC.

I postponed that step to verify, first, that I would not be needing any of the DOCs anymore. I had previously worked on ways to check PDFs by converting them to JPGs and seeing which ones converted successfully. In that previous effort, IrfanView (my preferred tool) had not behaved as expected, so I had grappled with other approaches. This time, however, the quick IrfanView batch conversion went smoothly. This gave me a JPG displaying the first page of each PDF. My decision there was that, in the interests of speed (and to avoid having to go through every page of every PDF), I was content to look just at the first page. There could still be errors on later pages of a PDF, but that would be rare. If the first page came through OK, I could be fairly confident that most docs converted successfully. So now, using IrfanView, I flipped through those JPGs quickly.

With these steps out of the way -- PDFs checked, superannuated DOCs deleted -- I went back to my Excel spreadsheet and worked up batch commands to move the new PDFs back to where the DOCs had been. I had changed a couple of names along the way, so I had to move those manually, but the rest went automatically. Project done!

Wednesday, May 23, 2012

A Million-Day Calendar with Explicit Julian-Gregorian Comparison

I wanted to look up a historical date. Specifically, I wanted to know which day of the week it occurred on. As I was looking for an answer to that question, I gradually came to the impression that there did not exist a standard calendar. I decided to build one. This post describes that process.

Someone may already have created what I was looking for. But I wasn't finding it. What I was looking for was, simply, the Official Calendar. Of the United States, of the Catholic Church, it didn't matter -- just an official calendar that some reputable body had actually committed to print (preferably with explanations, and without errors).

What I was finding, instead, was lots of rules about how to calculate an official calendar, as well as various tools that would assist in those calculations. This was fine, as far as it went. But we don't generally tell people who prefer the Celsius temperature scale to just use the Fahrenheit and convert it. Instead, people living in places that use Celsius have thermometers that show them the literal answer, without the need for a manual conversion process. I wanted something like that for calendar dates.

Ultimately, I created a calendar covering a million days, starting on January 1, 500 BC. I produced that calendar as a spreadsheet, printed it as a PDF, and made both available for download. I don't often revisit this post. As of an update in early 2023, these materials are available for download through my SourceForge project page or at Box.com or MediaFire. See also my download blog post.

The PDF is a 38MB, 10,000-page document. I would not recommend printing more than necessary.

Assumptions and Calculations

Built into the Million-Day Calendar

I chose Excel 2010 to develop the calendar because that version of Excel could accommodate somewhat more than a million rows. I did not use Excel's built-in date arithmetic, though, because of its known errors. That is, I did not ask Excel to calculate the necessary dates automatically. Instead, I calculated them in a semi-manual process. The process was not entirely manual, because I did not calculate row by row, day by day, for each of the million days shown. Instead, I developed formulas that would count forwards or backwards from a certain date, and I applied those formulas to the million rows, usually broken into several segments due to historical changes in calendar calculation. There were some manual adjustments as well.

I found that a million days would cover approximately the period from January 1, 500 BC to the year 2238 AD. This seemed like a good range for most purposes. For dates outside this range, there would still be the option of using a formula or calculator, or of adding another tab to extend the spreadsheet.

As shown in the preceding paragraph, I was inclined to use AD and BC to refer to calendar eras. AD was short for Anno Domini (Latin for "in the year of the [or "our"] Lord"). AD and BC (short for "Before Christ") were thus based on an early medieval calculation of the number of years before or after the birth of Jesus. This religious origin was an addition to other religious origins (e.g., "Thursday" deriving from "Thor's Day"). Instead of AD and BC, an apparent minority of non-Christians preferred to use CE (short for "Christian Era" or "Common Era" or "Current Era") and BCE.

Traditional chronology did not incorporate a year zero (i.e., 0 AD or 0 BC). That is, the calendar went directly from 1 BC to 1 AD. The original concept may have been that there was no need for a year zero, since Jesus was not born until the start of the first year of his life (incorrectly calculated as 1 AD). This variation would make no practical difference in the AD era: for example, the number 2012 represented the year in which this post was written. It would lead to difficulties in the BC era, however. For instance, the rule on leap years (involving division by 4) would produce a leap year in the year 4 AD and, before that, in the year 0; but since there was no year 0, the prior leap year was in 1 BC. Hence, traditional BC dates did not fit exactly with the rule that leap years are evenly divisible by 4.

The calendar in effect at the time of Jesus was the Julian calendar, introduced by Julius Caesar in 46 BC -- a year which, by decree, was 445 days long. The Julian calendar was revised several times, finally stabilizing in 4 AD. For present purposes, the key innovation of the Julian calendar was the decision to define the year as equal to 365.25 days, adjusted via leap years in every year evenly divisible by 4 (e.g., 2008, 2012, 2016). The Julian calendar eliminated the leap month Mercedonius but did not otherwise significantly change the names or lengths of months. For purposes of year numbering, epochs (i.e., reference years) in the early centuries of the Julian calendar commonly used regnal systems based on the current ruler or other officials (e.g., "January 1 in the second year of the reign of the Emperor Justinian"), but there was a semi-chaos of other epochs as well. For instance, the Anno Mundi era started from calculations of the date on which the world was created, and the Ab urbe condita era started from the hypothesized date when Rome was founded.

The big change after the institution of the Julian calendar came in 1582 AD, when Pope Gregory XIII introduced the Gregorian calendar. The Gregorian reform assumed the use of AD rather than regnal or other epoch systems; the AD epoch concept had been gradually spreading during the Middle Ages. Gregory's principal contribution was to revise leap year calculations. Over the centuries, the Julian calendar had become increasingly inaccurate with respect to the actual equinox. That is, the calendar might say that it was March 21 -- the time for Easter -- and therefore daytime and nightime should each be about 12 hours long; but in fact, according to the clock, that day would already have arrived more than a week earlier.

In other words, the Julian calendar was falling behind the real world because the calendar was inserting too many leap years. The extra leap days were making the Julian calendar late: it would say the date was only March 11, when it really should have been March 21. Gregory thus removed ten days from the calendar for October 1582, to catch up, and also changed the leap year calculation slightly. The Gregorian rule for leap years was that every year evenly divisible by 4 would still be a leap year, except that years evenly divisible by 100 would not be leap years unless they were also evenly divisible by 400. So 1700, 1800, and 1900 would not be leap years, but 1600 and 2000 would be.

This adjustment was still not perfect, but because of gradual slowing in the Earth's rotation, it was apparently pretty close. The slowing issue, which I did not explore, may have been related to the difference between the tropical year and the sidereal year. The Julian and Gregorian calendars were apparently based on the tropical year, which was the amount of time that it took the Sun (as seen from Earth) to come back to the same place as it was on the previous vernal (spring) equinox. The sidereal year was an alternative to the tropical year: it was the amount of time that it took Earth to return to the same relative position as it had occupied a year earlier, as measured with reference to certain stars.

These findings about the Julian and Gregorian calendars called for some decisions, for purposes of constructing a million-day calendar. One such problem had to do with the present day. My computer might tell me that it was May 6, 2012. This would be a date in the Gregorian calendar. Its appearance on my computer, my wristwatch, and everywhere else would testify to Gregory's widespread success. I knew, however, that there was also a Chinese New Year and a Jewish calendar and all sorts of other calendars that still had meaning for various cultural and religious purposes, as well as the similarly named but essentially unrelated Julian Year system used in astronomy. Even the Julian calendar continued to be used in Eastern Orthodox churches. I decided that the intended spreadsheet approach to the million-day calendar might enable others to add these alternative calendars as they wished. Because of the size of the spreadsheet and the relative rarity and potential complexity of these other calendars, however, I decided that I would not try to build any of these alternatives into the calendar myself, but would instead focus on the Julian and Gregorian calendars that predominated in the West during the timeframe addressed in the million-day calendar.

Another problem had to do with adoption dates. The Gregorian adjustment of October 1582 specified that the Julian calendar would end on October 4, 1582; the days of October 5 through October 14 (inclusive) would not exist; and the Gregorian calendar would begin on October 15, 1582. This rule was adopted at very divergent rates: immediately, in several Roman Catholic countries, but elsewhere with considerable delays and confusion continuing into the 20th century. The problem here, then, was that October 5, 1582 did not exist in Spain, and yet someone in England could be staring at a letter dated October 5, 1582, and that would make perfect sense according to the Julian calendar, which would continue to be used in England until 1752 (at which point England would need to delete eleven days, not ten, to get in sync with the Gregorian reform). During the transition period in England, people commonly used the terms "Old Style" (abbreviated as "O.S." in English, and as "st.v." in Latin) to refer to the Julian date, and "New Style" ("N.S." or "st.n.") to refer to the Gregorian date.

As just described, the Gregorian calendar officially began (and was officially implemented in some places) on October 15, 1582; the Julian calendar officially ended on the preceding day, which (according to the Julian) was October 4, 1582. But one could also say that October 4, 1582 (Julian) was the same as October 14, 1582 (Gregorian). This way of looking at the matter would require proleptic (i.e., anachronistic) calculations. Specifically, there would be a proleptic Gregorian calendar for all dates before October 15, 1582 on the Gregorian calendar, and there would also be a proleptic Julian calendar for all dates before January 1, 4 AD on the Julian calendar. October 13, 1582 (Gregorian) would be the same as October 3, 1582 (Julian); October 12 (G) would be the same as October 2 (J); and so forth, back in time.

Since the Gregorian calendar did not exist before 1582, the statement that the Battle of Hastings occurred on October 14, 1066 would imply that it was October 14 according to the Julian calendar, not the Gregorian. While it could be confusing to cite proleptic Gregorian dates for events that were made part of history according to the Julian calendar, there seemed to be some applications for which a proleptic Gregorian calendar could be useful. For example, someone might be interested in determining whether a certain event happened on the actual equinox, as distinct from the date represented as the equinox in the Julian calendar. In developing the million-day calendar, I thought it would thus be useful to display Julian and Gregorian dates side-by-side, so as to confirm the accuracy of the calendar and/or of others' conversions between the two, as described more fully below.

To a much greater degree than the proleptic Gregorian calendar, it seemed that the proleptic Julian calendar could be useful for a variety of historical situations. The concept here was, in essence, that one could work backwards to construct a Julian calendar for dates long before Julius Caesar, and could use that calendar to construct a list of standard dates when various historical events occurred. Although sources rarely seemed to specify what calendar they were using, it appeared that the proleptic Julian calendar was in fact being used widely for this purpose. There would certainly be scholarly disputes as to the conversion of ancient chronologies to Julian calendar terms (so as to interpret, for instance, a statement that a certain event occurred in the 245th year since the founding of Rome), but at least the calendar system itself would be consistent over centuries.

Developing and Testing the Million-Day Calendar

I added proleptic Julian calendar calculations to the million-day calendar. I started these calculations by adding a separate Julian Days table to the spreadsheet. The concept of the Julian Day was proposed by Joseph Scaliger in 1583. Julian Days were simply a count of days, beginning (for astronomical and historical reasons) with Day Zero at 12:00 noon on January 1, 4713 BC. (Julian Days could include decimal values for fractions of a day, such as 0.083 = 2 PM.) So, for instance, Julian Day 7 arrived at noon on January 8, 4713 BC.

There were no years in the Julian Day system, but Julian Days could be used to calculate the proleptic Julian calendar, in which every fourth year would be treated as a leap year. Because there was no Year Zero in the Julian calendar, Scaliger's first year of 4713 BC was a leap year. (That is, in a system that had a Year Zero between 1 BC and 1 AD, 4713 BC would have been called 4712 BC.) The resulting calculations produced Julian dates, in the spreadsheet, that were consistent with those reached by John Herschel in his Outlines of Astronomy (1849, p. 595). Specifically, January 1, 4004 BC was Julian Day 258,963; the destruction of Solomon's Temple (which Herschel put on May 1, 1015 BC) was on Julian Day 1,350,815; and Rome's founding (which Herschel put at April 22, 753 BC) was on Julian Day 1,446,502. Moving into the million-day period beginning on January 1, 500 BC (Julian Day 1,538,799), the spreadsheet matched Herschel's calculation that the Julian calendar reformation of January 1, 45 BC occurred on Julian Day 1,704,987; the Islamic Hijra calendar began on Julian Day 1,948,439 (July 15, 622 AD); and the official last day of the Julian calendar (October 4, 1582) was Julian Day 2,299,160. It tentatively seemed that the spreadsheet's Julian calendar portion was accurate.

I also added Day of Week calculations to the spreadsheet, beginning with the common assertion that January 1, 4713 BC was a Monday (in, implicitly, the proleptic Julian calendar). For the dates cited in the preceding paragraph, these calculations indicated that January 1, 4004 BC was a Saturday; May 1, 1015 BC was a Friday; April 22, 753 BC was a Tuesday; January 1, 500 BC was a Thursday; January 1, 45 BC was a Friday; July 15, 622 AD was a Thursday; and October 4, 1582 was a Thursday. Further, I extended the Julian calendar beyond its official end to Thursday, November 7, 2238 AD (Julian Day 2,538,798). According to the spreadsheet (and also the Julian Day arithmetic, i.e., Julian Day 2,538,798 minus Julian Day 1,538,799), that was the millionth day (inclusive) from Thursday, January 1, 500 BC. These particular Julian Day numbers and day-of-the-week calculations matched the values produced by an online date calculator appearing on a NASA webpage. It tentatively seemed that the spreadsheet's Julian Day calculations were corresponding accurately with Julian calendar dates.

Next, I produced a proleptic Gregorian calendar in the million-day calendar, adjacent to the Julian calculations. The starting point for this calendar's calculations was its commonly recognized starting date of Friday, October 15, 1582. As noted above, the preceding day of October 14 on the Gregorian calendar (G) (if such a date had officially existed on that calendar) would have been Thursday, October 4 on the Julian calendar (J). So the spreadsheet's presentation of Julian and proleptic Gregorian dates had to match up on the row containing the values of October 14 (G) and October 4 (J). That is, both had to have the same Julian Day value of 2,299,160. From October 14, 1582, I extended the Gregorian calendar back to January 1, 500 BC. I decided not to extend this proleptic Gregorian calendar back into the period before 500 BC, though there were situations in which such an extension might have been useful.

There were some interesting things in the relationship between the proleptic Gregorian calendar and the Julian calendar. At the starting point in the 16th century AD, the Gregorian dates were later than the Julian. As just noted, October 14, 1582 (G) was equivalent to October 4, 1582 (J). The Gregorian allowed fewer leap years, so the difference between it and the Julian began to narrow with each additional century (except for those evenly divisible by 400), going back in time. The ten-day difference of 1582 thus became a nine-day difference on the first previous day when the formulas for the two calendars differed: there was no February 29, 1500 (G), but there was a February 29, 1500 (J). By the time one arrived back at the third century AD, the difference between the two calendars vanished. That is, as noted by Peter Meyer, the two calendars had exactly the same dates from March 1, 200 AD to February 28, 300 AD. This was no coincidence. Gregory had designed his reform so that Easter would occur at about the same time as it had occurred in 325 AD, when the Council of Nicea (also spelled Nicaea) discussed such matters. So during the century ending on February 28, 300 AD (J), both calendars showed the same dates (e.g., February 1, 300 (J) = February 1, 300 (G), and both are Julian Day 1,830,664). Before the third century, the Gregorian calendar predated the Julian by progressively larger amounts, until January 1, 500 BC (J) would be represented as December 27, 501 BC (G). Going back still farther, dates on the Julian calendar would continue to fall three days later every 400 years, so that January 1, 4713 BC (J) would arrive a month earlier on the Gregorian, in late November 4714 BC. On the other extreme, in the centuries following 1582 AD, the Gregorian dates became progressively later than those of the extended Julian, until November 7, 2238 AD (J) was equivalent to November 22, 2238 (G).

I checked the foregoing dates and days of the week using another online calculator as well, produced by Fourmilab Switzerland. I began by entering Julian Day numbers and then seeing what results this calculator would produce for Julian and Gregorian calendar dates. This calculator took the approach of inserting a Year Zero in the proleptic Gregorian calendar, so its statement of BC dates differed from the values shown in the spreadsheet by one year. For example, the Fourmilab calculator indicated that January 1, 45 BC (J) was equal to December 30, 45 BC (G), whereas the spreadsheet would put the latter as December 30, 46 BC (G). Fourmilab's approach seemed incorrect in this regard. For mathematical purposes (as in e.g., the ISO 8601 approach, below), there would need to be a Year Zero; but the historical reality seemed to be that proleptic calculations in both Julian and Gregorian calendars did not have a year zero. Fourmilab was not alone here; the conflation of mathematical consistency with historical fact had evidently produced some confusion in other computing situations as well. At any rate, after adjusting for that divergence in BC years, the results of the Fourmilab calculator did match up with those yielded by the spreadsheet and the NASA calculator. This calculator and the spreadsheet also agreed that February 1, 200 AD (G) was Julian Day 1,794,140 and was also February 2, 200 AD (J). (The NASA calculator did not do proleptic Gregorian calculations.)

I looked at one other online calculator, produced by CSGNetwork. I did not attempt a redundant comparison against all of the dates listed above. Instead, I focused on the especially problematic period of the first two centuries AD. In that timeframe, the CSGNetwork calculator seemed to be in error. Specifically, a "Calendar Date Entry" of January 1, 1 AD yielded Julian Day 1,721,425.5. The NASA and Fourmilab calculators and the spreadsheet agreed that January 1, 1 AD (J) should rather be Julian Day 1,721,423.5 or 1,721,424. So if "Calendar Date Entry" in the CSGNetwork calendar was intended to refer to a Julian calendar date, its Julian Day output was incorrect. It did not appear that the calendar intended to refer, rather, to a Gregorian calendar date of January 1, 1 AD, because it then stated that its Julian Day value of 1,721,425.5 was equivalent to January 3, 1 AD (G). In that latter regard, it was correct.

To some unknown extent, online calculators presumably used formulas that had been devised to facilitate date calculations. For example, Bill Jefferys presented a formula for converting Julian Days (and, perhaps, dates on the Julian calendar) to the proleptic Gregorian calendar, but indicated that it would be inaccurate before 1582, and especially for years before 400 AD. Paul Dohrman offered a procedure for converting Julian to Gregorian, and J.R. Kambak offered one for conversions from Gregorian calendar dates to Julian Days. Dohrman's approach, as I understood it, required these steps:

Truncate to centuries (e.g., 622 AD becomes 6). In the case of BC dates, treat them as negatives and start by subtracting a year first (e.g., 499 BC becomes -500, which becomes -5). This calculation produces X.
Calculate 0.75X minus 1.25. So 622 AD » 6 » 3.25 (using » as shorthand for "becomes"), and 499 BC » -5 » -5.
Truncate decimal points. So 622 AD » 6 » 3.25 » 3. This is the number of days to add to the Julian date to find the Gregorian.

This procedure produced some results consistent with the spreadsheet and the Fourmilab calculator, converting July 15, 622 AD (J) to July 18, 622 AD (G), and January 1, 500 BC (J) to December 27, 501 BC (G) (after Year Zero adjustment). This procedure did not seem to work in the first two centuries AD, however. For example, in the case of July 1, 1 AD (J), Dohrman's approach seemed to yield the incorrect value of June 30 (i.e., century 0 * 0.75 – 1.25) rather than June 29 (G).

There also seemed to be a problem with Kambak's long formula for converting Gregorian dates to Julian Days. It is possible that I did not copy or interpret that formula correctly. The version that I tested was as follows, where Y = Gregorian year, M = Gregorian month, D = Gregorian day, and JD = Julian Day:

JD = 367Y – 7(Y+(M+9)/12)/4 – 3((Y+(M–9)/7)/100+1)/4 + 275M/9 + D + 1721029

As I translated this into an Excel formula (placed into cell D2), it read as follows (assuming the values of Y, M, and D were entered into cells A2 through C2, respectively):

=367*A2-7*(A2+(B2+9)/12)/4-3*((A2+(B2-9)/7)/100+1)/4+275*(B2/9)+C2+1721029

That formula's results varied from those produced by the Fourmilab calculator for certain dates checked above, such as July 1, 1 AD (G) and October 14, 1582 (G). The variance in these instances was very small, however. Specifically, the values for those two dates produced by the formula and the Fourmilab calculator were 1,721,606 vs. 1,721,606.5, respectively (for July 1, 1 AD (G)) and 2,299,159 vs. 2,299,159.5, respectively (for October 14, 1582 AD (G)). That is, the Fourmilab calculator exceeded the formula's output by only 0.5 day in each case. Unfortunately, this variation was not consistent. For July 15, 622 AD (G), the Fourmilab calculator produced a value of 1,948,435.5, which was 0.5 day smaller than the Julian Day value of 1,948,436 produced by the formula. Moreover, for November 22, 2238 (G), the Fourmilab calculator's output of 2,538,797.5 was 1.5 days larger than the figure of 2,538,796 produced by the formula. In each of these several instances, the spreadsheet agreed, again, with the results produced by the Fourmilab calculator, after rounding the latter's 0.5-day output upward. It appeared, in short, that this formula was very close but not entirely accurate.

By this point, checking of the spreadsheet had begun to transition into critiques of the ways in which various calculators and other tools had interpreted and applied various sources (e.g., Tantzen, 1960). I took this as a preliminary indication of the potential usefulness of the million-day spreadsheet, at least where an explicit presentation of dates might facilitate visualization of calendar developments. While further usage and testing would be helpful in identifying points at which errors might have crept into the spreadsheet, it did preliminarily appear that the spreadsheet could provide a useful tool for date calculations and conversions.

The ISO 8601 Refinement

I developed the Gregorian section of the spreadsheet in one additional way. The International Organization for Standardization (ISO) had produced a standard prescription (known as ISO 8601) for calculating dates. This prescription appeared likely to be useful for a variety of purposes, so the spreadsheet contains a column devoted to it.

The ISO 8601 standard adopted Gregorian date numbers. One effect of the standard, for present purposes, was to prescribe standard ways of representing dates. There was a YYYY-DDD ordinal date option, which used the day of the year, where day 366 would have a value only in leap years (e.g., 2012-366 = December 31, 2012). In the spreadsheet, I used the year-month-day format (e.g., 2012-05-06 = May 6, 2012). ISO year values were ordinarily displayed with four characters (e.g., padded with leading zeros in 0023 rather than 23) for consistency.

A second effect of ISO 8601 stemmed from its adoption of a Year Zero, with apparently the same effect as what was sometimes called astronomical year numbering. In this approach, before the epoch of 1 AD, the absolute value of the ISO year was one less than the traditional year (e.g., ISO year 0000 = 1 BC; ISO year –0001 = 2 BC). So the million-day calendar started on ISO date -0500-12-27 (i.e., December 27, 501 BC (G)). The numerical approach of ISO 8601, using minus signs instead of "BC" and likewise dispensing with "AD," had the advantage of avoiding controversy regarding the use of those two traditional modifiers. The Fourmilab calculator (above) appeared to be implementing an ISO 8601 approach in its calculation of BC dates.

With the Gregorian calendar presented in ISO format, it would have been possible to apply another kind of check to the spreadsheet's day-of-the-week column. This check would have used what was known as the Doomsday technique. That technique, useful for quickly calculating the day of the week for a given date, seemed unnecessarily complicated within the million-day calendar spreadsheet, where one could simply use the Julian Day. That is, since Julian Day 0 occurred at noon on Monday, January 1, 4713 BC, every Julian Day evenly divisible by 7 would be a Monday. This way of calculating the day of the week, for a given date on the Gregorian calendar, seemed to produce the same results as I had calculated by using a formula that copied, into each day-of-week cell, the name of the day that appeared in the 7th preceding row.

Official and Local Calendars

As previously noted, Gregory intended that the last day of the Julian calendar (October 4, 1582) would be followed by the first day of the Gregorian calendar (October 15, 1582). That intention was followed in a number of countries and, at this writing, was implemented in various online calculators (e.g., those appearing on U.S. Naval Observatory and NASA webpages). It appeared that 1582 was the most plausible candidate for the year in which the world converted from the Julian to Gregorian calendars. In short, this combination of proleptic Julian (to 4 AD), Julian (from 4 AD to 1582 AD), and Gregorian (since 1582 AD) appeared to form the most credible version of the world's official calendar. The spreadsheet thus expresses what appears to be the Official Calendar that I had sought at the outset.

Some remarks appearing in preceding paragraphs have already acknowledged certain aspects of that de facto official calendar. For one thing, the concept of the Julian Day was built from a starting date calculated according to Julian reckoning, but came to serve as a means of cross-reference between the Julian and the later Gregorian calendars. So the spreadsheet column that presents the Julian Day number corresponding to a particular day on the Julian or Gregorian calendar does not belong solely within either the Julian or Gregorian sections of the calendar. Rather, it seemed to be best presented in the spreadsheet's Official Calendar section.

Likewise, a given date would be a Monday, or a Tuesday, or some other day of the week, regardless of the date number given to it on the Julian or Gregorian calendars. So it would have been redundant to present separate day-of-week columns in each of those calendars' parts of the spreadsheet. Instead, the day of the week appears just once, in the Official Calendar section.

That section also presents the official date, in two different formats. First is the traditional format, using BC or AD indicators of era. These traditional dates are provided in the somewhat condensed but still recognizable YYYY-MM-DD form. As such, their components (e.g., the number of the month) are accessible for further date calculations, as users may desire, with the aid of Excel text functions (e.g., MID, FIND). The column presenting the Official Date in Traditional Format is thus the specific statement of the Official Calendar in approximately the form that now appears to be used by most people.

Second, the spreadsheet also presents the official date in ISO format -- specifically, with minus signs and a Year Zero, modifying the traditional presentation. To emphasize, this is the official date. It uses the Julian calendar for dates before October 15, 1582, and therefore is not the ISO 8601 date. It is simply an indication of how the traditional, official date looks when stated in ISO style for purposes of numeric calculations.

As noted above, substantial portions of the world did not adopt the Gregorian reforms in 1582. The spreadsheet is adaptable for purposes of developing localized versions that may accommodate reforms implemented in later years. In the process of preparing this post, I also found a useful calendar with local customizations at TimeAndDate.com, though a brief look suggested the presence of inaccuracies like those identified in other calculators (above).

Uses of the Million-Day Calendar

This post has explained the creation of a million-day calendar covering the period from 500 BC to 2238 AD. That calendar is provided in spreadsheet format, one row per day.

This spreadsheet format seems to have facilitated identification of potential errors in certain tools designed to assist in use of, and interactions between, the Julian and Gregorian calendars as well as the Julian Day and ISO 8601 date systems. It may prove useful in other contexts calling for calculations, demonstrations, or cross-comparisons among calendars and systems, including some that users may add.

The spreadsheet presentation may also be useful in less technical, more data-oriented applications. Within the limits of computing power and spreadsheet capacity, there may be tasks that call for an ability to add columns of information, to be filled at a rate of one item per day (or week, or other time period). For instance, at this writing, I would like to find a database (if one exists) that would show something like the leading headline of the day -- the sort of thing that one might expect to find on the front page of the New York Times, for instance, if that newspaper had existed on the day of the Battle of Hastings. If no such database exists, perhaps this spreadsheet, shared among a number of potential contributors, could help to bring about its existence.

Sunday, March 18, 2012

Batch Converting DOC to PDF with 7-PDF Maker

I had some Microsoft Word .doc files. I wanted to convert them to PDF. I wanted to be able to do this from the command line, so as to reach into different folders and process large numbers of them at once.

I went into Softpedia and did a search. It came up with numerous free programs for this purpose. I chose 7-PDF Maker. It had a pretty good rating, as Softpedia programs go (4.0 stars; 16,001 downloads), and it did offer a command-line option. I also downloaded its manual. (It had a real manual!)

Once 7-PDF Maker was installed, I searched for its command-line executable, 7p.exe. I put a copy of it into D:\Workspace (i.e., the folder where I was working). That way, my commands that referred to 7p.exe would know where to find it. There were other ways, but this was simplest, and 7p.exe was not a filename that would get confused with the ones I wanted to convert.

I opened a command window in D:\Workspace and typed "7p /?" to see what the command line options were. Basically, it seemed, I could save the DOC as a PDF with a command as simple as "7p D:\Workspace\File.doc." The /? instructions seemed to be saying that I had to specify an absolute path for the source file (i.e., not just "File.doc" without the drive and folder information). I was not sure whether that was necessary with a copy of 7p.exe in the working folder. There was also an option to save the resulting PDF to a different folder (e.g., "7p File.doc D:\Workspace\Output"). In addition, I could use wildcards. 7p.exe D:\Folder\*.doc would convert all doc files in Folder to PDF. The same command with *.* would convert all supported files to PDF. There were many supported filetypes (manual p. 18), including Word, WordPerfect, OpenOffice, Excel, PowerPoint, and various image formats (e.g., BMP, TIF, JPG, PNG).

There were also options for overwriting and recursion (i.e., working down through subdirectories). In both cases, the default was false (i.e., don't recurse, don't overwrite). The default was all I needed, so I did not investigate the exact syntax. But it appeared that one instance of the word "true" on the command line would be construed as an instruction to recurse.

I gave it a test run with x.doc. The command I used was simply "7p x.doc." That gave me an error, so I tried "7p D:\Workspace\x.doc." That gave me a different error: "Variante referenziert kein Automatisierungsobjekt." One translation was, "Variant does not reference an automation object." Did this mean that x.doc was not a convertible DOC file? Or that I should have been running this in the 7-PDF installation folder on drive C? I tried the latter with an absolute path (i.e., not just "7p x.doc"). Same "Variante referenziert" error.

I tried opening x.doc in Word. Oh. Now I understood. It was called a DOC file, but it was actually just a text file with a DOC extension. But the manual said that text files were supported. Maybe the .doc extension was confusing 7p? I changed it to x.txt and tried the original approach of running the command in D:\Workspace rather than in the installation folder on drive C. Specifically, I tried just "7p x.txt." It said, "URL seems to be an unsupported one." Maybe it was the wrong kind of text file. Whatever; I used a text to PDF converter for them instead.

I did not proceed further with 7-PDF because, at this point, I found an alternative I liked better. Not to say that 7-PDF was a bad program; it just was not working really well for me at this point.

Tuesday, January 3, 2012

Converting Scattered WMA Files to MP3

I had .WMA files scattered around my hard drive. I wanted to convert them to .MP3. I could search for *.WMA, using a file finder or search tool like Everything, thereby seeing that those files were in many different folders. Having already sorted them, I didn't want to move them anywhere for the format conversion. I wanted to convert them right where they were. A command-line tool would do this. The general form of the command would be like this: PROGRAM SOURCE TARGET OPTIONS. For PROGRAM, I would enter the name of the command-line conversion program that I was going to be using. For SOURCE and TARGET, I would enter the full pathname (i.e., the name of the folder plus the name of the file, like "D:\Folder\File to Convert.wma," where the target would end in mp3 rather than wma). OPTIONS would be specified by the conversion program. For instance, there might be an option allowing me to indicate that I wanted the resulting MP3 file to be 64bps.

The problem was, I didn't have a command-line WMA to MP3 conversion tool. I ran a search and wound up trying the free Boxoft WMA to MP3 Converter. (They also had lots of other free and paid conversion and file manipulation programs.) When I ran their converter, it steered me to an instruction file that inspired me to compose the following command (all on one line):

AlltoMp3Cmd "D:\Folder\Filename.wma" "D:\Folder\Filename.mp3" -B56

I had to use quotation marks around the source and target names in some cases (though not in this particular example) because some of the path or file names contained spaces. The -B56 option was supposed to tell it to produce a 56-bit MP3. (I also tried it with a space: "-B 56".) I was able to produce similar commands en masse, for all of the WMAs that I wanted to convert, by exporting the results of the *.WMA search from Everything to a text file called wmalist.txt, making sure to remove entries for files that I did not wnat to convert. (At the root of each drive containing files of interest, I could also have used this command, assuming wmalist.txt did not already exist: dir *.wma /b /s >> D:\wmalist.txt.) I then massaged the contents of wmalist.txt using Microsoft Excel. So now I had all of these AlltoMp3Cmd commands ready to run. I copied them all into a Notepad file named Renamer.bat. All I had to do was double-click on it in Windows Explorer and it would run.

I decided to try Renamer.bat with just one WMA file. So I created another file, X.bat, with just one line in it, like the line shown above. To run X.bat from the command line, so that I could see what it was doing, I would need a command window that was ready to execute commands in the folder where X.bat was located. Problem: X.bat was not in the same folder as Boxoft's AlltoMp3Cmd.exe executable program, so X.bat would fail. If I didn't want to get into changing the computer's PATH, I could either put X.bat in the Boxoft program folder or I could copy AlltoMp3Cmd.exe to the folder where X.bat was located.

Either way, I needed to open a command window in one of those two folders, so as to run X.bat. I could start from scratch (Start > Run > cmd) and use commands (e.g., "D:" would take me to drive D and "cd \Folder" would take me to the folder where Filename.wma was located), or I could use Ultimate Windows Tweaker to install a right-click option to open a command window in any folder. I had already done the latter, so this step was easy.

Once I had sorted out all that, I was ready to try running X.bat. But when I did, it crashed the AlltoMp3Cmd.exe program. If I clicked on Cancel when I got the crash dialog, the command window said this:

Exception Exception in module AlltoMp3Cmd.exe at 0005B4E1.
Installation file incorrect. Please re-install it!.

But reinstalling the Boxoft program didn't help. I sent them a note to let them know of this problem and decided to try other approaches. One possibility was that their program was suitable for Windows XP but not Windows 7, which I was using. It didn't seem to be a question of how the main program was installed, since the error message was referring specifically to the AlltoMp3Cmd.exe command-line executable (which presumably would be the same on any Windows system).

I decided to try running it in a Windows XP virtual machine (VM). I had already installed Microsoft's Windows Virtual PC, which came with a WinXP VM, so I fired it up to try the same command line in the same folder. To move quickly to the proper folder in the WinXP command window, I ran my trusty old RegTweak2.reg file, created in Notepad, to install a right-click option to open a command window in any folder in Windows Explorer. But when I tried to use it, I got an error:

'\\tsclient\D\Folder Name\Subfolder Name'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported. Defaulting to Windows directory.
'\\tsclient\D\Folder Name\Subfolder Name'
CMD does not support UNC paths as current directories.

A bit more playing around persuaded me that what this message meant was that command-line work in the VM would have to be done on what the VM considered a "real" (actually a virtual) drive -- in other words, drive C. So I put copies of X.bat and AlltoMp3Cmd.exe into the VM's drive C, in a new folder I called Workspace, and I tried running X.bat from the command line there. But again I got an error: "AlltoMp3Cmd.exe has encountered a problem and needs to close." Maybe the program wasn't built to handle paths. For whatever reason, it looked like the Boxoft AlltoMp3Cmd command-line utility was not going to work for me.

A search in CNET brought up some other possibilities. One was IrfanView, reminding me that I had used that program to work partway through a somewhat similar problem months earlier. Using IrfanView version 4.28 and various insights described more fully in that other writeup (and in a recent thread), I went back to my original list of files in wmalist.txt and prepared this command:

i_view32.exe /filelist=D:\wmalist.txt /convert=$D$N.mp3

This command was supposed to use the file names ($N) and directories (i.e., folders, $D) specified in wmalist.txt to produce MP3 files with those same names, in those same directories. Before trying it out, I made a copy of wmalist.txt and changed the original so that it contained only two lines, referring to WMA files on two different drives. I ran the command shown above in a CMD window. I got an error:

'i_view32.exe' is not recognized as an internal or external command, operable program or batch file.

In other words, Windows 7 did not know where to look to find IrfanView. I could have taken the steps mentioned above, moving the .txt file to wherever i_view32.exe was located; but since I used IrfanView often, I wanted to add it to the PATH variable so that Windows would permanently recognize it. The solution was to go to Start > Run > SystemPropertiesAdvanced.exe (also available through Control Panel > System > Advanced System Settings) and then click on Environment Variables > System Variables > highlight Path > Edit. To see clearly what I was doing, I cut the existing Variable Value out of the dialog and doctored it in Notepad. The basic idea was to add, to the end of the existing value, a semicolon and then (without adding a space after the semicolon) paste the location of i_view32.exe (found easily enough via an Everything search > right-click > Copy path to clipboard). I made sure to add a final backslash ("\") after the path to i_view32.exe. I pasted that back into the dialog, OKed my way out of System Properties, went back into the command window, pressed the Up arrow key to repeat the command ... and it still didn't work. I thought that possibly I would have to reboot to have the new PATH definition take effect. That was the answer to that particular problem. After rebooting, in a command window, I ran the command shown above, and there were no errors. IrfanView was open, but nothing was in it. I ran searches in Everything for the two files in my test WMAlist.txt file, with wildcard extensions (i.e., I searched for Filename.*). No joy: there were no MP3 versions of those files. I tried a modified version of the command:

i_view32.exe /filelist=D:\wmalist.txt /convert=D:\*.mp3

but that produced no output in D. The IrfanView command was not working. I tried yet another variation, as above but without "D:\" but that wasn't it either. I tried the original command without using the filelist option:

i_view32.exe "D:\File Path\File Name.wma" /convert=$D$N.mp3

This produced an error:

Error! Can't load 'D:\File Path\File Name.wma'

Did that mean that the /convert option was not being recognized? Everything indicated that no MP3 file had been created. And why would IrfanView be unable to load the existing WMA file? It could load it easily enough from Windows Explorer or Everything. I tried again:

i_view32.exe "D:\File Path\File Name.wma"

That worked: IrfanView played the file. So the convert option was the problem. Another variation:

i_view32.exe "D:\File Path\File Name.wma" /convert="File Name.mp3"

If that did work, I wasn't sure where the output file would turn up. No worries there: it didn't work. I got the "Can't Load" error again. IrfanView's help file said that it did support wildcards for /convert, so that was presumably not the problem. I had seen an indication that IrfanView would not batch-convert certain kinds of files, but WMA was not on the list I saw. I was going to post a question in the relevant IrfanView forum, but at this point they weren't letting me in, for some reason. Eventually it occurred to me to look in IrfanView's File > Batch Conversion/Rename area, where it appeared that the program would support only image conversions, not audio.

It seemed I would need to continue searching for a command-line option. Back at that CNET search, I looked at the Koyota Free Mp3 WMA Converter -- from another company that offered multiple free conversion products -- but saw no indications that it had command-line options. Likewise for Power MP3 WMA Converter and others.

I finally opted for a kludge solution. Using an Excel spreadsheet, I created a batch file (again, using techniques described in the other post referenced above and elsewhere) to rename each file in WMAlist.txt to a unique name (example: ZZZ_00001.wma) -- after making sure I did not already have any files with that kind of name. The unique names would help to insure that all WMA files would get the treatment, even if two of them had the same original name. This produced 386 files. Then, using Everything, I selected and moved all ZZZ_*.wma files to D:\Workspace. Somehow, only 375 files made it to that folder. It turned out that I had inadvertently included WMA program files from drive C after all, which I had not wanted to do, and for some reason a few of those were not moved to D:\Workspace -- probably for insufficient rights. So now I would have to undo that damage.

After taking care of that, in D:\Workspace, I tried the Boxoft program again, this time using its Batch Convert mode. It took a while. Spot checks suggested that the conversion quality was good. I wasn't sure what bitrate to use to convert the files. It seems that, at 56 kbps for what appeared to be a bunch of voice (not music) files, I erred on the high side. I started with 353 WMA files occupying a total of 237MB, and I ended up with 353 MP3 files occupying 405MB. Those files were converted at what appeared, at quick glance, to be a rate of about 6MB per minute. I then revised the spreadsheet to produce batch file command lines that would move those MP3s back to the folders where the similarly named WMA files had been and rename them back to their original names (but with MP3 extensions).

Sunday, December 18, 2011

Converting Email (EML) Files to PDF - Another (Partial) Try

Once before, I had converted individual email messages in EML format (from Thunderbird) to PDF format. That had been a long and complicated process that I'd had to revisit a few weeks later. It was now time to export some more emails from T-bird. I decided to look for a simpler approach.

In the first stage of this inquiry, I was working with some EML files that had already been exported. So this post starts halfway through the process. The ordinary starting point, exporting emails from T-bird, appears later in this post.

Conversion Approaches

A thread gave me some ideas to play with. It seemed that an EML, renamed to an HTM, could be opened in Internet browsers (e.g., Firefox) and also in editors (e.g., Microsoft Word, Wordpad). Unfortunately, an email's header, containing the most recent sender, recipient, date, and subject information, was not inclined to print properly. Some of the programs reviewed in the previous post would print the header with funky colors and other formatting stuff I didn't want. Some would also produce tiny print.

I was able to produce better-looking output by changing the EML to an HTM extension and opening it in Notepad.  At the end of each line in the header, and twice more after the header, I inserted <br /> and then saved it. Now it would open in Firefox with a halfway decent appearance. There were also some lines in the header that I wanted to remove, technical lines other than the customary To, From, Date, and Subject lines.

The question was how to automate these changes. I saw that Notepad++ would do changes for all opened files, but I didn't want to have to open large numbers of emails. I found indications that a one-line Perl command and also an editor called TexFinderX would change all occurrences of certain text within multiple documents, but I didn't want to change all occurrences. I wanted to change the first occurrence of "From:" to be "<br />From:" and likewise with Date and Subject, and I wanted to make other changes as well.

A search led to another Perl command that would apparently perform at least some of these tasks. It was tempting to try to pick up enough Perl fluency to make that command work. I decided, though, that it would be better to start by learning Perl more thoroughly (someday), so as to have a clearer understanding of what this command might or might not be doing inside some large number of files. The search led, similarly, to various SED and AWK commands, with the same reservations on my end.

For some reason, it felt safer to use a utility programmed to achieve the same thing as those command-line approaches, even though I would still not be able to see what was being changed. Maybe it was just reassuring to think that someone with programming knowledge was trying to solve the problem. It helped to find a Gizmo recommendation for ReplaceText (formerly BK ReplaceEm). My faith was shaken, however, when I saw that the Gizmo recommendation, dated October 17, 2011, was pointing to a webpage that said ReplaceText was no longer supported and "has known problems with some Windows 7 installations." Not the end of the world, but also not ideal. Gizmo's second recommendation was A.F.9, which appeared to have been last updated in 2002. It seemed I might have to come up with some other approach.

Dealing with Headers

Another problem I was dealing with was that not all emails had the same kind of header. Some had at least a dozen lines, with references to things like "X-Message-Delivery," where others had a smaller set of header lines. It appeared that a fully automated approach could easily make a mess. I noticed, for instance, that the Birdie conversion program would just run multiple lines together, in files with some kinds of headers.

Having spent a lot of time undoing messes caused by the previous EML conversion process, I decided to take a slower and more cautious approach, at least for starters. I began with FIND commands, on the Windows command line, designed to distinguish EML files with different kinds of headers. These FIND commands ran into Access Denied errors, resolved in a separate post.

It turned out that emails could contain a variety of header fields. Some (e.g., Date, From, Subject) were essential and self-explanatory. Others (e.g., Content-Type) were evidently common but were not ordinarily displayed in email readers (e.g., Outlook, Hotmail, Thunderbird). Another category was the X-field. According to one source, "X-fields are experimental [though evidently X was intended to stand for "extension," not "experimental"] fields added by email clients or servers and may be useful valid information or may be forged." Things seemed to have changed since 1993, when someone seems to have felt that X-fields were to be "strongly discouraged." At this point, they were widely used. I had seen a number of them. They were also called X-headers. There was apparently no authoritative list of them; people were seemingly free to invent them as they saw fit, perhaps by using ordinary email programs (e.g., Eudora). I found lists of X-headers for Usenet and listserv posts. After some hunting, I did finally find a list of X-headers that might appear in email messages, as well as a discussion of standard HTTP header fields. By this point, though, it was clear that any such list had to be open-ended. Even that lengthy list did not contain some of the X-headers that appeared in one of the first emails I looked it (i.e., headers of the X-Message variety).  I found no clear indication of what headers might appear in a legitimate email message.

So apparently I was not going to be able to rely on someone's preconceived list of headers, as a guide to removing the unwanted ones from a large number of emails en masse (perhaps using some tool like TexFinderX, above). The most reliable approach would seemingly require me to identify the header fields actually used in the emails I wanted to convert.  There did not seem to be an automated way to do that. Some emails had HTML codes or empty lines dividing the two, but others did not. I worked up an approach using screenshots, one per file (viewed in Notepad), to give me an impression of those codes. But attempts to use optical character recognition (OCR) software on those screenshots did not give me an ideally reliable indication, in text, of the header contents.

It seemed that I probably had the ability to use macros in Word to eliminate unwanted headers and to reposition the wanted headers within the body of the email message, so as to produce a good appearance when the file was then printed to PDF. (I was using Bullzip as my PDF printer.) I decided to approach that project by whittling down the size of the messages -- that is, by removing attachments first.

PDFing the Attachments

I had a sample set of 46 EML files. I was not sure how many had attachments. A look at some of them in Notepad suggested that not all EMLs announced the presence of attachments in the same way. It did seem, though that a certain text string would be found in most cases where an attachment existed. That string was:

Content-Disposition: attachment;

On that starting hint, I ran a FIND command to see which of these 46 EMLs appeared to have attachments:

find /i /c "Content-Disposition: attachment;" "D:\Emails\*" > Attachlist.txt

Attachlist.txt gave me output like this:

---------- D:\EMAILS\FILENAME.EML: 1

apparently indicating that FILENAME.EML contained one occurrence of the Content-Disposition text string. It said that maybe a third of the files had one such occurrence, with the rest having none. An exception: one file contained two occurrences. I looked at it. It seemed to be a case of a forwarded message, with the ATT00001.htm filename that I had often observed but never did understand. The message was displayed in the EML, but was also apparently attached in that ATT format. My guess was that the best approach would be to keep everything up to the last occurrence of the Content-Disposition string.

I used Excel to convert Attachlist.txt to a batch file that would move the files containing Content-Disposition to a separate folder. Then I opened all of the EMLs to see how accurate this FIND command was. It appeared that I had found the key to distinguishing those files that did contain attachments: the ones with "Content-Disposition: attachment;" did contain attachments, and the others did not.

I opened the EMLs containing attachments and manually printed their attachments as PDFs, with names that would help me link them to the emails later. Some of them were extraneous winmail.dat attachments that I ignored. Some were already PDFs and thus didn't need to be printed to PDF. So now, in that separate folder, I had about a dozen EMLs that contained attachments, and about a half-dozen attachments that I had just PDFed from those EMLs. I wanted to convert those EMLs to PDF first, and separately from the larger group of non-attachment EMLs, so as not to get confused as to which emails the attachments belonged to. (More precise naming of the PDFs would have alleviated that concern, but would have taken more time.) So now it was time to figure out a solution for the EMLs themselves.

Removing Attachment-Related Material from EMLs

My original plan was to get the EML into Microsoft Word, where I could write macros to manipulate the text in useful ways. If I just opened an EML in Word, it would appear without its header. That is, it would not look like an ordinary email, as viewed in Thunderbird or Outlook. I could get around that problem by opening Word and setting it to Options > General tab > Confirm conversion at Open, and then open the EML as Plain Text, but that would be an extra delay. With a lot of emails, it could be cumbersome.

I wanted to start by removing the contents of the emails that contained attachments, starting at the point of the last occurrence of "Content-Disposition: attachment." Experimentation revealed that EMLs so truncated would still open and look fine in Thunderbird, and now they wouldn't have all that attachment material to prevent them from functioning like HTML files. If they had more than one attachment, an attachment notice would still appear at the bottom of the Thunderbird EML screen, but I was OK with that.

It occurred to me that it might be possible to find a text editor with macro capabilities, so as to handle multiple tasks without having to cycle through multiple programs. This led to a separate search for a suitable text editor. In a development that would doubtless provoke some to cheer and others to weep, I wound up with Emacs. In that process, I did manage to develop an Emacs macro that would eliminate, from an EML file, the post-HTML attachment material.

I was only working with a dozen or so EMLs in that pilot test. I wound up combining the extracted attachments and manually produced PDF printouts of the truncated EMLs (viewed in Thunderbird) manually, so as to remove this potentially complicating part of the larger EML-to-PDF project.

Regular EMLs: A First Pass in Emacs

Now I was able to focus on the headers, without worrying about attachments. I had a group of 34 EMLs, with various kinds of headers, that I wanted to manipulate into printable HTMs. I opened 15 of them in Notepad and took a look. I decided to take them in batches. The first kind, the simplest, had "To: " as its very first four characters. These seemed to follow a pattern of separate lines for To, Date, Subject, and From lines, which I would keep, followed by Content-type, MIME-Version, and Content-transfer-encoding lines, which I would discard. The four keeper lines would be moved down into the text, immediately after the BODY tag, with <br /> tags added for line breaks as needed. I manually edited one of these in that way, opened it as an HTM, and it looked good.

As I thought about it some more, I thought that possibly a better strategy would be to write a macro that would go down to the <BODY> tag and then search upwards for the desired tags (From, To, etc.) and delete everything else. And then, as I thought about it still more, I realized that all of the above were beyond my present Emacs ability. I would get there ... but not today.

Saturday, April 23, 2011

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

I had a large number of emails in Thunderbird (an email program like Outlook, but open source freeware). I wanted to export each of those emails to its own distinct PDF file with a filename containing Date, Time, Sender, Recipient, and Subject information in this format:

2011-03-20 14.23 Email from Me to John Doe re Tomorrow.pdf

In that example, I might ultimately eliminated the "from Me" part as understood, but of course other emails would be from John back to me, so for starting purposes I wanted all five of the fields just listed. The steps I went through are described below. There is a summary at the end of this post.

Recap and Development: Converting Emails into EML Format Files with Preferred Filenames

So far, I had already worked through the process of exporting those emails to distinct EML files. I had also used a spreadsheet to rename those EML files so that they would provide clearer and more complete information about the file's contents. (I was using Excel 2003 for spreadsheeting. OpenOffice Calc was now able to handle a million rows (i.e., to rename a million files), but it had not been stable for me. One option, for those who had more than 65,000 EMLs and therefore couldn't work within Excel's 65,000-row limit, was to do part of the list at a time.) This post picks up from there, summarizing a more streamlined approach to the steps described at greater length in the two previous posts linked to in this paragraph.

I had previously tried to begin with the Index.csv file exported from Thunderbird via ImportExportTools, but that had been a very convoluted and unsatisfactory process. I did continue to use Index.csv, but my main effort was to work up a spreadsheet that would use and alter the filenames created when I exported EMLs from T-bird, also using ImportExportTools. As described previously, I had developed some rules for automated cleaning of various debris from filenames, such as the underscores that ImportExportTools inserted in place of quotation marks and other characters.

To summarize the approach described in more detail in the previous post, I got the filenames from the folder where ImportExportTools had put them by using this command at the CMD prompt (Start > Run > cmd): "DIR /b > dirlist.txt" and then I copied and pasted the contents of dirlist.txt into an Excel spreadsheet. There I extracted the Date, Sender, and Subject fields from those filenames using Excel functions, including FIND, MID, TRIM, and LEN, all described in Excel's Help feature and in the previous post. I also used Excel in a separate worksheet to massage the data on the individual emails as provided in Index.csv.

The two worksheets did not produce the same information, and I needed them both. The one contained actual filenames, which I wanted to revise en masse to be more readable and to include the "To" field, which was contained in Index.csv. Many of the things that ImportExportTools screwed up about the subject fields of emails, for purposes of CMD-compatible filenames (and going well beyond that) involved the underscore character. Hence, the chief sections in the main worksheet (where I revised the data from dirlist.txt), going across the columns, were as follows:

Dirlist (raw EML filenames)
Date & time conversion (from 19980102-0132 to 1998-01-02 01.32
Subject: Clean up starting & ending underscores of From names
Subject: Replace "Re_ " with "Reply re" in Subject field
Subject: Replace "n_t" and "_s" (as in "don_t" and "Mike_s") with apostrophes
Subject: Replace serial underscores: "_ _" becomes " - "
Subject: Replace "I_m" with "I'm" and "you_re" with "you're"
Subject: Replace underscore and space ("_ ") with hyphen (" - ")
Subject: Remove starting and ending hyphens

That accounted for the bulk of the needed changes in the Subject field, in the files I was working with. I set these rules up to eliminate the first one, or in some instances two, occurrences of the underscore string in question. Few emails contained more than that; for those few, leaving the additional underscores in place was acceptable. There would be some predictable misfires of these rules, but they would generally improve the situation, and when dealing with a large number of EMLs that I didn't intend to rename manually, this was the best that I could hope for under the circumstances.

Then I used VLOOKUP to search for a match with the Index.csv-style Date and Time (e.g., 19980102-0132) data in the Index.csv worksheet, and also for a match with the Index.csv Date+Time+From combination. (Sometimes the From field was necessary to distinguish two or more emails sent at the same time. Because of the underscores and other oddities about the EML filenames, subjects were too different to compare in most cases.) This identified precise matches between the two worksheets for about 80% of EMLs.

So now I was going to try using that same spreadsheet with another batch of emails exported from Thunderbird. I exported the Index.csv and the EMLs, and set to work on the spreadsheeting process of reconciling their names and producing MOVE commands for a CMD batch file that would automatically rename large numbers of EMLs to be readable and to include data from the To field.

This time around, I did a first pass to bulk-recognize and batch-rename that first 80% of the EMLs. The CMD command format was this:

MOVE /-y "Old Filename.eml" "Renamed\New Filename.eml" 2> errlog.txt

This renamed the old EMLs to the desired new EML filenames, put them into the Renamed subfolder, and gave me an error log to say what went wrong with any of the renames. The error log wasn't very useful, so I stopped creating it in these commands. What I had to do instead, to find out which EMLs had been successfully renamed, was to do a dirlist.txt for the Renamed folder, feed that back into the spreadsheet, and delete those lines that had executed successfully. For about 15% of the emails, I could not automatically detect matches between data from Index.csv and actual files, so I wound up naming those files according to date, time, and sender only. Finally, I got down to less than 1% of emails that I had to rename in a more manual fashion, mostly due to non-ASCII characters in their filenames. For that, I used Bulk Rename Utility.

I was not sure whether this route wound up being better than the approach of using one of the shareware programs discussed in the previous post. I was not aware of the potential difficulties when I was looking at those programs, so for example I didn't try them out on emails with Chinese characters in their Subject fields. The other way always looks easier after a project like this. The approach I had taken had surely been more time-consuming than if I had known of a killer app that would do exactly what I wanted without unanticipated complications or failures. Absent a reliable, obvious solution at an affordable price, the main thing I could say at this point was that at least the conversion to EML was done.

Final Step: Converting EMLs into PDF

With EMLs thus exported from Thunderbird and mostly renamed to indicate date, time, sender, recipient, and subject, the remaining task was to convert the EMLs to PDF. This, it developed, might not be as simple as I had hoped. There was, first, the problem of finding a program that would do that. Some of the emails were simple text and could have been easily converted to TXT format just by changing their extensions from .eml to .txt. Acrobat and other PDF programs would readily print large numbers of text files, unlike EMLs. Other EMLs, however, contained HTML (e.g., different fonts, different colors of print, images). I wasn't sure what would happen if I changed their extensions and then printed. I noticed that the change to .txt caused the HTML codes to become visible in one message that I experimented with. When I converted that file to PDF using Acrobat, its header appeared in a relatively ugly form, but the colors and fonts seemed to be at least somewhat preserved. In another case, though, the PDF was largely a printout of code -- a truly undesirable replacement for what had been a pretty email with photos included. My version of Acrobat (ver. 8.2) did not provide any editable settings for conversion from text or HTML to PDF.

Thunderbird was my default program for displaying EMLs. I wondered if a different program could view them and would have better PDF printing capabilities, or if I should try converting them into another interim format in order to then convert them to PDF. A search led to the claim that Microsoft Word (or other programs) could display EMLs. I tried and found that this was essentially untrue: in Word, there was almost nothing left of that pretty email I had just tested. Converting EML to MSG seemed to be one option, but this looked like a dead end; that is, it didn't look like it would be any easier to PDF an MSG file than to PDF an EML. Getting the EMLs into Outlook wasn't likely to be a solution; as I recalled, my version of Outlook (2003) had been unable to batch print emails as individual PDFs. Marina Martin said that MBOX was the standard interoperable email file type. I could have exported from Thunderbird directly to MBOX using ImportExportTools, but I had not investigated that; I had assumed that MBOX meant one large file containing many emails, like PST, and I had wanted to rename my emails individually. Martin gave advice on using eml2mbox to convert EML to MBOX; hopefully I would not have lost anything by taking the route through EML format. But if MBOX was such a common format, there was surprisingly little interest in converting it to PDF. My search led to essentially nothing along those lines. Well, but couldn't Firefox or any other web browser read HTML emails? I tried; neither Firefox nor Internet Explorer were willing to open an EML. I renamed it to be an .html file. Both opened that, but here again the problem was that the header was so ugly and hard to read: it was just a paragraph-length jumble of text mixing up the generally important stuff (e.g., from, to) with technical information about the transmission. Even assuming I could work out a batch-PDF process for HTMLs, this was not the solution. There were other possibilities, but in the end it did appear that I simply needed to buy an EML-to-PDF converter.

It tentatively appeared that MSGViewer Pro ($70) might be the most frequently downloaded program in this area, ahead of its own sister program PSTViewer Pro as well as Total Mail Converter. A search for reviews led to very little. It didn't appear that MSGViewer Pro had the ability to include image attachments within the PDF of an EML, as Total Mail Converter Pro ($100) supposedly did. On the other hand, MSGViewer Pro supposedly provided a free five-day trial. I decided that I did not have time to mess with endless numbers of attachments right now, and was therefore willing to just zip the EMLs into a single file for possible future processing, if I decided that there was sufficient need and time for that. Given my unlikelihood of using these programs very often, I also hoped that their prices would drop. I figured that if the MSGViewer Pro trial was fully functional, I might be able to take care of my need for it now, converting EMLs into PDFs without attachments, and otherwise let the matter sit for another year or more.

On that basis, I downloaded and installed MSGViewer Pro. It was apparently designed for an older version of Windows. When I installed it, I got one of those Win7 messages indicating that it might not have installed properly, and inviting me to reinstall using "recommended settings," whatever that meant. I accepted the offer. Once properly installed, I ran the program. A dialog came up saying, "Trial is not licensed for commercial use." I clicked "Run Trial." Right away, I found that its Refresh feature did not work: I copied some EMLs into a separate folder to experiment with, and could not get the program to find that folder. I killed the program and started over. Now it found the folder. I selected those messages, clicked the Export button, and told it to give the resulting PDF (one of the available output options; the others were TXT, JPG, BMP, PNG, TIFF, and GIF) the same names as the input files. It had a nice option, which I accepted, to copy failed messages to a separate folder. A dialog came up saying, "You can only export 50 emails in trial version of MsgViewer Pro." So that popped that fantasy. It ran pretty quickly and reported that all of the files had been successfully exported. Sadly, the results were no better-looking than I had been able to achieve on my own, with other measures described above. HTML codes were visible in some PDFs -- or perhaps I should say, not visible, but overwhelming: it looked like a piece of ordinary HTML coding. The typeface was tiny. Some lines were actually split down the middle horizontally, with the top half of a line of text appearing at the bottom of one page and the bottom half appearing at the top of the next page. In a word, the results were junk. I uninstalled MSGViewer Pro.

I decided to try Total Mail Converter Pro. No installation problems. When installation ended, the program started right up without giving me a choice. Then it decided I needed to log onto Gmail. This was not my plan, so I canceled that. I liked its interface better than MSGViewer Pro: smaller but still readable font, seemingly more options. I selected my test files and clicked the PDF button. It gave me options to combine the files into one PDF or produce separate files. It also provided a file name template, with choices of subject, sender, recipient, date, and source filename. I tried these. There were other options: which fields to export, whether to include attachments in the doc or put them in separate folders, header, footer, document properties. It did the conversion almost instantly. The date format was month-day-year. The subject data weren't cleaned up, so I would still have had to go through something like my spreadsheet process to get the filenames the way I wanted them. Moment of truth: the file contents included a colored top part, as I had encountered with Birdie (see previous post). HTML codes were still visible in some messages, but in others the HTML seemed to have been better converted into rich text. Typefaces were still tiny. Definitely a better program. But worth $100 for my needs?

Ideally, I would have been converting my emails to PDF as I went along, without converting them around and around, from Outlook to Thunderbird to EML and wherever else they might have gone over the past several years. This might have better preserved what I recalled as the colorful, more engaging look of some of them, and perhaps I would have come up with better ways of capturing those characteristics as I continued to become more experienced with the process. In the present circumstances, where I really just wanted to get the job done and move on, it seemed that playing with that sort of thing was not a short-term option.

Since I was planning to keep the EMLs anyway, and since I did not plan to view these emails frequently, I decided that I really didn't lose much in informational terms by going with the free option identified above. I took a larger sample of EMLs and, using Bulk Rename Utility, renamed them to be .txt files (though later I realized I could have just said "ren D:\Documents\*.eml *.txt"). Since I had installed Adobe Acrobat, I had a right-click option to convert to Adobe PDF. No doubt some freeware PDF programs provided similar functionality. The Acrobat conversion of these files into PDF was not nearly as fast as that performed by Total Mail Converter Pro. Acrobat put each of those newly created PDFs onscreen and obliged me to manually confirm that I wished to save them. I had converted 40 files, and wasn't interested in manually closing all 40; ultimately I had to use Task Manager to shut them down. That problem turned out to be just a result of the settings I was using for my default Bullzip PDF printer; changing those defaults and using Acrobat's Advanced > Document Processing > Batch Processing option made the process completely automatic. In terms of appearance, it seemed the fonts, HTML handling, and other features were more or less the same as I had gotten from those other programs (above). I probably could have made the average resulting email more readable (except where HTML formatting made clear who was responding to whom) by looking for a program that would strip the HTML codes out of those TXT files, but I didn't feel like investing the time at this point and wasn't sure the effort would yield a net improvement.

Briefly, then, the PDFing part of this process involved using a bulk renamer to replace the .eml extension with a .txt extension, and then using a bulk PDF printer or converter to convert those TXT files into PDF. This approach still preserved the look of some emails, while allowing others to be overrun with HTML codes.

I ran that batch process on a full year's set of EMLs. I converted 1,422 EMLs into TXT files by changing their extensions with Bulk Rename Utility. Somehow, though, Acrobat produced only 689 PDFs from that set. Which ones, and what had happened to the rest? Acrobat didn't seem to be offering a log file. My guess was that Acrobat went too fast for Bullzip. There was no real reason why I shouldn't have been using Acrobat's own PDF printer for this particular project -- in fact, I did not remember precisely what Acrobat snafu had prompted me to switch to Bullzip as my default PDF printer in the first place -- so I went into Start > Settings > Printers and made that change now. I also right-clicked and changed some of the Printing Preferences, for that printer, so that it would run automatically. I deleted the first set of PDFs and tried again. I noticed, this time, that Acrobat was not even trying to convert more than 689 files -- it was saying, "1 of 689," "2 of 689," etc. What was causing it to overlook these other files, I was not sure. It seemed I would have to do a "DIR /b > Printed.txt" command in CMD, and then convert Printed.txt into a Deleter.bat file that would delete the text files that were successfully printed, so as to highlight the ones that remained. (See previous post for details on these sorts of commands.)

(Incidentally, I had also noticed, now that I was working with the Acrobat batch options, that it had a "Remove File Attachments" option. While it did not seem to work with EMLs, possibly it would have been useful if these emails had been in MSG or PST format.)

The automated process got as far as file no. 2 in the list before it stalled. Why it stalled, I had no idea. I clicked on the X at the top right-hand corner of the dialog to kill it -- I even said "Close the Program" when Windows gave me that option -- and then Acrobat took off and printed a couple hundred more PDFs before stalling in that same way again. Possibly I had the Acrobat PDF printer's properties set to stop on encountering an error. I ran through most of that first set before spacing out and killing Acrobat (the whole program) at a stall, instead of just killing the stalled task. I deleted those that had printed successfully, creating a Deleter.bat file for the purpose as just mentioned, and ran another batch. This time, Acrobat was printing a total of 667 files. So I figured the situation was as follows: Acrobat would print PDFs through a glorified command-line kind of process, and that command line would accommodate only so many characters. If I'd had shorter file names, maybe it would have been willing to print thousands of TXT files at one go. If I had wanted to add complexity to the process, I could have renamed my files with names like 0001.txt, reserving a spreadsheet to change their names back to original form after conversion to PDF. But with my filenames as they were, it was only going to process 600 or 700 at a time. That was my theory.

When Acrobat was done with the second set -- the first one that had run through to completion -- it showed me a list of warnings and errors. These were errors pertaining to maybe a dozen files. The errors included "File Not Found" (typically referring to GIFs that were apparently in the original email), "General Error" (hard to decipher, but in some cases apparently referring to ads that didn't get properly captured in the email), and several "Bad Image" errors (seemingly related to the absence of an image that was supposed to appear in the email). A spot check suggested that the messages with these errors tended to be commercial (e.g., advertising) messages, as distinct from personal or professional messages that I might actually care about. In a couple of cases a single commercial email would have several errors. But anyway, it looked like they were being converted, with or without errors.

I decided to try printing the next batch with Bullzip instead of Acrobat printer. I had to set it as the default printer in Settings > Printers. I also had to adjust its settings (by going to its Options shortcut in the Start Menu > General and Dialogs tabs) so that it would run without opening dialogs or completed PDFs. Would it now process significantly more than 600 input files? The short answer: no. So for the next round, I tried selecting all the TXT files in a folder and right-clicking > Convert to Adobe PDF. This was a bad idea. Now Acrobat wanted to open a couple thousand documents onscreen. I had to force-reboot the system to stop this one.

So now I thought maybe I'd look for some other text-to-PDF converter. It sounded like ActivePDF was a leading solution for IT professionals, but I didn't care to spend $700+. Shivaranjan recommended Zilla TXT To PDF Converter ($30). Softpedia listed a dozen freeware converters, of which by far the most popular was Free EasyPDF. But I couldn't quite figure out what was going on there. There was no help file, and the program wasn't even listed on its supposed creator's webpage. CNET called it fatally crippled. I didn't know why 30,000 people would have downloaded it. Back to Softpedia's list: Free Text to PDF Converter was another possibility with a Good rating. Its webpage said it could batch-convert text to PDF files. I went into its Open option, selected a boatload of TXT files, and saw no sign that it had any intention of doing anything with them. Looking more closely at its starting screen, I saw it said this:

Command Line usage:
TXT2PDF <inputfile> <output.pdf> [parameter table]

The documentation webpage said I was supposed to drag the TXT files into the window on the main screen to convert them. It also said this program would convert only plain text, not HTML. I wasn't sure what that meant for the EMLs that contained HTML code as plain text. The optional parameters had to do with font, paper size, etc. In the folder where I had my TXT files to be converted, I tried this command:

"C:\Program Files\Text2PDF v1.5\txt2pdf.exe" "Text File to Be Converted.txt"

with quotation marks as shown, on the command line. It worked. It produced a PDF. There was no word wrap, so words would just break in the middle at the end of the line, like this:

We can't pledge that we've entirely emerged from th
at episode, but this
past summer I sat down and rewrote the entire man
ual in a way that makes
more sense. The guy just didn't know how to phrase

The print size was very large, though there were parameters to change that, but nothing, apparently, to persuade lines to break at the ends of words rather than in the middle. This could defeat Copernic text searching, rendering some PDF file contents unfindable, so it wasn't going to be a good solution for me. But it really seemed like the command line approach, which would let me name each file to be converted, was the answer to the problem of being able to process only ~600 text files at a time. Another possibility: AcroPad. The following command worked:

Acropad "File to Convert.txt" "File Converted.pdf" Courier 11

I could have named other typefaces and font sizes. Output was double-spaced. Lines were broken at the ends of words, not in the middle. HTML code in the file was just treated as text and printed out as-is. I kept searching. A post by Adam Brand said I could use a command to automate printing if I had Acrobat Reader installed. That prompted another search that led to several insights. First, it turned out I could print a file from the command line using a Notepad command in the form of "notepad.exe /p filename." Since my default printer was a PDF printer, it printed a PDF -- a nice one, too, for basic purposes, nicer than some of the output I was getting from the programs tested above. It put the output on the desktop. I changed the location for the output by going into the Desktop folder for my username. Since I was running as Administrator, the location was C:\Users\Administrator\Desktop. There, I right-clicked on the Desktop folder, went to Properties > Location tab and changed it. (Another Notepad option, which I didn't need, was to specify which printer I wanted to use: /pt.)

The Notepad approach did nothing with HTML codes in these plain text files. An alternative that would work with rich text, which might or might not help in my case, was supposedly to try the same switch with Wordpad: "wordpad.exe /p filename." But when I did that, I got an error message:

'wordpad.exe' is not recognized as an internal or external command, operable program or batch file.

This was odd. To fix it, I ran regedit (Start > Run) and went to
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\App Paths\. There, following instructions, I right-clicked on App Paths and selected New > Key. I named the new key "Wordpad." I right-clicked on that Wordpad key and selected New > Expandable String Value. It apparently didn't matter what I called it. I called it ProgramPath. I right-clicked on ProgramPath and pasted in the path where Wordpad was, which I had obtained by going into the Properties of the Wordpad shortcut on my Start Menu. In other words, what I entered here included quotation marks and the name of the executable wordpad.exe, with extension. The instuctions said that, to run Wordpad from the command line (as distinct from in Start > Run), the command would have to begin with the Start command. For present purposes, what I would type at the C prompt would be "start wordpad /p filename." This worked (and I exported the new registry key and added it to my Win7RegEdit.reg file for future installations), but it did not produce a superior PDF compared to that which Notepad had produced, and for some reason it truncated the filename of the resulting PDF.

Revised Final Step: Converting TXT to HTML to PDF

Searching onward, there was a possibility of treating them as HTML rather than TXT files. I had flirted with this earlier but had not grasped that, of course, these actually were HTML files in the first place; they had become EMLs and TXTs only later. I typed "ren *.txt *.htm" to rename them all as HTML files. To print them, there were some complicated approaches, but I hoped that PrintHTML.exe would do the trick. The syntax, for my purposes, was this:

printhtml.exe file="filename.htm"

with optional leftmargin=1, rightmargin=1, topmargin=1, and bottommargin=1 parameters, among others that I didn't need. The printhtml.exe file would of course have to be in the folder with the files being printed unless I wanted to add it to the registry as just described for Wordpad. PrintHTML wouldn't work until I installed the DHTML Editing Control. I did all that, and got no error messages, but also did not seem to get any output. I decided to put that on hold to look at another possibility: automated PDF printing using Foxit Reader on the command line. Pretty much the same command syntax as above:

"FoxitReader.exe" /p filename

Here, again, there was a need for a registry edit, unless I wanted to park a copy of Foxit in every folder where I would use it from the command line. But the instructions were only for using Foxit to print PDFs, so I got an error: "Could not parse [filename]." There was also an option of using Acrobat Reader to print a PDF silently or with a dialog box, but there again it wasn't what I needed: I was printing HTMLs. I returned to that printhtml.exe program mentioned above. The command ran, with no indication any errors, but there did not seem to be any output. Another possibility was:

RUNDLL32.EXE MSHTML.DLL,PrintHTML "Filename.htm"

But for me, unfortunately, that produced an empty PDF. Turning again to freeware possibilities, I found an Xmarks list of top-ranked HTML to PDF programs. Most of the top-ranked items were online, one-file-at-a-time tools. Others required PHP knowledge that I didn't have (e.g., HTML_ToPDF, PDF-o-Matic). HTMLDOC looked promising for command-line usage; I found its manual; but when I downloaded and unzipped it, I couldn't find anything that looked like a setup or installation file. Apparently the version that's free is the source code, and I didn't know how to compile it. DomPDF and html2pdf (and, I suspect, some of these others) were apparently for Linux, not for Windows. I tried wkhtmltopdf. When I ran it, I got an error:

wkhtmltopdf.exe - System Error
The program can't start becuase libgcc_s_dw2-1.dll is missing from your computer. Try reinstalling the program to fix this problem.

Possibly the reason I got that error is that I was trying the same trick of running the program in a folder where my PDF files were. I had copied the executable (wkhtmltopdf.exe) to that folder, but had not brought along its libraries or whatever else it might need. I tried running it again -- I was just trying to use the help command, "wkhtmltopdf -- help" -- but this time pointing to the place where the program files were installed:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -- help

and that worked. I got a long list of command options. What I understood from it was that I wanted, in part, a command like this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "File to be converted.htm"

I tried that. It gave me an error:

Error: Failed loading page http: (sometimes it will work just to ignore this error with --load-error-handling ignore)

So I tried adding that long parameter to the command. It seemed like it worked: it gave the error message but then proceeded through the rest of its steps and announced, "Done." But I didn't see any output anywhere. Then I realized there was an error in what I had actually typed. I tried again. This time, it gave me a different error message: "You need to specify at least one input file, and exactly one output file." So the format I was supposed to use, aside from that additional "--load-error-handling ignore" parameter, was this:

"C:\Program Files\wkhtmltopdf\wkhtmltopdf.exe" -s Letter "HTML file to be converted.htm" "New PDF File.pdf"

And that worked. At last, I had a mass-production way of converting EMLs (by changing their extension to .htm, not .txt) to PDFs. It was too early to break out the champagne, but at least the computer and I were back on speaking terms. Now I just needed to run "DIR /s /b > dirlist.txt" in the top-level folder under which I had sorted my emails, convert that dirlist.txt file into a .bat file that would convert the file listings into batch commands, and run it. I was afraid the whole command, with the introductory reference to C:\Program Files, would be too long for Windows in some cases, so I edited the registry as described above, so that I would only have to type wkhtmltopdf.exe at the start of each command line. But now that registry edit wasn't working -- it certainly seemed to previously -- so I copied all of the wkhtmltopdf program files to the folder where I would be running this batch file. I didn't want the computer to crash itself by opening hundreds of simultaneous wkhtmltopdf processes, and I wanted to move the PDFs, so the format I used for these commands was:

start /wait wkhtmltopdf -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

That worked. Now I investigated the longer list of wkhtmltopdf command-line options, by typing "wkhtmltopdf -H" (with a capital H). Whew! The list was so long, I couldn't view it in the cmd window -- it scrolled past the point of recall. I tried again: "wkhtmltopdf -H > wkhtmltopdf_manual.txt." I couldn't add too much to the command line -- I was already afraid the long filenames would make some commands too long for CMD to process. But having viewed some output of these various PDFing programs, a few sets of commands seemed essential, including these:

-T 25 -B 25 -L 25 -R 25
--minimum-font-size 10

The first set would give me one-inch margins all around. Putting these on the already long command line increased my interest in another option: --read-args-from-stdin. This one, according to the manual, would also have the advantage of speeding up the process, since I would be starting wkhtmltopdf just once, and then re-running it with different arguments. The concept seemed to be that my conversion batch file (or, really, just a typed command) would contain this:

start wkhtmltopdf --read-args-from-stdin < do-this.txt

and then do-this.txt would contain line after line of instructions like this one:

-T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter "D:\Former Directory\HTML file to be converted.htm" "D:\New Folder\New PDF File.pdf"

Or perhaps they could be rearranged so that some of the contents of the second could be in the first, and therefore would not have to be repeated on every line in do-this.txt. In which case the main conversion command would look like this:

start wkhtmltopdf --read-args-from-stdin -T 25 -B 25 -L 25 -R 25 --minimum-font-size 10 -s Letter < do-this.txt

and do-this.txt would contain only the "before" and "after" filenames. I decided to try this approach. Unfortunately, it didn't work. It froze. So then I tried just the minimal one shown a moment ago, putting all options except --read-args-from-stdin in the do-this.txt file. Sadly, that froze too. I tried the minimal command plus just filenames, leaving out the several additional commands about margins and font size. Still no joy. So, plainly, I did not understand the manual. I decided to go back to the approach of just putting it all on one line and repeating all commands, in a batch file, for each HTM file that I was converting to PDF. Each line would begin with "start /wait," not just "start," for reasons stated above. This worked, but now I noticed a new problem that I really hadn't wanted to notice before, because I just wanted this project to be done already.

Separating EMLs With and Without HTML Code

The new problem was that emails that were originally in HTML format turned out best when they were now renamed with an .htm extension, and processed that way, but the ones that didn't have HTML codes in them were now reduced to a mess. Specifically, line and paragraph breaks were gone; everything was just jumbled together in one continuous stream of text. Every non-HTML email was now being represented by a single long paragraph. To get decent output, it seemed that I needed to separate the emails that contained HTML code from those that did not. I would then use wkhtmltopdf with the former, but not with the latter. But how could I tell whether a file contained HTML code? I decided that an occurrence of "</" would be good enough in most cases. But then it occurred to me that there might be programs that would sort this out for me. A search led to the FileID utility. Their read-me file led me to think that this command, entered in the top-level folder where the files to be checked, might do the job:

"D:\FileID Folder\fileid" /s /e /k /n

This would run FileID from the folder where its program files were stored, and would instruct FileID to check all files in all subdirectories, to automatically change file extensions to match contents, to delete null files, and not to prompt me for input. But it did not seem to be working. Regardless of whether I entered these options as upper- or lower-case (e.g., /S or /s), FileID paused after every screenful of information, and did not seem to be renaming anything. I decided to try again with another command-line program of similar purpose, TrID. TrID had an online version and a GUI. On second thought, I decided to give the GUI version a whirl. I downloaded the program and its XML definitions. (I already had the necessary .NET Framework installed.) As advised by Billy, I moved everything from the XML definition folder (after unzipping them with WinRAR) into the folder containing the TrIDNet.exe file. I doubleclicked on that executable and saw that it would process only one file at a time.

I moved on to the command-line version. This called for a download of a different set of program files and definitions. I wasn't sure whether TrID would actually change incorrect extensions, or just detect them. Again, rather than plow into the support forums, I just tried it out. But in this case, that strategy didn't work: there was no manual or other use instructions in the download. The forum contained a tip on using PowerShell to fix extensions, but I didn't know enough about PowerShell to be able to interpret and adapt that tip to my situation. But, silly me, I forgot about just getting online help. In the folder where I had unzipped TrID.exe, I opened a cmd window and typed "trid -?" and got the idea that I could type "trid -ce" or perhaps "trid *.* -ce" to have the program change file extensions as needed, for all files in the current directory. It didn't appear to have a subdirectory option, so I would have to do some file moving.

A different approach was to use a CHK recovery program to detect the proper extension for anything with a CHK extension. While FileCHK looked like the better program for recovering real CHK files, it looked like UnCHK would have more flexibility for my situation, provided I first ran "ren *.htm *.chk" to change the file extensions to .chk. When I tried to run unchk.exe, I got an error message:

The program can't start because MSVBVM50.DLL is missing from your computer. Try reinstalling the program to fix this problem.

Eric had already warned me, in the read-me file, that this meant I needed to download and install the Visual Basic 5 runtime. I did, and tried again. Now it ran. I couldn't find documentation or a /help option to explain its settings. It took me a while to realize it wasn't a command-line program, though it could run from the command line. It was very bare-bones. I started it, navigated to the first of the folders I wanted to repair, and (having renamed files to have .chk extensions), gave it a try. It gave me a dialog asking about Scan Depth. I knew from the read-me that I wanted the Whole Files option. It ran for a while and then disappeared. It didn't seem to have done anything. After some more searching around, I concluded that this CHK approach wasn't what I wanted.

So I looked elsewhere. If I wanted to spend a day or so refreshing my aging knowledge of BASIC programming, or invest some time in learning more about batch scripting or Microsoft Access or some other program, I was pretty sure I could work up a way to examine file contents. But I wanted a solution faster than that, if possible. The CMD batch FIND command looked like it might do the job. But the command that I thought should work,

FOR %G IN (*.txt) do (find /i "</" "%G")

didn't. It wasn't because "</" were weird characters; it wasn't finding files containing ordinary text either. I tried again with the FINDSTR command:

findstr /m /s "</" *.* > dirlist.txt

This looked promising. But when I examined dirlist.txt, I saw that many of the files listed in it were better presented as TXT than as HTM. Apparently I should have been looking for files with more substantial HTML content. A spot check of several emails suggested that the existence of an upper- or lower-case "<html" might be a good guide. So apparently I would have to run FINDSTR twice:

findstr /m /s "<HTML" *.* > dirlist.txt
findstr /m /s "<html" *.* >> dirlist.txt

with two ">" symbols in the second one, so as to avoid overwriting the results of the first search with the results of the second. I tried that. There were some error messages, "Cannot open [filename]," apparently attributable to weird characters in the file's name; somehow it seemed I had still not entirely succeeded in cleaning those up. I assumed FINDSTR's failure in this regard would leave those files being treated as TXT by default, which would probably be OK since the majority of files overall appeared to be non-html. Ultimately, dirlist.txt contained a list of maybe 40% of all of the emails I was working on. That seemed like it might be about right. In other words, it seemed that about 60% of the emails were best treated as plain text, and I would be getting to those shortly. I put dirlist.txt into a spreadsheet to produce commands that would run wkhtmltopdf on the files that those two commands listed in dirlist.txt. The key formula from that spreadsheet:

="start /wait /min wkhtmltopdf -T 25 -B 25 -L 25 -R 25 --minimum-font-size 12 -s Letter "&CHAR(34)&B1&"\"&C1&".htm"&CHAR(34)&" "&CHAR(34)&"..\Converted\"&C1&".pdf"&CHAR(34)

That formula, applied to each file identified as containing "<html," produced PDFs that looked relatively good. I found that I needed a way of testing them, though, because in a number of cases wkhtmltopdf had produced PDFs that would not open. I also noticed that the batch file running these commands kept acting like it had died. Windows would say, "wkhtmltopdf.exe has stopped working," and I would click the option to "Close the program." And then, after a while, it would come roaring back to life. This may have happened especially when wkhtmltopdf was converting simple email messages into PDFs of a thousand pages or more. A thousand pages of gibberish. In a number of cases, too, the resulting PDF was a failure. When I tried to open those PDFs, Acrobat said this:

There was an error opening this document. The file is damaged and could not be repaired.

I was not sure what triggered these problems. I wondered if possibly the simpleminded conversion from EML to HTM by merely changing the extension caused problems in the case of EMLs that contained attachments. If that was the case, then what I should have done might have been to export from Thunderbird in HTML format in the first place -- to do two exports, in other words: one for EMLs, which would include attachments, to be zipped up into an archive and shelved until the future day when there would be a simple, cheap solution for the PDFing of emails plus their attachments; and another export in HTML, for purposes of PDFing here and now, without attachments. I tested this with one of the gibberished emails and found that, when exported from T-bird as HTML using ImportExportTools, it did print to a good-looking PDF. In that approach, the naming procedures used to rename the exported emails in the desired way -- containing date, time, sender, recipient, and subject information -- would apparently have to be preserved and reapplied, so that both exported sets -- the EMLs and the HTMLs -- would be named as desired.

To investigate these questions, I traced back one PDF that did not open -- that produced the error message quoted above -- and one that opened but that was filled with gibberish. The one that was damaged did not come from an email that originally contained attachments. I was able to print that email directly from Thunderbird without problems. So I wasn't sure what the problem was there. For a sample of one filled with gibberish, I chose the largest of them all. This was a 3,229-page PDF that was produced from a little two-page email that did originally have an attachment. I sampled three other PDFs containing gibberish. All three had come from emails that originally had attachments. So it did appear that attachments were foiling my simplistic approach of just changing file extensions from EML to HTM. I wondered if it was too late to just change the extensions back to .eml, for the ones that had not produced good PDFs, and maybe PDF them manually. I tried with one, and it worked. So that would have been a possibility, assuming I had time for printing emails one by one.

It seemed the gibberish might not be gibberish after all. It might be a digital representation of the photograph or whatever else was attached to the email. I didn't know of a way to test text for gibberish, so this didn't seem to be a problem that I could deal with very effectively at this point. I could name some files as HTM, as I had done, and just accept a certain amount of gibberish -- perhaps after screening out the really large PDFs (or, earlier in the process, the large EMLs, TXTs, or HTMs), which seemed most likely to have had attachments -- or I could rename them all as TXTs and print them that way, looking solely for the text content without regard to their appearance (and still probably getting gibberish). If I needed to know how they looked originally, I would have to go back to the archived EML version of the PDFd text. A third option was to go back to T-bird and re-export everything as HTML, thereby skimming off the attachments, and then use my saved renaming spreadsheets to rename the newly produced, roughly named HTMs, and then do my PDFing from those new HTMs. Presumably, that is, the new HTMs would print correctly, since they would not have attachments.

Back to the Drawing Board: T-Bird to HTML to PDF

I decided to try that third option. I went back to Thunderbird and used ImportExportTools to export the emails as HTML rather than as EML. It would have been more logical to start by PDFing these HTMLs, to make sure that would work; but at this point I had such a clutter of emails in various formats that I decided to proceed, as before, with the renaming process first, so as to be able to delete those that I wasn't going to need. Having already worked through the process of renaming to the point of achieving final names, I used directory listings and spreadsheets to try to match up the "before" names (i.e., the names of the raw HTML exports) and the "after" names (i.e., the final names I had developed previously).

Once I had the emails in individual HTML files with workable filenames, I ran wkhtmltopdf again. I started by taking a directory listing of the files to be converted; I put those into a spreadsheet, as before; and in the spreadsheet I used more or less the same wkhtmltopdf formula shown above in order to produce working commands. These pretty much succeeded. I was now getting good PDFs from the emails. It seemed that wkhtmltopdf had a habit of wrapping lines severely or perhaps indenting them too much. That is, if I wrote an email in reply to someone else, the text of my email would look fine,

but the text of the message

to which I was replying,
typically shown below the

reply text, would be
indented and then broken

like this.

Wkhtmltopdf converted HTML files to PDF at a rate of somewhat more than one email per second. Of course, these were small files, as email messages tend to be. There was a problem with them taking up a lot of disk space; it seemed I might have been well-advised to format the drive to have smaller than the default cluster size. The program slowed down considerably at times. I assume it was running into complexities with some HTML files.

The batch file ran and finished, but it had converted only about half of the HTMLs into PDFs. I decided to test the PDFs before deleting the corresponding HTMLs. I opened a half-dozen of them without a problem. Then, for a more thorough test, as described in a separate post, I ran an IrfanView batch conversion from PDF into RAW format. I chose RAW because it would result in just one file. TIF might have been another possibility. It did appear that this process was all working well. Ultimately, these steps converted all of the HTMLs into PDFs.

Summary

The first part of what I was able to achieve, at this point, was to export my emails from Thunderbird to EML format, using the ImportExportTools add-on for Thunderbird. Once I had exported all those EMLs, I used a zipping program (either WinRAR or 7zip) to bundle them together into a single file containing all of a year's emails. I took these steps because EML files, unlike HTML, PDF, JPG, TXT, or other formats, were able to contain email attachments along with the text of the email messages. I planned to keep these year-by-year ZIPs of EMLs until some point when I could find a cheap and broadly accepted program for printing both the email message and its attachment into a single PDF.

The other main achievement was to work out a process for converting HTMLs (also exported from Thunderbird via ImportExportTools) into PDFs. I used wkHTMLtoPDF for this purpose. I ran it in a batch file, produced by a spreadsheet, so that there was one command per file. I used DIR folder comparisons and other means to test that all files were being converted and that they were being converted into valid PDFs.

Ray Woodcock's Latest

Pages

Sunday, June 10, 2012

Batch Converting Many WordStar (.ws) Files to PDF

Sunday, May 27, 2012

Batch Converting Multiple Word DOC Files to PDF in Scattered Folders

Wednesday, May 23, 2012

A Million-Day Calendar with Explicit Julian-Gregorian Comparison

Sunday, March 18, 2012

Batch Converting DOC to PDF with 7-PDF Maker

Tuesday, January 3, 2012

Converting Scattered WMA Files to MP3

Sunday, December 18, 2011

Converting Email (EML) Files to PDF - Another (Partial) Try

Saturday, April 23, 2011

Windows 7: Archiving Thunderbird Emails to Individual PDFs - Retry

Support This Blog

Total Pageviews

Archives

Pages

Sunday, June 10, 2012

Sunday, May 27, 2012

Wednesday, May 23, 2012

Sunday, March 18, 2012

Tuesday, January 3, 2012

Sunday, December 18, 2011

Saturday, April 23, 2011

Support This Blog

RSS Feed - Subscribe to My

Total Pageviews

Archives