Showing posts with label spreadsheet. Show all posts
Showing posts with label spreadsheet. Show all posts

Wednesday, May 23, 2012

A Million-Day Calendar with Explicit Julian-Gregorian Comparison

I wanted to look up a historical date.  Specifically, I wanted to know which day of the week it occurred on.  As I was looking for an answer to that question, I gradually came to the impression that there did not exist a standard calendar.  I decided to build one.  This post describes that process.

Someone may already have created what I was looking for.  But I wasn't finding it.  What I was looking for was, simply, the Official Calendar.  Of the United States, of the Catholic Church, it didn't matter -- just an official calendar that some reputable body had actually committed to print (preferably with explanations, and without errors).

What I was finding, instead, was lots of rules about how to calculate an official calendar, as well as various tools that would assist in those calculations.  This was fine, as far as it went.  But we don't generally tell people who prefer the Celsius temperature scale to just use the Fahrenheit and convert it.  Instead, people living in places that use Celsius have thermometers that show them the literal answer, without the need for a manual conversion process.  I wanted something like that for calendar dates.

Ultimately, I created a calendar covering a million days, starting on January 1, 500 BC. I produced that calendar as a spreadsheet, printed it as a PDF, and made both available for download. I don't often revisit this post. As of an update in early 2023, these materials are available for download through my SourceForge project page or at Box.com or MediaFire. See also my download blog post.

The PDF is a 38MB, 10,000-page document.  I would not recommend printing more than necessary.

Assumptions and Calculations
Built into the Million-Day Calendar

I chose Excel 2010 to develop the calendar because that version of Excel could accommodate somewhat more than a million rows.  I did not use Excel's built-in date arithmetic, though, because of its known errors.  That is, I did not ask Excel to calculate the necessary dates automatically.  Instead, I calculated them in a semi-manual process.  The process was not entirely manual, because I did not calculate row by row, day by day, for each of the million days shown.  Instead, I developed formulas that would count forwards or backwards from a certain date, and I applied those formulas to the million rows, usually broken into several segments due to historical changes in calendar calculation.  There were some manual adjustments as well.

I found that a million days would cover approximately the period from January 1, 500 BC to the year 2238 AD.  This seemed like a good range for most purposes. For dates outside this range, there would still be the option of using a formula or calculator, or of adding another tab to extend the spreadsheet.

As shown in the preceding paragraph, I was inclined to use AD and BC to refer to calendar eras.  AD was short for Anno Domini (Latin for "in the year of the [or "our"] Lord"). AD and BC (short for "Before Christ") were thus based on an early medieval calculation of the number of years before or after the birth of Jesus. This religious origin was an addition to other religious origins (e.g., "Thursday" deriving from "Thor's Day"). Instead of AD and BC, an apparent minority of non-Christians preferred to use CE (short for "Christian Era" or "Common Era" or "Current Era") and BCE.

Traditional chronology did not incorporate a year zero (i.e., 0 AD or 0 BC).  That is, the calendar went directly from 1 BC to 1 AD.  The original concept may have been that there was no need for a year zero, since Jesus was not born until the start of the first year of his life (incorrectly calculated as 1 AD).  This variation would make no practical difference in the AD era:  for example, the number 2012 represented the year in which this post was written.  It would lead to difficulties in the BC era, however.  For instance, the rule on leap years (involving division by 4) would produce a leap year in the year 4 AD and, before that, in the year 0; but since there was no year 0, the prior leap year was in 1 BC.  Hence, traditional BC dates did not fit exactly with the rule that leap years are evenly divisible by 4.

The calendar in effect at the time of Jesus was the Julian calendar, introduced by Julius Caesar in 46 BC -- a year which, by decree, was 445 days long.  The Julian calendar was revised several times, finally stabilizing in 4 AD.  For present purposes, the key innovation of the Julian calendar was the decision to define the year as equal to 365.25 days, adjusted via leap years in every year evenly divisible by 4 (e.g., 2008, 2012, 2016).  The Julian calendar eliminated the leap month Mercedonius but did not otherwise significantly change the names or lengths of months.  For purposes of year numberingepochs (i.e., reference years) in the early centuries of the Julian calendar commonly used regnal systems based on the current ruler or other officials (e.g., "January 1 in the second year of the reign of the Emperor Justinian"), but there was a semi-chaos of other epochs as well.  For instance, the Anno Mundi era started from calculations of the date on which the world was created, and the Ab urbe condita era started from the hypothesized date when Rome was founded.

The big change after the institution of the Julian calendar came in 1582 AD, when Pope Gregory XIII introduced the Gregorian calendar.  The Gregorian reform assumed the use of AD rather than regnal or other epoch systems; the AD epoch concept had been gradually spreading during the Middle Ages.  Gregory's principal contribution was to revise leap year calculations.  Over the centuries, the Julian calendar had become increasingly inaccurate with respect to the actual equinox.  That is, the calendar might say that it was March 21 -- the time for Easter -- and therefore daytime and nightime should each be about 12 hours long; but in fact, according to the clock, that day would already have arrived more than a week earlier.

In other words, the Julian calendar was falling behind the real world because the calendar was inserting too many leap years.  The extra leap days were making the Julian calendar late:  it would say the date was only March 11, when it really should have been March 21.  Gregory thus removed ten days from the calendar for October 1582, to catch up, and also changed the leap year calculation slightly.  The Gregorian rule for leap years was that every year evenly divisible by 4 would still be a leap year, except that years evenly divisible by 100 would not be leap years unless they were also evenly divisible by 400.  So 1700, 1800, and 1900 would not be leap years, but 1600 and 2000 would be.

This adjustment was still not perfect, but because of gradual slowing in the Earth's rotation, it was apparently pretty close.  The slowing issue, which I did not explore, may have been related to the difference between the tropical year and the sidereal year.  The Julian and Gregorian calendars were apparently based on the tropical year, which was the amount of time that it took the Sun (as seen from Earth) to come back to the same place as it was on the previous vernal (spring) equinox.  The sidereal year was an alternative to the tropical year:  it was the amount of time that it took Earth to return to the same relative position as it had occupied a year earlier, as measured with reference to certain stars.

These findings about the Julian and Gregorian calendars called for some decisions, for purposes of constructing a million-day calendar.  One such problem had to do with the present day.  My computer might tell me that it was May 6, 2012.  This would be a date in the Gregorian calendar.  Its appearance on my computer, my wristwatch, and everywhere else would testify to Gregory's widespread success.  I knew, however, that there was also a Chinese New Year and a Jewish calendar and all sorts of other calendars that still had meaning for various cultural and religious purposes, as well as the similarly named but essentially unrelated Julian Year system used in astronomy.  Even the Julian calendar continued to be used in Eastern Orthodox churches.  I decided that the intended spreadsheet approach to the million-day calendar might enable others to add these alternative calendars as they wished.  Because of the size of the spreadsheet and the relative rarity and potential complexity of these other calendars, however, I decided that I would not try to build any of these alternatives into the calendar myself, but would instead focus on the Julian and Gregorian calendars that predominated in the West during the timeframe addressed in the million-day calendar.

Another problem had to do with adoption dates.  The Gregorian adjustment of October 1582 specified that the Julian calendar would end on October 4, 1582; the days of October 5 through October 14 (inclusive) would not exist; and the Gregorian calendar would begin on October 15, 1582.  This rule was adopted at very divergent rates:  immediately, in several Roman Catholic countries, but elsewhere with considerable delays and confusion continuing into the 20th century.  The problem here, then, was that October 5, 1582 did not exist in Spain, and yet someone in England could be staring at a letter dated October 5, 1582, and that would make perfect sense according to the Julian calendar, which would continue to be used in England until 1752 (at which point England would need to delete eleven days, not ten, to get in sync with the Gregorian reform).  During the transition period in England, people commonly used the terms "Old Style" (abbreviated as "O.S." in English, and as "st.v." in Latin) to refer to the Julian date, and "New Style" ("N.S." or "st.n.") to refer to the Gregorian date.

As just described, the Gregorian calendar officially began (and was officially implemented in some places) on October 15, 1582; the Julian calendar officially ended on the preceding day, which (according to the Julian) was October 4, 1582.  But one could also say that October 4, 1582 (Julian) was the same as October 14, 1582 (Gregorian).  This way of looking at the matter would require proleptic (i.e., anachronistic) calculations.  Specifically, there would be a proleptic Gregorian calendar for all dates before October 15, 1582 on the Gregorian calendar, and there would also be a proleptic Julian calendar for all dates before January 1, 4 AD on the Julian calendar.  October 13, 1582 (Gregorian) would be the same as October 3, 1582 (Julian); October 12 (G) would be the same as October 2 (J); and so forth, back in time.

Since the Gregorian calendar did not exist before 1582, the statement that the Battle of Hastings occurred on October 14, 1066 would imply that it was October 14 according to the Julian calendar, not the Gregorian.  While it could be confusing to cite proleptic Gregorian dates for events that were made part of history according to the Julian calendar, there seemed to be some applications for which a proleptic Gregorian calendar could be useful.  For example, someone might be interested in determining whether a certain event happened on the actual equinox, as distinct from the date represented as the equinox in the Julian calendar.  In developing the million-day calendar, I thought it would thus be useful to display Julian and Gregorian dates side-by-side, so as to confirm the accuracy of the calendar and/or of others' conversions between the two, as described more fully below.

To a much greater degree than the proleptic Gregorian calendar, it seemed that the proleptic Julian calendar could be useful for a variety of historical situations.  The concept here was, in essence, that one could work backwards to construct a Julian calendar for dates long before Julius Caesar, and could use that calendar to construct a list of standard dates when various historical events occurred.  Although sources rarely seemed to specify what calendar they were using, it appeared that the proleptic Julian calendar was in fact being used widely for this purpose.  There would certainly be scholarly disputes as to the conversion of ancient chronologies to Julian calendar terms (so as to interpret, for instance, a statement that a certain event occurred in the 245th year since the founding of Rome), but at least the calendar system itself would be consistent over centuries.

Developing and Testing the Million-Day Calendar

I added proleptic Julian calendar calculations to the million-day calendar. I started these calculations by adding a separate Julian Days table to the spreadsheet. The concept of the Julian Day was proposed by Joseph Scaliger in 1583. Julian Days were simply a count of days, beginning (for astronomical and historical reasons) with Day Zero at 12:00 noon on January 1, 4713 BC. (Julian Days could include decimal values for fractions of a day, such as 0.083 = 2 PM.) So, for instance, Julian Day 7 arrived at noon on January 8, 4713 BC.

There were no years in the Julian Day system, but Julian Days could be used to calculate the proleptic Julian calendar, in which every fourth year would be treated as a leap year.  Because there was no Year Zero in the Julian calendar, Scaliger's first year of 4713 BC was a leap year.  (That is, in a system that had a Year Zero between 1 BC and 1 AD, 4713 BC would have been called 4712 BC.)  The resulting calculations produced Julian dates, in the spreadsheet, that were consistent with those reached by John Herschel in his Outlines of Astronomy (1849, p. 595).  Specifically, January 1, 4004 BC was Julian Day 258,963; the destruction of Solomon's Temple (which Herschel put on May 1, 1015 BC) was on Julian Day 1,350,815; and Rome's founding (which Herschel put at April 22, 753 BC) was on Julian Day 1,446,502.  Moving into the million-day period beginning on January 1, 500 BC (Julian Day 1,538,799), the spreadsheet matched Herschel's calculation that the Julian calendar reformation of January 1, 45 BC occurred on Julian Day 1,704,987; the Islamic Hijra calendar began on Julian Day 1,948,439 (July 15, 622 AD); and the official last day of the Julian calendar (October 4, 1582) was Julian Day 2,299,160.  It tentatively seemed that the spreadsheet's Julian calendar portion was accurate.

I also added Day of Week calculations to the spreadsheet, beginning with the common assertion that January 1, 4713 BC was a Monday (in, implicitly, the proleptic Julian calendar).  For the dates cited in the preceding paragraph, these calculations indicated that January 1, 4004 BC was a Saturday; May 1, 1015 BC was a Friday; April 22, 753 BC was a Tuesday; January 1, 500 BC was a Thursday; January 1, 45 BC was a Friday; July 15, 622 AD was a Thursday; and October 4, 1582 was a Thursday.  Further, I extended the Julian calendar beyond its official end to Thursday, November 7, 2238 AD (Julian Day 2,538,798).  According to the spreadsheet (and also the Julian Day arithmetic, i.e., Julian Day 2,538,798 minus Julian Day 1,538,799), that was the millionth day (inclusive) from Thursday, January 1, 500 BC.  These particular Julian Day numbers and day-of-the-week calculations matched the values produced by an online date calculator appearing on a NASA webpage.  It tentatively seemed that the spreadsheet's Julian Day calculations were corresponding accurately with Julian calendar dates.

Next, I produced a proleptic Gregorian calendar in the million-day calendar, adjacent to the Julian calculations.  The starting point for this calendar's calculations was its commonly recognized starting date of Friday, October 15, 1582.  As noted above, the preceding day of October 14 on the Gregorian calendar (G) (if such a date had officially existed on that calendar) would have been Thursday, October 4 on the Julian calendar (J).  So the spreadsheet's presentation of Julian and proleptic Gregorian dates had to match up on the row containing the values of October 14 (G) and October 4 (J).  That is, both had to have the same Julian Day value of 2,299,160.  From October 14, 1582, I extended the Gregorian calendar back to January 1, 500 BC.  I decided not to extend this proleptic Gregorian calendar back into the period before 500 BC, though there were situations in which such an extension might have been useful.

There were some interesting things in the relationship between the proleptic Gregorian calendar and the Julian calendar.  At the starting point in the 16th century AD, the Gregorian dates were later than the Julian.  As just noted, October 14, 1582 (G) was equivalent to October 4, 1582 (J).  The Gregorian allowed fewer leap years, so the difference between it and the Julian began to narrow with each additional century (except for those evenly divisible by 400), going back in time.  The ten-day difference of 1582 thus became a nine-day difference on the first previous day when the formulas for the two calendars differed:  there was no February 29, 1500 (G), but there was a February 29, 1500 (J).  By the time one arrived back at the third century AD, the difference between the two calendars vanished.  That is, as noted by Peter Meyer, the two calendars had exactly the same dates from March 1, 200 AD to February 28, 300 AD.  This was no coincidence.  Gregory had designed his reform so that Easter would occur at about the same time as it had occurred in 325 AD, when the Council of Nicea (also spelled Nicaea) discussed such matters.  So during the century ending on February 28, 300 AD (J), both calendars showed the same dates (e.g., February 1, 300 (J) = February 1, 300 (G), and both are Julian Day 1,830,664).  Before the third century, the Gregorian calendar predated the Julian by progressively larger amounts, until January 1, 500 BC (J) would be represented as December 27, 501 BC (G).  Going back still farther, dates on the Julian calendar would continue to fall three days later every 400 years, so that January 1, 4713 BC (J) would arrive a month earlier on the Gregorian, in late November 4714 BC.  On the other extreme, in the centuries following 1582 AD, the Gregorian dates became progressively later than those of the extended Julian, until November 7, 2238 AD (J) was equivalent to November 22, 2238 (G).

I checked the foregoing dates and days of the week using another online calculator as well, produced by Fourmilab Switzerland.  I began by entering Julian Day numbers and then seeing what results this calculator would produce for Julian and Gregorian calendar dates.  This calculator took the approach of inserting a Year Zero in the proleptic Gregorian calendar, so its statement of BC dates differed from the values shown in the spreadsheet by one year.  For example, the Fourmilab calculator indicated that January 1, 45 BC (J) was equal to December 30, 45 BC (G), whereas the spreadsheet would put the latter as December 30, 46 BC (G).  Fourmilab's approach seemed incorrect in this regard.  For mathematical purposes (as in e.g., the ISO 8601 approach, below), there would need to be a Year Zero; but the historical reality seemed to be that proleptic calculations in both Julian and Gregorian calendars did not have a year zero.  Fourmilab was not alone here; the conflation of mathematical consistency with historical fact had evidently produced some confusion in other computing situations as well.  At any rate, after adjusting for that divergence in BC years, the results of the Fourmilab calculator did match up with those yielded by the spreadsheet and the NASA calculator.  This calculator and the spreadsheet also agreed that February 1, 200 AD (G) was Julian Day 1,794,140 and was also February 2, 200 AD (J).  (The NASA calculator did not do proleptic Gregorian calculations.)

I looked at one other online calculator, produced by CSGNetwork.  I did not attempt a redundant comparison against all of the dates listed above.  Instead, I focused on the especially problematic period of the first two centuries AD.  In that timeframe, the CSGNetwork calculator seemed to be in error.  Specifically, a "Calendar Date Entry" of January 1, 1 AD yielded Julian Day 1,721,425.5.  The NASA and Fourmilab calculators and the spreadsheet agreed that January 1, 1 AD (J) should rather be Julian Day 1,721,423.5 or 1,721,424.  So if "Calendar Date Entry" in the CSGNetwork calendar was intended to refer to a Julian calendar date, its Julian Day output was incorrect.  It did not appear that the calendar intended to refer, rather, to a Gregorian calendar date of January 1, 1 AD, because it then stated that its Julian Day value of 1,721,425.5 was equivalent to January 3, 1 AD (G).  In that latter regard, it was correct.

To some unknown extent, online calculators presumably used formulas that had been devised to facilitate date calculations.  For example, Bill Jefferys presented a formula for converting Julian Days (and, perhaps, dates on the Julian calendar) to the proleptic Gregorian calendar, but indicated that it would be inaccurate before 1582, and especially for years before 400 AD.  Paul Dohrman offered a procedure for converting Julian to Gregorian, and J.R. Kambak offered one for conversions from Gregorian calendar dates to Julian Days.  Dohrman's approach, as I understood it, required these steps:
  1. Truncate to centuries (e.g., 622 AD becomes 6).  In the case of BC dates, treat them as negatives and start by subtracting a year first (e.g., 499 BC becomes -500, which becomes -5).  This calculation produces X.
  2. Calculate 0.75X minus 1.25.  So 622 AD » 6 » 3.25 (using » as shorthand for "becomes"), and 499 BC » -5 » -5.
  3. Truncate decimal points.  So 622 AD » 6 » 3.25 » 3.  This is the number of days to add to the Julian date to find the Gregorian.
This procedure produced some results consistent with the spreadsheet and the Fourmilab calculator, converting July 15, 622 AD (J) to July 18, 622 AD (G), and January 1, 500 BC (J) to December 27, 501 BC (G) (after Year Zero adjustment).  This procedure did not seem to work in the first two centuries AD, however.  For example, in the case of July 1, 1 AD (J), Dohrman's approach seemed to yield the incorrect value of June 30 (i.e., century 0 * 0.75 – 1.25) rather than June 29 (G).

There also seemed to be a problem with Kambak's long formula for converting Gregorian dates to Julian Days.  It is possible that I did not copy or interpret that formula correctly.  The version that I tested was as follows, where Y = Gregorian year, M = Gregorian month, D = Gregorian day, and JD = Julian Day:
JD = 367Y – 7(Y+(M+9)/12)/4 – 3((Y+(M–9)/7)/100+1)/4 + 275M/9 + D + 1721029
As I translated this into an Excel formula (placed into cell D2), it read as follows (assuming the values of Y, M, and D were entered into cells A2 through C2, respectively):
=367*A2-7*(A2+(B2+9)/12)/4-3*((A2+(B2-9)/7)/100+1)/4+275*(B2/9)+C2+1721029
That formula's results varied from those produced by the Fourmilab calculator for certain dates checked above, such as July 1, 1 AD (G) and October 14, 1582 (G).  The variance in these instances was very small, however.  Specifically, the values for those two dates produced by the formula and the Fourmilab calculator were 1,721,606 vs. 1,721,606.5, respectively (for July 1, 1 AD (G)) and 2,299,159 vs. 2,299,159.5, respectively (for October 14, 1582 AD (G)).  That is, the Fourmilab calculator exceeded the formula's output by only 0.5 day in each case.  Unfortunately, this variation was not consistent.  For July 15, 622 AD (G), the Fourmilab calculator produced a value of 1,948,435.5, which was 0.5 day smaller than the Julian Day value of 1,948,436 produced by the formula.  Moreover, for November 22, 2238 (G), the Fourmilab calculator's output of 2,538,797.5 was 1.5 days larger than the figure of 2,538,796 produced by the formula.  In each of these several instances, the spreadsheet agreed, again, with the results produced by the Fourmilab calculator, after rounding the latter's 0.5-day output upward.  It appeared, in short, that this formula was very close but not entirely accurate.

By this point, checking of the spreadsheet had begun to transition into critiques of the ways in which various calculators and other tools had interpreted and applied various sources (e.g., Tantzen, 1960). I took this as a preliminary indication of the potential usefulness of the million-day spreadsheet, at least where an explicit presentation of dates might facilitate visualization of calendar developments.  While further usage and testing would be helpful in identifying points at which errors might have crept into the spreadsheet, it did preliminarily appear that the spreadsheet could provide a useful tool for date calculations and conversions.

The ISO 8601 Refinement

I developed the Gregorian section of the spreadsheet in one additional way. The International Organization for Standardization (ISO) had produced a standard prescription (known as ISO 8601) for calculating dates.  This prescription appeared likely to be useful for a variety of purposes, so the spreadsheet contains a column devoted to it.

The ISO 8601 standard adopted Gregorian date numbers. One effect of the standard, for present purposes, was to prescribe standard ways of representing dates. There was a YYYY-DDD ordinal date option, which used the day of the year, where day 366 would have a value only in leap years (e.g., 2012-366 = December 31, 2012).  In the spreadsheet, I used the year-month-day format (e.g., 2012-05-06 = May 6, 2012). ISO year values were ordinarily displayed with four characters (e.g., padded with leading zeros in 0023 rather than 23) for consistency.

A second effect of ISO 8601 stemmed from its adoption of a Year Zero, with apparently the same effect as what was sometimes called astronomical year numbering.  In this approach, before the epoch of 1 AD, the absolute value of the ISO year was one less than the traditional year (e.g., ISO year 0000 = 1 BC; ISO year –0001 = 2 BC). So the million-day calendar started on ISO date -0500-12-27 (i.e., December 27, 501 BC (G)). The numerical approach of ISO 8601, using minus signs instead of "BC" and likewise dispensing with "AD," had the advantage of avoiding controversy regarding the use of those two traditional modifiers.  The Fourmilab calculator (above) appeared to be implementing an ISO 8601 approach in its calculation of BC dates.

With the Gregorian calendar presented in ISO format, it would have been possible to apply another kind of check to the spreadsheet's day-of-the-week column.  This check would have used what was known as the Doomsday technique.  That technique, useful for quickly calculating the day of the week for a given date, seemed unnecessarily complicated within the million-day calendar spreadsheet, where one could simply use the Julian Day.  That is, since Julian Day 0 occurred at noon on Monday, January 1, 4713 BC, every Julian Day evenly divisible by 7 would be a Monday.  This way of calculating the day of the week, for a given date on the Gregorian calendar, seemed to produce the same results as I had calculated by using a formula that copied, into each day-of-week cell, the name of the day that appeared in the 7th preceding row.

Official and Local Calendars

As previously noted, Gregory intended that the last day of the Julian calendar (October 4, 1582) would be followed by the first day of the Gregorian calendar (October 15, 1582).  That intention was followed in a number of countries and, at this writing, was implemented in various online calculators (e.g., those appearing on U.S. Naval Observatory and NASA webpages).  It appeared that 1582 was the most plausible candidate for the year in which the world converted from the Julian to Gregorian calendars.  In short, this combination of proleptic Julian (to 4 AD), Julian (from 4 AD to 1582 AD), and Gregorian (since 1582 AD) appeared to form the most credible version of the world's official calendar.  The spreadsheet thus expresses what appears to be the Official Calendar that I had sought at the outset.

Some remarks appearing in preceding paragraphs have already acknowledged certain aspects of that de facto official calendar.  For one thing, the concept of the Julian Day was built from a starting date calculated according to Julian reckoning, but came to serve as a means of cross-reference between the Julian and the later Gregorian calendars.  So the spreadsheet column that presents the Julian Day number corresponding to a particular day on the Julian or Gregorian calendar does not belong solely within either the Julian or Gregorian sections of the calendar.  Rather, it seemed to be best presented in the spreadsheet's Official Calendar section.

Likewise, a given date would be a Monday, or a Tuesday, or some other day of the week, regardless of the date number given to it on the Julian or Gregorian calendars.  So it would have been redundant to present separate day-of-week columns in each of those calendars' parts of the spreadsheet.  Instead, the day of the week appears just once, in the Official Calendar section.

That section also presents the official date, in two different formats.  First is the traditional format, using BC or AD indicators of era.  These traditional dates are provided in the somewhat condensed but still recognizable YYYY-MM-DD form.  As such, their components (e.g., the number of the month) are accessible for further date calculations, as users may desire, with the aid of Excel text functions (e.g., MID, FIND).  The column presenting the Official Date in Traditional Format is thus the specific statement of the Official Calendar in approximately the form that now appears to be used by most people.

Second, the spreadsheet also presents the official date in ISO format -- specifically, with minus signs and a Year Zero, modifying the traditional presentation.  To emphasize, this is the official date.  It uses the Julian calendar for dates before October 15, 1582, and therefore is not the ISO 8601 date.  It is simply an indication of how the traditional, official date looks when stated in ISO style for purposes of numeric calculations.

As noted above, substantial portions of the world did not adopt the Gregorian reforms in 1582.  The spreadsheet is adaptable for purposes of developing localized versions that may accommodate reforms implemented in later years.  In the process of preparing this post, I also found a useful calendar with local customizations at TimeAndDate.com, though a brief look suggested the presence of inaccuracies like those identified in other calculators (above).

Uses of the Million-Day Calendar

This post has explained the creation of a million-day calendar covering the period from 500 BC to 2238 AD.  That calendar is provided in spreadsheet format, one row per day.

This spreadsheet format seems to have facilitated identification of potential errors in certain tools designed to assist in use of, and interactions between, the Julian and Gregorian calendars as well as the Julian Day and ISO 8601 date systems.  It may prove useful in other contexts calling for calculations, demonstrations, or cross-comparisons among calendars and systems, including some that users may add.

The spreadsheet presentation may also be useful in less technical, more data-oriented applications.  Within the limits of computing power and spreadsheet capacity, there may be tasks that call for an ability to add columns of information, to be filled at a rate of one item per day (or week, or other time period).  For instance, at this writing, I would like to find a database (if one exists) that would show something like the leading headline of the day -- the sort of thing that one might expect to find on the front page of the New York Times, for instance, if that newspaper had existed on the day of the Battle of Hastings.  If no such database exists, perhaps this spreadsheet, shared among a number of potential contributors, could help to bring about its existence.

Thursday, March 1, 2012

Excel 2003: An Attempt to Sort Sentences or Filenames by Identifying Multiword Phrases

I had a bunch of filenames.  I wanted to know which two-word phrases appeared most frequently.  The purpose of this effort was to put those filenames into categories.  I hoped that the approach of finding significant two-word phrases in those filenames would help me automate the file-sorting process.  This post describes the first approach I took to identify those two-word phrases.

Parsing the Filenames

I hoped that two-word phrases would be a good size.  I figured they would sort into many essentially one-word phrases (e.g., "a dog" would be trimmed to "dog," which would not have much significance) and a relatively small number of more significant two-word phrases (e.g., "big dog").   I hoped, that is, that I could easily filter the list for common words (a, an, the, if, of ...) and then count what was left.  It seemed, though, that I could still have taken approximately the approach described here if I had had been looking for three-word or longer phrases.

The approach would have been similar if I had been examining sentences instead of filenames.  I would have gotten the sentences into individual rows by using Microsoft Word and replacing sentence ends (e.g., ? ! .) with the same punctuation followed by ^p characters, which Word would interpret as the start of a new line, and then pasting the text into Excel.  It might have taken a while to resolve sentence ends, given problems with e.g., "Mr.".  I did not have more than 65,000 filenames, so Excel 2003 would handle it.  Otherwise, I might have needed a newer spreadsheet (e.g., the Calc programs in LibreOffice or OpenOffice).

Consider this sentence or filename:  "I saw a dog - a big dog."  Using spaces as delimiters (as I intended to do), its two-word phrases would be "I saw," "saw a," "a dog," "dog -," "- a," "a big," and "big dog."  This approach would produce some junk, but I hoped it would be easily filtered.  Not worrying about the punctuation seemed faster than trying to anticipate endless punctuation combinations (e.g., "dog -- a").  It would fail in some instances (e.g., "dog--a"), but those too seemed likely to involve insignificant phrases.

That sentence contains seven spaces.  I could count the number of spaces manually.  With many sentences, it would make more sense to use a formula.  One approach would be to use Excel's SUBSTITUTE and LEN functions to calculate the difference in length between the sentence with and without spaces, and then use MAX to see which row has the largest number of spaces.  TRIM would remove excess spaces.

Since each two-word pair would contain one space, the sample sentence would contain seven pairs of words.  Therefore, my spreadsheet needed enough columns to accommodate seven two-word phrases.  The approach I took was to set up the spreadsheet like this:



The top row indicated the number of the two-word phrase.  The first one was "I saw," in group 1.  The second row indicated the number of the space found.  For instance, the first space was at character position 2 (i.e., right after "I"), and the second space appeared at position 6 (after "saw").  This image shows only the first three groups; I would need four more.  As it shows, the first group required calculation of only two numbers because its starting point (character position 1) was already known.  I preferred this approach of breaking out the calculations visibly, rather than jamming them all together in one long, complicated formula.  It was usually faster this way, and it made it easier to spot errors.

I used formulas in the header rows to facilitate copying.  For instance, the formula in cell F2 was =F1, and in F1 it was =C1.  That way, once I worked out my formulas in row 3, I could select and copy cells F1 through I3 repeatedly across the spreadsheet, until I had enough column groups to accommodate all of the spaces in my longest filename or sentence.

To prepare the following list of formulas for the cells on row 3, I used a macro.  Those formulas were as follows:

C3  =FIND(" ",$A3)
D3  =FIND(" ",$A3,C3+1)
E3  =TRIM(LEFT(A3,D3))
F3  =IF(D3=LEN($A3),"",C3)
G3  =IF(F3="","",FIND(" ",$A3,F3+1))
H3  =IF(G3="","",IF(ISERROR(FIND(" ",$A3,G3+1)),LEN($A3),FIND(" ",$A3,G3+1)))
I3  =IF(H3="","",TRIM(MID($A3,F3,H3-F3+1)))
J3  =IF(H3=LEN($A3),"",G3)
K3  =IF(J3="","",FIND(" ",$A3,J3+1))
L3  =IF(K3="","",IF(ISERROR(FIND(" ",$A3,K3+1)),LEN($A3),FIND(" ",$A3,K3+1)))
M3  =IF(L3="","",TRIM(MID($A3,J3,L3-J3+1)))

So anyway, now I had a spreadsheet containing the two-word phrases that I wanted to sort and examine.  I made a backup and proceeded to convert the spreadsheet's formulas to values.  With the formulas frozen as values, I could sort and delete rows and columns without changing the results shown.  There appeared to be functions for this purpose in ASAP Utilities and possibly elsewhere.  An easy way to do this was to go to the top left-hand corner of the spreadsheet, hit Ctrl-A to select it all, and then hit Edit > Copy, Edit > Paste Special > Values > Enter > Enter.  I had usually found that the same keystrokes worked in later versions of Excel.  In this case, the keystrokes were Alt-E, C, Alt-E, S, V, Enter, Enter.

However achieved, the point is that cell C3 no longer contained the formula =FIND(" ",$A3).  Now it just contained the number 2, and it would continue to contain that number even if I deleted cell A3.  With that done, I could now delete the columns containing the numbers that I had used to detect my two-word phrases (e.g., columns C, D, F, G, H ...).  I sped up that process by doing a column sort (select all (Ctrl-A) and then Data > Sort > Options > Sort Left to Right > Row 2.  I put numbers at the tops of the Phrase columns, so as to keep their order straight.

Lining Up the Phrases

At this point, the cells toward the right side of my spreadsheet looked like this:
That is, the ends of the rows did not end at the same place.  In this illustration, the filename on row 2 had 17 two-word phrases, while the one on row 3 had only 16.  I wanted them all in a single column.  I wasn't sure if I needed them in proper order (in this example, P2, Q2, R2, followed by P3, Q3, R3), but I preferred that if possible.

I had tackled a similar problem previously.  That post still provided more details than I have written here.  But in some regards, I now saw a way to simplify that process a bit.  I deleted row 1, containing the column headings, as well as column 1, containing the original filenames.  They seemed unnecessary and potentially confusing at this point.  Then I created another worksheet (Insert > Worksheet).  Its default name was Sheet2.  In cell A1 in that worksheet, I entered a formula that referred to the corresponding cell in Sheet1, where my data were located.  The formula that I put into cell A1 in Sheet2 was =IF(LEN(Sheet1!A1)>0,CELL("address",Sheet1!A1),"").

That formula produced this output in Sheet2, cell A1:  [xA8.xls]Sheet1!$A$1.  (xA8.xls was the name of my spreadsheet file.)  I was going to use the preliminary information, the part that said [xA8.xls]Sheet1!, to tell the computer where a new entry began.  In other words, if the spreadsheet had given me just $A$1, I would have changed my formula so that Sheet2, cell A1 would contain something like this:  #$A$1.  Any odd character would have been sufficient in place of # as long as it gave the computer something unique to look for.

Next, I copied that formula from Sheet2, cell A1 to each cell that corresponded to my data in Sheet1.  In other words, in Sheet1 I hit Ctrl-End to discover that the bottom right corner of my spreadsheet was at cell Q14998.  So now, in Sheet2, I copied that formula from cell A1 to all cells down to Q14998.  A fast way to do that was to hit F5 and type Q14998 and hit Enter.  That took me to Q14998.  I put an X there.  Then I went back to A1 and hit Ctrl-C > Ctrl-Shift-End > Enter.  Boom!  Done.  I went back to Q14998 to double-check.  Then Alt-O, C, A to autofit the column widths.

Now I had a bunch of cells that had ugly stuff like cell A1 (above):  [xA8.xls]Sheet1!$A$1.  I hit Ctrl-A > Ctrl-C to copy it all.  I opened Notepad and hit Ctrl-V to paste it.  (If there had been too much data for Notepad, I think something like Notepad++ would have worked.)  The purpose of putting it into Notepad was to remove the column formatting.  (There were delays with some of these operations, as the Windows clipboard and the various programs wrapped their heads around the size of the task.)

Now I copied it all from Notepad to Microsoft Word.  Any editor would have worked, as long as it had a way of doing global search-and-replace operations that would produce line breaks.  That is, in Word, I replaced all instances of the ugly stuff (i.e., [xA8.xls]Sheet1!) with a newline character (^p).  Word reported that it had made about 170,000 replacements.  This gave me the desired list, starting like this:

$A$1
$B$1  
$C$1  
$D$1  

There were invisible tabs after the entries (e.g., it was actually $A$1 tab), and there were also some blank lines.  I cleaned these up with various ^p^p and ^t^p replace operations, repeated until there were no more.  The file was long -- about 3,700 pages.

I went back to Excel and deleted Sheet2:  I didn't need it anymore, and all its calculations would slow things down.  I created a new Sheet2 and pasted the results from Word into it.  I wasn't sure if Word had inserted any other funky invisible characters, so I ran the results through Notepad on the way home, just to be sure.

But now, oops, new problem.  My old Excel 2003 couldn't accommodate 170,000 rows.  It gave me an error:  "File not loaded completely."  I took a look.  Row 65536 contained a reference to $F$5911.  I could have transferred the whole project over to LibreOffice, which could handle 170,000 rows, but that would have been premature.  I had kept Notepad open, so now I searched for that item $F$5911 in the data that were still sitting there in Notepad.  I deleted everything down to that point, copied, and repeated in an adjacent column.  I had to do this a couple of times.  So now I had a spreadsheet that was three columns wide and 65,536 rows long.  Warning: do not print your results.

I inserted blank columns between these columns of output.  Those blank columns would contain the text corresponding to those cell references.  For instance, in Sheet2 cell A1, I now had $A$1.  So in the blank cell B1, I put this formula:  =INDIRECT("Sheet1!"&A1).  I copied that formula to all the other cells in those adjacent columns.  That gave me three sets of two columns each:  the first provided the reference, and the second containd my actual two-word phrase.

At this point, I froze all those values, as above, and deleted the reference columns.  So now I just had three columns containing two-word phrases.  In short, I had managed to isolate the phrases.  Now it was time to start slicing and dicing them.

Working with Unique Phrases

I wanted to filter those two-word phrases for unique ones.  For example, if there were a thousand two-word phrases that said "Email from," I wanted just one entry of that nature, with a numerical companion cell containing the number 1,000.

To do that filtering, I tried to use LibreOffice Calc, where I could get all that material into column A.  But Calc hung for about 15 hours before producing output.  (Excel also hang at a number of points in this whole process.)  I did not test Calc's output, but it appeared to have given me some repetitive items in that supposedly nonrepetitive list.  I went back to Excel, filtering each of the three columns and then combining the short output into one column and filtering that.  This gave me 7,539 unique two-word phrases.  That list passed the test for duplicates, but counts indicated that it was not accounting for all 169,891 two-word phrases.  The formulas to do those counts really slowed things down, so I deleted them for the moment.

I thought that it might help to clean up unnecessary characters.  I sorted each of the three columns according to starting and ending characters, so as to detect extraneous ones (e.g., various kinds of brackets, quotation marks, and other symbols).  I also did some global (Ctrl-H) searches in Excel to the same effect.  In addition, I deleted a few meaningless two-word pairs.  These efforts exposed me to the fact that many two-word phrases would consist, at least in part, of single letters (e.g., "A dog," "Roger P").  It seemed that I probably should have done global replace operations, even before breaking out the two-word phrases, so as to promote meaningful two-word combinations (e.g., "Roger P" and "P Smith" might be less recognizable than "Roger Smith").  Possibly a preliminary search for three- and even four-word proper nouns would have been a good approach there.

I also decided, along the way, that it would make sense to tolerate meaningful one-word terms (e.g., the "Pentium" word that would remain after removing the unhelpful "a" from "a Pentium").  That predictable conclusion raised questions about my sequence of steps generally.

After those cleanup efforts, I filtered again for unique two-word phrases and counted their occurrences.  Working with three source columns required some extra steps, mostly to eliminate duplicates, but ultimately my count of items found from searches for unique items was within 1% of the actual number of such items.  In other words, my filtering for unique items substantially accounted for all items.  Specifically, I had 163,391 two-word phrases, and in some apparent overcounting, my list of unique phrases accounted for 163,772 such phrases.

Because those counts did not jibe perfectly, I decided to take an elimination approach -- that is, to remove the two-word phrases that lacked significance, rather than selecting those that had it.  To help me in this process, I divided the two-word phrases into individual words, and then prepared to test each of them against a list of relatively insignificant words that I had cooked up for the occasion.  Before doing that, I had to screen again for stray punctuation (now that e.g., a comma had emerged in "soon, however" such that "soon," would not be found in an elimination of "soon") -- making it increasingly appealing to find some way of doing that earlier in the process.  Another step that would have been beneficially taken earlier, to the extent possible, would have been to convert date references (e.g., "2-13-09 project" vs. "Feb. 13 project") into a standard format, or else decide that none were significant.  It seemed, generally, that renaming the files to standardize various names and terms (or at least revising aliases for them) should have occurred earlier in the process.  Such changes seemed likely to shape this sentence-parsing process.  For instance, if full names were to be used (e.g., "Roger P. Smith"), then a two- rather than three-word focus could be wrongheaded.

Project Abortion

At this point, I had an unpleasant realization.  The need to pare down my spreadsheet, so as not to have to wait forever for calculations, and the loss of concentration due to long waits and the passage of time, had produced a situation where I had not realized that, in the past several hours, my work had ceased to be directly linkable back to the original list of files.  There was no index.  Worse, there had been global changes to remove troublesome punctuation, further reducing direct linkability.  I could continue with this effort, using backups to reconstruct the links.  But there seemed to be good reasons to reconceive the whole approach.

One emerging conclusion was that there just might not be a shortcut to a good rubric or other structured approach to the renaming of files, the coding of sentences, or the constructing of aliases for bits of text being examined in this kind of effort.  One question in that regard was whether there would be a good way to identify and change troublesome characters or combinations of characters up front.  That question led me to a second try.

Wednesday, February 29, 2012

Excel 2003: Print or Export the Formulas Used in Each Cell

I had a spreadsheet in Excel 2003.  (I suspect the approach used here would also work in other versions of Excel, but I have not tried it.)  I wanted to print out the formulas used in each cell.  I did a couple of searches and wound up in a thread where they were advising me to use a macro for this purpose.  The steps I used to set up the macro were similar to those that I had used for another Excel macro:

  1. Close all Excel files other than the one you're working on.
  2. Go into Tools > Macro > Visual Basic Editor > Insert > Module.
  3. Copy and paste macro text (see below) into the window.
  4. Go to File > Close and return to Microsoft Excel.
  5. In this case, I used the macro by going into Tools > Macro > Macros and running the ListFormulas macro.
The text of the macro -- what I copied and pasted into the module window -- was as follows:
Sub ListFormulas()
    Dim FormulaCells As Range, Cell As Range
    Dim FormulaSheet As Worksheet
    Dim Row As Integer
    
'   Create a Range object for all formula cells
    On Error Resume Next
    Set FormulaCells = Range("A1").SpecialCells(xlFormulas, 23)
    
'   Exit if no formulas are found
    If FormulaCells Is Nothing Then
        MsgBox "No Formulas."
        Exit Sub
    End If
    
'   Add a new worksheet
    Application.ScreenUpdating = False
    Set FormulaSheet = ActiveWorkbook.Worksheets.Add
    FormulaSheet.Name = "Formulas in " & FormulaCells.Parent.Name
    

'   Set up the column headings
    With FormulaSheet
        Range("A1") = "Address"
        Range("B1") = "Formula"
        Range("C1") = "Value"

        Range("A1:C1").Font.Bold = True
    End With
    
'   Process each formula
    Row = 2
    For Each Cell In FormulaCells
        Application.StatusBar = Format((Row - 1) / FormulaCells.Count, "0%")
        With FormulaSheet
            Cells(Row, 1) = Cell.Address _
                (RowAbsolute:=False, ColumnAbsolute:=False)
            Cells(Row, 2) = " " & Cell.Formula
            Cells(Row, 3) = Cell.Value
            Row = Row + 1
        End With
    Next Cell
    
'   Adjust column widths
    FormulaSheet.Columns("A:C").AutoFit
    Application.StatusBar = False
End Sub 
(Note that the format of this blog may wrap some lines.  Copying and pasting may yield better results than retyping.)  The author named in that code, John Walkenbach, provided this code and also offered a Power Utility Pak ($40) that contained this and other tools.  I had installed a couple of freeware utility collections -- ASAP Utilities and Morefunc -- and I hardly ever used them.  I checked the list of tools in ASAP Utilities, just in case, but didn't find anything quite along these lines.  A quick glance revealed no list of Morefunc utilities.

When I ran the macro (step 5, above), it seemed to hang my machine.  I was using a fairly large spreadsheet -- I probably should have tried it on something smaller -- but instead I went to bed.  I didn't know how long it took, but it worked.  When I awoke, it had created a new worksheet (i.e., a new tab at the bottom of the spreadsheet), with three columns:  Address (e.g., F2), Formula (e.g., =C2), and Value (e.g., 17).

Tuesday, February 14, 2012

Batch Merging Many Scattered JPGs into Many Multipage PDFs - Clarified

I had recently streamlined a spreadsheet-oriented technique for working up file names and batch commands to merge multiple JPGs into single- or multi-page PDFs.  I had yet another task calling for that process.  This post presents the same steps as in that previous writeup, but in clearer and more organized terms.  The various steps are described in different terms in the previous post and in its much more convoluted predecessor.  Note that the simplification of steps here yields a longer explanation.

Generate the Input PDF Filenames

In this discussion, the project had already been brought to the point where there was a spreadsheet containing the names of JPGs to be converted to PDF.  (A list of such files could come from a command like DIR D:\*.jpg /s /a-d /b > JPGLIST.TXT.)  In that spreadsheet, row 1 was reserved for column names (e.g., "Original Filename," "Path").  The name of the first file to be converted appeared in cell A2 (e.g., "Test File.jpg"), and the others appeared below it in column A.

Column B was used to calculate the endpoint of the path (e.g., "D:\Folder 1\Test File.jpg").  This calculation required or at least benefited from the use of a REVERSE function.  The REVERSE function was created by inserting the following text into a module:

Function Reverse(Text As String) As String
Dim i As Integer
Dim StrNew As String
Dim StrOld As String
StrOld = Trim(Text)
For i = 1 To Len(StrOld)
StrNew = Mid(StrOld, i, 1) & StrNew
Next i
Reverse = StrNew
End Function
To create the module, I closed all Excel files other than the one being worked on.  I went into Tools > Macro > Visual Basic Editor > Insert > Module.  I pasted the text there.  I went to to File > Close and Return to Microsoft Excel.  The formulas in the next several cells were then as follows:
B2:  =LEN(A2)-FIND("\",REVERSE(A2))+1
C2:  =LEFT(A2,B2)
D2:  =MID(A2,B2+1,LEN(A2))
I was reconstructing some of these steps after the fact; hopefully I was not misstating anything or leaving anything out of this description.  I focused on the formulas for row 2 because, when I was done filling out row 2, it would be easy to copy all of its formulas at once to fill the spreadsheet, so that the same formulas would be applied to the filenames in rows 3, 4, 5 et seq.

The next step was to identify which files I wanted to convert.  I was able to use the spreadsheet to help in this process.  For example, I had files with names like these:
Test File page 01.jpg
Test File page 02.jpg
I could determine that these belonged together by using a column that showed these filenames without the ending X characters.  That column would use a command like =RIGHT(D2,$E$1), where I would put a number in E1.  The number I used in this example was 12, because I would have to eliminate 12 characters from the right end of "Test File page 01.jpg" to get the name that I wanted to give to the new PDF, which would be simply "Test File."  I manually entered 12 in cell E2.  I filtered the rows so that the ones with numbers in column E would not show, so as to reduce clutter.  I repeated this process until each row had a number in column E, indicating how many characters I had to trim from the right end of the filename to get a common stem.  Then cell F2 contained =LEFT(D2,LEN(D2)-E2), to produce the stem filename.  So, for example, if there were two pages that had to go into Test File.pdf, then (assuming I had sorted my spreadsheet correctly), the contents of cells F2 et seq. would look like this:
F2:  Test File
F3:  Test File
F4:  Another File
and so forth, continuing down the list, so that each JPG would be assigned to a stem filename.  I discovered at this point that I had brought over a bunch of JPGs that did not belong in a multipage PDF, or at least I could not be sure of it.  I put these back where they came from, and thus had a smaller set of files to work on.

Now I could sort the spreadsheet to insure that the JPGs I wanted to merge into a specific PDF would be next to each other.  Once I had the files sorted into the desired order, I used column G as an Index column, numbering each row in that order.  This column would be useful to restore the desired order if I resorted the spreadsheet for some reason, and it would also provide the key ingredient in the unique filename that I was about to calculate.  (To generate an index number, I used either a +1 formula, incrementing the number in the preceding row by 1, followed by an Edit > Copy, Edit > Paste Special > Values sequence to convert those formulas to fixed numbers that would not be affected by resorting the spreadsheet; or, easier, I entered 1 in the first row, 2 in the second row, and then highlighted the cells to be filled and used Edit > Insert > Series to add the sequential numbers.)  Then I used columns G through I to calculate a simple, unique filename and produce a renaming command:
G2:  unique Index number
H2:  ="Input_"&REPT("0",4-LEN(G2))&G2&".pdf"
I2:  ="ren "&CHAR(34)&A2&CHAR(34)&" "&H2
Cell H2 would produce the file name of "Input_0001.pdf."  (More or fewer digits could be appropriate, depending on the number of files involved.)  This name worked for me because a search of my files verified that my computer contained no other files whose names began with Input_.  Otherwise, a different name might have been advisable.  Cell I2 would produce a command that would rename Test File page 01.jpg to be Input_0001.pdf.  (Of course, this wasn't a conversion.  IrfanView would be doing that.  This was just a translation of file names, to make subsequent steps easier.)

Then, as usual, I copied these commands from row 2 down to fill the remaining rows of the spreadsheet.  Next, I copied all of the commands from column I into Notepad, saved that file with a .bat extension (e.g., "Renamer.bat"), and ran it in Windows Explorer.  That renamed all of the JPGs that I hoped to merge into multipage PDFs.  (Obviously, I had a backup.)

Now I would use column J for commands to move all those JPGs to a single folder.  (I did the renaming before the moving in case I had two separate sets of incoming JPGs with the same name, such as D:\Folder A\Test File page 01.jpg and E:\Folder R\Test File page 01.jpg.)  In my actual example, the files were all moved to V:\Workspace, so my command in cell J2 looked like this:
J2:  ="move /-y "&CHAR(34)&C2&H2&CHAR(34)&" V:\Workspace"
I combined the commands in column J into another batch file, Mover.bat, and used it to move all those renamed JPGs to V:\Workspace.  Now that they were together, I could use IrfanView's File > Batch Conversion/Rename option to convert all those JPGs to single-page PDFs.  I output those PDFs to V:\Workspace and then removed the JPGs from V:\Workspace to V:\Workspace\Original JPGs, where they would probably not be needed further.

Generate the File Value Lines

So now I had files starting with Input_0001.pdf in V:\Workspace. The project now was to merge those single-page input PDFs into multipage output PDFs. In column K, I entered a number indicating where each input PDF should go. As I could see from column F (above), the files that used to be called Test File page 01.jpg and Test File page 02.jpg, which had now been converted to uniquely named single-page PDFs, were both going to end up in Output_0001.pdf, so both of those had the number 1 in column K.   I needed to pad out that number so that the resulting filename would sort correctly. So the next formulas were as follows:
K2: =LEFT(H2,11)&"pdf"
L2: =REPT("0",4-LEN(K2))&K2
But I wasn't ready, yet, to generate the ultimate Output PDF files.  I needed to make sure that PDFsam (below) would have all the instructions needed to insert multiple single-page PDFs into one Output PDF.  Those instructions would come from lines of XML code, one line per input PDF.  The lines of XML code would be combined into a smaller number of FileValueLines files.  In my example, the instruction lines for the erstwhile Test File page 01.jpg and Test File page 02.jpg files (now called Input_0001.pdf and Input_0002.pdf) would both go into a single FileValueLines file.  A FileValueLines file could contain any number of lines of XML code; it just depended on how many single-page PDFs were going to be combined into it.  What I needed, then, was the command that would shovel those XML code lines into the appropriate FileValueLines files.  The next formulas in my spreadsheet looked like this:
M2:  ="FileValueLines_"&L2
N2:  ="echo ^<file value="&CHAR(34)&"V:\Workspace\PDF\"&H2&CHAR(34)&" /^> >> "&M2
and cell N2 produced this command:
echo ^<file value="V:\Workspace\PDF\Input_0002.pdf" /^> >> FileValueLines_0001
Note that this command assumed that the PDFs created by IrfanView were in a folder called V:\Workspace\PDF.  I filled the lower rows of the spreadsheet with these formulas and copied the commands in column N into a new Notepad file called _ValueMaker.bat.  I ran _ValueMaker.bat in V:\Workspace. It produced 545 FileValueLines files in V:\Workspace, as expected.  (I added an underscore before the file's name so that it would appear near the top of the potentially cluttered list of files in V:\Workspace.)

This was a point of transition.  Until now, I had been working with a total of 3,894 spreadsheet rows, one for each of the 3,894 JPGs that I wanted to combine into multipage PDFs.  Each of those 3,894 JPGs (converted into single-page PDFs called Input_0001.pdf et seq.) was represented in exactly one File Value line.  Those 3,894 File Value lines were now contained within a total of 545 FileValueLine_ files.  In other words, we were transitioning from a focus on 3,894 individual JPGs to a focus on 545 multipage PDFs.

Building the XML Files

The 545 FileValueLines files (FileValueLines_0001 and so forth) needed more lines before they would function.  To add those lines, I needed a new spreadsheet.  I started this one with a list of the FileValueLines files I had just created.  (A command like "dir FileValueLines* /b /a-d > _FVLlist.txt" would produce that list.)  In other words, column A contained the list of the new FileValueLines files.  Next, I calculated the name of the resulting XML file, and then wrote the formula that would create those XML files:
B2:  =RIGHT(A2,4)
C2:  ="XMLfile_"&B2&".xml"
D2:  ="copy /b _Header.txt+"&A2&"+_Tailer.txt "&C2
That formula in D2 produced the command that I would need to create the complete XML files:  "copy /b _Header.txt+FileValueLines_0001+_Tailer.txt XMLfile_0001.xml."  This command called for me to create just one _Header.txt and one _Tailer.txt file that would be added to the start and end of each of my XML files.  The _Header.txt file read as follows:
<?xml version="1.0" encoding="UTF-8"?>
<filelist>
I saved those two lines in a new _Header.txt file.  It was important to make sure the file ended on a blank line.  In other words, if I moved my cursor down to the end of _Header.txt, it needed to be on the line below <filelist>, not at the end of the line containing <filelist>.  The same was true for _Tailer.txt, which contained this line:
</filelist>
Once _Header.txt and _Tailer.txt existed, I could run the commands contained in column D.  I copied those 545 commands into a new batch file called _Builder.bat and ran it.  That gave me 545 XML files, starting with XMLfile_0001.xml.  Again, V:\Workspace was filling up, so I moved the no longer needed FileValueLines files to their own archival subfolder.

Generating the PDFsam Commands

At this point, I would use a batch file to generate batch files.  This would call for the same kind of process as above:  a header text file (but no tailer) containing lines that would be standard in all of these batch files, combined with another text file containing unique commands.  In this new _HeaderB.txt file, I placed these lines:
@echo off

set JAVA=%JAVA_HOME%\BIN\JAVA

set JAVA_OPTS=-Xmx256m -Dlog4j.configuration=console-log4j.xml

set CONSOLE_JAR="C:\Program Files (x86)\pdfsam\lib\pdfsam-console-2.3.1e.jar"

@echo on
Apparently these first lines set up a few essential Java variables.  (In case of difficulty here, as elsewhere, see the preceding post.)  The second part of the batch file -- the unique command -- would tell PDFsam (below) what to do.  The command looked like this:
E2:  ="echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l V:\Workspace\"&C2&" -o V:\Workspace\Merged\Output_"&B2&".pdf concat >> Command_"&B2&".txt"
Basically, this would produce a command that said, "Write a command that will create Output_*.pdf, and put that command in a file called Command_*.txt."  In other words, that spreadsheet formula in cell E2 produced a command that looked like this:
echo %%JAVA%% %%JAVA_OPTS%% -jar %%CONSOLE_JAR%% -l V:\Workspace\XMLfile_0001.xml -o V:\Workspace\Merged\Output_0001.pdf concat >> Command_0001.txt
Spreadsheet column E gave me 545 variations on that command.  I copied them all and ran them in a new batch file called _Texter.bat.  The resulting 545 Command_*.txt files would have to share V:\Workspace with my preexisting XMLfile_*.xml files; both would be needed shortly.

Now it was time to combine _HeaderB.txt and Command_*.txt into a batch file that would run PDFsam and produce an multipage Output PDF.  This was easy enough:
F2:  ="copy /b _HeaderB.txt+"&RIGHT(E2,16)&" Ready_"&B2&".bat"
The 545 versions of that command, run via _Commander.bat, created 545 Ready_*.bat files, each containing the header information from _HeaderB.txt plus the single command line from Command_*.txt.  Now I was done with the Command_*.txt files, and could archive them.

Final Run:  Creating the Merged Output PDFs

If I double-clicked on one of those Ready batch files, it would run and produce a final merged Output.pdf.  But I wanted to run them all at once.  So, back to the spreadsheet:
G2:  ="call "&RIGHT(F2,14)&" >> _Errorlog.txt"
That produced "call Ready_0001.bat >> _Errorlog.txt."  I copied the 545 CALL commands from column G into a file called _Runner.bat and -- you guessed it -- I ran it.  It took a few minutes, but it produced 542 merged PDFs.  Not 545.  I searched _Errorlog.txt for occurrences of "Error."  Sure enough, three occurrences.  In each case, the error was the same:  "Input_0838.pdf (and Input_0947.pdf, and Input_1090.pdf) not found as file or resource."  Apparenly PDFsam had not gone ahead to create the Output PDFs anyway, without the offending Input file.  I checked and, indeed, those three files were not included among my input.  I was not sure what happened to them.  They appeared to have disappeared during this process.  I made a note to retrieve them from backup.

Otherwise, though, the process seemed to have succeeded.  Spot checks indicated that I had working multipage PDFs.  Now it was time to rename them to be something other than Output_0001.pdf etc.  I went back to my first spreadsheet, saved it, and went to work on a temporary copy of it.  Specifically, I converted all of its formulas to numerical values.  (In Excel 2003, this involved going to the top left corner, selecting the whole spreadsheet, and using the Edit > Copy, Edit > Paste Special Values combination.)  With that done, I was free to delete and rearrange columns as needed, without having cells recalculate in undesirable ways.  I kept only columns C (showing the original source folder), F (showing the name I had chosen for the combined output PDF), and L (showing the output file number).  I rearranged those columns in reverse order, so that the output file number column could function as a rather redundant index.  I went to the second spreadsheet and, working likewise on a copy of that, deleted all columns except column B, containing the output file number.  In an adjacent column, I used VLOOKUP to look up the filenames, from the other spreadsheet, to which these Output_0001.pdf etc. files should be renamed.

Through such steps, I renamed these files.  Yet something went wrong.  Due to distraction and time delay during the last stages of this process, somehow the renaming did not bring me back to a coherent set of files.  Spot checks indicated that the ones that did rename were renamed correctly.  That is, if a file wound up being renamed as Letter to Joe, it did seem to be exactly that.  But I was left with 60 files -- more than 10% of the total -- that did not rename at all.  There may have been a way to fix them through the spreadsheet, but I had run out of time for the project.  Under the circumstances, I defaulted to putting those files aside into a separate folder, pending a future manual renaming effort.

The process described in this post did seem to work.  But its numerous steps continued to hold the threat of something going wrong, somewhere along the line, and being difficult to retrace.  This did not seem to be nearly the last word on batch merging scattered JPGs into multipage PDFs.

Tuesday, January 3, 2012

Converting Scattered WMA Files to MP3

I had .WMA files scattered around my hard drive.  I wanted to convert them to .MP3.  I could search for *.WMA, using a file finder or search tool like Everything, thereby seeing that those files were in many different folders.  Having already sorted them, I didn't want to move them anywhere for the format conversion.  I wanted to convert them right where they were.  A command-line tool would do this.  The general form of the command would be like this:  PROGRAM SOURCE TARGET OPTIONS.  For PROGRAM, I would enter the name of the command-line conversion program that I was going to be using.  For SOURCE and TARGET, I would enter the full pathname (i.e., the name of the folder plus the name of the file, like "D:\Folder\File to Convert.wma," where the target would end in mp3 rather than wma).  OPTIONS would be specified by the conversion program.  For instance, there might be an option allowing me to indicate that I wanted the resulting MP3 file to be 64bps.

The problem was, I didn't have a command-line WMA to MP3 conversion tool.  I ran a search and wound up trying the free Boxoft WMA to MP3 Converter.  (They also had lots of other free and paid conversion and file manipulation programs.)  When I ran their converter, it steered me to an instruction file that inspired me to compose the following command (all on one line):

AlltoMp3Cmd "D:\Folder\Filename.wma" "D:\Folder\Filename.mp3" -B56
I had to use quotation marks around the source and target names in some cases (though not in this particular example) because some of the path or file names contained spaces.  The -B56 option was supposed to tell it to produce a 56-bit MP3.  (I also tried it with a space:  "-B 56".)  I was able to produce similar commands en masse, for all of the WMAs that I wanted to convert, by exporting the results of the *.WMA search from Everything to a text file called wmalist.txt, making sure to remove entries for files that I did not wnat to convert.  (At the root of each drive containing files of interest, I could also have used this command, assuming wmalist.txt did not already exist:  dir *.wma /b /s >> D:\wmalist.txt.)  I then massaged the contents of wmalist.txt using Microsoft Excel.  So now I had all of these AlltoMp3Cmd commands ready to run.  I copied them all into a Notepad file named Renamer.bat.  All I had to do was double-click on it in Windows Explorer and it would run.

I decided to try Renamer.bat with just one WMA file.  So I created another file, X.bat, with just one line in it, like the line shown above.  To run X.bat from the command line, so that I could see what it was doing, I would need a command window that was ready to execute commands in the folder where X.bat was located.  Problem:  X.bat was not in the same folder as Boxoft's AlltoMp3Cmd.exe executable program, so X.bat would fail.  If I didn't want to get into changing the computer's PATH, I could either put X.bat in the Boxoft program folder or I could copy AlltoMp3Cmd.exe to the folder where X.bat was located.

Either way, I needed to open a command window in one of those two folders, so as to run X.bat.  I could start from scratch (Start > Run > cmd) and use commands (e.g., "D:" would take me to drive D and "cd \Folder" would take me to the folder where Filename.wma was located), or I could use Ultimate Windows Tweaker to install a right-click option to open a command window in any folder.  I had already done the latter, so this step was easy.

Once I had sorted out all that, I was ready to try running X.bat.  But when I did, it crashed the AlltoMp3Cmd.exe program.  If I clicked on Cancel when I got the crash dialog, the command window said this:
Exception Exception in module AlltoMp3Cmd.exe at 0005B4E1.
Installation file incorrect. Please re-install it!.
But reinstalling the Boxoft program didn't help.  I sent them a note to let them know of this problem and decided to try other approaches.  One possibility was that their program was suitable for Windows XP but not Windows 7, which I was using.  It didn't seem to be a question of how the main program was installed, since the error message was referring specifically to the AlltoMp3Cmd.exe command-line executable (which presumably would be the same on any Windows system).

I decided to try running it in a Windows XP virtual machine (VM).  I had already installed Microsoft's Windows Virtual PC, which came with a WinXP VM, so I fired it up to try the same command line in the same folder.  To move quickly to the proper folder in the WinXP command window, I ran my trusty old RegTweak2.reg file, created in Notepad, to install a right-click option to open a command window in any folder in Windows Explorer.  But when I tried to use it, I got an error:
'\\tsclient\D\Folder Name\Subfolder Name'
CMD.EXE was started with the above path as the current directory.
UNC paths are not supported.  Defaulting to Windows directory.
'\\tsclient\D\Folder Name\Subfolder Name'
CMD does not support UNC paths as current directories.
A bit more playing around persuaded me that what this message meant was that command-line work in the VM would have to be done on what the VM considered a "real" (actually a virtual) drive -- in other words, drive C.  So I put copies of X.bat and AlltoMp3Cmd.exe into the VM's drive C, in a new folder I called Workspace, and I tried running X.bat from the command line there.  But again I got an error:  "AlltoMp3Cmd.exe has encountered a problem and needs to close."  Maybe the program wasn't built to handle paths.  For whatever reason, it looked like the Boxoft AlltoMp3Cmd command-line utility was not going to work for me.

A search in CNET brought up some other possibilities.  One was IrfanView, reminding me that I had used that program to work partway through a somewhat similar problem months earlier.  Using IrfanView version 4.28 and various insights described more fully in that other writeup (and in a recent thread), I went back to my original list of files in wmalist.txt and prepared this command:
i_view32.exe /filelist=D:\wmalist.txt /convert=$D$N.mp3
This command was supposed to use the file names ($N) and directories (i.e., folders, $D) specified in wmalist.txt to produce MP3 files with those same names, in those same directories.  Before trying it out, I made a copy of wmalist.txt and changed the original so that it contained only two lines, referring to WMA files on two different drives.  I ran the command shown above in a CMD window.  I got an error:
'i_view32.exe' is not recognized as an internal or external command, operable program or batch file.
In other words, Windows 7 did not know where to look to find IrfanView.  I could have taken the steps mentioned above, moving the .txt file to wherever i_view32.exe was located; but since I used IrfanView often, I wanted to add it to the PATH variable so that Windows would permanently recognize it.  The solution was to go to Start > Run > SystemPropertiesAdvanced.exe (also available through Control Panel > System > Advanced System Settings) and then click on Environment Variables > System Variables > highlight Path > Edit.  To see clearly what I was doing, I cut the existing Variable Value out of the dialog and doctored it in Notepad.  The basic idea was to add, to the end of the existing value, a semicolon and then (without adding a space after the semicolon) paste the location of i_view32.exe (found easily enough via an Everything search > right-click > Copy path to clipboard).  I made sure to add a final backslash ("\") after the path to i_view32.exe.  I pasted that back into the dialog, OKed my way out of System Properties, went back into the command window, pressed the Up arrow key to repeat the command ... and it still didn't work.  I thought that possibly I would have to reboot to have the new PATH definition take effect.  That was the answer to that particular problem.  After rebooting, in a command window, I ran the command shown above, and there were no errors.  IrfanView was open, but nothing was in it.  I ran searches in Everything for the two files in my test WMAlist.txt file, with wildcard extensions (i.e., I searched for Filename.*).  No joy:  there were no MP3 versions of those files.  I tried a modified version of the command:
i_view32.exe /filelist=D:\wmalist.txt /convert=D:\*.mp3
but that produced no output in D.  The IrfanView command was not working.  I tried yet another variation, as above but without "D:\" but that wasn't it either.  I tried the original command without using the filelist option:
i_view32.exe "D:\File Path\File Name.wma" /convert=$D$N.mp3
This produced an error:
Error!  Can't load 'D:\File Path\File Name.wma'
Did that mean that the /convert option was not being recognized?  Everything indicated that no MP3 file had been created.  And why would IrfanView be unable to load the existing WMA file?  It could load it easily enough from Windows Explorer or Everything.  I tried again:
i_view32.exe "D:\File Path\File Name.wma"
That worked:  IrfanView played the file.  So the convert option was the problem.  Another variation:
i_view32.exe "D:\File Path\File Name.wma" /convert="File Name.mp3"
If that did work, I wasn't sure where the output file would turn up.  No worries there:  it didn't work.  I got the "Can't Load" error again.  IrfanView's help file said that it did support wildcards for /convert, so that was presumably not the problem.  I had seen an indication that IrfanView would not batch-convert certain kinds of files, but WMA was not on the list I saw.  I was going to post a question in the relevant IrfanView forum, but at this point they weren't letting me in, for some reason.  Eventually it occurred to me to look in IrfanView's File > Batch Conversion/Rename area, where it appeared that the program would support only image conversions, not audio.

It seemed I would need to continue searching for a command-line option.  Back at that CNET search, I looked at the Koyota Free Mp3 WMA Converter -- from another company that offered multiple free conversion products -- but saw no indications that it had command-line options.  Likewise for Power MP3 WMA Converter and others.

I finally opted for a kludge solution.  Using an Excel spreadsheet, I created a batch file (again, using techniques described in the other post referenced above and elsewhere) to rename each file in WMAlist.txt to a unique name (example:  ZZZ_00001.wma) -- after making sure I did not already have any files with that kind of name.  The unique names would help to insure that all WMA files would get the treatment, even if two of them had the same original name.  This produced 386 files.  Then, using Everything, I selected and moved all ZZZ_*.wma files to D:\Workspace.  Somehow, only 375 files made it to that folder.  It turned out that I had inadvertently included WMA program files from drive C after all, which I had not wanted to do, and for some reason a few of those were not moved to D:\Workspace -- probably for insufficient rights.  So now I would have to undo that damage.

After taking care of that, in D:\Workspace, I tried the Boxoft program again, this time using its Batch Convert mode.  It took a while.  Spot checks suggested that the conversion quality was good.  I wasn't sure what bitrate to use to convert the files.  It seems that, at 56 kbps for what appeared to be a bunch of voice (not music) files, I erred on the high side.  I started with 353 WMA files occupying a total of 237MB, and I ended up with 353 MP3 files occupying 405MB.  Those files were converted at what appeared, at quick glance, to be a rate of about 6MB per minute.  I then revised the spreadsheet to produce batch file command lines that would move those MP3s back to the folders where the similarly named WMA files had been and rename them back to their original names (but with MP3 extensions).