Hard Times ― Andreas Zwinkau

Source: Curtis Poe via Quora

A few years ago, while building an ETL system in Perl for reducing the cost of phase III clinical trials I found myself processing about 40,000 dates. Two of those 40,000 dates failed to validate. I wasn't terribly concerned as these dates came from client-supplied data and this data is often, um, surprising. However, when I inspected the raw data, it turns out that those dates were January 1st, 2011 and January 1st, 2007. I realized I had a bug in software I had just written but it turned out this bug was actually 30 years old.

For anyone who doesn't really understand the software ecosystem, this may sound mystifying, but it makes sense. Because of a decision taken a long time ago to make another company money, my $client lost money in paying me to fix a bug that one company accidentally introduced and another company deliberately introduced. But to explain it I need to talk about a third company that introduced a feature that eventually became a bug, and a few other historical tidbits that nonetheless contributed to the obscure bug I fixed.

In the good ol' days, Apple computers would sometimes spontaneously reset their date to January 1st, 1904. The reason for this is fairly simple. Back then, Apple computers used battery-powered "system clocks" to keep track of the date and time. What happened when the battery ran out? Apple computers tracked their dates as the number of seconds since the epoch. In this sense, an epoch is merely a reference date from which we start counting and for Macintosh computers, that epoch was January 1st, 1904 and when the system clock battery died, that was your new date. But why did that really happen?

Back then, Apple used 32 bits—ones and zeros—to store the number of seconds from their start date. One bit can hold one of two values, 0 or 1. Two bits can hold one of four values, 00, 01, 10, 11. Three bits can hold one of eight values, 000, 001, 010, 011, 100, 101, 110, 111, and so on. How much can 32 bits hold? 32 bits can hold one of 232, or 4,294,967,296 seconds. For Apple dates that was approximately 136 years, which is why older Macs couldn't handle dates after 2040 and if your system clock battery died, your date would reset to 0 seconds after the epoch and you'd have to keep manually resetting the date every time you turned on your computer (or until you bought a new battery for your system clock).

However, the Apple solution of storing dates as the number of seconds after the epoch means we couldn't handle dates before the epoch and that had far-reaching implications, as we'll see. This was a feature, not a bug, that Apple introduced. It meant, amongst other things, that the Macintosh operating system was generally immune to the Y2K bug (though many Mac apps weren't because they would introduce their own date system to work around the Mac limitations).

Moving along, we have Lotus 1-2-3, IBM's "killer app" that helped to launch the PC revolution, though it was VisiCalc on the Apple that really launched the personal computer. It's fair to say that if 1-2-3 hadn't come along, PCs would likely have not taken off the way they had and personal computer history would have turned out considerably different. However, Lotus 1-2-3 incorrectly reported 1900 as a leap year. When Microsoft released Multiplan, their first spreadsheet program, it didn't have much market penetration. So when they conceived of Excel, they decided to not only copy Lotus 1-2-3's row/column naming scheme, they made it bug-for-bug compatible, including deliberately treating 1900 as a leap year, a problem that remains to this day. So for 1-2-3, this was a bug, but for Excel, it was a feature to guarantee that everyone who used 1-2-3 could import their spreadsheets into Excel with no differences in the data, even if the data were wrong.

But there was another problem here. Microsoft first released Excel for the Macintosh and as mentioned, the Macintosh didn't recognize dates prior to January 1st, 1904. However, Excel used January 1st, 1900 as its epoch. So Excel was modified to recognize what the epoch was and internally stored dates relative to these respective epochs. This Microsoft support article explains the problem fairly clearly. And that leads to my bug.

The ETL system I was building received Excel spreadsheets from many customers. Those spreadsheets may have been produced on Windows, but they may have been produced on a Mac. As a result, the "epoch" date for the spreadsheets might be January 1st, 1900 or January 1st, 1904. How do you know which one? Well, the Excel file format exposes this information, but the parser I was using didn't (it now does) and it expects you to know whether you have a 1900 or 1904-based spreadsheet. I suppose I could have spent a lot of time trying to figure out how to read the binary format of Excel and sent a patch to the maintainer of the parser, but I had many other things to do for $client and I quickly wrote a heuristic to determine whether or not a given spreadsheet’s epoch was in 1900 or 1904. It was pretty simple.

In Excel, you may have a date of July 5, 1998, but it might be formatted as "07-05-98" (the useless US system), "Jul 5, 98", "July 5, 1998", "5-Jul-98" or any of a number of other useless formats (ironically, the one format my version of Excel didn't offer is the standard ISO 8601 format). Internally, however, the unformatted value is either "35981", for the 1900 date system, or "34519", for the 1904 system (these numbers represent the number of days after the epoch). So what I did is use a fairly robust date parser to extract the year from the formatted date, and then an Excel date parser to extract the year from the unformatted value. If they're four years apart, I know I'm using the 1904 date system.

So why don't I simply use the formatted date? Because July 5, 1998 might be formatted as "July, 98", losing me the day of the month. We were getting our spreadsheets from so many companies and they created them in so many different ways that they expect us (meaning me, in this case) to figure it out. After all, Excel gets it right, we should, too!

That's when 39082 kicked me in the tail. Remember how Lotus 1-2-3 considered 1900 a leap year and how that was faithfully copied to Excel? Because it adds an extra day to 1900, many date calculation functions relying on this can easily be off by a day. That means that 39082 might be January 1st, 2011 (on Macs), or it might be December 31st, 2006 (on Windows). If my "year parser" extracts 2011 from the formatted value, well, that's great. But since the Excel parser doesn't know whether it's a 1900 or 1904 date system, it defaults to the common 1900 date system, returns 2006 as the year, my software sees that the years are five years apart assumes an error, logs it, and returns the unformatted value.

To work around this, I wrote the following (pseudo-code):

diff = formatted_year - parsed_year
if 0 == diff
    assume 1900 date system
if 4 == diff
    assume 1904 date system
if 5 == diff and month is December and day is 31
    assume 1904 date system

And all 40,000 dates parsed correctly.