Wednesday, 29 July 2015

The Mysterious Case of the Missing Milliseconds

Strings are the lingua franca of many distributed systems, and Spotlight is no different. Earlier today, we hit a weird head-scratching bug in one of our services, and - surprise, surprise - turns out it's all do with strings. To work around limitations of an old line-of-business application, we have a database trigger (no, really) that captures changes made to a particular table, serializes them into an XML message, and pushes this into a SQL Service Broker queue; we then have a Windows service that pulls messages off the queue, parses the XML, breaks it into nicely manageable chunks and publishes them all to RabbitMQ using EasyNetQ. SImple. Except, once in a while, it blows up and starts complaining about FormatExceptions.

Now... within the database trigger, we're doing this:


which returns 2015-07-29T20:55:21.130 as you'd expect.

There's then a line of code in the Windows service that says:

var format = "yyyy-MM-ddTHH:mm:ss.fff";
DateTime.ParseExact(d, format, CultureInfo.InvariantCulture, DateTimeStyles.AdjustToUniversal | DateTimeStyles.AssumeUniversal);

Now, this is the code of somebody who knows that turning datetimes into strings and back again can get a bit tricky, and so has left absolutely nothing to chance - they've supplied an exact date format, they've specified a culture, they've even gone so far as to specify the DateTimeStyles. There's unit tests and integration tests, and everything looks absolutely lovely. And then it blows up. Very occasionally,

Except... SQL Server does something weird.

SELECT @DateTime = '2015-07-29 21:59:15:123'
SELECT CONVERT(VARCHAR(128), @DateTime, 126) -- returns 2015-07-29T21:59:15.123 (fine!)

SELECT @DateTime = '2015-07-29 21:59:15:000'
SELECT CONVERT(VARCHAR(128), @DateTime, 126) -- returns 2015-07-29T21:59:15

SELECT @DateTime = '2015-07-29 21:59:15:999'
SELECT CONVERT(VARCHAR(128), @DateTime, 126) -- returns 2015-07-29T21:59:16

SELECT @DateTime = '2015-07-29 21:59:15:001'
SELECT CONVERT(VARCHAR(128), @DateTime, 126) -- returns 2015-07-29T21:59:15

First, SQL Server doesn't have true millisecond precision - the milliseconds part will often get rounded by +- 0.001 seconds. Second - if the milliseconds part is zero, it'll be omitted from the string representation. Which means our incredibly specific and detailed date parsing routine will choke, because suddenly it has a date that doesn't match the format we've specified, and DateTime.ParseExact will throw a FormatException. Unit tests don't pick it up, because why would you mock such completely bizarre (and undocumented) behaviour, when you don't even know it exists?

What this means is that, since any changes done between .999 and .001 milliseconds will blow up, roughly 0.3% of all our transactions will be failing with a FormatException rather than getting synced to the rest of our systems. Which means fishing them out of the error queue and sorting them out manually - ah, the joy of distributed systems. This formatting weirdness happens on every version of SQL back as far as 2003, but there's no reference to it in the documentation until SQL Server 2012. It's been raised as a bug and closed as 'by design' because "the ISO 8601 spec leaves the conversion semantics for fractional seconds up to the implementation" - which I'm pretty sure didn't mean "go ahead and be internally inconsistent!" but as with so many other issues like this, fixing the bug would change behaviour that's been in place for years and could break things. I've no idea how - or why - anyone would build a system that genuinely relies on this bizarre idiosyncrasy, but I'll bet good money somebody out there has done it.

The beautiful irony, of course, is that if we'd used DateTime.Parse instead of ParseExact, we'd never have had a problem. :)

1 comment:

Ed Blackburn said...

I only learned this when I was looking at combguid in NHibernate:

I've spent a lot of time in the past few years interrupting with C++, python, java and javascript. Because parsing dates is such a pain in the arse, I've resorted to using unix time (or another epoch) for date arguments and persistence. I usually use milliseconds so only have to contend with a 64bit integer (domain depending) and add that number to the epoch.Obviously I'm at the mercy of libraries accounting for leaps but I've avoided the crazy parsing issues that used to bite me (and evidently yourself!)