Monday, 29 February 2016

The Mysterious Case of the Latin General: Collations in SQL Server

Computers deal in absolutes. True, false, yes, no. So when you ask a computer if two things are the same, it can’t get away with “well, sort of… it depends, really; why are you asking?”.

But we use computers to model reality, and reality is a horrible mess of edge cases, quirks and idiosyncracies. Today is Leap Year Day – February 29th. Which means you can do this:

var today = DateTime.Now.AddYears(1).AddYears(-1);
var yesterday = DateTime.Now.AddDays(-1);
if (today == yesterday) { Console.WriteLine("In your face, Einstein!"); }

And if you think that's messy, wait until you start looking at string comparisons. Which, by happy coincidence, was the thing on the top of my stack when I got into work this morning.

OK, pop quiz time! Are these statements true or false?

  1. "foo" == "foo"
  2. "FOO" == "foo"
  3. "café" == "cafe"
  4. "œnophile" == "oenophile"
  5. "strasse" = "straße"
  6. "SIGURÐAR" == "SIGURDAR"

Well? How did you do? See, the answer in almost every case is that lovely old chestnut "it depends". And when you're dealing with text columns in Microsoft SQL Server, the thing it depends on is called a collation. A collation is a set of rules that defines whether or not two strings should be considered to be equal, and what 'alphabetical order' means for ordering a given set of strings. For years, my team has run most of our databases systems with a collation called SQL_Latin1_General_CP1_CI_AI – and the important part there is the CI_AI at the end, which indicates that this is a case-insensitive (CI), accent-insensitive (AI) collation, which means that if you search our database for 'CAFE', you'll find records containing 'café', because the case-insensitive rule says that "C" and "c" are the same, and the accent-insensitive part says "ignore the accent – pretend that é is the same as e"

Now, in recent versions of SQL Server, Microsoft has introduced a new set of collations to choose from – which means that even though we've already decided we want a Latin alphabet, with case-insensitive and accent-insensitive string comparisons, we still have a choice of six:

SELECT * FROM fn_helpcollations() WHERE name LIKE '%Latin1_General%CI_AI'

Latin1_General_CI_AI

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive

Latin1_General_100_CI_AI

Latin1-General-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive

SQL_Latin1_General_CP1_CI_AI

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 54 on Code Page 1252 for non-Unicode Data

SQL_Latin1_General_CP1253_CI_AI

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 124 on Code Page 1253 for non-Unicode Data

SQL_Latin1_General_CP437_CI_AI

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 34 on Code Page 437 for non-Unicode Data

SQL_Latin1_General_CP850_CI_AI

Latin1-General, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive for Unicode Data, SQL Server Sort Order 44 on Code Page 850 for non-Unicode Data

What I'm specifically interested in here – because it's my job as architect to decide this kind of thing - is why we might choose to use the new Latin1_General_CI_AI collation instead of the tried'n'tested SQL_Latin1_General_CP1_CI_AI collation. So I did a little Googling, and dug up this excellent article, which explains a bit about how these collations differ when it comes to Unicode character expansions. See, there are quite a lot of 'letters' which can be written as either one or two characters, depending on your cultural conventions (and what kind of keyboard you've got). Typesetting purists will refer to archæologists and œnophiles, whilst the rest of us have to settle for archaeologists and oenophiles because we don't know how to type "æ" and "œ". And then there's the German "sharp S", which is written as either 'ß' or 'ss' depending on who's writing it.

Now here's the real-world scenario where this matters. Let's say we've got a database of rock music clubs, and someone's added Roadrunner's Paradise in Berlin, and – because they're German, typing in German on a German keyboard – they're put the address in as "Saarbrücker Straße 24, 10405 Berlin". Now one of our British users is on holiday, and they're in Berlin, and they're trying to find out if there are any rock clubs on Saarbrücker Straße. Except because they've got an English guidebook and an English keyboard, they search for "saarbrucker strasse", because they don't know how to type the umlaut or the sharp-S 'ß' character.

Whether their search works or not can potentially depend on several things

  • The character type (varchar or nvarchar) of our StreetAddress column
  • The collation of our StreetAddress column
  • The character type specified in the query – SQL will assume string literals are ANSI strings unless you prefix them with N, so LIKE 'saarbrucker strasse' is different to LIKE N'saarbrucker strasse'
  • Whether or not we explicitly specify a collation in our WHERE clause.

OK, so we have two different column types (varchar vs nvarchar), we have two different collations (Latin1 vs SQL_Latin1), we have two different ways of specifying our search text (N'foo' vs 'foo'), and we have three different ways to specify collation in the query (explicit Latin1, explicit SQL_Latin1, or implicit) - meaning there are 2*2*2*3 = 24 different ways to store and query the data supporting this particular scenario.

So let's try them all. For SCIENCE. Code to create and run these examples is at https://gist.github.com/dylanbeattie/c9705dc5c4efd82f60b8, and when you run it, it works like this:

Column Type

Colum collation

Search text

Query collation

Did it work?

nvarchar

Latin1

'%saarbrucker strasse%'

(implicit)

Yes

nvarchar

Latin1

'%saarbrucker strasse%'

Latin1

Yes

nvarchar

Latin1

'%saarbrucker strasse%'

SQL_Latin1

Yes

nvarchar

Latin1

N'%saarbrucker strasse%'

(implicit)

Yes

nvarchar

Latin1

N'%saarbrucker strasse%'

Latin1

Yes

nvarchar

Latin1

N'%saarbrucker strasse%'

SQL_Latin1

Yes

nvarchar

SQL_Latin1

'%saarbrucker strasse%'

(implicit)

Yes

nvarchar

SQL_Latin1

'%saarbrucker strasse%'

Latin1

Yes

nvarchar

SQL_Latin1

'%saarbrucker strasse%'

SQL_Latin1

Yes

nvarchar

SQL_Latin1

N'%saarbrucker strasse%'

(implicit)

Yes

nvarchar

SQL_Latin1

N'%saarbrucker strasse%'

Latin1

Yes

nvarchar

SQL_Latin1

N'%saarbrucker strasse%'

SQL_Latin1

Yes

varchar

Latin1

'%saarbrucker strasse%'

(implicit)

Yes

varchar

Latin1

'%saarbrucker strasse%'

Latin1

Yes

varchar

Latin1

'%saarbrucker strasse%'

SQL_Latin1

No

varchar

Latin1

N'%saarbrucker strasse%'

(implicit)

Yes

varchar

Latin1

N'%saarbrucker strasse%'

Latin1

Yes

varchar

Latin1

N'%saarbrucker strasse%'

SQL_Latin1

Yes

varchar

SQL_Latin1

'%saarbrucker strasse%'

(implicit)

No

varchar

SQL_Latin1

'%saarbrucker strasse%'

Latin1

Yes

varchar

SQL_Latin1

'%saarbrucker strasse%'

SQL_Latin1

No

varchar

SQL_Latin1

N'%saarbrucker strasse%'

(implicit)

Yes

varchar

SQL_Latin1

N'%saarbrucker strasse%'

Latin1

Yes

varchar

SQL_Latin1

N'%saarbrucker strasse%'

SQL_Latin1

Yes

So there's the difference. If you're dealing with varchar columns (as opposed to nvarchars), and you specify your query text as an ANSI string (as opposed to a Unicode string), then you need to use the Latin1_General_CI_AI collation if you want "strasse" to match "straße"; in all other scenarios, 'ss' is equal to 'ß' (and apparently this is applies to all Latin collations regardless of accent sensitivity).

Of course, this works both ways – by using Latin1_General_CI_AI, you create a scenario where somebody searching for '%aße%' will find English words like 'assessment' and 'molasses', because as far as SQL Server is concerned, 'aße' and 'asse' are the same string. If that's a problem for your particular scenario, you'll need to do something like restrict your search inputs to the basic ASCII character set.

As with almost everything in software development, modelling the real world involves making compromises. It's our job to make sure we're aware of those compromises, and that we understand how they translate back into real-world behaviour, and it's up to us to explain them to users who think our software sucks because it doesn't know what day it is and it says there's no rock bars in Saarbrucker Strasse.

Happy Leap Year Day!

Tuesday, 23 February 2016

Better Hypermedia Through Obfuscation

Here’s a fun and completely twisted idea that came to me on the train this morning. One of the constraints of ReSTful architectures is that they use hypermedia as the engine of application state – in other words, clients shouldn’t maintain their own information about where they “are” and what resources and actions are available to them right now; they should rely on hypermedia embedded in the current resource representation they’ve retrieved from the server. A common example of this is pagination – navigating big lists of resources, using a representation something like this example:


  GET /people

  "_embedded": {
    "people": [ /* array of 10 people */ ]
  },
  "_links": {
    "first": { "href": "/people?page=0" },
    "last": { "href": "/people?page=12" },
    "next": { "href": "/people?page=1" },
    "self": { "href": "/people?page=0" }
  },
  "count": 10,
  "total": 115
}
Now, with an API like this, it’s all too easy for the client – or rather, the person building the client – to go “ok, page number is a zero-based integer; let’s cut a few corners here” and just program something like

for(var i = 0; i < 12; i++) {
    http.get(“https://api.foo.com/people?page=”+page);
}
Now, I’m a big fan of something we call the "pit of success" – the idea being that we "build platforms such that […] developers just fall into doing the 'right thing'", and more generally, the idea that the easiest way to achieve something is also the ‘correct’ way to achieve that thing. So what if we intentionally obfuscate our APIs so that hypermedia navigation becomes easier than building a for() loop? By, for example, requiring that our page number is written out longform, instead of numerically? And, just for fun, we’ll require that it’s in a language that isn’t in the regular ASCII character set. Like Ukrainian:
{
    "first": { "href": "/people?page=один" },
    "last": { "href": "/people?page=дванадцять" },
    "prev": { "href": "/people?page=два" },
    "next": { "href": "/people?page=чотири" },
    "self": { "href": "/people?page=три" }
}
Suddenly, your ‘shortcut’ isn’t a short cut any more. For starters, you’ll probably need to install a special keyboard in order to type those characters – not to mention suddenly your source code files will need to be in UTF8, and I’ll wager that somewhere in your build pipeline there’s a tool that can’t handle UTF8 source code. And you’ll need a routine which can translate integers into Ukrainian string representations… No, the easiest thing to do now is to retrieve the first resource, parse the next.href property, retrieve that resource, and so on until you hit a resource with no next link. Which, of course, is exactly how hypermedia is supposed to work in the first place.

Wednesday, 10 February 2016

“How Can Software Be So Hard?”

Last night, I went to a Gresham College lecture at the Museum of London, “How Can Software Be So Hard?” presented by Professor Martyn Thomas CBE. The lecture itself was great – good content supported by solid examples. I must say there wasn’t a great deal there that I haven’t heard before – software engineering is an immature discipline; we rely too heavily on “testing” to validate and verify the systems we create; validation normally happens far too late to do anything about the problems it uncover; we’re overly reliant on modules and components that we can’t actually trust to work properly… all interesting and valid observations, but nothing particularly revolutionary.

Personally, I think the question posed in the lecture title is, to some extent, built on a false premise. Towards the end, he makes the observation – whilst talking about teenagers getting rich by writing iPhone games – that we only focus on the success stories, and the countless failures don’t get any attention. Which is true – but I think with software, we routinely take so much success for granted that it’s the failures which stand out. Sure, they happen – but if you think software is hard, try building a mechanical or an analog system that can send high-fidelity music from Boston to Singapore, or show the same colour photograph to a million people five minutes after it was taken. So many software projects are instigated to pursue efficiencies – to take some business process or system that used to require tens of thousands of people, and hire a few dozen programmers to replace it with a system that basically runs itself; it is any wonder it doesn’t always go smoothly? There’s a lot of things which are easy in software and close to impossible in any other domain, and I think to lose sight of that would be disadvantageous in an age where we’re trying to inspire more kids to study STEM subjects and pursue careers in science and engineering.

The thing that really stuck in my head, though, was a comment Professor Thomas made in response to a question after the lecture. Somebody asked him what he thought about self-driving cars, and amongst the points he raised in response, he said something like:

What about pedestrians? Why would you find a crossing, press the button and wait for the green man if you know all the cars on the road are programmed not to run you over?

Over the last few years, I’ve spent a lot of time studying the way smart devices are affecting the way we interact with the world around us – something I’ve covered at length in my talk “Are Smart Systems Making Us Stupid?”, which I presented at BuildStuff last year. I’ve looked into all sorts of models of human/machine interaction, but I’d never considered that particular angle before – and it’s fascinating.

Photograph of David Prowse as the Green Cross Code man.

Our basic instinct for self-preservation is reinforced from an early age – is there anybody here who DOESN’T remember being taught how to cross the street? So what happens to those behaviour patterns in a world where kids work out pretty early on that they can jump in front of Google Cars and get away with it? Do we program the cars to hit a kid once in a while – not to kill them, just give ‘em a nasty bump to teach them a lesson? How much use is a self-driving car when your journey takes longer than walking because your car slams the brakes on every time a pedestrian walks in front of it? Maybe you’re better off dumping your luggage, or your shopping, in the car, and taking the train instead?

It’s also an interesting perspective on a discussion that, until now, has been framed very much from the perspective of the driver – “will my self-driving car kill me to save a dozen schoolkids?” – and raises even more questions around the social implications of technology like driverless cars.

The next talk in the series is “Computers, People and the Real World” on April 5th, and if you’re interested in how our increasing dependency on machines and software is affecting the world we live in and the lives we live, I’d heartily recommend it.

(PS: if anyone from Gresham College is reading this – get somebody to introduce your speakers. They’ve worked hard to prepare the material they’re presenting; help them make the best possible first impression by taking the stage to a round of applause from a crowd who already know who they are. It’s not that hard, and it makes a huge difference.)