Tuesday, 17 December 2013

The Joy of GUIDs

Like many developers, my first experience of primary keys was the Autonumber field in MS Access. At the time, this looked like a pretty neat idea. When I outgrew Access and moved on to using SQL Server, I adopted the convention of an identity(1,1) field as a primary key - a pattern that's used extensively both within my own code and throughout the industry.

Recently, though, I've been working on big distributed systems where the identity field really doesn't work too well. The biggest drawback is that if you create a new entity in any of your data stores, you need to wait for the response from the database before you can find it again. For RESTful systems where your Create method might just return an HTTP 202 "Accepted for processing" and leave you to work out whether your request eventually succeeded or not, this approach just plain doesn't work. Enter the GUID - globally unique identifier. The idea is that every GUID ever created is completely unique - it uses various system/hardware IDs, the system clock and a degree of randomness to create unique 128-bit values. The lovely thing about working with GUIDs is that when you create your new customer, or order, or invoice, YOU can assign the ID that will identify the object for the rest of its life.

Thing is, working with GUIDs is very, very different to working with ints. For starters, a lot of security practises around integers just don't apply.

A great example is URL-hacking. If you've ever seen an address in your browser's address bar that looks like:

http://www.myshop.com/basket.asp?basketid=2789

Try editing it. Try changing the query string to be basketid=2788 - on many (poorly written!) systems, that'll show you what's in someone else's basket. It doesn't take a lot of work to write a little script that'll GET all the basket IDs from 0-999999 and voila! you've got a folder full of other people's shopping baskets. Even if the developers do something terribly clever - like using identity(27897,17) instead of identity(1,1) - if you can do ten HTTP requests per second, you can still sweep the entire ID range from 0-1,000,000 in just over 24 hours. Sure, only 1 in 17 of your requests will succeed - but who cares? That's still 58,823 compromised shopping baskets.

Now, imagine instead of sequental integers, you use a GUID:

http://www.myshop.com/basket.asp?basketid=fa8f6423-e7d6-ff40-7afc-ae874c755c00

Try hacking it. It won't work. There are 3.4×1038 possible GUIDs, and roughly seven billion people on earth. If every single person on the planet used your website every day, at the end of a year you'd have 10,000,000,000 * 365 shopping baskets in your system. Let's see how long it would take an attacker to steal those baskets by URL-hacking:

Number of possible basket IDs Bp = 3.4x10^38

Number of actual basket IDs Ba = 3.65x10^10

Likelihood of a randomly chosen GUID finding a shopping basket = Bp/Ba = 1.0x10-28

Let's assume the website we're attacking is running on some serious distributed hardware so it'll remain responsive throughout our attack, and you have a massive botnet so you can check baskets until the server starts to struggle under the load. Facebook manages 6 million page-views per minute, or 100,000 requests per second. At 100,000 requests per second, it will take you (on average) 2,953,788,842,323,276 years to find a single shopping basket by URL-hacking.  Given those odds, a lot of security concerns around vulnerabilities like URL-hacking just disappear.

I've sat through numerous conversations that get onto "what'll happen if we get a GUID collision?"

There's really only one sensible answer to this: You'll find the place in your code where you're accidentally re-using the same GUIDs, and fix the bug. GUIDs don't collide - you're doing it wrong. If you're using Guid.NewGuid() or NEWID(), you won't get collisions. It just won't happen. There isn't even any meaningful analogy with daily human experience that can convey just how unlikely a GUID collision is. Things with a probability of 1.0x10-28 just don't happen.

Friday, 11 October 2013

Automatic Semantic Versioning with GitHub and TeamCity

You’ve quite possibly come across the idea of semantic versioning. It's a set of rules for versioning releases of software, designed to minimise disruption to people who are relying on your APIs not to change. In a nutshell, every release of your software has a three-part version number.

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes,
  2. MINOR version when you add functionality in a backwards-compatible manner, and
  3. PATCH version when you make backwards-compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Pretty simple, huh? In other words, 2.3.4 should offer the same feature set as 2.3.1 but with extra bug fixes. 2.4.0 is backwards-compatible with 2.3.2, and with 2.2.*, and with 2.1.*, and when you go for the ‘big rewrite’ and break everything, that’s version 3.0.0 which isn’t compatible with any previous versions.

Our main development pipeline is now based on TeamCity 8 and GitHub – both of which are absolutely wonderful once you get the hang of using them properly – and I wanted to automate the versioning of our builds according to semantic versioning principles. Here’s the workflow I settled on:

  • Every build has a MAJOR, MINOR, PATCH, and BUILD version, and an optional SUFFIX
  • Each part of the version number is managed as a TeamCity build parameter.
  • MAJOR and MINOR are managed manually. TeamCity can’t tell whether your commit is a feature or a bugfix, so for now we’ve just settled for manually changing the relevant parameters when you release a major or minor release.
  • All development is done on feature branches and submitted via pull requests.
  • Any pull request is assumed to be a PATCH release unless a developer changes the major or minor numbers when they accept it.
  • The SUFFIX is used to denote a pre-release package, and should indicate which branch’s features are included in that prerelease
    • e.g. Dylan.Web-1.2.3.0-branch7 should be the code you’d get if you accepted branch7 into master 1.2.3
  • Anything built off master should be releasable, and tagged as such
  • Finally, whenever a production release is built, we’ll tag the GitHub repository with the version of that release – for tagging, we ignore the build number, so the tag will be something like v0.0.0

This lot actually distils to quite a simple guideline: The patch number is the number of pull requests accepted since the last tagged minor release.

Now that sounds like something you could automate… so with Powershell on one screen and the Github API documentation on the other, I hacked together the following script, which is now happily running as a build step on our TeamCity server.

$MajorVersion = %MajorVersion% # The major version of your current build
$MinorVersion = %MinorVersion% # The minor version of your current build
$RepoUrl = "%vcsroot.url%" # The HTTPS path to your repo – https://github.com/MyCompany/MyRepository.git
$Token = "%GitHubAccessToken%" # Your GitHub access token, from https://github.com/settings/applications

try {
    # Parse the supplied Git repo URL to determine the repo name and owner
    $RepoUrl -match "
https://github.com/([^/]+)/(.+)\.git$"
    $repoOwner = $matches[1]
    $repoName = $matches[2]
    Write-Host "Reading repo $repoName owned by $repoOwner"
       
    # Find the tag resource matching the baseline of the current major/minor version
    # Remember, we’re not looking for X.Y.Z, we’re looking for X.Y.0 since patch numbers
    # should increment until the next minor release.

    $uri = "
https://api.github.com/repos/$repoOwner/$repoName/git/refs/tags/v$MajorVersion.$MinorVersion.0?access_token=$Token"
    Write-Output "Looking for tag v$MajorVersion.$MinorVersion.0 at $uri"
    $tag = Invoke-RestMethod -Uri "$uri"

    # $tag.object.url will now give us a more detailed tag resource, including the commit that was tagged
    $uri = $tag.object.url
    Write-Output "Getting tag info from $uri`?access_token=$Token"
    $tag = Invoke-RestMethod -Uri "$uri`?access_token=$Token"
       
    # $tag.object.url is now a link to the tagged commit
    $uri = $tag.object.url
    $tag = Invoke-RestMethod -Uri "$uri`?access_token=$Token"

    # now we can dig into the commit itself and find out WHEN the baseline release was tagged...
    $since = $tag.committer.date

    # Now we can retrieve all the commits in this repo SINCE that date
    $commitsUri = "
https://api.github.com/repos/$repoOwner/$repoName/commits?since=$since&access_token=$Token"
    Write-Host "Retrieving commit log from $commitsUri"
    $commits = Invoke-RestMethod -Uri "$commitsUri"
    $merges = @($commits | Where-Object { $_.commit.message -match "^Merge pull request" })

    # Reversing the merges just means they show up in TeamCity’s build log in chronological order.
    # Which is nice.
    [Array]::Reverse($merges)

    $mergeCount = $merges.Count
    Write-Host "Found $mergeCount merges since last release tag"
    for($i = 0; $i -lt $merges.Count; $i++) {
        $merge_number = $i+1
        $merge_message = $merges[$i].commit.message.split("`r`n",[StringSplitOptions]"RemoveEmptyEntries") -join "`r`n    "
        $merge_sha = $merges[$i].sha
        $merge_web_url = $RepoUrl.Replace(".git", "/commit/$merge_sha")
        "`r`n  Merge #$merge_number`: $merge_web_url"
        "    $merge_message"
    }
    Write-Host "`r`n"
    Write-Output "##teamcity[setParameter name='PatchVersion' value='$mergeCount']"
} catch [System.Exception] {
    Write-Host "Exception trying to determine patch number from Github API"
    Write-Host $_
    Write-Host "Using default patch number 0"
    Write-Output "##teamcity[setParameter name='PatchVersion' value='0']"
}

A couple of fun details to look out for:

  • If we’re doing the first build of a new minor version, there won’t be any ‘baseline’ tagged version yet – which is fine; we just set the patch number to zero and off we go.
  • You can control TeamCity build parameters by outputting specially-formatted messages in your build log – that’s what those lines

    Write-Output "##teamcity[setParameters name='SomeParameter' value='NewValue']"

    are doing. See Build Script Interaction with TeamCity for more info.

Finally, if you'd like the commit/merge hyperlinks in your TeamCity build log to be clickable, check out this awesome tip from Christian Rodemeyer about how to embed clickable URL links in TeamCity build logs.

Happy versioning!

Monday, 19 August 2013

Building a ServiceStack-based OAuth2 resource server using DotNetOpenAuth

Last week, I asked on StackOverflow if anyone had used DotNetOpenAuth to secure a ServiceStack API using OAuth2. Well, it would appear not… so alongside a week of band rehearsals and a big old party last weekend, I’ve put a proof of concept up onGithub that demonstrates how to use ServiceStack and DotNetOpenAuth together to create a REST API that’s secured using OAuth2. The app I’m working on is aimed specifically at existing customers, who will already have a login account on our site, so I don’t need to use OpenID at all. In fact, I spent a good few days puzzling over this before I realised my mistake. After digging through the various specs and innumerable blog posts, I decided to use the OAuth2 authorization framework – specifically, a feature of that framework called the resource owner password credentials grant

The DotNetOpenAuth project includes a couple of great OAuth2 samples, including an authorization server (the bit that does the customer login), but they’re built primarily around ASP.NET MVC or WCF. I’ve worked with both of these in the past, but recently I’ve switched to using the ServiceStack library to build ReST APIs, and I absolutely love it. It’s easy to use, easy to extend, and wonderfully lightweight, so I wanted to use ServiceStack to build the data API that our app is going to use. I assumed it wouldn’t be too hard to build our own authorisation server, and then use a standard protocol like OAuth to control access to the resources exposed by this server.

There’s a working prototype / proof of concept up on GitHub now, at https://github.com/dylanbeattie/OAuthStack – complete with commentary on how the various integration points worked. OAuth2 is actually conceptually quite simple, but getting the two libraries to work together proved a little fiddly; both ServiceStack and OpenAuthDotNet use their own abstractions over the ASP.NET request/response objects, making me really wish Microsoft hadn’t made the .NET HTTP framework quite so inflexible back in the day… but after a week or so of tinkering, testing, poring over RFCs and bouncing around in the Visual Studio debugger, I’ve got something that seems pretty lightweight, doesn’t reinvent the wheel, and seems to satisfy the requirements.

Suffice to say that if DotNetOpenAuth and ServiceStack weren’t open source this kind of integration would have been practically impossible, so I want to say a huge thank you to the authors and maintainers of both of those projects. Thanks, guys. You rock :)

Monday, 24 June 2013

Linking Subversion Commits to Pivotal Tracker on Windows using Powershell

My team at work now keep our entire lives in PivotalTracker, and rather like it. We recently linked it to our GitHub repository using the built-in PivotalTracker integration that’s provided by GitHub, which worked really nicely. Thing is, we’re still in the process of migrating our codebase to GitHub – there’s lots of older projects with TeamCity jobs still linked to our in-house Subversion repo – and I wanted to add a similar facility for our projects that are still in Subversion.

This is all based on the Source Control Management Post-Commit Hook Integration feature in the Pivotal Tracker API, and getting it working required two scripts.

post-commit.bat – this is the standard batch file used by Subversion on Windows to run code after a successful commit. Subversion runs this automatically, and passes in two command-line arguments – the (local) path of the repository that accepted the commit, and the revision number created by that commit. Ours now looks like this:

@echo off
$ps = %SystemRoot%\System32\WindowsPowerShell\v1.0\powershell.exe
$ps Set-ExecutionPolicy unrestricted $ps -command "D:\svn\data\repositories\projects\hooks\post-commit.ps1" -repopath %1 -revision %2

A couple of things to notice. First, we’re running the powershell Set-ExecutionPolicy as part of the script. This is brute force and there’s probably a nicer way of doing it, but since the Powershell script runs as the user who owns the Subversion server process, and that user doesn’t have interactive logon rights, I couldn’t see a quicker way of getting that script to run reliably. Ah, Powershell security, how we love you.

Second – we’re passing the two command-line arguments into the powershell script as named parameters. Subversion will call the batch file as:

post-commit.bat D:\svn\data\repositories\projects 12345

and we’re translating that into

powershell.exe –command post-commit.ps1 –repopath D:\svn\data\repositories\projects –revision 12345

The bit that actually does the heavy lifting is post-commit.ps1

param([string]$repopath, [string]$revision)   

$svnlook = "C:\Program Files\Subversion\bin\svnlook.exe"

$message = & "$svnlook" log $repopath -r $revision | Out-String

if ($message -match "#\d\d\d\d\d\d\d\d") {
    $reponame = Split-Path $repopath –leaf
    $author = & "$svnlook" author $repopath -r $revision | Out-String
   
    # Grab a reference to System.Web so we can HtmlEncode things.
    Add-Type -AssemblyName System.Web
    $author = [System.Web.HttpUtility]::HtmlEncode($author)
    $message = [System.Web.HttpUtility]::HtmlEncode($message)

    # We run ViewCV on our Subversion server to browse code via the web. Tweak this to suit,
    # or remove the <url></url> element from the XML completely.

    $url = "
http://subversion.mycompany.com/viewvc/$reponame`?view=revision&revision=$revision"

    $headers = @{"X-TrackerToken" = "put your Pivotal Tracker API Token Here"}
    # This is Powershell's "here-string" syntax for multiline string literals.
    $body = @"

<source_commit>
    <message>$message</message>
    <author>$author</author>
    <commit_id>$revision</commit_id>
    <url>$url</url>
</source_commit>
"@

    $r = Invoke-WebRequest -Uri http://www.pivotaltracker.com/services/v3/source_commits -ContentType "application/xml" -Method POST -Headers $headers -Body $body

    Write-Host $r
} else {
    # if nothing in the commit message matched the #12345678 syntax, then don’t bother calling the API.
}

To test it, first run the Powershell script directly from your command line. Say you’ve just committed revision 12345 to the “projects” repository, remembering to include [#12345678] in your commit comment where 12345678 is the PivotalTracker story ID.

PS D:\svn\data\repositories\projects\hooks> ./post-commit.ps1 –repopath D:\svn\data\repositories\projects –revision 12345

Your commit comment should appear in the story history within a few seconds.

Now check the batch file harness can run your Powershell properly – see the note below about revisions:

D:\svn\data\repositories\projects\hooks> post-commit.bat D:\svn\data\repositories\projects 12346

If that works, the whole kaboodle should work. A couple of caveats to watch out for:

  1. If your post-commit hook fails, your Subversion client will give you something cryptic about “Commit failed: MERGE … 200 OK” – which is not particularly transparent. Best to get things working on the command line first before testing via Subversion itself.
  2. I couldn’t get the same commit to produce multiple Pivotal tickets – when I was testing it, I had to create a fresh commit for each test run of the integration hook. This is possibly just Pivotal automatically de-duplicating identical stories – but worth knowing for test purposes

Saturday, 23 March 2013

Adventures in Querying with RavenDB

For the last couple of months, we've been working on a project that uses RavenDB as the main data store. We went with Raven for several reasons - it looked like a pretty good fit for what we were doing, we were using NServiceBus which now includes Raven as its default persistence store, and we were keen to see what it could do. Generally, it's been really, really positive. However, RavenDB is the first document database I've ever used, and so once in a while the learning curve just leaves me scratching my head in frustration...

So, last week, a request comes in for a straightforward one-off report. We're using Raven to store details about media clips - video and audio files - that are linked to our customers' online CVs, and the product owner wanted a list of customers along with the total duration of the media clips linked to those customers' CVs.

In SQL, this would be

SELECT Customer, SUM(Duration) as TotalDuration FROM MediaClips GROUP BY Customer

The two developers who actually built the system weren't in the office last week, so this request ended up on my desk. "Ah, no problem", thinks me. After all - it's a database; how hard can it be? I'd spent a bit of time building custom indexes to drive an SLA report we built last week, and with quite a lot of Googling, managed to get some really impressive results out of Raven.

Thing is - I'm not building a feature here, I just need to copy & paste a bunch of numbers into a spreadsheet and get on with my life. And I have no idea how you do that in Raven. I spend an hour or so trying to get Ronnie Overby's RavenDB provider for LinqPAD to install. This is a third-party library that isn't part of RavenDB, but I like LinqPAD, I'm pretty comfortable with Linq query syntax, and this looked like a good place to start. Half an hour later, after downloading the source, compiling it, creating a certificate, signing it and getting LinqPAD to recognise and install the driver, I gave up because I just couldn't work out how to get any data out of the damn thing.

Instead, I throw together a .NET console app and start playing around with indexes, working from "Using Linq to query RavenDB indexes". I get as far as creating a named map/reduce index that's pulling out the data I need - this takes a good hour or two of trial and error. The eureka moment is when I work out that you can define Raven indexes as raw LINQ code - they're not even embedded strings, they're actually strongly-typed, compiled Linq expressions that are passed in to the query definition. I get as far as this:

class Total {
  public string CustomerId { get; set; }
  public int Duration { get; set; }
}

public void BuildIndex() {
  var store = new DocumentStore() { ConnectionStringName = "raven" };
  store.Initialize();
  store.DatabaseCommands.PutIndex("ClipInfos/TotalDurations", new IndexDefinitionBuilder<ClipInfo, Total> {

    Map = clips => from clip in clips select new {
    Customer = clip.CustomerId,
      Duration = clip.DurationInSeconds
    },

    Reduce = results => from result in results group result by result.Customer into g select new {
      Customer = g.Key,
      Duration = g.Sum(x => x.Duration)
    }
  });
}

Now, Raven encourages "safe" query patterns. Which means it'll only give you 128 records per query by default, and if you run more than 30 queries in a single session it'll fail with an exception telling you you're doing something dangerous. The only way I can find to actually retrieve all the data I need is to circumvent these defaults to increase the batch size from 128 to 1024, and then repeated use .Skip(1024*batch).Take(1024) to pull records back 1,024 at a time and add them to my result set.

using (var session = raven.OpenSession()) {
  session.Advanced.MaxNumberOfRequestsPerSession = 512;
  while (true) {
    var totals = session.Query<Total>("ClipInfos/TotalDurations").Skip(1024 * batches).Take(1024).ToList();
    foreach (var total in totals)  {
      // Do something useful.
    }
    if (totals.Count() < 1024) break;
    batches++;
  }
}

At this point, I'm thinking one of two things:

  1. This is fine. It's an ad-hoc requirement, and if we end up doing this a *lot* I should read up on the index replication bundle - a plug-in that, I gather, will export indexed data into a relational DB for easier querying.
  2. This is completely wrong. It should not take three hours and a dedicated console application to run a simple ad-hoc query against a set of Raven data. I am missing something obvious...

As somebody who's been writing SQL for nearly 20 years, it's really frustrating to hit a brick wall like this... and it's been a long day, and I'm getting fed up, and I wander over to Twitter to have a bit of a rant and @ayende pops up:

image

Ah. Dynamic Queries. That looks interesting. So I head over to Raven Studio and sure enough, there's a "Dynamic Queries" facility. So I paste in a simple query and hit "Execute":

image

No results? Huh? But there's tens of thousands of documents in there! At this point I give up and go to the pub.

This morning, I go back to it and notice the little "i" icon next to the Query header, and see this...

image

Ah. I have no idea what Lucene syntax is... but that probably explains why none of my dynamic queries worked.

Raven is really impressive. We've managed to get it to do some very cool things; it's fast, and most of the time it Just Works. And there is an inevitable learning curve associated with adopting a new technology - particularly one that represents a paradigm shift in approach. It's kinda like learning Powershell and treating it as another CLR/.NET language... which is really, really frustrating, until you realize it's not "scriptable .NET", it's more like command.com on steroids. The shift from relational to document databases is much the same. That said - it would be really, really useful if the dynamic query tool in Raven Studio would distinguish between "I understand that query but could find no matching records" and "I have no idea what you are trying to do" - I have no idea how strict the Lucene query language or parser is, but knowing whether your query is syntactically valid or not would be a big help when you're trying to work out why it didn't return anything.

image