Pragmatism in URL design

I was at the 2nd Linked Data Meetup London last week. It was great, and there was a lot of consensus and momentum built, but one of the smallest remarks by Tom Scott generated the most debate (partly egged on by me).

Tom was discussing how the BBC uses "the web as its CMS" for some topics, such as using Music Brainz as a source for discography information, and Wikipedia as a source for introductions on various topic pages, such as those on the BBC Wildlife Finder about animals. As well as simply using the data though, the BBC also uses the identifiers from those sites in its URLs. For Music Brainz, these look like long strings of random letters, numbers and hyphens (the Beatles, for instance, are b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d), which then form the end part of a URL at (eg see the page for the Beatles). Wikipedia, though, uses 'human readable' identifiers, which look like words, with capital letters where appropriate, separated by underscores. So, for instance, the identifier for the Ursus arctos, commonly known as the 'brown bear', is simply Brown_Bear, and this too is used in the URLs for both the Wikipedia and BBC pages for that animal.

There are some good, pragmatic reasons why Music Brainz chose random strings, but Wikipedia chose words. Music Brainz has data on far more artists, albums and tracks than there are Wikipedia pages, and a lot of their data is generated far more automatically and quickly than Wikipedia (eg by ripping CDs and uploading ID3 tags). Whereas Wikipedia pages all created by hand, and there's a strong requirement to only have one page per topic, and so matching page names to the URLs is both doable and is helps to enforce unique page titles.

The biggest problem with URLs that use human-readable paths and identifiers, though, is that names can and do change. And if the name is tied to the URL, then when a name changes, you either leave the URL as it is (and live with the mis-match), or update the URL, with a redirect from the old one (and live with the fact that this stops all the URLs being 'permanent'). Wikipedia currently follows the second option. Wordpress, by default, follows the first option (although you can manually re-name URL slugs to override this if you like).

So, you can see that there are pros and cons to having human-readable vs non-human-readable URIs. Within the realm of Linked Data, URIs are really important (as they identify concepts, as well as web pages), and so the design of URLs has to be even more carefully considered. To return to the meetup last week, Tom's comment, which has prompted a fair bit of debate on this topic, was that "persistance beats human-readability", and he seemed to regret having human-readable URIs on the BBC Wildlife Finder*. I disagreed, and a debate then ensued on the relative merits of human-readability vs persistance in URI design.

There were a range of opinions on this. Chris Sizemore commented that "I'm honest, I'm not as worried abt persistance as I prob shld b. Web got by, why can't SemWeb?" Tom replied that "of course web of doc does get broken coz urls change. It's quite expensive to keep link checking (incl what doc it points to)" and then suggested that "the whole opaque vs human readable URL thing is largely religious. Experience dictates which you think is more important".

There followed a bunch of comments suggesting that it was more a pragmatic, rather than a "religious" decision, weighing up the relative costs and benefits of each approach, and Michael Smethurst pointed out that creating human-readable identifiers can be expensive.

I then suggested that "not even opaque URLs can be 100% persistent. Concepts change, merge and split", and Chris Sizemore added "we should design [for] less than 100% persistance in URIs [and] promote 'healing' mechanisms". I'm not sure that there are any real answers in this area yet - it's still something to be investigated, and something the Linked Data community will have to deal with. I suggested that "HTTP already supplies some of the healing mechanisms: redirects, 300 Multiple Choice, 410 Gone, etc" and Michael suggested that "changed concepts, merged concepts and split concepts are new concepts so need new uris...", which are two contrasting but not un-complementary methods. (Take the example where a company 'splits', by spinning out a part of its business into a new one, and keeping the rest running under the same name and brand - whether the larger part of the 'split' should retain the same URI is a decision that can either be philosophical or pragmatic).

I've come to the conclusion that URL design has to be done pragmatically, balancing lots of different factors, including readability of the identifiers, the 'hackability' of the path structure (ie the number of slashes to include), overall length, ease/costs of producing them, and overall the effect that all these things have on the ability for the URL to be 'permanent'. It may be one of the axioms of the web that URIs are opaque, and that machines "should not look at the contents of the URI string to gain other information", but there are lots of ways in which humans don't follow this principle:

There's probably other ways in which URLs are treated as non-opaque by humans too.

I'm interested in how various sites and services negotiate this minefield of options and trade-offs. Here are some examples which I think are interesting:

I might try and add to these lists as I discover other interesting examples, but for now, my message is that you have to think hard about URL design, weigh up the different options, and then be pragmatic about it...

P.S I've just remembered that I wrote a post about 'how the BBC iPlayer broke its URLs' a couple of years, which might also be relevant to this discussion...

* Tom later clarified that the bit he regretted was "including the /species/ etc. in the URL", rather than re-using Wikipedia identifiers, though others have suggested there have been problems with this too.

Other blog posts on this topic:

Updates: (9:31PM) Added a couple of extra links and examples. (9:54pm) Added link to Matt's blog post. (Mar 1) Added P.S with link to an old blog post of mine.