I was at the 2nd Linked Data Meetup London last week. It was great, and there was a lot of consensus and momentum built, but one of the smallest remarks by Tom Scott generated the most debate (partly egged on by me).
Tom was discussing how the BBC uses “the web as its CMS” for some topics, such as using Music Brainz as a source for discography information, and Wikipedia as a source for introductions on various topic pages, such as those on the BBC Wildlife Finder about animals. As well as simply using the data though, the BBC also uses the identifiers from those sites in its URLs. For Music Brainz, these look like long strings of random letters, numbers and hyphens (the Beatles, for instance, are b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d), which then form the end part of a URL at http://www.bbc.co.uk/music/artists (eg see the page for the Beatles). Wikipedia, though, uses ‘human readable’ identifiers, which look like words, with capital letters where appropriate, separated by underscores. So, for instance, the identifier for the Ursus arctos, commonly known as the ‘brown bear’, is simply Brown_Bear, and this too is used in the URLs for both the Wikipedia and BBC pages for that animal.
There are some good, pragmatic reasons why Music Brainz chose random strings, but Wikipedia chose words. Music Brainz has data on far more artists, albums and tracks than there are Wikipedia pages, and a lot of their data is generated far more automatically and quickly than Wikipedia (eg by ripping CDs and uploading ID3 tags). Whereas Wikipedia pages all created by hand, and there’s a strong requirement to only have one page per topic, and so matching page names to the URLs is both doable and is helps to enforce unique page titles.
The biggest problem with URLs that use human-readable paths and identifiers, though, is that names can and do change. And if the name is tied to the URL, then when a name changes, you either leave the URL as it is (and live with the mis-match), or update the URL, with a redirect from the old one (and live with the fact that this stops all the URLs being ‘permanent’). Wikipedia currently follows the second option. WordPress, by default, follows the first option (although you can manually re-name URL slugs to override this if you like).
So, you can see that there are pros and cons to having human-readable vs non-human-readable URIs. Within the realm of Linked Data, URIs are really important (as they identify concepts, as well as web pages), and so the design of URLs has to be even more carefully considered. To return to the meetup last week, Tom’s comment, which has prompted a fair bit of debate on this topic, was that “persistance beats human-readability”, and he seemed to regret having human-readable URIs on the BBC Wildlife Finder*. I disagreed, and a debate then ensued on the relative merits of human-readability vs persistance in URI design.
There were a range of opinions on this. Chris Sizemore commented that “I’m honest, I’m not as worried abt persistance as I prob shld b. Web got by, why can’t SemWeb?” Tom replied that “of course web of doc does get broken coz urls change. It’s quite expensive to keep link checking (incl what doc it points to)” and then suggested that “the whole opaque vs human readable URL thing is largely religious. Experience dictates which you think is more important”.
There followed a bunch of comments suggesting that it was more a pragmatic, rather than a “religious” decision, weighing up the relative costs and benefits of each approach, and Michael Smethurst pointed out that creating human-readable identifiers can be expensive.
I then suggested that “not even opaque URLs can be 100% persistent. Concepts change, merge and split”, and Chris Sizemore added “we should design [for] less than 100% persistance in URIs [and] promote ‘healing’ mechanisms”. I’m not sure that there are any real answers in this area yet – it’s still something to be investigated, and something the Linked Data community will have to deal with. I suggested that “HTTP already supplies some of the healing mechanisms: redirects, 300 Multiple Choice, 410 Gone, etc” and Michael suggested that “changed concepts, merged concepts and split concepts are new concepts so need new uris…”, which are two contrasting but not un-complementary methods. (Take the example where a company ‘splits’, by spinning out a part of its business into a new one, and keeping the rest running under the same name and brand – whether the larger part of the ‘split’ should retain the same URI is a decision that can either be philosophical or pragmatic).
I’ve come to the conclusion that URL design has to be done pragmatically, balancing lots of different factors, including readability of the identifiers, the ‘hackability’ of the path structure (ie the number of slashes to include), overall length, ease/costs of producing them, and overall the effect that all these things have on the ability for the URL to be ‘permanent’. It may be one of the axioms of the web that URIs are opaque, and that machines “should not look at the contents of the URI string to gain other information”, but there are lots of ways in which humans don’t follow this principle:
There’s probably other ways in which URLs are treated as non-opaque by humans too.
I’m interested in how various sites and services negotiate this minefield of options and trade-offs. Here are some examples which I think are interesting:
http://www.flickr.com/photos/, even for videos – which just goes to show that sometimes your website might change in ways that your initial URL design didn’t anticipate, which you’ll then need to make a decision on.I might try and add to these lists as I discover other interesting examples, but for now, my message is that you have to think hard about URL design, weigh up the different options, and then be pragmatic about it…
P.S I’ve just remembered that I wrote a post about ‘how the BBC iPlayer broke its URLs’ a couple of years, which might also be relevant to this discussion…
* Tom later clarified that the bit he regretted was “including the /species/ etc. in the URL”, rather than re-using Wikipedia identifiers, though others have suggested there have been problems with this too.
Other blog posts on this topic:
Updates: (9:31PM) Added a couple of extra links and examples. (9:54pm) Added link to Matt’s blog post. (Mar 1) Added P.S with link to an old blog post of mine.
Pingback: Musings on Linked Data stuff « Matt Jukes
karl said:
In your paragraph
“one of the axioms of the web that URIs are opaque, and that machines “should not look at the contents of the URI string to gain other information”, but there are lots of ways in which humans don’t follow this principle”
Not only humans in fact. The first item in your list is talking about Google and it has changed a lot the way the Web is made. In commercial environments (aka Web Agencies), SEO (capacity of having a better findability) touches the content organization but also the words in URIs. So often the SEO person will not only recommend the way to architect content in the page, but also the words that must be in the URI. It is basically an additional constraint to the list you created.
* Persistence
* Readability
* Findability
kL said:
Every programming problem can be solved by adding one more layer of abstraction ;)
I have archive of names used in URLs and each human-readable name is mapped to permanent identifier *or a redirect*.
This lets me use human-redable URLs, rename things as I like, and I never break any URL.
Of course collisions happen, and have to resolved by choosing slightly different URL name.
Ben Truyman said:
Just to add to this, a couple weeks ago we were reviewing our analytics reporting of pages that resulted in 404s and found a large amount of users were in fact modifying the URL to, for example, go one level up.
Imagine the page they’re currently on has a URL of:
http://client.com/section/subsection/_A0f4d
We actually found them to doing things like removing chunks to get to what they though would be the section’s page:
http://client.com/section/
In our case, this resulted in a 404 as the URLs require an identifier at the end. It was interesting to say the least.
John S. Erickson, Ph.D. said:
Thank you Frankie for what I think is a helpful post. Note that a key facet — minefield? — of this conversation that you haven’t really explored is the role played by persistent identifier systems such as the DOI, based on the Handle System.
One point that might be lost in your comments is that persistence is orthogonal to opacity: there are both persistent identifier schemes (like the DOI) that can accommodate human-readable syntax, as well as obviously many, many opaque identifier schemes that do nothing for persistence.
In the DOI ecosystem it has long been said that persistence is about policy, not about technology. The DOI was created as a way for (primarily) the publishing industry to have a management layer that would mirror their business practices, which included moving publications around their houses and transitioning imprints to other houses. Those practices also include the care and feeding of metadata supply chains, esp. with discovery systems and retailers, which the DOI facilitates.
The proper and timely management of DOI records, managed in infrastructure run by registration agencies such as CrossRef, is a key part of what makes persistence happen. Equally important are the stakeholders ensuring that the resources “pointed to” by the DOI record are valid. The structure of the identifier has nothing to do with it, wther “human readable” or opaque.
Pingback: inkdroid › a middle way for linked data at the bbc
Pingback: Destillat #10 | duetsch.info - Open Source, Wet-, Web-, Software
onpause said:
re: “Of course collisions happen, and have to resolved by choosing slightly different URL name” imagine how much that costs when you need to maintain literally millions of URLs… i.e. it’s too expensive to bother with… but i’m glad for you that your site is of the size and scope that you CAN do it…
re: “DOI” — um, that’s failed, it’s way too brittle. use HTTP URIs as identifiers, and try to stay permanent and persistent, it’s the best option. URIs are IDs for Things, which can include documents, but also non-documents…
re: “So often the SEO person will not only recommend the way to architect content in the page, but also the words that must be in the URI.” those SEO persons are rubbish and wrong, words in URLs don’t influence a page’s rank in Google. they do influence the likelihood of a click-thru is a Google user does notice your link in a search result, so it’s not worth nothing. just not as important as persistance and pointability/linkability which are FAR MORE important for SEO.
night night.
Martin said:
Another example to add to your list is the DOI system, now pervasive throughout academic.
http://dx.doi.org/
is the ‘resolver’ for these.
mbt sneakers said:
In our case, this resulted in a 404 as the URLs require an identifier at the end. It was interesting to say the least.http://www.mbtshoesfactory.com