Culture Hack Data

I’ve been working with Culture Hack recently, on a project exploring open data for arts, culture and heritage organisations.

That’s a topic quite familiar to me, as it’s something I discussed quite a bit whilst working at the Science Museum, many moons ago (I don’t think we used quite that term, but the idea was the same). So it’s been interesting to see how things have changed since.

Culture Hack is a programme run by Caper (and Sync in Scotland) which brings hackers together with arts organisations to build prototypes and hacks, usually over the course of an event. I attended the ‘North’ event in Leeds a couple of years back.

One issue that developers have faced at the events is knowing what data and content is available. This has sometimes been addressed this by distributing USB sticks with links, spreadsheets and data dumps on them. That’s okay in the short term, but isn’t ideal.

Thanks to some funding from the TSB, Culture Hack commissioned me and Kim Plowright to create a new web resource listing and describing some of the best cultural open data out there.

The site launched last week, and is visible at data.culturehack.org.uk

Culture Hack Data website - showing a search box, filters, and some example results
The Culture Hack Data site

When planning the project, we didn’t want it to be a complex database. Instead, we’ve kept things simple. Our aim is to get people to the actual data as quickly as possible, and to describe the sources well.

The data sources are curated into eight broad categories: art, literature, music, performance, fashion, design, media and history, so if you’ve got a particular interest in one area, it’s easy to see what’s available.

A second key factor we identified is the size of the dataset. This is crucial if you’re in a hurry (at a hack event, say), as dealing with huge datasets requires a different set of tools (and often, more time) than smaller datasets, which can often simply be manipulated within spreadsheets. So we classify datasets into Small (less than 10 thousand records), Medium (10 thousand to 1 million records) and Huge (more than a million records).

We’ve also labelled datasets with the formats they’re in (e.g. XML or JSON), and the rights they’re released under, such as the various Creative Commons licences.

Where possible, we’ve included a sample of the data you can download too. This is especially useful for datasets you have to register to access.

We’ve included all the relevant sources that we know of, but the list will build over time. Thinking about ways for people to contribute was one of the trickiest aspects of the project, and I was keen to avoid building lots of authoring, editing and moderation features without knowing how the site would be used.

Instead, we’ve opted for an approach of accepting quick suggestions via e-mail (or Twitter), and for more involved collaboration, pointing people at the Github repository where all the code and content is hosted.

Using Git for managing content is a bit of trend – it’s a way of getting full version control and change tracking, without needing complex CMS interfaces. There’s also a growing ecosystem of apps, GUIs and tools around it (and GitHub). In particular, we’ve experimented with using prose.io – a lovely stripped-back editing interface for GitHub files.

Of course, Git is pretty developer-centric at the moment. Whilst that’s the main audience for the Culture Hack site, we also wanted to encourage participation from the institutions themselves. To this end, there’s some instructions on using GitHub, which hopefully take people through the process step-by-step.

One final quirky note about the project: we’ve given each dataset on the website its own numerical ID, which you might notice in the URL. Rather than simply start from 1 though, we’ve using ‘Artisanal Integers’. This is a fancy name for a simple idea: a collection of web services which generate unique numbers on request. This is useful because it avoids us accidentally assigning the same number to two datasets (which might happen if two people are adding pages at the same time).

Finally, in the spirit of open data, all of the metadata on the site, as well as our descriptions, are all freely available without copyright restrictions.