A problem for tagging: morphology

Tags are hot. If you don't know what they are, check out this and this and this.

Another website that has tags is 43 things and its sister site 43 places.

Anyway, onto the point of this post, the effect that morphology can have on tagging. If you don't know what morphology is, perhaps you'll pick it up from this screenshot:

[Image no longer available]

This tag zeitgeist is from the 43 places 'ideas' site. The pairs of tags highlighted are where they differ by some suffix (affixes and suffixes make up linguistic morphology), mostly plurality and tense.

Because tags are based around unique letter-strings, they are both amiguous (bank has atleast two meanings) and non-unique (compare eggplant vs aubergine). However, these are facts about language generally, and so there's not too much you can do about it (trying to map words to meanings would be a near-impossible task).

However, calculating plurals is a relatively simple task for computers to do (in English). You start by specifying the general rule (+s), then some alternative rules (-y -> +ies), and then finally by specifying a list of irregular plurals (which you can probably buy in). Then you simply test for the rules in reverse order, first checking to see if the word is irregular, then seeing if any of the alternative rules apply, and then applying the general rule. I managed to code the whole thing in Prolog in a couple of hours, for a course I took a year ago. Of course, you'd also have to be able to spot which tags aren't nouns and don't have a plural...

Whether you'd want to use this for websites which feature tags is another question. In some cases, eg with photos, it might make sense to have a difference between singular and plural. In other cases, it might be useful to merge the tags. At the very least, it'd be good to be able to generate links between the tags.

Interestingly, in the image above, there doesn't seem to be any pattern over whether the singular or plural version is more used.

On this weblog, I've mostly used plural tags.