OpenCalais followup

As I wrote a while ago, I’ve been experimenting with the Calais Auto Tagger plugin which interfaces WordPress with OpenCalais. After writing a post, you push a button and it looks at the text and comes up with some tags.

After running it over my entire site, I ended up with over 5,000 tags. Eat that, LiveJournal. However, there’s quite a lot of crap in there, and I’m going to need to do some cleanup.

For instance, the tagger seems to get confused by punctuation. I ended up with double quotes in some tags, because words it found interesting were at the start of a quote. It also picked out “56AM” as a tag, based on it occurring in the time 3:56AM. Not too smart. It also ended up with some HTML entities in tags, such as “&hellip”, but that could be a bug in the plugin.

On the plus side, it mostly does a good job of picking out noteworthy people’s names. However, it didn’t detect that “Obama” refers to Barack Obama, which I think would be a pretty good optimization to add for this upcoming year.

It loves advertising. I suspect that’s what it’s really being used for. I ended up with tags for advertising, picking out advertising agencies, advertising agency, advertising campaign, advertising campaigns, advertising inserts, advertising money, advertising pages, advertising revenue, advertising space, and even advertising wrap, even though I don’t talk about most of those things. (Well, until now.)

Another weirdness is that it loves tagging things with “USD”, which I assume refers to the currency. It tagged over 217 posts with that. Anything which mentions an amount of money and a ‘$’ symbol seems to get tagged with it, which makes some sense, but isn’t terribly useful. As a human, I’d expect things to be tagged ‘USD’ which actually concerned international currency exchange.

There are also topics it seems totally hopeless at finding keywords for. I’ve written several entire posts about the budgerigar, for example, and it hasn’t picked out that “budgerigar” or “parakeet” would be a good tag for them.

So overall, it’s useful, but don’t expect to be able to throw a lot of text at it and get good tags out without doing some post-filtering. It seems to pick up things that Bayesian analysis isn’t good at spotting, however.