I recently did some work on the back end of my web sites. I consolidated all the individual WordPress installs into a single multi-user one, cleaned up the database to free up disk space, and slimmed down the number of plugins. I’m taking advantage of Automattic’s Jetpack plugin to provide functionality that previously required a bunch of third party plugins, including:

  • Markdown support (including in comments)
  • “Like” buttons for social network sharing
  • Mobile device support
  • Push notifications when someone comments
  • Comment login via social networks
  • E-mail subscriptions

It wasn’t long before I got some mild negative feedback: My changing the login system meant that some comments got flagged as spam which shouldn’t have been, so I had to go in and unflag them.

It is, of course, a pain when you write some carefully thought-out comment, only to have the system apparently drop it into a black hole. I understand that, as I have that experience myself on other sites. However, if you don’t run your own web site, you might not be aware of why people like me have such zealous spam filtering. So, let me pull back the curtain a little.

The main anti-spam tool I use is Akismet, which detects spam by aggregating comments across tens of thousands of web sites and looking for patterns. It gives me statistics on how many spam comments it has blocked. In January, for example, it caught 189 attempts at posting spam comments. Half a dozen every day. That doesn’t sound too bad, right?

Urban planners and other designers have a concept known as affordance. It refers to the way a thing can have a design which encourages or discourages particular kinds of use. In the specific case of urban planning, affordance is used to refer to the graffiti-attracting or graffiti-discouraging properties of particular materials and structures. For example, the classic UK bus shelter of the 1970s—a wooden shed—had a very high affordance for graffiti. Modern bus shelters have glass walls in order to make them less attractive targets for defacement.

One of the interesting things about graffiti affordance that I learned from a book on urban planning is that once a single piece of graffiti appears on a surface, the surface has a much bigger affordance for more graffiti. The electronic vandalism of spam works in exactly the same way: If your web site has comment spam all over it, it will become an attractive target for more comment spam. Not only do spammers use search engines to find spam-infested web sites and post more spam to them, they also make money by selling each other lists of potential victims.

Spam levels also rise and fall based on the time of year (there’s a rise before Christmas), and they can be changed dramatically when security teams manage to take down a botnet. As it happens, my January total of 189 blocked spam comments was on the low end of the range. In August, I had 1,574 spam attempts. Fifty a day. That is more like what I’d be facing if I turned off the filtering.

99% of spam comments use invalid e-mail addresses, so for a while I figured I’d just use a confirmation e-mail plugin, and require users to click a link in the e-mail to confirm that they were actual humans and post their first comment. After that, they could post freely. No actual humans would be inconvenienced that way, right?

Unfortunately, doing that and turning off other user filters resulted in thousands of garbage user accounts every week. I was paying for the database space used to store the garbage, I was dealing with the bounce messages from the confirmation messages—and after all that, I was still getting actual spam, because thanks to globalization you can pay someone in India or China a pittance to spend the day posting carefully targeted manual spam to web sites that show up in your favorite Google searches.

So, I use a plugin called Stop Spammer Registrations to stop the user registration spam. It checks IP addresses and e-mail addresses against blocklists maintained by various web communities. It also looks for suspicious behavior, like invalid or missing HTTP headers, extremely long or short usernames, and so on. It’s not perfect, and has misflagged a couple of people, but I hope that now that I’ve explained the scale of the problem, you understand why it’s necessary.

I’m also trying my best to make things painless by allowing login using social network accounts you already have, rather than requiring that you set up yet another username and password. If you don’t mind using a social network login, that’s actually the best bet for commenting, as it avoids the need to scrutinize a registration request and lets you proceed straight to the actual commenting stage. Also, once you have an approved comment, you shouldn’t get filtered again so long as you use the same login method. If a comment does go into the moderation queue, rest assured I will get to it, and I’m now set up so that I can moderate comments on my phone as soon as I get the push notification.

(At this point, it’s worth mentioning that just because you see something get posted, doesn’t mean I’m actually using the Internet. I use WordPress’s post scheduling system and maintain a queue of items, so that I can keep a reasonably regular update schedule, currently around 3 items per day.)

So in summary, I’m genuinely sorry if you’re inconvenienced, but spam filtering on both user registration and comment posting is a necessary part of running a web site that’s been around for 20 years and is on every spammers’ radar. I wish it wasn’t, and I’m open to suggestions for better technologies, but for the time being I have something that seems to be working after some initial teething troubles.

Of course, lots of people are just turning off comments entirely, but in the absence of adoption of an open social network to host discussion (hint hint), I’m not comfortable doing that.

As I wrote a while ago, I’ve been experimenting with the Calais Auto Tagger plugin which interfaces WordPress with OpenCalais. After writing a post, you push a button and it looks at the text and comes up with some tags.

After running it over my entire site, I ended up with over 5,000 tags. Eat that, LiveJournal. However, there’s quite a lot of crap in there, and I’m going to need to do some cleanup.

For instance, the tagger seems to get confused by punctuation. I ended up with double quotes in some tags, because words it found interesting were at the start of a quote. It also picked out "56AM" as a tag, based on it occurring in the time 3:56AM. Not too smart. It also ended up with some HTML entities in tags, such as "&hellip", but that could be a bug in the plugin.

On the plus side, it mostly does a good job of picking out noteworthy people’s names. However, it didn’t detect that "Obama" refers to Barack Obama, which I think would be a pretty good optimization to add for this upcoming year.

It loves advertising. I suspect that’s what it’s really being used for. I ended up with tags for advertising, picking out advertising agencies, advertising agency, advertising campaign, advertising campaigns, advertising inserts, advertising money, advertising pages, advertising revenue, advertising space, and even advertising wrap, even though I don’t talk about most of those things. (Well, until now.)

Another weirdness is that it loves tagging things with "USD", which I assume refers to the currency. It tagged over 217 posts with that. Anything which mentions an amount of money and a ‘$’ symbol seems to get tagged with it, which makes some sense, but isn’t terribly useful. As a human, I’d expect things to be tagged ‘USD’ which actually concerned international currency exchange.

There are also topics it seems totally hopeless at finding keywords for. I’ve written several entire posts about the budgerigar, for example, and it hasn’t picked out that "budgerigar" or "parakeet" would be a good tag for them.

So overall, it’s useful, but don’t expect to be able to throw a lot of text at it and get good tags out without doing some post-filtering. It seems to pick up things that Bayesian analysis isn’t good at spotting, however.

I upgraded to the latest WordPress. I’ve checked LiveJournal OpenID login, and it works. However, since some people have complained that they can’t get it to work, I’ve also enabled manual signup.

I also took a look through the available plugins, and found the OpenCalais Archive Tagger. It uses the Reuters OpenCalais project, a database of semantic information, to automatically go back and add relevant tags to all your previous posts. I’m giving it a go, I figure anything that can make it easier to find interesting stuff amongst the 20 years of writing here has to be worthwhile.

WordPress 2.3 is out, with official tag support. I’ve just finished upgrading, and tags now work properly. I had to hack together some SQL + Ruby to convert everything, but it should all be done now.

Atom feeds and OpenID support should hopefully work as before; let me know if you notice anything strange. I’m going to test by replying to this…

Update: It works. And excitingly, you no longer have to hack code to get OpenID support working to and from LiveJournal.