Jan 22

I decided it was way, way past time I got my web site in better order. It was getting embarrassing that my ‘content management system’ (CMS) was a bunch of text files in folders, and that whenever I wanted to find something I’d written I had to resort to find . -exec grep.

For a while I entertained the idea of building my own CMS. However, I finally decided that there were much more creative and enjoyable things I could be doing than bringing work home, so I went with something someone else wrote. I may do a new site design; right now I’m using a standard template.

Getting all my existing data into the new system was a Simple Matter Of Programming. The CMS is built on Ruby On Rails, so I was able to write some code to run through the filesystem, find all the existing data, patch it up appropriately, and store it in the relational database. I implemented some heuristics to guess the missing time zone information based on the date and where I typically was at that time, so timestamps should mostly be accurate to within an hour or so. (I wasn’t totally precise about DST.)

Aside from search, pings, flickr integration, easy editing, an API for handy software, and other stuff that’s fun for me, the new system also brings the much-requested return of comments. Be good.

Jan 22

[For more cases of LiveJournal Abuse Team behaving abusively, check out http://ljabuse.blogspot.com/.]

For several years I was a paying user of LiveJournal. Now I pay for web hosting and run my own content management system. It’s not by choice; this is the story.

In a nutshell, following an altercation with a racist troll, LiveJournal suspended my account without warning, even though I had not breached their Terms Of Service. They didn’t suspend the troll’s account–instead, they announced that (contrary to their written terms of service) racist comments were in fact perfectly acceptable on LiveJournal.

Attempts at compromise to resolve the issue were ignored and rejected, even when I offered to delete offending comments. The money I had paid for the service they were refusing to provide was not refunded.

Continue reading »

Mar 23

As part of my mission to bring clarity to the world, let me explain the so-called “302 exploit” you may have heard scare stories about.

Background

HTTP, the protocol used to serve web pages, has two numeric codes that can be returned by the web server to direct the client (browser) to a new URL: 301 and 302.

A 301 redirection means “The page you requested has moved permanently. Please go to the new address I am providing you with, and update your bookmarks.”

A 302 redirection means “The page you requested is temporarily being hosted somewhere else; please fetch the content from the URL I am providing you with. This is still the correct URL for future access, however.”

The “exploit”

Suppose a malicious person has a spam site they wish to promote; let’s call it SPAMSITE. He looks at Google’s search results for his choice of keywords, and copies the URL of a site listed quite high up; call it GOODSITE.

Next, he sets up his web server to detect when the GoogleBot crawls his web pages. When that happens, he has his web server issue a 302 redirection. That is, SPAMSITE says to GoogleBot: “The page you requested at SPAMSITE is temporarily hosted at GOODSITE. However, SPAMSITE is still the place you should visit in future to get the content.”

The idea is that GoogleBot then indexes SPAMSITE as if it was the real GOODSITE, and GOODSITE gets dumped from the rankings. Users who search for GOODSITE via Google click on the link for SPAMSITE, which looks like it contains the real content from GOODSITE, but instead they get ads for penis enlargers, Texas Hold’em Poker, and Asian amputee lesbians shaving each other.

The reality

That’s the nightmare scenario being screamed to the media. Reality is not quite that simple, however.

Google rates sites according to their “pagerank”–a magic number proportional to how many other sites link to them. The more sites with high pagerank link to a particular page, the higher that page’s pagerank.

So, let’s go back to our earlier scenario. Google has been told that SPAMSITE is the proper URL for GOODSITE. However, chances are there are a lot of sites linking to GOODSITE; and if SPAMSITE has just been set up for hijacking, there won’t be anything pointing at it. So, Google will ignore what SPAMSITE told it, and report GOODSITE as the URL, because it has a much higher pagerank. (Or, so the Google guys say.)

The problem, such as it is

This still leaves one problem scenario: if SPAMSITE can somehow get its pagerank to be higher than GOODSITE, it can push GOODSITE off of the listings and take its place. Of course, it could do that anyway, by serving up a copy of GOODSITE’s content, but 302 allows it to do it without actually committing copyright violation.

So that’s the sum total of the “exploit”: there’s a way to spam Google exactly as if you were using other people’s content, without actually copying that content. Whoop-de-doo.

Fixes which aren’t

One thing a lot of people ask is, why not just ignore 302s and always index the destination URL?

Answer: Because it would break a lot of links. For example, the canonical URL of my web site is http://www.pobox.com/~meta/. That URL goes issues a 302 redirect to wherever I happen to be hosting my web site at the time.

Similarly, lots of commercial sites have canonical URLs which they publish, and then redirect to some dynamically generated page in a content management system. For example, IBM.

Mar 31

Wasted a few hours yesterday trying to install Zope and CMF, as lots of people have raved about CMF as a content management system for web site construction. Unfortunately, Zope relies on Python… and it seems that the current version of Python is incompatible with Zope, and the previous version doesn’t build on Mac OS X. (Or rather, it eventually built after some hacking, but bad things happened when it tried to find any of its standard libraries.) Oh well, it was worth a try, and I didn’t especially want to learn Python anyway.

Today I tried Jakarta Tomcat. Since OS X already has Apache and Java fully installed, this took about five minutes; Apple even have some step-by-step instructions. So it looks like any dynamic web site construction I do with open source tools is going to be in Java. Suits me fine.

Nov 08

Almost finished another dynamic web site / content management system today. It produces the same look and feel as the main IBM web site; I can’t show it off to everyone, unfortunately, because it’ll be password-protected when it goes live.

Anyway, I say “almost” because there’s a bizarre bug, and I was quite unable to track down the cause. I came home instead. I’m counting on one of two things to happen: either I’ll go in tomorrow and the bug will have mysteriously vanished when the server does its general housekeeping stuff overnight, or my subconscious will quietly work on the problem, and I’ll walk in with my latte, look at the code, and say “Aha!”

The latter might sound pretty far-fetched, but it happens all the time.