23 March 2005

Bringing clarity

As part of my mission to bring clarity to the world, let me explain the so-called “302 exploit” you may have heard scare stories about.

Background

HTTP, the protocol used to serve web pages, has two numeric codes that can be returned by the web server to direct the client (browser) to a new URL: 301 and 302.

A 301 redirection means “The page you requested has moved permanently. Please go to the new address I am providing you with, and update your bookmarks.”

A 302 redirection means “The page you requested is temporarily being hosted somewhere else; please fetch the content from the URL I am providing you with. This is still the correct URL for future access, however.”

The “exploit”

Suppose a malicious person has a spam site they wish to promote; let’s call it SPAMSITE. He looks at Google’s search results for his choice of keywords, and copies the URL of a site listed quite high up; call it GOODSITE.

Next, he sets up his web server to detect when the GoogleBot crawls his web pages. When that happens, he has his web server issue a 302 redirection. That is, SPAMSITE says to GoogleBot: “The page you requested at SPAMSITE is temporarily hosted at GOODSITE. However, SPAMSITE is still the place you should visit in future to get the content.”

The idea is that GoogleBot then indexes SPAMSITE as if it was the real GOODSITE, and GOODSITE gets dumped from the rankings. Users who search for GOODSITE via Google click on the link for SPAMSITE, which looks like it contains the real content from GOODSITE, but instead they get ads for penis enlargers, Texas Hold’em Poker, and Asian amputee lesbians shaving each other.

The reality

That’s the nightmare scenario being screamed to the media. Reality is not quite that simple, however.

Google rates sites according to their “pagerank”–a magic number proportional to how many other sites link to them. The more sites with high pagerank link to a particular page, the higher that page’s pagerank.

So, let’s go back to our earlier scenario. Google has been told that SPAMSITE is the proper URL for GOODSITE. However, chances are there are a lot of sites linking to GOODSITE; and if SPAMSITE has just been set up for hijacking, there won’t be anything pointing at it. So, Google will ignore what SPAMSITE told it, and report GOODSITE as the URL, because it has a much higher pagerank. (Or, so the Google guys say.)

The problem, such as it is

This still leaves one problem scenario: if SPAMSITE can somehow get its pagerank to be higher than GOODSITE, it can push GOODSITE off of the listings and take its place. Of course, it could do that anyway, by serving up a copy of GOODSITE’s content, but 302 allows it to do it without actually committing copyright violation.

So that’s the sum total of the “exploit”: there’s a way to spam Google exactly as if you were using other people’s content, without actually copying that content. Whoop-de-doo.

Fixes which aren’t

One thing a lot of people ask is, why not just ignore 302s and always index the destination URL?

Answer: Because it would break a lot of links. For example, the canonical URL of my web site is http://www.pobox.com/~meta/. That URL goes issues a 302 redirect to wherever I happen to be hosting my web site at the time.

Similarly, lots of commercial sites have canonical URLs which they publish, and then redirect to some dynamically generated page in a content management system. For example, IBM.

© mathew 2017