Mar 24

J2SE 6 has some interesting new XML functionality called JAXB. Using JAXB, you can take an XSD file and compile it into Java classes. You can then add those classes to your project, create an Unmarshaler object, feed it some XML which meets the XSD, and it will pass you back a tree of appropriate POJOs you can mess with.The only problem is that the XML file my source application generates refers to a DTD which JAXB tries to load via xerxes, causing epic fail.

Clearly I could rewrite the XML on the fly, perhaps even using XSLT to make the code even more enterprisey. However, I can’t help thinking that there should be a simpler way to either make xerxes/JAXB ignore the DTD (which, after all, it doesn’t need), or tell it how to find it.

Anyone happen to know?

[And yes, this is horribly enterprisey, but the benefit of being able to unmarshal a large population of XML-represented objects in only 20 lines of code is too good to pass up. Plus, I already know that the bottleneck in the intended application will be database speed.]

Sep 02

According to Google Watch, our favorite search engine is dying. Supposedly Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about.

Well, those are some pretty huge error bars, which right away scream out “wild speculation”. But if we read on, the guy offers as evidence the fact that his web site, namebase.org, appears as a bare URL in the Google index, rather than having the conventional snippet of content and careful indexing.

It sounds pretty convincing, until you go look at the source of his web page:

<HTML><HEAD>
<META HTTP-EQUIV=”EXPIRES” CONTENT=”0″>
<META NAME=”ROBOTS” CONTENT=”NOARCHIVE”>
<META HTTP-EQUIV=”PRAGMA” CONTENT=”NO-CACHE”><TITLE>NameBase Book Index</TITLE>

Yes, the idiot has specifically set headers on his web site to try to tell web crawlers not to archive any of the text, and not to cache anything—and then he complains that the Google index doesn’t have a cached index of his content in its archive.

Of course, it goes without saying that his site’s home page also isn’t valid HTML, failing to validate against any DTD whatsoever.

So in conclusion: (a) Google doesn’t always do a good job of keeping an indexed cache of things you’ve told it not to cache, and (b) Google doesn’t always do a good job of indexing things that aren’t actually web pages.

I mean, I’m not delighted that I’m no longer the #1 spot in a search for my name, but I don’t take it as proof that Google is broken.

I’ve been wondering for a while why it is that people are so keen to predict that Google is ruined, hopeless, shooting itself in the foot, lost, going to crash, it’s doomed, doomed, doomed I tell you!

I think it’s more than http://en.wikipedia.org/wiki/Schadenfreude”>schadenfreude. I think Google irritates a lot of people because of the way they’ve put together a successful business by ignoring business rules and behaving ethically. They’ve refused to pollute search results with ads; they’ve refused to let people buy their way to the top; they’ve refused flashy graphical banners. They’ve thrived anyway. And a lot of people hate them for it.