Tag Archives: SQL

WordPress

WordPress 2.3 is out, with official tag support. I’ve just finished upgrading, and tags now work properly. I had to hack together some SQL + Ruby to convert everything, but it should all be done now.

Atom feeds and OpenID support should hopefully work as before; let me know if you notice anything strange. I’m going to test by replying to this…

Update: It works. And excitingly, you no longer have to hack code to get OpenID support working to and from LiveJournal.

Content upkeep

I got tags working via a plugin.

Since I was messing with the site anyway, I hacked together some Ruby code to pull all the content out of the database and perform automatic keyword extraction via naïve bayesian analysis.

It spat out a file of SQL commands, consisting of the subject of each posting and the first line of text (in comments), followed by the commands to add the tags. I ran through the file in vim deleting here and adding there, then executed the result. So now pretty much everything should be tagged, right back to the start. How cool is that?

The unbearable crapness of SQL

SQL is a dinosaur of a language, designed for the bad old days when computers enforced a fixed size and format for every kind of data, everything was upper case, and if you didn’t like it you went back to using paper. After all, disk space and CPU time were expensive, so you didn’t want people wasting them with pesky unaltered real-world information.

As such, SQL doesn’t have variable-length strings. Oh, sure, it has VARCHAR as well as CHAR, but VARCHAR only gives you a string that can be any size up to some fixed length. That’s as opposed to CHAR, which is even dumber, padding out all your strings to a fixed length. So all you gain from VARCHAR is the ability to get "John", "Smith" rather than "John            ", "Smith           " as output.

So, what happens if Mr Apu Nahasapeemapetilon suddenly joins the company, and your SURNAME VARCHAR(16) overflows? If you’re running MySQL, the name is silently truncated, and your database outputs garbage. If you’re running PostgreSQL, the database chokes and your system falls over with a run-time error, which is what the SQL standard apparently says should happen. So, would you like your leg amputated with a bread knife, or a rusty hacksaw blade?

Let’s suppose that you want to use PostgreSQL, so that you can have advanced features like, oh, the database actually remaining consistent if there’s a crash, and constraints that actually do something. Let’s also suppose that you don’t care too much about data being truncated, so long as you don’t personally have to deal with a support call every time some idiot decides to get creative with a product name.

Now you have a problem, because the program that’s feeding information into the database needs to know how much data it can put in each field without the database choking. The database knows this, of course; it’s part of the schema you used to set up the tables in the first place. But you don’t want the information in two places, because then if you change it in one place, you need to remember to change it in the other.

So, what’s the solution? Surely there has to be one?

J2EE in a nutshell

J2EE specifies Enterprise Java Beans for handling data, where the data and the client accessing it may or may not be on the same system. Entity beans are used to encapsulate the data in the database—instead of accessing the database directly, you create an entity bean to do it for you. That way the client can use the entity bean and not need to know about SQL.

However, the person writing the entity bean doesn’t really want to have to know about SQL either. That database stuff makes writing an entity bean a pain, so container-managed persistence was introduced. This allows you to let the EJB container (part of your J2EE server) handle the messy database stuff for you, and you don’t have to know any SQL at all.

Of course, SQL allows you to do a lot of useful things, such as querying databases in complex ways. It turned out that doing those things by fiddling around with container-managed entity beans was a real pain in the ass. So in EJB 2.0, Sun added EJB QL, the Enterprise JavaBean Query Language. It’s just like SQL, only slightly different, and not as powerful.

Of course, performance was getting pretty crappy by this point, so to improve things, EJB QL can be compiled into SQL.

So to summarize, here’s the old ugly way of doing things:

  • Client program connects to database server.
  • Client program sends SQL query to database server.
  • Client program gets back data from database server.

Now here’s the new, improved way of doing things:

  • Client program connects to J2EE server.
    • J2EE server instantiates entity bean.
    • Entity bean opens connection to database.
  • Client program sends EJB QL query to J2EE server.
    • Entity bean compiles EJB QL into SQL.
    • Entity bean sends SQL query to database.
    • Entity bean gets back data from database.
    • Entity bean passes back data to J2EE server.
  • Client program gets back data from J2EE server.

Obviously this is a big improvement. What’s more, because of the use of EJB, all of the J2EE entity bean stuff can be placed on a separate machine from the client and the database server. In fact, for performance reasons you’d be well advised to do so.

So, before: client system, database server. After: same client system, huge $20,000 server running $50,000 of J2EE application server software, same database server. Now your client software can send queries that look just like SQL, and get back results exactly as if they came from a database server—but instead, they’re coming from a box with that wonderful Java technology.