May 14

As I wrote a while ago, I’ve been experimenting with the Calais Auto Tagger plugin which interfaces Wordpress with OpenCalais. After writing a post, you push a button and it looks at the text and comes up with some tags.

After running it over my entire site, I ended up with over 5,000 tags. Eat that, LiveJournal. However, there’s quite a lot of crap in there, and I’m going to need to do some cleanup.

For instance, the tagger seems to get confused by punctuation. I ended up with double quotes in some tags, because words it found interesting were at the start of a quote. It also picked out "56AM" as a tag, based on it occurring in the time 3:56AM. Not too smart. It also ended up with some HTML entities in tags, such as "&hellip", but that could be a bug in the plugin.

On the plus side, it mostly does a good job of picking out noteworthy people’s names. However, it didn’t detect that "Obama" refers to Barack Obama, which I think would be a pretty good optimization to add for this upcoming year.

It loves advertising. I suspect that’s what it’s really being used for. I ended up with tags for advertising, picking out advertising agencies, advertising agency, advertising campaign, advertising campaigns, advertising inserts, advertising money, advertising pages, advertising revenue, advertising space, and even advertising wrap, even though I don’t talk about most of those things. (Well, until now.)

Another weirdness is that it loves tagging things with "USD", which I assume refers to the currency. It tagged over 217 posts with that. Anything which mentions an amount of money and a ‘$’ symbol seems to get tagged with it, which makes some sense, but isn’t terribly useful. As a human, I’d expect things to be tagged ‘USD’ which actually concerned international currency exchange.

There are also topics it seems totally hopeless at finding keywords for. I’ve written several entire posts about the budgerigar, for example, and it hasn’t picked out that "budgerigar" or "parakeet" would be a good tag for them.

So overall, it’s useful, but don’t expect to be able to throw a lot of text at it and get good tags out without doing some post-filtering. It seems to pick up things that Bayesian analysis isn’t good at spotting, however.

Apr 15

As you have probably noticed, I’ve just gone through a major software migration for my web site.

I was using typo. It was OK, but had a few problems. While its web site describes it as “lean”, that isn’t really the reality. It also relied on a combination of Apache, LigHTTPd and FastCGI that tended to break down without explanation.

The biggest reason for change, though, was that typo’s authors’ idea of what was important functionality was diverging from mine. The wakeup call was when someone spent a bunch of time replacing the regular page templates with templates written in HAML.

For those lucky enough not to know, HAML is a stupid and inexplicably trendy idea in the Rails community, comparable to LiveJournal’s S2 style system. Basically, instead of creating your page templates in HTML and CSS, which everyone can understand and for which there are a zillion useful tools, you instead write program code in a whole new language which has minimal documentation. The program then generates the HTML and CSS.

Of course, this destroys the entire point of template systems, which is to separate code from presentation and make the presentation layer editable by non-programmers using common tools.

I wouldn’t have minded the HAML idiocy so much if it wasn’t for the fact that typo still lacked support for things as basic as user authentication for commenting. So I looked at other web content management software… and looked… and looked.

I tried Blojsom. Supposedly it’s what Apple uses. If so, I hope they’ve done a lot of work on their version, as it’s a major PITA to set up, and very complicated even when you get it working.

In the end, though, I knew the main feature I wanted: OpenID support. Hence, I found myself reluctantly herded towards Wordpress, which has a working OpenID plugin. (Or at least, it works for my OpenID account when I test it. I don’t think it has XRI support, though.)

I did entertain the idea of writing my own CMS. I even sketched out some design notes. But it really is a solved problem, I just didn’t like the technologies used to solve it.

Let’s be blunt about this: I hate PHP, and I hate MySQL. PHP is the Visual BASIC of web programming languages, a mess which grew with no planning out of a quick hack, a kitchen sink language known for its amenability to security holes. MySQL is a toy database, popular because it’s fast, fast because by default it doesn’t actually provide the basic ACID functions a database is supposed to provide. (Sure, you can turn those on, but once you do, today’s PostgreSQL is faster under non-trivial load.)

But I don’t believe in religion, especially not when it comes to software. I’m a strict pragmatist–whatever it takes to get the job done, even if it may offend a few aesthetic sensibilities and fall far short of perfection.

I spend most of my time at work developing using IBM Lotus Notes and Domino. Every time Notes is mentioned on Slashdot, a bunch of people will rant about how bad its UI is. They miss the point utterly. Believe me, the poor UI of Notes is only the most glaringly obvious defect it has; there are far worse problems underneath that the average end user is blissfully unaware of. But you know what? It works. It is sufficient. It lets you build groupware applications and dynamic web sites with fine-grained security in days, not weeks. That is why people use it. The only other tool I’ve found which comes close is Ruby on Rails, and that’s still too immature for me to want to use it on production systems. (That, and it’s surrounded by a community of people who think things like HAML are a good idea.)

So, here we are. I’m editing this in a nice AJAX WYSIWYG editor with spelling checker (an idea shot down by the typo developers), and you should be able to log in with OpenID to comment (an idea the typo developers seem utterly uninterested in).

It took most of Saturday hacking with Ruby, PostgreSQL and MySQL, but I believe I’ve managed to transfer not just all my data, but all your comments too. I think I’ve even managed to keep all the permalinks the same, and preserve all the timestamps. I’ve temporarily lost the tags functionality, but should be able to get it back with another plugin. Hopefully Wordpress will prove more reliable than Typo, and hopefully the OpenID stuff will interoperate correctly with LiveJournal. If not, pray that I inexplicably become independently wealthy and have the time to write something that does the job properly.

Nov 01

(It was obvious.)

Continue reading »

Oct 07

Another user gets suspended from TrollJournal for posting public information, info that had been made public by the person detailed.

The backstory is that LiveJournal has introduced advertising in the form of “sponsored communities” with third party identity tracking.

To quote the LiveJournal “contract” back in 2004:

It may be because it’s one of our biggest pet peeves, or it may be because they don’t garner a lot of money, but nonetheless, we promise to never offer advertising space in our service or on our pages.

The most recent terms of service are slightly different:

You understand and agree that some or all of the Service may include advertisements and that these advertisements are necessary for LiveJournal to provide the Service. You also understand and agree that you will not obscure any advertisements from general view via HTML/CSS or any other means. By using the Service, you agree that LiveJournal has the right to run such advertisements with or without prior notice, and without recompense to you or any other user.

Well, we’ve established how much a written promise from SixApart/LiveJournal is worth. The new advertising pages weren’t announced in the news to users, of course; they were quietly trailed in the community devoted to LJ’s business decisions.

Something else wasn’t mentioned. In addition, it appears that the “communities” are being seeded with positive “buzz” from user accounts set up specifically for the purpose. And when insomnia (aka Mark Kraft) did a little trivial investigation, he discovered that one of the people running the new “communities” was a SixApart employee with a brand new account, rather than a regular user. In other words, it appears that unlike regular communities, the new sponsored “communities” are to be carefully moderated strictly by Six Apart employees, doubtless to ensure that no pesky free speech will upset the advertisers.

In fact, perhaps in reaction to the latest round of criticism, at least one of the new viral marketing pseudo-communities is locked down tight so you can’t even join it without asking a Six Apart staffer for permission.

May 24

One feature the Unix shell offers is customizing prompts. Most ‘power users’ make use of the feature, and it is indeed very handy. However, it’s easy to go completely overboard and end up with a prompt like this:

[21:52:15] [fred@webhost:/var/log/apache] $

The problem with a long prompt is that you quickly hit the right hand edge of the screen and your command starts wrapping. If you use KDE, however, there’s a better way.

Konsole

The xterm program introduced an escape sequence to set the window title. That can help a bit, because now you can put some of the boring info up in the window title, and reserve the prompt for path information.

I use KDE, and the Konsole terminal program in KDE goes further than xterm. As well as letting you change the window title, you can also have multiple terminal sessions in tabs (like Firefox web page tabs), and chance their titles too. The purpose of this posting is to explain the extra functionality in Konsole, and how to make use of it with the bash shell (default in Linux).

Continue reading »

Mar 31

Hi, I’m mathew, and this is my web site. Jakob Nielsen believes that omitting a photo of the author is one of the top 10 mistakes in weblog usability, so who am I to argue?

I started using the Internet around 1987; I remember Jakob’s Hypercard stack, as it was one of the first cool things I downloaded. I was introduced to Unix the following year, accidentally typed rn instead of rm one day (true story), and the rest is history.

I’ve been doing my best to gather together the worthwhile content I’ve written since then. It’s an ongoing process, but the archives genuinely go back to 1988. Of course, what I consider worthwhile may look like crap to you, but it’s all categorized and searchable so hopefully you can find something of interest.

Over the years I’ve done all kinds of work, most of it involving computers in some way—telephone technical support, data recovery, system administration, a bit of sales and marketing, application development, hardware maintenance, networking, web design, and so on. I’ve written code in well over a dozen different programming languages. I’m something of a generalist, a term I borrow from Ted Nelson, inventor of hypertext. His ideas inspired my choice of career—I built a primitive network hypertext system around 1985, wrote a browser in 1989, and wrote my first web page back in the days of HTML 1.0. I was rather startled when the rest of the world suddenly took an interest.

I currently live in Austin, Texas. Since being opinionated on the web hasn’t led to fame and riches, I work for IBM as a web architect.

Outside of computers, I’m interested in electronic music, photography, politics, design, video, and small fluffy animals. I also find physics and mathematics very interesting, but because my knowledge is broad rather than deep I tend to get lost soon after integrals get involved.

Dec 12

Quote from actual conference session description:

The mellow flavor of Swiss cheese enhances sandwiches, soups, and sauces. Experts know how to control the size of the holes in Swiss cheese by changing the acidity, temperature, and curing time during the complicated holeforming fermentation process. Similarly, site visitors savor the experience of interacting with enterprise content and traditional HTML and graphics on web pages.

The title also uses “surfacing” as a verb.

Sep 02

According to Google Watch, our favorite search engine is dying. Supposedly Google is not indexing anywhere from ten percent to seventy percent of the pages it knows about.

Well, those are some pretty huge error bars, which right away scream out “wild speculation”. But if we read on, the guy offers as evidence the fact that his web site, namebase.org, appears as a bare URL in the Google index, rather than having the conventional snippet of content and careful indexing.

It sounds pretty convincing, until you go look at the source of his web page:

<HTML><HEAD>
<META HTTP-EQUIV="EXPIRES" CONTENT="0">
<META NAME="ROBOTS" CONTENT="NOARCHIVE">
<META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE"><TITLE>NameBase Book Index</TITLE>

Yes, the idiot has specifically set headers on his web site to try to tell web crawlers not to archive any of the text, and not to cache anything—and then he complains that the Google index doesn’t have a cached index of his content in its archive.

Of course, it goes without saying that his site’s home page also isn’t valid HTML, failing to validate against any DTD whatsoever.

So in conclusion: (a) Google doesn’t always do a good job of keeping an indexed cache of things you’ve told it not to cache, and (b) Google doesn’t always do a good job of indexing things that aren’t actually web pages.

I mean, I’m not delighted that I’m no longer the #1 spot in a search for my name, but I don’t take it as proof that Google is broken.

I’ve been wondering for a while why it is that people are so keen to predict that Google is ruined, hopeless, shooting itself in the foot, lost, going to crash, it’s doomed, doomed, doomed I tell you!

I think it’s more than http://en.wikipedia.org/wiki/Schadenfreude”>schadenfreude. I think Google irritates a lot of people because of the way they’ve put together a successful business by ignoring business rules and behaving ethically. They’ve refused to pollute search results with ads; they’ve refused to let people buy their way to the top; they’ve refused flashy graphical banners. They’ve thrived anyway. And a lot of people hate them for it.

Jan 29

New Windows / Internet Explorer security hole:

  1. Upload any Windows executable you like to a web server.

  2. Set up the web server to send .exe files as text/html.

  3. Put a CLSID in the filename.

  4. Post links to the file, cloaking them as http://www.innocenturl.com%01%00@www.yoursite.com/virus/whatever via the previously announced URL cloaking bug.

  5. Wait for anyone using Internet Explorer to click on the innocent-looking link and get asked if they want to open the HTML web page.

  6. Cackle as their computer downloads the executable and runs it, without prompting them further.

Solution: Switch to Mozilla, or don’t click on “Open” to open files.

Mar 15

Spent Friday morning clearing up a random disaster… some legacy application that suddenly needed to go out on CD, that had never been designed to run on read-only media. Then I went to the MIT lunch trucks.

I spent the afternoon continuing to learn J2EE and SQL. I now have a simple user registration / login application written, which uses request dispatch and HTML files for look and feel. The book I’m learning from is OK on the Java stuff, but does a really poor job of teaching good systems design; they have all their HTML shoved into the servlets. Ugh.

After that I was hungry and light-headed. Concentrated mental effort seems to make me burn up energy like crazy; I think that’s how I manage to stay a reasonable weight while being an idle layabout. We had curry, but even so I had a mood crash around 23:00. Managed to go to sleep, fortunately.

Now I have to author a DVD for someone at work, process a couple of CDs of photos, and then maybe I’ll do something completely pointless for the sheer pleasure of it.