2 November 2006

Problem for XML gurus

(It was obvious.)

I’ve come across an interesting problem involving web URLs and XML documents, specifically Atom feeds.

Suppose you have “recipe of the day” web site. You build it so that a URL can be used to select recipes from various categories, much like a Google search URL:

http://www.example.com/recipes?page=list&category=desserts

You use the standard servlet API to parse the list of keyword-value pairs, and all is good. Then you add another command to return the latest N recipes:

http://www.example.com/recipes?page=recent&category=desserts

So far nothing unusual. Then one day the boss comes to you and says that you need to add a new category of “vegetarian & vegan”. Being a competent web developer, you know that ‘&’ is a reserved character in URLs as per RFC 2396. So to include it in a category, you need to URL-escape it. So you do:

http://www.example.com/recipes?page=recent&category=vegetarian%40vegan

Your web application parses the arguments as before, your servlet is handed page = ‘recent’ and category = ‘vegetarian & vegan’, and everything works. Similarly, you implement a command to display an individual recipe:

http://www.example.com/recipes?page=recipe&category=vegetarian%40vegan&date=20061030

Now you read about web syndication, and you decide to add an Atom feed of the most recent 10 recipes posted in each category. The obvious way to do it is to implement another page style that returns an XML Atom feed instead of an HTML web page:

http://www.example.com/recipes?page=atom&category=vegetarian%40vegan

Now for the problem. The Atom feed must include links to each recipe. As already established, the syntax of the links will be like this:

http://www.example.com/recipes?page=recipe&category=vegetarian%40vegan&date=20061030

The puzzle is: how do you include those URLs in an Atom feed, in such a way that the feed reader and/or browser will actually request the correct URL from the web server when the reader clicks through to a particular recipe?

Hints (or “Why this is difficult”):

  1. Bare ampersands must be escaped in XML.
  2. You can’t supply href as an element nested in a link element, only as an attribute of a link element.
  3. You can’t use CDATA to escape attribute values in quotes.
  4. Feed readers do not un-URL-escape URLs they get in feeds.

Assume that changing the entire syntax of the web site’s URLs is not a viable option. (In the real world case, it’s someone else’s web site.)

I’m hoping my brain will work out a cunning solution to this problem while I’m asleep.

© mathew 2017