Words and Pictures from Old Books · Search · About

How the Search facility works

This page is about how the searching is implemented; it might interest programmers, Web site developers, system integrators, or standards geeks.

The technology is all open source, and it's all available in exchange for pictures of your ankles. (Just kidding. Actually it's freely available, no pictures required)

The Metadata

There is some metadata associated with each image:

  1. about the image itself
  2. about the place or places depicted in the image

An example is probably the best way to explain the metadata. Consider a book with text written by Sir Charles Knight, such as Old England. Perhaps there is a colour plate in the book, such as that opposite page 383 in Volume I of Methley Hall, which I have scanned.

Now, Methley Hall is (or rather, was) a building in the West Riding of Yorkshire, in England. I have marked its location as Mickletown, West Riding, Yorkshire, England.

Of course, the printed book that I scanned is in Canada, and the image is on my Web server (and maybe also on your computer now, too), but the place (Methley Hall) is located in England.

The image as I scanned it was saved in PNG format, and after I cleaned up the scan in either Adobe PhotoShop or The GIMP, I saved it in several resolutions: the largest is 1475 pixels wide and 1023 pixels wide. So we have an image format (JPEG) and size (1475x1023). We must be careful not to suggest that Methley Hall is 1475 pixels wide! Although this is absurd, it's a surprisingly common mistake when people prepare data about images.

I associated some keywords with the image: interiors, windows, ceilings, arches, staircases, colour, furniture and manors.

The location and keywords are stored in an RDF/XML file (actually it isn't really proper RDF, but it's close enough for my purposes). The information about the physical image, the format and the pixel size, is stored in a relational database, separately from the RDF. In this way there is no possibility of confusing the metadata about the physical image and about the picture.

An astute reader might have observed that the keyword colour is more about the image than the place. Obviously Methley Hall is not black-and-white. The keywords, then, are about the printed picture in the book and what can be seen in it. Methley Hall might have a swimming pool and a wheelchair-accessible billiards table, but the From Old Books Web site is not about finding a country house! It is about finding cool pictures, though, so I have mentioned the arches and the staircase.

Searching

I am currently using Basex, a Java-based implementation of the XML Query language. This is a query language that lets me run queries against any mixture of XML files, XML document stores, RDF and relational databases, without needing to know which is which in the body of my query.

Under http://fromoldbooks.org/Search/ is a file named index.cgi; the Apache Web server runs this program to satisfy incoming HTTP requests for that directory or anything beneath it.

I should mention at this point that the Common Gateway Interface (CGI) is not a programming language. It's just the way in which the Web server communicates with an external program, and you can use almost any programming language you like. Java, Perl and PHP are two common choices, and in this case I used Perl.

The CGI script in this case does several things:

  1. Parses the query options;
  2. Builds an XML Query expression on the fly;
  3. Runs the XML Query with a template (usually HTML or SVG)
  4. Sends the results back to the requesting agent (usually your Web browser, but it might also be a search engine crawler)

The CGI script keeps a cache of recent search results, and also monitors system load (using /proc/uptime), deliberately sleeping for several seconds and printing an error message system too busy if the system gets too loaded. I found this was necessary because Internet Explorer has a button that tries to make a copy of a Web site for reading offline, and when people press it, their Web browser tried to download every page at full speed, including in this case every possible combination of search options! Although that has been fixed there are still misbehaving robots out there.

I also use memcached for the thumbnails on the front page.

If you would like to see what the queries look like, you can append &showquery=text/plain to any query, and you will see the text of the query that would have been fed to the XML Query implementation.

Contacting Liam

I am Liam at holoweb dot net; To get through my spam filters, tell me what colour socks you are wearing. If you want to bribe me, send pictures of barefooted men and their ankles that I can add to a public Web gallery! (I can add the images anonymously if you wish)

Valid XHTML 1.0!