Archive for September, 2009

Searching in the dark

September 25, 2009

As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.

Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.

Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).

The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.

The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.

It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.

So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.

Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.

Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.

A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.

Sample search in Arctika

Sample search in Arctika

Brilliant, guys!

September 20, 2009

Cancel Undo

Our fine usability guys hard at work.

An Excursion in Java Recursion

September 4, 2009

A quick Googling defines Excursion as: “a journey taken for pleasure”. Considering what I am about to blog about the title of this blog post might be a bit misleading, but you gotta give me one for the rhyme ;-)

As you might or might not know, doing recursion in Java is simply a bad thing. This is mainly because Java can’t do tail recursion. You can use recursion in Java if you are absolutely positive that you are only going to do a very limited number of recursive calls. If you could possibly go over 100 calls you should consider making it a for or while loop instead, if the Java runtime performs somewhere around 1000 recursive calls you will get a StackOverflowError. This is really bad – you see if you read the StackOverflowError docs you will see that it is a subclass of VirtualMachineError. The docs for VirtualMachineError says:

Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating

This means that you have pretty much no choice but to log a fatal error and abort the JVM.

There are ways for making the recursion limit of the JVM bigger by setting some system properties, but that is really just a band aid and I would advise against using them.

The Real Life Case: XML Parsing

Java 6 ships with a new XML parsing library, the core class of which being XMLStreamReader (also known as the “push parser”). I must say that it is quite a nice library and a huge improvement over SAX parsing, while still keeping a blazing performance. We use it in Summa and has been very happy with it.

The problem came when we started indexing documents like this one: java-recursion-lection-1.xml. You can definitely expect to find similar structures out in the wild (as we have seen here at work). The basic document structure is as follows:

<mydocument>
  <mytag>
     SOME TEXT BLOCK
  </mytag>
</mydocument>

If we just want to extract the text block it would be annoying with a standard SAX parser because a SAX parser splits up characters segments into arbitrary chunks and you have to collect them into one string yourself. The push parser API makes this a lot easier because it defines the property XMLInputFactory.IS_COALESCING which, when set, requires the parser to collect all the text chunks into one string. So extracting the raw text contents is easy peasy lemon squeezy:

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import java.io.FileReader;

/**
 * A small excursion in Java recursion.
 */
public class JavaRecursionLecture1 {

  public static void main(String[] args) throws Exception {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();
    inputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);

    XMLStreamReader reader = inputFactory.createXMLStreamReader(
               new FileReader("/home/mke/Documents/java-recursion-lection-1.xml"));
    parse(reader);
  }

  public static void parse(XMLStreamReader reader) throws Exception {
    while (reader.hasNext()) {
      int event = reader.getEventType();
      switch (event) {
        case XMLEvent.START_DOCUMENT :
          System.out.println("Document start");
          break;
        case XMLEvent.START_ELEMENT :
          System.out.println("Element: " + reader.getLocalName() );
          break;
        case XMLStreamReader.CHARACTERS :
          // Warning: Here be StackOverflowErrors
          System.out.println("Char data:\n" + reader.getText());
          break;
      }
    reader.next();
    }
  }
}

Except that this will throw a StackOverflowError if you run it on the file I linked you to. “What is up with that, there is no recursion here!” – you ask?

The problem here is that XMLStreamReader is highly recursive underneath the hood. My file contains lots of XML entities and the parser will make a recursive call each time a new entity is found in the stream. Looking at the heart of the implementation you will see that the author(s) actually where very minute about making sure that all recursive calls where tail calls. This would have been very robust had the Java runtime supported tail recursion – alas.

There are two ways to work around this misfeature. The first one is to don’t set the IS_COALESCING property, and then change the switch statement to something like this, using reader.getElementText() instead:

switch (event) {
  case XMLEvent.START_DOCUMENT :
    System.out.println("Document start");
    break;
  case XMLEvent.START_ELEMENT :
    System.out.println("Element: " + reader.getLocalName() );

    if ("mytag".equals(reader.getLocalName())) {
      System.out.println(reader.getElementText());
    }
    break;
  case XMLStreamReader.CHARACTERS :
    // Warning: Here be StackOverflowErrors
    System.out.println("Char data:\n" + reader.getText());
    break;
 }

This is not particularly elegant since it hard codes our <mytag> element. A more generic way is to provide your own coalescing implementation of getText():

/**
 * Use this method in response to XMLEvent.CHARACTERS event instead of
 * XMLStreamReader.getElementText() on a XMLEvent.START_ELEMENT. The former
 * approach will
 * @param reader the XMLStreamReader to pull character data out of,
 *               the reader is expected to be in a XMLEvent.CHARACTERS state
 * @return A string containing the full character data as one string
 * @throws Exception if the Jupiter aligns with Mars
 */
 public static String getCoalescedText(XMLStreamReader reader)
 throws XMLStreamException {
   StringBuilder builder = new StringBuilder();
   char[] buf = new char[1024];

   while (reader.getEventType() == XMLEvent.CHARACTERS) {
     int offset = 0;
     int len;
     while (true) {
       len = reader.getTextCharacters(offset, buf, 0, buf.length);
       if (len != 0) builder.append(buf, 0, len);
       if (len < buf.length) break;
     }
     reader.next();
   }
   return builder.toString();

And then in the switch branch checking on character events do:

     case XMLStreamReader.CHARACTERS :
       // Warning: If you expect a StackOverflowError here, you are
       //          going to wait a long while!
       System.out.println("Character data:\n"
                          + getCoalescedText(reader));
       break;

Anyway – this became a long an code-full post. All I really wanted to say was Avoid recursion in Java unless you know exactly what you are doing.