Archive for the ‘kamstrup’ Category

IntelliJ Idea Open Sourced

October 16, 2009

Wow, I must admit that the latest news from JetBrains takes me quite by surprise! But what a sweet surprise it is!

Scala and Git support out of the box you say? This is more than welcome – now the next generation development experience is enabled out of the box.

I can’t help but wonder why they did it though? Growing pressure from Netbeans and Eclipse? I’ve always thought that Idea was the better of the three – thus expecting it to generate a fine revenue? Perhaps not -  or perhaps JetBrains had a sudden fit of philanthropy? Or perhaps open source is just a superior development model. No matter the true motivation I am pretty hyped about this :-)

An Excursion in Java Recursion

September 4, 2009

A quick Googling defines Excursion as: “a journey taken for pleasure”. Considering what I am about to blog about the title of this blog post might be a bit misleading, but you gotta give me one for the rhyme ;-)

As you might or might not know, doing recursion in Java is simply a bad thing. This is mainly because Java can’t do tail recursion. You can use recursion in Java if you are absolutely positive that you are only going to do a very limited number of recursive calls. If you could possibly go over 100 calls you should consider making it a for or while loop instead, if the Java runtime performs somewhere around 1000 recursive calls you will get a StackOverflowError. This is really bad – you see if you read the StackOverflowError docs you will see that it is a subclass of VirtualMachineError. The docs for VirtualMachineError says:

Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating

This means that you have pretty much no choice but to log a fatal error and abort the JVM.

There are ways for making the recursion limit of the JVM bigger by setting some system properties, but that is really just a band aid and I would advise against using them.

The Real Life Case: XML Parsing

Java 6 ships with a new XML parsing library, the core class of which being XMLStreamReader (also known as the “push parser”). I must say that it is quite a nice library and a huge improvement over SAX parsing, while still keeping a blazing performance. We use it in Summa and has been very happy with it.

The problem came when we started indexing documents like this one: java-recursion-lection-1.xml. You can definitely expect to find similar structures out in the wild (as we have seen here at work). The basic document structure is as follows:

<mydocument>
  <mytag>
     SOME TEXT BLOCK
  </mytag>
</mydocument>

If we just want to extract the text block it would be annoying with a standard SAX parser because a SAX parser splits up characters segments into arbitrary chunks and you have to collect them into one string yourself. The push parser API makes this a lot easier because it defines the property XMLInputFactory.IS_COALESCING which, when set, requires the parser to collect all the text chunks into one string. So extracting the raw text contents is easy peasy lemon squeezy:

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import java.io.FileReader;

/**
 * A small excursion in Java recursion.
 */
public class JavaRecursionLecture1 {

  public static void main(String[] args) throws Exception {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();
    inputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);

    XMLStreamReader reader = inputFactory.createXMLStreamReader(
               new FileReader("/home/mke/Documents/java-recursion-lection-1.xml"));
    parse(reader);
  }

  public static void parse(XMLStreamReader reader) throws Exception {
    while (reader.hasNext()) {
      int event = reader.getEventType();
      switch (event) {
        case XMLEvent.START_DOCUMENT :
          System.out.println("Document start");
          break;
        case XMLEvent.START_ELEMENT :
          System.out.println("Element: " + reader.getLocalName() );
          break;
        case XMLStreamReader.CHARACTERS :
          // Warning: Here be StackOverflowErrors
          System.out.println("Char data:\n" + reader.getText());
          break;
      }
    reader.next();
    }
  }
}

Except that this will throw a StackOverflowError if you run it on the file I linked you to. “What is up with that, there is no recursion here!” – you ask?

The problem here is that XMLStreamReader is highly recursive underneath the hood. My file contains lots of XML entities and the parser will make a recursive call each time a new entity is found in the stream. Looking at the heart of the implementation you will see that the author(s) actually where very minute about making sure that all recursive calls where tail calls. This would have been very robust had the Java runtime supported tail recursion – alas.

There are two ways to work around this misfeature. The first one is to don’t set the IS_COALESCING property, and then change the switch statement to something like this, using reader.getElementText() instead:

switch (event) {
  case XMLEvent.START_DOCUMENT :
    System.out.println("Document start");
    break;
  case XMLEvent.START_ELEMENT :
    System.out.println("Element: " + reader.getLocalName() );

    if ("mytag".equals(reader.getLocalName())) {
      System.out.println(reader.getElementText());
    }
    break;
  case XMLStreamReader.CHARACTERS :
    // Warning: Here be StackOverflowErrors
    System.out.println("Char data:\n" + reader.getText());
    break;
 }

This is not particularly elegant since it hard codes our <mytag> element. A more generic way is to provide your own coalescing implementation of getText():

/**
 * Use this method in response to XMLEvent.CHARACTERS event instead of
 * XMLStreamReader.getElementText() on a XMLEvent.START_ELEMENT. The former
 * approach will
 * @param reader the XMLStreamReader to pull character data out of,
 *               the reader is expected to be in a XMLEvent.CHARACTERS state
 * @return A string containing the full character data as one string
 * @throws Exception if the Jupiter aligns with Mars
 */
 public static String getCoalescedText(XMLStreamReader reader)
 throws XMLStreamException {
   StringBuilder builder = new StringBuilder();
   char[] buf = new char[1024];

   while (reader.getEventType() == XMLEvent.CHARACTERS) {
     int offset = 0;
     int len;
     while (true) {
       len = reader.getTextCharacters(offset, buf, 0, buf.length);
       if (len != 0) builder.append(buf, 0, len);
       if (len < buf.length) break;
     }
     reader.next();
   }
   return builder.toString();

And then in the switch branch checking on character events do:

     case XMLStreamReader.CHARACTERS :
       // Warning: If you expect a StackOverflowError here, you are
       //          going to wait a long while!
       System.out.println("Character data:\n"
                          + getCoalescedText(reader));
       break;

Anyway – this became a long an code-full post. All I really wanted to say was Avoid recursion in Java unless you know exactly what you are doing.

Summa Moving to SourceForge

August 4, 2009

Yesterday I had the pleasure to announce on the mailing lists that Summa has reached the first milestone in migrating to SourceForge, and here follows the blog post :-)

From now on all Summa code is hosted and developed in the “summa” project on SourceForge now, in addition all bugs have been migrated from our old GForge solution to a Trac instance hosted via the cool new “hosted apps” functionality on SourceForge.

We will also move the mailing lists over in the near future. The fate of the Summa wiki is still left unclear.

I must be frank and admit that I have long felt that SourceForge was in a bit of a standstill applying only visual refreshes every now and then, and never fixing the real issues with the site. However the new Hosted Apps approach is simply sweet! There is a huge list of popular open source products you can choose to run on your site as a hosted apps (see an incomplete list here). For instance; some may surprised to know that popular version control systems such as Git, Mercurial, and Bazaar is supported as well as Subversion. Right now we run only a Trac issue tracker and a Subversion repository.

On a personal note I must still admit that my heart lies with the recently open sourced Launchpad, despite the recent kick-assiness from the SF team.

Efficient sorting and iteration on large databases

June 15, 2009

Before you read on, heed my words that this post might be a wee bit technical… If not extremely technical – caveat emptor

The Problem

In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a H2 database, all records are annotated with a modification time (henceforth mtime) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.

This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don’t want any of that. We want lightweight transactions and speed!

Take One, LIMIT and OFFSET

The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:

  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000

Here we increment last_offset by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first last_offset records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset last_offset into the result set begins.

Take Two, Salted Timestamps

So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages… 

The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:

  SELECT * FROM records WHERE mtime>$last_mtime ORDER BY mtime LIMIT 1000

Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.

If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion – this is way too slow for us, so we deviced what we have dubbed Salted Timestamps.

Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a “timestamp” that still sorts correctly and we can even create a UNIQUE index on the mtime column.

Conclusion

We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the DatabaseStorage* class and the timestamp handling in the UniqueTimestampGenerator* class in the Storage module in the Summa sources.

*) Most sorry that these links require a login (which is freely available, but anyway) – we are working on a solution with anonymous access. More on that later.

Asking for Trouble? You’ve Come to the Right Place!

April 2, 2009

Toke was asking for trouble yesterday. I would assume that he knew me better by now… With my last commit it is now actually possible to inline Javascript inside your configuration when using a ScriptFilter.

The following now actually works:

<xproperties>
  <entry>
    <key>filter.name</key>
    <value class="string">InlineJavascriptTest</value>
  </entry>
  <entry>
    <key>summa.filter.sequence.filterclass</key>
    <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
  </entry>
  <entry>
    <key>filter.script.inline</key>
    <value class="string"><![CDATA[
                    payload.getRecord().setId('inlineJavascript');
    ]]></value>
  </entry>
</xproperties>

I can not begin to enumerate all the dangers in doing this, but somehow the thrill of the possibilities got the better of me. If Javascript isn’t your thing you can specify the script language to anything supported by your Java runtime by defining the property filter.script.lang.

So – use this at your own peril!

UPDATE: You can find a list of available ScriptEngines for Java at scripting.dev.java.net.

Javascript Filters in Summa

April 1, 2009

I just completed the draft implementation of Javascript filters for Summa and I am posting here to hear some comments. If nobody complains the existing implementation will be likely to stay unchanged. Really the implementation supports any old scripting language supported by the ScriptEngineManager of the JVM, but in practice Javascript will probably be the most important one.

The scripting environment will include two “magic” variables: payload and allowPayload. Unsurprisingly the payload variable contains a reference to the Payload object being processed. The allowPayload variable is a boolean value that defaults to true. If allowPayload is set to false the payload will be dropped from the processing pipeline.

Update: The script filters now have a third magic variable called log sporting the methods log.trace|debug|info|warn|error|fatal(string).

The best way to explain this is probably with an example. To write a Javascript filter for Summa create a file called myFilter.js with the following content:

var record = payload.getRecord();

if (!record.getId().startsWith(record.getBase())) {
    record.setId(record.getBase() + "_" + record.getId())
}

if (record.getId().endsWith('taboo')) {
    allowPayload = false;
}

This script will make sure that all records have their ids prefixed with their base name, and will filter out any records which id ends with “taboo”.

To plug the script into your filter pipeline you need to stick something like the following in your filter chain configuration:

<xproperties>
    <entry>
        <key>filter.name</key>
        <value class="string">FixRecordIdsandDropTaboos</value>
    </entry>
    <entry>
        <key>summa.filter.sequence.filterclass</key>
        <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
    </entry>
    <entry>
        <key>filter.script.url</key>
        <value class="string">http://example.com/filters/myFilter.js</value>
    </entry>
</xproperties>

I am using the ScriptEngine framework which appeared in Java 6 for all of this, and all in all the development experience has been quite nice. Writing this blog post took almost as long as it took me to write that filter :-)

UPDATE: You can find a list of all available scripting engines at scripting.dev.java.net.

Summa 1.3.0 Live and Unleashed

March 28, 2009

As Toke hinted earlier we released Summa 1.3.0 yesterday. It is no secret that the 1.3.0 release cycle was longer than we planned for, but in the end I think the gains have been worth it. It took a lot of hard work (and lots of patience from our families), but we got there.

A lot of effort in the 1.3.0 cycle revolved about 3 main things:

  • Performance We really wanted to be able to provide quick turn around times from top to bottom of the Summa stack.  There’s a lot of activity around Summa these days and we never know where Summa might end up being deployed. Time has tought us that being able to quickly rebuild the indexes and record caches is a huge boon when venturing into unknown territory.
  • Details Make the small details work and do as expected. This means sanitizing the log output to be as useful and concise as possible, but also making sure that it is easy to track any records that are dropped from the indexing chain for what ever reason. Also stuff like being able to update the modification time of one single record and having that trigger a corresponding update of the index
  • Bug fixing Hunt down those kritters and write tonnes of unit tests to to keep that wooden stake through their hearts so that they don’t rise from the dead

The reasons for these points of focus are many, but two things are worth mentioning. The first thing is that we are aiming to use the Summa 1.3 series for production here at the State and University Library of Denmark, which means that we can finally replace the aging “pilot Summa” we are running in production these days (Summa started out as a closed source foray into the realm of relavancy ranked integrated search – and an old version of this project is the basis for the current search engine behind statsbiblioteket.dk). Secondly we felt the need to have a solid base to work on to try out all the cool stuff we have been talking about since we started our journey into the land of searching. This is also why Summa 1.3.0 doesn’t really have any new big features, mostly polish and optimizations of the existing code base.

All in all I am pretty proud of this 1.3.0 release and I am really looking forward to deploying it in a real world scenario. Also; with this milestone set I am confident that we are going to have a blast hacking on the mile long list of cook ideas we have floating around. I should post about some of these ideas soonish, but not today; today I am going to slack off and play with the kids.

Adventures in the land of databases

March 4, 2009

The tale of three databases

In the olden days Summa used a Postgresql database as a backend for its caching/presentation engine, also called the Storage. Postgresql served us well – good performance and good reliability. The one drawback was that it required additional setup if one wanted to deploy Summa on a new machine.

In our fierce hunt for easy deployments of the Summa search engine we decided that we wanted to try out using an embedded Java database. This way we don’t depend on external processes or setup. Since Apache Derby was recently included in the Sun’s Java 6 we decided to go down that route. And for a long while it seemed that Derby was the hammer of all nails.

However, when we started doing scalability tests Derby started to play tricks with us. You see, Summa can do incremental updates of its internal indexes and for this to work we need to be able to determine efficiently if any of the indexed data has been changed and needs an update in the index. With a couple million records in the database Derby seemed to get extremely slow. I did some honest attempts at tweaking our data models and talking to the Derby community about enhancing the performance, but at the end of the day we couldn’t make it meet out needs.

After a few days with out sufficient progress (I was able to make small performance imrpovements, but not enough) it was decided to try and re-enable the olde Postgresql backend. Behold; everything was snappy and we had acceptable performance. The only problem was that now we had not solved our initial goal which was easy setup…

Enter H2. After Googling around a bit I stumbled upon the H2 embedded Java database. After some initial testing its performance seemed very promising – faster than Postgresql even though Postgresql is written in C and H2 in Java. This once again proves that Java is not slow (a prejudice that some people will probably retain to the day they die).

The Devilish Details

Despite H2 rocking our socks off, all was not well in paradise. Like Derby (particularly Derby I might add) H2 did not perform well on SQL JOINs. In our case we are talking LEFT JOINs on huge result sets. To fight this I added a mode to our abstract database layer that would avoid the JOINs and instead perform direct lookups of the stuff we JOINed in. This gave us a good performance boost, but still not good enough.

Next thing we had to learn the hard way was that when Thomas Mueller (the über cool H2 main hacker) says that H2 does not perform well on large result sets you better believe him :-) So again I added a new mode to the abstract database layer to fetch the result set as a sequence of SQL queries using the LIMIT and OFFSET keywords. In essence it is really just client side paging of the result set. With this in place I am proud to say that we have H2 delivering stellar performance.

It is likely that the same client side paging would make the Derby backend acceptably fast as well, but unfortunately Derby does not support manual paging very well (ok, you can doit, but it is not really nice).

Conclusions

Instead of the single Derby backend for our storage component that we had initially anticipated we now have three backends. The H2 and Postgresql ones perform very well while the Derby one could use some love (and is workable for smaller colelctions).

I also learned a lot more about JDBC and SQL than I had ever hoped to ever know :-)

Toke’s Christmas Present for Y’all

December 19, 2008

Just one last status update before I rush off to buy the last Christmas presents. Toke just converted Summa’s internal Lucene document builder to use the new XMLStreamReader found in Java 6, instead of using some XPath/DOM magic we had been using hitherto when indexing in Summa (yeah, we know DOM parsing is silly here, but it had proven stable and “OK” for a very long time).

This provided an overal indexing speedup of a factor 8. I think that is Toke’s way of saying “merry Christmas everybody” :-)

Scp magic

December 19, 2008

Ever needed to copy a file from one host to another, but being forced to go over an intermediate host because you don’t have the rights to access the machine with the file directly?

Let me put an end to your dispair then! Here’s a small script called scp-via that can do this for you! Just run

scp-via $VIA_HOST $FROM_TARGET $TO_TARGET

- with the same syntax for FROM_TARGET and TO_TARGET as you use for regular scp.

Basically it performs a little scp, netcat, ssh magic under the hood, but hopefully you need not worry about that.

I believe I need not say that scp-via comes with absolutely no warranty or guarantees about not desctroying your hard drives. Run it at your own risk.