IntelliJ Idea Open Sourced

October 16, 2009 by kamstrup

Wow, I must admit that the latest news from JetBrains takes me quite by surprise! But what a sweet surprise it is!

Scala and Git support out of the box you say? This is more than welcome – now the next generation development experience is enabled out of the box.

I can’t help but wonder why they did it though? Growing pressure from Netbeans and Eclipse? I’ve always thought that Idea was the better of the three – thus expecting it to generate a fine revenue? Perhaps not -  or perhaps JetBrains had a sudden fit of philanthropy? Or perhaps open source is just a superior development model. No matter the true motivation I am pretty hyped about this :-)

Searching in the dark

September 25, 2009 by eskildsen

As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.

Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.

Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).

The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.

The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.

It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.

So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.

Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.

Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.

A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.

Sample search in Arctika

Sample search in Arctika

Brilliant, guys!

September 20, 2009 by eskildsen

Cancel Undo

Our fine usability guys hard at work.

An Excursion in Java Recursion

September 4, 2009 by kamstrup

A quick Googling defines Excursion as: “a journey taken for pleasure”. Considering what I am about to blog about the title of this blog post might be a bit misleading, but you gotta give me one for the rhyme ;-)

As you might or might not know, doing recursion in Java is simply a bad thing. This is mainly because Java can’t do tail recursion. You can use recursion in Java if you are absolutely positive that you are only going to do a very limited number of recursive calls. If you could possibly go over 100 calls you should consider making it a for or while loop instead, if the Java runtime performs somewhere around 1000 recursive calls you will get a StackOverflowError. This is really bad – you see if you read the StackOverflowError docs you will see that it is a subclass of VirtualMachineError. The docs for VirtualMachineError says:

Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating

This means that you have pretty much no choice but to log a fatal error and abort the JVM.

There are ways for making the recursion limit of the JVM bigger by setting some system properties, but that is really just a band aid and I would advise against using them.

The Real Life Case: XML Parsing

Java 6 ships with a new XML parsing library, the core class of which being XMLStreamReader (also known as the “push parser”). I must say that it is quite a nice library and a huge improvement over SAX parsing, while still keeping a blazing performance. We use it in Summa and has been very happy with it.

The problem came when we started indexing documents like this one: java-recursion-lection-1.xml. You can definitely expect to find similar structures out in the wild (as we have seen here at work). The basic document structure is as follows:

<mydocument>
  <mytag>
     SOME TEXT BLOCK
  </mytag>
</mydocument>

If we just want to extract the text block it would be annoying with a standard SAX parser because a SAX parser splits up characters segments into arbitrary chunks and you have to collect them into one string yourself. The push parser API makes this a lot easier because it defines the property XMLInputFactory.IS_COALESCING which, when set, requires the parser to collect all the text chunks into one string. So extracting the raw text contents is easy peasy lemon squeezy:

import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import java.io.FileReader;

/**
 * A small excursion in Java recursion.
 */
public class JavaRecursionLecture1 {

  public static void main(String[] args) throws Exception {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();
    inputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);

    XMLStreamReader reader = inputFactory.createXMLStreamReader(
               new FileReader("/home/mke/Documents/java-recursion-lection-1.xml"));
    parse(reader);
  }

  public static void parse(XMLStreamReader reader) throws Exception {
    while (reader.hasNext()) {
      int event = reader.getEventType();
      switch (event) {
        case XMLEvent.START_DOCUMENT :
          System.out.println("Document start");
          break;
        case XMLEvent.START_ELEMENT :
          System.out.println("Element: " + reader.getLocalName() );
          break;
        case XMLStreamReader.CHARACTERS :
          // Warning: Here be StackOverflowErrors
          System.out.println("Char data:\n" + reader.getText());
          break;
      }
    reader.next();
    }
  }
}

Except that this will throw a StackOverflowError if you run it on the file I linked you to. “What is up with that, there is no recursion here!” – you ask?

The problem here is that XMLStreamReader is highly recursive underneath the hood. My file contains lots of XML entities and the parser will make a recursive call each time a new entity is found in the stream. Looking at the heart of the implementation you will see that the author(s) actually where very minute about making sure that all recursive calls where tail calls. This would have been very robust had the Java runtime supported tail recursion – alas.

There are two ways to work around this misfeature. The first one is to don’t set the IS_COALESCING property, and then change the switch statement to something like this, using reader.getElementText() instead:

switch (event) {
  case XMLEvent.START_DOCUMENT :
    System.out.println("Document start");
    break;
  case XMLEvent.START_ELEMENT :
    System.out.println("Element: " + reader.getLocalName() );

    if ("mytag".equals(reader.getLocalName())) {
      System.out.println(reader.getElementText());
    }
    break;
  case XMLStreamReader.CHARACTERS :
    // Warning: Here be StackOverflowErrors
    System.out.println("Char data:\n" + reader.getText());
    break;
 }

This is not particularly elegant since it hard codes our <mytag> element. A more generic way is to provide your own coalescing implementation of getText():

/**
 * Use this method in response to XMLEvent.CHARACTERS event instead of
 * XMLStreamReader.getElementText() on a XMLEvent.START_ELEMENT. The former
 * approach will
 * @param reader the XMLStreamReader to pull character data out of,
 *               the reader is expected to be in a XMLEvent.CHARACTERS state
 * @return A string containing the full character data as one string
 * @throws Exception if the Jupiter aligns with Mars
 */
 public static String getCoalescedText(XMLStreamReader reader)
 throws XMLStreamException {
   StringBuilder builder = new StringBuilder();
   char[] buf = new char[1024];

   while (reader.getEventType() == XMLEvent.CHARACTERS) {
     int offset = 0;
     int len;
     while (true) {
       len = reader.getTextCharacters(offset, buf, 0, buf.length);
       if (len != 0) builder.append(buf, 0, len);
       if (len < buf.length) break;
     }
     reader.next();
   }
   return builder.toString();

And then in the switch branch checking on character events do:

     case XMLStreamReader.CHARACTERS :
       // Warning: If you expect a StackOverflowError here, you are
       //          going to wait a long while!
       System.out.println("Character data:\n"
                          + getCoalescedText(reader));
       break;

Anyway – this became a long an code-full post. All I really wanted to say was Avoid recursion in Java unless you know exactly what you are doing.

Summa Moving to SourceForge

August 4, 2009 by kamstrup

Yesterday I had the pleasure to announce on the mailing lists that Summa has reached the first milestone in migrating to SourceForge, and here follows the blog post :-)

From now on all Summa code is hosted and developed in the “summa” project on SourceForge now, in addition all bugs have been migrated from our old GForge solution to a Trac instance hosted via the cool new “hosted apps” functionality on SourceForge.

We will also move the mailing lists over in the near future. The fate of the Summa wiki is still left unclear.

I must be frank and admit that I have long felt that SourceForge was in a bit of a standstill applying only visual refreshes every now and then, and never fixing the real issues with the site. However the new Hosted Apps approach is simply sweet! There is a huge list of popular open source products you can choose to run on your site as a hosted apps (see an incomplete list here). For instance; some may surprised to know that popular version control systems such as Git, Mercurial, and Bazaar is supported as well as Subversion. Right now we run only a Trac issue tracker and a Subversion repository.

On a personal note I must still admit that my heart lies with the recently open sourced Launchpad, despite the recent kick-assiness from the SF team.

Ticer Summer School

August 3, 2009 by villadsen

I went to the Ticer Summer School 2009 on Friday 31st of August to to talk about Summa as part of a panel on Integrated Search alongside Jørgen Madsen (Primo), David Lindahl (eXtensible Catalog), and Benoît Pauwels (VuFind).

Tilburg University Campus

Tilburg University Campus

Thomas Place made the introduction explaining the basic concepts of Integrated Search. Afterwards we each presented our systems followed by a short question and answer session. I think it was during these presentations that people realized how similar most of the systems actually were.

After a coffee break each of us once again got up to do a second presentation – this time focusing more on a specific feature.

  • Jørgen Madsen: Joining Catalogues – Clean-up and Deduplication
  • David Lindahl: Metadata Handling and FRBR
  • Mads Villadsen: Facetting and Clustering
  • Benoît Pauwels: Web 2.0 Features of Integrated Search

Each presentation was once again followed by a longer question and answer session.

My slides from both of theses can be found on the Presentations page on the Summa Wiki.

The day concluded with a panel discussion based on questions from the audience. All during the day the people from the audience had been very good at asking relevant questions, and I think they really managed to get to the core of the issues regarding usability, faceted searching, and integrated search itself.

This is the first time I have been part of a panel in this way, and also the first time I have been at a Ticer Summer School – and I found both things to be a really great experience. I have never been to anything that has been as well organized as this, and all the people were very interested and engaged in the discussions. The other panel members were well prepared, open for discussion, and willing to talk freely about any issues. I can only hope that I was in the same league as them.

All in all I had a really nice time, and if I ever get the chance to do something similar again I would be very interested.

Quick and dirty test of the YUI Compressor

July 2, 2009 by Jørn Thøgersen

As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer – YUI Compressor.

At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.

The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the YUI compressor version 2.4.2 and the YUIAnt.jar and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don’t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren’t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.

The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason – I didn’t bother to investigate further – the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.

The first screen shot shows the size and load times for our js files. Business as usual – the YUI Compressor is disabled.

nocompress2Scaled

The next screen shot shows the size and load times for the same js files now with the compression enabled.

compress2Scaled

The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don’t thing we will be comfortable with this knowing that it might cause errors.

As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.

This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.

Thoughts on optimizing our search web site

July 1, 2009 by Jørn Thøgersen

wwws

The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site’s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end – especially when the underlying search engine Summa has proven to be blazingly fast.

There are a lot of things we could do better such as:

1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.

2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs – loads of divs.

3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.

4. General optimizing of the server side code. In fact this part isn’t all that bad but a general clean up once in a while doesn’t hurt anybody.

Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the YUI Compressor – tool to compress/minify javascript and css. As we don’t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.

Efficient sorting and iteration on large databases

June 15, 2009 by kamstrup

Before you read on, heed my words that this post might be a wee bit technical… If not extremely technical – caveat emptor

The Problem

In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a H2 database, all records are annotated with a modification time (henceforth mtime) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.

This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don’t want any of that. We want lightweight transactions and speed!

Take One, LIMIT and OFFSET

The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:

  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000

Here we increment last_offset by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first last_offset records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset last_offset into the result set begins.

Take Two, Salted Timestamps

So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages… 

The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:

  SELECT * FROM records WHERE mtime>$last_mtime ORDER BY mtime LIMIT 1000

Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.

If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion – this is way too slow for us, so we deviced what we have dubbed Salted Timestamps.

Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a “timestamp” that still sorts correctly and we can even create a UNIQUE index on the mtime column.

Conclusion

We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the DatabaseStorage* class and the timestamp handling in the UniqueTimestampGenerator* class in the Storage module in the Summa sources.

*) Most sorry that these links require a login (which is freely available, but anyway) – we are working on a solution with anonymous access. More on that later.

Faceting and Flash Disks in the Gobi Desert

May 16, 2009 by villadsen

As has been mentioned in many different places the code4lib 2009 videos are now online.

Those that missed the finer details in Toke’s talk about complete faceting of 100 million documents can go see the video here. Lots of good, nerdy stuff.

Toke and Mikkel also both gave lightning talks.  Mikkel’s talk about how easy it would be to set up a Summa installation in the Gobi Desert was on day three, and is available here. He is on somewhere near the middle of the video (number 5 of 9).

Toke’s talk about Flash Disks and how they will save everyone of us was on day two, and can be seen here. His talk starts about a third of the way into video as number 6 of 16 (he is actually on twice – the first first attempt without any graphs in his slides is number 4…).

Watching the videos really make me want to go to code4lib again next year.