Usability testing Summa Search

May 5, 2009 by michaelpoltorak

smile1

We recently did a usability test of the library’s Summa based search engine – known as Search. To ensure objectivity the test was headed by Julia from UNI-C and was done as a think-out-loud test with eight users from the nearby university. You can download the report in Danish (English version coming up) or read a brief recap of the conclusions:

Bad stuff first:

  • Use of sort and facets. Many test users didn’t use or had trouble using facets and sort features for narrowing down search results.
  • The request list. Conceptually, the request list is hard to grasp for some users. In order to request a number of items, the user must put the items on the list one by one. Once done, he has to press the request button to actually perform the request. It seems that some users miss the last crucial step and actually believe that requests have been sent after the first step – but they have not.
  • Search and request procedure for articles. Finding and requesting articles is perhaps the Archilleus Heel of the current system. Very often users think they can search for individual articles. In most cases they cannot, but actually have to go through a printed or an online journal to either find data about the article or to get an online version for download. Unfortunately, it is has been difficult for us to communicate this counter-intuitive circumstance to users through the interface. Clearly, we’ll need to have another go a the problem, but the best solution – i.e. having all articles searchable and preferably in full-text versions – is not likely to happen in the near future.

The good things:

  • Todo list. The list has been very well received as many users like to keep information about material between sessions. Furthermore, the concept is intuitive and known from other websites and applications – and even from the real world
  • Google style straightforward search. Overall the system is fast and easy to use in terms of searching
  • Did you mean. The Google inspired feature is great for catching tpyos
  • Suggest. The feature suggests words that other users have already searched for. The most popular searchs are show first.
    This is a feature that many users seem to just use out-of-the-box. It can be used as inspiration as well as a quick-spell-thing
  • Added value, such as book covers, table-of-contents, sample chapters, author biographies, etc. Many materials in our database have such extra content information added. Users like this and find it is very helpful in assisting them in making judgements about a material’s relevance.

Overall, we are very satisfied with the test because it confirmed some suspicions we’ve been having for some time now, and especially because it highlighted the problems related to the request list. We’ll be working with the problem areas over the next few months.

Rising Summa

April 21, 2009 by eskildsen

The people higher up in the food-chain has decided to provide a Summa based search backend to public libraries in Denmark. For an annual fee, Statsbiblioteket handles the flow of data from raw dumps to webservices and keeps the servers running. Maintenance, money and sales is fairly boring, seen from our developer perspective, but tweaking Summa to allow for easy experimentation and setup has been very rewarding.

As usual Mikkel weaved his magic and created a package (read: A collection of scripts and all the JARs from the Summa project) that makes it very simple to set up a local Summa for experimentation. The working title was Summix, but we all know how it goes with working titles.

With some late night fiddling, the complexity was reduced to “Unzip and run a script”, which gets a Summa demo running with a skeleton web front end. Added bonus? It runs under Windows as well as Linux (and probably OS X too, but we haven’t checked). We will write a tutorial on the wiki Real Soon Now.

Getting there...

Getting there...

Asking for Trouble? You’ve Come to the Right Place!

April 2, 2009 by kamstrup

Toke was asking for trouble yesterday. I would assume that he knew me better by now… With my last commit it is now actually possible to inline Javascript inside your configuration when using a ScriptFilter.

The following now actually works:

<xproperties>
  <entry>
    <key>filter.name</key>
    <value class="string">InlineJavascriptTest</value>
  </entry>
  <entry>
    <key>summa.filter.sequence.filterclass</key>
    <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
  </entry>
  <entry>
    <key>filter.script.inline</key>
    <value class="string"><![CDATA[
                    payload.getRecord().setId('inlineJavascript');
    ]]></value>
  </entry>
</xproperties>

I can not begin to enumerate all the dangers in doing this, but somehow the thrill of the possibilities got the better of me. If Javascript isn’t your thing you can specify the script language to anything supported by your Java runtime by defining the property filter.script.lang.

So – use this at your own peril!

UPDATE: You can find a list of available ScriptEngines for Java at scripting.dev.java.net.

Javascript Filters in Summa

April 1, 2009 by kamstrup

I just completed the draft implementation of Javascript filters for Summa and I am posting here to hear some comments. If nobody complains the existing implementation will be likely to stay unchanged. Really the implementation supports any old scripting language supported by the ScriptEngineManager of the JVM, but in practice Javascript will probably be the most important one.

The scripting environment will include two “magic” variables: payload and allowPayload. Unsurprisingly the payload variable contains a reference to the Payload object being processed. The allowPayload variable is a boolean value that defaults to true. If allowPayload is set to false the payload will be dropped from the processing pipeline.

Update: The script filters now have a third magic variable called log sporting the methods log.trace|debug|info|warn|error|fatal(string).

The best way to explain this is probably with an example. To write a Javascript filter for Summa create a file called myFilter.js with the following content:

var record = payload.getRecord();

if (!record.getId().startsWith(record.getBase())) {
    record.setId(record.getBase() + "_" + record.getId())
}

if (record.getId().endsWith('taboo')) {
    allowPayload = false;
}

This script will make sure that all records have their ids prefixed with their base name, and will filter out any records which id ends with “taboo”.

To plug the script into your filter pipeline you need to stick something like the following in your filter chain configuration:

<xproperties>
    <entry>
        <key>filter.name</key>
        <value class="string">FixRecordIdsandDropTaboos</value>
    </entry>
    <entry>
        <key>summa.filter.sequence.filterclass</key>
        <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
    </entry>
    <entry>
        <key>filter.script.url</key>
        <value class="string">http://example.com/filters/myFilter.js</value>
    </entry>
</xproperties>

I am using the ScriptEngine framework which appeared in Java 6 for all of this, and all in all the development experience has been quite nice. Writing this blog post took almost as long as it took me to write that filter :-)

UPDATE: You can find a list of all available scripting engines at scripting.dev.java.net.

Summa 1.3.0 Live and Unleashed

March 28, 2009 by kamstrup

As Toke hinted earlier we released Summa 1.3.0 yesterday. It is no secret that the 1.3.0 release cycle was longer than we planned for, but in the end I think the gains have been worth it. It took a lot of hard work (and lots of patience from our families), but we got there.

A lot of effort in the 1.3.0 cycle revolved about 3 main things:

  • Performance We really wanted to be able to provide quick turn around times from top to bottom of the Summa stack.  There’s a lot of activity around Summa these days and we never know where Summa might end up being deployed. Time has tought us that being able to quickly rebuild the indexes and record caches is a huge boon when venturing into unknown territory.
  • Details Make the small details work and do as expected. This means sanitizing the log output to be as useful and concise as possible, but also making sure that it is easy to track any records that are dropped from the indexing chain for what ever reason. Also stuff like being able to update the modification time of one single record and having that trigger a corresponding update of the index
  • Bug fixing Hunt down those kritters and write tonnes of unit tests to to keep that wooden stake through their hearts so that they don’t rise from the dead

The reasons for these points of focus are many, but two things are worth mentioning. The first thing is that we are aiming to use the Summa 1.3 series for production here at the State and University Library of Denmark, which means that we can finally replace the aging “pilot Summa” we are running in production these days (Summa started out as a closed source foray into the realm of relavancy ranked integrated search – and an old version of this project is the basis for the current search engine behind statsbiblioteket.dk). Secondly we felt the need to have a solid base to work on to try out all the cool stuff we have been talking about since we started our journey into the land of searching. This is also why Summa 1.3.0 doesn’t really have any new big features, mostly polish and optimizations of the existing code base.

All in all I am pretty proud of this 1.3.0 release and I am really looking forward to deploying it in a real world scenario. Also; with this milestone set I am confident that we are going to have a blast hacking on the mile long list of cook ideas we have floating around. I should post about some of these ideas soonish, but not today; today I am going to slack off and play with the kids.

Speed thrills

March 27, 2009 by eskildsen

As part of Summa 1.3 planning, we set up some performance-goals: Indexing speed should be 500+ records/sec, storage extraction speed should be twice that and so on. Paired with the magnificent VisualVM and a penchant for bit-fiddling, we’ve often found it hard to go home from work: The hunt for the next speed-increase has just been too exciting. Yes, we’re lost cases.

Experimental patching of the H2 database, introduction of an optimized replace-framework and severe reductions in object allocation has been part of the last few weeks. Boringly our main time was spend on new features and bug hunting, which our fearless leaders strangely enough thinks takes precedence, but oh how those glorious moments of “We got a factor 10 speed increase in Analyzer Foo” makes up for all that.

ThreadLocal StringBuilders for Fast Text Processing

March 12, 2009 by kamstrup

It is a common task for many server-like applications to process a lot of text-like objects in some streaming manner or other. Many Java programmers tend to think that it is a very good idea to do it like this:

public String processRecord(Record rec) {
    StringBuilder builder = new StringBuilder(1024);
    // Build a string from rec
    return builder.toString();
}

If this pattern looks completely sensible to you then please read this blog post carefully :-)

Debunking A Myth: Object Allocations Are Not Free

Allocating lots of Java objects is a bad idea; simple as that. I recently optimized our object allocation in the Summa indexer using the technique I am about to describe and it gave a 2-3 times increase of our overall throughput! Java, specifically the JVM, is not a magic beast that can dispose and allocate memory for free. The modern garbage collectors are very cool and advanced, but they are only so good. Even though Java is a garbage collected language you still have to think about your memory allocations!

In the example above the StringBuilder is especially bad because it typically allocates a rather big char array. So we should try to avoid that.

Resetting a StringBuilder

Contrary to the impression that the Javadocs for StringBuilder will give most people (in the classical over generalization manner “most people” will mean “me”). You can reset a StringBuilder by doing

builder.setLength(0);

This will not allocate a new char array underneath; I know because I checked the Java 6 source code.

Keeping Only One StringBuilder Around: Thread Locals

Thread locals are an often under used feature, both in Java and in many other languages. With Java generics they are actually a breeze to use. Firstly I should better clarify what “thread local” means. A variable that is thread local will only exist on the thread for which it was created. This makes it easier to handle concurrency because, well, you don’t have to :-) Each thread will have its own copy of the variable around.

Now might be a good time to check out the Javadocs for the TreadLocal class. To create a thread local string builder declare a variable like this:

private ThreadLocal<StringBuilder> threadLocalBuilder =
                                               new ThreadLocal<StringBuilder>() {
        @Override
        protected StringBuilder initialValue() {
            return new StringBuilder();
        }

        @Override
        public StringBuilder get() {
            StringBuilder b = super.get();
            b.setLength(0); // clear/reset the buffer
            return b;
        }

    };

Beware: The above thread local string builder will reset its character buffer each time you grab a reference to it. I usually find myself wanting this behavior, but if you don’t want this you should comment out the setLength(0) line.

To use the thread local builder in a method simply do:

public String processRecord(Record rec) {
    StringBuilder builder = threadLocalBuilder.get();
    // Build a string from rec
    return builder.toString();
}

Caveat Emptor

So what’s the catch? You will be keeping one string builder around for each new thread that ever enters processRecord(). This could potentially end up as lots of string builders if your application is designed like this. Also if you ever build a very large string the string builder will keep its internal character buffer at that size even though you reset it. It will be up to you dear reader to determine if that will be a problem for you. Note however that the thread local variable are deallocated when the thread owning them dies.

Of course one could also add some more intelligent resetting logic in in the ThreadLocal.get() method above. Like allocating a new string builder if b.capacity() becomes too big.

More Optimization: Don’t Allocate the Final String

The observant reader will notice that I also allocate a new String when I do builder.toString() in the end of processRecord(). This can also be avoided, but I have to change the method signature to return a Reader instead of a String:

public Reader processRecord(Record rec);

Unfortunately Java does not come with a Reader implementation that wraps a StringBuilder (or more generally a CharSequence). You can find such a CharSequenceReader in the Summa source code under the LGPL (I would have linked directly to the code in our SVN repo, but it requires a login (which you can create yourself, but it is a pain)).

So the exercise for the eager reader is to wrap that CharSequenceReader in a thread local to also avoid allocating that one again. Note that you also need to reset the CharSequenceReader by calling reset() on it.

Here’s to a faster future!

XSLTs for inspiration

March 11, 2009 by villadsen

We have decided to publish the XSLTs used for the Search website at the State and University Library, Århus, Denmark.

The files are provided “as-is” under the MIT License. They are meant to be used as a source of inspiration, as they do contain a few minor dependencies on internal code – for instance used to translate labels to human readable strings in different languages.

In other words this is a snapshot of of the XSLTs that are all free to use, but we offer no guarantee that they are actually useful. Likewise we cannot be held accountable for errors or faults contained within these files.

Go to the bottom of the Sandbox page at the Wiki for the download.

Update 2009-03-17: Added some more XSLTs to the Sandbox page. These are XSLTs developed in a project for public libraries using Summa.

Adventures in the land of databases

March 4, 2009 by kamstrup

The tale of three databases

In the olden days Summa used a Postgresql database as a backend for its caching/presentation engine, also called the Storage. Postgresql served us well – good performance and good reliability. The one drawback was that it required additional setup if one wanted to deploy Summa on a new machine.

In our fierce hunt for easy deployments of the Summa search engine we decided that we wanted to try out using an embedded Java database. This way we don’t depend on external processes or setup. Since Apache Derby was recently included in the Sun’s Java 6 we decided to go down that route. And for a long while it seemed that Derby was the hammer of all nails.

However, when we started doing scalability tests Derby started to play tricks with us. You see, Summa can do incremental updates of its internal indexes and for this to work we need to be able to determine efficiently if any of the indexed data has been changed and needs an update in the index. With a couple million records in the database Derby seemed to get extremely slow. I did some honest attempts at tweaking our data models and talking to the Derby community about enhancing the performance, but at the end of the day we couldn’t make it meet out needs.

After a few days with out sufficient progress (I was able to make small performance imrpovements, but not enough) it was decided to try and re-enable the olde Postgresql backend. Behold; everything was snappy and we had acceptable performance. The only problem was that now we had not solved our initial goal which was easy setup…

Enter H2. After Googling around a bit I stumbled upon the H2 embedded Java database. After some initial testing its performance seemed very promising – faster than Postgresql even though Postgresql is written in C and H2 in Java. This once again proves that Java is not slow (a prejudice that some people will probably retain to the day they die).

The Devilish Details

Despite H2 rocking our socks off, all was not well in paradise. Like Derby (particularly Derby I might add) H2 did not perform well on SQL JOINs. In our case we are talking LEFT JOINs on huge result sets. To fight this I added a mode to our abstract database layer that would avoid the JOINs and instead perform direct lookups of the stuff we JOINed in. This gave us a good performance boost, but still not good enough.

Next thing we had to learn the hard way was that when Thomas Mueller (the über cool H2 main hacker) says that H2 does not perform well on large result sets you better believe him :-) So again I added a new mode to the abstract database layer to fetch the result set as a sequence of SQL queries using the LIMIT and OFFSET keywords. In essence it is really just client side paging of the result set. With this in place I am proud to say that we have H2 delivering stellar performance.

It is likely that the same client side paging would make the Derby backend acceptably fast as well, but unfortunately Derby does not support manual paging very well (ok, you can doit, but it is not really nice).

Conclusions

Instead of the single Derby backend for our storage component that we had initially anticipated we now have three backends. The H2 and Postgresql ones perform very well while the Derby one could use some love (and is workable for smaller colelctions).

I also learned a lot more about JDBC and SQL than I had ever hoped to ever know :-)

33,333 XSLT woes

March 3, 2009 by eskildsen

I am not the XSLT guru at work, I just call the buggers from the code. Then I get drafted into debugging the little rascals, when they freak out and drag the whole indexing workflow down. Excavation unearthed that we don’t have a limit on recursive calls. An OAI-record had turned up with 33,333 subjects (no, I am not making that number up) and the default Java Transformer happily processed those in a recursive manner. Happily until the StackOverflowError roared and bit the head of the indexer.

What kind of maniac makes a record with 33,333 subjects? It turns out that it actually makes some sense. The record was about plankton. Apparently there are three trillion different kinds of plankton (yes, I am making that number up) of which 33,333 was of interest to the writers of said record.

Here’s to A. acanthos, bergonii C, pseudofrigida c C, Subeucalanus monachus, Vibilia armata, Zygosphaera sp. C and all the rest. Long live biodiversity.