Archive for the ‘Summa’ Category

Solid Toys for the Boys

December 8, 2009

As some may know we have experimented quite a bit with Lucene indexes on Solid State Drives and we’ve had very good experiences with it. Seeing huge performance gains. Since we are also routinely running big applications and other heavy duty tasks on our desktop machines our dear Toke had the idea that we should all have SSDs in our desktops. After a good deal of shopping about he settled on the Kingston v 40GB drive as research revealed that this exact model had the good Intel metal inside (this is fx. not the case for the 64GB model).

Yesterday we got the delivery and immediately start unpacking and upgrading our machines. And boy where these babies worth every penny! :-)

(sorry for the ugly scaling of the following images – WordPress is killing me)

Toke was the Super delivery boy

Quick - get them before they are gone!

Yours truly is a Super happy camper

Super tag team getting their hands dirty

Firstly we did clean installations of Ubuntu. With a 10GB root partition and a ~26GB /home partition and ~4GB swap. Root and /home formatted with Ext4. All on the SSD. The time?

  • Installing Ubuntu Karmic 64 bit from USB stick: 4 minutes (with ~1 minute waiting for network on a slow repository)

The next thing was the boot… While we where rebooting from the install-session we talked about how fast the boot was going to be. But in the talking we almost didn’t react before the reboot was back up to the login screen. Wow. As we didn’t have a timer with sub-minute resolution at hand we can only give you subjective numbers. Among the spectators the opinions range from “negative time” to “5s” to “10s”. My personal estimates are:

  • Boot from GRUB to GDM login screen: 5s
  • From login screen to working GNOME desktop: 4s

This is pretty darn fast I tell you :-)

In general application launching is also noticably faster. Especially so for applications with lots of IO, likethe  Evolution mail reader or our development environment IntelliJ Idea. Compiling the Summa project is also a heavily IO bound process. The result:

  • Compiling Summa from scratch with cold disk caches: With conventioanl drives ~6 minutes. With our new SSDs 2.5 minutes. That’s a speedup of a factor ~2.5.

As you might have guessed by now – we like SSDs – a lot!

Searching in the dark

September 25, 2009

As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.

Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.

Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).

The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.

The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.

It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.

So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.

Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.

Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.

A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.

Sample search in Arctika

Sample search in Arctika

Summa Moving to SourceForge

August 4, 2009

Yesterday I had the pleasure to announce on the mailing lists that Summa has reached the first milestone in migrating to SourceForge, and here follows the blog post :-)

From now on all Summa code is hosted and developed in the “summa” project on SourceForge now, in addition all bugs have been migrated from our old GForge solution to a Trac instance hosted via the cool new “hosted apps” functionality on SourceForge.

We will also move the mailing lists over in the near future. The fate of the Summa wiki is still left unclear.

I must be frank and admit that I have long felt that SourceForge was in a bit of a standstill applying only visual refreshes every now and then, and never fixing the real issues with the site. However the new Hosted Apps approach is simply sweet! There is a huge list of popular open source products you can choose to run on your site as a hosted apps (see an incomplete list here). For instance; some may surprised to know that popular version control systems such as Git, Mercurial, and Bazaar is supported as well as Subversion. Right now we run only a Trac issue tracker and a Subversion repository.

On a personal note I must still admit that my heart lies with the recently open sourced Launchpad, despite the recent kick-assiness from the SF team.

Efficient sorting and iteration on large databases

June 15, 2009

Before you read on, heed my words that this post might be a wee bit technical… If not extremely technical – caveat emptor

The Problem

In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a H2 database, all records are annotated with a modification time (henceforth mtime) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.

This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don’t want any of that. We want lightweight transactions and speed!

Take One, LIMIT and OFFSET

The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:

  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000

Here we increment last_offset by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first last_offset records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset last_offset into the result set begins.

Take Two, Salted Timestamps

So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages… 

The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:

  SELECT * FROM records WHERE mtime>$last_mtime ORDER BY mtime LIMIT 1000

Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.

If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion – this is way too slow for us, so we deviced what we have dubbed Salted Timestamps.

Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a “timestamp” that still sorts correctly and we can even create a UNIQUE index on the mtime column.

Conclusion

We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the DatabaseStorage* class and the timestamp handling in the UniqueTimestampGenerator* class in the Storage module in the Summa sources.

*) Most sorry that these links require a login (which is freely available, but anyway) – we are working on a solution with anonymous access. More on that later.

Rising Summa

April 21, 2009

The people higher up in the food-chain has decided to provide a Summa based search backend to public libraries in Denmark. For an annual fee, Statsbiblioteket handles the flow of data from raw dumps to webservices and keeps the servers running. Maintenance, money and sales is fairly boring, seen from our developer perspective, but tweaking Summa to allow for easy experimentation and setup has been very rewarding.

As usual Mikkel weaved his magic and created a package (read: A collection of scripts and all the JARs from the Summa project) that makes it very simple to set up a local Summa for experimentation. The working title was Summix, but we all know how it goes with working titles.

With some late night fiddling, the complexity was reduced to “Unzip and run a script”, which gets a Summa demo running with a skeleton web front end. Added bonus? It runs under Windows as well as Linux (and probably OS X too, but we haven’t checked). We will write a tutorial on the wiki Real Soon Now.

Getting there...

Getting there...

Asking for Trouble? You’ve Come to the Right Place!

April 2, 2009

Toke was asking for trouble yesterday. I would assume that he knew me better by now… With my last commit it is now actually possible to inline Javascript inside your configuration when using a ScriptFilter.

The following now actually works:

<xproperties>
  <entry>
    <key>filter.name</key>
    <value class="string">InlineJavascriptTest</value>
  </entry>
  <entry>
    <key>summa.filter.sequence.filterclass</key>
    <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
  </entry>
  <entry>
    <key>filter.script.inline</key>
    <value class="string"><![CDATA[
                    payload.getRecord().setId('inlineJavascript');
    ]]></value>
  </entry>
</xproperties>

I can not begin to enumerate all the dangers in doing this, but somehow the thrill of the possibilities got the better of me. If Javascript isn’t your thing you can specify the script language to anything supported by your Java runtime by defining the property filter.script.lang.

So – use this at your own peril!

UPDATE: You can find a list of available ScriptEngines for Java at scripting.dev.java.net.

Javascript Filters in Summa

April 1, 2009

I just completed the draft implementation of Javascript filters for Summa and I am posting here to hear some comments. If nobody complains the existing implementation will be likely to stay unchanged. Really the implementation supports any old scripting language supported by the ScriptEngineManager of the JVM, but in practice Javascript will probably be the most important one.

The scripting environment will include two “magic” variables: payload and allowPayload. Unsurprisingly the payload variable contains a reference to the Payload object being processed. The allowPayload variable is a boolean value that defaults to true. If allowPayload is set to false the payload will be dropped from the processing pipeline.

Update: The script filters now have a third magic variable called log sporting the methods log.trace|debug|info|warn|error|fatal(string).

The best way to explain this is probably with an example. To write a Javascript filter for Summa create a file called myFilter.js with the following content:

var record = payload.getRecord();

if (!record.getId().startsWith(record.getBase())) {
    record.setId(record.getBase() + "_" + record.getId())
}

if (record.getId().endsWith('taboo')) {
    allowPayload = false;
}

This script will make sure that all records have their ids prefixed with their base name, and will filter out any records which id ends with “taboo”.

To plug the script into your filter pipeline you need to stick something like the following in your filter chain configuration:

<xproperties>
    <entry>
        <key>filter.name</key>
        <value class="string">FixRecordIdsandDropTaboos</value>
    </entry>
    <entry>
        <key>summa.filter.sequence.filterclass</key>
        <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
    </entry>
    <entry>
        <key>filter.script.url</key>
        <value class="string">http://example.com/filters/myFilter.js</value>
    </entry>
</xproperties>

I am using the ScriptEngine framework which appeared in Java 6 for all of this, and all in all the development experience has been quite nice. Writing this blog post took almost as long as it took me to write that filter :-)

UPDATE: You can find a list of all available scripting engines at scripting.dev.java.net.

Summa 1.3.0 Live and Unleashed

March 28, 2009

As Toke hinted earlier we released Summa 1.3.0 yesterday. It is no secret that the 1.3.0 release cycle was longer than we planned for, but in the end I think the gains have been worth it. It took a lot of hard work (and lots of patience from our families), but we got there.

A lot of effort in the 1.3.0 cycle revolved about 3 main things:

  • Performance We really wanted to be able to provide quick turn around times from top to bottom of the Summa stack.  There’s a lot of activity around Summa these days and we never know where Summa might end up being deployed. Time has tought us that being able to quickly rebuild the indexes and record caches is a huge boon when venturing into unknown territory.
  • Details Make the small details work and do as expected. This means sanitizing the log output to be as useful and concise as possible, but also making sure that it is easy to track any records that are dropped from the indexing chain for what ever reason. Also stuff like being able to update the modification time of one single record and having that trigger a corresponding update of the index
  • Bug fixing Hunt down those kritters and write tonnes of unit tests to to keep that wooden stake through their hearts so that they don’t rise from the dead

The reasons for these points of focus are many, but two things are worth mentioning. The first thing is that we are aiming to use the Summa 1.3 series for production here at the State and University Library of Denmark, which means that we can finally replace the aging “pilot Summa” we are running in production these days (Summa started out as a closed source foray into the realm of relavancy ranked integrated search – and an old version of this project is the basis for the current search engine behind statsbiblioteket.dk). Secondly we felt the need to have a solid base to work on to try out all the cool stuff we have been talking about since we started our journey into the land of searching. This is also why Summa 1.3.0 doesn’t really have any new big features, mostly polish and optimizations of the existing code base.

All in all I am pretty proud of this 1.3.0 release and I am really looking forward to deploying it in a real world scenario. Also; with this milestone set I am confident that we are going to have a blast hacking on the mile long list of cook ideas we have floating around. I should post about some of these ideas soonish, but not today; today I am going to slack off and play with the kids.

Speed thrills

March 27, 2009

As part of Summa 1.3 planning, we set up some performance-goals: Indexing speed should be 500+ records/sec, storage extraction speed should be twice that and so on. Paired with the magnificent VisualVM and a penchant for bit-fiddling, we’ve often found it hard to go home from work: The hunt for the next speed-increase has just been too exciting. Yes, we’re lost cases.

Experimental patching of the H2 database, introduction of an optimized replace-framework and severe reductions in object allocation has been part of the last few weeks. Boringly our main time was spend on new features and bug hunting, which our fearless leaders strangely enough thinks takes precedence, but oh how those glorious moments of “We got a factor 10 speed increase in Analyzer Foo” makes up for all that.

XSLTs for inspiration

March 11, 2009

We have decided to publish the XSLTs used for the Search website at the State and University Library, Århus, Denmark.

The files are provided “as-is” under the MIT License. They are meant to be used as a source of inspiration, as they do contain a few minor dependencies on internal code – for instance used to translate labels to human readable strings in different languages.

In other words this is a snapshot of of the XSLTs that are all free to use, but we offer no guarantee that they are actually useful. Likewise we cannot be held accountable for errors or faults contained within these files.

Go to the bottom of the Sandbox page at the Wiki for the download.

Update 2009-03-17: Added some more XSLTs to the Sandbox page. These are XSLTs developed in a project for public libraries using Summa.