Archive for the ‘Summa’ Category

Large facet corpus, small result set

January 23, 2013


Each time a user issues a search in our primary corpus, we perform faceting on 19 different fields. Some of those fields have a low amount of unique values (year, language, classification), some of them are quite heavy (author, title, semi-free keywords). We have a grand total of 38 million unique tags and 160 million references from 11 documents to the tags displayed as part of the faceting.

The way our faceting works is simple in principle: When a search is performed, an array of counters (a basic int[]) is allocated. The array contains a counter for each unique tag (38 million in our case). All the internal IDs for the documents from the full search result are extracted and iterated. For each document ID, the reference structure provides the tag IDs, which corresponds exactly to entries in the counter array. For each tag ID, the counter for said tag is incremented. When all document IDs has been iterated, the counters are iterated and the ones with the highest counts are extracted.

If you followed that, congratulations. So, performance-wise, what is wrong with that approach? Yeah, the title was kind of a giveaway. Even if the search results in just a single hit, with just a single tag, we still need to iterate the full 38 million counters in order to extract the facet result. We need to clear it too, before the next run, so we’re really talking 2 complete iterations, one of them involving sorting logic. Ouch. Or to be more precise: About 300ms of ouch.

So what do we do? Well, if we know that our result set is small, we could use a simple HashMap to hold our counters; with tag-IDs as keys and the counters themselves as values. We tried that some years ago. It did sorta-kinda work, but that approach had significant drawbacks:

  • HashMaps are memory pigs and they tax the garbage collector. We do not want to insert hundreds of thousands of objects into them, when they are just used for highly temporary counting.
  • We need to guess the size of the result from the start. Not just the number of hits in the search result, but the number of tags that they refer to collectively so as not to get unvieldy HashMaps. If we guess wrong, we need to start over or copy the values from the map into our full counting structure.

We abandoned our HashMap based sparse counter approach as our experiments showed that the dumb “just iterate everything all the time” performed better for most of our searches.

New idea

Summing up the requirements for a faceting system where tag extraction performance is dependent on the number of found tags, rather than using a fixed amount of time:

  • It should work without knowing the number of tags in advance.
  • It should not tax the heap nor the garbage collector excessively
  • Extraction time should be linear (we accept O(n*log2(n))
  • for sorted extraction) to the number of marked tags

Mikkel Kamstrup Erlandsen kindly pointed me to the article Using Uninitialized Memory for Fun and Profit. With a little tweaking, a simplified version should satisfy all the requirements. We will build a counting structure that can switch seamlessly from sparse to non-sparse representation.

For this, we introduce yet another array: The tag count tracker. It holds pointers into the tag count array from before. Its length is the cutoff for when to use sparse counting and when to use full counting and must be determined by experiments.

When the count for a tag for a document needs to be incremented, we start by loading the old count from our tag count array (we need to do this anyway in order to add 1 to the count). If the count is 0, the position of the counter is added to the tag count tracker. If this overflows the tag count tracker, we switch to standard counting and completely ignore the tag count tracker hereafter. Either way, the value from the tag count array is incremented and stored back into the array as we would normally do.

When all the tags for all the documents has been iterated, the tag count tracker (if not overflowed) contains a complete list of all the tags that has a count of 1 or more. The tag extracter needs only iterate those and, just as important, needs only clear those. If the tag count tracker was overflowed, the old iterate-everything approach is used. As for clearing the tag count tracker for next use, it is simply a matter of setting its logical length (a single int) to 0. Presto!

Now it just needs to be implemented.

Fire fire everywhere

September 21, 2011

Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.

A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts & session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.

Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.

Mads coding

Mads stands up and codes. A sure sign of high alert

Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation i++; was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.

It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.

With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.

The right tool – neo4j?

May 6, 2011


We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years we have used relational databases for this, going from PostgreSQL through Derby and landing on H2, which has served us fairly well. Documents with relations were a bit slow to resolve but there were less than 5% of those in the full corpus, so for all practical purposes they presented no problem.

Fast forward to a week ago, where we added a new target to our integrated search. One million fresh documents or a 10% increase in total document count. Unfortunately most of these documents were related to each other, increasing our total relation count 2000%. Our full indexing time soared from “It’ll be done before noon” to “I hope it finishes before tomorrow”. Ouch!

A long discussion about database design and indexes followed. The conclusion was that we really did not use H2 for what it was good at and that maybe we should look at a graph-oriented database.

Implementing storage with neo4j

Neo4j is an open source graph database. It is written in Java and can be used as an embedded application. This is important for us as we like our packages to be self contained. Friday is work-on-whatever-project-you-think-can-benefit-Statsbiblioteket-day, so I dedicated it to trying out neo. Keep in mind that I only heard about Neo this morning and that I hadn’t looked at a single page about the product before I started the project.

  • 10:36 Created test project
  • 10:40 Selected Neo4j version 1.3 stable Enterprise, added to Maven POM, downloaded files
  • 10:50 Created skeleton class for Summa NeoStorage
  • 10:55 Added basic properties, created skeleton Unit test
  • 11:15 Added code for flushing Summa Record to Neo
  • 11:36 Added code for retrieving a previously flushed record + unit test. Hello world completed
  • 11:40 Break finished, started on ModificationTime retrieval
  • 12:35 Proof of concept retrieval using modification time (slow development due to human error)
  • 13:20 Finished bulk ingest and record-by-record extraction using modification time order, including unit test
  • 13:50 Finished mapping of Summa Record child-relations to Neo, both for ingest and extraction

Including mistakes, distractions from colleagues and reading documentation, it took under 4 hours to integrate Neo as the new backend for our storage. There’s still a lot of minor things to add and special cases to cover, but the result is complete enough to be used for most workflows in Summa.

Implementation was a breeze, the API was very clean and the examples and guides at the website were to the point and well thought out. Kudos to the developers for great work!

Performance peek

Since the Neo Storage isn’t finished, it would be unfair to compare it to H2 with regard to ingest. However, the extraction part is complete enough to test.

With the old H2-backed storage, our extraction time fell to below 5 records/sec for the new documents with 2-4 relations each (extraction time for records without relations is 2-3000 records/sec).

Creating a test-storage using Neo with 100000 documents, each with 5 children, changed the extraction speed to 2500 expanded records/second or 15,000 raw records/second if we count the children.

Granted, the only fair test is with production data, but so far Neo4j looks like a clear winner for our purposes!

The road ahead

November 26, 2010

When we’re Out There at conferences or visiting other libraries, we’re sometimes asked why we bother “developing our own search engine”. It is a really good question. One that we ask ourselves from time to time, to make sure we’re on the right track.

One of the reasons why we’ve been rolling our own instead of taking the obvious Solr-road is historical. Solr didn’t meet our needs when we started. It still doesn’t in its current form, but it’s getting close. Another reason is that the scope of Summa is different from Solr’s: We’ve invested a lot of energy in handling data-flow with transformations in a multi-stage system with a persistent storage service that handles parent-child relations and keeps track of updates.

The interesting thing here is that Summa and Solr are not necessarily in competition. Solr has a lot of great analyzers and has gained a lot of momentum during the last years. We would therefore like to use Solr at the core of Summa instead of the custom Lucene searcher we have currently. At Statsbiblioteket we have done a fair amount of work with low-memory index-lookup, sorting and hierarchical faceting which we’d like to take with us, so this calls for participation in Solr development.

Relevance Correlation between systems

Statsbiblioteket has recently begun collaboration with the Summon team from Serial Solutions. Due to Danish legislation we are not allowed to let Summon index our own data, so the upcoming search setup needs to merge results from Summon with our own. As both systems sort the results by relevance ranking, proper merging is quite hard. This is due to the fact that relevance scores are not comparable between different systems. One solution is to perform a local comparison of the returned fields in the results, but as this completely ignores the whole term statistics based ranking of Lucene/Solr, we would like something better.

Up until now we have been spoiled by using relevance ranked integrated search across different sources where we controlled the full indexing process. SOLR-1632 seems to be the right way to provide similar functionality for Solr, but it is not mature yet and is quite a large thing to put into production. Better to walk before running. As part of our collaborative agreement Serial Solutions has agreed to deliver statistics about their metadata to us on an experimental basis. With this we can hopefully do better merging and issue boosted queries to get the hits that are relevant to us, thus approximating the results we would have gotten if all metadata had been in a single index.

New Blood

March 2, 2010

Mikkel – one of the core Summa developers – has decided to move on. He will be working for Canonical which has been his dream job for years, so we can only wish him the best of luck.

Mikkel has been an integral part of the Summa rewrite that has taken place, and has also had the very important role of being our release manager and all around “glue guy”. He will be greatly missed.

However we have managed to grab Henrik – a developer from another section – to come work with us and fill the missing spot. He has spent the last month (along with Toke) making sure that any and all important information has been properly extracted from Mikkel’s brain. Henrik has proven to be a fast learner and he has already become deeply involved with our release management, production environment, and the Summa code base in general.

So a big welcome to Henrik. He will be a great addition to the team.

Experimental GUI for code4lib

February 25, 2010
Highly experimental GUI

Highly experimental GUI.

For this year’s code4lib Jørn and I had prepared an demo called Kill the Search Button.

The demo has an experimental GUI and boasts some of the ideas and concepts we’ve developed over the last few years when working with our search engine Summa/Search.


Basically, the GUI rethinks the traditional way of presenting search results. The idea is to lay out the search result as a map or a set of tiled pages that the user can navigate, rather than a chunk of separate pages with only one page visible at a time.

As a consequence, the search result is zoomable and scrollable allowing the user to smoothly move around the search result and to zoom between an overview layout with several columns of pages to a very detailed close-up view.
Zooming is done using a slider. Page transitions are done using scroll animations in order to underline the feeling of moving about the search result.

Pages are loaded in chunks of 10 and new chunks are added on the fly when needed, creating a growing, visible search result.

The search field and navigation box float on top of the search result. They both loose focus when not in use in order to let the search result below become more visible. They can both be dragged around.

Index lookup

Index lookup. For librarians.

Features without a search button

Besides the GUI we added a number of features to the demo:

  • Live search: The search result is updated on every key press
  • Did you mean. Good for typos and spell checking
  • Suggest. Inspiration from other user’s queries
  • Index lookup. Mostly for librarians
  • About-by. Boosting particular fields for different results

Usability-wise, the demo illustrates how a smoother search experience can be achieved simply by updating the search result on key press. By doing this, the wait for page reload is gone and feedback is instant, allowing the user to evaluate the search results much more rapidly.

The about-by slider feature lets the user slide between materials about a person and materials written by a person.

Try it out

Sounds confusing? Luckily we have a web page! It contains a number of screen casts with Jørn going through the features (go to:, but you can also give it a span your self: (PLEASE NOTE: The demo currently only runs in Firefox 3.5+).

Damn you, snow

By the way, we never did get to go to code4lib to do the presentation due a blizzard and some ticket-mess-up.

Here’s the blizzard:


And here’s a shot of how great we look trough our remote c0de4lib video intro:

Looking great video-wise. Our remote code4lib intro.

Juglr, Higgla, and DidYouMean

February 10, 2010

A productive day today!

Summa DidYouMean

Henrik and I sat down and pushed a preliminary did-you-mean service to Summa trunk. It’s to be considered pre-alpha quality at this stage, but it will be mature for Summa 1.5.2. It’s based on Karl Wettin’s old lucene-didyoumean contrib that lingered in the Apache bugtracker for years (yes, and I mean years literally). You can find the updated code on there are branches in the Git repo for Lucene 3.0 (master), Lucene 2.9.* and Lucene-2.4.* – but be aware that this code has not been production tested yet.

Juglr 0.2.1

My pet peeve project, the actor model and messaging library for Java 6+, Juglr,  has hit 0.2.1. I’ve now done some more large scale testing with and it seems to work pretty well.

Introducing Higgla

The large scale testing of Juglr I just referred to is actually a new project of mine… Man – I tend to spew out a few too many projects these days🙂. The new project on the stack is Higgla – with tag line: “a lightweight, scalable, indexed, JSON document storage”. If you are wondering where the name came from I can inform you that Higgla is Jamaican for “higgler” which a quick Googling defines as “A person who trades in dairy, poultry, and small game animals; A person who haggles or negotiates for lower prices. The point is that Higgla is about dealing with any old sort of data and doing it in a very non-formal way.

Higgla is very much in the spirit  of CouchDB and Elastic Search – a schema free database, and not just schema free, but completely schema free. There is no structure implied as CouchDB’s Views does. Indexing is done on a document level, and each document need not have the same searchable fields as others. Heck each document revision does not need to be indexed the same way as the previous revision!

As I hinted, Higgla is based on Juglr. Higgla illustrates pretty well the power of a combined actor+http messaging stack like Juglr – if you browse the source code you will see that there really is not a lot of it!

In there core Higgla leverages the always awesome Lucene. I had to think quite hard to make the storage engine transaction safe in a massively parallel setup because Lucene doesn’t as such support parallel transactions (but it does support sequential transactions quite well). I figured it out eventually though.

Even though this is just a 0.0.1 Higgla already ships with Python- and Java client libraries (even though talking straight HTTP+JSON shouldn’t be that hard in most frameworks, it’s still nice with a simple convenience lib). An example with the Python client looks like:

import higgla
import json

# Connect to the server
session = higgla.Session("localhost", 4567, "my_books")

# Prepare a box for storage, with id 'book_1',
# revision 0 (since this is a new box), and indexing
# the fields 'title' and 'author'
box = session.prepare_box("book_1", 0, "title", "author")

# Add some data to the box
box["title"] = "Dive Into Python"
box["author"] = "Mark Pilgrim"
box["stuff"] = [27, 68, 2, 3, 4]

# Store it on the server[box])

# Now find the box again
query = session.prepare_query(author="mark")
results = session.send_query(query)
print json.dumps(results, indent=2)
print "TADAAA!"

That completes the Puthon example. The Java API is almost identical so I wont cover it, although I can’t do the same fancy varargs stuff that Python provides🙂

What did I mean?

January 28, 2010

It has been a long standing wish to get a good did-you-mean service shipped with Summa. And by “did-you-mean service” I mean the little helpful tip that shows up underneath the text entry when you mistype something when doing a search. Note that I say “mistype” and not “misspell”, because a good did-you-mean service is a lot more complex than a spell checker.

Consider when I read aloud my wish list to my mom over the phone and I try to explain to her that I badly want “Heroes of Might and Magic”  for Christmas. This phrase being completely meaningless to her she types in the search field:

heroes of light and magic

Notice that this is indeed a correctly spelled phrase, but nonetheless not what she/I wanted. A good search engine would ask Did you mean: “heroes of might and magic”?.

On the other hand if a search engine runs on a database of bad monochrome underworld games and “Heroes of Might and Magic” wasn’t there, but instead the index contained a game called “Heroes of Fight and Magic” the search engine should of course suggest Did you mean: “heroes of fight and magic”? in stead.

So we’ve identified two things we want that a normal spellchecker doesn’t provide:

  • Consider each word in a query in the context of the whole phrase it appears in
  • Only suggest stuff that is actually in the index

The Code

After surveying what was available on the open source market we realized that none of the solutions out there did what we wanted. I was pointed at Karl Wettin’s work on LUCENE-626. Although Karl’s work is great, it’s not compatible with the new API in Lucene >= 3.0 and it has a hardwired dependency on Berkely DB that we could not accept. So I branched his work in order to bring it into 2010 and I am proud to say that I’ve now reached an almost-works state. You can find the code on GitHub:

The new thing about this is also that we are now engaging in upstream Lucene work, rather than staying in our own Summa backyard. Quite exciting, and a very rewarding experience for a software developer. Toke has some more news in this regard as well – he’s doing some upstream stuff that has far bigger implications than my odd-job did-you-mean-hacking. But I’ll leave you hanging there and let Toke talk about this himself.

Solid Toys for the Boys

December 8, 2009

As some may know we have experimented quite a bit with Lucene indexes on Solid State Drives and we’ve had very good experiences with it. Seeing huge performance gains. Since we are also routinely running big applications and other heavy duty tasks on our desktop machines our dear Toke had the idea that we should all have SSDs in our desktops. After a good deal of shopping about he settled on the Kingston v 40GB drive as research revealed that this exact model had the good Intel metal inside (this is fx. not the case for the 64GB model).

Yesterday we got the delivery and immediately start unpacking and upgrading our machines. And boy where these babies worth every penny!🙂

(sorry for the ugly scaling of the following images – WordPress is killing me)

Toke was the Super delivery boy

Quick - get them before they are gone!

Yours truly is a Super happy camper

Super tag team getting their hands dirty

Firstly we did clean installations of Ubuntu. With a 10GB root partition and a ~26GB /home partition and ~4GB swap. Root and /home formatted with Ext4. All on the SSD. The time?

  • Installing Ubuntu Karmic 64 bit from USB stick: 4 minutes (with ~1 minute waiting for network on a slow repository)

The next thing was the boot… While we where rebooting from the install-session we talked about how fast the boot was going to be. But in the talking we almost didn’t react before the reboot was back up to the login screen. Wow. As we didn’t have a timer with sub-minute resolution at hand we can only give you subjective numbers. Among the spectators the opinions range from “negative time” to “5s” to “10s”. My personal estimates are:

  • Boot from GRUB to GDM login screen: 5s
  • From login screen to working GNOME desktop: 4s

This is pretty darn fast I tell you🙂

In general application launching is also noticably faster. Especially so for applications with lots of IO, likethe  Evolution mail reader or our development environment IntelliJ Idea. Compiling the Summa project is also a heavily IO bound process. The result:

  • Compiling Summa from scratch with cold disk caches: With conventioanl drives ~6 minutes. With our new SSDs 2.5 minutes. That’s a speedup of a factor ~2.5.

As you might have guessed by now – we like SSDs – a lot!

Searching in the dark

September 25, 2009

As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.

Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.

Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).

The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.

The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.

It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.

So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.

Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.

Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.

A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.

Sample search in Arctika

Sample search in Arctika


Get every new post delivered to your Inbox.