Memory is overrated

June 6, 2013 by

A recurring question on the Solr mailing list is “how do I speed up my searches” and a recurring answer is “equip the machine with enough RAM to have your whole index cached in memory”. That answer is also given when the size index in question is 200GB.

So what is wrong with that?

Technically nothing: Lucene & Solr search eats IOPS like candy and having the full index in disk cache is close to an optimal solution (optimal would be a full in-memory index without the file access indirection, but I digress). There is the matter of getting updates into disk cache, which does involve some trickiness if the index if updated in a master-slave setup and copied. But that can be solved with even more RAM, so I guess that falls under the same “buy more RAM”-logic. What cannot be solved is the long warmup time if the server is rebooted or the disk cache is otherwise cleared, but that is a rare occurrence.

Economically, copious amounts of RAM does not make sense. Yes, you guessed it, this is about Solid State Drives.

  • Their price is 1/10 of RAM (or 1/5 if you want RAID 1)
  • They suffer a lot less from the cleared disk cache problem
  • They can be easily RAIDed for TB-scale
  • They even draw less power than the same amount of RAM

Of course, it all boils down to how fast SSDs are, compared to the humongous disk cache solution. We experimented with this 5 years ago but hardware and software has improved since then, so it was time for new measurements.

Setup reasoning and methodology

5 years ago, our measurements were very close to the Lucene 2.x searcher itself. The search results were extracted, but there were no web services or similar transport overhead. We chose to do so as this gave us very clean data for comparison.

This time our tests uses the standard Solr 4 web service, with the server and the test-client being on different machines. While the non-trivial transport overhead gives a less clean comparison, it has the distinct advantage of providing real world numbers and thus useful for informing readers of what they can expect from a similar setup. The tests were run a multiple of times with the best results being used for all the charts. A ZIP with the full result set as well as the test scripts is available upon request, should anyone be interested.

Like last time, the test corpus is our local index at the State and University Library, Denmark.

  • It has 11M documents at 49GB
  • Queries are edismaxed over 30 fields
  • The result set contains 5-10 fields per document for a maximum of 20 documents and is about 30KB of XML
  • Faceted queries involves two fields: One with 10M unique values (15M instances) and one with 626 (1.5M instances)
  • The test queries are logged user queries, which are being issued by multiple threads using JMeter
  • Between each test, Solr is shut down, the disk cache cleared and Solr started again
  • The first query is not measured
  • MMapDirectory is used.

Hardware

The test machine is an 2*8 core “Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz” with 64GB of RAM, with the amount of RAM being software adjustable. Storage is 3 * 7200 RPM drives in RAID-5 and 3 * 200GB Dell MZ-5EA2000-0D3 SSDs in RAID-0.

Addendum 2013-06-07: The SSDs does not have TRIM enabled and has been used for 2½ years, during which a lot of indexes has been created, along with a 10M+ file test and some 40GB database tests. They should be fairly fragmented by now.

Test results

The most eye-opening graphs are for 8GB RAM + SSD vs. full memory cached index. Keep in mind that 2 of the 8GB are used for the Java heap itself and some memory are used for general bookkeeping, leaving a little less than 5GB (or 10% of the index size) for caching.

SSD @ 8GB RAM vs. fully cached

SSD with 8GB of RAM vs. fully cached index, non-faceting searches

SSD with 8GB of RAM vs. fully cached index, faceting searches

SSD with 8GB of RAM vs. fully cached index, faceting searches

(Please ignore the strange U at the end of the first graph; it is a long story that warrants another blog post)

As can be seen, the “8GB RAM + SSD”-solution is very close to having the index fully cached in memory for our setup. Your mileage may vary, but this is consistent with our general observation at Statsbiblioteket: Our main search servers each has 3 active search installations with a shared index size of 110MB+ on SSD. They are equipped with 16GB of RAM each and has ~7GB free for disk caching.

Further supporting the case for SSD, the 95 and 99 percentiles (calculated using a sliding window over the last 1000 searches) for the searches are nearly always below 1 second: The users are getting consistently snappy results. Again, please ignore the part of the graph after 200 seconds.

SSD with 8GB of RAM

SSD with 8GB of RAM, non-faceting searches

The SSD-solution is not independent of the amount of RAM available for cache though. We tested with 4GB of RAM, which leaves just 1GB (2% of index size) for caching. As can be seen below, performance is 1/4th of the 8GB of RAM + SSD setup.

SSD with 4GB of RAM

SSD with 4GB of RAM, non-faceting searches

Just for kicks, here’s a chart showing the performance using spinning disks on an unwarmed index.

Spinning drives with 32GB of RAM

Spinning drives with 32GB of RAM, non-faceting searches

The graph does illuminate a big problem with spinning drives: If the searcher is warmed using queries, it takes a very long time to reach peak performance. Copying the full index to /dev/null is faster and the result is maximum performance, but that trick is only effective if the whole index fits in the disk cache.

Conclusion

Using SSDs as storage for search delivers near maximum performance at a fraction of the cost of an equivalent RAM solution. As always, do test before buying.

You are faceting it wrong

April 16, 2013 by

As hinted in the previous post “Over 9000 facet fields“, faceting on many fields is tricky business. Read on if you are using Solr with faceting and use an inordinate amount of memory to do so.

Clarification

This post skips a lot of the nitty-gritty details and tries to present its case with a lot of examples. It is aimed for persons who has a bit of experience with running a Solr server.

There are different ways of doing basic faceting with stock Solr: enum (few unique values), field cache (many references, many values) and DocValues (new in Solr 4, not explored here). In this post we focus on index wide field cache based faceting of string values on multi-value fields. Don’t worry about all the adjectives, this is what we normally use when we do a faceted search.

On the first facet call on a given field in Solr, an in-memory structure is generated. Not very intuitively, it is fairly cheap to facet on a lot of unique values: Faceting on a field with 10M unique values on an index of 1M documents only uses about 2-300MB, depending on the amount of concurrent requests. Just as non-intuitive, each document adds to the size of the facet structure, even when it holds no values for the facet field.

The problem with 100 facet fields

Solr treats each facet independently: To get the total amount of memory used for faceting, just sum the amounts used by each facet.

Let’s look at a relatively small index to start with: Faceting on a single field for 1M documents, 5000 references and 100 unique values takes up only 1.6MB theoretically (in reality we need to multiply this with 3 or more). Faceting on 100 of those can be done with less than 1GB of heap.

Even when the number of references and unique values rises drastically, the heap requirements stay modest: 1M documents, 5M references and 5M unique values takes up 17MB for a facet. Faceting on 100 of those is still within the capabilities of a moderate machine.

Increasing the number of documents has more impact. Faceting on a field for an index with 10 times the documents from above, for a total of 10M documents, increases heap requirements from 17MB to 41MB per facet. With 100 facets, we are deep in “we really need to tune the garbage collector to avoid erratic response times”-land at 4GB theoretically and around 15GB in reality. Having 50M documents means allocating 15GB of heap (which in reality is more like 50GB) for the 100 facets.

The worst offender is of course the number of facets as the impact rises linear with this. It is important to remember that the amount of values in the facets has very little say: 500 tiny facets has vastly more impact than 5 really large ones. Besides the heap requirement, processing time also rises linear with facet count (all else being equal, which it never is).

Faceting on 100 fields, all at the same time

If a person really need to facet on all fields at the same time, it is hard to avoid either switching facet implementation or taking the substantial performance- and memory-hits. See “Over 9000 facet fields” for details.

However, this is really a fringe use case. It might be used for statistics extraction or such, but a more common use case is…

Faceting on 100 fields, a few facets at a time

If a product catalogue has multiple diverse product lines or an index is otherwise shared among separate entities, it often makes sense from a user interface perspective to have a plethora of facets available, but only use a few at a time. While the latent facets imposes no extra performance penalty, they do take up just as much heap as if they were all active at the same time.

As previously established, having a lot of references and unique values is a lot cheaper than having many facets. One way of handling the many facets is thus to collapse them into a single facet.

Suppose we have an index with 50M documents and 100 small facets, each with 5000 references to 100 unique values. They each take up 77MB for at total of 7.6GB of heap (multiply with 3 to get real world numbers). A document might have values like this:

  • facet1:valueA
  • facet1:valueB
  • facet2:valueA
  • facet3:valueC

We collapse this to a single field by prefixing the values with the former field names:

  • collapsed:facet1/valueA
  • collapsed:facet1/valueB
  • collapsed:facet2/valueA
  • collapsed:facet3/valueC

With the collapsing, we have 1 field, 50M documents, 100*5000 references and 100*100 unique values. The needed amount of heap for the whole shebang is 114MB (or less than 0.5GB in real world numbers).

To perform a search with a wanted facet, we apply the parameter facet.prefix, which Limits the terms on which to facet to those starting with the given string prefix. At the GUI end, the field name prefix must be removed.

One facet per client

One use case presented on the Solr mailing list involves an index shared between 180 clients, each with their own documents and each doing search and faceting on their own documents only. Each client has a facet field dedicated to them, so this is a case of faceting on 180 fields, one facet at a time.

From a business logic perspective this setup makes perfect sense. Unfortunately Solr threw frequent Out Of Memory errors with a heap of 25GB. Even worse, an expected quadrupling of the number of clients means that both the number of facets and the number of documents will quadruple, resulting in future memory requirements of 10 times 25GB.

As the searches for each client are constricted to their own data by filtering (guessing a bit here), so that there is no chance of mixing values, there is no reason for using the prefix trick from above. The solution is simply that all clients share the same facet field instead of having individual ones.

A different solution would be to dedicate a shard to each client. This would have the added bonus of making relevance ranking for the single customer unaffected by the other customers data. But that is another story.

Over 9000 facet fields

March 20, 2013 by

In some ways, Lucene’s faceting, Solr’s faceting and our own home brewed solution SOLR-2412 works the same: Keep a permanent list from document IDs to term IDs. When searching, create a list of counters for each term and count the number of term occurences.

One difference is how multiple facets are treated. Both Lucene and Solr’s faceting keeps independent structures for each facet. This keeps the code clean and usage flexible: User A can ask for facets F1 & F2, while user B can have facets F2, F3 & F4. When the facet structures for user B’s request are calculated, only F3 & F4 needs to be created as F2 already exists.

Our implementation is a little different as it merges structures. Instead of user A asking for facets F1 & F2, the request is really for facet group G1 (containing fields F1 & F2). User B gets facet group G2 (containing fields F2, F3 & F4). When the facet structures for user B’s request are calculated, the full structure must be created from scratch (I’m lying a bit here as there is some re-use, but we’ll ignore that for clarity).

So what do we win by grouping facets structures? Memory and speed!

Let’s say we have an index with 10M documents and we want 10 facets. The first is heavy with 50M references and 20M unique values, while the other 9 are small with 1M references and 200 unique values.

Memory

The docIDs->termIDs maps used by Solr & SOLR-2412 are really multiple lists and a series of indirections. Somewhat simplified, there is docID2refs, which is a list of pointers, one for each docID, into a list of references. refs is a list of references to terms ordinals, one entry for each term in each document. Everything is packed, so the size of a facet structure is

#docs*log2(#refs) + #refs*log2(#uniqueTerms) bit

On top of that there are counter structures #uniqueTerms*32 bit / thread and miscellaneous house keeping. For an index with 10M documents, 50M references from documents to terms for a heavy facet field and 20M unique terms, this is 10M*log2(50M) + 50M*log2(20M) bit ~= 30MB + 145MB = 175MB and 20M*32 bit = 76MB for each concurrent search thread.

Looking at our sample case, with Solr’s independent facets, the memory requirement can be estimated by summing: As calculated above, the big facet takes up 175MB. The 9 smaller facets each takes 25MB, with 24MB used for the docID2refs list and 1MB used for the refs list. A grand total of 400MB, ignoring counter structures.

By grouping facets, as done in SOLR-2412, the memory requirement is

#docs*log2(#refs) + #refs*log2(#uniqueTerms) bit

That is the same formula as for Solr, except that this holds the information for all the fields. The memory requirement is thus 10M*log2(50M+9*1M) + (50M+9*1M)*log2(20M+9*200) bit ~= 31MB + 171MB = 202MB. In reality there are temporary memory overheads and bookkeeping, so a rule of thumb is that one must allocate heap for 3 times the calculated structure size.

Speed

When a faceted search is requested, the terms in the facets must be counted for all the documents that matches the given query. With independent facets, this goes something like this

foreach facet {
  foreach docID {
    foreach termID_for_the_docID {
      counter[facet][termID_for_the_docID]++
    }
  }
}

or we could swap the facet and docID iteration

foreach docID {
  foreach facet {
    foreach termID_for_the_docID {
      counter[facet][termID_for_the_docID]++
    }
  }
}

either way the two outer loops means that the inner termID-loop is encountered #docIDs*#facets times.

With grouped facets, the iteration is

foreach docID {
  foreach termID_for_the_docID {
    counter[termID_for_the_docID]++
  }
}

which means that the inner termID-loop is encountered #docIDs times.

In reality, this does not mean that independent faceting is #facets times slower than grouped faceting. Very informal testing on our corpus at Statsbiblioteket with 11M documents, 100M+ references, 20M+ unique values and 15 facets showed that independent faceting was about 1½-2 times slower.

5000 and counting

One of the fun things about following mailing lists are the occational far out use cases that pops up. Two days ago a user presented a problem: He ran out of memory on his 12GB machine, when he tried to facet on 5000 fields on his index with 11M documents.

The user likely did not really want to facet on all the fields on each search, but it is a fun experiment to get it to work anyway. Besides, most of the results extends to the more realistic “facet on x arbitrary fields out of 5000″.

Assuming the modest numbers of 11M documents, 10K references and 200 unique terms for each facet, the memory requirement when using independent faceting is 18MB. Per facet! With 5000 facets that is 85GB in total, ignoring counters and general overhead. No wonder he ran out of memory. Even if he switched to a 128GB machine, a facet call with 100K matching documents would still involve 100K*5K = 500M lookups for termIDs, which would likely take quite some time.

Using grouped faceting, the theoretical memory requirement is 152MB and the number of lookups for termIDs for a 100K matching search is 100K.

It’s over 9000!!!

To put the theory to the test and live up to the meme, a test corpus was created: 10M documents, each with 10 fields:

  • 1 field with 1 out of 18 unique values
  • 7 fields chosen at random between 9000 possible, each with 1 out of 9000 unique values
  • 1 field with 2 random values

The index measured 88GB. Full faceting on the 9002 fields would ideally take 136GB+ if done with independent facets and about 344MB+ if grouped. Aaand we need to multiply this with 3 to get a more realistic estimate of required memory.

…testing… …testing… Drat! 3 times the ideal memory was not enough. It was upped to 2GB, which worked fine. Each query resulted in ~550K documents, randomly distributed in the index. Response times were as follows:

  • 1st request (facet structure creation): 330 seconds
  • 2nd request: 8 seconds
  • 3rd request: 8 seconds
  • 4th request: 9 seconds
  • 5th request: 8 seconds

The inner faceting took about 6-700ms on each request, except for the first. The remaining 7-8 seconds was used for building the 3.5MB XML response, which contained the top-3 for each facet, for a total of 27K term counts. The hardware was a Xeon E5-2670 with 16GB free RAM and SSDs (what else?) to hold the index.

Conclusion

While the idea of faceting on 5000 or 9000 fields seems insane at first, it is actually quite doable with modest memory requirement and response times. So technically we succeeded.

However, as stated earlier, this is likely not a real-world use case. The “facet on x arbitrary fields out of 5000″-problem is more realistic and quite doable, for example by creating a single facet with the entries encoded as field_term and performing a search with facet.prefix=field_ for each wanted facet.

Collator sorted facet results

February 22, 2013 by

The concept is quite simple: Instead of ordering the facet result by tag count, it is ordered by sorting the terms. Where a classic facet result might be

  • steak (102)
  • tofu (80)
  • ice cream (47)
  • pasta (47)
  • ratatouile (32)

the same search could return the term-sorted facet

  • broccoli (5)
  • crème brûlée (10)
  • ice cream (47)
  • lard (1)
  • pasta (47)

At Statsbiblioteket, we use this to perform lookups in indexed titles and author names.

What seems to be the problem?

If we just perform a faceting on the indexed terms and sort by natural order (which is ~Unicode for Lucene), this will work just fine. Provided you stick to ASCII! This constraint makes it absolutely useless for anything but a very controlled vocabulary: If we sort by Unicode, the terms crab, crème & crow will be ordered crab, crow & crème. Removing diacritics does not work for us as some of the characters in Danish, notably æ, ø & å, does not have diacritics.

What seems to be the solution?

To get proper sorting in Java, one would normally use a Collator (we prefer the ICU Collator, but Java has one build-in). So somehow we want to apply Collator-sorting to our faceting.

Collator based sorting at index open

For a long time our faceting system worked by Collator sorting all the terms when the index were (re)opened. This gave the desired result, but as both author- and title-fields tend to have a lot of unique terms (about 18 million in our main index), it took us a about 250 seconds to prepare that part of the faceting structure upon an index update.

Collator keys to the rescue?

The current prevalent sorting solution in Lucene/Solr is to index ICU collator keys. ICU collator keys are byte-representations of terms with the property that a simple byte-by-byte comparison will result in the correct order. The terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00] & [2b 49 43 53 01 08 01 c1 00] with the default ICU collator for danish. As can be seen, byte number 3 in each of those keys are the first byte that is not equal and those bytes (27, 2f & 43) does indeed signify the correct order for the input.

This works very well for sorting of documents, but does not work for faceting. The problem is that term->collationKey is a one-way transformation. We get the tags in the right order, but they are not human readable.

Extended collator keys to the rescue!

The solution is deceptively simple: Just store an UTF-8 representation of the original terms after their keys. Thus the terms crab, crème & crow will be converted to the bytes [2b 49 27 29 01 08 01 c1 00 63 72 61 62], [2b 49 2f 3f 2f 01 84 8f 06 01 bf 00 63 72 c3 a8 6d 65] & [2b 49 43 53 01 08 01 c1 00 63 72 6f 77].

The natural sort order is preserved and we can now extract the original term. This all works due to the nice property for ICU collator keys that they are always terminated with a 00. The 00 acts as terminator for comparison and as delimiter between the key and the term.

It actually works

There is a slight index-time penalty as the keys needs to be created and a slight increase in index size, but those are both academical. The change in the time used for creating the facet structures for title and author is substantial: That time fell from 250 seconds to 86 seconds.

Yes, the code is open source. Right now it must be digged out from the Exposed module in the Summa project, but a proper Lucene patch will be generated Real Soon Now.

Large facet corpus, small result set

January 23, 2013 by

Primer

Each time a user issues a search in our primary corpus, we perform faceting on 19 different fields. Some of those fields have a low amount of unique values (year, language, classification), some of them are quite heavy (author, title, semi-free keywords). We have a grand total of 38 million unique tags and 160 million references from 11 documents to the tags displayed as part of the faceting.

The way our faceting works is simple in principle: When a search is performed, an array of counters (a basic int[]) is allocated. The array contains a counter for each unique tag (38 million in our case). All the internal IDs for the documents from the full search result are extracted and iterated. For each document ID, the reference structure provides the tag IDs, which corresponds exactly to entries in the counter array. For each tag ID, the counter for said tag is incremented. When all document IDs has been iterated, the counters are iterated and the ones with the highest counts are extracted.

If you followed that, congratulations. So, performance-wise, what is wrong with that approach? Yeah, the title was kind of a giveaway. Even if the search results in just a single hit, with just a single tag, we still need to iterate the full 38 million counters in order to extract the facet result. We need to clear it too, before the next run, so we’re really talking 2 complete iterations, one of them involving sorting logic. Ouch. Or to be more precise: About 300ms of ouch.

So what do we do? Well, if we know that our result set is small, we could use a simple HashMap to hold our counters; with tag-IDs as keys and the counters themselves as values. We tried that some years ago. It did sorta-kinda work, but that approach had significant drawbacks:

  • HashMaps are memory pigs and they tax the garbage collector. We do not want to insert hundreds of thousands of objects into them, when they are just used for highly temporary counting.
  • We need to guess the size of the result from the start. Not just the number of hits in the search result, but the number of tags that they refer to collectively so as not to get unvieldy HashMaps. If we guess wrong, we need to start over or copy the values from the map into our full counting structure.

We abandoned our HashMap based sparse counter approach as our experiments showed that the dumb “just iterate everything all the time” performed better for most of our searches.

New idea

Summing up the requirements for a faceting system where tag extraction performance is dependent on the number of found tags, rather than using a fixed amount of time:

  • It should work without knowing the number of tags in advance.
  • It should not tax the heap nor the garbage collector excessively
  • Extraction time should be linear (we accept O(n*log2(n))
  • for sorted extraction) to the number of marked tags

Mikkel Kamstrup Erlandsen kindly pointed me to the article Using Uninitialized Memory for Fun and Profit. With a little tweaking, a simplified version should satisfy all the requirements. We will build a counting structure that can switch seamlessly from sparse to non-sparse representation.

For this, we introduce yet another array: The tag count tracker. It holds pointers into the tag count array from before. Its length is the cutoff for when to use sparse counting and when to use full counting and must be determined by experiments.

When the count for a tag for a document needs to be incremented, we start by loading the old count from our tag count array (we need to do this anyway in order to add 1 to the count). If the count is 0, the position of the counter is added to the tag count tracker. If this overflows the tag count tracker, we switch to standard counting and completely ignore the tag count tracker hereafter. Either way, the value from the tag count array is incremented and stored back into the array as we would normally do.

When all the tags for all the documents has been iterated, the tag count tracker (if not overflowed) contains a complete list of all the tags that has a count of 1 or more. The tag extracter needs only iterate those and, just as important, needs only clear those. If the tag count tracker was overflowed, the old iterate-everything approach is used. As for clearing the tag count tracker for next use, it is simply a matter of setting its logical length (a single int) to 0. Presto!

Now it just needs to be implemented.

Fire fire everywhere

September 21, 2011 by

Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.

A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts & session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.

Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.

Mads coding

Mads stands up and codes. A sure sign of high alert

Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation i++; was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.

It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.

With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.

The right tool – neo4j?

May 6, 2011 by

Background

We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years we have used relational databases for this, going from PostgreSQL through Derby and landing on H2, which has served us fairly well. Documents with relations were a bit slow to resolve but there were less than 5% of those in the full corpus, so for all practical purposes they presented no problem.

Fast forward to a week ago, where we added a new target to our integrated search. One million fresh documents or a 10% increase in total document count. Unfortunately most of these documents were related to each other, increasing our total relation count 2000%. Our full indexing time soared from “It’ll be done before noon” to “I hope it finishes before tomorrow”. Ouch!

A long discussion about database design and indexes followed. The conclusion was that we really did not use H2 for what it was good at and that maybe we should look at a graph-oriented database.

Implementing storage with neo4j

Neo4j is an open source graph database. It is written in Java and can be used as an embedded application. This is important for us as we like our packages to be self contained. Friday is work-on-whatever-project-you-think-can-benefit-Statsbiblioteket-day, so I dedicated it to trying out neo. Keep in mind that I only heard about Neo this morning and that I hadn’t looked at a single page about the product before I started the project.

  • 10:36 Created test project
  • 10:40 Selected Neo4j version 1.3 stable Enterprise, added to Maven POM, downloaded files
  • 10:50 Created skeleton class for Summa NeoStorage
  • 10:55 Added basic properties, created skeleton Unit test
  • 11:15 Added code for flushing Summa Record to Neo
  • 11:36 Added code for retrieving a previously flushed record + unit test. Hello world completed
  • 11:40 Break finished, started on ModificationTime retrieval
  • 12:35 Proof of concept retrieval using modification time (slow development due to human error)
  • 13:20 Finished bulk ingest and record-by-record extraction using modification time order, including unit test
  • 13:50 Finished mapping of Summa Record child-relations to Neo, both for ingest and extraction

Including mistakes, distractions from colleagues and reading documentation, it took under 4 hours to integrate Neo as the new backend for our storage. There’s still a lot of minor things to add and special cases to cover, but the result is complete enough to be used for most workflows in Summa.

Implementation was a breeze, the API was very clean and the examples and guides at the website were to the point and well thought out. Kudos to the developers for great work!

Performance peek

Since the Neo Storage isn’t finished, it would be unfair to compare it to H2 with regard to ingest. However, the extraction part is complete enough to test.

With the old H2-backed storage, our extraction time fell to below 5 records/sec for the new documents with 2-4 relations each (extraction time for records without relations is 2-3000 records/sec).

Creating a test-storage using Neo with 100000 documents, each with 5 children, changed the extraction speed to 2500 expanded records/second or 15,000 raw records/second if we count the children.

Granted, the only fair test is with production data, but so far Neo4j looks like a clear winner for our purposes!

Hierarchical faceting in Solr

March 9, 2011 by

Solr already has SOLR-64 which does hierarchical faceting and SOLR-792 which does pivot faceting. A few minutes ago, I uploaded SOLR-2412 which does hierarchical faceting. What’s the big idea?

SOLR-2412 is a fairly thin wrapper around LUCENE-2369. LUCENE-2369 was designed with the clear trade-offs

* Slow startup
* Low memory overhead
* Fast response

with the archetypal usage scenario being a large index containing one or more rich hierarchies that is batch-updated every night (see Hierarchical faceting – working code for more details). With fear of misrepresenting, SOLR-64 and SOLR-792 were created from a feature-standpoint with performance characteristics being secondary.

Feature wise, SOLR-2412 (let’s call it Exposed faceting from now on) differs markedly from pivot faceting (SOLR-792) at this time, as neither of the two solutions can do what the other one does. However, I feel confident that Exposed faceting can be tweaked to do pivot faceting later on. The main reason to use Exposed over SOLR-792 would be to change trade-offs.

Compared to SOLR-64, Exposed faceting’s features differs primarily by supporting multiple paths per document: A product belonging to multiple categories, multiple locations for a bus route and so on.

The next step is to create a test bed for doing performance measurements on Exposed vs. Solr’s different faceting implementations. Naturally the hoped-for outcome is that Exposed is markedly better under the defined trade-offs.

Virtual Integrated Search

February 4, 2011 by

For a while it seemed that Integrated Search with a nice Discovery Interface coupled with a large Data Well was the answer to how libraries were going to let users find and use the multitudes of material they have access to.

Many different places have tried building their own data well (sometimes a large national data well) but most have given up. Why? Primarily because of the unwillingness of publishers to hand over data to every single data well but also because the very concept of a very large local data well has been made (at least somewhat) obsolete by the new “Web Scale Discovery” tools – such as Summon, Primo Central and EBSCOHost.

Some libraries are fine with using the standard discovery interfaces that these services provide and some would rather use their APIs and build their own interface on top, perhaps adding tight integration with their library system and other locally developed systems.

However this area is fast moving and just as it looked as if pretty much all metadata would eventually be available in all these systems EBSCO has decided not to allow their metadata to be indexed outside of their own system. They will however allow for “just-in-time” searches to be used. It appears as if the market is fragmenting back into federated search – and the problems with federated search are well-known and are what made us all pursue integrated search to begin with.

But can we do better this time around? Fortunately the answer is yes – if we can get a little help from our friends.

There are two main problems with federated search:

  • Response times: The entire federated system is only as fast as the slowest search node.
  • Merging: There is no meaningful way to merge different result sets as they can have vastly different sorting criteria.

We can’t do a lot about the problems with response times but fortunately the new systems are _a lot_ faster then the old ones, so hopefully it wont be that big an issue.

Merging has however gotten easier, strangely enough as a side effect of making the sorting of individual results more complicated. The magic here lies in relevancy ranking and the fact that pretty much all new systems are based on the same principles and code base (ie. Lucene/Solr).

So how does this work? The relevancy ranking of a given document in a query is based on different things but a major contributor is the term frequency–inverse document frequency.

We have chosen to call this concept Virtual Integrated Search as the end result is (potentially) virtually indistinguishable from from having a large local data well coupled with a true integrated search. For the Well11 and code4lib 2011 conferences we have prepared a first stab at an implementation and integration with our existing Search and Summa system. This is not much more than an internal beta version where a primary focus of the frontend has been to make is possible to tweak the way merging is performed.

What will it look like? The main purpose for us is to make it look to the user as if all the data is retrieved from a single source. Currently we present results from Summon in a box of its own but the plan is to simply go back and present all results in a single list.

This is something we will be working on and experimenting with in the coming months and we are quite excited about the possibilities.

Going to code4lib 2011

January 7, 2011 by

Mads Villadsen and I are fortunate enough to be attending code4lib 2011 in early February. Last year our plane was stopped by a snow drift. This year we’re going full paranoia with a planned US-arrival 2 days before the main conference starts.

code4lib 2009 was the best library-oriented conference we’ve been to, so our hopes for 2011 are high. The program certainly looks interesting and one common theme – merging of search results from different sources – is perfectly timed to our current project on merging of Summa and Summon search results. Hopefully we’ll have enough experience by then to do a lightning talk about it.

We will also be attending the pre-conference on Solr and since hierarchical-like faceting seems to be a fairly hot topic this year, we plan to hack together a Solr-based proof of concept of our take on the problem before the conference.


Follow

Get every new post delivered to your Inbox.