Sorting, faceting and index lookup with Lucene

A while ago, I described a proof of concept on how to reduce the memory impact for sorting with Lucene. There were also some lofty ideas about faceting and index lookup. I am now happy to say that these ideas has graduated to the proof of concept stage.

Laying the ground

First of all, let us describe three common ways of looking at an index.

  • Sorting: Sorting on title, author and similar fields should be done with respect to the locale of the user. This is often done by comparing the relevant strings from a search result to each other, which scales horribly when the number of hits rises as locale-aware comparison of Strings is expensive. Another way that is making the rounds is to index collator keys, which are not usable for human reading, but very fast for comparison.
  • Faceting: Fairly standard by now, faceting does present some architectural challenges. Creating facets for subjects with a low number of possible values is simple enough, but scaling to faceting on title, author or similar fields with many unique value takes its toll on memory if all the values are kept there. Solr provides an UnInvertedField which offloads most of the terms to storage.
  • Index lookup: Also known as “author browse” and probably some other names, as it is not that well-known. The idea is to provide the user with an authoritative list of possible values for a given field. When the user starts to type, valid values with the typed prefix is shown together with a few of the previous values as well as the following values. The difference from the well-known suggestion concept is that index lookup is complete and that the result is in locale-aware sorted order. It can be seen as a dictionary. If memory is plentiful, a simple implementation method is to hold a sorted list of terms in memory and do a standard locale aware binary search for the entry point in the list.

One common element for these three ideas is the terms. Terms kept in memory. With lower end hardware or higher end index size, this is a problem. Lucene (and thereby Solr) is switching to something called BytesRef, which can represent characters in a more compact way than Java Strings. They really do help a lot, but for some that’s not enough.

Now, a Lucene index naturally contains all these terms: They can be looked up, they can be sequentially accessed or (dramatic pause) they can be accessed by ordinal. An ordinal is just an integer and takes up very little memory to represent. Requesting a term by ordinal is a bit costly though, as it must be fetched from storage, which requires some seeking.

Another way

Going back to the three ideas, another common element is that only a relative small amount of the terms is actually shown to the user. For sorting, it might be the top 20 titles. For faceting, it might be 200 tags shared among 10 facets. For index lookup it might be 10 author names. Fast resolving of terms is not required!

Armed with this knowledge, let’s define four building blocks.

  • Ordinal to term lookup: Given an ordinal, the term (aka BytesRef) is returned by querying the index. Trivial for single segments, fairly simple for multiple segments as they are just appended. This is just a logical mapping and requires practically no memory.
  • Indirects: The ordinals are sorted, typically with respect to a locale, and the sorted lists is called the indirects list. If an index in the indirect is lower than another, it means that its corresponding term comes before the other indirect entry’s term with respect to sorting. We always need to sort in order to have indirects, even if the terms in the segments are already in order. This is because an index is often composed of several segments that most probably contains duplicate terms. In that case, the sorting can also be seen as de-duplication on ordered lists. This works very well with the collator key idea from above.
  • Document to single indirect For each document id (still just an integer), an indirect index is kept. Resolving the indirect returns a term corresponding to the document. Memory wise this requires a list of integers as long as the number of documents.
  • Document to indirects For each document id, a list of corresponding indirects is kept. By following the indirects through the ordinals, the corresponding terms can be resolved. Memory wise this requires a list of integers as long as the number of documents plus a list of integers as long as the total number of indirects for all documents.

The common element here is arrays of integers. Java’s int[] is not that memory hungry to start with, but we can reduce the impact even further by using the more compact representation PackedInts. Revisiting the three common ways of looking at an index with the new structures, we see how it all comes together.

  • Sorting: The document to single indirect is used for this as the order of two documents can be determined by comparing their respective entries in the integer array. This is very fast.
  • Faceting: The document to indirects is used for faceting. A counting array of length #indirects is made and the documents ids are iterated. For each id, the indirects are looked up and the corresponding counter entries are incremented. As it is just a matter of iteration and array lookups and updates, this is very fast. After that, the most popular entries in the counter lists can be extracted and the terms resolved. As the indirects can be ordered, it is also possible to extract by lexicographical order or similar.
  • Index lookup: The indirects are immediately usable for this as a binary search can be performed. For each iteration of the binary search, one actual term needs to be looked up in storage and compared to the given prefix, but even with 10 million terms, this requires only 24 lookups. Most of these lookups will normally be disk cached.
  • Enhanced index lookup: By using faceting for index lookup, we gain the advantage of showing how many documents will match the displayed terms for a given field. Furthermore this can be coupled with previous selections or a query, limiting the displayed terms to those that will result in at least one hit for a search. This might sound like heavy processing, but luckily the counters from the faceting can be easily cached. While an initial faceting of 5 million tags takes about 1 second on a laptop anno 2010, subsequent index lookups within that structure takes only 1-3 milliseconds.

An example

2 test indexes were created with 1 and 10 million documents respectively. Each document had an id, a title (10 random characters), an author (10 random characters) and 0-5 sample-tags which were a single term A-Y. For each index, a search was performed that hit half of the available documents and sorting (Lucene term order on author), faceting (locale “da” for the title, count for the title (yes, again) and count for the sample-tags) and index lookup (on title with locale “da”) was performed. After that a search that hit 1/10 of the index (random distribution) was performed with the same sorting, faceting and index lookups as before.

1 million documents, collator sort

Index = /home/te/projects/index1M (1000000 documents)
Used heap after loading index and performing a simple search: 5 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 88 MB

First natural order sorted search for "even:true" with 500000 hits: 2473 ms
Subsequent 5 sorted searches average response time: 23 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 1:20 minutes
First faceting for even:true: 128 ms
Subsequent 4 faceting calls (count caching disabled) response times: 75 ms

Initial lookup pool request (might result in structure building): 0:33 minutes
First index lookup for "even:true": 56 ms
Subsequent 91 index lookups average response times: 0 ms

First natural order sorted search for "multi:A" with 95222 hits: 6 ms
Subsequent 5 sorted searches average response time: 5 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 23 ms
Subsequent 4 faceting calls (count caching disabled) response times: 23 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 42 ms
Subsequent 91 index lookups average response times: 0 ms

Free memory with sort, facet and index lookup structures intact: 68 MB
Total test time: 1:57 minutes

10 million documents, collator sort

Index = /home/te/projects/index10M (10000000 documents)
used heap after loading index and performing a simple search: 25 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 910 MB

First natural order sorted search for "even:true" with 5000000 hits: 0:21 minutes
Subsequent 5 sorted searches average response time: 227 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 10:08 minutes
First faceting for even:true: 934 ms
Subsequent 4 faceting calls (count caching disabled) response times: 897 ms

Initial lookup pool request (might result in structure building): 3:21 minutes
First index lookup for "even:true": 348 ms

First natural order sorted search for "multi:A" with 947218 hits: 99 ms
Subsequent 5 sorted searches average response time: 45 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 267 ms
Subsequent 4 faceting calls (count caching disabled) response times: 233 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 99 ms
Subsequent 91 index lookups average response times: 2 ms

Used memory with sort, facet and index lookup structures intact: 685 MB
Total test time: 14:00 minutes

As can be seen, the initialization time is very high with many documents. Using a hybrid approach with indexed collator keys helps a lot but is – for now – incompatible with lexicographically sorted faceting and index lookup as they require that human readable terms are returned. This can be solved without extra memory overhead by having parallel structures in the index for the terms and the collator keys for the terms, but the code has not been written yet. Simulating the parallel structures idea by sorting in term natural order, we get the following measurements.

1 million documents, term natural order sort

Index = /home/te/projects/index1M (1000000 documents)
used heap after loading index and performing a simple search: 6 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 88 MB

First natural order sorted search for "even:true" with 500000 hits: 2586 ms
Subsequent 5 sorted searches average response time: 23 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 0:44 minutes
First faceting for even:true: 120 ms
Subsequent 4 faceting calls (count caching disabled) response times: 66 ms

Initial lookup pool request (might result in structure building): 0:15 minutes
First index lookup for "even:true": 26 ms
Subsequent 91 index lookups average response times: 0 ms

First natural order sorted search for "multi:A" with 95222 hits: 6 ms
Subsequent 5 sorted searches average response time: 4 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 22 ms
Subsequent 4 faceting calls (count caching disabled) response times: 20 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 7 ms
Subsequent 91 index lookups average response times: 0 ms

Used memory with sort, facet and index lookup structures intact: 66 MB
Total test time: 1:03 minutes

10 million documents, term natural order sort

Index = /home/te/projects/index10M (10000000 documents)
used heap after loading index and performing a simple search: 25 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 910 MB

First natural order sorted search for "even:true" with 5000000 hits: 0:22 minutes
Subsequent 5 sorted searches average response time: 229 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 4:46 minutes
First faceting for even:true: 865 ms
Subsequent 4 faceting calls (count caching disabled) response times: 788 ms

Initial lookup pool request (might result in structure building): 1:41 minutes
First index lookup for "even:true": 318 ms
Subsequent 91 index lookups average response times: 2 ms

First natural order sorted search for "multi:A" with 947218 hits: 101 ms
Subsequent 5 sorted searches average response time: 51 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)), 
  group(name=count, order=count, locale=null, fields(a)), 
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 220 ms
Subsequent 4 faceting calls (count caching disabled) response times: 220 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 92 ms
Subsequent 91 index lookups average response times: 2 ms

Used memory with sort, facet and index lookup structures intact: 648 MB
Total test time: 6:58 minutes

Observations

It seems obvious that using indexed collator keys is the way to go. It is significantly faster. The drawbacks are slightly increased indexing time and increased index size.

The current proof of concept plays nice with reopening of the index. This means that an update to the index only requires a partial refresh of the overall structure. In order to do so, some intermediate structures needs to be held in memory. Disabling this caching makes reopening just as expensive as the initial open, but frees some memory. The amount of freed memory has not been measured yet, but I would guess about 1/4-1/3 of total memory usage, depending on the index.

In perspective

The sorting part in itself required about 1/3 of the size needed by standard Lucene natural order sort. It is more difficult to compare the faceting and index lookup as Lucene does not provide either. Solr handles faceting, so I hope to find the time to make a comparison with that product.

To boil it down, the proof of concept makes it possible to provide faceting on authors, sorting on title and a bit on the side for an index with 20 million documents on a 2 GB machine. Scaling up is theoretically n log(n) in time but less than that in the tests above, probably due to caching and JITting.

Try it for yourself

Remember, this is a proof of concept. Check out Lucene trunk and apply the patch at LUCENE-2369. This contains an interactive proof of concept for the sorting part and a non-interactive unit test that builds an index and performs faceting and index lookup. Look in the “exposed”-folders in src and test.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Low-level, Lucene, open source. Bookmark the permalink.

3 Responses to Sorting, faceting and index lookup with Lucene

  1. Pingback: The Evil Blog » Blog Archive » Fascinating Facets!

  2. Pingback: Fast, light, n-level hierarchical faceting « Software Development at Statsbiblioteket

  3. Pingback: Fast, light, n-level hierarchical faceting « Software Development at Statsbiblioteket « Merveilles du web 2.0… mon « copier bloguer » du web

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s