Over 9000 facet fields

by

In some ways, Lucene’s faceting, Solr’s faceting and our own home brewed solution SOLR-2412 works the same: Keep a permanent list from document IDs to term IDs. When searching, create a list of counters for each term and count the number of term occurences.

One difference is how multiple facets are treated. Both Lucene and Solr’s faceting keeps independent structures for each facet. This keeps the code clean and usage flexible: User A can ask for facets F1 & F2, while user B can have facets F2, F3 & F4. When the facet structures for user B’s request are calculated, only F3 & F4 needs to be created as F2 already exists.

Our implementation is a little different as it merges structures. Instead of user A asking for facets F1 & F2, the request is really for facet group G1 (containing fields F1 & F2). User B gets facet group G2 (containing fields F2, F3 & F4). When the facet structures for user B’s request are calculated, the full structure must be created from scratch (I’m lying a bit here as there is some re-use, but we’ll ignore that for clarity).

So what do we win by grouping facets structures? Memory and speed!

Let’s say we have an index with 10M documents and we want 10 facets. The first is heavy with 50M references and 20M unique values, while the other 9 are small with 1M references and 200 unique values.

Memory

The docIDs->termIDs maps used by Solr & SOLR-2412 are really multiple lists and a series of indirections. Somewhat simplified, there is docID2refs, which is a list of pointers, one for each docID, into a list of references. refs is a list of references to terms ordinals, one entry for each term in each document. Everything is packed, so the size of a facet structure is

#docs*log2(#refs) + #refs*log2(#uniqueTerms) bit

On top of that there are counter structures #uniqueTerms*32 bit / thread and miscellaneous house keeping. For an index with 10M documents, 50M references from documents to terms for a heavy facet field and 20M unique terms, this is 10M*log2(50M) + 50M*log2(20M) bit ~= 30MB + 145MB = 175MB and 20M*32 bit = 76MB for each concurrent search thread.

Looking at our sample case, with Solr’s independent facets, the memory requirement can be estimated by summing: As calculated above, the big facet takes up 175MB. The 9 smaller facets each takes 25MB, with 24MB used for the docID2refs list and 1MB used for the refs list. A grand total of 400MB, ignoring counter structures.

By grouping facets, as done in SOLR-2412, the memory requirement is

#docs*log2(#refs) + #refs*log2(#uniqueTerms) bit

That is the same formula as for Solr, except that this holds the information for all the fields. The memory requirement is thus 10M*log2(50M+9*1M) + (50M+9*1M)*log2(20M+9*200) bit ~= 31MB + 171MB = 202MB. In reality there are temporary memory overheads and bookkeeping, so a rule of thumb is that one must allocate heap for 3 times the calculated structure size.

Speed

When a faceted search is requested, the terms in the facets must be counted for all the documents that matches the given query. With independent facets, this goes something like this

foreach facet {
  foreach docID {
    foreach termID_for_the_docID {
      counter[facet][termID_for_the_docID]++
    }
  }
}

or we could swap the facet and docID iteration

foreach docID {
  foreach facet {
    foreach termID_for_the_docID {
      counter[facet][termID_for_the_docID]++
    }
  }
}

either way the two outer loops means that the inner termID-loop is encountered #docIDs*#facets times.

With grouped facets, the iteration is

foreach docID {
  foreach termID_for_the_docID {
    counter[termID_for_the_docID]++
  }
}

which means that the inner termID-loop is encountered #docIDs times.

In reality, this does not mean that independent faceting is #facets times slower than grouped faceting. Very informal testing on our corpus at Statsbiblioteket with 11M documents, 100M+ references, 20M+ unique values and 15 facets showed that independent faceting was about 1½-2 times slower.

5000 and counting

One of the fun things about following mailing lists are the occational far out use cases that pops up. Two days ago a user presented a problem: He ran out of memory on his 12GB machine, when he tried to facet on 5000 fields on his index with 11M documents.

The user likely did not really want to facet on all the fields on each search, but it is a fun experiment to get it to work anyway. Besides, most of the results extends to the more realistic “facet on x arbitrary fields out of 5000″.

Assuming the modest numbers of 11M documents, 10K references and 200 unique terms for each facet, the memory requirement when using independent faceting is 18MB. Per facet! With 5000 facets that is 85GB in total, ignoring counters and general overhead. No wonder he ran out of memory. Even if he switched to a 128GB machine, a facet call with 100K matching documents would still involve 100K*5K = 500M lookups for termIDs, which would likely take quite some time.

Using grouped faceting, the theoretical memory requirement is 152MB and the number of lookups for termIDs for a 100K matching search is 100K.

It’s over 9000!!!

To put the theory to the test and live up to the meme, a test corpus was created: 10M documents, each with 10 fields:

  • 1 field with 1 out of 18 unique values
  • 7 fields chosen at random between 9000 possible, each with 1 out of 9000 unique values
  • 1 field with 2 random values

The index measured 88GB. Full faceting on the 9002 fields would ideally take 136GB+ if done with independent facets and about 344MB+ if grouped. Aaand we need to multiply this with 3 to get a more realistic estimate of required memory.

…testing… …testing… Drat! 3 times the ideal memory was not enough. It was upped to 2GB, which worked fine. Each query resulted in ~550K documents, randomly distributed in the index. Response times were as follows:

  • 1st request (facet structure creation): 330 seconds
  • 2nd request: 8 seconds
  • 3rd request: 8 seconds
  • 4th request: 9 seconds
  • 5th request: 8 seconds

The inner faceting took about 6-700ms on each request, except for the first. The remaining 7-8 seconds was used for building the 3.5MB XML response, which contained the top-3 for each facet, for a total of 27K term counts. The hardware was a Xeon E5-2670 with 16GB free RAM and SSDs (what else?) to hold the index.

Conclusion

While the idea of faceting on 5000 or 9000 fields seems insane at first, it is actually quite doable with modest memory requirement and response times. So technically we succeeded.

However, as stated earlier, this is likely not a real-world use case. The “facet on x arbitrary fields out of 5000″-problem is more realistic and quite doable, for example by creating a single facet with the entries encoded as field_term and performing a search with facet.prefix=field_ for each wanted facet.

About these ads

3 Responses to “Over 9000 facet fields”

  1. jrochkind Says:

    What are the cases where your solution is less performant (time and/or memory) than the standard lucene/solr algorithm?

    Then the question is which cases are more common, right? As far as which should be the standard lucene/solr implementation.

  2. Toke Eskildsen Says:

    The most obvious downside to SOLR-2412, as compared to Solr standard faceting, is the same as the upside: Grouping means cummulative longer startup time when the users wants faceting on different fields.

    As for the raw startup times, they are about the same. SOLR-2412 has 2 structure builders, one generally slower and one generally faster than Solr standard. The fast one uses a bit more memory than Solr standard. The classic speed/memory trade-off applies here.

    Lucene faceting is another beast. I have not played enough with it to make any hard statements. In theory the centralized term handling provides faster startup with little to no extra cost for term resolving, as long as we’re not talking distributed search. Under the assumptions that the facet fields are known at build time and that a single index is used, the Lucene faceting principle seems superior.

    Memory-wise SOLR-2412 it better than or the same as Solr standard for all cases I have measured, using the low-mem structure builder. I am fairly sure that some cases can be constructed where is uses a bit more memory. Maybe a group with a field with very few unique values that are referenced a lot of times, followed by a field with a lot of unique values that are referenced relatively few times?

    I do not see SOLR-2412 as a serious contender for Solr standard faceting. It is too specialized and would require a lot of cleanup. But I would be happy to see the grouping and the low-weight hierarchical structure ideas applied to either Lucene or Solr standard faceting. Alas, I currently do not have the time to attempt such a feat.

  3. David Smiley Says:

    Thanks for posting Toke! I’m subscribing to your blog now.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: