You are faceting it wrong

As hinted in the previous post “Over 9000 facet fields“, faceting on many fields is tricky business. Read on if you are using Solr with faceting and use an inordinate amount of memory to do so.

Clarification

This post skips a lot of the nitty-gritty details and tries to present its case with a lot of examples. It is aimed for persons who has a bit of experience with running a Solr server.

There are different ways of doing basic faceting with stock Solr: enum (few unique values), field cache (many references, many values) and DocValues (new in Solr 4, not explored here). In this post we focus on index wide field cache based faceting of string values on multi-value fields. Don’t worry about all the adjectives, this is what we normally use when we do a faceted search.

On the first facet call on a given field in Solr, an in-memory structure is generated. Not very intuitively, it is fairly cheap to facet on a lot of unique values: Faceting on a field with 10M unique values on an index of 1M documents only uses about 2-300MB, depending on the amount of concurrent requests. Just as non-intuitive, each document adds to the size of the facet structure, even when it holds no values for the facet field.

The problem with 100 facet fields

Solr treats each facet independently: To get the total amount of memory used for faceting, just sum the amounts used by each facet.

Let’s look at a relatively small index to start with: Faceting on a single field for 1M documents, 5000 references and 100 unique values takes up only 1.6MB theoretically (in reality we need to multiply this with 3 or more). Faceting on 100 of those can be done with less than 1GB of heap.

Even when the number of references and unique values rises drastically, the heap requirements stay modest: 1M documents, 5M references and 5M unique values takes up 17MB for a facet. Faceting on 100 of those is still within the capabilities of a moderate machine.

Increasing the number of documents has more impact. Faceting on a field for an index with 10 times the documents from above, for a total of 10M documents, increases heap requirements from 17MB to 41MB per facet. With 100 facets, we are deep in “we really need to tune the garbage collector to avoid erratic response times”-land at 4GB theoretically and around 15GB in reality. Having 50M documents means allocating 15GB of heap (which in reality is more like 50GB) for the 100 facets.

The worst offender is of course the number of facets as the impact rises linear with this. It is important to remember that the amount of values in the facets has very little say: 500 tiny facets has vastly more impact than 5 really large ones. Besides the heap requirement, processing time also rises linear with facet count (all else being equal, which it never is).

Faceting on 100 fields, all at the same time

If a person really need to facet on all fields at the same time, it is hard to avoid either switching facet implementation or taking the substantial performance- and memory-hits. See “Over 9000 facet fields” for details.

However, this is really a fringe use case. It might be used for statistics extraction or such, but a more common use case is…

Faceting on 100 fields, a few facets at a time

If a product catalogue has multiple diverse product lines or an index is otherwise shared among separate entities, it often makes sense from a user interface perspective to have a plethora of facets available, but only use a few at a time. While the latent facets imposes no extra performance penalty, they do take up just as much heap as if they were all active at the same time.

As previously established, having a lot of references and unique values is a lot cheaper than having many facets. One way of handling the many facets is thus to collapse them into a single facet.

Suppose we have an index with 50M documents and 100 small facets, each with 5000 references to 100 unique values. They each take up 77MB for at total of 7.6GB of heap (multiply with 3 to get real world numbers). A document might have values like this:

facet1:valueA
facet1:valueB
facet2:valueA
facet3:valueC

We collapse this to a single field by prefixing the values with the former field names:

collapsed:facet1/valueA
collapsed:facet1/valueB
collapsed:facet2/valueA
collapsed:facet3/valueC

With the collapsing, we have 1 field, 50M documents, 100*5000 references and 100*100 unique values. The needed amount of heap for the whole shebang is 114MB (or less than 0.5GB in real world numbers).

To perform a search with a wanted facet, we apply the parameter facet.prefix, which Limits the terms on which to facet to those starting with the given string prefix. At the GUI end, the field name prefix must be removed.

One facet per client

One use case presented on the Solr mailing list involves an index shared between 180 clients, each with their own documents and each doing search and faceting on their own documents only. Each client has a facet field dedicated to them, so this is a case of faceting on 180 fields, one facet at a time.

From a business logic perspective this setup makes perfect sense. Unfortunately Solr threw frequent Out Of Memory errors with a heap of 25GB. Even worse, an expected quadrupling of the number of clients means that both the number of facets and the number of documents will quadruple, resulting in future memory requirements of 10 times 25GB.

As the searches for each client are constricted to their own data by filtering (guessing a bit here), so that there is no chance of mixing values, there is no reason for using the prefix trick from above. The solution is simply that all clients share the same facet field instead of having individual ones.

A different solution would be to dedicate a shard to each client. This would have the added bonus of making relevance ranking for the single customer unaffected by the other customers data. But that is another story.