We are now used to standard faceting and it works really well for a lot of cases. However, sometimes we have a taxonomy that we want to present. For monkeys this could be something like
Classification
- Primates (9)
- Atelidae (2)
- Atelinae (2)
- Ateles (2)
- Cebidae (5)
- Cebus (2)
- C. apella (1)
- C. capucinus (1)
- Saimiri (3)
- S. sciureus (3)
- Cercopithecidae (1)
- Papio (1)
- P. anubis (1)
Example data and goal
We consider three documents, each with some multi-level faceting information
Doc 1: A/B/C, D/E/F Doc 2: A/B/C, A/B/J Doc 3: A, D/E, G/H/I
The multi-level tags are thus
A A/B/C A/B/J D/E D/E/F G/H/I
Visually we want all levels to be expanded and counted so that we see
A (4) - B (3) - C (2) - J (1) D (2) - E (2) - F (1) G (1) - H (1) - I (1)
for a search that matches all 3 documents.
The easy, non-fast and non-light solution
If the documents were expanded, either at index time or facet structure build time, to have the tags
Doc 1: A, A/B, A/B/C, D, D/E, D/E/F Doc 2: A, A/B, A/B/C, A, A/B, A/B/J Doc 3: A, D, D/E, G, G/H, G/H/I
then a standard single-level faceting system would produce the desired output (a small lie as we also need to limit the number of tags returned in a level-aware way). Unfortunately this expansion does not come cheap. In this example we go from 7 tag references to 18 and from 6 unique tags to 10. This is not too bad, but when we have deeper levels and more diverse tags, the overhead increases: Faceting on house location with country/state/city/street/number would result in 5 times the references of non-hierarchical faceting. More references means more processing time and more memory overhead.
The hard, fast, light and purple solution
We do not create new tags from the existing tags
Doc 1: A/B/C, D/E/F Doc 2: A/B/C, A/B/J Doc 3: A, D/E, G/H/I
Instead we augment each tag in the original list with two pieces of information: The level L in the hierarchy (starting at level 1) and the previous level P that the tag T matches. C is the count from a given search. Our list is thus enriched to the following, where the text in the parentheses explain how the previous level was derived:
L P T C 1 0 A (1) (no previous tag) 3 1 A/B/C (2) (The previous tag "A" matches only the first level of "A/B/C") 3 2 A/B/J (1) (The previous tag "A/B/C" matches first and second level of "A/B/J") 2 0 D/E (1) (no previous match) 3 2 D/E/F (1) (The previous tag "D/E" matches both the first and second level of "D/E/F") 3 0 G/H/I (1) (no previous tag)
As both the level L and the previous level P are at most max-depth, we can represent the numbers in a packed structure, where the extra information for each tag takes up 2 * log2(maxdepth+1) bits. For 10 million tags of a maximum depth of 7, this is 10M * 2 * log(8) bits = 10M * 2 * 3 / 8 bytes = 7 MB. 50 million tags of max depth 30 would take 59 MB of extra overhead.
The tag counts are updated just as for standard single-level faceting. The tags at any given level is those that satisfies L >= level, P < level. The counts for any given level is a sum of the counts from the starting point until the next tag matching the equation. Let’s walk through the example:
At level 1, the equation is L >= 1, P < 1. For tag A this is 1 >= 1, 0 < 1 which clearly matches. Next match is tag D/E and the last is G/H/I (remember that counts are summed between matches). As we know that we’re extracting at level 1, we can split the terms for D/E and G/H/I to get D and G and thus have
A (4) D (2) G (1)
At level 2, we use the starting points for the tags from level 1 and the equation is L >= 2, P < 2. Tag A no longer matches, but A/B/C does with 3 >= 2, 1 < 2. Tag A/B/J does not, so its count is added to A/B/C. D/E matches with 2 >= 2, 0 < 2, but D/E/F does not with 3 >= 2, 2 < 2 so it only adds to the count for D/E. Last fit is G/H/I with 3 >= 2, 0 < 2. Extracting the proper level 2 terms the result is expanded to
A (4) A/B (3) D (2) D/E (2) G (1) G/H (1)
At level 3, we use the staring points from the tags from level 2 and the equation L >= 3, P < 3. Tag A/B/C still matches with 3 >= 3, 1 < 3 and A/B/J matches too. Tag D/E does not with 2 >= 3, 0 < 2. Next match if D/E/F with 3 >= 3, 2 < 3 and last is G/H/I with 3 >= 3, 0 < 3. Extracting at level 3, the tags are expanded to
A (4) A/B (3) A/B/C (2) A/B/J (1) D (2) D/E (2) D/E/F (1) G (1) G/H (1) G/H/I (1)
For brevity, some devils has been skipped here.
- To fit sub-tags under the correct tags, a recursive descend in a sub-set of the tags should be used.
- To avoid re-calculating sums, there should be a counting pass at first.
- To provide mixed count and custom sorting of tags, an expanded facet query syntax must be provided which works with a custom sorter.
The time spend on calculating the facets is dictated by the query options. Worst case is 100% orthogonal tags, tag sorting by counting and a complete extraction of all results. In that case the tag-list must be iterated as many times as there are levels. For the location-example above, this is 5 times, which coincidentally is the same factor as the tags-explosion for the easy implementation. Back-of-the-envelope, the worst case execution time is thus on par with the easy implementation, while memory overhead is a lot less.
As soon at the tags begin to overlap at some levels, sum count caching kicks in. Coupled with a limit on the number of extracted tags at any given level and the fact that we only need to re-visit the sub-tags for weeded super-tags, the number of passes falls drastically. This brings the expected execution time down to less than the easy method.
Fast (cpu-cycles), light (1 byte/unique tag on top of traditional faceting as a rule of thumb) and … Okay, the purple part was a lie. Now it just needs to be implemented.
That’s all, folks!
October 5, 2010 at 12:58 pm
As always – awesome!
October 11, 2010 at 12:12 am
[...] Software Development at Statsbiblioteket A peekhole into the life of the software development department at the State Library of Denmark « Fast, light, n-level hierarchical faceting [...]