Fast, light, n-level hierarchical faceting

by

We are now used to standard faceting and it works really well for a lot of cases. However, sometimes we have a taxonomy that we want to present. For monkeys this could be something like

Classification
  - Primates (9)
    - Atelidae (2)
      - Atelinae (2)
        - Ateles (2)
    - Cebidae (5)
      - Cebus (2)
        - C. apella (1)
        - C. capucinus (1)
      - Saimiri (3)
        - S. sciureus (3)
    - Cercopithecidae (1)
      - Papio (1)
        - P. anubis (1)

Example data and goal

We consider three documents, each with some multi-level faceting information

Doc 1: A/B/C, D/E/F
Doc 2: A/B/C, A/B/J
Doc 3: A,     D/E,   G/H/I

The multi-level tags are thus

A
A/B/C
A/B/J
D/E
D/E/F
G/H/I

Visually we want all levels to be expanded and counted so that we see

A      (4)
 - B   (3)
   - C (2)
   - J (1)
D      (2)
 - E   (2)
   - F (1)
G      (1)
  - H  (1)
   - I (1)

for a search that matches all 3 documents.

The easy, non-fast and non-light solution

If the documents were expanded, either at index time or facet structure build time, to have the tags

Doc 1: A,     A/B,   A/B/C, D,     D/E,   D/E/F
Doc 2: A,     A/B,   A/B/C, A,     A/B,   A/B/J
Doc 3: A,     D,     D/E,   G,     G/H,   G/H/I

then a standard single-level faceting system would produce the desired output (a small lie as we also need to limit the number of tags returned in a level-aware way). Unfortunately this expansion does not come cheap. In this example we go from 7 tag references to 18 and from 6 unique tags to 10. This is not too bad, but when we have deeper levels and more diverse tags, the overhead increases: Faceting on house location with country/state/city/street/number would result in 5 times the references of non-hierarchical faceting. More references means more processing time and more memory overhead.

The hard, fast, light and purple solution

We do not create new tags from the existing tags

Doc 1: A/B/C, D/E/F
Doc 2: A/B/C, A/B/J
Doc 3: A,     D/E,   G/H/I

Instead we augment each tag in the original list with two pieces of information: The level L in the hierarchy (starting at level 1) and the previous level P that the tag T matches. C is the count from a given search. Our list is thus enriched to the following, where the text in the parentheses explain how the previous level was derived:

L P T      C
1 0 A     (1)   (no previous tag)
3 1 A/B/C (2)   (The previous tag "A" matches only the first level of "A/B/C")
3 2 A/B/J (1)   (The previous tag "A/B/C" matches first and second level of "A/B/J")
2 0 D/E   (1)   (no previous match)
3 2 D/E/F (1)   (The previous tag "D/E" matches both the first and second level of "D/E/F")
3 0 G/H/I (1)   (no previous tag)

As both the level L and the previous level P are at most max-depth, we can represent the numbers in a packed structure, where the extra information for each tag takes up 2 * log2(maxdepth+1) bits. For 10 million tags of a maximum depth of 7, this is 10M * 2 * log(8) bits = 10M * 2 * 3 / 8 bytes = 7 MB. 50 million tags of max depth 30 would take 59 MB of extra overhead.

The tag counts are updated just as for standard single-level faceting. The tags at any given level is those that satisfies L >= level, P < level. The counts for any given level is a sum of the counts from the starting point until the next tag matching the equation. Let’s walk through the example:

At level 1, the equation is L >= 1, P < 1. For tag A this is 1 >= 1, 0 < 1 which clearly matches. Next match is tag D/E and the last is G/H/I (remember that counts are summed between matches). As we know that we’re extracting at level 1, we can split the terms for D/E and G/H/I to get D and G and thus have

A (4)
D (2)
G (1)

At level 2, we use the starting points for the tags from level 1 and the equation is L >= 2, P < 2. Tag A no longer matches, but A/B/C does with 3 >= 2, 1 < 2. Tag A/B/J does not, so its count is added to A/B/C. D/E matches with 2 >= 2, 0 < 2, but D/E/F does not with 3 >= 2, 2 < 2 so it only adds to the count for D/E. Last fit is G/H/I with 3 >= 2, 0 < 2. Extracting the proper level 2 terms the result is expanded to

A   (4)
A/B (3)
D   (2)
D/E (2)
G   (1)
G/H (1)

At level 3, we use the staring points from the tags from level 2 and the equation L >= 3, P < 3. Tag A/B/C still matches with 3 >= 3, 1 < 3 and A/B/J matches too. Tag D/E does not with 2 >= 3, 0 < 2. Next match if D/E/F with 3 >= 3, 2 < 3 and last is G/H/I with 3 >= 3, 0 < 3. Extracting at level 3, the tags are expanded to

A     (4)
A/B   (3)
A/B/C (2)
A/B/J (1)
D     (2)
D/E   (2)
D/E/F (1)
G     (1)
G/H   (1)
G/H/I (1)

For brevity, some devils has been skipped here.

  • To fit sub-tags under the correct tags, a recursive descend in a sub-set of the tags should be used.
  • To avoid re-calculating sums, there should be a counting pass at first.
  • To provide mixed count and custom sorting of tags, an expanded facet query syntax must be provided which works with a custom sorter.

The time spend on calculating the facets is dictated by the query options. Worst case is 100% orthogonal tags, tag sorting by counting and a complete extraction of all results. In that case the tag-list must be iterated as many times as there are levels. For the location-example above, this is 5 times, which coincidentally is the same factor as the tags-explosion for the easy implementation. Back-of-the-envelope, the worst case execution time is thus on par with the easy implementation, while memory overhead is a lot less.

As soon at the tags begin to overlap at some levels, sum count caching kicks in. Coupled with a limit on the number of extracted tags at any given level and the fact that we only need to re-visit the sub-tags for weeded super-tags, the number of passes falls drastically. This brings the expected execution time down to less than the easy method.

Fast (cpu-cycles), light (1 byte/unique tag on top of traditional faceting as a rule of thumb) and … Okay, the purple part was a lie. Now it just needs to be implemented.

That’s all, folks!

About these ads

2 Responses to “Fast, light, n-level hierarchical faceting”

  1. Mikkel Kamstrup Erlandsen Says:

    As always – awesome!

  2. Hierarchical faceting – working code « Software Development at Statsbiblioteket Says:

    […] Software Development at Statsbiblioteket A peekhole into the life of the software development department at the State Library of Denmark « Fast, light, n-level hierarchical faceting […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: