Hierarchical faceting – working code

by

A week ago an idea of how to do hierarchical faceting was presented. It ended with “Now it just needs to be implemented”. Hubris, of course. The amount of devils running amok was staggering and at times it seemed that even duct tape wouldn’t ensure that the original promises were kept. However, frantic hacking prevailed and the current Lucene 4 based implementation actually looks half-way decent. It would be slightly insulting just to call it a proof of concept now.

To recap, we want hierarchical faceting. Searching for “monkeys” should give us something like

Classification
  - Primates (9)
    - Atelidae (2)
      - Atelinae (2)
        - Ateles (2)
    - Cebidae (5)
      - Cebus (2)
        - C. apella (1)
        - C. capucinus (1)
      - Saimiri (3)
        - S. sciureus (3)
    - Cercopithecidae (1)
      - Papio (1)
        - P. anubis (1)

Wishes

Let’s be a little more specific in our requirements. Unless otherwise noted, the current implementation fulfill the following:

  • n-level: There should be no hard limit on the depth of the hierarchy. The current implementation works well to at least level 30 and there are some obvious optimizations that could be done to get better performance for deeper levels.
  • Diverse levels: There should be no requirements for the documents to specify tags to a common level: Having document 1 state the tag “A/B” and document 2 state “F/G/H/I/J” should work.
  • Multi-value: Each document should be capable of stating zero or more hierarchical tags.
  • Minimum index-impact: The index should not grow significantly, neither should the tags require extended pre-processing to work. The current implementation requirement is that tags are stated in a field with a delimiter between levels, such as “mytag:foo/bar/zoo”.
  • Summing: The reference counts for sub-tags under a given tag should be summed recursively and provided in the faceting result.
  • Sorting: It should be possible to control the sorting on the different levels, e.g. sorting alphanumerically on the first level, then by reference count on the second and again alphanumerically of all subsequent levels. The current implementation allows for custom sorting in the inner workings but the interface only provides index-order, locale-based and count-order out of the box.
  • Output control: It should be possible to limit the output in different ways as hierarchies tend to grow big. The current implementation allows limiting on the depth (levels) as well as the width (sub-tags under a given tag).
  • Filtering: If should be possible to specify thresholds for which tags are relevant. This should be done on a level-basis, so the requirement for level 1 could be a total count of at least 100, while the requirement for level 2 could be 10.
  • Starting path: To facilitate a dynamic interface, it should be possible to request a hierarchical tag structure starting at a given point in the hierarchy. This has not been implemented yet.
  • Low memory impact: As always, we want the impact on memory to be minimized. This is a fuzzy requirement. The current implementation keeps true to the original premises and adds 2 * log2(maxdepth+1) bits for each tag on top of the standard faceting overhead. At extraction time, the temporary memory requirements grow linearly with the number of unique tags to return on all levels.
  • Fast: Another fuzzy requirement. The current implementation takes n time for pre-processing, where n is the number of tags: All it does is split each tag into sub-parts, count the parts and make a comparison to the previous tag. For constructing the result, it has a worst-case requirement of n * maxdepth + n * log(n). Some clever caching should be possible, but is is not trivial.
  • What else? It would be surprising if this list were complete. Suggestions are welcome.

Experimental setup

An index builder were created that takes 3 parameters: The number of documents, the maximum number of tags at any given level for any given document and the maximum number of levels. Each tag is a random character from A-Z. Asking for 3 documents with a maximum of 2 tags and maximum level 3 might give

Doc 1: A, A/F, A/T, A/T/R, A/T/Y, F/G/I
Doc 1: Z
Doc 2: F/H, F/H/I, B, B/B

For each test, a search was performed that hits all documents and the faceting system extracts a result to XML. This was all on the local machine with direct Java-calls (no web services or similar overhead). The hardware was a Dell M6500 laptop with Intel i7 CPU, PC1333 RAM and SSD.

Some measurements

Some documents, small hierarchy

1,000,000 documents, max tags/level 4, max levels 3.
Request a maximum of 5 tags/level down to level 5 in reference count order.

Hierarchical facet startup time = 1642 ms
18,278 unique tags, 2,383,248 references to tags
Mem usage: preFacet=4 MB, postFacet=9 MB
Collection and extraction #0 for 1000000 documents in 171 ms
Generated XML with 13594 characters

Few documents, wide hierarchy

10,000 documents, max tags/level 10, max levels 4.
Request a maximum of 5 tags/level down to level 5 in reference count order.

Hierarchical facet startup time = 0:06 minutes
438,746 unique tags, 1,478,224 references to tags
Mem usage: preFacet=3 MB, postFacet=9 MB
Collection and extraction #0 for 10000 documents in 169 ms
Generated XML with 70046 characters

Many documents, wide hierarchy, extreme references

10,000,000 documents, max tags/level 10, max levels 4
Request a maximum of 5 tags/level down to level 5 in reference count order.

Hierarchical facet startup time = 4:50 minutes
475,254 unique tags,  1,430,592,254 references to tags.
Mem usage: preFacet=26 MB, postFacet=3319 MB
Collection and extraction #0 for 10000000 documents in 0:21 minutes

Few documents, deep hierarchy

10,000 documents, max tags/level 3, max levels 15.
Request a maximum of 5 tags/level down to level 5 in reference count order.

Hierarchical facet startup time = 6:57 minutes
50,538,133 unique tags, 52,237,029 references to tags
Mem usage: preFacet=61 MB, postFacet=603 MB
Collection and extraction #0 for 10000 documents in 0:10 minutes
Generated XML with 196739 characters

Shallow, single-tag, uniform

Special index to compare to Solr hierarchical faceting: 260,000 documents, each with a single unique two-level tag: A/1, A/2 … A/10000, B/1, B/2 … Z/10000.
Request unlimited tags down to level 5 (effectively unlimited with this corpus).

Hierarchical facet startup time = 2196 ms
260,000 unique tags, 260,000 references to tags
Mem usage: preFacet=2 MB, postFacet=37 MB
Collection and extraction #0 for 260000 documents in 1048 ms
Collection and extraction #1 for 259630 documents in 884 ms
Collection and extraction #2 for 258829 documents in 995 ms
Generated XML with 15316147 characters in 320 ms

More realistically, setting the retrieval limit to the top 10 tags we get

Collection and extraction #0 for 260000 documents in 167 ms
Collection and extraction #1 for 259912 documents in 76 ms
Collection and extraction #2 for 258982 documents in 60 ms
Generated XML with 27347 characters in 7 ms

Sample output

<facetresponse xmlns="http://lucene.apache.org/exposed/facet/response/1.0" 
  query="even:true" hits="9984" countms="5" countcached="false" totalms="34">
  <facet name="hierarchical" fields="deep" order="count" maxtags="20" 
  mincount="0" offset="0" hierarchical="true" potentialtags="89534" 
  count="7589" totalCount="141989" totaltags="26">
      <tag count="320" totalcount="6304" term="Y">
        <subtags potentialtags="3987" count="674" totalCount="5984" totaltags="26">
          <tag count="17" totalcount="301" term="K">
            <subtags potentialtags="222" count="72" totalCount="284" totaltags="26">
              <tag count="4" totalcount="22" term="G">
                <subtags potentialtags="16" count="18" totalCount="18" totaltags="15">
                  <tag count="2" totalcount="2" term="A"></tag>
                  <tag count="2" totalcount="2" term="J"></tag>
                  <tag count="2" totalcount="2" term="U"></tag>
                  <tag count="1" totalcount="1" term="B"></tag>
                  <tag count="1" totalcount="1" term="C"></tag>
                </subtags>
              </tag>
              <tag count="3" totalcount="22" term="Q">
                <subtags potentialtags="19" count="19" totalCount="19" totaltags="18">
                  <tag count="2" totalcount="2" term="Q"></tag>
                  <tag count="1" totalcount="1" term="A"></tag>
                  <tag count="1" totalcount="1" term="B"></tag>
                  <tag count="1" totalcount="1" term="C"></tag>
                  <tag count="1" totalcount="1" term="D"></tag>
                </subtags>
...

Download

Important: This is not usable for production, partly because Lucene 4 is still in development, partly because it was hacked together in a week.

Do a SVN checkout of Lucene trunk and patch it with LUCENE-2369, then fix the errors as Lucene trunk is a moving target (I should really make a compiled version with a command line test tool). The test-code for the hierarchical faceting is located in the aptly named class TestHierarchicalFacets.

Use only for good.
Toke Eskildsen
statsbiblioteket.dk

About these ads

4 Responses to “Hierarchical faceting – working code”

  1. peeters Says:

    Hi, I’m interested in this. Is the JIRA issue (https://issues.apache.org/jira/browse/LUCENE-2369) really related to this? From what I understand from the issue, it seems that it’s related to something else. How can I test this?

  2. Toke Eskildsen Says:

    The issue is related, but only remotely. It turned out that the building blocks I developed for LUCENE-2369 worked really well for faceting and I got caught in developer frenzy. Next shiny thing that caught my eye was the hierarchical part and here I really should have stopped and started a discussion on the Lucene developer maillist, instead of just adding it to LUCENE-2369.

    I am currently contemplating whether to try making this a broader Lucene issue, tweak the code to enhance SOLR-64 or SOLR-792 or maybe have a better look at Bobo-Browse.

    You can test hierarchical faceting as I describe under Download. There is only unit-tests, but the test-index building code is very straight forward and easily tweakable. If it causes you problems, I’ll gladly take a look and ensure that the code compiles against trunk (or at least a fixed and easily checkoutable tag) or I could compile a JAR and put it somewhere.

    I would be very interested to hear your observations if you get it to work.

  3. Toke Eskildsen Says:

    I’ve updated the patch to work with the current Lucene trunk (make sure to get the patch from nov. 19.). You can get the code, compile and run the rather heavy test of hierarchical faceting by calling the commands below on a Linux/OSX/Unix box. For Windows I think the classpath delimiter should be ;.

    svn co http://svn.apache.org/repos/asf/lucene/dev/trunk@1036986 lucene-2369
    cd lucene-2369/lucene
    wget https://issues.apache.org/jira/secure/attachment/12460069/LUCENE-2369.patch
    patch -p0 < LUCENE-2369.patch
    ant compile-test
    java -cp lib/junit-4.7.jar:build/classes/test/:build/classes/java org.junit.runner.JUnitCore org.apache.lucene.search.exposed.facet.TestHierarchicalFacets

  4. Going to code4lib 2011 « Software Development at Statsbiblioteket Says:

    [...] seems to be a fairly hot topic this year, we plan to hack together a Solr-based proof of concept of our take on the problem before the [...]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: