A week ago an idea of how to do hierarchical faceting was presented. It ended with “Now it just needs to be implemented”. Hubris, of course. The amount of devils running amok was staggering and at times it seemed that even duct tape wouldn’t ensure that the original promises were kept. However, frantic hacking prevailed and the current Lucene 4 based implementation actually looks half-way decent. It would be slightly insulting just to call it a proof of concept now.
To recap, we want hierarchical faceting. Searching for “monkeys” should give us something like
Classification
- Primates (9)
- Atelidae (2)
- Atelinae (2)
- Ateles (2)
- Cebidae (5)
- Cebus (2)
- C. apella (1)
- C. capucinus (1)
- Saimiri (3)
- S. sciureus (3)
- Cercopithecidae (1)
- Papio (1)
- P. anubis (1)
Wishes
Let’s be a little more specific in our requirements. Unless otherwise noted, the current implementation fulfill the following:
- n-level: There should be no hard limit on the depth of the hierarchy. The current implementation works well to at least level 30 and there are some obvious optimizations that could be done to get better performance for deeper levels.
- Diverse levels: There should be no requirements for the documents to specify tags to a common level: Having document 1 state the tag “A/B” and document 2 state “F/G/H/I/J” should work.
- Multi-value: Each document should be capable of stating zero or more hierarchical tags.
- Minimum index-impact: The index should not grow significantly, neither should the tags require extended pre-processing to work. The current implementation requirement is that tags are stated in a field with a delimiter between levels, such as “mytag:foo/bar/zoo”.
- Summing: The reference counts for sub-tags under a given tag should be summed recursively and provided in the faceting result.
- Sorting: It should be possible to control the sorting on the different levels, e.g. sorting alphanumerically on the first level, then by reference count on the second and again alphanumerically of all subsequent levels. The current implementation allows for custom sorting in the inner workings but the interface only provides index-order, locale-based and count-order out of the box.
- Output control: It should be possible to limit the output in different ways as hierarchies tend to grow big. The current implementation allows limiting on the depth (levels) as well as the width (sub-tags under a given tag).
- Filtering: If should be possible to specify thresholds for which tags are relevant. This should be done on a level-basis, so the requirement for level 1 could be a total count of at least 100, while the requirement for level 2 could be 10.
- Starting path: To facilitate a dynamic interface, it should be possible to request a hierarchical tag structure starting at a given point in the hierarchy. This has not been implemented yet.
- Low memory impact: As always, we want the impact on memory to be minimized. This is a fuzzy requirement. The current implementation keeps true to the original premises and adds
2 * log2(maxdepth+1) bitsfor each tag on top of the standard faceting overhead. At extraction time, the temporary memory requirements grow linearly with the number of unique tags to return on all levels. - Fast: Another fuzzy requirement. The current implementation takes
ntime for pre-processing, wherenis the number of tags: All it does is split each tag into sub-parts, count the parts and make a comparison to the previous tag. For constructing the result, it has a worst-case requirement ofn * maxdepth + n * log(n). Some clever caching should be possible, but is is not trivial. - What else? It would be surprising if this list were complete. Suggestions are welcome.
Experimental setup
An index builder were created that takes 3 parameters: The number of documents, the maximum number of tags at any given level for any given document and the maximum number of levels. Each tag is a random character from A-Z. Asking for 3 documents with a maximum of 2 tags and maximum level 3 might give
Doc 1: A, A/F, A/T, A/T/R, A/T/Y, F/G/I Doc 1: Z Doc 2: F/H, F/H/I, B, B/B
For each test, a search was performed that hits all documents and the faceting system extracts a result to XML. This was all on the local machine with direct Java-calls (no web services or similar overhead). The hardware was a Dell M6500 laptop with Intel i7 CPU, PC1333 RAM and SSD.
Some measurements
Some documents, small hierarchy
1,000,000 documents, max tags/level 4, max levels 3.
Request a maximum of 5 tags/level down to level 5 in reference count order.
Hierarchical facet startup time = 1642 ms 18,278 unique tags, 2,383,248 references to tags Mem usage: preFacet=4 MB, postFacet=9 MB Collection and extraction #0 for 1000000 documents in 171 ms Generated XML with 13594 characters
Few documents, wide hierarchy
10,000 documents, max tags/level 10, max levels 4.
Request a maximum of 5 tags/level down to level 5 in reference count order.
Hierarchical facet startup time = 0:06 minutes 438,746 unique tags, 1,478,224 references to tags Mem usage: preFacet=3 MB, postFacet=9 MB Collection and extraction #0 for 10000 documents in 169 ms Generated XML with 70046 characters
Many documents, wide hierarchy, extreme references
10,000,000 documents, max tags/level 10, max levels 4
Request a maximum of 5 tags/level down to level 5 in reference count order.
Hierarchical facet startup time = 4:50 minutes 475,254 unique tags, 1,430,592,254 references to tags. Mem usage: preFacet=26 MB, postFacet=3319 MB Collection and extraction #0 for 10000000 documents in 0:21 minutes
Few documents, deep hierarchy
10,000 documents, max tags/level 3, max levels 15.
Request a maximum of 5 tags/level down to level 5 in reference count order.
Hierarchical facet startup time = 6:57 minutes 50,538,133 unique tags, 52,237,029 references to tags Mem usage: preFacet=61 MB, postFacet=603 MB Collection and extraction #0 for 10000 documents in 0:10 minutes Generated XML with 196739 characters
Shallow, single-tag, uniform
Special index to compare to Solr hierarchical faceting: 260,000 documents, each with a single unique two-level tag: A/1, A/2 … A/10000, B/1, B/2 … Z/10000.
Request unlimited tags down to level 5 (effectively unlimited with this corpus).
Hierarchical facet startup time = 2196 ms 260,000 unique tags, 260,000 references to tags Mem usage: preFacet=2 MB, postFacet=37 MB Collection and extraction #0 for 260000 documents in 1048 ms Collection and extraction #1 for 259630 documents in 884 ms Collection and extraction #2 for 258829 documents in 995 ms Generated XML with 15316147 characters in 320 ms
More realistically, setting the retrieval limit to the top 10 tags we get
Collection and extraction #0 for 260000 documents in 167 ms Collection and extraction #1 for 259912 documents in 76 ms Collection and extraction #2 for 258982 documents in 60 ms Generated XML with 27347 characters in 7 ms
Sample output
<facetresponse xmlns="http://lucene.apache.org/exposed/facet/response/1.0"
query="even:true" hits="9984" countms="5" countcached="false" totalms="34">
<facet name="hierarchical" fields="deep" order="count" maxtags="20"
mincount="0" offset="0" hierarchical="true" potentialtags="89534"
count="7589" totalCount="141989" totaltags="26">
<tag count="320" totalcount="6304" term="Y">
<subtags potentialtags="3987" count="674" totalCount="5984" totaltags="26">
<tag count="17" totalcount="301" term="K">
<subtags potentialtags="222" count="72" totalCount="284" totaltags="26">
<tag count="4" totalcount="22" term="G">
<subtags potentialtags="16" count="18" totalCount="18" totaltags="15">
<tag count="2" totalcount="2" term="A"></tag>
<tag count="2" totalcount="2" term="J"></tag>
<tag count="2" totalcount="2" term="U"></tag>
<tag count="1" totalcount="1" term="B"></tag>
<tag count="1" totalcount="1" term="C"></tag>
</subtags>
</tag>
<tag count="3" totalcount="22" term="Q">
<subtags potentialtags="19" count="19" totalCount="19" totaltags="18">
<tag count="2" totalcount="2" term="Q"></tag>
<tag count="1" totalcount="1" term="A"></tag>
<tag count="1" totalcount="1" term="B"></tag>
<tag count="1" totalcount="1" term="C"></tag>
<tag count="1" totalcount="1" term="D"></tag>
</subtags>
...
Download
Important: This is not usable for production, partly because Lucene 4 is still in development, partly because it was hacked together in a week.
Do a SVN checkout of Lucene trunk and patch it with LUCENE-2369, then fix the errors as Lucene trunk is a moving target (I should really make a compiled version with a command line test tool). The test-code for the hierarchical faceting is located in the aptly named class TestHierarchicalFacets.
Use only for good.
Toke Eskildsen
statsbiblioteket.dk
November 19, 2010 at 4:49 pm
Hi, I’m interested in this. Is the JIRA issue (https://issues.apache.org/jira/browse/LUCENE-2369) really related to this? From what I understand from the issue, it seems that it’s related to something else. How can I test this?
November 19, 2010 at 9:38 pm
The issue is related, but only remotely. It turned out that the building blocks I developed for LUCENE-2369 worked really well for faceting and I got caught in developer frenzy. Next shiny thing that caught my eye was the hierarchical part and here I really should have stopped and started a discussion on the Lucene developer maillist, instead of just adding it to LUCENE-2369.
I am currently contemplating whether to try making this a broader Lucene issue, tweak the code to enhance SOLR-64 or SOLR-792 or maybe have a better look at Bobo-Browse.
You can test hierarchical faceting as I describe under Download. There is only unit-tests, but the test-index building code is very straight forward and easily tweakable. If it causes you problems, I’ll gladly take a look and ensure that the code compiles against trunk (or at least a fixed and easily checkoutable tag) or I could compile a JAR and put it somewhere.
I would be very interested to hear your observations if you get it to work.
November 19, 2010 at 11:53 pm
I’ve updated the patch to work with the current Lucene trunk (make sure to get the patch from nov. 19.). You can get the code, compile and run the rather heavy test of hierarchical faceting by calling the commands below on a Linux/OSX/Unix box. For Windows I think the classpath delimiter should be ;.
svn co http://svn.apache.org/repos/asf/lucene/dev/trunk@1036986 lucene-2369
cd lucene-2369/lucene
wget https://issues.apache.org/jira/secure/attachment/12460069/LUCENE-2369.patch
patch -p0 < LUCENE-2369.patch
ant compile-test
java -cp lib/junit-4.7.jar:build/classes/test/:build/classes/java org.junit.runner.JUnitCore org.apache.lucene.search.exposed.facet.TestHierarchicalFacets
January 7, 2011 at 11:29 am
[...] seems to be a fairly hot topic this year, we plan to hack together a Solr-based proof of concept of our take on the problem before the [...]