<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Software Development at Statsbiblioteket</title>
	<atom:link href="http://sbdevel.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://sbdevel.wordpress.com</link>
	<description>A peekhole into the life of the software development department at the State Library of Denmark</description>
	<lastBuildDate>Tue, 20 Dec 2011 11:28:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='sbdevel.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Software Development at Statsbiblioteket</title>
		<link>http://sbdevel.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://sbdevel.wordpress.com/osd.xml" title="Software Development at Statsbiblioteket" />
	<atom:link rel='hub' href='http://sbdevel.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Fire fire everywhere</title>
		<link>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/</link>
		<comments>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 13:13:10 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=1009</guid>
		<description><![CDATA[Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=1009&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.</p>
<p>A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts &amp; session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.</p>
<p>Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.</p>
<div id="attachment_1011" class="wp-caption center" style="width: 310px"><a href="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg"><img src="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg?w=300&#038;h=224" alt="Mads coding" title="Mads stands and codes" width="300" height="224" class="size-medium wp-image-1011" /></a><p class="wp-caption-text">Mads stands up and codes. A sure sign of high alert</p></div>
<p>Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation <code>i++;</code> was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.</p>
<p>It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.</p>
<p>With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/1009/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/1009/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=1009&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg?w=300" medium="image">
			<media:title type="html">Mads stands and codes</media:title>
		</media:content>
	</item>
		<item>
		<title>The right tool &#8211; neo4j?</title>
		<link>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/</link>
		<comments>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/#comments</comments>
		<pubDate>Fri, 06 May 2011 14:10:45 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=989</guid>
		<description><![CDATA[Background We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=989&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h2>Background</h2>
<p>We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years we have used relational databases for this, going from <a href="http://www.postgresql.org/">PostgreSQL</a> through <a href="http://db.apache.org/derby/">Derby</a> and landing on <a href="http://www.h2database.com/html/main.html">H2</a>, which has served us fairly well. Documents with relations were a bit slow to resolve but there were less than 5% of those in the full corpus, so for all practical purposes they presented no problem.</p>
<p>Fast forward to a week ago, where we added a new target to our integrated search. One million fresh documents or a 10% increase in total document count. Unfortunately most of these documents were related to each other, increasing our total relation count 2000%. Our full indexing time soared from &#8220;It&#8217;ll be done before noon&#8221; to &#8220;I hope it finishes before tomorrow&#8221;. Ouch!</p>
<p>A long discussion about database design and indexes followed. The conclusion was that we really did not use H2 for what it was good at and that maybe we should look at a graph-oriented database.</p>
<h2>Implementing storage with neo4j</h2>
<p><a href="http://neo4j.org/">Neo4j</a> is an open source graph database. It is written in Java and can be used as an embedded application. This is important for us as we like our packages to be self contained. Friday is work-on-whatever-project-you-think-can-benefit-Statsbiblioteket-day, so I dedicated it to trying out neo. Keep in mind that I only heard about Neo this morning and that I hadn&#8217;t looked at a single page about the product before I started the project.</p>
<ul>
<li><b>10:36</b> Created test project</li>
<li><b>10:40</b> Selected Neo4j version 1.3 stable Enterprise, added to Maven POM, downloaded files</li>
<li><b>10:50</b> Created skeleton class for Summa NeoStorage</li>
<li><b>10:55</b> Added basic properties, created skeleton Unit test</li>
<li><b>11:15</b> Added code for flushing Summa Record to Neo</li>
<li><b>11:36</b> Added code for retrieving a previously flushed record + unit test. Hello world completed</li>
<li><b>11:40</b> Break finished, started on ModificationTime retrieval</li>
<li><b>12:35</b> Proof of concept retrieval using modification time (slow development due to human error)</li>
<li><b>13:20</b> Finished bulk ingest and record-by-record extraction using modification time order, including unit test</li>
<li><b>13:50</b> Finished mapping of Summa Record child-relations to Neo, both for ingest and extraction</li>
</ul>
<p>Including mistakes, distractions from colleagues and reading documentation, it took under 4 hours to integrate Neo as the new backend for our storage. There&#8217;s still a lot of minor things to add and special cases to cover, but the result is complete enough to be used for most workflows in Summa.</p>
<p>Implementation was a breeze, the API was very clean and the examples and guides at the website were to the point and well thought out. Kudos to the developers for great work!</p>
<h2>Performance peek</h2>
<p>Since the Neo Storage isn&#8217;t finished, it would be unfair to compare it to H2 with regard to ingest. However, the extraction part is complete enough to test.</p>
<p>With the old H2-backed storage, our extraction time fell to below 5 records/sec for the new documents with 2-4 relations each (extraction time for records without relations is 2-3000 records/sec).</p>
<p>Creating a test-storage using Neo with 100000 documents, each with 5 children, changed the extraction speed to 2500 expanded records/second or 15,000 raw records/second if we count the children.</p>
<p>Granted, the only fair test is with production data, but so far Neo4j looks like a clear winner for our purposes!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/989/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/989/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=989&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Hierarchical faceting in Solr</title>
		<link>http://sbdevel.wordpress.com/2011/03/09/hierarchical-faceting-solr/</link>
		<comments>http://sbdevel.wordpress.com/2011/03/09/hierarchical-faceting-solr/#comments</comments>
		<pubDate>Wed, 09 Mar 2011 12:11:49 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Faceting]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=983</guid>
		<description><![CDATA[Solr already has SOLR-64 which does hierarchical faceting and SOLR-792 which does pivot faceting. A few minutes ago, I uploaded SOLR-2412 which does hierarchical faceting. What&#8217;s the big idea? SOLR-2412 is a fairly thin wrapper around LUCENE-2369. LUCENE-2369 was designed with the clear trade-offs * Slow startup * Low memory overhead * Fast response with [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=983&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Solr already has <a href="https://issues.apache.org/jira/browse/SOLR-64">SOLR-64</a> which does hierarchical faceting and <a href="https://issues.apache.org/jira/browse/SOLR-792">SOLR-792</a> which does pivot faceting. A few minutes ago, I uploaded <a href="https://issues.apache.org/jira/browse/SOLR-2412">SOLR-2412</a> which does hierarchical faceting. What&#8217;s the big idea?</p>
<p>SOLR-2412 is a fairly thin wrapper around <a href="https://issues.apache.org/jira/browse/LUCENE-2369">LUCENE-2369</a>. LUCENE-2369 was designed with the clear trade-offs</p>
<p>  * Slow startup<br />
  * Low memory overhead<br />
  * Fast response</p>
<p>with the archetypal usage scenario being a large index containing one or more rich hierarchies that is batch-updated every night (see <a href="https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/">Hierarchical faceting &#8211; working code</a> for more details). With fear of misrepresenting, SOLR-64 and SOLR-792 were created from a feature-standpoint with performance characteristics being secondary.</p>
<p>Feature wise, SOLR-2412 (let&#8217;s call it <em>Exposed</em> faceting from now on) differs markedly from pivot faceting (SOLR-792) at this time, as neither of the two solutions can do what the other one does. However, I feel confident that Exposed faceting can be tweaked to do pivot faceting later on. The main reason to use Exposed over SOLR-792 would be to change trade-offs.</p>
<p>Compared to SOLR-64, Exposed faceting&#8217;s features differs primarily by supporting multiple paths per document: A product belonging to multiple categories, multiple locations for a bus route and so on.</p>
<p>The next step is to create a test bed for doing performance measurements on Exposed vs. Solr&#8217;s different faceting implementations. Naturally the hoped-for outcome is that Exposed is markedly better under the defined trade-offs.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/983/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/983/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/983/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=983&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/03/09/hierarchical-faceting-solr/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Virtual Integrated Search</title>
		<link>http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/</link>
		<comments>http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/#comments</comments>
		<pubDate>Fri, 04 Feb 2011 11:33:39 +0000</pubDate>
		<dc:creator>villadsen</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[c4l11]]></category>
		<category><![CDATA[well11]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=970</guid>
		<description><![CDATA[For a while it seemed that Integrated Search with a nice Discovery Interface coupled with a large Data Well was the answer to how libraries were going to let users find and use the multitudes of material they have access to. Many different places have tried building their own data well (sometimes a large national [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=970&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>For a while it seemed that Integrated Search with a nice Discovery Interface coupled with a large Data Well was the answer to how libraries were going to let users find and use the multitudes of material they have access to.</p>
<p>Many different places have tried building their own data well (sometimes a large national data well) but most have given up. Why? Primarily because of the unwillingness of publishers to hand over data to every single data well but also because the very concept of a very large local data well has been made (at least somewhat) obsolete by the new &#8220;Web Scale Discovery&#8221; tools &#8211; such as Summon, Primo Central and EBSCOHost.</p>
<p>Some libraries are fine with using the standard discovery interfaces that these services provide and some would rather use their APIs and build their own interface on top, perhaps adding tight integration with their library system and other locally developed systems.</p>
<p>However this area is fast moving and just as it looked as if pretty much all metadata would eventually be available in all these systems EBSCO has <a href="http://pegasuslibrarian.com/2011/01/heads-they-win-tales-we-lose-discovery-tools-will-never-deliver-on-their-promise.html">decided not to allow their metadata to be indexed outside of their own system</a>. They will however allow for &#8220;just-in-time&#8221; searches to be used. It appears as if the market is fragmenting back into federated search &#8211; and the problems with federated search are well-known and are what made us all pursue integrated search to begin with.</p>
<p>But can we do better this time around? Fortunately the answer is yes &#8211; if we can get a little help from our friends.</p>
<p>There are two main problems with federated search:</p>
<ul>
<li>Response times: The entire federated system is only as fast as the slowest search node.</li>
<li>Merging: There is no meaningful way to merge different result sets as they can have vastly different sorting criteria.</li>
</ul>
<p>We can&#8217;t do a lot about the problems with response times but fortunately the new systems are _a lot_ faster then the old ones, so hopefully it wont be that big an issue.</p>
<p>Merging has however gotten easier, strangely enough as a side effect of making the sorting of individual results more complicated. The magic here lies in relevancy ranking and the fact that pretty much all new systems are based on the same principles and code base (ie. Lucene/Solr).</p>
<p>So how does this work? The relevancy ranking of a given document in a query is based on different things but a major contributor is the <a href="http://en.wikipedia.org/wiki/Tf%E2%80%93idf">term frequency–inverse document frequency</a>.</p>
<p>We have chosen to call this concept Virtual Integrated Search as the end result is (potentially) virtually indistinguishable from from having a large local data well coupled with a true integrated search. For the <a href="http://www.well11.dk/">Well11</a> and <a href="http://code4lib.org/conference/2011/">code4lib 2011</a> conferences we have prepared a first stab at an implementation and integration with our existing Search and Summa system. This is not much more than an internal beta version where a primary focus of the frontend has been to make is possible to tweak the way merging is performed.</p>
<p>What will it look like? The main purpose for us is to make it look to the user as if all the data is retrieved from a single source. Currently we present results from Summon in a box of its own but the plan is to simply go back and present all results in a single list.</p>
<p>This is something we will be working on and experimenting with in the coming months and we are quite excited about the possibilities.</p>

<a href='http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/search_2009/' title='search_2009'><img data-attachment-id='972' data-orig-size='1028,750' data-liked='0'width="150" height="109" src="http://sbdevel.files.wordpress.com/2011/02/search_2009.png?w=150&#038;h=109" class="attachment-thumbnail" alt="Search 2009" title="search_2009" /></a>
<a href='http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/search_2010/' title='search_2010'><img data-attachment-id='973' data-orig-size='1008,746' data-liked='0'width="150" height="111" src="http://sbdevel.files.wordpress.com/2011/02/search_2010.png?w=150&#038;h=111" class="attachment-thumbnail" alt="Search 2010" title="search_2010" /></a>
<a href='http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/search_2011_beta/' title='search_2011_beta'><img data-attachment-id='974' data-orig-size='1028,709' data-liked='0'width="150" height="103" src="http://sbdevel.files.wordpress.com/2011/02/search_2011_beta.png?w=150&#038;h=103" class="attachment-thumbnail" alt="Search 2011 Beta" title="search_2011_beta" /></a>

<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/970/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/970/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/970/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=970&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/02/04/virtual-integrated-search/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">villadsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2011/02/search_2009.png?w=150" medium="image">
			<media:title type="html">search_2009</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2011/02/search_2010.png?w=150" medium="image">
			<media:title type="html">search_2010</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2011/02/search_2011_beta.png?w=150" medium="image">
			<media:title type="html">search_2011_beta</media:title>
		</media:content>
	</item>
		<item>
		<title>Going to code4lib 2011</title>
		<link>http://sbdevel.wordpress.com/2011/01/07/going-to-code4lib-2011/</link>
		<comments>http://sbdevel.wordpress.com/2011/01/07/going-to-code4lib-2011/#comments</comments>
		<pubDate>Fri, 07 Jan 2011 10:29:35 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[c4l11]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=965</guid>
		<description><![CDATA[Mads Villadsen and I are fortunate enough to be attending code4lib 2011 in early February. Last year our plane was stopped by a snow drift. This year we&#8217;re going full paranoia with a planned US-arrival 2 days before the main conference starts. code4lib 2009 was the best library-oriented conference we&#8217;ve been to, so our hopes [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=965&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>Mads Villadsen and I are fortunate enough to be attending <a href="http://code4lib.org">code4lib</a> 2011 in early February. Last year our plane was stopped by a snow drift. This year we&#8217;re going full paranoia with a planned US-arrival 2 days before the main conference starts.</p>
<p>code4lib 2009 was the best library-oriented conference we&#8217;ve been to, so our hopes for 2011 are high. The program certainly looks interesting and one common theme &#8211; merging of search results from different sources &#8211; is perfectly timed to our current project on merging of Summa and Summon search results. Hopefully we&#8217;ll have enough experience by then to do a lightning talk about it.</p>
<p>We will also be attending the pre-conference on Solr and since hierarchical-like faceting seems to be a fairly hot topic this year, we plan to hack together a Solr-based proof of concept of <a href="http://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/">our take on the problem</a> before the conference.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/965/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/965/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/965/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=965&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/01/07/going-to-code4lib-2011/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>The road ahead</title>
		<link>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/</link>
		<comments>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/#comments</comments>
		<pubDate>Fri, 26 Nov 2010 08:08:03 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Statsbiblioteket]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=958</guid>
		<description><![CDATA[When we&#8217;re Out There at conferences or visiting other libraries, we&#8217;re sometimes asked why we bother &#8220;developing our own search engine&#8221;. It is a really good question. One that we ask ourselves from time to time, to make sure we&#8217;re on the right track. One of the reasons why we&#8217;ve been rolling our own instead [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=958&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>When we&#8217;re Out There at conferences or visiting other libraries, we&#8217;re sometimes asked why we bother &#8220;developing our own search engine&#8221;. It is a really good question. One that we ask ourselves from time to time, to make sure we&#8217;re on the right track.</p>
<p>One of the reasons why we&#8217;ve been rolling our own instead of taking the obvious Solr-road is historical. Solr didn&#8217;t meet our needs when we started. It still doesn&#8217;t in its current form, but it&#8217;s getting close. Another reason is that the scope of Summa is different from Solr’s: We&#8217;ve invested a lot of energy in handling data-flow with transformations in a multi-stage system with a persistent storage service that handles parent-child relations and keeps track of updates.</p>
<p>The interesting thing here is that Summa and Solr are not necessarily in competition. Solr has a lot of great analyzers and has gained a lot of momentum during the last years. We would therefore like to use Solr at the core of Summa instead of the custom Lucene searcher we have currently. At Statsbiblioteket we have done a fair amount of work with <a href="https://issues.apache.org/jira/browse/LUCENE-2369">low-memory index-lookup, sorting and hierarchical faceting</a> which we’d like to take with us, so this calls for participation in Solr development.</p>
<h2>Relevance Correlation between systems</h2>
<p>Statsbiblioteket has recently begun collaboration with the Summon team from Serial Solutions. Due to Danish legislation we are not allowed to let Summon index our own data, so the upcoming search setup needs to merge results from Summon with our own. As both systems sort the results by relevance ranking, proper merging is quite hard. This is due to the fact that relevance scores are not comparable between different systems. One solution is to perform a local comparison of the returned fields in the results, but as this completely ignores the whole term statistics based ranking of Lucene/Solr, we would like something better.</p>
<p>Up until now we have been spoiled by using relevance ranked integrated search across different sources where we controlled the full indexing process.  <a href="https://issues.apache.org/jira/browse/SOLR-1632">SOLR-1632</a> seems to be the right way to provide similar functionality for Solr, but it is not mature yet and is quite a large thing to put into production. Better to walk before running. As part of our collaborative agreement Serial Solutions has agreed to deliver statistics about their metadata to us on an experimental basis. With this we can hopefully do better merging and issue boosted queries to get the hits that are relevant to us, thus approximating the results we would have gotten if all metadata had been in a single index.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/958/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/958/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=958&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Hierarchical faceting &#8211; working code</title>
		<link>http://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/</link>
		<comments>http://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/#comments</comments>
		<pubDate>Sun, 10 Oct 2010 23:12:31 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=932</guid>
		<description><![CDATA[A week ago an idea of how to do hierarchical faceting was presented. It ended with &#8220;Now it just needs to be implemented&#8221;. Hubris, of course. The amount of devils running amok was staggering and at times it seemed that even duct tape wouldn&#8217;t ensure that the original promises were kept. However, frantic hacking prevailed [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=932&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A week ago an idea of <a href="http://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/">how to do hierarchical faceting</a> was presented. It ended with &#8220;Now it just needs to be implemented&#8221;. Hubris, of course. The amount of devils running amok was staggering and at times it seemed that even duct tape wouldn&#8217;t ensure that the original promises were kept. However, frantic hacking prevailed and the current Lucene 4 based implementation actually looks half-way decent. It would be slightly insulting just to call it a proof of concept now.</p>
<p>To recap, we want hierarchical faceting. Searching for &#8220;monkeys&#8221; should give us something like</p>
<pre>
Classification
  - Primates (9)
    - Atelidae (2)
      - Atelinae (2)
        - Ateles (2)
    - Cebidae (5)
      - Cebus (2)
        - C. apella (1)
        - C. capucinus (1)
      - Saimiri (3)
        - S. sciureus (3)
    - Cercopithecidae (1)
      - Papio (1)
        - P. anubis (1)
</pre>
<h2>Wishes</h2>
<p>Let&#8217;s be a little more specific in our requirements. Unless otherwise noted, the current implementation fulfill the following:</p>
<ul>
<li><em>n-level:</em> There should be no hard limit on the depth of the hierarchy. The current implementation works well to at least level 30 and there are some obvious optimizations that could be done to get better performance for deeper levels.</li>
<li><em>Diverse levels:</em> There should be no requirements for the documents to specify tags to a common level: Having document 1 state the tag &#8220;A/B&#8221; and document 2 state &#8220;F/G/H/I/J&#8221; should work.</li>
<li><em>Multi-value:</em> Each document should be capable of stating zero or more hierarchical tags.</li>
<li><em>Minimum index-impact:</em> The index should not grow significantly, neither should the tags require extended pre-processing to work. The current implementation requirement is that tags are stated in a field with a delimiter between levels, such as &#8220;mytag:foo/bar/zoo&#8221;.</li>
<li><em>Summing:</em> The reference counts for sub-tags under a given tag should be summed recursively and provided in the faceting result.</li>
<li><em>Sorting:</em> It should be possible to control the sorting on the different levels, e.g. sorting alphanumerically on the first level, then by reference count on the second and again alphanumerically of all subsequent levels. The current implementation allows for custom sorting in the inner workings but the interface only provides index-order, locale-based and count-order out of the box.</li>
<li><em>Output control:</em> It should be possible to limit the output in different ways as hierarchies tend to grow big. The current implementation allows limiting on the depth (levels) as well as the width (sub-tags under a given tag).</li>
<li><em>Filtering:</em> If should be possible to specify thresholds for which tags are relevant. This should be done on a level-basis, so the requirement for level 1 could be a total count of at least 100, while the requirement for level 2 could be 10.</li>
<li><em>Starting path:</em> To facilitate a dynamic interface, it should be possible to request a hierarchical tag structure starting at a given point in the hierarchy. This has not been implemented yet.</li>
<li><em>Low memory impact:</em> As always, we want the impact on memory to be minimized. This is a fuzzy requirement. The current implementation keeps true to the original premises and adds <code>2 * log2(maxdepth+1) bits</code> for each tag on top of the standard faceting overhead. At extraction time, the temporary memory requirements grow linearly with the number of unique tags to return on all levels.</li>
<li><em>Fast:</em> Another fuzzy requirement. The current implementation takes <code>n</code> time for pre-processing, where <code>n</code> is the number of tags: All it does is split each tag into sub-parts, count the parts and make a comparison to the previous tag. For constructing the result, it has a worst-case requirement of <code>n * maxdepth + n * log(n)</code>. Some clever caching should be possible, but is is not trivial.</li>
<li><em>What else?</em> It would be surprising if this list were complete. Suggestions are welcome.</li>
</ul>
<h2>Experimental setup</h2>
<p>An index builder were created that takes 3 parameters: The number of documents, the maximum number of tags at any given level for any given document and the maximum number of levels. Each tag is a random character from A-Z. Asking for 3 documents with a maximum of 2 tags and maximum level 3 might give</p>
<pre>
Doc 1: A, A/F, A/T, A/T/R, A/T/Y, F/G/I
Doc 1: Z
Doc 2: F/H, F/H/I, B, B/B
</pre>
<p>For each test, a search was performed that hits all documents and the faceting system extracts a result to XML. This was all on the local machine with direct Java-calls (no web services or similar overhead). The hardware was a Dell M6500 laptop with Intel i7 CPU, PC1333 RAM and SSD.</p>
<h2>Some measurements</h2>
<h3>Some documents, small hierarchy</h3>
<p>1,000,000 documents, max tags/level 4, max levels 3.<br />
Request a maximum of 5 tags/level down to level 5 in reference count order.</p>
<pre>
Hierarchical facet startup time = 1642 ms
18,278 unique tags, 2,383,248 references to tags
Mem usage: preFacet=4 MB, postFacet=9 MB
Collection and extraction #0 for 1000000 documents in 171 ms
Generated XML with 13594 characters
</pre>
<h3>Few documents, wide hierarchy</h3>
<p>10,000 documents, max tags/level 10, max levels 4.<br />
Request a maximum of 5 tags/level down to level 5 in reference count order.</p>
<pre>
Hierarchical facet startup time = 0:06 minutes
438,746 unique tags, 1,478,224 references to tags
Mem usage: preFacet=3 MB, postFacet=9 MB
Collection and extraction #0 for 10000 documents in 169 ms
Generated XML with 70046 characters
</pre>
<h3>Many documents, wide hierarchy, extreme references</h3>
<p>10,000,000 documents, max tags/level 10, max levels 4<br />
Request a maximum of 5 tags/level down to level 5 in reference count order.</p>
<pre>
Hierarchical facet startup time = 4:50 minutes
475,254 unique tags,  1,430,592,254 references to tags.
Mem usage: preFacet=26 MB, postFacet=3319 MB
Collection and extraction #0 for 10000000 documents in 0:21 minutes
</pre>
<h3>Few documents, deep hierarchy</h3>
<p>10,000 documents, max tags/level 3, max levels 15.<br />
Request a maximum of 5 tags/level down to level 5 in reference count order.</p>
<pre>
Hierarchical facet startup time = 6:57 minutes
50,538,133 unique tags, 52,237,029 references to tags
Mem usage: preFacet=61 MB, postFacet=603 MB
Collection and extraction #0 for 10000 documents in 0:10 minutes
Generated XML with 196739 characters
</pre>
<h3>Shallow, single-tag, uniform</h3>
<p>Special index to compare to <a href="http://wiki.apache.org/solr/HierarchicalFaceting">Solr hierarchical faceting</a>: 260,000 documents, each with a single unique two-level tag: A/1, A/2 &#8230; A/10000, B/1, B/2 &#8230; Z/10000.<br />
Request unlimited tags down to level 5 (effectively unlimited with this corpus).</p>
<pre>
Hierarchical facet startup time = 2196 ms
260,000 unique tags, 260,000 references to tags
Mem usage: preFacet=2 MB, postFacet=37 MB
Collection and extraction #0 for 260000 documents in 1048 ms
Collection and extraction #1 for 259630 documents in 884 ms
Collection and extraction #2 for 258829 documents in 995 ms
Generated XML with 15316147 characters in 320 ms
</pre>
<p>More realistically, setting the retrieval limit to the top 10 tags we get</p>
<pre>
Collection and extraction #0 for 260000 documents in 167 ms
Collection and extraction #1 for 259912 documents in 76 ms
Collection and extraction #2 for 258982 documents in 60 ms
Generated XML with 27347 characters in 7 ms
</pre>
<h2>Sample output</h2>
<pre>
&lt;facetresponse xmlns="http://lucene.apache.org/exposed/facet/response/1.0"
  query="even:true" hits="9984" countms="5" countcached="false" totalms="34"&gt;
  &lt;facet name="hierarchical" fields="deep" order="count" maxtags="20"
  mincount="0" offset="0" hierarchical="true" potentialtags="89534"
  count="7589" totalCount="141989" totaltags="26"&gt;
      &lt;tag count="320" totalcount="6304" term="Y"&gt;
        &lt;subtags potentialtags="3987" count="674" totalCount="5984" totaltags="26"&gt;
          &lt;tag count="17" totalcount="301" term="K"&gt;
            &lt;subtags potentialtags="222" count="72" totalCount="284" totaltags="26"&gt;
              &lt;tag count="4" totalcount="22" term="G"&gt;
                &lt;subtags potentialtags="16" count="18" totalCount="18" totaltags="15"&gt;
                  &lt;tag count="2" totalcount="2" term="A"&gt;&lt;/tag&gt;
                  &lt;tag count="2" totalcount="2" term="J"&gt;&lt;/tag&gt;
                  &lt;tag count="2" totalcount="2" term="U"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="B"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="C"&gt;&lt;/tag&gt;
                &lt;/subtags&gt;
              &lt;/tag&gt;
              &lt;tag count="3" totalcount="22" term="Q"&gt;
                &lt;subtags potentialtags="19" count="19" totalCount="19" totaltags="18"&gt;
                  &lt;tag count="2" totalcount="2" term="Q"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="A"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="B"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="C"&gt;&lt;/tag&gt;
                  &lt;tag count="1" totalcount="1" term="D"&gt;&lt;/tag&gt;
                &lt;/subtags&gt;
...
</pre>
<h2>Download</h2>
<p>Important: This is not usable for production, partly because Lucene 4 is still in development, partly because it was hacked together in a week.</p>
<p>Do a SVN checkout of Lucene trunk and patch it with <a href="https://issues.apache.org/jira/browse/LUCENE-2369">LUCENE-2369</a>, then fix the errors as Lucene trunk is a moving target (I should really make a compiled version with a command line test tool). The test-code for the hierarchical faceting is located in the aptly named class TestHierarchicalFacets.</p>
<p>Use only for good.<br />
<i>Toke Eskildsen</i><br />
<i>statsbiblioteket.dk</i></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/932/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/932/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/932/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=932&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Fast, light, n-level hierarchical faceting</title>
		<link>http://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/</link>
		<comments>http://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/#comments</comments>
		<pubDate>Tue, 05 Oct 2010 10:04:58 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=914</guid>
		<description><![CDATA[A recipe for doing n-level hierarchical faceting with very little memory overhead.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=914&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>We are now used to standard faceting and it works really well for a lot of cases. However, sometimes we have a taxonomy that we want to present. For monkeys this could be something like</p>
<pre>
Classification
  - Primates (9)
    - Atelidae (2)
      - Atelinae (2)
        - Ateles (2)
    - Cebidae (5)
      - Cebus (2)
        - C. apella (1)
        - C. capucinus (1)
      - Saimiri (3)
        - S. sciureus (3)
    - Cercopithecidae (1)
      - Papio (1)
        - P. anubis (1)
</pre>
<h2>Example data and goal</h2>
<p>We consider three documents, each with some multi-level faceting information</p>
<pre>
Doc 1: A/B/C, D/E/F
Doc 2: A/B/C, A/B/J
Doc 3: A,     D/E,   G/H/I
</pre>
<p>The multi-level tags are thus</p>
<pre>
A
A/B/C
A/B/J
D/E
D/E/F
G/H/I
</pre>
<p>Visually we want all levels to be expanded and counted so that we see</p>
<pre>
A      (4)
 - B   (3)
   - C (2)
   - J (1)
D      (2)
 - E   (2)
   - F (1)
G      (1)
  - H  (1)
   - I (1)
</pre>
<p>for a search that matches all 3 documents.</p>
<h2>The easy, non-fast and non-light solution</h2>
<p>If the documents were expanded, either at index time or facet structure build time, to have the tags</p>
<pre>
Doc 1: A,     A/B,   A/B/C, D,     D/E,   D/E/F
Doc 2: A,     A/B,   A/B/C, A,     A/B,   A/B/J
Doc 3: A,     D,     D/E,   G,     G/H,   G/H/I
</pre>
<p>then a standard <a href="http://sbdevel.wordpress.com/2010/09/24/sorting-faceting-index-lookup/">single-level faceting system</a> would produce the desired output (a small lie as we also need to limit the number of tags returned in a level-aware way). Unfortunately this expansion does not come cheap. In this example we go from 7 tag references to 18 and from 6 unique tags to 10. This is not too bad, but when we have deeper levels and more diverse tags, the overhead increases: Faceting on  house location with country/state/city/street/number would result in 5 times the references of non-hierarchical faceting. More references means more processing time and more memory overhead.</p>
<h2>The hard, fast, light and purple solution</h2>
<p>We do not create new tags from the existing tags</p>
<pre>
Doc 1: A/B/C, D/E/F
Doc 2: A/B/C, A/B/J
Doc 3: A,     D/E,   G/H/I
</pre>
<p>Instead we augment each tag in the original list with two pieces of information: The level <i>L</i> in the hierarchy (starting at level 1) and the previous level <i>P</i> that the tag <i>T</i> matches. <i>C</i> is the count from a given search. Our list is thus enriched to the following, where the text in the parentheses explain how the previous level was derived:</p>
<pre>
L P T      C
1 0 A     (1)   (no previous tag)
3 1 A/B/C (2)   (The previous tag "A" matches only the first level of "A/B/C")
3 2 A/B/J (1)   (The previous tag "A/B/C" matches first and second level of "A/B/J")
2 0 D/E   (1)   (no previous match)
3 2 D/E/F (1)   (The previous tag "D/E" matches both the first and second level of "D/E/F")
3 0 G/H/I (1)   (no previous tag)
</pre>
<p>As both the level <i>L</i> and the previous level <i>P</i> are at most max-depth, we can represent the numbers in a packed structure, where the extra information for each tag takes up <code>2 * log2(maxdepth+1) bits</code>. For <strong>10 million tags</strong> of a maximum <strong>depth of 7</strong>, this is <code>10M * 2 * log(8) bits = 10M * 2 * 3 / 8 bytes</code> = <strong>7 MB</strong>. 50 million tags of max depth 30 would take 59 MB of extra overhead.</p>
<p>The tag counts are updated just as for standard single-level faceting. The tags at any given level is those that satisfies <code>L &gt;= level, P &lt; level</code>. The counts for any given level is a sum of the counts from the starting point until the next tag matching the equation. Let&#8217;s walk through the example:</p>
<p>At <strong>level 1</strong>, the equation is <code>L &gt;= 1, P &lt; 1</code>. For tag A this is <code>1 &gt;= 1, 0 &lt; 1</code> which clearly matches. Next match is tag D/E and the last is G/H/I (remember that counts are summed between matches). As we know that we&#8217;re extracting at level 1, we can split the terms for D/E and G/H/I to get D and G and thus have</p>
<pre>
A (4)
D (2)
G (1)
</pre>
<p>At <strong>level 2</strong>, we use the starting points for the tags from level 1 and the equation is <code>L &gt;= 2, P &lt; 2</code>. Tag A no longer matches, but A/B/C does with <code>3 &gt;= 2, 1 &lt; 2</code>. Tag A/B/J does not, so its count is added to A/B/C. D/E matches with <code>2 &gt;= 2, 0 &lt; 2</code>, but D/E/F does not with <code>3 &gt;= 2, 2 &lt; 2</code> so it only adds to the count for D/E. Last fit is G/H/I with <code>3 &gt;= 2, 0 &lt; 2</code>. Extracting the proper level 2 terms the result is expanded to</p>
<pre>
A   (4)
A/B (3)
D   (2)
D/E (2)
G   (1)
G/H (1)
</pre>
<p>At <strong>level 3</strong>, we use the staring points from the tags from level 2 and the equation <code>L &gt;= 3, P &lt; 3</code>. Tag A/B/C still matches with <code>3 &gt;= 3, 1 &lt; 3</code> and A/B/J matches too. Tag D/E does not with <code>2 &gt;= 3, 0 &lt; 2</code>. Next match if D/E/F with <code>3 &gt;= 3, 2 &lt; 3</code> and last is G/H/I with <code>3 &gt;= 3, 0 &lt; 3</code>. Extracting at level 3, the tags are expanded to</p>
<pre>
A     (4)
A/B   (3)
A/B/C (2)
A/B/J (1)
D     (2)
D/E   (2)
D/E/F (1)
G     (1)
G/H   (1)
G/H/I (1)
</pre>
<p>For brevity, some devils has been skipped here.</p>
<ul>
<li>To fit sub-tags under the correct tags, a recursive descend in a sub-set of the tags should be used.</li>
<li>To avoid re-calculating sums, there should be a counting pass at first.</li>
<li>To provide mixed count and custom sorting of tags, an expanded facet query syntax must be provided which works with a custom sorter.</li>
</ul>
<p>The time spend on calculating the facets is dictated by the query options. Worst case is 100% orthogonal tags, tag sorting by counting and a complete extraction of all results. In that case the tag-list must be iterated as many times as there are levels. For the location-example above, this is 5 times, which coincidentally is the same factor as the tags-explosion for the easy implementation. Back-of-the-envelope, the worst case execution time is thus on par with the easy implementation, while memory overhead is a lot less.</p>
<p>As soon at the tags begin to overlap at some levels, sum count caching kicks in. Coupled with a limit on the number of extracted tags at any given level and the fact that we only need to re-visit the sub-tags for weeded super-tags, the number of passes falls drastically. This brings the expected execution time down to less than the easy method.</p>
<p>Fast (cpu-cycles), light (<strong>1 byte/unique tag</strong> on top of traditional faceting as a rule of thumb) and &#8230; Okay, the purple part was a lie.  Now it just needs to be implemented.</p>
<p>That&#8217;s all, folks!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/914/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/914/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/914/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=914&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/10/05/fast-hierarchical-faceting/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Sorting, faceting and index lookup with Lucene</title>
		<link>http://sbdevel.wordpress.com/2010/09/24/sorting-faceting-index-lookup/</link>
		<comments>http://sbdevel.wordpress.com/2010/09/24/sorting-faceting-index-lookup/#comments</comments>
		<pubDate>Fri, 24 Sep 2010 12:44:34 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[open source]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=879</guid>
		<description><![CDATA[A while ago, I described a proof of concept on how to reduce the memory impact for sorting with Lucene. There were also some lofty ideas about faceting and index lookup. I am now happy to say that these ideas has graduated to the proof of concept stage. Laying the ground First of all, let [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=879&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<p>A while ago, I described a proof of concept on <a href="http://sbdevel.wordpress.com/2010/03/19/string-sorting-in-lucene-without-the-memory-penalty/">how to reduce the memory impact for sorting with Lucene</a>. There were also some lofty ideas about faceting and index lookup. I am now happy to say that these ideas has graduated to the proof of concept stage.</p>
<h2>Laying the ground</h2>
<p>First of all, let us describe three common ways of looking at an index.</p>
<ul>
<li><strong>Sorting:</strong> Sorting on title, author and similar fields should be done with respect to the locale of the user. This is often done by comparing the relevant strings from a search result to each other, which scales horribly when the number of hits rises as locale-aware comparison of Strings is expensive. Another way that is making the rounds is to <a href="http://wiki.apache.org/solr/UnicodeCollation">index collator keys</a>, which are not usable for human reading, but very fast for comparison.</li>
<li><strong>Faceting:</strong> Fairly standard by now, faceting does present some architectural challenges. Creating facets for subjects with a low number of possible values is simple enough, but scaling to faceting on title, author or similar fields with many unique value takes its toll on memory if all the values are kept there. Solr provides an <a href="http://lucene.apache.org/solr/api/org/apache/solr/request/UnInvertedField.html">UnInvertedField</a> which offloads most of the terms to storage.</li>
<li><strong>Index lookup:</strong> Also known as &#8220;author browse&#8221; and probably some other names, as it is not that well-known. The idea is to provide the user with an authoritative list of possible values for a given field. When the user starts to type, valid values with the typed prefix is shown together with a few of the previous values as well as the following values. The difference from the well-known suggestion concept is that index lookup is complete and that the result is in locale-aware sorted order. It can be seen as a dictionary. If memory is plentiful, a simple implementation method is to hold a sorted list of terms in memory and do a standard locale aware binary search for the entry point in the list.</li>
</ul>
<p>One common element for these three ideas is the terms. Terms kept in memory. With lower end hardware or higher end index size, this is a problem. Lucene (and thereby Solr) is switching to something called BytesRef, which can represent characters in a more compact way than Java Strings. They really do help a lot, but for some that&#8217;s not enough.</p>
<p>Now, a Lucene index naturally contains all these terms: They can be looked up, they can be sequentially accessed or (dramatic pause) they can be accessed by ordinal. An ordinal is just an integer and takes up very little memory to represent. Requesting a term by ordinal is a bit costly though, as it must be fetched from storage, which requires some seeking.</p>
<h2>Another way</h2>
<p>Going back to the three ideas, another common element is that only a relative small amount of the terms is actually shown to the user. For sorting, it might be the top 20 titles. For faceting, it might be 200 tags shared among 10 facets. For index lookup it might be 10 author names. Fast resolving of terms is not required!</p>
<p>Armed with this knowledge, let&#8217;s define four building blocks.</p>
<ul>
<li><strong>Ordinal to term lookup:</strong> Given an ordinal, the term (aka BytesRef) is returned by querying the index. Trivial for single segments, fairly simple for multiple segments as they are just appended. This is just a logical mapping and requires practically no memory.</li>
<li><strong>Indirects:</strong> The ordinals are sorted, typically with respect to a locale, and the sorted lists is called the indirects list. If an index in the indirect is lower than another, it means that its corresponding term comes before the other indirect entry&#8217;s term with respect to sorting. We always need to sort in order to have indirects, even if the terms in the segments are already in order. This is because an index is often composed of several segments that most probably contains duplicate terms. In that case, the sorting can also be seen as de-duplication on ordered lists. This works very well with the collator key idea from above.</li>
<li><strong>Document to single indirect</strong> For each document id (still just an integer), an indirect index is kept. Resolving the indirect returns a term corresponding to the document. Memory wise this requires a list of integers as long as the number of documents.</li>
<li><strong>Document to indirects</strong> For each document id, a list of corresponding indirects is kept. By following the indirects through the ordinals, the corresponding terms can be resolved. Memory wise this requires a list of integers as long as the number of documents plus a list of integers as long as the total number of indirects for all documents.</li>
</ul>
<p>The common element here is arrays of integers. Java&#8217;s <code>int[]</code> is not that memory hungry to start with, but we can reduce the impact even further by using the more compact representation <a href="https://issues.apache.org/jira/browse/LUCENE-1990">PackedInts</a>. Revisiting the three common ways of looking at an index with the new structures, we see how it all comes together.</p>
<ul>
<li><strong>Sorting:</strong> The <em>document to single indirect</em> is used for this as the order of two documents can be determined by comparing their respective entries in the integer array. This is very fast.</li>
<li><strong>Faceting:</strong> The <em>document to indirects</em> is used for faceting. A counting array of length #indirects is made and the documents ids are iterated. For each id, the indirects are looked up and the corresponding counter entries are incremented. As it is just a matter of iteration and array lookups and updates, this is very fast. After that, the most popular entries in the counter lists can be extracted and the terms resolved. As the indirects can be ordered, it is also possible to extract by lexicographical order or similar.</li>
<li><strong>Index lookup:</strong> The <em>indirects</em> are immediately usable for this as a binary search can be performed. For each iteration of the binary search, one actual term needs to be looked up in storage and compared to the given prefix, but even with 10 million terms, this requires only 24 lookups. Most of these lookups will normally be disk cached.</li>
<li><strong>Enhanced index lookup:</strong> By using faceting for index lookup, we gain the advantage of showing how many documents will match the displayed terms for a given field. Furthermore this can be coupled with previous selections or a query, limiting the displayed terms to those that will result in at least one hit for a search. This might sound like heavy processing, but luckily the counters from the faceting can be easily cached. While an initial faceting of 5 million tags takes about 1 second on a laptop anno 2010, subsequent index lookups within that structure takes only 1-3 milliseconds.
</ul>
<h2>An example</h2>
<p>2 test indexes were created with 1 and 10 million documents respectively. Each document had an id, a title (10 random characters), an author (10 random characters) and 0-5 sample-tags which were a single term A-Y. For each index, a search was performed that hit half of the available documents and sorting (Lucene term order on author), faceting (locale &#8220;da&#8221; for the title, count for the title (yes, again) and count for the sample-tags) and index lookup (on title with locale &#8220;da&#8221;) was performed. After that a search that hit 1/10 of the index (random distribution) was performed with the same sorting, faceting and index lookups as before.</p>
<h3>1 million documents, collator sort</h3>
<pre>
Index = /home/te/projects/index1M (1000000 documents)
Used heap after loading index and performing a simple search: 5 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 88 MB

First natural order sorted search for "even:true" with 500000 hits: 2473 ms
Subsequent 5 sorted searches average response time: 23 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 1:20 minutes
First faceting for even:true: 128 ms
Subsequent 4 faceting calls (count caching disabled) response times: 75 ms

Initial lookup pool request (might result in structure building): 0:33 minutes
First index lookup for "even:true": 56 ms
Subsequent 91 index lookups average response times: 0 ms

First natural order sorted search for "multi:A" with 95222 hits: 6 ms
Subsequent 5 sorted searches average response time: 5 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 23 ms
Subsequent 4 faceting calls (count caching disabled) response times: 23 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 42 ms
Subsequent 91 index lookups average response times: 0 ms

Free memory with sort, facet and index lookup structures intact: 68 MB
Total test time: 1:57 minutes
</pre>
<h3>10 million documents, collator sort</h3>
<pre>
Index = /home/te/projects/index10M (10000000 documents)
used heap after loading index and performing a simple search: 25 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 910 MB

First natural order sorted search for "even:true" with 5000000 hits: 0:21 minutes
Subsequent 5 sorted searches average response time: 227 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 10:08 minutes
First faceting for even:true: 934 ms
Subsequent 4 faceting calls (count caching disabled) response times: 897 ms

Initial lookup pool request (might result in structure building): 3:21 minutes
First index lookup for "even:true": 348 ms

First natural order sorted search for "multi:A" with 947218 hits: 99 ms
Subsequent 5 sorted searches average response time: 45 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=locale, locale=da, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 267 ms
Subsequent 4 faceting calls (count caching disabled) response times: 233 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 99 ms
Subsequent 91 index lookups average response times: 2 ms

Used memory with sort, facet and index lookup structures intact: 685 MB
Total test time: 14:00 minutes
</pre>
<p>As can be seen, the initialization time is very high with many documents. Using a hybrid approach with indexed collator keys helps a lot but is &#8211; for now &#8211; incompatible with lexicographically sorted faceting and index lookup as they require that human readable terms are returned. This can be solved without extra memory overhead by having parallel structures in the index for the terms and the collator keys for the terms, but the code has not been written yet. Simulating the parallel structures idea by sorting in term natural order, we get the following measurements.</p>
<h3>1 million documents, term natural order sort</h3>
<pre>
Index = /home/te/projects/index1M (1000000 documents)
used heap after loading index and performing a simple search: 6 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 88 MB

First natural order sorted search for "even:true" with 500000 hits: 2586 ms
Subsequent 5 sorted searches average response time: 23 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 0:44 minutes
First faceting for even:true: 120 ms
Subsequent 4 faceting calls (count caching disabled) response times: 66 ms

Initial lookup pool request (might result in structure building): 0:15 minutes
First index lookup for "even:true": 26 ms
Subsequent 91 index lookups average response times: 0 ms

First natural order sorted search for "multi:A" with 95222 hits: 6 ms
Subsequent 5 sorted searches average response time: 4 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 22 ms
Subsequent 4 faceting calls (count caching disabled) response times: 20 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 7 ms
Subsequent 91 index lookups average response times: 0 ms

Used memory with sort, facet and index lookup structures intact: 66 MB
Total test time: 1:03 minutes
</pre>
<h3>10 million documents, term natural order sort</h3>
<pre>
Index = /home/te/projects/index10M (10000000 documents)
used heap after loading index and performing a simple search: 25 MB
Maximum possible memory (Runtime.getRuntime().maxMemory()): 910 MB

First natural order sorted search for "even:true" with 5000000 hits: 0:22 minutes
Subsequent 5 sorted searches average response time: 229 ms

Facet pool acquisition for for "even:true" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 4:46 minutes
First faceting for even:true: 865 ms
Subsequent 4 faceting calls (count caching disabled) response times: 788 ms

Initial lookup pool request (might result in structure building): 1:41 minutes
First index lookup for "even:true": 318 ms
Subsequent 91 index lookups average response times: 2 ms

First natural order sorted search for "multi:A" with 947218 hits: 101 ms
Subsequent 5 sorted searches average response time: 51 ms

Facet pool acquisition for for "multi:A" with structure groups(
  group(name=sorted, order=index, locale=null, fields(a)),
  group(name=count, order=count, locale=null, fields(a)),
  group(name=multi, order=count, locale=null, fields(facet))): 0 ms
First faceting for multi:A: 220 ms
Subsequent 4 faceting calls (count caching disabled) response times: 220 ms

Initial lookup pool request (might result in structure building): 0 ms
First index lookup for "multi:A": 92 ms
Subsequent 91 index lookups average response times: 2 ms

Used memory with sort, facet and index lookup structures intact: 648 MB
Total test time: 6:58 minutes
</pre>
<h3>Observations</h3>
<p>It seems obvious that using indexed collator keys is the way to go. It is significantly faster. The drawbacks are slightly increased indexing time and increased index size.</p>
<p>The current proof of concept plays nice with reopening of the index. This means that an update to the index only requires a partial refresh of the overall structure. In order to do so, some intermediate structures needs to be held in memory. Disabling this caching makes reopening just as expensive as the initial open, but frees some memory. The amount of freed memory has not been measured yet, but I would guess about 1/4-1/3 of total memory usage, depending on the index.</p>
<h2>In perspective</h2>
<p>The sorting part in itself required about <a href="https://issues.apache.org/jira/browse/LUCENE-2369">1/3 of the size needed by standard Lucene natural order sort</a>. It is more difficult to compare the faceting and index lookup as Lucene does not provide either. Solr handles faceting, so I hope to find the time to make a comparison with that product.</p>
<p>To boil it down, the proof of concept makes it possible to provide faceting on authors, sorting on title and a bit on the side for an index with 20 million documents on a 2 GB machine. Scaling up is theoretically <code>n log(n)</code> in time but less than that in the tests above, probably due to caching and JITting.</p>
<h2>Try it for yourself</h2>
<p>Remember, this is a proof of concept. Check out Lucene trunk and apply the patch at <a href="https://issues.apache.org/jira/browse/LUCENE-2369">LUCENE-2369</a>. This contains an interactive proof of concept for the sorting part and a non-interactive unit test that builds an index and performs faceting and index lookup. Look in the &#8220;exposed&#8221;-folders in src and test.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/879/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/879/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/879/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=879&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/09/24/sorting-faceting-index-lookup/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>ECDL 2010</title>
		<link>http://sbdevel.wordpress.com/2010/09/16/ecdl-2010/</link>
		<comments>http://sbdevel.wordpress.com/2010/09/16/ecdl-2010/#comments</comments>
		<pubDate>Thu, 16 Sep 2010 06:35:28 +0000</pubDate>
		<dc:creator>esbenab</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Statsbiblioteket]]></category>
		<category><![CDATA[Bolette]]></category>
		<category><![CDATA[ECDL2010]]></category>
		<category><![CDATA[Esben]]></category>
		<category><![CDATA[Glasgow]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=866</guid>
		<description><![CDATA[ECDL Conference Digital Resources Dep. People report from ECDL2010 Have been meaning to write about our fascinating conference experience for days! This is it. The first keynote was by Susan Dumais from Microsoft Research on a study in retrieval of dynamic web information. It was interesting to hear about users reusing the same queries and [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=866&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></description>
			<content:encoded><![CDATA[<h1>ECDL Conference</h1>
<h1>Digital Resources Dep. People report from ECDL2010</h1>
<p>Have been meaning to write about our fascinating conference experience for days! This is it.</p>
<p>The first keynote was by Susan Dumais from Microsoft Research on a study in retrieval of dynamic web information. It was interesting to hear about users reusing the same queries and revisiting the same pages but in different re-visitation patterns (fast, hybrid, medium, slow). Esben has some very nice drawings in his notes <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  Also 43% of search activity is repeat clicks or repeat queries. Also have handwritten notes with nice table. The DiffIE plug-in highlights the changes since the user last visited a page, and the study showed that users noticed more changes using DiffIE and revisited pages more. The DiffIE plug-in highlights the changes since the user last visited a page, and the study showed that users noticed more changes using DiffIE and revisited pages more. This could also be used to see new results on the same query in a search interface.</p>
<p>After the keynote, Esben went to the session on metadata and Bolette went to a steering committee meeting.</p>
<h2>Metadata Session</h2>
<p>User contributed descriptive metadata for libraries and cultural institutes:<br />
Easing user interaction by using known &#8220;portals&#8221; for allowing users to contribute knowledge.<br />
Exposing hidden collections, with verry little or no meta data to the public, in order to use crowd sourcing to enrich the data.<br />
<strong><a href="//www.loc.goc/rr/print/flickr_report_final.pdf">report</a></strong><br />
four categories<br />
1.Personal &amp; historical<br />
2.Link out<br />
3.corrections &amp; translations<br />
4.link in</p>
<p>Crowd cataloguing NOT professional, will be entered into meta data with flickr.com as source&#8230; (is flickr.com known in 100 years)</p>
<h2>Steering Committee Meeting</h2>
<p>The SC meeting raised some interesting discussions. The TPDL charter discussion was mostly on organisational details. The discussion on how to get good quality papers in an interdisciplinary conference was more interesting. The challenge in cross-field conferences is that we try to write papers, that can be read by everyone in the community. This means that we leave out the important technical parts specific to our own field and the paper ends up more superficial and somehow less scientific&#8230; We also talked about introducing multiple themes in next years conference: TPDL2011. And about a number of other things&#8230;</p>
<p>Next there was a panel session.</p>
<h2>Panel session</h2>
<h4>Developing services to support research data management and sharing</h4>
<p><span style="font-weight:normal;">Mandatory data management plan in research grant applications, are more and more common. There is a need for a &#8220;creative commons&#8221; approach to the curation and preservation of research data.<br />
There are trust issues in who has the data that is curated, and preserved. Both in relation to confidentiality and the data being preserved.</span></p>
<p><span style="font-weight:normal;">Transparency in data records.<br />
Peer review hardly ever touches on the gathering of data and it&#8217;s relation to the results&#8230;<br />
What is being done to alleviate the sharing of person-sensitive data? not much at present and it is beginning to become an issue, recommendations are to consider this early on in the grant application process.<br />
Measures mus be taken in order to insure consent for data sharing and archiving! At present ethic advisors recommend destroying the datasets upon the end of person-sensitive research.<br />
up front prepare for sharing!</span></p>
<p><span style="font-weight:normal;">Problems exist in the preservation and curation of datasets, how will the &#8220;user guide&#8221; to the data be preserved in complex datasets.</span></p>
<p><span style="font-weight:normal;">This area needs to write and talk about the successes of data sharing.</span></p>
<h2>Evening Program</h2>
<p>After the panel it was time for an extended coffee break, meeting an old friend and relaxing before the evening program. The welcome reception at Glasgow City Chambers was amazing! Well, the reception was nice as receptions are. Talked to nice people, had a bit of wine, listened to an official city representative welcoming the conference to his city and talked to more nice people, but the building was absolutely amazing! And the nice Indian student helper gave us directions and reserved a table at a nearby Indian restaurant for us, and we had a great dinner!</p>
<h1>Wednesday</h1>
<p>The second keynote by Liina Munari from the European Commission was on EU-supported DL projects and included a lot of slides with a lot of bullet points&#8230;</p>
<p>Next Bolette went to the Digital Preservation session and Esben went to the Web 2.0 session.</p>
<h2>Digital Preservation Session</h2>
<p>The first presentation in the digital preservation session was about Hoppla, which is a system designed to help small institutions with logical preservation: <a href="//www.ifs.tuwien.ac.at/dp/hopplaî"></a></p>
<p>Then there was a presentation on estimating digitization costs and one on a vocabulary for preserved new media art. New media art with both digital and physical elements or maybe interactive present interesting preservation challenges, and maybe with this media rather than preservation we should talk about facilitating recreation or re-performance.</p>
<h2>Web 2.0 session</h2>
<h3>Privacy aware folksonomies</h3>
<p>The problem regarding privacy in folksonomies are, to a large extent, related to the fact that the provider has all the knowledge about all the users and all their tags. This issue can become a real privacy nightmare when tags are analysed in order to associate users in groups, it is likely that all users who tagged images with &#8220;<code>bobs_wedding"</code> also attended said wedding, imagination and varying degrees of evil seems to be the limiting factor in the possibilities of data mining.</p>
<p>A possible solution combining cryptography and data split over several databases was shown, this allowed a granularity of access, between users, friends and providers.</p>
<h3>SEA<span style="font-weight:normal;">mless</span> WE<span style="font-weight:normal;">b</span> ED<span style="font-weight:normal;">iting for curated content </span>SEAWEED</h3>
<p>A project to release the user from the chains of the <code>edit-&gt;review-&gt;publish</code> process, it&#8217;s basically just a new take on web editing, but looks promising, especially in conjunction annotation.<br />
Details are available <a href="http://wordpress.org/extend/plugins/seaweed/">here</a>(WP plugin) and <a href="http://www.seaweed-editor.com/">here</a>.</p>
<h2>Poster Session</h2>
<p>Before lunch the somewhat nervous Esben did a very good poster spotlight presentation ending with &#8220;And this was the&#8230;&#8221; Look at watch&#8230; &#8220;50 second presentation of my poster.&#8221; And he did get the audience to laugh <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  After lunch was the poster session and quite a few people came to talk to Esben about his work. Some people also came to talk about his fascinating fabric poster, especially the people who were presenting posters themselves and had been travelling with big paper posters rolled up. Bolette thinks Esben got a number of good ideas for further work but unfortunately has too little time&#8230; Bolette also went to see some of the other posters and talk to some people.</p>
<p>Once again the late afternoon session lost to the need for relaxing. We went shopping on the way back to the hotel. Whisky, book, dvd. Slept for 45 minutes and walked via the botanical garden to the conference dinner, which was at an old church converted to restaurant complete with whisky bar in the cellar. It was very impressive and we talked about whether it would be a good idea for some Danish churches, since we don&#8217;t seem to be using them all that much for their intended purpose anymore. Oh, and the bagpipe was fun and the food was good.</p>
<h1>Thursday</h1>
<p>By Thursday everyone was tired and the Theatre 1 didn&#8217;t seem quite as full as the previous days. We went to the session on query log analysis and the one ontologies and <a>TPDL2011</a> was announced. And then we used 11 hours travelling home!</p>
<h2>Query log analysis session.</h2>
<h3>Determining time of queries for search of re-ranking</h3>
<p>This talked was focused about how the time of search and the implicit information in the search terms can be analysed and used in retrieving the correct data entries, the talk focused on how the terminology changes over time. The term &#8220;<code>Hillary Cllinton 1997-2002</code>&#8221; should return many of the same results as &#8220;<code>New York Senator</code>&#8221; and &#8220;<code>First Lady</code>&#8220;.<br />
The team had found that of searches <strong>1,5%</strong> eg. &#8220;<code>Presidential election 2008</code>&#8221; uses explicit time queries, and  <strong>7%</strong> use implicit time queries eg. &#8220;<code>German world cup</code>&#8220;<br />
As many events occur at a specific time, or at least over short periods, the researchers introduced <em>Time granularity</em> which returned relevant search results for the period of granularity <code>[01/01/2004;01/01/2005]</code>.<br />
Questions were asked regarding problems with events at or near the borders of time granular, this was an issue they where aware of, a possible solution could be double overlapping time granulars <code>[01/01/2004;01/01/2005]</code> and  <code>[01/06/2004;01/06/2005]</code> allowing for inclusion of search results otherwise omitted.</p>
<p>Query transformation in cultural metadata.</p>
<p>User contributed descriptive metadata for libraries and cultural institutions.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gofacebook/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/facebook/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gotwitter/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/twitter/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/866/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/866/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/866/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&amp;blog=4699377&amp;post=866&amp;subd=sbdevel&amp;ref=&amp;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/09/16/ecdl-2010/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">esbenab</media:title>
		</media:content>
	</item>
	</channel>
</rss>
