<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:georss="http://www.georss.org/georss" xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Software Development at Statsbiblioteket &#187; Summa</title>
	<atom:link href="http://sbdevel.wordpress.com/category/hacking/summa/feed/" rel="self" type="application/rss+xml" />
	<link>http://sbdevel.wordpress.com</link>
	<description>A peekhole into the life of the software development department at the State Library of Denmark</description>
	<lastBuildDate>Sat, 18 May 2013 20:58:36 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.com/</generator>
<cloud domain='sbdevel.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://s2.wp.com/i/buttonw-com.png</url>
		<title>Software Development at Statsbiblioteket &#187; Summa</title>
		<link>http://sbdevel.wordpress.com</link>
	</image>
	<atom:link rel="search" type="application/opensearchdescription+xml" href="http://sbdevel.wordpress.com/osd.xml" title="Software Development at Statsbiblioteket" />
	<atom:link rel='hub' href='http://sbdevel.wordpress.com/?pushpress=hub'/>
		<item>
		<title>Large facet corpus, small result set</title>
		<link>http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/</link>
		<comments>http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/#comments</comments>
		<pubDate>Wed, 23 Jan 2013 20:34:24 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Faceting]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[Solr]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=1023</guid>
		<description><![CDATA[Primer Each time a user issues a search in our primary corpus, we perform faceting on 19 different fields. Some of those fields have a low amount of unique values (year, language, classification), some of them are quite heavy (author, title, semi-free keywords). We have a grand total of 38 million unique tags and 160 [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=1023&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<h2>Primer</h2>
<p>Each time a user issues a search in our primary corpus, we perform faceting on 19 different fields. Some of those fields have a low amount of unique values (year, language, classification), some of them are quite heavy (author, title, semi-free keywords). We have a grand total of 38 million unique tags and 160 million references from 11 documents to the tags displayed as part of the faceting.</p>
<p>The way our faceting works is simple in principle: When a search is performed, an array of counters (a basic <code>int[]</code>) is allocated. The array contains a counter for each unique tag (38 million in our case). All the internal IDs for the documents from the full search result are extracted and iterated. For each document ID, the reference structure provides the tag IDs, which corresponds exactly to entries in the counter array. For each tag ID, the counter for said tag is incremented. When all document IDs has been iterated, the counters are iterated and the ones with the highest counts are extracted.</p>
<p>If you followed that, congratulations. So, performance-wise, what is wrong with that approach? Yeah, the title was kind of a giveaway. Even if the search results in just a single hit, with just a single tag, we still need to iterate the full 38 million counters in order to extract the facet result. We need to clear it too, before the next run, so we&#8217;re really talking 2 complete iterations, one of them involving sorting logic. Ouch. Or to be more precise: About 300ms of ouch.</p>
<p>So what do we do? Well, if we know that our result set is small, we could use a simple HashMap to hold our counters; with tag-IDs as keys and the counters themselves as values. We tried that some years ago. It did sorta-kinda work, but that approach had significant drawbacks:</p>
<ul>
<li>HashMaps are memory pigs and they tax the garbage collector. We do not want to insert hundreds of thousands of objects into them, when they are just used for highly temporary counting.</li>
<li>We need to guess the size of the result from the start. Not just the number of hits in the search result, but the number of tags that they refer to collectively so as not to get unvieldy HashMaps. If we guess wrong, we need to start over or copy the values from the map into our full counting structure.</li>
</ul>
<p>We abandoned our HashMap based sparse counter approach as our experiments showed that the dumb &#8220;just iterate everything all the time&#8221; performed better for most of our searches.</p>
<h2>New idea</h2>
<p>Summing up the requirements for a faceting system where tag extraction performance is dependent on the number of found tags, rather than using a fixed amount of time:</p>
<ul>
<li>It should work without knowing the number of tags in advance.</li>
<li>It should not tax the heap nor the garbage collector excessively</li>
<li>Extraction time should be linear (we accept <code>O(n*log2(n))</code></li>
<p> for sorted extraction) to the number of marked tags</li>
</ul>
<p>Mikkel Kamstrup Erlandsen kindly pointed me to the article <a href="http://research.swtch.com/sparse" title="Using Uninitialized Memory for Fun and Profit">Using Uninitialized Memory for Fun and Profit</a>. With a little tweaking, a simplified version should satisfy all the requirements. We will build a counting structure that can switch seamlessly from sparse to non-sparse representation.</p>
<p>For this, we introduce yet another array: The <i>tag count tracker</i>. It holds pointers into the <i>tag count array</i> from before. Its length is the cutoff for when to use sparse counting and when to use full counting and must be determined by experiments.</p>
<p>When the count for a tag for a document needs to be incremented, we start by loading the old count from our <i>tag count array</i> (we need to do this anyway in order to add 1 to the count). If the count is 0, the position of the counter is added to the <i>tag count tracker</i>. If this overflows the <i>tag count tracker</i>, we switch to standard counting and completely ignore the <i>tag count tracker</i> hereafter. Either way, the value from the <i>tag count array</i> is incremented and stored back into the array as we would normally do.</p>
<p>When all the tags for all the documents has been iterated, the <i>tag count tracker</i> (if not overflowed) contains a complete list of all the tags that has a count of 1 or more. The tag extracter needs only iterate those and, just as important, needs only clear those. If the <i>tag count tracker</i> was overflowed, the old iterate-everything approach is used. As for clearing the <i>tag count tracker</i> for next use, it is simply a matter of setting its logical length (a single int) to 0. Presto!</p>
<p>Now it just needs to be implemented.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/1023/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/1023/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=1023&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2013/01/23/large-facet-corpus-small-result-set/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a8d49de9ea76368a1a2d9b9a1c975ea5?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Fire fire everywhere</title>
		<link>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/</link>
		<comments>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/#comments</comments>
		<pubDate>Wed, 21 Sep 2011 13:13:10 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=1009</guid>
		<description><![CDATA[Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=1009&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Searching at Statsbiblioteket has been slow the last couple of days and the condition has grown progressively worse. Yesterday evening and this morning, using the system required the patience of a parent. Consequently the full tech stack congregated at the office (maintenance BOFH, backend bit-fiddler, web service hacker and service glue guy) hell-bent on killing bugs.</p>
<p>A hastily cobbled together log propagation thingamabob claimed that the backend answered snappy enough, network inspection showed a very high amount of requests to storage (a database that contains the records backing the search) and timeouts &amp; session pool overflows were all over the place. The DidYouMean service was appointed scapegoat and killed with extreme prejudice.</p>
<p>Things got exponentially worse! Uptime was about 2 minutes, with last minute performance quickly falling off to unusable. Phones started ringing, emails ticked in and an angry mob with pitchforks lay siege to the office. Inspection revealed that killing DidYouMean meant that the service layer unsuccessfully tried to get the result for 20 seconds (yes, that timeout was far too high) before giving up, quickly filling Apache session pools. DidYouMean was resurrected, services started up again and all was well. Or at least back to where it was before the wrong service was unfairly executed.</p>
<div id="attachment_1011" class="wp-caption center" style="width: 310px"><a href="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg"><img src="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg?w=300&#038;h=224" alt="Mads coding" title="Mads stands and codes" width="300" height="224" class="size-medium wp-image-1011" /></a><p class="wp-caption-text">Mads stands up and codes. A sure sign of high alert</p></div>
<p>Waiting for the next hammer to drop, code was reviewed (again), pool sizes were tweaked and logs were watched intensely. At 12:09:47 and 952 milliseconds, the impact riveter started again and storage staggered. But lo and behold: The maintenance guy had changed log levels to DEBUG (for a limited amount of time). An hour and 20,000 observed requests for the exact same ID later, the magical incantation <code>i++;</code> was inserted in a while loop. Testing, deployment, re-deployment, tomcat restart and another tomcat restart followed quickly.</p>
<p>It turned out that certain rare records triggered the endless loop. The progressively worse performance stemmed from more and more of these edge cases piling up, each looping forever. As the overwhelmed storage was on the same server as the searcher, the shared request pool was flooded with storage requests, only occasionally allowing search requests.</p>
<p>With the roaring fire quelled, business returned to normal. By pure coincidence, the assignment for the next days is vastly improved propagation, logging and visualisation of timing information throughout the services.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/1009/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/1009/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=1009&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/09/21/fire-fire-everywhere/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a8d49de9ea76368a1a2d9b9a1c975ea5?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2011/09/20110921-0844_madsstaaropogkoder_e.jpg?w=300" medium="image">
			<media:title type="html">Mads stands and codes</media:title>
		</media:content>
	</item>
		<item>
		<title>The right tool &#8211; neo4j?</title>
		<link>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/</link>
		<comments>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/#comments</comments>
		<pubDate>Fri, 06 May 2011 14:10:45 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[eskildsen]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Low-level]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=989</guid>
		<description><![CDATA[Background We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=989&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<h2>Background</h2>
<p>We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years we have used relational databases for this, going from <a href="http://www.postgresql.org/">PostgreSQL</a> through <a href="http://db.apache.org/derby/">Derby</a> and landing on <a href="http://www.h2database.com/html/main.html">H2</a>, which has served us fairly well. Documents with relations were a bit slow to resolve but there were less than 5% of those in the full corpus, so for all practical purposes they presented no problem.</p>
<p>Fast forward to a week ago, where we added a new target to our integrated search. One million fresh documents or a 10% increase in total document count. Unfortunately most of these documents were related to each other, increasing our total relation count 2000%. Our full indexing time soared from &#8220;It&#8217;ll be done before noon&#8221; to &#8220;I hope it finishes before tomorrow&#8221;. Ouch!</p>
<p>A long discussion about database design and indexes followed. The conclusion was that we really did not use H2 for what it was good at and that maybe we should look at a graph-oriented database.</p>
<h2>Implementing storage with neo4j</h2>
<p><a href="http://neo4j.org/">Neo4j</a> is an open source graph database. It is written in Java and can be used as an embedded application. This is important for us as we like our packages to be self contained. Friday is work-on-whatever-project-you-think-can-benefit-Statsbiblioteket-day, so I dedicated it to trying out neo. Keep in mind that I only heard about Neo this morning and that I hadn&#8217;t looked at a single page about the product before I started the project.</p>
<ul>
<li><b>10:36</b> Created test project</li>
<li><b>10:40</b> Selected Neo4j version 1.3 stable Enterprise, added to Maven POM, downloaded files</li>
<li><b>10:50</b> Created skeleton class for Summa NeoStorage</li>
<li><b>10:55</b> Added basic properties, created skeleton Unit test</li>
<li><b>11:15</b> Added code for flushing Summa Record to Neo</li>
<li><b>11:36</b> Added code for retrieving a previously flushed record + unit test. Hello world completed</li>
<li><b>11:40</b> Break finished, started on ModificationTime retrieval</li>
<li><b>12:35</b> Proof of concept retrieval using modification time (slow development due to human error)</li>
<li><b>13:20</b> Finished bulk ingest and record-by-record extraction using modification time order, including unit test</li>
<li><b>13:50</b> Finished mapping of Summa Record child-relations to Neo, both for ingest and extraction</li>
</ul>
<p>Including mistakes, distractions from colleagues and reading documentation, it took under 4 hours to integrate Neo as the new backend for our storage. There&#8217;s still a lot of minor things to add and special cases to cover, but the result is complete enough to be used for most workflows in Summa.</p>
<p>Implementation was a breeze, the API was very clean and the examples and guides at the website were to the point and well thought out. Kudos to the developers for great work!</p>
<h2>Performance peek</h2>
<p>Since the Neo Storage isn&#8217;t finished, it would be unfair to compare it to H2 with regard to ingest. However, the extraction part is complete enough to test.</p>
<p>With the old H2-backed storage, our extraction time fell to below 5 records/sec for the new documents with 2-4 relations each (extraction time for records without relations is 2-3000 records/sec).</p>
<p>Creating a test-storage using Neo with 100000 documents, each with 5 children, changed the extraction speed to 2500 expanded records/second or 15,000 raw records/second if we count the children.</p>
<p>Granted, the only fair test is with production data, but so far Neo4j looks like a clear winner for our purposes!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/989/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/989/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=989&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2011/05/06/the-right-tool-neo4j/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a8d49de9ea76368a1a2d9b9a1c975ea5?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>The road ahead</title>
		<link>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/</link>
		<comments>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/#comments</comments>
		<pubDate>Fri, 26 Nov 2010 08:08:03 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Statsbiblioteket]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=958</guid>
		<description><![CDATA[When we&#8217;re Out There at conferences or visiting other libraries, we&#8217;re sometimes asked why we bother &#8220;developing our own search engine&#8221;. It is a really good question. One that we ask ourselves from time to time, to make sure we&#8217;re on the right track. One of the reasons why we&#8217;ve been rolling our own instead [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=958&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>When we&#8217;re Out There at conferences or visiting other libraries, we&#8217;re sometimes asked why we bother &#8220;developing our own search engine&#8221;. It is a really good question. One that we ask ourselves from time to time, to make sure we&#8217;re on the right track.</p>
<p>One of the reasons why we&#8217;ve been rolling our own instead of taking the obvious Solr-road is historical. Solr didn&#8217;t meet our needs when we started. It still doesn&#8217;t in its current form, but it&#8217;s getting close. Another reason is that the scope of Summa is different from Solr’s: We&#8217;ve invested a lot of energy in handling data-flow with transformations in a multi-stage system with a persistent storage service that handles parent-child relations and keeps track of updates.</p>
<p>The interesting thing here is that Summa and Solr are not necessarily in competition. Solr has a lot of great analyzers and has gained a lot of momentum during the last years. We would therefore like to use Solr at the core of Summa instead of the custom Lucene searcher we have currently. At Statsbiblioteket we have done a fair amount of work with <a href="https://issues.apache.org/jira/browse/LUCENE-2369">low-memory index-lookup, sorting and hierarchical faceting</a> which we’d like to take with us, so this calls for participation in Solr development.</p>
<h2>Relevance Correlation between systems</h2>
<p>Statsbiblioteket has recently begun collaboration with the Summon team from Serial Solutions. Due to Danish legislation we are not allowed to let Summon index our own data, so the upcoming search setup needs to merge results from Summon with our own. As both systems sort the results by relevance ranking, proper merging is quite hard. This is due to the fact that relevance scores are not comparable between different systems. One solution is to perform a local comparison of the returned fields in the results, but as this completely ignores the whole term statistics based ranking of Lucene/Solr, we would like something better.</p>
<p>Up until now we have been spoiled by using relevance ranked integrated search across different sources where we controlled the full indexing process.  <a href="https://issues.apache.org/jira/browse/SOLR-1632">SOLR-1632</a> seems to be the right way to provide similar functionality for Solr, but it is not mature yet and is quite a large thing to put into production. Better to walk before running. As part of our collaborative agreement Serial Solutions has agreed to deliver statistics about their metadata to us on an experimental basis. With this we can hopefully do better merging and issue boosted queries to get the hits that are relevant to us, thus approximating the results we would have gotten if all metadata had been in a single index.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/958/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/958/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=958&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/11/26/the-road-ahead/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a8d49de9ea76368a1a2d9b9a1c975ea5?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>
	</item>
		<item>
		<title>New Blood</title>
		<link>http://sbdevel.wordpress.com/2010/03/02/new-blood/</link>
		<comments>http://sbdevel.wordpress.com/2010/03/02/new-blood/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 08:21:59 +0000</pubDate>
		<dc:creator>villadsen</dc:creator>
				<category><![CDATA[People]]></category>
		<category><![CDATA[Statsbiblioteket]]></category>
		<category><![CDATA[Summa]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=826</guid>
		<description><![CDATA[Mikkel &#8211; one of the core Summa developers &#8211; has decided to move on. He will be working for Canonical which has been his dream job for years, so we can only wish him the best of luck. Mikkel has been an integral part of the Summa rewrite that has taken place, and has also [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=826&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>Mikkel &#8211; one of the core Summa developers &#8211; has <a href="http://www.grillbar.org/wordpress/?p=451">decided to move on</a>. He will be <a href="http://www.grillbar.org/wordpress/?p=458">working for Canonical</a> which has been his dream job for years, so we can only wish him the best of luck.</p>
<p>Mikkel has been an integral part of the Summa rewrite that has taken place, and has also had the very important role of being our release manager and all around &#8220;glue guy&#8221;. He will be greatly missed.</p>
<p>However we have managed to grab Henrik &#8211; a developer from another section &#8211; to come work with us and fill the missing spot. He has spent the last month (along with Toke) making sure that any and all important information has been properly extracted from Mikkel&#8217;s brain. Henrik has proven to be a fast learner and he has already become deeply involved with our release management, production environment, and the Summa code base in general.</p>
<p>So a big welcome to Henrik. He will be a great addition to the team.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/826/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/826/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=826&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/03/02/new-blood/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/7a08f53168eb27dde27ffd8a08845a13?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">villadsen</media:title>
		</media:content>
	</item>
		<item>
		<title>Experimental GUI for code4lib</title>
		<link>http://sbdevel.wordpress.com/2010/02/25/experimental-gui-for-code4lib/</link>
		<comments>http://sbdevel.wordpress.com/2010/02/25/experimental-gui-for-code4lib/#comments</comments>
		<pubDate>Thu, 25 Feb 2010 10:32:20 +0000</pubDate>
		<dc:creator>michaelpoltorak</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[Statsbiblioteket]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[c4l10]]></category>
		<category><![CDATA[experimental]]></category>
		<category><![CDATA[gui]]></category>
		<category><![CDATA[inspiration]]></category>
		<category><![CDATA[search]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=783</guid>
		<description><![CDATA[For this year&#8217;s code4lib Jørn and I had prepared an demo called Kill the Search Button. The demo has an experimental GUI and boasts some of the ideas and concepts we&#8217;ve developed over the last few years when working with our search engine Summa/Search. GUI Basically, the GUI rethinks the traditional way of presenting search [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=783&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<div id="attachment_792" class="wp-caption alignnone" style="width: 460px"><a href="http://sbdevel.files.wordpress.com/2010/02/gui1.jpg"><img class="size-full wp-image-792" title="gui" src="http://sbdevel.files.wordpress.com/2010/02/gui1.jpg?w=450&#038;h=237" alt="Highly experimental GUI" width="450" height="237" /></a><p class="wp-caption-text">Highly experimental GUI.</p></div>
<p>For this year&#8217;s <a title="Code4lib 2010" href="http://code4lib.org/conference/2010/">code4lib</a> Jørn and I had prepared an demo called Kill the Search Button.</p>
<p>The demo has an experimental GUI and boasts some of the ideas and concepts we&#8217;ve developed over the last few years when working with our search engine <a href="http://www.statsbiblioteket.dk/summa/">Summa</a>/<a href="http://www.statsbiblioteket.dk/search/">Search</a>.</p>
<h3>GUI</h3>
<p>Basically, the GUI rethinks the traditional way of presenting search results. The idea is to lay out the search result as a map or a set of tiled pages that the user can navigate, rather than a chunk of separate pages with only one page visible at a time.</p>
<p>As a consequence, the search result is zoomable and scrollable allowing the user to smoothly move around the search result and to zoom between an  overview layout with several columns of pages to a very detailed close-up  view.<br />
Zooming is done using a slider. Page transitions are done using  scroll animations in order to underline the feeling of moving about the search result.</p>
<p>Pages are loaded in chunks of 10 and new chunks are added on the fly when needed, creating a growing, visible search result.</p>
<p>The search field and navigation box float on top of the search result. They both loose focus when not in use in order to let the search result below become more visible. They can both be dragged around.</p>
<div id="attachment_803" class="wp-caption alignnone" style="width: 460px"><a href="http://sbdevel.files.wordpress.com/2010/02/gui3.jpg"><img class="size-full wp-image-803" title="gui3" src="http://sbdevel.files.wordpress.com/2010/02/gui3.jpg?w=450&#038;h=159" alt="Index lookup" width="450" height="159" /></a><p class="wp-caption-text">Index lookup. For librarians.</p></div>
<h3>Features without a search button</h3>
<p>Besides the GUI we added a number of features to the  demo:</p>
<ul>
<li>Live search: The search result is updated on every key press</li>
<li>Did you mean. Good for typos and spell checking</li>
<li>Suggest. Inspiration from other user&#8217;s queries</li>
<li>Index lookup. Mostly for librarians</li>
<li>About-by. Boosting particular fields for different results</li>
</ul>
<p>Usability-wise, the demo illustrates how a smoother search experience can   be achieved simply by updating the search result on key press. By   doing this, the wait for page reload is gone and feedback is instant, allowing the user to   evaluate the search results much more rapidly.</p>
<div id="attachment_801" class="wp-caption alignnone" style="width: 460px"><a href="http://sbdevel.files.wordpress.com/2010/02/gui21.jpg"><img class="size-full wp-image-801" title="gui2" src="http://sbdevel.files.wordpress.com/2010/02/gui21.jpg?w=450&#038;h=353" alt="" width="450" height="353" /></a><p class="wp-caption-text">The about-by slider feature lets the user slide between materials about a person and materials written by a person.</p></div>
<h3>Try it out</h3>
<p>Sounds confusing? Luckily we have a web page! It contains a number of screen casts with Jørn going through the features (go to: <a href="http://developer.statsbiblioteket.dk/kill/code4lib.html">developer.statsbiblioteket.dk/kill/code4lib.html</a>), but you can also give it a span your self:  <a href="http://developer.statsbiblioteket.dk/kill/index.jsp">developer.statsbiblioteket.dk/kill/</a>. (PLEASE NOTE: The demo currently only runs in Firefox 3.5+).</p>
<h3>Damn you, snow</h3>
<p>By the way, we never did get to go to code4lib to do the presentation due a blizzard and some ticket-mess-up.</p>
<p>Here&#8217;s the blizzard:</p>
<div class="wp-caption alignnone" style="width: 460px"><img class="size-full wp-image-808" title="blizzard" src="http://sbdevel.files.wordpress.com/2010/02/blizzard.jpg?w=450&#038;h=337" alt="" width="450" height="337" /><br />
<p class="wp-caption-text">Here</p></div>
<p>And here&#8217;s a shot of how great we look trough our remote c0de4lib video intro:</p>
<div class="wp-caption alignnone" style="width: 510px"><img title="Looking great" src="http://farm5.static.flickr.com/4048/4396439490_cf0b496a79.jpg" alt="" width="500" height="370" /><p class="wp-caption-text">Looking great video-wise. Our remote code4lib intro.</p></div>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/783/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/783/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=783&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/02/25/experimental-gui-for-code4lib/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="http://2.gravatar.com/avatar/b076a4eca73033844ded205c077d2b69?s=96&#38;d=http%3A%2F%2F2.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">michaelpoltorak</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2010/02/gui1.jpg" medium="image">
			<media:title type="html">gui</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2010/02/gui3.jpg" medium="image">
			<media:title type="html">gui3</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2010/02/gui21.jpg" medium="image">
			<media:title type="html">gui2</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2010/02/blizzard.jpg" medium="image">
			<media:title type="html">blizzard</media:title>
		</media:content>

		<media:content url="http://farm5.static.flickr.com/4048/4396439490_cf0b496a79.jpg" medium="image">
			<media:title type="html">Looking great</media:title>
		</media:content>
	</item>
		<item>
		<title>Juglr, Higgla, and DidYouMean</title>
		<link>http://sbdevel.wordpress.com/2010/02/10/juglr-higgla-and-didyoumean/</link>
		<comments>http://sbdevel.wordpress.com/2010/02/10/juglr-higgla-and-didyoumean/#comments</comments>
		<pubDate>Wed, 10 Feb 2010 19:53:14 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[Hacking]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[actor]]></category>
		<category><![CDATA[didyoumean]]></category>
		<category><![CDATA[higgla]]></category>
		<category><![CDATA[http]]></category>
		<category><![CDATA[juglr]]></category>
		<category><![CDATA[messaging]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=778</guid>
		<description><![CDATA[A productive day today! Summa DidYouMean Henrik and I sat down and pushed a preliminary did-you-mean service to Summa trunk. It&#8217;s to be considered pre-alpha quality at this stage, but it will be mature for Summa 1.5.2. It&#8217;s based on Karl Wettin&#8217;s old lucene-didyoumean contrib that lingered in the Apache bugtracker for years (yes, and [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=778&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>A productive day today!</p>
<h3>Summa DidYouMean</h3>
<p>Henrik and I sat down and pushed a preliminary did-you-mean service to Summa trunk. It&#8217;s to be considered pre-alpha quality at this stage, but it will be mature for <a href="http://summa.sf.net">Summa</a> 1.5.2. It&#8217;s based on Karl Wettin&#8217;s old lucene-didyoumean contrib that <a href="https://issues.apache.org/jira/browse/LUCENE-626">lingered in the Apache bugtracker for years</a> (yes, and I mean <em>years</em> literally). You can find the updated code on  <a href="http://github.com/mkamstrup/lucene-didyoumean">github.com/mkamstrup/lucene-didyoumean</a> there are branches in the Git repo for Lucene 3.0 (master), Lucene 2.9.* and Lucene-2.4.* &#8211; but be aware that this code has not been production tested yet.</p>
<h3>Juglr 0.2.1</h3>
<p>My pet peeve project, the actor model and messaging library for Java 6+, <a href="http://github.com/mkamstrup/juglr">Juglr</a>,  has hit <a href="http://github.com/mkamstrup/juglr/downloads">0.2.1</a>. I&#8217;ve now done some more large scale testing with and it seems to work pretty well.</p>
<h3>Introducing Higgla</h3>
<p>The large scale testing of Juglr I just referred to is actually a new project of mine&#8230; Man &#8211; I tend to spew out a few too many projects these days <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> . The new project on the stack is <a href="http://github.com/mkamstrup/higgla">Higgla</a> &#8211; with tag line:<em> &#8220;a lightweight, scalable, indexed, JSON document storage&#8221;</em>. If you are wondering where the name came from I can inform you that Higgla is Jamaican for &#8220;higgler&#8221; which a quick Googling defines as <em>&#8220;A person who trades in dairy, poultry, and small game animals; A person who haggles or negotiates for lower prices</em><em>&#8220;</em>. The point is that Higgla is about dealing with any old sort of data and doing it in a very non-formal way.</p>
<p>Higgla is very much in the spirit  of <a href="http://couchdb.apache.org/">CouchDB</a> and <a href="http://www.elasticsearch.com/">Elastic Search</a> &#8211; a schema free database, and not just schema free, but <em>completely schema free</em>. There is no structure implied as CouchDB&#8217;s Views does. Indexing is done on a document level, and each document need not have the same searchable fields as others. Heck each document revision does not need to be indexed the same way as the previous revision!</p>
<p>As I hinted, Higgla is based on Juglr. Higgla illustrates pretty well the power of a combined actor+http messaging stack like Juglr &#8211; if you <a href="http://github.com/mkamstrup/higgla/tree/master/src/java/higgla/server/">browse the source code</a> you will see that there really is not a lot of it!</p>
<p>In there core Higgla leverages the always awesome Lucene. I had to think quite hard to make the storage engine transaction safe in a massively parallel setup because Lucene doesn&#8217;t as such support parallel transactions (but it does support sequential transactions quite well). I figured it out eventually though.</p>
<p>Even though this is just a 0.0.1 Higgla already ships with Python- and Java client libraries (even though talking straight HTTP+JSON shouldn&#8217;t be that hard in most frameworks, it&#8217;s still nice with a simple convenience lib). An example with the Python client looks like:</p>
<pre>import higgla
import json

# Connect to the server
session = higgla.Session("localhost", 4567, "my_books")

# Prepare a box for storage, with id 'book_1',
# revision 0 (since this is a new box), and indexing
# the fields 'title' and 'author'
box = session.prepare_box("book_1", 0, "title", "author")

# Add some data to the box
box["title"] = "Dive Into Python"
box["author"] = "Mark Pilgrim"
box["stuff"] = [27, 68, 2, 3, 4]

# Store it on the server
session.store([box])

# Now find the box again
query = session.prepare_query(author="mark")
results = session.send_query(query)
print json.dumps(results, indent=2)
print "TADAAA!"
</pre>
<p>That completes the Puthon example. The Java API is almost identical so I wont cover it, although I can&#8217;t do the same fancy varargs stuff that Python provides <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/778/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/778/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=778&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/02/10/juglr-higgla-and-didyoumean/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/4a6d483c449dc6560e3e12d49d5e12bb?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>What did I mean?</title>
		<link>http://sbdevel.wordpress.com/2010/01/28/what-did-i-mean/</link>
		<comments>http://sbdevel.wordpress.com/2010/01/28/what-did-i-mean/#comments</comments>
		<pubDate>Thu, 28 Jan 2010 09:37:28 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[Lucene]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[didyoumean]]></category>
		<category><![CDATA[lucene]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=774</guid>
		<description><![CDATA[It has been a long standing wish to get a good did-you-mean service shipped with Summa. And by &#8220;did-you-mean service&#8221; I mean the little helpful tip that shows up underneath the text entry when you mistype something when doing a search. Note that I say &#8220;mistype&#8221; and not &#8220;misspell&#8221;, because a good did-you-mean service is [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=774&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>It has been a long standing wish to get a good <em>did-you-mean</em> service shipped with <a href="http://summa.sf.net">Summa</a>. And by &#8220;did-you-mean service&#8221; I mean the little helpful tip that shows up underneath the text entry when you mistype something when doing a search. Note that I say &#8220;mistype&#8221; and not &#8220;misspell&#8221;, because a good did-you-mean service is a lot more complex than a spell checker.</p>
<p>Consider when I read aloud my wish list to my mom over the phone and I try to explain to her that I badly want &#8220;Heroes of Might and Magic&#8221;  for Christmas. This phrase being completely meaningless to her she types in the search field:</p>
<blockquote><p>heroes of light and magic</p></blockquote>
<p>Notice that this is indeed a correctly spelled phrase, but nonetheless not what she/I wanted. A good search engine would ask <em>Did you mean: &#8220;heroes of might and magic&#8221;?</em>.</p>
<p>On the other hand if a search engine runs on a database of bad monochrome underworld games and &#8220;Heroes of Might and Magic&#8221; wasn&#8217;t there, but instead the index contained a game called &#8220;Heroes of Fight and Magic&#8221; the search engine should of course suggest <em>Did you mean: &#8220;heroes of fight and magic&#8221;?</em> in stead.</p>
<p>So we&#8217;ve identified two things we want that a normal spellchecker doesn&#8217;t provide:</p>
<ul>
<li>Consider each word in a query in the context of the whole phrase it appears in</li>
<li>Only suggest stuff that is actually in the index</li>
</ul>
<h3>The Code</h3>
<p>After surveying what was available on the open source market we realized that none of the solutions out there did what we wanted. I was pointed at <a href="https://issues.apache.org/jira/browse/LUCENE-626">Karl Wettin&#8217;s work on LUCENE-626</a>. Although Karl&#8217;s work is great, it&#8217;s not compatible with the new API in Lucene &gt;= 3.0 and it has a hardwired dependency on Berkely DB that we could not accept. So I branched his work in order to bring it into 2010 and I am proud to say that I&#8217;ve now reached an almost-works state. You can find the code on GitHub: <a href="http://github.com/mkamstrup/lucene-didyoumean">github.com/mkamstrup/lucene-didyoumean</a></p>
<p>The new thing about this is also that we are now engaging in upstream Lucene work, rather than staying in our own Summa backyard. Quite exciting, and a very rewarding experience for a software developer. Toke has some more news in this regard as well &#8211; he&#8217;s doing some upstream stuff that has far bigger implications than my odd-job did-you-mean-hacking. But I&#8217;ll leave you hanging there and let Toke talk about this himself.</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/774/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/774/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=774&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2010/01/28/what-did-i-mean/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/4a6d483c449dc6560e3e12d49d5e12bb?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>Solid Toys for the Boys</title>
		<link>http://sbdevel.wordpress.com/2009/12/08/solid-toys-for-the-boys/</link>
		<comments>http://sbdevel.wordpress.com/2009/12/08/solid-toys-for-the-boys/#comments</comments>
		<pubDate>Tue, 08 Dec 2009 13:13:18 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[io]]></category>
		<category><![CDATA[kingston]]></category>
		<category><![CDATA[ssd]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=743</guid>
		<description><![CDATA[As some may know we have experimented quite a bit with Lucene indexes on Solid State Drives and we&#8217;ve had very good experiences with it. Seeing huge performance gains. Since we are also routinely running big applications and other heavy duty tasks on our desktop machines our dear Toke had the idea that we should [&#8230;]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=743&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>As some may know we have experimented quite a bit with Lucene indexes on Solid State Drives and we&#8217;ve had very good experiences with it. Seeing huge performance gains. Since we are also routinely running big applications and other heavy duty tasks on our desktop machines our dear Toke had the idea that we should all have SSDs in our desktops. After a good deal of shopping about he settled on the Kingston v 40GB drive as research revealed that this exact model had the good Intel metal inside (this is fx. <em>not</em> the case for the 64GB model).</p>
<p>Yesterday we got the delivery and immediately start unpacking and upgrading our machines. And boy where these babies worth every penny! <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>(sorry for the ugly scaling of the following images &#8211; WordPress is killing me)</p>
<div id="attachment_747" class="wp-caption alignnone" style="width: 250px"><a href="http://sbdevel.files.wordpress.com/2009/12/supertoke1.jpg"><img class=" " title="supertoke1" src="http://sbdevel.files.wordpress.com/2009/12/supertoke1.jpg?w=240&#038;h=298" alt="" width="240" height="298" /></a><p class="wp-caption-text">Toke was the Super delivery boy</p></div>
<div id="attachment_748" class="wp-caption alignnone" style="width: 310px"><a href="http://sbdevel.files.wordpress.com/2009/12/superdrives1.jpg"><img title="superdrives1" src="http://sbdevel.files.wordpress.com/2009/12/superdrives1.jpg?w=300&#038;h=225" alt="" width="300" height="225" /></a><p class="wp-caption-text">Quick - get them before they are gone!</p></div>
<div id="attachment_749" class="wp-caption alignnone" style="width: 230px"><a href="http://sbdevel.files.wordpress.com/2009/12/supermikkel1.jpg"><img title="supermikkel1" src="http://sbdevel.files.wordpress.com/2009/12/supermikkel1.jpg?w=220&#038;h=299" alt="" width="220" height="299" /></a><p class="wp-caption-text">Yours truly is a Super happy camper</p></div>
<div id="attachment_750" class="wp-caption alignnone" style="width: 310px"><a href="http://sbdevel.files.wordpress.com/2009/12/superteam1.jpg"><img title="superteam1" src="http://sbdevel.files.wordpress.com/2009/12/superteam1.jpg?w=300&#038;h=221" alt="" width="300" height="221" /></a><p class="wp-caption-text">Super tag team getting their hands dirty</p></div>
<p>Firstly we did clean installations of Ubuntu. With a 10GB root partition and a ~26GB <em>/home</em> partition and ~4GB swap. Root and <em>/home</em> formatted with Ext4. All on the SSD. The time?</p>
<ul>
<li><em>Installing Ubuntu Karmic 64 bit from USB stick: 4 minutes (with ~1 minute waiting for network on a slow repository)</em></li>
</ul>
<p>The next thing was the boot&#8230; While we where rebooting from the install-session we talked about how fast the boot was going to be. But in the talking we almost didn&#8217;t react before the reboot was back up to the login screen. Wow. As we didn&#8217;t have a timer with sub-minute resolution at hand we can only give you subjective numbers. Among the spectators the opinions range from &#8220;negative time&#8221; to &#8220;5s&#8221; to &#8220;10s&#8221;. My personal estimates are:</p>
<ul>
<li><em>Boot from GRUB to GDM login</em> screen: 5s</li>
<li><em>From login screen to working GNOME desktop: 4s</em></li>
</ul>
<p>This is pretty darn fast I tell you <img src='http://s0.wp.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>In general application launching is also noticably faster. Especially so for applications with lots of IO, likethe  Evolution mail reader or our development environment IntelliJ Idea. Compiling the Summa project is also a heavily IO bound process. The result:</p>
<ul>
<li><em>Compiling Summa from scratch with cold disk caches: With conventioanl drives ~6 minutes. With our new SSDs 2.5 minutes. That&#8217;s a speedup of a factor ~2.5.</em></li>
</ul>
<p>As you might have guessed by now &#8211; we like SSDs &#8211; a lot!</p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/743/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/743/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=743&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/12/08/solid-toys-for-the-boys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/4a6d483c449dc6560e3e12d49d5e12bb?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/12/supertoke1.jpg?w=240" medium="image">
			<media:title type="html">supertoke1</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/12/superdrives1.jpg?w=300" medium="image">
			<media:title type="html">superdrives1</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/12/supermikkel1.jpg?w=220" medium="image">
			<media:title type="html">supermikkel1</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/12/superteam1.jpg?w=300" medium="image">
			<media:title type="html">superteam1</media:title>
		</media:content>
	</item>
		<item>
		<title>Searching in the dark</title>
		<link>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/</link>
		<comments>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 15:12:47 +0000</pubDate>
		<dc:creator>Toke Eskildsen</dc:creator>
				<category><![CDATA[Summa]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=729</guid>
		<description><![CDATA[Building indexes of semi-classified material and making a integrated search between structured and non-structured data.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=729&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></description>
				<content:encoded><![CDATA[<p>As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.</p>
<p>Except that access to most of the material is rather limited and that search is &#8230; well, pretty much non-existing.</p>
<p>Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).</p>
<p>The harvested web material is stored in <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC-files</a>, so the obvious choice for a quick test was <a href="http://archive-access.sourceforge.net/projects/nutch/">NutchWAX</a>. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the &#8220;<em>is it even feasible to do this?</em>&#8220;-question.</p>
<p>The &#8220;<em>but does it makes sense to do integrated search for such different data sources as web and library books?</em>&#8220;-question could not be answered by this, so naturally we had to hack something together with <a href="http://www.statsbiblioteket.dk/summa">Summa</a>, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from <a href="http://crawler.archive.org/">Heritrix</a> and the <a href="http://lucene.apache.org/tika/">Tika-toolkit</a> for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard <a href="http://www.statsbiblioteket.dk/search/index.jsp?query=foo">library index</a>.</p>
<p>It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types &amp; sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.</p>
<p>So what&#8217;s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems &#8211; at the moment &#8211; to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.</p>
<p>Happily we don&#8217;t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello <a href="http://www.nationalarchives.gov.uk/webcontinuity/">continuity</a>). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.</p>
<p>Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.</p>
<p>A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.<br />
<div id="attachment_730" class="wp-caption alignleft" style="width: 460px"><img src="http://sbdevel.files.wordpress.com/2009/09/arctika-hoereskader-censored.png?w=450&#038;h=257" alt="Sample search in Arctika" title="arctika-hoereskader-censored" width="450" height="257" class="size-full wp-image-730" /><p class="wp-caption-text">Sample search in Arctika</p></div></p>
<br />  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/729/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&#038;blog=4699377&#038;post=729&#038;subd=sbdevel&#038;ref=&#038;feed=1" width="1" height="1" />]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="http://1.gravatar.com/avatar/a8d49de9ea76368a1a2d9b9a1c975ea5?s=96&#38;d=http%3A%2F%2F1.gravatar.com%2Favatar%2Fad516503a11cd5ca435acc9bb6523536%3Fs%3D96" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/09/arctika-hoereskader-censored.png" medium="image">
			<media:title type="html">arctika-hoereskader-censored</media:title>
		</media:content>
	</item>
	</channel>
</rss>
