<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	xmlns:media="http://search.yahoo.com/mrss/"
	>

<channel>
	<title>Software Development at Statsbiblioteket</title>
	<atom:link href="http://sbdevel.wordpress.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://sbdevel.wordpress.com</link>
	<description>A peekhole into the life of the software development department at the State Library of Denmark</description>
	<lastBuildDate>Fri, 16 Oct 2009 09:55:58 +0000</lastBuildDate>
	<generator>http://wordpress.com/</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<cloud domain='sbdevel.wordpress.com' port='80' path='/?rsscloud=notify' registerProcedure='' protocol='http-post' />
<image>
		<url>http://www.gravatar.com/blavatar/35a7330e5cde94801fbe6950804698a2?s=96&#038;d=http://s.wordpress.com/i/buttonw-com.png</url>
		<title>Software Development at Statsbiblioteket</title>
		<link>http://sbdevel.wordpress.com</link>
	</image>
			<item>
		<title>IntelliJ Idea Open Sourced</title>
		<link>http://sbdevel.wordpress.com/2009/10/16/intellij-idea-open-sourced/</link>
		<comments>http://sbdevel.wordpress.com/2009/10/16/intellij-idea-open-sourced/#comments</comments>
		<pubDate>Fri, 16 Oct 2009 09:55:58 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[open source]]></category>
		<category><![CDATA[git]]></category>
		<category><![CDATA[ide]]></category>
		<category><![CDATA[idea]]></category>
		<category><![CDATA[intellij]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[jetbrains]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=740</guid>
		<description><![CDATA[Wow, I must admit that the latest news from JetBrains takes me quite by surprise! But what a sweet surprise it is!
Scala and Git support out of the box you say? This is more than welcome &#8211; now the next generation development experience is enabled out of the box.
I can&#8217;t help but wonder why they [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=740&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Wow, I must admit that <a href="http://blogs.jetbrains.com/idea/2009/10/intellij-idea-open-sourced/">the latest news from JetBrains</a> takes me quite by surprise! But what a sweet surprise it is!</p>
<p>Scala and Git support out of the box you say? This is more than welcome &#8211; now the next generation development experience is enabled out of the box.</p>
<p>I can&#8217;t help but wonder why they did it though? Growing pressure from Netbeans and Eclipse? I&#8217;ve always thought that Idea was the better of the three &#8211; thus expecting it to generate a fine revenue? Perhaps not -  or perhaps JetBrains had a sudden fit of philanthropy? Or perhaps open source is just a superior development model. No matter the true motivation I am pretty hyped about this <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/740/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/740/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/740/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/740/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/740/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/740/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/740/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/740/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/740/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/740/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=740&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/10/16/intellij-idea-open-sourced/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>Searching in the dark</title>
		<link>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/</link>
		<comments>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/#comments</comments>
		<pubDate>Fri, 25 Sep 2009 15:12:47 +0000</pubDate>
		<dc:creator>eskildsen</dc:creator>
				<category><![CDATA[Summa]]></category>
		<category><![CDATA[Web]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=729</guid>
		<description><![CDATA[Building indexes of semi-classified material and making a integrated search between structured and non-structured data.<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=729&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.</p>
<p>Except that access to most of the material is rather limited and that search is &#8230; well, pretty much non-existing.</p>
<p>Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).</p>
<p>The harvested web material is stored in <a href="http://www.archive.org/web/researcher/ArcFileFormat.php">ARC-files</a>, so the obvious choice for a quick test was <a href="http://archive-access.sourceforge.net/projects/nutch/">NutchWAX</a>. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the &#8220;<em>is it even feasible to do this?</em>&#8220;-question.</p>
<p>The &#8220;<em>but does it makes sense to do integrated search for such different data sources as web and library books?</em>&#8220;-question could not be answered by this, so naturally we had to hack something together with <a href="http://www.statsbiblioteket.dk/summa">Summa</a>, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from <a href="http://crawler.archive.org/">Heritrix</a> and the <a href="http://lucene.apache.org/tika/">Tika-toolkit</a> for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard <a href="http://www.statsbiblioteket.dk/search/index.jsp?query=foo">library index</a>.</p>
<p>It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types &amp; sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.</p>
<p>So what&#8217;s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems &#8211; at the moment &#8211; to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.</p>
<p>Happily we don&#8217;t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello <a href="http://www.nationalarchives.gov.uk/webcontinuity/">continuity</a>). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.</p>
<p>Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.</p>
<p>A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.<br />
<div id="attachment_730" class="wp-caption alignleft" style="width: 460px"><img src="http://sbdevel.files.wordpress.com/2009/09/arctika-hoereskader-censored.png?w=450&#038;h=257" alt="Sample search in Arctika" title="arctika-hoereskader-censored" width="450" height="257" class="size-full wp-image-730" /><p class="wp-caption-text">Sample search in Arctika</p></div></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/729/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/729/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/729/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/729/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/729/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/729/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=729&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/09/25/searching-in-the-dark/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/09/arctika-hoereskader-censored.png" medium="image">
			<media:title type="html">arctika-hoereskader-censored</media:title>
		</media:content>
	</item>
		<item>
		<title>Brilliant, guys!</title>
		<link>http://sbdevel.wordpress.com/2009/09/20/brilliant-guys/</link>
		<comments>http://sbdevel.wordpress.com/2009/09/20/brilliant-guys/#comments</comments>
		<pubDate>Sun, 20 Sep 2009 19:02:27 +0000</pubDate>
		<dc:creator>eskildsen</dc:creator>
				<category><![CDATA[usability]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=718</guid>
		<description><![CDATA[
Our fine usability guys hard at work.
       <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=718&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><img src="http://sbdevel.files.wordpress.com/2009/09/20090917_1426_cancelundo_e.jpg?w=300&#038;h=400" alt="Cancel Undo" title="Cancel Undo" width="300" height="400" class="size-full wp-image-717" /></p>
<p>Our fine usability guys hard at work.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/718/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/718/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/718/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/718/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/718/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/718/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/718/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/718/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/718/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/718/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=718&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/09/20/brilliant-guys/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">eskildsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/09/20090917_1426_cancelundo_e.jpg" medium="image">
			<media:title type="html">Cancel Undo</media:title>
		</media:content>
	</item>
		<item>
		<title>An Excursion in Java Recursion</title>
		<link>http://sbdevel.wordpress.com/2009/09/04/an-excursion-in-java-recursion/</link>
		<comments>http://sbdevel.wordpress.com/2009/09/04/an-excursion-in-java-recursion/#comments</comments>
		<pubDate>Fri, 04 Sep 2009 09:56:20 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Hacking]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[push parser]]></category>
		<category><![CDATA[recursion]]></category>
		<category><![CDATA[tail recursion]]></category>
		<category><![CDATA[xml]]></category>
		<category><![CDATA[xml entities]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=711</guid>
		<description><![CDATA[A quick Googling defines Excursion as: &#8220;a journey taken for pleasure&#8221;. Considering what I am about to blog about the title of this blog post might be a bit misleading, but you gotta give me one for the rhyme  
As you might or might not know, doing recursion in Java is simply a bad [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=711&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>A quick <a href="http://www.google.dk/search?q=define%3Aexcursion&amp;ie=utf-8&amp;oe=utf-8&amp;aq=t&amp;rls=com.ubuntu:en-US:unofficial&amp;client=firefox-a">Googling defines <em>Excursion</em></a> as: <em>&#8220;a journey taken for pleasure&#8221;</em>. Considering what I am about to blog about the title of this blog post might be a bit misleading, but you gotta give me one for the rhyme <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_wink.gif' alt=';-)' class='wp-smiley' /> </p>
<p>As you might or might not know, doing recursion in Java is simply a bad thing. This is mainly because Java can&#8217;t do <a href="http://en.wikipedia.org/wiki/Tail_recursion">tail recursion</a>. You can use recursion in Java if you are absolutely positive that you are only going to do a very limited number of recursive calls. If you could possibly go over 100 calls you should consider making it a <em>for</em> or <em>while</em> loop instead, if the Java runtime performs somewhere around 1000 recursive calls you will get a <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/StackOverflowError.html">StackOverflowError</a>. This is <em>really</em> bad &#8211; you see if you read the StackOverflowError docs you will see that it is a subclass of <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/lang/VirtualMachineError.html">VirtualMachineError</a>. The docs for VirtualMachineError says:</p>
<blockquote><p>Thrown to indicate that the Java Virtual Machine is broken or has run out of resources necessary for it to continue operating</p></blockquote>
<p>This means that you have pretty much no choice but to log a fatal error and abort the JVM.</p>
<p>There are ways for making the recursion limit of the JVM bigger by setting some system properties, but that is really just a band aid and I would advise against using them.</p>
<h3>The Real Life Case: XML Parsing</h3>
<p>Java 6 ships with a new XML parsing library, the core class of which being <a href="http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLStreamReader.html">XMLStreamReader</a> (also known as the <em>&#8220;push parser&#8221;</em>). I must say that it is quite a nice library and a huge improvement over <a href="http://en.wikipedia.org/wiki/Simple_API_for_XML">SAX parsing</a>, while still keeping a blazing performance. We use it in <a href="http://summa.sf.net">Summa</a> and has been very happy with it.</p>
<p>The problem came when we started indexing documents like this one: <a href="http://summa.sourceforge.net/misc/java-recursion-lection-1.xml">java-recursion-lection-1.xml</a>. You can definitely expect to find similar structures out in the wild (as we have seen here at work). The basic document structure is as follows:</p>
<pre>&lt;mydocument&gt;
  &lt;mytag&gt;
     SOME TEXT BLOCK
  &lt;/mytag&gt;
&lt;/mydocument&gt;</pre>
<p>If we just want to extract the text block it would be annoying with a standard SAX parser because a SAX parser splits up characters segments into arbitrary chunks and you have to collect them into one string yourself. The push parser API makes this a lot easier because it defines the property <a href="http://java.sun.com/javase/6/docs/api/javax/xml/stream/XMLInputFactory.html#IS_COALESCING">XMLInputFactory.IS_COALESCING</a> which, when set, requires the parser to collect all the text chunks into one string. So extracting the raw text contents is easy peasy lemon squeezy:</p>
<pre>import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamReader;
import javax.xml.stream.events.XMLEvent;
import java.io.FileReader;

/**
 * A small excursion in Java recursion.
 */
public class JavaRecursionLecture1 {

  public static void main(String[] args) throws Exception {
    XMLInputFactory inputFactory = XMLInputFactory.newInstance();
    inputFactory.setProperty(XMLInputFactory.IS_COALESCING, Boolean.TRUE);

    XMLStreamReader reader = inputFactory.createXMLStreamReader(
               new FileReader("/home/mke/Documents/java-recursion-lection-1.xml"));
    parse(reader);
  }

  public static void parse(XMLStreamReader reader) throws Exception {
    while (reader.hasNext()) {
      int event = reader.getEventType();
      switch (event) {
        case XMLEvent.START_DOCUMENT :
          System.out.println("Document start");
          break;
        case XMLEvent.START_ELEMENT :
          System.out.println("Element: " + reader.getLocalName() );
          break;
        case XMLStreamReader.CHARACTERS :
          // Warning: Here be StackOverflowErrors
          System.out.println("Char data:\n" + reader.getText());
          break;
      }
    reader.next();
    }
  }
}</pre>
<p><strong>Except that this will throw a StackOverflowError</strong> if you run it on the <a href="http://summa.sourceforge.net/misc/java-recursion-lection-1.xml">file I linked you to</a>. &#8220;What is up with that, there is no recursion here!&#8221; &#8211; you ask?</p>
<p>The problem here is that XMLStreamReader is highly recursive underneath the hood. My file contains lots of <a href="http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references">XML entities</a> and the parser will make a recursive call each time a new entity is found in the stream. Looking at the <a href="http://google.com/codesearch/p?hl=en&amp;sa=N&amp;cd=1&amp;ct=rc#l3_VNyZJ4_Q/src/share/classes/com/sun/org/apache/xerces/internal/impl/XMLDocumentFragmentScannerImpl.java&amp;q=org.apache.xerces.internal.impl.XMLDocumentFragmentScannerImpl&amp;l=2520">heart of the implementation</a> you will see that the author(s) actually where very minute about making sure that all recursive calls where tail calls. This would have been very robust had the Java runtime supported tail recursion &#8211; alas.</p>
<p>There are two ways to work around this misfeature. The first one is to <em>don&#8217;t set the IS_COALESCING property</em>, and then change the switch statement to something like this, using <tt>reader.getElementText()</tt> instead:</p>
<pre>switch (event) {
  case XMLEvent.START_DOCUMENT :
    System.out.println("Document start");
    break;
  case XMLEvent.START_ELEMENT :
    System.out.println("Element: " + reader.getLocalName() );

    if ("mytag".equals(reader.getLocalName())) {
      System.out.println(reader.getElementText());
    }
    break;
  case XMLStreamReader.CHARACTERS :
    // Warning: Here be StackOverflowErrors
    System.out.println("Char data:\n" + reader.getText());
    break;
 }</pre>
<p>This is not particularly elegant since it hard codes our <tt>&lt;mytag&gt;</tt> element. A more generic way is to provide your own coalescing implementation of <tt>getText()</tt>:</p>
<pre>/**
 * Use this method in response to XMLEvent.CHARACTERS event instead of
 * XMLStreamReader.getElementText() on a XMLEvent.START_ELEMENT. The former
 * approach will
 * @param reader the XMLStreamReader to pull character data out of,
 *               the reader is expected to be in a XMLEvent.CHARACTERS state
 * @return A string containing the full character data as one string
 * @throws Exception if the Jupiter aligns with Mars
 */
 public static String getCoalescedText(XMLStreamReader reader)
 throws XMLStreamException {
   StringBuilder builder = new StringBuilder();
   char[] buf = new char[1024];

   while (reader.getEventType() == XMLEvent.CHARACTERS) {
     int offset = 0;
     int len;
     while (true) {
       len = reader.getTextCharacters(offset, buf, 0, buf.length);
       if (len != 0) builder.append(buf, 0, len);
       if (len &lt; buf.length) break;
     }
     reader.next();
   }
   return builder.toString();</pre>
<p>And then in the switch branch checking on character events do:</p>
<pre>     case XMLStreamReader.CHARACTERS :
       // Warning: If you expect a StackOverflowError here, you are
       //          going to wait a long while!
       System.out.println("Character data:\n"
                          + getCoalescedText(reader));
       break;</pre>
<p>Anyway &#8211; this became a long an code-full post. All I really wanted to say was <em>Avoid recursion in Java unless you know exactly what you are doing.</em></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/711/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/711/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/711/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/711/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/711/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/711/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/711/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/711/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/711/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/711/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=711&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/09/04/an-excursion-in-java-recursion/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>Summa Moving to SourceForge</title>
		<link>http://sbdevel.wordpress.com/2009/08/04/summa-moving-to-sourceforge/</link>
		<comments>http://sbdevel.wordpress.com/2009/08/04/summa-moving-to-sourceforge/#comments</comments>
		<pubDate>Tue, 04 Aug 2009 09:55:00 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Summa]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[gforge]]></category>
		<category><![CDATA[migration]]></category>
		<category><![CDATA[sourceforge]]></category>
		<category><![CDATA[trac]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=707</guid>
		<description><![CDATA[Yesterday I had the pleasure to announce on the mailing lists that Summa has reached the first milestone in migrating to SourceForge, and here follows the blog post  
From now on all Summa code is hosted and developed in the &#8220;summa&#8221; project on SourceForge now, in addition all bugs have been migrated from our [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=707&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Yesterday I had the pleasure to <a href="https://lists.gforge.statsbiblioteket.dk/pipermail/summa-announce/2009-August/000022.html">announce on the mailing lists</a> that Summa has reached the first milestone in migrating to <a href="http://sf.net">SourceForge</a>, and here follows the blog post <img src='http://s.wordpress.com/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' /> </p>
<p>From now on all Summa code is hosted and developed in the <a href="http://sourceforge.net/projects/summa/">&#8220;summa&#8221; project on SourceForge</a> now, in addition all bugs have been migrated from our old GForge solution to a <a href="https://sourceforge.net/apps/trac/summa">Trac instance</a> hosted via the cool new &#8220;hosted apps&#8221; functionality on SourceForge.</p>
<p>We will also move the mailing lists over in the near future. The fate of the <a href="http://wiki.statsbiblioteket.dk/summa">Summa wiki</a> is still left unclear.</p>
<p>I must be frank and admit that I have long felt that SourceForge was in a bit of a standstill applying only visual refreshes every now and then, and never fixing the <em>real</em> issues with the site. However the new Hosted Apps approach is simply sweet! There is a huge list of popular open source products you can choose to run on your site as a hosted apps (see an <a href="https://sourceforge.net/apps/trac/sourceforge/wiki/Hosted%20Apps">incomplete list here</a>). For instance; some may surprised to know that popular version control systems such as <a href="http://git-scm.com/">Git</a>, <a href="http://mercurial.selenic.com/wiki/">Mercurial</a>, and <a href="http://bazaar-vcs.org/">Bazaar</a> is supported as well as <a href="http://subversion.tigris.org/">Subversion</a>. Right now we run only a <a href="http://trac.edgewall.org/">Trac</a> issue tracker and a Subversion repository.</p>
<p>On a personal note I must still admit that my heart lies with the recently open sourced <a href="https://launchpad.net/">Launchpad</a>, despite the recent kick-assiness from the SF team.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/707/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/707/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/707/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=707&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/08/04/summa-moving-to-sourceforge/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>Ticer Summer School</title>
		<link>http://sbdevel.wordpress.com/2009/08/03/ticer-summer-school/</link>
		<comments>http://sbdevel.wordpress.com/2009/08/03/ticer-summer-school/#comments</comments>
		<pubDate>Mon, 03 Aug 2009 21:24:15 +0000</pubDate>
		<dc:creator>villadsen</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[09ticer]]></category>
		<category><![CDATA[presentation]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[tilburg]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=700</guid>
		<description><![CDATA[I went to the Ticer Summer School 2009 on Friday 31st of August to to talk about Summa as part of a panel on Integrated Search alongside Jørgen Madsen (Primo), David Lindahl (eXtensible Catalog), and Benoît Pauwels (VuFind).
Thomas Place made the introduction explaining the basic concepts of Integrated Search. Afterwards we each presented our systems [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=700&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>I went to the <a href="http://www.tilburguniversity.nl/services/lis/ticer/09carte/">Ticer Summer School 2009</a> on Friday 31st of August to to talk about Summa as part of a panel on Integrated Search alongside Jørgen Madsen (Primo), David Lindahl (eXtensible Catalog), and Benoît Pauwels (VuFind).</p>
<div id="attachment_703" class="wp-caption alignright" style="width: 261px"><a href="http://sbdevel.files.wordpress.com/2009/08/tilburg-university-campus.jpg"><img class="size-medium wp-image-703" title="tilburg-university-campus" src="http://sbdevel.files.wordpress.com/2009/08/tilburg-university-campus.jpg?w=251&#038;h=300" alt="Tilburg University Campus" width="251" height="300" /></a><p class="wp-caption-text">Tilburg University Campus</p></div>
<p>Thomas Place made the introduction explaining the basic concepts of Integrated Search. Afterwards we each presented our systems followed by a short question and answer session. I think it was during these presentations that people realized how similar most of the systems actually were.</p>
<p>After a coffee break each of us once again got up to do a second presentation &#8211; this time focusing more on a specific feature.</p>
<ul>
<li>Jørgen Madsen: Joining Catalogues &#8211; Clean-up and Deduplication</li>
<li>David Lindahl: Metadata Handling and FRBR</li>
<li>Mads Villadsen: Facetting and Clustering</li>
<li>Benoît Pauwels: Web 2.0 Features of Integrated Search</li>
</ul>
<p>Each presentation was once again followed by a longer question and answer session.</p>
<p>My slides from both of theses can be found on the <a href="http://wiki.statsbiblioteket.dk/summa/Presentations">Presentations</a> page on the <a href="http://wiki.statsbiblioteket.dk/summa/">Summa Wiki</a>.</p>
<p>The day concluded with a panel discussion based on questions from the audience. All during the day the people from the audience had been very good at asking relevant questions, and I think they really managed to get to the core of the issues regarding usability, faceted searching, and integrated search itself.</p>
<p>This is the first time I have been part of a panel in this way, and also the first time I have been at a Ticer Summer School &#8211; and I found both things to be a really great experience. I have never been to anything that has been as well organized as this, and all the people were very interested and engaged in the discussions. The other panel members were well prepared, open for discussion, and willing to talk freely about any issues. I can only hope that I was in the same league as them.</p>
<p>All in all I had a really nice time, and if I ever get the chance to do something similar again I would be very interested.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/700/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/700/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/700/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/700/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/700/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/700/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/700/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/700/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/700/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/700/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=700&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/08/03/ticer-summer-school/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">villadsen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/08/tilburg-university-campus.jpg?w=251" medium="image">
			<media:title type="html">tilburg-university-campus</media:title>
		</media:content>
	</item>
		<item>
		<title>Quick and dirty test of the YUI Compressor</title>
		<link>http://sbdevel.wordpress.com/2009/07/02/quick-and-dirty-test-of-the-yui-compressor/</link>
		<comments>http://sbdevel.wordpress.com/2009/07/02/quick-and-dirty-test-of-the-yui-compressor/#comments</comments>
		<pubDate>Thu, 02 Jul 2009 06:12:03 +0000</pubDate>
		<dc:creator>Jørn Thøgersen</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[css]]></category>
		<category><![CDATA[javascript]]></category>
		<category><![CDATA[optimizing]]></category>
		<category><![CDATA[YUI Compressor]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=681</guid>
		<description><![CDATA[As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer &#8211; YUI Compressor.
At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=681&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><span style="font-size:small;">As a part of <a href="http://sbdevel.wordpress.com/2009/07/01/thoughts-on-optimizing-our-search-web-site/" target="_self">our quest trying to optimize the speed of our search front end</a> I recently tried out the Yahoo js and css minifyer &#8211; YUI Compressor.</span></p>
<p><span style="font-size:small;">At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.</span></p>
<p><span style="font-size:small;">The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the <a href="http://www.julienlecomte.net/yuicompressor/" target="_self">YUI compressor version 2.4.2</a> and the <a href="http://www.ubik-ingenierie.com/miscellanous/YUIAnt/" target="_self">YUIAnt.jar</a> and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don&#8217;t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren&#8217;t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.</span></p>
<p><span style="font-size:small;">The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason &#8211; I didn&#8217;t bother to investigate further &#8211; the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.</span></p>
<p><span style="font-size:small;">The first screen shot shows the size and load times for our js files. Business as usual &#8211; the YUI Compressor is disabled.</span></p>
<p><span style="font-size:small;"><img class="alignnone size-full wp-image-684" title="nocompress2Scaled" src="http://sbdevel.files.wordpress.com/2009/07/nocompress2scaled.png?w=449&#038;h=615" alt="nocompress2Scaled" width="449" height="615" /><br />
</span></p>
<p><span style="font-size:small;">The next screen shot shows the size and load times for the same js files now with the compression enabled.</span></p>
<p><span style="font-size:small;"><img class="alignnone size-full wp-image-685" title="compress2Scaled" src="http://sbdevel.files.wordpress.com/2009/07/compress2scaled.png?w=449&#038;h=489" alt="compress2Scaled" width="449" height="489" /><br />
</span></p>
<p><span style="font-size:small;">The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don&#8217;t thing we will be comfortable with this knowing that it might cause errors.</span></p>
<p><span style="font-size:small;">As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.</span></p>
<p><span style="font-size:small;">This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.</span></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/681/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/681/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/681/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=681&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/07/02/quick-and-dirty-test-of-the-yui-compressor/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Thøgersen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/07/nocompress2scaled.png" medium="image">
			<media:title type="html">nocompress2Scaled</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/07/compress2scaled.png" medium="image">
			<media:title type="html">compress2Scaled</media:title>
		</media:content>
	</item>
		<item>
		<title>Thoughts on optimizing our search web site</title>
		<link>http://sbdevel.wordpress.com/2009/07/01/thoughts-on-optimizing-our-search-web-site/</link>
		<comments>http://sbdevel.wordpress.com/2009/07/01/thoughts-on-optimizing-our-search-web-site/#comments</comments>
		<pubDate>Wed, 01 Jul 2009 11:15:41 +0000</pubDate>
		<dc:creator>Jørn Thøgersen</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[optimizing response time]]></category>
		<category><![CDATA[search]]></category>
		<category><![CDATA[YUI Compressor]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=669</guid>
		<description><![CDATA[
The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site&#8217;s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end &#8211; especially when [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=669&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p><img class="alignnone size-full wp-image-670" title="wwws" src="http://sbdevel.files.wordpress.com/2009/07/wwws.jpg?w=450&#038;h=279" alt="wwws" width="450" height="279" /></p>
<p><span style="font-size:small;">The code for <a href="http://www.statsbiblioteket.dk/search/" target="_self">our search front end</a> has over time grown to a considerable size and we have started to suspect that the web site&#8217;s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end &#8211; especially when the underlying search engine <a href="http://www.statsbiblioteket.dk/summa" target="_self">Summa</a> has proven to be blazingly fast.</span></p>
<p><span style="font-size:small;">There are a lot of things we could do better such as:</span></p>
<p><span style="font-size:small;">1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.</span></p>
<p><span style="font-size:small;">2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs &#8211; loads of divs.</span></p>
<p><span style="font-size:small;">3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.</span></p>
<p><span style="font-size:small;">4. General optimizing of the server side code. In fact this part isn&#8217;t all that bad but a general clean up once in a while doesn&#8217;t hurt anybody.</span></p>
<p><span style="font-size:small;"> Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the <a href="http://developer.yahoo.com/yui/compressor/" target="_self">YUI Compressor</a> &#8211; tool to compress/minify javascript and css. As we don&#8217;t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.</span></p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/669/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/669/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/669/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=669&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/07/01/thoughts-on-optimizing-our-search-web-site/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">Thøgersen</media:title>
		</media:content>

		<media:content url="http://sbdevel.files.wordpress.com/2009/07/wwws.jpg" medium="image">
			<media:title type="html">wwws</media:title>
		</media:content>
	</item>
		<item>
		<title>Efficient sorting and iteration on large databases</title>
		<link>http://sbdevel.wordpress.com/2009/06/15/efficient-sorting-and-iteration-on-large-databases/</link>
		<comments>http://sbdevel.wordpress.com/2009/06/15/efficient-sorting-and-iteration-on-large-databases/#comments</comments>
		<pubDate>Mon, 15 Jun 2009 12:27:15 +0000</pubDate>
		<dc:creator>kamstrup</dc:creator>
				<category><![CDATA[Database]]></category>
		<category><![CDATA[Summa]]></category>
		<category><![CDATA[kamstrup]]></category>
		<category><![CDATA[databases]]></category>
		<category><![CDATA[efficient]]></category>
		<category><![CDATA[optimization]]></category>
		<category><![CDATA[sorting]]></category>
		<category><![CDATA[sql]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=634</guid>
		<description><![CDATA[Before you read on, heed my words that this post might be a wee bit technical&#8230; If not extremely technical &#8211; caveat emptor
The Problem
In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=634&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>Before you read on, heed my words that this post might be a wee bit technical&#8230; If not extremely technical &#8211; <em>caveat emptor</em></p>
<h3>The Problem</h3>
<p>In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a<a href="http://www.h2database.com"> H2 database</a>, all records are annotated with a modification time (henceforth <em>mtime</em>) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.</p>
<p>This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don&#8217;t want any of that. We want lightweight transactions and speed!</p>
<h3>Take One, LIMIT and OFFSET</h3>
<p>The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:</p>
<pre>  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000</pre>
<p>Here we increment<em> last_offset</em> by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first <em>last_offset</em> records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset <em>last_offset</em> into the result set begins.</p>
<h3>Take Two, Salted Timestamps</h3>
<p>So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages&#8230; <em> </em></p>
<p>The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:</p>
<pre>  SELECT * FROM records WHERE mtime&gt;$last_mtime ORDER BY mtime LIMIT 1000</pre>
<p>Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.<em></em></p>
<p>If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion &#8211; this is way too slow for us, so we deviced what we have dubbed <em>Salted Timestamps</em>.</p>
<p>Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a &#8220;timestamp&#8221; that still sorts correctly and we can even create a UNIQUE index on the mtime column.</p>
<h3>Conclusion</h3>
<p>We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the <a href="https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/Storage/src/dk/statsbiblioteket/summa/storage/database/DatabaseStorage.java?root=summa&amp;view=markup">DatabaseStorage</a>* class and the timestamp handling in the <a href="https://gforge.statsbiblioteket.dk/plugins/scmsvn/viewcvs.php/trunk/Storage/src/dk/statsbiblioteket/summa/storage/database/UniqueTimestampGenerator.java?root=summa&amp;view=markup">UniqueTimestampGenerator</a>* class in the Storage module in the Summa sources.</p>
<p><strong>*)</strong> Most sorry that these links require a login (which is freely available, but anyway) &#8211; we are working on a solution with anonymous access. More on that later.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/634/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/634/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/634/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/634/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/634/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/634/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/634/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/634/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/634/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/634/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=634&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/06/15/efficient-sorting-and-iteration-on-large-databases/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">kamstrup</media:title>
		</media:content>
	</item>
		<item>
		<title>Faceting and Flash Disks in the Gobi Desert</title>
		<link>http://sbdevel.wordpress.com/2009/05/16/faceting-and-flash-disks-in-the-gobi-desert/</link>
		<comments>http://sbdevel.wordpress.com/2009/05/16/faceting-and-flash-disks-in-the-gobi-desert/#comments</comments>
		<pubDate>Sat, 16 May 2009 14:46:20 +0000</pubDate>
		<dc:creator>villadsen</dc:creator>
				<category><![CDATA[Conference]]></category>
		<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://sbdevel.wordpress.com/?p=624</guid>
		<description><![CDATA[As has been mentioned in many different places the code4lib 2009 videos are now online.
Those that missed the finer details in Toke&#8217;s talk about complete faceting of 100 million documents can go see the video here. Lots of good, nerdy stuff.
Toke and Mikkel also both gave lightning talks.  Mikkel&#8217;s talk about how easy it would [...]<img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=624&subd=sbdevel&ref=&feed=1" />]]></description>
			<content:encoded><![CDATA[<div class='snap_preview'><br /><p>As has been mentioned in many different places the code4lib 2009 videos are now online.</p>
<p>Those that missed the finer details in Toke&#8217;s talk about complete faceting of 100 million documents can go see the <a href="http://code4lib.org/conference/2009/eskildsen">video here</a>. Lots of good, nerdy stuff.</p>
<p>Toke and Mikkel also both gave lightning talks.  Mikkel&#8217;s talk about how easy it would be to set up a Summa installation in the Gobi Desert was on day three, and is available <a href="http://dl.lib.brown.edu/code4lib/lightning_day3.html">here</a>. He is on somewhere near the middle of the video (number 5 of 9).</p>
<p>Toke&#8217;s talk about Flash Disks and how they will save everyone of us was on day two, and can be seen <a href="http://dl.lib.brown.edu/code4lib/lightning_day2.html">here</a>. His talk starts about a third of the way into video as number 6 of 16 (he is actually on twice &#8211; the first  first attempt without any graphs in his slides is number 4&#8230;).</p>
<p>Watching the videos really make me want to go to code4lib again next year.</p>
  <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gocomments/sbdevel.wordpress.com/624/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/comments/sbdevel.wordpress.com/624/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godelicious/sbdevel.wordpress.com/624/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/delicious/sbdevel.wordpress.com/624/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/gostumble/sbdevel.wordpress.com/624/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/stumble/sbdevel.wordpress.com/624/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/godigg/sbdevel.wordpress.com/624/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/digg/sbdevel.wordpress.com/624/" /></a> <a rel="nofollow" href="http://feeds.wordpress.com/1.0/goreddit/sbdevel.wordpress.com/624/"><img alt="" border="0" src="http://feeds.wordpress.com/1.0/reddit/sbdevel.wordpress.com/624/" /></a> <img alt="" border="0" src="http://stats.wordpress.com/b.gif?host=sbdevel.wordpress.com&blog=4699377&post=624&subd=sbdevel&ref=&feed=1" /></div>]]></content:encoded>
			<wfw:commentRss>http://sbdevel.wordpress.com/2009/05/16/faceting-and-flash-disks-in-the-gobi-desert/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
	
		<media:content url="" medium="image">
			<media:title type="html">villadsen</media:title>
		</media:content>
	</item>
	</channel>
</rss>