Quick and dirty test of the YUI Compressor

July 2, 2009 by Jørn Thøgersen

As a part of our quest trying to optimize the speed of our search front end I recently tried out the Yahoo js and css minifyer – YUI Compressor.

At first glance the nice things about the YUI Compressor are that it is a Java based (we are a Java friendly team), open source and fairly easy to work with. The YUI Compressor handles both javascript and css but in this post I have chosen to focus on the js part.

The test integration into my IDE (Intellij IDEA) and the project was quite easy because somebody has taken the time to write YUIAnt. I just downloaded the YUI compressor version 2.4.2 and the YUIAnt.jar and added them to the project and modified my build scripts to run the compressor when the website is deployed to the web server. The beauty of this is that you naturally don’t have to look at the minified javascript when editing and if you for some reason want to debug the code run time you can easily setup a debug option in your build script and bypass the compressor for on the fly debugging. If you aren’t into all this build script stuff or have a simple project there are lots of online YUI Compressor sites out there where you can paste you js code or css and get a compressed version in return.

The version 2.4.2 of the YUI Compressor nearly worked without problems. For some reason – I didn’t bother to investigate further – the YUI Compressor had some issues with unterminated Strings in the jscalendar-1.0 library. I just excluded the directory and went on with my small non scientific test using Firebug as my test environment.

The first screen shot shows the size and load times for our js files. Business as usual – the YUI Compressor is disabled.

nocompress2Scaled

The next screen shot shows the size and load times for the same js files now with the compression enabled.

compress2Scaled

The file sizes have been reduced and the overall load time has shrunk approximately half a second. When the file sizes are very small the load times are very sensitive to queing effects but the file size is in most cases reduced. In the case of bigger js files the improvement in speed as well as size is clear. I have tried to compensate for caching effects in both cases (compress/not compressed). It seems that there is about a 20-25% reduction in file size and approximately the same reduction in load time for the js. These numbers are without using the obfuscation option (reduction of variable names to the shortest possible length and other tricks) simply because I don’t thing we will be comfortable with this knowing that it might cause errors.

As I am new to this I am interested to hear about any major drawbacks compressing/minifying may have.

This is of course a small step and not something which alone makes the difference between a slow and a fast site but I am hoping that attention to a number of different optimization issues will make a big difference in the long run.

Thoughts on optimizing our search web site

July 1, 2009 by Jørn Thøgersen

wwws

The code for our search front end has over time grown to a considerable size and we have started to suspect that the web site’s response time could be better. With this in mind I have for some time now been keen on looking into optimizing the speed of our front end – especially when the underlying search engine Summa has proven to be blazingly fast.

There are a lot of things we could do better such as:

1. Optimizing the javascript code by trawling through the lot and removing redundancy as well as rewriting some of the methods to be more efficient.

2. A thorough cleanup of the css. There is a lot we can do here as we have loads of redundancy, classes which are not in use anymore and declarations which could be handled way cooler. Another thing I noticed is we like divs – loads of divs.

3. Taking a critical look at our numerous DOM transformations. Some of them are down right unnecessary.

4. General optimizing of the server side code. In fact this part isn’t all that bad but a general clean up once in a while doesn’t hurt anybody.

Because my summer holiday is coming up soon I have chosen to start with some light weight stuff. I have tried out the newest version of the YUI Compressor – tool to compress/minify javascript and css. As we don’t use minifying at the moment we should be able to benefit from it performance wise. In order not to clutter up this post I will post my experience with this in a separate post soonish.

Efficient sorting and iteration on large databases

June 15, 2009 by kamstrup

Before you read on, heed my words that this post might be a wee bit technical… If not extremely technical – caveat emptor

The Problem

In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a H2 database, all records are annotated with a modification time (henceforth mtime) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.

This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don’t want any of that. We want lightweight transactions and speed!

Take One, LIMIT and OFFSET

The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:

  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000

Here we increment last_offset by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first last_offset records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset last_offset into the result set begins.

Take Two, Salted Timestamps

So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages… 

The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:

  SELECT * FROM records WHERE mtime>$last_mtime ORDER BY mtime LIMIT 1000

Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.

If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion – this is way too slow for us, so we deviced what we have dubbed Salted Timestamps.

Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a “timestamp” that still sorts correctly and we can even create a UNIQUE index on the mtime column.

Conclusion

We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the DatabaseStorage* class and the timestamp handling in the UniqueTimestampGenerator* class in the Storage module in the Summa sources.

*) Most sorry that these links require a login (which is freely available, but anyway) – we are working on a solution with anonymous access. More on that later.

Faceting and Flash Disks in the Gobi Desert

May 16, 2009 by villadsen

As has been mentioned in many different places the code4lib 2009 videos are now online.

Those that missed the finer details in Toke’s talk about complete faceting of 100 million documents can go see the video here. Lots of good, nerdy stuff.

Toke and Mikkel also both gave lightning talks.  Mikkel’s talk about how easy it would be to set up a Summa installation in the Gobi Desert was on day three, and is available here. He is on somewhere near the middle of the video (number 5 of 9).

Toke’s talk about Flash Disks and how they will save everyone of us was on day two, and can be seen here. His talk starts about a third of the way into video as number 6 of 16 (he is actually on twice – the first first attempt without any graphs in his slides is number 4…).

Watching the videos really make me want to go to code4lib again next year.

Usability testing Summa Search

May 5, 2009 by michaelpoltorak

smile1

We recently did a usability test of the library’s Summa based search engine – known as Search. To ensure objectivity the test was headed by Julia from UNI-C and was done as a think-out-loud test with eight users from the nearby university. You can download the report in Danish (English version coming up) or read a brief recap of the conclusions:

Bad stuff first:

  • Use of sort and facets. Many test users didn’t use or had trouble using facets and sort features for narrowing down search results.
  • The request list. Conceptually, the request list is hard to grasp for some users. In order to request a number of items, the user must put the items on the list one by one. Once done, he has to press the request button to actually perform the request. It seems that some users miss the last crucial step and actually believe that requests have been sent after the first step – but they have not.
  • Search and request procedure for articles. Finding and requesting articles is perhaps the Archilleus Heel of the current system. Very often users think they can search for individual articles. In most cases they cannot, but actually have to go through a printed or an online journal to either find data about the article or to get an online version for download. Unfortunately, it is has been difficult for us to communicate this counter-intuitive circumstance to users through the interface. Clearly, we’ll need to have another go a the problem, but the best solution – i.e. having all articles searchable and preferably in full-text versions – is not likely to happen in the near future.

The good things:

  • Todo list. The list has been very well received as many users like to keep information about material between sessions. Furthermore, the concept is intuitive and known from other websites and applications – and even from the real world
  • Google style straightforward search. Overall the system is fast and easy to use in terms of searching
  • Did you mean. The Google inspired feature is great for catching tpyos
  • Suggest. The feature suggests words that other users have already searched for. The most popular searchs are show first.
    This is a feature that many users seem to just use out-of-the-box. It can be used as inspiration as well as a quick-spell-thing
  • Added value, such as book covers, table-of-contents, sample chapters, author biographies, etc. Many materials in our database have such extra content information added. Users like this and find it is very helpful in assisting them in making judgements about a material’s relevance.

Overall, we are very satisfied with the test because it confirmed some suspicions we’ve been having for some time now, and especially because it highlighted the problems related to the request list. We’ll be working with the problem areas over the next few months.

Rising Summa

April 21, 2009 by eskildsen

The people higher up in the food-chain has decided to provide a Summa based search backend to public libraries in Denmark. For an annual fee, Statsbiblioteket handles the flow of data from raw dumps to webservices and keeps the servers running. Maintenance, money and sales is fairly boring, seen from our developer perspective, but tweaking Summa to allow for easy experimentation and setup has been very rewarding.

As usual Mikkel weaved his magic and created a package (read: A collection of scripts and all the JARs from the Summa project) that makes it very simple to set up a local Summa for experimentation. The working title was Summix, but we all know how it goes with working titles.

With some late night fiddling, the complexity was reduced to “Unzip and run a script”, which gets a Summa demo running with a skeleton web front end. Added bonus? It runs under Windows as well as Linux (and probably OS X too, but we haven’t checked). We will write a tutorial on the wiki Real Soon Now.

Getting there...

Getting there...

Asking for Trouble? You’ve Come to the Right Place!

April 2, 2009 by kamstrup

Toke was asking for trouble yesterday. I would assume that he knew me better by now… With my last commit it is now actually possible to inline Javascript inside your configuration when using a ScriptFilter.

The following now actually works:

<xproperties>
  <entry>
    <key>filter.name</key>
    <value class="string">InlineJavascriptTest</value>
  </entry>
  <entry>
    <key>summa.filter.sequence.filterclass</key>
    <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
  </entry>
  <entry>
    <key>filter.script.inline</key>
    <value class="string"><![CDATA[
                    payload.getRecord().setId('inlineJavascript');
    ]]></value>
  </entry>
</xproperties>

I can not begin to enumerate all the dangers in doing this, but somehow the thrill of the possibilities got the better of me. If Javascript isn’t your thing you can specify the script language to anything supported by your Java runtime by defining the property filter.script.lang.

So – use this at your own peril!

UPDATE: You can find a list of available ScriptEngines for Java at scripting.dev.java.net.

Javascript Filters in Summa

April 1, 2009 by kamstrup

I just completed the draft implementation of Javascript filters for Summa and I am posting here to hear some comments. If nobody complains the existing implementation will be likely to stay unchanged. Really the implementation supports any old scripting language supported by the ScriptEngineManager of the JVM, but in practice Javascript will probably be the most important one.

The scripting environment will include two “magic” variables: payload and allowPayload. Unsurprisingly the payload variable contains a reference to the Payload object being processed. The allowPayload variable is a boolean value that defaults to true. If allowPayload is set to false the payload will be dropped from the processing pipeline.

The best way to explain this is probably with an example. To write a Javascript filter for Summa create a file called myFilter.js with the following content:

var record = payload.getRecord();

if (!record.getId().startsWith(record.getBase())) {
    record.setId(record.getBase() + "_" + record.getId())
}

if (record.getId().endsWith('taboo')) {
    allowPayload = false;
}

This script will make sure that all records have their ids prefixed with their base name, and will filter out any records which id ends with “taboo”.

To plug the script into your filter pipeline you need to stick something like the following in your filter chain configuration:

<xproperties>
    <entry>
        <key>filter.name</key>
        <value class="string">FixRecordIdsandDropTaboos</value>
    </entry>
    <entry>
        <key>summa.filter.sequence.filterclass</key>
        <value class="string">dk.statsbiblioteket.summa.common.filter.object.ScriptFilter</value>
    </entry>
    <entry>
        <key>filter.script.url</key>
        <value class="string">http://example.com/filters/myFilter.js</value>
    </entry>
</xproperties>

I am using the ScriptEngine framework which appeared in Java 6 for all of this, and all in all the development experience has been quite nice. Writing this blog post took almost as long as it took me to write that filter :-)

UPDATE: You can find a list of all available scripting engines at scripting.dev.java.net.

Summa 1.3.0 Live and Unleashed

March 28, 2009 by kamstrup

As Toke hinted earlier we released Summa 1.3.0 yesterday. It is no secret that the 1.3.0 release cycle was longer than we planned for, but in the end I think the gains have been worth it. It took a lot of hard work (and lots of patience from our families), but we got there.

A lot of effort in the 1.3.0 cycle revolved about 3 main things:

  • Performance We really wanted to be able to provide quick turn around times from top to bottom of the Summa stack.  There’s a lot of activity around Summa these days and we never know where Summa might end up being deployed. Time has tought us that being able to quickly rebuild the indexes and record caches is a huge boon when venturing into unknown territory.
  • Details Make the small details work and do as expected. This means sanitizing the log output to be as useful and concise as possible, but also making sure that it is easy to track any records that are dropped from the indexing chain for what ever reason. Also stuff like being able to update the modification time of one single record and having that trigger a corresponding update of the index
  • Bug fixing Hunt down those kritters and write tonnes of unit tests to to keep that wooden stake through their hearts so that they don’t rise from the dead

The reasons for these points of focus are many, but two things are worth mentioning. The first thing is that we are aiming to use the Summa 1.3 series for production here at the State and University Library of Denmark, which means that we can finally replace the aging “pilot Summa” we are running in production these days (Summa started out as a closed source foray into the realm of relavancy ranked integrated search – and an old version of this project is the basis for the current search engine behind statsbiblioteket.dk). Secondly we felt the need to have a solid base to work on to try out all the cool stuff we have been talking about since we started our journey into the land of searching. This is also why Summa 1.3.0 doesn’t really have any new big features, mostly polish and optimizations of the existing code base.

All in all I am pretty proud of this 1.3.0 release and I am really looking forward to deploying it in a real world scenario. Also; with this milestone set I am confident that we are going to have a blast hacking on the mile long list of cook ideas we have floating around. I should post about some of these ideas soonish, but not today; today I am going to slack off and play with the kids.

Speed thrills

March 27, 2009 by eskildsen

As part of Summa 1.3 planning, we set up some performance-goals: Indexing speed should be 500+ records/sec, storage extraction speed should be twice that and so on. Paired with the magnificent VisualVM and a penchant for bit-fiddling, we’ve often found it hard to go home from work: The hunt for the next speed-increase has just been too exciting. Yes, we’re lost cases.

Experimental patching of the H2 database, introduction of an optimized replace-framework and severe reductions in object allocation has been part of the last few weeks. Boringly our main time was spend on new features and bug hunting, which our fearless leaders strangely enough thinks takes precedence, but oh how those glorious moments of “We got a factor 10 speed increase in Analyzer Foo” makes up for all that.