Archive for the ‘Web’ Category

Searching in the dark

September 25, 2009

As part of our obligation to preserve our online cultural heritage, Statsbiblioteket and Det Kongelige Bibliotek in Denmark continuously harvest the danish web (the *.dk-domains), digitize public danish television, rip all danish-produced music and generally just collect whatever we can get our hands on. The terabytes add up (120TB for the web pages so far, more for television, radio and so on) and the machines are happily harvesting, ripping and wolfing down the bytes into semi-safe storage (2 geographically and architecturally different setups, checksummed, re-checksummed etc.). All fine and dandy.

Except that access to most of the material is rather limited and that search is … well, pretty much non-existing.

Such things tend to change over time, preluded by meetings, committees, deals and whatnot. As technicians, we are normally not directly involved in all the politics surrounding this, but in order to get some concrete arguments, we were asked to try and index some of the harvested web material and do a search demo, where web material was presented together with our normal material (books, cds, articles et al).

The harvested web material is stored in ARC-files, so the obvious choice for a quick test was NutchWAX. Setup was easy, some 100 million documents was indexed (about 2% of the harvested web material) and searches were sub-second on a modest machine. A great success in terms of answering the “is it even feasible to do this?“-question.

The “but does it makes sense to do integrated search for such different data sources as web and library books?“-question could not be answered by this, so naturally we had to hack something together with Summa, our precious hammer. Due to other highly-prioritized assignments, we only had about a week to get it to work, so corners were cut where possible. Using the ARC-reader from Heritrix and the Tika-toolkit for analyzing the wealth of different data, the aptly named Arctika was born. Arctika handled the web stuff and an aggregator handled the integration with our standard library index.

It could use a lot more work, but it worked surprisingly well for a quick hack. We were able to demonstrate everything we wanted: The integrated search made sense, the ranking generally pulled the good stuff to the top (admittedly, tweaking the ranking for different sources would surely be needed for a real application) and the faceting system clearly helped give an overview of material types & sources and provided an easy means to do temporal navigation in the search-result: Limiting searches to a specific period of time is quite usable for investigating the media handling of major events.

So what’s the dark part? Well, legislation. As always. That and money. Harvested web material is sensible, only legally accessible for the selected few professors. On top of that, showing snippets from harvested web pages seems – at the moment – to require compensating the content owners, according to EU-law. Opening up for all the material at once will probably not happen in the foreseeable future.

Happily we don’t need to do everything at once. If we limit the public accessible index to websites from the government and companies, it should be legal to show the search-results and the stored versions (hello continuity). Add the recorded television and radio to the mix, pour in scanned newspapers, integrate with old-school books and presto, we have something great. Danish culture at our fingertips, past and present.

Dreaming, I know. But on the technical level, we just need the green light from the bigwigs to make this happen.

A screenshot, you say? Why, yes, of course. We present this super-cool bling bling interface with a stupendously large amount of interesting information to you. Slightly marred by the need to sensor out some sensible information and the fact that indexing time was capped at half a day to make the deadline.

Sample search in Arctika

Sample search in Arctika

Usability testing Summa Search

May 5, 2009

smile1

We recently did a usability test of the library’s Summa based search engine – known as Search. To ensure objectivity the test was headed by Julia from UNI-C and was done as a think-out-loud test with eight users from the nearby university. You can download the report in Danish (English version coming up) or read a brief recap of the conclusions:

Bad stuff first:

  • Use of sort and facets. Many test users didn’t use or had trouble using facets and sort features for narrowing down search results.
  • The request list. Conceptually, the request list is hard to grasp for some users. In order to request a number of items, the user must put the items on the list one by one. Once done, he has to press the request button to actually perform the request. It seems that some users miss the last crucial step and actually believe that requests have been sent after the first step – but they have not.
  • Search and request procedure for articles. Finding and requesting articles is perhaps the Archilleus Heel of the current system. Very often users think they can search for individual articles. In most cases they cannot, but actually have to go through a printed or an online journal to either find data about the article or to get an online version for download. Unfortunately, it is has been difficult for us to communicate this counter-intuitive circumstance to users through the interface. Clearly, we’ll need to have another go a the problem, but the best solution – i.e. having all articles searchable and preferably in full-text versions – is not likely to happen in the near future.

The good things:

  • Todo list. The list has been very well received as many users like to keep information about material between sessions. Furthermore, the concept is intuitive and known from other websites and applications – and even from the real world
  • Google style straightforward search. Overall the system is fast and easy to use in terms of searching
  • Did you mean. The Google inspired feature is great for catching tpyos
  • Suggest. The feature suggests words that other users have already searched for. The most popular searchs are show first.
    This is a feature that many users seem to just use out-of-the-box. It can be used as inspiration as well as a quick-spell-thing
  • Added value, such as book covers, table-of-contents, sample chapters, author biographies, etc. Many materials in our database have such extra content information added. Users like this and find it is very helpful in assisting them in making judgements about a material’s relevance.

Overall, we are very satisfied with the test because it confirmed some suspicions we’ve been having for some time now, and especially because it highlighted the problems related to the request list. We’ll be working with the problem areas over the next few months.

XSLTs for inspiration

March 11, 2009

We have decided to publish the XSLTs used for the Search website at the State and University Library, Århus, Denmark.

The files are provided “as-is” under the MIT License. They are meant to be used as a source of inspiration, as they do contain a few minor dependencies on internal code – for instance used to translate labels to human readable strings in different languages.

In other words this is a snapshot of of the XSLTs that are all free to use, but we offer no guarantee that they are actually useful. Likewise we cannot be held accountable for errors or faults contained within these files.

Go to the bottom of the Sandbox page at the Wiki for the download.

Update 2009-03-17: Added some more XSLTs to the Sandbox page. These are XSLTs developed in a project for public libraries using Summa.

Learning how to comminicate and cooprate in teams

January 22, 2009

Yesterday we all went to get a brush-up on how to communicate and make better teams. Actually, it was not only us, but the whole it department totaling to 42 bodies. The man to guide us was Ejnar, a former school teacher turned consultant. He also volunteers as a handball coach and was actually able to shout louder than anyone I’ve ever met. He was actually a very nice and down-to-earth fella.

The day involved identifying what good communication and teamwork is all about and eventually trying to work out areas where we could improve – both as a whole department and as individual groups. Ejnar employed an Open Space approach. It allows one to extract common themes from seemingly disparate, chaotic data. It worked nicely.

The highlights of the day were Ejnar spontaneously bear-hugging Mads – who is notoriously shy of physical human contact – and the discovery that the chicken-disguised-in-tomato-sauce served for lunch was partly raw.

Afterwards, we all went to eat and drink at Bryggeriet. We had a nice roast and several good beers. Especially the the Christmas Brew (yes, still available) and Dark Ale was great – the latter with nice notes of liquorice.

The day in photos (thanks to Toke)
Unfortunately, we have no photos of Mads being hugged…

bi_20090121_1152
A little team building before lunch

bi_20090121_1312
Lunch…

bi_20090121_1440

bi_20090121_1456
Doing the Open Space job.

bi_20090121_1449_2
The consultant’s suitcase(s).

bi_20090121_606
Me talking. Please note the subtle Acer logo from the projector.

bi_20090121_1954
Free beer and dinner. Nice.

Summa Ubiquity Command

January 1, 2009

Some time ago I played around with a new Firefox extension called Ubiquity. It is a graphical keyboard user interface allowing the user to easily execute various commands without cluttering up the interface with buttons.

I have created a summa command on top of our OpenSearch interface. In many ways it is similar to the iGoogle gadget I mentioned in a previous blog post. One of the key differences is of course that a Ubiquity command is present as a feature of the Firefox browser whereas an iGoogle gadget is only available on your iGoogle page.

Some screenshots of the command in action.

summa-ubiquity-1

summa-ubiquity-2

To try it out you first need to install the Ubiquity Firefox extension. After having restarted Firefox go to the Summa Command’s github page. There you will be presented with a message allowing you to subscribe to the command. As Ubiquity considers all sources to untrusted you will have to acknowledge a security warning message before actually being able to install the command (read here for more details).

Using your new command should now be as simple as pressing the Ubiquity hotkey and entering summa followed by your query. Read the Ubiquity Tutorial for more information.

Further details are available at the Sandbox page in our Wiki.

Alternative Super Suit

November 28, 2008

Not all of us can be Supermen. Michael for instance chose something else entirely…

mn_orange

Supermen

November 28, 2008

supermen_500

There’s nothing like the right t-shirt to boost self-confidence. It’s Mikkel in red and Toke in blue.

Actually, we are considering making Superman t-shirts mandatory throughout the office.

Google makes bad usability to get user data?

November 25, 2008

google

Google’s search result page is now a wiki
Google has the introduced a new service that they call the Search Wiki.
The service allows a user (logged into his Google account) to:

1.  rearrange the order in which Google search result items appear,
2.  remove search result items and
3.  comment items and read other user’s comments.
4. In addition, the user can add his own items in the shape of URL’s to a given search result.

Here’s an example of a search for Barack Obama. The items can be moved up or removed. The top item has a green arrow because I moved it to the top spot. It can be moved down the list again, but for some reason it cannot be removed entirely ( a bug?). Clicking the bubble opens a comment slot.

google_move_remove
The changes made by the user will remain and appear on future searches only if the user is logged in. Also, the changes will only be visible to the user himself and will not influence other users’ search results.

Bad usability in exchange for user data?
The remove item feature seems to be usable, but its a bit unclear what the point of being able to rearrange items is. Google does not mention whether they’ll be using the data about users’ rearranging activities to improve searchs result in general, only that the they now do offer the rearrange feature.

From my point of view, this is useless customisation. There is no real value in being able to move search result items around.

I do, however, expect Google to be smarter than this. There’s real value to be had about user preferences in the way people rearrange or remove items – and I’m pretty sure that the Google folks  are collecting such data in order to improve their general search feature.

Actually, they may also be looking to use it for their personalised search feature. Personalised search is an interesting Google feature living outside the spotlight. Occasionally – and only if your are using the English Google, I believe – you’ll notice a small message in the top right side of the screen, saying: “Personalized based on your web history”.

pers

This is a potentially very smart feature: over time, a search engine can get collect a pretty good profile of users’ preferences through search activity analysis and use it to filter away noise and ambiguity from search results – all without direct user interaction.

But of course, its even better if you can cheat users to send more feedback by employing a Search Wiki service.

Comments as link meta data
The comments feature seems to be a bit more usable. One of the problems with links is that you don’t know whats behind them. Annotated links are conceptually good because you can use them to make an informed decision about whether to actually click or not.

But then again – what happens when you have 634 comments on a link? Or – as is the case with the top item for a Google search for Barack Obama – that you get comments like:

“BAMA MY MANNA”

“OBAMA!!”

“this function rocks!”

“Yes we did!”

“Very nice website”

“change!”

“Mr. President”

One of the comments, though, was quite usable: “modern website for an innovative leader. you’ll find all his social networks including flickr and youtube there (check out the backstage pics and videos)”.

Have a look at the comments yourself

Back to the Labs
Overall, this is an interesting Google experiment. The overall purpose is quite unclear, especially because people in general not are inclined to customise – even when the benefits are more evident than in this case – and I am sure Google knows this. There is no doubt that Search Wiki will get a lot of interest because its a Google service, but in the long run I’m quite sure that it’ll return to the Labs for dismantling. This said, I am sure Google will get heaps of good behavioral data to put back into their search engine from the experiment.

See also:

Poltorak’s law of blog activity (PLBA)

November 10, 2008

skaters_small

Its relatively quiet here at the office right now. Summa 1.0 and even Summa 1.1 are out the door and the new version of Statsbiblioteket’s search engine has been released too.

At the same time, I’ve noticed, the activity on this blog has gone down quite considerably. Actually, I would expect the opposite to happen: more time, more time to think big thoughts and blog about them. But not so, apparently.

Observing this, I propose that: Blog activity is proportionally related to general work activity.

I call this relationship Poltorak’s law of blog activity (PLBA).

The general idea is that its easier to blog when being busy because you are forced to think less about what you blog about and in what style. When there’s much time available, however, own expectations about the outcome and quality of the individual blog post raise and pressure to deliver also goes up. As a result less blog posts are produced when time is an abundant resource.

I hope to be able to elaborate on this relationship in an upcoming article, but right now I have the time for it. Keep you posted.

The Summa Sandbox

November 6, 2008

For a while we have been talking about making a Summa Sandbox where we could show how Summa was being used. Both some of the real life production stuff but also the more experimental stuff we sometimes make.

Something we implemented a long time ago is OpenSearch. Using OpenSearch it is possible to add the search from the State and University Library to the Firefox search bar.

summa-opensearch-firefox

Because we have an OpenSearch interface it is also possible to add the result of a search to an rss reader and that way get automatically notified of any new books in the result set.

summa-rss-feed-search

We have also successfully used our OpenSearch interface to quickly try some things out – the rss format is extremely easy to parse and use in pretty much any programming language.

One of the things we have made is a gadget for iGoogle.

summa-gadget

It allows you to search the materials at the State and University Library and shows the first 5 hits in a compact format. Clicking on a hit will take you to the full record at our regular search website.

To try it out go to your iGoogle page, and click “Add stuff…” then click on “Add feed or gadget” in the menu on the left. In the input field enter the following url:

http://www.statsbiblioteket.dk/search/summagadget.xml

Keep reading this blog to stay updated on other things we might try out.