We use a backing storage for documents in our home brewed search system Summa. It was supposed to be a trivial key-value store with document-IDs resolving to documents. Then an evil librarian pointed out that books are related to each other, so we had to add some sort of relational mapping. For some years we have used relational databases for this, going from PostgreSQL through Derby and landing on H2, which has served us fairly well. Documents with relations were a bit slow to resolve but there were less than 5% of those in the full corpus, so for all practical purposes they presented no problem.
Fast forward to a week ago, where we added a new target to our integrated search. One million fresh documents or a 10% increase in total document count. Unfortunately most of these documents were related to each other, increasing our total relation count 2000%. Our full indexing time soared from “It’ll be done before noon” to “I hope it finishes before tomorrow”. Ouch!
A long discussion about database design and indexes followed. The conclusion was that we really did not use H2 for what it was good at and that maybe we should look at a graph-oriented database.
Implementing storage with neo4j
Neo4j is an open source graph database. It is written in Java and can be used as an embedded application. This is important for us as we like our packages to be self contained. Friday is work-on-whatever-project-you-think-can-benefit-Statsbiblioteket-day, so I dedicated it to trying out neo. Keep in mind that I only heard about Neo this morning and that I hadn’t looked at a single page about the product before I started the project.
- 10:36 Created test project
- 10:40 Selected Neo4j version 1.3 stable Enterprise, added to Maven POM, downloaded files
- 10:50 Created skeleton class for Summa NeoStorage
- 10:55 Added basic properties, created skeleton Unit test
- 11:15 Added code for flushing Summa Record to Neo
- 11:36 Added code for retrieving a previously flushed record + unit test. Hello world completed
- 11:40 Break finished, started on ModificationTime retrieval
- 12:35 Proof of concept retrieval using modification time (slow development due to human error)
- 13:20 Finished bulk ingest and record-by-record extraction using modification time order, including unit test
- 13:50 Finished mapping of Summa Record child-relations to Neo, both for ingest and extraction
Including mistakes, distractions from colleagues and reading documentation, it took under 4 hours to integrate Neo as the new backend for our storage. There’s still a lot of minor things to add and special cases to cover, but the result is complete enough to be used for most workflows in Summa.
Implementation was a breeze, the API was very clean and the examples and guides at the website were to the point and well thought out. Kudos to the developers for great work!
Since the Neo Storage isn’t finished, it would be unfair to compare it to H2 with regard to ingest. However, the extraction part is complete enough to test.
With the old H2-backed storage, our extraction time fell to below 5 records/sec for the new documents with 2-4 relations each (extraction time for records without relations is 2-3000 records/sec).
Creating a test-storage using Neo with 100000 documents, each with 5 children, changed the extraction speed to 2500 expanded records/second or 15,000 raw records/second if we count the children.
Granted, the only fair test is with production data, but so far Neo4j looks like a clear winner for our purposes!