2D visualization of high dimensional word embeddings

In this blog post I tried to make an method for a computer to  read a text and analyse the characters and then make a 2D visualization of the similarity of the characters. To achieve this I am using the word2vec algorithm and then making a distance matrix of all mutual distances and fitting them into a 2D plot. The three texts I used was

  • All  3 Lord of The Ring books
  • Pride and Prejudice + Emma by Jane Austen
  • A combined text of 35.000 free english Gutenberg e-books

Word2Vec is an algorithm invented by Google researchers in 2013. Input it a text which has been preprocessed I will explain later. The algorithm  then extract all words and maps each word to a multidimensional vector of typical 200 dimensions. Think of a the quills of a hedgehog where each quill is a word, except it is in more than 3 dimensions. What is remarkable about the algorithm is that it captures some of the contexts of the words and this is reflected in the multidimensional vectors. Words that are somewhat similar are very close in this vector space, where ‘close’ is measured by the angle between two vectors. Furthermore the relative positions of two words also captures a relation between words. A well known example is that the distance vector from ‘man’ to ‘king’ is almost identical to the distance vector from ‘woman’ to ‘queen’. Using this information you are able to predict the word ‘queen’ given the three words <man,king> <woman,?>. It is far from obvious to understand why the algorithm  reflects this behaviour in the vector space and I have not fully understood the algorithm yet. Before you can use the word2vec algorithm you have to remove all punctuations and split the sentences into separate lines and lowercase the text. The splitting into sentences is not just splitting whenever you meet a ‘.’ character. For instance Mr. Anderson should not trigger a split.

First I  create the multidimensional representation of the words using word2vec which is just all the words (like a dictionary) and the vector for that word.  Next step I manual input the characters (or words in fact.) that I want to create the visualization for and calculate the distance matrix for all mutual distances by taking the cosinus of the angle between the vectors. This gives a value between -1 and +1 which I then shifts to 0 to 2 so I have a positive distance between the words. Finally I take this distance matrix and turn it into a 2D visualization trying to keep the distances as ‘close a possible’ in the 2D visualization as in the vectorspace. Of course this is not possible generally. Even for 3 vectors this can be impossible (if the Triangle inequality is broken). I create the plot by dividing the 2D into a grid and place the first character in the middle. The next character is also easy to place in on the circle with the radius of the distance. For the following characters I place one a time in the grid that minimize the sum of the distance-errors to the already placed characters in the grid. This is a greedy algorithm that priorities the first characters added to the plot and this I why the plotted the main characters first and have the other characters place them accordingly to these.

I tried to use the Stanford entity extraction tool to both extract locations and persons from a given text, but there was way too many false positives, thus I had the manually feed the algorithm the characters. To do it perfect I should had replaced a character metioned with  multiple names by a single same. Gandalf, Gandalf the Grey, Mithrandir is the same character etc. but I did not perform this substitution. So when I select the character Gandalf I only get the context where he is mentioned as Gandalf and not Mithrandir.

And now lets see some of the 2D visualizations!

Lord of the Rings

0) Frodo
1) Sam
2) Gandalf
3) Gollum
4) Elrond
5) Saroman
6) Legolas
7) Gimli
8) Bilbo
9) Galadriel

______________________________________________________________________
___________________________________________3__________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________1______________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
5_____________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________0__________________8___
_______________________2______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________6______________________________
______________________________________________________________________
______________________________________________________________________
__________________________________7___________________________________
______________________________________________________________________
______________________________________________________________________
_______________________4______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_____________________________9________________________________________
______________________________________________________________________

 

Jane Austen: Pride and Prejudice +Emma

0)Elizabeth ( Elizabeth Bennet)
1)Wickham (George Wickham)
2)Darcy (Mr. Darcy)
3)Bourgh (Lady Catherine de Bourgh)
4)Lydia (Lydia Bennet)
5)William (Mr. William Collins)
6)Kitty (Catherine “Kitty” Bennet)
7)Emma (Emma Woodhouse)
8)Knightley (George Knightley)

__________________________________________________________________________
_________________________________________8________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________________1__________2___________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________7_____
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_____________________________________________________0____________________
________________3__________4______________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_______________________________________________________5__________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________6______________________________________________________
__________________________________________________________________________

 

35.000 English Gutenberg books

In this plot instead of characters I selected different animals

0) Fish
1) Cat
2) Dog
3) Cow
4) Bird
5) Crocodile
6) Donkey
7) Mule
8) Horse
9) Snake
__________________________________________________________
_____________________________________________7____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
____________________________________________6_____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
________________________________________________________8_
__________________________________________________________
__________________________________________________________
________________________________2_________________________
__________________________________________________________
_______________1__________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
___________________________________________________3______
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__4_______________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________9_______________________________
__________________________________________________________
______________5___________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________0_______________________________
__________________________________________________________

Conclusion

Does the 2D plotting catch some of the essence of the words/characters from the books? Or does it look like they are just thrown in random on the plane?

I look forward to your your conclusion! For the Gutenberg animals plot I believe the visualization really does match how I see the animals. Fish, reptiles are grouped together and in the upper left corner we have the horses family of animals. For the Jane Austin it is also interesting that the character Emma match Elizabeth most though there are from two different books but somewhat identical main characters.

Posted in Uncategorized | Leave a comment

CDX musings

This is about web archiving, corpus creation and replay of web sites. No fancy bit fiddling here, sorry.

There is currently some debate on CDX, used by the Wayback Engine, Open Wayback and other web archive oriented tools, such as Warcbase. A CDX Server API is being formulated, as is the CDX format itself. Inspired by a post by Kristinn Sigurðsson, here comes another input.

cdx_02

CDX components, diagram by Ulrich Karstoft Have

CDX Server API

There is an ongoing use case-centric discussion of needed features for a CDX API. Compared to that, the CDX Server API – BETA seems a bit random. For example: A feature such as regexp-matching on URLs can be very heavy on the backend and open op for easy denial of service (intentional as well as unintentional). It should only be in the API if it is of high priority. One way of handling all wishes is to define a core API with the bare necessities and add extra functionality as API extensions. What is needed and what is wanted?

The same weighing problem can be seen for the fields required for the server. Kristinn discusses CDX as format and boils the core fields down to canonicalized URL, timestamp, original URL, digest, record type, WARC filename and WARC offset. Everything else is optional. This approach matches the core vs. extensions functionality division.

With optional features and optional fields, a CDX server should have some mechanism for stating what is possible.

URL canonicalization and digest

The essential lookup field is the canonicalised URL. Unfortunately that is also the least formalised, which is really bad from an interoperability point of view. When the CDX Server API is formalised, a strict schema for URL canonicalisation is needed.

Likewise, the digest format needs to be fixed, to allow for cross-institution lookups. As the digests do not need to be cryptographically secure, the algorithm chosen (hopefully) does not become obsolete with age.

It would be possible to allow for variations of both canonicalisation and digest, but in that case it would be as extensions rather than core.

CDX (external) format

CDX can be seen as a way of representing a corpus, as discussed on the RESAW workshop in december 2015.

  • From a shared external format perspective, tool-specific requirements such as sorting of entries or order of fields are secondary or even irrelevant.
  • Web archives tend to contain non-trivial amounts of entries, so the size of a CDX matters.

With this in mind, the pure minimal amount of fields would be something like original URL, timestamp and digest. The rest is a matter of expanding the data from the WARC files. On a more practical level, having the WARC filename and WARC offset is probably a good idea.

The thing not to require is the canonicalized URL: It is redundant, as it can be generated directly from the original URL, and it unnecessarily freezes the canonicalization method to the CDX format.

Allowing for optional extra fields after the core is again pragmatic. JSON is not the most compact format when dealing with tables of data, but it has the flexibility of offering entry-specific fields. CDXJ deals with this, although it does not specify the possible keys inside of the JSON blocks, meaning that the full CDX file has to be iterated to get those keys. There is also a problem of simple key:value JSON entries, which can be generically processed, vs. complex nested JSON structures, which requires implementation-specific code.

CDX (internal) format

Having a canonicalized and SURTed URL as the primary field and having the full CDX file sorted is an optimization towards specific tools. Kris touches lightly on the problem with this by suggesting that the record type might be better positioned as the secondary field (as opposed to timestamp) in the CDX format.

It follows easily that the optimal order or even representation of fields depends on tools as well as use case. But how the tools handle CDX data internally really does not matter as long as they expose the CDX Server API correctly and allows for export in an external CDX format. The external format should not be dictated by internal use!

CDX (internal vs. external) format

CDX is not a real format yet, but tools do exist and they expects some loosely defined common representation to work together. As such it is worth considering if some of the traits of current CDXes should spill over to an external format. Primarily that would be the loosely defined canonicalized URL as first field and the sorted nature of the files. In practice that would mean a near doubling of file size due to the redundant URL representation.

CDX implementation

The sorted nature of current CDX files has two important implications: Calculating the intersections between two files is easy and lookup on primary key (canonicalised URL) can algorithmically be done in O(log n) time using binary search.

In reality, simple binary search works poorly on large datasets, due to the lack of locality. This is worsened by slow storage types such as spinning drives and/or networked storage. There are a lot of tricks to remedy this, from building indexes to re-ordering the entries. The shared theme is that the current non-formalised CDX files are not directly usable: They require post-processing and extensions by the implementation.

The take away is that the value of a binary search optimized external CDX format is diminishing as scale goes up and specialized implementations are created. Wholly different lookup technologies, such as general purpose databases or search engines, has zero need for the format-oriented optimizations.

 

 

Posted in Uncategorized | Leave a comment

Faster grouping, take 1

A failed attempt of speeding up grouping in Solr, with an idea for next attempt.

Grouping at a Statsbiblioteket project

We have 100M+ articles from 10M+ pages belonging to 700K editions of 170 newspapers in a single Solr shard. It can be accessed at Mediestream. If you speak Danish, try searching for “strudsehest”. Searches are at the article level, with the results sorted by score and grouped by edition, with a maximum of 10 articles / edition. Something like this:

q=strudsehest&group=true&group.field=edition&group.limit=10

This works well for most searches. But for the heavier ones, response times creeps into seconds, sometimes exceeding the 10 second timeout we use. Not good. So what happens in a grouped search that is sorted by document score?

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, calculate is score and resolve its group value (a BytesRef)
    2. If the group value matched one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving multiple field values for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the full documents

Observation 1: Hits are iterated twice. This is hard to avoid if we need more than 1 entry in each group. An alternative would be to keep track of all groups until all the hits has been iterated, but this would be extremely memory costly with high cardinality fields.

Observation 2: In step 3.1, score and group resolving is performed for all hits. It is possible to use the same logic as step 2.1, where the group is only resolved if the score is competitive.

Attempt 1: Delayed group resolving

The idea in observation 2 has been implemented as a kludge-hacky-proof-of-concept. Code is available at the group_4_10 branch at GitHub for those who like hurt.

When the hits are iterated the second time, all scores are resolved but only the group values for the documents with competitive scores are resolved. So how well does it work?

run=1_collated

Lazy group value resolving for Solr

Observation: Optimized (aka lazy group value resolving) grouping is a bit slower than vanilla Solr grouping for some result sets, probably the ones where most of the group values has to be resolved. For other result sets there is a clear win.

It should be possible to optimize a bit more and bring the overhead of the worst-case optimized groupings down to near-zero. However, since there are so few best-case result sets and since the win is just about a third of the response time, I do not see this optimization attempt as being worth the effort.

Idea: A new level of lazy

IMPORTANT: I am a klutz and did not read the Solr code properly. Please skip ahead to the next section.

Going back to the algorithm for grouping we can see that “resolving the value” occurs multiple times. But what does it mean?

With DocValued terms, this is really a two-step process: The DocValue ordinal is requested for a given docID (blazingly fast) and the ordinal is used to retrieve the term (fast) in the form of a BytesRef. You already know where this is going, don’t you?

Millions of “fast” lookups accumulates to slow and we don’t really need the terms as such. At least not before we have to deliver the final result to the user. What we need is a unique identifier for each group value and the ordinal is exactly that.

But wait. Ordinals are not comparable across segments! We need to map the segment ordinals to a global structure. Luckily this is exactly what happens when doing faceting with facet.method=fc, so we can just scrounge the code from there.

With this in mind, the algorithm becomes

  1. The hits are calculated
  2. A priority queue is used to find the top-X groups with the highest scores
    1. For each hit, calculate its score
    2. If the score is > the lowest score in the queue, resolve the group value ordinal and update the priority queue
  3. For each of the top-X groups, a priority queue is created and filled with document IDs
    1. For each hit, resolve its group value segment-ordinal and convert that to global ordinal
    2. If the group value ordinal matches one of the top-X groups, update that group’s queue
      1. Updating the queue might involve resolving the document score or resolving multiple field value ordinals for the document, depending on in-group sorting
  4. Iterate the top-X groups and resolve the Terms from the group value ordinals as well as the full documents

Note how the logic is reversed for step 3.1, prioritizing value ordinal resolving over score calculation. Experience from the facet code suggests that ordinal lookup is faster than score calculation.

This idea has not been implemented yet. Hopefully it will be done Real Soon Now, but no promises.

Update 2016-01-28: Read the code properly

The Solr code does not resolve BytesRefs before it is necessary in step 3. On the contrary, it resolved the segment-specific group ordinals from the BytesRefs delivered by step 2 and uses fast lookup of those ordinals before calculating the score.

So while the first pass (determining the groups) seems optimizable by postponing BytesRef resolving, the second pass does not have any apparent performance failings.

Posted in Uncategorized | Leave a comment

The ones that got away

Two and a half ideas of improving Lucene/Solr performance that did not work out.

Track the result set bits

At the heart of Lucene (and consequently also Solr and ElasticSearch), there is a great amount of doc ID set handling. Abstractly speaking, a doc ID set is a collection of internal document IDs. If there are 10 documents in the index and a query matches the first and the fourth, the set would be [0, 3] as we count from 0. Every time there is filtering or faceting, bit sets are used.

SortedIntDocSet in Solr is one example of a document ID set. The IDs are stored after each other in an int[]. This takes up 32 bits per document in the set and is used for small result sets only.

FixedBitSet in Lucene is another example. It reserves 1 bit for each document in the full corpus. If the bit for a document is set, it means that the document is part of the set. All the bits are stored in a long[], which makes most operations quite fast: To calculate the union, the relevant code is

int pos = Math.min(numWords, otherNumWords);
while (--pos >= 0) {
  thisArr[pos] |= otherArr[pos];
}

FixedBitSet is very widely used. It is referenced in the Lucene/Solr code 451 times! As opposed to SortedIntDocSet et al, it can contain result sets of any size and reading or changing a specific bit is O(1). It does have a weakness though: Sparse sets.

Nearly all operations on FixedBitSet takes time linear to the total number of documents in the index, not to the number of documents in the set. Calculating the union of [5, 87] and [1, 5, 1056978], when there are 20 million documents in the index, requires stepping through all the 300,000 longs (20M / 64) in each array to check for set bits.

Idea: Tracking changes to the structure works really well for sparse faceting, so why not for FixedBitSets? Add a new tracking bitset where each bit corresponds to a single long in the underlying FixedBitSet. If the tracking bit is set, the FixedBitSet long contains at least one set bit. To locale set bits, iterate the tracking bitset and look for longs != 0. This reduces the number of unnecessary lookups by a factor 64. Of course a third bitset can be added on top of the two, reducing unnecessary lookups by a total factor of 4096.

An implementation is available at github with a performance test. Instructions for running is in the TRACKED_README.txt file.  Some reduced sample runs, where open if the vanilla FixedBitSet and tracked is the tracked version:

2K / 20M bits set (0.01% filled)
test         impl        ms
union        open      0.42
union        tracked   0.16
cardinality  open      0.34
cardinality  tracked   0.09
icount       open      0.59
icount       tracked   0.57
nextSetBit   open      0.31
nextSetBit   tracked   0.12
iterator     open      0.32
iterator     tracked   0.14

20K / 20M bits set (0.1% filled)
test         impl        ms
union        open      0.81
union        tracked   1.08
cardinality  open      0.34
cardinality  tracked   0.49
icount       open      0.58
icount       tracked   2.18
nextSetBit   open      0.58
nextSetBit   tracked   0.79
iterator     open      0.68
iterator     tracked   0.80

200K / 20M bits set (1.0% filled)
test         impl        ms
union        open      3.51
union        tracked   4.93
cardinality  open      0.43
cardinality  tracked   1.48
icount       open      0.59
icount       tracked  10.26
nextSetBit   open      3.21
nextSetBit   tracked   4.24
iterator     open      3.98
iterator     tracked   2.54

Technically the implementation delivers: Sparse result sets are processed faster. Unfortunately most operations require the sets to be less than 0.1% filled and worse, the overhead for non-sparse bitsets is prohibitive. There might be some special places where the tracked bitset can be used, but it is not viable as a general replacement for FixedBitSet.

Better scaling of top-X searches

This idea got pretty far with LUCENE-6828 JIRA and an analysis of the speed up.

Requesting the top-X relevance ranked results from Lucene/Solr is pretty much the core use case. The underlying implementation uses a binary heap based priority queue with Java Objects. Due to the cache-adverse nature of binary heaps and the large overhead of tiny Objects, this does not scale well: Do not ask top top-1M hits and really do not ask for top-100M.

Using an array based bit packed heap implementation yields superior space- and speed-performance for top-100K+ requests (3x+ for both memory and speed, plus a lot less GC activity; see the previously linked analysis for details), with the advantages growing with the number of concurrent requests. No win for small top-X requests, but no big loss either.

So success? Sorta, kinda. Turns out that there are generally two scenarios:

  1. Request a relatively modest top-X, such af top-10 or top-1000.
  2. Export the <em>full</em> result set, which can be millions or billions.

As the array-based heap has no advantage over the vanilla Solr heap for small result sets and as full exports are much better handled by Solr streaming export, there is near-zero need for the middle ground that a tracked bitset provides. It is a solution looking for a problem.

Cache-friendly priority queue

Since the vanilla Solr binary heap is cache-adverse, why not try and fix that problem? Yes, totally ignoring the “there is no problem as nobody requests large top-X results”-part here.

The idea is fairly simple: Binary heaps works well when the heap can be fully contained in the CPU cache, so making a heap of heaps means that most operations will be done on data that are already cached. There is a very nice write up on this at Playful Programming: Cache optimizing a priority queue.

This was implemented in Java at the hitdocs-trunk branch. Performance tests are in org.apache.lucene.search.TestHitQueue. Measurements below are from testPQPerformanceAtomicHeapsLimited, where sentinel is vanilla Solr, packed is the array-based heap described above and bheap4 is the best performing heaps-within-heaps implementation. Lower percentages are best and the winning implementation for each test is marked with ■.

Threads       Top-X      Hits      Sentinel         Packed         BHeap4 
      1     1000000    100000   118.49 100%     11.06   9%■    20.47  17% 
      1     1000000   1000000  1356.56 100%    133.51  10%■   270.82  20% 
      1     1000000   2000000  2216.50 100%    566.67  26%    472.43  21%■
      1     1000000   5000000  3406.35 100%    827.23  24%    771.33  23%■
      1     1000000  10000000  4183.84 100%   1054.94  25%   1012.78  24%■
      4     1000000    100000   246.00 100%     22.38   9%■    41.73  17% 
      4     1000000   1000000  2154.64 100%    232.00  11%■   504.80  23% 
      4     1000000   2000000  3356.96 100%   1057.01  31%    772.90  23%■
      4     1000000   5000000  5022.50 100%   1484.44  30%   1227.93  24%■
      4     1000000  10000000  6503.33 100%   1910.10  29%   1891.28  29%■
     16     1000000    100000  1310.95 100%     68.29   5%■   119.87   9% 
     16     1000000   1000000  6731.49 100%    605.90   9%■  1405.25  21% 
     16     1000000   2000000 11882.59 100%   3713.82  31%   2744.36  23%■
     16     1000000   5000000 19589.75 100%   5389.25  28%   4447.21  23%■
     16     1000000  10000000 25522.07 100%   6921.96  27%   4843.61  19%■

Note: The cases where hits <= top-X (100K and 1M) are special as packed sorts result sets with hits <= top-X using the more cache-friendly merge sort. This explains why packed is twice as fast as bheap4 for these cases. A similar optimization could be implemented for bheap4.

Looking through rose-tinted glasses, bheap4 is a bit better than packed (and they are both vastly superior to sentinel). But the implementation of bheap4 is bordering on voodoo, with the multi-level offset calculations and all the bit-tricks needed to avoid doing expensive multiplications and modulo calculations.

Hopefully the speed of bheap can be improved further – the Playful Programming author Björn Fahller certainly got fine results with his C++ implementation. But until that happens, its miniscule gains are not worth the added complexity.

Posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr | Leave a comment

Speeding up core search

Issue a query, get back the top-X results. It does not get more basic with Solr. So great win if we can improve on that, right? Truth be told, the answer is still “maybe”, but read on for some thoughts, test code and experimental results.

Getting the top-X results

  • The X in top-X is the maximum result set size we want back from a request.
  • A result set is the documents actually returned – this can be the same or less than X.
  • hits is the number of documents that matched the query – this can be much higher than the size of the result set.

When a top-X search is performed, Solr uses a Priority Queue to keep track of the result set. Such a queue is often implemented as a heap and so it is with Solr: The relevant class in the source is HitQueue.

A ScoreDoc represents a single result document, in the form of a document ID and a score (and sometimes a shard index, but we ignore that for now). The HitQueue holds ScoreDocs and is a simple beast: Add a new ScoreDoc and it will check to see if it has a higher score than any of the existing ScoreDocs. If so, the new ScoreDoc is added. If the queue is full, the ScoreDoc with the lowest score is removed.

Potential problems with the current implementation

  • A simple binary heap has poor memory locality: This means that an insert into the heap often results in multiple memory accesses scattered in the heap structure. This is not a problem for a tiny heap as everything will be in the CPU L2 cache, but for larger heaps that means a lot of uncached memory accesses.
  • The HitQueue in Solr uses ScoreDoc objects as elements in the heap. When the HitQueue is created, it is filled with dummy-ScoreDocs, called sentinels. There are multiple problems with that.
    • Each ScoreDoc object takes up 28 bytes on the heap. If 30 concurrent requests asks for top-2500000, that takes up 2GB of RAM, no matter what the actual result size is. I mention those specific numbers as that was the case for a person on the solr IRC channel.
    • Each ScoreDoc object is temporary, which means a lot of allocations and a lot of work for the garbage collector to clean up. In the previous case of the top-2500000, the JVM was doing stop-the-world garbage collections half the time.
    • The use of sentinel objects means that the heap is pre-filled with elements that will always be less than any real elements. Adding an element means sifting it from the top of the heap to the bottom. Not a problem for small heaps, but with larger ones it means unnecessary memory thrashing until the heap is full of new elements.
  • If the queue is not filled to its max size (the X in top-X), the ongoing maintaining of a heap structure is not the most efficient solution. For that, it would be better to simply collect all elements and do a merge-sort or similar, when they are to be delivered in order.

To sum it up: Solr’s queue in theory works well when requesting a small number of results, but poorly for large numbers. And indeed, that is part of the shared experience with Solr.

Lately CursorMark has been added to Lucene & Solr, which allows for efficient paging of results. One could say that this renders the optimization of requests for large result sets is a moot point, but hey, it’s a challenge.

Experimental improvements

Switching to a better heap algorithm would be a worthy experiment, but that is hard so we will skip that for now. Instead we will do away with all those pesky ScoreDoc objects.

The two relevant parts of a ScoreDoc are docID (an integer) and score ( a float). The sorting order of two ScoreDocs is determined primarily by the score and secondarily by the docID. There are a lot of such comparisons when using a heap.

In Java, the bits of a float can be extracted by Float.floatToRawIntBits, which produces an integer. This is a very fast operation. Interestingly enough, the sort order of two positive floats is preserved in the integer representations. This means that a ScoreDoc can be converted into an atomic long with Float.floatToRawIntBits(score) << 32 | docID and that two such longs are directly comparable.

With each element as a long, the object-guzzling ScoreDoc[] in the heap turns into a long[]. Besides taking only 8 bytes/element instead of 28, there is a much higher chance of CPU L2 cache hits as the longs in a long[] are packed tightly together on the heap.

Handling the case of a not-fully-filled heap is simple: Just add the elements to the array at the next free space. If the array gets full, heap-sort it and continue using it as a heap. If the queue does not run full, use the faster merge-sort to deliver the results in order.

Explorative experimentation

Which are just fancy words for poor methodology. Unfortunately the HitQueue used in Solr is not easily replaceable (it extends PriorityQueue, which has a lot of non-overridable methods). So to get a gist of which ideas are worth pursuing, we turn to micro benchmarks. Shalin Shekhar Mangar suggested using JMH. A sound advice which is postponed for now in favour of “just run it a lot of times, alternating between implementations”.

The test bench is simple: Start a bunch of Threads, each running a test in parallel. Each test instantiates a queue implementation, fills it with random docID/score-pairs, then empties it. To guard against noisy neighbours, all threads tests the same implementation and finish fully, before switching to the next implementation.

For completeness, a Solr HitQueue sans sentinels is also tested. Spoiler: Turning off the sentinel looks like an extremely easy win for large top-X requests.

Experimental results – the original hypothesis

The idea was to improve on large Top-X, so let’s look at top-1M. The test was with 1, 4 & 16 threads on a quad-core i7 laptop and with varying amounts of hits. The raw output of the testPQPerformanceReport1M follows:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1     1000000        10    15.06 100%      0.50   3%■     0.97   6% 
      1     1000000       100    13.56 100%      0.47   3%■     0.97   7% 
      1     1000000      1000    13.28 100%      0.56   4%■     1.20   9% 
      1     1000000     10000    21.21 100%      1.28   6%■     1.58   7% 
      1     1000000    100000    86.04 100%     18.43  21%      8.34  10%■
      1     1000000   1000000   954.23 100%    447.40  47%     87.39   9%■
      1     1000000  10000000  3245.70 100%   3544.06 109%    895.89  28%■

      4     1000000        10    27.31 100%      1.47   5%■     3.35  12% 
      4     1000000       100    25.68 100%      1.58   6%■     2.97  12% 
      4     1000000      1000    25.79 100%      1.45   6%■     3.04  12% 
      4     1000000     10000    33.42 100%      2.27   7%■     2.95   9% 
      4     1000000    100000   119.99 100%     19.50  16%     11.52  10%■
      4     1000000   1000000  1456.82 100%    576.17  40%    134.46   9%■
      4     1000000  10000000  5934.11 100%   4278.37  72%   1385.38  23%■

     16     1000000        10   131.92 100%      3.26   2%■     9.79   7% 
     16     1000000       100   120.81 100%      4.08   3%■     8.76   7% 
     16     1000000      1000   124.63 100%      3.01   2%■    10.30   8% 
     16     1000000     10000   162.60 100%      4.68   3%■    10.49   6% 
     16     1000000    100000   485.46 100%     84.27  17%     32.81   7%■
     16     1000000   1000000  4702.79 100%   1787.32  38%    368.57   8%■
     16     1000000  10000000 16563.52 100%  10964.12  66%   4197.17  25%■

Below each implementation (Sentinel, No_Sentinel and Packed) are two columns: How many milliseconds it took to initialize, fill & empty the queue, followed by how long it took relative to the vanilla Solr Sentinel implementation (this is of course 100% for the Sentinel implementation itself). The fastest implementation for any given test is marked with a black box ■.

  • Sentinel starts out relatively slow for small amounts of Hits and gets dog-slow, when the amount of Hits gets to 1M+.
  • No_Sentinel is a lot better for the smaller hit counts (it does not have to do all the initialization) and markedly better up to 1M Hits.
  • Packed is slower than No_Sentinel for the smaller hit counts, but as the number of hits rises, it pulls markedly ahead. There really is no contest beyond 100K hits, where Packed is 3-10x as fast as vanilla Solr Sentinel.
    Notice how there is a relative performance hit, when the amount of Hits exceeds the queue size for Packed. This it where it switches from “just fill the array sequentially, then merge-sort at the end” to maintaining a heap.

Experimental results – small top-X requests

A superior implementation for large top-X requests is nice and it is certainly possible to choose implementation based on the value of top-X. But let’s check to see how it behaves for small top-X requests. Top-10 for instance:

Threads       Top-X      Hits      Sentinel    No_Sentinel         Packed 
      1          10        10     0.01 100%      0.01  49%      0.00  45%■
      1          10       100     0.01 100%      0.01  85%      0.01  81%■
      1          10      1000     0.01 100%■     0.02 204%      0.02 219% 
      1          10     10000     0.05 100%■     0.05 101%      0.08 152% 
      1          10    100000     0.59 100%      0.44  74%      0.37  62%■
      1          10   1000000     4.41 100%      4.53 103%      3.64  83%■
      1          10  10000000    42.15 100%     45.37 108%     36.33  86%■

      4          10        10     0.00 100%■     0.00 154%      0.01 329% 
      4          10       100     0.00 100%      0.01 348%      0.00  85%■
      4          10      1000     0.01 100%      0.01 120%      0.01  79%■
      4          10     10000     0.05 100%■     0.07 138%      0.08 154% 
      4          10    100000     0.47 100%      0.46  99%■     0.55 117% 
      4          10   1000000     4.71 100%      6.29 134%      4.45  95%■
      4          10  10000000    72.23 100%     60.09  83%     56.29  78%■

     16          10        10     0.00 100%      0.00  42%      0.00  39%■
     16          10       100     0.00 100%      0.00  72%      0.00  60%■
     16          10      1000     0.01 100%■     0.01 109%      0.01 128% 
     16          10     10000     0.08 100%      0.09 112%      0.07  80%■
     16          10    100000     1.48 100%      1.32  89%      1.12  76%■
     16          10   1000000    17.63 100%     18.74 106%     16.97  96%■
     16          10  10000000   207.65 100%    212.93 103%    192.81  93%■

Measurements for small number of Hits are somewhat erratic. That is to be expected as those tests are very fast (< 1ms) to complete, so tiny variations on machine load or garbage collection has a lot to say.

  • Sentinel is as expected very fast, compared to the top-1M test. No surprise there as asking for top-10 (or top-20) is the core request for Solr. The relative speed stays reasonably constant as the number of Hits grows.
  • No_Sentinel is a bit slower than Sentinel. That can be explained by a slightly different code path for insertion of elements – this should be investigated as there is a possibility of an easy optimization.
  • Packed is very interesting: Although it is not consistently better for the lower numbers of Hits (do remember we are still talking sub-millisecond times here), it is consistently a bit faster for the larger ones. There might not be a need for choosing between implementations.

Conclusion

Replacing the ScoreDoc-object using HitQueue in Solr with a bit-packing equivalent is a winning strategy in this test: The speed-up is substantial in some scenarios and the memory usage is ⅓rd of vanilla Solr.

The road ahead

This should of course be independently verified, preferably on other machine architectures. It should be investigated how to handle the case where shardIndex is relevant and more importantly discussed how to adjust the Lucene/Solr search code to use an alternative HitQueue implementation.

Posted in eskildsen, Hacking, Low-level, Lucene, open source, Performance, Solr, Uncategorized | 1 Comment

Light validation of Solr configuration

This week we were once again visited by the Edismax field alias bug in Solr: Searches with boosts, such as foo^2.5, stopped working. The problem arises when an alias with one or more non-existing fields is defined in solrconfig.xml and it is tedious to track down as one needs to check for existence of all the fields referenced.

We have a 10+ different Solr setups and we use aliases in most of them. So a quick script was whipped together: validate_config.sh, which (…wait for it…) validates Solr configs. Nothing fancy and it tends to report false positives when things are commented out in the XML files. Still, it does check that

  • all fields in schema.xml references existing field types
  • all copyFields in schema.xml references existing fields
  • all fields referenced in solrconfig.xml are defined in schema.xml
  • no alias in solrconfig.sh has the same name as a field in schema.xml

Some of these problems, such as referencing a non-existing fields in mlt.fl or pf in solrconfig.xml, are silent and hard to track down: Solr does not complain and searches seem to work. But in the case of misspellings of field names, the result is poorer quality searches as the intended functionality is not activated.

Cross-validation of fields used in solrconfig.xml and schema.xml would be nice to have as part of Solr core startup, but until then the validate_config.sh script might be of use. Get it at GitHub.

Posted in eskildsen, Solr | Leave a comment

Sampling methods for heuristic faceting

Initial experiments with heuristic faceting in Solr were encouraging: Using just a sample of the result set, it was possible to get correct facet results for large result sets, reducing processing time by an order of magnitude. Alas, further experimentation unearthed that the sampling method was vulnerable to clustering. While heuristic faceting worked extremely well for most of the queries, it failed equally hard for a few of the queries.

The problem

Abstractly, faceting on Strings is a function that turns a collection of documents into a list of top-X terms plus the number of occurrences of these terms. In Solr the collection of documents is represented with a bitmap: One bit per document; if the bit is set, the document is part of the result set. The result set of 13 hits for an index with 64 documents could look like this:

00001100 01010111 00000000 01111110

Normally the faceting code would iterate all the bits, get the terms for the ones that are set and update the counts for those terms. The iteration of the bits is quite fast (1 second for 100M bits), but getting the terms (technically the term ordinals) and updating the counters takes more time (100 seconds for 100M documents).

Initial attempt: Sample the full document bitmap

The initial sampling was done by dividing the result set into chunks and only visiting those chunks. If we wanted to sample 50% of our result set and wanted to use 4 chunks, the parts of the result set to visit could be the one marked with red:

4 chunks: 00001100 01111110 00000000 01010111

As can be counted, the sampling hit 5 documents out of 13. Had we used 2 chunks, the result could be

2 chunks: 00001100 01111110 00000000 01010111

Only 2 hits out of 13 and not very representative. A high chunk count is needed: For 100M documents, 100K chunks worked fairly well. The law of large numbers helps a lot, but in case of document clusters (a group of very similar documents indexed at the same time) we still need both a lot of chunks and a high sampling percentage to have a high chance of hitting them. This sampling is prone to completely missing or over representing clusters.

Current solution: Sample the hits

Remember that iterating of the result bitmap itself is relatively fast. Instead of processing chunks of the bitmap and skipping between them, we iterate over all the hits and only update counts for some of them.

If the sampling rate is 50%, the bits marked with red would be used as sample:

50% sampling: 00001100 01111110 00000000 01010111

If the sampling rate is 33%, the bits for the sample documents would be

33% sampling: 00001100 01111110 00000000 01010111

This way of sampling is a bit slower than sampling on the full document bitmap as all bits must be visited, but it means that the distribution of the sampling points is as fine-grained as possible. It turns out that the better distribution gives better results, which means that the size of the sample can be lowered. Lower sample rate = higher speed.

Testing validity

A single shard from the Net Archive Search was used for testing. The shard was 900GB with 250M documents. Faceting was performed on the field links, which contains all outgoing links from indexed webpages. There are 600M unique values in that field and each document in the index contains an average of 25 links. For a full search on *:* that means 6 billion updates of the counter structure.

For this test, we look for the top-25 links. To get the baseline, a full facet count was issued for the top-50 links for a set of queries. A heuristic facet call was issued for the same queries, also for the top-50. The number of lines until the first discrepancy were counted for all the pairs. The ones with a count beneath 25 were considered faulty. The reason for the over provisioning was to raise the probability of correct results, which of course comes with a performance penalty.

The sampling size was set to 1/1000 the number of documents or roughly 200K hits. Only result sets sizes above 1M are relevant for validity as those below takes roughly the same time to calculate with and without sampling.

Heuristic validity for top 25/50

Heuristic validity for top 25/50

While the result looks messy, the number of faulty results was only 6 out of 116, for results set sizes above 1M. For the other 110 searches, the top-25 fields were correct. Raising the over provisioning to top-100 imposes a larger performance hit, but reduces the number of faulty results to 0 for this test.

Heuristic validity for top 25/100

Heuristic validity for top 25/100

Testing performance

The response times for full count faceting and heuristic faceting on the links field with over provision of 50 is as follows:

Heuristic speed for top 25/50

Heuristic speed for top 25/50

Switching from linear to logarithmic plotting for the y-axis immediately:

Heuristic speed for top 25/50, logarithmic Y-axis

Heuristic speed for top 25/50, logarithmic y-axis

It can be seen full counting rises linear with result size, while sampling time is near-constant. This makes sense as the sampling was done by updating counts for a fixed amount of documents. Other strategies, such as making the sampling rate a fraction of the result size, should be explored further, but as the validity plot shows, the fixed strategy works quite well.

The performance chart for over provisioning of 100 looks very much like the one for 50, only with slightly higher response times for sampling. As the amount of non-valid results is markedly lower for an over provisioning of 100, this seems like the best speed/validity trade off for our concrete setup.

Heuristic speed for top 25/100, logarithmic Y-axis

Heuristic speed for top 25/100, logarithmic y-axis

Summary

Heuristic faceting with sampling on hits gives a high probability of correct results. The speed up relative to full facet counting rises with result set size as sampling has near-constant response times. Using over provisioning allows for fine-grained tweaking between performance and chance of correct results. Heuristic faceting is expected to be the default for interactive use with the links field. Viability of heuristic faceting for smaller fields is currently being investigated.

As always, there is full source code and a drop-in sparse faceting Solr 4.10 WAR at GitHub.

Posted in eskildsen, Faceting, Low-level, open source, Performance, Solr | Leave a comment