Visualising Netarchive Harvests

 

An overview of website harvest data is important for both research and development operations in the netarchive team at Det Kgl. Bibliotek. In this post we present a recent frontend visualisation widget we have made.

From the SolrWayback Machine we can extract an array of dates of all harvests of a given URL. These dates are processed in the browser into a data object containing the years, months, weeks and days to enable us to visualise the data. Futhermore the data is given an activity level from 0-4.

The high-level overview seen below is the year-month graph. Each cell is coloured based on the activity level relative to the number of harvests in the most active month. For now we use a linear calculation so gray means no activity, activity level 1 is 0-25% of the most active month, and level 4 is 75-100% of the most active month. As GitHub fans, we have borrowed the activity level colouring from the user commit heat map.

1-overview-no-url

 

We can visualise a more detailed view of the data as either a week-day view of the whole year, or as a view of all days since the first harvest. Clicking one of these days reveals all the harvests for the given day, with links back to SolrWayback to see a particular harvest.

2-year-week-no-url

 

In the graph above we see the days of all weeks of 2009 as vertical rows. The same visualisation can be made for all harvest data for the URL, as seen below (cut off before 2011, for this blog post).

3-all-years-no-url

 

There are both advantages and disadvantages to using the browser-based data calculation. One of the main advantages is a very portable frontend application. It can be used with any backend application that outputs an array of dates. The initial idea was to make the application usable for several different in-house projects. Drawbacks to this approach is, of course, the scalability. Currently the application processes 25.000 dates in about 3-5 seconds on the computer used to develop the application (a 2016 quad core Intel i5).

The application uses the frontend library VueJS and only one other dependency, the date-fns library. It is completely self-contained and it is included in a single script tag, including styles.

Ideas for further development.

We would like to expand this to also include both:

  1. multiple URLs, which would be nice for resources that have changed domain, subdomain or protocol over time, e.g. the URL http://pol.dk, http://www.pol.dk and https://politiken.dk could be used for the danish newspaper Politiken.
  2. domain visualisation for all URLs on a domain. A challenge here will of course be the resources needed to process the data in the browser. Perhaps a better calculation method must be used – or a kind of lazy loading.
Posted in Blogging, Solr, Web | Tagged , | Leave a comment

SolrWayback Machine

Another ‘google innovation week’ at work has produced the SolrWayback Machine. It works similar to the Internet Archive: Wayback Machine (https://archive.org/web/) and can be used to show harvested web content (Warc files).  The Danish Internet Archive has over 20billion harvested web objects and takes 1 Petabyte of storage.

The SolrWayback engine require you have indexed the Warc files using the Warc-indexer tool from British Library. (https://github.com/ukwa/webarchive-discovery/tree/master/warc-indexer).

It is quite fast and comes with some additional features as well:

  •  Image search similar to google images
  •  Link graphs showing  links (ingoing/outgoing) for domains using the D3 javascript framework.
  •  Raw download of any harvested resource from the binary Arc/Warc file.

Unfortunately  the collection is not available for the public so I can not show you the demo. But here is a few pictures from the SolrWayback machine.

solrwayback_demo

solrwayback_linkgraph.png

SolrWayback at GitHub: https://github.com/netarchivesuite/solrwayback/

Posted in Uncategorized | 1 Comment

juxta – image collage with metadata

Creating large collages of images to give a bird’s eye view of a collection seems to be gaining traction. Two recent initiatives:

Combining those two ideas seemed like a logical next step and juxta was born: A fairly small bash-script for creating million-scale collages of images, with no special server side.  There’s a small (just 1000 images) demo at SBLabs.

Presentation principle

The goal is to provide a seamless transition from the full collection to individual items, making it possible to compare nearby items with each other and locate interesting ones. Contextual metadata should be provided for general information and provenance.

Concretely, the user is presented with all images at once and can zoom in to individual images in full size. Beyond a given threshold, metadata are show for the image currently under the cursor, or finger if a mobile device is used. An image description is displayed just below the focused image, to avoid disturbing the view. A link to the source of the image is provided on top.

kob1_crop

Overview of historical maps

kob2_crop

Meta-data for a specific map

Technical notes, mostly on scaling

On the display side, OpenSeadragon takes care of the nice zooming. When the user moves the focus, a tiny bit of JavaScript spatial math resolves image identity and visual boundaries.

OpenSeadragon uses pyramid tiles for display and supports the Deep Zoom protocol can be implemented using only static files. The image to display is made up of tiles of (typically) 256×256 pixels. When the view is fully zoomed, only the tiles within the viewport are requested. When the user zooms out, the tiles from the level above are used. The level above is half the width and half the height and is thus represented by ¼ the amount of tiles. And so forth.

Generating tiles is heavy

A direct way of creating the tiles is

  1. Create one large image of the full collage (ImageMagick’s montage is good for this)
  2. Generate tiles for the image
  3. Scale the image down to 50%×50%
  4. If the image is larger than 1×1 pixel then goto 2

Unfortunately this does not scale particularly well. Depending on size and tools, it can take up terabytes of temporary disk space to create the full collage image.

By introducing a size constraint, juxta removes this step: All individual source images are scaled & padded to have the exact same size. The width and height of the images are exact multiples of 256. Then the tiles can be created by

  1. For each individual source image, scale, pad and split the image directly into tiles
  2. Create the tiles at the level above individually by joining the corresponding 4 tiles below and scale to 50%×50% size
  3. If there are more than 1 tile or that tile is larger than 1×1 pixel then goto 2

As the tiles are generated directly from either source images or other tiles, there is no temporary storage overhead. As each source image and each tile are processed individually, it is simple to do parallel processing.

Metadata takes up space too

Displaying image-specific metadata is simple when there are just a few thousand images: Use an in-memory array of Strings to hold the metadata and fetch it directly from there. But when the number of images goes into the millions, this quickly becomes unwieldy.

juxta groups the images spatially in buckets of 50×50 images. The metadata for all the images in a bucket are stored in the same file. When the user moved the focus to a new image, the relevant bucket is fetched from the server and the metadata are extracted. A bucket cache is used to minimize repeat calls.

Most file systems don’t like to hold a lot of files in the same folder

While the limits differ, common file systems such as ext, hfs & ntfs all experience performance degradation with high numbers of files in the same folder.

The Deep Zoom protocol in conjunction with file-based tiles means that the amount of files at the deepest zoom level is linear to the number of source images. If there are 1 million source images, with full-zoom size 512×512 pixels (2×2 tiles), the number of files in a single folder will be 2*2*1M = 4 million. Far beyond the comfort-zone fo the mentioned file systems (see the juxta readme for tests of performance degradation).

juxta mitigates this by bucketing tiles in sub-folders. This ensures linear scaling of build time at least up to 5-10 million images. 100 million+ images would likely deteriorate build performance markedly, but at that point we are also entering “is there enough free inodes on the file system?” territory.

Unfortunately the bucketing of the tile files is not in the Deep Zoom standard. With OpenSeadragon, it is very easy to change the mapping, but it might be more difficult for other Deep Zoom-expecting tools.

Some numbers

Using a fairly modern i5 desktop and 3 threads, generating a collage of 280 5MPixel images, scaled down to 1024×768 pixels (4×3 tiles) took 88 seconds or about 3 images/second. Repeating the experiment with a down-scale to 256×256 pixels (smallest possible size) raised the speed to about 7½ image/second.

juxta comes with a scale-testing script that generates sample images that are close (but not equal) to the wanted size and repeats them for the collage. With this near-ideal match, processing speed was 5½ images/second for 4×3 tiles and 33 images/second for 1×1 tiles.

The scale-test script has been used up to 5 million images, with processing time practically linear to the number of images. At 33 images/second that is 42 hours.

Posted in Uncategorized | Leave a comment

Automated improvement of search in low quality OCR using Word2Vec

This abstract has been accepted for Digital Humanities in the Nordic Countries 2nd Conference, http://dhn2017.eu/

In the Danish Newspaper Archive[1] you can search and view 26 million newspaper pages. The search engine[2] uses OCR (optical character recognition) from scanned pages but often the software converting the scanned images to text makes reading errors. As a result the search engine will miss matching words due to OCR error. Since many of our newspapers are old and the scans/microfilms is also low quality, the resulting OCR constitutes a substantial problem. In addition, the OCR converter performs poorly with old font types such as fraktur.

One way to find OCR errors is by using the unsupervised Word2Vec[3] learning algorithm. This algorithm identifies words that appear in similar contexts. For a corpus with perfect spelling the algorithm will detect similar words synonyms, conjugations, declensions etc. In the case of a corpus with OCR errors the Word2Vec algorithm will find the misspellings of a given word either from bad OCR or in some cases journalists. A given word appears in similar contexts despite its misspellings and is identified by its context. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB of raw text.

Given the words returned by Word2Vec we use a Danish dictionary to remove the same word in different grammatical forms. The remaining words are filtered by a similarity measure using an extended version of Levenshtein distance taking the length of the word and an idempotent normalization taking frequent one and two character OCR errors into account.

Example: Let’s say you use the Word2Vec to find words for banana and it returns: hanana, bananas, apple, orange. Remove bananas using the (English) dictionary since this is not an OCR error. For the three remaining words only hanana is close to banana and it is thus the only misspelling of banana found in this example. The Word2Vec algorithm does not know how a words is spelled/misspelled, it only uses the semantic and syntactic context.

This method is not an automatic OCR error corrector and cannot output the corrected OCR. But when searching it will appear as if you are searching in an OCR corrected text corpus. Single word searches on the full corpus gives an increase from 3% to 20% in the number of results returned. Preliminary tests on the full corpus shows only relative few false positives among the additional results returned, thus increasing recall substantially without a decline in precision.

The advantage of this approach is a quick win with minimum impact on a search engine [2] based on low quality OCR. The algorithm generates a text file with synonyms that can be used by the search engine. Not only single words but also phrase search with highlighting works out of the box. An OCR correction demo[4] using Word2Vec on the Danish newspaper corpus is available on the Labs[5] pages of The State And University Library, Denmark.

[1] Mediestream, The Danish digitized newspaper archive.
http://www2.statsbiblioteket.dk/mediestream/avis

[2] SOLR or Elasticsearch etc.

[3] Mikolov et al., Efficient Estimation of Word Representations in Vector Space
https://arxiv.org/abs/1301.3781

[4] OCR error detection demo (change word parameter in URL)
http://labs.statsbiblioteket.dk/dsc/ocr_fixer.jsp?word=statsminister

[5] Labs for State And University Library, Denmark
http://www.statsbiblioteket.dk/sblabs/

 

Posted in Uncategorized | Leave a comment

70TB, 16b docs, 4 machines, 1 SolrCloud

At Statsbiblioteket we maintain a historical net archive for the Danish parts of the Internet. We index it all in Solr and we recently caught up with the present. Time for a status update. The focus is performance and logistics, not net archive specific features.

Hardware & Solr setup

Search hardware is 4 machines, each with the same specifications & setup:

  • 16 true CPU-cores (32 if we count Hyper Threading)
  • 384GB RAM
  • 25 SSDs @ 1TB (930GB really)

Each machine runs 25 Solr 4.10 instances @ 8GB heap, each instance handling a single shard on a dedicated SSD. Except for machine 4 that only has 5 shards, because it is being filled. Everything coordinated by SolrCloud as a single collection.

netarchive_search_overview_20161129

Netarchive SolrCloud search setup

As the Solrs are the only thing running on the machines, it follows that there are at least 384GB-25*8GB = 184GB free RAM for disk cache on each machine. As we do not specify Xms, this varies somewhat, with 200GB free RAM on last inspection. As each machine handles 25*900GB = 22TB index data, the amount of disk cache is 200GB/22TB = 1% of index size.

Besides the total size of 70TB / 16 billion documents, the collection has some notable high-cardinality fields, used for grouping and faceting:

  • domain & host: 16 billion values / 2-3 million unique
  • url_norm: 16 billion values / ~16 billion unique
  • links: ~50 billion values / 20 billion+ unique

Workload and performance

The archive is not open to the public, so the amount of concurrent users is low, normally just 1 or 2 at a time. There are three dominant access patterns:

  1. Interactive use: The typical request is a keyword query with faceting on a 4-6 fields (domain, year, mine-type…), sometimes grouped on url_norm and often filtered on one or more of the facets.
  2. Corpus probing: Batch extraction jobs, such as using the stats component for calculating the size of all harvested material, for a given year, for all harvested domains separately.
  3. Lookup mechanism for content retrieval: Very experimental and used similarly to CDX-lookups + Wayback display. Such lookups are searches for 1-100 url_field:url pairs, OR’ed together, grouped on the url_field and sorted on temporal proximity to a given timestamp.

Due to various reasons, we do not have separate logs for the different scenarios. To give an approximation of interactive performance, a simple test was performed: Extract all terms matching more that 0.01% of the documents, use those terms to create fake multi-term queries (1-4 terms) and perform searches for the queries in a single thread.

compare_no-facets_sparse-facets_20161130

Non-faceted (blue) and faceted (pink) search in the full net archive, bucketed by hit count

The response times for interactive use lies within our stated goal of keeping median response times < 2 seconds. It is not considered a problem that queries with 100M+ hits takes a few more seconds.

The strange low median for non-faceted search in the 10⁸-bucket is due to query size (number of terms in the query) impacting search-speed. The very fast single-term queries dominates this bucket as very few multi-term queries gave enough hits to land in the bucket. The curious can take a look at the measurements, where the raw test result data are also present.

Morale: Those are artificial queries and should only be seen as a crude approximation of interactive use. More complex queries, such as grouping on billions of hits or faceting on the links-field, can take minutes. The so-far-discovered extremely worst-case is 30 minutes.

Secret sauce

  1. Each shard is fully optimized and the corpus is extended by adding new shards, which are build separately. The memory savings from this are large: No need for the extra memory needed for updating indexes (which requires 2 searchers to be temporarily open at the same time), no need for large segment→ordinal maps for high-cardinality faceting.
  2. Sparse faceting means lower latency, lower memory footprint and less GC. To verify this, the performance test above was re-taken with vanilla Solr faceting.
compare_sparse-facets_solr-facets_20161130

Vanilla Solr faceting (blue) and sparse faceting (pink) search in the full net archive, bucketed by hit count

Lessons learned so far

  1. Memory. We started out with 256GB of RAM per machine. This worked fine until all the 25 Solr JVMs(machine had expanded up to their 8GB Xmx, leaving ~50GB or 0.25% of the index size free for disk cache. At that point the performance tanked, which should not have come as a surprise as we tested this scenario nearly 2 years ago. Alas, quite foolishly we had relied on the Solr JVMs not expanding all the way up to 8GB.Upping the memory per machine to 384GB, leaving 200GB or 1% of index size free for disk cache ensured that interactive performance was satisfactory.
    An alternative would have been to lower the heap for each Solr. The absolute minimum heap for our setup is around 5GB per Solr, but that setup is extremely vulnerable to concurrent requests or memory heavy queries. To free enough RAM for satisfactory disk caching, we would have needed to lower the heaps to 6GB, ruling out faceting on some of the heavier fields and in general having to be careful about the queries issued. Everything works well with 8GB, with the only Out Of Memory incidents having been due to experiment-addicted developers (aka me) issuing stupid requests.
  2. Indexing power: Practically all the work on indexing is being done in the Tika-analysis phase. It took about 40 CPU-core years to build the current Version 2 of the index; in real-time it took about 1 year. Fortunately the setup scales practically linear, so next time we’ll try to allocate 12 power houses for 1 month instead of 1 machine for 12 months.
  3. Automation: The current setup is somewhat hand-held. Each shard is constructed by running a command, waiting a bit less than a week, manually copying the constructed shard to the search cloud and restarting the cloud (yes, restart).In reality it is not that cumbersome, but a lot of time was wasted with the indexer having finished, with noone around to start the next batch. Besides, the excessively separated index/search setup means that the content currently being indexed into an upcoming shard cannot be searched.

Looking forward

  1. Keep the good stuff: We are really happy about the searchers being non-mutable and on top of fully optimized shards.
  2. Increase indexing power: This is a no-brainer and “only” a question of temporary access to more hardware.
  3. Don’t diss the cloud: Copying raw shards around and restarting the cloud is messy. Making each shard a separate collection would allow us to use the collections API for moving them around and an alias to access it all as a single collection.
  4. Connect the indexers to the searchers: As the indexers only handle a single shard at a time, they are fairly easy to scale so that they can also function as searchers for the shards being build. The connection is trivial to create if #3 is implemented.
  5. Upgrade the components: Solr 6 will give us JSON faceting, Streaming, SQL-ish support, graph traversal and more. These aggregations would benefit both interactive use and batch jobs.We could do this by upgrading the shards with Lucene’s upgrade tool, but we would rather perform a whole re-index as we have also need better data in the index. A story for another time.

 

Posted in Hacking, Low-level, Performance, Solr, Statsbiblioteket, Uncategorized | 6 Comments

Prototype demo for OCR postfix in Danish Newspapers

In The Danish Newspaper Archive you can search in 25million newspaper pages and view the pages. The search engine uses OCR (optical character recognition) from scanned pages but often the software reading the text from the scanned images makes reading errors. As a result of this the search engine will miss matching words due to OCR error. Since many of our newspapers are old and quality of the scans/microfilms is not very good combined with OCR software has problems old fonts types, the bad OCR constitutes a substantial problem.

One way to find these OCR errors is using the Word2Vec algorithm that I have written about before. The algorithm detects words that appear in similar contexts. So for a corpus with perfect spelling the algorithm will detect similar words,synonyms,conjugations,declensions etc. But in the case of a corpus with OCR errors the Word2Vec algorithm will also find the
misspellings of a given word either from bad OCR or in some case journalists. A given misspelled word appear the the exactly same contexts for all it misspellings. For this to work the Word2Vec algorithm requires a huge corpus and for the newspapers we had 140GB raw text. So this is probably also the largest word2vec index ever build on a Danish corpus.

Given the list of words return by Word2Vec we then use a Danish dictionary to remove the same word in different forms which is not a OCR error. On the remaining words you just see if the words are close to enough comparing characters to be identified as a misspelling.
Examle: Lets say you use the Word2Vec to find words for ‘banana’ and it returns: hanana, bananas,apple, orange,
You remove bananas using the (english) dictionary since this is not an OCR error. For the three remaining word only ‘hanana’ is close to ‘banana’ and it thus the only mispelling of banana found in this example. Remember the Word2Vec algorithm does care how the words are spelled/misspelled it only uses the semantic context of the words.

You can play with the Word2Vec index on the Danish Newspapers here:
http://labs.statsbiblioteket.dk/dsc/ (remember to select the newspaper corpus)

And this page shows how the dictionary is used to find misspellings:
http://labs.statsbiblioteket.dk/dsc/ocr_fixer.jsp?word=statsminister
(change the last words in the url – Danish only sorry…)

Running this algorithm on 1000 random words takes 8 hours (using 20 CPUs) and fixes 84mio. OCR errors though a very few of them are false positives and not OCR errors, but this is very rare compared to true OCR errors. This last step to maximize bad OCR errors and minimize false positives is still in progress… 🙂

The newspaper Archive: http://www2.statsbiblioteket.dk/mediestream/avis

 

Posted in Uncategorized | Leave a comment

2D visualization of high dimensional word embeddings

In this blog post I tried to make an method for a computer to  read a text and analyse the characters and then make a 2D visualization of the similarity of the characters. To achieve this I am using the word2vec algorithm and then making a distance matrix of all mutual distances and fitting them into a 2D plot. The three texts I used was

  • All  3 Lord of The Ring books
  • Pride and Prejudice + Emma by Jane Austen
  • A combined text of 35.000 free english Gutenberg e-books

Word2Vec is an algorithm invented by Google researchers in 2013. Input it a text which has been preprocessed I will explain later. The algorithm  then extract all words and maps each word to a multidimensional vector of typical 200 dimensions. Think of a the quills of a hedgehog where each quill is a word, except it is in more than 3 dimensions. What is remarkable about the algorithm is that it captures some of the contexts of the words and this is reflected in the multidimensional vectors. Words that are somewhat similar are very close in this vector space, where ‘close’ is measured by the angle between two vectors. Furthermore the relative positions of two words also captures a relation between words. A well known example is that the distance vector from ‘man’ to ‘king’ is almost identical to the distance vector from ‘woman’ to ‘queen’. Using this information you are able to predict the word ‘queen’ given the three words <man,king> <woman,?>. It is far from obvious to understand why the algorithm  reflects this behaviour in the vector space and I have not fully understood the algorithm yet. Before you can use the word2vec algorithm you have to remove all punctuations and split the sentences into separate lines and lowercase the text. The splitting into sentences is not just splitting whenever you meet a ‘.’ character. For instance Mr. Anderson should not trigger a split.

First I  create the multidimensional representation of the words using word2vec which is just all the words (like a dictionary) and the vector for that word.  Next step I manual input the characters (or words in fact.) that I want to create the visualization for and calculate the distance matrix for all mutual distances by taking the cosinus of the angle between the vectors. This gives a value between -1 and +1 which I then shifts to 0 to 2 so I have a positive distance between the words. Finally I take this distance matrix and turn it into a 2D visualization trying to keep the distances as ‘close a possible’ in the 2D visualization as in the vectorspace. Of course this is not possible generally. Even for 3 vectors this can be impossible (if the Triangle inequality is broken). I create the plot by dividing the 2D into a grid and place the first character in the middle. The next character is also easy to place in on the circle with the radius of the distance. For the following characters I place one a time in the grid that minimize the sum of the distance-errors to the already placed characters in the grid. This is a greedy algorithm that priorities the first characters added to the plot and this I why the plotted the main characters first and have the other characters place them accordingly to these.

I tried to use the Stanford entity extraction tool to both extract locations and persons from a given text, but there was way too many false positives, thus I had the manually feed the algorithm the characters. To do it perfect I should had replaced a character metioned with  multiple names by a single same. Gandalf, Gandalf the Grey, Mithrandir is the same character etc. but I did not perform this substitution. So when I select the character Gandalf I only get the context where he is mentioned as Gandalf and not Mithrandir.

And now lets see some of the 2D visualizations!

Lord of the Rings

0) Frodo
1) Sam
2) Gandalf
3) Gollum
4) Elrond
5) Saroman
6) Legolas
7) Gimli
8) Bilbo
9) Galadriel

______________________________________________________________________
___________________________________________3__________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________1______________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
5_____________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________0__________________8___
_______________________2______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________6______________________________
______________________________________________________________________
______________________________________________________________________
__________________________________7___________________________________
______________________________________________________________________
______________________________________________________________________
_______________________4______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_____________________________9________________________________________
______________________________________________________________________

 

Jane Austen: Pride and Prejudice +Emma

0)Elizabeth ( Elizabeth Bennet)
1)Wickham (George Wickham)
2)Darcy (Mr. Darcy)
3)Bourgh (Lady Catherine de Bourgh)
4)Lydia (Lydia Bennet)
5)William (Mr. William Collins)
6)Kitty (Catherine “Kitty” Bennet)
7)Emma (Emma Woodhouse)
8)Knightley (George Knightley)

__________________________________________________________________________
_________________________________________8________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________________1__________2___________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________7_____
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_____________________________________________________0____________________
________________3__________4______________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_______________________________________________________5__________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________6______________________________________________________
__________________________________________________________________________

 

35.000 English Gutenberg books

In this plot instead of characters I selected different animals

0) Fish
1) Cat
2) Dog
3) Cow
4) Bird
5) Crocodile
6) Donkey
7) Mule
8) Horse
9) Snake
__________________________________________________________
_____________________________________________7____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
____________________________________________6_____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
________________________________________________________8_
__________________________________________________________
__________________________________________________________
________________________________2_________________________
__________________________________________________________
_______________1__________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
___________________________________________________3______
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__4_______________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________9_______________________________
__________________________________________________________
______________5___________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________0_______________________________
__________________________________________________________

Conclusion

Does the 2D plotting catch some of the essence of the words/characters from the books? Or does it look like they are just thrown in random on the plane?

I look forward to your your conclusion! For the Gutenberg animals plot I believe the visualization really does match how I see the animals. Fish, reptiles are grouped together and in the upper left corner we have the horses family of animals. For the Jane Austin it is also interesting that the character Emma match Elizabeth most though there are from two different books but somewhat identical main characters.

Posted in Uncategorized | 1 Comment