2D visualization of high dimensional word embeddings

In this blog post I tried to make an method for a computer to  read a text and analyse the characters and then make a 2D visualization of the similarity of the characters. To achieve this I am using the word2vec algorithm and then making a distance matrix of all mutual distances and fitting them into a 2D plot. The three texts I used was

  • All  3 Lord of The Ring books
  • Pride and Prejudice + Emma by Jane Austen
  • A combined text of 35.000 free english Gutenberg e-books

Word2Vec is an algorithm invented by Google researchers in 2013. Input it a text which has been preprocessed I will explain later. The algorithm  then extract all words and maps each word to a multidimensional vector of typical 200 dimensions. Think of a the quills of a hedgehog where each quill is a word, except it is in more than 3 dimensions. What is remarkable about the algorithm is that it captures some of the contexts of the words and this is reflected in the multidimensional vectors. Words that are somewhat similar are very close in this vector space, where ‘close’ is measured by the angle between two vectors. Furthermore the relative positions of two words also captures a relation between words. A well known example is that the distance vector from ‘man’ to ‘king’ is almost identical to the distance vector from ‘woman’ to ‘queen’. Using this information you are able to predict the word ‘queen’ given the three words <man,king> <woman,?>. It is far from obvious to understand why the algorithm  reflects this behaviour in the vector space and I have not fully understood the algorithm yet. Before you can use the word2vec algorithm you have to remove all punctuations and split the sentences into separate lines and lowercase the text. The splitting into sentences is not just splitting whenever you meet a ‘.’ character. For instance Mr. Anderson should not trigger a split.

First I  create the multidimensional representation of the words using word2vec which is just all the words (like a dictionary) and the vector for that word.  Next step I manual input the characters (or words in fact.) that I want to create the visualization for and calculate the distance matrix for all mutual distances by taking the cosinus of the angle between the vectors. This gives a value between -1 and +1 which I then shifts to 0 to 2 so I have a positive distance between the words. Finally I take this distance matrix and turn it into a 2D visualization trying to keep the distances as ‘close a possible’ in the 2D visualization as in the vectorspace. Of course this is not possible generally. Even for 3 vectors this can be impossible (if the Triangle inequality is broken). I create the plot by dividing the 2D into a grid and place the first character in the middle. The next character is also easy to place in on the circle with the radius of the distance. For the following characters I place one a time in the grid that minimize the sum of the distance-errors to the already placed characters in the grid. This is a greedy algorithm that priorities the first characters added to the plot and this I why the plotted the main characters first and have the other characters place them accordingly to these.

I tried to use the Stanford entity extraction tool to both extract locations and persons from a given text, but there was way too many false positives, thus I had the manually feed the algorithm the characters. To do it perfect I should had replaced a character metioned with  multiple names by a single same. Gandalf, Gandalf the Grey, Mithrandir is the same character etc. but I did not perform this substitution. So when I select the character Gandalf I only get the context where he is mentioned as Gandalf and not Mithrandir.

And now lets see some of the 2D visualizations!

Lord of the Rings

0) Frodo
1) Sam
2) Gandalf
3) Gollum
4) Elrond
5) Saroman
6) Legolas
7) Gimli
8) Bilbo
9) Galadriel

______________________________________________________________________
___________________________________________3__________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________1______________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
5_____________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________________0__________________8___
_______________________2______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_______________________________________6______________________________
______________________________________________________________________
______________________________________________________________________
__________________________________7___________________________________
______________________________________________________________________
______________________________________________________________________
_______________________4______________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
______________________________________________________________________
_____________________________9________________________________________
______________________________________________________________________

 

Jane Austen: Pride and Prejudice +Emma

0)Elizabeth ( Elizabeth Bennet)
1)Wickham (George Wickham)
2)Darcy (Mr. Darcy)
3)Bourgh (Lady Catherine de Bourgh)
4)Lydia (Lydia Bennet)
5)William (Mr. William Collins)
6)Kitty (Catherine “Kitty” Bennet)
7)Emma (Emma Woodhouse)
8)Knightley (George Knightley)

__________________________________________________________________________
_________________________________________8________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________________1__________2___________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
____________________________________________________________________7_____
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_____________________________________________________0____________________
________________3__________4______________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
_______________________________________________________5__________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
__________________________________________________________________________
___________________6______________________________________________________
__________________________________________________________________________

 

35.000 English Gutenberg books

In this plot instead of characters I selected different animals

0) Fish
1) Cat
2) Dog
3) Cow
4) Bird
5) Crocodile
6) Donkey
7) Mule
8) Horse
9) Snake
__________________________________________________________
_____________________________________________7____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
____________________________________________6_____________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
________________________________________________________8_
__________________________________________________________
__________________________________________________________
________________________________2_________________________
__________________________________________________________
_______________1__________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
___________________________________________________3______
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__4_______________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________9_______________________________
__________________________________________________________
______________5___________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________________________________________
__________________________0_______________________________
__________________________________________________________

Conclusion

Does the 2D plotting catch some of the essence of the words/characters from the books? Or does it look like they are just thrown in random on the plane?

I look forward to your your conclusion! For the Gutenberg animals plot I believe the visualization really does match how I see the animals. Fish, reptiles are grouped together and in the upper left corner we have the horses family of animals. For the Jane Austin it is also interesting that the character Emma match Elizabeth most though there are from two different books but somewhat identical main characters.

About thomasegense

Thomas Egense Mathematician Works at The State and University Library, Denmark
This entry was posted in Uncategorized. Bookmark the permalink.

One Response to 2D visualization of high dimensional word embeddings

  1. Thank You For sharing blog but I am searching best Software Development Company

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s