Prebuild and trained Word2Vec dictionaries ready for use
Two different prebuild big data Word2Vec dictionaries has been added to LOAR (Library Open Access Repository) for download. These dictionaries are build from the text of 55,000 e-books from Project Gutenberg and 32.000.000 Danish newspaper pages.
35.000 of the Gutenberg e-books are English, but over 50 different languages are present in the dictionaries. Even though they are different languages the Word2Vec algorithm did a good job of separating the different languages so it is almost like 50 different Word2Vec dictionaries.
The text from the danish newspapers is not public available so you would not be able to build this dictionary yourself. A total of 300Gb of raw text went into building the dictionary, so it is probably the largest Word2Vec dictionary build on a Danish corpus. Since the danish newspapers suffer from low quality OCR, many of words in the dictionary are misspellings. Using this dictionary it was possible to fix many of the OCR errors due the nature of the Word2Vec algorithm, since a given word appears in similar contexts despite its misspellings and is identified by its context. (see https://sbdevel.wordpress.com/2017/02/02/automated-improvement-of-search-in-low-quality-ocr-using-word2vec/)
Download and more information about the Word2Vec dictionaries:
Online demo of the two corpora: Word2Vec demo