What’s in an Embedding? Analyzing Word Embeddings through Multilingual Evaluation

In the last two years, there has been a surge of word embedding algorithms and research on them. However, evaluation has mostly been carried out on a narrow set of tasks, mainly word similarity/relatedness and word relation similarity and on a single language, namely English.

We propose an approach to evaluate embeddings on a variety of languages that also yields insights into the structure of the embedding space by investigating how well word embeddings cluster along different syntactic features.

We show that all embedding approaches behave similarly in this task, with dependency-based embeddings performing best. This effect is even more pronounced when generating low dimensional embeddings.

If you want to use the embeddings: See below for download links.

The Paper

Scripts to run the experiments. Read the README, you need to adjust the paths in the scripts. The experiments take some days (maybe weeks if you use a single desktop machine) to run (depending on your computing power) and > 500GB of free space.

Below, you can see the results for evaluating diffferent embeddings on a set of languages by classifying for different kinds of features. Doesn't make sense? Read the Paper / the paper on INFDok ;-)

Contact me if you have questions: My university hompage

Results overview

gnuplot source

The data points used to draw this graph (emdedding-dimensionality-windowsize):

These are the embeddings generated for this paper. The format is as follows: [Language]-[Type]-[dimensionality]-[Window]. Note that w2vf (the dependency based embeddings) don't have a window parameter and therefore 5 and 11 point to the same file. If you use these embeddings, it would be great to cite this paper :-)