The original word2vec, written in C, could utilize all the cores available. It's actually refreshing to read that code, a true one person shop, hard engineering code :)
That's not true - they don't lock global weights when updating, so if you have lots of cores and threads, asynchronous updates will result in very poor accuracy, making training useless.
If you train on the big dataset that is produced by demo-train-big-model-v1.sh (which includes news corpora from 2012 & 2013, the 1BN word dataset from statmt, the UMBC web corpus and the whole wikipedia) using only one thread, accuracy on the google analogy dataset drops to 68% (down from ~71.5% when using 20 threads.)
This is due to the learning rate algorithm used: Learning rate is linearly reduced with the number of processed words. When K threads are used, the input dataset is split into K parts, processed in parallel, which means that more parts of the dataset have a chance to influence the resulting vectors in the beginning of the computation (where learning rate is relatively high.)
No it is not - as number of cores approaches infinity, the validation accuracy will approach zero, due to the lack of locking of shared memory. There is definitely a sweet spot in the number of cores for the original code, but it is not scalable to infinity. Therefore, it cannot utilize any number of cores.
Aligned float updates are atomic in all architectures that matter. Also, unsynchronized parameter updates for SGD have actually been studied in [1], where it was shown that they don't affect performance.
In the limit, performance would indeed suffer as all updates would happen in parallel.
[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in Neural Information Processing Systems. 2011.
The summary would be that accuracy per pass suffers slightly, but since the speedup is close to linear for the first dozen or so cores, each pass is much faster to run. The result is that the wall time to achieve a given level of accuracy is much shorter despite the slightly lower accuracy per pass.
That just sounds like a poorly tuned algorithm/learning rate in the single threaded case. Certainly you can emulate the parallel algorithm sequentially if you wanted.
"...and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge." Sweet! Just today I saw someone on the gensim mailing list trying to process 250Gb worth of data feeding it into word2vec https://groups.google.com/forum/#!topic/gensim/QvSJd4Ma6oE
100M words/sec is quite impressive, although this approach does not seem to be able to handle very large dictionaries.
If the authors are reading this, it'd be nice to see the actual accuracy on the google analogy dataset (the paper states it is within 1% of the reference implementation) and the performance on the large dataset produced by demo-train-big-model-v1.sh.
Incidentally, at Yahoo we can learn from a dataset of 1066 Billion Words, on a dictionary of 1.42 Billion unique terms, in 7344" (~145M words/sec.)
We formulate SGNS word2vec as a distributed graph problem, where nodes are all unique tokens (the dictionary) in the corpus and edges are defined by skipgrams. For skipgram (w_in, w_center), there will be an edge from w_in to w_center.
Tokens are randomly distributed over a set of workers. Each worker iterates over its edges in parallel with all other workers and performs the appropriate computation.
Drawing negative samples is done in two steps. We first draw a worker W from a suitable distribution over the workers and then draw a word from W. The overall word sampling is the same as for the reference implementation (ie, unigram distribution raised to 3/4.)
This work will soon be made public [1].
[1] Stergios Stergiou, Zygimantas Straznickas, Rolina Wu and Kostas Tsioutsiouliklis, ``Distributed Negative Sampling for Word Embeddings''. AAAI 2017.
The benchmarks on the Titan-X are taken from a 2015 paper, thus they are definitely testing the older Maxwell (as opposed to Pascal) GPU. The FLOPS and memory bandwidth differences are significant.
https://github.com/facebookresearch/fastText