Parallelizing Word2Vec in Multi-Core and Many-Core Architectures

minimaxir · on Dec 6, 2016

It's worth noting that fasttext, which was made in part by the original word2vec authors, can handle as many cores as you throw at it.

https://github.com/facebookresearch/fastText

boomzilla · on Dec 6, 2016

The original word2vec, written in C, could utilize all the cores available. It's actually refreshing to read that code, a true one person shop, hard engineering code :)

utopcell · on Dec 7, 2016

I agree. It is beautiful to see zero dependencies on a tool that is so popular. Even random numbers are generated with custom code !

Performance takes a big hit however: gensim is ~3X faster, mainly because it uses BLAS primitives in C.

ncdr · on Dec 7, 2016

That's not true - they don't lock global weights when updating, so if you have lots of cores and threads, asynchronous updates will result in very poor accuracy, making training useless.

utopcell · on Dec 7, 2016

The opposite is actually true.

If you train on the big dataset that is produced by demo-train-big-model-v1.sh (which includes news corpora from 2012 & 2013, the 1BN word dataset from statmt, the UMBC web corpus and the whole wikipedia) using only one thread, accuracy on the google analogy dataset drops to 68% (down from ~71.5% when using 20 threads.)

This is due to the learning rate algorithm used: Learning rate is linearly reduced with the number of processed words. When K threads are used, the input dataset is split into K parts, processed in parallel, which means that more parts of the dataset have a chance to influence the resulting vectors in the beginning of the computation (where learning rate is relatively high.)

ncdr · on Dec 7, 2016

No it is not - as number of cores approaches infinity, the validation accuracy will approach zero, due to the lack of locking of shared memory. There is definitely a sweet spot in the number of cores for the original code, but it is not scalable to infinity. Therefore, it cannot utilize any number of cores.

utopcell · on Dec 7, 2016

Aligned float updates are atomic in all architectures that matter. Also, unsynchronized parameter updates for SGD have actually been studied in [1], where it was shown that they don't affect performance.

In the limit, performance would indeed suffer as all updates would happen in parallel.

[1] Recht, Benjamin, et al. "Hogwild: A lock-free approach to parallelizing stochastic gradient descent." Advances in Neural Information Processing Systems. 2011.

nkurz · on Dec 7, 2016

There's another paper describing the "Hogbatch" approach that shows more exactly the effect of adding cores on accuracy: http://www.ece.ubc.ca/~matei/papers/ipdps16.pdf.

The summary would be that accuracy per pass suffers slightly, but since the speedup is close to linear for the first dozen or so cores, each pass is much faster to run. The result is that the wall time to achieve a given level of accuracy is much shorter despite the slightly lower accuracy per pass.

robrenaud · on Dec 7, 2016

That just sounds like a poorly tuned algorithm/learning rate in the single threaded case. Certainly you can emulate the parallel algorithm sequentially if you wanted.

utopcell · on Dec 7, 2016

True, you could definitely emulate it by shuffling the input, but that's not what is being done unfortunately.

novalis78 · on Dec 6, 2016

"...and process hundreds of millions of words per second, which is the fastest word2vec implementation to the best of our knowledge." Sweet! Just today I saw someone on the gensim mailing list trying to process 250Gb worth of data feeding it into word2vec https://groups.google.com/forum/#!topic/gensim/QvSJd4Ma6oE

utopcell · on Dec 7, 2016

100M words/sec is quite impressive, although this approach does not seem to be able to handle very large dictionaries.

If the authors are reading this, it'd be nice to see the actual accuracy on the google analogy dataset (the paper states it is within 1% of the reference implementation) and the performance on the large dataset produced by demo-train-big-model-v1.sh.

Incidentally, at Yahoo we can learn from a dataset of 1066 Billion Words, on a dictionary of 1.42 Billion unique terms, in 7344" (~145M words/sec.)

programnature · on Dec 7, 2016

utopcell · on Dec 7, 2016

We formulate SGNS word2vec as a distributed graph problem, where nodes are all unique tokens (the dictionary) in the corpus and edges are defined by skipgrams. For skipgram (w_in, w_center), there will be an edge from w_in to w_center.

Tokens are randomly distributed over a set of workers. Each worker iterates over its edges in parallel with all other workers and performs the appropriate computation.

Drawing negative samples is done in two steps. We first draw a worker W from a suitable distribution over the workers and then draw a word from W. The overall word sampling is the same as for the reference implementation (ie, unigram distribution raised to 3/4.)

This work will soon be made public [1].

[1] Stergios Stergiou, Zygimantas Straznickas, Rolina Wu and Kostas Tsioutsiouliklis, ``Distributed Negative Sampling for Word Embeddings''. AAAI 2017.

pavanky · on Dec 6, 2016

BTW the NVIDIA results they are comparing to are from the previous generation. The current generation has 50% more FLOPS and 33% more bandwidth.

yvdriess · on Dec 6, 2016

Titan-X is the current generation, DeepBench also shows results with Titan-X. The P100 has not been released.

jimfleming · on Dec 6, 2016

Titan X is the product line which has multiple generations, with the Pascal architecture being the latest.

tmostak · on Dec 7, 2016

The benchmarks on the Titan-X are taken from a 2015 paper, thus they are definitely testing the older Maxwell (as opposed to Pascal) GPU. The FLOPS and memory bandwidth differences are significant.

wyldfire · on Dec 7, 2016

So many of us on HN have access to early samples and we're so spoiled that we lose sight of what "current" hardware is. ;)