Language Models, Word2Vec, and Efficient Softmax Approximations

kirillkh · on Oct 29, 2017

How can this be used for full-text search, e.g. with Lucene? The first step in indexing a document for full-text search is reducing each word to its base form, and similarly for a search string. While it's not a difficult problem in English, in some languages (e.g. Herew) it's notoriously hard to figure out the base form of a word and further disambiguate its meaning, as the only way to do so is based on context. So how can you easily build a stemmer/lemmatizer on top of these instruments to perform such task?

rpedela · on Oct 29, 2017

This presentation from 2015 [1] answers your question. The basic idea is to create a list of keywords and/or phrases for your corpus. It can either be done manually or automatically using gensim or a similar tool. Then use word2vec to create vectors for the keywords and phrases. Cluster the vectors and use the clusters as "synonyms" at both index and query time using a Solr synonyms file.

You can also use Brown clustering [3] to create the clusters. It does a good job and is faster to compute than clustered word vectors. However clustered word vectors typically have better semantic performance.

1. https://www.slideshare.net/mobile/lucidworks/implementing-co...

2. Demo source: https://github.com/DiceTechJobs/ConceptualSearch

3. https://github.com/percyliang/brown-cluster

kirillkh · on Oct 30, 2017

Some clarifying questions:

1) Do words in a generic corpus (such as Wikipedia) actually form well-separated clusters?

2) Is it correct that you find word clusters in the corpus as a preprocessing step (as opposed to at indexing or query time)?

3) Do I understand correctly that you use all words in clusters as synonyms and pass them to Solr at query/indexing time? Is it query time, index time or both?

4) Given a language where words have many syntactic forms (e.g. buy-bought-buying), how does it work with clusters? Do both syntactic forms and synonyms end up in the same cluster? Wouldn't it be beneficial to treat many of these different forms as the same word (i.e. perform stemming) and only list truly different, but closely related concepts as synonyms?

rpedela · on Oct 30, 2017

1. It should. The talk recommends using multiple cluster sizes (e.g. 50,500,5000) and give more weight in the query to smaller clusters. Ideally you would run word2vec on your own domain-specific corpus and then cluster, but that only works if your corpus is of sufficient size.

2. Correct. The goal of the pre-processing step is to generate a Solr synonyms file which can be added to your index mapping.

3a. You could use all the words, but in general I would advise against it. Using all the words from Wikipedia or Google News would be similar to using a thesaurus which can add a lot of noise. For example, the word "cocoa" could mean chocolate, a city in Florida, or programming language. It is better to use a list of domain-specific keywords and phrases as a filter for which words are added to the Solr synonyms file. However if your corpus is Wikipedia, Google News, or something equally generic, then using all the words makes sense.

3b. It must be both query and index time. For example, the phrase "java developer" would have the mapping "java developer => cluster_15" in the synonyms file. In order for the search terms "java developer" to match cluster_15, "cluster_15" must be indexed in place of "java developer".

4. The different forms will most likely end up in the same cluster, but stemming would guarantee it.

kirillkh · on Oct 30, 2017

4) But recall that the language in question is such that stemming is hard. If you expand every form of every word in Hebrew, you obtain something like 600,000 words. And many of them have completely different meanings due to syntactic coincidences and short roots. So, ideally, the first step would a) determine what exactly each word in this document means in the given context, b) replace it with an unambiguous identifier.

For example, in Hebrew the word BRHA can mean several things: "pool", "blessing", "in soft" and "her knee" (no kidding).

rpedela · on Oct 30, 2017

I don't know anything about Hebrew. But maybe lemmatization would be better than stemming since it takes meaning and context into account. It is also possible that it is an unnecessary step for clustered word vectors for your use case. If it was me, I would try without stemming/lemmatization first.

EDIT

I found this Hebrew analyzer for Lucene/Solr/Elasticsearch [1] which appears to do stemming or lemmatization. Potentially you could use the output of the analyzer as the input to word2vec.

1. https://github.com/synhershko/HebMorph

kirillkh · on Oct 30, 2017

Please forgive me attempting to milk as much as possible from this discussion - I just don't have many opportunities to get useful advice on this subject, and I've been mulling over it for a long time.

> But maybe lemmatization would be better than stemming

You're right, I'm using "stemming" and "lemmatization" interchangeably where I shouldn't. What I mean is lemmatization.

> It is also possible that it is an unnecessary step for clustered word vectors for your use case

I don't focus on a specific use case, I'm just trying to find a way to enable full-text search for Hebrew. Searching based on concept similarity is a very cool addition, though, and I do have some use cases in mind for it specifically. But I'm just thinking what a typical cluster would look like, and I imagine 99.9% of it will be different forms of the same handful of base forms. Furthermore, telling Lucene to match based on all these forms will inevitably create a large number of false positives due to the aforementioned abundance of homonyms. So I can see a clear problem here even now. That's why I keep reiterating my original question of whether this system can first be used for lemmatizing and then everything else.

rpedela · on Oct 30, 2017

For English NLP, I often stem first because it usually reduces noise. I think your main concern is that the abundance of homonyms will increase noise which is certainly possible. Because I don't know Hebrew, I don't have any intuition on what may work. My advice is to experiment. Cluster some Hebrew text without lemmatization, cluster with lemmatization using that Hebrew analyzer I linked, and see what the results are. Also maybe a literature review will yield experiments done with Hebrew and word embeddings/vectors. Sorry I cannot be of more help.

EDIT

I found this paper which may answer your question about lemmatization and word vectors.

http://www.openu.ac.il/iscol2015/downloads/ISCOL2015_submiss...

kirillkh · on Oct 30, 2017

Thanks, I know about HebMorph. Its authors don't want it to be used for commercial purposes (at least for free), so that limits its usability beyond simple experiments. As to your second link, it confirms my suspicions that lemmatizing is important for Hebrew, but the code they reference in the footnotes is equally hostile to commercial usage. I was really hoping word2vec or other new tools would enable building lemmatizer from scratch without much hassle.

Thanks for your advice, anyway.

rpedela · on Oct 30, 2017

I think using word vectors for lemmatization is an interesting idea and on the cutting edge. Here is a paper which discusses it. https://link.springer.com/chapter/10.1007/978-3-662-49192-8_...

kirillkh · on Oct 31, 2017

Thanks! That paper is extremely helpful. Still, there is one thing missing to complete the picture for me right now. At the input, I have a list of words that I want to index or query. When indexing, they usually form a sentence, when querying, they might just be keywords. But in both cases the words will usually be selected by the user/author in such a way that a human that reads all the words from the input together is able to disambiguate the meaning of every word. This is precisely what I'm missing.

Let's say the user entered three words: A B C. You look up each of them among the vectors and discover that there are three matching vectors for A, four for B and five for C (and for the sake of generality let's assume that there are more words than just 3 in the input, so it's impractical to test every subset of these words for co-occurrence). How do you jointly select the correct vector for each of the words?

kirillkh · on Oct 31, 2017

Actually, I have an idea, albeit not without some doubts.

Let x1 be the number of vectors matching A, x2 the number of vectors matching B, etc, till xn. Let c1..cn be a particular selection of vectors. Now my main assumption here is that in order to determine which of these vectors are most often encountered together in the same context [1], our goal is to find j that maximizes sum_{i from 1 to n, i!=j}[d_i], where d_i=(c_j dot c_i) if the dot product is nonnegative, otherwise d_i=0. I'm not sure it's true primarily because I don't know if by summing up these dot products we add apples to apples or apples to oranges.

Then in order to find the best selection of vectors c1..cn we can iterate on every vector v_k matching A and dot v_k with every vector matching B, then pick the maximum m2 (or 0 if it's negative); dot v_k with every vector matching C, then pick the maximum m3; etc. Thus, for k'th iteration we obtain the selection of vectors that maximizes M_1k=sum_i[m_i]. After we're done with all c1 iterations, we pick the best such selection M1=max_k[M_1k]. This is all done in O(x1(x2+x3+...xn)) time.

Next, we repeat the above process for all x2 vectors matching B and obtain M2, etc, etc. Ultimately, we pick the selection of vectors that produced the highest M_t across all choices of t. Overall, we get O((sum_i[xi])^2), which seems fast enough. What do you think?

[1] One obvious problem is this limits the number of contexts we match against to just one.

rpedela · on Nov 1, 2017

Each token would have one vector from word2vec. A token could be a word or phrase depending on the pre-processing. The words in a phrase are usually concatenated with an underscore. I recommend gensim if want/need phrases.

kirillkh · on Nov 1, 2017

Ah, you're right, word2vec assigns one vector to each word, as opposed to one vector to each meaning. Then the problem remains: we can't differentiate between homonyms.

But it seems it's been solved, too: https://github.com/sbos/AdaGram.jl

rpedela · on Nov 1, 2017

There is also sense2vec which I think tries to do something similar. https://explosion.ai/blog/sense2vec-with-spacy

kirillkh · on Oct 31, 2017

Also, this makes me wonder what other things you can do with vectors. If you compute dot product between a verb or a noun with vector "singular"-"plural", will it give a positive value for plurals and a negative for singulars (or vice versa)?

rpedela · on Nov 1, 2017

No idea. Experiment!

kirillkh · on Oct 29, 2017

Thank you! This sounds exactly what I was imagining. Very exciting!

garysieling · on Oct 29, 2017

For integrating into Solr, I've used Word2vec to improve rankings the synonyms it finds (boosting synonyms by how similar they are to the query). In English Word2vec tends to think plurals are synonyms.

I did a talk on this with more details: https://www.slideshare.net/GarySieling/ai-with-the-best-buil...

Sense2Vec might also help: https://github.com/explosion/sense2vec

There is also a project called Vespa, which looks potentially interesting as a replacement for Lucene - https://github.com/vespa-engine/vespa

simonhughes22 · on Oct 30, 2017

The dice talk mentions that also (weighting by word2vec similarity). It's important to note that word2vec, LSA, sense2vec, etc, all find words that are RELATED but not necessarily SYNONYMOUS. For instance, antonyms like black and white, rich and poor, often appear in the same contexts, have the same word type but are opposite in meaning. Similarly, politicians on the opposite ends of the political spectrum will usually get assigned similar vectors and the same cluster as they tend to appear in similar contexts. It uses the context (word window in word2vec and GloVe, the document in LSA) to determine a measure of similarity. But the context for antonyms is typically very similar. Attaching the part-of-speech tag to each word before pushing it through these models can help as it enforces that grammatical relation, but this won't address all of these issues (e.g. black and white are both adjectives). In my experience, if people mostly search for nouns in your search engine (e.g. job search) this issue is also less of a concern, but can still cause cause problems. Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings.

kirillkh · on Oct 31, 2017

> Finally, conceptual search can also help with precision - by matching across all concepts within a document, you can help disambiguate its meaning, when you have words that have multiple meanings

Thanks! I've written up something along these lines here: https://news.ycombinator.com/item?id=15592196

I'd love to hear your opinion if this is going to work.

visarga · on Oct 29, 2017

Run Doc2Vec (or Word2Vec) on a large corpus of text or download pretrained vectors. To compute a document vector, take a linear combination of the word vectors in the document according to TFIDF. Now that you have vectors for each document, you need to create a fast index with a library called "Annoy". It can do very fast similarity search in vector space for millions of documents. I think this approach works faster than grep and doesn't need to bother with stemming. It will automatically know that "machine learning" and "neural nets" are related, so it does a kind of fuzzy search.

garysieling · on Oct 29, 2017

If you wanted it to know that "machine learning" and "neural networks" were related, wouldn't you need to do some type of entity extraction first, since Word2vec is run on tokens?

visarga · on Oct 30, 2017

You can use Gensim:

    from gensim.models.phrases import Phrases
    bigrams = Phrases(corpus)

or you could rank bigrams by count(w1+w2)^2/(count(w1)*count(w2))

many variations on this formula work, but the idea is to compare the count of the bigram to the counts of the unigrams.

By the way, you do bigram identification before Word2Vec to have specialized vectors for bigrams as well.

Besides this method, there is one great way to identify ngrams: use Wikipedia titles. It's quite an extended list that covers most of the important named entities, locations and multi-word topic names, or go directly to http://wiki.dbpedia.org/ for a huge list with millions of ngrams. Cross reference it with your text corpus and you get a nice clean list.

kbwt · on Oct 29, 2017

The original word2vec source code comes with a probabilistic phrase detection tool. Keyword: word2phrase.

garysieling · on Oct 29, 2017

Good to know, thanks!

_pctq · on Oct 29, 2017

Alternatively to tf-idf, there's an interesting property in word embeddings generated by word2vec : they're sorted by rarity (the most common words being on top of the list).

So if you insert them in the same order in a database, you can just use their primary key as weight for a word. This also has the advantage of filtering out stop words without any additional processing.

kirillkh · on Oct 29, 2017

If I understand correctly, this forgoes Lucene entirely. I would really like something that can be integrated into Lucene/Solr due to the availability of all the infrastructure build around it.

> works faster than grep

I didn't quite get the connection to grep.

visarga · on Oct 30, 2017

Suppose you have gigabytes of text, Annoy will find matching articles faster and more precise than grepping with keywords.

rpedela · on Oct 30, 2017

Lucene is faster and better than grep too. Annoy may be better than Lucene's "more like this" query which is for finding similar documents in an index to a given set of documents. But how would it be helpful for keyword search which is what is being asked about?

visarga · on Oct 30, 2017

I know, inverted index search is fast, it is the basic search engine algorithm, but there is a difference in quality of top ranked results. With word vectors you can ensure the topic of the whole document is what you want. Many documents mix topics and some keywords appear by mistake in the wrong place, for example, because scraping web text is imperfect and might capture extra text.

kleebeesh · on Oct 29, 2017

Nice write-up! I also found this one useful in terms of high-level implementation: http://adventuresinmachinelearning.com/word2vec-keras-tutori...

Initially it was not obvious to me that the dot-product was even part of the model. In hindsight it's intuitive: a pair of high similarity vectors will have a high dot-product, which yields a high sigmoid activation. This also motivates the use of cosine similarity, which is just a normalized dot-product. Likely obvious to some but this eluded me the first few days I studied this model.

rvarma · on Oct 29, 2017

Thanks for the link! It actually provides a really clear and intuitive explanation of the notion of similarity.