Good point! I would see this rather as yet another argument for why you should s...

Sean1708 · on July 16, 2019

To be honest, I think the idea that we should expect ML algorithms to give a single, certain answer is misguided. I would expect the output from this algorithm to be "King - Man + Woman = King (90%), Queen (83%), Prince (70%)" or something like that, i.e. a list of answers with some measure of how "good" those answers are. Then again, I work in a field that doesn't really have categorical answers so maybe I'm missing something obvious.

flo_hu · on July 16, 2019

That's pretty much correct. You would typically calculate a vector for "King-Man+Woman" and then do a query on this based on a cosine distance (or similar measure) over the entire vocabulary.

The query would give you a ranked list of the closest word vectors with scores that indicate how good the match is.

piker · on July 16, 2019

But the example is only performing vector operations. You could perhaps normalize the distances of a number of vectors with a softmax or something to produce a probability across a set, but what's being presented in the paper is the "closest" vector following the operations in terms of cosine distance.

feanaro · on July 16, 2019

In the end, it doesn't matter what transformation you do though, as long as you do it consistently and/or not in an ad hoc manner. If excluding the original term always leads to useful results, it is a useful transformation.

The problems materialize when you're just cherry picking for results.