Good point!
I would see this rather as yet another argument for why you should simply give the actual output of the NLP algorithm.
So if people actually do the calculation King-Man+Woman and it comes closest to King, than they should report "King-Man+Woman~=King" and not "King-Man+Woman=Queen" (only because that's what they expected).
To be honest, I think the idea that we should expect ML algorithms to give a single, certain answer is misguided. I would expect the output from this algorithm to be "King - Man + Woman = King (90%), Queen (83%), Prince (70%)" or something like that, i.e. a list of answers with some measure of how "good" those answers are. Then again, I work in a field that doesn't really have categorical answers so maybe I'm missing something obvious.
That's pretty much correct.
You would typically calculate a vector for "King-Man+Woman" and then do a query on this based on a cosine distance (or similar measure) over the entire vocabulary.
The query would give you a ranked list of the closest word vectors with scores that indicate how good the match is.
But the example is only performing vector operations. You could perhaps normalize the distances of a number of vectors with a softmax or something to produce a probability across a set, but what's being presented in the paper is the "closest" vector following the operations in terms of cosine distance.
In the end, it doesn't matter what transformation you do though, as long as you do it consistently and/or not in an ad hoc manner. If excluding the original term always leads to useful results, it is a useful transformation.
The problems materialize when you're just cherry picking for results.
So if people actually do the calculation King-Man+Woman and it comes closest to King, than they should report "King-Man+Woman~=King" and not "King-Man+Woman=Queen" (only because that's what they expected).