Wouldn't syntactic parsing also improve by more knowledge of the real world?
From the example in the text:
"They ate the pizza with anchovies"
Isn't it important to know that humans don't use anchovies to eat pizza? You need some semantics as well. Otherwise, from pure syntax, how can you tell it apart from:
I guess I should've connected the dots a bit better on this.
The neural network model offers a pretty compelling answer to this. The idea is that you each word is represented as a real-valued vector, and these vectors encode information about the word's meaning. So, "pizza" encodes in it that it's a type of food, and so does "anchovies". But "fork" is closer to "knife" and "spoon".
Linear models struggle with this. In a linear model, mostly, each word is an island. "fork" and "spoon" each have unique IDs, and so you'd better hope you see enough examples of each of them to understand what they mean. It helps to add semantic cluster IDs to the words as well, but this isn't quite as good as the word vectors.
What doesn't work particularly well is encoding external knowledge explicitly. Having a semantic resource that tells you explicitly that "fork" and "spoon" are members of class "cutlery". This has been tried by many people, and it produces small or no benefit.
Linear models struggle with this. In a linear model, mostly, each word is an island.
It's actually quite easy to do this with a linear model. You use the distribution of head/relation/dependent pairs as an auxiliary distribution in a linear model. This has been done for quite some time (see my answer to your parent).
This auxiliary distribution is then estimated using machine-annotated data. Of course, a non-linear classifier using embeddings is cheaper to train ;), because you don't have to parse a lot of data ahead of time.
This kind of representation is famously done by word2vec, also from Google, and the Python package Gensim. Many NLP projects start directly from word embeddings precomputed on very large corpora of text.
No, this is not the type of representation that is done by word2vec. These older bilexical preference models work as follows:
- Take a huge corpus.
- Parse the corpus using your parser.
- Extract head-dependent strengths according to some metric (e.g. pointwise mutual information) from the machine annotated data. These association strengths are used as an auxiliary distribution. The machine annotated data may contain erroneous analyses, but they are typically outnumbered by correct analyses[1].
- In parse disambiguation, the head-dependent strength is added as another feature (the strength is the feature value).
This idea precedes word2vec for quite some time and is also different, both in the training procedure (word2vec uses raw unannotated data, bilexical preference models use machine-annotation data) as well as the result of training (distributed word representations vs. head-dependent association strengths).
Although it's a different method, they can capture the same thing as word embeddings plus a non-linear classifier: which heads typically take which dependents. However, in contrast to word embeddings, they are effective in a linear classifier because the auxiliary distribution itself consists of combinatory features (head-dependents).
[1] E.g. consider freer word order languages. In Dutch, SVO is the preferred word order, but OVS is also permitted. A parser without such an auxiliary distribution might analyse the object as a subject in an OVS sentence. However, since SVO is much more frequent than OVS, it will learn the correct associations on a large machine-annotated corpus.
How does the NN encode the fact that when you eat "with" a fork, the word "with" isn't the same sort of "with" as eating something "with" anchovies.
Everyone has their favorite way to think about NNs. I think of them as hyperplanes slicing up n-dim space. I just don't know how they "reason" about semantic information.
How does the NN encode the fact that when you eat "with" a fork, the word "with" isn't the same sort of "with" as eating something "with" anchovies.
By looking at the noun in the NP embedded in the PP rather than the preposition. Whether this actually happens depends on various factors:
- In some dependency annotation schemes it's actually the head of the NP that attaches to the head with the prepositional phrase relation. Though this is uncommon, since typically the preposition is the direct dependent of the head.
- In some transition systems, attachment of a dependent is done when all its dependents are attached. So, at the moment the preposition is attached to a head, you already know what the preposition governs.
- In some transition systems, previously-made attachments can be changed (e.g. when you encounter a noun in a PP which suggests a different attachment).
- You might already look ahead in the buffer, either by having a small number of buffer positions as the input of the NN or by forming a representation of the current buffer using an RNN.
tl;dr NN can definitely learn such things, but you may have to help a bit by picking the right transition system or classifier inputs.
Hyperplanes slicing up n-dimensional space pretty much perfectly describes a support vector machine (SVM), but not a neural network. Maybe something more like disjoint manifolds rather than planes would be a better analogy...
> What doesn't work particularly well is encoding external knowledge explicitly. Having a semantic resource that tells you explicitly that "fork" and "spoon" are members of class "cutlery". This has been tried by many people, and it produces small or no benefit.
I rarely do this, but … citation(s) please? Who has tried it, can you point too their results? Why hasn't it worked, seems like totally the obvious solution for the next 7% gap. Why on earth wouldn't it work?
FWIW, there was a paper where they trained word vectors on big handmade semantic graphs like that. The word vectors were actually more accurate than when trained on raw text.
Good question! This is the whole point of parsing sentences to deduce semantic meaning.
He bent the fork and He could fork the Github repository
Remember the deductive reasoning puzzles when you were a kid? [0]There are some facts in these sentences and let's use some deductive reasoning to figure it out.
He followed by bent means that He has to be the subject and bent has to be the main verb of the sentence. The the following has to belong to fork. It is never not that way with grammar. I can go a lot deeper here but I'm going to keep this short. We can train any machine to deduce that fork has to be a noun and that it is a direct object of the verb bent. Fork means one thing in this sentence. It is a tool to eat with.
The modal could in the second sentence requires grammatically to have a verb following it. It not so much guessing what can fork mean, an action or an eating utensil, but rather, what is its function? He has to be a subject. If it was on the other side of the verb, it would be him. Because of the the, the Github repository has to be a noun phrase. By deduction fork has to be a verb. Therefore fork is an action of dividing into another part. Like I said before we can go deeper, for example, the capital G in Github means it is a proper noun so it can't function as a verb and the verb takes an object. Whether a verb takes an object or not can give semantic meaning to the verb too.
This is an example of how the existence of the in a phrase allows machines to deduce meanings of the same word in different sentence.
Generally, they don't. word2vec and GloVe, the two most popular word vector models in the circles I run with, don't have any solution to this at all. As a result, when you do two dimensional visualizations of these vector spaces, words with multiple commonly used senses get positions in the middle of nowhere.
In many situations, the downstream models that depend on these word vectors manage to perform well anyway, so for simplicity, we just live with this limitation.
That said, there has been work on handling polysemy (the property of one word having multiple meanings) in word vector models. The simplest method I've heard of it is to do the word sense disambiguation out of band and then tag some kind of identifier onto the end of each word before you start training your word vector model.
As an example, you could run a part of speech tagger and then tag the part of speech onto the end of the word. So in the above example, we'd get "fork_verb" and "fork_noun". Part of speech doesn't fully disambiguate a word, so this only gets you part of the way there, but at least it's easy.
You can do something similar with named entities. You can replace the two words "Larry" and "Page", which the model would learn very generic vectors for with "Larry_Page" or "EntityID_192318" or whatever.
There has also been work in automatically detecting different senses of a word. I've got at least one paper[1] in my notes that talks about this kind of thing. It does k-means clustering on the contexts that a particular word appears in, and learns a different representation for each cluster.
[1] - Eric H. Huang , Richard Socher , Christopher D. Manning , Andrew Y. Ng, Improving word representations via global context and multiple word prototypes, Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, July 08-14, 2012, Jeju Island, Korea
Huh, interesting. I have no idea what I'm talking about, but this reminds me of the story somewhat recently about the dude that used vector analysis of the Voynich Manuscript and I think got something meaningful out of it. Same thing? Or different entirely?
This is addressed in the original announcement post of SyntaxNet[0]:
The major source of errors at this point are examples such as the prepositional phrase attachment ambiguity described above, which require real world knowledge (e.g. that a street is not likely to be located in a car) and deep contextual reasoning. Machine learning (and in particular, neural networks) have made significant progress in resolving these ambiguities. But our work is still cut out for us: we would like to develop methods that can learn world knowledge and enable equal understanding of natural language across all languages and contexts.
The SyntaxNet team thinks that this is indeed a major source of errors and seems to be a focus of their work going forward.
You can learn this quite easily using bilexical dependencies on a machine annotated corpus, even without word embeddings (which is another option as Matthew suggests in a sibling post):
To your point, how does NLP address grammatical differences/errors, colloquialisms or slang? In the context of personal assistants every implementation I've seen has required more unnatural-language given the status quo.
According to the paper linked by the original announcement [1], the parser scores 94.41% for unlabeled attachment on the Wall Street Journal corpus [2], a parsed and labeled data set of 30 million words.
This corpus is a standard for NLP research on english syntax, but I think its worth remembering there is a great deal of disagreement among linguists about what the syntax of english is and what the lexical categories are.
Somewhat off topic, but for anyone more familiar with spaCy:
Any idea if spaCy has a sane way to be used in other languages? Specifically Rust, in this case? I'm in need of a decent NLP library to train and or map sentences to intents. Somewhat similar to what wit.ai offers, but it needs to be offline-able. spaCy sounds great, but requiring a runtime (if spaCy indeed does) on the installed system is a no-go for me.
Not knowing much about Python, It looks like it is actually compatible with CPython, so could it therefor be compiled to C and imported into Rust?
spaCy's implemented in Cython, which compiles into C++. So, the end-state at the moment is a .so object that expects to be loaded into Python.
Now, you can instead tell Cython to compile into standalone objects, so that they can be imported from C/C++ code. I'm not well versed on this, but if you can call into C/C++ libraries easily from Rust, you should be able to call into spaCy.
You'll miss some of the niceties from the Python layer, but the C-level API is pretty well behaved. So, you should be able to get by okay.
Curious: since spaCy is a commercial project (I know that it is open source), did you consider writing it in C or C++, it would make binding to other languages a lot easier. I can imagine that importing the Python runtime can be problematic in some projects. Also, I assume that you will see some Python types in the API when you want to call the parser from C?
(My own NN dependency parser is written in Go, since it's mostly for research, but if I wanted to make it embeddable, I'd have implemented it in C++ or Rust with a C API.)
Well, I like writing in Cython, and I figured I was better off pleasing some people (Python) a lot before trying to please a lot of people a little. It won't be hard to port to C++ when the time comes.
This is not the case in general for Python. If you want to run python from Rust or C you effectively link the CPython interpreter into your program and start executing the Python source code as an embedded string.
IIRC some of spaCy is implemented in cython which I think generates C for a CPython extension. This is typically for lower-layer stuff and doesn't provide any real utility for linkage with C/Rust.
"Python is so limiting" is an odd way to you express your problem IMO. It is limiting in that you can't just use a foreign-function interface to simply call into it without the baggage of a big runtime/interpreter. But in general it's a super-high level language that I wouldn't describe as "limiting."
EDIT: I stand corrected, syllogism's comment indicates that the cython layer might be useful on its own.
> "Python is so limiting" is an odd way to you express your problem IMO.
You are correct, i'll edit that to better reflect my intent. I worded it poorly, and meant no disrespect to Python :), it just doesn't fit my needs in this case.
From the example in the text:
"They ate the pizza with anchovies"
Isn't it important to know that humans don't use anchovies to eat pizza? You need some semantics as well. Otherwise, from pure syntax, how can you tell it apart from:
"They ate the pizza with forks"