Zero-Shot Translation with Multilingual Neural Machine Translation System

Smerity · on Nov 23, 2016

If people are interested in the underlying architecture of Google's Neural Machine Translation (GNMT) system, I wrote an article that builds it up piece by piece. While it's intended for people who are likely to implement GNMT or similar architectures, the article is descriptive enough that it should be possible to follow along even if you're not well versed in deep learning.

http://smerity.com/articles/2016/google_nmt_arch.html

The GNMT architecture is used almost as is for the zero-shot MT experiments. We're likely to see the GNMT architecture used extensively by Google for a variety of projects as they spent a deal of time and effort ensuring it is scalable to quite large datasets. Training a neural machine translation system with a single language pair is difficult - training it with multiple, especially all using the same set of weights, is insanely challenging!

As an example, the GNMT architecture was used as the basis of "Generating Long and Diverse Responses with Neural Conversation Models", which trains on the entirety of Reddit (1.7 billion messages) as well as other various datasets.

https://openreview.net/forum?id=HJDdiT9gl

alexbeloi · on Nov 23, 2016

Can you explain the odd looking connections to the decoder that appear in V2 and onwards? It's not entirely obvious to me from the description how the attention is applied.

In the first part of the decoder rollout, the diagram connects the (concatenated?) output of the encoder into the middle(?) of the first output of the decoder rollout. In what sense?

For the second rollout of the decoder the diagram shows 3 arrows going into the LSTM cell, what is that (left-facing) arrow on the right side doing?

gabrielgoh · on Nov 23, 2016

i see ray kurtzwell managed to sneak his name in as the last author. heh

sgentle · on Nov 23, 2016

This reminds me of Searle's Chinese Room Argument[0]: imagine you have a particularly dreary job where you sit in a room filled with boxes of symbols written on paper. Every now and again someone comes in and hands you some new symbols. You look through your rulebook and, depending on what it says, hand them some symbols back. It turns out that these rules actually implement a conversational program in Chinese. And if you can implement those rules and not understand Chinese, why would you think a computer program, implementing its own rules, could understand anything?

The common "Systems Reply" response to this is that you're looking at the wrong layer of abstraction. The computer hardware (or the person in the room) doesn't understand Chinese, the computer plus the rules plus the data forms a system that understands Chinese. Searle's answer to this is that, well, what if you memorised the rules and the database? You might know all the rules, you might be able to follow them, but you wouldn't understand Chinese.

What I think is fascinating about this is that it's vulnerable to Bayesian Judo[1]: if you have a strong belief that computers aren't capable of true understanding because of the Chinese Room Argument, then building an actual Chinese Room-style computer and having it show understanding should be a fairly strong blow to that belief.

Now, it's easy to quibble about what true understanding actually means, but one version (used by Searle's answer) is this: "[..] he would not know the meaning of the Chinese word for hamburger. He still cannot get semantics from syntax." But this news is exactly that! A computer translation of the same semantic concept from one syntax to another without ever having been taught the rules connecting them. In other words, this is semantics from syntax implemented by nothing but a computer, a database, and a set of rules.

So, by the reverse Chinese Room Argument, I would say this system exhibits a kind of understanding. Not a very sophisticated kind, mind you, but something that should still spook you if you believe computers are categorically incapable of thinking like us.

[0] http://plato.stanford.edu/entries/chinese-room/ [1] http://lesswrong.com/lw/i5/bayesian_judo/

wyager · on Nov 23, 2016

Aaronson has an interesting proposal to address the Chinese room problem that I think makes a lot of sense. The idea is that the Chinese Room intuitively doesn't exhibit understanding because it's a constant-time exponential-memory algorithm (a lookup table), whereas the algorithm that generated the entries in the Chinese Room table (a human) is super-constant time, sub-exponential memory algorithm, which introduces a place for consciousness to emerge. So the only reason the Chinese Room problem is philosophically confounding is that it adds a layer of indirection (a cache table) that obviously can't be conscious over the algorithm that actually might be conscious and that generates the table entries. http://www.scottaaronson.com/papers/philos.pdf

timr · on Nov 23, 2016

"But this news is exactly that! A computer translation of the same semantic concept from one syntax to another without ever having been taught the rules connecting them."

By that standard, statistical translation approaches were "understanding" a long time ago. The new thing here isn't that systems aren't being taught "the rules" (that wasn't happening in statistical MT either), the new thing is that there's a different kind of classifier in the "middle" now, which is representing a hidden state. This classifier is more flexible in a lot of ways, but also more of a black box, and takes a lot more effort to train without overfitting. It's cool that you can translate between language pairs that have never been explicitly trained, but let's not overstate the meaning of it.

The blog post makes this rather breathless speculation:

"Within a single group, we see a sentence with the same meaning but from three different languages. This means the network must be encoding something about the semantics of the sentence rather than simply memorizing phrase-to-phrase translations. We interpret this as a sign of existence of an interlingua in the network."

This is...a fun story, but not much else. First off, you can make dimensionality reduction plots that "show" a lot of things. Even ignoring that issue, in translations of short sentences involving specific concepts (i.e. the example about the stratosphere), is it really surprising that you'd find clusters? The words in that sentence are probably unique enough that they'd form a distinct cluster in mappings from any translation system.

Folks get caught up in the "neural" part of neural networks, and assume that some magical quasi-human thought is happening. If the tech were called "highly parameterized reconfigurable weighted networks of logistic classifiers", there'd be less loopy speculation.

sgentle · on Nov 23, 2016

Don't worry, I'm not being bamboozled by the word "neural". My argument that there is a definition of understanding that you can derive from a well-known thought experiment that looks like it is met by this implementation of "highly parameterized reconfigurable weighted networks of logistic classifiers".

I don't see any particular difference between training a classifier and teaching rules; the rules are just encoded in the parameters of the classifier. If it helps, you can just replace "taught" with "trained on" and "rules" with "data", but there's no version of the Chinese Room Argument where you're sitting in a room with boxes full of unsupervised learning datasets and a book of sigmoid functions.

Perhaps this system works similarly to previous ones, but not having been taught (trained on) any rules (data) about the specific language pairs in question seems to be a strong argument for some kind of semantic representation of language. You might have seen that before, but I haven't and the article seems to imply that it's new. Again, I'm talking specifically about the similarity between this result and an example of something "machines can't do".

The point is that the non-magical argument goes both ways. If a brain is just a complicated and meaty computer, then we should expect sophisticated enough programs on powerful enough hardware to start displaying things we might recognise as intelligent. That's not going to look particularly impressive – our machine translator isn't going to develop a conscience or try to unionise – but it might do something that qualifies for some definition of understanding.

timr · on Nov 23, 2016

But you are getting into magical thinking, in that there is no reasonable definition of "understanding" that this system meets. It cannot reason or make deductions. It can't re-write sentences to use completely different words/structures but imply the same meaning. In fact, there is literally no "conceptual" representation here -- there is a vector of numbers that gets passed between encoder and decoder, but it is no more a form of intelligence than the "hidden" state that is maintained by an HMM.

"not having been taught (trained on) any rules (data) about the specific language pairs in question seems to be a strong argument for some kind of semantic representation of language."

Well, yeah, there's a representation of language. But it isn't "semantic" -- it's vector of language-independent parameters for a decoder, which can then output symbols in a second language. Could you theoretically imagine some huge magical network of logistic classifiers that uses this as the first of a (far larger) processing machine that enables something like human intelligence? Maybe. But this is not it. This is bigger, far more complicated/flexible version of a machine that is purpose-built to map between sequences of text.

(That said, I really don't want to go down the rabbit hole of "what is AGI, anyway?", which is about as productive/interesting as hitting the bong and wondering if maybe we all live in a computer simulation after all. I'm merely observing that this is not an intelligent machine.)

sgentle · on Nov 23, 2016

> It cannot reason or make deductions. It can't re-write sentences to use completely different words/structures but imply the same meaning.

I agree that it doesn't meet these definitions of understanding. I'm arguing that it meets the definition of "semantics from syntax".

> Well, yeah, there's a representation of language. But it isn't "semantic" -- it's vector of parameters for a decoder, which then output symbols in a second language.

What is it that makes a vector of parameters not semantic? Would it be semantic if they were stored in a different format? If I tell you I have a system in which the concept of being hungry is stored as the number 5, would you say "that's not a concept, that's a number"? If that vector of parameters represents being hungry in any language, what is it if not a semantic representation of hunger?

There's no need to imagine a huge magical network that implements a grandiose vision of intelligence. We're talking about a small, non-magical network that implements a very modest vision of intelligence. Bacteria are still alive even though they're a lot less complex than we are. What would you expect the single-celled equivalent of intelligence to look like? Something with a very minor capacity for inference? Something with rudimentary abstraction across different representations of the same underlying idea?

d--b · on Nov 23, 2016

I don't think you can say that this is a refutation of the Chinese room argument. Here the system doesn't know what the semantics are. For instance, it wouldn't be able to use some kind of logic inference to disambiguate between the two meanings of the same word. It may use some statistical inference (ie. if we talk about meat elsewhere in the sentence, it's likely that Hamburger refers to the sandwich rather than to the inhabitant of Hamburg), but it wouldn't be able to understand the content of the text to provide a translation that derives from this meaning.

However, what this article seems to point out is that there is an intermediate representation that is being used within the neural network. This intermediate representation would correspond to a Chinese room where symbols (like hamburger) are first matched to something common to all language (let's call it X), and then translated back to specific languages.

What this means is neither the Chinese room argument is true or false, it means that it's more complicated than that. It means that "understanding" has various degrees of depth, and that the first degree of understanding really is just syntactic conceptualization.

SonicSoul · on Nov 23, 2016

it is also possible to think of humans are pre-programmed systems that think have a consciousness and understanding but are in fact following a set of rules they don't full understand yet. Perhaps creating "artificial" systems that can evolve their own intelligence could be a way for us to understand our own programming better.

hota_mazi · on Nov 22, 2016

Fascinating. Maybe the next step will be to extract the tokenized interlingua language that's emerged in the neural network and map it to real words, and blam, we reinvent Esperanto!

klodolph · on Nov 22, 2016

It would probably not look anything like Esperanto, which is a charliefoxtrot of pidgin Spanish with some Turkish orthography thrown in.

In all seriousness, however, different languages often have radically different concepts encoded into them. For example, in English sentences you might be forced to explicitly give subjects to the actors in your sentences, assign genders to your pronouns, and choose whether actions take place in the present or future, whether they are ongoing in general or happening at this moment, etc. In Japanese you might be forced to choose how to express your relationship with the person you are speaking to, the person you are speaking about, and decide whether a causal relationship is "if and only if" or just plain "if then".

There's long been a hypothesis about whether there is some kind of "universal grammar" (Chomsky) but modern linguists do not really entertain that idea.

toomanybeersies · on Nov 23, 2016

I think that's the fundamental problem with translating; different languages have different information encoded in a sentence.

For instance, as you've said, Japanese encodes the relation to whom you're speaking. Even French does this (to a lesser extent), with "tu"/"vous" depending on how familiar the person you're speaking to is.

schoen · on Nov 23, 2016

And as the parent commenter points out, not only do they have different information encoded, but they may require different information to be expressed so that it's not permissible to simply omit information that the source language omitted.

https://en.wikipedia.org/wiki/T%E2%80%93V_distinction

https://en.wikipedia.org/wiki/Pro-drop_language

https://en.wikipedia.org/wiki/Evidentiality

mack73 · on Nov 23, 2016

But there absolutely must be a common denominator. A Japanese sentence might encode the metric "relationship to subject" which an English sentence will encode as "NULL".

Berobero · on Nov 23, 2016

The issue is if you're translating from a language that doesn't explicitly encode that information into a language that requires it; you need some way of filling in the blanks. This is all well and great when that information can be inferred from the explicit context provided; the real problem is that with translation there are a non-trivial number of cases that require the translator to consider the pragmatic context of what's being translated to figure out a correct/good translation (e.g. who's speaking/writing, who are they addressing, what is the overall societal context of content, what might be the purpose of the content, etc.). For a lot of these problems it's entirely reasonable to get a computer to fill in the blanks for 90%+ of cases, but the last few percent of cases require AI that is equivalent to that of a human.

schoen · on Nov 23, 2016

I'm not positive that I understand what you mean by "common denominator", but if you mean "it's possible to somehow express or explain what one language has encoded using another language", I would agree with that.

mattkrause · on Nov 23, 2016

Can you elaborate on that last bit?

I know there are some possible counter-examples to universal grammar, like Dan Everett's work on Pirahã, a language which may lack recursion, and that there's more interest in usage-based theories recently. However, my impression is that most linguists are still pretty attached to a some sort UG-like theory of language and language acquisition. (Admittedly, this is partly because no one can agree on what does or does not constitute UG, but still...)

There have been a bunch of pop-sci articles about non-Chomskian points of view lately, but I'm not sure how well (e.g.) Tom Wolfe is plugged into theoretical linguistics gossip and a lot of the others seem to draw from the same pool of authors, like Paul Ibbotson, who had a piece in Scientific American and another in Salon.

klodolph · on Nov 23, 2016

My comment about universal grammar may not have been as germane as I thought.

I have not really been plugged into the pop science side of linguistics, so I'm not really familiar with any of the articles in Scientific American or Wolfe's recent work. However, I don't think that many people in linguistics have taken universal grammar very seriously since the 1970s or so, but that might just be my view of things.

Chomsky's ridiculous and untenable ideas about a "language organ" can't be taken seriously. Anyone with a defensible stance on UG therefore defines UG as something else, and whatever that "something else" is ends up not quite useful enough to actually hang on to. "UG-like language acquisition" seems like a bit of a contradiction to me, since the original concept of UG is of an innate grammar (and if you have to acquire it, it's not innate).

Perhaps these anti-Chomsky pieces get so much circulation because firstly, they're easy since Chomsky's old theories are practically indefensible and long out of fashion and secondly, Chomsky's a household name so you can move advertising dollars by picking on his old ideas. There will always be a market for articles that tell readers that some famous intellectual was wrong once.

schoen · on Nov 23, 2016

It seems Chomsky has had to back down on quite a bit of what UG should mean, for example suggesting that individual pieces can go unused in particular languages and language learners.

One interesting thing that might still be left is criteria about what languages would be learnable or unlearnable. For example, perhaps there are reasons that it's impossible for human beings to become fluent in a conlang like Ithkuil (or Lojban), even if they were exposed as children to interactions with (hypothetical) fluent speakers. Yet this isn't true of Esperanto (or at least very slightly creolized versions of it), which does have native speakers for whom the language didn't seem to present any challenge. There could be criteria about what makes a language too weird or too hard to learn and speak with the facility with which all existing natural languages can be learned and spoken by almost all children; maybe some of those criteria are fuzzy but others are firm. I think this is related to UG, but clearly a weaker concept.

eveningcoffee · on Nov 23, 2016

Esperanto is a lovely idea but any language needs an active support to stay current. If the language does not take inputs from culture, science and philosophy, it starts to degrade.

renox · on Nov 23, 2016

? Nothing prevent adding new words to Esperanto, the most difficult part is reaching a consensus but that's also the case for 'normal' languages: there are many new French word introduced by the academie Française which aren't much used (courriel instead of mail, pourriel instead of spam, etc)..

eveningcoffee · on Nov 23, 2016

It is not about language capability to accept new words but about language users motivation to do so.

ChuckMcM · on Nov 23, 2016

Nice piece of work and counts as "implementing Star Trek in the present". Now I just need a nice pair of noise cancelling over the ear headphones that let me hear english spoken no matter where I am :-)

agildehaus · on Nov 23, 2016

Stick a fish in your ear. Might get lucky.

honkhonkpants · on Nov 22, 2016

Pretty impressive, but even more amazingly their paper is in a single-column format that I can actually read on my computer, instead of pretending that I am reading printed and bound conference proceedings. Truly a giant leap for the field.

YeGoblynQueenne · on Nov 23, 2016

>> We call this “zero-shot” translation, shown by the yellow dotted lines in the animation. To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation.

I think it was last year when a friend was telling me how Google translates the Greek word for "swallow" (the bird) to French. Back then, the translation was the French word for "to swallow" (the verb). The bird and the action don't even sound remotely alike in Greek and neither are they spelled alike (the bird is "χελιδόνι" the action is "καταπίνω"; google trans. will at least give their correct pronounciation). My friend figured Google can't find enough examples between the two languages, so it goes via English ... were the two words are homonyms.

I think that was last year, and certainly before September.

So I gave it a try again today, and this is still what I get:

  Greek                 French
  χελιδόνι		avaler
  chelidóni

If you omit the accent on the "o" you don't even get the mistranslation- you get only the phonetic transcription of the Greek word in latin characters.

Obviously the important thing here is not the one word that google translate gets wrong, but the fact that it doesn't really look like this "new" system is all that new, or that it does anything all that different from the previous one, or indeed that it improves things at all.

YeGoblynQueenne · on Nov 24, 2016

Google, and also Microsoft btw, absolutely need to be called out on this. They keep claiming that their translation systems work well, because they have reasonably good results between some language pairs, like English/French or English/Spanish, that are a) close linguistically b) have a lot of examples of translated documents and, more importantly, c) have many speakers who might use Google translate.

For languages where none of the above holds, however, the results continue to be completely ridiculous, no matter what "new" technique Google (or MS) advertises. Since those languages are not spoken by as many people as English or Spanish etc, however, it's very hard for the user to figure out how attrociously bad their automatic translations are.

Here's an example from my native Greek; this is a bit of news text from yesterday [1]:

Λανθασμένη χαρακτηρίζει ο Κύπριος κυβερνητικός εκπρόσωπος, Νίκος Χριστοδουλίδης, την προσέγγιση, να μπαίνουν στο «ίδιο καλάθι» η Ελλάδα με την Τουρκία σε σχέση με το κυπριακό.

And here's Google's translation:

Incorrect characterizes Cypriot government spokesman Nikos Christodoulides, the approach to be put in "one basket" by Greece and Turkey in relation to Cyprus.

So, the Cypriot government spokesman (well done) is put in a basket by Greece and Turkey (wait wut). Hey, maybe the guy wanted to be put in two baskets? [2]

That's very typical of the way Google translates between Greek and English. For Google, it's Neural Networks leading us to a bright future where language barriers are eliminated thanks to Scienz! For Greeks, it's comedy gold.

And it's the same for Russians, Polish, Finns, Swedes, Indians, Chinese, Hungarians...

Still, Google keeps including those languages in the count of languages it "covers", because it's good advertisement and who can really dispute them anyway?

_____________________

[1] http://www.kathimerini.gr/884845/article/epikairothta/politi...

[2] What's being said is more like: "The Cypriot government spokesman said that it's a mistake to treat Greece and Turkey in the same manner with regards to Cyprus".

vurpo · on Nov 23, 2016

I wonder how much memory this translation via an intermediate representation of a sentence takes. It seems like representing the semantic meaning of a sentence in a language-independent way would take a huge amount of data.

ximeng · on Nov 23, 2016

E.g. Japanese is 50MB to download for offline usage in Google Translate's iPhone app.

glandium · on Nov 22, 2016

The mentioned Japanese->English->Korean combination is one of the worst possible things to do. Both Korean and Japanese are very different from English, while similar to each other to some extent. Direct translation from one to the other would actually have much better results than translating back to Korean the (likely broken) English you get from Japanese.

Edit: I do realize they're not talking about successive translations, but that's essentially how the training ends up happening, isn't it?

A better example, IMHO, would have been three very different languages, like English, Japanese and Russian.

xbmcuser · on Nov 22, 2016

It is not translating from Japanese>English>korean it learned how to translate Japanese<>English Korean<>English now it knows how to translate Japanese<>Korean. In laymen terms it understands Korean and Japanese so can translate between them without needing examples of Japanese<>Korean translations.

paulsutter · on Nov 23, 2016

I'm guessing the downvotes here are based on your wording, "worst possible thing to do", and not your core point that Korean and Japanese are grammatically very similar.

You're right that the test would be even more interesting with three very different languages. But it's still pretty impressive with Korean, English, and Japanese as shown, despite the similarity.

spynxic · on Nov 22, 2016

This post seems to over-exaggerate a commonly known mathematical property.

Suppose I have languages X, Y, and Z. My machine currently knows how to translate between X->Y and X->Z. The goal is to turn Y into Z without direct training. The process would be to translate Y into X and X into Z.. effectively Y->Z.

This isn't really transfer learning as much as it is logical induction...Or am I missing something?

mattkrause · on Nov 23, 2016

Well....

It is one of the oldest ideas in machine translation. In the late 1950s, Richens proposed an MT system that used an intermediate representation called an "interlingua." For each language, the system has two components, one that maps a natural language onto the interlingua, and a second that converts the interlingua back into a natural language. Use different source and destination pairs and--bam! you've got a multi-way translator.

However, this only works if a) the interlingua is rich enough to capture the semantics of the to-be-translated text and b) the conversions between natural languages and interlingua also preserves those semantics. In practice, neither has worked pretty well.

In the 1960s, the interlingua was typically hand-crafted and rules were either hand written, or later, induced, to convert natural languages to/from it. Since people have been working on this for 60 years, you can probably image how well that works[].

The clever bit here is that you don't really need to do that. If you have enough data, the LSTMs can learn it on their own (see Figure 2, for example, where semantically-related sentences from multiple languages end up in the same neighborhoods).

[] Actually, somewhat better than you'd think. It certainly wouldn't have done Pushkin any justice, but it was surprisingly decent on weather reports (etc).

csydas · on Nov 23, 2016

I was going to comment on the article and idea of multilingual NN's in general since, as I'm learning Russian currently, I'm having a hard time seeing how you can get a good automated translation without some editorializing or awareness of what's going on in both sentences. Your comment on NN's "... not doing Pushkin justice" stood out since it's something I've been wondering how NN could handle a language like Russian that traditionally doesn't use the verb "to be" in sentences, much like the E-Prime idea that exists in English for clarity of description.

Just trying to re-think in terms of a Russian sentence instead of an English sentence has also proven difficult, and I'd be curious how you can automate it on NN's well. My colleagues (russian natives) are all able to spot auto-translated messages from me nearly instantly since it messes up pretty rudimentary grammar rules or just doesn't phrase the sentence like you would in native Russian.

lilyball · on Nov 22, 2016

Presumably what you're missing is the fact that this is translating from Y->Z in a single step, rather than doing two separate translations.

mattkrause · on Nov 23, 2016

They're certainly not doing English to Portuguese by way of Spanish or anything like that.

However, you could almost read the paper as "We can translate something from a natural language into a high-dimensional representation, then turn that high-dimensional representation back into (another, possibly different) natural language. "

empath75 · on Nov 23, 2016

It seems similar to automatically captioning images.

mattkrause · on Nov 23, 2016

Very much so!

The two tasks are surprisingly interchangable. I once worked on a project where we used a statistical MT approach to "translate" between image features and captions--and I don't think we were the only ones trying such things.

In a pleasing bit of symmetry, the attentional network used here looks like it was initially developed for image captioning.

spynxic · on Nov 22, 2016

Then the high dimensional representation of the language that allows such a translation to be possible is profound

nl · on Nov 23, 2016

No one has managed to get induction working reliably for natural language tasks in any way that is generalizable enough to rely on.

So that is new.

railorsi · on Nov 22, 2016

Could we say logical induction is the principle behind transfer learning then?

spynxic · on Nov 22, 2016

Emergent induction via high dimensional structure composition.. maybe. I haven't read any whitepages about the underlying processes

P.S. note the hand-waviness when I use the word "emergent"