(These are excerpts from my book "Intelligence is not Artificial")
Neural Machine Translation
Yoshua Bengio at the University of Montreal started working on neural networks for natural language processing in 2000 ("A Neural Probabilistic Language Model", 2001). Bengio's neural language models learn to convert a word symbol into a vector within a meaning space. The word vector is the semantic equivalent of an image vector: instead of extracting features of the image, it extracts the semantic features of the word to predict the next word in the sentence. Bengio realized something peculiar about word vectors learned from a text by his neural networks: these word vectors represent precisely the kind of linguistic regularities and patterns that define the use of a language, the kind of things that one finds in the grammar, the lexicon, the thesaurus, etc; except that they are not separate databases but just one organic body of expertise about the language. Firth again: "you shall know a word by the company it keeps".
In 2005 Bengio developed a method to solve the "curse of dimensionality" in natural language processing, the problem of training a network with the particular data that are vocabularies ("Hierarchical Probabilistic Neural Network Language Model", 2005). After Bengio's pioneering work, several others applied deep learning to natural language processing, notably Ronan Collobert and Jason Weston at NEC Labs in Princeton ("A Unified Architecture for Natural Language Processing", 2008), one of the earliest multitask deep networks, and capable of learning recursive structures. Bengio's mixed approach (neural networks and statistical analysis) was further expanded by Andrew Ng's and Christopher Manning's student Richard Socher at Stanford with applications to natural language processing ("Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks", 2010), which improved the parser developed by Manning with Dan Klein ("Accurate Unlexicalized Parsing", 2003). The result was a neural network that learns recursive structures, just like Collobert's and Weston's. Socher introduced a language-parsing algorithm based on recursive neural networks that Socher also reused for analyzing and annotating visual scenes ("Parsing Natural Scenes and Natural Language with Recursive Neural Networks", 2010).
However, Bengio's neural network was a feed-forward network, which means that it could only use a fixed number of preceding words when predicting the next one. Czech student Tomas Mikolov of the Brno University of Technology, working at John Hopkins University in Sanjeev Khudanpur's team, showed that, instead, a recurrent neural network is able to process sentences of any length ("Recurrent Neural Network-based Language Model," 2010). An RNN transforms a sentence into a vector representation, or viceversa. This enables translation from one language to another: a RNN (the encoder) can transform the sentence of a language into a vector representation that another RNN (the decoder) can transform into the sentence of another language. (Mikolov was hired by Google in 2012 and by Facebook in 2014, and in between in 2013 he invented the "skip-gram" method for learning vector representations of words from large amounts of unstructured text data).
In 2013 Nal Kalchbrenner and Phil Blunsom of Oxford University attempted statistical machine translation based purely on neural networks ("Two Recurrent Continuous Translation Models").
They introduced "sequence to sequence" (or "seq2seq") learning, a new paradigm in supervised learning, but the length of the output sequence of characters was limited to be the same length as the output.
Bengio's group (led by Kyunghyun Cho and Dzmitry Bahdanau on loan from Jacobs University Bremen) established the standard "encoder-decoder" model of machine translation: an encoder neural network reads and encodes a source sentence into a fixed-length vector, and then a decoder outputs a translation from the encoded vector ("Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", 2014). This project also introduced a simpler alternative to Long Short-Term Memory (LSTM) in recurrent architectures, later named "gated recurrent unit" (GRU).
The problem with Bengio's original encoder-decoder approach is that it represented all the information of the sentence into a fixed-length vector, a fact that obviously caused a decline in accuracy with longer sentences. Bahdanau then added "attention" to the encoder-decoder framework ("Neural Machine Translation by Jointly Learning to Align and Translate", 2014): attention is a mechanism that improves the ability of the network to inspect arbitrary elements of the sentence. This improved architecture really showed that neural networks could be applied to translating texts because the attention mechanism overcame the original limitation and enabled neural networks to process long sentences. The attention mechanism developed by Bahdanau came to be known as "RNNSearch" or simply "additive attention".
The desire to add "attention" skills to a neural network dates from the 1980s when neuroscience began to elucidate how the brain makes sense of visual scenes and so quickly. In 1986 the neuroscientists Christof Koch and Shimon Ullman proposed that the primate brain creates a visual "saliency map". A saliency map, basically, encodes the importance of each element in the visual space. This led in 1998 to the attention-based model of Laurent Itti, a student of Christof Koch at Caltech ("A Model of Saliency-based Visual-attention for Rapid Scene Analysis", 1998). Attention was introduced in image recognition tasks by Volodymyr Mnih at DeepMind ("Recurrent Models of Visual Attention", June 2014) whose "recurrent attention model" (RAM) was applied few months later to object recognition by Jimmy Lei Ba at the University of Toronto ("Multiple Object Recognition with Visual Attention", December 2014).
Looping back to the field of computer vision that had jumpstarted the field, this attention-based technique was used by Bengio's other student Kelvin Xu to automatically generate captions of images ("Show, Attend and Tell", 2015).
Bengio's student Kyunghyun Cho showed that the same architecture of gated recurrent neural networks, convolutional neural networks and attention mechanism dramatically improved performance in multiple tasks: machine translation image caption generation and speech recognition ("Describing Multimedia Content using Attention-based Encoder-Decoder Networks", 2015).
Some other attention mechanisms were introduced a few months later by Christopher Manning's student Minh-Thang Luong at Stanford ("Effective Approaches to Attention-based Neural Machine Translation", 2015), notably the "dot-product" (or multiplicative) mechanism.
Luong's multiplicative attention proved to be much faster and more efficient than Bahdanau's additive attention.
"Attention" is, however, a misnomer: the purpose of human attention is to speed up the process, if at the cost of accuracy, whereas "attention" in neural networks is a complex algorithm that, at every step, looks back at the input (or, better, at the hidden state of the encoder, a layer that captures the significant dependencies). The advantage for a neural network is that it doesn't have to encode all information of the input into one fixed-length vector. "Attention" provides flexibility but not necessarily agility.
Phil Blunsom's group at Oxford University used an attention-augmented LSTM network (trained with almost 100,000 articles from the CNN and more than 200,000 articles from the Daily Mail websites) to read a text and then produce an answer to a question ("Teaching Machines to Read and Comprehend", 2015). This Attentive Reader was a generalization of Weston's memory networks for
A couple of years later, Christopher Manning's student Danqi Chen, while interning at Facebook, developed DrQA ("Reading Wikipedia to Answer Open-Domain Questions", 2017), an evolution of the Attentive Reader but using a method called "distant supervision" invented by Dan Jurafsky's team at Stanford ("Distant Supervision for Relation Extraction Without Labeled Data", 2009).
Scene understanding (what is going on in a picture, which objects are represented and what are they doing) is easy for animals but hard for machines. "Vision as inverse graphics" is a way to understand a scene by attempting to generate it: what caused these objects to be there and in those positions? The program has to generate the lines and circles that constitute the scene. Once the program has discovered how to generate the scene, it can reason about it and find out what the scene is about. This approach reverse-engineers the physical process that produced the scene: computer vision is the "inverse" of computer graphics. Therefore the "vision as inverse graphics" method involves a generator of images and then a predictor of objects. The prediction is inference. This method harkens back to the Swedish statistician Ulf Grenander's work in the 1970s.
After DRAW, DeepMind (Ali Eslami, Nicolas Heess and others) turned to scene understanding. Their AIR ("Attend-Infer-Repeat", 2016) model, which was again a combination of variational inference and deep learning, inferred objects in images by treating inference as a repetitive process, implemented as a LSTM that processed (i.e., attended to) one object at a time.
Lukasz Romaszko at the University of Edinburgh later improved this idea with his Probabilistic HoughNets ("Vision-as-Inverse-Graphics", 2017), similar to the
"de-rendering" used by Jiajun Wu at MIT ("Neural Scene De-rendering", 2017).
Ali Eslami and Danilo Rezende at DeepMind developed an unsupervised model to derive 3D structures from 2D images of them via probabilistic inference ("Unsupervised Learning of 3D Structure from Images", 2016). Based on that work, in June 2018 they introduced a whole new paradigm: the Generative Query Network (GQN). The goal was to have a neural network learn the layout of a room after observing it from different perspectives, and then have it display the scene viewed from a novel perspective. The system was a combination of a representation network (that learns a description of the scene, counting, localizing and classifying objects) and a generation network (that produces a new description of the scene).
In 2014, at the same time that Cho and Bahdanau were refining the encoder-decoder framework,
Ilya Sutskever, Oriol Vinyals and Quoc Le at Google solved the "sequence-to-sequence problem" of deep learning using a Long Short-Term Memory ("Sequence to Sequence Learning with Neural Networks"), so the length of the input sequence of characters doesn't have to be the same length of the output. Sutskever, Vinyals and Le trained a recurrent neural network that was then able to read a sentence in one language, produce a semantic representation of its meaning, and generate a translation in another language, via another encoder-decoder architecture.
The crowning achievement of neural machine translation was Google's "dynamic coattention network" (DCN) of 2016, based on the Sutskever-Vinyals-Le model and on the attention technique pioneered by Dzmitry Bahdanau's BiRNN (bidirectional RNN) at Jacobs University Bremen in Germany to improve the speed of machine translation ("Neural Machine Translation by jointly Learning to Align and Translate", 2015). This Google neural translation machine consisted of a deep LSTM network with eight encoder layers and eight decoder layers.
Of course the question is whether these systems that translate one sentence into another sentence based on simple mathematical formulas are actually "understanding" what the sentence says. Kevin Knight's student Xing Shi at the University of Southern California demonstrated that the vector representations of neural machine translation (their hidden layers) capture some morphological and syntactic properties of language ("Does String-Based Neural MT Learn Source Syntax?", 2016), and Yonatan Belinkov at MIT discovered even some semantical properties hidden in those vector representations ("Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks", 2017).
In 2018 Xuedong Huang's team at Microsoft built a system that achieved human parity on the dataset newstest2017 and claimed that such a system was able to translate sentences of news articles from Chinese to English with the same quality and accuracy as a person. The team combined three techniques developed by Microsoft in China: dual learning (2016, in collaboration with Peking University), deliberation networks (2017, in collaboration with the University of Science and Technology of China), and joint training (2018, again in collaboration with the University of Science and Technology of China).
Recurrent neural networks had matured enough that in November 2016 Google switched its translation algorithm to a recurrent neural network and the jump in translation quality was noticeable.
After the successful implementations by Sutskever and Bahdanau in 2014, the sequence-to-sequence modeling required by machine translation was implemented with recurrent neural networks: use a series of (bi-directional) recurrent neural networks to map an input sequence to a variable-length output sequence. Within two years, however, architectures for sequence-to-sequence modeling that were entirely convolutional were proposed by Nal Kalchbrenner, now at DeepMind, namely his ByteNet ("Neural Machine Translation in Linear Time", 2016), and by Jonas Gehring in the Facebook team of Yoshua Bengio's former student Yann Dauphin, namely ConvS2S ("Convolutional Sequence to Sequence Learning", 2017). As far as sequence-to-sequence modeling goes, there are at least two advantages of convolutional networks over recurrent ones. One is that their computation can be parallelized, i.e. done faster. Secondly, multi-layer convolutional neural networks create hierarchical representations of the sequence, as opposed to to the chain structures created by recurrent networks, The lower layers of such hierarchies model local relationships (between nearby items of the sequence) and higher layers model non-local relationships (between distant items of the sequence). This architecture provides a faster path to relate elements that are in arbitrary positions of the sequence. The distance still matters, of course: the computational "cost" of relating two items increases exponentially with their distance in ByteNet and linearly in ConvS2S.
"Translation is not a matter of words only; it is a matter of making intelligible a whole culture" (Anthony Burgess)
Back to the Table of Contents
Purchase "Intelligence is not Artificial"