(These are excerpts from my book "Intelligence is not Artificial")
Brute-force A.I. doesn't Speak
Most disheartening has been the scant progress in Natural Language Processing (NLP) since 1970.
Following are examples of exchanges between a person and Terry Winograd's program SHRDLU at the MIT in 1970:
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", i ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE i AM HOLDING.
Person: DOES THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS SUPPORT ANYTHING GREEN?
Computer: YES, THE GREEN PYRAMID.
Person: WHAT COLOR IS IT?
Computer: BY "IT", i ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS. RED
Person: DOES A STEEPLE
Computer interrupting: SORRY, i DON'T KNOW THE WORD "STEEPLE".
Person: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID.
Computer: i UNDERSTAND.
Person: ARE THERE ANY STEEPLES NOW?
It beats some of today's "personal digital assistants".
We virtually abandoned the idea of having machines understand and speak our language and resorted to the opposite: make humans speak like machines. That is what you do when you talk on the phone with a machine that asks you for numbers; and that is what you do when you talk to your smartphone's "assistant" according to the rules of that assistant. Nothing illustrates the gap between machine intelligence and human intelligence better than comparing how much of a language a toddler
learns in two years and how little of any language all machines ever invented
have learned in more than 60 years.
IBM's Watson, that debuted with much fanfare in 2011 on a quiz show competing against human experts, was actually not capable of understanding the spoken questions: the questions were delivered to Watson as text files, not as spoken questions (a trick which, of course, distorted the whole game).
The most popular search engines are still keyword-based. Progress in search engines has been mainly in indexing and ranking webpages, not in understanding what the user is looking for nor in understanding what the webpage says. Try for example "Hey i had a discussion with a friend about whether Qaddafi wanted to get rid of the US dollar and he was killed because of that" and see what you get (as i write these words, Google returns first of all my own website with the exact words of that sentence and then a series of pages that discuss the assassination of the US ambassador in Libya). Communicating with a search engine is a far (far) cry from
communicating with human beings.
Products that were originally marketed as able to understand natural language, such as SIRI for Apple's iPhone, have bitterly disappointed their users. These products understand only the most elementary of sounds, and only sometimes, just like their ancestors of decades ago. Promising that a device will be able to translate speech on the fly (like Samsung did with its Galaxy S4 in 2013) is a good way to embarrass yourself and to lose credibility among your customers.
The status of natural language processing is well represented by antispam software that is totally incapable of understanding whether an email is spam or not based on its content while we can tell in a split second.
During the 1960s, following
(and mostly reacting against)
Noam Chomsky's "Syntactic Structures" (1957) that heralded a veritable linguistic revolution, a lot work in A.I. was directed towards "understanding" natural-language sentences, notably Charles Fillmore's case grammar at Ohio State University (1967), Roger Schank's conceptual dependency theory at Stanford (1969, later at Yale), William Woods' augmented transition networks at Harvard (1970),
Yorick Wilks' preference semantics at Stanford (1973),
and semantic grammars, an evolution of ATNs by Dick Burton at BBN for one of the first "intelligent tutoring system", Sophie (started in 1973 at UC Irvine by John Seely Brown and Burton). Unfortunately, the results were crude.
Schank and Wilks were emblematic of the revolt against Chomsky's logical approach, that did not work well in computational systems. Schank and Wilks turned to meaning-based approached to natural language processing.
Terry Winograd's SHRDLU and Woods' LUNAR (1973), both based on Woods' theories, were limited to very narrow domains and short sentences.
Roger Schank moved to Yale in 1974 and attacked the Chomsky-ian model that language comprehension is all about grammar and logic thinking. Schank instead viewed language as intertwined with cognition, as Otto Selz and other cognitive psychologists had argued 50 years earlier. Minsky's "frame" and Schank's "script" (all variations on Selz's "schema") assumed a unity of perception, recognition, reasoning, understanding and memory: memory has the passive function of remembering and the active function of predicting; the comprehension of the world and its categorization proceed together; knowledge is stories.
Schank's "conceptual dependency" theory, whose tenet is that two sentences whose meaning is equivalent must have the same representation, aim to replace Noam Chomsky's focus on syntax with a focus on concepts.
We humans use all sorts of complicated sentences, some of them very long, some of them nested into each other.
Little was done in discourse analysis before Eugene Charniak's thesis at the MIT ("Towards a Model of Children's Story Comprehension", 1972), Indian-born Aravind Joshi's "Tree Adjunct Grammars" (1975) at the University of Pennsylvania, and Jerry Hobbs' work at the SRI Intl ("Computational Approach to Discourse Analysis", 1976).
Then a handful of important theses established the field. One originated from the SRI, Barbara Grosz╬Ú╬¸s thesis at UC Berkeley ("The Representation and Use af Focus in a System for Understanding Dialogs", 1977). And two came from Bolt Beranek and Newman, where William Woods had pioneered natural-language processing: Bonnie Webber╬Ú╬¸s thesis at Harvard: ("Inference in an Approach to Discourse Anaphora", 1978) and Candace Sidner╬Ú╬¸s thesis at the MIT ("Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse", 1979).
In 1974 Marvin Minsky at MIT introduced the "frame" for representing a stereotyped situation ("A Framework for Representing Knowledge", 1974) and in 1975 for the same purpose Roger Schank, who had already designed MARGIE (1973, which, believe it or not, stands for "Memory, Analysis, Response Generation, and Inference on English"), in collaboration with Stanford student Chris Riesbeck, and psychologist and social scientist Robert Abelson at Yale introduced the script ("Scripts, Plans, and Knowledge", 1975). Schank's students built a number of systems that used scripts to understand stories: Richard Cullingford's Script Applier Mechanism (SAM) of 1975; Robert Wilensky's PAM (Plan Applier Mechanism) of 1976; Wendy Lehnert's question-answering system QUALM of 1977; Janet Kolodner's CYRUS (Computerized Yale Retrieval and Updating System) of 1978, that learned events in the life of two politicians; Michael Lebowitz's IPP (Integrated Partial Parser) of 1978, that in order to read newspaper stories about international terrorism introduced an extension of the script, the MOP (Memory Organization Packet); Jaime Carbonell's Politics of 1978, that simulated political beliefs; Gerald DeJong's FRUMP (Fast Reading Understanding and Memory Program) of 1979, an evolution of SAM for producing summaries of newspaper stories; BORIS (Better Organized Reading and Inference System) of 1980, developed by Lehnert and her student Michael Dyer, a story-understanding and question-answering system that combined the MOP and a new extension, the Thematic Affect Unit (TAU). Starting in 1978 these systems were grouped under the general heading of "case-based reasoning". Meanwhile, Steven Rosenberg at MIT built a model to understand stories based on Minsky's frames.
In particular, Jaime Carbonell's PhD dissertation at Yale University ("Subjective Understanding", 1979) can be viewed as a precursor of the field that would be called "sentiment analysis".
It is important to realize that, despite the hype and the papers published in reputable (?) A.I. magazines, none of these systems ever worked. They "worked" only in a very narrow domain and they "understood" pretty much only what was hardwired into them by the software engineer. That's why they were never used twice. They were certainly steps forward in theoretical research, but very humble and very short steps. In 2017 Schank published on his blog an angry article titled "The fraudulent claims made by IBM about Watson and A.I." that started out with the sentence "They are not doing cognitive computing no matter how many times they say they are" but perhaps that's precisely what Schank was doing two generations earlier.
These computer scientists, as well as philosophers such as Hans Kamp in the Netherlands (founder of Discourse Representation Theory in 1981), attempted a more holistic approach to understanding "discourse", not just individual sentences; and this resulted in domain-independent systems such as the Core Language Engine, developed in 1988 by Hiyan Alshawi's team at SRI in Britain.
Meanwhile, Melvin Maron's pioneering work on statistical analysis of text
at UC Berkeley ("On Relevance, Probabilistic Indexing, and Information Retrieval", 1960)
was being resurrected by Gerard Salton at Cornell University (the project leader of SMART, System for the Mechanical Analysis and Retrieval of
Text, since 1965). This technique,
true to the motto "You shall know a word by the company it keeps" (1957) by the British linguist John-Rupert Firth,
represented a text as a "bag" of words,
disregarding the order of the words and even the grammatical relationships.
Surprisingly, this method was working better than the complex grammar-based
approaches. It quickly came to be known as the "bag-of-words model" for
language analysis. Technically speaking, it was text classification using naive
Bayes classifiers. In 1998 Thorsten Joachims at Univ of Dortmund replaced the naive Bayes classifier with the method of statistical learning called "Support Vector Machines", invented by Vladimir Vapnik
at Bell Labs in 1995, and other improvements followed. The bag-of-words model became the dominant paradigm for natural language processing but its statistical approach still failed to grasp the
meaning of a sentence.
Nor did it have any idea of why a sentence was where it was and what it did there. Barbara Grosz at SRI International built an influential framework to study the sequence of sentences, i.e. the whole discourse, the "Centering" system ("Providing a Unified Account of Definite Noun Phrases in Discourse", 1983), later refined when she moved to Harvard ("A Framework for Modelling the Local Coherence of Discourse", 1986, but unpublished until 1995).
Perhaps the first major progress in machine translation since Systran was demonstrated in 1973 by Yorick Wilks at Stanford. His system was based on something similar to conceptual dependency, "preference semantics" ("An Artificial Intelligence Approach to Machine Translation", 1973).
The method that did improve the quality of automatic translation is the statistical one, pioneered in the 1980s by Fred Jelinek's team at IBM and first implemented there by Peter Brown's team (the Candide system of 1992). When there are plenty of examples of (human-made) translations, the computer can perform a simple statistical analysis and pick the most likely translation. Note that the computer isn't even trying to understand the sentence: it has no clue whether the sentence is about cheese or parliamentary elections. It has "learned" that those few words in that combination are usually translated in such and such a way by humans. The statistical approach works wonders when there are thousands of (human-made) translations of a sentence, for example between Italian and English. It works awfully when there are fewer, like in the case of Chinese to English.
Neural Machine Translation
Yoshua Bengio at the University of Montreal started working on neural networks for natural language processing in 2000 ("A Neural Probabilistic Language Model", 2001). Bengio's neural language models learn to convert a word symbol into a vector within a meaning space. The word vector is the semantic equivalent of an image vector: instead of extracting features of the image, it extracts the semantic features of the word to predict the next word in the sentence. Bengio realized something peculiar about word vectors learned from a text by his neural networks: these word vectors represent precisely the kind of linguistic regularities and patterns that define the use of a language, the kind of things that one finds in the grammar, the lexicon, the thesaurus, etc; except that they are not separate databases but just one organic body of expertise about the language. Firth again: "you shall know a word by the company it keeps".
In 2005 Bengio developed a method to solve the "curse of dimensionality" in natural language processing, the problem of training a network with the particular data that are vocabularies ("Hierarchical Probabilistic Neural Network Language Model", 2005). After Bengio's pioneering work, several others applied deep learning to natural language processing, notably Ronan Collobert and Jason Weston at NEC Labs in Princeton ("A Unified Architecture for Natural Language Processing", 2008), one of the earliest multitask deep networks, and capable of learning recursive structures. Bengio's mixed approach (neural networks and statistical analysis) was further expanded by Andrew Ng's and Christopher Manning's student Richard Socher at Stanford with applications to natural language processing ("Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks", 2010), which improved the parser developed by Manning with Dan Klein ("Accurate Unlexicalized Parsing", 2003). The result was a neural network that learns recursive structures, just like Collobert's and Weston's. Socher introduced a language-parsing algorithm based on recursive neural networks that Socher also reused for analyzing and annotating visual scenes ("Parsing Natural Scenes and Natural Language with Recursive Neural Networks", 2010).
However, Bengio's neural network was a feed-forward network, which means that it could only use a fixed number of preceding words when predicting the next one. Czech student Tomas Mikolov of the Brno University of Technology, working at John Hopkins University in Sanjeev Khudanpur's team, showed that, instead, a recurrent neural network is able to process sentences of any length ("Recurrent Neural Network-based Language Model," 2010). An RNN transforms a sentence into a vector representation, or viceversa. This enables translation from one language to another: a RNN (the encoder) can transform the sentence of a language into a vector representation that another RNN (the decoder) can transform into the sentence of another language. (Mikolov was hired by Google in 2012 and by Facebook in 2014, and in between in 2013 he invented the "skip-gram" method for learning vector representations of words from large amounts of unstructured text data).
In 2013 Nal Kalchbrenner and Phil Blunsom of Oxford University attempted statistical machine translation based purely on neural networks ("Two Recurrent Continuous Translation Models").
They introduced "sequence to sequence" (or "seq2seq") learning, a new paradigm in supervised learning, but the length of the output sequence of characters was limited to be the same length as the output.
Bengio's group (led by Kyunghyun Cho and Dzmitry Bahdanau on loan from Jacobs University Bremen) established the standard "encoder-decoder" model of machine translation: an encoder neural network reads and encodes a source sentence into a fixed-length vector, and then a decoder outputs a translation from the encoded vector ("Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", 2014). This project also introduced a simpler alternative to Long Short-Term Memory (LSTM) in recurrent architectures, later named "gated recurrent unit" (GRU).
The problem with Bengio's original encoder-decoder approach is that it represented all the information of the sentence into a fixed-length vector, a fact that obviously caused a decline in accuracy with longer sentences. Bahdanau then added "attention" to the encoder-decoder framework ("Neural Machine Translation by Jointly Learning to Align and Translate", 2014): attention is a mechanism that improves the ability of the network to inspect arbitrary elements of the sentence. This improved architecture really showed that neural networks could be applied to translating texts because the attention mechanism overcame the original limitation and enabled neural networks to process long sentences. The attention mechanism developed by Bahdanau came to be known as "RNNSearch" or simply "additive attention".
The desire to add "attention" skills to a neural network dates from the 1980s when neuroscience began to elucidate how the brain makes sense of visual scenes and so quickly. In 1986 the neuroscientists Christof Koch and Shimon Ullman proposed that the primate brain creates a visual "saliency map". A saliency map, basically, encodes the importance of each element in the visual space. This led in 1998 to the attention-based model of Laurent Itti, a student of Christof Koch at Caltech ("A Model of Saliency-based Visual-attention for Rapid Scene Analysis", 1998). Attention was introduced in image recognition tasks by Volodymyr Mnih at DeepMind ("Recurrent Models of Visual Attention", June 2014) whose "recurrent attention model" (RAM) was applied few months later to object recognition by Jimmy Lei Ba at the University of Toronto ("Multiple Object Recognition with Visual Attention", December 2014).
Looping back to the field of computer vision that had jumpstarted the field, this attention-based technique was used by Bengio's other student Kelvin Xu to automatically generate captions of images ("Show, Attend and Tell", 2015).
Bengio's student Kyunghyun Cho showed that the same architecture of gated recurrent neural networks, convolutional neural networks and attention mechanism dramatically improved performance in multiple tasks: machine translation image caption generation and speech recognition ("Describing Multimedia Content using Attention-based Encoder-Decoder Networks", 2015).
Some other attention mechanisms were introduced a few months later by Christopher Manning's student Minh-Thang Luong at Stanford ("Effective Approaches to Attention-based Neural Machine Translation", 2015), notably the "dot-product" (or multiplicative) mechanism.
Luong's multiplicative attention proved to be much faster and more efficient than Bahdanau's additive attention.
"Attention" is, however, a misnomer: the purpose of human attention is to speed up the process, if at the cost of accuracy, whereas "attention" in neural networks is a complex algorithm that, at every step, looks back at the input (or, better, at the hidden state of the encoder, a layer that captures the significant dependencies). The advantage for a neural network is that it doesn't have to encode all information of the input into one fixed-length vector. "Attention" provides flexibility but not necessarily agility.
Phil Blunsom's group at Oxford University used an attention-augmented LSTM network (trained with almost 100,000 articles from the CNN and more than 200,000 articles from the Daily Mail websites) to read a text and then produce an answer to a question ("Teaching Machines to Read and Comprehend", 2015). This Attentive Reader was a generalization of Weston's memory networks for
Scene understanding (what is going on in a picture, which objects are represented and what are they doing) is easy for animals but hard for machines. "Vision as inverse graphics" is a way to understand a scene by attempting to generate it: what caused these objects to be there and in those positions? The program has to generate the lines and circles that constitute the scene. Once the program has discovered how to generate the scene, it can reason about it and find out what the scene is about. This approach reverse-engineers the physical process that produced the scene: computer vision is the "inverse" of computer graphics. Therefore the "vision as inverse graphics" method involves a generator of images and then a predictor of objects. The prediction is inference. This method harkens back to the Swedish statistician Ulf Grenander's work in the 1970s.
After DRAW, DeepMind (Ali Eslami, Nicolas Heess and others) turned to scene understanding. Their AIR ("Attend-Infer-Repeat", 2016) model, which was again a combination of variational inference and deep learning, inferred objects in images by treating inference as a repetitive process, implemented as a LSTM that processed (i.e., attended to) one object at a time.
Lukasz Romaszko at the University of Edinburgh later improved this idea with his Probabilistic HoughNets ("Vision-as-Inverse-Graphics", 2017), similar to the
"de-rendering" used by Jiajun Wu at MIT ("Neural Scene De-rendering", 2017).
Ali Eslami and Danilo Rezende at DeepMind developed an unsupervised model to derive 3D structures from 2D images of them via probabilistic inference ("Unsupervised Learning of 3D Structure from Images", 2016). Based on that work, in June 2018 they introduced a whole new paradigm: the Generative Query Network (GQN). The goal was to have a neural network learn the layout of a room after observing it from different perspectives, and then have it display the scene viewed from a novel perspective. The system was a combination of a representation network (that learns a description of the scene, counting, localizing and classifying objects) and a generation network (that produces a new description of the scene).
In 2014, at the same time that Cho and Bahdanau were refining the encoder-decoder framework,
Ilya Sutskever, Oriol Vinyals and Quoc Le at Google solved the "sequence-to-sequence problem" of deep learning using a Long Short-Term Memory ("Sequence to Sequence Learning with Neural Networks"), so the length of the input sequence of characters doesn't have to be the same length of the output. Sutskever, Vinyals and Le trained a recurrent neural network that was then able to read a sentence in one language, produce a semantic representation of its meaning, and generate a translation in another language, via another encoder-decoder architecture.
The crowning achievement of neural machine translation was Google's "dynamic coattention network" (DCN) of 2016, based on the Sutskever-Vinyals-Le model and on the attention technique pioneered by Dzmitry Bahdanau's BiRNN (bidirectional RNN) at Jacobs University Bremen in Germany to improve the speed of machine translation ("Neural Machine Translation by jointly Learning to Align and Translate", 2015). This Google neural translation machine consisted of a deep LSTM network with eight encoder layers and eight decoder layers.
Of course the question is whether these systems that translate one sentence into another sentence based on simple mathematical formulas are actually "understanding" what the sentence says. Kevin Knight's student Xing Shi at the University of Southern California demonstrated that the vector representations of neural machine translation (their hidden layers) capture some morphological and syntactic properties of language ("Does String-Based Neural MT Learn Source Syntax?", 2016), and Yonatan Belinkov at MIT discovered even some semantical properties hidden in those vector representations ("Evaluating Layers of Representation in Neural Machine Translation on Part-of-Speech and Semantic Tagging Tasks", 2017).
In 2018 Xuedong Huang's team at Microsoft built a system that achieved human parity on the dataset newstest2017 and claimed that such a system was able to translate sentences of news articles from Chinese to English with the same quality and accuracy as a person. The team combined three techniques developed by Microsoft in China: dual learning (2016, in collaboration with Peking University), deliberation networks (2017, in collaboration with the University of Science and Technology of China), and joint training (2018, again in collaboration with the University of Science and Technology of China).
Recurrent neural networks had matured enough that in November 2016 Google switched its translation algorithm to a recurrent neural network and the jump in translation quality was noticeable.
After the successful implementations by Sutskever and Bahdanau in 2014, the sequence-to-sequence modeling required by machine translation was implemented with recurrent neural networks: use a series of (bi-directional) recurrent neural networks to map an input sequence to a variable-length output sequence. Within two years, however, architectures for sequence-to-sequence modeling that were entirely convolutional were proposed by Nal Kalchbrenner, now at DeepMind, namely his ByteNet ("Neural Machine Translation in Linear Time", 2016), and by Jonas Gehring in the Facebook team of Yoshua Bengio's former student Yann Dauphin, namely ConvS2S ("Convolutional Sequence to Sequence Learning", 2017). As far as sequence-to-sequence modeling goes, there are at least two advantages of convolutional networks over recurrent ones. One is that their computation can be parallelized, i.e. done faster. Secondly, multi-layer convolutional neural networks create hierarchical representations of the sequence, as opposed to to the chain structures created by recurrent networks, The lower layers of such hierarchies model local relationships (between nearby items of the sequence) and higher layers model non-local relationships (between distant items of the sequence). This architecture provides a faster path to relate elements that are in arbitrary positions of the sequence. The distance still matters, of course: the computational "cost" of relating two items increases exponentially with their distance in ByteNet and linearly in ConvS2S.
"Translation is not a matter of words only; it is a matter of making intelligible a whole culture" (Anthony Burgess)
Back to the Table of Contents
Purchase "Intelligence is not Artificial"