(These are excerpts from my book "Intelligence is not Artificial")
Brute-force A.I. doesn't Speak
Most disheartening has been the scant progress in Natural Language Processing (NLP) since 1970.
Following are examples of exchanges between a person and Terry Winograd's program SHRDLU at the MIT in 1970:
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", i ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE i AM HOLDING.
Person: DOES THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS SUPPORT ANYTHING GREEN?
Computer: YES, THE GREEN PYRAMID.
Person: WHAT COLOR IS IT?
Computer: BY "IT", i ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS. RED
Person: DOES A STEEPLE
Computer interrupting: SORRY, i DON'T KNOW THE WORD "STEEPLE".
Person: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID.
Computer: i UNDERSTAND.
Person: ARE THERE ANY STEEPLES NOW?
It beats some of today's "personal digital assistants".
We virtually abandoned the idea of having machines understand and speak our language and resorted to the opposite: make humans speak like machines. That is what you do when you talk on the phone with a machine that asks you for numbers; and that is what you do when you talk to your smartphone's "assistant" according to the rules of that assistant. Nothing illustrates the gap between machine intelligence and human intelligence better than comparing how much of a language a toddler
learns in two years and how little of any language all machines ever invented
have learned in more than 60 years.
IBM's Watson, that debuted with much fanfare in 2011 on a quiz show competing against human experts, was actually not capable of understanding the spoken questions: the questions were delivered to Watson as text files, not as spoken questions (a trick which, of course, distorted the whole game).
The most popular search engines are still keyword-based. Progress in search engines has been mainly in indexing and ranking webpages, not in understanding what the user is looking for nor in understanding what the webpage says. Try for example "Hey i had a discussion with a friend about whether Qaddafi wanted to get rid of the US dollar and he was killed because of that" and see what you get (as i write these words, Google returns first of all my own website with the exact words of that sentence and then a series of pages that discuss the assassination of the US ambassador in Libya). Communicating with a search engine is a far (far) cry from
communicating with human beings.
Products that were originally marketed as able to understand natural language, such as SIRI for Apple's iPhone, have bitterly disappointed their users. These products understand only the most elementary of sounds, and only sometimes, just like their ancestors of decades ago. Promising that a device will be able to translate speech on the fly (like Samsung did with its Galaxy S4 in 2013) is a good way to embarrass yourself and to lose credibility among your customers.
The status of natural language processing is well represented by antispam software that is totally incapable of understanding whether an email is spam or not based on its content while we can tell in a split second.
During the 1960s, following
(and mostly reacting against)
Noam Chomsky's "Syntactic Structures" (1957) that heralded a veritable linguistic revolution, a lot work in A.I. was directed towards "understanding" natural-language sentences, notably Charles Fillmore's case grammar at Ohio State University (1967), Roger Schank's conceptual dependency theory at Stanford (1969, later at Yale), William Woods' augmented transition networks at Harvard (1970),
Yorick Wilks' preference semantics at Stanford (1973),
and semantic grammars, an evolution of ATNs by Dick Burton at BBN for one of the first "intelligent tutoring system", Sophie (started in 1973 at UC Irvine by John Seely Brown and Burton). Unfortunately, the results were crude.
Schank and Wilks were emblematic of the revolt against Chomsky's logical approach, that did not work well in computational systems. Schank and Wilks turned to meaning-based approached to natural language processing.
Terry Winograd's SHRDLU and Woods' LUNAR (1973), both based on Woods' theories, were limited to very narrow domains and short sentences.
Roger Schank moved to Yale in 1974 and attacked the Chomsky-ian model that language comprehension is all about grammar and logic thinking. Schank instead viewed language as intertwined with cognition, as Otto Selz and other cognitive psychologists had argued 50 years earlier. Minsky's "frame" and Schank's "script" (all variations on Selz's "schema") assumed a unity of perception, recognition, reasoning, understanding and memory: memory has the passive function of remembering and the active function of predicting; the comprehension of the world and its categorization proceed together; knowledge is stories.
Schank's "conceptual dependency" theory, whose tenet is that two sentences whose meaning is equivalent must have the same representation, aim to replace Noam Chomsky's focus on syntax with a focus on concepts.
We humans use all sorts of complicated sentences, some of them very long, some of them nested into each other.
Little was done in discourse analysis before Eugene Charniak's thesis at the MIT ("Towards a Model of Children's Story Comprehension", 1972), Indian-born Aravind Joshi's "Tree Adjunct Grammars" (1975) at the University of Pennsylvania, and Jerry Hobbs' work at the SRI Intl ("Computational Approach to Discourse Analysis", 1976).
Then a handful of important theses established the field. One originated from the SRI, Barbara Grosz╬Ú╬¸s thesis at UC Berkeley ("The Representation and Use af Focus in a System for Understanding Dialogs", 1977). And two came from Bolt Beranek and Newman, where William Woods had pioneered natural-language processing: Bonnie Webber╬Ú╬¸s thesis at Harvard: ("Inference in an Approach to Discourse Anaphora", 1978) and Candace Sidner╬Ú╬¸s thesis at the MIT ("Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse", 1979).
In 1974 Marvin Minsky at MIT introduced the "frame" for representing a stereotyped situation ("A Framework for Representing Knowledge", 1974) and in 1975 for the same purpose Roger Schank, who had already designed MARGIE (1973, which, believe it or not, stands for "Memory, Analysis, Response Generation, and Inference on English"), in collaboration with Stanford student Chris Riesbeck, and psychologist and social scientist Robert Abelson at Yale introduced the script ("Scripts, Plans, and Knowledge", 1975). Schank's students built a number of systems that used scripts to understand stories: Richard Cullingford's Script Applier Mechanism (SAM) of 1975; Robert Wilensky's PAM (Plan Applier Mechanism) of 1976; Wendy Lehnert's question-answering system QUALM of 1977; Janet Kolodner's CYRUS (Computerized Yale Retrieval and Updating System) of 1978, that learned events in the life of two politicians; Michael Lebowitz's IPP (Integrated Partial Parser) of 1978, that in order to read newspaper stories about international terrorism introduced an extension of the script, the MOP (Memory Organization Packet); Jaime Carbonell's Politics of 1978, that simulated political beliefs; Gerald DeJong's FRUMP (Fast Reading Understanding and Memory Program) of 1979, an evolution of SAM for producing summaries of newspaper stories; BORIS (Better Organized Reading and Inference System) of 1980, developed by Lehnert and her student Michael Dyer, a story-understanding and question-answering system that combined the MOP and a new extension, the Thematic Affect Unit (TAU). Starting in 1978 these systems were grouped under the general heading of "case-based reasoning". Meanwhile, Steven Rosenberg at MIT built a model to understand stories based on Minsky's frames.
In particular, Jaime Carbonell's PhD dissertation at Yale University ("Subjective Understanding", 1979) can be viewed as a precursor of the field that would be called "sentiment analysis".
It is important to realize that, despite the hype and the papers published in reputable (?) A.I. magazines, none of these systems ever worked. They "worked" only in a very narrow domain and they "understood" pretty much only what was hardwired into them by the software engineer. That's why they were never used twice. They were certainly steps forward in theoretical research, but very humble and very short steps. In 2017 Schank published on his blog an angry article titled "The fraudulent claims made by IBM about Watson and A.I." that started out with the sentence "They are not doing cognitive computing no matter how many times they say they are" but perhaps that's precisely what Schank was doing two generations earlier.
These computer scientists, as well as philosophers such as Hans Kamp in the Netherlands (founder of Discourse Representation Theory in 1981), attempted a more holistic approach to understanding "discourse", not just individual sentences; and this resulted in domain-independent systems such as the Core Language Engine, developed in 1988 by Hiyan Alshawi's team at SRI in Britain.
Meanwhile, Melvin Maron's pioneering work on statistical analysis of text
at UC Berkeley ("On Relevance, Probabilistic Indexing, and Information Retrieval", 1960)
was being resurrected by Gerard Salton at Cornell University (the project leader of SMART, System for the Mechanical Analysis and Retrieval of
Text, since 1965). This technique,
true to the motto "You shall know a word by the company it keeps" (1957) by the British linguist John-Rupert Firth,
represented a text as a "bag" of words,
disregarding the order of the words and even the grammatical relationships.
Surprisingly, this method was working better than the complex grammar-based
approaches. It quickly came to be known as the "bag-of-words model" for
language analysis. Technically speaking, it was text classification using naive
Bayes classifiers. In 1998 Thorsten Joachims at Univ of Dortmund replaced the naive Bayes classifier with the method of statistical learning called "Support Vector Machines", invented by Vladimir Vapnik
at Bell Labs in 1995, and other improvements followed. The bag-of-words model became the dominant paradigm for natural language processing but its statistical approach still failed to grasp the
meaning of a sentence.
Perhaps the first major progress in machine translation since Systran was demonstrated in 1973 by Yorick Wilks at Stanford. His system was based on something similar to conceptual dependency, "preference semantics" ("An Artificial Intelligence Approach to Machine Translation", 1973).
The method that did improve the quality of automatic translation is the statistical one, pioneered in the 1980s by Fred Jelinek's team at IBM and first implemented there by Peter Brown's team (the Candide system of 1992). When there are plenty of examples of (human-made) translations, the computer can perform a simple statistical analysis and pick the most likely translation. Note that the computer isn't even trying to understand the sentence: it has no clue whether the sentence is about cheese or parliamentary elections. It has "learned" that those few words in that combination are usually translated in such and such a way by humans. The statistical approach works wonders when there are thousands of (human-made) translations of a sentence, for example between Italian and English. It works awfully when there are fewer, like in the case of Chinese to English.
Yoshua Bengio at the University of Montreal started working on neural networks for natural language processing in 2000 ("A Neural Probabilistic Language Model", 2001). Bengio's neural language models learn to convert a word symbol into a vector within a meaning space. The word vector is the semantic equivalent of an image vector: instead of extracting features of the image, it extracts the semantic features of the word to predict the next word in the sentence. Bengio realized something peculiar about word vectors learned from a text by his neural networks: these word vectors represent precisely the kind of linguistic regularities and patterns that define the use of a language, the kind of things that one finds in the grammar, the lexicon, the thesaurus, etc; except that they are not separate databases but just one organic body of expertise about the language. Firth again: "you shall know a word by the company it keeps".
In 2005 Bengio developed a method to solve the "curse of dimensionality" in natural language processing, the problem of training a network with the particular data that are vocabularies ("Hierarchical Probabilistic Neural Network Language Model", 2005). After Bengio's pioneering work, several others applied deep learning to natural language processing, notably Ronan Collobert and Jason Weston at NEC Labs in Princeton ("A Unified Architecture for Natural Language Processing", 2008), one of the earliest multitask deep networks, and capable of learning recursive structures. Bengio's mixed approach (neural networks and statistical analysis) was further expanded by Andrew Ng's and Christopher Manning's student Richard Socher at Stanford with applications to natural language processing ("Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks", 2010), which improved the parser developed by Manning with Dan Klein ("Accurate Unlexicalized Parsing", 2003). The result was a neural network that learns recursive structures, just like Collobert's and Weston's. Socher introduced a language-parsing algorithm based on recursive neural networks that Socher also reused for analyzing and annotating visual scenes ("Parsing Natural Scenes and Natural Language with Recursive Neural Networks", 2010).
However, Bengio's neural network was a feed-forward network, which means that it could only use a fixed number of preceding words when predicting the next one. Czech student Tomas Mikolov of the Brno University of Technology, working at John Hopkins University in Sanjeev Khudanpur's team, showed that, instead, a recurrent neural network is able to process sentences of any length ("Recurrent Neural Network-based Language Model," 2010). An RNN transforms a sentence into a vector representation, or viceversa. This enables translation from one language to another: a RNN (the encoder) can transform the sentence of a language into a vector representation that another RNN (the decoder) can transform into the sentence of another language. (Mikolov was hired by Google in 2012 and by Facebook in 2014, and in between in 2013 he invented the "skip-gram" method for learning vector representations of words from large amounts of unstructured text data). Mikolov's method would be the basis for the R-Net developed by Microsoft's Chinese laboratories (Furu Wei and others) that in January 2018 would win the Stanford reading-comprehension competition beating
(on some metric on some reading task) the human beings.
Bengio's "Neural Machine Translation by Jointly Learning to Align and Translate" (2012) showed that neural networks could be applied to translating texts.
In 2013 Nal Kalchbrenner and Phil Blunsom of Oxford University attempted statistical machine translation based purely on neural networks ("Two Recurrent Continuous Translation Models").
Bengio's group (led by Kyunghyun Cho and Dzmitry Bahdanau on loan from Jacobs University Bremen) established the standard "encoder-decoder" model of machine translation: an encoder neural network reads and encodes a source sentence into a fixed-length vector, and then a decoder outputs a translation from the encoded vector ("Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation", 2014). This project also introduced a simpler alternative to Long Short-Term Memory (LSTM) in recurrent architectures, later named "gated recurrent unit" (GRU).
A few months later in
2014 Ilya Sutskever, Oriol Vinyals and Quoc Le at Google solved the "sequence-to-sequence problem" of deep learning using a Long Short-Term Memory ("Sequence to Sequence Learning with Neural Networks"), so the length of the input sequence of characters doesn't have to be the same length of the output. Sutskever, Vinyals and Le trained a recurrent neural network that was then able to read a sentence in one language, produce a semantic representation of its meaning, and generate a translation in another language, via another encoder-decoder architecture.
The crowning achievement of neural machine translation was Google's "dynamic coattention network" (DCN) of 2016, based on the Sutskever-Vinyals-Le model and on the attention technique pioneered by Dzmitry Bahdanau's BiRNN (bidirectional RNN) at Jacobs University Bremen in Germany to improve the speed of machine translation ("Neural Machine Translation by jointly Learning to Align and Translate", 2015). This Google neural translation machine consisted of a deep LSTM network with eight encoder layers and eight decoder layers.
The desire to add "attention" skills to a neural network dates from the 1980s when neuroscience began to elucidate how the brain makes sense of visual scenes and so quickly. In 1986 the neuroscientists Christof Koch and Shimon Ullman proposed that the primate brain creates a visual "saliency map". A saliency map, basically, encodes the importance of each element in the visual space. This led in 1998 to the attention-based model of Laurent Itti, a student of Christof Koch at Caltech ("A Model of Saliency-based Visual-attention for Rapid Scene Analysis", 1998). Attention was introduced in image recognition tasks by Volodymyr Mnih at DeepMind ("Recurrent Models of Visual Attention", June 2014) whose "recurrent attention model" (RAM) was applied few months later to object recognition by Jimmy Lei Ba at the University of Toronto ("Multiple Object Recognition with Visual Attention", December 2014). Meanwhile Bahdanau, Cho, and Bengio were adding attention to the encoder-decoder framework ("Neural Machine Translation by Jointly Learning to Align and Translate", September 2014), yielding attention introduced in image recognition tasks by Bengio's system to automatically generate captions of images ("Show, Attend and Tell", 2015).
Recurrent neural networks had matured enough that in November 2016 Google switched its translation algorithm to a recurrent neural network and the jump in translation quality was noticeable.
"Translation is not a matter of words only; it is a matter of making intelligible a whole culture" (Anthony Burgess)
Back to the Table of Contents
Purchase "Intelligence is not Artificial"