(These are excerpts from my book "Intelligence is not Artificial")
Understanding this Book (or any Book)
Using a technique derived from Mikolov's skip-gram, Oriol Vinyals and Quoc Le at Google revolutionized the venerable branch of discourse analysis. They trained a recurrent network with a large set of chats between users and support technicians. This created the equivalent of a translation (or, better, of a sequence-to-sequence model): the question asked by a user has to be "translated" into the response of the support technician ("A Neural Conversational Model", 2015).
Then Oriol Vinyal used the same technique of machine translation to analyze images and create captions. The best architecture to represent images as vectors was the convolution neural network, so Vinyal used a convolution neural network as the image encoder ("Show and Tell - A Neural Image Caption Generator", 2015) and a decoder RNN turned that vector representation into sentences. It achieved the same feat of a neural network trained to describe a scene. The similarities between language parsing in natural language processing and scene analysis in machine vision had been known at least since Gabriela Csurka developed the "bag-of-visual-words" or "bag-of-features" technique. Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a support vector machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student Fei-fei Li at Caltech.
Up to this point the technique for natural language processing included: the "bag-of-words" approach, in which sentence representations are independent of word order; the sequence models developed by Michael Jordan (1986) and Jeffrey Elman (1990) at UC San Diego; and models based on tree structures, in which a sentence's symbolic representation is derived from its constituents following a syntactic blueprint (the typical symbolic structure that results from this process resembles an inverted tree). The latter arose in the 1990s after a debate on representations in neural networks that started in 1984 when Geoffrey Hinton (then at Carnegie Mellon University) circulated a report titled "Distributed Representations" about representations in which "each entity is represented by a pattern of activity distributed over many computing elements, and each computed element is involved in representing many different entities." The problem of representing tree structures in neural networks was solved by Jordan Pollack of Ohio State University who came up with the Recursive Auto-Associative Memory or RAAM ("Recursive Distributed Representations", 1990). A few years later Christoph Goller and Andreas Kuechler in Germany extended Pollack's RAAM so that it could be used for arbitrarily complex symbolic structures, e.g. any sort of tree structure ("Learning Task-dependent Distributed Representations by Backpropagation Through Structure", 1995).
For question-answering systems James Weston (now at Facebook's labs in New York) developed "Memory Networks" (2014), neural networks coupled with long-term memories.
"Sequence tagging" (or "labeling") is the process of assigning each item in a sequence to a category, a process that is used in both natural language processing and bioinformatics. This process was traditionally implemented either with generative models such as the hidden Markov models employed in speech recognition or with the "conditional random fields" invented by John Lafferty (a former member of Fred Jelinek's group at IBM, now at Carnegie Mellon University), working with Andrew McCallum and Fernando Pereira ("Conditional Random Fields", 2001). Collobert's technique constituted the first major innovation, and it was countered years later by the bi-directional LSTM with conditional random fields developed by Zhiheng Huang, Wei Xu and Kai Yu of Baidu ("Bidirectional LSTM-CRF Models for Sequence Tagging", 2015).
Collobert's neural-network architecture for NLP formed the basis for Soumith Chintala's "sentiment analysis" at New York University, that learned to categorize movie reviews as positive or negative ("Sentiment Analysis using Neural Architectures", 2015). Socher at Stanford helped Kai Sheng Tai develop Tree-LSTM, a generalization of LSTMs to the tree structures used in natural language processing that further improved sentiment analysis taking advantage of the research started by Pollack 25 years earlier ("Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks", 2015). Sentiment analysis was also the objective of two projects in New England. In 2016 the Computational Story Laboratory of the University of Vermont (led by Peter Dodds and Chris Danforth) used Teuvo Kohonen's Self Organising Map (SOM) to study what Kurt Vonnegut had termed the "emotional arcs" of written stories ("The Emotional Arcs of Stories are Dominated by Six Basic Shapes", 2016). In 2017 Eric Chu of MIT's Laboratory for Social Machines directed by Deb Roy (later hired by Twitter) used deep convolutional neural networks to infer the emotional content of videos and television shows by analyzing both the story, the facial expressions and the soundtrack, i.e. for both audio and visual sentiment analysis ("Audio-Visual Sentiment Analysis for Learning Emotional Arcs in Movies", 2017).
A footnote on sentiment analysis. There are countless precursors, like Carbonell's dissertation of 1979 at Yale and Clark Elliott's PhD dissertation of 1992 at Northwestern University ("The Affective Reasoner", 1992), an implementation of Andrew Ortony's psychological theory (his "appraisal model" of 1988); but the discipline was truly born in 2002 with two studies: one by Peter Turney at the Institute for Information Technology of Canada ("Thumbs up or Thumbs down? Semantic Orientation Applied to Unsupervised Classification of Reviews", 2002) and one (a movie review classifier) and the other one by Bo Pang and Lillian Lee at Cornell University ("Thumbs up? Sentiment Classification using Machine Learning Techniques", 2002). Jeonghee Yi at IBM in San Jose (2003) was perhaps the first one to use "Sentiment Analysis" in the title of his paper.
Training a neural network requires a well-structured dataset. But a lot of real-world information comes in unstructured formats such as books, magazines, radio news, TV programs, etc. Hence the need for text-understanding, or reading-comprehension, technology. Understanding a text requires, first of all, determining what the real focus is. Hence a number of neural attention mechanisms were developed, mainly Jason Weston's memory networks at Facebook ("Memory Networks", 2014) and Richard Socher's dynamic memory networks at MetaMind in Palo Alto ("Ask Me Anything", 2015). Minjoon Seo in Ali Farhadi's group at the Allen Institute for Artificial Intelligence ("Bidirectional Attention Flow for Machine Comprehension", 2016) developed the Bidirectional Attention Flow (BiDAF) model, a new kind of "attention" technique, inspired by the "bi-attention" technique of by Dzmitry Bahdanau's BiRNN. Seo's architecture can model the context at different levels of granularity.
Just about at the same time, extensive datasets such as SQuAD and MARCO made it possible to train neural networks for reading comprehension tasks.
Weizhu Chen's team at Microsoft developed Reasonet in 2016, that combined memory networks with reinforcement learning, and FusionNet in 2017, that introduced a simpler attention mechanism called "History of Word". In 2017 the Reinforced Mnemonic Reader, developed in China jointly by Xipeng Qiu at Fudan University and the National University of Defense Technology, set a new record ("Reinforced Mnemonic Reader for Machine Reading Comprehension", 2017). Alas, recurrent neural networks are very slow both in training and inference a fact which prevents them from being deployed in real-time applications. In 2017 Quoc Le's team at Google developed the feed-forward neural net called QANET, that didn't use a recurrent neural net and surpassed human performance.
Several startups began offering services of text analysis and summary: Narrative Science, founded in 2010 in Chicago by Northwestern University's professors Kristian Hammond and Larry Birnbaum (a student of Schank's at Yale University in 1986); Maluuba, founded in 2011 in Canada by two students of the University of Waterloo, Sam Pasupalak and Kaheer Suleman, and acquired in 2017 by Microsoft; and MetaMind, founded in 2014 in Palo Alto by Richard Socher and acquired by Salesforce in 2016. But their narrative summaries only worked in very narrow domains under very friendly circumstances.
The results are still far from human performance. The most illiterate person on the planet can understand language better than the most powerful machine.
Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "Bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a Support Vector Machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student FeiFei Li at Caltech.
To be fair, progress in natural language understanding was hindered by the simple fact that humans prefer not to speak to another human in our time-consuming natural language. Sometimes we prefer to skip the "Good morning, how are you?" and get straight to the "Reset my Internet connection" in
which case saying "One" to a machine is much more effective than
having to wait for a human operator to pick up the phone and to understand your
issue. Does anyone actually understand the garbled announcements in the New
York subway? Communicating in natural language is not always a solution, as
SIRI users are rapidly finding out on their smartphone. Like it or not, humans
can more effectively go about their business using the language of machines.
For a long time, therefore, Natural Language Processing remained an underfunded
research project with few visible applications. It is only recently that
interest in "virtual personal assistants" has resurrected the field.
In order to realize how far we are from having machines that truly "understand"
our language, think of a useful application that would greatly help civility
in written conversations: the equivalent of a spelling checker for hostile
moods. Imagine an app that, when you try to send an email, would warn you
"The tone of this email is rude: do you really want to send it?" or
"The tone of this email is sarcastic" or
"The tone of this email is insulting".
It is not difficult for a human to read "between the lines", to understand
the hidden motivation of a message and, in particular, to understand when
the writer is deliberately trying to hurt your feelings.
Written hostilities can escalate quickly in the age of email and texting.
The interesting fact is that we understand in a second that the tone of an
email is not friendly, even when the email is a correct reply to our question
or a positive comment to something we have done. When a friend was celebrating
the killing of Osama bin Laden, i quipped "Yes, we are very good at
assassinating people". You understand the sarcasm and indirect critique of US
foreign policy, don't you? You may also understand that i was greatly annoyed
by that operation, and, even more, by the fact that people were celebrating
in the streets.
We routinely get in trouble when we speak quickly because we say something
that we "should not have said": this doesn't mean that what we said was false,
but that we said it on purpose to cause harm and perhaps humiliate. Most of
the time we regret doing it. Conversely, we are easily ticked off by the
wrong tone in an email that was sent to us.
We immediately understand the "tone" of an email, especially when it's meant to
hurt or annoy us, and we are very good
at forging a tone that will hurt or annoy somebody. We word
our sentences accordingly to our mood and to the mood we want to create
in the other person.
When you chat with someone, you pay little attention to the grammatical
structure of what you are saying (in fact, you make a lot of grammatical mistakes,
interrupting your own sentences, restarting them, interjecting a lot of random
noise such as "hmmm") but you pay a lot of attention at the dynamics of the
conversation, which depends heavily on the tone of your and their voices
(assertive, soothing, angry, etc).
The importance of mood in understanding what is going on cannot be overstated.
The impact of mood on comprehension was already studied by Gordon Bower at Stanford in "Mood and Memory" (1981), with a tentative computational theory based on semantic networks, and Daniel Martins in France ("Influence of Affect on Comprehension of a Text", 1982).
The collusion of emotion and cognition has been studied for a long time, from
Richard Lazarus at UC Berkeley
("Thoughts on the Relations between Emotion and Cognition", 1982) to Joseph LeDoux at New York University (his book "The Emotional Brain", 1996) via
multi-level theories of cognition-emotion interaction such as
Philip Barnard's "interacting cognitive subsystems model"
("Interacting Cognitive Subsystems", 1985) at Cambridge University
and Barnard's collaborator John Teasdale's model of nine cognitive subsystems at
Oxford University ("Emotion and two kinds of meaning", 1993).
More studies have emerged in the 1990s about how
emotional states influence what one understands, for example
Joseph-Paul Forgas' "affect infusion model" ("Mood and Judgement", 1995)
and Isabelle Tapiero's book "Situation Models and Levels of Coherence" (2007).
But little progress has been made in computing moods. The one influential paper on the subject was written by a philosopher, Laura Sizer at Hampshire University ("Towards A Computational Theory of Mood", 2000).
The other thing that we humans can do effortlessly (and frequently abuse this skill) is to generate stories. Given something that happened or an article that we read or a television show that we watched, we can easily create a story to describe it.
That's another thing that machines can't do in any reasonable fashion, despite decades of research.
The pioneering systems of automatic story generation were the Automated Novel Writer, developed since 1971 at the University of Wisconsin (a status report was published in 1973) by Sheldon Klein, who had already worked on automatic summaries at Carnegie Mellon University ("Automatic Paraphrasing in Essay Format", 1965);
James Meehan's story generator Tale-Spin ("The Metanovel", 1976), advised by Roger Schank at Yale, a program that generated stories about woodland creatures;
and Michael Lebowitz's Universe at Columbia University ("Creating Characters in a Story-Telling Universe", 1984).
Selmer Bringsjord of Rensselaer Polytechnic Institute in New York state and David Ferrucci of IBM started building Brutus in 1990 ("AI and Literary Creativity", 1999).
Then came Scott Turner's Minstrel at UCLA ("Minstrel, a computer model of creativity and storytelling", 1992) and
Rafael Perez y Perez's Mexica in Britain ("A Computer Model of Creativity in Writing", 1999).
An illiterate four-year child is infinitely better than these systems at
"narrating" a generic event.
Machine Translation too has disappointed. Despite recurring investments in the field by major companies, your favorite online translation system succeeds only with the simplest sentences, just like Systran in the 1970s. Here are some random Italian sentences from my old books translated into English by the most popular translation engine: "Graham Nash the content of which led nasal harmony", "On that album historian who gave the blues revival", "Started with a pompous hype on wave of hippie phenomenon".
In November 2016 the new Google Translate feature was widely publicized because it dramatically improved the machine-translation score called BLEU (bilingual evaluation understudy), introduced in 2002 by IBM. The new Google Translate was developed by Quoc Le (born in Vietnam), Mike Schuster (born in Germany), and Yonghui Wu (a Chinese-born veteran of Google's search engine). I tried it myself on simple sentences and the improvement was obvious. I tried it on one of my old music reviews written in Italian and the result is difficult to understand (maybe the original was too!) The biggest mistake: the Italian plural "geni" got translated as the plural of "gene" but in that context it is obviously the plural of "genius".
After successfully employing that recurrent neural network to improve Google's machine translation, Ilya Sutskever announced that: "all supervised vector-to-vector problems are now solved thanks to deep feed-forward neural networks" and "all supervised sequence-to-sequence problems are now solved thanks to deep LSTM networks" (at the 2014 Neural Information Processing Systems conference in Montreal). Unbridled optimism has always been A.I.'s main enemy.
Even if we ever get to the point that a machine can translate a complex sentence, here is the real test: "'Thou' is an ancient English word". Translate that
into Italian as "'Tu' e` un'antica parola Inglese" and you get an
obviously false statement ("Tu" is not an English word). The trick is
to understand what the original sentence means, not to just mechanically
replace English words with Italian words. If you understand what it means, then
you'll translate it as "'Thou' e` un'antica parola Inglese", i.e. you
don't translate the "thou"; or, depending on the context, you might
want to replace "thou" with an ancient Italian word like "'Ei'
e` un'antica parola Italiana" (where "ei" actually means
"he" but it plays a similar role to "thou" in the context
of words that changed over the centuries). A machine will be able to get it
right only when it fully understands the meaning and the purpose of the
sentence, not just its structure.
(There is certainly at least one quality-assurance engineer who, informed of this passage in this book, will immediately enter a few lines of code in the machine translation program to correctly translate "'Thou' is an ancient English word". That is precisely the dumb, brute-force, approach that i am talking about).
Or take Ronald Reagan's famous sarcastic statement, that the nine most terrifying words in the English language are "I'm from the government and i'm here to help". Translate this into Italian and you get "Le nove parole piu` terrificanti in Inglese sono `io lavoro per il governo e sono qui per aiutare'". Those are neither nine in the Italian translation (they are ten) and they are not "Inglese" (English) because they are now Italian. An appropriate translation would be "Le dieci parole piu` terrificanti in Italiano sono `io lavoro per il governo e sono qui per aiutare'".
Otherwise the translation, while technically impeccable, makes no practical
Or take Bertrand Russell's paradox: "the smallest positive integer number that cannot be described in fewer than fifteen words". This is a paradox because the sentence in quotes contains fourteen words. Therefore if such an integer number exists, it can be described by that sentence, which is fourteen words long. When you translate this paradox into Italian, you can't just translate fifteen with "quindici". You first
need to count the number of words. The literal translation "il numero
intero positivo piu` piccolo che non si possa descrivere in meno di quindici
parole" does not state the same paradox because this Italian sentence
contains sixteen words, not fourteen like the original English sentence. You
need to understand the meaning of the sentence and then the nature of the
paradox in order to produce an appropriate translation. I could continue with
self-referential sentences (more and more convoluted ones) that can lead to trivial mistakes when translated "mechanically" without understanding what they are meant to do.
Translations of proverbs can be quite inefficient. Take the Italian "Tra il dire e il fare c'e` di mezzo il mare", which is equivalent to the English "Easier said than done". In 2017 the most popular online translator renders it as "Between the saying and the sea there is the middle of the sea". Even the translation into Spanish fails (it is rendered as "Entre el dicho y el mar est el medio del mar") despite the fact that the equivalent Spanish proverb is very similar to the Italian ("Del dicho al hecho hay mucho trecho"). Our software engineer is now frantically entering a few lines of code in the online translator to make sure that this Italian proverb will be translated correctly in English and Spanish: alas, there are hundreds of languages and thousands of proverbs in each one, so the possible combinations are millions.
To paraphrase the physicist Max Tegmark, a good explanation is one that answers more than was asked. If i ask you "Do you know what time it is", a "Yes" is not a good answer. I expect you to at least tell me what time it is, even if it was not specifically asked. Better: if you know that i am in a hurry to catch a train, i expect you to calculate the odds of making it to the station in time and to tell me "It's too late, you won't
make it" or "Run!" If i ask you "Where is the
library?" and you know that the library is closed, i expect you to reply
with not only the location but also the important information that it is
currently closed (it is pointless to go there). If i ask you "How do i get
to 330 Hayes St?" and you know that it used to be the location of a
popular Indian restaurant that just shut down, i expect you to reply with a
question "Are you looking for the Indian restaurant?" and not with a
simple "It's that way". If i am in a foreign country and ask a simple
question about buses or trains, i might get a lengthy lecture about how public
transportation works, because the local people guess that I don't know how it
works. Speaking a language is pointless if one doesn't understand what language
is all about. A machine can easily be programmed to answer the question
"Do you know what time it is" with the time (and not a simple
"Yes"), and it can easily be programmed to answer similar questions
with meaningful information; but we "consistently" do this for all
questions, and not because someone told us to answer the former question with
the time and other questions with meaningful information, but because that is
what our intelligence does: we use our knowledge and common sense to formulate
Ludwig Wittgenstein in the "Philosophical Investigations" (published posthumously in 1953) wrote that "the meaning of a word is its use in the language". That statement launched a whole new discipline, now called "pragmatics", via John Austin's analysis of speech acts (starting with a lecture at Harvard University in 1955 that in 1962 became the book "How to Do Things with Words"), Paul Grice's "conversational maxims" ( "Logic and conversation", 1975) and Dan Sperber's and Deirdre Wilson's "relevance theory" ("Relevance - Communication and Cognition", 1986). The term "pragmatics" was coined by Charles Morris, the founder of modern semiotics, in his book "Foundations of the Theory of Signs" (1938), which divided the study of language in three branches: syntax, semantics and pragmatics.
In the near future it will still be extremely difficult to build machines that can understand the simplest of sentences. At the current rate of progress, it may take centuries before we have a machine that can have a conversation like the ones I have with my friends on the Singularity. And that would still be a far cry from what humans do: consistently provide an explanation that answers more than it was asked.
A lot more is involved than simply understanding a language. If people around me speak Chinese, they are not speaking to me. But if one says "Sir?" in English, and i am the only English speaker around, i am probably supposed to pay attention.
The state of Natural Language Processing is well represented by the results returned by the most advanced search engines: the vast majority of results are precisely the
kind of commercial pages that i don't want to see. Which human would normally
answer "do you want to buy perfume Katmandu" when i inquire about
Katmandu's monuments? It is virtually impossible to find out which cities are
connected by air to a given airport because the search engines all return
hundreds of pages that offer "cheap" tickets to that airport.
Take, for example, zeroapp.email, a young startup being incubated in San Francisco in 2016. They want to use deep learning to automatically catalog the emails that you receive. Because you are a human being, you imagine that their software will read your email, understand the content, and then file it appropriately. If you were an A.I. scientist, you would have guessed instinctively that this cannot be the case. What they do is to study your behavior and learn what to do
the next time that you receive an email that is similar to past ones. If you have done X for 100 emails of this kind, most likely you want to do X also for all the future emails of this kind. This kind of "natural language processing" does not understand the text:
it analyzes statistically the past behavior of the user and then predicts what
the user will want to do in the future. The same principle is used by Gmail's
Priority Inbox, first introduced in 2010 and vastly improved over the years:
these systems learn, first and foremost, by watching you; but what they learn
is not the language that you speak.
I like to discuss with machine-intelligence fans a simple situation. Let's say you are accused of a murder you did not commit. How many years will it take before you are willing to accept a jury of 12 robots instead of 12 humans? Initially, this sounds like a question about "when will you trust robots to decide whether you are guilty or innocent?" but it actually isn't (i would probably trust a robot better than many of the jurors who are easily swayed by good looks,
racial prejudices and many other unpredictable factors). The question is about
understanding the infinite subtleties of legal debates, the language of lawyers
and, of course, the language of the witnesses. The odds that those 12 robots
fully understand what is going on at a trial will remain close to zero for a
Back to the Table of Contents
Purchase "Intelligence is not Artificial"