(These are excerpts from my book "Intelligence is not Artificial")
Brute-force A.I. doesn't Speak
Most disheartening has been the scant progress in Natural Language Processing (NLP) since 1970.
Following are examples of exchanges between a person and Terry Winograd's program SHRDLU at the MIT in 1970:
Person: FIND A BLOCK WHICH IS TALLER THAN THE ONE YOU ARE HOLDING AND PUT IT INTO THE BOX.
Computer: BY "IT", i ASSUME YOU MEAN THE BLOCK WHICH IS TALLER THAN THE ONE i AM HOLDING.
Person: DOES THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS SUPPORT ANYTHING GREEN?
Computer: YES, THE GREEN PYRAMID.
Person: WHAT COLOR IS IT?
Computer: BY "IT", i ASSUME YOU MEAN THE SHORTEST THING THE TALLEST PYRAMID'S SUPPORT SUPPORTS. RED
Person: DOES A STEEPLE
Computer interrupting: SORRY, i DON'T KNOW THE WORD "STEEPLE".
Person: A "STEEPLE" IS A STACK WHICH CONTAINS TWO GREEN CUBES AND A PYRAMID.
Computer: i UNDERSTAND.
Person: ARE THERE ANY STEEPLES NOW?
It beats some of today's "personal digital assistants".
We virtually abandoned the idea of having machines understand and speak our language and resorted to the opposite: make humans speak like machines. That is what you do when you talk on the phone with a machine that asks you for numbers; and that is what you do when you talk to your smartphone's "assistant" according to the rules of that assistant. Nothing illustrates the gap between machine intelligence and human intelligence better than comparing how much of a language a toddler
learns in two years and how little of any language all machines ever invented
have learned in more than 60 years.
IBM's Watson, that debuted with much fanfare in 2011 on a quiz show competing against human experts, was actually not capable of understanding the spoken questions: the questions were delivered to Watson as text files, not as spoken questions (a trick which, of course, distorted the whole game).
The most popular search engines are still keyword-based. Progress in search engines has been mainly in indexing and ranking webpages, not in understanding what the user is looking for nor in understanding what the webpage says. Try for example "Hey i had a discussion with a friend about whether Qaddafi wanted to get rid of the US dollar and he was killed because of that" and see what you get (as i write these words, Google returns first of all my own website with the exact words of that sentence and then a series of pages that discuss the assassination of the US ambassador in Libya). Communicating with a search engine is a far (far) cry from
communicating with human beings.
Products that were originally marketed as able to understand natural language, such as SIRI for Apple's iPhone, have bitterly disappointed their users. These products understand only the most elementary of sounds, and only sometimes, just like their ancestors of decades ago. Promising that a device will be able to translate speech on the fly (like Samsung did with its Galaxy S4 in 2013) is a good way to embarrass yourself and to lose credibility among your customers.
The status of natural language processing is well represented by antispam software that is totally incapable of understanding whether an email is spam or not based on its content while we can tell in a split second.
During the 1960s, following Noam Chomsky's "Syntactic Structures" (1957) that heralded a veritable linguistic revolution, a lot work in A.I. was directed towards "understanding" natural-language sentences, notably Charles Fillmore's case grammar at Ohio State University (1967), Roger Schank's conceptual dependency theory at Stanford (1969, later at Yale), William Woods' augmented transition networks at Harvard (1970), and semantic grammars, an evolution of ATNs by Dick Burton at BBN for one of the first "intelligent tutoring system", Sophie (started in 1973 at UC Irvine by John Seely Brown and Burton). Unfortunately, the results were crude. Terry Winograd's SHRDLU and Woods' LUNAR (1973), both based on Woods' theories, were limited to very narrow domains and short sentences.
We humans use all sorts of complicated sentences, some of them very long, some of them nested into each other.
Little was done in discourse analysis before Eugene Charniak's thesis at the MIT ("Towards a Model of Children's Story Comprehension", 1972), Indian-born Aravind Joshi's "Tree Adjunct Grammars" (1975) at the University of Pennsylvania, and Jerry Hobbs' work at the SRI Intl ("Computational Approach to Discourse Analysis", 1976).
Then a handful of important theses established the field. One originated from the SRI, Barbara Grosz╬Ú╬¸s thesis at UC Berkeley ("The Representation and Use af Focus in a System for Understanding Dialogs", 1977). And two came from Bolt Beranek and Newman, where William Woods had pioneered natural-language processing: Bonnie Webber╬Ú╬¸s thesis at Harvard: ("Inference in an Approach to Discourse Anaphora", 1978) and Candace Sidner╬Ú╬¸s thesis at the MIT ("Towards a Computational Theory of Definite Anaphora Comprehension in English Discourse", 1979).
These computer scientists, as well as philosophers such as Hans Kamp in the Netherlands (founder of Discourse Representation Theory in 1981), attempted a more holistic approach to understanding "discourse", not just individual sentences; and this resulted in domain-independent systems such as the Core Language Engine, developed in 1988 by Hiyan Alshawi's team at SRI in Britain.
Meanwhile, Melvin Maron's pioneering work on statistical analysis of text was being resurrected by Gerard Salton at Cornell University (the project leader of SMART, System for the Mechanical Analysis and Retrieval of
Text, since 1965). This technique represented a text as a "bag" of words,
disregarding the order of the words and even the grammatical relationships.
Surprisingly, this method was working better than the complex grammar-based
approaches. It quickly came to be known as the "bag-of-words model" for
language analysis. Technically speaking, it was text classification using naive
Bayes classifiers. In 1998 Thorsten Joachims at Univ of Dortmund replaced the naive Bayes classifier with the method of statistical learning called "Support Vector Machines", invented by Vladimir Vapnik
at Bell Labs in 1995, and other improvements followed. The bag-of-words model became the dominant paradigm for natural language processing but its statistical approach still failed to grasp the
meaning of a sentence. In 2003 Yoshua Bengio at the University of Montreal used a different method for statistical language modeling, the kind of representation that is called "distributed" (as opposed to "local") in neural networks; and then in 2010 Andrew Ng at Stanford built on this mixed
approach (neural networks and statistical analysis) using recursive neural
The results are still far from human performance. The most illiterate person on the planet can understand language better than the most powerful machine.
Ironically, the biggest success story of the "bag-of-words" model has been in image classification, not in text classification. In 2003 Gabriela Csurka at Xerox in France applied the same statistical method to images. The "Bag-of-visual-words" model was born, that basically treats an image as a document. For the whole decade this was the dominant method for image recognition, especially when coupled with a Support Vector Machine classifier. This approach led, for example, to the system for classification of natural scenes developed in 2005 at Caltech by Pietro Perona and his student FeiFei Li at Caltech.
To be fair, progress in natural language understanding was hindered by the simple fact that humans prefer not to speak to another human in our time-consuming natural language. Sometimes we prefer to skip the "Good morning, how are you?" and get straight to the "Reset my Internet connection" in
which case saying "One" to a machine is much more effective than
having to wait for a human operator to pick up the phone and to understand your
issue. Does anyone actually understand the garbled announcements in the New
York subway? Communicating in natural language is not always a solution, as
SIRI users are rapidly finding out on their smartphone. Like it or not, humans
can more effectively go about their business using the language of machines.
For a long time, therefore, Natural Language Processing remained an underfunded
research project with few visible applications. It is only recently that
interest in "virtual personal assistants" has resurrected the field.
Machine Translation too has disappointed. Despite recurring investments in the field by major companies, your favorite online translation system succeeds only with the simplest sentences, just like Systran in the 1970s. Here are some random Italian sentences from my old books translated into English by the most popular translation engine: "Graham Nash the content of which led nasal harmony", "On that album historian who gave the blues revival", "Started with a pompous hype on wave of hippie phenomenon".
The method that has indeed improved the quality of automatic translation is the statistical one, pioneered in the 1980s by Fred Jelinek's team at IBM and first implemented there by Peter Brown's team. When there are plenty of examples of (human-made) translations, the computer can perform a simple statistical analysis and pick the most likely translation. Note that the computer isn't even trying to understand the sentence: it has no clue whether the sentence is about cheese or parliamentary elections. It has "learned" that those few words in that
combination are usually translated in such and such a way by humans. The
statistical approach works wonders when there are thousands of (human-made)
translations of a sentence, for example between Italian and English. It works
awfully when there are fewer, like in the case of Chinese to English.
In 2013 Nal Kalchbrenner and Phil Blunsom of Oxford University attempted statistical machine translation based purely on neural networks ("Two Recurrent Continuous Translation Models"). In 2014 Ilya Sutskever's solved the "sequence-to-sequence problem" of deep learning using a Long Short-Term Memory ("Sequence to Sequence Learning with Neural Networks"), so the length of the input sequence of characters doesn't have to be the same length of the output.
Even if we ever get to the point that a machine can translate a complex sentence, here is the real test: "'Thou' is an ancient English word". Translate that
into Italian as "'Tu' e` un'antica parola Inglese" and you get an
obviously false statement ("Tu" is not an English word). The trick is
to understand what the original sentence means, not to just mechanically
replace English words with Italian words. If you understand what it means, then
you'll translate it as "'Thou' e` un'antica parola Inglese", i.e. you
don't translate the "thou"; or, depending on the context, you might
want to replace "thou" with an ancient Italian word like "'Ei'
e` un'antica parola Italiana" (where "ei" actually means
"he" but it plays a similar role to "thou" in the context
of words that changed over the centuries). A machine will be able to get it
right only when it fully understands the meaning and the purpose of the
sentence, not just its structure.
(There is certainly at least one quality-assurance engineer who, informed of this passage in this book, will immediately enter a few lines of code in the machine translation program to correctly translate "'Thou' is an ancient English word". That is precisely the dumb, brute-force, approach that i am talking about).
Or take Ronald Reagan's famous sarcastic statement, that the nine most terrifying words in the English language are "I'm from the government and i'm here to help". Translate this into Italian and you get "Le nove parole piu` terrificanti in Inglese sono `io lavoro per il governo e sono qui per aiutare'". Those are neither nine in the Italian translation (they are ten) and they are not "Inglese" (English) because they are now Italian. An appropriate translation would be "Le dieci parole piu` terrificanti in Italiano sono `io lavoro per il governo e sono qui per aiutare'".
Otherwise the translation, while technically impeccable, makes no practical
Or take Bertrand Russell's paradox: "the smallest positive integer number that cannot be described in fewer than fifteen words". This is a paradox because the sentence in quotes contains fourteen words. Therefore if such an integer number exists, it can be described by that sentence, which is fourteen words long. When you translate this paradox into Italian, you can't just translate fifteen with "quindici". You first
need to count the number of words. The literal translation "il numero
intero positivo piu` piccolo che non si possa descrivere in meno di quindici
parole" does not state the same paradox because this Italian sentence
contains sixteen words, not fourteen like the original English sentence. You
need to understand the meaning of the sentence and then the nature of the
paradox in order to produce an appropriate translation. I could continue with
self-referential sentences (more and more convoluted ones) that can lead to trivial mistakes when translated "mechanically" without understanding what they are meant to do.
To paraphrase the physicist Max Tegmark, a good explanation is one that answers more than was asked. If i ask you "Do you know what time it is", a "Yes" is not a good answer. I expect you to at least tell me what time it is, even if it was not specifically asked. Better: if you know that i am in a hurry to catch a train, i expect you to calculate the odds of making it to the station in time and to tell me "It's too late, you won't
make it" or "Run!" If i ask you "Where is the
library?" and you know that the library is closed, i expect you to reply
with not only the location but also the important information that it is
currently closed (it is pointless to go there). If i ask you "How do i get
to 330 Hayes St?" and you know that it used to be the location of a
popular Indian restaurant that just shut down, i expect you to reply with a
question "Are you looking for the Indian restaurant?" and not with a
simple "It's that way". If i am in a foreign country and ask a simple
question about buses or trains, i might get a lengthy lecture about how public
transportation works, because the local people guess that I don't know how it
works. Speaking a language is pointless if one doesn't understand what language
is all about. A machine can easily be programmed to answer the question
"Do you know what time it is" with the time (and not a simple
"Yes"), and it can easily be programmed to answer similar questions
with meaningful information; but we "consistently" do this for all
questions, and not because someone told us to answer the former question with
the time and other questions with meaningful information, but because that is
what our intelligence does: we use our knowledge and common sense to formulate
In the near future it will still be extremely difficult to build machines that can understand the simplest of sentences. At the current rate of progress, it may take centuries before we have a machine that can have a conversation like the ones I have with my friends on the Singularity. And that would still be a far cry from what humans do: consistently provide an explanation that answers more than it was asked.
A lot more is involved than simply understanding a language. If people around me speak Chinese, they are not speaking to me. But if one says "Sir?" in English, and i am the only English speaker around, i am probably supposed to pay attention.
The state of Natural Language Processing is well represented by the results returned by the most advanced search engines: the vast majority of results are precisely the
kind of commercial pages that i don't want to see. Which human would normally
answer "do you want to buy perfume Katmandu" when i inquire about
Katmandu's monuments? It is virtually impossible to find out which cities are
connected by air to a given airport because the search engines all return
hundreds of pages that offer "cheap" tickets to that airport.
Take, for example, zeroapp.email, a young startup being incubated in San Francisco in 2016. They want to use deep learning to automatically catalog the emails that you receive. Because you are a human being, you imagine that their software will read your email, understand the content, and then file it appropriately. If you were an A.I. scientist, you would have guessed instinctively that this cannot be the case. What they do is to study your behavior and learn what to do
the next time that you receive an email that is similar to past ones. If you have done X for 100 emails of this kind, most likely you want to do X also for all the future emails of this kind. This kind of "natural language processing" does not understand the text:
it analyzes statistically the past behavior of the user and then predicts what
the user will want to do in the future. The same principle is used by Gmail's
Priority Inbox, first introduced in 2010 and vastly improved over the years:
these systems learn, first and foremost, by watching you; but what they learn
is not the language that you speak.
I like to discuss with machine-intelligence fans a simple situation. Let's say you are accused of a murder you did not commit. How many years will it take before you are willing to accept a jury of 12 robots instead of 12 humans? Initially, this sounds like a question about "when will you trust robots to decide whether you are guilty or innocent?" but it actually isn't (i would probably trust a robot better than many of the jurors who are easily swayed by good looks,
racial prejudices and many other unpredictable factors). The question is about
understanding the infinite subtleties of legal debates, the language of lawyers
and, of course, the language of the witnesses. The odds that those 12 robots
fully understand what is going on at a trial will remain close to zero for a
Back to the Table of Contents
Purchase "Intelligence is not Artificial"