(These are excerpts from my book "Intelligence is not Artificial")
In November 1958, at the Symposium on Mechanization of Thought Processes in England, the always prescient John McCarthy delivered a lecture titled "Programs with Common Sense", that became one of the most influential papers in A.I. McCarthy understood that a machine with no common sense is what we normally call "an idiot". It can certainly do one thing very well, but it cannot be trusted to do it alone, and it certainly cannot be trusted doing anything else.
What we say is not what we mean. If I ask you to cook dinner using whatever high-protein food you can find in a kitchen cabinet, that does not mean that you should cook the spider crawling on its walls, nor the chick that your children have adopted as a pet, nor (gasp) the toddler who is hiding in it for fun.
How do we decide when is the best time to take a picture at an event? A machine can take thousands of pictures, one per second, and maybe even more, but we only take 2 or 3 because those are the meaningful events.
Surveillance cameras and cameras on drones can store millions of hours of videos. They can recognize make and model of a car, and even read its plate number, but they can't realize that a child is drowning in a swimming pool or that a thief is breaking into a car.
Maybe machines are becoming better than humans at recognizing images in some circumstances, but common sense still matters for understanding what is going on. For example, in April 2013 two homemade bombs killed three people during the Boston marathon. Within hours the investigation had identified two suspects. It was common sense, not artificial intelligence that helped the detectives: video footage showed a crowd reacting in panic
except two people who quietly walked away. Any human can draw the conclusion: those two people were not scared by what happened, they knew exactly what had happened, and the only people who could have known were the perpetrators.
In April 2016 in England a group of children spontaneously formed a human arrow on the ground to direct a police helicopter towards the fleeing suspects of a crime. Nobody taught the children to do that. What the children guessed (in a few seconds) is long list of "common sense" knowledge: there has been a crime and we need to capture the criminals; the criminals are running away to avoid capture; the helicopter in the sky is the police looking for the criminals; the police force is the entity in charge of catching criminals; it is good that you help the police if you have seen the criminals flee; it is bad if the criminals escape; the helicopter cannot hear you but can see you if you all group together; the arrow is a universal symbol to mark a direction; helicopters fly faster than humans can run; etc. That's what intelligence does when it has common sense.
Around the same time in 2016 Wei Zexi, a 21-year-old student from Xidian University in China's Shaanxi province, who was undergoing treatment for a rare form of cancer, found an advert on Baidu (China's search engine) publicizing a treatment offered by the Beijing Armed Police Corps No 2 Hospital. The "doctor" turned out to be bogus and the treatment killed the boy. The Chinese media demonized Baidu (and, hopefully, the military hospital!), but this was not a case of Baidu being evil: it was the case of yet another algorithm that has no common sense, just like the Google algorithm that in 2015 thought two African-Americans were gorillas, just like the Microsoft algorithm that in 2016 posted racist and sexist messages on Twitter. This is what intelligence does when it has no common sense.
To make things worse, i found the news of Wei Zexi's death on a website that itself displayed some silly ads. Two of these ads were almost porno in nature (titled "30 Celebs Who Don't Wear Underwear" and "Most Embarrassing Cheerleader Moments"). These ads were posted next to the article describing the tragic death of Wei Zexi: the "intelligent" software that assigns ads to webpages has no common sense, i.e. it cannot understand that it is really disgusting to post such sex-related ads in a page devoted to someone's death. (No, the ads were not customized for me: i was using an Internet-cafe terminal).
On 21 June 2017 the Los Angeles Times reported that a strong earthquake had just struck the town of Santa Barbara. There was no such earthquake. The US Geological Survey (USGS) had issued a false alarm. News organisations across the world had received the alert by email but quickly dismissed it because it was dated 29 June 2025; clearly some kind of snafu if you have common sense. The Los Angeles Times, however, was using an A.I. bot to automatically write stories about earthquakes and the A.I. bot dutifully informed the nation of the earthquake. (The Los Angeles Times quickly retracted the article, but the story became so popular that another bot running on the Los Angeles Times website, the one in charge of maximizing advertising revenues, started displaying an appliance advert in front of the article retracting the news of the earthquake).
The navigation software that is now common in every smartphone is a typical example of what happens when a machine has no common sense. It calculates the shortest route to my destination but sometimes i don't want the shortest route if it implies a huge number of turns. I'd rather stay on the same street as long as possible than turn right and left a dozen times if the difference is only a few minutes. In fact, one night i got so exhausted by the silly route calculated by my navigator that i left the party early: the navigator did its job of guiding me to my destination as quickly as possible, but ruined my evening. And we generally prefer to avoid dangerous neighborhoods. A friend was assaulted when she stopped at a red light in a bad part of town. When i was driving in another bad part of town, the car in front of me suddenly stopped and two big tall men came out and walked to me and asked me why i was following them (I wasn't, but obviously in that neighborhood it happens). We would rather drive an extra ten minutes than having to drive through a bad neighborhood.
The balance between common sense and algorithms is delicate. Every year more than 200,000 Chinese die in car accidents. The government is introducing stricter rules (i.e. algorithms) and enforcing the ones that exist. This resort to algorithms will certainly reduce the most common accidents and save thousands of lives. On the other hand, the USA is a country in which there seem to be more traffic rules than drivers. On some roads it takes a few minutes to read all the posted signs. There are also strict rules on how to build a car to make it as safe as possible. And, yet, the number of people killed in car accidents in the USA keeps climbing: 32,744 in 2014 (10.28 per 100,000 population), 35,485 in 2015 (11.06), 37,461 in 2016. It looks like the USA has reached a point at which it's difficult to further reduce the number of fatalities. There is a simple explanation: drivers in the USA are trained to follow rules; they are not trained to avoid accidents. On the other hand, Chinese drivers are trained to avoid accidents, not necessarily to follow rules.
When computers became powerful enough, some A.I. scientists embarked in ambitious attempts to replicate the "common sense" that we humans seem to master so easily as we grow up. The most famous project was Doug Lenat's Cyc (1984), which is still going on. In 1999 Marvin Minsky's pupil Catherine Havasi at the MIT launched Open Mind Common Sense that has been collecting "common sense" provided by thousands of volunteers. DBpedia, started at the Free University of Berlin in 2007, collects knowledge from Wikipedia articles. The goal of these systems is to create a vast catalog of the knowledge that ordinary people have: plants, animals, places, history, celebrities, objects, ideas, etc. For each one we intuitively know what to do: you are supposed to be scared of a tiger, but not of a cat, despite the similarities; umbrellas make sense when it rains or at the beach; clothes are for wearing them; food is for eating; etc. More recently, the very companies that are investing in deep learning have realized that you can't do without common sense. Hence, Microsoft started Satori in 2010 and Google revealed its Knowledge Graph in 2012. By then Knowledge Graph already contained knowledge about 570 million objects via more than 18 billion relationships between objects (Google did not disclose when the project had started). These projects marked a rediscovery of the old program of "knowledge representation" (based on mathematical logic) that has been downplayed too much after the boom in deep learning. Knowledge Graph is a "semantic network", a kind of knowledge representation that was very popular in the 1970s. Google's natural-language processing team, led by Fernando Pereira, is integrating Google's famous deep-learning technology (the "AlphaGo" kind of technology) with linguistic knowledge that is the result of eight years of work by professional linguists.
It is incorrect to say that deep learning is a technique for learning to do what we do. If i do something that has never been done before, deep learning cannot learn how to do it: it needs thousands if not millions of samples in order to learn how to do it. If it is the first time that it has been done, by definition, deep learning cannot learn it: there is only one case. Deep learning is a technique for learning something that humans DID (past tense).
Now let's imagine a scenario in which neural networks have learned everything that humans ever did. What happens next? The short answer is: nothing. These neural networks are incapable of doing anything that they were not trained to do, so this is the end of progress.
Training a neural network to do something that has never been done before is possible (for example, you can just introduce some random redistribution of what it has learned), but then the neural network has to understand that the result of the novel action is interesting, which requires an immense knowledge of the real world. If I perform a number of random actions, most of them will be useless, wastes of time and energy, but maybe one or two will turn out to be useful. We often stumble into interesting actions by accident and realize that we can use those accidental actions for doing something very important. I was looking for a way to water my garden without having to physically walk there, and one day i realized that an old broken hose had so many holes in it that would work really well to water the fruit trees. Minutes ago, i accidentally pressed the wrong key on my Android tablet and discovered a feature that I didn't know it existed. It is actually a useful feature.
In order to understand which novel action is useful, one needs a list of all the things that can possibly be useful to a human being. It is trivial for us to understand what can be useful to human life. It is not trivial for a machine, and certainly not trivial at all for a neural network trained to learn from us.
See for example Alexander Tuzhilin's paper "Usefulness, Novelty, and Integration of Interestingness Measures" (Columbia University, 2002) and Iaakov Exmans paper "Interestingness a Unifying Paradigm Bipolar Function Composition" (Israel, 2009).
The importance of common sense in daily activities is intuitive. We get angry whenever someone does something without "thinking". It is not enough to recognize that a car is a car and a tree is a tree. It is also important to understand that cars move and trees don't, that cars get into accidents and some trees bear edible fruits, etc. Deep learning is great for recognizing that a car is a car and a tree is a tree, but it struggles to go beyond recognition. So there is already a big limitation.
A second problem with deep-learning systems is that you need a very large dataset to train them. We humans learn a new game just from listening to a friend's description and from watching friends play it a couple of times. Deep learning requires thousands if not millions of cases before it can play decently.
Big data are used to train the neural networks of deep learning systems, but "big data" is not what we use to train humans. We do exactly the opposite. Children's behavior is "trained" by two parents and maybe a nanny, not by videos found on the Internet. Their education is "trained" by carefully selected teachers who had to get a degree in education, not by the masses. We train workers using the rare experts in the craft, not a random set of workers. We train scientists using a handful of great scientists, not a random set of students.
I am typing these words in 2016 while Egypt and other countries are searching the Mediterranean Sea for an airplane that went missing. In 2014 a Malaysia Airlines airplane en route from Kuala Lumpur to Beijing mysteriously disappeared over the Indian ocean. Deep-learning neural networks can be trained to play go/weichi because there are thousands of well documented games played by human masters, but the same networks cannot be trained to scour the ocean for debris of a missing airplane: we don't have thousands of pictures of debris of missing airplanes. They can have arbitrary shapes, float in arbitrary ways, be partially underwater, etc. Humans can easily identify pieces of an airplane even if they have only seen 10 or 20 airplanes in their life, and never seen the debris of an aircrash; neural networks can only do it if we show them thousands of examples.
A third problem of machines with no common sense is their inability to recognize an "obvious" mistake. Several studies have shown that, in some circumstances, deep-learning neural networks are better than humans at recognizing objects; but, when the neural network makes a mistake, you can tell that it has no common sense: it is usually a mistake that makes us laugh, i.e. a mistake that no idiot would make. You train a neural network using a large set of cat photos. Deep learning is a technique that provides a way to structure the neural network in an optimal way. Once the neural network has learned to recognize a cat, it is supposed to recognize any cat photo that it hasn't seen before. But deep neural networks are not perfect: there is always at least one case (a "blind spot") in which the neural network fails and mistakes the cat for something else. That "blind spot" tells a lot about the importance of common sense. In 2013 a joint research by Google, New York University and UC Berkeley showed that tiny perturbations (invisible to humans) can completely alter the way a neural network classifies the image. The paper written by Christian Szegedy and others was ironically titled "Intriguing Properties Of Neural Networks". Intriguing indeed, because no human would make those mistakes. In fact, no human would notice anything wrong with the "perturbed" images.
Neural networks can easily be fooled by "adversarial examples". An adversarial example is an image (or sound or other pattern) that has been slightly modified in a way to mislead the neural network despite the fact that the human eye doesn't notice anything strange about it. In 2015 Ian Goodfellow, the inventor of generative adversarial networks, working at Google with Szegedy, discovered "a fast method of generating adversarial examples" basically a way to serially hack a neural network ("Explaining and Harnessing Adversarial Examples", 2015). His method to quickly and massively generate adversarial examples was named "fast gradient sign method" (FGSM).
This is not just a theoretical discussion. If a self-driving car that uses a deep neural network mistakes a pedestrian crossing the street for a whirlwind, there could be serious consequences.
If a self-driving car turns the corner and moves towards you just when you were about to cross the street, and there is really no driver inside (right now all self-driving cars have a human driver who can take over at any time), do you still cross the street? Many of us would not, and will not. We often make eye contact with the driver in order to confirm that s/he has seen us. Can we make eye contact with the self-driving algorithm? We are told that machine vision is accurate 97% of the times, but we don't want to be a member of the 3%. And even if it gets down to an incredible 0.0001% error rate, that would still be thousands of mistakes a day.
Humans are very good at making mistakes all the time but they are also pretty good at improvising a remedy to each possible mistake. Cars are very good at not making mistakes, but, if they make a mistake, they won't even realize that they made a mistake: whatever they make is what they calculated to be the right thing to do. Machines don't think "Oh shoot! This is not right!" Machines cannot calculate the outcome of their action and realize that, no matter how good and rational the intention was, the result is a disaster, and must be undone, and ideally even aborted before it's done.
Conversely, in 2015 Anh Nguyen at the University of Wyoming showed that deep neural networks can easily be fooled into recognizing objects that don't exist ("Deep Neural Networks are Easily Fooled", 2015): two of the most popular neural networks (AlexNet and LeNet) recognized with more than 99% confidence some abstract tapestry as familiar objects. In 2017 Alhussein Fawzi and Seyed Moosavi of Pascal Frossard's team at Federal Institute of Technology Lausanne (EPFL) in Switzerland developed the DeepFool algorithm to scientifically obtain perturbations that fool deep networks ("A Simple and Accurate Method to Fool Deep Neural Networks", 2017), i.e. to quantify how robust a neural network is.
The problem does not go away in the three-dimensional world. MIT students Anish Athalye, Logan Engstrom and Andrew Ilyas fooled Google's InceptionV3 with a 3D-printed turtle: the neural network recognized it as a rifle. ("Synthesizing Robust Adversarial Examples", 2017). The same neural network was fooled into recognizing a cat as guacamole, but that was still in the realm of two-dimensional images.
Working with Goodfellow, Alexey Kurakin at Google Brain showed that the neural network can be fooled even when the "adversarial example" is located in the physical world, e.g. when a camera takes a picture of a street sign that has been manipulated in "adversarial" manners, and sometimes all it takes is to add a few stickers to the letters "STOP" ("Adversarial Examples in the Physical World", 2016).
Ivan Evtimov and others at UC Berkeley and at the University of Washington created a general attack algorithm to fool deep neural networks, Robust Physical Perturbations or RP2 ("Robust Physical-World Attacks on Deep Learning Models", 2017).
Pieter Abbeel's student Sandy Huang at UC Berkeley, working with Ian Goodfellow (now at OpenAI), showed that deep reinforcement learning too is vulnerable to adversarial examples ("Adversarial Attacks on Neural Network Policies", 2017): that's the method used by A3C to play Atari videogames and by AlphaGo to play weichi/go. In 2017 Goodfellow and Nicolas Papernot even published an open-source library of adversarial examples, Cleverhans, that you can use to test how vulnerable your neural network is.
Similarly, Percy Liang and her student Robin Jia at Stanford showed how easily a question-answering neural network can be hijacked: beware of your favorite chatbots ("Adversarial Examples for Evaluating Reading Comprehension Systems", 2017).
Alexey Kurakin wrote: "Most existing machine learning classifiers are highly vulnerable to adversarial examples" (2016). That's a "highly", not just "a little".
In their article "Is Attacking Machine Learning Easier than Defending it?" (2017) Goodfellow and Papernot explain why it is almost impossible to defend a neural network from all possible adversarial examples: adversarial examples are solutions to a complex (nonlinear) optimization problem and no existing mathematical tool can model this kind of solutions. In other words, we can't build a mathematical proof that a certain strategy would defend a neural network against any such attack.
It tells you something important about "machine intelligence" that today's machine intelligence fails if we change, even slightly, the requirements of the problem that it has just learned to solve. It is not intelligence, it is something else.
A neural network works better than a human being in the very finite world of ImageNet, that limits the number of possible categories to one thousand. In other words, the neural network is not trying to "recognize the object" but instead it is trying to "recognize which of the known one thousand objects this particular one is". If the object is none of them, then the neural network fails. If the neural network has been trained to recognize all of my friends but you showed it pictures of your friends who are not my friends, the neural network will try to find some among my friends who look like these unknown people.
It is false that deep learning is better than humans at recognizing images, but it is true that humans are a lot better than deep learning at learning abstractions through verbal definition, as documented by Tenenbaum of MIT, Brenden Lake of New York University and Ruslan Salakhutdinov of the University of Toronto ("Human-level Concept Learning through Probabilistic Program Induction", 2014).
It is true that deep learning fails all too easily in situations that differ just slightly from the situations for which the system has been trained. For example, a team at Dileep George's new startup Vicarious showed that the famous DeepMind system that learned to play Atari videogames better than videogame masters is actually quite incapable of simple adjustments (Schema Networks", 2017).
And, no, obviously a convnet does not recognize your face: it just recognizes that the picture contains your nose and your eyes and your mouth, in whichever order. And it will recognize as your face anyone who looks like you, including a cast of your face. If they ever replace the human being with a face-recognition machine, simply make a cast of your face and anyone will be able to go through the machine by simply showing that cast. You can find several videos on YouTube on "How to make a face cast on yourself".
Deep learning is essentially a very complicated statistical method for classifying patterns of data.
Deep learning is about correlation (A and B happen together), not causation (A is caused by B): it cannot distinguish causation from correlation.
The discipline of deep learning is reminiscent of alchemy before chemistry was invented and of engineering before physics was invented and of medicine before biology was invented. These disciplines could boast some success stories, but their progress was based on minor improvements over what worked, not on an understanding of why it worked. The Romans could build amazing aqueducts because they had figured out that arches can support weight, but didn't know why it worked. Deep learning is in a similar situation: conference papers document small improvements over success stories, but only reference the previous success stories, not a scientific theory of intelligence, just like the Romans didn't know Isaac Newton's equations and alchemists didn't know Antoine Lavoisier's formulas.
Beware, in particular, of machine learning in social sciences. The British prime minister Benjamin Disraeli once said "There are lies, damned lies and statistics". At some point we may have to say: "There are lies, damned lies, and machine learning".
Deep learning depends in an essential way on human expertise. It needs a huge dataset of human-prepared cases in order to "beat" the humans at their game (chess, go/weichi, etc). A world in which humans don't exist (or don't collaborate) would be a difficult place for deep learning. A world in which the expertise is generating by other deep-learning machines would be even tougher. For example, Google's translation software simply learns from all the translations that it can find. If many English-to-Italian human translators over the centuries have translated "table" with "tavolo", it learns to translate "table" into "tavolo". But what if someone injected into the Web thousands of erroneous translations of "table"? Scientists at Google are beginning to grapple with the fact that the dataset of correct translations, which is relentlessly being updated from what Google's "crawlers" find on the web, may degrade rapidly as humans start posting approximate translations made with Google's translation software. If you publish a mistake made by the robot as if it were human knowledge, you fool all the other robots who are trying to learn from human expertise. Today's robots, equipped with deep learning, learn from our experts, not from each other. We learn from experts and by ourselves, i.e by "trial and error" or through a lengthy excruciating research. Robots learn from experts, human experts, the best human experts. Google's translation software is not the best expert in translation. If it starts learning from itself (from its own mediocre translations), it will never improve.
Supervised learning is "learning by imitation", which is as good as the person you are imitating. That's why the generation of AlphaGo is introducing additional tricks. Reinforcement learning, which was the topic of Minsky's PhD thesis in 1954, is a way for the machine to learn more than any of the human experts have learned, because it can play thousands of games against itself while human experts can only play a few each week. Another useful addition to deep learning (also used by AlphaGo) is tree-search, invented by Minsky's mentor Claude Shannon in 1950.
Similar considerations apply to robots. World knowledge is vital to perform ordinary actions. Robot dexterity has greatly improved thanks to a multitude of sensors, motors and processors. But grabbing an object is not only about directing the movement of the hand, but also about controlling it. Grabbing a paper cup is not the same as grabbing a book: the paper cup might collapse if your hand squeezes it too much. And grabbing a paper cup full of water is different from grabbing an empty paper cup: you don't want to spill the water. Moving about an environment requires knowledge about furniture, doors, windows, elevators, etc. The Stanford robot that in 2013 was trained to buy a cup of coffee at the cafeteria upstairs had to learn that a) you don't break the door when you pull down the handle; b) you don't spill coffee on yourself because it would cause a short circuit; c) you don't break the button that calls the elevator; etc; and, as mentioned, that the image in the elevator's mirror is you and you don't need to wait for yourself to come out of the elevator.
We interact with objects all the time, meaning that we know what we can do with any given object.
Your body has a history. The machine needs to know that history in order to navigate the labyrinth of your world and the even more confusing labyrinth of your intentions.
Finally, there are ethical principles. The definition of what constitutes "success" in the real world is not obvious. For example: getting to an appointment in time is "good", but not if this implies running over a few pedestrians; a self-driving car should avoid crashing against walls, unless it is the only way to avoid a child
Most robots have been designed for and deployed in structured environments, such as factories, in which the goal to be achieved does not interfere with ordinary life. But a city street or a home contain much more than simply the tools to achieve a goal.
"Computers are useless: they can only give you answers" (Pablo Picasso, 1964).
Back to the Table of Contents
Purchase "Intelligence is not Artificial"