(These are excerpts from my book "Intelligence is not Artificial")
Deep Learning  A brief History of Artificial Intelligence/ Part 2
Knowledgebased systems did not expand as expected: the human experts were not terribly excited at the idea of helping construct clones of themselves, and, in any case, the clones were not terribly reliable.
Expert systems also failed because of the Worldwide Web: you don't need an expert system when thousands of human experts post the answer to all possible questions. All you need is a good search engine. That search engine plus those millions of items of information posted (free of charge) by thousands of people around the world do the job that the "expert system" was supposed to do. The expert system was a highly intellectual exercise in representing knowledge and in reasoning heuristically. The Web is a much bigger knowledge base than any
expertsystem designer ever dreamed of. The search engine has no pretense of
sophisticated logic but, thanks to the speed of today's computers and networks,
it "will" find the answer on the Web. Within the world of computer
programs, the search engine is a brute that can do the job once reserved to
artists.
Note that the apparent "intelligence" of the Web (its ability to provide all sorts of questions) arises from the "nonintelligent" contributions of thousands of people in a way very similar to how the intelligence of an ant colony emerges from the nonintelligent contributions of thousands of ants.
In retrospect a lot of sophisticated logicbased software had to do with slow and expensive machines. As machines get cheaper and faster and smaller, we don't need sophisticated logic anymore: we can just use fairly dumb techniques to achieve the same goals. As an analogy, imagine if cars, drivers and gasoline were very cheap and goods were provided for free by millions of people: it would be pointless to try and figure out the best way to deliver a good to a destination because one could simply ship many of those goods via many drivers with an excellent chance that at least one good would be delivered on time at the right address. The
route planning and the skilled knowledgeable driver would become useless, which
is precisely what has happened in many fields of expertise in the consumer
society: when is the last time you used a cobbler or a watch repairman?
The motivation to come up with creative ideas for A.I. scientists was due to slow, big and expensive machines. Now that machines are fast, small and cheap the motivation to come up with creative ideas is much reduced. Now the real motivation for A.I. scientists is to have access to thousands of parallel processors and let them run for months. Creativity has shifted to coordinating those processors so that they will search through billions of items of information. The machine intelligence required in the world of cheap computers has become less of a logical intelligence and more of a “logistical” intelligence.
Meanwhile, in the 1980s some conceptual breakthroughs fueled real progress in robotics. The Italian cyberneticist Valentino Braitenberg, in his "Vehicles" (1984), showed that no intelligence is required for producing "intelligent" behavior: all that is needed is a set of sensors and actuators. As the complexity of the "vehicle" increases, the vehicle seems to display an increasingly intelligent behavior. Starting in about 1987, Rodney Brooks at the MIT began to design robots that use little or no representation of the world. One can know nothing, and have absolutely no common sense, but still be able to do interesting things if equipped with the appropriate set of sensors and actuators.
The 1980s also witnessed a progressive rehabilitation of neural networks, a process that turned exponential in the 2000s. The discipline was rescued in 1982 by the CalTech physicist John Hopfield, who described a new generation of neural networks, based on simulating the physical process of annealing. These neural networks were immune to Minsky's critique. Hopfield's key intuition was to note the similarity with statistical mechanics. Statistical mechanics translates the laws of Thermodynamics into statistical properties of large sets of particles. The fundamental tool of statistical mechanics (and soon of this new generation of neural networks) is the Boltzmann distribution (actually discovered by JosiahWillard Gibbs in 1901), a method to calculate the probability that a physical system is in a specified state. Meanwhile, in 1974 Paul Werbos had worked out a more efficient way to train a neural network: the “backpropagation” algorithm.
Building on Hopfield's ideas, in 1983 Geoffrey Hinton at Carnegie Mellon University and Terry Sejnowski at John Hopkins University developed the socalled Boltzmann machine (technically, a Monte Carlo version of the Hopfield network), a software technique for networks capable of learning; and in 1986 Paul Smolensky at the University of Colorado introduced a further optimization, the Restricted Boltzmann Machine. These were carefully calibrated mathematical algorithms to build neural networks to be both feasible (given the dramatic processing requirements of neural network computation) and plausible (that solved the problem correctly). Historical trivia: the Monte Carlo method of simulation had been one of the first applications that John Von Neumann had programmed in the ENIAC, right after inventing it with Stanislaw Ulam in 1946 as part of a topsecret military program.
This school of thought merged with another one that was coming from a background of statistics and neuroscience. Credit goes to Judea Pearl of UC Los Angeles for introducing Bayesian thinking into Artificial Intelligence to deal with probabilistic knowledge (“Reverend Bayes on Inference Engines", 1982). Thomas Bayes was the 18th century mathematician who developed Probability Theory as we know it today. Ironically, he never published his main achievement, that today we know as Bayes' theorem.
A kind of Bayesian network, the Hidden Markov Model, was already being used by A.I., particularly for speech recognition. The Hidden Markov Model is a Bayesian network that has the sense of time and can model a sequence of events. It was invented by Leonard Baum in 1966 at the Institute for Defense Analyses in New Jersey, and first used in speech recognition by Jim Baker at Carnegie Mellon University in 1973, and later by Fred Jelinek at IBM. Statistical methods based on the Hidden Markov Model for speech processing became popular with Jack Ferguson's “The Blue Book”, which was the outcome of his lectures at the Institute for Defense Analyses in 1980.
Reinforcement learning was invented even before the field was called "Artificial Intelligence": it was the topic of Minsky's PhD thesis in 1954. Reinforcement learning was first used in 1959 by Samuel's checkersplaying program. In 1961 British wartime codebreaker, Alan Turing cohort and molecular biologist Donald Michie at the University of Edinburgh built a device (made of matchboxes!) to play TicTacToe called MENACE (Matchbox Educable Noughts and Crosses Engine) that learned how to improve its performance. In 1976 John Holland (of genetic algorithms fame) introduced classifier systems, which are reinforcementlearning systems.
Reinforcement learning was resurrected in the early 1980s by Andrew Barto and his student Richard Sutton at the University of Massachusetts . They applied ideas published by the mathematician Harry Klopf at the Air Force research laboratories in Boston in his 40page report "Brain Function and Adaptive Systems" (1972): the neuron is a goaldirected agent and an hedonist one; neurons actively seek "excitatory" signal and avoid "inhibitory" signals.
All the studies on reinforcement learning since Michie's MENACE converged together in the Qlearning algorithm invented in 1989 at Cambridge University by Christopher Watkins, which was, technically speaking, a Markov decision process ("Learning from Delayed Rewards", 1989). Watkins basically discovered the similarities between reinforcement learning and the theory of optimal control that had been popular in the 1950s thanks to the work of Lev Pontryagin in Russia (the "maximum principle" of 1956) and Richard Bellman at RAND Corporation (the "Bellman equation" of 1957). Trivia: Bellman is the one who coined the expression "the curse of dimensionality" that came to haunt the field of neural networks.
Meanwhile, the Swedish statistician Ulf Grenander (who in 1972 had established the Brown University Pattern Theory Group) fostered a conceptual revolution in the way a computer should describe knowledge of the world: not as concepts but as patterns. His "general pattern theory" provided mathematical tools for identifying the hidden variables of a data set. Grenander's pupil David Mumford studied the visual cortex and came up with a hierarchy of modules in which inference is Bayesian, and it is propagated both up and down ("On The Computational Architecture Of The Neocortex II", 1992). The assumption was that feedforward/feedback loops in the visual region integrate topdown expectations and bottomup observations via probabilistic inference. Basically, Mumford applied hierarchical Bayesian inference to model how the brain works.
Hinton's Helmholtz machine of 1995 was de facto an implementation of those ideas: an unsupervised learning algorithm to discover the hidden structure of a set of data based on Mumford's and Grenander's ideas.
The hierarchical Bayesian framework was later refined by TaiSing Lee of Carnegie Mellon University ("Hierarchical Bayesian Inference In The Visual Cortex", 2003). These studies were also the basis for the widelypublicized "Hierarchical Temporal Memory" model of the startup Numenta, founded in 2005 in Silicon Valley by Jeff Hawkins, Dileep George and Donna Dubinsky; yet another path to get to the same paradigm: hierarchical Bayesian belief networks.
The field did not take off until 2006, when Geoffrey Hinton at the the University of Toronto developed Deep Belief Networks, a fast learning algorithm for Restricted Boltzmann Machines. What had truly changed between the 1980s and the 2000s was the speed (and the price) of computers. Hinton's algorithms worked wonders when used on thousands of parallel processors. That's when the media started publicizing all sorts of machinelearning feats.
Deep Belief Networks are layered hierarchical architectures that stack Restricted Boltzmann Machines one on top of the other, each one feeding its output as input to the one immediately higher, with the two top layers forming an associative memory. The features discovered by one RBM become the training data for the next one.
Hinton and others had discovered how to create neural networks with many layers. One layer learns something and passes it on to the next one, which uses that something to learn something else and passes it on to the next layer, etc.
DBNs are still limited in one respect: they are “static classifiers”, i.e. they operate at a fixed dimensionality. However, speech or images don't come in a fixed
dimensionality, but in a (wildly) variable one. They require “sequence
recognition”, i.e. dynamic classifiers, that DBNs cannot provide. One method to
expand DBNs to sequential patterns is to combine deep learning with a “shallow
learning architecture” like the Hidden Markov Model.
Another thread in “deep learning” originated with convolutional networks invented in 1980 by Kunihiko Fukushima in Japan. Fukushima's Neocognitron was directly based on the studies of the cat's visual system published in 1962 by two Harvard neurobiologists, David Hubel (originally from Canada) and Torsten Wiesel (originally from Sweden). They proved that visual perception is the result of successive transformations, or, if you prefer, of propagating activation patterns. They discovered two types of neurons: simple cells, which respond to only one type of visual stimulus and behave like convolutions, and complex cells. Fukushima's system was a multistage architecture that mimicked those different kinds of neurons.
In 1989 Yann LeCun at Bell Labs applied backpropagation to convolutional networks to solve the problem of recognizing handwritten numbers (and then in 1994 for face detection and then in 1998 for reading cheques).
Deep neural networks already represented progress over the traditional threelayer networks, but it was really the convolutional approach that made the difference. They are called “convolutional” because they employ a technique of filtering that recalls the transformations caused by the mathematical operation of convolution.
A convolutional neural network consists of several convolutional layers. Each convolution layer consists of a convolution or filtering stage (the “simple cell”), a detection stage, and a pooling stage (the “complex cell”), and the result of each convolutional layer is in the form of “feature maps”, and that is the input to the next convolutional layer. The last layer is a classification module.
The detection stage of each convolutional layer is the middleman between simple cells and complex cells and provides the nonlinearity of the traditional multilayer neural network. Traditionally, this nonlinearity was provided by a mathematical function called “sigmoidal”, but in 2011 Yoshua Bengio ("Deep Sparse Rectifier Networks") introduced a more efficient function, the “rectified linear unit”, also inspired by the brain, that have the further advantage of avoiding the “gradient vanishing” problem of sigmoidal units.
Every layer of a convolutional network detects a set of features, starting with large features and moving on to smaller and smaller features. Imagine a group of friends subjected by you to a simple game. You show a picture to one of them, and allow him to provide a short description of the picture to only another one and using only a very vague vocabulary; for example: an object with four limbs and two colors. This new person can then summarize that description in a more
precise vocabulary to the next person; for example a fourlegged animal with
black and white stripes. Each person is allowed to use a more and more specific
vocabulary to the next person. Eventually, the last person can only utter names
of objects, and hopefully correctly identifies the picture because, by the time
it reaches this last person, the description has become fairly clear (e.g. the
mammal whose skin is black and white, i.e. the zebra).
(Convolution is a welldefined mathematical operation that, given two functions, generates a third one, according to a simple formula. This is useful when the new function is an approximation of the first one, but easier to analyze. You can find many websites that provide “simple” explanations of what a convolution is and why we need them: these “simple” explanations are a few pages long, and virtually nobody understands them, and each of them is completely different from the other one. Now you know where the term “convoluted” comes from!)
In 1990 Robert Jacobs at the University of Massachusetts introduced the "mixtureofexperts" architecture that trains different neural networks simultaneously and let them compete to learn, with the result that different networks end up learning different functions ("Task Decomposition Through Competition in a Modular Connectionist Architecture", 1990).
Meanwhile in 1996 David Field and Bruno Olshausen at Cornell University had invented "sparse coding", an unsupervised technique for neural networks to learn the patterns inherent in a dataset. Sparse coding helps neural networks represent data in an efficient way that can be used by other neural networks.
Each neural network is, ultimately, a combination of "encoder" and "decoder": the first layers encode the input and the last layers decode it. For example, when my brain recognizes an object as an apple, it has first encoded the image into some kind of neural activity (representing shape, color, size, etc of the object), and has then decoded that neural activity as an apple.
The “stacked autoencoders” developed in 2007 by Yoshua Bengio at the University of Montreal further improved the efficiency of capturing patterns in a dataset. There are cases in which a neural network would turn into a very poor classifier because of the nature of the training data. In that case a neural network called "autoencoder" can learn the important features of in an unsupervised way. So autoencoders are special cases of unsupervised neural networks, and they are more efficient than sparse coding. An autoencoder is designed to reconstruct its inputs, which forces its middle (hidden) layer to form useful representations of the inputs. Then these representations can be used by a neural network for a supervised task such as classification. In other words, a stacked autoencoder learns something about the distribution of data and can be used to pretrain a neural network that has to operate on those data.
Therefore, many scientists contributed to the “invention” of deep learning and to the resurrection of neural networks. But the fundamental contribution came from Moore's Law: between the 1980s and 2006 computers had become enormously faster, cheaper and smaller. A.I. scientists were able to implement neural networks that were hundreds of times more complex, and able to train them with millions of data. This was still unthinkable in the 1980s. Therefore what truly happened
between 1986 (when Restricted Boltzmann machines were invented) and 2006 (when
deep learning matured) that shifted the balance from the logical approach to
the connectionist approach in A.I. was Moore's Law. Without massive
improvements in the speed and cost of computers deep learning would not have
happened. Deep learning owes a huge debt of gratitude to the supercharged GPUs
(Graphical Processing Units) that have become affordable in the 2010s.
Credit for the rapid progress in convolutional networks goes mainly to mathematicians, who were working on techniques for matrixmatrix multiplication and made their systems available as opensource software. The software, such as UC Berkeley's Caffe, used by neuralnetwork designers, reduces a convolution to a matrixmatrix multiplication. This is a problem of linear algebra for which seasoned mathematicians had provided solutions. Initial progress took place at the Jet Propulsion Laboratory (JPL), a research center in California operated by the California Institute for Technology (CalTech) for the space agency NASA. Charles Lawson was the head of Applied Math Group at JPL since 1965. Lawson and his employee Richard Hanson developed software for linear algebra, including software for matrix computation, that was to be applied to astronomical things like gravitational fields. In 1979, together with Fred Krogh, an expert in differential equations, they released a Fortran library called Basic Linear Algebra Subprograms (BLAS). By 1990 BLAS 3 incorporated a library for matrixmatrix operations called GEneral Matrix to Matrix Multiplication (GEMM), largely developed at Britain's Numerical Algorithms Group (instituted in 1970 as a joint project between several British universities and the Atlas Computer Laboratory). The computational "cost" of a neural network is mainly due to two kinds of layers: the layers that are fullyconnected to each other and the convolutions. Both kinds entail massive multiplications of matrices; literally millions of them in the case of image recognition. Without something like GEMM no array of GPUs could perform the task.
A landmark achievement of deeplearning neural networks was published in 2012 by Alex Krizhevsky and Ilya Sutskever from Hinton's group at the University of Toronto: they demonstrated that deep learning (using a convolutional neural network with five convolutional layers and Bengio's rectified linear unit) outperforms traditional techniques of computer vision after processing 200 billion images during training (1.2 million humantagged images plus thousands of computergenerated variants of each). Deep convolutional neural networks became de facto standard for computervision systems.
In 2013 Google hired Hinton and Facebook hired LeCun.
Trivia: none of the protagonists of deep learning were born in the USA, although they all ended up working there. Fukushima is Japanese, LeCun and Bengio are French, Hinton is British, Ng is Chinese, Krizhevsky and Sutskever are Russian, Olshausen is Swiss. Add Hava Siegelmann from Israel, Sebastian Thrun and Sepp Hochreiter from Germany, Daniela Rus from Romania, Feifei Li from China, and the DeepMind founders from Britain and New Zealand.
Deep Belief Nets are probabilistic models that consist of multiple layers of probabilistic reasoning. Thomas Bayes' theorem of the 18th century is rapidly becoming one of the most influential scientific discoveries of all times (not bad for un unpublished manuscript discovered after Bayes' death). Bayes' theory of probability interprets knowledge as a set of probabilistic (not certain) statements and interprets learning as a process to refine those probabilities. As we acquire more evidence, we refine our beliefs. In 1996 the developmental psychologist Jenny Saffran showed that babies use probability theory to learn about the world, and they do learn very quickly a lot of facts. So Bayes had stumbled on an important fact about the way the brain works, not just a cute mathematical theory.
Since 2012 all the main software companies have invested in A.I. startups: Amazon (Kiva, 2012), Google (Neven, 2006; Industrial Robotics, Meka, Holomni, Bot & Dolly, DNNresearch, Schaft, Bost, DeepMind, Redwood Robotics, 201314), IBM (AlchemyAPI, 2015; plus the Watson project), Microsoft (Project Adam, 2014), Apple (Siri, 2011; Perceptio and VocalIQ, 2015; Emotient, 2016), Facebook (Face.com, 2012), Yahoo (LookFlow, 2013), Twitter (WhetLab, 2015), Salesforce (MetaMind,
2016), etc.
Since 2012 the applications of deep learning have multiplied. Deep learning has been applied to big data, biotech, finance, health care… Countless fields hope to automate the understanding and classification of data with deep learning.
Several platforms for deep learning have become available as opensource software: Torch (New York University), Caffe (Pieter Abbeel's group at UC Berkeley), Theano (Univ of Montreal, Canada), Chainer (Preferred Networks, Japan), Tensor
Flow (Google), etc. This opensource software multiplies the number of people
who can experiment with deep learning.
In 2015 Matthias Bethge's team at the University of Tübingen in Germany taught a neural network to capture an artistic style and then applied the artistic style to any picture.
The game of go/weichi had been a favorite field of research since the birth of deep learning. In 2006 Rémi Coulom introduced the Monte Carlo Tree Search algorithm and applied it to go/weichi. This algorithm dramatically improved the chances by machines to beat go masters: in 2009 Fuego Go (developed at the University of Alberta) beat Zhou Junxun, in 2010 MogoTW (developed by a FrenchTaiwanese team) beat Catalin Taranu, in 2012 Tencho no Igo/ Zen (developed by Yoji Ojima) beat Takemiya Masaki, in 2013 Crazy Stone (by Remi Coulom) beat Yoshio Ishida, and in 2016 AlphaGo (developed by Google's DeepMind) beat Lee Sedol. DeepMind's victory was widely advertised. DeepMind used a slightly modified Monte Carlo algorithm but, more importantly, it taught itself by playing against itself (what is called "reinforcement learning"). AlphaGo's neural network was trained with 150,000 games played by go/weichi masters. DeepMind had previously combined convolutional networks with reinforcement learning to train
a neural network to play video games ("Playing Atari with Deep
Reinforcement Learning", 2013).
By mixing deep learning and reinforcement learning one can also get
Deep QNetworks (DQN), developed in Canada by Volodymyr Mnih and others in 2013.
Ironically, few people noticed that in September 2015 Matthew Lai unveiled an opensource chess engine called Giraffe that uses deep reinforcement learning to teach itself how to play chess (at international master level) in 72 hours. It was designed by just one person and it ran on a the humble computer of his department at Imperial College London. (Lai was hired by Google DeepMind in January 2016, two months before AlphaGo's exploit against the Go master).
In 2016 Toyota demonstrated a selfteaching car, another application of deep reinforcement learning like AlphaGo: a number of cars are left to randomly roam the territory with the only rule that they have to avoid accidents. After a
while, the cars learn how to drive properly in the streets.
The funny thing about convolutional networks is that nobody really knows why they work so well when they work well. Designing a convolutional network is still largely a process of "trial and error". In the paper "Why Does Deep And Cheap Learning Work So Well?" (2016) Henry Lin, a physicist at Harvard University, and Max Tegmark, a mathematician at the MIT, advanced the hypothesis that "deep learning" neural networks may have something profound in common with the nature of our universe.
Poker was another game targeted by A.I. scientists, so much so that the University of Alberta even set up a Computer Poker Research Group. Here in 2007 Michael Bowling developed an algorithm called Counterfactual Regret Minimization or CFR ("Regret Minimization in Games with Incomplete Information", 2007), based on the "regret matching" algorithm invented in 2000 by Sergiu Hart and Andreu MasColell at the Einstein Institute of Mathematics in Israel ("A Simple Adaptive Procedure Leading to Correlated Equilibrium", 2000). These are techniques of selfplaying: given some rules describing a game, the algorithm plays against itself and develops its own strategy for playing the game better and better. It's yet another form of reinforcement learning, except that in this case reinforcement learning is used to devise the strategy from scratch, not to learn the strategy used by humans. The goal of CFR and its numerous variants is to approximate solutions for imperfect information games such as poker. CFR variants became the algorithms of choice for "poker bots" used in computer poker competitions. In 2015 Bowling's team developed Cepheus and Tuomas Sandholm at Carnegie Mellon University developed Claudico, that played the professionals at a Pittsburgh casino (Pennsylvania). Claudico lost but in 2017 Libratus, created by the same group, won. Libratus employed a new algorithm, called CFR+, introduced in 2014 by Finnish hacker Oskari Tammelin ("Solving Large Imperfect Information Games Using CFR+, 2014) that learns much faster compared with previous versions of CFR. However, the setting was absolutely unnatural, in particular to rule out card luck. It is safe to state that no human players had ever played poker in such a setting before. But it was telling that the machine started winning when the number of players was reduced and the duration of the tournament was extended: more players in a shorter time beat Claudico, but fewer players over a longer time lost to Libratus.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"
