(These are excerpts from my book "Intelligence is not Artificial")
Bayes Reborn - A brief History of Artificial Intelligence/ Part 3
The Hopfield network proved Minsky and Papert wrong but has a problem: it tends to get trapped into what mathematicians call "local minima". Two improvements of Hopfield networks were proposed in a few months: the Boltzmann machine and backpropagation.
The Boltzmann machine was inspired by the physical process of annealing. At the same time that Hopfield introduced his recurrent neural networks, Scott Kirkpatrick at IBM introduced a stochastic method for mathematical optimization called "simulated annealing" (" Optimization by Simulated Annealing", 1983), which uses a degree of randomness to overcome local minima. This method was literally inspired by the physical process of cooling a liquid until it achieves a solid state.
In 1983 the cognitive psychologist Geoffrey Hinton, formerly a member of the PDP group at UC San Diego and now at Carnegie Mellon University, and the physicist Terry Sejnowski, a student of Hopfield
at Princeton University
but now at Johns Hopkins University, invented a neural network called the Boltzmann Machine that used a stochastic technique to avoid local minima, basically a Monte Carlo version of the Hopfield network ("Optimal Perceptual Inference", 1983):
They replaced Hopfield's deterministic neurons with probabilistic neurons, and then used simulated annealing to find low energy states with high probability.
Their Boltzmann Machine (with no layers) avoided local minima and converged towards a global minimum. The learning rule of a Boltzmann machine is simple, and, yet, that learning rule can discover interesting features about the training data. In reality, the Boltzmann machine is but a case of "undirected graphical models" that have long been used in statistical physics: the nodes can only have binary values (zero or one) and they are connected by symmetric connections. They are "probabilistic" because they behave according to a probability distribution, rather than a deterministic formula.
But there was still a major problem: the learning procedure of a Boltzmann machine is painfully slow. And it was still haunted by local minima in the case of many layers.
In 1982 David Parker at Stanford Linear Accelerator rediscovered backpropagation and called it “learning logic”, and later published an influential report at MIT ("Learning Logic", 1985). In 1985 the young Yann LeCun too, at the Electrotechnic and Electronic School of Paris, rediscovered backpropagation ("A Learning Scheme for Asymmetric Threshold Network", 1985). Werbos had worked out the algorithm as an extension of statistical regression but they realized that it could be applied to training multi-layer neural networks, and therefore overcome Minsky's criticism. Paul Werbos himself was now explicitly proposing backpropagation as a method for neural networks ("Applications of Advances in Nonlinear Sensitivity Analysis", 1982).
Helped by the same young Hinton (now at Carnegie Mellon) and by Ronald Williams, both former buddies of the PDP group, in 1986 the mathematical psychologist David Rumelhart optimized backpropagation for training multilayer (or "deep") neural networks using a "local gradient descent" algorithm that would rule for two decades, de facto a generalized delta rule ("Learning Representations by Back-propagating Errors", 1986; retitled "Learning Internal Representations by Error Propagation" as a book chapter).
Rumelhart popularized the paradigm that would rule in multi-layer networks, the paradigm already used by Rosenblatt, Widrow, Ivakhnenko, Werbos, Fukushima and Hopfield, i.e. that the functioning of a neural network consists of two phases. During the feedforward phase the network uses some algorithm (that depends on an activation function, typically a sigmoid) to compute the output based on the input. This involves applying the algorithm to every neuron in the network, moving forward layer by layer. Then the error is computed as the difference between the desired output and the actual output. During the feedbackward phase, another algorithm (typically a gradient-descent algorithm) adjusts the weights of the network according to a function of the error, going backwards layer by layer, aiming to reduce the error. The difference between Widrow's method and Werbos-Rumelhart's method is that Widrow's method is linear and that limits the learning process in the case of many layers, whereas Werbos-Rumelhart's method benefits from a nonlinear activation function. The generalized delta rule used an S-shaped sigmoid function. Unlike Widrow's function, the sigmoid function introduced non-linearity in the neural network. Other than that, Werbos-Rumelhart's method was similar to Widrow's method: weights are changed proportionally to the negative gradient of the error; but the nonlinear approach changes each weight a tiny step in the direction to reduce network error, and so the network learns in a more gradual and stable way. To be fair, the term "backpropagation" was already in Rosenblatt's book "Principles of Neurodynamics" (1962), and even the general method (but not an actual algorithm).
Error backpropagation is a very slow process and requires huge amounts of data; but backpropagation provided A.I. scientists with an efficient method to compute and adjust the "gradient" with respect to the stregths of the neural connections in a multilayer network. (Technically speaking, backpropagation is gradient descent of the mean-squared error as a function of the weights).
The world finally had a way (actually, two ways) to build multilayer neural networks to which Minsky's old critique did not apply.
Note that the idea for backpropagation came from both engineering (old cybernetic thinking about feedback) and from psychology.
The first success story of backpropagation was NETtalk, a three-layer neural network designed in 1986 by Sejnowski at and programmed by George Miller's student Charlie Rosenberg that successfully learned to pronounce English text.
At the same time, another physicist, Paul Smolensky of the University of Colorado, introduced a further optimization, the "harmonium", better known as Restricted Boltzmann Machine ("Information Processing in Dynamical Systems", 1986) because it restricts the kind of connections that are allowed between layers. The learning algorithm devised by Hinton and Sejnowski is very slow in multilayer Boltzmann machines but very fast in restricted Boltzmann machines.
Multi-layered neural networks had finally become a reality.
The architecture of Boltzmann machines makes it unnecessary to propagate errors, hence Boltzmann machines and all their variants do not rely on backpropagation.
These events marked a renaissance of neural networks.
In 1986 Hinton and Sejnowski organized the first "Connectionist Summer School" at Carnegie Mellon University (attended by McClelland, David Willshaw, who was a neurologist at the University of Edinburgh, Dave Touretzky, and future stars such as Dana Ballard, Yann LeCun, the future guru of convolutional networks, and Gerald Tesauro, the future creator of TD-Gammon), a few weeks later
Rumelhart and McClelland published
the two-volume "Parallel Distributed Processing" (1986) and the International Conference on Neural Networks was held in San Diego in 1987.
San Diego was an appropriate location since in 1982 Francis Crick, the British biologist who co-discovered the structure of DNA in 1953 and who now lived in southern California, had started the Helmholtz club with UC Irvine physicist Gordon Shaw (one of the earliest researchers on the neuroscience of music), Caltech neurophysiologist Vilayanur Ramachandran (later at UC San Diego), Caltech neurosurgeon Joseph Bogen (one of Roger Sperry's pupils in split-brain surgery), Caltech neurobiologists John Allman, Richard Andersen, and David Van Essen (who mapped out the visual system of the macaque monkey), Carver Mead, Terry Sejnowski and David Rumelhart.
(Sad note: Rumelhart's career ended a few years later due to a neurodegenerative disease).
There were other pioneering ideas, especially in unsupervised learning.
Unsupervised learning is closely related to the problem of source separation in electrical engineering, a problem that consists in discovering the original sources of an electrical signal. Jeanny Herault and Christian Jutten of the Grenoble Institute of Technology in France developed a method called "independent component analysis", a higher-order generalization of principal components analysis ("Space or Time Adaptive Signal Processing by Neural Network Models", 1986), later refined by Jean-Francois Cardoso at the CNRS in France ("Sources Separation Using Higher Order Moments", 1989) and by Pierre Comon at Thompson in France ("Independent Component Analysis", 1991).
In 1986 Ralph Linsker at IBM research labs in Yorktown Heights published three unsupervised models that can reproduce known properties of the neurons in the visual cortex ("From Basic Network Principles to Neural Architecture", 1986). Linsker later developed the infomax method for unsupervised learning that simplified independent component analysis ("Self-Organization in a perceptual network", 1988), which recycled one of Uttley's ideas ("Information Transmission in the Nervous System" (1979). The infomax principle is basically to maximize the mutual information that the output of a neural network processor contains about its input and viceversa.
David Zipser at UC San Diego came up with the "autoencoder", apparently from an unpublished idea by Hinton of the previous year ("Programming Networks to Compute Spatial Functions", 1986), although the term was first used in print by Suzanna Becker in 1990. An autoencoder is an unsupervised neural network that is trained by backpropagation to output the input, or a very close approximation of it. In other words, it is trying to learn the identity function. In other words, an autoencoder tries to predict the input from the input.
The drawback of backpropagation is that it requires "supervision": someone has to tell the network what is the desired network (has to "train" the network). An autoencoder basically converts the supervised learning into an unsupervised learning. When one forces the network to produce the input as output, the network learns how to map all training inputs to themselves, which results in learning a representation of the training sample. If the middle layer of the network contains fewer units than the input layer, the network has to produce a compact representation, and ends up learning interesting facts about the data. Autoencoders are powerful models for capturing characteristics of data.
The technique was employed by Zipser to compress images ("Learning Internal Representations from Gray-Scale Images", 1987) and to compress speech waves ("Discovering the Hidden Structure of Speech", 1987).
One can also view the action of an autoencoder as a way to store an input so that it can be subsequently be retrieved as accurately as possible, i.e. the result of an autoencoder is to create a representation of the input (to "encode" the input) that allows the network to later retrieve it in a very accurate form.
One powerful effect of doing this layer after layer would be understood only two decades later and originate the boom of "deep learning".
Dana Ballard at the University of Rochester predated deep belief networks and stacked autoencoders by 20 years when he used unsupervised learning to build representations layer by layer ("Modular Learning in Neural Networks", 1987).
Linsker’s infomax was an early application of Shannon’s information theory to unsupervised neural networks. A similar approach was tried by Mark Plumbley at Cambridge University networks ("An Information-Theoretic Approach to Unsupervised Connectionist Models", 1988).
An accelerated version of gradient descent was developed by the Russian mathematician Yurii Nesterov at the Central Economic Mathematical Institute of Moscow ("A Method of Solving a Convex Programming Problem", 1983), and it would become the most popular of the gradient-based optimization algorithms (known as "Nesterov momentum"). Later, Ning Qian at Columbia University showed similarities between Nesterov's theory and the theory of coupled and damped harmonic oscillators in physics ("On the Momentum Term in Gradient Descent Learning Algorithms", 1999).
Soon, new optimizations led to new gradient-descent methods, notably the "real-time recurrent learning" algorithm, developed simultaneously by Tony Robinson and Frank Fallside at Cambridge University ("The Utility Driven Dynamic Error Propagation Network", 1987) and Gary Kuhn at the Institute for Defense Analysis in Princeton ("A First Look at Phonetic Discrimination Using a Connectionist Network with Recurrent Links", 1987), but popularized by Ronald Williams and David Zipser at UC San Diego ("A Learning Algorithm for Continually Running Fully Recurrent Neural Networks", 1989). Paul Werbos,
now at the US Department of Energy
in Washington, expanded backpropagation into "backpropagation through time",
i.e. backpropagation applied to sequences such as a time series
("Generalization of Backpropagation with Application to a Recurrent Gas Market Model", 1988); and variations on backpropagation through time include: the "block update" method pioneered by Ronald Williams at Northwestern University ("Complexity of Exact Gradient Computation Algorithms For Recurrent Neural Networks", 1989), the "fast-forward propagation" method by Jacob Barhen, Nikzad Toomarian and Sandeep Gulati at CalTech ("Adjoint Operator Algorithms for Faster Learning in Dynamical Neural Networks", 1991), and the "green function" method by Guo-Zheng Sun, Hsing-Hen Chen and Yee-Chun Lee at the University of Maryland ("Green's Function Method for Fast On-Line Learning Algorithm of Recurrent Neural Networks", 1992). All these algorithms were elegantly unified by Amir Atiya at CalTech and Alexander Parlos at Texas A&M University ("New Results on Recurrent Network Training", 2000).
These were carefully calibrated mathematical algorithms to build neural networks to be both feasible (given the dramatic processing requirements of neural network computation) and plausible (that solved the problem correctly).
Nonetheless, philosophers were still debating whether the "connectionist" approach (neural networks) made sense. Two of the most influential philosophers, Jerry Fodor and Zenon Pylyshyn, wrote that the cognitive architecture cannot possibly be connectionist ("Connectionism and Cognitive Architecture", 1988) whereas the philosopher Andy Clark at the University of Sussex argued precisely the opposite in his book "Microcognition" (1989). Paul Smolensky at University of Colorado ("The Constituent Structure of Connectionist Mental States", 1988), Jordan Pollack ("Recursive Auto-associative Memory", 1988) and Jeffrey Elman ("Structured Representations and Connectionist Models", 1990) proved how neural networks could do precisely what Fodor thought they could never do, and another philosopher, David Chalmers at Indiana University, closed the discussion for good ("Why Fodor and Pylyshyn Were Wrong", 1990).
This school of thought merged with another one that was coming from a background of statistics and neuroscience. Credit goes to Judea Pearl of UC Los Angeles for introducing Bayesian thinking into Artificial Intelligence to deal with probabilistic knowledge (“Reverend Bayes on Inference Engines", 1982).
Ray Solomonoff's universal Bayesian methods for inductive inference were finally vindicated.
A kind of Bayesian network, the Hidden Markov Model, was already being used by A.I., particularly for speech recognition.
Neural networks and probabilities have something in common: neither is a form of perfect reasoning. Classical logic, based on deduction, aims to prove the truth. Neural networks and probabilities aim to approximate the truth.
Neural networks are "universal approximators", as proven first by George Cybenko in 1989 at the University of Illinois ("Approximation by Superpositions of a Sigmoidal Function", 1989) and by Kurt Hornik at the Technical University in Austria, in collaboration with the economists Maxwell Stinchcombe and Halbert White of UC San Diego ("Multilayer feedforward networks are universal approximators", 1989). Cybenko and Hornik proved that neural networks can approximate any continuous function of the kind that, de facto, occurs in ordinary problems. Basically, neural networks approximate complex mathematical functions with simpler ones, which is, after all, precisely what our brain does: it simplifies the incredible complexity of the environment that surrounds us although it can only do it by approximation. Complexity is expressed mathematically by nonlinear functions. Neural networks are approximators of non-linear functions. The fact that a nonlinear function can be more efficiently represented by multilayer architectures with fewer parameters became a motivation to study multilayer architectures.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"