(These are excerpts from my book "Intelligence is not Artificial")
Bayes Reborn  A brief History of Artificial Intelligence/ Part 3
The Hopfield network proved Minsky and Papert wrong but has a problem: it tends to get trapped into what mathematicians call "local minima". Two improvements of Hopfield networks were proposed in a few months: the Boltzmann machine and backpropagation.
The Boltzmann machine was inspired by the physical process of annealing. At the same time that Hopfield introduced his recurrent neural networks, Scott Kirkpatrick at IBM introduced a stochastic method for mathematical optimization called "simulated annealing" (" Optimization by Simulated Annealing", 1983), which uses a degree of randomness to overcome local minima. This method was literally inspired by the physical process of cooling a liquid until it achieves a solid state.
In 1983 the cognitive psychologist Geoffrey Hinton, formerly a member of the PDP group at UC San Diego and now at Carnegie Mellon University, and the physicist Terry Sejnowski, a student of Hopfield but now at John Hopkins University, invented a neural network called the Boltzmann Machine that used a stochastic technique to avoid local minima, basically a Monte Carlo version of the Hopfield network ("Optimal Perceptual Inference", 1983): they used an "energy function" equivalent to Hopfield's energy function (the annealing process again) but they replaced Hopfield's deterministic neurons with probabilistic neurons. Their Boltzmann Machine (with no layers) avoided local minima and converged towards a global minimum. The learning rule of a Boltzmann machine is simple, and, yet, that learning rule can discover interesting features about the training data. In reality, the Boltzmann machine is but a case of "undirected graphical models" that have long been used in statistical physics: the nodes can only have binary values (zero or one) and they are connected by symmetric connections. They are "probabilistic" because they behave according to a probability distribution, rather than a deterministic formula.
In 1986 Sejnowski trained the neural net NETtalk to pronounce English text. But there was still a major problem: the learning procedure of a Boltzmann machine is painfully slow. And it was still haunted by local minima in the case of many layers.
Helped by the same young Hinton (now at Carnegie Mellon) and by Ronald Williams, both former buddies of the PDP group, in 1986 the mathematical psychologist David Rumelhart optimized backpropagation for training multilayer (or "deep") neural networks using a "local gradient descent" algorithm that would rule for two decades, de facto a generalized delta rule ("Learning Representations by Backpropagating Errors", 1986; retitled "Learning Internal Representations by Error Propagation" as a book chapter). Error backpropagation is a very slow process and requires huge amounts of data; but backpropagation provided A.I. scientists with an efficient method to compute and adjust the "gradient" with respect to the stregths of the neural connections in a multilayer network. (Technically speaking, backpropagation is gradient descent of the meansquared error as a function of the weights).
The world finally had a way (actually, two ways) to build multilayer neural networks to which Minsky's old critique did not apply.
Note that the idea for backpropagation came from both engineering (old cybernetic thinking about feedback) and from psychology.
At the same time, another physicist, Paul Smolensky of the University of Colorado, introduced a further optimization, the "harmonium", better known as Restricted Boltzmann Machine ("Information Processing in Dynamical Systems", 1986) because it restricts the kind of connections that are allowed between layers. The learning algorithm devised by Hinton and Sejnowski is very slow in multilayer Boltzmann machines but very fast in restricted Boltzmann machines.
Multilayered neural networks had finally become a reality.
The architecture of Boltzmann machines makes it unnecessary to propagate errors, hence Boltzmann machines and all their variants do not rely on backpropagation.
In 1987 Stephen Grossberg at Boston University unveiled yet another unsupervised connectionist model, "adaptive resonance theory" (ART), that implements both shortterm and longterm memories.
These events marked a renaissance of neural networks. Rumelhart was one of the authors of the twovolume "Parallel Distributed Processing" (1986) and the International Conference on Neural Networks was held in San Diego in 1987.
San Diego was an appropriate location since in 1982 Francis Crick, the British biologist who codiscovered the structure of DNA in 1953 and who now lived in southern California, had started the Helmholtz club with UC Irvine physicist Gordon Shaw (one of the earliest researchers on the neuroscience of music), Caltech neurophysiologist Vilayanur Ramachandran (later at UC San Diego), Caltech neurosurgeon Joseph Bogen (one of Roger Sperry's pupils in splitbrain surgery), Caltech neurobiologists John Allman, Richard Andersen, and David Van Essen (who mapped out the visual system of the macaque monkey), Carver Mead, Terry Sejnowski and David Rumelhart.
(Sad note: Rumelhart's career ended a few years later due to a neurodegenerative disease).
Soon, new optimizations led to new gradientdescent methods, notably the "realtime recurrent learning" algorithm, developed simultaneously by Tony Robinson and Frank Fallside at Cambridge University ("The Utility Driven Dynamic Error Propagation Network", 1987) and Gary Kuhn at the Institute for Defense Analysis in Princeton ("A First Look at Phonetic Discrimination Using a Connectionist Network with Recurrent Links", 1987), but popularized by Ronald Williams and David Zipser at UC San Diego ("A Learning Algorithm for Continually Running Fully Recurrent Neural Networks", 1989). Paul Werbos, now at the National Science Foundation in Washington, expanded backpropagation into "backpropagation through time" ("Generalization of Backpropagation with Application to a Recurrent Gas Market Model", 1988); and variations on backpropagation through time include: the "block update" method pioneered by Ronald Williams at Northwestern University ("Complexity of Exact Gradient Computation Algorithms For Recurrent Neural Networks", 1989), the "fastforward propagation" method by Jacob Barhen, Nikzad Toomarian and Sandeep Gulati at CalTech ("Adjoint Operator Algorithms for Faster Learning in Dynamical Neural Networks", 1991), and the "green function" method by GuoZheng Sun, HsingHen Chen and YeeChun Lee at the University of Maryland ("Green's Function Method for Fast OnLine Learning Algorithm of Recurrent Neural Networks", 1992). All these algorithms were elegantly unified by Amir Atiya at CalTech and Alexander Parlos at Texas A&M University ("New Results on Recurrent Network Training", 2000).
These were carefully calibrated mathematical algorithms to build neural networks to be both feasible (given the dramatic processing requirements of neural network computation) and plausible (that solved the problem correctly).
Nonetheless, philosophers were still debating whether the "connectionist" approach (neural networks) made sense. Two of the most influential philosophers, Jerry Fodor and Zenon Pylyshyn, wrote that the cognitive architecture cannot possibly be connectionist ("Connectionism and Cognitive Architecture", 1988) whereas the philosopher Andy Clark at the University of Sussex argued precisely the opposite in his book "Microcognition" (1989). Paul Smolensky at University of Colorado ("The Constituent Structure of Connectionist Mental States", 1988), Jordan Pollack ("Recursive Autoassociative Memory", 1988) and Jeffrey Elman ("Structured Representations and Connectionist Models", 1990) proved how neural networks could do precisely what Fodor thought they could never do, and another philosopher, David Chalmers at Indiana University, closed the discussion for good ("Why Fodor and Pylyshyn Were Wrong", 1990).
This school of thought merged with another one that was coming from a background of statistics and neuroscience. Credit goes to Judea Pearl of UC Los Angeles for introducing Bayesian thinking into Artificial Intelligence to deal with probabilistic knowledge (“Reverend Bayes on Inference Engines", 1982).
Ray Solomonoff's universal Bayesian methods for inductive inference were finally vindicated.
A kind of Bayesian network, the Hidden Markov Model, was already being used by A.I., particularly for speech recognition.
Neural networks and probabilities have something in common: neither is a form of perfect reasoning. Classical logic, based on deduction, aims to prove the truth. Neural networks and probabilities aim to approximate the truth.
Neural networks are "universal approximators", as proven first by George Cybenko in 1989 at the University of Illinois ("Approximation by Superpositions of a Sigmoidal Function", 1989) and by Kurt Hornik at the Technical University in Austria, in collaboration with the economists Maxwell Stinchcombe and Halbert White of UC San Diego ("Multilayer feedforward networks are universal approximators", 1989). Cybenko and Hornik proved that neural networks can approximate any continuous function of the kind that, de facto, occurs in ordinary problems. Basically, neural networks approximate complex mathematical functions with simpler ones, which is, after all, precisely what our brain does: it simplifies the incredible complexity of the environment that surrounds us although it can only do it by approximation. Complexity is expressed mathematically by nonlinear functions. Neural networks are approximators of nonlinear functions. The fact that a nonlinear function can be more efficiently represented by multilayer architectures with fewer parameters became a motivation to study multilayer architectures.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"
