Intelligence is not Artificial

(These are excerpts from my book "Intelligence is not Artificial")

The Connectionists (Neural Networks)

Meanwhile, the other branch of Artificial Intelligence was pursuing a rather different approach: simulating what the brain does at the physical level of neurons and synapses. The symbolic school of John McCarthy and Marvin Minsky believed in using mathematical logic (i.e., symbols) to simulate how the human mind works; the school of "neural networks" (or "connectionism") believed in using mathematical calculus (i.e., numbers) to simulate how the brain works.

Since in the 1950s neuroscience was just in its infancy (medical machines to study living brains would not become available until the 1970s), computer scientists only knew that the brain consists of a huge number of interconnected neurons, and neuroscientists were becoming ever more convinced that "intelligence" was due to the connections, not to the individual neurons. A brain was viewed as a network of interconnected nodes, and our mental life as due to the way signals travel through those connections from the neurons of the sensory system up to the neurons that process those sensory data and eventually down to the neurons that generate action.

The neural connections can vary in strength from zero to infinite, and this is known as the "weight" of the connection. Change the weight of some neural connections and you change the outcome of the network's computation. In other words, the weights of the connections can be tweaked to cause different outputs for the same inputs. The problem for those designing "neural networks" consists in fine-tuning the connections so that the network as a whole comes up with the correct interpretation of the input; e.g. with the word "apple" when the image of an apple is presented. This is called "training the network". For example, showing many apples to the system and forcing the answer "APPLE" should result in the network adjusting those connections to recognize apples in general. This is called "supervised learning". The normal operation of the neural network is quite simple. The signals coming from different neurons into a neuron are weighed based on the weights of each input connection and then fed to an "activation function" (also known as the "nonlinearity") that decides what has to be the output produced by this neuron. The simplest activation function is a function that has a threshold, the "step" function: if the total input passes the threshold value, the neuron emits a one, otherwise a zero. This process goes on throughout the network. The network is usually organized in layers of neurons. The weights of the connections determine what the network computes. The weights change during "training" (i.e. in response to experience). Neural networks "learn" those weights during training. A simple approach to learning weights is to compare the output of the neural network to the correct answer and then modify the weights in the network so as to produce the correct answer. Each "correct answer" is a training example. The neural network needs to be trained with numerous such examples. Today computers are powerful enough that it can be literally millions of examples. If the network has been designed well, the weights will eventually converge to a stable configurations: at that point the network should provide the correct answer even for instances that were not in the training data (e.g., recognize an apple that it has never seen before). The designer of the neural network has to decide the structure of the neural network (e.g. the number of layers, the size of each layer, and which neurons connect to which other neurons), the initial values of the weights, the activation function (the "nonlinearity"), and the training strategy. Both the initialization and the training may require the use of random numbers, and there are many different ways to generate random numbers. The term "hyperparameters" refers to all the parameters that the network designer needs to pinpoint. For example, the hyperparameters for modern convolutional neural networks include: the number of layers in the neural network, the activation function, the loss function, the kernel size and the batch size. There probably exists a best set of hyperparameters that optimizes the training of a neural network, but it varies case by case, and there is no easy deterministic way to discover it, hence the hyperparamets are usually set by trial and error. It may take months to come up with a neural network that can be trained.

Since the key is to adjust the strength of the connections, the alternative term for this branch of A.I. is "connectionism".

One of the most influential books in the early years of neuroscience was "Organization of Behavior" (1949), written by the psychologist Donald Hebb at McGill University in Montreal (Canada). Hebb described how the brain learns by changing the strength in the connections between its neurons. In 1951 two Princeton University students, Marvin Minsky and Dean Edmonds, simulated Hebbian learning in a network of 40 neurons realized with 3,000 vacuum tubes, and called this machine SNARC (Stochastic Neural Analog Reinforcement Computer). I wouldn't count it as the first neural network because SNARC was not implemented on a computer. In 1954 Wesley Clark and Belmont Farley at MIT simulated Hebbian learning on a computer, i.e. created the first artificial neural network (a two-layer network). In 1956 Hebb collaborated with IBM's research laboratory in Poughkeepsie to produce another computer model, programmed by Nathaniel Rochester's team (that included a young John Holland).

If there was something similar to the Macy conferences in Britain, it was the Ratio Club, organized in 1949 by the neurologist John Bates, a dining club of young scientists who met periodically at London's National Hospital to discuss cybernetics. McCulloch, who was traveling in Britain, became their very first invited speaker. Among its members was the neurologist William Grey-Walter, who in 1948 built two tortoise-shaped robots (better known as Elmer and Elsie) that some consider the first autonomous mobile robots. Turing was a member and tested the "Turing test" at one of their meetings. John Young was a member: in 1964 he would discover the "selectionist" theory of the brain. And, finally, another member was the psychiatrist Ross Ashby, who in 1948 actually built a machine to simulate the brain, the homeostat ("the closest thing to a synthetic brain so far designed by man", as Time magazine reported). The title of that paper became the title of his influential book "Design for a Brain" (1952). No surprise then that mathematical models of the brain proliferated in Britain, peaking just about in the year of the first conference on Artificial Intelligence: Jack Allanson at Birmingham University reported on "Some Properties of Randomly Connected Neural Nets" (1956), Wilfred Taylor built an associative memory at University College London ("Electrical Simulation of Some Nervous System Functional Activities", 1956), Raymond Beurle at Imperial College London studied "Properties of a Mass of Cells Capable of Regenerating Pulses" (1956), and Albert "Pete" Uttley at the Radar Research Establishment, a mathematician who had designed Britain's first parallel processor, wrote about "Conditional Probability Machines and Conditioned Reflexes" (1956). It is debatable whether, as argued by Christof Teuscher in his book "Turing's Connectionism" (2001), Turing truly predated neural networks (as well as genetic algorithms) in an unpublished 1948 paper, now known as "Intelligent Machinery", that was about "unorganized machines", i.e. random Boolean networks.

A purely cybernetic approach was well represented by Gordon Pask who, in Cambridge, was building special-purpose electro-mechanical automata such as Eucrates (1955), which simulated the interaction between tutor and pupil (i.e. a machine that can teach another machine), and in 1956 patented a machine called SAKI (which stood for "Self-Adaptive Keyboard Instructor") to train people to type on a keypunch. The eclectic Pask was also an artist who produced pioneering works of interactive art such as "MusiColour" (1953), a sound-activated light-show, and "Colloquy of Mobiles" (1968), an installation that allowed the audience to interact with five machines communicating among themselves via sound and light. In 1968 he went on to build even an electrochemical ear.

Frank Rosenblatt's Perceptron (1957) at Cornell University and Oliver Selfridge's Pandemonium (1958) at MIT popularized the new view of artificial neural networks: not knowledge representation and logical inference, but pattern propagation and automatic learning. The Perceptron, first implemented in software in 1958 on the Weather Bureau's IBM 704 and then custom-built in hardware at Cornell Aeronautical Laboratory, was the first trainable neural network (called "single-layer" even though it had two layers of neurons). The activation function was the same binary function (the "step function") used by the McCulloch-Pitts neuron but it had a learning rule (an algorithm for changing the weights, which was based on the errors that it made). Its application was to separate data in two groups. The network of McCullouch-Pitts was a new way to build finite automata, an extension of logic. Rosenblatt's network was a new way to build learning algorithms, an extension of statistics. Its simple learning rule worked wonders, but The limitations of perceptrons were obvious to everybody and in the following years several studies found solutions. The British National Physical Laboratory (November 1958) organized a symposium titled "The Mechanisation of Thought Processes"; three conferences on "Self-Organization" were held in 1959, 1960 and 1962; and Rosenblatt published his report "Principles of Neurodynamics" (1962). However, nobody could figure out how to build a multilayer perceptron.

Pitts also worked with the MIT neurologist Jerome Lettvin (Pitts' best friend since 1938) and the Chilean biologist Humberto Maturana on their seminal study about the visual system of the frog ("What the Frog's Eye Tells the Frog's Brain", 1959). They discovered that the retina is more than a simple transmitter of impulses to the brain for the brain to analyze them: the retina includes neurons that already respond to specific features such as edges, lighting and movement. Some of these "feature detectors" of the frog's eye were nicknamed "bug detectors" because they specifically reacted to small, dark, moving objects. Something in that study shook Pitts' belief in Boolean logic as a model for the functioning of the brain (Pitts burned his unfinished doctoral dissertation and pretty much ended his career).

In 1960 Bernard Widrow and his student Ted Hoff at Stanford University built a single-layer network based on an extension of the McCulloch-Pitts neuron called Adaline (Adaptive Linear Neuron) and using a generalization of the Perceptron's learning rule, the "delta rule" or "least mean square" (LMS) algorithm (a way to minimize the difference between the desired and the actual signal), the first practical application of a "stochastic gradient descent" method to machine learning. The method of "stochastic gradient descent" had been introduced in 1951 for mathematical optimization by Herbert Robbins of the Univ of North Carolina ("A Stochastic Approximation Method ", 1951).

Widrow understood that both the activation and the learning had to be more complex, and that the latter depended on the former. The threshold activation function used in both the McCulloch-Pitts network and the Perceptron did not lend itself to significant learning. Widrow used their way of summing the inputs to a neuron (each input multiplied by its weight) but then picked a linear (not threshold) activation function so that the output of a neuron was not just "on" or "off". The advantage was that this function, while still very simple, lent itself to a more sophisticated learning algorithm. Widrow used the same principle of adjusting the weights/strengths of the connections to reduce the error (i.e., reduce the difference between desired and actual output), but he could now use gradient-descent learning to adjust the weights. (Technically speaking, each weight is changed according to the negative of the derivative of the error with respect to that weight, and a derivative cannot exist in the case of the threshold function, which is a discontinuous function, but it exists in the case of the linear function, which is continuous).

The Adaline was physically a small analog machine built by Hoff, which was even sold commercially by their startup. Trivia: Ted Hoff later joined a tiny Silicon Valley startup called Intel and helped design the world's first microprocessor.

The "gradient descent" method, discovered in 1847 by the French mathematician Augustin Cauchy, was first applied to control theory in 1960 by Henry Kelley at Grumman Aircraft Engineering Corporation in New York ("Gradient Theory of Optimal Flight Paths", 1960) and by Arthur Bryson at Harvard University ("A Gradient Method for Optimizing Multi-stage Allocation Processes", 1961).

In 1955 the Hungarian physicist Dennis Gabor, the inventor of holography, had already devised how to employ the gradient-descent method for training the analog computer that his students were building at Imperial College London, but that research remained unpublished.

The mathematical idea behind gradient-descent methods is simple: first one measures the global error in the performance of the network (desired output minus actual output), then one computes the derivative of such error with respect to the weight/strength of each connection, and finally one adjusts each weight/strength in the direction that decreases the error. Bottom line: neural networks can learn their weights (the weights of the connections between neurons) using the gradient-descent algorithm. That was a primitive form of "backpropagation".

At the IRE (Institute of Radio Engineers) convention of 1960 Marvin Minsky presented a lengthy paper titled "Steps Towards Artificial Intelligence" that was skeptic about reinforcement learning (quote: "I am not convinced that these statistical training schemes should play a central role in our models"). He would quickly become neural networks' nemesis.

Another important discovery that went unnoticed at the time was the first learning algorithms for multilayer networks, published in 1965 by the Ukrainian mathematician Alexey Ivakhnenko in his book "Cybernetic Predicting Devices".

There were several projects in Europe, notably the Lernmatrix, created in 1961 by Karl Steinbuch (the engineer who, working at Standard Elektrik Lorenz, had designed Germany's first fully-transistorized commercial computer, the ER 56) and the (very similar) associative memory built by the chemist Christopher Longuet-Higgins' group at the University of Edinburgh ("Non-holographic Associative Memory", 1969).

In retrospect, the program of neural networks (just like the program of cybernetics before it) was the program of how to bridge the gap between physiologists and engineers. One can see how a mathematical problem turned into a chemical problem and, eventually, into a neurological problem. Alan Turing's "The Chemical Basis of Morphogenesis" (1952), his only paper about chemistry, proposed a mechanism for the emergence of patterns in biological systems (like the stripes of the zebra), and that paper introduced nonlinear dynamics to study self-organizing processes. Raymond Beurle applied this approach to pattern formation in the brain and obtained the first neural field equations which described a wave of information propagating through the neural network ("Properties of a Mass of Cells Capable of Regenerating Pulses", 1956). The equations were improved by John Griffith at Cambridge University ("A Field Theory of Neural Nets", 1963) and then by Jack Cowan and Hugh Wilson at MIT, who obtained a model similar to the Hopfield networks of a decade later ("A Mathematical Theory of the Functional Dynamics of Cortical and Thalamic Nervous Tissue", 1973). Meanwhile, expanding Turing's theory, the Belgian chemist Ilya Prigogine established nonlinear non-equilibrium thermodynamics and, in a talk titled "Structure, Dissipation and Life" (delivered in June 1967 at the International Conferences On Theoretical Physics and Biology) described biological systems as "dissipative systems" which self-organize far from equilibrium. So indirectly Turing's legacy for connectionism was that he originated the thinking that led to employ nonlinear differential equations for neural networks.

Compared with expert systems, neural networks are dynamic systems (their configuration changes as they are used) and predisposed to learning by themselves (they can adjust their configuration). "Unsupervised" networks, in particular, can discover categories by themselves; e.g., they can discover that several images refer to the same kind of object, a cat.

There are two ways to solve a crime. One way is to hire the smartest detective in the world, who will use experience and logic to find out who did it. On the other hand, if we had enough surveillance cameras placed around the area, we would scan their tapes and look for suspicious actions. Both ways may lead to the same conclusion, but one uses a logic-driven approach (symbolic processing) and the other one uses a data-driven approach (ultimately, the visual system, which is a connectionist system).

Expert systems were the descendants of the "logical" school that looked for the exact solution to a problem. Neural nets were initially viewed as equivalent logical systems, but actually represented the other kind of thinking, probabilistic thinking, in which we content ourselves with plausible solutions, not necessarily exact ones. That is the case of speech and vision, and of pattern recognition in general.

In 1965 Marvin Minsky and Samuel Papert of MIT began a vicious campaign against research in neural networks by circulating a technical manuscript that was eventually published as a book titled "Perceptrons" (1969), even though it was mostly about Adaline. It contained a devastating critique of neural networks that virtually killed the discipline.

This came a decade after a review by Noam Chomsky of a book by Burrhus Skinner had turned the tide in psychology, ending the domination of behaviorism and resurrecting cognitivism, and Noam Chomsky's campaign against behaviorism culminated in an article in the New York Review of Books of December 1971. Most A.I. scientists favored the "cognitive" approach simply for computational reasons, but those computer scientists felt somewhat reassured by the events in psychology that their choice was indeed wise.

Minsky's and Papert's proof came, by sheer coincidence, at the right time to avoid criticism: both Pitts and McCulloch died in 1969 (may and september), and Rosenblatt died in a boating accident in 1971.

Dave Block (a physicist who had worked with Rosenblatt at Cornell University) told Minsky that the limitations of the (single-layer) Perceptron could be easily overcome with multilayer neural nets, and, to be fair, Minsky and Papert accepted Block's criticism; but, unfortunately, Rosenblatt's learning algorithm did not work for multilayer nets.

In 1969 Stanford held the first International Joint Conference on Artificial Intelligence (IJCAI). Nils Nilsson from SRI presented Shakey. Carl Hewitt from MIT's Project MAC presented Planner, a language for planning action and manipulating models in robots. Cordell Green from SRI and Richard Waldinger from Carnegie-Mellon University presented systems for the automatic synthesis of programs (automatic program writing). Roger Schank from Stanford and Daniel Bobrow from Bolt Beranek and Newman (BBN) presented studies on how to analyze the structure of sentences. Connectionists were under-represented. The Artificial Intelligence magazine, founded in 1970, did not publish any paper on neural networks until 1989, when it published a survey by Geoffrey Hinton. Nonetheless,

The gradient method was perfected as a method to optimize multi-stage dynamic systems by Bryson and his Chinese-born student Yu-Chi Ho in the book "Applied Optimal Control" (1969). At that point the mathematical theory necessary for backpropagation in multi-layer neural networks was basically ready. In 1970 the Finnish mathematician Seppo Linnainmaa invented "reverse mode of automatic differentiation", which has backpropagation as a special case. In 1974 Paul Werbos' dissertation at Harvard University applied Bryson's backpropagation algorithm to the realm of neural networks ("Beyond Regression", 1974). Werbos had realized that the "backpropagation" algorithm was a more efficient way to train a neural network than any of the existing methods. His discovery languished for several years because his background wasn't quite orthodox: his thesis advisor was the social scientist and cybernetic pioner Karl Deutsch, and his algorithm of backpropagation was meant as a mathematical expression of the concept of "cathexis" that Sigmund Freud had introduced in his book "The Project for a Scientific Psychology" (1895). And the whole point of Werbos' research was to provide a better alternative to statistical analysis for long-range forecasting, in particular the forecasting of international affairs at the US Department of Defense.

Practitioners of neural networks also took detours into cognitive science. For example: Kaoru Nakano at the University of Tokyo ("Learning Process in a Model of Associative Memory", 1971); James Anderson at Rockefeller University ("A Simple Neural Network Generating an Interactive Memory", 1972), in the laboratory of psychologist William Estes who had founded a whole new field with his manifesto "Toward a Statistical Theory of Learning" (1950); and Teuveo Kohonen in Finland ("Correlation Matrix Memories", 1972) used neural networks to model associative memories based on Donald Hebb's law. The neuroscientist Christoph von der Malsburg at the Max Planck Institute in Germany built a model for the visual cortex of higher vertebrates (" Self-organization of Orientation Sensitive Cells in Striate Cortex", 1973). The holy grail of neural networks was unsupervised learning: have the machine learn concepts from the data without human intervention. Several variations on Karl Pearson's decades-old method of "principal components analysis" were proposed, and significant contributions came from the science of signal processing. For example, Pete Uttley designed the Informon to separate frequently occurring patterns ("A Network for Adaptive Pattern Recognition", 1970). In 1975 the first multi-layered network appeared, designed by Kunihiko Fukushima in Japan, the Cognitron ("Cognitron - A Self-organizing Multilayered Neural Network", 1975). Stephen Grossberg at Boston University unveiled another unsupervised model, "adaptive resonance theory" ("Adaptive Pattern Classification and Universal Recoding", 1976), and some of his ideas anticipated Hopfield's continuous networks ("Contour Enhancement, Short Term Memory, and Constancies in Reverberating Neural Networks", 1973). And Shunichi Amari at the University of Tokyo delivered the classic neural field equations that completed the work begun by Raymond Beurle in the 1950s ("Mathematical Theory on Formation of Category Detecting Nerve Cells," 1978; later expanded in "Field Theory of Self-organizing Neural Nets", 1983). The Italian-born Tomaso Poggio and the British-born David Marr (who was now at MIT) developed a nonlinear system for a specific case, that of binocular vision ("Cooperative Computation of Stereo Disparity", 1976). Therefore, by the mid-1970s significant progress had occurred (if not widely publicized) in neural networks.

Later, Walter Freeman at UC Berkeley (the same neurologist who in 1936 had performed the first "lobotomy" for psychiatric treatment), working in collaboration with the philosopher Christine Skarda, applied chaos theory to the study of brain processes ("How Brains Make Chaos in Order to Make Sense of the World", 1987). For the record, Skarda soon became a Buddhist philosopher convinced that our view of the brain is fundamentally wrong and has since spent more than 20 years in meditation retreat.

A much more stinging criticism of the logical school of A.I. could have come from neuroscience, a discipline that was beginning to use computer simulations. In 1947 Kacy Cole at the Marine Biological Lab near Boston pioneered the "voltage clamp" technique to measure the electrical current flowing through the membranes of neurons. Using that technique, in 1952 the British physiologists Alan Hodgkin and Andrew Huxley at Cambridge University built the first mathematical model of a spiking neuron, which also counts as the first simulation of computational neuroscience (for the record, they simulated the axon of the squid's brain). The Hodgkin-Huxley model is a set of nonlinear differential equations that approximates the electrical characteristics of neurons. The next major breakthroughs in the simulation of brain computation came respectively in 1962, when Wilfrid Rall at the National Institutes of Health simulated a dendritic arbor, and in 1966, when Fred Dodge and James Cooley at IBM simulated a propagating impulse in an axon. Meanwhile, Donald Perkel at the RAND Corporation in Los Angeles had written computer programs to simulate the working of the neuron using one of the earliest computers, the Johnniac ("Continuous-time Simulation of Ganglion Nerve Cells in Aplysia", 1963). These simulations (by people who actually knew what a neuron looks like) bore little resemblance to the naive digital neurons of the artificial neural networks.

Crucially, neuroscientists kept emphasizing the role of synapses: intelligence is not about the neuron but about the connections (the synapses) that create the network of neurons.

Jean-Pierre Changeux in "Neuronal Man" (1985): "The impact of the discovery of the synapse and its functions is comparable to that of the atom or DNA".

Joseph Ledoux in "Synaptic Self" (2002): "You are your synapses - they are who you are".