Intelligence is not Artificial

(These are excerpts from my book "Intelligence is not Artificial")

### The Connectionists (Neural Networks)

Meanwhile, the other branch of Artificial Intelligence was pursuing a rather different approach: simulating what the brain does at the physical level of neurons and synapses. The symbolic school of John McCarthy and Marvin Minsky believed in using mathematical logic (i.e., symbols) to simulate how the human mind works; the school of "neural networks" (or "connectionism") believed in using mathematical calculus (i.e., numbers) to simulate how the brain works.

Since in the 1950s neuroscience was just in its infancy (medical machines to study living brains would not become available until the 1970s), computer scientists only knew that the brain consists of a huge number of interconnected neurons, and neuroscientists were becoming ever more convinced that "intelligence" was due to the connections, not to the individual neurons. A brain was viewed as a network of interconnected nodes, and our mental life as due to the way signals travel through those connections from the neurons of the sensory system up to the neurons that process those sensory data and eventually down to the neurons that generate action.

The neural connections can vary in strength from zero to infinite, and this is known as the "weight" of the connection. Change the weight of some neural connections and you change the outcome of the network's computation. In other words, the weights of the connections can be tweaked to cause different outputs for the same inputs. The problem for those designing "neural networks" consists in fine-tuning the connections so that the network as a whole comes up with the correct interpretation of the input; e.g. with the word "apple" when the image of an apple is presented. This is called "training the network". For example, showing many apples to the system and forcing the answer "APPLE" should result in the network adjusting those connections to recognize apples in general. This is called "supervised learning". The normal operation of the neural network is quite simple. The signals coming from different neurons into a neuron are weighed based on the weights of each input connection and then fed to an "activation function" (also known as the "nonlinearity") that decides what has to be the output produced by this neuron. The simplest activation function is a function that has a threshold, the "step" function: if the total input passes the threshold value, the neuron emits a one, otherwise a zero. This process goes on throughout the network. The network is usually organized in layers of neurons. The weights of the connections determine what the network computes. The weights change during "training" (i.e. in response to experience). Neural networks "learn" those weights during training. A simple approach to learning weights is to compare the output of the neural network to the correct answer and then modify the weights in the network so as to produce the correct answer. Each "correct answer" is a training example. The neural network needs to be trained with numerous such examples. Today computers are powerful enough that it can be literally millions of examples. If the network has been designed well, the weights will eventually converge to a stable configurations: at that point the network should provide the correct answer even for instances that were not in the training data (e.g., recognize an apple that it has never seen before). The designer of the neural network has to decide the structure of the neural network (e.g. the number of layers, the size of each layer, and which neurons connect to which other neurons), the initial values of the weights, the activation function (the "nonlinearity"), and the training strategy. Both the initialization and the training may require the use of random numbers, and there are many different ways to generate random numbers. The term "hyperparameters" refers to all the parameters that the network designer needs to pinpoint. It may take months to come up with a neural network that can be trained.

Since the key is to adjust the strength of the connections, the alternative term for this branch of A.I. is "connectionism".

One of the most influential books in the early years of neuroscience was "Organization of Behavior" (1949), written by the psychologist Donald Hebb at McGill University in Montreal (Canada). Hebb described how the brain learns by changing the strength in the connections between its neurons. In 1951 two Princeton University students, Marvin Minsky and Dean Edmonds, simulated Hebbian learning in a network of 40 neurons realized with 3,000 vacuum tubes, and called this machine SNARC (Stochastic Neural Analog Reinforcement Computer). I wouldn't count it as the first neural network because SNARC was not implemented on a computer. In 1954 Wesley Clark and Belmont Farley at MIT simulated Hebbian learning on a computer, i.e. created the first artificial neural network (a two-layer network). In 1956 Hebb collaborated with IBM's research laboratory in Poughkeepsie to produce another computer model, programmed by Nathaniel Rochester's team (that included a young John Holland).

If there was something similar to the Macy conferences in Britain, it was the Ratio Club, organized in 1949 by the neurologist John Bates, a dining club of young scientists who met periodically at London's National Hospital to discuss cybernetics. McCulloch, who was traveling in Britain, became their very first invited speaker. Among its members was the neurologist William Grey-Walter, who in 1948 built two tortoise-shaped robots (better known as Elmer and Elsie) that some consider the first autonomous mobile robots. Turing was a member and tested the "Turing test" at one of their meetings. John Young was a member: in 1964 he would discover the "selectionist" theory of the brain. And, finally, another member was the psychiatrist Ross Ashby, who in 1948 actually built a machine to simulate the brain, the homeostat ("the closest thing to a synthetic brain so far designed by man", as Time magazine reported). The title of that paper became the title of his influential book "Design for a Brain" (1952). No surprise then that mathematical models of the brain proliferated in Britain, peaking just about in the year of the first conference on Artificial Intelligence: Jack Allanson at Birmingham University reported on "Some Properties of Randomly Connected Neural Nets" (1956), Raymond Beurle at Imperial College London studied "Properties of a Mass of Cells Capable of Regenerating Pulses" (1956), and Albert "Pete" Uttley at the Radar Research Establishment, a mathematician who had designed Britain's first parallel processor, wrote about "Conditional Probability Machines and Conditioned Reflexes" (1956). It is debatable whether, as argued by Christof Teuscher in his book "Turing's Connectionism" (2001), Turing truly predated neural networks (as well as genetic algorithms) in an unpublished 1948 paper, now known as "Intelligent Machinery", that was about "unorganized machines", i.e. random Boolean networks.

A purely cybernetic approach was well represented by Gordon Pask who, in Cambridge, was building special-purpose electro-mechanical automata such as Eucrates (1955), which simulated the interaction between tutor and pupil (i.e. a machine that can teach another machine), and in 1956 patented a machine called SAKI (which stood for "Self-Adaptive Keyboard Instructor") to train people to type on a keypunch. The eclectic Pask was also an artist who produced pioneering works of interactive art such as "MusiColour" (1953), a sound-activated light-show, and "Colloquy of Mobiles" (1968), an installation that allowed the audience to interact with five machines communicating among themselves via sound and light. In 1968 he went on to build even an electrochemical ear.

The idea that computers were "giant brains" wasn't just a myth invented by the media. Some psychologists enthusiastically signed on to this metaphor. George Miller was a psychologist at Harvard University who in 1950 visited the Institute for Advanced Study in Princeton, one of the pioneering centers in computer science. The following year he was hired by MIT to lead the psychology group at the newly formed Lincoln Laboratories (a hotbet of military technology for the Cold War) and published an influential book titled "Language and Communication" in which he launched the program of studying the human mind using the information theory just developed by Claude Shannon at Bell Labs in his article "A Mathematical Theory of Communication" (1948).

Frank Rosenblatt's Perceptron (1957) at Cornell University and Oliver Selfridge's Pandemonium (1958) at MIT defined the standard for artificial neural networks: not knowledge representation and logical inference, but pattern propagation and automatic learning. The Perceptron, first implemented in software in 1958 on the Weather Bureau's IBM 704 and then custom-built in hardware at Cornell Aeronautical Laboratory, was the first trainable neural network (called "single-layer" even though it had two layers of neurons). The activation function was the same binary function (the "step function") used by the McCulloch-Pitts neuron but it had a learning rule (an algorithm for changing the weights). Its application was to separate data in two groups. The limitations of perceptrons were obvious to everybody and in the following years several studies found solutions. The British National Physical Laboratory (November 1958) organized a symposium titled "The Mechanisation of Thought Processes"; three conferences on "Self-Organization" were held in 1959, 1960 and 1962; and Rosenblatt published his report "Principles of Neurodynamics" (1962). However, nobody could figure out how to build a multilayer perceptron.

In 1960 Bernard Widrow and his student Ted Hoff at Stanford University built a single-layer network based on an extension of the McCulloch-Pitts neuron called Adaline (Adaptive Linear Neuron) and using a generalization of the Perceptron's learning rule, the "delta rule" or "least mean square" (LMS) algorithm (a way to minimize the difference between the desired and the actual signal), the first practical application of a "stochastic gradient descent" method to machine learning. The method of "stochastic gradient descent" had been introduced in 1951 for mathematical optimization by Herbert Robbins of the Univ of North Carolina ("A Stochastic Approximation Method ", 1951). Trivia: Ted Hoff later joined a tiny Silicon Valley startup called Intel and helped design the world's first microprocessor.

The "gradient descent" method, discovered in 1847 by the French mathematician Augustin Cauchy, was first applied to control theory in 1960 by Henry Kelley at Grumman Aircraft Engineering Corporation in New York ("Gradient Theory of Optimal Flight Paths", 1960) and by Arthur Bryson at Harvard University ("A Gradient Method for Optimizing Multi-stage Allocation Processes", 1961). That was "backpropagation".

The mathematical idea behind gradient-descent methods is simple: first one measures the global error in the performance of the network (desired output minus actual output), then one computes the derivative of such error with respect to the weight/strength of each connection, and finally one adjusts each weight/strength in the direction that decreases the error.

Another important discovery that went unnoticed at the time was the first learning algorithms for multilayer networks, published in 1965 by the Ukrainian mathematician Alexey Ivakhnenko in his book "Cybernetic Predicting Devices".

Compared with expert systems, neural networks are dynamic systems (their configuration changes as they are used) and predisposed to learning by themselves (they can adjust their configuration). "Unsupervised" networks, in particular, can discover categories by themselves; e.g., they can discover that several images refer to the same kind of object, a cat.

There are two ways to solve a crime. One way is to hire the smartest detective in the world, who will use experience and logic to find out who did it. On the other hand, if we had enough surveillance cameras placed around the area, we would scan their tapes and look for suspicious actions. Both ways may lead to the same conclusion, but one uses a logic-driven approach (symbolic processing) and the other one uses a data-driven approach (ultimately, the visual system, which is a connectionist system).

Expert systems were the descendants of the "logical" school that looked for the exact solution to a problem. Neural nets were initially viewed as equivalent logical systems, but actually represented the other kind of thinking, probabilistic thinking, in which we content ourselves with plausible solutions, not necessarily exact ones. That is the case of speech and vision, and of pattern recognition in general.

In 1969 Stanford held the first International Joint Conference on Artificial Intelligence (IJCAI). Nils Nilsson from SRI presented Shakey. Carl Hewitt from MIT's Project MAC presented Planner, a language for planning action and manipulating models in robots. Cordell Green from SRI and Richard Waldinger from Carnegie-Mellon University presented systems for the automatic synthesis of programs (automatic program writing). Roger Schank from Stanford and Daniel Bobrow from Bolt Beranek and Newman (BBN) presented studies on how to analyze the structure of sentences.

In 1969 Marvin Minsky and Samuel Papert of MIT published a devastating critique of neural networks (titled "Perceptrons") that virtually killed the discipline. This came a decade after a review by Noam Chomsky of a book by Burrhus Skinner had turned the tide in psychology, ending the domination of behaviorism and resurrecting cognitivism, and Noam Chomsky's campaign against behaviorism culminated in an article in the New York Review of Books of December 1971. Most A.I. scientists favored the "cognitive" approach simply for computational reasons, but those computer scientists felt somewhat reassured by the events in psychology that their choice was indeed wise.

Minsky's and Papert's proof came, by sheer coincidence, at the right time to avoid criticism: both Pitts and McCulloch died in 1969 (may and september), and Rosenblatt died in a boating accident in 1971.

To be fair, Minsky and Papert simply argued that the limitations of the Perceptron could be overcome only with multilayer neural nets and, unfortunately, Rosenblatt's learning algorithm did not work for multilayer nets.

The gradient method was perfected as a method to optimize multi-stage dynamic systems by Bryson and his Chinese-born student Yu-Chi Ho in the book "Applied Optimal Control" (1969). At that point the mathematical theory necessary for backpropagation in multi-layer neural networks was basically ready. In 1970 the Finnish mathematician Seppo Linnainmaa invented "reverse mode of automatic differentiation", which has backpropagation as a special case. In 1974 Paul Werbos' dissertation at Harvard University applied Bryson's backpropagation algorithm to the realm of neural networks ("Beyond Regression", 1974). Werbos had realized that the "backpropagation" algorithm was a more efficient way to train a neural network than any of the existing methods. His discovery languished for several years because his background wasn't quite orthodox: his thesis advisor was the social scientist and cybernetic pioner Karl Deutsch, and his algorithm of backpropagation was meant as a mathematical expression of the concept of "cathexis" that Sigmund Freud had introduced in his book "The Project for a Scientific Psychology" (1895).

Practitioners of neural networks also took detours into cognitive science. For example, James Anderson at Rockefeller University ("A Simple Neural Network Generating an Interactive Memory", 1972) and Teuveo Kohonen in Finland ("Correlation Matrix Memories", 1972) used neural networks to model associative memories based on Donald Hebb's law. The neuroscientist Christoph von der Malsburg at the Max Planck Institute in Germany built a model for the visual cortex of higher vertebrates (" Self-organization of Orientation Sensitive Cells in Striate Cortex", 1973). The holy grail of neural networks was unsupervised learning: have the machine learn concepts from the data without human intervention. Several variations on Karl Pearson's decades-old method of "principal components analysis" were proposed, and significant contributions came from the science of signal processing. For example, Pete Uttley designed the Informon to separate frequently occurring patterns ("A Network for Adaptive Pattern Recognition", 1970). In 1975 the first multi-layered network appeared, designed by Kunihiko Fukushima in Japan, the Cognitron ("Cognitron - A Self-organizing Multilayered Neural Network", 1975). Stephen Grossberg at Boston University unveiled another unsupervised model, "adaptive resonance theory" ("Adaptive Pattern Classification and Universal Recoding", 1976). Another one was proposed by Shunichi Amari at the University of Tokyo ("Mathematical Theory on Formation of Category Detecting Nerve Cells," 1978; later expanded in "Field theory of self-organizing neural nets", 1983). Therefore, by the mid-1970s significant progress had occurred (if not widely publicized) in neural networks.

A much more stinging criticism could have come from neuroscience, a discipline that was beginning to use computer simulations. In 1947 Kacy Cole at the Marine Biological Lab near Boston pioneered the "voltage clamp" technique to measure the electrical current flowing through the membranes of neurons. Using that technique, in 1952 the British physiologists Alan Hodgkin and Andrew Huxley at Cambridge University built the first mathematical model of a spiking neuron, which also counts as the first simulation of computational neuroscience (for the record, they simulated the axon of the squid's brain). The Hodgkin-Huxley model is a set of nonlinear differential equations that approximates the electrical characteristics of neurons. The next major breakthroughs in the simulation of brain computation came respectively in 1962, when Wilfrid Rall at the National Institutes of Health simulated a dendritic arbor, and in 1966, when Fred Dodge and James Cooley at IBM simulated a propagating impulse in an axon. Meanwhile, Donald Perkel at the RAND Corporation in Los Angeles had written computer programs to simulate the working of the neuron using one of the earliest computers, the Johnniac ("Continuous-time Simulation of Ganglion Nerve Cells in Aplysia", 1963). These simulations (by people who actually knew what a neuron looks like) bore little resemblance to the naive digital neurons of the artificial neural networks.

Nonetheless, neuroscientists kept emphasizing the role of synapses: intelligence is not about the neuron but about the connections (the synapses) that create the network of neurons.

Jean-Pierre Changeux in "Neuronal Man" (1985): "The impact of the discovery of the synapse and its functions is comparable to that of the atom or DNA".

Joseph Ledoux in "Synaptic Self" (2002): "You are your synapses - they are who you are".