(These are excerpts from my book "Intelligence is not Artificial")
Footnote: A brief History of the Gradient
Training a neural network means minimizing its error. The neural network needs two algorithms for this: one algorithm to minimize the "gradient" in a series of steps and one algorithm to compute the "gradient" at each step. Backpropagation is the algorithm to compute the gradient. The algorithm used in conjunction with backpropagation is an optimization method because minimizing the error is another way of saying that we want to find the minimum of a complex (nonlinear) function. There are many such optimization methods. The need originally arose after Newton published his equations of gravitation. It was mainly astronomers who needed to find the minima of functions. The British mathematician John Wallis published the socalled "Newton's method" in his "Treatise of Algebra both Historical and Practical" (1685), the book that introduced the calculus of limits, but it was the French mathematician Augustin Cauchy in 1847 who invented the first efficient method of this kind. His "gradientdescent" method (or "steepestdescent method") enabled astronomers to calculate the orbit of an astronomical object without having to solve Newton's differential equations of motion, but instead using (simpler) algebraic equations.
The idea is quite intuitive if one thinks of a function as a curve in space similar to the skyline of a mountain range, with peaks and valley. The goal is to descend from the top of a mountain to the parking lot. The fastest way at every step (if you are not scared of heights) is to always take the steepest way down: follow the direction of the slope downhill. The problem is that there are many cases in which this takes you to a valley, e.g. a glacier nested between several mountains, where the strategy fails: you can't go any further down. Or, mathematically speaking, the gradient at any local minima is zero, so it doesn't tell you how to continue optimizing the function. That is the problem of "local minima".
A very popular gradient method for neural networks, employed in conjunction with backpropagation, is the "stochastic gradient descent" method, pioneered by Herbert Robbins in 1951, mainly because it is much more efficient, a fact of great importance when you are training a neural network with a large dataset; but also to reduce the chances of getting stuck in local minima.
Backpropagation is what is known in calculus as a "chain rule" and it computes
the gradient at every step, after which the optimization method decides which step to take next.
The iterative application of gradient computation and optimization results in the progressive modification of the "weights" that connect neurons.
There are also gradientfree methods of optimization, notably simulated annealing and genetic algorithm, which, in fact, have been studied by A.I. since the early days.
Much of A.I. reasoning consists in "finding" a solution in space of solutions. This task is closely related to the mathematical problem of optimization in which all possible states of the system have an objective function and the goal is to find the state with the maximum objective value. For example, hillclimbing search consists in a loop that continuously moves towards increasing value. Gradient descent is a close relative of hill climbing. This strategy is obviously na‹ve because hillclimbing will stop when it can't increase the value anymore, but that could just be a subpeak of a mountain that is much higher (it could be a "local minimum"). The hillclimbing algorithm is fooled by any local minima and, additionally, it doesn't know which way to go if it enters a plateau (in which each direction yields the same value). Many methods have been proposed to improve simple search algorithms like hill climbing: Scott Kirkpatrick's simulated annealing in 1983, Tabu search, published in 1986 by Fred Glover at the University of Colorado ("Future Paths for Integer Programming and Links to Artificial Intelligence", 1986), and GRASP (greedy randomized adaptive search procedure), introduced by Thomas Feo of the University of Texas and Mauricio Resende of UC Berkeley ("A Probabilistic Heuristic for a Computationally Difficult Set Covering Problem", 1989).
Searching a "tree" of possible actions/solutions remained one of the fundamental chores of any "intelligent" system, and the A* algorithm remained the foundation of this kind of search. However, A* and all its variants are methods of "offline" search: even before taking the first step, they already plan the complete sequence of steps to achieve the goal, i.e. all the way from the start state to the goal state. Realtime search methods, instead, plan only a few first steps. This approach is more practical in realworld situations. Several realtime algorithms have been developed that are also variations of A*, notably the Learning RealTime A* (LRTA*), introduced by Richard Korf at UCLA ("RealTime Heuristic Search", 1987) and Realtime A* algorithms. The former improves its performance over time but it is less efficient than the latter.
"If you do not change direction, you may end up where you are heading" (Lao Tze)
Back to the Table of Contents
Purchase "Intelligence is not Artificial")
