(These are excerpts from my book "Intelligence is not Artificial")
Analog Computation
Neural networks, and deep learning in particular, are good for recognizing patterns (e.g., that this particular object is an apple) but not for learning events in time. Neural networks have no sense of time.
In 1992 Hava Siegelmann of BarIlan University in Israel and Eduardo Sontag of Rutgers University developed Recurrent Neural Networks (RNNs) that can operate on sequences and therefore can also model relationships in time ("Analog Computation via Neural Networks", paper submitted in 1992 but published only in 1994). Typical applications of RNNs are: image captioning, that turns an image into a sequence of words ("sequence output"); sentence classification, that turns a sequence of words into a category ("sequence input"); and sentence translation (sequence input and sequence output). The innovation in RNNs is a hidden layer that connects two points in time. In the traditional feedforward structure, each layer of a neural network feeds into the next layer. In RNNs there is a hidden layer that feeds not only into the next layer but also into itself at the next time step. This recursion or cycle adds a model of time to traditional backpropagation, and is therefore known as "backpropagation through time".
A general problem of neural networks with many layers ("deep" neural networks), and of RNNs in particular, is the "vanishing gradient", already described in 1991 by Josef "Sepp" Hochreiter at the Technical University of Munich and more famously in 1994 by Yoshua Bengio ("Learning LongTerm Dependencies with Gradient Descent is Difficult"). The expression "vanishing gradient" refers to the fact that the computations for each new layer become less and less clear. It is a problem similar to calculating the probability of a chain of events: if you multiply a probability between 0 and 1 by another probability between 0 and 1 and so on many times, the result is always zero, even in the case in which all those numbers expressed probabilities of 99%. A network with many layers is difficult to train because the "weights" of the last layer end up being too weak.
In 1997 Sepp Hochreiter and his professor Jurgen Schmidhuber came up with a solution: the Long Short Term Memory (LSTM) model. In this model, the unit of the neural network (the "neuron") is replaced by one or more memory cells. Each cell functions like a miniTuring machine, performing simple operations of read, write, store and erase that are triggered by simple events. The big difference with Turing machines is that these are not binary decisions but "analog" decisions, represented by real numbers between 0 and 1, not just 0 and 1. For example, if the network is analyzing a text, a unit can store the information contained in a paragraph and apply this information to a subsequent paragraph. The reasoning behind the LSTM model is that a recurrent neural network contains two kinds of memory: there is a shortterm memory about recent activity and there is a longterm memory which is the traditional "weights" of the connections that change based on this recent activity. The weights change very slowly as the network is being trained. The LSTM model tries to retain also information contained in the recent activity, that traditional network only use to finetune the weights and then discard.
For 60 years it was assumed that no computing device can be more powerful than a Universal Turing Machine. Hava Siegelmann proved mathematically that analog RNNs can achieve superTuring computing ("On the Computational Power of Neural Nets", 1992). Alan Turing himself had tried to imagine a way to extend the computational power of his universal machine ("Systems of Logic Based on Ordinals", 1938), but his idea cannot be implemented in practice. Siegelmann's system was not the first system to break the Turing limit using real numbers, and nobody has built a computer yet that can perform operations on real numbers in a single step.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"
