(These are excerpts from my book "Intelligence is not Artificial")
Footnote: The Opaque Power of Neural Networks
The funny thing about multilayer networks is that nobody really knows why they work so well when they work well. Designing a multilayer network is still largely a process of "trial and error".
Like most nonlinear systems, multilayer neural networks are difficult to "understand". It is relatively easy to figure out what a linear algorithm does, although a computer may compute it millions of times faster than us, but a nonlinear algorithm is, to a large extent, inscrutable.
One could see this as a human limitation: we can see that the neural network works well, but we cannot understand how it is doing it. In reality, it is equally difficult for us to understand what animals are doing, and we cannot understand either what most humans are doing... unless they tell us what they are doing. So that is also a limitation by the machine: a neural network that cannot explain its behavior has a clear limitation compared with human beings who routinely explain what they are doing. Paraphrasing what the philosopher Daniel Dennett said in a 2017 interview, if the neural network "cannot do better than us at explaining what it is doing, then don't trust it." Imagine that a deep-learning system, proven to be more accurate than all the human experts combined, analyzes a CT scan of your body and tells you that you only have one month to live. The human experts may be less accurate but they can tell you why they think what they think. The neural network simply tells you that you have one month to live with no explanation. How does it feel? Have you become just a disposable record in a database of patient records? Explaining what is happening to you has become a waste of time? Perhaps more importantly: can you trust an opinion that comes with no explanation? Do you begin making arrangements for your funeral without knowing why you are going to die soon?
In 2013 Rob Fergus' student Matthew Zeiler at New York University introduced a visualization technique to get insight about the functioning of intermediate layers of deep convolutional networks ("Visualizing and Understanding Convolutional Networks", 2013). LeCun's student Anna Choromanska at New York University used the Physics of the spherical spin-glass model to explain why stochastic-gradient descent works so well in "deep" neural networks ("The Loss Surfaces of Multilayer Networks", 2015). In 2016 Wojciech Samek in Germany unveiled a method called "deep Taylor decomposition" to peek into the workings of deep neural networks. At about the same time Carlos Guestrin's team at the University of Washington developed an algorithm called LIME (Local Interpretable Model-Agnostic Explanations) that can explain the predictions of any classifier ("Why Should I Trust You?", 2016). In 2017 David Gunning at DARPA (Defense Advanced Research Projects Agency) launched a four-year nation-wide project (involving ten research laboratories) to develop neural-network interfaces that can help ordinary users to understand how the neural network reached the conclusion it reached (the eXplainable A.I. or XAI program). For example, Mohamed Amer at SRI International was trying to visualize the inner workings of a neural network using the technique of generative adversarial networks.
In 2015 Naftali Tishby of the Hebrew University in Israel offered an explanation based on Claude Shannon's theory of information.
In 1999 Tishby, Fernando Pereira (then at Bell Labs and later hired by Google) and William Bialek (then at NEC Research Institute in New Jersey and later at Princeton University) had formulated the “information bottleneck method" for network optimization ("The Information Bottleneck Method", 2000).
A network retains only what is essential because the process of "learning" is equivalent to squeezing information through a bottleneck: the network behaves like someone forced to retain only what is truly essential, or, better, relevant, and to discard the rest. The process works if the network chooses correctly what can be discarded. Tishby likes to say that “the most important part of learning is actually forgetting.”
They had also calculated the theoretical bound of information bottleneck, i.e. the greatest degree of optimization that still retains the relevant information.
In 2014 two physicists, David Schwab of Northwestern University and Pankaj Mehta of Boston University, discovered a striking similarity between Hinton's deep-learning algorithm and "block-spin renormalization", a routine mathematical method used in statistical physics to extract the relevant features of a system and determine which ones can be ignored ("An Exact Mapping Between the Variational Renormalization Group and Deep Learning", 2014).
Invented in 1966 by Leo Kadanoff, it is used by physicists to describe a system in a statistical manner, without the need to know the exact state of all its particles, in particular at the so-called "critical point" of a physical system (like when water turns from liquid to vapor state). Renormalization is a mathematical way to describe what matters macroscopically of a system without bothering about the microscopic details that are not important for its macroscopic behavior (or, better, simply averaging over them).
In 2015 Tishby and his student Noga Zaslavsky explained deep learning in terms of information bottleneck: a deep neural network works like an optimization algorithm that retains only the relevant information for data classification
("Deep Learning and the Information Bottleneck Principle", 2015).
When a neural network is being trained with a dataset, at some point it enters a “compression” phase in which it starts shedding information, i.e. it starts "forgetting" what is not relevant in order to retain the capacity to learn more about what is relevant. If a neural network is trained with enough samples, it will converge to the Tishby-Pereira-Bialek bound.
Alex Alemi at Google applied Tishby's information-bottleneck thinking to very deep neural networks ("Deep Variational Information Bottleneck", 2016).
Tishby's model of learning in a network is intriguing
because it provides a link to theories of human memory.
It has been obvious since at least British psychologist Donald Broadbent published "Perception and Communication" (1958) that the number of objects we see in a lifetime exceeds the number of neurons in the brain that would be needed to store them as images. Human memory must select what to remember and forget most of the stimuli that are perceived by the senses. Broadbent stated the principle of "limited capacity" of the brain (also known as the "filter theory") to explain how a limited-capacity system such as the brain can cope with the overwhelming amount of information available in the world. Tishby's theory can even provide a link to dreaming: Francis Crick, co-discovered of the double-helix structure of DNA, once speculated that the function of dreams is to "clear the circuits" of the brain (“The Function of Dream Sleep”, 1983). The brain, in the face of huge daily sensory stimulation, must understand what matters, understand what does not matter, remember what will still matter and forget what will never matter again.
Back to the Table of Contents
Purchase "Intelligence is not Artificial")