(These are excerpts from my book "Intelligence is not Artificial")
The Future of Machine Learning: Unsupervised Learning to the Rescue?
Summarizing, there are four desiderata that one would like to see in A.I. systems, if they have to compare well with human (or just animal) brains:
meta-learning, learning by demonstration ("few-shot learning"), transfer learning and multi-task learning.
Meta-learning is particularly relevant in the case of reinforcement learning. It is obvious that reinforcement learning is highly unnatural. DeepMind's AlphaGo and OpenAi Five need to learn from scratch via a huge number of trials. Animals, instead, use built-in or acquired "meta-skills" to learn new tasks in just a few trials.
The modern computational theory of meta-learning (learning how to learn) dates back at least to the 1990s, when Schmidhuber published the manifesto "Simple Principles of Metalearning" (1996), followed by his student Sepp Hochreiter ("Learning to Learn Using Gradient Descent", 2001), and by Nicolas Schweighofer and Kenji Doya at Japan's ATR ("Meta-learning in Reinforcement Learning", 2001). Examples of "deep" meta-learning systems of the new generation are: RL Square by Pieter Abbeel's student Yan Duan at UC Berkeley, based on Schulman's TRPO ("RL Square: Fast Reinforcement Learning via Slow Reinforcement Learning", 2016); the "model-agnostic meta-learning" (MAML) of Sergey Levine's student Chelsea Finn at UC Berkeley ("Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks", 2017); Marcel Binz's thesis at KTH Royal Institute of Technology ("Learning Goal-Directed Behaviour", 2017); Jane Wang's "deep meta-reinforcement learning" at DeepMind ("Learning to Reinforcement Learn", 2017);
and OpenAI's Reptile, developed by Alex Nichol and John Schulman, a generalization of Finn's MAML ("On First-Order Meta-Learning Algorithms", 2018).
DeepMind's neuroscientist Matthew Botvinick believes that the latter could be a model for how our brain learns: the dopamine system trains another part of the brain, the prefrontal cortex, to operate as its own free-standing learning system ("Prefrontal Cortex as a Meta-reinforcement Learning System", 2018).
It is also obvious that animals can naturally "transfer" skills from one domain to another: an animal rarely needs to learn a new skill as if it had nothing in common with known skills. Transfer learning is about applying what one learned in one case to a different case. Success stories in computational transfer learning are scant despite the pioneering work of Satinder Singh (1991),
Lorien Pratt (1992), Sebastian Thrun (1994) and Rich Caruana (1993).
Reinforcement learning is particularly difficult to generalize to multiple problems because each case requires a different reward function. Transfer learning is one case in which the learning agent needs to do a lot of exploration (it can't just repeat what it has learned in a previous case). Exploration based on "intrinsic motivation" is a old idea, from Schmidhuber ("Evolutionary Principles in Selfreferential Learning", 1987) via Barto ("Intrinsically Motivated Reinforcement Learning", 2004) all the way to DeepMind ("Unifying Count-Based Exploration and Intrinsic Motivation", 2016), whose algorithm achieved improvement in strategy-based videogames such as Montezuma Revenge, and OpenAI ("Some Considerations on Learning to Explore via Meta-reinforcement Learning", 2017), that introduced two new reinforcement-learning algorithms (EMAML and E-RL2).
Andrei Rusu and others at DeepMind also invented the "progressive nets" method that supposedly emulated the way the human mind can reuse previous experience and applied it to accelerate learning multiple Atari games ("Progressive Neural Networks", 2016).
Curiosity-driven exploration, which, again, was pioneered by Schmidhuber ("Curious Model-building Control Systems," 1991) and was applied to developmental robotics by Pierre-Yves Oudeyer and Frederic Kaplan at Sony labs in France ("Motivational Principles for Visual Know-how Development," 2003), was studied by Shankar Sastry's student Joshua Achiam at UC Berkeley ("Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning", 2017), Trevor Darrell's student Deepak Pathak at UC Berkeley ("Curiosity-driven Exploration by Self-supervised Prediction", 2017), and DeepMind ("Kickstarting Deep Reinforcement Learning", 2018).
Pathak's “intrinsic curiosity model", for example, was a self-supervised reinforcement learning system that used curiosity, instead of feedback, as a natural reward signal to enable the agent to explore its environment and learn skills for later use. An example of applications of curiosity-driven learning was Leela, unveiled in 2018 by Leela.ai in Palo Alto, a learning agent modeled on Jean Piaget' child psychology that built increasingly abstract models of the world from exploration, play and trial-and-error.
Pieter Abbeel's student Abhishek Gupta at UC Berkeley introduced an algorithm, "model agnostic exploration with structured noise" (MAESN), to learn exploration strategies from past experience ("Meta-Reinforcement Learning of Structured Exploration Strategies, 2018").
Abhinav Gupta's student Lerrel Pinto at Carnegie Mellon University built a robot that could push, poke, grasp and observe objects; four different types of physical interactions that forced a shared CNN to learn a visual representation ("The Curious Robot - Learning Visual Representations via Physical Interactions", 2016).
The way we normally learn a new game is by watching people play. After watching a few games, and being told what the rules are, we can start playing. This is called "few-shot learning", learning by watching a few demonstrations. This is very different from what OpenAI Five and DeepMind's AlphaZero do: they play thousands of games against themselves.
Pioneering work in "behavioral cloning" (another term for "learning from demonstration") was made in Britain by Donald Michie ("Cognitive Models from Subcognitive Skills", 1990) and Claude Sammut ("Learning to Fly", 1992). Research on machine learning from visual observation of human performance was conducted in Japan by Yasuo Kuniyoshi of the Electrotechnical Laboratory (ETL) in collaboration with Masayuki Inaba and Hirochika Inoue of Tokyo University ("Teaching by Showing", 1989), and by Dean Pomerleau at Carnegie Mellon University when training the self-driving vehicle ALVINN to follow street lanes ("Autonomous Land Vehicle in a Neural Network", 1989). Stefan Schaal wrote the manifesto "Is Imitation Learning the Route to Humanoid Robots?" (1999). Two future stars of deep learning, Pieter Abbeel and Andrew Ng, were attracted by the field ("Apprenticeship Learning via Inverse Reinforcement Learning", 2004) when, both at Stanford University, the former was studying philosophy and the latter had just graduated in philosophy of computer science from UC Berkeley with a thesis on reinforcement learning supervised by Michael Jordan. Stimulus to develop alternatives for few-shot learning came also from Lake-Salakhutdinov-Tenenbaum's theory of concept learning through probabilistic program induction (2015). A popular avenue of research in one-shot learning was to extend Alex Graves' neural Turing machines (2014) and James Weston's memory networks (2014): Oriol Vinyals at DeepMind developed "matching networks" ("Matching Networks for One Shot Learning", 2016) and Han Altae-Tran at MIT developed the "iterative refinement long short-term memory" ("Low Data Drug Discovery with One-Shot Learning", 2018).
Sergey Levine's students Tianhe Yu and Chelsea Finn, instead, working on robots, developed a system that can imitate a movement after watching it just once, a movement that the system has never seen before. The secret is a variant of MAML called "domain-adaptive meta-learning" (or DAML) that trains a deep network with many videos of human and robot movements performed for different tasks. Then the system can handle a novel task involving a novel object after watching a human perform that task with that object ("One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning", 2018).
The next step for Tianhe Yu and Chelsea Finn was to use the video of a human demonstration (or televised transmission of it) to train the robot
("One-Shot Imitation from Observing Humans via Domain-Adaptive Meta-Learning", 2018).
Byron Boots' group at Georgia Tech used imitation learning to continue Dean Pomerleau's mission to train a self-driving vehicle ("Agile Autonomous Driving using End-to-End Deep Imitation Learning", 2018).
Aravind Rajeswaran and Vikash Kumar at University of Washington mixed learning by demonstration and deep reinforcement learning to teach a robot how to grasp objects ("Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations", 2018) and DeepMind's group of Nando de Freitas and Nicolas Heess did the same to teach visual skills ("Reinforcement and Imitation Learning for Diverse Visuomotor Skills", 2018).
Being able to learn from a demonstration can make a huge difference. For example, it was long believed that reinforcement learning cannot learn to use multi-fingered robotic hands because of the large number of degrees of freedom, i.e. because of the high dimensionality of the problem. Therefore multi-fingered hands were controlled with trajectory optimization methods such as the physics-based method developed by Karen Liu at Georgia Institute of Technology ("Synthesis of Interactive Hand Manipulation", 2008) or the "contact-invariant optimization" method developed by Emanuel Todorov at the University of Washington ("Contact-invariant Optimization for Hand Manipulation", 2012). But Aravind Rajeswaran at the University of Washington, working in collaboration with UC Berkeley's Sergey Levine and OpenAI's John Schulman, showed that reinforcement learning can actually learn to control dexterous multi-fingered hands if it is augmented with a small number of human demonstrations ("Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations", 2018).
Finally, all animals, and certainly humans, can learn many tasks with the same brain, whereas it is terribly difficult for a machine-learning algorithm to learn more than one task. There have been, however, encouraging signs that it might be possible.
Sejnowski's NETtalk (1986) used one neural network to learn two strongly related tasks about speech, as did Ronan Collobert's and Jason Weston's system for natural language processing (2008). Sebastian Thrun at Carnegie Mellon University studied how to enable a robot to continuously learn as it collects new experiences ("Lifelong Robot Learning", 1995). Rich Caruana's work on the self-driving car ALVINN (1997) showed that learning multiple tasks in parallel can be easier than learning them separately. In fact, in 2015 Bharath Ramsundar at Stanford developed a multitask network for drug discovery, based on Szegedy's GoogLeNet, whose accuracy improved as additional tasks and data were added ("Massively Multitask Networks for Drug Discovery", 2015). Incidentally, Ramsundar was also the brain behind deepchem.io, an open-source Python library to help scientists build deep-learning systems for drug discovery. Alas, Ramsundar's "massively multitasking" network required training with millions of data points when in reality the average drug-discovery laboratory works with only a handful of chemical compounds.
One key event in the revival of multitask networks was the Kaggle competition sponsored in 2012 by pharmaceutical giant Merck: the winner was a multitask network (designed by Geoffrey Hinton's student George Dahl at the University of Toronto). Studies multiplied and within a few years there was significant progress.
Ishan Misra and Abhinav Shrivastava at Carnegie Mellon University introduced "cross-stitch units" for convolutional networks, units that look for the best shared representations for multitask learning ("Cross-Stitch Networks for Multi-task Learning", 2016).
Ever since Caruana's paper "Multitask Learning - A KnowledgeBased Source of Inductive Bias" (1993), the supervision of multiple tasks was carried out at the outermost layer of the neural network (as if it were at the top of a hierarchy), but Anders Sogaard at the University of Copenhagen and Yoav Goldberg at Bar-Ilan University showed that it is better to manage different tasks at different layers rather than in the same layer ("Deep Multi-task Learning with Low Level Tasks Supervised at Lower Layers", 2016). Their finding was applied, for example, by Richard Socher of Salesforce in collaboration with the University of Tokyo to problems of natural language processing, thereby improving on Collobert's and Weston's model ("A Joint Many-Task Model", 2017).
Mingsheng Long at Tsinghua University introduced "relationship networks" to discover transferrable features between tasks
("Learning Multiple Tasks with Multilinear Relationship Networks", 2017).
In order to achieve what Thrun had called "lifelong learning" in 1995, neural networks need to overcome the problem that Michael McCloskey and Neal Cohen had called "catastrophic forgetting" in 1989. For almost 30 years that remained a stumbling block, and lifelong learning was mostly attempted with other techniques. For example, in 2013 Eric Eaton and Paul Ruvolo of Bryn Mawr College proposed a general algorithm for lifelong learning called Efficient Lifelong Learning Algorithm (ELLA). Finally, James Kirkpatrick and others at Deep Mind developed the algorithm called "elastic weight consolidation" (EWC) that overcomes "catastrophic forgetting" and that can therefore be used in supervised learning and reinforcement learning algorithms to train the network on several tasks sequentially without forgetting the previous ones ("Overcoming Catastrophic Forgetting in Neural Networks", 2017). Another strategy was devised by Jaehong Yoon at KAIST in South Korea, the Dynamically Expandable Network ("Lifelong Learning with Dynamically Expandable Networks", 2017).
In 1996 Caruana had introduced a method of "hard parameter sharing", but this is suitable only for learning closely-related tasks (such as two tasks related to linguistic features). Instead, Sebastian Ruder at the National University of Ireland introduced "sluice networks", a framework for learning loosely-related tasks, a framework that also turns out to be a generalization of both the Sogaard-Goldberg model and of cross-stitch networks ("Learning What to Share Between Loosely Related Tasks", 2018).
Iasonas Kokkinos at University College London built UberNet, a CNN based on VGG-Net, capable of handling multiple vision tasks ("Training a Universal Convolutional Neural Network for Low-, mid-, and High-level Vision Using Diverse Datasets and Limited Memory", 2017).
Fei Wang in collaboration with Tsinghua University and The Chinese University of Hong Kong designed a network of stacked attention modules, based on ResNet101, that, while performing the general task of image classification, can separate features that are useful for different tasks ("Residual Attention Network for Image Classification", 2017).
Andrew Davison's student Shikun Liu at Imperial College London proposed the Multi-Task Attention Network (MTAN), that can be implemented with any feed-forward network, and provides a global feature pool for any number of task-specific modules so that all learned features can be shared across different tasks. Each task-specific module is an attention module designed to learn task-specific features but also to share them ("End-to-End Multi-Task Learning with Attention", 2018).
These are the desiderata: meta-learning, few-shot learning (i.e. learning by demonstration), transfer learning and multi-task learning. Now the question is which method of learning can achieve those goals. Artificial Intelligence has studied three main methods of learning for neural networks: supervised, unsupervised and reinforced. Supervised learning gave us image and speech recognition. Reinforcement learning gave us software that plays games better than human champions. These successes somewhat obscured the research in unsupervised learning; but obviously supervised learning (that needs to see millions of cats before it can recognize a cat) and reinforcement learning (that needs to play a game millions of times before it can win) are not good models of how animals really learn. Animals learn a lot faster. Many believe that the difference between current algorithms and animals lies precisely in unsupervised learning.
If you want your robot to understand what is happening around it, and what the consequences of its actions will be on the objects around it, supervised learning is not helpful: you would need to show your robot millions of chairs, millions of tables, millions of pens, millions of all the possible objects in order for the robot to figure out just what is in the room. And then it would still need to learn the dynamics of physical interactions with those objects. In order to get to the point that your robot can understand and deal with a variety of scenes and situations, your robot needs to learn by itself what the world is like and how it works.
Up to the 2010s there were basically two categories of unsupervised learning methods: probabilistic (such as Paul Smolensky's restricted Boltzmann machines and Pietro Perona's "constellation models") and autoencoders.
And, incidentally, Marc'Aurelio Ranzato in Yann LeCun's group at New York University proved that these two groups are actually similar mathematical models ("A Unified Energy-based Framework for Unsupervised Learning", 2007).
Some new timely theories about the brain reshaped the field of unsupervised learning.
The Cambridge University neuroscientist Horace Barlow, one of the pioneers with Hubel and Wiesel of formal studies of the visual cortex, showed how neurons of the visual cortex detect what he termed "suspicious coincidences" that occur frequently and use them to build models of the world ("Cerebral Cortex as Model Builder", 1985). This was based on the models of the visual cortex developed by Christoph von der Malsburg in Germany (" Self-organization of Orientation Sensitive Cells in Striate Cortex", 1973) and by Barlow's colleague Nicholas Swindale at Cambridge ("The Development of Columnar Systems in the Mammalian Visual Cortex", 1980). Barlow's intuition was that the brain is a poor scientist but a great
statistician, and so, for example, the brain learns that lightning is almost always accompanied by thunder even without knowing the reason.
If Barlow focused on coincidence, others focused on movement.
Animals are capable of visually identifying an object regardless of where the object is, i.e. regardless of distance and perspective. This is quite amazing as the image of the object can be very different if you think of an image as just a matrix of pixels. We can even deform an object and most people will recognize it (e.g., you can still read the newspaper if you bend a page, which means that letters don't have to be flat like on a table or a wall for your brain to recognize them). Fukushima's neocognitron of 1980 tried to simulate this phenomenon via a hierarchy of alternating feature detectors and invariance layers, which was basically the architecture discovered by David Hubel and Torsten Wiesel. This was the principle also used by LeCun to recognized digits in 1989. Barlow's collaborator Peter Foldiak at Cambrige University , instead, argued that the visual system learns to recognize objects (regardless of the way they look from a specific perspective) from its sensory experience: as we move around we see the object change shape and this trains our visual system to recognize that object from any other perspective ("Learning Invariance from Transformation Sequences", 1991).
This came to be known as the "principle of temporal coherence".
Foldiak's model was also more consistent with the view of biologists of the school known as "ecological realism". Biologists such as James-Jerome Gibson had been arguing for decades that animals learn about the environment by acting in it, and, once they learn how the environment works, they become capable of performing a lot of actions in it. As Gibson phrased it: "We move in order to see and we see in order to move".
In the real world an object is rarely seen as an isolated image. It is almost always part of a scene, and part of a scene that is changing. Even if the objects are not moving, the observer is moving, and in most cases both the viewer and the objects are moving.
Foldiak's ideas also resonated with Dileep George's hierarchical probabilistic model of the visual cortex at Stanford University that was based on a similar principle: the geometric invariance of objects (that is trivial for humans to understand) is linked to our movements. As we move around an object, we know that it is still the same object even if it looks different as the visual angle changes. That's how our brain gets trained to recognize objects ("A Hierarchical Bayesian Model of Invariant Pattern Recognition", 2005). His research was sponsored by Silicon Valley inventor Jeff Hawkins at the Redwood Center for Theoretical Neuroscience and then at their startup Numenta.
The limits of supervised learning were particularly serious in the field of video analysis, which was becoming important in designing vision systems for autonomous robots, self driving cars, and security systems. Supervised learning is difficult in the case of videos because videos are much higher dimensional entities compared to single images. Luckily, videos contain a lot of "suspicious coincidences": spatial and temporal regularities. For example, two successive frames of a video are likely to contain the same objects. Unsupervise learning becomes much easier if the neural network can also use these spatial and temporal correlations. These regularities provide important information about how objects behave. The neural network can use this information about objects to train itself. That's why sometimes this is known as "self-supervised" learning.
Furthermore, a self-supevised neural network learns a representation that can be used for other practical tasks. For example, a neural network that learns to recognize a car in videos of highways is learning a representation of what cars do on roads. This can be used, for example, to classify movies.
So it turned out that videos were resurrecting unsupervised learning from both
sides: they were showing the limitations of supervised learning and at the same
time they were showing that a neural network can self-train by using the information available in the environment (information captured by the videos themselves).
Video analysis requires world knowledge and world knowledge is contained in videos.
This was not an accident: every animal is a manifestation of the same loop.
Animals need world knowledge in order to act in their environment, and they
acquire that knowledge precisely by acting in the environment.
We learn what objects do and what to do with objects, and we learn it by
interacting with them. The challenge is to design robots that can do the same:
learn about the world while interacting with the world, accumulate knowledge
about everyday objects and use it when appropriate.
Inspired by the new discipline of developmental cognitive neuroscience (that took the name from Mark Johnson's 1996 book), three famous pioneers of Japanese robotics (Minoru Asada, Hiroshi Ishiguro and Yasuo Kuniyoshi) advocated "cognitive developmental robotics" ("Cognitive Developmental Robotics As a New Paradigm for the Design of Humanoid Robots", 2001).
Equivalently, Juyang Weng at Michigan State University and others called for a robotics of "autonomous mental development" ("Autonomous Mental Development by Robots and Animals," 2001).
Developmental robotics, again, called for unsupervised learning.
Before deep learning, mosts approaches to learning representations of videos in an unsupervised way were based on "independent component analysis". Johannes van Hateren and Daniel Ruderman at Groningen University in the Netherlands pioneered the field ("Independent Component Analysis of Natural Image Sequences Yields Spatio-temporal Filters Similar to Simple Cells in Primary Visual Cortex", 1998).
One has to realize upfront that deep learning is not necessary for predicting what is going to happen in a scene. This was done, for example, by Abhinav Gupta's student Jacob Walker at Carnegie Mellon University ("Patch to the Future", 2014). Using a rather traditional method (the 30-year-old Kanade-Lucas tracking algorithm), his program learned in a completely unsupervised manner from a large collection of videos what happens next in scenes of traffic.
Also relatively traditional was the approach, at ETRI in Korea, of Michael Ryoo's predictor of human activity, that predicts the next frame in a video of human action ("Human Activity Recognition", 2011). This was, incidentally, an important work because it emphasized the importance of "prediction": classifying what humans did in the past is not enough if, for example, you want to prevent a crime; you also need to be able to predict what they are about to do before they do it. Classifying what humans "did" is not enough. Ryoo used an old-fashioned histogram-based approach.
The main reason to use deep learning is to get closer to what the brain does,
hoping that this will also lead to better results.
Barlow's intuition was used by Hossein Mobahi, working with Ronan Collobert and Jason Weston at University of Illinois, for an unsupervised learning model (a deep convolutional network) based on features that are adjacent in time ("Deep Learning from Temporal Coherence in Video", 2009).
Then in 2012 came the autoencoder built by Andrew Ng's group that recognized cats in still frames of videos, a project that proved deep learning could be a useful method for video analysis.
One more finding from neuroscience fueled progress in video analysis.
Rajesh Rao and Dana Ballard at the Salk Institute described
brains as predictive systems:
the early stages of sensory processing in the brain learn the statistical regularities in the environment and transmit to the next stages of processing only the sensory input that is not redundant.
The predictable components of the input are removed at the very beginning,
and only what is not predictable reaches the subsequent stages.
The sensory input gets compressed into a more efficient form before it is
forwarded to other regions of the brain
("Predictive Coding in the Visual Cortex, 1999).
The principle of predictive coding, originally developed for the visual system
of the fly by Mandyam Srinivasan, Simon Laughlin and Andreas Dubs at the Australian National University ("Predictive Coding", 1982),
was soon applied to many other brain areas, including the auditory system, the hippocampus and the frontal cortex.
Predictive coding views cortical functions as a process in which top-down information predicts bottom-up information, and inhibits all bottom-up information that fits with the prediction, thus allowing only errors to propagate upwards.
This simple principle actually constitutes a very efficient way
to "code" new information about the world.
Karl Friston at University College London summarized the activity of the brain as a process to minimize prediction error, and also expressed it in terms of
minimizing the "free energy" of the brain, a thermodynamic formulation that could lead to a unified theory of mind and life ("Learning and Inference in the Brain", 2003).
Incidentally, the underlying principle of encoding only the "unexpected" and discarding the "predictable" is the same used in audio and video compression methods such as JPEG.
Meanwhile, Daniel Felleman and David Van Essen ("Distributed Hierarchical Processing in the Primate Cerebral Cortex", 1991) had showed that the cortex is layered (we know of at least six layers) and hierarchical, and that each layer learns more abstract concepts.
Twenty years earlier, David Mumford had modeled the visual cortex as a hierarchy in which loops integrate top-down expectations and bottom-up observations via probabilistic (Bayesian) inference ("On The Computational Architecture Of The Neocortex II", 1992), an idea refined a decade later with Tai-sing Lee of Carnegie Mellon University ("Hierarchical Bayesian Inference in the Visual Cortex", 2003). This came to be known as the "Bayesian brain hypothesis" after a book by Kenji Doya and others titled "Bayesian Brain" (2007).
Jeff Hawkins merged these threads in his book "On Intelligence" (2004), and envisioned a general neocortical algorithm that is basically a prediction algorithm, and a general process of learning that is basically just a process of optimizing prediction.
Andy Clark at the University of Edinburgh summarized this view of the brain as a "hierarchical generative model that aims to minimize prediction error within a bidirectional cascade of cortical processing" ("Predictive Brains, Situated Agents, and the Future of Cognitive Science", 2013).
Learning is a delicate dance between top-down predictions and bottom-up inputs that either validate those predictions (and are therefore discarded) or invalidate them (in which case they trigger new coding).
Viewing the brain as a "predictive network" established a new paradigm for neural networks.
Rasmus Palm's thesis at the Technical University of Denmark ("Prediction as a Candidate for Learning Deep Hierarchical Models of Data", 2012) showed that a "predictive" autoencoder is a far better candidate for learning than the original "reconstructive" autoencoder. The "predictive" encoder is a particular kind of denoising autoencoder that, instead of reconstructing the input, tries to predict future input from the inputs received so far. In order to succeed, it must have encoded the previous inputs into a suitable representation to make a prediction about the next input. For the record, Palm's predictive encoder is similar to the "conditional restricted Boltzmann machine" designed by Geoffrey Hinton's student Graham Taylor ("Two Distributed-State Models For Generating High-Dimensional Time Series", 2011).
Rakesh Chalasani and Jose Principe at the University of Florida implemented the ideas of Friston and Rao-Ballard ("Deep Predictive Coding Networks", 2013).
In the case of video analysis, a neural network trained to predict the next frame in a video is implicitly learning an efficient representation of the world depicted in that video: the objects and the structure of the scene.
This was the strategy followed by
Vincent Michalski at the Goethe University in Germany in designing his multi-layer neural network ("Modeling Deep Temporal Dependencies with Recurrent Grammar Cells", 2014),
by Marc'Aurelio Ranzato (now at Facebook) for a recurrent neural network that also borrowed ideas from Mikolov's language model of 2010 ("Video Language Modeling", 2014),
by Bill Lotter at Harvard University for his convolutional LSTM network PredNet ("Unsupervised Learning of Visual Structure Using Predictive Generative Networks", 2015),
and by Ruslan Salakhutdinov's student Nitish Srivastava at the University of Toronto for his coupled LSTMs, an encoder that is trained with the initial frames to build the representation and a decoder that predicts the next frame based on that representation ("Unsupervised Learning Of Video Representations Using LSTMs”, 2015).
These systems had learned to predict pixels. Antonio Torralba's student Carl Vondrick at MIT ("Anticipating Visual Representations from Unlabeled Video", 2016) built a system (a variant of AlexNet with three more fully connected layers) to predict not just future pixels but the future of complex concepts such as objects and actions, and without learning a visual representation.
Xiaolong Wang at Carnegie Mellon University trained a convolutional neural network with hundreds of thousands of unlabeled videos to learn visual representations via visual tracking. Visual tracking (following the object while the video is rolling) provides the equivalent of "supervision" ("Unsupervised Learning of Visual Representations using Videos", 2015).
Jitendra Malik's student Pulkit Agrawal used the information obtained by a moving camera, the camera of a self-driving car ("Learning to See by Moving", 2015). His KittiNet coupled pre-training (training a convolutional neural network on a pretext task that is not the target one) and fine-tuning (adapting the network to the real task).
Alexei Efros' student Carl Doersch at UC Berkeley, instead of temporal correlations, used spatial correlations as the training surrogate: he designed a convolutional network to predict the position of the patch of an image relative to another one, with the pairs of patches picked at random, a pair per image. As the network learned this task, it started discovering categories such as "cat" and "bird" ("Unsupervised Visual Representation Learning by Context Prediction", 2016).
the goal of representation learning is to build internal representations of the world that can later be used for machine-learning tasks.
"Self-supervised learning" is a smart way to achieve representation learning. "Self-supervised learning" is a particular kind of unsupervised learning in which the network uses information implicit in the environment to train itself.
Doersch uses the relative spatial co-location of patches in images,
Wang uses object correspondence obtained through tracking in videos,
and Agrawal uses information obtained by a moving camera.
In all of these cases the representation learned through "self-supervised learning" can be used for applications of object identification and object classification.
Mehdi Noroozi and Paolo Favaro at the University of Bern in Switzerland pre-trained a convolutional network called Context-Free Network (a variation of AlexNet) to solve jigsaw puzzles. Then the same network was used to classify and detect objects ("Unsupervised Learning of Visual Representations by Solving Jigsaw Puzzles", 2017). They used Agrawal's model: pre-training (in this case train to solve jigsaw puzzles) and fine-tuning (adapting the network to the classification or detection task). It turns out that during the pre-training the network learned something about the structure of objects, knowledge that is useful in those other tasks. In this case the self-supervision used information that is available within a single image: one can fragment any image into arbitrary tiles and turn it into a jigsaw puzzle.
Phillip Isola at UC Berkeley trained his deep neural network with the rate of co-occurrence in space and time of objects (how often they are found in the same picture or video frame) so that the network can then predict whether two objects are likely to be found next to each other in space or time ("Learning Visual Groups from Co-occurrences in Space and Time", 2016). A practical application was to group photographs by theme (ocean views, mountain landscapes, sunsets, etc).
Tinghui Zhou, working in Noah Snavely's team at Google, designed an (unsupervised) CCN for predicting the three-dimensional structure of a scene given a sequence of images ("Unsupervised Learning of Depth and Ego-Motion from Video", 2017). Noah Snavely had previously helped John Flynn of Zoox build DeepStereo, a network trained end-to-end with a large number of images taken from different viewpoints for the purpose of synthesizing a new view of a scene ("Learning to Predict New Views from the World’s Imagery", 2016).
Another unsupervised strategy to learn video representations consisted in training a CNN to verify that a sequence of frames corresponds to the correct order in a video. This was done by both Ishan Misra at Carnegie Mellon University ("Shuffle and Learn - Unsupervised Learning Using Temporal Order Verification", 2016) and Hsin-Ying Lee at UC Merced ("Unsupervised Representation Learning by Sorting Sequences", 2017).
Deepak Pathak at UC Berkeley used motion cues (again, the Gestalt principle of "common fate", that pixels that move together tend to belong together) to build simple pixel representations of objects found in the frames of a video, and then trained a CNN to predict these representations without having access to the motion cues: the network learned a higher-level representation that was then transferred to other recognition tasks ("Learning Features by Watching Objects Move", 2017).
There are cues in the environment that help us make sense of what is happening, and self-supervised networks can exploit those cues too (although it is not easy for us to realize which cues we ourselves use!) Chuang Gan at MIT thought of using geometry as auxiliary supervision for the self-supervised learning of video representations, and found that that the CNN pre-trained by the geometry cues can indeed understand more of what goes on in a video ("Geometry Guided Convolutional Neural Networks for Self-Supervised Video Representation Learning", 2018). Alas, the most accurate method for dynamic scene recognition remained the "shallow" method used by Christoph Feichtenhofer at Graz University in Austria ("Spacetime Forests with Complementary Features for Dynamic Scene Recognition", 2013).
Carl Vondrick, now at Google, built a self-supervised CNN that, while colorizing grayscale videos, learned to visually track objects in the videos without ever being trained explicitly for tracking, and in fact learned to track multiple objects (“Tracking Emerges by Colorizing Videos”, 2018).
Andrew Zisserman's students Olivia Wiles and Sophia Koepke at the University of Oxford built two self-supervised frameworks/architectures, both trained using a large collection of video data with no manually labelled annotations. X2Face was a self-supervised framework for face puppeteering, i.e. giving a face the pose and expression of another face ("Self-supervised Learning from Watching Faces", 2018). Facial Attributes-Net (FAb-Net) was a self-supervised framework for learning a facial attribute representation that encodes information about pose and expression ("Self-supervised Learning of a Facial Attribute Embedding from Video", 2018).
Alexei Efros' student Andrew Owens at UC Berkeley learned a multisensory representation of a video by training a neural network to predict whether video and audio are aligned; and the resulting network proved was then trained to visualize the location of a sound in a video frame or to recognize the action going on ("Audio-Visual Scene Analysis with Self-Supervised Multisensory Features", 2018).
There was a general strategy at work: the use of a "proxy task" to force the network to learn a higher-level visual representation that could then be used for pre-training the network for other visual tasks. Doersch's cropped patches, Wang's visual tracking, Agrawal's KittiNet, Mehdi Noroozi's jigsaw puzzles, Deepak Pathak's segmentations, Misra's sequential verification, Lee's sorting sequences, Noroozi's shuffled patches, and Owens' video-sound alignment were all examples of proxy cases that pre-trained a neural network in an unsupervised manner. Indirectly, they all made the network generate a representation useful for other tasks.
Mehdi Noroozi and Paolo Favaro at the University of Bern later decoupled the self-supervised model from the task-specific model, thus obtaining a more efficient architecture that shrank the gap between the accuracy of models trained via self-supervised learning and models trained via supervised learning ("Boosting Self-Supervised Learning via Knowledge Transfer", 2018).
Most of these "proxy tasks" were still implemented on two-dimensional CNNs. By definition, such networks cannot properly capture a spatio-temporal representation. Dahun Kim at KAIST in Korea used a three-dimensional self-supervised task to train three-dimensional CNNs using a video dataset ("Self-Supervised Video Representation Learning with Space-Time Cubic Puzzles", 2018). Based on 3D ResNet-18, this system beat all the previous self-supervised systems on popular benchmarks by a significant margin.
All of these systems used convolutional nets and/or LSTMs.
Filip Piekniewski at UC San Diego started from the realization that Fukushima's neocognitron (the template for all convolutional networks) was only a timid approximation of the structure of the visual system.
His Predictive Vision Model (PVM) was inspired by more up-to-date neuroscience
and by Jeff Hawkins' hierarchical temporal memory
("Unsupervised Learning from Continuous Video in a Scalable Predictive Recurrent Network", 2016).
A hierarchical model of the brain's visual system, originally proposed by Sven Behnke and Raul Rojas at the Free University of Berlin ("Neural Abstraction Pyramid", 1998), the "neural abstraction pyramid", assumed that the visual cortex relies on both horizontal (lateral) and vertical (feedback and feedforward) loops as it transforms an image into a sequence of representations with increasing levels of abstraction and decreasing levels of detail.
Similarly, a decade later, Rodney Douglas and Kevan Martin at the Institute of Neuroinformatics in Switzerland found a lot of feedback connectivity in the neocortex, but also found that the connectivity tends to be local (neurons tend to talk to neighbouring neurons in the same region of the cortex whereas long-distance connections are rare), i.e. that "the local circuit is the heart of cortical computation" ("Recurrent Neuronal Circuits in the Neocortex", 2007).
PVM was therefore based on ubiquitous feedback connectivity, unlike deep learning that mostly relies on feedforward connections: a hierarchy of heavily connected units (with both horizontal and vertical feedback), each a multilayer perceptron.
Like Wang's system, PVM too learned (unsupervised) from tracking the movement of objects relative to the observer. Additionally, it tried to construct a representation of the physical reality around the object.
Deep learning was based on end-to-end error propagation, whereas PVM was mainly interested in local prediction.
The video prediction model designed by Sergey Levine's student Chelsea Finn at UC Berkeley with help from Ian Goodfellow, called "convolutional dynamic neural advection" (CDNA), using a stack of LSTMs, explicitly predicted the motion of the objects encountered in a video, beyond single-frame prediction ("Unsupervised Learning for Physical Interaction Through Video Prediction", 2016). This system was meant to do more than learn: it was meant to "imagine" possible futures based on different courses of action.
Tracking objects in video is a fundamental problem in computer vision. It is essential to interacting with objects and, generally speaking, living a normal life in our ordinary world. Predicting what will happen next to the object that we are tracking is equally important, and natural for the human mind.
Robots need to understand a lot of things about the world. Today they understand less than what a worm understands. They need to understand: the effect of forces on objects (what happens if you push an object beyond the edge of a table?); the effect of movement of objects (where can the car go from where it is now?); and the effect of people's movement (what happens next when a person pulls out a gun?) These are all cases of "prediction". Animal brains are amazingly good at predicting (simulating) the future. Machines are still incredibly bad at it.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"