(These are excerpts from my book "Intelligence is not Artificial")
Deep Reinforcement Learning: A brief History of Artificial Intelligence/ Part 7
The game of go/weiqi had been a favorite field of research since the birth of deep learning.
In 2006 Remi Coulom in France applied Ulam's old Monte Carlo method to the problem of searching a tree data structure (one of the most common problems in computational math, but particularly difficult when the data are game moves), and thus obtained the Monte Carlo Tree Search algorithm, and then applied it to go/weiqi ("Bandit Based MonteCarlo Planning", 2006). At the same time Levente Kocsis and Csaba Szepesvari developed the UCT algorithm (Upper Confidence Bounds for Trees algorithm), an application of the bandit algorithm to the Monte Carlo search, the bandit algorithm being a problem in probability theory first studied by Herbert Robbins in 1952 at the Institute for Advanced Study ("Some aspects of the sequential design of experiments", 1952).
These algorithms dramatically improved the chances by machines to beat go masters: in 2009 Fuego Go (developed at the University of Alberta) beat Zhou Junxun, in 2010 MogoTW (developed by a FrenchTaiwanese team) beat Catalin Taranu, in 2012 Tencho no Igo/ Zen (developed by Yoji Ojima) beat Takemiya Masaki, in 2013 Crazy Stone (by Remi Coulom) beat Yoshio Ishida, and in 2016 AlphaGo (developed by Google's DeepMind) beat Lee Sedol. DeepMind's victory was widely advertised. DeepMind used a slightly modified Monte Carlo algorithm but, more importantly, AlphaGo taught itself by playing against itself. AlphaGo's neural network was trained with 150,000 games played by go/weiqi masters.
AlphaGo had played 200 million games by January 2017 when it briefly debuted online disguised under the moniker Master; and in May 2017 it beat the world's master Ke Jie.
DeepMind had previously combined convolutional networks with reinforcement learning to train a neural network to play videogames: in 2013 Volodymyr Mnih and others had trained convolutional networks with a variant of Qlearning, the "asynchronous actorcritic" algorithm or A3C, in order to improve the policy function ("Playing Atari with Deep Reinforcement Learning", 2013).
These Deep QNetwork (DQN) was the first in a progression of
deep reinforcement learning methods developed by DeepMind, leading to AlphaGo and then AlphaZero.
Deep Qlearning suffers from some inherent limitations and requires a lot of computational power, which translates not only in cost but also in slow response time. DeepMind had rediscovered Lin's "experience replay" of 1992 as a remedy for the Atariplaying network. Next, DeepMind rediscovered the policy gradient methods for reinforcement learning (although different from the likelihoodratio method of REINFORCE): for example, David Silver's "deterministic policy gradient" of 2014 and Nicolas Heess' "stochastic value gradient" of 2015. Volodymyr Mnih's team worked on asynchronous gradientdescent algorithms that run multiple agents in parallel instead of using "experience replay". In 2015 Arun Nair and others designed the General Reinforcement Learning Architecture (Gorila) that greatly improved the performance on those Atari games by performing that kind of asynchronous training of reinforcement learning agents (100 separate actorlearner processes).
In 2016 the DeepMind team also adapted "deep Qlearning" to the more general domain of continuous action ("Continuous Control with Deep Reinforcement Learning", 2016).
Ironically, few people noticed that in September 2015 Matthew Lai unveiled an opensource chess engine called Giraffe that used deep reinforcement learning to teach itself how to play chess (at international master level) in 72 hours. It was designed by just one person and it ran on a the humble computer of his department at Imperial College London. (Lai was hired by Google DeepMind in January 2016, two months before AlphaGo's exploit against the Go master).
In 2016 Toyota demonstrated a selfteaching car, another application of deep reinforcement learning like AlphaGo: a number of cars are left to randomly roam the territory with the only rule that they have to avoid accidents. After a
while, the cars learn how to drive properly in the streets.
Poker was another game targeted by A.I. scientists, so much so that the University of Alberta even set up a Computer Poker Research Group. Here in 2007 Michael Bowling developed an algorithm called Counterfactual Regret Minimization or CFR ("Regret Minimization in Games with Incomplete Information", 2007), based on the "regret matching" algorithm invented in 2000 by Sergiu Hart and Andreu MasColell at the Einstein Institute of Mathematics in Israel ("A Simple Adaptive Procedure Leading to Correlated Equilibrium", 2000). These are techniques of selfplaying: given some rules describing a game, the algorithm plays against itself and develops its own strategy for playing the game better and better. It's yet another form of reinforcement learning, except that in this case reinforcement learning is used to devise the strategy from scratch, not to learn the strategy used by humans. The goal of CFR and its numerous variants is to approximate solutions for imperfect information games such as poker. CFR variants became the algorithms of choice for "poker bots" used in computer poker competitions. In 2015 Bowling's team developed Cepheus and Tuomas Sandholm at Carnegie Mellon University developed Claudico, that played the professionals at a Pittsburgh casino (Pennsylvania). Claudico lost but in 2017 Libratus, created by the same group, won. Libratus employed a new algorithm, called CFR+, introduced in 2014 by Finnish hacker Oskari Tammelin ("Solving Large Imperfect Information Games Using CFR+, 2014) that learns much faster compared with previous versions of CFR. However, the setting was absolutely unnatural, in particular to rule out card luck. It is safe to state that no human players had ever played poker in such a setting before. But it was telling that the machine started winning when the number of players was reduced and the duration of the tournament was extended: more players in a shorter time beat Claudico, but fewer players over a longer time lost to Libratus.
Deep reinforcement learning has three main problems: it requires a lot of data, which is unnatural (animals can learn something even if they saw it only once or twice), it fails all too easily in situations that differ just slightly from the situations for which the system has been trained, and its internal workings are largely inscrutable, which makes us uncomfortable to use them on missioncritical projects.
A different strand from DeepMind's A3C harkens back to the "relational Markov decision process", introduced by Carlos Guestrin ("Planning under Uncertainty in Complex Structured Environments", 2003); and to the ObjectOriented Markov Decision Process, introduced by Micheal Littman's student Carlos Diuk at Rutgers University ("An ObjectOriented Representation for Efficient Reinforcement Learning", 2008). The relational Markov decision process blends probabilistic and
relational representations with the express goal of planning action in the real world.
Marta Garnelo's "deep symbolic reinforcement" at Imperial College London ("Towards Deep Symbolic Reinforcement Learning",2016) and Dileep George's "schema networks" to transfer experience from one situation to other similar situations (Schema Networks", 2017) built upon these ideas, basically wedding firstorder logic representation with deep reinforcement learning.
Yet another route led to Trust Region Policy Optimization (TRPO), introduced in 2015 by Pieter Abbeel's student John Schulman at UC Berkeley as an alternative to DeepMind's policygradient methods (at least for continuous control tasks). DeepMind itself contributed an alternative to policygradient methods: ACER developed by Ziyu Wang and others of Nando de Freitas' team ("Sample Efficient ActorCritic with Experience Replay", 2016), a variant of the Retrace method developed by their colleague Remi Munos ("Safe and Efficient Offpolicy Reinforcement", 2016). John Schulman himself improved TRPO with PPO, "proximal policy optimization", released by OpenAI in 2017 ("Proximal Policy Optimization Algorithms", 2017). DQN, A3C, TRPO, ACER and PPO were just the tip of the iceberg: algorithms for optimization of reinforcement learning multipled and evolved rapidly after the success of these techniques in learning to play games.
"Policy search" methods, such as Marc Deisenroth's and CarlEdward Rasmussen's PILCO (2011), were invented to make robots automatically learn new behavior through experience. These methods were used successfully to train both the vision and the motor components of a robot so that it can perform manipulation tasks. Ideally, the vision system should adapt to the goal of the task, but this cannot happen if the two are trained separately. Sergey Levine and his student Chelsea Finn at UC Berkeley developed a method for training the visual system and the control system of a robot at the same time instead of training them separately. In this way the vision system can adapt to the goals of the task, a much more efficient way to "see" things in the world, and, incidentally, the way all animals look at the world ("EndtoEnd Training of Deep Visuomotor Policies", 2016). Their "guided policy search" turned policy search into a kind of supervised learning whose training data are generated iteratively, which is an approach similar to "visual servoing", pioneered 25 years earlier by Bernard Espiau and Patrick Rives at INRIA in France ("A New Approach to Visual Servoing in Robotics", 1992), and of the "Bregman Alternating Direction Method of Multipliers" (2014) introduced by Huahua Wang and Arindam Banerjee at the University of Minnesota. The difference is that Levine and Finn used a new convolutional architecture for both policies to automatically capture spatial information about the scene.
Most successes in reinforcement learning had taken place in singleagent scenarios, with the advantage of an environment that remained largely unchanged. However, the real world is almost never a stationary environment because it involves multiple agents. Multiagent learning is inherently more complicated. Traditionally, two extreme approaches were attempted. The decentralizedlearning approach dates back to at least Independent Qlearning, developed by Ming Tan at GTE Labs in Boston ("Multiagent Reinforcement Learning  Independent vs. Cooperative Agents", 1993), and was updated to DQN by Raul Vicente' students at the University of Tartu in Estonia ("Multiagent Cooperation and Competition with Deep Reinforcement Learning", 2015). The centralizedlearning approach dates back to at least Carlos Guestrin at Stanford ("Multiagent Planning with Factored MDPs", 2002), and later deeplearning implementations include Sainbayar Sukhbaatar's CommNet at New York University ("Learning Multiagent Communication with Backpropagation", 2016) and Alibaba's BicNet in China ("Multiagent BidirectionallyCoordinated Nets", 2017). Neither approach was satisfactory.
Eventually, a hybrid approach became popular. The paradigm of centralized training with decentralized execution was pioneered by Frans Oliehoek at the University of Amsterdam ("Exploiting Locality of Interaction in Factored Decentralized POMDPs", 2008). Ryan Lowe of McGill University and Yi Wu of UC Berkeley called it "Decentralized Actor Centralized Critic" ("MultiAgent ActorCritic for Mixed CooperativeCompetitive Environments", 2017). Hybrid systems of this kind were developed by Jayesh Gupta at Stanford ("Cooperative Multiagent Control Using Deep Reinforcement Learning", 2017), by DeepMind ("ValueDecomposition Networks For Cooperative MultiAgent Learning", 2017) and by Shimon Whiteson's team at Oxford University, the QMIX method ("Monotonic Value Function Factorisation for Deep MultiAgent Reinforcement Learning", 2018).
Progress in reinforcement learning for robots was helped by virtual environments in which one could test robot behavior. In 2016 OpenAI released the toolkit OpenAI Gym, and in 2017 OpenAI released Roboschool, a software environment to create realworld simulations for training robots.
In 2017 Vinyals at DeepMind introduced the "StarCraft II Learning Environment" (SC2LE1), a reinforcementlearning environment based on the StarCraft II video game to test these systems.
A wide variety of reinforcementlearning algorithms existed by the end of the 2010s: Volodymyr Mnih's Advantage Actor Critic (A3C), John Schulman's PPO, DeepMind's ApeX ("Distributed Prioritized Experience Replay", 2018), DeepMind's Impala ("Scalable Distributed DeepRL with Importance Weighted Actorlearner Architectures", 2018), etc. More general frameworks for reinforcement learning included: Intel's Coach ("Reinforcement Learning Coach by Intel", 2017), Google's TensorFlow Agents ("Efficient Batched Reinforcement Learning in TensorFlow", 2017), OpenAI's Baselines (2017), Cambridge University's TensorForce ("A TensorFlow Library for Applied Reinforcement Learning", 2017), Eric Liang's RLlib at UC Berkeley (2018), etc.
If 2016 had been the year of reinforcement learning, by 2017 it was becoming an easy target for all sorts of attacks. First of all, several studies showed that other forms of machine learning could replicate the feats of DeepMind: Jeff Clune's team at Uber AI Labs ("Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning", 2017), Tim Salimans at OpenAI ("Evolution Strategies as a Scalable Alternative to Reinforcement Learning", 2017) as well as Ben Recht's students Horia Mania and Aurelia Guy at UC Berkeley ("Simple Random Search Provides a Competitive Approach to Reinforcement Learning", 2018) showed that much simpler genetic algorithms could do as well on most tasks. Secondly, Sham Kakade's student Aravind Rajeswaran at the University of Washington showed that multilayered neural networks are an unnecessary complication for robotic tasks of continuous control such as those of the OpenAI gym environment ("Towards Generalization and Simplicity in Continuous Control", 2017). For example, Nicolas Heess and others at Deep Mind used deep reinforcement learning to train a robot for walking, running, jumping and turning ("Emergence of Locomotion Behaviours in Rich Environments", 2017). The training required 64 CPUs running for over 100 hours. The video was impressive, but similar results had been obtained five years earlier by Emanuel Todorov's students Yuval Tassa and Tom Erez (both later hired by DeepMind) at the University of Washington, running on much smaller and slower hardware using not reinforcement learning but a humbler method, online trajectory optimization, aka model predictive control, on their physics simulator MuJoCo ("Synthesis and Stabilization of Complex Behaviors through Online Trajectory Optimization", 2012).
The original reason to develop the complex algorithms of A.I. was not that it was impossible to play chess or weiqi/go with simpler algorithms but rather that it was plain silly to do so: it would have required colossal amounts of computation on that slow and expensive hardware. It seems like A.I. in the age of AlphaGo reached a point where its complex algorithms do require a colossal amount of computation; which seems to contradict the whole point of doing A.I. Nobody ever proved formally that simple algorithms cannot achieve superhuman performance at chess or go/weiqi or any other deterministic game. The availability of computational power reduces the raison d’ être of A.I.’s sophisticated algorithms. What's the point of developing multilayer networks if simple architectures can do the same job? The only reason would be to speed up things, or run on cheaper hardware, but instead the opposite is happening: DeepMind and OpenAI have to throw more and more computational power at their algorithms. It was not surprising that, after the initial enthusiasm, A.I. scientists started revisiting the old simple algorithms that were dismissed 20 years earlier. Some of them might indeed prove that A.I. unnecessarily complicated things because we forgot that the stumbling block was not a faulty theory but only slow hardware.
One of the biggest problems for reinforcement learning is that it requires a reward function: the reward function really represents what we want the system to learn. The problem is that designing the correct reward function is extremely difficult in the real world. If the reward is not correct, reinforcement learning acts like an uncontrollable chain reaction.
Another unsolved problem in reinforcement learning is the problem of balancing exploration and exploitation. This is a general problem of mathematical optimization called "the multiarmed bandit problem". The classic example is the dinner: you can have dinner at your favorite restaurant, where you will almost certainly get a good meal, or try a new restaurant, with a chance that you'll discover a new favorite. This doesn't sound like a big deal. But apply the same reasoning to medical care and you can probably appreciate why it is a big deal. Exploitation consists in making the best decision that maximizes immediate return. Exploration consists in investing in gathering more knowledge about the environment, which in the long run might yield a higher return. Exploration is extremely expensive in reinforcement learning so it tends to be neglected. But in some cases there is a huge price to pay for it. That's why DQN could play Atari games so well but not Montezuma Revenge, a game that requires more exploration.
Historically speaking, we have learned that the classic A.I. problems can be divided in categories based on some factors, and it turns out that Go belong to the simplest category of A.I. problems. Problems that are deterministic, fully observable and known, discrete (a limited number of possible states) and static (the rules don't change) are actually "easy" to solve when we can afford to throw virtually infinite computational power at them. There is nothing in their formulation that makes it impossible to solve them: it is "difficult" to solve them simply because the number of possible solutions is very big. In the old days of slow, expensive hardware one would program the machine to look for the solution by simulating human intuition. In the age of cheap, fast hardware, one programs the machine to use brute force and simply try all possible combinations until one turns out to be the correct solution. The game of weiqi/go is one such problem: "difficult" because of the many many many possible moves, but "easy" for a machine that can try all of them.
The next step after weiqi/go was the multiplayer videogame Dota 2 of 2013, played daily by over one million people. In 2018 OpenAI announced a bot, OpenAI Five, that, like AlphaGo Zero, learned the game from scratch by selfplay without any knowledge of how humans play it. This problem is neither fully observable/known nor discrete nor static. OpenAI Five trains by playing 180 years worth of games against itself every day, running on 256 GPUs and 128,000 CPU cores, using a separate LSTM for each hero. OpenAI Five uses proximal policy optimization.
In 2018 OpenAi used the same algorithm to teach a robotic hand, Dactyl, to manipulate cubes.
"That men do not learn very much from the lessons of history is the most important of all the lessons of history" (Aldous Huxley)
Back to the Table of Contents
Purchase "Intelligence is not Artificial")
