(Piero Scaruffi and Juergen Schmidhuber)

Who is Juergen Schmidhuber

Artificial Intelligence was founded in 1956 at a famous conference organized by John McCarthy and Marvin Minsky but for 4 decades it was mostly a curiosity, inspiration for science-fiction movies. The modern history of Artificial Intelligence begins in the late 1990s with LeCun, Hinton, Bengio (the “Canadian” branch) and Juergen Schmidhuber.  Schmidhuber studied in his native Germany but then moved to the Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, or IDSIA, in Switzerland, of which he is now the director. The theses written by his students have introduced some of the most influential concepts of what is now called “Deep Learning”. In 1997 his student Sepp Hochreiter wrote his thesis on the Long Short Term Memory (LSTM) model, which revolutionized neural networks. In 2006 his other student Alex Graves published a method that made LSTM useful for practical applications, the so called "connectionist temporal classification" or CTC. It was Graves who obtained the first impressive results in tasks such as handwriting recognition and speech recognition. With Hinton’s and Bengio’s papers of 2006-07, this project marks the beginning of deep learning. In 2010 his student Dan Ciresan built a nine-layer neural net on a Nvidia graphic processor: it was one of the first projects that used a GPU (invented for videogames) to run a neural network. Today this is the norm. In 2013 Ciresan’s LSTM achieved near-human performance in recognizing Chinese handwriting. This was the second major success story of deep learning after AlexNet, the network developed by Hinton’s students at the University of Toronto for image recognition. In 2015 Schmidhuber's team the Highway Network, probably the very first neural network with over 100 layers. Meanwhile one of his student, Shane Legg, had co-founded DeepMind in London that was acquired by Google and in 2016 became famous for AlphaGo.Two more students of Schmidhuber joined DeepMind: Alex Grave and Daan Wierstra. Now his colleagues in Artificial Intelligence are also rediscovering his studies that weren’t influential at the time. For example, his 1987 thesis employed genetic algorithms for meta-learning, i.e. learning how to learn new things, a topic that has become very popular (AlphaGo is great at playing weiqi, but can only do that one thing, whereas you can learn an infinite number of things). In 1991 he developed a theory of how an A.I. system could become “curious” about the world and start learning new things by itself, another topic that today is becoming popular.


The Interview

I traveled with Piero Scaruffi to Switzerland to meet Schmidhuber and discuss the state of A.I. We met at a cafenext to the church of Lugano.

 

PS: Okay so you studied in Germany?

S: I was born in Munich, Bavaria, which is in the southern part of Germany, and when I was a boy, I realized that I am not very smart, but maybe I can be smart enough to build something that allows something to be smarter than I was, that can solve all the problems that I cannot solve myself. And so I studied artificial intelligence, how to build learning machines.

PS: So when was this. in the 80s when you started studying artificial intelligence?

S: Yeah, that’s right…

PS: At that time AI was not very popular. Mid-80s it started going down…

S: Back then it was a traditional kind of AI, with expert systems and we were improving and some of that was important. But it didn’t lead to general purpose AI that I wanted to build. Something that can solve anything. And in my dissertation in 1987, I tried to solve the fundamental goal of AI, not building a machine that learns something here and learns something there, but learns the learning algorithm itself. It learns to improve itself but also learns the way to improve itself. So, without any limitations excepting for the limitations of computability, logic and physics. And that was my diploma thesis. And that was 30 years ago when computers were a million times more expensive than now, for the same price unit, they were a million times slower than now. Back then there was a trend that said every 5 years the computers are getting ten times cheaper. That’s an old trend that started at least 75 years ago. It started at least in 1941 or maybe even earlier. In 1941 when Konrad Zuse built the first program-controlled computer and this computer could do roughly one operation per second, but then a few years later one could do 100 operations for the same price. And 20 years later you could do 10,000 operations for the same price, and 30 years later you could do 1 million operations for the same price and today 25 years later or so, you can do a million billion operations for the same price. And for the first time we are in the realm of small animal brains. In the not so distant future, if the trend doesn’t break, we are going to have really small, computational devices which can compute as much as a human brain. And then it will take maybe another 50 years for at the same price you can get a small computational devices that can compute as much as all the human brains put together.

PS: Let’s go back a little. So you got your diploma thesis with this very ambitious project. And then you came here (IDSIA in Lugano, Switzerland)?

S: No, it took some time, in 1987 I published my thesis. It defined what I was working on for the next few decades. In 1991 I had my Ph.D thesis. It was also about learning machines, artificial neural networks that learn to become smarter over time, using experience to solve problems that couldn’t be solved before. My focus always has been on general purpose computers and neural networks.

PS: A couple of questions. You got into neural networks when a) they were basically unfeasible with those machines and b) they were not terribly popular.

S: That’s not quite true, because the field that we call deep learning today, artificial networks that learn from experience, that started in the 50s, and then it became what people call deep learning in 1965. What we call deep learning started in the Soviet Union in 1965, through the work Ivakhnenko and Lapa, and they had the first deep networks with many layers, with arbitrary number of layers, and they were able to learn internal representations of data…

PS: Did they really implement it or it was just mathematical thery?

S: It was the algorithm and they implemented and applied it and in the subsequent decades lots of people in Eastern Europe used it for all kinds of applications. So Ivakhnenko and Lapa started this deep learning as we call it today. They didn’t call it deep learning, they had a different language. And the methods that they used in the 60s were still in use in the 2000s. So some people always have  been interested in that. So, what we did then, in the 90s was focus on more general networks, because what Ivakhnenko and Lapa had in the 60s was one layer feeding into the next and then to the next and so on, and you can do really interesting things with many parts, but they are not general purpose computers. If you want general purpose computers you need connections, feedback connections which allow systems to memorize stuff, memorize what the systems saw before. For instance when you see a video you want to keep short term memory such that you can make decisions  based on what happened before. And if you want to have a general purpose computer you need to have something like that.

PS: At that point, were you aware of John Hopfield’s work?

S: Yeah..But this was a limited kind of recurrent  network in 1981 and you couldn’t use it to memorise. It’s a kind of recurrent network that settles down to a stationary state, a fixed time, and you couldn’t use it to store data.

PS: Psychologically, his work was important to prove that neural networks were feasible. After Minsky’s book, there was a general feeling that neural networks went the wrong way, in the United
States that was the feeling, and then John Hopfield changed the psychology…

S: You mentioned Minsky’s book and there is a myth which says, Minsky in 1969 and Popper killed neural network research by publishing this book about the limitations of shallow networks, with a single layer, perceptrons, and it’s a total myth, because, 4 years before this book came out, Ivakhnenko already had deep networks. If Minsky was a scientist and aware of the other things that happened, it is not obvious in the book that he knew. Maybe it was a Cold War thing, I don’t know. And then later, this myth became that this book killed neural network research and then later, during the 80s it reawakened. But no, neural network was alive in Eastern Europe, and then in Japan, Fukushima did his thing in the 70s, the basics of convolutional networks. There were many places where people were still interested in neural networks.

PS: Do you feel that these places were outside the United States and that Minsky did have an influence over the United States?

S: I guess he had a lot of influence in the United States, and the basic inventions in neural networks were made outside the United States. For instance, the Soviet Union which was leading in many fields of science. It had started the Space Age, they had the biggest bomb ever, they had the first man in space, the first woman in space, first machines on the moon, first machine on another planet, and they had many of the best mathematicians, and one of them was Ivakhnenko. Today it wouldn’t be called the Soviet Union any longer, the place where he did that was Ukraine.

PS: OK, so this thing was very alive in Europe. I was in California at that time and expert systems were ruling. And then you get into this field when, it was ’89 when you developed LSTM networks.

S:  LSTM was born in the 90s and before that people already knew about recurrent networks, a concept that goes back a long time, but it didn’t work, and the LSTM, that was born in Munich in the 90s, and was developed by my brilliant student Sepp Hochreiter, who was my first student ever, and then later Felix Gers, another German guy, but he was already working in my Swiss lab, so I started that in Munich and then in Switzerland, with a couple of students who were working on their post docs, funded mostly be Swiss tax payers and also by other European tax payers and then  it became more or less what it is today around 2005 or 2006. Alex Graves is another important PhD student who made LSTM practical

PS: So, in 2005-2006 Hinton worked out their version of deep learning. How did the two compare?

S: So Lecun did important stuff, long before them. What Lecun did in the 80s was that he combined the convolutional architecture of Fukushima with a technique with backpropagation. It was first published in its modern form by a Finnish guy, Deppelin Mainmar in 1970.

PS: So this was before Verbus, before 13:9 – 13:10.

S: Long before that. So the modern form of backpropagation with screen connections with sparse networks, the reverse mode of automatic differentiation, was published as a master’s thesis in 1970 by Seppo Linnainmaa. The basics of backpropagation, the way of applying their chain rules in sequential systems and hierarchical systems, that goes back to the 1960s and the guy who should be mentioned there is Kelly.

PS: It was called control theory.

S: And a colleague of his, Bryson. But the modern version that everyone is using now for not only for neural networks, but also  all kinds of differentiable networks, that goes back to Seppo Linnainmaa. And the first guy to apply this to neural networks for Paul Werbos in 1982. What Lecun did was combine it with the very useful convolutional architecture, which is very good for vision, and all kinds of two dimensionsal data, which can be traced back to Fukushima’s work in the later 1970s.

PS: Was this LeNet1 or the following versions?

S: The original one was in 1989…

PS: They already had backpropagation?

S: Yes, because Fukushima also had deep networks, but he didn’t use backpropagation to train them. Instead, he had unsupervised learning rules to train them. It was great for computer vision, and Lecun proved that in many subsequent papers. And my team was the one that made it so fast that you could win competitions with it. And the guy who made that possible was my post-doc Dan Ciresan.  And in 2011, through the work of Dan Ciresan, our team could start winning all these competitions. In computer vision in Silicon Valley for example, there was the traffic sign recognition competition.

PS: So you were using convolutional neural networks…

S: That was in 2011. Subsequently, the team here and also Ueli Meier, Jonathan Masci  and others were able to win a whole series of competitions with convolutional networks that we hadn’t invented, others had invented, Fukushima, Lecun and people working in Lecun’s lab, so that was what we did in 2011. In this field of convolutional networks our contribution was limited, on GPU so we could win all these competitions,

PS: LSTM was born out of the need for giving computers a memory. Could you explain why memory was so important for neural networks?

S: Suppose you want to recognize speech. Supposing somebody says 7 and the same guy says 11, the ending is in “-even” which is almost the same. So you have to memorise did he say “s-“ before he said “-even” or did he say “el” before he said “-even”. “Even” by itself only 50 times 18:35-18:36. And so you have to have a device that memorises things which occurred 50 steps ago, at least. And if you want to understand more complicated things, you have to look back to thousands of steps, maybe millions of steps. Then you have to have a recurrent network which learns to do that from experience, which learns to put the important stuff into the memory and cuts out the noise, and that’s what the research we have been conducting since the early 90s was focusing on.

PS: Somebody who is a software engineer can ask why don’t you use a traditional file om a hard disk.

S: You have to store the past, you could have a video stored on tape whatever. You not only want to store the data but you also want to learn from the data. It also works with different videos or different speed signals. For that type of generalization, the network has to learn what is the important thing in the video or the speed signal, and what is unimportant. For example, if you have videos and you want to classify them, you have a flying bird in this video or there is a sitting bird, or is there no bird at all. This question should be answered under all kinds of illumination conditions, so that not only for this video but in the millions and billions of videos where there are birds. For that you need something more sophisticated than a trail of information. This is what the neural networks can learn. They can learn to extract features from the data, which are abstract representations of what has happened, spatio-temporal events that have happened in the past. And the cool thing is that then they can generalize so that they can recognize in the future even unseen videos or unheard speech.

PS: Excellent! Now, one of your students co-founded Deep Mind.

S: Shane Legg and he was one of three founders of Deep Mind, that’s right.

PS: Okay so, Shane Legg, what do you think of Deep Mind, how did these guys got started, I mean, out of the blue this place in London, gets so much credibility that Google just buys them and gives them all this computational power. You must be very jealous?

S: Shane, he was one of the three co-founders, and his friend was Daan Wierstra who was number 4 in Deep Mind, also a student here in my lab, and they did their PhDs roughly at the same time and Dan was working on reinforcement learning and Shane was working on machine superintelligence and he wrote about some limitations of general purpose AI. These were the first two in DeepMind who really had PhDs and publications on artificial intelligence and machine learning. The other two co-founders were Mustafa Suleyman, who was a business guy, and their frontman Demis Hassabis, who had done important work in biological neuroscience, how the brain works and stuff like that. Deep Mind today is mostly doing machine learning and AI and they had great results. Alpha Go is a very famous result.

PS: One interesting footnote, Shane is from New Zealand. Dan is Dutch, you are German and Alex Graves is from?

S: Scotland. Whenever we hire people, we try to get the best. You have a little research lab, here in Switzerland, and then you have an open position and then a worldwide announcement and the probability is low that you will get a Swiss guy.

PS: Tell me something in a few minutes about the research lab

S: Yes, this lab was created in 1988 by a rich Italian guy, Angelo Dalle Molle, who discovered that it is easier to establish a foundation in Switzerland rather than in his home country. He became rich through automotive but he is also the creator of the drink Cynar,, and he wanted to improve people’s lives and so he created this foundation about artificial intelligence in order to improve human lives.

PS: And you joined when?

S: In 1995 and I had a direct position and then I co-directed it with my co-director Luca Maria Gambardella, who was a pioneer in so-called “swarm intelligence”, together with Marco Dorigo who also worked here for a while, who read the standard works on swarm intelligence, and ant-colony optimization, and they found ways of having swarms of rather stupid agents collectively solve complicated problems. And that became famous and Dorigo and Gambardella had lots of citations,

PS: OK, so let’s fast forward to 2012 – 2016 when AI became very popular. I mean what’s your assessment? Last year we had statements from Hinton who sounded unhappy with convolutional networks, Lecun has been critical of probabilities. On the other hand, there is this hype in the press, Alpha Go, Pix2pix, so many exciting things, what’s your assessment, where are we?

S: I don’t care much for recent comments like that because I have been trying all these decades to build intelligent machines which solve one new problem after another, learn also to improve their skills through learning algorithms, and I am dedicated to our old line of research which we think is the only one which is about learning programs for general purpose computers. If you want to build a general problem solver which learns one new skill on top of the old one, there  are ways to using tools that we can  put networks together to act in a way that is smarter and smarter. Over decades we have found ways of improving and speeding up systems like that and I still think that’s the future. You have the neural network which is analyzing the increasing stream of data into actions and the goal of this network is to maximize rewards and solve problems, minimize pain for instance, when the goal is to reach that charging station, three times a day, for the robot which has a battery, whenever the battery is empty or near empty, it will get hungry, negative signals that are coming from the battery, the goal is to reach that charging station without bumping into painful obstacles on the way, because you want to minimize pain. And it is not easy to learn and implement, and complex behavior like that is difficult. Now what we have done since the 1990s is that we give this little robot a model of the world which it can use to take the next thing. So, I can predict the future given its actions and its history, and you can use predictive coding as a response to compress the data that is coming in, and find regularities in the data set. This is new in the sense that there was a time when the robot didn’t know that this regularity existed, and then someone discovers the world, for instance gravity. It sees a video or creates a video of a hundred falling apples,  and then it realizes there is a regularity, all these apples fall down in the same way. And you can predict how they fall down and you can predict how they fall down by seeing a couple of images and you can predict what the next image will be like. Which means that to the extent you can predict the data, you can compress the data through predictive coding, and then because you can predict you don’t have to guess. And all the regularities of the world are reflected in compressibilities like that. And then our agent, our little robot who is trying to optimize its performance in its environment, can invent new problems, new self-invented experiments that lead to more data, which then can have the property that it can learn something about the world that it doesn’t know yet. Artificial curiosity, we call that. To me these are old problems that we identified in 1990, predictive coding, artificial curiosity, self-invented problems, systems that can not only slavishly imitate what humans tell them, they can have backpropagation, a teacher that tells you what to do. So what you have is a general purpose agent who solves one problem after another, using a model of the world that can predict what happens next, which can then be used for planning, action sequences, and can compress into two RICA networks and in a sense we can have a single RICA network. So I have my own things where I say these are the things that have to be solved, in order to go to the long-standing goal of general purpose AI. I have little interest in tiny little improvements in neural networks, or a tiny little improvement there.

PS: At one point you mentioned “model of the world”. I assume that’s not a small detail. How do you represent the world?

S: How do you represent the model of the world, how do you predict the world, how do you do that? With a neural network, with a recurrent neural network, you can look at all the data that you have ever observed. What is the data that you have observed? Are all the inputs and observations that came in, and is passed through your actions. So you act to see, act to see, act to see, so you have a growing history of the past actions. So everything you know about the world is contained in this history of things, Now you can try to find regularities, by prediction, predict the next thing given the past things. And with a recurrent network I can learn to do that. And as it is learning to do that, it becomes a better and better predictor, and at the same time it is becoming a better and better compressor, because whatever you can predict, you don’t have to store extra. That’s the predictive coding part. As a consequence, this recurrent network has to invent all kinds of sub-programs, parallel sequential sub-programs that are useful for predicting what occurs. And this is the most important thing that the predictor does, it is inventing, or generating little sub-routines that represent the working of the world. For example, you have gravity, you can have a little sub-routine for that, so that it is going to have predictions for falling apples, for previous apples. When you have the first frame of a video of a falling apple, you can predict the next one, to a certain extent, not every time. Because gravity is not the only thing going on in this world, it’s just one of the things. So you learn more and more about the regularities of the world, and like babies, our agents are motivated to come up with self-invented problems, with new experiments that lead to more data, which contains a regularity that they didn’t know yet. So, systems that create their own problems that don’t just imitate humans, that are unsupervised but still active, and they generate the data through their actions, like a scientist who’s generating the data through his experiments, a composer who is generating data through his actions.

PS: I think that this would be useful. For instance, developing a drug is a very complicated experimental process, that’s where they have to do a lot of experiments, come up with a new kind of experiment and you see the result. Is that something that you guys talk about?

S: For that you can use some of our systems, there are millions of applications when you apply prediction machines to all kinds of data, where you want to find regularities in the data, for example in your case, how can you predict certain chemical reactions, biological reactions, the history of previous things that you have observed, and how you can steer processes that in the end yield certain desires, quantities of certain things, but that’s a special case, but there are lots of applications general purpose systems that I have just described.

PS: Two things I meant to ask you, just out of curiosity. People tend to forget that at the first 1956 A.I. conference, it wasn’t just knowledge based systems and neural networks, there was also Ray Solomonoff, who was one of the people who stayed there the whole time. What do you think of his theories?

S: Ray Solomonoff was great, we had him here as a visiting professorship, at the Swiss AI lab, and he was always very interested in our stuff and we were very interested in his stuff, because he was the first one who had a mathematically optimal type of prediction machine. He had a universal prediction machine combining reasoning and computer science. He had this thing called the universal prior, which is the sum of all computable probability distributions. It was an infinite sum, but with that you can show that you can predict any computable process, with the probabilities of the next thing depend in a computable way on the previous observations. You can, with the universal prior, outperform any other predictor. So he had the first universal prediction machine. And then a post-doc, a very senior guy in my lab, Marcus Hutter, a German guy who was also working here in Switzerland, and he was able to generalize the theory of Solomonoff, with the action case, where  you not only have passive predictions, but also where you act and perceive and act and perceive, that is, you are shaping the data through your actions, and he found optimal ways of acting, at least in a mathematical way that is optimal, although it is not a very practical thing, which is the reason we are still in business. Ray Solomonoff, in 2005, he was a visiting professor and in many ways his work was central to what we had been doing.

PS: The perception that this could be another myth is that he had a brilliant theory as a mathematician, what he got even back then, but it was computationally impossible. Is it possible?

S: If I scale it down it becomes possible. But on a universal case it is not possible to efficiently implement it or on a finer computer because the universal prior is a sum over all computation distributions which means you have to sum over all possible computation, and this is something that you can do in theory. So that’s the reason it is not practical. However, you can scale it down with these results, come up with more limited systems which are optimal in a more limited sense, where everything is finite and then you can still be optimal in this limited setting. So, in between these results that he has between 1960 and 1978, they are also relevant for practically feasible solutions. But before that, the 1956 Dartmouth conference that you referred to, there was the Paris conference in 1950, which many people consider the first conference on Artificial Intelligence, which had a fresh name, human thought and machine thought. That was when Norbert Wiener played against probably the world’s first AI machine, which was a chess end game play by the Spanish inventor Leonardo Torres. So who started practical AI,  if somebody started practical AI it was Torres, in 1912/14, so in the early years of the 20th century. He had the first automaton who could play chess eng-game. Back then, playing chess was considered an intelligent activity, and he had the practically working thing, he was the pioneer of practical AI.  Decades later his machine played against Norbert Wiener the cybernetics guy. Back then there was a dispute about what it should be called, cybernetics, but AI won. But much before that the same thing existed under a different name.

PS: The other school I mentioned was the knowledge based school, the symbolic school, the expert systems. What do you think of it today?

S: In many ways it was great and fundamental. The symbolic school goes back to Goedel himself in 1931 and he founded theoretical computer science, he had the first universal language and he used the integer to create a universal language, in which he could formalize anything, any computation, any theorem proving process. Then he laid the foundation for all this theorem-proving work, that became so important in the 60s and 70s, as expert systems. What Goedel did was he had this universal language in which he could express any computational theorem-proving procedures, And then he had this great way of inventing self-referential statements, that say things such as “I am not provable through a computational theorem-proving procedure”. He did that in 1931, And either the statement is true and you cannot prove it, or all of mathematics is flawed. So he showed both the possibility and the limitations of artificial intelligence. In 1931. Five years later, three other guys built on this work and facilitated the universal language and these guys were Church with the la,bda calculus in 1935, Turing with the universal machine in 1936, Post with the Post calculus, same thing, and they all turned out to be equally powerful, and so that’s how theoretical computer science got going. But also AI got going, because the limitations of computer science are also the limitations of AI. And I am fast forwarding, business became really important in the 60s and 70s and lots of expert systems are really simple theorem provers. You have a couple of axioms and from there you view the consequences, and for many applications that’s great. But just not enough to build a real AI that interacts with the world, which doesn’t know the axioms of the world, and which learns through experience how the world works like a physicist, and uses that knowledge to solve more and more problems. That is what we have been doing since the 80s.


Grand Finale

S: So, let’s zoom back and try to see the position of makind in the history of the universe. It started 13.8 billion years ago, there was a Big Bang, and now we take 1/1000th of that time, 13 million years ago, the first hominid emerged, our ancestors. And we take 1/1000th of that mind and you come out 15,000 years ago, and something important started back then: the first animals were domesticated, agriculture was invented, the beginnings of civilization, and you see our civilization is 1/1,000,000th of the world’s history. The first guy who had agriculture was almost the same guy who had a spacecraft in 1957. All of human history and all of our civilisaton is just a flash. Now very soon you’re going to have the first AIs that really deserve the name AI. And then maybe it will take 13 years if we divide again by 1/1000th and everything is going to change beyond recognition. What is going to happen? When the AIs are supersmart and I have no doubt that they are going to be much more smart than we are, they are going to realize what we realized long time ago, which is: all of our resources are not in our biosphere. They are in space where there is billion times more sunlight in our solar system alone. Of course, most of the AI will emigrate and there will be an expanding robot civilization, an AI civilisation that will have trillions of AIs, using self-replicating robot factories to expand into space. And it is not going to stop in our solar system, because most resources are not in the solar system. So the AI is going to expand in the Milky Way. Within a few hundred thousand years, it’s going to colonise all the Milky Way and it’s going to cover it with senders and receivers, so that the AIs can travel as they are already travelling in my lab, which is by radio from senders to receivers. Establishing the infrastructure is going to take some time, buy once you have that you can travel by light speed and it’s not going to stop there. Because the Universe is still young, it is only 13.8 billion years old, it’s going to be a thousand times older than that. Within a few tens of billions of years, totally within the light speed, totally within the limits of physics, all of our visible universe is going to be colonized and transformed by Artificial Intelligence, not by humans, humans will not be able to follow them, humans are not going to exist, but it’s okay. We are going to see the beauty and the grandeur by realizing that we are part of the grander scheme of things, which is driving the Universe from a lower complexity to a higher complexity. Now let’s multiply the current age of the Universe, by again a factor of one thousand, what’s going to happen? 13 thousand billion years from now, and in that distant future they are going to look back and say Look, almost immediately after the big bang, after 13 billion years, the universe started getting intelligent. So, this is much more than an industrial revolution, it is something that transcends humankind itself, life itself, it’s comparable to something that happened 3.5 billion years ago. A new sort of life is going to transform all the cosmos. It’s a privilege to live at the time when you see the beginnings of that and can shape this beginning a little bit.