(These are excerpts from my book "Intelligence is not Artificial")
The Curse of the Large Dataset
The most damning evidence that A.I. has posted very little conceptual progress towards human-level intelligence comes from an analysis of what truly contributed to A.I.'s most advertised successes: the algorithm or the training database? The algorithm is designed to learn an intelligent task, but has to be trained by human-provided examples of that intelligent task.
Neural networks learn patterns. There is a pattern about neural networks that has become the norm after the 1990s: an old technique stages spectacular performance thanks to a large training dataset, besides more powerful processors.
In 1997 Deep Blue used the NegaScout algorithm of 1983. The key to its success, besides the massively-parallel 30 high-performance processors, was a dataset of 700,000 chess games played by masters, a dataset created in 1991 by IBM for the second of Feng-hsiung Hsu's chess-playing programs, Deep Thoughts 2.
In 2011 Watson utilized (quote) "90 clustered IBM Power 750 servers with 32 Power7 cores running at 3.55 GHz with four threads per core" and a dataset of 8.6 million documents culled from the Web in 2010, but its "intelligence" was Robert Jacobs' 20-year-old "mixture-of-experts" technique.
All the successes of convolutional neural networks after 2012 were based on Fukushima's 30-year-old technique but trained on the ImageNet dataset of one million labeled images created in 2009 by Feifei Li.
DeepMind's celebrated videogame-playing of December 2013
used Chris Watkins' Q-learning algorithm of 1989 but trained on the Arcade Learning Environment dataset of Atari games developed in 2013 by Michael Bowling `s team at the University of Alberta.
In 2015 Google's FaceNet used the dataset Labeled Faces in the Wild, a collection of digital pictures of celebrities tagged by Erik Learned-Miller's lab at the University of Massachusetts since 2007.
In 2016 AlphaGo used the dataset of millions of go positions stored and ranked at the KGS Go Server (Kiseido Go Server).
It is easy to predict that the next breakthrough in Deep Learning will not come from a new conceptual discovery but from a new large dataset in some other domain of expertise. Progress in Deep Learning depends to a large extent on many human beings (typically PhD students) who manually accumulate a large body of facts. It is not terribly important what kind of neural network gets trained to use those data, as long as there are really a lot of data. The pattern looks like this: at first the dataset becomes very popular among hacker; then some of these hackers utilize an old-fashioned A.I. technique to train an artificial intelligence until it exhibits master-like skills in that domain.
Several popular datasets of manually-labeled images have been developed by various organizations over the years: FERET (1993) by Jonathon Phillips at the Army Research Laboratory in Maryland,
the NIST handwritten-digit dataset (1993) by the National Institute of Standards and Technology,
the ORL face dataset (1994) by Ferdinando Samaria at Olivetti's British labs,
the MNIST (Modified NIST) handwritten-digit dataset (1999) by LeCun at New York University, NORB (2004) also by LeCun's team at New York University, etc; all the way to the Tiny Images Dataset started in 2007 by Antonio Torralba at MIT, that eventually grew to 80 million tiny images, from which Hinton's students extracted CIFAR-10 and CIFAR-100.
In 2006 Andrew Zisserman's group at Oxford built the dataset of annotated visual objects PASCAL VOC (which, believe it or not, stands for "Pattern Analysis Statistical-modelling And Computational Learning Visual Object Classes") and in 2009 Feifei Li (now at Stanford University) published the most famous of all that she had started while at Princeton, ImageNet, whose related challenge (first held in 2010) would graduate the most famous names in deep learning (the 2010 challenge was won by a joint team of NEC Laboratories in Cupertino and the University of Illinois led by Yuanqing Lin, using the SIFT method).
In 2014 Larry Zitnick's team at Microsoft published the Microsoft dataset for image captioning named COCO, which stands for "Common Objects in Context" (Facebook hired Zitnick and his team members Ross Girshick of R-CNN fame and Piotr Dollar).
Ditto for large datasets of annotated speech, such as: the Switchboard-1 Telephone Speech Corpus, a project started by Texas Instruments in 1990, composed of approximately 2,400 telephone conversations; the Continuous Speech Recognition (CSR) Corpus, a dataset containing thousands of spoken articles, mostly from the Wall Street Journal, compiled by in 1991 by Douglas Paul of MIT, in collaboration with Dragon Systems; and the TIMIT Acoustic-Phonetic Continuous Speech Corpus, started in 1993 by the Linguistic Data Consortium (LDC), the same organization that in 1996 released the Broadcast News corpus (30 hours of radio and television news broadcasts).
In 2013 Shih-Fu Chang's team at Columbia University released the Sentibank dataset for visual sentiment (emotions).
In 2014 Catalin Ionescu at Institute of Mathematics of the Romanian Academy published Human3.6m, a dataset of videos of human motion useful to train robots. In 2017 Andrew Zisserman's group at Oxford released the Kinetics-600 dataset of 500,000 video clips, covering 600 human action classes with at least 600 video clips for each action class.
In 2016 Google released a dataset of eight million tagged YouTube videos called YouTube-8M.
Progress in these disciplines has largely followed the creation of these datasets.
If you want to predict what's coming next in A.I., look at the new datasets. Now there are even SEMAINE (Sustained Emotionally coloured Machine-human Interaction using Nonverbal Expression), developed in 2007 at Queen's University Belfast; Cam3D, built in 2011 at Cambridge University, a corpus of complex mental states captured using high-definition cameras and Kinect sensors; MAHNOB-HCI, created in 2011 by Maja Pantic's team at Imperial College London, a corpus of synchronized recordings of video, audio, and physiological data annotated with emotional tags; the EAGER dataset of spontaneous dynamic facial expressions, assembled in 2013 at Binghamton University;
FaceWarehouse, a dataset of three-dimensional facial expressions, published by Zhejiang University in 2014; the LSUN (Large-scale Scene Understanding) dataset, prepared by Jianxiong Xiao's student Fisher Yu at Princeton University in 2015; Ziwei Liuís CelebFaces Attributes (CelebA) dataset at the Chinese University of Hong Kong (2015);
Haolin Wei's corpus of human interactions, released in 2014 by Dublin City University in Ireland; the UM-corpus of English-Chinese pairs for machine translation, published in 2014 by Liang Tian at University of Macau, as well as Longyue Wang's similar corpuses derived from movie subtitles (TVsub and MVsub) in 2016 at Dublin City University in Ireland.
The Stanford Natural Language Inference (SNLI) dataset, released in 2015 by Sam Bowman at Stanford, consisted of 570,000 human-written pairs of English sentences to train neural networks for sentence representation.
In 2016 Percy Liang's team at Stanford developed the Stanford Question Answering Dataset (SQuAD), a reading comprehension dataset consisting of more than 100,000 questions,
as well as SCONE (Sequential CONtext-dependent Execution).
And in 2016 Jianfeng Gao's team at Microsoft published the Machine Reading Comprehension (MARCO) dataset 100,000 queries with their corresponding answers.
The Large Movie Review Dataset (LMRD), released in 2011 by Andrew Ng's student Andrew Maas at Stanford University, consisted of 25,000 highly opinionated movie reviews that could be used to train a neural network for "sentiment analysis". The Stanford Sentiment Treebank, published in 2013 by Richard Socher at Stanford, contained "sentiment" labels for more than 200,000 phrases.
In 2018 three important datasets were introduced for natural-language processing: the CoQA (Conversational Question Answering) dataset by Siva Reddy and Danqi Chen (both now in Christopher Manning's group at Stanford University); the QuAC (Question Answering in Context) dataset, a collaboration among Percy Liang's group at Stanford University, Luke Zettlemoyer's group at the University of Washington and Mark Yatskar at the Allen Institute; and OpenAI's WebText, a dataset of eight million World-wide Web pages.
What do you do if a very large dataset tells you nothing? For example in 2010 Naoki Nakaya of the Danish Cancer Society compiled a huge dataset of 60,000 people over 30 years that shows no correlation between personality traits and the likelihood of surviving cancer ("Personality Traits and Cancer Risk and Survival Based on Finnish and Swedish Registry Data", 2010). Now what? What we (humans) learn from this dataset is that we have to move on to some other research. What does the neural network learn from this dataset?
It is scary to think that neural networks always learn something from the data,
and those learned features will influence the way the networks will treat you,
possibly for the rest of your life.
Data is becoming destiny.
Back to the Table of Contents
Purchase "Intelligence is not Artificial"