Import AI: #86: Baidu releases a massive self-driving car dataset; DeepMind boosts AI capabilities via neural teachers; and what happens when AIs evolve to do dangerous, subversive things.

by Jack Clark

Boosting AI capabilities with neural teachers:
…AKA, why my small student with multiple expert teachers beats your larger more well-resourced teacherless-student…
Research from DeepMind shows how to boost the performance of a given agent on a task by transferring knowledge from a pre-trained ‘teacher’ agent. The technique yields a significant speedup in training AI agents, and there’s some evidence that agents that are taught attain higher performance than non-taught ones. The technique comes in two flavors: single teacher and multi-teacher; agents pretrained via multiple specialized teachers do better than ones trained by a single entity, as expected.
  Strange and subtle: The approach has a few traits that seem helpful for the development of more sophisticated AI agents: in one task DeepMind tests it on the agent needs to figure out how to use a short-term memory to be able to attain a high score. ‘Small’ agents (which only have two convolutional layers) typically fail to learn to use a memory and therefore cannot achieve scores above a certain threshold, but by training a ‘small’ agent with multiple specialized teachers the researchers create one that can succeed at the task. “This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behaviour by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves,” the researchers write.
  Why it matters: Trends like this suggest that scientists can speed their own research by using such pre-trained techniques to better evaluate new agents. This adds further credence to the notion that a key input to (some types of) AI research will shift to being compute from pre-labelled static datasets. Though it should be noted that data here is implicit in the form of a procedural, modifiable simulator that researchers can access). More speculatively, this means it may be possible to use mixtures of teachers to train complex agents that far exceed in capabilities any of their forebears – perhaps an area where the sum really will be greater than its parts.
Read more: Kickstarting Deep Reinforcement Learning (Arxiv).

100,000+ developer survey shows AI concerns:
…What developers think is dangerous and exciting, and who they think is responsible…
Developer community StackOverflow has published the results of its annual survey of its community; this year it asked about AI:
– What developers think is “dangerous” re AI: Increasing automation of jobs (40.8%)
– What developers think is “exciting” re AI: AI surpassing human intelligence, aka the singularity (28%)
– Who is responsible for considering the ramifications of AI:
   – The developers or the people creating the AI: 47.8%
   – A governmental or other regulatory body: 27.9%
– Different roles = different concerns: People that identified as technical specialists tended to say they were more concerned about issues of fairness than the singularity, whereas designers and mobile developers tended to be more concerned about the singularity.
  Read more: Developer Survey Results 2018 (StackOverFlow).

Baidu and Toyota and Berkeley researchers organize self-driving car challenge backed by new self-driving car dataset from Baidu:
…”ApolloScape” adds Chinese data for self-driving car researchers, plus Baidu says it has joined Berkeley’s “DeepDrive” self-driving car AI coalition…
A new competition and dataset may give researchers a better way to measure the capabilities and progression of autonomous car AI.
  The dataset: The ‘ApolloScape’ dataset from Baidu contains ~200,000 RGB images with corresponding pixel-by-pixel semantic annotation. Each frame is labeled from a set of 25 semantic classes that include: cars, motorcycles, sidewalks, traffic cones, trash cans, vegetation, and so on. Each of the images has a resolution of 3384 x 2710, and each frame is separated by a meter of distance. 80,000 images have been released as of March 8 2018.
Read more about the dataset (potentially via Google Translate) here.
  Additional information: Many of the researchers linked to ApolloScape will be talking at a session on autonomous cars at the IEEE Intelligent Vehicles Symposium in China.
Competition: The new ‘WAD’ competition will give people a chance to test and develop AI systems on the ApolloScape dataset as well as a dataset from Berkeley DeepDrive (the DeepDrive dataset consists of 100,000 video clips, each about 40 seconds long, with one key frame from each clip annotated). There is about $10,000 in cash prizes available, and the researchers are soliciting papers on research techniques in: drivable area segmentation (being able to figure out which bits of a scene correspond to which label and which of these areas are safe); road object detection (figuring out what is on the road); and transfer learning from one semantic domain to another, specifically going from training on the Berkeley dataset (filmed in California, USA) to the ApolloScape dataset (filmed in Beijing, China).
   Read more about the ‘WAD’ competition here.

Microsoft releases a ‘Rosetta Stone’ for deep learning frameworks:
…GitHub repo gives you a couple of basic operations displayed in many different ways…
Microsoft has released a GitHub repository containing similar algorithms implemented in a variety of frameworks, including: Caffe2, Chainer, CNTK, Gluon, Keras (with backends CNTK/TensorFlow/Theano), Tensorflow, Lasagna, MXNet, PyTorch, and Julia – Knet. The idea here is that if you read one algorithm in one of these frameworks you’ll be able to use that knowledge to understand the other frameworks.
  “The repo we are releasing as a full version 1.0 today is like a Rosetta Stone for deep learning frameworks, showing the model building process end to end in the different frameworks,” write the researchers in a blog post that also provides some rough benchmarking for training time for a CNN and an RNN.
  Read more: Comparing Deep Learning Frameworks: A Rosetta Stone Approach (Microsoft Tech Net).
View the code examples (GitHub).

Evolution’s weird, wonderful, and potentially dangerous implications for AI agent design:
…And why the AI safety community may be able to learn from evolution…
A consortium of international researchers have published some of the weird, infuriating, and frequently funny ways in which evolutionary algorithms have figured out non-obvious solutions and hacks to tasks they’re asked to solve. The paper includes an illuminating set of examples of ways in which algorithms have subverted the wishes of their human overseers, including:
– Opportunistic Somersaulting: When trying to evolve creatures to jump, some agents discovered that they could instead evolve very tall bodies and then somersault, gaining a reward in proportion to their feet gaining distance from the floor.
– Pointless Programs: When researchers tried to evolve code with GenProg to solve a buggy data sorting program, GenProg evolved a solution that had the buggy program return an empty list, which wasn’t scored negatively as an empty list can’t be out of order as it contains nothing to order.
– Physics Hacking: One robot figured out the correct vibrational frequency to surface a friction bug in the floor of an environment in a physics simulator, letting it propel itself across the ground via the bug.
– Evolution finds a way: Another type of bug is the ways that evolution can succeed even when researchers think such success is impossible, like a six-legged robot that figured out how to walk fast without its feet touching the ground (solution: it flipped itself on its back and used the movement of its legs to propel itself nonetheless).
– And so much more!
The researchers think evolution may also illuminate some of the more troubling problems in AI safety. “The ubiquity of surprising and creative outcomes in digital evolution has other cross-cutting implications. For example, the many examples of “selection gone wild” in this article connect to the nascent field of artificial intelligence safety,” the researchers write. “These anecdotes thus serve as evidence that evolution—whether biological or computational—is inherently creative, and should routinely be expected to surprise, delight, and even outwit us.” (emphasis mine).
  Read more: The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities (Arxiv).

Allen AI puts today’s algorithms to shame with new common sense question answering dataset:
…Common sense questions designed to challenge and frustrate today’s best-in-class algorithms…
Following the announcement of $125 million in funding and a commitment to conducting AI research that pushes the limits of what sorts of ‘common sense’ intelligence machines can manifest, the Allen Institute for Artificial Intelligence has released a new ‘ARC’ challenge and dataset researchers can use to develop smarter algorithms.
  The dataset: The main ARC test contains 7787 natural science questions, split across an easy set and a hard set. The hard set of questions are ones which are answered incorrectly by retrieval-based and word co-occurrence algorithms. In addition, AI2 is releasing the ‘ARC Corpus’, a collection of 14 million science-related sentences with knowledge relevant to ARC, to support the development of ARC-solving algorithms. This corpus contains knowledge relevant to 95% of the Challenge questions, AI2 writes.
Neural net baselines: AI2 is also releasing three baseline models which have been tested on the challenge, achieving some success on the ‘easy’ set and failing to be better than random chance on the ‘hard’ set. These include a decomposable attention model (DecompAttn), Bidirectional Attention Flow (BiDAF), and a decomposed graph entailment model (DGEM). Questions in ARC are designed to test everything from definitional to spatial to algebraic knowledge, encouraging the usage of systems that can abstract and generalize concepts derived from large corpuses of data.
Baseline results: ARC is extremely challenging: AI2 benchmarked its prototype neural net approaches (along with others) discovered that scores top out at 60% on the ‘easy’ set of questions and 27% percent on the more challenging questions.
Sample question:Which property of a mineral can be determined just by looking at it? (A) luster [correct] (B) mass (C) weight (D) hardness“.
SQuAD successor: ARC may be a viable successor to the Stanford Question Answering Dataset (SQuAD) and challenge; the SQuAD competition has recently hit some milestones, with companies ranging from Microsoft to Alibaba to iFlyTek all developing SQuAD solvers that attain scores close to human performance (which is about 82% for ExactMatch and 91% for F1). A close evaluation of SQuAD topic areas gives us some intuition as to why scores are so much higher on this test than on ARC – simply put, SQuAD is easier; it pairs chunks of information-rich text with basic questions like “where do most teachers get their credentials from?” that can be retrieved from the text without requiring much abstraction.
Why it matters: “We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD,” the researchers write. The big question now is where this dataset falls on the Goldilocks spectrum — is it too easy (see: Facebook’s early memory networks tests) or too hard or just right? If a system were to get, say, 75% or so on ARC’s more challenging questions, it would seem to be a significant step forward in question understanding and knowledge representation
  Read more: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Arxiv).
SQuAD scores available at the SQuAD website.
  Read more: SQuAD: 100,000+ Questions for Machine Comprehension of Text (Arxiv).

Tech Tales:

The Ten Thousand Floating Heads

The Ten-K, also known as The Heads, also sometimes known as The Ten Heads, officially known as The Ten Thousand Floating Heads, is a large-scale participatory AI sculpture that was installed in the Natural History Museum in London, UK, in 2025.

The way it works is like this: when you walk into the museum and breathe in that musty air and look up the near-endless walls towards the ceiling your face is photographed in high definition by a multitude of cameras. These multi-modal pictures of you are sent to a server which adds them to the next training set that the AI uses. Then, in the middle of the night, a new model is trained that integrates the new faces. Then the AI system gets to choose another latent variable to filter by (this used to be a simple random number generator but, as with all things AI, has slowly evolved into an end-to-end ‘learned randomness’ system with some auxiliary loss functions to aid with exploration of unconventional variables, and so on) and then it looks over all the faces in the museum’s archives, studies them in one giant embedding, and pulls out the ten thousand that fit whatever variable it’s optimizing for today.

These ten thousand faces are displayed, portrait-style, on ten thousand tablets scattered through the museum. As you go around the building you do all the usual things, like staring at the dinosaur bones, or trudging through the typically depressing and seemingly ever-expanding climate change exhibition, but you also peer into these tablets and study the faces that are being shown. Why these ten thousand?, you’re meant to think. What is it optimizing for? You write your guess on a slip of paper or an email or a text and send it to the museum and at the end of the day the winners get their names displayed online and on a small plaque which is etched with near-micron accuracy (so as to avoid exhausting space) and is installed in a basement in the museum and viewable remotely – machines included – via a live webcam.

The correct answers for the variable it optimizes for are themselves open to interpretation, as isolating them and describing what they mean has become increasingly difficult as the model gets larger and incorporates more faces. It used to be easy: gender, hair color, eye color, race, facial hair, and so on. But these days it’s very subtle. Some of the recent names given to the variables include: underslept but well hydrated, regretful about a recent conversation, afraid of museums, and so on. One day it even put up a bunch of people and no one could figure out the variable and then six months later some PHD student did a check and discovered half the people displayed that day had subsequently died of one specific type of cancer.

Recently The Heads got a new name: The Oracle. This has caused some particular concern within certain specific parts of government that focus on what they euphemistically refer to as ‘long-term predictors’. The situation is being monitored.

Things that inspired this story: t-SNE embeddings, GANs, auxiliary loss functions, really deep networks, really big models, facial recognition, religion, cults.