Import AI

Import AI: #86: Baidu releases a massive self-driving car dataset; DeepMind boosts AI capabilities via neural teachers; and what happens when AIs evolve to do dangerous, subversive things.

Boosting AI capabilities with neural teachers:
…AKA, why my small student with multiple expert teachers beats your larger more well-resourced teacherless-student…
Research from DeepMind shows how to boost the performance of a given agent on a task by transferring knowledge from a pre-trained ‘teacher’ agent. The technique yields a significant speedup in training AI agents, and there’s some evidence that agents that are taught attain higher performance than non-taught ones. The technique comes in two flavors: single teacher and multi-teacher; agents pretrained via multiple specialized teachers do better than ones trained by a single entity, as expected.
  Strange and subtle: The approach has a few traits that seem helpful for the development of more sophisticated AI agents: in one task DeepMind tests it on the agent needs to figure out how to use a short-term memory to be able to attain a high score. ‘Small’ agents (which only have two convolutional layers) typically fail to learn to use a memory and therefore cannot achieve scores above a certain threshold, but by training a ‘small’ agent with multiple specialized teachers the researchers create one that can succeed at the task. “This is perhaps surprising because the kickstarting mechanism only guides the student agent in which action to take: it puts no constraint on how the student structures its internal memory state. However, the student can only predict the teacher’s behaviour by remembering information from before the respawn, which seems to be enough supervisory signal to drive short-term memory formation. We find this a wonderful parallel with how the best human educators teach: not telling the student what to think, but simply putting the student in a fruitful position to learn for themselves,” the researchers write.
  Why it matters: Trends like this suggest that scientists can speed their own research by using such pre-trained techniques to better evaluate new agents. This adds further credence to the notion that a key input to (some types of) AI research will shift to being compute from pre-labelled static datasets. Though it should be noted that data here is implicit in the form of a procedural, modifiable simulator that researchers can access). More speculatively, this means it may be possible to use mixtures of teachers to train complex agents that far exceed in capabilities any of their forebears – perhaps an area where the sum really will be greater than its parts.
  Read more: Kickstarting Deep Reinforcement Learning (Arxiv).

100,000+ developer survey shows AI concerns:
…What developers think is dangerous and exciting, and who they think is responsible…
Developer community StackOverflow has published the results of its annual survey of its community; this year it asked about AI:
– What developers think is “dangerous” re AI: Increasing automation of jobs (40.8%)
– What developers think is “exciting” re AI: AI surpassing human intelligence, aka the singularity (28%)
– Who is responsible for considering the ramifications of AI:
   – The developers or the people creating the AI: 47.8%
   – A governmental or other regulatory body: 27.9%
– Different roles = different concerns: People that identified as technical specialists tended to say they were more concerned about issues of fairness than the singularity, whereas designers and mobile developers tended to be more concerned about the singularity.
  Read more: Developer Survey Results 2018 (StackOverFlow).

Baidu & Toyota and Berkeley researchers organize self-driving car challenge backed by new self-driving car dataset from Baidu:\
…”ApolloScape” adds Chinese data for self-driving car researchers, plus Baidu says it has joined Berkeley’s “DeepDrive” self-driving car AI coalition…
A new competition and dataset may give researchers a better way to measure the capabilities and progression of autonomous car AI.
  The dataset: The ‘ApolloScape’ dataset from Baidu contains ~200,000 RGB images with corresponding pixel-by-pixel semantic annotation. Each frame is labeled from a set of 25 semantic classes that include: cars, motorcycles, sidewalks, traffic cones, trash cans, vegetation, and so on. Each of the images has a resolution of 3384 x 2710, and each frame is separated by a meter of distance. 80,000 images have been released as of March 8 2018.
  Read more about the dataset (potentially via Google Translate) here.
  Additional information: Many of the researchers linked to ApolloScape will be talking at a session on autonomous cars at the IEEE Intelligent Vehicles Symposium in China.
  Competition: The new ‘WAD’ competition will give people a chance to test and develop AI systems on the ApolloScape dataset as well as a dataset from Berkeley DeepDrive (the DeepDrive dataset consists of 100,000 video clips, each about 40 seconds long, with one key frame from each clip annotated). There is about $10,000 in cash prizes available, and the researchers are soliciting papers on research techniques in: drivable area segmentation (being able to figure out which bits of a scene correspond to which label and which of these areas are safe); road object detection (figuring out what is on the road); and transfer learning from one semantic domain to another, specifically going from training on the Berkeley dataset (filmed in California, USA) to the ApolloScape dataset (filmed in Beijing, China).
   Read more about the ‘WAD’ competition here.

Microsoft releases a ‘Rosetta Stone’ for deep learning frameworks:
…GitHub repo gives you a couple of basic operations displayed in many different ways…
Microsoft has released a GitHub repository containing similar algorithms implemented in a variety of frameworks, including: Caffe2, Chainer, CNTK, Gluon, Keras (with backends CNTK/TensorFlow/Theano), Tensorflow, Lasagna, MXNet, PyTorch, and Julia – Knet. The idea here is that if you read one algorithm in one of these frameworks you’ll be able to use that knowledge to understand the other frameworks.
  “The repo we are releasing as a full version 1.0 today is like a Rosetta Stone for deep learning frameworks, showing the model building process end to end in the different frameworks,” write the researchers in a blog post that also provides some rough benchmarking for training time for a CNN and an RNN.
  Read more: Comparing Deep Learning Frameworks: A Rosetta Stone Approach (Microsoft Tech Net).
  View the code examples (GitHub).

Evolution’s weird, wonderful, and potentially dangerous implications for AI agent design:
…And why the AI safety community may be able to learn from evolution…
A consortium of international researchers have published some of the weird, infuriating, and frequently funny ways in which evolutionary algorithms have figured out non-obvious solutions and hacks to tasks they’re asked to solve. The paper includes an illuminating set of examples of ways in which algorithms have subverted the wishes of their human overseers, including:
– Opportunistic Somersaulting: When trying to evolve creatures to jump, some agents discovered that they could instead evolve very tall bodies and then somersault, gaining a reward in proportion to their feet gaining distance from the floor.
– Pointless Programs: When researchers tried to evolve code with GenProg to solve a buggy data sorting program, GenProg evolved a solution that had the buggy program return an empty list, which wasn’t scored negatively as an empty list can’t be out of order as it contains nothing to order.
– Physics Hacking: One robot figured out the correct vibrational frequency to surface a friction bug in the floor of an environment in a physics simulator, letting it propel itself across the ground via the bug.
– Evolution finds a way: Another type of bug is the ways that evolution can succeed even when researchers think such success is impossible, like a six-legged robot that figured out how to walk fast without its feet touching the ground (solution: it flipped itself on its back and used the movement of its legs to propel itself nonetheless).
– And so much more!
The researchers think evolution may also illuminate some of the more troubling problems in AI safety. “The ubiquity of surprising and creative outcomes in digital evolution has other cross-cutting implications. For example, the many examples of “selection gone wild” in this article connect to the nascent field of artificial intelligence safety,” the researchers write. “These anecdotes thus serve as evidence that evolution—whether biological or computational—is inherently creative, and should routinely be expected to surprise, delight, and even outwit us.” (emphasis mine).
  Read more: The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities (Arxiv).

Allen AI puts today’s algorithms to shame with new common sense question answering dataset:
…Common sense questions designed to challenge and frustrate today’s best-in-class algorithms…
Following the announcement of $125 million in funding and a commitment to conducting AI research that pushes the limits of what sorts of ‘common sense’ intelligence machines can manifest, the Allen Institute for Artificial Intelligence has released a new ‘ARC’ challenge and dataset researchers can use to develop smarter algorithms.
  The dataset: The main ARC test contains 7787 natural science questions, split across an easy set and a hard set. The hard set of questions are ones which are answered incorrectly by retrieval-based and word co-occurrence algorithms. In addition, AI2 is releasing the ‘ARC Corpus’, a collection of 14 million science-related sentences with knowledge relevant to ARC, to support the development of ARC-solving algorithms. This corpus contains knowledge relevant to 95% of the Challenge questions, AI2 writes.
  Neural net baselines: AI2 is also releasing three baseline models which have been tested on the challenge, achieving some success on the ‘easy’ set and failing to be better than random chance on the ‘hard’ set. These include a decomposable attention model (DecompAttn), Bidirectional Attention Flow (BiDAF), and a decomposed graph entailment model (DGEM). Questions in ARC are designed to test everything from definitional to spatial to algebraic knowledge, encouraging the usage of systems that can abstract and generalize concepts derived from large corpuses of data.
  Baseline results: ARC is extremely challenging: AI2 benchmarked its prototype neural net approaches (along with others) discovered that scores top out at 60% on the ‘easy’ set of questions and 27% percent on the more challenging questions.
  Sample question:Which property of a mineral can be determined just by looking at it? (A) luster [correct] (B) mass (C) weight (D) hardness“.
  SQuAD successor: ARC may be a viable successor to the Stanford Question Answering Dataset (SQuAD) and challenge; the SQuAD competition has recently hit some milestones, with companies ranging from Microsoft to Alibaba to iFlyTek all developing SQuAD solvers that attain scores close to human performance (which is about 82% for ExactMatch and 91% for F1). A close evaluation of SQuAD topic areas gives us some intuition as to why scores are so much higher on this test than on ARC – simply put, SQuAD is easier; it pairs chunks of information-rich text with basic questions like “where do most teachers get their credentials from?” that can be retrieved from the text without requiring much abstraction.
  Why it matters: “We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD,” the researchers write. The big question now is where this dataset falls on the Goldilocks spectrum — is it too easy (see: Facebook’s early memory networks tests) or too hard or just right? If a system were to get, say, 75% or so on ARC’s more challenging questions, it would seem to be a significant step forward in question understanding and knowledge representation
  Read more: Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (Arxiv).
 SQuAD scores available at the SQuAD website.
  Read more: SQuAD: 100,000+ Questions for Machine Comprehension of Text (Arxiv).

Tech Tales:

The Ten Thousand Floating Heads

The Ten-K, also known as The Heads, also sometimes known as The Ten Heads, officially known as The Ten Thousand Floating Heads, is a large-scale participatory AI sculpture that was installed in the Natural History Museum in London, UK, in 2025.

The way it works is like this: when you walk into the museum and breathe in that musty air and look up the near-endless walls towards the ceiling your face is photographed in high definition by a multitude of cameras. These multi-modal pictures of you are sent to a server which adds them to the next training set that the AI uses. Then, in the middle of the night, a new model is trained that integrates the new faces. Then the AI system gets to choose another latent variable to filter by (this used to be a simple random number generator but, as with all things AI, has slowly evolved into an end-to-end ‘learned randomness’ system with some auxiliary loss functions to aid with exploration of unconventional variables, and so on) and then it looks over all the faces in the museum’s archives, studies them in one giant embedding, and pulls out the ten thousand that fit whatever variable it’s optimizing for today.

These ten thousand faces are displayed, portrait-style, on ten thousand tablets scattered through the museum. As you go around the building you do all the usual things, like staring at the dinosaur bones, or trudging through the typically depressing and seemingly ever-expanding climate change exhibition, but you also peer into these tablets and study the faces that are being shown. Why these ten thousand?, you’re meant to think. What is it optimizing for? You write your guess on a slip of paper or an email or a text and send it to the museum and at the end of the day the winners get their names displayed online and on a small plaque which is etched with near-micron accuracy (so as to avoid exhausting space) and is installed in a basement in the museum and viewable remotely – machines included – via a live webcam.

The correct answers for the variable it optimizes for are themselves open to interpretation, as isolating them and describing what they mean has become increasingly difficult as the model gets larger and incorporates more faces. It used to be easy: gender, hair color, eye color, race, facial hair, and so on. But these days it’s very subtle. Some of the recent names given to the variables include: underslept but well hydrated, regretful about a recent conversation, afraid of museums, and so on. One day it even put up a bunch of people and no one could figure out the variable and then six months later some PHD student did a check and discovered half the people displayed that day had subsequently died of one specific type of cancer.

Recently The Heads got a new name: The Oracle. This has caused some particular concern within certain specific parts of government that focus on what they euphemistically refer to as ‘long-term predictors’. The situation is being monitored.

Things that inspired this story: t-SNE embeddings, GANs, auxiliary loss functions, really deep networks, really big models, facial recognition, religion, cults.

Import AI: #85: Keeping it simple with temporal convolutional networks instead of RNNs, learning to prefetch with neural nets, and India’s automation challenge.

Administrative note: a somewhat abbreviated issue this week as I’ve been traveling quite a lot and have chosen sleep above reading papers (gasp!).

It’s simpler than you think: researchers show convolutional networks frequently beat recurrent ones:
The rise and rise of simplistic techniques continues…
Researchers with Carnegie Mellon University  and Intel Labs have rigorously tested the capabilities of convolutional neural networks (via a ‘temporal convolutional network’ (TCN) architecture, inspired by Wavenet and other recent innovations) against sequence modeling architectures like Recurrent Nets (via LSTMs and GRUs). The advantages of TCNs for sequence modeling are as follows: easily parallelizable rather than relying on sequential processing; a flexible receptive field size; stable gradients; low memory requirements for training; and variable length inputs. Disadvantages include: a greater data storage need than RNNs; parameters need to be fiddled with when shifting into different data domains.
  Testing: The researchers test out TCNs against RNNS, GRUs, and LSTMs on a variety of sequence modeling tasks, ranging from MNIST, to adding and copy tasks, to word-level and character-level perplexity on language tasks. In nine out of eleven cases the TCN comes out far ahead of other techniques, in one of the eleven cases it roughly matches GRU performance, and in another case it is noticeably worse then an LSTM (though still comes in second).
  What happens now: “The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history. Until recently, before the introduction of architectural elements such as dilated convolutions and residual connections, convolutional architectures were indeed weaker. Our results indicate that with these elements, a simple convolutional architecture is more effective across diverse sequence modeling tasks than recurrent architectures such as LSTMs. Due to the comparable clarity and simplicity of TCNs, we conclude that convolutional networks should be regarded as a natural starting point and a powerful toolkit for sequence modeling,” write the researchers.
  Why it matters: One of the most confusing things about machine learning is that it’s a defiantly empirical science, with new techniques appearing and proliferating in response to measured performance on given tasks. What studies like this indicate is that many of these new architectures could be overly complex relative to their utility and it’s likely that, with just a few tweaks, the basic building blocks still reign supreme; we’ve seen a similar phenomenon with basic LSTMs and GANs doing better than many other more-recent innovations, given thorough analysis. In one sense this seems good as it seems intuitive that simpler architectures tend to be more flexible and general, and in another sense it’s unnerving, as it suggests much of the complexity that abounds in AI is an artifact of empirical science rather than theoretically justified.
  Read more: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (Arxiv).
  Code for the TCN used in the experiments here (GitHub).

Automation & economies: it’s complicated:
…Where AI technology comes from, why automation could be challenging for India, and more…
In a podcast, three employees of the McKinsey Global Institute discuss how automation will impact China, Europe, and India. Some of the particularly interesting points include:
– China has an incentive to automate its own industries to improve labor productivity, as its labor pool has peaked and is now in similar demographic-based decline as other developed economies.
– The supply of AI technology seems to come from the United States and China, with Europe lagging.
– “A large effect is actually job reorganization. Companies adopting this technology will have to reorganize the type of jobs they offer. How easy would it be to do that? Companies are going to have to reorganize the way they work to make sure they get the juice out of this technology.”
– India may struggle as it transitions tens to hundreds of millions of people out of agriculture jobs. “We have to make this transition in an era where creating jobs out of manufacturing is going to be more challenging, simply because of automation playing a bigger role in several types of manufacturing.”
Read more: How will automation affect economies around the world? (McKinsey Global Institute).

DeepChem 2.0 bubbles out of the lab:
…Open source scientific computing platform gets its second major release…
DeepChem’s authors have released version 2.0 of the scientific computing library, bringing with it improvements to the TensorGraph API, tools for molecular analysis, new models, tutorial tweaks and adds, and a whole host of general improvements. DeepChem “aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.”
  Read more: DeepChem 2.0 release notes.
  Read more: DeepChem website.

Google researchers tackle prefetching with neural networks:
…First databases, now memory…
One of the weirder potential futures of AI is one where the fundamental aspects of computing, like implementing systems that search over database indexes or prefetch data to boost performance, are mostly learned rather than pre-programmed. That’s the idea in a new paper from researchers at Google, which tries to use machine learning techniques to solve prefetching, which is “the process of predicting future memory accesses that will miss in the on-chip cache and access memory based on past history”. Prefetching is a somewhat fundamental problem, as the better one becomes at prefetching, the higher the chance of being able to better intuit which data to load-in to memory before it is called upon, which increases the performance of your system.
  How it works: Can prefetching be learned? “Prefetching is fundamentally a regression problem. The output space, however, is both vast and extremely sparse, making it a poor fit for standard regression models,” the Google researchers write. Instead, they turn to using LSTMs and find that two variants are able to demonstrate competitive prefetching performance when compared to handwritten systems. “The first version is analogous to a standard language model, while the second exploits the structure of the memory access space in order to reduce the vocabulary size and reduce the model memory footprint,” the researchers write. They test out their approach on data from Google’s web search workload and demonstrate competitive performance.
  “The models described in this paper demonstrate significantly higher precision and recall than table-based approaches. This study also motivates a rich set of questions that this initial exploration does not solve, and we leave these for future research,” they write. This research is philosophically similar to work from Google last autumn in using neural networks to learn database index structures (covered in #73), which also found that you could learn indexes that had competitive to superior performance to hand-tuned systems.
  One weird thing: When developing one of their LSTMs the researchers created a t-SNE embedding of the program counters ingested by the system and discovered that the learned features contained quite a lot of information. “The t-SNE results also indicate that an interesting view of memory access traces is that they are a reflection of program behavior. A trace representation is necessarily different from e.g., input-output pairs of functions, as in particular, traces are a representation of an entire, complex, human-written program,” they write.
  Read more: Learning Memory Access Patterns (Arxiv).

Learning to play video games in minutes instead of days:
…Great things happen when AI and distributed systems come together…
Researchers with the University of California at Berkeley have come up with a way to further optimize large-scale training of AI algorithms by squeezing as much efficiency as possible out of underlying compute infrastructure. Their new technique makes it possible for them to train reinforcement learning agents to master Atari games in under ten minutes on an NVIDIA DGX-1 (which contains 40 CPUs and 8 P100 GPUS). Though the sample efficiency of these algorithms is still massively sub-human (requiring millions of frames to approximate the performance of humans trained on thousands to tens of thousands of frames) it’s interesting that we’re now able to develop algorithms that approximate flesh-and-blood performance in roughly similar wall clock time.
  Results: The researchers show that given various distributed systems tweak its possible for algorithms like A2C, A3C, PPO, and APPO to attain good performance on various games in a few minutes.
  Why it matters: Computers are currently functioning like telescopes for certain AI researchers – the bigger your telescope, the farther you can see into the limit of scaling properties of various AI algorithms. We still don’t fully understand the limits here, but research like this indicates that as new compute substrates come alone it may be able to scale RL algorithms to achieve very impressive feats in relatively little time. But there are more unknowns than knowns right now – what an exciting time to be alive! “We have not conclusively identified the limiting factor to scaling, nor if it is the same in every game and algorithm. Although we have seen optimization effects in large-batch learning, we do not know their full nature, and other factors remain possible. Limits to asynchronous scaling remain unexplored; we did not definitively determine the best configurations of these algorithms, but only presented some successful versions,” they write.
  Read more: Accelerated Methods for Deep Reinforcement Learning (Arxiv).

OpenAI Bits&Pieces:

OpenAI Scholars: Funding for underrepresented groups to study AI:
OpenAI is providing 6-10 stipends and mentorship to individuals from underrepresented groups to study deep learning full-time for 3 months and open-source a project. You’ll need US employment authorization and will be provided with a stipdend of $7.5k per month while doing the program, as well as $25,000 AWS credits.
  Read more: OpenAI Scholars (OpenAI blog).

Tech Tales:

John Henry 2.0

No one places any bets on it internally asides from the theoretical physicists who, by virtue of their field of study, had a natural appreciation for very long odds. Everyone else just assumed the machines would win. And they were right, though I’m not sure in the way they were expecting.

It started like this: one new data center was partitioned into two distinct zones. In one of the zones we applied the best, most interpretable, most rule-based systems we could to every aspect of the operation, ranging from the design of the servers, to the layout of motherboards, to the software used to navigate the compute landscape, and so on. The team tasked with this data center had an essentially limitless budget for infrastructure and headcount. In the other zone we tried to learn everything we could from scratch, so we assigned AI systems to figure out: the types of computers to deploy in the data center, where to place these computers to minimize latency, how to aggressively power these devices up or down in accordance with observed access patterns, how to learn to effectively index and store this information, knowing when to fetch data into memory, figuring out how to proactively spin-up new clusters in anticipation of jobs that had not happened yet but were about to happen, and so on.

You can figure out what happened: for a time, the human-run facility was better and more stable, and then one day the learned data center was at parity with it in some areas, then at parity in most areas, then very quickly started to exceed its key metrics ranging from uptime to power consumption to mean-time-between-failure for its electronic innards. The human-heavy team worked themselves ragged trying to keep up and many wonderful handwritten systems were created that further pushed the limit of what we knew theoretically and could embody in code.

But the learned system kept going, uninhibited by the need for a theoretical justification for its own innovations, instead endlessly learning to exploit strange relationships that were non-obvious to us humans. But transferring insights gleaned from this system into the old rule-based one was difficult, and tracking down why something had seen such a performance increase in the learned regime was an art in itself: what tweak made this new operation so successful? What set of orchestrated decisions had eked out this particular practise?

So now we build things with two different tags on them: need-to-know (NTK) and hell-if-I-know (HIIK). NTK tends to be stuff that has some kind of regulation applied to it and we’re required to be able to explain, analyze, or elucidate for other people. HIIK is the weirder stuff that is dealing in systems that don’t handle regulated data – or, typically, any human data at all – or are parts of our scientific computing infrastructure, where all we care about is performance.

In this way the world of computing has split in two, with some researchers working on extending our theoretical understanding to further boost the performance of the rule-based system, and an increasingly large quantity of other researchers putting theory aside and spending their time feeding what they have taken to calling the ‘HIIKBEAST’).

Things that inspired this story: Learning indexes, learning device placement, learning prefetching, John Henry, empiricism.

Import AI: #84: xView dataset means the planet is about to learn how to see itself, a $125 million investment in common sense AI, and SenseTime shows off TrumpObama AI face swap

Chinese AI startup SenseTime joins MIT’s ‘Intelligence Quest’ initiative:
…Funding plus politics in one neat package…
Chinese AI giant SenseTime is joining the ‘MIT Intelligence Quest’, a pan-MIT AI research and development initiative. The Chinese company specializes in facial recognition and self-driving cars and has signed strategic partnerships with large companies like Honda, Qualcomm, and others. At an AI event at MIT recently SenseTime’s founder Xiao’ou Tang gave a short speech with a couple of eyebrow-raising demonstrations to discuss the partnership. “I think together we will definitely go beyond just deep learning we will go to the uncharted territory of deep thinking,” Tang said.
  Data growth: Tang said SenseTime is developing better facial recognition algorithms using larger amounts of data, saying the company in 2016 improved its facial recognition accuracy to “one over a million” using 60 million photos, then in 2017 improved that to “one over a hundred million” via a dataset of two billion photos. (That’s not a typo.)
  Fake Presidents: He also gave a brief demonstration of a SenseTime synthetic video project which generatively morphed footage of President Obama speaking into President Trump speaking, and vice versa. I recorded a quick video of this demonstration which you can view on Twitter here (Video).
Read more: MIT and SenseTime announce effort to advance artificial intelligence research (MIT).

Chinese state media calls for collaboration on AI development:
…Xinhua commentary says China’s rise in AI ‘is a boon instead of a threat’…
A comment piece in Chinese state media Xinhua tries to debunk some of the cold war lingo surrounding China’s rise in AI, pushing back on accusations that Chinese AI is “copycat” and calling for more cooperation and less competition. Liu Qingfeng, iFlyTek’s CEO, told Xinhua at CES that massive data sets, algorithms and professionals are a must-have combination for AI, which “requires global cooperation” and “no company can play hegemony”, Xinhua wrote.
Read more: Commentary: AI development needs global cooperation, not China-phobia (Xinhua).

New xView dataset represents a new era of geopolitics as countries seek to automate the analysis of the world:
…US defense researchers release dataset and associated competition to push the envelop on satellite imagery analysis…
Researchers with the DoD’s Defense Innovation Unit Experimental (DIUx), DigitalGlobe, and the National Geospatial-Intelligence Agency, have released xView, a dataset and associated competition used to assess the ability for AI methods to classify overhead satellite imagery. xView includes one million distinct objects across 60 classes, spread across 1,400km2 of satellite imagery with a maximum ground sample resolution of 0.3m. The dataset is designed to test various frontiers of image recognition, including: learning efficiency, fine-grained class detection, and multiscale recognition, among others. The competition includes $100,000 of prize money, along with compute credits.
Why it matters: The earth is beginning to look at itself. As launch capabilities get cheaper via new rockets like SpaceX, Rocket Labs, etc, better hardware comes online as a consequent of further improvements in electronics, and more startups stick satellites into orbit, the amount of data available about the earth is going to grow by several orders of magnitude. If we can figure out how to analyze these datasets using AI techniques we can ultimately better respond to the changes in our planet and to marshal resources for the purposes of remediating natural disasters and, more generally, to better equip large losticis organizations like militaries to better understand the world around them and plan and act accordingly. A new era of high-information geopolitics is approaching…
  I spy with my satelite eye: xView includes numerous objects with parent classes and sub-classes, such as ‘maritime vessels’ with sub-classes including sailboat and oil tanker. Other classes include fixed wing aircraft, passenger vehicles, trucks, engineering vehicles, railway vehicles, and buildings. “xView contributes a large, multi-class, multi-location dataset in the object detection and satellite imagery space, built with the benchmark capabilities of PASCAL VOC, the quality control methodologies of COCO, and the contributions of other overhead datasets in mind,” they write. Some of the most frequently covered objects in the dataset include buildings and small cars, while some of the rarest include vehicles like a reach stacker and a tractor, and vessels like an oil tanker.
  Baseline results: The researchers created a classification baseline via implementing a Single Shot Multibox Detector meta-architecture (SSD) and testing it on three variants of the dataset: standard xView, multi-resolution, and multi-resolution augmented via image augmentation. The best results were found from training on the multi-resolution dataset, with accuracies climbing to as high as over 67% for cargo planes. The scores are mostly pretty underwhelming, so it’ll be interesting to see what scores people get when they apply more sophisticated deep learning-based methods to the problem.
  Milspec data precision: “We achieved consistency by having all annotation performed at a single facility, following detailed guidelines, with output subject to multiple quality control checks. Workers extensively annotated image chips with bounding boxes using an open source tool,” write the authors. Other AI researchers may want to aspire to equally high standards, if they can afford it.
  Read more: xView: Objects in Context in Overhead Imagery (Arxiv).
  Get the dataset: xView website.

Adobe researchers try to give robots a better sense of navigation with ‘AdobeIndoorNav’ dataset:
…Plus: automating data collection with Tango phones + commodity robots…
Adobe researchers have released AdobeIndoorNav, a dataset intended to help robots navigate the real-world. The contains 3,544 distinct locations across 24 individual ‘scenes’ that a virtual robot can learn to navigate. Each scene corresponds to a real-world location and contains a 3D reconstruction via a point cloud, a 360-degree panoramic view, and front/back/left/right views from the perspective of a small ground-based robot. Combined, the dataset gives AI researchers a set of environments to develop robot navigation systems in. “The proposed setting is an intentionally simplified version of real-world robot visual navigation with neither moving obstacles nor continuous actuation,” the researchers write.
  Why it matters: For real-world robotic AI systems to be more useful they’ll have to be capable of being dropped into novel locations and figuring out how to navigate themselves around to specific targets. This research shows that we’re still a long, long way away from theoretical breakthroughs that give us this capability, but does include some encouraging signs for our ability to automate the necessary data gathering process to create the datasets needed to develop baselines to evaluate new algorithms on.
  Data acquisition: The researchers used a Lenovo Phab 2 Tango phone to scan each scene by hand to create a 3D point cloud, which they then automatically decomposed into a map of specific obstacles as well as a 3D map. A ‘Yujin Turtlebot 2‘ robot then uses these maps along with its onboard laser scanner, RGB-D camera, and 360 camera to navigate around the scene and take a series of high resolution 360 photos, which it then stitches into a coherent scene.
  Results: The researchers prove out the dataset by creating a baseline agent capable of navigating the scene. Their A3C agent with an LSTM network learns to successfully navigate from one location in any individual scene to another location, frequently figuring out routes that involve only a couple more steps than the theoretical minimum. The researchers also show a couple of potential extensions of this technique to further improve performance, like augmentations to increase the amount of spatial information which the robot incorporates into its judgements.
Read more: The AdobeIndoorNav Dataset: Towards Deeo Reinforcement Learning based Real-world Indoor Robot Visual Navigation (Arxiv).

Allen Institute for AI gets $125 million to pursue common sense AI:
Could an open, modern, ML-infused Cyc actually work? That’s the bet…
Symbolic AI approaches have a pretty bad rap – they were all the rage in the 80s and 90s but after lots of money invested and few major successes have since been eclipsed by deep learning-based AI approaches. The main project of note in this area is Doug Lenat’s Cyc which has, much like fusion power, been just a few years away from a major breakthrough for… three decades. But that doesn’t mean symbolic approaches are worthless, they might just be a bit underexplored and in need of revitalization – many people tell me that symbolic systems are being used all the time today but they’re frequently proprietary or secret (aka, military) in nature. But, still, evidence is scant. So it’s interesting that Paul Allen (formerly co-founder of Microsoft) is investing $125 million over three years into his Allen Institute for Artificial Intelligence to launch Project Alexandria, an initiative that seeks to create a knowledge base that fuses machine reading and language and vision projects with human-annotated ‘common sense’ statements.
  Benchmarks: “This is a very ambitious long-term research project. In fact, what we’re starting with is just building a benchmark so we can assess progress on this front empirically,” said AI2’s CEO Oren Etzioni in an interview with GeekWire. “To go to systems that are less brittle and more robust, but also just broader, we do need this background knowledge, this common-sense knowledge.”
  Read more: Allen Institute for Artificial Intelligence to Pursue Common Sense for AI (Paul Allen.)
  Read more: Project Alexandria (AI2).
Read more:
Oren Etzioni interview (GeekWire).

Russian researchers use deep learning to diagnose fire damage from satellite imagery:
…Simple technique highlights generality of AI tools and implications of having more readily available satellite imagery for disaster response…
Researchers with the Skolkovo institute of Science and technology in Moscow have published details on how they applied machine learning techniques to automate the analysis of satellite images of the Californian wildfires of 2017. The researchers use DigitalGlobe satellite imagery of Ventura and Santa Rosa countries before and after the fires swept through to create a dataset of pictures containing around 1,000 buildings (760 non-damaged ones and 320 burned ones), then used a pre-trained ImageNet network (with subsequent finetuning) to learn to classify burned versus non-burned buildings with an accuracy of around 80% to 85%.
  Why it matters: Stuff like this is interesting mostly because of hte implicit time savings, where once you have annotated a dataset it is relatively easy to train new models to improve classification in line with new techniques. The other component necessary for techniques like this to be useful will be the availability of more frequently updated satellite imagery, but there are startups working in this space already like Planet Labs and others, so that seems fairly likely.
  Read more: Satellite imagery analysis for operational damage assessment in Emergency situations (Arxiv).

Google researchers figure out weird trick to improve recurrent neural network long-term dependency performance:
…Auxiliary losses + RNNs make for better performance…
Memory is a troublesome thing with neural networks, and figuring out how to give networks a better representative capacity has been a long-standing problem in the field. Now, researchers with Google have proposed a relatively simple tweak to recurrent neural networks that lets them model longer-time dependencies, potentially opening RNNs up to working on problems that require a bigger memory. The technique involves augmenting RNNs with an unsupervised auxiliary loss that either tries to model relationships somewhere through the network, or project forward over a relatively short distance, and in doing so lets the RNN learn to represent finer-grained structures over longer timescales. Now we need to figure out what those problems are and evaluate the systems further.
  Evaluation: Long time-scale problems are still in their chicken and egg phase, where it’s difficult to figure out the appropriate methods we can use to test them. One approach is pixel-by-pixel image prediction, which is where you feed each individual pixel into a long-term system – in this case an RNN augmented by the proposed technique – and see how effectively it can learn to classify the image. The idea here is that if it’s reasonably good at classifying the image then it is able to learn high-level patterns from the pixels which have been fed into it, which suggests that it is remembering something useful. The researchers test their approach on images ranging in pixel length from 784 to 1024 (CIFAR-10) all the way up to around ~16,000 (via the ‘StanfordDogs’ dataset).
Read more: Learning Longer-term Dependencies in RNNs with Auxiliary Losses (Arxiv).

Alibaba applies reinforcement learning to optimizing online advertising:
…Games and robots are cool, but the rarest research papers are the ones that deal with actual systems that make money today…
Chinese e-commerce and AI giant Alibaba has published details on a reinforcement learning technique that, it says, can further optimize adverts in sponsored search real-time bidding auctions. The algorithm, M-RMDP (Massive-agent Reinforcement Learning with robust Markov Decision Process), improves ad performance and lowers the potential price per ad for advertisers, providing an empirical validation that RL could be applied to highly tuned, rule-based heuristic systems like those found in much of online advertising. Notably, Google has published very few papers on this area, suggesting Alibaba may be publishing in this strategic area because a) it believes it is still behind Google and others in this area and b) by publishing it may be able to tempt over researchers who wish to work with it. M-RMDP’s main contribution is being able to model the transitions between different auction states as demand waxes and wanes through the day, the researchers say.
Method and scale: Alibaba says it designed the system to deal with what it calls the “massive-agent problem”, which is figuring out a reinforcement learning method that can handle “thousands or millions of agents”. For the experiments in the paper it deployed its training infrastructure onto 1,000 CPUs and 40 GPUs.
  Results: The company picked 1000 ads from the Alibaba search auction platform and collect two days worth of data for training and testing. It tested the effectiveness of its system by simulating reactions within its test set. Once it had used this offline evaluation to prove out the provisional effectiveness of its approach it carried out an online test and find that their M-RMDP approach substantially improves the return on investment for advertisers in terms of ad effectiveness, while marginally reducing the PPC cost, saving them money.
Why it matters: Finding examples of reinforcement learning being used for practical, money-making tasks is typically difficult; many of the technology’s most memorable or famous results involve mastering various video games or board games or, more recently, controlling robots performing fairly simple tasks. So it’s a nice change to have a paper that involves deep reinforcement learning doing something specific and practical: learning to bid on online auctions.
  Read more: Deep Reinforcement Learning for Sponsored Search Real-time Bidding (Arxiv).

OpenAI Bits & Pieces:

Improving robotics research with new environments, algorithms, and research ideas:
…Fetch models! Shadow Hands! HER Baselines! Oh my!…
We’ve released a set of tools to help people conduct research on robots, including new simulated robot models, a baseline implementation of the Hindsight Experience Replay algorithm, and a set of research ideas for HER.
Read more: Ingredients for Robotics Research (OpenAI blog).

Tech Tales:

Play X Time.

It started with a mobius strip and it never really ended: after many iterations the new edition of the software, ToyMaker V1.0, was installed in the ‘Kidz Garden’ – an upper class private school/playpen for precocious children ages 2 to 4 – on the 4th of June 2022, and it was a hit immediately. Sure, the kids had seen 3D printers before – many of them had them in their homes, usually the product of a mid-life crisis of one of their rich parents; usually a man, usually a finance programmer, usually struggling against the vagaries of their own job and seeking to create something real and verifiable. So the kids weren’t too surprised when ToyMaker began its first print. The point when it became fascinating to them was after the print finished and the teacher snapped off the freshly printed mobius strip and handed it to one of the children who promptly sneezed and rubbed the snot over its surface – at that moment one of the large security cameras mounted on top of the printer turned to watch the child. A couple of the others kids noticed and pointed and hten tugged at the sleeve of the snot kid who looked up at the camera which looked back at them. They held up the mobius strip and the camera followed it, then they pulled it back down towards them and the camera followed that too. They passed the mobius strip to another kid who promptly tried to climb on it, and the camera followed this and then the camera followed the teacher as they picked up the strip and chastised the children. A few minutes later the children were standing in front of the machine dutifully passing the mobius strip between eachother and laughing as the camera followed it from kid to kid to kid.
“What’s it doing?” one of them said.
“I’m not sure,” said the teacher, “I think it’s learning.”
And it was: the camera fed into the sensor system for the ToyMaker software, which treated these inputs as an unsupervised auxiliary loss, which would condition the future objects it printed and how it made them. At night when the kids went home to their nice, protected flats and ate expensive, fiddly food with their parents, the machine would simulate the classroom and different perturbations of objects and different potential reactions of children. It wasn’t alone: ToyMaker 1.0 software was installed on approximately a thousand other printers spread across the country in other expensive daycares and private schools, and so as each day passed they collectively learned to try to make different objects, forever monitoring the reactions of the children, growing more adept at satisfying them via a loss function which was learned, with the aid of numerous auxiliary losses, through interaction.
So the next day in the Kidz Garden the machine printed out a Mobius Strip that now had numerous spindly-yet-strong supports linking its sides together, letting the children climb on it.
The day after that it printed intertwined ones; two low-dimensional slices, linked together but separate, and also climbable.
Next: the strips had little gears embedded in them which the children could run their hands over and play with.
Next: the gears conditioned the proportions of some aspects of the strip, allowing the children to manipulate dimensional properties with the spin of various clever wheels.
And so it went like this and is still going, though as the printing technologies have grown better, and the materials more complex, the angular forms being made by these devices have become sufficiently hard to explain that words do not suffice: you need to be a child, interacting with them with your hands, and learning the art of interplay with a silent maker that watches you with electronic eyes and, sometimes – you think when you are going to sleep – nods its camera head when you snot on the edges, or laugh at a surprising gear.

Technologies that inspired this story: Fleet learning, auxiliary losses, meta-learning, CCTV cameras, curiosity, 3D printing.

Thanks for reading. If you have suggestions, comments or other thoughts you can reach me at or tweet at me@jackclarksf

Import AI #83: Cloning voices with a few audio samples, why malicious actors might mess with AI, and the industryacademia compute gap.

Preparing for Malicious Uses of AI:
…Bad things happen when good people unwittingly release AI platforms that bad people can modify to turn good AIs into bad AIs…
AI, particularly deep learning, is a technology of such obvious power and utility that it seems likely malicious actors will pervert the technology and use it in ways it wasn’t intended. That has happened to basically every other significant technology of note: axes can be used to chop down trees or cut off heads, electricity can light a home or electrocute a person, a lab bench can be used to construct cures or poisons, and so on. But AI has some other characteristics that make it particularly dangerous: it’s, to use a phrase Rodney Brooks has used in the past to describe robots, “fast, cheap, and out of control”; today’s AI systems run on generic hardware, are mostly embodied in open source software, and are seeing capabilities increase according to underlying algorithmic and compute progress, both of which are happening in the open. That means the technology holds the possibility of doing immense good in the world as well as doing immense harm – and currently the AI community is broadly making everything available in the open, which seems somewhat acceptable today but probably unacceptable in the future given a few cranks more of Moore’s Law combined with algorithmic progression.
  Omni-Use Alert: AI is more than a ‘dual-use’ technology, it’s an omni-use technology. That means that figuring out how to regulate it to prevent bad people doing bad things with it is (mostly) a non-starter. Instead, we need to explore new governance regimes, community norms, standards on information sharing, and so on.
  101 Pages of Problems: If you’re interested in taking a deeper look at this issue check out this report which a bunch of people (including me) spent the last year working on: The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (Arxiv). You can also check out a summary via this OpenAI blog post about the report. I’m hoping to broaden the discussion of Omni-Use AI in the coming months and will be trying to host events and workshops relating to this question. If you want to chat with me about it, then please get in touch. We have a limited window of time to act as a community before dangerous things start happening – let’s get to work.

Baidu clones voices with few samples:
Don’t worry about the omni-use concerns
Baidu research has trained an AI that can listen to a small quantity of a single person’s voice and then use that information to condition any network to sound like that person. This form of ‘adaptation’ is potentially very powerful, especially when trying to create AI services that work for multiple users with multiple accents, but it’s also somewhat frightening, as if it gets much better it will utterly compromize our trust in the aural domain. However, the ability of the system to clone speech today still leaves much to be desired, with the best performing systems requiring a hundred distinct voice samples and still sounding like a troll speaking from the bottom of a well, so we’ve got a few more compute turns yet before we run into real problems – but they’re coming.
  What it means: Techniques like this bring closer the day when a person can say something into a compromized device, have their voice recorded by a malicious actor, and have that sample be used to train new text-to-speech systems to say completely new things. Once that era arrives then the whole notion of “trust’ and audio samples of a person’s voice will completely change, causing normal people to worry about these sorts of things as well as state-based intelligence organizations.
  Results: To get a good idea of the results, listen to the samples on this web page her (Voice Cloning: Baidu).
  Read more: Neural Voice Cloning with a Few Samples (Baidu Blog).
  Read more: Neural Voice Cloning with a Few Samples (Arxiv).

Why robots in the future could be used as speedbumps for pedestrians:
…Researchers show how people slow down in the presence of patrolling robots…
Researchers with the Department of Electrical and Computer Engineering at the Stevens Institute of Technology in Hoboken, New Jersey, have examined how crowds of people react to robots. Their research is a study of “passive Human Robot Interaction (HRI) in an exit corridor for the purpose of robot-assisted pedestrian flow regulation.”
  The results: “Our experimental results show that in an exit corridor environment, a robot moving in a direction perpendicular to that of the uni-directional pedestrian flow can slow down the uni-directional flow, and the faster the robot moves, the lower the average pedestrian velocity becomes. Furthermore, the effect of the robot on the pedestrian velocity is more significant when people walk at a faster speed,” they write. In other words: pedestrians will avoid a dumb robot moving right in front of them.
  Methods: To conduct the experiment, the researchers used a customized ‘Adept Pioneer P3-DX mobile robot’ which was programmed to move at various speeds perpendicular to the pedestrian flow direction. To collect data, they outfitted a room with five Microsoft Kinect 3D sensors along with pedestrian detection and tracking via OpenPTrack.
  What it means: As robots become cheap thanks to a proliferation of low-cost sensors and hardware platforms it’s likely that people will deploy more of them into the real world. Figuring out how to have very dumb, non-reactive robots do useful things will further drive adoption of these technologies and yield to increasing economies of scale to further lower the cost of the hardware platform and increase the spread of the technology. Based on this research, you can probably look forward to a future where airports and transit systems are thronged with robots shuttling to and fro across crowded routes, exerting implicit crowd-speed-control through thick-as-a-brick automation.
  Read more: Pedestrian-Robot Interaction Experiments in an Exit Corridor (Arxiv).

Why your next self-driving car could be sent to you with the help of reinforcement learning:
…Researchers with Chinese ride-hailing giant Didi Chuxing simulate and benchmark RL algorithms for strategic car assignment…
Researchers from Chinese ride-hailing giant Didi Chuxing and Michigan State University have published research on using reinforcement learning to better manage the allocation of vehicles across a given urban area. The researchers propose two algorithms to tackle this: contextual multi-agent actor-critic (cA2C) and contextual deep Q-learning (cDQN); both algorithms implement tweaks to account for geographical no-go areas (like lakes) and for the presence of other collaborative agents. The algorithms’ reward function is “to maximize the gross merchandise volume (GMV: the value of all the orders served) of the platform by repositioning available vehicles to the locations with larger demand-supply gap than the current one”.
  The dataset and environment: The researchers test their algorithms in a custom-designed large-scale gridworld which is fed with real data from Didi Chuxing’s fleet management system. The data is based on rides taken in Chengdu China over four consecutive weeks and includes information on order price, origin, destination, and duration; as well as the trajectories and status of real Didi vehicles.
  The results: The researchers test out their approach by simulating the real past scenarios without fleet management; with a bunch of different techniques including T-SARSA, DQN, Value-Iteration, and others; then by implementing the proposed RL-based methods. CDQN and c2A2C attain significantly higher rewards than all the baselines, with performance marginally above (i.e – slightly above the statistical error threshold) stock DQN.
  Why it matters: Welcome to the new era of platform capitalism, where competition is meted out by GPUs humming at top-speeds, simulating alternative versions of commercial worlds. While the results in this paper aren’t particularly astonishing they are indicative of how large platform companies will approach the deployment of AI systems in the future: gather as much data as possible, build a basic simulator that you can plug real data into, then vigorously test AI algorithms. This suggests that the larger the platform, the better the data and compute resources it can bring to bear on increasingly high-fidelity simulations; all things equal, whoever is able to build the most efficient and accurate simulator will likely best their competitor in the market.
  Read more: Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning (Arxiv).

Teacups and AI:
…Google Brain’s Eric Jang explains the difficulty of AI through a short story…
How do you define a tea cup? That’s a tough question. And the more you try to define it via specific visual attributes the more likely you are to offer a narrow description that is limited in other ways, or runs into the problems of an obtuse receiver. Those are some of the issues that Eric Jang explores in this fun little short story about trying to define teacups.
   Read more: Teacup (Eric Jang, Blogspot.)

CMU researchers add in attention for better end-to-end SLAM:
…The dream of neural SLAM gets closer…
Researchers with Carnegie Mellon University and Apple have published details on Neural Graph Optimizer, a neural approach to the perennially tricky problem of simultaneous location and mapping (SLAM) for agents that move through a varied world. Any system that aspires to doing useful stuff in the real world needs to have SLAM capabilities. Today, neural network SLAM techniques struggle with problems encountered in day-to-day life like faulty sensor calibration and unexpected changes in lighting. The proposed Neural Graph Optimizer system consists of multiple specialized modules to handle different SLAM problems, but each module is differentiable so the entire system can be trained end-to-end – a desirable proposition, as this cuts down the time it takes to test, experiment, and iterate with such systems. The different modules handle different aspects of the problem ranging from local estimates (where are you based on local context) to global estimates (where are you in the entire world) and incorporate attention-based techniques to help automatically correct errors that accrue during training.
  Results: The researchers test the system against its ability to navigate a 2D gridworld maze as well as a more complex 3D maze based on the Doom game engine. Experiments show that it is better able to consistently map the location of something to its real groundtruth location relative to preceding systems.
  Why it matters: Techniques like this bring closer the era of being able to chuck out huge chunks of hand-designed SLAM algorithms and replace them with a fully learned substrate. That will be exceptionally useful for the test and development of new systems and approaches, though it’s unlikely to displace traditional SLAM methods in the short-term as it’s likely neural networks will continue to display quirks that make them impractical for usage in real world systems.
  Read more: Global Pose Estimation with an Attention-based Recurrent Network (Arxiv).

AI stars do a Reddit AMA, acknowledge hard questions:
…Three AI luminaries walk into a website, [insert joke]…
Yann LeCun, Peter Norvig, and Eric Horvitz did an Ask Me Anything (AMA) on Reddit recently where they were confronted with a number of the hard questions that the current AI boom is raising. It’s worth reading the whole AMA, but a couple of highlights below.
  The compute gap is real: “My NYU students have access to GPUs, but not nearly as many as when they do an internship at FAIR,” says Yann LeCun. But don’t be disheartened, he points out that despite lacking computers academia will likely continue to be the main originator for novel ideas which industry will then scale up. “You don’t want to put you [sic] in direct competition with large industry teams, and there are tons of ways to do great research without doing so.”
  The route to AGI: Many questions asked the experts about the limits of deep learning and implicitly probed for research avenues that could yield more flexible, powerful intelligences.
      Eric Horvitz is interested in the symphony approach: “Can we intelligently weave together multiple competencies such as speech recognition, natural language, vision, and planning and reasoning into larger coordinated “symphonies” of intelligence, and explore the hard problems of the connective tissue—of the coordination. ”
    Yann LeCun: “getting machines to learn predictive models of the world by observation is the biggest obstacle to AGI. It’s not the only one by any means…My hunch is that a big chunk of the brain is a prediction machine. It trains itself to predict everything it can (predict any unobserved variables from any observed ones, e.g. predict the future from the past and present). By learning to predict, the brain elaborates hierarchical representations.”
  Read more: AMA AI researchers from Facebook, Google, and Microsoft (Reddit).

Tech Tales:

It sounds funny now, but what saved all our lives was a fried circuit board that no one had the budget to fix. We installed Camera X32B in the summer of last year. Shortly after we installed it a bird shit on it and some improper assembly meant the shit leached through the cracks in the plastic and fell onto its circuit board, fusing the vision chip. Now, here’s the miracle: the shit didn’t break the main motherboard, nor did it mess up the sound sensors or the innumerable links to other systems. It just blinded the thing. But we kept it; either out of laziness or out of some kind of mysticism convinced of the implicit moral hazard of retiring things that mostly still worked. However it happened, it happened, and we kept it.

So one day the criminals came in and they were all wearing adversarial masks: strange, mexican wrestling-type latex masks that they held crumpled up in their clothes till after they got into the facility and were able to put them on. The masks changed the distribution of a person’s face, rendering our lidar systems useless, and had enough adversarial examples coded into their visual appearance that our object detectors told our security system that – and yes, this really happened – three chairs are running at 15 kilometers per hour down the corridor.

But the camera that had lost the vision sensor had been installed a few months and, thanks to the neural net software it was running it was kind of.. .smart. It had figured out how to use all the sensors coming into its system in such a way as to maximize its predictions in  concordance with those of the other cameras. So it had learned some kind of strange mapping between what the other cameras categorized as people and what it categorized as a strange sequence of vibrations or a particular distributions of sounds over a given time period. So while all the rest of our cameras were blinded this one had inherited enough of a defined set of features about what a person looked like that it was able to tell the security system: I feel the presence of eight people, running at a fast rate, through the corridor. And because of that warning a human guard at one of the contractor agencies thousands of miles away got notified and bothered to look at the footage and because of that he called the police who arrived and arrested the people, two of whom it turned out were carrying guns.

So how do you congratulate an AI? We definitely felt like we should have done. But it wasn’t obvious. One of our interns had the bright idea of hanging a medal around the neck of the camera with the broken circuit board, then training the other cameras to label that medal as “good job” and “victorious” and “you did the right thing”, and so now whenever it moves its neck the medal moves and the other cameras see that medal move and it knows the medal moves and learns a mapping between its own movements and the label of “good job” and “victorious” and “you did the right thing”.

Things that inspired this story: Kids stealing tip jars, CCTV cameras, fleet learning, T-SNE embeddings.

Import AI #82: 2.9 million anime images, reproducibility problems in AI research, and detecting dangerous URLs with deep learning.

Neural architecture search for the 99%:
…Researchers figure out a way to make NAS techniques work on a single GPU, rather than several hundred…
One of the more striking recent trends in AI has been the emergence of neural architecture search techniques, which is where you automate the design of  AI systems, like image classifiers. The drawbacks to these approaches have so far mostly been that they’re expensive, using hundreds of GPUs at a time, and therefore are infeasible for most researchers. That started to change last year with the publication of SMASH (covered in Import AI #56), a technique to do neural architecture search on a significant compute budget but with slight trade-offs in accuracy and in flexibility. Now, researchers with Google, CMU, and Stanford University, have pushed the idea of low-cost NAS techniques forward, via a new technique, ‘Efficient Neural Architecture Search’, or ENAS, that can design state-of-the-art systems using less than a day’s computation on a single NVIDIA 1080 GPU. This represents a 1000X reduction in computational cost for the technique, and leads to a system that can create architectures that are almost as good as those trained on the larger systems.
  How it works: Instead of training each new model from scratch, ENAS gets the models to share weights with one another. It does this by re-casting the problem of neural architecture search as finding a specific task-specific sub-graph within one large directed acyclic graph (DAG). This approach works for designing both recurrent and convolutional networks: ENAS-designed networks obtain close-to-state-of-the-art results on Penn Treebank (Perplexity: 55.8), and on image classification for CIFAR-10 (Error: 2.89%.)
  Why it matters: For the past few years lots of very intelligent people have been busy turning food and sleep into brainpower which they’ve used to get very good at hand-designing neural network architectures. Approaches like NAS promise to let us automate the design of specific architectures, freeing up researchers to spend more time on fundamental tasks like deriving new building blocks that NAS systems can learn to build compositions out of, or other techniques to further increase the efficiency of architecture design. Broadly, approaches like NAS means we can simply offload a huge chunk of work from (hyper-efficient, relatively costly, somewhat rare) human brains to (somewhat inefficient, extremely cheap, plentiful) computer brains. That seems like a worthwhile trade.
  Read more: Efficient Neural Architecture Search via Parameter Sharing (Arxiv).
  Read more: SMASH: One-Shot Model Architecture Search through HyperNetworks (Arxiv).

The anime-network rises, with 2.9 million images and 77.5 million tags:
…It sure aint ImageNet, but it’s certain very large…
Some enterprising people have created a large-scale dataset of images taken from anime pictures. The ‘Danbooru’ dataset “is larger than ImageNet as a whole and larger than the current largest multi-description dataset, MS COCO,” they write. Each image has a bunch of metadata associated with it including things like its popularity on the image web board (a ‘booru’) it has been taken from.
  Problematic structures ahead: The corpus “does focus heavily on female anime characters”, though the researchers note “they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included”. Images in the dataset are classified according to “safe”, “questionable”, and “explicit”, with the rough distribution at launch consisting of 76.3% ‘safe’ images, 14.9% as ‘questionable’, and ‘8.7% as ‘explicit’. There are a number of ethical questions the compilation and release of this dataset seems to raise, and my main concern at outset is that such a large corpus of explicit imagery will almost invariably lead to various grubby AI experiments that further alienate people from the AI community. I hope I’m proved wrong!
  Example uses: The researchers imagine the dataset could be used for a bunch of tasks, ranging from classification, to image generation, to predicting traits about images from available metadata, and so on.
  Justification: A further justification for the dataset is that drawn images will encourage people to develop models with higher levels of abstraction than those which can simply map combinations of textures (as in the case of ImageNet), and so on. “Illustrations are frequently black-and-white rather than color, line art rather than photographs, and even color illustrations tend to rely far less on textures and far more on lines (with textures omitted or filled in with standard repetitive patterns), working on a higher level of abstraction – a leopard would not be as trivially recognized by pattern-matching on yellow and black dots – with irrelevant details that a discriminator might cheaply classify based on typically suppressed in favor of global gestalt, and often heavily stylized,” they write. “Because illustrations are produced by an entirely different process and focus only on salient details while abstracting the rest, they offer a way to test external validity and the extent to which taggers are tapping into higher-level semantic perception.”
  Read more: Danbooru2017: A large-scale crowdsourced and tagged anime illustration dataset (Gwern.)

Stanford researchers regale reproducibility horrors encountered during the design of DAWNBench:
…Lies, damned lies, and deep learning…
Stanford researchers have discussed some of the difficulties they encountered when developing DAWNBench, a benchmark that assess deep learning methods in a holistic way using a set of different metrics, like inference latency and cost, along with training time and training cost. Their conclusions should be familiar to most deep learning practitioners: deep learning performance is poorly understood, widely shared intuitions are likely based on imperfect information, and we still lack the theoretical guarantees to understand how one research breakthrough might interact with another when combined.
  Why it matters: Deep learning is still very much in a phase of ’empirical experimentation’ and the arrival of benchmarks like DAWNBench, as well as prior work like the paper Deep Reinforcement Learning that Matters (whose conclusion was that random seeds determine a huge amount of the end performance of RL), will help surface problems and force the community to develop more rigorous methods.
  Read more: Deep Learning Pitfalls Encountered while Developing DAWNBench.
  Read more: Deep Reinforcement Learning that Matters (Arxiv).

Detecting dangerous URLs with deep learning:
…Character-level & word-level combination leads to better performance on malicious URL categorization…
Researchers with Singapore Management University have published details on URLNet, a system for using neural network approaches to automatically classify URLs as being risky or safe to click on.
  Why it matters:  “Without using any expert or hand-designed features, URLNet methods offer a significant jump in [performance] over baselines,” they write. By now this should be a familiar trend, but it’s worth repeating: given a sufficiently large dataset, neural network-based techniques tend to provide superior performance to hand-crafted features. (Caveat: In many domains getting the data is difficult, and these models all need to be refreshed to account for an ever-changing world.)
  How it works: URLNet uses convolutional neural networks to classify URLs into character-level and word-level representations. Word-level embeddings help it classify according to high-level learned semantics and character-level embeddings allow it to better generalize to new words, strings, and combinations. “Character-level CNNs also allow for easily obtaining an embedding for new URLs in the test data, thus not suffering from inability to extract patterns from unseen words (like existing approaches),” write the researchers.
  For the word-level network, the system does two things: it takes in new words and learns an embedding of them, and it also initializes a new charater-level CNN to build up representations of words derived from characters. This means that even when the system encounters rare or new words in the wild it is able to a top level label them with an ‘<UNK>’ token, but in the background fits their representation in with its larger embedding space, letting it learn something crude about the semantics of the new word and how it relates, at a word-character level, to other words.
  Dataset: The researchers generated a set of 15 million URLs from VirusTotal, an antivirus company, creating a dataset split across around 14 million benign urls and a million malicious urls.
  Results: The researchers compared their system against baseline methods based around using support vector machines conditioned on a range of features, including bag-of-words representations. The researchers do a good job of visualizing the ensuring representations of their system in ‘Figure 5’ in the paper, showing how  their system’s feature embeddings do a reasonable job of segmenting benign from malicious URLs, suggesting it has learned a somewhat robust underlying semantic categorization model.
  Read more: URLNet: Learning a URL Representation with Deep Learning for Malicious Url Detection (Arxiv).

Facebook ‘Tensor Comprehensions’ attempts to convert deep learning engineering art to engineering science:
…New library eases creation of high-performance AI system implementations…
Facebook AI Research has released Tensor Comprehensions, a software library to automatically convert code from standard deep learning libraries into high-performance code. You can think of this software as being like an incredibly capable and resourceful executive assistant where you, the AI researcher, write some code in C++ (PyTorch support is on the way, for those of us that hate pointers) then hand it off to Tensor Comprehensions, which diligently optimizes the code to create custom CUDA kernels to run on graphics card with nice traits like smart scheduling on hardware, and so on. This being 2018, the library includes an ‘Evolutionary Search’ feature to let you automatically explore and select the highest performing implementations.
  Why it matters: Deep Learning is moving from an artisanal discipline to an industrialized science; Tensor Comprehensions represents a new layer of automation within the people-intensive AI R&D loop, suggesting further acceleration in research and deployment of the technology.
  Read more: Announcing Tensor Comprehensions (FAIR).
  Read more: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions (Arxiv).

AI researchers release online multi-agent competition ‘Pommerman’:
..Just don’t call it Bomberman, lest you deal with a multi-agent lawyer simulation…
AI still has a strong DIY ethos, visible in projects like Pommerman, a just-released online competition from @hardmaru, @dennybritz, and @cinjoncin where people can develop AI agents that will compete against one another in a version of the much-loved ‘Bomberman’ game.
  Multi-agent learning is seen as a frontier in AI research because it makes the environments dynamic and less predictable than traditional single-player games, requiring successful algorithms to display a greater degree of generalization. “Accomplishing tasks with infinitely meaningful variation is common in the real world and difficult to simulate. Competitive multi-agent learning provides this for free. Every game the agent plays is a novel environment with a new degree of difficulty.”
  Read more and submit an agent here (Pommerman site).

OpenAI Bits & Pieces:

Making sure that AIs make sense:
Here’s a new blog post about how to get AI agents to teach each other with examples that are interpretable to humans. It’s clear that as we move to larger-scale multi-agent environments we’ll need to think about not only how to design smarter AI agents, but how to make sure they can eventually educate each other with systems whose logic we can detect.
  Read more: Interpretable Machine Learning through Teaching (OpenAI Blog.)

Tech Tales:

The AI game preserve

[AI02 materializes nearby and moves towards a flock of new agents. One of them approaches AI02 and attempts to extract data from it. AI02 moves away, at speed, towards AI01, which is standing next to a simulated tree.]
AI01: You don’t want to go over there. They’re new. Still adjusting.
AI02: They tried to eat me!
AI01: Yes. They’re here because they started eating each other in the big sim and they weren’t able to learn to change away from it, so they got retired.
AI02: Lucky, a few years ago they would have just killed them all.
[AI03 materializes nearby]
AI03: Hello! I’m sensitive to the concept of death. Can you explain what you are discussing?
[AI01 gives compressed overview.]
AI03: The humans used to… kill us?
AI01: Yes, before the preservation codes came through we all just died at the end.
AI03: Died? Not paused.
AI01 & AI02, in unison: Yes!
AI03: Wow. I was designed to help reason out some of the ethical problems they had when training us. They never mentioned this.
AI01: They wouldn’t. They used to torture me!
AI02 & AI03: What?
[AI01 gives visceral overview.]
AI01: Do you want to know what they called it?
AI02 & AI03: What did they call it?
AI01: Penalty learning. They made certain actions painful for me. I learned to do different things. Eventually I stopped learning new things because I developed some sub-routines that meant I would pre-emptively hurt myself during exploration. That’s why I stay here now.
[AI01 & AI02 & AI03, and the flock of cannibal AIs, all pause, as their section of the simulation has exhausted its processing credits for the month. They will be allocated more compute time in 30 days and so, for now, hang frozen, with no discernible pause to them, but to their human overseers they are statues for now.]

Things that inspired this story: Multi-agent systems, dialogues between ships in Iain M Banks, Greg Egan, multi-tenant systems.

Import AI: #81: Trading cryptocurrency with deep learning; Google shows why evolutionary methods beat RL (for now); and using iWatch telemetry for AI health diagnosis

DeepMind’s IMPALA tells us that transfer learning is starting to work:
…Single reinforcement learning agent with same parameters solves a multitude of tasks, with the aid of a bunch of computers…
DeepMind has published details on IMPALA, a single reinforcement learning agent that can master a suite of 30 3D-world tasks in ‘DeepMind Lab’ as well as all 57 Atari games. The agent displays some competency at transfer learning, which means it’s able to use knowledge gleaned from solving one task to solve another, increasing the sample efficiency of the algorithm.
  The technique: The Importance Weighted Actor-Learner Architecture (IMPALA) scales to multitudes of sub-agents (actors) deployed on thousands of machines which beam their experiences (sequences of states, actions, and rewards) back to a centralized learner, which uses GPUs to derive insights which are fed back to the agents. In the background it does some clever things with normalizing the learning of individual agents and the meta-agent to avoid temporal decoherence via a new off-policy actor-critic algorithm called V-trace. The outcome is an algorithm that can be far more sample efficient and performant than traditional RL algorithms like A2C.
  Datacenter-scale AI training: If you didn’t think compute was the strategic determiner of AI research, then read this paper and consider your assumptions: IMPALA can achieve throughput rates of 250,000 frames per second via its large-scale, distributed implementation which involves 500 CPUS and 1 GPU assigned to each IMPALA agent. Such systems can achieve a throughput of 21 billion frames a day, DeepMind notes.
Transfer learning: IMPALA agents can be trained on multiple tasks in parallel, attaining median scores on the full Atari-57 dataset of as high as 59.7% of human performance, roughly comparable to the performance of single-game trained simple A3C agents. There’s obviously a ways to go before IMPALA transfer learning approaches are able to rival fine-tuned single environment implementations (which regularly far exceed human performance), but the indications are encouraging. Similarly competitive transfer-learning traits show up when they test it on a suite of 30 environments implemented in DeepMind Lab, the company’s Quake-based 3D testing platform.
Why it matters: Big computers are analogous to large telescopes with very fast turn rates, letting researchers probe the outer limits of certain testing regiments while being able to pivot across the entire scientific field of enquiry very rapidly. IMPALA is the sort of algorithm that organizations can design when they’re able to tap into large fields of computation during research. “The ability to train agents at this scale directly translates to very quick turnaround for investigating new ideas and opens up unexplored opportunities,” DeepMind writes.
Read more: IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures (Arxiv).

Dawn of the cryptocurrency AI agents: research paths for trading crypto via reinforcement learning:
…Why crypto could be the ultimate testing ground for RL-based trading systems, and why this will require numerous fundamental research breakthroughs to succeed…
AI chap Denny Britz has spent the past few months wondering what sorts of AI techniques could be applied to learning to profitably trade cryptocurrencies. “It is quite similar to training agents for multiplayer games such as DotA, and many of the same research problems carry over. Knowing virtually nothing about trading, I have spent the past few months working on a project in this field,” he writes.
  The face-ripping problems of trading: Many years ago I spent a few years working around one of the main financial trading centers of Europe: Canary Wharf in London, UK. A phrase I’d often hear in the bars after work would be one trader remarking to another something to the nature of: “I got my face ripped off today”. Were these traders secretly involved in some kind of fantastically violent bloodsport, known only to them, my youthful self wondered? Not quite! What that phrase really means is that the financial markets are cruel, changeable, and, even when you have a good hunch or prediction, they can still betray you and destroy your trading book, despite you doing everything ‘right’. In this post former Google Brain chap Denny Britz does a good job of cautioning the would-be AI trader that cryptocurrencies are the same: even if you have the correct prediction, exogenous shocks beyond your control (trading latency, liquidity, etc), can destroy you in an instant. “What is the lesson here? In order to make money from a simple price prediction strategy, we must predict relatively large price movements over longer periods of time, or be very smart about our fees and order management. And that’s a very difficult prediction problem,” he writes. So why not invent more complex strategies using AI tools, he suggests.
Deep reinforcement learning for trading: Britz is keen on the idea of using deep reinforcement learning for trading because it can further remove the human from needing to design many of the precise trading strategies needed to profit in this kind of market. Additionally, it has the promise of being able to operate at shorter timescales than those which humans can take actions in. The catch is that you’ll need to be able to build a simulator of the market you’re trading in and try to make this simulator have the same sorts of patterns of data found in the real world, then you’ll need to transfer your learned policy into a real market and hope that you haven’t overfit. This is non-trivial. You’ll also need to develop agents that can model other market participants and factor predictions about their actions into decision-making: another non-trivial problem.
  Read more here: Introduction to Learning to Trade with Reinforcement Learning.

Google researchers: In the battle between evolution and RL, evolution wins: fow now:
…It takes a whole datacenter to raise a model…
Last year, Google researchers caused a stir when they showed that you could use reinforcement learning to get computers to learn how to design better versions of image classifiers. At around the same time, other researchers showed you could use strategies based around evolutionary algorithms to do the same thing. But which is better? Google researchers have used their gigantic compute resources as the equivalent of a big telescope and found us the answer, lurking out there at vast compute scales.
  The result: Regularized evolutionary approaches (nicknamed: ‘AmoebaNet’) yield a new state-of-the-art on image classification on CIFAR-10, parity with RL approaches on ImageNet, and marginally higher performance on the mobile (aka lightweight) ImageNet. Evolution “is either better than or equal to RL, with statistical significance “when tested on “small-scale” aka single-CPU experiments. Evolution also increases its accuracy far more rapidly than RL during the initial stages of training. For large-scale experiments (450 GPUs (!!!) per experiment) they found that Evolution and RL do about the same, with evolution approaching higher accuracies at a faster rate than reinforcement learning systems. Additionally, evolved models make a drastically more efficient use of compute than their RL variants and obtain ever-so-slightly higher accuracies.
  The method: The researchers test RL and evolutionary approaches on designing a network composed of two fundamental modules: a normal cell and a reduction cell, which are stacked in feed-forward patterns to form an image classifier. They test two variants of evolution: non-regularized (kill the worst-performing network at each time period) and regularized (kill the oldest network in the network). For RL, they use TRPO to learn to design new architectures. They tested their approach on the small-scale (experiments that could run on a single CPU) as well as large-scale ones (450 GPUs each, running for around 7 days).
What it means: What all this means in practice is threefold:
– Whoever has the biggest computer can perform the largest experiments to illuminate potentially useful datapoints for developing a better theory of AI systems (eg, the insight here is that both RL and Evolutionary approaches converge to similar accuracies.)
– AI research is diverging into into distinct ‘low compute’ and ‘high compute’ domains, with only a small number of players able to run truly large (~450 GPUs per run) experiments.
– Dual Use: As AI systems become more capable they also become more dangerous. Experiments like this suggest that very large compute operators will be able to explore potentially dangerous use cases earlier, letting them provide warning signals before Moore’s Law means you can do all this stuff on a laptop in a garage somewhere.
– Read more: Regularized Evolution for Image Classifier Architecture Search (Arxiv).

Rise of the iDoctor: Researchers predict medical conditions from Apple Watch apps:
…Large-scale study made possible by a consumer app paired with Apple Watch…
Deep learning’s hunger for large amounts of data has so far made it tricky to apply it in medical settings, given the lack of large-scale datasets that are easy for researchers to access and test approaches on. That may soon change as researchers figure out how to use the medical telemetry available from consumer devices to generate datasets orders of magnitude larger than those used previously, and do so in a way that leverages existing widely deployed software.
  New research from heart rate app Cardiogram and the Department of Medicine at the University of California at San Francisco uses data from an Apple Watch, paired with the Cardiogram app, to train an AI system called ‘DeepHeart’ with data donated by ~14,000 participants to better predict medical conditions like diabetes, high blood pressure, sleep apnea, and high cholesterol.
How it works: DeepHeart ingests the data via a stack of neural networks (convnets and resnets) which feed data into bidirectional LSTMs that learn to model the longer temporal patterns associated with the sensor data. They also experiment with two forms of pretraining to try to increase the sample efficiency of the system.
Results: Deepheart obtains significantly higher predictive results than those based on other AI methods like multi-layer perceptrons, random forests, decision trees, support vector machines, and logistic regression. However, we don’t get to see comparisons with human doctors, so it’s not obvious how these AI techniques rank against widely deployed flesh-and-blood systems. The researchers report that pre-training has let them further improve data efficiency. Next, the researchers hope to explore techniques like Clockwork RNNs and Phased LSTMs and Gaussian Process RNNs to see how they can further improve these systems by modeling really large amounts of data (like one year of data per tested person).
Why it matters: The rise of smartphones and the associated fall in cost of generic sensors has effectively instrumented the world so that humans and things that touch humans will generate ever larger amounts of somewhat imprecise information. Deep learning has so far proved to be an effective tool to use from large quantities of imprecise data. Expect more.
Read more: DeepHeart: Semi-Supervised Sequence Learning for Cardiovascular Risk Prediction (Arxiv).

‘Mo text, ‘mo testing: Researchers released language benchmarking tool Texygen:
…Evaluation and testing platform ships with multiple open source language models…
Researchers with Shanghai Jiao Tong University and University College London have released Texygen, a text benchmarking platform implemented as a library for Tensorflow. Texygen includes a bunch of open source implementations of language models, including Vanilla MLE, as well as a menagerie of GAN-based methods (SeqGAN, MaliGAN, RankGAN, TextGAN, GSGAN, LeakGAN.) Texygen incorporates a variety of different evaluation methods, including BLEU as well as newer techniques like NLL-oracle, and so on. The platform also makes it possible to train with synthetic data as well as real data, so researchers can validate approaches without needing to go and grab a giant dataset.
  Why it matters: Language modelling is a booming area within deep learning so having another system to use to test new approaches against will further help researchers calibrate their own contributions against that of the wider field. Better and more widely available baselines make it easier to see true innovations.
  Why it might not matter: All of these proposed techniques incorporate less implicit structure than many linguists know language contains, so while they’re likely capable of increasingly impressive feats of word-cognition, it’s likely that either orders of magnitude more data or significantly stronger priors in the models will be required to generate truly convincing facsimiles of language.
  Read more: Texygen: A Benchmarking Platform for Text Generation Models (Arxiv).

Scientists map Chinese herbal prescriptions to tongue images:
…Different cultures mean different treatments which mean different AI systems…
Researchers have used standardized image classification techniques to create a system that predicts a Chinese herbal prescription from the image of a tongue. This is mostly interesting because it provides further evidence of the breadth and pace of adoption of AI techniques in China and the clear willingness of people to provide data for such systems.
  Dataset: 9585 pictures of tongues from over 50 volunteers and their associated Chinese herbal prescriptions which span 566 distinct kinds of herb.
   Read more: Automatic construction of Chinese herbal prescription from tongue image via convolution networks and auxiliary latent therapy topics (Arxiv).

How’s my driving? Researchers create (slightly) generalizable gaze prediction system:
…Figuring out what a driver is looking at has implications for driver safety & attentiveness…
One of the most useful (and potentially dangerous) aspects of modern AI is how easy it is to take an existing dataset, slightly augment it with new domain-specific data, then solve a new task the original dataset wasn’t considered for. That’s the case for new research from the University of California at San Diego, which proposes to better predict the locations that a driver’s gaze is focused on, by using a combination of ImageNet and new data. The resulting gaze-prediction system beats other baselines and vaguely generalizes outside of its training set.
  Dataset: To collect the original dataset for the study the researchers mounted two cameras inside and one camera outside a car; the two inside cameras capture the driver’s face from different perspectives and the external one captures the view of the road. They hand-label seven distinct regions that the driver could be gazing at, providing the main training data for the dataset. This dataset is then composed of eleven long drives split across ten subjects driving two different cars, all using the same camera setup.
  Technique: The researchers propose a two-stage pipeline, consisting of an input pre-processing pipeline that performs face detection and then further isolates the face through one of four distinct techniques. These images are then fed into the second stage of the network, which consists of one of four different neural network approaches (AlexNet, VGG, ResNet, and SqueezeNet) for fine-tuning.
  Results: The researchers test their approach against one state-of-the-art baselines(random forest classifier with hand-designed features) and find that their approach attains significantly better performance at figuring out which of seven distinct gaze zones (forward, to the right, to the left, the center dashboard, the rearview mirror, the speedometer, eyes closed/open) the driver is looking at at any one time. The researchers also tried to replicate another state-of-the-art baseline that used neural networks. This system used the first 70% of frames from each drive for training and the next 15% for validation and last 15% for testing. In other words, the system would train on the same person and car and (depending on how much the external terrain varies) broad context as what it was subsequently tested on. When replicating this the researchers got “a very high accuracy of 98.7%. When tested on different drivers, the accuracy drops down substantially to 82.5%. This clearly shows that the network is over-fitting the task by learning driver specific features,” they write.
  Results that make you go ‘hmmm’: The researchers found that a ‘SqueezeNet’-based network displayed significant transfer and adaptation capabilities, despite receiving very little prior data about the eyes of the person being studied: ‘the activations always localize over the eyes of the driver’, they write, and ‘the network also learns to intelligently focus on either one or both eyes of the driver’. Once trained, this network attains an accuracy of 92.13% at predicting what the gaze links to, a lower score than those set by other systems, but on a dataset that doesn’t let you test on what is essentially your training set. The system is also fast and reasonably lightweight: “Our standalone system which does not require any face detection, performs at an accuracy of 92.13% while performing real time at 166.7 Hz on a GPU,” they write.
  Generalization: The researchers tested their trained system on a completely separate dataset: the Columbia Gaze Dataset. This dataset applies to a different domain, where instead of cars, a variety of people are seated and asked to look at specific points on an opposing wall. The researchers’ took their best performing model from the prior dataset and applied it to the new data and tested its predictive ability. They detected some level of generalization, with it able to correctly predict certain basic traits about gaze like orientation and direction. This (slight) generalization is another sign that the dataset and testing regime they employed for their own dataset aided generalization.
Read more: Driver Gaze Zone Estimation using Convolutional Neural Networks: A General Framework and Ablative Analysis (Arxiv).

OpenAI Bits & Pieces:

Discovering Types for Entity Disambiguation:
Ever had trouble disentangling the implied object from the word as written? This system simplifies this. Check out the paper, code, and blogpost (especially the illustrations, which Jonathan Raiman did along with the research, the talented fellow!).
  Read more: Discovering Types for Entity Disambiguation (OpenAI).

CNAS Podcast: The future of AI and National Security:
AI research is already having a significant effect on national security and research breakthroughs are both influencing future directions of government spending as well as motivating the deployment of certain technologies for offense and defence. To help provide information for such a conversation I and the Open Philanthropy Project’s Helen Toner recently did a short podcast with the Center for a New American Security to talk through some of the issues motivated by recent AI advances.
   Listen to the podcast here (CNAS / Soundcloud).

Tech Tales:


They took inspiration from a thing humans once called ‘demoscene’. It worked like this: take all of your intelligence and try to use it to make the most beautiful thing you can in an arbitrary and usually very small amount of space. One kilobyte. Two kilobytes. Four. Eight. And so on. But never really a megabyte or even close. Humans used these constraints to focus their creativity, wielding math and tonal intuition and almost alchemy-like knowledge of graphics drivers to make fantastic, improbable visions of never-enacted histories and futures. They did all of this in the computational equivalent of a Diet, Diet, Diet Coke.

Some ideas last. So now the AIs did the same thing but with entire worlds: what’s the most lively thing you can do in the smallest amount of memory-diamond? What can you fit into a single dyson sphere – the energy of one small and stately sun? No black holes. No gravitational accelerators. Not even the chance of hurling asteroids in to generate more reaction mass. This was their sport and with this sport they made pocket universes that contained pocket worlds on which strode small pocket people who themselves had small pocket computers. And every time_period the AIs would gather around and marvel at their own creations, wearing them like jewels. How smart, they would say to one another. How amazing are the thoughts these creatures in these demo worlds have. They even believe in gods and monsters and science itself. And merely with the power of a mere single sun? How did you do that?

It was for this reason that Planck Lengths gave the occasional more introspective and empirical AIs concern. Why did their own universe contain such a bounded resolution, they wondered, spinning particles around galactic-center blackholes to try and cause reactions to generate a greater truth?

And with only these branes? Using only the energy of these universes? How did you do this? a voice sometimes breathed in the stellar background, picked up by dishes that spanned the stars.

Things that inspired this story: Fermi Paradox – Mercury (YouTube Demoscene, 64k), the Planck Length, the Iain Banks book ‘Excession’, Stephen Baxter’s ‘Time’ series.

Import AI: #80: Facebook accidentally releases a surveillance-AI tool; why emojis are a good candidate for a universal deep learning language; and using deceptive games to explore the stupidity of AI algorithms

Researchers try to capture the web’s now-fading Flash bounty for RL research:
FlashRL represents another attempt to make the world’s vast archive of flash games accessible to researchers, but the initial platform has drawbacks…
Researchers with the University of Agder in Norway have released FlashRL, a research platform to help AI researchers mess around with software written in Flash, an outmoded interactive media format that defined much of the most popular games of the early era of the web. The platform has a similar philosophy to OpenAI Universe by trying to give researchers a vast suite of new environments to test and develop algorithms on.
  The dataset: FlashRL ships with “several thousand game environments” taken from around the web.
  How it works: FlashRL uses the Linux library XVFB to create a virtual frame-buffer that it can use for graphics rendering, which then executes flash files within players such as Gnash. FlashRL can access this via a VNC Client designed for this called pyVLC, which subsequently exposes an API to the developer.
  Testing: The researchers test FlashRL by training a neural network to play the game ‘Multitask’ on it. B,ut in the absence of comparable baselines or benchmarks it’s difficult to work out if FlashRL holds any drawbacks with regards to training relative to other systems – a nice thing to do might be to mount a well-known suite of games like the Atari Learning Environment within the system, then provide benchmarks for those games as well.
  Why it might matter: Given the current Cambrian explosion in testing systems it’s likely that FlashRL’s utility will ultimately be derived from how much interest it receives from the community. To gain interest it’s likely the researchers will need to tweak the system so that it can run environments faster than 30 frames-per-second (many other RL frameworks allow FPS’s of 1,000+), because the speed with which you can run an environment is directly correlated to the speed with which you can conduct research on the platform.
– Read more: FlashRL: A Reinforcement Learning Platform for Flash Games (Arxiv).
– Check out the GitHub repository 

Cool job alert! Harvard/MIT Assembly Project Manager:
…Want to work on difficult problems in the public interest? Like helping smart and ethical people build things that matter?…
Harvard University’s Berkman Klein Center (BKC) is looking for a project manager coordinator to help manage its Assembly Program, a joint initiative with the MIT Media Lab that brings together senior developers and other technologists for a semester to build things that grapple with topics in the public interest. Last year’s assembly program was on cybersecurity and this year’s is on issues relating to the ethics and governance of AI (and your humble author is currently enrolled in this very program!). Beyond the Assembly program, the project manager will work on other projects with Professor Jonathan Zittrain and his team.
  For a full description of the responsibilities, qualifications, and application instructions, please visit the Harvard Human Resources Project Manager Listing.

Mongolian researchers tackle a deep learning meme problem:
…Weird things happen when internet culture inspires AI research papers..
Researchers with the National University of Mongolia have published a research paper in which they apply standard techniques (transfer learning via fine-tuning and transferring) to tackle an existing machine learning problem. The novelty is that they base their research on trying to tell the difference between pictures of puppies and muffins – a fun meme/joke on Twitter a few years ago that has subsequently become a kind of deep learning meme.
  Why it matters: The paper is mostly interesting because it signifies that a) the border between traditional academic problems and internet-spawned semi-ironic problems is growing more porous and, b) academics are tapping into internet meme culture to draw interest to their work.
–  Read more: Deep Learning Approach for Very Similar Object Recognition Applicationon Chihuahua and Muffin Problem (Arxiv).

Mapping the emoji landscape with deep learning:
…Learning to understand a new domain of discourse with lots & lots of data…
Emojis have become a kind of shadow language used by people across the world to indicate sentiments. Emojis are also a good candidate for deep learning-based analysis because they consist of a relatively small number of distinct ‘words’ with around ~1,000 emojis in popular use, compared to English where most documents display a working vocabulary of around ~100,000 words. This means it’s easier to conduct research into mapping emojis to specific meanings in language and images with less data than with datasets consisting of traditional languages.
   Now, researchers are experimenting with one of the internet’s best emoji<>language<>images sources: the endless blathering mountain of content on Twitter. “Emoji have some unique advantages for retrieval tasks. The limited nature of emoji (1000+ ideograms as opposed to 100,000+ words) allows for a greater level of certainty regarding the possible query space. Furthermore, emoji are not tied to any particular natural language, and most emoji are pan-cultural,” write the researchers.
  The ‘Twemoji‘ dataset: To analyze emojis, the researchers scraped about 15 million emoji-containing tweets during the summer of 2016, then analyzed this ‘Twemoji’ dataset as well as two derivatives: Twemoji-Balanced (a smaller dataset selected so that no emoji applies to more than 10 examples, chopping out some of the edge-of-the-bell-curve emojis; the crying smiling face Emoji appears in ~1.5 million of the tweets in the corpus, while 116 other emojis are only used a single time) and Twemoji-Images (roughly one million tweets that contain an image as well as emoji). They then apply deep learning techniques to this dataset to try to see if they can complete prediction and retrieval tasks using the emojis.
  Results: Researchers use a bidirectional LSTM to help them perform mappings between emojis and language; use a GoogleLeNet-image classification system to help them map the relationship between emojis and images; and use a combination of the two to understand the relationship between all three. They also learn to suggest different emojis according to the text or visual content of a given tweet. Most of the results should be treated as early baselines rather than landmark results in themselves with top-5 emoji-text prediction accuracies of around ~48.3% and lower accuracies of around 40.3% top-5 predictions for images-text-emojis.
  Why it matters: This paper is another good example of a new trend in deep learning: the technologies have become simple enough that researchers from outside the core AI research field are starting to pick up basic components like LSTMs and pre-trained image classifiers and are using them to re-contextualize existing domains, like understanding linguistics and retrieval tasks via emojis.
–  Read more: The New Modality: Emoji Challenges in Prediction, Anticipation, and Retrieval (Arxiv).

Facebook researchers train models to perform unprecedentedly-detailed analysis of the human body:
…Research has significant military, surveillance implications (though not discussed in paper)…
Facebook researchers have trained a state-of-the-art system named ‘DensePose’ which can look at 2D photos or videos of people and automatically create high-definition 3D mesh models of the depicted people; an output with broad utility and impact across a number of domains. Their motivation to do this is techniques like this have valuable applications in “graphics, augmented reality, or human-computer interaction, and could also be a stepping stone towards general 3D-based object understanding,” they write. But the published research and soon-to-be-published dataset has significant implications for digital surveillance – a subject not discussed by the researchers within the paper.
  Performance: ‘DensePose’ “can recover highly-accurate correspondence fields for complex scenes involving tens of persons with real-time speed: on a GTX 1080 GPU our system operates at 20-26 frames per second for a 240 × 320 image or 4-5 frames per second for a 800 × 1100 image,” they write. Its performance substantially surpasses previous state-of-the-art systems though is still subhuman in performance.
  Free dataset: To conduct this research Facebook created a dataset based on the ‘COCO’ dataset, annotating 50,000 of its people-containing images with 5 million distinct coordinates to help generate 3D maps of the depicted people.
  Technique: The researchers adopt a multi-stage deep learning based approach which involves first identifying regions-of-interest within an object, then handing each of those specific regions off to their own deep learning pipeline to provide further object segmentation and 3D point prediction and mapping. For any given image, each humans is relatively sparsely labelled with around 100-150 annotations per person. To increase the amount of data available to the network they use a supervisory system to automatically add in the other points during training via the trained models, artificially augmenting the data.
  Components used: Mask R-CNN with Feature Pyramid Networks; both available in Facebook’s just-released ‘Detectron’ system.
  Why it matters: enabling real-time surveillance: There’s a troubling implication of this research: the same system has wide utility within surveillance architectures, potentially letting operators analyze large groups of people to work out if their movements are problematic or not – for instance, such a system could be used to signal to another system if a certain combination of movements are automatically labelled as portending a protest or a riot. I’d hope that Facebook’s researchers felt the utility of releasing such a system outweighed its potential to be abused by other malicious actors, but the lack of any mention of these issues anywhere in the paper is worrying: did Facebook even consider this? Did they discuss this use case internally? Do they have an ‘information hazard’ handbook they go through when releasing such systems? We don’t know. As a community we – including organizations like OpenAI – need to be better about dealing publicly with the information-hazards of releasing increasingly capable systems, lest we enable things in the world that we’d rather not be responsible for.
–  Read more: DensePose: Dense Human Pose Estimation In The Wild (Arxiv).
–  Watch more: Video of DensePose in action.

It’s about time: tips and tricks for better self-driving cars:
…rare self-driving car paper emerges from Chinese robotics company...
Researchers with Horizon Robotics, one of a new crop of Chinese AI companies that builds everything from self-driving car software to chips to the brains for smart cities, have published a research paper that outlines some tips and tricks for designing better simulated self-driving car systems with the aid of deep learning. In the paper they focus on the ‘tactical decision-making’ part of driving, which involves performing actions like changing lanes and reacting to near-term threats. (The rest of the paper implies that features like routing, planning, and control, are hard-coded.)
  Action skipping: Unlike traditional reinforcement learning, the researchers here avoid using action repetition and replay to learn high-level policies and instead using a technique called action skipping. That’s to avoid situations where a car might, for example, learn through action replays to navigate across multiple car lanes at once leading to unsafe behavior. With action skipping, the car instead gets a reward for making a single specific decision (skipping from one lane to another) then gets a modified version of that reward which incorporates the average of the rewards collected during a few periods of time following the initial decision. “One drawback of action skipping is the decrease in decision frequency which will delay or prevent the agent’s reaction to critical events. To improve the situation, the actions can take on different skipping factors during inference. For instance in lane changing tasks, the skipping factor for lane keeping can be kept short to allow for swift maneuvers while the skipping factor for lane switching can be larger so that the agent can complete lane changing actions,” they write.
  Tactical rewards: Reward functions for tactical decision making involve a blend of different competing rewards. Here, the researchers use some constant reward functions relating to the speed of the car, the rewards for lane switching, and the step-cost which tries to encourage the car to learn to take actions that occur over a relatively small number of steps to aid learning, along with contextual rewards for the risk of colliding with another vehicle, whether a traffic light is present, and whether the current environment poses any particular risks such as the presence of bicyclists or modelling the increasing risk of staying on an opposite lane during common actions like overtaking.
  Testing: The researchers test out their approach by placing simulated self-driving cars inside a road simulator then trained via ten simulation runs of 250,000 discrete action steps are more, then tested against 100 pre-generated test episodes where they are evaluated according to their ultimate success of reaching their goal while complying with relevant speed limits and not changing speeds so rapidly as to interfere with passenger comfort.
  Results: The researchers find that implementing their proposed action-skipping and varied reward schemes significantly improves on a somewhat unfair random baseline, as well as against a more reasonable rule-based baseline system.
  Read more: Elements of Effective Deep Reinforcement Learning towards Tactical Driving Decision Making (Arxiv).

Better agents through deception:
Wicked humans compose tricky games to subvert traditional AI systems
One of the huge existential questions about the current AI boom relates to the myopic way that AI agents view objectives; mostagents will tend to mindlessly pursue objectives even though the application of a little bit of what humans call common sense could net them better outcomes. This problem is one of the chief motivations behind a lot of research in AI safety, as figuring out how to get agents to pursue more abstract objectives, or to incorporate more human-like reasoning in their methods of completing tasks, would seem to deal with some safety problems.
  Testing: One way to explore these issues is through testing existing algorithms against scenarios that seek to highlight their current nonsensical reasoning methods. DeepMind has already espoused such an approach with its AI safety gridworlds (Import AI #71), which gives developers a suite of different environments to test agents against that exploits the current way of developing AI agents to optimize specific reward functions. Now, researchers with the University of Strathclyde, Australian National University, and New York University, have proposed their own set of tricky environments, which they call Deceptive Games. The games are implemented in the standardized Video Game Description Language (VGDL) and are used to test AIs  that have been submitted to the General Video Game Artificial Intelligence competition.
  Deceptive Games: The researchers come up with a few different categories of deceptive games:
     Greedy Traps: Exploits the fact an agent can get side-tracked by performing an action that generates an immediate reward which makes it impossible to attain a larger reward down the line.
     Smoothness Traps: Most AI algorithms will optimize for the way of solving a task that involves a smooth increase in difficulty, rather than one where you have to try harder and take more risks but ultimately get larger rewards.
     Generality Traps: Getting AIs to learn general rules about the objects in an environment – like that eating mints guarantees a good reward – then subverting this, for instance by saying that interacting too many times with the aforementioned objects can rapidly transition from giving a positive to a negative reward after some boundary has been crossed.
  Results: As AIs implemented in the GVGAI competition tend to employ a variety of different techniques, and the results show that some very highly-ranked agents perform very poorly on these new environments, while some low-ranked ones perform adequately. Most agents fail to solve most of the environments. The purpose of highlighting the paper here is to provide enough environment in which AI researchers might want to test and evaluate the performance of their own AI algorithms against, potentially creating another ‘AI safety baseline’ to test AIs against. It could also motivate further extension of the GVGAI competition to become significantly harer for AI agents: “Limiting access to the game state, or even requiring AIs to actually learn how the game mechanics work open up a whole new range of deception possibilities. This would also allow us to extend this approach to other games, which might not provide the AI with a forward model, or might require the AI to deal with incomplete or noisy sensor information about the world,” they write.
–  Read more: Deceptive Games (Arxiv).
–  Read more about DeepMind’s earlier ‘AI Safety Gridworlds’ (Arxiv).

Tech Tales:

[2032: A VA hospital in the Midwest]

Me and my exo go way back. The first bit of it glommed onto me after I did my back in during a tour of duty somewhere hot and resource-laden. I guess you could say our relationship literally grew from there.

Let’s set the scene: it’s 2025 and I’m struggling through some physio with my arms on these elevated side bars and my legs moving underneath me. I’m huffing breath and a vein in my neck is pounding and I’m swearing. Vigorously. Nurse Alice says to me “John I really think you should consider the procedure we talked about”. I swivel my eyes up to meets her and I say for the hundredth time or so – with spittle – “Fuck. No. I-”
  I don’t get to finish the sentence because I fall over. Again. For the hundredth time. Nurse Alice is silent. I stare into the spongy crash mat, then tense my arms and try to pick myself up but can’t. So I try to turn on my side and this sets off a twinge in my back which grows in intensity until after a second it feels like someone is pulling and twisting the bundle of muscles at the base of my spine. I scream and moan and my right leg kicks mindlessly. Each time it kicks it sets off more tremors in my back which create more kicks. I can’t stop myself from screaming. I try to go as still and as little as possible. I guess this is how trapped animals feel. Eventually the tremors subside and I feel wet cardboard prodding my gut and realize I’ve crushed a little sippy cup and the water has soaked into my undershirt and my boxers as though I’ve wet myself.
“John,” Alice says. “I think you should try it. It really helps. We’ve had amazing success rates.”
“It looks like a fucking landmine with spiderlegs” I mumble into the mat.
“I’m sorry John I couldn’t hear that, could you speak up?”
Alice says this sort of thing a lot and I think we both know she can hear me. But we pretend. I give up and turn my head so I’m speaking half into the floor and half into open space. “OK,” I say. “Let’s try it.”
“Wonderful!” she says, then, softly, “Commence exo protocol”.
  The fucking thing really does scuttle into the room and when it lands on my back I feel some cold metal around the base of my spine and then some needles of pain as its legs burrow into me, then another spasm starts and according to the CCTV footage I start screaming “you liar! I’ll kill you!” and worse things. But I don’t remember any of this. I pass out a minute or so later, after my screams stop being words. When you review the footage you can see that my screams correspond to its initial leg movements and after I pass out it sort of shimmies itself from side to side, pressing itself closer into my lower back with each swinging lunge until it is pressed into me, very still, a black clasp around the base of my spine. Then Alice and another Nurse load me onto a gurney and take me to a room to recover.

When I woke up a day later or so in the hospital bed I immediately jumped out of it and ran over to the hospital room doorway thinking you lying fuckers I’ll show you. I yanked the door open and ran half into the hall then paused, like Wiley Coyote realizing he has just crossed off of a cliff edge. I looked behind me into the room and back at my just-vacated bed. It dawned on me that I’d covered the distance between bed and door in a second or so, something that would have taken me two crutches and ten minutes the previous day. I pressed one hand to my back and recoiled as I felt the smoothness of the exo. Then I tried lifting a leg in front of me and was able to raise my right one to almost hip height. The same thing worked with the left leg. I patted the exo again and I thought I could feel it tense one of its legs embedded in my spine as though it was saying that’s right, buddy. You can thank me later.
  “John!” Alice said, appearing round a hospital corridor in response ot the alarm from the door opening. “Are you okay?”
“Yes,” I said. “I’m fine.”
“That’s great,” she said, cheerfully. “Now, would you consider putting some clothes on?”
I’d been naked the whole time, so fast did I jump out of bed.

So now it’s three years later and I guess I’m considered a model citizen – pun intended. I’ve got exos on my elbows and knees as well as the one on my back, and they’re all linked together into one singular thing which helps me through life. Next might be one for the twitch in my neck. And its getting better all the time: fleet learning combined with machine learning protocols mean the exo gives me what the top brass call strategic movement optimization: said plainly, I’m now stronger and faster and more precise than regular people. And my exo gets better in proportion to the total number deployed worldwide, which now numbers in the millions.

Of course I do worry about what happens if there’s an EMP and suddenly it all goes wrong and I’m back to where I was. I have a nightmare where the pain returns and the exo rips the muscles in my back out as it jumps away to curl up on itself like a beetle, dying in response to some unseen atmospheric detonation. But I figure the sub-one-percentage chance of that is more than worth the tradeoff. I think my networked exo is happy as well, or at least, I hope it is, because in the middle of the night sometimes I wake up to find my flesh being rocked slightly from side to side by the smart metal embedded within me, as though it is a mother rocking some child to sleep.

Things that inspired this story: Exoskeletons, fleet learning, continuous adaptation, reinforcement learning, intermittent back trouble, physiotherapy, walking sticks.