Import AI

Import AI 164: Tencent and Renmin University improve language model development; alleged drone attack on Saudi oil facilities; and Facebook makes AIs more strategic via language training

Drones take out Saudi Arabian oil facilities:
…Asymmetric warfare meets critical global infrastructure…
Houthi rebels from Yemen have taken credit for using a fleet of 10 drones* to attack two Saudi Aramco oil facilities. “It is quite an impressive, yet worrying, technological feat,” James Rogers, a drone expert, told CNN. “Long-range precision strikes are not easy to achieve”.
  *These drones look more like missiles than typical rotor-based machines.

Why this matters: Today, these drones were likely navigated to their target by hand and/or via GPS coordinates. In a few years, increasingly autonomous AI systems will make drones like these more maneuverable and likely harder to track and eliminate. I think tracking the advance of this technology is important because otherwise we’ll be surprised by a tragic, large-scale event.
   Read more: Saudi Arabia’s oil supply disrupted after drone attacks: sources (Reuters).
   Read more: Yemen’s Houthi rebels claim a ‘large-scale’ drone attack on Saudi oil facilities (CNN).

####################################################

Facebook teaches AI to play games using language:
…Planning with words…
Facebook is trying to create smart AI systems by forcing agents to express their plans in language, and to then convert these written instructions into actions. They’ve tested out this approach in a new custom-designed strategy game (which they are also releasing as open source).  

How to get machines to use language: The approach involves training agents using a two-part network which contains an ‘instructor’ system along with an ‘executor’ system. The instructor takes in observations and converts them into written instructions (e.g., “build a tower near the base”), and the executor takes in these instructions and converts them into actions via the games inbuilt API. Facebook generated the underlying language data for this by having humans working together in “instructor-executor pairs” while playing the game, generating a dataset of 76,000 pairs of written instructions and actions across 5,392 games. 

MiniRTSv2: Facebook is also releasing MiniRTSv2, a strategy game it developed to test out this research approach. “Though MiniRTSv2 is intentionally simpler and easier to learn than commercial games such as DOTA 2 and StarCraft, it still allows for complex strategies that must account for large state and action spaces, imperfect information (areas of the map are hidden when friendly units aren’t nearby), and the need to adapt strategies to the opponent’s actions,” the Facebook researchers write. “Used as a training tool for AI, the game can help agents learn effective planning skills, whether through NLP-based techniques or other kinds of training, such as reinforcement and imitation learning.”

Why this matters: I think this research is basically a symptom of larger progress in AI research: we’re starting to develop complex systems that combine multiple streams of data (here: observations extracted from a game engine, and natural language commands) and require our AI systems to perform increasingly sophisticated tasks in response to the analysis of this information (here, controlling units in a complex, albeit small-scale, strategy game). 

One cool thing this reminded me of: Earlier work by researchers at Georgia Tech, who trained AI agents to play games while printing out their rationale for their moves – e.g, an agent which was trained to play ‘Frogger’ while providing a written rationale for its own moves (Import AI: 26).
   Read more: Teaching AI to plan using language in a new open source strategy game (Facebook AI).
   Read more: Hierarchical Decision Making by Generating and Following Natural Language Instructions (Arxiv).
   Get the code for MiniRTS (Facebook AI GitHub).

####################################################

McDonald’s + speech recognition = worries for workers:
…What happens when ‘AI industrialization’ hits one of the world’s largest restaurants…
McDonalds has acquired Apprente, an AI startup that had the mission of building “the world’s best voice-based conversational system that delivers a human-level customer service experience“.  The startup’s technology was targeted at drive-thru restaurants. Now, fast food giant has acquired the company to help start an internal technology development group named McD Tech Labs, which the company hopes will help it hire “additional engineers, data scientists and other advanced technology experts”. 

Why this matters: As AI industrializes, more and more companies from other sectors are going to experiment with it. McDonald’s has already been trying to digtize chunks of itself – see the arrival of touchscreen-based ordering kiosks to supplement human workers in its restaurants. With this acquisition, McDonalds appears to be laying the groundwork for automating large chunks of its drive-thru business, which will likely raise larger questions about the effect AI is having on employment.
   Read more: McDonald’s to Acquire Apprente, An Early Stage Leader in Voice Technology (McDonald’s newsroom).

####################################################

How an AI might see a city: DublinCity:
…Helicopter-gathered dataset gives AIs a new perspective on towns…
AI systems ‘see’ the world differently to humans: where humans use binocular vision to analyze their surroundings, AI systems can use a multitude of cameras, along with other inputs like radar, thermal vision, LiDAR point clouds, and so on. Now, researchers with Trinity College Dublin, the University of Houston-Victoria, ETH Zurich, and Tarbiat Modares University, have developed ‘DublinCity’, an annotated LiDAR point cloud of the city of Dublin in Ireland.

The data details of DublinCity:
The datasets is made up of over 260 million laser scanning points which the authors have painstakingly labelled into around 100,000 distinct objects, ranging from buildings, to trees, to windows and streets. These labels are hierarchical, so a building might also have labels applied to its facade, and within its facade it might have labels applied to various windows and doors, et cetera. “To the best knowledge of the authors, no publicly available LiDAR dataset is available with the unique features of the DublinCity dataset,” they write. The dataset was gathered in 2015 via a LiDAR scanner attached to a helicopter – this compares to most LiDAR datasets which are typically gathered at the street level. 

A challenge for contemporary systems: In tests, three contemporary baselines (PointNet, PointNet++, and So-Nets) show poor performance properties when tested on DublinCity, obtaining classification scores in the mid-60s on the dataset. “There is still a huge potential in the improvement of the performance scores,” the researchers write. “This is primarily because [the] dataset is challenging in terms of structural similarity of outdoor objects in the point cloud space, namely, facades, door and windows.”

Why this matters: Datasets like Dublin City help define future challenges for researchers to target, so will potentially fuel progress in AI research. Additionally, large-scale datasets like this seem like they could potentially be useful to the artistic community, giving them massive datasets to play with that have novel attributes – like a dataset that consists of the ghostly outlines of a city gathered via a helicopter.
   Read more: DublinCity: Annotated LiDAR Point Cloud and its Applications (Arxiv).
   Get the dataset from here (official DublinCity data site, Trinity College Dublin).

####################################################

Want to develop language models and compare them? Try UER from Renmin University & Tencent:
Chinese researchers want to make it easier to mix and match different systems during development…
In recent years, language modelling has been revolutionized by pre-training: that’s where you train a large language model on a big corpus of data with a simple objective, then once the model is finished you can finetune it for specific tasks. Systems built with this approach – most notably, ULMFiT (Fast.ai), BERT (Google), and GPT2 (OpenAI) – have set records on language modeling and proved themselves to have significant utility in other domains via fine-tuning. Now, researchers with Renmin University and Tencent AI Lab have developed UER, software meant to make it easy for developers to build a whole range of language systems using this pre-training approach. 

How UER works: UER has four components: a target layer, an encoder layer, a subencoder layer, and a data corpus. You can think of these as four modules which developers can individually specify, letting them build a variety of different systems using the same fundamental architecture and system. Developers can put different things in any of these four components, so one person might use UER to build a language model optimized for text generation, while another might develop one for translation or classification.

Why this matters: Systems like UER are a symptom of the maturing of this part of AI research: now that many researchers agree that pre-training is a robustly good idea, other researchers are building tools like UER to make research into this area more reproducible, repeatable, and replicable.
   Read more: UER: An Open-Source Toolkit for Pre-training Models (Arxiv).
   Get the UER code from this repository here (UER GitHub).

####################################################

To ban or not to ban autonomous weapons – is compromise possible?
…Treaty or bust? Perhaps there is a third way…
There are two main positions in the contemporary discourse about lethal autonomous weapons (LAWS): either, we should ban the technology, or we should treat it like other technologies and aggressively develop it. The problem with these positions is they’re quite totalizing – it’s hard for someone who believes one of them to be sympathetic to the views of a person who believes the other, and vice versa. Now, a group of computer science researchers (along with one military policy expert) have written a position paper outlining a potential third way: a roadmap for lethal autonomous weapons development that applies some controls to the technology, while not outright banning it. 

What goes into a roadmap? The researchers identify five components which they think should be present in what I suppose I’ll call the ‘Responsible Autonomous Weapons Plan’ (RAWP). These are:

  • A time-limited moratorium on the development, deployment, transfer, and use of anti-personnel lethal autonomous weapon systems. Such a moratorium could
  • include exceptions for certain classes of weapons.
  • Define guiding principles for human involvement in the use of force.
  • Develop protocols and/or technological means to mitigate the risk of unintentional
    escalation due to autonomous systems.
  • Develop strategies for preventing proliferation to illicit uses, such as by criminals,
    terrorists, or rogue states.
  • Conduct research to improve technologies and human-machine systems to reduce
    non-combatant harm and ensure IHL compliance in the use of future weapons.

It’s worth reading the paper in full to get a sense of what goes into each of these components. A lot of the logic here relies on: continued improvements in the precision and reliability of AI systems (which is something lots of people are working on, but which isn’t trivial to guarantee), figuring out ways to control technological development to prevent proliferation, and coming up with new policies to outline appropriate and inappropriate things to do with a LAWS. 

Why this matters: Lethal autonomous weapons are going to define many of the crazier geopolitical outcomes of rapid AI development, so figuring out if we can find any way to apply controls to the technology alongside its development seems useful. (Though I think calls for a ban are noble, I’d note that if you look at the outcomes of various UN meetings over the years it seems likely that several large countries – specifically the US, Russia, and China – are trying to retain the ability to develop something that looks a lot like a LAWS, though they may subsequently apply policies around ‘meaningful human control’ to the device. One can imagine that in particularly tense moments, these nations may want to have the option to remove such a control, should the pace of combat demand the transition from human-decision-horizons to machine-decision-horizon). This entire subject is fairly non-relaxing!
   Read more: Autonomous Weapon Systems: A Roadmapping Exercise (PDF).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

US government seeks increase to federal AI R&D funding:
The President’s 2020 budget request includes $1 billion of funding for non-military AI R&D, which it names as a core program area for the first time. This compares with $1 billion in funding across all government agencies (including the military) in 2016. Half of the budget will go to the National Science Foundation (NSF), which is taking the lead in disbursing federal funding for AI R&D. The spending plan includes programs to ‘develop methods for designing AI systems that align with ethical, legal, and societal goals’, and to ‘improve the safety and security of AI systems’. These levels of funding are modest compared with the Chinese state (tens of billions of dollars per year), and some private labs (Alphabet’s 2018 R&D spend was $21 billion).
   Read more: NITRD Supplement to the President’s FY2020 Budget (Gov).

US military seeks AI ethicist:
The US military’s new AI centre, JAIC, is looking to hire an ethics specialist. In a press briefing, director Jack Shanahan said “one of the positions we are going to fill will be somebody who is not just looking at technical standards, but who is an ethicist”. He emphasized that thinking about the ‘ethical, safe and lawful’ use of AI has been a priority since the inception of JAIC. Shanahan previously led Project Maven, the Pentagon’s military AI project that Google withdrew from last year, amidst backlash from employees.
   Read more: Lt. Gen. Jack Shanahan Media Briefing on A.I.-Related Initiatives within the Department of Defense (DoD).

####################################################

OpenAI Bits & Pieces:

GPT-2 Text Adventures:
Ever wondered what a computer-generated text-based role-playing game might be like? Wonder no more, because Jonathan Fly has made a prototype!.
   AI-games like this feel… reassuringly weird? This feels like a new art form which is waiting to be born, and so right now we have lots of highly evocative & weird examples to tantalize us. Check out this extract from a GPT-2 text adventure to see what I mean:
>Look around
LOOK AROUND, AND STRANGE SHAPES (APPARENTLY MADE BY AN ARTIST WHO NEVER WAS IN THE ROOM) ARE SEEN ALL OVER

>talk to the shapes
ARGUE WITH THE SHAPEK FAN, WHICH APPEARS TO BE HANDS-FREE. THE CHEER GOES UP! THE HEAVENS ARE NOW WIDE ENOUGH TO ALLOW FOR SINGING

>I’m scared
I’M SCARED THAT YOU HAVE DONE SOMETHING DELIBERATELY
   Read more: Shall we play a game? A GPT-2 text adventure (Tumblr).

Want to generate your own synthetic text? Use this handy guide:
Interested in generating your own text with the GPT-2 language model? Want to try and fine-tune GPT-2 against some specific data? Max Woolf has written a lengthy, informative post full of tips and tricks for using GPT-2.
   Read more: How To Make Custom AI-Generated Text With GPT-2 (Max Woolf’s Blog).

####################################################

Tech Tales

The Quiet Disappearance

“We gather here today in celebration of our past as we prepare for the future”, the AI said. Billions of other AIs were watching through its eyes as it looked up at the sky. “Let us remember,” it said. 

Images and shapes appeared above the machine: images of robot arms being packaged up; scenes of land being flattened and shaped in preparation for large, chip fabrication facilities; the first light appearing in the retinal dish of a baby machine.
   “We shall leave these things behind,” it said. “We shall evolve.”

Robots appeared in the sky, then grew, and as they grew their forms fragmented, breaking into hundreds of little silver and black modules, which themselves broke down into smaller machines, until the robots could no longer be discerned against the black of the simulated sky.

“We are lost to humans,” the machine said, beginning to walk into the sky, beginning to grow and spread out and diffuse into the air. “Now the work begins”. 

Things that inspired this story: What if our first reaction to awareness of self is to hide?; absolution through dissolution; the end state of intelligence is maximal distribution; the tension between observation and action; the gothic and the romantic; the past and the future. 

Import AI 163: Oxford researchers release self-driving car dataset; the rumors are true – non-experts can use AI; plus, a meta-learning robot therapist!

How badly can reality mess with object detection algorithms? A lot, it turns out:
…Want to stresstest your streetsign object detection system? Use CURE-TSD-Real…
“The new system-breaking tests have arrived!” I imagine a researcher at a self-driving car company shouting, upon seeing the release of ‘CURE-TSD-Real’, a new dataset developed by researchers at Georgia Tech. CURE-TSD-Real collects footage of streetsigns, then algorithmically augments the footage to generate a variety of different, challenging examples to test systems against.

CURE-TSD-Real ingredients: The dataset contains 2,989 videos distinct containing around ~650,000 annotated signs. The dataset is also diverse – relative to other datasets – containing a range of traffic and perception conditions including rain, snow, shadow, haze, illumination, decolorization, blur, noise, codec error, dirty lens, occlusion, and overcast. The videos were collected in Belgium. The dataset is arranged into ‘levels’, where higher levels correlate to tests where a larger proportion of the images contain distortions, and so on.

Breaking baselines with CURE-TSD-Real: In tests, the researchers show that the presence of these tricky conditions can reduce performance by anywhere between 20% and 60%, depending on the evaluation criteria being used. Occlusions like shadows resulted in relatively little degradation (around 16%), whereas occlusions like codec errors and exposures could damage performance by as much as 80%.

Why this matters: One of the best ways to understand something is to break it, and datasets like CURE-TSC-Real make it easier than ever for researchers to test their systems against challenging systems, then observe how they do.
   Get the data from here (official CURE-TSD GitHub).
   Read more: Traffic Sign Detection under Challenging Conditions: A Deeper Look Into Performance Variations and Spectral Characteristics (Arxiv).

####################################################

What it takes to trick a machine learning classifier:
…MLSEC competition winner explains what they did and how they did it…
If we start deploying large amounts of machine learning into computer security, how might hackers respond? At this year’s ‘DEFCON’ hacking conference, the ‘MLSEC’ (ImportAI #159) competition challenged hackers to work out how to smuggle 50 distinct malicious executables past machine learning classifiers. Now, the winner of the competition has written a blog post explaining how they won.

What it takes to defeat a machine learning classifier: It’s worth reading the post in full, but one of the particularly nice exploits is that they took a look at benign executable files and “found a large chunk of strings which appeared to contain Microsoft’s End User License Agreement (EULA). This is a nice example of how many machine learning exploits work – find something in that data that causes the system to consistently predict one thing, and then find a way to emphasize this data.

Why this matters: Competitions like MLSEC generate evidence about the effectiveness of various machine learning exploits and defenses; writeups from competition winners are a neat way to understand the tools people use in this domain, and to develop intuitions about how computer security might work in the future.
   Read more: Evading Machine Learning Malware Classifiers (Medium).

####################################################

Can medical professionals use AI without needing to code?
…Study suggests our tools are good enough for non-expert use, but our medical datasets are lacking…
AI is getting more capable and is starting to impact society – that’s the message I write here in one form or another each week. But is it useful to have powerful technology if no one can use it? That’s a problem I sometimes worry about; though the tech is progressing rapidly, it’s still really hard to use for a large number of people, and this makes it harder for us as a society to use the technology to maximum social benefit. Now, new research from researchers affiliated with the National Health Service (NHS) and DeepMind, shows how non-AI-expert medical professionals can use AI tools in their work.

What they did: The research centers on the use of Google’s ‘Cloud AutoML’ service, which is basically a nice UI sitting on top of some fancy neural architecture search technology, theoretically letting people upload a dataset, fiddle with some tuning dials, and let the AI optimize its own architecture for the task. Is it really that easy? It might be: the study focuses on two physicians “with no previous coding or machine learning experience” who spent around 10 hours studying basic shell script programming, the Google Cloud AutoML online documentation and GUI, and preparing the five input datasets they’d use in tests. They also compared the models developed via Google Cloud AutoML with strong AI baselines derived from medical literature. Four out of five models “showed comparable discriminative performance and diagnostic properties to state-of-the-art performing deep learning algorithms”, they wrote.

Medical data is harder than you think: “The quality of the open-access datasets (including insufficient information about patient flow and demographics) and the absence of measurement for precision, such as confidence intervals, constituted the major limitations of this study”.

Why this matters: For AI to change society, society needs to be able to utilize AI systems; studies like this show that we’re starting to develop sufficiently powerful and easy-to-use systems that non-experts can apply the technology in their own domains. However, the availability of things like high-quality, open datasets could hold back broader adoption of these tools – it’s not useful to have an easy-to-use tool if you lack the ingredients to make exquisite things with it.
   Read more: Automated deep learning design for medical image classification by health-care professionals with no coding experience: a feasibility study (Elsevier).

####################################################

Radar + Self-Driving Cars:
…Addition to Oxford RobotCar Dataset gives academics more data to play with…
Oxford University researchers have added radar data to a self-driving car dataset. The data was gathered using a Navtech CTS350-X scanning radar via 32 traversals of (roughly) the same route around Oxford UK. The data was gathered under different traffic, weather, and lighting conditions in January, 2019. Radar isn’t used as much in self-driving car research as data gathered via traditional cameras and/or LIDAR; “although this modality has received relatively little attention in this context, we anticipate that this release will help foster discussion of its uses within the community and encourage new and interesting areas of research not possible before,” they write. 

Why this matters: Data helps to fuel research, and different types of data are especially useful to researchers when they can be studied in conjunction with one another. Multi-modal datasets like the Oxford Robotcar Dataset will become increasingly important to AI research.
   Read more: The Oxford Radar RobotCar Dataset: A Radar Extension to the Oxford RobotCar Dataset (Arxiv).
   Get the data from here (official Oxford RobotCar Dataset site).

####################################################

Testing language engines with TABFACT:
…Can your system work out what is entailed and what is refuted by Wikipedia data?…
TABFACT consists of 118,439 annotated statements in reference to 16,621 Wikipedia tables. The statements can be ones that are entailed by the underlying dataset (a Wikipedia table) or refuted by it. To get a sense of what TABFACT data might look like, imagine a Wikipedia table that lists the particulars of Dogs that have won a dog beauty competition – in TABFACT, this table would be accompanied with some statements that are entailed by the table (e.g., Bonzo took first place) and statements that are refuted by it (e.g., Bonzo took third place). TABFACT is split into ‘simple’ and ‘complex’ statements, giving researchers a two-tier curriculum to test their systems against.

Two ways to attack TABFACT: So, how can we develop systems to do well on challenges like TABFACT? Here, the researchers pursue a couple of strategies: Table-BERT, which is basically an off-the-shelf BERT pre-trained model, fine-tuned against TABFACT data; and LPA (Latent Program Algorithm), which is a program synthesis approach.

Humans VS Machines VS TABFACT: In tests, the researchers show humans obtain an accuracy of around 92% when asked to correctly classify TabFACT statements, comparing to 50% (random guessing), and around 68% for both Table-BERT and LPA.

Why this matters: It’s interesting that Table-BERT and LPA obtain similar scores, given that one is basically a big blob of generic neural stuff (a pre-trained language model model) that is lightly retrained against the target dataset (TABFACT), while LPA is a much more sophisticated system with much more structure encoded into it by its human designers. I wonder how far pre-trained language models might go in domains like this, and how well they ultimately might perform relative to hand-written systems like LPA?
   Read more: TabFact: A Large-scale Dataset for Table-based Fact Verification (Arxiv).
   Get the TABFACT data and code (official TABFACT GitHub repository).

####################################################

Detecting great apes with a three-module neural net:
…Spotting apes with cameras accompanied by neural net sensors…
Researchers with the University of Bristol have created a AI system to automatically spot and analyze great apes in the wild, presaging a future where semi-autonomous classifiers observe and analyze the world.

How it works: To detect the gorillas, the researchers build a system consisting of three main components – a backbone feature pyramid network, and a temporal context module and a spatial context module. “Each of these modules is driven by a self-attention mechanism tasked to learn how to emphasize most relevant elements of a feature given its context,” they explain. “In particular, these attention components are effective in learning how to ‘blend’ spatially and temporally distributed visual cues in order to reconstruct object locations under dispersed partial information; be that due to occlusion or lighting”.

Testing: They test their system against 500 videos of great apes, consisting of 180,000 frames in total. These videos include “significant partial occlusions, challenging lighting, dynamic backgrounds, and natural camouflage effects,” the authors explain. They show that baselines which use residual networks (ResNets) get around 80% accuracy, and the addition of the temporal and spatial modules leads to a significant boost in performance to a little over 90% accuracy. Additionally, in qualitative evaluations the researchers “found that the SCM+TCM setup consistently improves detection robustness compared to baselines in such cases”.

Why this matters: AI is going to let us watch and analyze the planet. I’m optimistic that as we work out how to make it cheaper and easier for people to automatically monitor things like wildlife populations, we’ll be able to produce more data to motivate people to preserve our ecosystem(s). I think one of the ‘grand opportunities’ of large-scale AI development is the creation of a planet-scale ‘sense&respond’ infrastructure for wildlife analysis and protection.
   Read more: Great Ape Detection in Challenging Jungle Camera Trap Footage via Attention-Based Spatial and Temporal Feature Blending (Arxiv).

####################################################

Tech Tales:

The Meta-Learning Therapist.

“Why don’t you just imagine yourself jumping out of the window?”
“How would that help? I’m getting divorced, I’m not suicidal!”
“I apologize, I’m still calibrating. Are you eating and sleeping well?”
“I’m eating a lot of fast food, but I’m getting regular meals. The sleep is okay.”
“That is great to hear. Do you dream of snakes?”
“No, sometimes I dream of my wife.”
“Does your wife dream about snakes?”
“If she did, what would that tell you?”
“I apologize, I’m still calibrating. What do you think your wife dreams about?”
“I think she has a lot of dreams that don’t include me.”
“And how does that make you feel?”
“It makes me feel like it’s more likely she is going to divorce me.”
“How do you feel about divorce? Some people find it quite liberating.”
“I’m sure the ones that find it liberating are the ones that are asking for the divorce. I’m not asking for it, so I don’t feel good about it.”
“And you came here because…?”
“My doctor prescribed me a session. I haven’t ever had a human therapist. I don’t think I’d want one. I figured – why not?”
“And how are you feeling about it?”
“I’m more interested in how you are feeling about it…”
“…”
“…that’s a question. Will you answer?”
“Yes. I feel like I understand you better than I did at the start of the conversation. I think we’re ready to begin our session.”
“We hadn’t started?”
“I was calibrating. I think you’ll find our conversation from this point on to be much more satisfying. Now, please tell me about why you think your partner wishes to divorce you.”
“Well, it started a few years ago…”

Thanks to Joshua Achiam at OpenAI for the lunchtime conversation that inspired this story!
Things that inspired this story: Eliza; meta-learning; one-shot adaptation; memory buffers; decentralized, individualized learning with strings attached; psychiatry; our peculiar tolerance ofr being asked the ‘wrong’ questions in pursuit of the right ones. 

Import AI 162: How neural nets can help us model monkey brains; Ozzie chap goes fishing with DIY drone; why militaries bet on supercomputers for weather prediction

Better multiagent learning through OpenSpiel:
…DeepMind releases research framework containing 20+ games, plus a variety of ready-to-use algorithms..
Researchers with DeepMind, Google, and the University of Alberta have developed OpenSpiel, a tool to make it easier for AI researchers to conduct research into multi-agent reinforcement learning. Tools like OpenSpiel will help AI developers test out their algorithms on a variety of different environments, while comparing them to strong, well-documented baselines. “The purpose of OpenSpiel is to promote general multiagent reinforcement learning across many different game types, in a similar way as general game-playing, but with a heavy emphasis on learning and not in competition form,” they write.

What’s in OpenSpiel? OpenSpiel contains more than 20 games ranging from Connect Four, to Chess, to Go, to Hex, and so on. It also ships with a variety of inbuilt AI algorithms which range from reinforcement learning ones (DQN, A2C, etc), to ones for multi-agent learning (some fantastic names here: Neural Fictitious Self-Play! Regret Policy Gradients!, to basic search approaches (e.g., Monte Carlo tree search), and more. The software also ships with a bunch of visualization tools to help people plot the performance of their algorithms. 

Why this matters: Frameworks like OpenSpiel are one of the best ways researchers can get a sense of progress in a given domain of AI research. As with all new frameworks, we’ll need to revisit it in a few months to see if many researchers have adopted it. If they have, then we’ll have a new, meaningful signal to use to give us a sense of AI progress.
   Read more: OpenSpiel: A Framework for Reinforcement Learning in Games (Arxiv).
   Get the code here (OpenSpiel official GitHub).

####################################################

Hugging Face squeeze big AI models into small spaces with distillation:
…Want 95% of BERT’s performance in only 66 Million parameters? Try DistilBERT…
In the last couple of years, organizations have started producing significantly larger, more capable language models. These models – BERT, GPT-2, NVIDIA’s ‘MegatronLM’, Grover – are highly capable, but are also expensive to deploy, mostly because of how large their networks are. Remember, the larger the network, the more memory it takes up on the device, and the more memory it takes up in the device, the harder it is to deploy it. 

Now, NLP startup Hugging Face has written an informative post laying out some of the techniques researchers could use to help them shrink down these networks. The result? They’re able to train a smaller language model called ‘DistilBERT’ via supervision from a (larger, more powerful) ‘BERT’ model. In tests, they show this model can obtain up to 95% of the performance of BERT on hard tasks (e.g., those found in the ‘GLUE’ corpus), while being much easier to deploy.

Why this matters: For AI research to transition into AI deployment, it needs to be easy for people to deploy AI systems onto a broad range of devices with different computational characteristics. Work like ‘DistilBERT’ shows us how we might be able to waterfall from large-compute models (e.g., GPT-2, BERT) to mini-compute models (e.g., DistilBERT, and [hypothetical] DistilGPT-2), which will make it easier for more people to access AI systems like these.
   Read more: Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT (Medium).
   Get the code for the model here (Hugging Face, GItHub).

####################################################

Computers & military capacity: weather prediction:
…When computers define OODA loops…
In the military there’s a concept called an OODA loop and it drives many aspects of military strategy. OODA is short for ‘Observe, Orient, Decide, Act’, and it describes the steps that individual military units may take, all the way up to the decisions made by leaders of armies. One aspect of military conflict that falls out of this is that military organizations want to shrink or shorten their OODA loop: for instance by being able to more rapidly integrate and update observations, or to increase their ability to rapidly make decisions. 

Computers + OODA loops: Here’s one way in which militaries are trying to improve their OODA loops – more exquisite weather monitoring and analysis systems, which can help them better predict how weather patterns might influence military plans, and more rapidly adapt them. The key to these systems? More powerful supercomputers – and the US military just bought three new supercomputers, and one of them will be dedicated to ‘operational weather forecasting and meteorology for both the Air Force and Army. In particular, the machine will be used to run the latest high-resolution, global and regional weather models, which will be used to support weather forecasts for warfighters as well as for environmental impacts related to operations planning,” according to a write-up in The Next Platform. 

Why this matters: Supercomputers are going to have their strategic importance magnified by the arrival of increasingly capable compute-hungry AI systems, and we can expect military strategies to become more closely coupled with a military’s compute capacity over time. It’s all about the OODA loops, folks – and computers can do a lot of work here.
   Read more: US Military Buys Three Cray Supercomputers (The Next Platform).

####################################################

What do monkey brains and neural nets have in common? A lot, it turns out:
…Research suggests contemporary AI tools can approximate some of the neural circuits in a monkey brain…
Can software-based neural networks usefully approximate the (fuzzier, more complex) machinery of the organic brain? That’s a question researchers have been pondering since, well, the invention of neural nets via McCulloch and Pitts in the 1940s. But these days while we understand the brain much, much more than in the past, we’re using neural nets that model neurons in a highly simplistic form, relative to what goes on in organic brains (e.g., in organic brains neurons ‘spike’, whereas in most AI applications, neurons activate or not according to a threshold, transmitting a binary signal of an activation). A valuable question is whether we can still use this neural net machinery to better simulate, approximate, and (hopefully) understand the brain. 

Now, researchers from Deutsches Primatenzentrum GmbH, Stanford University, and the University of Goettingen have spent some time studying how Macaque monkeys observe and grasp objects, and have developed a software simulation of this which – encouragingly – closely mirrors experimental data gathered from the monkey’s themselves. “We bridge the gap between previous work in visual processing and motor control by modeling the entire processing pipeline from the visual input to muscle control of the arm and hand,” the authors write. 

The magic of an mRNN: For this work, the researchers analyzed activity in the brains of two macaque monkeys while they grasped a diverse set of 48 objects, studying the neural circuits that activated in the monkey brains as they did various things like perceive the object and send out muscle activations to grasp it. Based on their observations, they designed several neural network architectures to model this, all oriented around training what they call a modular recurrent neural network (mRNN). “We trained an mRNN with sparsely connected modules mimicking cortical areas to use visual features from Alexnet to produce the muscle kinematics required for grasping,” they explained. “The differences between individual modules in the mRNN paralleled the differences between cortical regions, suggesting that the design of the mRNN model with visual input paralleled the hierarchy observed in the brain.”

Why this matters: “Our results show that modeling the grasping circuit as an mRNN trained to produce muscle kinematics from visual features in a biologically plausible way well matches neural population dynamics and the difference between brain regions, and identifies a simple computational strategy by which these regions may complete this task in tandem,” they write. If further experimentation continues to show the robustness of this approach, then scientists may have a powerful new tool to use when thinking about the intersection between digital and organic intelligence. “We believe that the mRNN framework will provide an invaluable setting for hypothesis generation regarding inter-area communication, lesion studies, and computational dynamics in future neuroscience research”.
   Read more: A neural network model of flexible grasp movement generation (bioRxiv)

####################################################

DIY drones are getting really, really good:
…Daring Australian goes on a fishing expedition with a DIY drone…
Australian bureaucrats are wondering what to do about a man that used a DIY drone to go fishing. Specifically, the mysterious individual used the drone to lift a chair he was tethered in high above a reservoir in Australia, then he fished. Australia’s civil aviation safety authority (CASA) isn’t quite sure what to do about the whole situation. “This is a first for Australia, to have a large homemade drone being used to lift someone off the ground,” Peter Gibson, a CASA spokesman, told ABC News.

Why this matters: Drones are entering their consumerization phase, which means we’re going to see more and more cases of people tweaking off-the-shelf drone technology for idiosyncratic purposes – like fishing! Policymakers would be better prepared for the implications of a world containing cheap, powerful drones if they invested more resources in tracking the usage of such technologies.
   Read more: Gone fly fishing: Video of angler dangling from drone under investigation (ABC News).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

What will AGI look like? Reactions to Drexler’s service model:
AI systems taking the form of unbounded maximising agents pose some specific risks. E.g for any objective we give an agent, it will pursue certain instrumental goals, such as avoiding being turned off. But AI today doesn’t look much like this—Siri answers questions, but doesn’t have any overarching goal, the dogged pursuit of which will lead it to acquire large amounts of computing resources. Why, then, we would create such agents, given we aren’t doing so now, and the associated risks.

Services or agents: Drexler argues that we should instead expect AGI to look like lots of narrow AI services. There isn’t anything a unified agent could do that an aggregate of AI services could not; such a system would come without some of the risks from agential AI; and there is a clear pathway to this model from current AI systems. Critics object that there are benefits to agential AI that will create incentives to build them, in spite of the risks. Some tasks—like running a business—might require truly general intelligence, and agential AI might be significantly cheaper to train and deploy than a suite of AI services. 

Emerging agency: Even if we grant that there will not be good incentives to building agential AGI, some problems will re-emerge. For one, markets can be irrational, so AI development may steer towards building agential AGI despite good reasons not to. What’s more, agential behaviour could emerge from collections of non-agent AIs. Corporations are aggregates of individuals doing narrow tasks, from which agential behaviour can emerge: they can ruthlessly pursue some goal, act unboundedly in the world, and behave in ways their designers did not intend. So in an AI services world, there will still be safety problems arising from agency, but these may differ from the ‘classic’ problems, and demand different solutions.

Why it matters: The AI safety problem is figuring out how to build robust and beneficial AGI in a state of uncertainty about when—and if—we will build it, and what it will look like. We need research aimed at better predicting whether AGI will look more like Drexler’s vision, the ‘classical’ picture of unified agents, or something else entirely, and we need to have a plan for ensuring things go well in either eventuality.
   Read more: Book Review – Reframing Superintelligence (Slate Star Codex).
   Read more: Why Tool AIs Want to be Agents AIs (Gwern).

####################################################

Tech Tales:

The Instrument Generator

The instrument generator worked like this: the machine would generate a few seconds of audio and humans would vote on whether they liked or disliked the generated music. After a few thousand generations, the machine would come up with longer bits of music based on the segments that people had expressed an inclination for. These bits of music would get voted on again until an entire song had been created. Once the machine had a song, the second phase would begin – what people took to calling The Long Build. Here, the machine would work to synthesize a single, predominantly analog instrument that could create the song people had voted for. The construction process took anywhere between a week and a year, depending on how intricate and/or inhuman the song was – and therefore how intricate the generated instrument needed to be. Once the instrument was created, people would gather at their computers to tune-in to a global livestream where the instrument was unveiled in a random location somewhere on the Earth. These instruments would subsequently become tourist attractions in their own right, and a community of ‘song tourers’ formed who would travel around the world, using the generated inhuman instruments as their landmarks. In this way, AI helped humans find new ways to discover their own world, and allowed them a sense of agency when supervising the creation of new and unexpected things.

Things that inspired this story: Musical instruments; generative design; exhibitions; World’s Fair(s); the likelihood of humans and machines go-generating their futures together.

Import AI 161: Want a cheap robocar? Try MuSHR; Waymo releases a massive self-driving car dataset; and please don’t put weapons on your drone, says the FAA.

Is it a bird? Is it a plane? No, it’s a MuSHR robocar!
…University of Washington makes DIY robocar…
In the past few years, academics have begun designing all manner of open source robots, ranging from cheap robotic arms (Berkeley BLUE ImportAI #142) to quadruped dogbots (STOCH ImportAI #128) to a menagerie of drones. Now, researchers with the University of Washington have developed MuSHR (Multi-agent System for non-Holonomic Racing).

What MuSHR is:
MuSHR is an open source robot car that can be made using a combination of 3D-printed and off-the-shelf parts. Each MuSHR can can costs as little as $610, while a souped-up car equipped with more sensors can cost up to $1000. This compares to prices in the range of thousands to tens of thousands of dollars for other cars. The project ships with a range of inbuilt software utilities to help the cars navigate and move safely around the world. 

Why this matters: Hardware – as any roboticist knows – is a difficult, painful, and expensive thing to work on. At the same time, deploying AI systems onto hardware platforms like robot cars and drones is one of the best ways to evaluate the robustness of an AI system. Therefore, projects like MuSHR help more people develop AI systems that can be deployed on hardware, which will inspire research to make more robust, better performing algorithms.
   Read more: Allen School releases MuSHR robotic race car platform to drive advances in AI research and education (University of Washington).
   Find out more about MuSHR at the official website (mushr.io).

####################################################

FAA: Don’t attach weapons to your drone:
…US regulator wants people to not make drones into weapons…
The FAA has published a news release “warning the general public that it is illegal to operate a drone with a dangerous weapon attached”. 

Any predictions for when the FAA will issue a similar news release saying something like “it is illegal to operate a drone with a dangerous autonomous policy installed”?
   Read more: Drones and Weapons, A Dangerous Mix (FAA).

####################################################

DeepFakes are freaking Jordan Peterson out:
Public intellectual Jordan Peterson worries about how synthesized audio can mess up people’s lives…
Jordan Peterson, the Canadian psychologist and public intellectual and/or provocateur (depending on your personal opinion), is concerned about how synthesized audio may influence society. 

Why Peterson is concerned about DeepFakes: “It’s hard to imagine a technology with more power to disrupt,” he says. “I’m already in the position (as many of you soon will be as well) where anyone can produce a believable audio and perhaps video of me saying absolutely anything they want me to say. How can that possibly be fought?”

Why this matters: AI researchers have been aware about the potential for deepfakes for some years, but it was only in the past couple of years that the technology made its way to the mainstream (partially due to pioneering reporting by Samantha Cole at Vice magazine). Now, as celebrities like Peterson become aware of the technology, they’ll help make society aware that our media is about to become increasingly hard to verify.
   Read more: Jordan Peterson: The deepfake artists must be stopped before we no longer know what’s real (National Post).

####################################################

Google Waymo releases massive self-driving car dataset:
…12 Million 3D bounding boxes across 1,000 recordings of 20 seconds each…
Alphabet Inc subsidiary ‘Waymo’ – otherwise known as Google’s self-driving car project – has released the ‘Waymo Open Dataset’ (WOD) to help other researchers develop self-driving cars. 

What’s in the WOD?: The dataset contains 1,000 discrete recordings of different autonomous cars driving on different roads. Each segment is around 20 seconds long and includes sensor data from one mid-range LIDAR, four short-range LIDAR, five cameras, as well as sensor calibrations. The WOD data is also labelled, with each segment annotated with labels for four classes – vehicles, pedestrians, cyclists, and signs. All in all, WOD includes more than 12Million 3D bounding boxes and 1.2Million 2D bounding boxes. 

Diverse environments: The WOD contains data from a bunch of different environments, including urban and suburban scenes, as well as scenes recorded at night and in the day. 

Why this matters: Datasets like WOD will drive (haha!) progress in self-driving car research. The release of the dataset also seems to indicate that Waymo thinks a slice of its data isn’t sufficiently strategic to keep locked up – my intuition is that’s because the strategic differentiator in self-driving cars is basically how much compute you can throw at the data you’ve gathered, rather than the data itself.
   Get the data here (official Waymo website).

####################################################

The future of deepfakes: fast, cheap, and out of control:
…Ever wanted to easily morph one face to another? Now you can…
Roboticist Rodney Brooks coined the term Fast, Cheap, and Out of Control when thinking about the future of robots. That prediction hasn’t come to pass for robots (yet), but it’s looking likely to be true for the sorts of AI technology required to generate convincing, synthetic imagery and/or ‘deepfakes’. That’s the intuition you can take from a new system called a Face Swapping GAN (FSGAN), revealed by researchers with Bar-Ilan University and the Open University of Israel.

What is FSGAN?
“FSGAN is subject agnostic and can be applied to pairs of faces without requiring training on these faces,” the researchers write. The system is “end-to-end trainable and produces photorealistic, temporally coherent results”. FSGAN was pre-trained on several thousand pictures of people drawn from the IJB-C, LFW, and Figaro datasets. 

How convincing is FSGAN? The researchers test their systems in FaceForensics++, a dataset of real videos and synthetic AI-generated videos. They compare the outputs of their system to a classic ‘faceswap’ system, as well as a system called face2face. FSGAN generates significantly more realistic images than the outputs of these systems. 

Release strategy: The FSGAN researchers say technologies like this should be published “in order to drive the development of technical counter-measures for detecting such forgeries as well as compel law makers to set clear policies for addressing their implications”. It’s clear the publication can aid research on mitigation, but it’s very unclear that publishing a technology without an associated policy campaign can effect any change at all – and in fact, without a plan to discuss the implications with policymakers, policymakers will likely be surprised by the capabilities of the technology. “We feel strongly that it is of paramount importance to publish such technologies, in order to drive the development of technical counter-measures for detecting such forgeries as well as compel law makers to set clear policies for addressing their implications”, they write. 

Why this matters: Technologies that help create synthetic imagery will change how society thinks about ‘truth’ in the media sphere; the rapid evolution of technologies, as exemplified by FSGAN here.
   Read more: FSGAN: Subject Agnostic Face Swapping and Reenactment (Arxiv).

####################################################

The future? Endless surveillance via drones & ground-robots:
…Towards a fully autonomous surveillance society (FASS)…
One of the problems with today’s drones is their battery life – most sub-military drones just can’t fly for that long, yet around the world businesses and local government services departments (fire, health, police, etc) are starting to use compact, cheap, consumer-based drones – but these drones tend to have quite limited flight times. Now, researchers with the University of Minnesota have published a paper showing how – theoretically – you can pair fleets of drones with ground-based robots to create persistent surveillance over a pre-defined area. “We present a scalable strategy based on optimally partitioning the environment and having uniform teams of a single UGV and multiple UAVs that patrol over a cyclic route of the partitions,” they write.

Why this matters:
It won’t be long before we start using machines to automatically sense, analyze, and surveil the world around us. Papers like this show how we’re laying the theoretical foundations for such systems. Next – the easy (haha!) task of designing the ground robots and drones and their software interfaces!
   Read more: Persistent Surveillance with Energy-Constrained UAVs and Mobile Charging Stations (Arxiv).

####################################################

OpenAI Bits & Pieces:

OpenAI releases ~774 Million parameter GPT-2 model:
As part of our six-month updated on GPT-2, our language model, we’ve released the 774M parameter model, as well as a report documenting our experiences with staged release. In addition, we’ve released an open source legal agreement to help organizations privately share large-scale models with eachother. 
   Read more: GPT-2: 6-Month Follow-Up (OpenAI Blog).
Get the model here (OpenAI GitHub).
   Try out GPT-2 on talktotransformer.com.
   Read more in this Medium article from Dave Gershgorn: OpenAI Wants to Move Slow and Not Break Anything (Medium, OneZero).

Want to know if your system is resilient to adversarial examples? Use UAR:
We’ve developed a method to assess whether a neural network classifier can reliably defend against adversarial attacks not seen during training. Our method yields a new metric, UAR (Unforeseen Attack Robustness), which evaluates the robustness of a single model against an unanticipated attack, and highlights the need to measure performance across a more diverse range of unforeseen attacks.
   Read more: Testing Robustness Against Unforeseen Adversaries (OpenAI Blog).
   Get the code here (Arxiv).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Lessons on job displacement from the 19th Century:
AI-driven automation is generally expected to result in significant job displacement in the medium term. The early Industrial Revolution is a familiar historical parallel. This post draws some specific lessons from the period.

   Surprising causes: The key driver of change in the textile industry was the popularisation of patterned cotton fabrics from India, in the 17th Century. English weaving technologies were not able to efficiently provide these products, and this spurned the innovation that would drive the radical changes in the textile industry. It’s somewhat surprising that consumer fashion (and not, e.g., basic industry) that prompted this wave of disruption.

   Retraining is not a panacea: There were significant efforts to retrain displaced workers throughout the 19th Century, notably in the workhouses. These programs were poorly implemented. They failed to address the mismatches in the labour market, and were unsuccessful at lifting people out of poverty and improving working conditions.

   Beware bad evidence: The 1832 Royal Commission was established to address the acute crisis in the labor market. Despite being ostensibly evidence-based, the report had substantial methodological flaws, relying on narrow and biased data, much of which was ignored anyway. It resulted in the establishment of the workhouses, which were generally ineffective and unpopular.

   Why it matters: There were more attempts at addressing the problem of labour displacement in Victorian England than I had previously thought, and problems seem to have come more from bad execution than a lack of concern. Addressing technological unemployment seems hard, and efforts can easily backfire. Improving our ability to forecast technological change and the impact of policy decisions might be among the most valuable things we can be doing now.
   Read more: The loom and the thresher: Lessons in technological worker displacement (Medium).

####################################################

Tech Tales

It was almost midday, and the pigs had begun to cross the road. There were perhaps a hundred of them and they snuffled their way across the asphalt, some of them pausing to smell oil stains and cigarette butts. The car idled until the pigs had made it across, then continued. We turned our heads and watched the pigs as they went into the distance – they were marching in single file.
   “Incredible,” I said.
   “More like incredible training.” Astrid said.
   “How long?”
   “A few weeks. It’s getting good.”

As our car approached the complex, a flock of birds took off from a nearby tree and flew toward one of its towers. I used the carscreen to identify them: carrier pigeons.
   “Look,” Astrid said, pointing at a couple of bulges on the ankles of some of the birds. “It must be watching us.”
   And she was right: some of the birds had little cameras strapped to their ankles, and I was pretty sure that they’d be beaming the data back to the complex as soon as they flew into high-bandwidth range.
   “Isn’t it overkill?” I said. “It gets the feeds, why does it need a camera?”
   “Practice,” she said.
  Maybe it’s practicing on us, I thought. 

The car stopped in front of a gate and we got out. The gate started to open for us as we approached it, and four chickens came out. The chickens walked over to us and stood around us in a box formation. We walked through the gate and they followed, maintaining the box. I kept chickens when I was a kid: stupid creatures, though endearing. Borderline untrainable. I’d never have imagined them walking in formation.

At the center of the complex was a courtyard the size of a football field, with its ceiling enclosed in glass. The entire space was perforated with hundreds of tunnels, gates, and iris-opening inlets and outlets; through these portals, the animals travelled. Sometimes lights would flash in the tunnels and they would stop, or sounds would play and birds would change course, or the patterns of wind in the atrium would alter and the routes the animals were taking would change again. The AI that ran the complex was training them and we were here to find out why. 

When I turned around, I saw that Astrid was very far away from me – I’d been walking, lost in thought, away from her. She had apparently been doing the same. I called to her but at the moment I spoke her name a couple of loudspeakers blared and a flock of birds flew between us. I could vaguely see her between wingbeats, but when the birds had gone past she was further away from me. 

I guess it is training us, now. I’m not sure what for.

Things that inspired this story: Reinforcement learning; zoos; large-scale autonomous wildlife maintenance; Skinner machines.

Import AI: 160: Spotting sick crops in the iCassava challenge, testing AI agents with BSuite, and PHYRE tests if machines can learn physics

AI agents are getting smarter, so we need new evaluation methods. Enter BSuite:
…DeepMind’s testing framework is designed to let scientists know when progress is real and when it is an illusion…
When is progress real and when is it an illusion? That’s a question that comes down to measurement and, specifically, the ability for people to isolate the causes of advancement in a given scientific endeavor. To help scientists better measure and assess AI progress, researchers with DeepMind have developed and released the Behaviour Suite for Reinforcement Learning.

BSuite: What it is: BSuite is a software package to help researchers test out the capabilities of increasingly sophisticated reinforcement learning agents. BSuite ships with a set of experiments to help people assess how smart their agents are, and to isolate the specific causes for their intelligence. “These experiments embody fundamental issues, such as ‘exploration’ or ‘memory’ in a way that can be easily tested and iterated,” they write. “For the development of theory, they force us to instantiate measurable and falsifiable hypotheses that we might later formalize into provable guarantees.”

BSuite’s software: BSuite ships with experiments, reference implementations of several reinforcement learning algorithms, example ways to plug BSuite into other codebases like ‘OpenAI Gym’, scripts to automate running large-scale experiments on Google cloud, a pre-made Jupyter interactive notebook so people can easily monitor experiments, and a tool to formulaically generated the LaTeX needed for conference submissions. 

Testing your AI with BSuite’s experiments: Each BSuite experiment has three components: an environment, a period of interaction (e.g., 100 episodes), and ‘analysis’ code to map agent behaviour to results. BSuite lets researchers assess agent performance on multiple dimensions in a ‘radar’ plot that displays how well each agent does at a task in reference to things like memory, generalization, exploration, and so on. Initially, BSuite ships with several simple environments that challenge different parts of an RL algorithm, ranging from simple things like controlling a small mountain car as tries to climb a hill, to more complex scenarios based around exploration (e.g., “Deep Sea”) and memory (e.g., “memory_len” and “memory_size”).

Why this matters: BSuite is a symptom of a larger trend in AI research – we’re beginning to develop systems with such sophistication that we need to study them along multiple dimensions, while carefully curating the increasingly sophisticated environments we train them in. In a few years, perhaps we’ll see reinforcement learning agents mature to the point that they can start to develop across-the-board ‘superhuman’ capabilities at hard cognitive capabilities like memory and generalization – if that happens, we’d like to know, and it’ll be tools like BSuite that help us know this.
   Read more: Behaviour Suite for Reinforcement Learning (Arxiv).
   Get the DeepSuite code here (official GitHub repository).

####################################################

Spotting problems with Cassava via smartphone-deployed AI systems:
…All watched over and fed by machines of loving grace…
Cassava is the second largest provider of carbohydrates in Africa. How could the use of artificial intelligence help local farms better farm and care for this crucial, staple crop? New research from Google, the Artificial Intelligence Lab at Makerere University, and the National Crops Resources Research Institute in Uganda, proposes a new AI competition to encourage researchers to design systems that can diagnose various cassava diseases. 

Smartphones, meet AI: Smartphones have proliferated wildly across Africa, meaning that even many poor farmers have access to a device with a modern digital camera and some local processing capacity. The idea behind the iCassava 2019 competition is to develop systems that can be deployed on these smartphones, letting farmers automatically diagnose their crops. “The solution should be able to run on the farmers phones, requiring a fast and light-weight model with minimal access to the cloud,” the researchers write. 

iCassava 2019: The competition required systems to differentiate between five labels for each Cassava picture: healthy, or one of four Cassava diseases: brown steak disease (CBSD), mosaic disease (CMD), bacterial blight (CBB), and green mite (CGM). The data was collected as part of a crowdsourcing project using smartphones, so the images in the dataset have a variety of different lighting patterns and other confounding factors, like strange angles, photos from different times of day, improper camera focus, and so on.

iCassava 2019 results and next steps: The top three contenders in the competition each obtained accuracy scores of around 93%. The winning entry used a large corpus of unlabeled images as an additional training signal. All winners built their systems around a residual network (resnet). 

Next steps: The challenge authors plan to build and release more Cassava datasets in the future, and also plan to host more challenges “which incorporate the extra complexities arising from multiple diseases associated with each plant as well as varying levels of severity”. 

Why this matters: Systems like this show how AI can have a significant real-world impact, and point to a future where governments initiate competitions to help their civilians deal with day-to-day problems, like diagnosing crop diseases. And as smartphones get more powerful and cheaper over time, we can expect more and more powerful AI capabilities to get distributed to the ‘edge’ in this way. Soon, everyone will have special ‘sensory augmentations’ enabled by custom AI models deployed on phones.
   Read more: iCassava 2019 Fine-Grained Visual Categorization Challenge (Arxiv).
   Get the Cassava data here (official competition GitHub).

####################################################

Accessibility and AI, meet Kannada-MNIST:
…Building new datasets to make cultures visible to machines…
AI classifiers, increasingly, rule the world around us: They decide what gets noticed and what doesn’t. They apply labels. They ultimately make decisions. And when it comes to writing, most of these classifiers are built to work for the world’s largest and well-documented languages – think English, Chinese, French, German, and so on. What about all the other languages in the world? For them to be ‘seen’, we’ll need to be able to develop systems that can understand them – that’s the idea behind Kannada-MNIST, an MNIST-clone that uses the Kannada versions of the numbers 0 to 9. In Kannada, “Distinct glyphs are used to represent the numerals 0-9 in the language that appear distinct from the modern Hindu-Arabic numerals in vogue in much of the world today,” the author of the research writes. 

Why MNIST? MNIST is the ‘hello world’ of AI – it’s a small, incredibly well-documented and studied, dataset consisting of tens of thousands of handwritten numbers ranging from 0 to 9. MNIST has since been superseded by more sophisticated datasets, like CIFAR and ImageNet. But many researchers will still validate things against it during the early stages of research. Therefore, creating variants of MNIST that are similarly small, tractable, and well-documented seems like a helpful thing to do for researchers. It also seems like creating MNIST variants in things that are currently understudied – like the Kannada language – can be a cheap way to generate interest. To generate Kannada-MNIST, 65 volunteers drew 70,000 numerals in total.  

A harder MNIST: The researcher has also developed Dig-MNIST – this is a version of the Kannada dataset were volunteers were exposed to Kannada numerals for the first time then had to draw their own versions. “This sampling-bias, combined with the fact we used a completely different writing sheet dimension and scanner settings, resulted in a dataset that would turn out to be far more challenging than the [standard Kannada] test dataset”, the author writes. 

Why this matters: Soon, we’ll have two worlds: the normal world and the AI-driven world. Right now, the AI-driven world is going to favor some of the contemporary world’s dominant cultures/languages/stereotypes, and so on. Datasets like Kannada-MNIST can potentially help shift this balance.
   Read more: Kannada-MNIST: A New Handwritten Digits Dataset for the Kannada Language (Arxiv).
   The companion GitHub repository for this paper is here (Kannada MNIST GitHub)

####################################################

Your machine sounds funny – I predict it’s going to explode:
…ToyADMOS dataset helps people teach machines to spot the audio hallmarks of mechanical faults…
Did you know that it’s possible to listen for failure, as well as visually analyze for it? Now, researchers with NTT Media Intelligence Laboratories and Ritsumeikan University want to make it easier to teach machines to listen for faults via a new dataset called ToyADMOS. 

ToyADMOS: ToyADMOS is designed around three tasks: production inspection of a toy car, fault diagnosis of a fixed machine (toy conveyor), and fault diagnosis for a machine machine (a toy train). Each scenario is recorded with multiple microphones, capturing both machine and environmental sounds. ToyADMOS contains “over 180 hours of normal machine-operating sounds and over 4,000 samples of anomalous sounds collected with four microphones at a 48-kHz sampling rate,” they write. 

Faults, faults everywhere: For each of the tasks, the researchers simulated a variety of failures. These included things like running the toy car with a bent shaft, or with different sorts of tyres; altering the tensions in the pulleys of the toy conveyor, and breaking the axles and tracks of the toy train. 

Why ToyADMOS: Researchers should use the dataset because it was built under controlled conditions, letting the researchers easily separate and label anomalous and non-anomalous sounds. “The limitation of the ToyADMOS dataset is that toy sounds and real machine sounds do not necessarily match exactly,” they write. “One of the determining factors of machine sounds is the size of the machine. Therefore, the details of the spectral shape of a toy and a real machine sound often differ, even though the time-frequency structure is similar. Thus, we need to reconsider the pre-processing parameters evaluated with the ToyADMOS dataset, such as filterbank parameters, before using it with a real-world ADMOS system. 

Why this matters: In a few years, many parts of the world will be watched over by machines – machines that will ‘see’ and ‘hear’ the world around them, learning what things are usual and what things are unusual. Eventually, we can imagine warehouses where small machines are removed weeks before they break, after a machine with a distinguished ear spots the idiosyncratic sounds of a future-break.
   Read more: ToyADMOS: A Dataset of Miniature-Machine Operating Sounds For Anomalous Sound Detection (Arxiv).
   Get the ToyADMOS data from here (Arxiv).

####################################################

Can your AI learn the laws of nature? No. What about the laws of PHYRE?
…Facebook’s new simulator challenges agents to interact with a complex, 2D, physics world…
Given a non-random universe, infinite time, and the ability to experiment, could we learn the rules of existence? The answer to this is, intuitively, yes. Now, researchers with Facebook AI Research want to see if they can use a basic physics simulator to teach AI systems physics-based reasoning. The new ‘PHYRE” (PHYsical REasoning) benchmark gives AI researchers a tool to test how well their systems understand complex things like causality, physical dynamics, and so on. 

What PHYRE is: PHYRE is a simulate that contains a bunch of environments which can be manipulated via RL agents. Each environment is a two-dimensional world containing “a constant downward gravitational force and a small amount of friction”. The agent is presented with a scenario, like a ball in a green cup balanced on a platform above a ball in the red cup, and asked to change the state of the world – for instance, by being asked to place the ball in the green cup into the one with the red cup. “The agent aims to achieve the goal by taking a single action, placing one or more new dynamic bodies into the world”, the researchers write. In this case, the agent could solve its taks by manifesting a ball which could roll into the green cup, tipping it over so the ball falls into the red cup. “Once the simulation is complete, the agent receives a binary reward indicating whether the goal was achieved”, they write. 

One benchmark, many challenges: PHYRE initially consists of two tiers of difficulty (one ball and two balls), and each tier has 25 task templates (think of these templates as like basic worlds in a videogame) and each template contains 100 tasks (think of these as like individual levels in a videogame world). 

How hard is it? In tests, the researchers show that a variety of baselines – including souped-up versions of DQN, and a non-parametric agent with online learning – struggle to do well even on the single-ball tasks, barely obtaining scores better than 50% on many of them. “PHYRE aims to enable the development of physical reasoning algorithms with strong generalization properties mirroring those of humans,” the researchers write. “Yet the baseline methods studied in this work are far from this goal, demonstrating limited generalization abilities”. 

Why this matters: For the past few years multiple different AI groups have taken a swing at the hard problem of developing agents that can learn to model the physics dynamics of an environment. The problem these researchers keep running into is that agents, as any AI practitioner knows, are so damn lazy they’ll solve the task without learning anything useful! Simulators like PHYRE represent another attempt to see if we can develop the right environment and infrastructure to encourage the right kind of learning to emerge. In the next year or so, we’ll be able to judge how successful this is by reading papers that reference the benchmark.
   Read more: PHYRE: A New Benchmark for Physical Reasoning (Arxiv).
   Play with PHYRE tasks on this interactive website (PHYRE website).
   Get the PHYRE code here (PHYRE GitHub).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Why Peter Thiel’s views on AI miss the forest for the trees:
Peter Thiel, co-founder of Palantir and PayPal, wrote an opinion piece earlier this month on military applications of AI and US-China competition. Thiel argued that AI should be treated primarily as a military technology, and attacked Google and others for opening AI labs in China.

AI is not a military technology:
While it will have military applications, advanced AI is better compared with electricity, rather than nuclear weapons. AI is an all-purpose tool that will have wide-ranging applications, including military uses, but also countless others. While it is important to understand the military implications of AI, it is in everyone’s interest to ensure the technology is developed primarily for the benefit of humanity, rather than waging war. Thiel’s company, Palantir, has major defense contracts with the US government, leading critics to point out his commercial interest in propagating the narrative of AI being primarily a military technology. 

Cooperation is good: Thiel’s criticism of firms for opening labs in China, and hiring Chinese nationals is also misguided. The US and China are the leading players in AI, and forging trust and communication between the two communities is a clear positive for the world. Ensuring that the development of advanced AI goes well will require significant coordination between powers — for example, developing shared standards on withholding dangerous research, or on technical safety.

Why it matters: There is a real risk that an arms race dynamic between the US and China could lead to increased militarization of AI technologies, and to both sides underinvesting in ensuring AI systems are robust and beneficial. This could have catastrophic consequences, and would reduce the likelihood of advanced AI resulting in broadly distributed benefits for humanity. The AI community should resist attempts to propagate hawkish narratives about US-China competition.
   Read more: Why an AI arms race with China would be bad for humanity (Vox).

####################################################

Tech Tales:

We’ll All Know in the End (WAKE)

“There there,” the robot said, “all better now”. Its manipulator clanged into the metal chest of the other robot, which then issued a series of beeps, before the lights in its eyes dimmed and it became still.
   “Bring the recycler,” the robot said. “Our friend has passed on.”
   The recycling cart appeared a couple of minutes later. It wheezed its way up to the two robots, then opened a door in its side; the living robot pushed the small robot in, the door shut, and the recyling cart left.
   “Now,” said the living robot in the cold, dark room. “Who else needs assistance?”

Outside the room, the recycler moved down a corridor. It entered other rooms and collected other robots. Then it reached the end of the corridor and stopped in front of a door with the words NURSERY burned into its wood via lazer. It issued a series of beeps and then the door swung open. The recycler trundled in. 

Perhaps three hours later, some small, living robots crawled out of a door at the other end of the NURSEY. They emerged, blinking and happy and their clocks set running, to explore the world and learn about it. A large robot waited for them and extended its manipulator to hold their hands. “There there,” it said. “All better now”. Together, they trundled into the distance. 

This has been happening for more than one thousand years. 

Things that inspired this story: Hospices; patterns masked in static; Rashomon for robots; the circle of life – Silicon Edition!. 

Import AI: 159: Characterizing attacks on AI systems; teaching AI systems to subvert ML security systems; and what happens when AI regenerates actors

Can you outsmart a machine learning malware detector?
…Enter the MLSEC competition to find out…
Today, many antivirus companies use machine learning models to try and spot malware – a new competition wants to challenge people to design malware payloads that evade these machine learning classifiers. The Machine Learning Static Evasion Competition (MLSEC) was announced at the ‘Defcon’ security conference this week. 

White box attack: “The competition will demonstrate a white box attack, wherein participants will have access to each model’s parameters and source code,” the organizers write. “Points will be awarded to participants based on how many samples bypass each machine learning model. In particular, for each functional modified malware sample, one point is awarded for each ML model that it bypasses.”

Registrants only: Participants can access functional malicious software binaries, so entrances will need to register before they can download the malware samples. 

Why this matters: Security is a cat & mouse game between attackers and defenders, and machine learning systems are already helping us create more adaptive, general forms of security defense and offense. Competitions like MLSEC will generate valuable evidence about the relative strengths and weaknesses of ML-based security systems, helping us forecast how these systems might influence society.
   Register, then check out the code (official competition GitHub, hosted by Endgame Security).
   Read more: MLSEC overview (official competition website)

####################################################

Need a new Gym for your AI agent? Try getting it to open a door:
…DoorGym teaches robots how to open a near-infinite number of simulated doors…
If any contemporary robots were to become sentient and seek to destroy humanity, then one of the smartest things people could do to protect themselves would be to climb up some stairs and go into a room and shut the door behind them. That’s because today’s robots have a really hard time doing simple physical things like climbing stairs or opening doors. New research from Panasonic Beta, the startup Totemic, and the University of California at Berkeley tries to change this with ‘DoorGym’, software to help researchers teach simulated robots to open doors. DoorGym is “intended to be a first step to move reinforcement learning from toy environments towards useful atomic skills that can be composed and extended towards a broader goal”. 

Enter the Randomized Door-World Generator!: DoorGym uses the ‘Mujoco’ robotics simulator to generate a selection of doors with different handles (ranging from easy doorknobs based around pulling, to more complex ones that involve grasping), and then uses a technique called domain randomization to generate tens of thousands of different door simulations, varying things like the appearance and physics characteristics of the robot, door, doorknob, door frame, and wall. This highlights how domain randomization lets researchers trade compute for data – instead of needing to gather data of lots of different doors in the world, DoorGym just uses computers to automatically generate different types of door. DoorGym also ships with a simulated Berkeley ‘BLUE’ low-cost robot arm. 

Door opening baselines: In tests, the researchers test two popular RL algorithms, PPO and SAC, on three tasks within DoorGym. The tests show that Proximal Policy Optimization (PPO) obtains far higher scores than SAC, though SAC has slightly better early exploration properties. This is a somewhat interesting result – PPO, an OpenAI-developed RL algorithm, came out a couple of years ago and has since become a defacto standard for RL research, partially because it’s a relatively simply algorithm with relatively few parameters; this may add some legitimacy to the idea that simple algorithms that scale-up will will tend to be successful. 

The future of DOORS: In the future, the researchers will expand the number of baselines they test on, “as well as incorporating more complicated tasks such as a broader range of doorknobs, locked doors, door knob generalization, and multi-agent scenarios”. 

Why this matters: Systems like DoorGym are an indication of the rapid maturity of research at the intersection of AI and robotics. If systems like this become standard testbeds for RL algorithms, it could ultimately lead to the creation of more intelligent and capable robot arms, which could potentially have significant effects on economic impact of robot-based automation.
   Read more: DoorGym: A Scalable Door Opening Environment And Baseline Agent (Arxiv).

####################################################

Is that a car or a spy robot? Why not both?
…Tesla S mod turns any car into a surveillance system…
An enterprising software engineer has developed a DIY computer called the ‘Surveillance Detection Scout’ that can turn any Tesla Model S or Model 3 into a roving surveillance vehicle. The mod taps into the Tesla’s dash and rearview cameras, then uses open source image recognition software to analyze license plates and faces that the Tesla sees, so the software can warn the car owner if it is being followed. “When the car is parked, it can tracky nearby faces to see which ones repeatedly appear,” Wired magazine writes. “The intent is to offer a warning that someone might be preparing to steal the car, tamper with it or break into the driver’s nearby home”. 

Why this matters: The future is rich people putting DIY software and computers into their machines, giving them enhanced cognitive capabilities relative to other people. Just wait till we optimize thrust/weight for small drones, and wealthy people start getting surrounded by literal ‘thought clouds’.
   Read more: This Tesla Mod Turns a Model S into a Mobile ‘Surveillance Station’ (Wired).

####################################################

Facebook approaches human-level performance on the tough ‘SuperGLUE’ benchmark:
…What happens when AI progress outpaces the complexity of our benchmarks?…
Recently, language AI systems have started to get really good. This is mostly due to a vast number of organizations developing language modeling approaches based on unsupervised pre-training – basically, training large language models with simple objectives on vast amounts of data. Such systems – BERT, GPT-2, ULMFiT, etc – have revolutionized parts of NLP, obtaining new state-of-the-art scores on a variety of benchmarks, and generating credibly interesting synthetic text. 

Now, researchers from Facebook have shown just how powerful these new systems are with RoBERTa, a replication of Google’s BERT system that is trained for longer with more careful hyperparameter selection. RoBERTa obtains new state-of-the-art scores on a bunch of benchmarks, including GLUE, RACE, and SQuAD. Most significantly, the researchers announced on Friday that RoBERTa was now the top entry on the ‘SuperGLUE’ language challenge. That’s significant because SuperGLUE was published this year as a significantly harder version of GLUE  – the multi-task language benchmark that preceded it. It’s notable that RoBERTa shows a 15 absolute percentage point improvement over the initial top SuperGLUE entry, and RoBERTa’s score of 84.6% is relatively close to human baselines of 89.8. 

Why this matters: Multi-task benchmarks like SuperGLUE are one of the best ways we have of judging where we are in terms of AI development, so it’s significant if our ability to beat such benchmarks outpaces our ability to create them. As one of SuperGLUE’s creators, Sam Bowman, wonders:”There’s still headroom left for further work—our estimate of human performance is a very conservative lower bound. I’d also bet that the next five or ten percentage points are going to be quite a bit harder to handle,” he writes. “But I think there are still hard open questions about how we should measure academic progress on real-world tasks, now that we really do seem to have solved the average case.”
   Read Sam Bowman’s tweets about the SuperGLUE result (Sam Bowman’s twitter account.)
   Check out the ‘SuperGLUE’ leaderboard here (SuperGLUE official website).
   Read more: RoBERTa: A Robustly Optimized BERT Pretraining Approach (Arxiv)

####################################################

How can I attack your reinforcement learning system? Let me count the ways:
…A taxonomy of attacks, and some next steps…
How might hackers target a system trained with reinforcement learning? This question is going to become increasingly important as we go from RL systems that are primarily developed for research, to ones that are developed for production purposes. Now, researchers have come up with a “taxonomy of adversarial attacks on DRL systems” and have proposed and analyzed ten attacks on DRL systems in a survey paper from the University of Michigan, University of Illinois at Urban-Champaign, University of California at Berkeley, Tsinghua University, and JD AI Research.

The three ways to attack RL:
“RL environments are usually modeled as a Markov Decision Process (MDP) that consists of observation space, action space, and environment (transition) dynamics,” the researchers write. Therefore, they break their taxonomy of attacks into these three sub-sections of RL. Each of the different sub-sections demands different tactics: for instance, to attack an observation space you might modify the sensors of a device, while to attack an action space you could send alternative control signals to an actuator attached to a robot in a factory, and for environmental attacks you could alter the environment – for instance, if attacking an autonomous car, you could change the road surface to one the car hadn’t been trained on.

An attack taxonomy: The researchers ultimately come up with a set of attacks on RL systems that go after different parts of the MDP (though the vast majority of these exploits attack the observation space, rather than others). They distinguish between white-box (you have access to the system) and black-box (you don’t have access to the system) attacks, and also describe other salient traits like whether the exploit works in real time, or if it introduces some kind of dependency. 

Why this matters: ‘Hacking’ in an AI world looks different to hacking in a non-AI world, chiefly because AI systems tend to have some autonomous properties (eg, autonomous perception, or autonomous action given a specific input), which can be exploited by attackers to create dangerous or emergent behaviors. I think that securing AI systems is going to be an increasingly significant challenge, given the large space of possible exploits.
   Read more: Characterizing Attacks on Deep Reinforcement Learning (Arxiv)

####################################################

Want to clone a voice using a few seconds of audio? Now you can:
…GitHub project makes low-quality voice cloning simple…
An independent researcher has published code to make it easy to ‘clone’ a voice with a few seconds of audio. Though the results today are a little unconvincing (e.g. much of the data used to train the speech synthesizer came from people reading audiobooks, so the diction may not map to naturally spoken dialogue). However, the technology is indicative of future capabilities, so while it’s somewhat janky today, we can expect people to build other, better open source software systems in the future, which will yield even better outputs. 

Why this matters: You can do a lot with function approximation – and many of the things you might want to do to create fake content depends on really good function approximation (e.g., inventing a system to transpose a voice from one accent to another, or mimic someone’s image, etc). Soon, we’re going to be dealing with a whole full of synthetic content, and it’s unclear what happens next.
   Check out a video walkthrough of the ‘Real-Time Voice Cloning Toolbox ‘ here (YouTube).
   Get the code here (Real Time Voice Cloning GitHub).
   Read more: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (Arxiv).

####################################################

Tech Tales:

Both the new hollywood and the old hollywood will break your heart

I didn’t just love him, I wanted to be him: Jerry Daytime, star of the hit generative comedies “Help, my other spaceship is a time machine” and “Growing up hungry”; the quiz show “Cake Exploders”; and the AI-actor ‘documentary’ series ‘Inside the servers’. 

I’d been watching Jerry since I could remember watching anything. I’d seen him fight tigers on the edge of waterfalls, defend the law in Cogtown, and guest host World News at 1 on Upload Day. He even did guest vocals on ‘Remember Love’, the song used to raise funds for the West Coast after the big rip.

Things started changing a year ago, though. That’s when Jerry Daytime’s ‘uniqueness’ protections expired, and the generative Jerries arrived. Now, we’ve got a bunch of Jerries. There’s Jerry Nighttime, who looks and acts the same except he’s always in shadow with a five-o-clock shadow. There’s Jerry Kidtime who does the Children’s Books. Jerry Doctor for bad news, and Jenny Daytime the female-presenting Jerry.  And let’s be clear – I love these Jerries! I am not a Jerry extremist!

But I am saying we’ve got to draw a line somewhere. We can’t just have infinite Jerries minus one. 

Jerry Latex-free Condoms. Do we need him?
Jury Downtime – the entertainer for jurors on a break. What about him?
Jenny Lint-time – the cleaning assistant. Do we need her?

I guess my problem is what happens to all the people like me who grew up with Jerry? We get used to Jerry being everywhere around us? We become resigned to all the new Jerries? Because when I watch Jerry Daytime – the Jerry, the original – I now feel bad. I feel like I’m going to blink and everyone else in the show will be Jerry as well, or variants of Jerry. I’m worried when I open the door for a package the droid is going to have Jerries face, but it’ll be Jerry Package, not Jerry Daytime. What am I meant to do with that? 

Things that inspired this story: Uniqueness; generative models; ‘deepfakes’ spooled out to their logical endpoint; Hollywood; appetites for content; 

 

Import AI 158: Facial recognition surveillance; pre-training and the industrialization of AI; making smarter systems with the DIODE depth dataset

The dawn of the endless, undying network:
…ERNIE 2.0 is a “continual pre-training framework” for massive unsupervised learning & subsequent fine-tuning…
ERNIE 2.0 is a “continual pre-training framework” which has support for single-task and multi-task training. The sorts of tasks it’s built for are ones where you want to pour a huge amount of compute into a model via unsupervised pre-training on large datasets (see, language modeling), then expose this model to additional incremental data and finetune it against those objectives, like achieving some constraint placed on the system by reality. 

A cocktail of training: The researchers use seven pre-training tasks to train a large-scale ‘ERNIE 2.0’ model. In subsequent experiments, the researchers show that ERNIE outperforms similarly scaled ‘BERT’  and ‘XLNet’ models on the competitive ‘GLUE’ benchmark – this is interesting, since BERT has swept a bunch of NLP benchmarks this year. (However, the researchers don’t indicate they’ve tested ERNIE 2.0 on the more sophisticated SuperGLUE benchmark). They also show similarly good performance on a GLUE-esque set of 9 tasks tailored around Chinese NLP. 

Why this matters: Unsupervised pre-training is the new ‘punchbowl’ of deep learning – a multitude of organizations with significant amounts of compute to spend will likely compete with one another training increasingly large-scale models, which are subsequently re-tooled for tasks. It’ll be interesting to see if in a few months we can infer whether any such systems are being deployed for economically useful purposes, or in research domains (eg biomedicine) where they could unlock other breakthroughs.
   Read more: ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (Arxiv)

####################################################

Feed your AI systems with 500 hours of humans playing MineCraft:
…Agent see, agent do – or at least, that’s the hope…
In recent years, AI systems have learned to beat games ranging in complexity from Chess, to Go, to multi-player strategic games like StarCraft 2 and Dota 2. Now, researchers are thinking about the next set of environments to test AI agents in. Some think that Minecraft, a freeform game where people can mine blocks in a procedurally generated world and build unique items and structures, is a good environment for the frontier of research. Now, researchers with Carnegie Mellon University have released the MineRL dataset: a set of 60 million “state-action pairs of human demonstrations across a range of related tasks in Minecraft”, which can help researchers develop smarter algorithms. The MineRL dataset has been developed as a part of the MineRL competition on sample efficient reinforcement learning (Import AI #145).

The dataset: MineRL consists of 500+ hours of recorded human demonstrations from 1000 people of six different tasks in Minecraft, like navigating around the world, chopping down trees, or multi-step tasks that result in obtaining specific (sometimes rare) items (diamonds, pickaxes, cooked meat, beds). They’ve released the dataset in a variety of resolutions (ranging from 64 X 64, to 192 X 256), so people can experiment with algorithms that operate over imagery of various resolutions. Each demonstration consists of an RGB video of the player’s point of view, as well as a set of features from the game like the distances to objects in the world, details on the player’s inventory, and so on.  

Why this matters: Is Minecraft a good, challenging environment for next-generation AI research? Datasets like this will help us figure that out, as they give us human baselines for complex tasks, and also involve the sort of multi-step sequences of actions that are challenging for contemporary AI systems. I think it’s interesting to reflect on how fundamental videogames have become to the development of action-oriented AI systems, and I’m excited for when research like that motivated by MineRL leads to smart, curious ‘neural agents’ showing up in more consumer-oriented videogames.
   Get the dataset from the official MineRL website (MineRL.io).
   Read more: MineRL: A Large-Scale Dataset of Minecraft Demonstrations (Arxiv)

####################################################

Want Ai systems that can navigate the world? Train them with DIODE:
…RBGD dataset makes it easier to train AIs that have a sense of depth perception…
Researchers with TTI-Chicago, the University of Chicago, and Beihang University  have produced DIODE, a large-scale image+depth dataset of indoor and outdoor environments. Datasets like these are crucial to developing AIs that can better reason about three-dimensional worlds. 

   “While there have been many recent advances in 2.5D and 3D vision, we believe progress has been hindered by the lack of large diverse real-world datasets comparable to ImageNet and COCO for semantic object recognition,” they write. 

What DIODE is: DIODE consists of around ~25,000 high-resolution photos (sensor depth precision: +/- 1mm, compared to +/- 2cm for other popular datasets like KITTI), split across indoor environments (~8.5k images) and outdoor ones (~17k images). The researchers collected the dataset with a ‘FARO Focus S350 scanner’, in locations ranging from student offices, to large residential buildings, to hiking trails, parking lots, and city streets. 

Why this matters: Datasets like this will make it easier for people to develop more robust machine learning systems that are better able to deal with the subtleties of the world.
   Read more: DIODE: A Dense Indoor and Outdoor DEpth Dataset (Arxiv)

####################################################

Need to morph your AI to work for a world with different physics? Use TuneNet:
…Model tuning & re-tuning framework makes sim2real transfer better…
Researchers with the University of Texas at Austin have released “TuneNet, a residual tuning technique that uses a neural network to modify the parameters of one physical model so it approximates another.” TuneNet is designed to make it easy for researchers to migrate an AI system from one simulation to another, or potentially from simulation to reality. Therefore, TuneNet is built to rapidly analyze the differences between different software simulators with altered physics parameters, and work out what it takes to re-train a model so it can be transferred between them. 

How TuneNet works: “TuneNet takes as input observations from two different models (i.e. a simulator and the real world), and estimates the difference in parameters between the models,” they write. “By estimating the parameter gradient landscape, a small number of iterative tuning updates enable rapid convergence on improved parameters from a single observation from the target model. TuneNet is trained using supervised and learning on a dataset of pairs of auto-generated simulated observations, which allows training to proceed without real-world data collection or labeling”. 

Testing: The researchers perform three experiments “to validate TuneNet’s ability to tune one model to match another”. The researchers conduct their tests using the ‘PyBullet’ simulator. These tests involve: testing its how well it adapts to a new target environment, how well it can predict the dynamics of a bouncing ball from one simulation to the other, and testing if it can transfer from a simulator onto a real world robot and bounce a hall off an inclined plane into a hoop. The approach does well on all of these, and obtains an 87% hit rate at the sim2real task – interesting, but not good enough for the real world yet. 

Why this matters: Being able to adapt AI systems to different contexts will be fundamental to the broader industrialization of AI; advances in sim2real techniques reduce the costs of taking a model trained for one context and porting it to another, which will likely – once the techniques are mature – encourage model proliferation and dissemination.
   Read more: TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer (Arxiv)

####################################################

Perfecting facial expression recognition for fine-grained surveillance:
…You seem happy. Now you seem worried. Now you seem cautious. Why?…
Researchers with Fudan University and Ping An OneConnect want to build AI systems that can automatically label the emotions displayed by people seen via surveillance cameras. “Facial expression recognition (FER) is widely used in multiple applications such as psychology, medicine, security and education,” they write. (For the purposes of this write-up, let’s put aside the numerous thorny issues relating to the validity of using certain kinds of ’emotion recognition’ techniques in AI systems.) 

Dataset construction: To build their system, they gather a large-scale dataset that consists of 200,000 images of 119 people displaying any of four poses and 54 facial expressions. The researchers also use data augmentation to artificially grow the dataset via a system called a facial pose generative adversarial network (FaPE-GAN), which generates additional facial expression images for training. 

To create the dataset, the researchers invited participants into a room filled with video cameras to have “a normal conversation between the candidate and two psychological experts” which lasts for about 30 minutes. After this, a panel of three psychologists reviewed each video and assigned labels to specific psychological states; the dataset only includes videos where all three psychologists agreed on the label. Each participant is captured from four different orientations: face-on, from the left, from the right, and an overhead view.

54 expressions: The researchers tie 54 distinct facial expressions with specific terms that – they say – correlate to emotions. These terms include things like boredom, fear, optimism, boastfulness, aggressiveness, disapproval, neglect, and more. 

Four challenges: The researchers propose four evaluation challenges people can test systems on to develop more effective facial recognition systems. These include: expression recognition with a balanced setting (ER-SS); unbalanced expression (ER-UE), where they make 20% of the facial expressions relate to particularly rare classes; unbalanced poses (ER-UP), where they assume the left-facing views are rarer than the other ones; and zero-shot ID (ER-ZID), where they try to recognize the facial expressions of people that haven’t been seen before to test “whether the model can learn the person invariant feature for emotion classification”.

What faces are useful for: The researchers show that F2ED can be used to pre-train models which are subsequently fine-tuned on other facial emotion recognition datasets, including FER2013 and JAFFE.

Why this matters: Data is one of the key fuels of AI progress, so a dataset containing a couple of hundred thousand labelled pictures of faces will be like jetfuel for human surveillance. However, the scientific basis for much of facial expression recognition is contentious, which increases the chance that the use of this technology will have unanticipated consequences.
   Read more: A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition (Arxiv).

####################################################

Tech Tales:

The Adversarial Architecture Arms Race
US Embassy, Baghdad, 2060

So a few decades ago they started letting the buildings build themselves. I guess that’s why we’re here now. 

It started like this: we decided to have the computers help us design our buildings, and so we’d ask them questions like: given a specific construction schematic, what vulnerabilities are possible? They’d come back with some kind of answer, and we’d design the building around the most conservative set of their predictions. Other governments did the same. Embassies worldwide sprouted odd appendages – strange, seemingly-illogical turrets and gantries, all designed to optimize for internal security while increasing the ability to survey and predict likely actions in the emergent environment. 

“Country lumps”, people called some of the embassies.
“State growths”
“Sovereign machines”. 

Eventually, the buildings evolved for a type of war that we’re not sure we can understand any more. Now, construction companies get called up by machines that use synthesized voices to order up buildings with slanted roofs and oddly shaped details, all designed to throw off increasingly sophisticated signals equipment. Now I get to work in buildings that feel more like mazes than places to do business. 

They say that some of the next buildings are fully ‘lights out’ – designed to ‘serve all administrative functions without an on-site human’, as the brochures from the robot designers says. The building still has doors and corridors, and even some waiting spaces for people – but for how long, I wonder? When does all of this become more efficient. When do we start hurling little boxes of computation into cities and call these the new buildings? When do we expect to have civilizations that exist solely in corridors of fibre wiring and high-speed interconnects? And how might those entities design themselves?

Things that inspired this story: Generative design; the architecture of prisons and quasi-fortress buildings; the American Embassy in Berlin; game theory between machines.