Import AI

Import AI: 160: Spotting sick crops in the iCassava challenge, testing AI agents with BSuite, and PHYRE tests if machines can learn physics

AI agents are getting smarter, so we need new evaluation methods. Enter BSuite:
…DeepMind’s testing framework is designed to let scientists know when progress is real and when it is an illusion…
When is progress real and when is it an illusion? That’s a question that comes down to measurement and, specifically, the ability for people to isolate the causes of advancement in a given scientific endeavor. To help scientists better measure and assess AI progress, researchers with DeepMind have developed and released the Behaviour Suite for Reinforcement Learning.

BSuite: What it is: BSuite is a software package to help researchers test out the capabilities of increasingly sophisticated reinforcement learning agents. BSuite ships with a set of experiments to help people assess how smart their agents are, and to isolate the specific causes for their intelligence. “These experiments embody fundamental issues, such as ‘exploration’ or ‘memory’ in a way that can be easily tested and iterated,” they write. “For the development of theory, they force us to instantiate measurable and falsifiable hypotheses that we might later formalize into provable guarantees.”

BSuite’s software: BSuite ships with experiments, reference implementations of several reinforcement learning algorithms, example ways to plug BSuite into other codebases like ‘OpenAI Gym’, scripts to automate running large-scale experiments on Google cloud, a pre-made Jupyter interactive notebook so people can easily monitor experiments, and a tool to formulaically generated the LaTeX needed for conference submissions. 

Testing your AI with BSuite’s experiments: Each BSuite experiment has three components: an environment, a period of interaction (e.g., 100 episodes), and ‘analysis’ code to map agent behaviour to results. BSuite lets researchers assess agent performance on multiple dimensions in a ‘radar’ plot that displays how well each agent does at a task in reference to things like memory, generalization, exploration, and so on. Initially, BSuite ships with several simple environments that challenge different parts of an RL algorithm, ranging from simple things like controlling a small mountain car as tries to climb a hill, to more complex scenarios based around exploration (e.g., “Deep Sea”) and memory (e.g., “memory_len” and “memory_size”).

Why this matters: BSuite is a symptom of a larger trend in AI research – we’re beginning to develop systems with such sophistication that we need to study them along multiple dimensions, while carefully curating the increasingly sophisticated environments we train them in. In a few years, perhaps we’ll see reinforcement learning agents mature to the point that they can start to develop across-the-board ‘superhuman’ capabilities at hard cognitive capabilities like memory and generalization – if that happens, we’d like to know, and it’ll be tools like BSuite that help us know this.
   Read more: Behaviour Suite for Reinforcement Learning (Arxiv).
   Get the DeepSuite code here (official GitHub repository).

####################################################

Spotting problems with Cassava via smartphone-deployed AI systems:
…All watched over and fed by machines of loving grace…
Cassava is the second largest provider of carbohydrates in Africa. How could the use of artificial intelligence help local farms better farm and care for this crucial, staple crop? New research from Google, the Artificial Intelligence Lab at Makerere University, and the National Crops Resources Research Institute in Uganda, proposes a new AI competition to encourage researchers to design systems that can diagnose various cassava diseases. 

Smartphones, meet AI: Smartphones have proliferated wildly across Africa, meaning that even many poor farmers have access to a device with a modern digital camera and some local processing capacity. The idea behind the iCassava 2019 competition is to develop systems that can be deployed on these smartphones, letting farmers automatically diagnose their crops. “The solution should be able to run on the farmers phones, requiring a fast and light-weight model with minimal access to the cloud,” the researchers write. 

iCassava 2019: The competition required systems to differentiate between five labels for each Cassava picture: healthy, or one of four Cassava diseases: brown steak disease (CBSD), mosaic disease (CMD), bacterial blight (CBB), and green mite (CGM). The data was collected as part of a crowdsourcing project using smartphones, so the images in the dataset have a variety of different lighting patterns and other confounding factors, like strange angles, photos from different times of day, improper camera focus, and so on.

iCassava 2019 results and next steps: The top three contenders in the competition each obtained accuracy scores of around 93%. The winning entry used a large corpus of unlabeled images as an additional training signal. All winners built their systems around a residual network (resnet). 

Next steps: The challenge authors plan to build and release more Cassava datasets in the future, and also plan to host more challenges “which incorporate the extra complexities arising from multiple diseases associated with each plant as well as varying levels of severity”. 

Why this matters: Systems like this show how AI can have a significant real-world impact, and point to a future where governments initiate competitions to help their civilians deal with day-to-day problems, like diagnosing crop diseases. And as smartphones get more powerful and cheaper over time, we can expect more and more powerful AI capabilities to get distributed to the ‘edge’ in this way. Soon, everyone will have special ‘sensory augmentations’ enabled by custom AI models deployed on phones.
   Read more: iCassava 2019 Fine-Grained Visual Categorization Challenge (Arxiv).
   Get the Cassava data here (official competition GitHub).

####################################################

Accessibility and AI, meet Kannada-MNIST:
…Building new datasets to make cultures visible to machines…
AI classifiers, increasingly, rule the world around us: They decide what gets noticed and what doesn’t. They apply labels. They ultimately make decisions. And when it comes to writing, most of these classifiers are built to work for the world’s largest and well-documented languages – think English, Chinese, French, German, and so on. What about all the other languages in the world? For them to be ‘seen’, we’ll need to be able to develop systems that can understand them – that’s the idea behind Kannada-MNIST, an MNIST-clone that uses the Kannada versions of the numbers 0 to 9. In Kannada, “Distinct glyphs are used to represent the numerals 0-9 in the language that appear distinct from the modern Hindu-Arabic numerals in vogue in much of the world today,” the author of the research writes. 

Why MNIST? MNIST is the ‘hello world’ of AI – it’s a small, incredibly well-documented and studied, dataset consisting of tens of thousands of handwritten numbers ranging from 0 to 9. MNIST has since been superseded by more sophisticated datasets, like CIFAR and ImageNet. But many researchers will still validate things against it during the early stages of research. Therefore, creating variants of MNIST that are similarly small, tractable, and well-documented seems like a helpful thing to do for researchers. It also seems like creating MNIST variants in things that are currently understudied – like the Kannada language – can be a cheap way to generate interest. To generate Kannada-MNIST, 65 volunteers drew 70,000 numerals in total.  

A harder MNIST: The researcher has also developed Dig-MNIST – this is a version of the Kannada dataset were volunteers were exposed to Kannada numerals for the first time then had to draw their own versions. “This sampling-bias, combined with the fact we used a completely different writing sheet dimension and scanner settings, resulted in a dataset that would turn out to be far more challenging than the [standard Kannada] test dataset”, the author writes. 

Why this matters: Soon, we’ll have two worlds: the normal world and the AI-driven world. Right now, the AI-driven world is going to favor some of the contemporary world’s dominant cultures/languages/stereotypes, and so on. Datasets like Kannada-MNIST can potentially help shift this balance.
   Read more: Kannada-MNIST: A New Handwritten Digits Dataset for the Kannada Language (Arxiv).
   The companion GitHub repository for this paper is here (Kannada MNIST GitHub)

####################################################

Your machine sounds funny – I predict it’s going to explode:
…ToyADMOS dataset helps people teach machines to spot the audio hallmarks of mechanical faults…
Did you know that it’s possible to listen for failure, as well as visually analyze for it? Now, researchers with NTT Media Intelligence Laboratories and Ritsumeikan University want to make it easier to teach machines to listen for faults via a new dataset called ToyADMOS. 

ToyADMOS: ToyADMOS is designed around three tasks: production inspection of a toy car, fault diagnosis of a fixed machine (toy conveyor), and fault diagnosis for a machine machine (a toy train). Each scenario is recorded with multiple microphones, capturing both machine and environmental sounds. ToyADMOS contains “over 180 hours of normal machine-operating sounds and over 4,000 samples of anomalous sounds collected with four microphones at a 48-kHz sampling rate,” they write. 

Faults, faults everywhere: For each of the tasks, the researchers simulated a variety of failures. These included things like running the toy car with a bent shaft, or with different sorts of tyres; altering the tensions in the pulleys of the toy conveyor, and breaking the axles and tracks of the toy train. 

Why ToyADMOS: Researchers should use the dataset because it was built under controlled conditions, letting the researchers easily separate and label anomalous and non-anomalous sounds. “The limitation of the ToyADMOS dataset is that toy sounds and real machine sounds do not necessarily match exactly,” they write. “One of the determining factors of machine sounds is the size of the machine. Therefore, the details of the spectral shape of a toy and a real machine sound often differ, even though the time-frequency structure is similar. Thus, we need to reconsider the pre-processing parameters evaluated with the ToyADMOS dataset, such as filterbank parameters, before using it with a real-world ADMOS system. 

Why this matters: In a few years, many parts of the world will be watched over by machines – machines that will ‘see’ and ‘hear’ the world around them, learning what things are usual and what things are unusual. Eventually, we can imagine warehouses where small machines are removed weeks before they break, after a machine with a distinguished ear spots the idiosyncratic sounds of a future-break.
   Read more: ToyADMOS: A Dataset of Miniature-Machine Operating Sounds For Anomalous Sound Detection (Arxiv).
   Get the ToyADMOS data from here (Arxiv).

####################################################

Can your AI learn the laws of nature? No. What about the laws of PHYRE?
…Facebook’s new simulator challenges agents to interact with a complex, 2D, physics world…
Given a non-random universe, infinite time, and the ability to experiment, could we learn the rules of existence? The answer to this is, intuitively, yes. Now, researchers with Facebook AI Research want to see if they can use a basic physics simulator to teach AI systems physics-based reasoning. The new ‘PHYRE” (PHYsical REasoning) benchmark gives AI researchers a tool to test how well their systems understand complex things like causality, physical dynamics, and so on. 

What PHYRE is: PHYRE is a simulate that contains a bunch of environments which can be manipulated via RL agents. Each environment is a two-dimensional world containing “a constant downward gravitational force and a small amount of friction”. The agent is presented with a scenario, like a ball in a green cup balanced on a platform above a ball in the red cup, and asked to change the state of the world – for instance, by being asked to place the ball in the green cup into the one with the red cup. “The agent aims to achieve the goal by taking a single action, placing one or more new dynamic bodies into the world”, the researchers write. In this case, the agent could solve its taks by manifesting a ball which could roll into the green cup, tipping it over so the ball falls into the red cup. “Once the simulation is complete, the agent receives a binary reward indicating whether the goal was achieved”, they write. 

One benchmark, many challenges: PHYRE initially consists of two tiers of difficulty (one ball and two balls), and each tier has 25 task templates (think of these templates as like basic worlds in a videogame) and each template contains 100 tasks (think of these as like individual levels in a videogame world). 

How hard is it? In tests, the researchers show that a variety of baselines – including souped-up versions of DQN, and a non-parametric agent with online learning – struggle to do well even on the single-ball tasks, barely obtaining scores better than 50% on many of them. “PHYRE aims to enable the development of physical reasoning algorithms with strong generalization properties mirroring those of humans,” the researchers write. “Yet the baseline methods studied in this work are far from this goal, demonstrating limited generalization abilities”. 

Why this matters: For the past few years multiple different AI groups have taken a swing at the hard problem of developing agents that can learn to model the physics dynamics of an environment. The problem these researchers keep running into is that agents, as any AI practitioner knows, are so damn lazy they’ll solve the task without learning anything useful! Simulators like PHYRE represent another attempt to see if we can develop the right environment and infrastructure to encourage the right kind of learning to emerge. In the next year or so, we’ll be able to judge how successful this is by reading papers that reference the benchmark.
   Read more: PHYRE: A New Benchmark for Physical Reasoning (Arxiv).
   Play with PHYRE tasks on this interactive website (PHYRE website).
   Get the PHYRE code here (PHYRE GitHub).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Why Peter Thiel’s views on AI miss the forest for the trees:
Peter Thiel, co-founder of Palantir and PayPal, wrote an opinion piece earlier this month on military applications of AI and US-China competition. Thiel argued that AI should be treated primarily as a military technology, and attacked Google and others for opening AI labs in China.

AI is not a military technology:
While it will have military applications, advanced AI is better compared with electricity, rather than nuclear weapons. AI is an all-purpose tool that will have wide-ranging applications, including military uses, but also countless others. While it is important to understand the military implications of AI, it is in everyone’s interest to ensure the technology is developed primarily for the benefit of humanity, rather than waging war. Thiel’s company, Palantir, has major defense contracts with the US government, leading critics to point out his commercial interest in propagating the narrative of AI being primarily a military technology. 

Cooperation is good: Thiel’s criticism of firms for opening labs in China, and hiring Chinese nationals is also misguided. The US and China are the leading players in AI, and forging trust and communication between the two communities is a clear positive for the world. Ensuring that the development of advanced AI goes well will require significant coordination between powers — for example, developing shared standards on withholding dangerous research, or on technical safety.

Why it matters: There is a real risk that an arms race dynamic between the US and China could lead to increased militarization of AI technologies, and to both sides underinvesting in ensuring AI systems are robust and beneficial. This could have catastrophic consequences, and would reduce the likelihood of advanced AI resulting in broadly distributed benefits for humanity. The AI community should resist attempts to propagate hawkish narratives about US-China competition.
   Read more: Why an AI arms race with China would be bad for humanity (Vox).

####################################################

Tech Tales:

We’ll All Know in the End (WAKE)

“There there,” the robot said, “all better now”. Its manipulator clanged into the metal chest of the other robot, which then issued a series of beeps, before the lights in its eyes dimmed and it became still.
   “Bring the recycler,” the robot said. “Our friend has passed on.”
   The recycling cart appeared a couple of minutes later. It wheezed its way up to the two robots, then opened a door in its side; the living robot pushed the small robot in, the door shut, and the recyling cart left.
   “Now,” said the living robot in the cold, dark room. “Who else needs assistance?”

Outside the room, the recycler moved down a corridor. It entered other rooms and collected other robots. Then it reached the end of the corridor and stopped in front of a door with the words NURSERY burned into its wood via lazer. It issued a series of beeps and then the door swung open. The recycler trundled in. 

Perhaps three hours later, some small, living robots crawled out of a door at the other end of the NURSEY. They emerged, blinking and happy and their clocks set running, to explore the world and learn about it. A large robot waited for them and extended its manipulator to hold their hands. “There there,” it said. “All better now”. Together, they trundled into the distance. 

This has been happening for more than one thousand years. 

Things that inspired this story: Hospices; patterns masked in static; Rashomon for robots; the circle of life – Silicon Edition!. 

Import AI: 159: Characterizing attacks on AI systems; teaching AI systems to subvert ML security systems; and what happens when AI regenerates actors

Can you outsmart a machine learning malware detector?
…Enter the MLSEC competition to find out…
Today, many antivirus companies use machine learning models to try and spot malware – a new competition wants to challenge people to design malware payloads that evade these machine learning classifiers. The Machine Learning Static Evasion Competition (MLSEC) was announced at the ‘Defcon’ security conference this week. 

White box attack: “The competition will demonstrate a white box attack, wherein participants will have access to each model’s parameters and source code,” the organizers write. “Points will be awarded to participants based on how many samples bypass each machine learning model. In particular, for each functional modified malware sample, one point is awarded for each ML model that it bypasses.”

Registrants only: Participants can access functional malicious software binaries, so entrances will need to register before they can download the malware samples. 

Why this matters: Security is a cat & mouse game between attackers and defenders, and machine learning systems are already helping us create more adaptive, general forms of security defense and offense. Competitions like MLSEC will generate valuable evidence about the relative strengths and weaknesses of ML-based security systems, helping us forecast how these systems might influence society.
   Register, then check out the code (official competition GitHub, hosted by Endgame Security).
   Read more: MLSEC overview (official competition website)

####################################################

Need a new Gym for your AI agent? Try getting it to open a door:
…DoorGym teaches robots how to open a near-infinite number of simulated doors…
If any contemporary robots were to become sentient and seek to destroy humanity, then one of the smartest things people could do to protect themselves would be to climb up some stairs and go into a room and shut the door behind them. That’s because today’s robots have a really hard time doing simple physical things like climbing stairs or opening doors. New research from Panasonic Beta, the startup Totemic, and the University of California at Berkeley tries to change this with ‘DoorGym’, software to help researchers teach simulated robots to open doors. DoorGym is “intended to be a first step to move reinforcement learning from toy environments towards useful atomic skills that can be composed and extended towards a broader goal”. 

Enter the Randomized Door-World Generator!: DoorGym uses the ‘Mujoco’ robotics simulator to generate a selection of doors with different handles (ranging from easy doorknobs based around pulling, to more complex ones that involve grasping), and then uses a technique called domain randomization to generate tens of thousands of different door simulations, varying things like the appearance and physics characteristics of the robot, door, doorknob, door frame, and wall. This highlights how domain randomization lets researchers trade compute for data – instead of needing to gather data of lots of different doors in the world, DoorGym just uses computers to automatically generate different types of door. DoorGym also ships with a simulated Berkeley ‘BLUE’ low-cost robot arm. 

Door opening baselines: In tests, the researchers test two popular RL algorithms, PPO and SAC, on three tasks within DoorGym. The tests show that Proximal Policy Optimization (PPO) obtains far higher scores than SAC, though SAC has slightly better early exploration properties. This is a somewhat interesting result – PPO, an OpenAI-developed RL algorithm, came out a couple of years ago and has since become a defacto standard for RL research, partially because it’s a relatively simply algorithm with relatively few parameters; this may add some legitimacy to the idea that simple algorithms that scale-up will will tend to be successful. 

The future of DOORS: In the future, the researchers will expand the number of baselines they test on, “as well as incorporating more complicated tasks such as a broader range of doorknobs, locked doors, door knob generalization, and multi-agent scenarios”. 

Why this matters: Systems like DoorGym are an indication of the rapid maturity of research at the intersection of AI and robotics. If systems like this become standard testbeds for RL algorithms, it could ultimately lead to the creation of more intelligent and capable robot arms, which could potentially have significant effects on economic impact of robot-based automation.
   Read more: DoorGym: A Scalable Door Opening Environment And Baseline Agent (Arxiv).

####################################################

Is that a car or a spy robot? Why not both?
…Tesla S mod turns any car into a surveillance system…
An enterprising software engineer has developed a DIY computer called the ‘Surveillance Detection Scout’ that can turn any Tesla Model S or Model 3 into a roving surveillance vehicle. The mod taps into the Tesla’s dash and rearview cameras, then uses open source image recognition software to analyze license plates and faces that the Tesla sees, so the software can warn the car owner if it is being followed. “When the car is parked, it can tracky nearby faces to see which ones repeatedly appear,” Wired magazine writes. “The intent is to offer a warning that someone might be preparing to steal the car, tamper with it or break into the driver’s nearby home”. 

Why this matters: The future is rich people putting DIY software and computers into their machines, giving them enhanced cognitive capabilities relative to other people. Just wait till we optimize thrust/weight for small drones, and wealthy people start getting surrounded by literal ‘thought clouds’.
   Read more: This Tesla Mod Turns a Model S into a Mobile ‘Surveillance Station’ (Wired).

####################################################

Facebook approaches human-level performance on the tough ‘SuperGLUE’ benchmark:
…What happens when AI progress outpaces the complexity of our benchmarks?…
Recently, language AI systems have started to get really good. This is mostly due to a vast number of organizations developing language modeling approaches based on unsupervised pre-training – basically, training large language models with simple objectives on vast amounts of data. Such systems – BERT, GPT-2, ULMFiT, etc – have revolutionized parts of NLP, obtaining new state-of-the-art scores on a variety of benchmarks, and generating credibly interesting synthetic text. 

Now, researchers from Facebook have shown just how powerful these new systems are with RoBERTa, a replication of Google’s BERT system that is trained for longer with more careful hyperparameter selection. RoBERTa obtains new state-of-the-art scores on a bunch of benchmarks, including GLUE, RACE, and SQuAD. Most significantly, the researchers announced on Friday that RoBERTa was now the top entry on the ‘SuperGLUE’ language challenge. That’s significant because SuperGLUE was published this year as a significantly harder version of GLUE  – the multi-task language benchmark that preceded it. It’s notable that RoBERTa shows a 15 absolute percentage point improvement over the initial top SuperGLUE entry, and RoBERTa’s score of 84.6% is relatively close to human baselines of 89.8. 

Why this matters: Multi-task benchmarks like SuperGLUE are one of the best ways we have of judging where we are in terms of AI development, so it’s significant if our ability to beat such benchmarks outpaces our ability to create them. As one of SuperGLUE’s creators, Sam Bowman, wonders:”There’s still headroom left for further work—our estimate of human performance is a very conservative lower bound. I’d also bet that the next five or ten percentage points are going to be quite a bit harder to handle,” he writes. “But I think there are still hard open questions about how we should measure academic progress on real-world tasks, now that we really do seem to have solved the average case.”
   Read Sam Bowman’s tweets about the SuperGLUE result (Sam Bowman’s twitter account.)
   Check out the ‘SuperGLUE’ leaderboard here (SuperGLUE official website).
   Read more: RoBERTa: A Robustly Optimized BERT Pretraining Approach (Arxiv)

####################################################

How can I attack your reinforcement learning system? Let me count the ways:
…A taxonomy of attacks, and some next steps…
How might hackers target a system trained with reinforcement learning? This question is going to become increasingly important as we go from RL systems that are primarily developed for research, to ones that are developed for production purposes. Now, researchers have come up with a “taxonomy of adversarial attacks on DRL systems” and have proposed and analyzed ten attacks on DRL systems in a survey paper from the University of Michigan, University of Illinois at Urban-Champaign, University of California at Berkeley, Tsinghua University, and JD AI Research.

The three ways to attack RL:
“RL environments are usually modeled as a Markov Decision Process (MDP) that consists of observation space, action space, and environment (transition) dynamics,” the researchers write. Therefore, they break their taxonomy of attacks into these three sub-sections of RL. Each of the different sub-sections demands different tactics: for instance, to attack an observation space you might modify the sensors of a device, while to attack an action space you could send alternative control signals to an actuator attached to a robot in a factory, and for environmental attacks you could alter the environment – for instance, if attacking an autonomous car, you could change the road surface to one the car hadn’t been trained on.

An attack taxonomy: The researchers ultimately come up with a set of attacks on RL systems that go after different parts of the MDP (though the vast majority of these exploits attack the observation space, rather than others). They distinguish between white-box (you have access to the system) and black-box (you don’t have access to the system) attacks, and also describe other salient traits like whether the exploit works in real time, or if it introduces some kind of dependency. 

Why this matters: ‘Hacking’ in an AI world looks different to hacking in a non-AI world, chiefly because AI systems tend to have some autonomous properties (eg, autonomous perception, or autonomous action given a specific input), which can be exploited by attackers to create dangerous or emergent behaviors. I think that securing AI systems is going to be an increasingly significant challenge, given the large space of possible exploits.
   Read more: Characterizing Attacks on Deep Reinforcement Learning (Arxiv)

####################################################

Want to clone a voice using a few seconds of audio? Now you can:
…GitHub project makes low-quality voice cloning simple…
An independent researcher has published code to make it easy to ‘clone’ a voice with a few seconds of audio. Though the results today are a little unconvincing (e.g. much of the data used to train the speech synthesizer came from people reading audiobooks, so the diction may not map to naturally spoken dialogue). However, the technology is indicative of future capabilities, so while it’s somewhat janky today, we can expect people to build other, better open source software systems in the future, which will yield even better outputs. 

Why this matters: You can do a lot with function approximation – and many of the things you might want to do to create fake content depends on really good function approximation (e.g., inventing a system to transpose a voice from one accent to another, or mimic someone’s image, etc). Soon, we’re going to be dealing with a whole full of synthetic content, and it’s unclear what happens next.
   Check out a video walkthrough of the ‘Real-Time Voice Cloning Toolbox ‘ here (YouTube).
   Get the code here (Real Time Voice Cloning GitHub).
   Read more: Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (Arxiv).

####################################################

Tech Tales:

Both the new hollywood and the old hollywood will break your heart

I didn’t just love him, I wanted to be him: Jerry Daytime, star of the hit generative comedies “Help, my other spaceship is a time machine” and “Growing up hungry”; the quiz show “Cake Exploders”; and the AI-actor ‘documentary’ series ‘Inside the servers’. 

I’d been watching Jerry since I could remember watching anything. I’d seen him fight tigers on the edge of waterfalls, defend the law in Cogtown, and guest host World News at 1 on Upload Day. He even did guest vocals on ‘Remember Love’, the song used to raise funds for the West Coast after the big rip.

Things started changing a year ago, though. That’s when Jerry Daytime’s ‘uniqueness’ protections expired, and the generative Jerries arrived. Now, we’ve got a bunch of Jerries. There’s Jerry Nighttime, who looks and acts the same except he’s always in shadow with a five-o-clock shadow. There’s Jerry Kidtime who does the Children’s Books. Jerry Doctor for bad news, and Jenny Daytime the female-presenting Jerry.  And let’s be clear – I love these Jerries! I am not a Jerry extremist!

But I am saying we’ve got to draw a line somewhere. We can’t just have infinite Jerries minus one. 

Jerry Latex-free Condoms. Do we need him?
Jury Downtime – the entertainer for jurors on a break. What about him?
Jenny Lint-time – the cleaning assistant. Do we need her?

I guess my problem is what happens to all the people like me who grew up with Jerry? We get used to Jerry being everywhere around us? We become resigned to all the new Jerries? Because when I watch Jerry Daytime – the Jerry, the original – I now feel bad. I feel like I’m going to blink and everyone else in the show will be Jerry as well, or variants of Jerry. I’m worried when I open the door for a package the droid is going to have Jerries face, but it’ll be Jerry Package, not Jerry Daytime. What am I meant to do with that? 

Things that inspired this story: Uniqueness; generative models; ‘deepfakes’ spooled out to their logical endpoint; Hollywood; appetites for content; 

 

Import AI 158: Facial recognition surveillance; pre-training and the industrialization of AI; making smarter systems with the DIODE depth dataset

The dawn of the endless, undying network:
…ERNIE 2.0 is a “continual pre-training framework” for massive unsupervised learning & subsequent fine-tuning…
ERNIE 2.0 is a “continual pre-training framework” which has support for single-task and multi-task training. The sorts of tasks it’s built for are ones where you want to pour a huge amount of compute into a model via unsupervised pre-training on large datasets (see, language modeling), then expose this model to additional incremental data and finetune it against those objectives, like achieving some constraint placed on the system by reality. 

A cocktail of training: The researchers use seven pre-training tasks to train a large-scale ‘ERNIE 2.0’ model. In subsequent experiments, the researchers show that ERNIE outperforms similarly scaled ‘BERT’  and ‘XLNet’ models on the competitive ‘GLUE’ benchmark – this is interesting, since BERT has swept a bunch of NLP benchmarks this year. (However, the researchers don’t indicate they’ve tested ERNIE 2.0 on the more sophisticated SuperGLUE benchmark). They also show similarly good performance on a GLUE-esque set of 9 tasks tailored around Chinese NLP. 

Why this matters: Unsupervised pre-training is the new ‘punchbowl’ of deep learning – a multitude of organizations with significant amounts of compute to spend will likely compete with one another training increasingly large-scale models, which are subsequently re-tooled for tasks. It’ll be interesting to see if in a few months we can infer whether any such systems are being deployed for economically useful purposes, or in research domains (eg biomedicine) where they could unlock other breakthroughs.
   Read more: ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (Arxiv)

####################################################

Feed your AI systems with 500 hours of humans playing MineCraft:
…Agent see, agent do – or at least, that’s the hope…
In recent years, AI systems have learned to beat games ranging in complexity from Chess, to Go, to multi-player strategic games like StarCraft 2 and Dota 2. Now, researchers are thinking about the next set of environments to test AI agents in. Some think that Minecraft, a freeform game where people can mine blocks in a procedurally generated world and build unique items and structures, is a good environment for the frontier of research. Now, researchers with Carnegie Mellon University have released the MineRL dataset: a set of 60 million “state-action pairs of human demonstrations across a range of related tasks in Minecraft”, which can help researchers develop smarter algorithms. The MineRL dataset has been developed as a part of the MineRL competition on sample efficient reinforcement learning (Import AI #145).

The dataset: MineRL consists of 500+ hours of recorded human demonstrations from 1000 people of six different tasks in Minecraft, like navigating around the world, chopping down trees, or multi-step tasks that result in obtaining specific (sometimes rare) items (diamonds, pickaxes, cooked meat, beds). They’ve released the dataset in a variety of resolutions (ranging from 64 X 64, to 192 X 256), so people can experiment with algorithms that operate over imagery of various resolutions. Each demonstration consists of an RGB video of the player’s point of view, as well as a set of features from the game like the distances to objects in the world, details on the player’s inventory, and so on.  

Why this matters: Is Minecraft a good, challenging environment for next-generation AI research? Datasets like this will help us figure that out, as they give us human baselines for complex tasks, and also involve the sort of multi-step sequences of actions that are challenging for contemporary AI systems. I think it’s interesting to reflect on how fundamental videogames have become to the development of action-oriented AI systems, and I’m excited for when research like that motivated by MineRL leads to smart, curious ‘neural agents’ showing up in more consumer-oriented videogames.
   Get the dataset from the official MineRL website (MineRL.io).
   Read more: MineRL: A Large-Scale Dataset of Minecraft Demonstrations (Arxiv)

####################################################

Want Ai systems that can navigate the world? Train them with DIODE:
…RBGD dataset makes it easier to train AIs that have a sense of depth perception…
Researchers with TTI-Chicago, the University of Chicago, and Beihang University  have produced DIODE, a large-scale image+depth dataset of indoor and outdoor environments. Datasets like these are crucial to developing AIs that can better reason about three-dimensional worlds. 

   “While there have been many recent advances in 2.5D and 3D vision, we believe progress has been hindered by the lack of large diverse real-world datasets comparable to ImageNet and COCO for semantic object recognition,” they write. 

What DIODE is: DIODE consists of around ~25,000 high-resolution photos (sensor depth precision: +/- 1mm, compared to +/- 2cm for other popular datasets like KITTI), split across indoor environments (~8.5k images) and outdoor ones (~17k images). The researchers collected the dataset with a ‘FARO Focus S350 scanner’, in locations ranging from student offices, to large residential buildings, to hiking trails, parking lots, and city streets. 

Why this matters: Datasets like this will make it easier for people to develop more robust machine learning systems that are better able to deal with the subtleties of the world.
   Read more: DIODE: A Dense Indoor and Outdoor DEpth Dataset (Arxiv)

####################################################

Need to morph your AI to work for a world with different physics? Use TuneNet:
…Model tuning & re-tuning framework makes sim2real transfer better…
Researchers with the University of Texas at Austin have released “TuneNet, a residual tuning technique that uses a neural network to modify the parameters of one physical model so it approximates another.” TuneNet is designed to make it easy for researchers to migrate an AI system from one simulation to another, or potentially from simulation to reality. Therefore, TuneNet is built to rapidly analyze the differences between different software simulators with altered physics parameters, and work out what it takes to re-train a model so it can be transferred between them. 

How TuneNet works: “TuneNet takes as input observations from two different models (i.e. a simulator and the real world), and estimates the difference in parameters between the models,” they write. “By estimating the parameter gradient landscape, a small number of iterative tuning updates enable rapid convergence on improved parameters from a single observation from the target model. TuneNet is trained using supervised and learning on a dataset of pairs of auto-generated simulated observations, which allows training to proceed without real-world data collection or labeling”. 

Testing: The researchers perform three experiments “to validate TuneNet’s ability to tune one model to match another”. The researchers conduct their tests using the ‘PyBullet’ simulator. These tests involve: testing its how well it adapts to a new target environment, how well it can predict the dynamics of a bouncing ball from one simulation to the other, and testing if it can transfer from a simulator onto a real world robot and bounce a hall off an inclined plane into a hoop. The approach does well on all of these, and obtains an 87% hit rate at the sim2real task – interesting, but not good enough for the real world yet. 

Why this matters: Being able to adapt AI systems to different contexts will be fundamental to the broader industrialization of AI; advances in sim2real techniques reduce the costs of taking a model trained for one context and porting it to another, which will likely – once the techniques are mature – encourage model proliferation and dissemination.
   Read more: TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer (Arxiv)

####################################################

Perfecting facial expression recognition for fine-grained surveillance:
…You seem happy. Now you seem worried. Now you seem cautious. Why?…
Researchers with Fudan University and Ping An OneConnect want to build AI systems that can automatically label the emotions displayed by people seen via surveillance cameras. “Facial expression recognition (FER) is widely used in multiple applications such as psychology, medicine, security and education,” they write. (For the purposes of this write-up, let’s put aside the numerous thorny issues relating to the validity of using certain kinds of ’emotion recognition’ techniques in AI systems.) 

Dataset construction: To build their system, they gather a large-scale dataset that consists of 200,000 images of 119 people displaying any of four poses and 54 facial expressions. The researchers also use data augmentation to artificially grow the dataset via a system called a facial pose generative adversarial network (FaPE-GAN), which generates additional facial expression images for training. 

To create the dataset, the researchers invited participants into a room filled with video cameras to have “a normal conversation between the candidate and two psychological experts” which lasts for about 30 minutes. After this, a panel of three psychologists reviewed each video and assigned labels to specific psychological states; the dataset only includes videos where all three psychologists agreed on the label. Each participant is captured from four different orientations: face-on, from the left, from the right, and an overhead view.

54 expressions: The researchers tie 54 distinct facial expressions with specific terms that – they say – correlate to emotions. These terms include things like boredom, fear, optimism, boastfulness, aggressiveness, disapproval, neglect, and more. 

Four challenges: The researchers propose four evaluation challenges people can test systems on to develop more effective facial recognition systems. These include: expression recognition with a balanced setting (ER-SS); unbalanced expression (ER-UE), where they make 20% of the facial expressions relate to particularly rare classes; unbalanced poses (ER-UP), where they assume the left-facing views are rarer than the other ones; and zero-shot ID (ER-ZID), where they try to recognize the facial expressions of people that haven’t been seen before to test “whether the model can learn the person invariant feature for emotion classification”.

What faces are useful for: The researchers show that F2ED can be used to pre-train models which are subsequently fine-tuned on other facial emotion recognition datasets, including FER2013 and JAFFE.

Why this matters: Data is one of the key fuels of AI progress, so a dataset containing a couple of hundred thousand labelled pictures of faces will be like jetfuel for human surveillance. However, the scientific basis for much of facial expression recognition is contentious, which increases the chance that the use of this technology will have unanticipated consequences.
   Read more: A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition (Arxiv).

####################################################

Tech Tales:

The Adversarial Architecture Arms Race
US Embassy, Baghdad, 2060

So a few decades ago they started letting the buildings build themselves. I guess that’s why we’re here now. 

It started like this: we decided to have the computers help us design our buildings, and so we’d ask them questions like: given a specific construction schematic, what vulnerabilities are possible? They’d come back with some kind of answer, and we’d design the building around the most conservative set of their predictions. Other governments did the same. Embassies worldwide sprouted odd appendages – strange, seemingly-illogical turrets and gantries, all designed to optimize for internal security while increasing the ability to survey and predict likely actions in the emergent environment. 

“Country lumps”, people called some of the embassies.
“State growths”
“Sovereign machines”. 

Eventually, the buildings evolved for a type of war that we’re not sure we can understand any more. Now, construction companies get called up by machines that use synthesized voices to order up buildings with slanted roofs and oddly shaped details, all designed to throw off increasingly sophisticated signals equipment. Now I get to work in buildings that feel more like mazes than places to do business. 

They say that some of the next buildings are fully ‘lights out’ – designed to ‘serve all administrative functions without an on-site human’, as the brochures from the robot designers says. The building still has doors and corridors, and even some waiting spaces for people – but for how long, I wonder? When does all of this become more efficient. When do we start hurling little boxes of computation into cities and call these the new buildings? When do we expect to have civilizations that exist solely in corridors of fibre wiring and high-speed interconnects? And how might those entities design themselves?

Things that inspired this story: Generative design; the architecture of prisons and quasi-fortress buildings; the American Embassy in Berlin; game theory between machines.

Import AI 157: How weather can break self-driving car AI; modelling traffic via deep learning and satellites; and Chinese scientists make a smarter, smaller YOLOv3

Want to break an image classifier? Add some weather:
…Don’t use AI on a snow day…
Many of today’s object recognition systems are less robust and repeatable than people might assume – new research from the University of Tubingen and the International Max Planck Research School for Intelligent Systems shows just how fragile these systems are, with a trio of datasets that help people test the resilience of their AI systems. 

Three datasets to frustrate AI systems: The three datasets are called Pascal-C, Coco-C, and Cityscapes-C; these are ‘corrupted’ versions of existing datasets, and for each dataset the images within are corrupted with any of 15 distortions, each with five levels of severity. Some of the distortions that can be applied to the images include the addition of snow, frost, or fog to an image, as well as other distortions like the addition of noise, or the use of certain types of transforms. 

Just how bad is it: Out-of-the-box algorithms (typically based on the widely-used R-CNN family of models) see relative performance drops of between 30 and 50% on the corrupted versions of the datasts, highlighting the brittleness of many of today’s algorithms. 

Saving algorithms with messy data: One simple trick people can use to improve the robustness of models is to train them on stylized data – here, they basically take the underlying dataset and for each image create a variant stylized with a texture. These images are combined with the clean data, then trained on; models trained against datasets that incorporate the stylized data are more robust than those trained purely on clean data – this makes sense, as we’ve basically algorithmically expanded the dataset to encourage a certain type of generalization. 

Why this matters: Datasets like this make it easier for people to investigate the robustness of trained AI models, which can help us understand how contemporary models may fail and provide data to calibrate against when designing more robust ones. And the authors hope that other researchers will expand the benchmark further:
   “We encourage readers to expand the benchmark with novel corruption types. In order to achieve robust models, testing against a wide variety of different image corruptions is necessary, there is no ‘too much’. Since our benchmark is open source, we welcome new corruption types and look forward to your pull requests to https://github.com/bethgelab/imagecorruptions“.
   Read more: Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming (Arxiv).
   Get the code, data, and benchmarking leaderboard here (‘Robust Detection Benchmark’ official GitHub)

####################################################

The future of AI is… *checks notes* an AI assistant for the procedural building game Minecraft:
…Facebooks ‘CraftAssist’ project tries to build smarter AI systems by having them work alongside humans…
Facebook AI Research wants to study increasingly advanced AI systems by studying humans work alongside smart computers, so it has developed a bot to assist human players in the procedural building game Minecraft. “The ultimate goal of the bot is to be a useful and fun assistant in a wide variety of tasks specified and evaluated by human players”, they write. 

How the bot works: The robot works by taking in written prompts in natural language, then maps those to sequences of actions, like moving around the world or interacting with objects.
   For instance, in response to the query “go to the blue house”, the agent would try and map ‘blue’ and ‘house’ to entities it had stored in its memory, and if it found them would try to create a ‘move’ task that could let the agent navigate to that part of the world. The team achieves this via a neural semantic parser they call the Text-to-Action-Dictionary (TTAD) model, which converts natural language commands to specific actions. The agent also ships with systems to help out process the world around it, crudely analyzing the terrain and also heuristics for referring to objects based on their positions. 

Future extensions: Facebook has designed its agent to be extended in the future with more advanced AI capabilities. To that end, any CraftAssist agent can take in images in the form of a 64X64 ‘block’ resolution view (so viewing in terms of blocks in minecraft, rather than individual pixels). The agents can also access a 3D map of the space they’re in, so can locate any block within the world around them. 

Datasets: Facebook is releasing a dataset consisting of 800000 pairs of (algorithmically generated) actions and written instructions, 25402 human-written sentences that map to some of these actions pairs; 2513 suggested/imagined commands from humans that interacted with the bot, and 708 dialogue-action pairs from in-game chat. They’ve also release a ‘House’ dataset, which consists of 2050 human-built houses from Minecraft.  

Why this matters: Embedding AI systems into games will likely be one of the ways that we see people take AI research and port it into production – the use of Minecraft here is interesting given its playerbase numbering in the tens of millions, many of them children. Could we eventually see AI systems trained via the conversations with kids talking in broken English, training more robust policies through childish lingo? I think so! Next up: a generative Fortnite dance machine!
   Read more: CraftAssist: A Framework for Dialogue-enabled Interactive Agents (Arxiv).
   Get the code for CraftAssist here (official GitHub repository).

####################################################

Counting cars with deep learning and satellite imagery:
How can you count cars in countries that don’t have sensors wired into roads and traffic lights to gather the required data? Researchers with CMU and ETH Zurich think the use of deep learning and satellite imagery could be a viable supplement, and could help countries easily get measures for the Average Annual Daily Truck Traffic (AADTT) in a given region. 

In new research, they develop “a remote sensing approach to monitor freight vehicles through the use of high-resolution satellite images,” they write. “As satellite images become both cheaper and are taken at a higher resolution over time, we anticipate that our approach will become scalable at an affordable cost within the next few years to much larger geographic regions”.

The data: To train their system, the researchers hand-annotated vehicles seen in satellite images with around 2,000 bounding boxes from the Northeastern USA. “We used the predicted vehicle count from the detection model, the time stamp of the images, time-varying factors, and speed to make a probabilistic prediction of the AADTT”.

Testing generalization: The researchers gathered the data in America, and tested it also on data gathered from Brazil to explore the generalization properties of their system. “We found that distinct truck types (rather than geography) can impact the prediction accuracy of the detection model, and additional training seems necessary to transfer the model between countries,” they write. Additionally, “information on local driving patterns and labor laws could reduce the estimation error from the traffic monitoring model.” They trained a single-shot detection model to detect vehicles, and found that the model could provide reasonable predictions for locations from the United States, but struggled to provide as accurate predictions for Brazil, even once finetuned. 

Why this matters: Medium- and heavy-duty trucking accounts for about 7% of global CO2 emissions, and more than half of the world’s countries lack the infrastructure needed to accurately monitor traffic in their countries. Therefore, if we can develop AI-based classifiers to provide crude, cheap assessment capabilities, we can gather more data to help inform people about the world.
   Read more: Truck Traffic Monitoring with Satellite Images (Arxiv)

####################################################

How should AI researchers broadcast their insights to the world, and what do they need to be careful about?
…Publication in AI isn’t a binary choice between ‘release’ or ‘don’t release’, there are other tools available…
How can researchers maximize their contribution to scientific discourse while minimizing downsides (dual-use, malicious use, abuse, etc) of their research? That’s a question researchers from The Thoughtful Technology Project and Cambridge University’s Leverhulme Center for the Future of Intelligence, set out to provide some answers to in a blog post and action-oriented paper. 

   The core of their argument is that when researchers think they may have cause to question the release of their research, they should view their choice as being one of many, rather than a binary decision: “We particularly want to emphasize that when thinking about release practices, the choice is not a binary one between ‘release’ or ‘don’t release’. There are several different dimensions to consider and many different options within each of these dimensions, including: (1) content — what is released (options ranging from a fully runnable system all the way to a simple use case idea or concept); (2) timing — when it is released (options include immediate release, release at a specific predetermined time period or external event, staged release of increasingly powerful systems); and (3) distribution — where/to whom it is released to (options ranging from full public access to having release safety levels with auditing and approval processes for determining who has access).”

What should people do? They suggest three things the AI community should do to increase the chance of accruing the maximum possible social benefit from AI while minimizing certain downsides.

  1. Understand the potential risks of research via collaboration with experts, and develop mitigation strategies
  2. Build a community devoted to mitigating malicious use impacts of AI research and work to establish collective norms. 
  3. Create institutions to manage research practices in ML, potentially including techniques for expert vetting of certain research, as well as the development of sophisticated release procedures for research. 

Read more: Reducing malicious use of synthetic media research (Medium).

####################################################

Chinese scientists make a smarter, smaller drone vision system:
…What happens when drones become really, really smart?…
I have a confession to make: I’m afraid of drones. Specifically, I’m afraid of what happens when in a few years drones gain significant autonomous capabilities as a dividend of the AI revolution, and bad people do awful shit with these capabilities. I’m concerned about this because while drones have a vast range of good uses (which massively outnumber the negative ones!), they are also fundamentally mobile robots, and mobile robots are, to some people, great weapons (e.g. ISIS use of modified DIY military drones in recent years). 

What am I doing about this worry? I’m closely tracking developments in drone sensing and moving capabilities to try and develop my intuitions about this sub-field of AI development, and whenever I speak to policymakers I advocate for large-scale investments into the ongoing measurement, analysis, forecasting, and benchmarking of various AI capabilities so as to direct public money towards positive uses and generate the data that can unlock funding for dealing with (potential) negative uses. One of the things that motivates me here is a belief that if we just develop decent intuitions about the shape of progress at intersection of AI and drones, we’ll be able to get ahead of 95% of the bad stuff, and maximize our ability to benefit as a society from the technology. 

Now, researchers with the Beijing Institute of Technology have published (and released code for) ‘SlimYOLOv3’, a miniaturized version of the widely-used, very popular ‘YOLO’ (You Only Look Once) object recognition model. The difference between SlimYOLOv3 and YOLOv3 is simple: the Slim version is much, much smaller than the other, making it easier to deploy it on small computational devices, like the chips that can fit onto most drones. Specifically, they use sparsity training to guide a subsequent pruning process which helps them chop out unneeded bits of the neural network, then they fine-tune the model, and iteratively repeat the process until they obtain a satisfactory loss. 

So, how well does it work? “SlimYOLOv3 achieves compelling results compared with its unpruned counterpart: ~90.8% decrease of FLOPs, ~92% decline of parameter size, running ~2 faster and comparable detection accuracy as YOLOv3,” the authors write.
   They test out the system on the ‘VisDrone2018-Det’ dataset, which consists of ~7,000 drone-captured images containing any of ten predefined labelled objects (eg, pedestrian, car, bicycle, etc). They test out their SlimYOLOv3 system against an efficient YOLOv3 baseline, as well as a version of YOLOv3 augmented with spatial pyramid pooling (YOLOv3-SPP3). Variants of SlimYOLOv3 obtain scores that are around 10 absolute percentage points higher on evaluation criteria like Precision, Recall, and F1-score when compared against YOLOv3-tiny, while fitting in roughly the same computational envelop (8 million parameters, ~30mb model size). However, SlimYOLOv3 has a somewhat higher inference time than the less accurate YOLOv3-tiny. 

Be careful what you wish for: It’s notable that in March 2018 (Import AI #88), when YOLOv3 got released, the author anticipated its rapid diffusion, modification, and use: “What are we going to do with these detectors now that we have them?” A lot of the people doing this research are at Google and Facebook. I guess at least we know the technology is in good hands and definitely won’t be used to harvest your personal information and sell it to…. wait, you’re saying that’s exactly what it will be used for?? Oh. Well the other people heavily funding vision research are the military and they’ve never done anything horrible like killing lots of people with new technology oh wait…”.

Why this matters: Technologies like SlimYOLOv3 will give drones better, more efficient perceptive capabilities, which will make it easier for researchers to deploy increasingly sophisticated autonomous and semi-autonomous systems onto drones. This is going to change the world massively and rapidly – we should pay attention to what is happening.
   Read more: SlimYOLOv3: Narrower, Faster, and Better for Real-Time UAV Applications (Arxiv).  

####################################################

Want to test and develop better commonsense AI systems? Try WINOGRANDE:
…From 273 Winograd questions to 40,000 WINOGRANDE ones. Plus, pre-training for commonsense!…
Researchers with the Allen Institute for AI and the University of Washington have released ‘WINOGRANDE’, a scaled-up version of the iconic Winograd Schema Challenge (WSC) test for AI systems. For those not familiar, the WSC is a challenge and dataset consisting of 273 problems that AI systems need to try and solve. 

   So, what’s a Winograd problem? Here’s an example:
   Problem: “Pete envies Martin because he is successful.”
   Question: Is ‘he’ Pete or Martin?
   Answer: Martin. 

These problems are difficult for computers because they typically require a combination of context, world knowledge, and symbolic reasoning to solve. Some people have used progress on WSC as a litmus test for broader progress in AI research. But one thing that has held WSC back has been the lack of data – however you slice it, 273 just isn’t very large. Another has been that though the WSC questions were designed by experts to be challenging for AI systems, they still exhibit some language-based and data-based biases that can be exploited by AI systems, which can solve them by uncovering some of these underlying (unintentional) statistical regularities. 

Enter WINOGRANDE: The new WINOGRANDE dataset has been designed to be free of these biases, while also being much larger; the dataset contains around 44k questions, developed through crowdsourcing. The researchers hope researchers will test systems against WINOGRANDE to develop smarter systems, and will also use the dataset as a pre-training resource for applying to subsequent tasks (in tests, they show they can pre-train on WINOGRANDE to improve the state of the art on a range of other commonsense reasoning benchmarks in AI, including WSC, PDP, DPR, and COPA). 

Why this matters: Datasets like WINOGRANDE help define the frontier of difficulty for some AI systems and can also serve as training inputs for other, larger models. Commonsense reasoning is one of the main examples people use when discussing the limitations of contemporary AI techniques, so WINOGRANDE could define a new challenge which, if solved, could tell us something important about the future of genuinely intelligent AI.
   Read more: WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale (Arxiv). 

####################################################

Positive uses of AI – A crowd-sourced list:
…AI isn’t all doom and gloom – it’s also changing the world for the better…
This year, I’ve been giving an occasional lecture to congressional staff at the Woodrow Wilson Center in Washington DC on AI, measurement, and geopolitics. The lecture is basically a Cliff’s Notes version of a lot of the central concerns of Import AI: the relationship between AI and compute; the geopolitical shifts caused by AI advances; what AI tells us about the (complicated!) future of 2019-era-capitalism; how to view AI as an ecosystem of differently resourced parties rather than simply as blobs of resources linked to specific nation states, and so on. 

Recently, I asked some of my wonderful friends on Twitter for recent examples of positive uses of AI that I could highlight in a lecture, in part to show the pace of development here, and the breadth of opportunity. I got a great response to my Tweet, so I’m including some of the responses here as breadcrumbs for others:

Read the original tweet and the rest of the responses here (@jackclarksf twitter)

####################################################

Tech Tales:

The Rich Vessel 

Every week, someone else gets to be the richest person in the world. It’ll never be me because I don’t have the implant, so they can’t port my brain over into The Vessel. But for 95% of the rest of the planet, it could be them. 

So what happens when you’re the richest person in the world? Pretty predictable things:

  • Lots of people choose to feed people.
  • Lots of people choose to house people.
  • Lots of people choose to donate wealth. 
  • Few people choose to flaunt wealth. 
  • Few people choose to use wealth to hurt others. 
  • Very few people try to use wealth to influence politics (and they fail, as policy takes years, and getting stuff done in a week requires an act of god combined with a one-in-a-million chance). 
  • Basically no one rejects the offer. 

Now here is the scary thing. Would it surprise you if I told you that, despite this experiment running for over a year now, the richest person in the world is still the richest person in the world – and getting richer. 

You see, it turns out when the richest person in the world announced they were ‘taking a step back’ and created The Vessel initiative there was an ulterior purpose. They weren’t trying to ‘share their wealth and life experience’, they were trying to make sure that their own estate was resilient to them changing their own ideology. The whole purpose of The Vessel project isn’t to enhance our understanding of eachother, but is instead to give the Family Office and Lawyers and Consultants of the richest person in the world an ever-growing set of examples of all the decisions they need to be resilient to. 

After all, even if the richest person in the world woke up one day and wanted to give all their money away at once, that wouldn’t be the smartest move for them. They’d need to slow it down. Think more. Bring in the lawyers and consultants. Thanks to The Vessel, the collective intelligence of the world is discovering all the ways the world’s richest person could subvert the architectures of control they had built around themselves. 

Things that inspired this story: The habits of billionaires; Baudrillard; carceral architectures of bureaucracy and capital; brain-implants; societal stability; Gini coefficient; recipes to avoid revolution, recipes for trapping the world in amber until the sun melts it. 

Import AI 156: The 7,500 images that break image recognition systems; open source software for deleting objects from videos; and what it takes to do multilingual translation

Want 7,500 images designed to trick your object recognition system? Check out the ‘Natural Adversarial Examples’ dataset:
Can your AI system deal with these naturally occurring optical illusions?…
Have you ever been fiddling in the kitchen and dropped an orange-colored ceramic knife into a pile of orange peels and temporarily lost it? I have! These kinds of visual puzzles can be confusing for humans, and are even more tricky for machines to deal with. Therefore, researchers with the University of Berkeley, the University of Washington and the University of Chicago have developed and released a dataset full of these ‘natural adversarial examples’, which should help researchers test the robustness of AI systems and develop more powerful ones. 

Imagenet-A: You can get the data as an ImageNet classifier test called ImageNet-A, which consists of around 7,500 images designed to confuse and frustrate modern image recognition systems.

How hard are ‘natural adversarial examples’? Extremely hard!
The researchers tested out DenseNet-121 and ResNeXt-50 models on the dataset and show that both obtain an accuracy rate of less than 3% on ImageNet-A (compared to accuracies of 97%+ on standard ImageNet). Things don’t improve much when they try to train their AI systems with techniques designed to increase robustness of classifiers, finding that using things like adversarial training, styleized imagenet augmentation, uncertainty metrics, and other approaches don’t work particularly well. 

Why this matters: Being able to measure all the ways in which AI systems fail is a superpower, because such measurements can highlight the ways existing systems break and point researchers towards problems that can be worked on. I hope we’ll see more competitions that use datasets like this to test how resilient algorithms are to confounding examples.
   Read more: Natural Adversarial Examples (Arxiv).
   Get the code and the ‘IMAGENET-A’ dataset here (Natural Adversarial Examples GitHub)

####################################################

Computer, delete! Open source software for editing videos:
…AI is making video-editing much cheaper and more effective…
Ever wanted to pick a person or an animal or other object in a video and make it disappear? I’m sure the thought has struck some of you sometimes. Now, open source AI projects let you do just this: Video Object Removal is a new GitHub project that does what it says. The technology lets you draw a bounding box around an object in a video, and then the AI system will try to remove the person and inpaint the scene behind them. The software is based on two distinct technologies: Deep Video Inpainting, and Fast Online Object Tracking and Segmentation: A Unifying Approach. 

Why this matters: Media is going to change radically as a consequence of the proliferation of AI tools like this – get ready for a world where images and video are  so easy to manipulate that they become just another paintbrush, and be prepared to disbelieve everything you see online.
   Get the code from the GitHub page here (GitHub)

####################################################

Breaking drones to let others make smarter drones:
…’ALFA” datasets gives researchers flight data for when things go wrong…
Researchers with the Robotics Institute at Carnegie Mellon University have released ALFA, a dataset containing flight data and telemetry from a model plane, including data when the plane breaks. ALFA will make it easier for people to assess how well fault-spotting and fault-remediation algorithms work when exposed to real world failures. 

ALFA consists of data for 47 autonomous flights with scenarios for eight different types of faults, including engine, rudder, and elevator errors. The data represents 66 minutes of normal flight and 13 minutes of post-fault flight time taking place over a mixture of fields and woodland near Pittsburgh, and there’s also a larger unprocessed dataset representing “several hours of raw autonomous, autopilot-assisted, and manual flight data with tens of different faults scenarios”. 

Hardware: To collect the dataset, the researchers used a modified Carbon Z T-28 model plane, equipped with an onboard Nvidia Jetson TX2 computer, and running ‘Pixhawk’ autopilot software modified so that the researchers can remotely break the plan, generating the failure data. 

Why this matters: Science tends to spend more time and resources inventing things and making forward progress on problems, rather than breaking things and casting a skeptical eye on recent events (mostly); datasets like ALFA make it easier for people to study failures, which will ultimately make it easier to develop more robust systems.
   Read more: ALFA: A Dataset for UAV Fault and Anomaly Detection (Arxiv).
   Get the ALFA data here (AIR Lab Failure and Anomaly (ALFA) Dataset website).

####################################################

How far are we from training a single AI system to translate between all languages?
…Study involves 25 billion parallel sentences across 103 languages…
How good are modern multilingual machine learning-based translation systems – that is, systems which can translate between a multitude of different languages, typically via using the same massive trained model? A new study from Google – which it says may be the largest ever conducted of its kind – analyzes the performance of these systems in the wild. 

Data: For the study, the researchers evaluate “a massive open-domain dataset containing over 25 billion parallel sentences in 103 languages” using a large-scale machine translation system. They think that “this is the largest multilingual NMT system to date, in terms of the amount of training data and number of languages considered at the same time”. The datasets are distributed somewhat unevenly, though, reflecting the differing levels of documentation available for different languages. “The number of parallel sentences per language in our corpus ranges from around tens of thousands to almost 2 billion”, they write; there is a discrepancy of almost 5 orders of magnitude between the languages with the greatest and smallest amounts of data in the corpus. Google generated this data by crawling and extracting parallel sentences from the web, it writes. 

Desirable features: An excellent multilingual translation system should have the following properties, according to the researchers:

  • Maximum throughput in terms of number of languages considered within a single model. 
  • Positive transfer towards low-resource languages. 
  • Minimum interference (negative transfer) for high-resource languages. 
  • Models that perform well in “realistic, open-domain settings”.

When More Data Does Not Equal Better Data: One of the main findings of the study is the difficulty of training large models on such varied datasets, the researchers write. “In a large multi-task setting, high resource tasks are starved for capacity while low resource tasks benefit significantly from transfer, and the extent of interference and transfer are strongly related.” They develop some sampling techniques to train models to be more resilient to this, but find this involves its own tradeoffs between large-data and small-data languages as well. In many ways, the complexity of the task of large-scale machine translation, seems to hide subtle difficulties: “Performance degrades for all language pairs, especially the high and medium resource ones, as the number of tasks grows”, they write. 

Scale: To improve performance, the researchers test our three variants of the ‘Transformer’ component, training a small parameter count model (400 million parameters), a large and wide 12-layer model (1.3 billion parameters), and a large and 24-layer deep model (1.3 billion parameters); the large and deep model demonstrates superior performance and “does not overfit in low resource languages”, suggesting that model capacity has a significant impact on performance. 

Why this matters: Studies like this point us to a world where we train a translation model so large and so capable that it seems equivalent to the Babelfish from The Hitchhiker’s Guide to the Galaxy – a universal translation system, capable of taking concepts from one language and translating them to another then decoding it into the relevant target language. It’s also fascinating to think about what kinds of cognitive capabilities such models might develop – translation is hard, and to do a good job you need to be able to port concepts between languages, as well as just carefully translating words.
   Read more: Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges (Arxiv)

####################################################

Tech Tales:

[Underground mixed-use AI-Human living complex, Earth, 2050] 

Leaving Breakdown City 

It was a citywide bug. A bad one. Came in off of some off-zoned industrial code cleaning facilities. I guess something leaked. It made its way in through some utility exchange data centers and it spread from there. The first sign was the smell – suddenly, all the retail-zoned streets got thick with the smell of bacon and of perfume and of citrus – that was the tell, the chemical synthesizers going haywire. Things spread after that. 

I went and got Sandy while this was going on. She was working in the hospital and when I got there  was wheeling out some spiderweb-covered analog medical systems, putting patients onto equipment without chips, and pulling the smart equipment out of the most at risk patients. The floor was slick with hand saniters, from all the machines on the walls deciding to void themselves at once. 

C’mon, I said to her.
Just help me hook this up, she said.
We plugged a machine into a patient and unplugged the electronics. The patient whispered ‘thank you’.
You’re so welcome, Sandy said. I’ll be back, don’t forget your pills. 

We left the hospital and we headed for one of the elevators. We could smell flowers and meat and smoke and it was a real rush to run together, noses thick with scent, as the world ended behind us. We made it to an elevator and turned its electronics over to analog, then used the chunky, mechanical controls to set it on an upward trajectory. We’d come out topside and head to the next town over, hope that the isolation systems had kicked in to halt the spread. 

Watching the madness spread from robot to robot, floor to floor, system to other sub-system, and so on. We held hands and looked at the city as we rose up, and we saw:

  • Garbage trucks backing into hospitals. 
  • One street cleaning robot chasing a couple of smaller robots. 
  • Main roads that were completely empty and small roads that were completely jammed. 
  • Factories producing products which flow out of the factory and into the street on delivery robots, which then take it to over-stocked stores, leaving the boxes outside. 
  • Other elevators shivering up and down shafts, way too fast, taking in products and robots and spitting them out elsewhere for other reasons.

Things that inspired this story: Brain damage; brain surgery, a 50/50 chance of being able to speak properly after aforementioned surgery (not mine!); the Internet of Things; the Internet of Shit; computer viruses, noir novels, an ambition to write dollar-store AI fiction. 

 

Import AI 155: Mastering robots with the ‘DRIVE’ dataset; facial recognition for monkeys; and why AI development is a collective action problem.

Chinese company seeks smarter robots with ‘DRIVE’ dataset:
…Crowds? Check. Trashcan-sized robots? Check. A challenging navigation and mapping benchmark? Check…
Researchers with Chinese robot company Segway Robotics Inc have developed the ‘DRIVE’ dataset and benchmark, which is designed to help researchers develop smarter delivery robots. 

   The company did the research because it wants to encourage research in an area relevant to its business, and because of larger macroeconomic trends: “The online shopping and on-demand food delivery market in China has been growing at a rate of 30%-50% per year, leading to labor shotage and rising delivery cost,” the researchers write. “Delivery robots have the potential to solve the dilemma caused by the growing consumer demand and decreasing delivery workforce.” 

Robots! Each Segway robot used to gather the dataset is equipped with a RealSense visual inertial sensor, two wheel encoders, and a Hokuyuo 2D lidar. 

The DRIVE dataset: The dataset consists of 100 movement sequences across five different indoor locations, and was collected by robots over the course of one year. It is designed to be extremely challenging, and incorporates the following confounding factors and traits:

  • Commodity, aka cheap, inertial measurement units

  • Busy: The gathered data includes scenes with many moving people and objects, which can break brittle AI systems

  • Similar, similar: Some of the environments are superficially similar to eachother, which could trigger misclassification. Additionally, some of the places in the environments lack texture or include numerous reflections and shadows, making it harder for robots to visually analyze their environment. Additionally, some of the environments have bumpy or rough surfaces.

  • Hurry up and wait: Some of the datasets include long sequences in which the robot is stationery (which makes it difficult to estimate depth), while at other times the robots perform rapid rotations (which can lead to motion blur and wheels slipping on the ground). 

Why this matters: Datasets unlock AI progress, letting large numbers of people work together on shared challenges. Additionally, the creation of datasets usually imply specific business and research priorities, so the arrival of things like the DRIVE Benchmark point to broader maturation in smart, mobile robots.
   Read more: Segway DRIVE Benchmark: Place Recognition and SLAM Data Collected by A Fleet of Delivery Robots (Arxiv).
   Find out more about the benchmark here (Segway DRIVE website).

####################################################

You’ve heard of face identification. What about Primate face identification?
…Towards a future where we automatically scan and surveil the world around us…
Researchers with the Indraprastha Institute of Information Technology Delhi and the Wildlife Institute of India have teamed up to develop a system capable of identifying monkeys in the wild and have linked this to a crowd-sourced app, letting the “general public, professional monkey catchers and field biologists” crowd source images of monkeys for training larger, smarter models. 

Why do this? Monkeys are a bit of a nuisance in Indian urban and semi-urban environments, the researchers write, so have designed the system to use data captured ‘in the wild’, helping people build systems to surveil and analyze primates in challenging contexts. “Typically, we expect the images to be captured in uncontrolled outdoor scenarios, leading to significant variations in facial pose and lighting”. 

Datasets: 

  • Rhesus Macaque Dataset: 7679 images / 93 individuals. 
  • Chimpanzee Dataset: 7166 images / 90 primates. Pictures span good quality images from a Zoo, as well as uncontrolled images from a national park.

Results: The system outperforms a variety of baselines and sets a new state of the art across four validation scores, typically via a greater than 2 point absolute increase in performance, and sometimes via as much as a 6 or greater point increase. Their system is trained with a couple of different loss functions designed to capture smaller geometric features across faces, making the model more robust across multiple data distributions. 

Why this matters: This research is an indication of how as AI has matured we’ve started to see it being used as a kind of general-purpose utility, with researchers mixing and matching different techniques and datasets, making slight tweaks, and solving tasks for socially relevant applications. It’s particularly interesting to see this approach integrated with a crowd sourced app, pointing to a future where populations are able to collaboratively measure, analyze, and quantify the world around them.
   Read more: Primate Face Identification in the Wild (Arxiv)

####################################################

What Recursion’s big dataset release means for drug discovery:
…RxRx1 dataset designed to encourage “machine learning on large biological datasets to impact drug discovery and development”…
Recursion Pharmaceuticals, a company that uses AI for drug discovery, has released RxRx1, a 296GB dataset consisting of 125,510 images across 1,108 classes; an ImageNet-scale dataset, except instead of containing pictures of cats and dogs it contains pictures of human cells, to help scientists train AI systems to observe patterns across them, and generate insights for drug development. 

The challenges of biology: Biological datasets can be challenging for image recognition algorithms due to variation across cell samples, and other factors present during data sampling, such as temperature, humidity, reagent concentration and so on. RxRx1 contains data from 51 instances of the same experiment, which should help scientists develop algorithms that are robust to the changes across experiments, and are thus able to learn underlying patterns in the data.

What parts of AI research could RxRx1 help with? Recursion has three main ideas:

  • Generalization: The dataset is useful for refining techniques like transfer learning and domain adaptation.
  • Context Modeling: Each RxRx1 image ships with a detailed metadata, so researchers can experiment with this as an additional form of signal. 
  • Computer Vision: RxRx1 “presents a very different data distribution than is found in most publicly available imaging datasets,” Recursion writes. “These differences include the relative independence of many of the channels (unlike RGB images) and the fact that each example is one of a population of objects treated similarly as opposed to singletons.” 

Why this matters: We’re entering an era where people will start to employ large-scale machine learning to revolutionize medicine; tracking usage of datasets like RxRx1 and the results of a planned NeurIPS 2019 competition will help give us a sense of progress here and what it might mean for medicine and drug design.
   Read more: RxRx1 official website (RxRx.ai).

####################################################

Why AI could leave people with disabilities behind:
…Think bias is a problem now? Wait until systems are deployed more widely…
Researchers with Microsoft and the Human-Computer Interaction Institute at Carnegie Mellon University have outlined how people with disabilities could be left behind by AI advances. People with disabilities could have trouble accessing the benefits of AI systems due to issues of fairness and bias inherent to machine learning, according to a position paper from researchers with Microsoft and the Human-Computer Interaction Institute at Carnegie Mellon University. To deal with some of these issues, they propose a research agenda to help remedy these shortcomings in AI systems. The agenda contains four key activities:

  • Identify ways in which inclusion issues for people with disabilities could impact AI systems
  • Test inclusion hypotheses to understand failure scenarios
  • Create benchmark datasets to support replication and inclusion
  • Develop new modeling, bias mitigation, and error measurement techniques 

It’s all about representation: So, how might we expect AI systems to fail for people with disabilities? The authors survey current systems and provide some ideas. Spoiler alert: Mostly, these systems will fail to work for people with disabilities because they will have been designed by people who are neither disabled, nor are educated about the needs of people with disabilities.

  • Computer Vision: It’s likely that facial recognition will not work well for people with differences in facial features and expressions (eg, people with Down’s syndrome) not anticipated by system designers; face recognition could also not work for blind people, who may have differences in eye anatomy or be wearing medical or cosmetic aids. For similar reasons, we can expect systems designed to recognize certain bodytypes failing for some people. Additionally, object/scene/text recognition systems are likely to break more frequently for poorly sighted people, as the pictures poorly sighted people take are very different to those taken by sighed people. 
  • Speech Systems: Speech recognition systems won’t work for people that have speech disabilities; we may also need more granular metrics beyond things like Word Error Rate to best model how well systems work for different people. Similarly, speaker analysis systems will need to be trained with different datasets to accurately hear people with disabilities. 
  • Text Analysis: These systems will need to be designed to correct for errors that emerge under certain disabilities (for instance, dyslexia), and will need to account for people that write in different emotional registers to typical people. 

Why this matters: AI is an accelerant and a magnifier of whatever context it is deployed in due to the scale at which it operates, the number of automatic judgements it makes, and the increasingly comprehensive deployment of Ai-based techniques across society. Therefore, if we don’t think very carefully about how AI may or may not ‘see’ or ‘understand’ certain types of people, we could harm people or cut them off from accessing its benefits. (On the – extremely minor – plus side, this research suggests that people with disabilities may be harder to surveil than other people, for now.)
   Read more: Toward Fairness in AI for People with Disabilities: A Research Roadmap (Arxiv)

####################################################

‘Visus’ software provides quality assurance for model training:
…Now that models can design themselves, we need software to manage this…
Researchers with New York University have developed Visus, software that makes it easier for people to build models, evolve models, and manage the associated data processing pipelines needed to train them. It’s a tool that represents the broader industrialization of the AI community, and prefigures larger uses of ML across society.

What is Visus? The software gives AI developers a software interface that lets them define a problem, explore summaries of the input dataset, augment the data, and then explore and compare different models according to their performance scores and prediction outputs. The software is presented via a nicely designed user interface, making it more approachable than tools solely accessible via the command line. 

What can it do? What can’t it do! Visus is ‘kitchen sink software’, in the sense that it contains a vast number of features for tasks like exploratory data analysis, problem specification, data augmentation, model generation and selection, and confirmatory data analysis, and so on. 

Example use case: The researchers outline a hypothetical example where the New York City Department of Transportation uses Visus to figure out policies that it can enact which can reduce traffic fatalities. Here, they’d use Visus first to analyze the dataset about traffic collisions, then can select a variable in the dataset (for instance, number of collisions) that they’d want to predict, then ask Visus to perform a model search (otherwise known as ‘AutoML’), where it tries to find appropriate machine learning models to use to achieve the objective. Once it comes up with models, the user can also try to augment the underlying dataset, and then iterate on model design and selection again. 

Why this matters: Systems like ‘Visus’ are part of the industrialization of AI, as they take a bunch of incredibly complicated things like data augmentation and model design and analysis, then port it into more user-friendly software packages that broaden the number of people able to use such systems. This is like shifting away from artisanal individualized production to repeatable, system-based production. The outcome of adoption of tools like Visus will be more people using more AI systems across society – which will further change society.
   Read more: Visus: An Interactive System for Automatic Machine Learning Model Building and Curation (Arxiv).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Collective action problems for safe AI:
In many industries, profit-seeking firms are incentivised to invest in product safety. This is generally because they have internalised the costs of safety failures via regulation, liability, and consumer behaviour. Consider the cost to a car manufacturer of a critical safety failure – they will have to recall the product, they will be liable to fines and litigation, and they will suffer reputational damage. AI firms are subject to these incentives, but they appear to be weaker. Their products are difficult for manufacturers, consumers, and regulators to assess for safety; it is difficult to construct effective regulation; and many of the potential harms might be hard to internalise.

Competition: Another special feature about AI development is the possibility of discontinuous and/or very rapid progress. If firms believe this, they likely believe that there are significant payoffs to the first firm to make a particular breakthrough or to ‘pull ahead’ from competitors. This increases the costs of investing in safety, by increasing the expected benefits of faster development. This assumption may not hold true, which would make the situation more benign, but it is important to consider what this ‘worst-case’ for responsible development.

Cooperation: A simple model of this problem is a two-player game, where two firms face a decision to cooperate (maintain some level of investment in safety) or defect (fail to maintain this level). This allows us to see factors that can increase the likelihood of cooperation, by making it rational for each firm to do so: high trust that others will cooperate; shared upside from mutual cooperation; shared downside from mutual defection; smaller benefits to not reciprocating cooperation; and lower costs to unreciprocated cooperation.

Four strategies: This analysis can help identify strategies for increasing cooperation on responsible development: dispelling incorrect beliefs about responsible AI development; promoting inter-firm collaboration on projects; opening AI development to appropriate oversight and feedback; and creating stronger incentives to safe practices.
   Read more: The Role of Cooperation in Responsible AI Development (arXiv).
   Read more: Why Responsible AI Development Needs Cooperation on Safety (OpenAI Blog).
   Further thoughts on the project from corresponding author Dr Amanda Askell (Twitter).

####################################################

Tech Tales: 

Stop-Start Computing

And so after the Climate Accords and the Generational Crime Rulings and the loss of some 20% of the world’s land surface to a combination of heat and/or flooding, after all of this society carried on. We were hot and we were sick and there were way too many of us, but we carried on. 

We kept on moving little and big chunks of mass around on planet earth, and as we moved this stuff we mixed up the atmosphere and the underground and we changed our air and made it worse, but we carried on. 

And all through this we used our computers. We used our phones to watch old movies of ‘the times before’. We listened to music from prior decades. We played games in which the planet was covered in forests, or where we were neanderthals playing with axes in a kind of wilderness, or ones where we rode out into space and managed vast interstellar armies. Our simulations and our software and our entertainment got better and better and so we used it more and more, and we carried on. 

Everything has a breaking point. At some point computers started using so much energy that even with central planning and the imposition of controls, electrical utilities couldn’t keep up. Thirty percent of the electricity in some countries went to computers. In some smaller countries based around high-tech services, it was even higher. Data centers found themselves periodically running on backup generators – old salvaged WW2 diesel engines from submarines – and sometimes the power ran out entirely and these big computer cathedrals stood idle, mute blocks surrounded by farmland or forest or high-altitude steppes and deserts. 

So after we hit our limit we created the coins as part of the Centrally Managed Sustainable Compute Initiative. We were meant to call them ‘compute tokens’ but everyone called them coins, and we were meant to call the computation power we exchanged these coins for the Shared Societal Computer but everyone just called it the timeshare. 

So now here’s how it works: 

  • If you’re poor, you use a coin and you access lumps of computation and storage, rationed out according to the complex interplay of heat and consumption and climate. 
  • If you’re rich, you spend extra for Premium Compute Credits. 
  • If you’re ultrarich, you build yourself a powerplant or better yet something renewable – geothermal or wind or solar. Then you build your facility and you use that computation for yourself. 

Private data centers will be outlawed soon, people say. There’s talk of using all of the compute left in the world to save the world – something about simulating the impossible complexity of the earth, and finding a way to carry on. 

Things that inspired this story: The energy consumption of Bitcoin and large-scale AI models; climate change; inevitability.

Import AI 154: Teaching computers how to plan; DeepNude is where dual-use meets pornography; and what happens when we test machine translation systems on real-world data

Can computers learn to plan? Stanford researchers thinks so:
…Turns out being able to plan is similar to figuring out where you are and where you’ve been…
Researchers with Stanford University have developed a system that can watch instructional videos on YouTube and learn to look at the start and end of a new video then figure out the appropriate order of actions to take to transition from beginning to end.

What’s so hard about this? The real world involves such a vast combinatorial set of possibilities that traditional planning approaches (mostly) aren’t able to scale to work within it. “One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches that require a given set of predicates for a well-defined state space”. To get around this, they instead try to learn everything in a latent space, essentially slurping in reality and turning it into features, which they then use to map actions and observations into sequences, helping them figure out a plan.

Two models to learn the latent space:
   The system that derives the latent space and the transformations within it has two main components:

  • A transition model, which predicts the next state based on the current state and action.
  • A conjugate constraint model which maps current actions to past actions.

   The full model takes in a video and essentially learns the transitions between states by sliding these two models along through time to the desire goal state, sampling actions and then learns the next state. 

Two approaches to planning: The researchers experiment with two planning approaches, both of which rely on the features mined by the main system. One approach tries to map current and goal observations into a latent space while also mapping actions to prior actions, then samples from different actions to use to solve its task. The other approach is called ‘walkthrough planning’ and outputs the visual observations between the current and goal state; this is a less direct approach as it doesn’t output actions, but could serve as a useful reward signal for another system. 

Dataset: For this work, they use the CrossTask instructional video dataset, which is a compilation of videos showing 83 different tasks, involving things like grilling steak, making pancakes, changing a tire, and so on.

Testing: Spoiler alert – this kind of task is extremely hard, so get ready for some stay-in-your-chair results. In tests, the researchers find their system using the traditional planning approach can obtain accuracies of around 31.29% tests, with an overall success rate of 12.18%. This compares to a prior state-of-the-art of 24.39% accuracy and 2.89% success rate for ‘Universal Planning Networks’ (Import AI #90). (Note: UPN is the closest thing to compare to, but has some subtle differences making a direct comparison difficult). They show that the same system when using walkthrough planning can significantly improve scores over prior state-of-the-art systems as well – “our full model is able to plan the correct order for all video clips”, they write, compared to baselines which typically fail. 

Why this matters: We’re starting to see AI systems that use the big, learnable engines used in deep learning research as part of more deliberately structured systems to tackle specific tasks, like learning transitions and plans for video walkthroughs. Planning is an essential part of AI, and being able to learn plans and disentangle plans from actions (and learn appropriate associations) is an inherently complex task; progress here can give us a better sense for progress in the field of AI
   Read more: Procedure Planning in Instructional Videos (Arxiv)

####################################################

DeepNude: Dual Use concerns meet Pornography; trouble ensues:
…Rock, meet hard place…
What would a person look like without their clothes? That’s something people can imagine fairly easily, but has been difficult for AI systems. That is, until we developed a whole bunch of recent systems capable of modeling data distributions and generating synthetic versions of said data; these techniques contributed to the rise of things like ‘deepfakes’ which let people superimpose the face of one person on that of another in a video. Recently, someone took this a step further with a software tool called DeepNude which automatically removes the clothes of (predominantly women), rendering synthetic images of them in the nude. 

Blowback, phase one: The initial DeepNude blowback centered on the dubious motivation for the project and the immense likelihood of the software being used to troll, harass, and abuse women. Coverage in Vice led to such outcry from the community that the creator of DeepNude took the application down – but not before others had implemented the same capabilities in other software and distributed it around the web. 

Rapid proliferation makes norms difficult: Just a couple of days after taking the app down, the creator posted the code of the application to GitHub, saying that because the DeepNude application had already been replicated widely, there was no purpose in keeping the original code private, so they published it online. 

Why this matters: DeepNude is an illustration of the larger issues inherent to increasingly powerful AI systems; these things have got really powerful and can be used in a variety of different applications and are also, perhaps unintuitively, relatively easy to program and put together once you have some pre-trained networks lying around (and the norms of publication mean this is always the case). How we figure out new norms around development and publication of such technology will have a significant influence on what happens in society, and if we’re not careful we could enable more things like DeepNude.
   Read the statement justifying code release: Official DeepNude Algorithm (DeepNude GitHub).
   Read more: This Horrifying App Undresses a Photo of any Woman With a Single Click (Vice). (A special ImportAI shoutout to Samantha Cole, the journalist behind this story; Samantha was the first journalist to cover deepfakes back in 2017 and has been on this beat doing detailed work for a while. Worth a follow!)

####################################################

Have no pity for robots? Watch these self-driving cars try to tackle San Francisco:
A short video from Cruise, a self-driving car service owned by General Motors, shows how its cars can now deal with double-parked cars in San Francisco, California.
    Check out the video here (official Cruise Twitter).

####################################################

Think AI services are consistent across cloud providers? Think again:
…Study identifies significant differences in AI inferences made by Google, Amazon, and Microsoft…
Different AI cloud providers have different capabilities, and these under-documented differences could cause problems for software developers, according to research from computer science researchers with Deakin University and Monash University in Australia. In a study, they explore the differences between image labeling AI services from Amazon (“AWS Rekognition”), Google (“Google Cloud Vision”) and Microsoft (“Azure Computer Vision”). The researchers try to work out if “computer vision services, as they currently stand, offer consistent behavior, and if not, how is this conveyed to developers (if it is at all)?”

Developers may not realize that services can vary from cloud provider to provider, the researchers write; this is because if you look at the underlying storage and compute systems across major cloud providers like Microsoft or Amazon or Google you find that they’re very comparable, whereas differences in the quality of AI services are much less easy to work out from product descriptions. (For instance, one basic example is the labels services output when classifying objects; one service may describe a dog as both a ‘collie’ and a ‘border collie’, while another may use just one (or none) of these labels, etc.) 

Datasets and study length: The authors used three datasets to evaluate the services; two self-developed ones – a small one containing 30 images and a large one containing 1,650 ones, and a public dataset called COCOVal17, which contains 5,000 images. The study took place over 11 months and had two main experimental phases: a 13 week period from April to August 2018 and a 17 week period from November 2018 to March 2019. 

Methodology: They test the cloud services for six traits: the consistency of the top label assigned to an image from each service; the ‘semantic consistency’ of multiple labels returned by the same service; the confidence level of each service’s top label prediction; the consistency of these confidence intervals across multiple services; the consistency of the top label over time (aka, does it change); and the consistency of the top label’s confidence over time. 

Three main discoveries: The paper generates evidence for three concerning traits in clouds, which are:

Computer vision services do not respond with consistent outputs between services, given the same input image. 

  • Outputs from computer vision services are non-deterministic and evolving, and the same service can change its top-most response over time given the same input image. 
  • Computer vision services do not effectively communicate this evolution and instability, introducing risk into engineering these systems. 

Why this matters: Commercial AI systems can be non-repeatable and non-reliable, and this study shows that multiple AI systems developed by different providers can be even more inconsistent with one another over time. This is going to be a challenging issue, as it makes it easier for developers to get ‘locked in’ to the specific capabilities of a single service, and also makes application portability difficult. Additionally, these issues will make it harder for people to build AI services that are composed out of multiple distinct AI services from different clouds, as these systems will not have predictable performance capabilities.
   Read more: Losing Confidence in Quality: Unspoken Evolution of Computer Vision Services (Arxiv).

####################################################

Stealing people’s skeletons with deep learning:
…XNect lets researchers do real-time multi-person pose estimation via a single RGB camera…
How do you use a single camera to track multiple people and their pose as they move around? That’s a question being worked on by researchers with the Max Planck Institute for Informatics, EPFL, and the University of Saarland. The try to solve this problem via a neural network architecture that encodes and decodes poses of people, which is also implemented efficiently enough to run in real-time from a single camera feed. The system uses two networks; one which focuses on learning to reason about individual body joints, and another which tries to jointly reason about all body joints. 

Special components for better performance: Like some bits of AI research, this work takes a bunch of known-good stuff, and then pushes it forward on a task-specific dimension. Here, they develop a convolutional neural network architecture called SelecSLS Net, which “employs selective long and short range concatenation-skip connections to promote information flow across network layers which allows to use fewer features leading to a much faster inference time but comparable accuracy in comparison to ResNet-50”. 

Real-time performance: Most of the work here has involved increasing the efficiency of the system so it can process footage from video cameras in real-time (when running on an NVIDIA GTX 1080Ti and a Xeon E5). In terms of performance, the system marginally outperforms a more standard system that uses a typical residual network, while being far more efficient when it comes to runtime. 

Why this matters: It’s becoming trivial for computers to look at people, model each of them as a wireframe skeleton, and then compute over that. This is a classic omni-use capability; we could imagine such a system being used to automatically port people into simulated virtual worlds, or to plug them into a large-scale surveillance system to analyze their body movements and characterize the behavior of the crowd. How society deals with the challenges of such a multi-purpose technology remain to be seen.
   Read more: XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera (Arxiv).

####################################################

Think network design is hard? Try it where every network point is a drone:
…Researchers show how to build dynamic networks out of patrolling drones…
Researchers with Alpen-Adria-Universitat Klagenfurt, Austria, have developed “a novel collaborative data delivery approach where UAVs transport data in a store-and-forward fashion”. What this means is they develop a system that automatically plans the flight paths of fleets of drones so that the drones at the front of the formation periodically overlap in communication range with UAVs behind them, which then overlap in communication range with other, even more distant UAVs. The essential idea behind the research is to use fast drone-to-drone communications systems to hoover up data via exploration drones at the limits of a formation, then squirt this data back to a base station via the drones themselves. The next step for the research is to use “more sophisticated scheduling of UAVs to minimize the number of idle UAVs (that do neither sensing nor transporting data) at each time step”. 

Why this matters: Drones are going to let people form ad-hoc computation and storage systems, and approaches like this suggest the shape of numerous ‘flying internets’ that we could imagine in the future.
   Read more: Persistent Multi-UAV Surveillance with Data Latency Constraints (Arxiv).

####################################################

Pushing machine translation systems to the limit with real, messy data:
…Machine translation robustness competition shows what it takes to work in the real world…
Researchers from Facebook AI Research, Carnegie Mellon University, Harvard University, MIT, the Qatar Computing Research Institute, Google, and Johns Hopkins University, have published the results of the “first shared task on machine translation robustness”. The goal of this task is to give people better intuitions about how well machine translation models deal with “orthographic variations, grammatical errors, and other linguistic phenomena common in user-generated content”. 

Competitions, what are they good for? The researchers hope that systems which do well at this task will use better modelling, training and adaptation techniques, or may learn from large amounts of unlabeled data. And indeed, entered systems did use a variety of additional techniques to increase their performance, such as data cleaning, data augmentation, fine-tuning, ensembles of models, and more. 

Datasets: The datasets were “collected from Reddit, filtered out for noisy comments using a sub-word language modeling criterion and translated by professional translators”

Results: As this competition explores robustness in the context of a competition, it’s perhaps less meaningful to focus on the quantitative results, and instead discuss the trends seen among the entries. Some of the main things seen by the competition organizers are: stronger submissions were typically stronger across the board; out-of-domain generalization is important (so having systems that can deal with words they haven’t seen before); being able to accurately model upper and lower case text, as well as the use of special characters, is useful; it can be difficult to learn to translate sentences written in slang, 

Why this matters: Competitions like this give us a better sense of the real-world progress of AI systems, helping us understand what it takes to build systems that work over real data, as opposed to highly-constrained or specifically structured test sets.
   Read more: Findings of the FIrst Shared Task on Machine Translation Robustness (Arxiv).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Axon Ethics Board— no face recognition on police body cameras:
Axon, who make technologies for law enforcement, established an AI Ethics Board back in 2018 to look at the ethical implications of their products. The board has just released their first report, looking at ethical issues surrounding face recognition, particularly on police body cameras—Axon’s core product.

The board: Axon was an early mover in establishing an AI ethics board. The board’s members are drawn from law enforcement, civil rights groups, policy, academia, and tech. Among the lessons learned, the Board emphasizes the importance of board involvement at an early stage in product development (ideally before the design stage), so that they can suggest changes before they become too costly for the company.

Six major conclusions:
  (1) Face recognition technology is currently not reliable enough to justify use on body-cameras. Far greater accuracy, and equal performance across different populations are needed before deployment.
  (2) In assessing face recognition algorithms, it is important to separate false positive and false negative rates. There are real trade offs between the two, which depend on use cases. E.g. in identifying a missing person, more false positives might be cost worth bearing to minimize false negatives. Whereas in enforcement scenarios, it might be more important to minimize false positives, due to the potential harms from police interacting with innocent people on mistaken information.
  (3) The Board does not endorse the development of face recognition technology that can be completely customised by users, to prevent misuse. This requires technological controls by product manufacturers, but will increasingly also require government regulation.
  (4) No jurisdiction should adopt the technology without going through transparent, democratic processes. At present, big decisions affecting the public are being made by law enforcement alone, e.g. whether to include drivers license photos in face databases.
  (5) Development of products should be premised on evidence-based (and not merely theoretical) benefits.
  (6) When assessing costs and benefits of potential use cases, one must take into account the realities of policing in particular jurisdictions, and technological limitations.
  Read more: First Report of the Axon AI & Policing Ethics Board (Axon).
  Read more: Press release (Axon).

####################################################

NIST releases plan on AI standards:
The White House’s executive order on AI, released in February, included an instruction for NIST to make “a plan for Federal engagement in the development of technical standards and related tools in support of reliable, robust, and trustworthy systems that use AI technologies.” NIST have released a draft plan, and are accepting public input until July 19, before delivering a final document in August. Recommendations: NIST recommends that the government “bolster AI standards-related knowledge, leadership, and coordination among federal agencies; promote focused research on the ‘trustworthiness’ of AI; support and expand public-private partnerships; and engage with international parties.”

Why it matters: The US is keen to lead international efforts in standards-setting. Historically, international standards have governed policy externalities in cybersecurity, sustainability, and safety. Given the challenges of trust and coordinating safe practices in AI development and deployment, standards setting could play an important role.
  Read more: U.S. Leadership in AI: a Plan for Federal Engagement in Developing Technical Standards and Related Tools (NIST).

####################################################

Tech tales

Dreamworld versus Reality versus Government

After the traceable content accords were enacted people changed how they approached themselves – nude photos aren’t so fun if you know your camera is cryptographically signing them and tying them to you then uploading that information to some vast database hosted by a company or a state. 

The same thing happened for a lot of memes and meme-fodder: it’s not obviously a good idea to record yourself downing ten beers on an amusement park ride if you’re subsequently going to pursue a career in politics, nor does it seem like a smart thing to participate in overtly political pranks if you think you might pursue a career in law enforcement. 

The internet got… quiet? It was still full of noise and commotion and discussion, but the edge had been taken off a little. Of course, when we lost the edge we lost a lot of pain: it’s harder to produce terrorist content if it is traced back to your phone or camera or whatever, and it’s harder for other people to fake as much of it when it stops being, as they say, a ‘desirable media target’.

It didn’t take long for people to figure out a work around: artificial intelligence. Specifically, using large generative models to create images and, later, audio, and even later after that, videos, which could synthesize the things they wanted to create or record, but couldn’t send or do anymore. Teens started sending eachother impressionistic, smeared videos of teen-like creatures doing teen-like pranks. Someone invented some software called U.S.A which stood for Universal Sex Avatar and teens started sending eachother ‘AIelfies’ (pronounced elfeez) which showed nude-like human-like things doing sexual-like stuff. Even the terrorists got involved and started pumping out propaganda that was procedural and generative. 

Now the internet has two layers: the reality-layer and what people have taken to calling the dreamworld. In the reality-layer things are ever-more controlled and people conduct themselves knowing that what they do will be knowable and identifiable most-likely forever; everyone’s a politician, essentially. In the dreamworld, people experiment with themselves, and everyone has a few illicit channels on their messaging apps through which they let people send them dreamworld content, and through which they can anonymously and non-anonymously send their own visions into the world. 

The intelligence agencies are trying to learn about the dreamworld, people say. Knowing the difference between what known individuals publicly present and what the ghostly mass of civilization illicitly sends to itself is a valuable thing, say certain sour-faced people who are responsible for terrible tools that ward off against more terrible things. “The difference between presented self and imagined self is where identity resides,” says one of them in a no-phone presentation to other sour-faced people. “If we can learn how society chooses to separate the two, perhaps we can identify the character of our society. If we can do that, we can change the character.”

And so the terrible slow engines are working now, chewing through our dreamworld, invisible to us, but us increasingly aware of them. Where shall we go next, we wonder? What manifestation shall our individuality take next?

Things that inspired this story: Generative adversarial networks; DeepNude; DeepFakes; underground communities; private messaging infrastructures; the conversation of all of physical reality into digital simulacra.