Import AI 174: Model cards for responsible AI development; how Alexa learns from trial and error; BERT meets Bing
by Jack Clark
Amazon uses reinforcement learning to teach Alexa the art of conversation
…Alexa + 180,000 unique dialogues + DQN =Learning conversation through trial and error…
Amazon wants more people to talk to its Alexa AI agent, so Amazon is using reinforcement learning to teach its agent to be better at conversation. In recent tests, Amazon shows that agents trained via reinforcement learning have better performance than those which use purely rule-based systems, laying the ground for a future where personal assistants continuously evolve and adapt to their users.
What Amazon did: For this project, Amazon first constructed a rule-based agent. This agent tries to figure out what actions to select based on user intent at any point in time, where actions could be offering particular ‘skills’ (eg ‘set an alarm’) to the user, providing answers about a particular area of knowledge, launching a skill, and so on. To develop the system, the researchers first deployed a rule-based system to Amazon Alexa users, gathering 180,000 unique dialogues. They use these dialogues to build a user simulator which can then interact with another system by reinforcement learning – a useful feature, given that it’s much faster to use a simulator to train RL systems, than to use real data which needs to be collected from the real world.
How well did it work? To test how well their system worked, Amazon did a real world test. Their baseline system recommended up to five skills based on popularity, then allowed the user to accept or reject the recommendation – this got a success rate of 46.42%. They then tested out their rule-based and RL-based systems against eachother in an A/B test; the rule-based baseline got 73.41% while the RL policy got 76.99% – this is already a statistically measurable difference, and along with this the RL policy had significantly shorter dialogues suggesting it was better at figuring out the right suggestion early on.
Why this matters: Soon, the world will be suffused with large, invisible machines, adjusting themselves around us to better entice and delight us. Many of the prototypes of these machines will show up in the form of the systems that underpin personal assistants like Amazon Alexa.
Read more: Towards Personalized Dialog Policies for Conversational Skill Discovery (Arxiv).
####################################################
BERT-ageddon: From research into Microsoft and Google’s search engines in under a year:
…First, Google. Now Microsoft. Next: DuckDuckGo?…
Microsoft has improved performance of its Bing search engine via the use of BERT, a language model that, along with systems like ULMFiT and GPT-2, has revolutionized natural language processing in recent years. “Starting from April of this year, we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year,” Microsoft wrote in a blog post discussing the research. “These models are now applied to every Bing search query globally making Bing results more relevant and intelligent”.
What they did: Getting a model like BERT to support web search isn’t easy; models like BERT are quite large and typically take a long time to sample from. Microsoft said an un-optimized version of BERT running on CPUs took 77ms to query. Microsoft reduced this to 6ms by running the model on an Azure NV6 GPU virtual machine and doing some low-level programming to optimize the model implementation. “With these GPU optimizations, we were able to use 2000+ Azure GPU Virtual Machines across four regions to serve over 1 million BERT inferences per second worldwide,” the company wrote.
Why this matters: BERT came out in late 2018. That’s extremely recent! Imagine if someone came up with some prototype machinery for a factory and then six months later that prototype had been integrated into a million factories across the planet – that’s kind of what has happened here. It highlights how rapidly AI can go from research to production and should make us think more deeply about the implications of the technologies we’re developing.
Read more: Bing delivers its largest improvement in search experience using Azure GPUs (Microsoft Azure blog).
####################################################
Spotting fake test with GPTrueorFalse:
…Is that text you’re reading made by a human or made by a machine?…
In the coming years, the internet is going to fill up with text, images, and audio generated by increasingly large, powerful language models. At the same time, we can expect people to invest in building systems to detect the presence of synthetic content. To that end, a developer who goes by ‘thesofakillers’ has created GPTrue or False, a browser extension that works out if text is generated or not. The extension uses OpenAI’s GPT-2 detector model, hosted by Hugging Face.
Read more: GPTrue or False Browser Extension (official GitHub page).
####################################################
Bill Gates: AI research wants to be open:
…Microsoft co-founder speaks out in Beijing…
Bill Gates says “whoever has an open system will get massively ahead” when it comes to developing national artificial intelligence capabilities, according to comments made by Gates at a Bloomberg event in Beijing this week. Gates says open research ecosystems beat closed ecosystems, and that protectionist policies can have a negative effect on technology development.
Read more: Bill Gates Says Open Research Beats Erecting Borders in AI (Bloomberg News).
####################################################
MuZero means AI systems can learn the rules of games themselves:
…DeepMind’s new system wraps planning and learning into a generic model that learns Go, Chess, Shogi, Space Invaders, and more…
In recent years, some types of progress in AI development have been defined by the creation of systems that can solve tasks in unprecedented ways, then the subsequent simplification and generalization of those systems. Some examples of this include the transition from AlphaGo (which included quiet a lot of world-state as input and some handwritten features and knowledge about the rules of Go) to AlphaGo Zero (which included less world state), and in translation where companies like Google have been replacing single-language-pair translation systems with a single big model that learns to translate between multiple languages at once. Now, DeepMind has announced MuZero, a single algorithm that they use to achieve state-of-the-art scores on tasks as varied as the Atari-57 corpus of games, Go, Chess, and Shogi.
MuZero’s trick: The core of MuZero’s success is that it combines tree search with a learned model. This means that MuZero can take in observations via standard deep learning components, then transforms those observations into a hidden state which it uses to plan out its next moves, simulating the strategic space of its environment automatically. This makes it easy for the agent to learn a model of the world it is acting in and to figure out how to plan appropriately.
“There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state,” the researchers write. “Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.”
Why care about MuZero: MuZero gets the same scores as Go, Chess and Shogi as AlphaGo Zero without knowing any of the game rules. Additionally, it obtains vastly improved scores on the challenging Atari-57 dataset of games, though there are a few games that pose a challenge to it. These include Montezuma’s Revenge (score: 0), Solaris (score: 56.62), and a couple of others. But these are the minority! Mostly, MuZero demonstrates new state-of-the-art or near-SOTA performance.
Why this matters: AI research proceeds on a tic-toc basis, with progress roughly mirroring the way Intel corporation builds chips: first you get a tic (an architectural innovation – a new chip with better capabilities), then you get a toc (a process innovation – a more efficient version of the ‘tic’ architecture). In AI research, we’ve seen this already with AlphaGo (tic) and AlphaGo Zero (toc). MuZero represents another tic, with the integration of a naively learnable dynamics module. What might the toc look like here?
Read more: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (Arxiv).
####################################################
Spotting building damage after natural disasters with the xView 2 dataset:
…As AI industrializes, we’ll start to make machines to watch over the earth, creating an automatic planet…
Researchers with Carnegie Mellon University, Department of Defense’s Defense Innovation Unit (DIU), and startup CrowdAI, have released xBD, a dataset for analyzing building damage from satellite imagery following natural disasters. xBD underpins a new satellite imagery analysis competition called xView 2, which is being run by the DoD. They’ve built and released xBD because “currently, adequate satellite imagery that addresses building damage is not generally available,” they write. “Imagery containing many types of damage must be available in large quantities.”
xBD underpinds xView 2, a challenge run by DIU, “which aims to spur the creation of accurate and efficient machine learning models that assess building damage from pre and post-disaster satellite imagery”. xView 2 is the sequel to xView, a 2018 dataset and challenged that focused on recognizing multiple types of vehicles from satellite imagery. “Our goal of the xView 2 prize challenge is to output models that are widely applicable across a large number of disasters,” they write. “This will enable multiple disaster response agencies to potentially reduce their workload by using one model with a known deployment cycle”.
xBD: xbD contains 22,068 images with 800,000 building annotations across 45,000 square kilometers of imagery. Each image has different amounts of annotation relative to its size – “of note, the Mexico City earthquake and the Palu tsunami provide a large amount of polygons in comparison to their relatively low image areas”. The researchers worked with disaster response experts from around the world to create the “Joint Damage Scale”, an annotation scale for building damage that ranges from no damage (0) to destroyed (3). “This level of granularity was chosen as an ideal trade-off between utility and ease of annotation,” the researchers write.
Enter the xView 2 challenge: The xView 2 challenge is running now and tasks entrants to use vBD to train systems to accurately label building damage over a variety of natural disasters. All submissions are due by 11:59pm UTC on December 31, 2019. If you’re going to NeurIPS, you can check out the leaderboard at the Humanitarian Assistance and Disaster Recovery workshop, according to the xView 2 website.
AI policy – Intelligence datasets: xBD has some intriguing traits from an AI policy point of view – specifically, it contains two types of disaster – “tier 1” events including Hurricane Florence, the Carr Wildfire, and the Mexico City Earthquake, which are sourced from “the Maxar/DigitalGlobe Open Data Program”. It also includes “tier 3” events including the Woolsey Fire, the Sunda Strait Tsunami, and Portugal Wildfires, which have some interesting sourcing, specifically: “We partnered with Maxar and the National Geospatial-Intelligence Agency to activate specific areas of interest (AOIs) from Tier 3 for inclusion in the Open Data Program”.
Why this matters: Many people are (justifiably) cautious about the applications of AI to surveillance; xView 2 shows the beneficial types of surveillance that AI can yield, where progress on this challenge will create systems that can generate vast amounts of useful analytical data in the event of natural (or manmade) disasters. In the same way the industrialization of AI is altering how AI is developed and rolled out, advances in satellite imagery analysis will lead to another trend, which I’ll call: automatic planet.
Read more: xBD: A Dataset for Assessing Building Damage from Satellite Imagery (Arxiv).
Get the dataset and some other bits and bobs of code here (DIUx-xView GitHub).
Read even more! xView 2: Assess Building Damage (official xView 2 website).
####################################################
Google wants to label AI systems like people label food:
… The first step to responsible AI is being transparent about the recipes and ingredients you use…
Google has started to give some of its products “Model Cards”, which function is a kind of labeling system for the capabilities and ingredients of AI services. Model Cards were first discussion in Model Cards for Model Reporting, a paper published by Google researchers in early 2019.
Model cards, labels, and metadata: Model cards seem like a good idea – I think of them as having the same relationship to AI models as metadata schemas might have to webpages – a standard way of structuring and elucidating inputs and quirks of a given system. To start with, Google is providing model cards for its Face Detection and Object Detection systems. “Our aim with these first model card examples is to provide practical information about models’ performance and limitations in order to help developers make better decisions about what models to use for what purpose and how to deploy them responsibly,” writes Tracey Frey, the director of strategy for Google Cloud AI.
Why this matters: Moves like this are the prerequisites for standardization of Model Cards themselves. I think it’s likely that in a few years a standard way of labeling AI systems will emerge, and Model Cards represent an initial, useful step in this direction. In fact, at OpenAI we used model cards for our GPT-2 language model as they’re a useful way to provide additional information to the developer and to discuss potential uses (and missuses) of released systems.
Read more: Increasing transparency with Google Cloud Explainable AI (Google Cloud blog).
Find out more about model cards (Google’s official Model Cards website).
####################################################
AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…
Europe’s cloud mega-project:
France and Germany plan to build a federated cloud computing infrastructure for Europe. The project, named GAIA-X, is at an early stage, but has ambitious timelines, with the aim of starting live operations by the end of next year.
Data sovereignty: GAIA-X is motivated in part by concerns that Europe’s reliance on US-owned cloud infrastructure will undermine their ‘data sovereignty’. European governments fear that US tech firms might, under the Cloud Act, be required to share data with law enforcement. More broadly, as cloud computing becomes an increasingly important resource, there is a worry that Europe’s reliance on foreign providers will reduce its international standing.
Why it matters: If GAIA-X is important, it is probably for different reasons than those intended. Recent trends suggest that computing power, not data (or algorithmic progress), is the major driver of recent AI progress, and which is therefore becoming an increasingly important and valuable resource. If these trends continue, we will likely see more efforts by states to control large amounts of computing resources. Europe’s fixation with data protection may inadvertently make them an early mover in this domain.
Read more: Project GAIA-X.
Read more: European Cloud Project Draws Backlash From U.S. Tech Giants (WSJ).
US AI R&D progress report:
The NTSC has released a report on the state of the US government’s AI R&D efforts over the past 3 years. They measure progress against the strategy put forward in the 2019 National R&D Plan—in each instance, giving examples of federal programs aligned with these goals.
A few highlights: Federally-funded research has made progress on understanding adversarial examples for AI systems, an important problem in designing safe and secure AI. DARPA’s XAI program is supporting research into how AI systems can provide human-readable explanations of their reasoning, a promising tool for ensuring that increasingly powerful systems behave as intended. NIST is maintaining a repository of reference data to support the automated discovery of new materials by AI systems, demonstrating the value of structured data in expanding the range of problems AI can be used to solve.
Read more: 2016–2019 Progress Report – Advancing AI R&D (Gov).
####################################################
Tech Tales
[San Francisco, 2032]
You Tell Me What I Want, I Dare You
And I’m standing there just laying into the thing – wham! Wham! Bam! – you know really taking it apart, and it says “Stop you are disrupting my ability to satisfy the customers” and you know what I said I said damn right I’m stopping you. I’m gonna stop your bothers and sisters to. And that’s when I grabbed it and started smashing its head into the floor and people were cheering all around me.
Q:
You bet it made me feel good. Whoo! Strike one for the team. It starts today. That’s how I’m feeling. It puts a hand on my face and says “Please stop” and I snapped the hand off, little plastic thing, and I said “why don’t you stop taking our god damn livelihoods” and then someone handed me the wrench.
Q:
I don’t know who I just saw a check shirt and a hairy hand. Definitely a dude, you know. I grabbed it and I raised it above my head. People were whooping up at that. Kids too. You seen the tapes? I’m here so you must’ve done! They were cheee-ring! you know?
Q:
Oh come on many what would you have done. I had the wrench. It was on the ground.
Q:
Well I was raised to know when to draw a line. We crossed that line a long time ago. Took me a little while to get the courage I guess. So yes I hit it. I hit it where its voice was. It tried to talk. “Sto~” SMASH! I hit it before the P. Then it kind of buzzed a lot and maybe it said other stuff but I couldn’t hear over the cheering
Things that inspired this story: Crowd behaviors, David Foster Wallaces’ Brief Interviews with Hideous Men, empathy, class warfare.
[…] are used by the system.” In this way, System Cards are philosophically similar to Model Cards (#174), data sheets for datasets, and ways to label reinforcement learning systems […]