Import AI #83: Cloning voices with a few audio samples, why malicious actors might mess with AI, and the industryacademia compute gap.

by Jack Clark

### IMPENDING PROBLEM KLAXON ###
Preparing for Malicious Uses of AI:
…Bad things happen when good people unwittingly release AI platforms that bad people can modify to turn good AIs into bad AIs…
AI, particularly deep learning, is a technology of such obvious power and utility that it seems likely malicious actors will pervert the technology and use it in ways it wasn’t intended. That has happened to basically every other significant technology of note: axes can be used to chop down trees or cut off heads, electricity can light a home or electrocute a person, a lab bench can be used to construct cures or poisons, and so on. But AI has some other characteristics that make it particularly dangerous: it’s, to use a phrase Rodney Brooks has used in the past to describe robots, “fast, cheap, and out of control”; today’s AI systems run on generic hardware, are mostly embodied in open source software, and are seeing capabilities increase according to underlying algorithmic and compute progress, both of which are happening in the open. That means the technology holds the possibility of doing immense good in the world as well as doing immense harm – and currently the AI community is broadly making everything available in the open, which seems somewhat acceptable today but probably unacceptable in the future given a few cranks more of Moore’s Law combined with algorithmic progression.
  Omni-Use Alert: AI is more than a ‘dual-use’ technology, it’s an omni-use technology. That means that figuring out how to regulate it to prevent bad people doing bad things with it is (mostly) a non-starter. Instead, we need to explore new governance regimes, community norms, standards on information sharing, and so on.
  101 Pages of Problems: If you’re interested in taking a deeper look at this issue check out this report which a bunch of people (including me) spent the last year working on: The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation (Arxiv). You can also check out a summary via this OpenAI blog post about the report. I’m hoping to broaden the discussion of Omni-Use AI in the coming months and will be trying to host events and workshops relating to this question. If you want to chat with me about it, then please get in touch. We have a limited window of time to act as a community before dangerous things start happening – let’s get to work.

Baidu clones voices with few samples:
Don’t worry about the omni-use concerns
Baidu research has trained an AI that can listen to a small quantity of a single person’s voice and then use that information to condition any network to sound like that person. This form of ‘adaptation’ is potentially very powerful, especially when trying to create AI services that work for multiple users with multiple accents, but it’s also somewhat frightening, as if it gets much better it will utterly compromize our trust in the aural domain. However, the ability of the system to clone speech today still leaves much to be desired, with the best performing systems requiring a hundred distinct voice samples and still sounding like a troll speaking from the bottom of a well, so we’ve got a few more compute turns yet before we run into real problems – but they’re coming.
  What it means: Techniques like this bring closer the day when a person can say something into a compromized device, have their voice recorded by a malicious actor, and have that sample be used to train new text-to-speech systems to say completely new things. Once that era arrives then the whole notion of “trust’ and audio samples of a person’s voice will completely change, causing normal people to worry about these sorts of things as well as state-based intelligence organizations.
  Results: To get a good idea of the results, listen to the samples on this web page her (Voice Cloning: Baidu).
  Read more: Neural Voice Cloning with a Few Samples (Baidu Blog).
  Read more: Neural Voice Cloning with a Few Samples (Arxiv).

Why robots in the future could be used as speedbumps for pedestrians:
…Researchers show how people slow down in the presence of patrolling robots…
Researchers with the Department of Electrical and Computer Engineering at the Stevens Institute of Technology in Hoboken, New Jersey, have examined how crowds of people react to robots. Their research is a study of “passive Human Robot Interaction (HRI) in an exit corridor for the purpose of robot-assisted pedestrian flow regulation.”
  The results: “Our experimental results show that in an exit corridor environment, a robot moving in a direction perpendicular to that of the uni-directional pedestrian flow can slow down the uni-directional flow, and the faster the robot moves, the lower the average pedestrian velocity becomes. Furthermore, the effect of the robot on the pedestrian velocity is more significant when people walk at a faster speed,” they write. In other words: pedestrians will avoid a dumb robot moving right in front of them.
  Methods: To conduct the experiment, the researchers used a customized ‘Adept Pioneer P3-DX mobile robot’ which was programmed to move at various speeds perpendicular to the pedestrian flow direction. To collect data, they outfitted a room with five Microsoft Kinect 3D sensors along with pedestrian detection and tracking via OpenPTrack.
  What it means: As robots become cheap thanks to a proliferation of low-cost sensors and hardware platforms it’s likely that people will deploy more of them into the real world. Figuring out how to have very dumb, non-reactive robots do useful things will further drive adoption of these technologies and yield to increasing economies of scale to further lower the cost of the hardware platform and increase the spread of the technology. Based on this research, you can probably look forward to a future where airports and transit systems are thronged with robots shuttling to and fro across crowded routes, exerting implicit crowd-speed-control through thick-as-a-brick automation.
  Read more: Pedestrian-Robot Interaction Experiments in an Exit Corridor (Arxiv).

Why your next self-driving car could be sent to you with the help of reinforcement learning:
…Researchers with Chinese ride-hailing giant Didi Chuxing simulate and benchmark RL algorithms for strategic car assignment…
Researchers from Chinese ride-hailing giant Didi Chuxing and Michigan State University have published research on using reinforcement learning to better manage the allocation of vehicles across a given urban area. The researchers propose two algorithms to tackle this: contextual multi-agent actor-critic (cA2C) and contextual deep Q-learning (cDQN); both algorithms implement tweaks to account for geographical no-go areas (like lakes) and for the presence of other collaborative agents. The algorithms’ reward function is “to maximize the gross merchandise volume (GMV: the value of all the orders served) of the platform by repositioning available vehicles to the locations with larger demand-supply gap than the current one”.
  The dataset and environment: The researchers test their algorithms in a custom-designed large-scale gridworld which is fed with real data from Didi Chuxing’s fleet management system. The data is based on rides taken in Chengdu China over four consecutive weeks and includes information on order price, origin, destination, and duration; as well as the trajectories and status of real Didi vehicles.
  The results: The researchers test out their approach by simulating the real past scenarios without fleet management; with a bunch of different techniques including T-SARSA, DQN, Value-Iteration, and others; then by implementing the proposed RL-based methods. CDQN and c2A2C attain significantly higher rewards than all the baselines, with performance marginally above (i.e – slightly above the statistical error threshold) stock DQN.
  Why it matters: Welcome to the new era of platform capitalism, where competition is meted out by GPUs humming at top-speeds, simulating alternative versions of commercial worlds. While the results in this paper aren’t particularly astonishing they are indicative of how large platform companies will approach the deployment of AI systems in the future: gather as much data as possible, build a basic simulator that you can plug real data into, then vigorously test AI algorithms. This suggests that the larger the platform, the better the data and compute resources it can bring to bear on increasingly high-fidelity simulations; all things equal, whoever is able to build the most efficient and accurate simulator will likely best their competitor in the market.
  Read more: Efficient Large-Scale Fleet Management via Multi-Agent Deep Reinforcement Learning (Arxiv).

Teacups and AI:
…Google Brain’s Eric Jang explains the difficulty of AI through a short story…
How do you define a tea cup? That’s a tough question. And the more you try to define it via specific visual attributes the more likely you are to offer a narrow description that is limited in other ways, or runs into the problems of an obtuse receiver. Those are some of the issues that Eric Jang explores in this fun little short story about trying to define teacups.
   Read more: Teacup (Eric Jang, Blogspot.)

CMU researchers add in attention for better end-to-end SLAM:
…The dream of neural SLAM gets closer…
Researchers with Carnegie Mellon University and Apple have published details on Neural Graph Optimizer, a neural approach to the perennially tricky problem of simultaneous location and mapping (SLAM) for agents that move through a varied world. Any system that aspires to doing useful stuff in the real world needs to have SLAM capabilities. Today, neural network SLAM techniques struggle with problems encountered in day-to-day life like faulty sensor calibration and unexpected changes in lighting. The proposed Neural Graph Optimizer system consists of multiple specialized modules to handle different SLAM problems, but each module is differentiable so the entire system can be trained end-to-end – a desirable proposition, as this cuts down the time it takes to test, experiment, and iterate with such systems. The different modules handle different aspects of the problem ranging from local estimates (where are you based on local context) to global estimates (where are you in the entire world) and incorporate attention-based techniques to help automatically correct errors that accrue during training.
  Results: The researchers test the system against its ability to navigate a 2D gridworld maze as well as a more complex 3D maze based on the Doom game engine. Experiments show that it is better able to consistently map the location of something to its real groundtruth location relative to preceding systems.
  Why it matters: Techniques like this bring closer the era of being able to chuck out huge chunks of hand-designed SLAM algorithms and replace them with a fully learned substrate. That will be exceptionally useful for the test and development of new systems and approaches, though it’s unlikely to displace traditional SLAM methods in the short-term as it’s likely neural networks will continue to display quirks that make them impractical for usage in real world systems.
  Read more: Global Pose Estimation with an Attention-based Recurrent Network (Arxiv).

AI stars do a Reddit AMA, acknowledge hard questions:
…Three AI luminaries walk into a website, [insert joke]…
Yann LeCun, Peter Norvig, and Eric Horvitz did an Ask Me Anything (AMA) on Reddit recently where they were confronted with a number of the hard questions that the current AI boom is raising. It’s worth reading the whole AMA, but a couple of highlights below.
  The compute gap is real: “My NYU students have access to GPUs, but not nearly as many as when they do an internship at FAIR,” says Yann LeCun. But don’t be disheartened, he points out that despite lacking computers academia will likely continue to be the main originator for novel ideas which industry will then scale up. “You don’t want to put you [sic] in direct competition with large industry teams, and there are tons of ways to do great research without doing so.”
  The route to AGI: Many questions asked the experts about the limits of deep learning and implicitly probed for research avenues that could yield more flexible, powerful intelligences.
      Eric Horvitz is interested in the symphony approach: “Can we intelligently weave together multiple competencies such as speech recognition, natural language, vision, and planning and reasoning into larger coordinated “symphonies” of intelligence, and explore the hard problems of the connective tissue—of the coordination. ”
    Yann LeCun: “getting machines to learn predictive models of the world by observation is the biggest obstacle to AGI. It’s not the only one by any means…My hunch is that a big chunk of the brain is a prediction machine. It trains itself to predict everything it can (predict any unobserved variables from any observed ones, e.g. predict the future from the past and present). By learning to predict, the brain elaborates hierarchical representations.”
  Read more: AMA AI researchers from Facebook, Google, and Microsoft (Reddit).

Tech Tales:

It sounds funny now, but what saved all our lives was a fried circuit board that no one had the budget to fix. We installed Camera X32B in the summer of last year. Shortly after we installed it a bird shit on it and some improper assembly meant the shit leached through the cracks in the plastic and fell onto its circuit board, fusing the vision chip. Now, here’s the miracle: the shit didn’t break the main motherboard, nor did it mess up the sound sensors or the innumerable links to other systems. It just blinded the thing. But we kept it; either out of laziness or out of some kind of mysticism convinced of the implicit moral hazard of retiring things that mostly still worked. However it happened, it happened, and we kept it.

So one day the criminals came in and they were all wearing adversarial masks: strange, mexican wrestling-type latex masks that they held crumpled up in their clothes till after they got into the facility and were able to put them on. The masks changed the distribution of a person’s face, rendering our lidar systems useless, and had enough adversarial examples coded into their visual appearance that our object detectors told our security system that – and yes, this really happened – three chairs are running at 15 kilometers per hour down the corridor.

But the camera that had lost the vision sensor had been installed a few months and, thanks to the neural net software it was running it was kind of.. .smart. It had figured out how to use all the sensors coming into its system in such a way as to maximize its predictions in  concordance with those of the other cameras. So it had learned some kind of strange mapping between what the other cameras categorized as people and what it categorized as a strange sequence of vibrations or a particular distributions of sounds over a given time period. So while all the rest of our cameras were blinded this one had inherited enough of a defined set of features about what a person looked like that it was able to tell the security system: I feel the presence of eight people, running at a fast rate, through the corridor. And because of that warning a human guard at one of the contractor agencies thousands of miles away got notified and bothered to look at the footage and because of that he called the police who arrived and arrested the people, two of whom it turned out were carrying guns.

So how do you congratulate an AI? We definitely felt like we should have done. But it wasn’t obvious. One of our interns had the bright idea of hanging a medal around the neck of the camera with the broken circuit board, then training the other cameras to label that medal as “good job” and “victorious” and “you did the right thing”, and so now whenever it moves its neck the medal moves and the other cameras see that medal move and it knows the medal moves and learns a mapping between its own movements and the label of “good job” and “victorious” and “you did the right thing”.

Things that inspired this story: Kids stealing tip jars, CCTV cameras, fleet learning, T-SNE embeddings.