Import AI: #74: Why Uber is betting on evolution, what Facebook and Baidu think about datacenter-scale AI computing, and why Tacotron 2 means speech will soon be spoofable

by Jack Clark

All hail the new ‘datacenter scale’ era for AI:
…Facebook Research Paper goes through some of the many problems that come with running AI at the scale of an entire datacenter…
Facebook has published an analysis of how it runs its worldwide fleet of AI services and how its scale has influenced the way it has deployed AI into production. The company uses both CPUs and GPUs, with GPUs being used for large-scale face recognition, language translation, and the ‘Lumos’ feature-analysis service. It also runs a significant amount of work on CPUs; one major workload is ranking features for newsfeed. “Computer vision represents only a small fraction” of the total work, Facebook writes.
  Split languages: Facebook uses ‘Caffe2’ for its production systems, while its researchers predominantly use PyTorch. Though the company’s main ML services (FBLearner Feature Store / FBLearner Flow / FBLearner Predictor) support a bunch of different AI frameworks, they’ve all been specially integrated with Caffe2, the company says.
  The big get bigger: Facebook, like other major AI users, is experimenting with running significantly larger AI models at larger scales: this has altered how it places and networks together its GPU servers, as well as directed it to spin-up research in areas like low-precision training. They’re also figuring out ways to use the scale to their advantage. “Using certain hyperparameter settings, we can train our image classification models to very large mini-batches, scaling to 256+ GPUs,” they write. “For one of our larger workloads, data parallelism has been demonstrated to provide 4x the throughput using 5x the machine count (e.g., for a family of models that trains over 4 days, a pool of machines training 100 different models could now train 20 models per day, so training throughput drops by 20%, but the wait time for potential engineering advancement improves from four days to one day).”
  One GPU region to train them all: When Facebook was first experimenting with GPUs for deep learning it rolled out GPUs in a single data center region, which it figured was a good decision as the designs of the servers were changing and the teams needed to become used to maintaining them. This Had some pretty negative consequences down the road, causing a re-think of how the company distributed its data center resources and its infrastructure.
Read more: Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective.

Baidu publishes some rules-of-thumb for how model size relates to performance:
The beginnings of a theory for deep learning…
Deep learning is an empirical science – we don’t fully understand how various attributes of our neural networks dictate their ultimate representational capacity. That means the day-to-day work of any AI organization involves a lot of empirical experimentation. Now, researchers with Baidu have attempted to formalize some of their ideas about how the scale of a deep learning model relates to its performance.
  “Through empirical testing, we find predictable accuracy scaling as long as we have enough data and compute power to train large models. These results hold for a broad spectrum of state-of-the-art models over four application domains: machine translation, language modeling, image classification, and speech recognition,” they write.
  The results suggest that once researchers get a model to a certain threshold of accuracy they can be confident that by simply adding computer &/or data they can reach x performance within a rough margin of error. “Model error improves starting with “best guessing” and following the power-law curve down to “irreducible error”,” they say. “We find that models transition from a small training set region dominated by best guessing to a region dominated by power-law scaling. With sufficiently large training sets, models will saturate in a region dominated by irreducible error (e.g., Bayes error).”
  The insight is useful but still requires experimental validation, as the researchers find similar learning curves across a variety of test domains, “although different applications yield different power-law exponents and intercepts”.
  It is also a further sign that compute will become as strategic as data to AI, with researchers seeking to be able to run far more empirical tests and scale-up far more frequently when equipped with somewhat formal intuitions like the one stumbled upon by Baidu’s research team.
– Read more here: Deep Learning Scaling is Predictable, Empirically (Baidu blog).
– Read more here: Deep Learning Scaling is Predictable, Empirically (Arxiv).

Evolution, evolution everywhere at Uber AI Labs:
…Suite of new papers shows the many ways in which neuroevolution approaches are contemporary and complementary to neural network approaches…
Uber’s AI research team has published a set of papers that extend and augment neuroevolution approaches – continuing the long-standing professional fascinations of Uber researchers like Ken Stanley (inventor of NEAT and HyperNEAT, among others). Neuroevolution is interesting to contemporary AI researchers because it provides a method to use compute power to push simple algorithms through the more difficult parts of hard problems rather than having to invent new algorithmic pieces to get us across certain local minima; with evolutionary approaches, the difference between experimental success and failure is often dictated by the amount of compute applied to the problem.
–  Exploration: The researchers show how to further tune the exploration process in evolutionary strategies (ES) algorithms through the alternation of novelty search and quality diversity algorithms. They also introduce new ideas to improve the mutation process of large neural networks.
–  Theory: The researchers compare the approximate gradients computed by ES with the exact gradient computed by stochastic gradient descent (SGD) and design tools to better predict how ES performance relates to scale and parallelization.
–  Big compute everywhere: “For neuroevolution researchers interested in moving towards deep networks there are several important considerations: first, these kinds of experiments require more computation than in the past; for the experiments in these new papers, we often used hundreds or even thousands of simultaneous CPUs per run. However, the hunger for more CPUs or GPUs should not be viewed as a liability; in the long run, the simplicity of scaling evolution to massively parallel computing centers means that neuroevolution is perhaps best poised to take advantage of the world that is coming,” they write.
– Read more here: Welcoming the Era of Deep Neuroevolution (Arxiv).
– Read more: Deep Neuroevolution: Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning (Arxiv).
– Read more: Safe Mutations for Deep and Recurrent Neural Networks through Output Gradients.
– Read more: On the Relationship Between the OpenAI Evolution Strategy and Stochastic Gradient Descent (Arxiv).
– Read more: ES Is More Than Just a Traditional Finite Difference Approximator (Arxiv).
– Read more: Improving Exploration in Evolution Strategies for Deep Reinforcement Learning via a Population of Novelty-Seeking Agents.

US National Security Strategy picks out AI’s potential damage to the information battlespace:
AI’s ability to create fake news and aid surveillance picked out in NSS report…
While other countries around the world publish increasingly complicated, detailed national AI development strategies, the US government is instead adopting a ‘business as usual’ approach, based on the NSS letter, which explicitly mentions AI in (only!) two places – as it relates to innovation (named amid a bundle of different technologies as something to be supported), and national security. It’s the latter point which has more ramifications: The NSS explicitly names AI within the ‘Information Statecraft’ section as a potential threat to US national security.
  “Risks to U.S. national security will grow as competitors integrate information derived from personal and commercial sources with intelligence collection and data analytic capabilities based on Artificial Intelligence (AI) and machine learning. Breaches of U.S. commercial and government organizations also provide adversaries with data and insights into their target audiences,” the NSS says. “China, for example, combines data and the use of AI to rate the loyal of its citizens to the state and uses these ratings to determine jobs and more. Jihadist terrorist groups continue to wage ideological information campaigns to establish and legitimize their narrative of hate, using sophisticated communications tools to attract recruits and encourage attacks against Americans and our partners. Russia uses information operations as part of its offensive cyber efforts to influence public opinion across the globe. Its influence campaigns blend covert intelligence operations and false online personas with state-funded media, third-party intermediaries, and paid social media users or “trolls.” U.S. e orts to counter the exploitation of information by rivals have been tepid and fragmented. U.S. e orts have lacked a sustained focus and have been hampered by the lack of properly trained professionals. The American private sector has a direct interest in supporting and amplifying voices that stand for tolerance, openness, and freedom.”
Read more: National Security Strategy of the United States of America (PDF).

Goodbye, trustworthy phone calls, hello Tacotron 2:
…Human-like speech synthesis made possible via souped-up Wavenet…
Google has published research on Tacotron 2, text-to-speech (TTS) software that the company has used to generate synthetic audio samples that sound just like human beings.
  Results: One model attains a mean opinion score (MOS) of 4.53 compared to the 4.58 typically given to professionally recorded speech. You can check out some of the Tacotron 2 audio samples here; I listened to them and had trouble telling the difference between human and computer speakers. The researchers also carried out a side-by-side evaluation between audio synthesized by their system and the ground truth and found that people still have a slight preference towards ground truth (human-emitted spoken dialogue) versus the Tacotron 2 samples. Further work will be required to train the system to be able to deal with unusual words and pronunciations, as well as figuring out how to condition it at runtime to make a particular audio sample sound happy, sad, or whatever.
The next step for systems like this will be being able to re-train the synthetic voices to match a target speaker using a relatively small amount of data, then figuring out how to condition such systems with accents or other speech tics to better mimic the target.
Read more: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions.

Chinese chip startup Horizon Robotics releases surveillance chip:
…Chip focuses on surveillance, self-driving…
Horizon Robotics has released the ‘Journey 1.0 processor’, a chip that (according to Google translate), “has the ability to accurately detect and recognize pedestrian, motor vehicle, non-motorized vehicle and traffic sign at the same time. The intelligent driving platform based on this chip supports the detection of 260 kinds of traffic signs, and the recognition accuracy to the traffic lights of traffic lights, current lanes and adjacent lanes is more than 95%.”
  Each chip “can detect 200 visual targets at the same time,” the company says.
  China’s chip boom: China is currently undergoing  a boom in the number of domestic startups developing specific AI chips for inference and training – part of a larger national push to create more national champions with semiconductor expertise and provide some significant competition to traditional chip companies Intel, AMD, IBM, and NVIDIA.
– Read more on this Chinese press release from Horizon.
– Check out Horizon’s website.

Salesforce researchers craft AI architecture generator, marvel at its creation of the high-performance, non-standard ‘BC3’ cell:
…Giving neural architecture search a supervised boost via a Domain-Specific Language…
Salesforce’s approach to neural architecture search relies on human supervision in the form of a domain specific language (DSL) which is given to the AI. The intuition here is that the human can specify a small shopping list of AI components which the system can evaluate, and it will figure out the best quantity and combination of these components to solve its tasks.
  One drawback of neural architecture search is that it can be expensive, not only for the computation expended on trying out different architectures, but due to the larger storage and compute requirements that are necessary when you want to test out architectures. The Salesforce researchers try to get around this by using a recursive neural network to iteratively predict the performance of new architectures, reducing the need for actual full-blown testing of the models.
  Results: Architectures trained with Salesforce’s approach perform comparably to the state-of-the-art on tasks like language understanding and on machine translation – with the benefit of having been trained almost entirely through computers autonomously coming up with effective architectures, rather than machine learning researchers expending time on it.
  The mystery of the  ‘BC3’ cell: Like all good research papers, this one contains an easter egg: the discovery of the ‘BC3’ cell, which was used by the model in various top-performing models. This cell has the odd trait of containing “an unexpected layering of two Gate3 operators,” they write. “While only the core DSL was used, BC3 still breaks with many human intuitions regarding RNN architectures.”
  Neural architecture search techniques seem to be in their infancy today but are likely to become very significant over the next two years as these techniques will benefit tremendously from the arrival of new fast computer hardware, like custom AI chips from firms like Google (TPUs) and Graphcore, as well as new processors from AMD, NVIDIA, and Nervana (Intel).
Read more: A Flexible Approach to Automated RNN Architecture Generation.

Tech Tales:

[Detroit VRZoo Sponsored by WorldGoggles(TM), 2028]

“Daddy, daddy, it’s swinging from the top of its cage! And now it’s hanging with one arm. Oh wait… gross! Daddy it just pooped and now it’s throwing it across the cage!”
  You stare at the empty, silent enclosure. Look at the rounded prison bars, buffed smooth by the oils from decades of curious hands, then down to your kid who has their WorldGoggles on and is staring at the top left corner of the cage with an expression that – you – suspect is childlike wonder.
  “Daddy you’re missing it, come on!,” they say, grabbing your sleeve. “Put yours on.”
  Okay, you say, tugging the glasses down over your eyes. The cage in front of you becomes alive – a neon orange, static-haired orangutan dangles from the top bar of the cage with one arm and uses its other to scoop into its backside then sling poo at a trio of hummingbirds on the other side of the cage, which dodge from side-to-side, avoiding the flung shit.
   “Woah look at that,” your kid says. “Those are some smart birds!” The kid plants their feet on the floor and bounces from side to side, darting their hips left and right, mimicking the dodging of the birds.

After the poo-throwing comes the next piece of entertainment: the monkey and the birds play hide and seek with eachother, before being surprised by a perfectly rendered digital anaconda, hidden into one of the fake rock walls of the augmented reality cavern. After that you rent the three creatures a VR toy you bought your kid last weekend so they can all play a game together. Later, you watch your child as gaze up at digital tigers, or move their head from side to side as they follow the just-ever-so-slightly pixelated bubbles of illusory fish.

Like most other parents you spend the majority of the day with your goggles flipped up on your head, looking at the empty repurposed enclosures and the various electronic sensors that stud the corners and ceilings of the rooms where the living animals used to be. The buildings ring out with the happy cries of the kids and low, warm smalltalk between parents. But there are none of the smells of a regular zoo: no susurrations from sleeping or playing animals, no swinging of chains.

The queue for the warthog is an hour long and after fifteen minutes the kid is bored.
  Or as they say: “Daddy I’m B O R E D Bored! Can we go back to the monkeys.”
  It was an orangutan. And, no, we should see this.
  “What does it do?”
  It’s a warthog.
  “Yes I know Dad but what does it do?
  It’s alive, you say. They keep talking to you but you distract them by putting on your goggles and playing a game of augmented reality tennis with them, their toy, and the birds who you pay an ‘amusement fee’ to coax over to the two of you.

When you get into the warthog’s cage it reminds you of the setup for Lenin’s tomb in Moscow – a strange, overly large enclosure that the crowd files around, each person trudging as slowly as they can. No one has goggles on, though some kids fiddle with them. It’s as quiet as a church. You can even hear the heavy breathing of the creature, and at one point it burps, causing all the kids to giggle. “Wow,” your kid whispers, then points at the warthog’s head. It’s got a red Santa Hat on – some of the white threading around the base is tarnished with a brown smudge, either dirt or poo. Your kid tries to put on their goggles to take a photo and you stop them and whisper “just look”, and all the other parents look at you with kind eyes. Outside, later, it snows and there’s only a hint of smog in the flakes. Your kid imitates the warthog and bends forward, then runs ahead of you, pretending to burp like a living thing.

Technologies that inspired this story: Augmented Reality; Magic Leap, Hololens. The Berlin zoo. Multi-agent environments. Mobile phone games.