Import AI 106: Tencent breaks ImageNet training record with 1000+ GPUs; augmenting the Oxford RobotCar dataset; and PAI adds more members
by Jack Clark
What takes 2048 GPUs, takes 4 minutes to train, and can identify a seatbelt with 75% accuracy? Tencent’s new deep learning model:
…Ultrafast training thanks to LARS, massive batch sizes, and a field of GPUS…
As supervised learning techniques become more economically valuable, researchers are trying to reduce the time it takes to train deep learning models so that they can run more experiments within a given time period, and therefore increase both the cadence of their internal research efforts, as well as their ability to train new models to account for new data inputs or shifts in existing data distributions. One metric that has emerged as being important here is the time it takes people to train networks on the ‘ImageNet’ dataset to a baseline accuracy. Now, researchers with Chinese mega-tech company Tencent and Hong Kong Baptist University have shown how to use 2048 GPUs, a 64k batch-size (this is absolutely massive, for those who don’t follow this stuff regularly) to train a ResNet-50 model on ImageNet to a top-1 accuracy of 75.8% within 6.6 minutes, and AlexNet to 58.7% accuracy within 4 minutes.
Training: To train this, the researchers developed a distributed deep learning training system called ‘Jizhi’, which uses tricks including opportunistic data pipelining; hybrid all-reduce; and a training model which incorporates model and variable management, along with optimizations like mixed-precision networks (training using half-precision to increase the amount of throughput ). The authors say one of the largest contributing factors to their results is their ability to use LARS (Layer-wise Adaptive Rate Scaling (Arxiv)) to opportunistically flip between 16- and 32-bit precision during training – they conduct an ablation study and find that a version trained without LARS gets a Top-1 Accuracy of 73.2%, compared to 76.2% for the version trained with LARS.
Model architecture tweaks: The authors eliminate weight decay on the bias and batch normalization, and add batch normalization layers into AlexNet.
Communication strategies: The researchers implement a number of tweaks to deal with the problems brought about due to the immense scale of their training infrastructure. To help them do this they use a few tweaks including ‘tensor fusion’, which lets them chunk up multiple small-size tensors together before running an all-reduce step; ‘hierarchical all-reduce’, which lets them group GPUs together and selectively reduce and broadcast to further increase efficiency; and ‘hybrid All-reduce’, which lets them flip between two different implementations of all-reduce according to whatever is most efficient at the time.
Why it matters: Because deep learning is fundamentally an empirical discipline, in which scientists launch experiments, observe results, and use hard-won intuitions to re-configure hyperparameters and architectures and repeat the process, then computers are somewhat analogous to telescopes: the bigger the computer, the farther you may be able to see, as you’re able to run a faster experimental loop at greater scales than other people. The race between large organizations to scale-up training will likely lead to many interesting research avenues, but it also risks bifurcating research into “low compute” and “high compute” environments – that could further widen the gulf between academia and industry, which could create problems in the future.
Read more: Highly Scalable Deep Learning Training System with MIxed-Precision Training ImageNet in Four Minutes (Arxiv).
What’s better than the Oxford RobotCar Dataset? An even more elaborate version of this dataset!
…Researchers label 11,000 frames of data to help people build better self-driving cars…
Researchers with Universita degli Studi Federico II in Naples and Oxford Brookes University in Oxford have augmented the Oxford RobotCar Dataset with many more labels designed specifically for training vision-based policies for self-driving cars. The new datasets is called READ, or the “Road Event and Activity Detection” dataset, and involves a large number of rich labels which have been applied to ~11,000 frames of data gathered from cameras on an autonomous NISSAN Leaf driven around Oxford, UK. The dataset labels include “spatiotemporal actions performed not just by humans but by all road users, including cyclists, motor-bikers, drivers of vehicles large and small, and obviously pedestrians.” These labels can be quite granular and individual agents in a scene, like a car, can have multiple labels applied to them (for instance, a car in front of the autonomous vehicle at an intersection might be tagged with “indicating right” and “car stopped at the traffic light”. Similarly, Cyclists could be tagged with labels like “cyclist moving in lane” and “cyclist indicating left”, and so on. This richness might help develop better detectors that can create more adaptable autonomous vehicles.
Tools used: They used Microsoft’s ‘Visual Object Tagging Tool” (VOTT) to annotate the dataset.
Next steps: This version of READ is a preliminary one, and the scientists plan to eventually label 40,000 frames. They also have ambitious plans to create a novel, deep learning approach to detecting complex activities”. Let’s wish them luck.
Why it matters: Autonomous cars are going to revolutionize many aspects of the world, but in recent years there has been a major push by industry to productize the technology, which has led to much of the research occurring in private. Academic research initiatives and associated dataset releases like this promise to make it easier for other people to develop this technology, potentially broadening our own understanding of it and letting more people participate in its development.
Read more: Action Detection from a Robot-Car Perspective (Arxiv).
Whether rain, fog, or snow – researchers’ weather dataset has you covered:
…RFS dataset taken from creative commons images…
Researchers with the University of Essex and the University of Birmingham have created a new weather dataset called the Rain Fog Snow (RFS) dataset which researchers can use to better understand, classify and predict weather patterns.
Dataset: The dataset consists of more than 3,000 images taken from websites like Flickr, Pixabay, Wikimedia Commons, and others, depicting images of scenes with different weather conditions, ranging from Rain to Fog to Snow. In total, the researchers gather 1100 images from each class, creating a potentially new useful dataset for researchers to experiment with.
Read more: Weather Classification: A new multi-class dataset data augmentation approach and comprehensive evaluations of Convolutional Neural Networks (Arxiv).
DeepMind teaches computers to count:
…Pairing deep learning with specific external modules leads to broadened capabilities…
Neural networks are typically not very good at maths. That’s because figuring out a way to train a neural network to develop a differentiable, numeric representation is difficult, with most work typically involving handing off the outputs of a neural network to a non-learned predominantly hand-programmed system. Now, DeepMind has implemented a couple of modules — a Neural Accumulator (NAC) and a Neural Arithmetic Logic Unit (NALU) — specifically to help its computers learn to count. These modules are “biased to learn systematic numerical computation”, write the authors of the research. “Our strategy is to represent numerical quantities as individual neurons without a nonlinearity. To these single-value neurons, we apply operators that are capable of representing simple functions (e.g., +, -, x, etc). These operators are controlled by parameters which determine the inputs and operations used to create each output”.
Tests: The researchers rigorously test their approach on tasks ranging from counting the number of times a particular MNIST class has been seen; to basic addition, multiplication, and division tasks; as well as being tested in more complicated domains with other challenges, like needing to keep track of time while completing tasks in a simulated gridworld.
Why it matters: Systems like this promise to broaden the applicability of neural networks to a wider set of problems, and will let people build systems with larger and larger learned components, offloading human expertise from hand-programming things like numeric processors, to designing numeric modules that can be learned along with the rest of the system.
Read more: Neural Arithmetic Logic Units (Arxiv).
Get the code: DeepMind is yet to release official code, but that hasn’t stopped the wider community from quickly replicating it. There are currently five implementations of this available on GitHub – check out the list here and pick your favorite (Adam Trask, paper author, Twitter).
Google researchers use AI to optimize AI models for mobile phones:
…Platform-Aware Neural Architecture Search for Mobile (MnasNet) gives engineers more dials to tune when having AI systems learn to create other AI systems…
Google researchers have developed a neural architecture search approach that is tuned for mobile phones, letting them use machine learning to learn how to design neural network architectures that can be executed on mobile devices.
The technique: Google’s system treats the task of architecture design as a “multi-objective optimization problem that considers both accuracy and inference latency of CNN models”. The system uses what they term a “factorized hierarchical search space” to help it pick through possible architecture designs.
Results: Systems trained with MnasNet can obtain higher accuracies than those trained by other automatic machine learning system approaches, with one variant obtaining a top-1 imagenet accuracy of 76.13%, versus 74.5% for a prior high-scoring Google NAS technique. The researchers can also tune the networks for latency, so are able to design a system with a latency of 65ms (as evaluated on a Pixel phone), which is more efficient in terms of execution time than other approaches.
Why it matters: Approaches like this make it easier for us to offload the expensive task (in terms of researcher brain time) of designing neural network systems to computers, letting us trade researcher time for compute time. Stuff like this means we’re heading for a world where increasingly large amounts of computers are used to autonomously design systems, creating increasingly optimized architectures automatically. It’s worth bearing in mind that approaches like this will lead to a “rich get richer” effect with AI, where people with bigger computers are able to design more adaptive, efficient systems than their competitors.
Read more: MnasNet: Platform-Aware Neural Architecture Search for Mobile (Arxiv).
AI Policy with Matthew van der Merwe:
…Reader Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: email@example.com…
What AI means for international competition:
AI could have a transformative impact on a par with technologies such as electricity or combustion engines. If this is the case, then AI – like these precedents – will also transform international power dynamics.
Lessons from history: Previous technological discontinuities had had different winners and losers. The first industrial revolution shifted power from countries with small, professionalized armies to those able to mobilize their populations on a large scale. The technological revolution entrenched this gap, and further favored those with access to key resources. In both instances, first-mover advantages were dwarfed by advantages in resource and capital stocks, and success in applying technologies to new domains.
What about AI: Algorithms in most civilian applications can diffuse rapidly, and hence may be more difficult for countries to hoard. Other inputs to AI development, though, are resources that governments can develop and protect e.g. skills and hardware. The ability of economies to cope with societal impacts from AI will itself be an important driver of their success. The relative importance of these different inputs to AI progress will determine the winners and losers.
Why this matters: The US remains an outlier amongst countries in not having a coordinated AI strategy, notwithstanding some preliminary work done at the end of the Obama administration. As the report makes clear, technological leaps frequently have destabilizing effects on global power dynamics. While much of this remains uncertain, there are clear actions available to countries to mitigate against some of the greatest risks, particularly ensuring that safety and ethical considerations remain a priority in AI development.
Read more: Strategic Competition in an Era of Artificial Intelligence (CNAS).
Google’s re-entry into China:
Google is launching a censored search engine in China, according to leaks reported by The Intercept. new leaks have revealed. The alleged product has been developed in consultation with the Chinese government, and will be compliant with the country’s strict internet censorship, e.g. by blocking websites and searches related to human rights, democracy, and protests. Google’s search engine has been blocked in China since 2010, when the company ceased offering a censored product after a major cyberattack. They had previously faced significant criticism in the US for their involvement in censorship.
The AI principles: Google were praised for releasing their AI principles in June, after criticism over the collaboration on Project Maven. The principles include the pledge that Google “will not design or deploy AI … in technologies whose purpose contravenes widely accepted principles of international law and human rights.”
Why this matters: Google has been slowly re-establishing a presence in China, launching a new AI Center and releasing TensorFlow for Chinese developers in 2017. This latest project, though, is likely to spark criticism, particularly amidst the increasing attention on the conduct of tech giants. A bipartisan group of Senators have already released a letter critical of the decision. The Maven case demonstrates Google’s employees’ ability to mobilize effectively on corporate behavior they object to, particularly when information about these projects has been withheld. Whether this turns into another Maven situation remains to be seen.
Read more: Google plans to launch censored search engine in China (The Intercept).
Read more: Senators’ letter to Google.
More names join ethical AI consortium:
The Partnership on AI, a multi-stakeholder group aiming to ensure AI benefits society, has announced 18 new members, including PayPal, New America, and the Wikimedia Foundation. The group was founded in 2016 by the US tech giants and DeepMind, and is focussed on formulating best practices in AI to ensure that the technology is safe and beneficial.
Read more: Expanding the Partnership (PAI).
Down on the computer debug farm
So what’s wrong with it.
It thinks cats are fish.
Why did you bother to call me? That’s an easy fix. Just update the data distribution.
It’s not that simple. It recognizes cats, and it recognizes fish. But it’s choosing to see cats as fish.
We’re trying to reproduce. It was deployed in several elderly care homes for a few years. Then we picked up this bug recently. We think it was from a painting class.
Well, we’ll show you.
What am I looking at here?
Pictures of cats in fishbowls.
I know. Look, explain this to me. I’ve got a million other things to do.
We think it liked one of the people that was in this painting class and it complimented them when they painted a cat inside a fishbowl. It’s designed as a companion system.
Well, it kept doing that to this person, and it made them happy. Then it suggested to someone else they might want to paint this. It kind of went on from there.
“Went from there”?
We’ve found a few hundred paintings like this. That’s why we called you in.
And we can’t wipe it?
Have you considered having showing it a fish in a cat carrier?
Well, have you?
Have a better idea?
That’s what I thought. Get to work.
Things that inspired this story: Adversarial examples; bad data distributions; fleet learning; proclivities.