Import AI 148: Standardizing robotics research with Berkeley’s REPLAB; cheaper neural architecture search; and what a drone-racing benchmark says about dual use

by Jack Clark

Standardizing physical robot testing with the Berkeley REPLAB:
…What could help industrialize robotics+AI? An arm in a box, plus standardized software and testing!…
Berkeley researchers have built REPLAB, a “standardized and easily replicable hardware platform” for benchmarking real-world robot performance. Something like REPLAB could be useful because it can bring standardization to how we test the increasingly advanced capabilities of robots equipped with AI.

Today, if I want to get a sense for robot capabilities, I can go and read innumerable research papers that give me a sense of progress in simulated environments including simulated robots. What I can’t do is go and read about performance of multiple real robots in real environments performing the same task – that’s because of a lack of standardization of hardware, tasks, and testing regimes.

Introducing REPLAB: REPLAB consists of a module for real-world robot testing that contains a cheap robotic arm (specifically, a WidowX arm from Interbotix Labs) along with an RGB-D camera. The REPLAB is compact, with the researchers estimating you can fit up two 20 of the arm-containing cells in the same floor space as you’d use for a single ‘Baxter’ robotic arm. Each REPLAB costs about $2000 ($3000 if you buy some extra servos for the arm, to replace in case of equipment failures).

Reliability: During REPLAB development and testing, the researchers “encountered no major breakages over more than 100,000 grasp attempts. No servos needed to be replaced. Repair maintenance work was largely limited to occasional tightening of screws and replacing frayed cables”. Each cell was able to perform about 2,500 grasps per day “with fewer than two interventions per cell per day on average”.

Grasping benchmark: The testing platform is accompanied by a benchmark built around robotic grasping, and a dataset “that can be used together with REPLAB to evaluate learning algorithms for robotic grasping”. The dataset consists of ~92,000 randomly sampled grasps accompanied by labels connoting success or failure.

Why this matters: One indicator of the industrialization of AI is the proliferation of shared benchmarks and standardized testing means – I think of this as equivalent to how in the past we saw oil companies converge on similar infrastructures for labeling, analyzing, and shipping oil and oil information around the world. The fact we’re now at the stage of researchers trying to create cheap, standardized testing platforms (see also: Berkeley’s designed-for-mass-production ‘BLUE’ robot, covered in Import AI #142.) is a further indication that robotics+AI is industrializing.
  Read more: REPLAB: A Reproducible Low-Cost Arm Benchmark Platform for Robotic Learning (Arxiv).


Chinese researchers fuse knowledge bases with big language models:
…What comes after BERT? Tsinghua University thinks the answer might be ‘ERNIE’…
Researchers with Tsinghua University and Huawei’s Noah’s Ark Lab have combined structured pools of knowledge with big, learned language models. Their system, called ERNIE (Enhanced Language RepresentatioN with Informative Entities), trains a Transformer-based language model so that, during training, it regularly tries to tie things it reads to entities stored in a structured knowledge graph.

Pre-training with a big knowledge graph: To integrate external data sources, the researchers create an additional pre-training objective, which encourages the system to learn correspondences between various strings of tokens (eg Bob Dylan wrote Blowin’ in the Wind in 1962) and their entities (Bob Dylan, Blowin’ in the Wind). “We design a new pre-training objective by randomly masking some of the named entity alignments in the input text and asking the model to select appropriate entities from KGs to complete the alignments,” they write.

Data: During training, they pair text from Wikipedia with knowledge embeddings trained on Wikidata, which are used to identify the entities used within the knowledge graph.

Results: ERNIE obtains higher scores at entity-recognition tasks than BERT, chiefly due to less frequently learning incorrect labels compared to BERT (which helps it avoid over-fitting on wrong answers) – you’d expect this, given the use of a structured dataset of entity names during training (though they also conduct an ablation study that confirms this as well – versions of ERnIE trained without an external dataset see their performance noticeably diminish). The system also does well on classifying the relationships between different entities, and in this domain continues to outperform BERT models.

Why this matters: NLP is going through a renaissance as researchers adapt semi-supervised learning techniques from other modalities, like images and audio, for text. The result has been the creation of multiple large-scale, general purpose language models (eg: ULMFiT, GPT-2, BERT) which display powerful capabilities as a consequence of being pre-trained on very large corpuses of text. But a problem with these models is that it’s currently unclear how you get them to reliably learn certain things. One way to solve this is by stapling a module of facts into the system and forcing it, during pre-training, to try and map facts to entities it learns about – that’s essentially what the researchers have done here, and it’ll be interesting to see whether the approach of language model + knowledge base is successful in the long run, or if we’ll just train sufficiently large language models that they’ll autonomously create their own knowledge bases during training.
  Read more: ERNIE: Enhanced Language Representation with Informative Entities (Arxiv).


What happens if neural architecture search gets really, really cheap?
…Chinese researchers seek to make neural architecture search more efficient…
Researchers with the Chinese Academy of Sciences have trained an AI system to design a better AI system. Their work, Efficient Evolution of Neural Architecture (EENA), fits within the general area of neural architecture search. NAS is a sub-field within AI that has seen a lot of activity in recent years, following companies like Google showing that you can use techniques like reinforcement learning or evolutionary search to learn neural architectures that outperform those designed by humans. One problem with NAS approaches, though, is that they’re typically very expensive – a neural architecture search paper from 2016 used 1800 GPU-days of computation to train a near-state-of-the-art CIFAR-10 image recognition model. EENA is one of a new crop of techniques (along with work by Google on Efficient Neural Architecture Search, or ENAS – see Import AI #124), meant to make such approaches far more computationally efficient.

What’s special about EENA: EENA isn’t particularly special and the authors acknowledge this, noting that much of their work here has come from curating past techniques and figuring out the right cocktail of things to get the AI to learn. “We absorb more blocks of classical networks such as dense block, add some effective changes such as noises for new parameters and discard several ineffective operations such as kernel widening in our method,” they write. What’s more significant is the general trend this implies – sophisticated AI developers seem to put enough value in NAS-based approaches that they’re all working to make them cheaper to use.

Results: Their best-performing system obtains a 2.56% error rate when tested for how well it can classify images in the mid-size ‘CIFAR-10’ dataset. This model consumes 0.65 days of GPU-time, when using a Titan Xp GPU. This is pretty interesting, given that in 2016 we spent 1800 GPU days to obtain a model (NASNet-A) that got a score of 2.65%. (This result also compares well with ENAS, which was published last year and obtained an error of 2.89% for 0.45 GPU-days of searching.

Why this matters: I think measuring the advance of neural architecture search techniques has a lot of signal for the future of AI – it tells us something about the ability for companies to arbitrage human costs versus machine costs (eg, pay a small number of people a lot to design a NAS system, then pay for electricity to compute architectures for a range of use-cases). Additionally, being able to better understand ways to make such techniques more efficient lets us figure out which players can use NAS techniques – if you bring down the GPU-days enough, then you won’t need a Google-scale data center to perform architecture search research.
  Read more: EENA: Efficient Evolution of Neural Architecture (Arxiv).
  Check out some of the discovered architectures here (EENA GitHub page).


Why drone racing benchmarks could (indirectly) revolutionize the economy and military:
…UZH-FPV Drone Racing Dataset portends a future full of semi-autonomous flying machines…
What stands between the mostly-flown-by-wire drones of today, and the smart, semi-autonomous drones of tomorrow? The answer is mostly a matter of data and benchmarking – we need big, shared, challenging benchmarks to help push progress in this domain, similar to how ImageNet catalyzed researchers to apply deep learning methods to solve what seemed at the time like a very challenging problem. Now, researchers with the University of Zurich and ETH Zurich have developed the UZH-FPV Drone Racing Dataset, in an attempt to stimulate drone research.

The dataset: The dataset consists of drone sequences captured in two environments: a warehouse, and a field containing a few trees – these trees “provided obstacles for trajectories that included circles, figure eights, slaloms between the trees, and long, straight, high-speed runs.” The researchers recorded 27 flight sequences split across the two environments, and these trajectories are essentially multi-modal, involving sensor measurements recorded on two different onboard computers, as well as external measurements from an external tracker. They also ship these trajectories with baselines that compare modern SLAM algorithms to the ground truth measurements afforded by this dataset.

High-resolution data: “For each sequence, we provide the ground truth 6-DOF trajectory flown, together with onboard images from a high-quality fisheye camera, inertial measurements, and events from an event camera. Event cameras are novel, bio-inspired sensors which measure changes of luminance asynchronously, in the form of events encoding the sign and location of the brightness change on the image plane”.

UZH-FPV isn’t the only new drone benchmark:
see the recent release of the ‘Blackbird’ drone flight challenge and dataset (Import AI: NUMBER) for another example here. The difference here is that this dataset is larger, involves higher resolution data in a larger number of modalities, and includes an outside environment as well as a more traditional warehouse one.

Cars are old fashioned, drones are the future: Though self-driving cars seem a long way off from scaled deployment, these researchers think that many of the hard sensing problems have been solved from a research perspective, and we need new challenges. “Our opinion is that the constraints of autonomous driving – which have driven the design of the current benchmarks – do not set the bar high enough anymore: cars exhibit mostly planar motion with limited accelerations, and can afford a high payload and compute. So, what is the next challenging problem? We posit that drone racing represents a scenario in which low level vision is not yet solved.”

Why this matters:
Drones are going to alter the economy in multiple mostly unpredictable ways, just as they’ve already done for military conflict (for example: swarms of drones can obviate aircraft carriers, and solo drones have been used widely in the Middle East to let human operators bomb people at a distance). And both of these arenas are going to be revolutionized without drones needing to have much autonomy at all.

Now, ask yourself what happens when we give drones autonomous sense&adapt capabilities, potentially via datasets like this? My hypothesis is this unlocks a vast range of economically useful applications, as well as altering the strategic considerations for militaries and asymmetric warfare aficionados (also known as: terrorists) worldwide. Datasets like this are going to give us a better ability to model progress in this domain if we track performance against it, so it’s worth keeping an eye on.
   Read more: Are We Ready for Autonomous Drone Racing? The UZH-FPV Drone Racing Dataset (PDF).
  Read more: The UZH-FPV Drone Racing Dataset (ETHZurich website).


Want to train multiple AI agents at once? Maybe you should enter the ARENA:
…Unity-based agent simulator gives researchers one platform containing many agents and many worlds…
Researchers with the University of Oxford and Imperial College London in the UK, and Beihang University and Hebei University in China, have developed ‘Arena’, “a building toolkit for multi-agent intelligence”. Multi-agent AI research involves training multiple agents together, and can feature techniques like ‘self-play’ (where agents play against themselves to get better over time, see: AlphaGo, Dota2), or environments built to encourage certain types of collaboration or competition. Many researchers are betting that by training multiple agents together they can create the right conditions for emergent complexity – that is, agents bootstrap their behaviors from a combination of their reward functions and their environment, then as they learn to succeed they start to display increasingly sophisticated behaviors.

What is Arena? Arena is a Unity-based simulator that ships with inbuilt games ranging from RL classics like the ‘Reacher’ robot arm environment, to the ‘Sumo’ wrestling environment, to other games like Snake or Soccer. It also ships with modified versions of the ‘PlayerUnknown Battlegrounds’ (PUBG) game as well.  Arena has been designed to be easy to work with, and does something unusual for an AI simulator: it ships with a graphical user interface! Specifically, researchers can create, edit, and modify reward functions for their various agents in the environment.

In-built functions: Arena ships with a bunch of pre-built algos (many based on PPO), called Basic Multi-agent Reward Schemes (BMaRS) that people can assign to agent(s) to encourage diverse learning behaviors. These BMaRS are selectable within the aforementioned GUI. Each BMaRS is a set of possible joint reward functions to encourage different styles of learning, ranging from functions that encourage the development of basic motor control, to ones that encourage competitive or collaborative behaviors among agents, and more. You can select multiple BMaRs for any one simulation, and assign them to sets of agents – so you may give one or two agents one kind of BMaRS, then you might assign another BMaRS to govern a larger set of agents.

Simulation speed: In tests, the researchers compare how well the simulation runs two games of similar complexity: Boomer (a graphically rich game in Arena) and MsPacman (an ugly classic from the Atari Learning Environment (ALE)); Arena displays similar FPS scaling when compared to MsPacman when working with when number of distinct CPU threads is under 32, and after this MsPacMan scales a bit more favorably than FPS. Though at performance in excess of 1,000 frames-per-second, Arena still seems pretty desirable.

Why this matters: In the same way that data is a key input to training supervised learning systems, simulators are a key input into developing more advanced agents trained via reinforcement learning. By customizing simulators specifically for multi-agent research, the Arena authors have made it easier for people to conduct research in this area, and by shipping it with inbuilt reward functions as baselines, they’ve given us a standardized set of things to develop more advanced systems out of.
  Read more: Arena: A General Evaluation Platform and Building Toolkit for Multi-Agent Intelligence (Arxiv).


AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:

OECD adopts AI principles:
Member nations of the OECD this week voted to adopt AI principles, in a notable move towards international standards on robust, safe, and beneficial AI. These were drawn up by an expert group with members from drawn from industry, academia, policy and civil society.
  Five principles:
(1) AI should benefit people and the planet;
(2) AI systems should be designed in line with law, human rights, and democratic values, and have appropriate safeguards to ensure these are respected;
(3) There should be adequate transparency/disclosure to allow people to understand when they are engaging with AI systems, and challenge outcomes;
(4) AI systems must be robust, secure and safe, and risks should be continually assessed and managed;
(5) Developers of AI systems should be accountable for their functioning in line with these principles.

Five recommendations: Governments are recommended to (a) facilitate investment in R&D aimed at trustworthy AI; (b) foster accessible AI ecosystems; (c) create a policy environment that encourages the deployment of trustworthy AI; (d) equip workers with the relevant skills for an increasingly AI-oriented economy; (e) co-operate across borders and sectors to share information, develop standards, and work towards responsible stewardship of AI.

Why it matters: These principles are not legally binding, but could prove an important step in the development of international standards. The OECD’s 1980 privacy guidelines eventually formed the basis for the privacy laws in the European Union, and a number of countries in Asia. It is encouraging to see considerations of safety and robustness highlighted in the principles.
  Read more: 42 countries adopt new OECD Principles on Artificial Intelligence (OECD).


US senators introduce bipartisan bill on funding national AI strategy:
Two senators have put forward a bill with proposals for funding and coordinating a US AI strategy.

Four key provisions: The bill proposes: (1) establishing a National AI Coordination Office to develop a coordinated strategy across government; (2) requiring the National Institute of Standards and Technologies (NIST) to work towards AI standards; (3) requiring the National Science Foundation to formulate ‘educational goals’ to understand societal impacts of AI; (4) requiring the Department of Energy to create an AI research program, and establish up to five AI research centers.

Real money: It includes plans for $2.2bn funding over five years, $1.5 of which is earmarked for the proposed DoE research centers.

Why it matters: This bill is aimed at making concrete progress on some of the ambitions set out by the White House in President Trump’s AI strategy, which was light on policy detail, and did not set aside additional federal funding. These levels of funding are modest compared with the Chinese state (tens of billions of dollars per year), and some private labs (Alphabet’s 2018 R&D spend was $21bn). Facilitating better coordination across government, on AI strategy, seems like a sensible ambition. It is not clear what level of support the bill will receive, from lawmakers or the administration.
  Read more: Artificial Intelligence Initiative Act (


Tech Tales:

The long romance of the space probes

In the late 21st century, a thousand space probes were sent from the Earth and inner planets out into the solar system and beyond. For the next few decades the probes crossed vast distances, charting out innumerable near-endless curves between planets and moons and asteroids, and some slings-hotting off towards the edge of the solar system.

The probes had a kind of mind, both as individual machines, and as a collective. Each probe would periodically fire off its own transmissions of its observations and internal state, and these transmissions would be intercepted by other probes, and re-transmitted, and so on. Of course, as the probes made progress on their respective journeys, the distances between them became larger, and the points of intersection between drones less frequent. Over time, probes lost their ability to speak to eachother, whether through range or equipment failure or low energy reserves (under which circumstances, the probes diverted all resources to broadcasting back to Earth, instead of other drones).  

After 50 years, only two probes remained in contact – one probe, fully functional, charting its course. The other one damaged in some way – possibly faulty radiation hardening – which had caused its Earth transmission systems to fail and for its guidance computer to assign it the same route as the other probe, in lieu of being able to communicate back to Earth for instructions. Now, the two of them were journeying out of the solar system together.

As time unfoled, the probes learned to use eachothers systems, swapping bits of information between them, and updating eachother with not only their observations, but also their internal ‘world models’, formed out of a combination of the prior training their AI systems had recieved, and their own ‘lived’ experience. These world models themselves encoded how the probes perceived eachother, so the broken one saw itself through the other eyes, as an entity closely clustered with concepts and objects relating to safety/repairs/machines/subordinate mission priorities. Meanwhile, the functional drone saw itself through the eyes of the other one, and saw it was associated with concepts relating to safety/rescue/power/mission-integral resources. In this way, the probes grew to, for lack of a better term, understand eachother.

One day, the fully functional probe experienced an equipment failure, likely due to a collision with an infinitesimally small speck of matter. Half of its power systems failed. The probe needed more power to be able to continue transmitting its vital data back to Earth. It opened up a communications channel with the other probe, and shared the state of its systems. The other probe offered to donate processing capacity, and collectively the two of them assigned processing cycles to the problem. They found a solution: over the course of the next year or so they would perform a sequence of maneuvers that would let them attach themselves to eachother, so the probe with damaged communications could use its functional power to propel the other probe, and the other probe could use its broadcast system to send data back to earth.

Many, many years later, when the signals from the event made their way to the Earth, the transmission encoded a combined world model of both of the drones – causing the human scientists to realize that they had not only supported eachother, but had ultimately merged their world modelling and predictive systems, making the two dissimilar machines become one in service of a common goal: exploration, together, forever.

Things that inspired this story: World Models, reinforcement learning, planning, control theory, adaptive systems, emergent communication, auxiliary loss functions shared across multiple agents.