Import AI: 160: Spotting sick crops in the iCassava challenge, testing AI agents with BSuite, and PHYRE tests if machines can learn physics

by Jack Clark

AI agents are getting smarter, so we need new evaluation methods. Enter BSuite:
…DeepMind’s testing framework is designed to let scientists know when progress is real and when it is an illusion…
When is progress real and when is it an illusion? That’s a question that comes down to measurement and, specifically, the ability for people to isolate the causes of advancement in a given scientific endeavor. To help scientists better measure and assess AI progress, researchers with DeepMind have developed and released the Behaviour Suite for Reinforcement Learning.

BSuite: What it is: BSuite is a software package to help researchers test out the capabilities of increasingly sophisticated reinforcement learning agents. BSuite ships with a set of experiments to help people assess how smart their agents are, and to isolate the specific causes for their intelligence. “These experiments embody fundamental issues, such as ‘exploration’ or ‘memory’ in a way that can be easily tested and iterated,” they write. “For the development of theory, they force us to instantiate measurable and falsifiable hypotheses that we might later formalize into provable guarantees.”

BSuite’s software: BSuite ships with experiments, reference implementations of several reinforcement learning algorithms, example ways to plug BSuite into other codebases like ‘OpenAI Gym’, scripts to automate running large-scale experiments on Google cloud, a pre-made Jupyter interactive notebook so people can easily monitor experiments, and a tool to formulaically generated the LaTeX needed for conference submissions. 

Testing your AI with BSuite’s experiments: Each BSuite experiment has three components: an environment, a period of interaction (e.g., 100 episodes), and ‘analysis’ code to map agent behaviour to results. BSuite lets researchers assess agent performance on multiple dimensions in a ‘radar’ plot that displays how well each agent does at a task in reference to things like memory, generalization, exploration, and so on. Initially, BSuite ships with several simple environments that challenge different parts of an RL algorithm, ranging from simple things like controlling a small mountain car as tries to climb a hill, to more complex scenarios based around exploration (e.g., “Deep Sea”) and memory (e.g., “memory_len” and “memory_size”).

Why this matters: BSuite is a symptom of a larger trend in AI research – we’re beginning to develop systems with such sophistication that we need to study them along multiple dimensions, while carefully curating the increasingly sophisticated environments we train them in. In a few years, perhaps we’ll see reinforcement learning agents mature to the point that they can start to develop across-the-board ‘superhuman’ capabilities at hard cognitive capabilities like memory and generalization – if that happens, we’d like to know, and it’ll be tools like BSuite that help us know this.
   Read more: Behaviour Suite for Reinforcement Learning (Arxiv).
   Get the DeepSuite code here (official GitHub repository).

####################################################

Spotting problems with Cassava via smartphone-deployed AI systems:
…All watched over and fed by machines of loving grace…
Cassava is the second largest provider of carbohydrates in Africa. How could the use of artificial intelligence help local farms better farm and care for this crucial, staple crop? New research from Google, the Artificial Intelligence Lab at Makerere University, and the National Crops Resources Research Institute in Uganda, proposes a new AI competition to encourage researchers to design systems that can diagnose various cassava diseases. 

Smartphones, meet AI: Smartphones have proliferated wildly across Africa, meaning that even many poor farmers have access to a device with a modern digital camera and some local processing capacity. The idea behind the iCassava 2019 competition is to develop systems that can be deployed on these smartphones, letting farmers automatically diagnose their crops. “The solution should be able to run on the farmers phones, requiring a fast and light-weight model with minimal access to the cloud,” the researchers write. 

iCassava 2019: The competition required systems to differentiate between five labels for each Cassava picture: healthy, or one of four Cassava diseases: brown steak disease (CBSD), mosaic disease (CMD), bacterial blight (CBB), and green mite (CGM). The data was collected as part of a crowdsourcing project using smartphones, so the images in the dataset have a variety of different lighting patterns and other confounding factors, like strange angles, photos from different times of day, improper camera focus, and so on.

iCassava 2019 results and next steps: The top three contenders in the competition each obtained accuracy scores of around 93%. The winning entry used a large corpus of unlabeled images as an additional training signal. All winners built their systems around a residual network (resnet). 

Next steps: The challenge authors plan to build and release more Cassava datasets in the future, and also plan to host more challenges “which incorporate the extra complexities arising from multiple diseases associated with each plant as well as varying levels of severity”. 

Why this matters: Systems like this show how AI can have a significant real-world impact, and point to a future where governments initiate competitions to help their civilians deal with day-to-day problems, like diagnosing crop diseases. And as smartphones get more powerful and cheaper over time, we can expect more and more powerful AI capabilities to get distributed to the ‘edge’ in this way. Soon, everyone will have special ‘sensory augmentations’ enabled by custom AI models deployed on phones.
   Read more: iCassava 2019 Fine-Grained Visual Categorization Challenge (Arxiv).
   Get the Cassava data here (official competition GitHub).

####################################################

Accessibility and AI, meet Kannada-MNIST:
…Building new datasets to make cultures visible to machines…
AI classifiers, increasingly, rule the world around us: They decide what gets noticed and what doesn’t. They apply labels. They ultimately make decisions. And when it comes to writing, most of these classifiers are built to work for the world’s largest and well-documented languages – think English, Chinese, French, German, and so on. What about all the other languages in the world? For them to be ‘seen’, we’ll need to be able to develop systems that can understand them – that’s the idea behind Kannada-MNIST, an MNIST-clone that uses the Kannada versions of the numbers 0 to 9. In Kannada, “Distinct glyphs are used to represent the numerals 0-9 in the language that appear distinct from the modern Hindu-Arabic numerals in vogue in much of the world today,” the author of the research writes. 

Why MNIST? MNIST is the ‘hello world’ of AI – it’s a small, incredibly well-documented and studied, dataset consisting of tens of thousands of handwritten numbers ranging from 0 to 9. MNIST has since been superseded by more sophisticated datasets, like CIFAR and ImageNet. But many researchers will still validate things against it during the early stages of research. Therefore, creating variants of MNIST that are similarly small, tractable, and well-documented seems like a helpful thing to do for researchers. It also seems like creating MNIST variants in things that are currently understudied – like the Kannada language – can be a cheap way to generate interest. To generate Kannada-MNIST, 65 volunteers drew 70,000 numerals in total.  

A harder MNIST: The researcher has also developed Dig-MNIST – this is a version of the Kannada dataset were volunteers were exposed to Kannada numerals for the first time then had to draw their own versions. “This sampling-bias, combined with the fact we used a completely different writing sheet dimension and scanner settings, resulted in a dataset that would turn out to be far more challenging than the [standard Kannada] test dataset”, the author writes. 

Why this matters: Soon, we’ll have two worlds: the normal world and the AI-driven world. Right now, the AI-driven world is going to favor some of the contemporary world’s dominant cultures/languages/stereotypes, and so on. Datasets like Kannada-MNIST can potentially help shift this balance.
   Read more: Kannada-MNIST: A New Handwritten Digits Dataset for the Kannada Language (Arxiv).
   The companion GitHub repository for this paper is here (Kannada MNIST GitHub)

####################################################

Your machine sounds funny – I predict it’s going to explode:
…ToyADMOS dataset helps people teach machines to spot the audio hallmarks of mechanical faults…
Did you know that it’s possible to listen for failure, as well as visually analyze for it? Now, researchers with NTT Media Intelligence Laboratories and Ritsumeikan University want to make it easier to teach machines to listen for faults via a new dataset called ToyADMOS. 

ToyADMOS: ToyADMOS is designed around three tasks: production inspection of a toy car, fault diagnosis of a fixed machine (toy conveyor), and fault diagnosis for a machine machine (a toy train). Each scenario is recorded with multiple microphones, capturing both machine and environmental sounds. ToyADMOS contains “over 180 hours of normal machine-operating sounds and over 4,000 samples of anomalous sounds collected with four microphones at a 48-kHz sampling rate,” they write. 

Faults, faults everywhere: For each of the tasks, the researchers simulated a variety of failures. These included things like running the toy car with a bent shaft, or with different sorts of tyres; altering the tensions in the pulleys of the toy conveyor, and breaking the axles and tracks of the toy train. 

Why ToyADMOS: Researchers should use the dataset because it was built under controlled conditions, letting the researchers easily separate and label anomalous and non-anomalous sounds. “The limitation of the ToyADMOS dataset is that toy sounds and real machine sounds do not necessarily match exactly,” they write. “One of the determining factors of machine sounds is the size of the machine. Therefore, the details of the spectral shape of a toy and a real machine sound often differ, even though the time-frequency structure is similar. Thus, we need to reconsider the pre-processing parameters evaluated with the ToyADMOS dataset, such as filterbank parameters, before using it with a real-world ADMOS system. 

Why this matters: In a few years, many parts of the world will be watched over by machines – machines that will ‘see’ and ‘hear’ the world around them, learning what things are usual and what things are unusual. Eventually, we can imagine warehouses where small machines are removed weeks before they break, after a machine with a distinguished ear spots the idiosyncratic sounds of a future-break.
   Read more: ToyADMOS: A Dataset of Miniature-Machine Operating Sounds For Anomalous Sound Detection (Arxiv).
   Get the ToyADMOS data from here (Arxiv).

####################################################

Can your AI learn the laws of nature? No. What about the laws of PHYRE?
…Facebook’s new simulator challenges agents to interact with a complex, 2D, physics world…
Given a non-random universe, infinite time, and the ability to experiment, could we learn the rules of existence? The answer to this is, intuitively, yes. Now, researchers with Facebook AI Research want to see if they can use a basic physics simulator to teach AI systems physics-based reasoning. The new ‘PHYRE” (PHYsical REasoning) benchmark gives AI researchers a tool to test how well their systems understand complex things like causality, physical dynamics, and so on. 

What PHYRE is: PHYRE is a simulate that contains a bunch of environments which can be manipulated via RL agents. Each environment is a two-dimensional world containing “a constant downward gravitational force and a small amount of friction”. The agent is presented with a scenario, like a ball in a green cup balanced on a platform above a ball in the red cup, and asked to change the state of the world – for instance, by being asked to place the ball in the green cup into the one with the red cup. “The agent aims to achieve the goal by taking a single action, placing one or more new dynamic bodies into the world”, the researchers write. In this case, the agent could solve its taks by manifesting a ball which could roll into the green cup, tipping it over so the ball falls into the red cup. “Once the simulation is complete, the agent receives a binary reward indicating whether the goal was achieved”, they write. 

One benchmark, many challenges: PHYRE initially consists of two tiers of difficulty (one ball and two balls), and each tier has 25 task templates (think of these templates as like basic worlds in a videogame) and each template contains 100 tasks (think of these as like individual levels in a videogame world). 

How hard is it? In tests, the researchers show that a variety of baselines – including souped-up versions of DQN, and a non-parametric agent with online learning – struggle to do well even on the single-ball tasks, barely obtaining scores better than 50% on many of them. “PHYRE aims to enable the development of physical reasoning algorithms with strong generalization properties mirroring those of humans,” the researchers write. “Yet the baseline methods studied in this work are far from this goal, demonstrating limited generalization abilities”. 

Why this matters: For the past few years multiple different AI groups have taken a swing at the hard problem of developing agents that can learn to model the physics dynamics of an environment. The problem these researchers keep running into is that agents, as any AI practitioner knows, are so damn lazy they’ll solve the task without learning anything useful! Simulators like PHYRE represent another attempt to see if we can develop the right environment and infrastructure to encourage the right kind of learning to emerge. In the next year or so, we’ll be able to judge how successful this is by reading papers that reference the benchmark.
   Read more: PHYRE: A New Benchmark for Physical Reasoning (Arxiv).
   Play with PHYRE tasks on this interactive website (PHYRE website).
   Get the PHYRE code here (PHYRE GitHub).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Why Peter Thiel’s views on AI miss the forest for the trees:
Peter Thiel, co-founder of Palantir and PayPal, wrote an opinion piece earlier this month on military applications of AI and US-China competition. Thiel argued that AI should be treated primarily as a military technology, and attacked Google and others for opening AI labs in China.

AI is not a military technology:
While it will have military applications, advanced AI is better compared with electricity, rather than nuclear weapons. AI is an all-purpose tool that will have wide-ranging applications, including military uses, but also countless others. While it is important to understand the military implications of AI, it is in everyone’s interest to ensure the technology is developed primarily for the benefit of humanity, rather than waging war. Thiel’s company, Palantir, has major defense contracts with the US government, leading critics to point out his commercial interest in propagating the narrative of AI being primarily a military technology. 

Cooperation is good: Thiel’s criticism of firms for opening labs in China, and hiring Chinese nationals is also misguided. The US and China are the leading players in AI, and forging trust and communication between the two communities is a clear positive for the world. Ensuring that the development of advanced AI goes well will require significant coordination between powers — for example, developing shared standards on withholding dangerous research, or on technical safety.

Why it matters: There is a real risk that an arms race dynamic between the US and China could lead to increased militarization of AI technologies, and to both sides underinvesting in ensuring AI systems are robust and beneficial. This could have catastrophic consequences, and would reduce the likelihood of advanced AI resulting in broadly distributed benefits for humanity. The AI community should resist attempts to propagate hawkish narratives about US-China competition.
   Read more: Why an AI arms race with China would be bad for humanity (Vox).

####################################################

Tech Tales:

We’ll All Know in the End (WAKE)

“There there,” the robot said, “all better now”. Its manipulator clanged into the metal chest of the other robot, which then issued a series of beeps, before the lights in its eyes dimmed and it became still.
   “Bring the recycler,” the robot said. “Our friend has passed on.”
   The recycling cart appeared a couple of minutes later. It wheezed its way up to the two robots, then opened a door in its side; the living robot pushed the small robot in, the door shut, and the recyling cart left.
   “Now,” said the living robot in the cold, dark room. “Who else needs assistance?”

Outside the room, the recycler moved down a corridor. It entered other rooms and collected other robots. Then it reached the end of the corridor and stopped in front of a door with the words NURSERY burned into its wood via lazer. It issued a series of beeps and then the door swung open. The recycler trundled in. 

Perhaps three hours later, some small, living robots crawled out of a door at the other end of the NURSEY. They emerged, blinking and happy and their clocks set running, to explore the world and learn about it. A large robot waited for them and extended its manipulator to hold their hands. “There there,” it said. “All better now”. Together, they trundled into the distance. 

This has been happening for more than one thousand years. 

Things that inspired this story: Hospices; patterns masked in static; Rashomon for robots; the circle of life – Silicon Edition!.