Import AI

Import AI 143: Predicting car accident risks by looking at the houses people live in; why data matters as much as compute; and using capsule networks to generate synthetic data

Predicting car accident risks from Google Street View images:
The surprising correspondences between different types of data…
Researchers with the University of Warsaw and Stanford University have shown how to use pictures from people’s houses to better predict the chances of that person getting into a car accident. (Import AI administrative note – standard warnings about ‘correlation does not imply’ causation apply).

For the project, the researchers analyzed 20,000 addresses of insurance company clients – a random sample of an insurer’s portfolio collected in Poland between January 2012 and December 2015. For each address, they collect an overhead Google satellite view and a Google Street View image of the property, and humans then annotate the image with labels relating to the type of property, age, condition, estimate wealth of its residents, along with the type and density of buildings in the neighborhood. They subsequently test these variables and find that five of the seven have significant with regard to the insurance prediction problem.

  “Despite the high volatility of data, adding our five simple variables to the insurer’s model improves its performance in 18 out of 20 resampling trials and the average improvement of the Gini coefficient is nearly 2 percentage points,” they write.

Ultimately, they show that – to a statistically significant extent – “features visible on a picture of a house can be predictive of car accident risk, independently from classically used variables such as age, or zip code”.

Why this matters: Studies like this speak to the power of large-scale data analysis, highlighting how data that is innocuous at the level of the individual can become significant when compared and contrasted with a vast amount of other data. The researchers acknowledge this, noting that:  “modern data collection and computational techniques, which allow for unprecedented exploitation of personal data, can outpace development of legislation and raise privacy threats”.
  Read more: Google Street View image of a house predicts car accident risk of its resident (Arxiv).

#####################################################

Your next pothole could be inspected via drone:
…Drones + NVIDIA cards + smart algorithms = automated robot inspectors…
Researchers with HKUST Robotics Institute have created a prototype drone system that can be used to automatically analyze a road surface. The project sees the researchers develop a dense stereo vision algorithm which the UAV uses to analyze the road surface. They’re able to use this algorithm to process road images on the drone in real-time, automatically identifying surface-area disparities.

Hardware: To accomplish this, they use a ZED stereo camera mounted on a DJI Matrice 100 drone, which itself has a JETSON TX2 GPU installed onboard for real-time processing.

Why this matters: AI approaches make it cheap for robots to automatically sense&analyze aspects of the world, and experiments like this suggest that we’re rapidly approaching the era when we’ll start to automate various types of surveillance (both for civil and military purposes) via drones.
  Read more: Real-Time Dense Stereo Embedded in a UAV for Road Inspection (Arxiv).
  Get the datasets used in the experiment here (Rui Fan, HKUST, personal website).
  Check out a video of the drone here (Rui Fan, YouTube).

#####################################################

Train AI to watch over the world with the iWildCam dataset:
…Monitoring the planet with deep learning-based systems…
Researchers with the California Institute of Technology have published the iWildCam dataset to help people develop AI systems that can automatically analyze wildlife seen in camera traps spread across the American Southwest. They’ve also created a challenge based around the dataset, letting researchers compete in developing AI systems capable of automatically monitoring the world.

Testing generalization: “If we wish to build systems that are trained once to detect and classify animals, and then deployed to new locations without further training, we must measure the ability of machine learning and computer vision to generalize to new environments,” the researchers write.

Common nuisances: There are six problems relating to the data gathered from the traps: variable illumination, motion blur, size of the region of interest (eg, an animal might be small and far away from the camera), occlusion, camouflage, and perspective.

iWildCam: The images come from cameras installed across the American Southwest, consisting of 292,732 images spread between 143 locations. iWildCam is designed to capture the complexities of the datasets that human biologists need to deal with: “therefore the data is unbalanced in the number of images per location, distribution of species per location, and distribution of species overall”, they write.

Why this matters: Datasets like this – and AI systems built on top of it – will be fundamental to automating the observation and analysis of the world around us; given the increasingly chaotic circumstances of the world, it seems useful to be able to have machines automatically analyze changes in the environment for us.
   Read more: The iWildCam 2018 Challenge Dataset (Arxiv).
   Get the dataset: iWildCam  2019 challenge (GitHub).

#####################################################

Compute may matter, but so does data, says Max Welling:
…”The most fundamental lesson of ML is the bias-variance tradeoff”…
A few weeks ago Richard Sutton, one of the pioneers of reinforcement learning, wrote a post about the “bitter lesson” of AI research (Import AI #138), namely that techniques which use huge amounts of computation and relatively simple algorithms are better to focus on. Now, Max Welling, a researcher with the University of Amsterdam, has written a response claiming that data may be just as important as compute.

  “The most fundamental lesson of ML is the bias-variance tradeoff: when you have sufficient data, you do not need to impose a lot of human generated inductive bias on your model,” he writes. “However, when you do not have sufficient data available you will need to use human-knowledge to fill the gaps.”

Self-driving cars are a good example of a place where compute can’t solve most problems, and you need to invest in injecting stronger priors (eg, an understanding of the physics of the world) into your models, Welling says. He also suggests generative models could help fill in some of these gaps, especially when it comes to generalization.

Ultimately, Welling ends up somewhere between the ‘compute matters’ versus the ‘strong priors matter’ (eg, data) arguments. “I would say if we ever want to solve Artificial General Intelligence (AGI) then we will need model-based RL,” he writes. “We cannot answer the question of whether we need human designed models without talking about the availability of data.”

Why this matters: There’s an inherent tension in AI research between bets that revolve predominantly around compute and those that revolve around data. That’s likely because different bets encourage different research avenues and different specializations. I do worry about a world where people that do lots of ‘big compute’ experiments end up speaking a different language to those without, leading to different priors when approaching the question of how much computation matters.
  Read more: Do we still need models or just more data and compute? (Max Welling, PDF).

#####################################################

Want to train AI on something but don’t have much data? There’s a way!
…Using Capsule Networks to generate synthetic data…
Researchers with the University of Moratuwa want to be able to teach machines to recognize handwritten characters using very small amounts of data, so have implemented an approach based on Capsule Networks – a recently-proposed technique promoted by deep learning pioneer Geoff Hinton – that lets them learn to classify handwritten letters from as few as 200 examples.

The main way they achieve this is by synthetically augmenting these small datasets by using some of the idiosyncratic traits of capsule networks – namely, their ability to learn data representations that are more robust to transforms, as a consequence of their technical implementation of things like ‘routing by agreement‘. The researchers use these traits to directly manipulate the sorts of data representations being produced on exposure to the data to algorithmically generate handwritten letters that look similar to those in the training dataset, but are not identical; this generates additional data that the system can be trained on, without needing to collect more data from (expensive!) reality.

“By adding a controlled amount of noise to the instantiation parameters that represent the properties of an entity, we transform the entity to characterize actual variations that happen in reality. This results in a novel data generation technique, much more realistic than augmenting data with affine transformations,” they write. “The intuition behind our proposed perturbation algorithm is that by adding controlled random noise to the values of the instantiation vector, we can create new images, which are significantly different from the original images, effectively increasing the size of the training dataset”.

How well does it work? The researchers test their approach by evaluating how well TextCaps can learn to classify images when trained on full datasets and 200-sample-size datasets from EMNIST, MNIST and the much more visually complex Fashion MNIST; TextCaps is able to exceed state-of-the-art when trained on full data of three variants of EMNIST and gets close to this using just 200 samples, and approaches SOTA on MINIST and Fashion MNIST (though does very badly on Fashion MNIST when using just 200 samples, likely because of the complexity).

Why this matters: Approaches like this show how as we develop increasingly sophisticated AI systems we may be able to better deal with some of the limitations imposed on us by reality – like a lack of large, well-labeled datasets for many things we’d like to use AI on (for instance: learning to spot and classify numerous handwritten languages for which there are relatively few digitized examples). “We intend to extend this framework to images on the RGB space, and with higher resolution, such as images from ImageNet and COCO. Further, we intend to apply this framework on regionally localized languages by extracting training images from font files,” they write.
  Read more:  TextCaps: Handwritten Character Recognition with Very Small Datasets (Arxiv).
  Read more: Understanding Hinton’s Capsule Networks (Medium).
  Read more: How Capsules Work (Medium).
  Read more: Understanding Dynamic Routing between Capsules (Capsule Networks explainer on GitHub).

#####################################################

Want to test language progress? Try out SuperGLUE:
…Step aside GLUE – you were too easy!…
Researchers with New York University have had to toss out a benchmark they developed last year and replace it with a harder one, due to the faster-than-expected progress in certain types of language modelling. The ‘SuperGLUE’ benchmark is a sequel to GLUE and has been designed to include significantly harder tasks than those which were in GLUE.

New tasks to frustrate your systems: Tasks in SuperGBLUE include: CommitmentBank, where the goal is to judge how committed an author is to a specific clause within a sentence; the Choice of Plausible Alternatives (COPA) in which the goal is to pick the more likely sentence given two options; the Gendered Ambiguous Pronoun Coreference Task (GAP), where systems need to ‘determine the referent of an ambiguous pronoun’; the Multi-Sentence Reading Comprehension dataset, a true-false question-answering task; RTE, a textual-entailment task which was in GLUE 1.0; WIC, which challenges systems to do disambiguation and the Winograd Schema Challenge, which is a reading comprehension task designed to specifically test for world modeling or the lack of it (eg, systems that think large objects can go inside small objects, and vice versa).

PyTorch toolkit: The researchers plan to release a toolkit based on PyTorch and software from AllenNLP which will include pretrained models like OpenAI GPT and Google BERT, as well as designs to enable rapid experimentation and prototyping. As with GLUE, there will be an online leaderboard that people can compete on.

Why this matters: Well-designed benchmarks are one of the best tools we have available to us to help judge AI progress, so when benchmarks are rapidly obviated via progress in the field it suggests that the field is developing quickly. The researchers believe SuperGLUE is sufficiently hard that it’ll take a while to solve, so think “there is plenty of space to test new creative approaches on a broad suite of difficult NLP tasks with SuperGLUE.”
  Read more: Introducing SuperGLUE: A New Hope Against Muppetkind (Medium).
  Read more: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (PDF).

#####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

European Commission releases pilot AI ethics guidelines:
Last year, the European Commission announced the formation of the High-Level Expert Group on AI, a core component of Europe’s AI strategy. The group released draft ethics guidelines in December (see Import #126), and embarked on a consultation process with stakeholders and member states. This month they released a new draft, and will be running a pilot program through 2019.

   Key requirements for trustworthy AI: The guidelines lay out 7 requirements: Human agency and oversight; Technical robustness and safety; Privacy and data governance; Transparency; Diversity, non-discrimination and fairness; Societal and environmental wellbeing; Accountability.

  International guidelines: The report makes clear the Commission’s ambition to play a leading role in developing internationally-agreed AI ethics guidelines.

  Why it matters: The foregrounding of AI safety (‘technical robustness and safety’ in the language of the guidelines) is good news. The previous draft revealed long-term concerns had proved highly-controversial amongst the experts, and asked specifically for consultation input on these issues. This latest draft suggests that the public and other stakeholders take these concerns seriously.
  Read more: Communication – Building Trust in Human Centric AI (EC).

Microsoft refuses to sell face recognition due to human rights concerns:
In a talk at Stanford, CEO Brad Smith described recent deals Microsoft had declined due to ethical concerns. He revealed that the company refused to provide face recognition technology to a California law enforcement agency. Microsoft concluded the proposed roll-out would have disproportionately impacted women and ethnic minorities. The company also declined a deal with a foreign country to install face recognition across the nation’s capital, due to concerns that it would have suppressed freedom of assembly.
  Read more: Microsoft turned down facial-recognition sales on human rights concerns (Reuters)

#####################################################

Tech Tales:

Until Another Dream

I get up and I hunt down the things that are too happy or too sad and I take them out of the world. This is a civil-general world and by decree we cannot have extremes. So I take their broken shapes with me and I put them in a chest in the basement of my simulated castle. Then I take my headset off and I go to my nearby bar and the barman calls me “dreamkiller” as his way of being friendly.
What dreams did you kill today, dreamkiller?
You still dream about that butterfly with the face of a kitten you whacked?
Ever see any more of those sucking-face spiders?
What happened to the screaming paving slabs, anyway?
You get the picture.

The thing about today is everyone is online and online is full of so much money that it’s just like real life: most people don’t see the most extreme parts of it, and by a combination of market pressures and human preferences, some people get paid to either erase the extremes or hide them away.

After the bar I go home and I get into bed and my muscle memory has me pick up the headset and have it almost on my head before my conscious brain kicks in – what some psychologists call The Supervisor. “Do I really want to do this?” my supervisor asks me? “Why not go to bed?”

I don’t answer myself directly, instead I slide the headset on, turn it on, and go hunting. There have been reports of unspeakably cute birds carrying wicker baskets containing smaller baby birds in the south quadrant. Meanwhile up in the north there’s some kind of parasite that eats up the power sub-systems of the zones, projecting worms onto all the simulated telescreens.

My official job title is Reality Harmonizer and my barman calls me Dreamkiller and I don’t have a name for myself: this is my job and I do it not willingly, but because my own tastes and habits compel me to do it. I have begun to wonder if real-life murderers and murder-police are themselves people that take off their headsets at night and go to bars. I have begun to wonder whether they themselves find themselves in the middle of the night choosing between sleep and a kind of addictive duty. I believe the rules change when fairytales are real.

Things that inspired this story: MMOs; the details change but the roles are always the same; detectives; noir; feature-space.

Import AI 142: Import AI 142: Berkeley spawns cheap ‘BLUE’ arm; Google trains neural nets to prove math theorems; seven questions about GANs

Google reveals HOList, a platform for doing theorem proving research with deep learning-based methods:
…In the future, perhaps more math theorems will be proved by AI systems than humans…
Researchers with Google want to develop and test AI systems that can learn to solve mathematical theorems, so have made tweaks to theorem proving software to make it easier for AI systems to interface with. In addition, they’ve created a new theorem proving benchmark to spur development in this part of AI.

HOL List: The software they base their system on is called HOL Light. For this project, they develop “an instrumented, pre-packaged version of HOL Light that can be used as a large scale distributed environment of reinforcement learning for practical theorem proving using our new, well-defined, stable Python API”. This software ships with 41 “tactics” which are basically algorithms to use to help prove math theorems.

Benchmarks: The researchers have also released a new benchmark on HOL Light, and they hope this will “enable research and measuring progress of AI driven theorem proving in large theories”. The benchmarks are initially designed to measure performance on a few tasks, including: predicting the same methodologies used by humans to create a proof; and trying to prove certain subgoals or aspects of proofs without access to full information.

DeepHOL: They design a neural network-based theorem prover called DeepHOL which tries to concurrently encode the goals and premises while generating a proof. “In essence, we propose a hybrid architecture that both predicts the correct tactic to be applied, as well as rank the premise parameters required for meaningful application of tactics”. They test out a variety of different neural network-based approaches within this overall architecture and train them via reinforcement learning, with the best system able to prove 58% of the proofs in the training set – no slam-dunk, but very encouraging considering these are learning-based methods.

Why this matters: Theorem proving feels like a very promising way to test the capabilities of increasingly advanced machines, especially if we’re able to develop systems that start to generate new proofs. This would be a clear validation of the ability for AI systems to create novel scientific insights in a specific domain, and I suspect would give us better intuitions about AI’s ability to transform science more generally as well.  “We hope that our initial effort fosters collaboration and paves the way for strong and practical AI systems that can learn to reason efficiently in large formal theories,” they write.
  Read more: HOList: An Environment for Machine Learning of Higher-Order Theorem Proving (Extended Version).

#####################################################

Think GANs are interesting? Here are seven underexplored questions:
…Googler searches for the things we know we don’t know…
Generative adversarial networks have become a mainstay component of recent AI research given their utility in creative applications, where you need to teach a neural network about some data well enough that it can generate synthetic data that looks similar to the source, whether videos or images or audio.

But GANs are quite poorly understood, so researcher Augustus Odena has published an essay on Distill listing seven open questions about GANs.

The seven questions: These are:
– What are the trade-offs between GANs and other generative models?
– What sorts of distributions can GANs model?
– How can we scale GANs beyond image synthesis?
– What can we say about the global convergence of the training dynamics?
– How should we evaluate GANs and when should we use them?
– How does GAN training scale with batch size?
– What is the relationship between GANs and adversarial examples?

Why this matters: Better understanding how to answer these questions will help researchers better understand the technology, which will allow us to make better predictions about economics costs of training GAN systems, likely failures to expect, and point to future directions for work. It’s refreshing to see researchers publish exclusively about the problems and questions related to a technique, and I hope to see more scholarship like this.
  Read more: Open Questions about Generative Adversarial Networks (Distill).

#####################################################

Human doctors get better with aid of AI-based diagnosis system:
…MRNet dataset, competition, and research, should spur research into aiding clinicians with pre-trained medical-problem-spotting neural nets…
Stanford University researchers have developed a neural network-based technique to assess Knee MR scans for abnormalities and a few specific diagnoses (eg, ligament tears). They find that clinicians which have access to this model have a lower rate of mistaken diagnoses than those without access to it. When using this model “for every 100 healthy patients, ~5 are saved from being unnecessarily considered for surgery,” they write.

MRNet dataset: Along with their research, they’ve also released an underlying dataset: MRNet, a collection of 1,370 knee MRI exams performed at Stanford University Medical Center, spread across normal and abnormal knees.

Competition: “We are hosting a competition to encourage others to develop models for automated interpretation of knee MRs,” the researchers write. “Our test set (called internal validation set in the paper) has its ground truth set using the majority vote of 3 practicing board-certified MSK radiologists”.

Why this matters: Many AI systems are going to augment rather than substitute for human skills, and I expect this to be especially frequent in medicine, where we can expect to give clinicians more and more AI advisor systems to use when making diagnoses. In addition, datasets are crucial to the development of more sophisticated medical AI systems and competitions tend to drive attention towards a specific problem – so the release of both in addition to the paper should spur research in this area.
  Read more and register to download the dataset here: MRNet Dataset (Stanford ML Group).
  Read more about the underlying research: MRNet: Deep-learning-assisted diagnosis for knee magnetic resonance imaging (Stanford ML Group).

#####################################################

As AI hype fades, applications arrive:
…Now we’ve got to superhuman performance we need to work on human-computer interaction…
Jeffrey Bigham, a human-computer interaction researcher, thinks that AI is heading into an era of less hype – and that’s a good thing. This ‘AI autumn’ is a natural successor to the period we’re currently in, since we’re moving from the development to the deployment phase of many AI technologies.

Goodbye hype, hello applications:
“Hype deflates when humans are considered,” Bigham writes. “Self-driving cars seem much less possible when you think about all the things human drivers do in addition to the driving on well-known roads in good lighting conditions. They find passengers, they get gas, they fix the car sometimes, they make sure drunk passengers aren’t in danger, they walk elderly passengers into the hospital, etc”.

Why this matters: “If hype is at the rapidly melting tip of the iceberg, then the great human-centered applied work is the super large mass floating underneath supporting everything,” he writes. And, as most people know, working with humans is challenging and endlessly surprising, so the true test of AI capabilities will be to first reach human parity at certain things, then be deployed in ways that make sense to humans.
  Read more: The Coming AI Autumn (Jeffrey Bigham blog).

#####################################################

Berkeley researchers design BLUE, a (deliberately) cheap robot for AI research:
…BLUE promises human-like capabilities in a low-cost platform…
Berkeley researchers have developed the Berkeley robot for Learning in Unstructured Environments (BLUE), robotic arm designed for AI research and deployments. The robot was developed by a team of more than 15 researchers over the last three years. It is designed to cost around ~$5000 to build when built in batches of 1,500 units, and many design-choices have been constrained by the goal of making it both cheap to build and safe to operate around humans.

The robot can be used to train AI approaches on a cheap robotics platform, and works with teleoperation systems so it can be trained directly from human behaviors.

BLUE has seven degrees of freedom, distributed across three joints in the shoulder, one in the elbow, and three in the wrist. When designing BLUE, the researchers optimized for a “useful” robot – this required sufficient precision to be human-like (in this case, it can move with a precision of around 4 millimeters, which is far less than ultra-precise industrial robots) cheap enough to be manufactured at scale, and capable of a general class of manipulation tasks in unconstrained (aka, the opposite of a factory production line) environments.

Low-cost design: The BLUE robots use quasi-direct drive actuation (QDD), an approach that has most recently become popular in legged locomotion systems. They also designed a cheap, parallel jaw gripper (“we chose parallel jaws for their predictability, robustness, simplicity (low cost), and ease of simulation”).

Why this matters: In recent years, techniques based on deep learning have started to give robots unprecedented perception and manipulation capabilities. One thing that has held back deployment, though, is the absence of cheap robot platforms which researchers can experiment with. BLUE seems to have the nice properties of being built by researchers to reflect AI needs, while also being designed to be manufactured at scale. “Next up for the project is continued stress testing and ramping manufacturing,” they write. “The goal is to get these affordable robots into as many researchers’ hands as possible”.
  Read more: Project Blue (Berkeley website).
  Read the research paper: Quasi-Direct Drive for Low-Cost Compliant Robotic Manipulation (Arxiv).

#####################################################

Network architecture search gets more efficient with Single-Path NAS:
…Smarter search techniques lower the computational costs of AI-augmented search…
Researchers with Carnegie Mellon University, Microsoft, and the Harbin Institute of Technology have figured out a more efficient way to get computers to learn how to design AI systems for deployment on phones.

The approach, called Single-Path NAS, makes it more efficient to spend compute to search for more sophisticated AI models. The key technical trick is, at each layer of the network, to search over “an over-parameterized ‘superkernel’ in each ConvNet layer’. What this means in practice is the researchers have made it more efficient to rapidly iterate through different types of AI component at each layer of the network, making their approach more efficient than other NAS techniques.
  “Without having to choose among different paths/operations as in multi-path methods, we instead solve the NAS problem as finding which subset of kernel weights to use in each ConvNet layer”, they explain.

Hardware-aware: The researchers add a constraint during training that lets them optimize for the latency of the resulting architecture – this lets them automatically search for an architecture that best maps to the underlying hardware capabilities.

Testing, testing, testing: They test their approach on a Pixel 1 phone – a widely-used premium Android phone, developed by Google. They benchmark by using Single-Path NAS to design networks for image classification on ImageNet and compare it against state-of-the-art systems designed by human researchers as well as ones discovered via other neural architecture search techniques.  

  Results: Their approach gets an accuracy of 74.96%, which they claim is “the new state-of-the-art ImageNet accuracy among hardware-efficient NAS methods. Their system also takes about 8 epoches to train, compared to hundreds (or thousands) for other methods.

Why this matters: Being able to offload the cost of designing new network architectures from human designers to machine designers has the potential to further accelerate AI research progress and AI application deployment. This sort of technique fits into the broader trend of the industrialization of AI (which has been covered in this newsletter in a few different ways) – worth noting that the authors of this technique are spread across multiple companies and institutions, from CMU, to Microsoft, to the Harbin Institute of Technology in Harbin, China.
  Read more: Single-Path NAS: Designing Hardware-Efficient ConvNets in less than 4 hours (Arxiv).
  Get the code: Single-Path-NAS (GitHub).

#####################################################

How should the Department of Defense use Artificial Intelligence? Tell them your thoughts:
Can the DoD come up with principles for how it uses AI? There are ways you can help…
The Defense Innovation Board, an advisory committee to the Secretary of Defense, is trying to craft a set of AI principles that the DoD can use as it integrates AI technology into its systems – and it wants help from the world.

  Share your thoughts at Stanford this month: If you’re in the Bay Area, you may want to come to ‘The Ethical and Responsible Use of Artificial Intelligence for the Department of Defense (DoD)” at Stanford University on April 25th 2019, where you can give thoughts on how the DoD may want to consider using (or not using) AI. You can also submit public comments online.

  Why this matters: Military organizations around the world are adopting AI technology, and it’s unusual to see a military organization publicly claim to be so interested in the views of people outside its own bureaucracy. I think it’s worth people submitting thoughts here (especially if they’re constructively critical), as this will provide us evidence for how or if the general public can guide the ways in which these organizations use AI.
  Read more about the AI Principles project here (DiB website).

#####################################################

OpenAI Bits & Pieces:

OpenAI Five wins matches against pros, cooperates with humans:
  This weekend, OpenAI’s neural network-based system for playing Dota 2, OpenAI Five, beat a top professional team in San Francisco. Additionally, we showed how the same system can play alongside humans.
  OpenAI Five Arena: We also announced OpenAI Five Arena, a website which people can use to play with or against our Dota 2 agents. Sign up via: arena.openai.com. Wish us luck as we try to play against the entire internet next week.

#####################################################

Tech Tales:

The Big Art Machine

The Big Art Machine, or as everyone called it, The BAM, was a robot about thirty feet tall and a hundred and fifty feet long. It looked kind of like a huge, metal centipede, except instead of having a hundred legs, it had a hundred far more sophisticated appendages – each able to manipulate the world around it, and change its own dimensions through a complicated series of interlocking, metal plates.

The BAM worked like this: you and a hundred or so of your friends would pile into the machine and each of you would sit in a small, sealed room housed at the intersection between each of its hundred appendages and its main body. Each of these rooms contained a padded chair, and each chair came with a little swing-out screen, and on this screen you’d see two movie clips of how your appendage could move – you’d pick whichever one you preferred, then it’d show you another one, and so on.

The BAM was a big AI thing, essentially. Each of the limbs started out dumb and uncoordinated, and at first people would just focus on calibrating their own appendage, then they’d teach their own appendage to perhaps strike the ground, or try and pull something forward, or so on. There were no rules. Mostly, people would try to get the BAM to walk or – very, very occasionally – run. After enough calibration, the operators of each of the appendages would get a second set of movies on their screen – this time, movies of how their appendage plus another appendage elsewhere on the BAM might move together. In this way, the crowd would over time select harmonious movements, built out of idiosyncratic underlays.

So hopefully this gives you an idea for how difficult it was to get the BAM to do anything. If you’ve ever hosted a party for a hundred people before and tried to get them to agree on something – music, a drinking game, even just listening to one person give a speech – then you’ll know how difficult getting the BAM to do anything is. Which is why we were so surprised that one day a team of people got into the BAM and, after the first few hours of aimless clanking and probing, it started to walk, then it started to run, and then… we lost it.

Some people say that they taught it to swim, and took it into the ocean. Others say that it’s not beyond the realms of feasibility that it was possible to teach the thing to fly – though the coordination required and the time it would take to explore its way to such a particular combination of movements was so lengthy that many thought it impossible. Now, we tell stories about the BAM as a lesson in collective action and calibration, and children when they learn about it in school immediately dream of building machines in which thousands of people work together, calibrating around some purpose that comes from personal chaos.

Things that inspired this story: Learning from human preferences; heterogeneous data; the beautiful and near-endless variety of ways in which humans approach problems; teamwork; coordination; inverse reinforcement learning; robotics, generative models.

Import AI 141: AIs play doom at thousands of frames per second; NeurIPS wants reproducible research; and Google creates&scraps AI ethics council.

75 seconds: How long it takes to train a network against ImageNet:
…Fujitsu Research claims state-of-the-art ImageNet training scheme…
Researchers with Fujitsu Laboratories in Japan have further reduced the time it takes to train large-scale, supervised learning AI models; their approach lets them train a residual network to around 75% accuracy on the ImageNet dataset after 74.7 seconds of training time. This is a big leap from where we were in 2017 (an hour), and is impressive relative to late-2018 performance (around 4 minutes: see issue #121).

How they did it: The researchers trained their system across 2,048 Tesla V100 GPUs via the Amazon-developed MXNet deep learning framework. They used a large mini-batch size of 81,920, and also implemented layer-wise adaptive scaling (LARS) and a ‘warming up’ period to increase learning efficiency.

Why it matters: Training large models on distributed infrastructure is a key component of modern AI research, and the reduction in time we’ve seen on ImageNet training is striking – I think this is emblematic of the industrialization of AI, as people seek to create systematic approaches to efficiently training models across large amounts of computers. This trend ultimately leads to a speedup in the rate of research reliant on large-scale experimentation, and can unlock new paths of research.
  Read more: Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds (Arxiv).

#####################################################

Ian ‘GANfather’ Goodfellow heads to Apple:
…Machine learning researcher swaps Google for Apple…
Ian Goodfellow, a machine learning researcher who developed an AI approach called generative adversarial networks (GANs), is leaving Google for Apple.

Apple’s deep learning training period: For the past few years, Apple has been trying to fill its ranks with more prominent people working on its AI projects. In 2016 it hired Russ Salakhutdinov, a researcher from CMU who had formerly studied under Geoffrey Hinton in Toronto, to direct its AI research efforts. Russ helped build up more of a traditional academic ML group at Apple, and Apple lifted its customary veil of secrecy a bit with the Apple Machine Learning Journal, a blog that details some of the research done by the secretive organization. Most recently, Apple hired John Giannandrea from Google to help lead its AI strategy. I hope Ian can push Apple towards being more discursive and open about aspects of its research, and I’m curious to see what happens next.

Why this matters: Two of Ian’s research interests – GANs and adversarial examples (manipulations made to data structures that cause neural networks to misclassify things) – have significant roles in AI policy, and I’m wondering if Apple might explore this more through proactive work (making things safer and better) along with policy advocacy.
  Read more: One of Google’s top A.I. people has joined Apple (CNBC).

#####################################################

World’s most significant AI conference wants more reproducible research:
…NeurIPS 2019 policy will have knock-on effect across wider AI ecosystem…
The organizing committee for the Neural Information Processing Systems Conference (NeurIPS, formerly NIPS), has made two changes to submissions for the AI conference: A “mandatory Reproducibility Checklist”, along with “a formal statement of expectations regarding the submission of code through a new Code Submission Policy”.

Reproducibility checklist: Those submitting papers to NeurIPS will fill out a reproducibility checklist, originally developed by researcher Joelle Pineau. “The answers will be available to reviewers and area chairs, who may use this information to help them assess the clarity and potential impact of submissions”.

Code submissions: People will be expected (though not forced – yet) to submit code along with their papers, if they involve experiments that relate to a new algorithm or a modification of an existing one. “It has become clear that this topic requires we move at a careful pace, as we learn where our “comfort zone” is as a community,” the organizers write.

  Non-executable: Code submitted to NeurIPS won’t need to be executable – this helps researchers whose work depends either on proprietary code (for instance, it plugs into a large-scale, proprietary training system, like those used by large technology companies), or who depend on proprietary datasets.

Why this matters: Reproducibility touches on many of the anxieties of current AI research relating to the difference in resources between academic researchers and those at corporate labs. Having more initiatives around reproducibility may help to close this divide, especially done in a (seemingly quite thoughtful) way that lets corporate researchers do things like publishing code without needing to worry about leaking information about internal proprietary infrastructure.
  Read more: Call for Papers (NeurIPS Medium page).
  Check out the code submission policy here (Google Doc).

#####################################################

Making RL research cheaper by using more efficient environments:
…Want to train agents on a budget? Comfortable with your agents learning within a retro hell? Then ViZDoom might be the right choice for you…
A team of researchers from INRIA in France have developed a set of tasks that demand “complex reasoning and exploration”, which can be run within the ViZDoom simulator at around 10,000 environment interactions per second; the goal of the project is to make it easier for people to do reinforcement learning research without spending massive amounts of compute.

Extending ViZDoom: ViZDoom is an implementation of the ancient first-person shooter game, Doom. However, one drawback is that it ships with only eight different scenarios to train agents in. To extend this, the researchers have developed four new scenarios designed to “test navigation, reasoning, and memorization”, variants of which can be procedurally generated.

Scenarios for creating thinking machines: These four scenarios include a navigation task called Labyrinth; Find and return, where the agents needs to find an object in the maze then return to its starting point; Ordered k-item, where the agent needs to collect a few different items in a predefined order; and Two color correlation, where an agent needs to explore a maze to find a column at its center, then pick up objects which are the same color as the column.

Spatial reasoning is… reassuringly difficult: “The experiments on our proposed suite of benchmarks indicate that current state-of-the-art models and algorithms still struggle to learn complex tasks, involving several different objects in different places, and whose appearance and relationships to the task itself need to be learned from reward”.
  Read more: Deep Reinforcement Learning on a Budget: 3D Control and Reasoning Without a Supercomputer (Arxiv).

######################################################

Facebook wants to make smart robots, so it built them a habitat:
…New open source research platform can conduct large-scale experiments, running 3D world simulators at thousands of frames per second…
A team from Facebook, Georgia Institute of Technology, Simon Fraser University, Intel Labs, and Berkeley, have released Habitat, “a platform for embodied AI research”. The open source software is designed to help train agents for navigation and interaction tasks in a variety of domains, ranging from 3D environment simulators like Stanford’s ‘Gibson’ system or Matterport 3D to fully synthetic datasets like SUNCG.

  “Our goal is to unify existing community efforts and to accelerate research into embodied AI,” the researchers write. “This is a longterm effort that will succeed only by full engagement of the broader research community. To this end, we have opensourced the entire Habitat platform stack.”

  Major engineering: The Habitat simulator can support “thousands of frames per second per simulator thread and is orders of magnitude faster than previous simulators for realistic indoor environments (which typically operate at tens or hundreds of frames per second)”. Speed matters here, because the faster you can run your simulator, the more experience you can collect at each computational timestep. Faster simulators = its cheaper and quicker to train agents.

  Using habitat to test how well an agent can navigate: The researchers ran very large-scale tests on Habitat with a simple task: “an agent is initialized at a random starting position and orientation in an environment and asked to navigate to target coordinates that are provided relative to the agent’s position; no ground-truth map is available and the agent must use only its sensory input to navigate”. This is akin to waking up in a mansion with no memory and needing to get to a specific room…except in this world you do this for thousands of subjective years, since Facebook trains its agents for a little over 70 million timesteps in the simulator.

  PPO outperforms hand-coded SLAM approach: They find in tests that they can develop an AI agent based on a proximal policy optimization (PPO) policy trained via reinforcement learning which outperforms hand-coded ‘SLAM’ systems which implement “a classical robotics navigation pipeline including components for localization, mapping, and planning”.

Why this matters: Environments frequently contribute to the advancement of AI research, and the need for high-performance environments has been accentuated by the recent trend for using significant computational resources to train large, simple models. Habitat seems like a solid platform for large-scale research, and Facebook plans to add new features to it, like physics-based interactions within the simulator and supporting multiple agents concurrently. It’ll be interesting to see how this develops, and what things they learn along the way.
  Read more: Habitat: A Platform for Embodied AI Research (Arxiv).

######################################################

People want their AI assistants to be chatty, says Apple:
…User research suggests people prefer a chattier, more discursive virtual assistant…
Apple researchers want to build personal assistants that people actually want to use, so as part of that they’ve conducted research into how users respond to chatty or terse/non-chatty personal assistants, and how they respond to systems that try to ‘mirror’ the human they are interacting with.

Wizard-of-Oz: Apple composes this as a Wizard-of-Oz study, which means there is basically no AI involved: Apple instead had 20 people (three men and seventeen women – the lack of gender balance is not explained in the paper) take turns sitting in a room, where they would utter verbal commands for a simulated virtual assistant, which was in fact an Apple employee sitting in another room. The purpose of this type of study is to simulate the interactions that may occur between human and AI systems to help researchers figure out what they should build next, and how users might react to what they build.

Study methodology: They tested people against three systems: a chatty system, a non-chatty system, and one which tried to mirror the chattiness of the user.

  When testing the chatty vs non-chatty systems, Apple asked some human users to make a variety of verbal requests relating to alarms, calendars, navigation, weather, factual information, and searching the web. For example, a user make say “next meeting time”, and the simulated agent could respond with (chatty) “It looks like you have your next meeting after lunch at 2 P.M.”, or (non-chatty) “2 P.M.” Participants then classified the qualities of the responses into categories, like: good, off topic, wrong information, too impolite or too casual.

Talk chatty to me: The study finds that people tend to prefer chatty assistants to non-chatty ones, and have a significant preference for agents whose chattiness mirrors the chattiness of the human user. “”Mirroring user chattiness increases feelings of likability and trustworthiness in digital assistants. Given the positive impact of mirroring chattiness on interaction, we proceeded to build classifiers to determine whether features extracted from user speech could be used to estimate their level of chattiness, and thus the appropriate chattiness level of a response”, they explain.

Why this matters: Today’s virtual assistants contain lots and lots of hand-written material and/or specific reasoning modules (see: Siri, Cortana, Alexa, the Google Assistant). Many companies are trying to move to systems where a larger and larger chunk of the capabilities come from behaviors that are learned from interaction with users. To be able to build such systems, we need users that want to talk to their systems, which will generate the sorts of lengthy conversational interactions needed to train more advanced learning-based approaches.

  Studies like this from Apple show how companies are thinking about how to make personal assistants more engaging: primarily, this makes users feel more comfortable with the assistants, but as a secondary effect it can bootstrap the generation of data from which to learn from. There also may be stranger effects: “People not only enjoy interacting with a digital assistant that mirrors their level of chattiness in its responses, but that interacting in this fashion increases feelings of trust”, the researchers write.
  Read more: Mirroring to Build Trust in Digital Assistants (Arxiv).

######################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Google scraps AI ethics council:
Google’s new AI ethics council has been cancelled, just over a week after its launch (see #140).

What went wrong: There was significant backlash from employees at the appointments. ~2,500 employees signed a petition to remove the president of the conservative Heritage Foundation, Kay Coles James, from the council. Her appointment was described as bringing “diversity of thought” to the panel. Employees pointed to James’ track record of political positions described as anti-trans, anti-LGBT and anti-immigrant. There was also anger at the appointment of Dyan Gibbens, CEO of a drone company. A third appointee, Alessandro Acquisti, resigned from the body, saying it was not the right forum for engaging with ethical issues around AI.

What next: In a statement, the company says it is “going back to the drawing board,” and “will find different ways of getting outside opinions on these topics.”

Why it matters: This is an embarrassing outcome for Google, whose employees have again demonstrated their ability to force change at the company. Over and above the issues with appointments, there were reasons to be skeptical of the council as a supervision mechanism – the group was going to meet only four times in person over the next 12 months, and it is difficult to imagine the group being able, in this time, to understand Google’s activities enough to provide any meaningful oversight.
  Read more: Google cancels AI ethics board in response to outcry (Vox).

######################################################

Balancing openness and values in AI research
The Partnership on AI and OpenAI organized an event with members of the AI community, to explore openness in AI research. In particular, they considered how to navigate the tension between openness norms, and minimizing risks from unintended consequences and malicious uses of new technologies. Some of the impetus for the meal was OpenAI’s partial release of the GPT2 language model. Participants role-played an internal review board of an AI company, deciding whether to publish a hypothetical AI advance which may have malicious applications.

Key insights: Several considerations were identified: (1) Organizations should have standardized risk assessment processes; (2) The efficacy of review processes depends on time-frames, and whether other labs are expected to publish similar work. It is unrealistic to think that one lab could unilaterally prevent publication, so it is better to think of decisions as delaying (not preventing) the dissemination of information; (3) AI labs could learn from the ‘responsible disclosure’ process in computer security, where vulnerabilities are disclosed only after there has been sufficient time to patch security issues; (4) It is easier to mitigate risks at an early, design stage, of research, than once research has been completed.

Building consensus: A survey after the event showed consensus across the community that there should be standardized norms and review parameters across institutions. There was not consensus, however, on what these norms should be. PAI identify 3 viewpoints among respondents: one group believed openness is generally the best norm; another believed pre-publication review processes might be appropriate; another believed there should be sharing within trusted groups.
  Read more: When Is It Appropriate to Publish High-Stakes AI Research? (PAI).
  Read more: ATEAC member Joanna Bryson has written a post reflecting on the dissolution of the board, called: What we lost when we lost Google ATEAC (Joanna Bryson’s blog).

######################################################

Amazon shareholders could block government face recognition contract:
The SEC has ruled that Amazon shareholders can vote on two proposals to stop sales of face recognition technologies to law enforcement. The motions, put forward by activist shareholders, will be considered at the company’s annual shareholder meeting. One asks Amazon to stop sales of their Rekognition technology to government unless the company’s board determines it does not pose risks to human and civil rights. The other requests that the board commissions an independent review of the technology’s impacts on privacy and civil liberties. While the motions are unlikely to pass, they put further pressure on the company to address these long-running concerns.
  Read more: Amazon has to let shareholders vote on government Rekognition ban, SEC says (The Verge).
  Read more: A win for shareholders in effort to halt sales of Amazon’s racially biased surveillance tech (OpenMIC).

######################################################

Tech Tales:

Joy Asteroid

The joy asteroid landed in PhaseSpace at two in the morning, pacific time, in March 2025. Anyone with a real-world location that corresponded to the virtual asteroid was inundated with notifications for certain types of gameworld-enhancing augmentations, in exchange for the recording and broadcast of fresh media related to the theme of ‘happiness’.

Most people took the deal, and suddenly a wave of feigned happiness spread across the nearby towns and cities as people posed in bedrooms and parks and cars and trains in exchange for trinkets, sometimes mentioned and sometimes not. This triggered other performances of happiness and joy performances, entirely disconnected from any specific reward – though some who did it said they hoped a reward would magically appear, as it had done for the others.

Meanwhile, in PhaseSpace, the joy persisted, warping most of the rest of virtual reality with it. Joy flowed from PhaseSpace via novel joy-marketing mechanisms, all emanating from a load of financial resources that seemed to have been embedded in the asteroid.

All of this happened in about an hour, and after that people started to work out what the asteroid was. Someone on a social network had already used the term ’emergent burp’, and this wasn’t so far from the truth – something in the vast, ‘world modelling’ neural net that simultaneously modeled various real&virtual simulations while doing forward prediction and planning had spiraled into an emergent fault, leading to an obsession with joy – a reward loop suddenly appeared within the large model, diverging the objective. Most of this happened because of sloppy engineering – many safety protocols these days either have humans periodically calibrating the machines, or are based on systems with stronger guarantees.

The joy loop was eventually isolated, but rather than completely delete it, the developers of the game cordoned off the environment and moved it onto separate servers running on a separate air-gapped network, and created a new premium service for ‘a visit to the land of joy’. They claim to have proved that their networking system will prevent the joy bug from emanating, but they continue to feed it more compute, as people come back with wild tales of lands of bus-sized birds and two-headed sea lions, and trees that grow from the bottom of fat, winged clouds.

The company that operates the system is currently alleged to be building systems to provide ‘live broadcasts’ from the land of joy, to satisfy the demands of online influencers. I don’t want them to do this but I cannot stop them – and I know that if they succeed, I’ll tune in.

Things that inspired this story: virtual reality; imagining an economic ‘clicker-game’-esque version of P

Import AI 140: Surveilling a city via the ‘CityFlow’ dataset; 25,000 images of Chinese shop signs; and the seven traps of AI ethics

NVIDIA’s ‘CityFlow’ dataset shows how to do citywide-surveillance:
…CityFlow promises more efficient, safer transit systems… as well as far better surveillance systems…
Researchers with NVIDIA, San Jose State University, and the University of Washington, have released CityFlow, a dataset to help researchers develop algorithms for surveilling and tracking multiple cars as they travel around a city.

The CityFlow Dataset contains of 3.25 hours of video collected from 40 cameras distributed across 10 intersections in a US city. “The dataset covers a diverse set of location types, including intersections, stretches of roadways, and highways”. CityFlow contains over 229,680 bounding boxes across 666 vehicles, which include cars, buses, pickup trucks, vans, SUVs, and so on. Each video has a resolution of at least 960pixels, and “the majority” have a frame rate of 10 FPS.
  Sub-dataset: CityFlow ReID: The researchers have created a subset of the data for the purpose of re-identifying pedestrians and vehicles as they disappear from the view of one camera and re-appear in another. This subset of the data includes 56,277 bounding boxes,

Baselines: CityFlow ships with a set of baselines for the following tasks:

  • Pedestrian re-identification.
  • Vehicle re-identification.
  • Single-camera tracking of a distinct object.
  • Multi-camera tracking of a given object.

Why this matters – surveillance, everywhere: It would be nice to see some discussion within the paper about the wide-ranging surveillance applications and implications of this technology. Yes, it’ll clearly be used to improve the efficiency (and safety!) of urban transit systems, but it will also be plugged into local police services and national and international intelligence-gathering systems. This has numerous ramifications and I would be excited to see more researchers take the time to discuss these aspects of their work.
  Read more: CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification (Arxiv).

#####################################################

SkelNetOn challenges researchers to extract skeletons from images, point clouds, and parametric representations:
…New dataset and competition track could make it easier for AI systems to extract more fundamental (somewhat low-fidelity) representations of the objects in the world they want to interact with…
A group of researchers from multiple institutions have announced the ‘SkelNetOn’ dataset and challenge, which seeks to “utilize existing and develop novel deep learning architectures for shape understanding”. The challenge involves the geometric modelling of objects, which is a useful problem to work on as techniques that can solve it naturally generate “a compact and intuitive representation of the shape for modeling, synthesis, compression, and analysis”.

Three challenges in three domains: Each SkelNetOn challenge ships with its own dataset of 1,725 paired images/point clouds/parametric representations of objects and skeletons.

Why this matters: Datasets contribute to broader progress in AI research, and being able to smartly infer 2D and 3D skeletons from images will unlock applications, ranging from Kinect-style interfaces that rely on the computer knowing where the user is, to being able to cheaply generate (basic) skeletal models for use in media production, for example video games.
  The authors “believe that SkelNetOn has the potential to become a fundamental benchmark for the intersection of deep learning and geometry understanding… ultimately, we envision that such deep learning approaches can be used to extract expressive parameters and hierarchical representations that can be utilized for generative models and for proceduralization”.
  Read more: SkelNetOn 2019 Datast and Challenge on Deep Learning for Geometric Shape Understanding (Arxiv).

#####################################################

Want over 25,000 images of Chinese shop signs? Come get ’em:
…ShopSign dataset took more than two years to collect, and includes five hard categories of sign…
Chinese researchers have created ShopSign, a dataset of images of shop signs. Chinese shop signs tend to be set against a variety of backgrounds with varying lengths, materials used, and styles, the researchers note; this compares to signs in places like the USA, Italy, and France, which tend to be more standardized, they explain. This dataset will help people train automatic captioning systems that work against (some) Chinese signs, and could lead to secondary applications, like using generative models to create synthetic Chinese shop signs.

Key statistics:

  • 25,362: Chinese shop sign images within the dataset.
  • 4,000: Images taken at night.
  • 2,516: Pairs of images where signs have been photographed from both an angle and a front-facing perspective.
  • 50: Different types of camera used to collect the dataset, leading to natural variety within images.
  • 2.4 years: Time it took to collect the dataset.
  • >10: Locations of images, including Shanghai, Beijing, inner Mongolia, Xinjiang, Heilongjiang, Liaoning, Fujian, Shangqiu, Zhoukou, as well as several urban areas in Henan Province.
  • 5: “special categories”; these are ‘hard images’ which are signs against wood, deformed, exposed, mirrored, or obscure backdrops.
  • 196,010 – Lines of text in the dataset.
  • 626,280 – Chinese characters in the dataset.

Why this matters: The creation of open datasets of images not predominantly written in English will help to make AI more diverse, making it easier for researchers from other parts of the world to build tools and conduct research in contexts relevant to them. I can’t wait to see ShopSigns for every language, covering the signs of the world (and then I hope someone trains a Style/Cycle/Big-GAN on them to generate synthetic street sign art!).
  Get the data: The authors promise to share the dataset on their GitHub repository. As of Sunday March 31st the images are yet to be uploaded their. Check out GitHub here.     Read more: ShopSign: a Diverse Scene Text Dataset of Chinese Shop Signs in Street Views (Arxiv).

#####################################################

Stanford (briefly) sets state-of-the-art for GLUE language modelling challenge:
…Invest in researching new training signals, not architectures, say researchers…
Stanford University researchers recently set a new state-of-the-art on a multi-task natural language benchmark called GLUE, obtaining a score of 83.2 on GLUE on 20th of March, compared to 83.1 for the prior high score and 87.1 for human baseline performance.

Nine tasks, one benchmark: GLUE consists of nine natural language understanding tasks and was introduced in early 2018. Last year, systems from OpenAI (GPT) and Google (BERT) led the GLUE leaderboard; the Stanford system uses BERT in conjunction with additional supervision signals (supervised learning, transfer learning, multi-task learning, weak supervision, and ensembling) in a ‘Massive Multi-Task Learning (MMTL) setting’. The resulting model obtains state-of-the-art scores on four of GLUE’s nine tasks, and sets the new overall state-of-the-art.

RTE: The researchers detail how they improved performance on RTE (Recognizing Textual Entailment), one of GLUE’s nine tasks. The goal of RTE is to figure out if a sentence is implied by the preceding one, for example: in the following example, the second sentence is related to the first: “The cat sat on the mat. The dog liked to sit on the mat, so it barked at the cat.”

Boosting performance with five supervisory signals:

  • 1 signal: Supervised Learning [SL]: Score: 58.9
    • Train a standard biLSTM on the ‘RTE’ dataset, using ELMo embeddings and an attention layer.
  • 2 signals: Transfer Learning [TL] (+ SL): Score: 76.5
    • Fine-tune a linear layer on the RTE dataset on top of a pre-trained BERT module. “By first pre-training on much larger corpora, the network begins the fine-tuning process having already developed many useful intermediate representations which the RTE task head can then take advantage of”.
  • 3 signals: Multi-Task Learning [MTL] (+ TL + SL): Score 83.4
    • Train multiple additional linear layers against multiple tasks similar to RTE as well as RTE itself, using a a pre-trained BERT module with each layer having its own task-specific interface to the relevant dataset. Train across all these tasks for ten epoches, then fine-tune on individual tasks for an additional 5 epochs.
  • 4 signals: Dataset Slicing [DS] (+ TL + SL + MTL): Score 84.1
    • Identify parts of the dataset the network has trouble with (eg, consistently low performance on RTE examples that have rare punctuation), then train additional task heads on top of these subsets of the data.
  • 5 signals: Ensembling [E] (+ DS + TL + SL + MTL): Score 85.1
    • Mush together multiple different models trained with slightly different properties (eg, one that is purely lowercased text, while another which recognizes text upper-casing, or ones with different training/validation set splits). Averaging the probabilities of these model predictions together further increases the score.

Why this matters: Approaches like this show how researchers are beginning to figure out how to train far more capable language systems using relatively simple, task-agnostic techniques. The researchers write: “we believe that it is supervision, not architectures, that really move the needle in ML applications”.
  (In a neat illustration of the rate of progress in this domain, shortly after the Stanford researchers submitted their high-scoring GLUE system, they were overtaken by a system from Alibaba, which obtained a score of 83.3.)
  Details of the Stanford team’s ‘Snorkel MeTaL’ submission here.
  Check out the ‘GLUE’ benchmark here (GLUE official site).
  Read the original GLUE paper here (Arxiv).
  Read more: Massive Multi-Task Learning with Snorkel MeTaL: Bringing More Supervision to Bear (Stanford Dawn).

#####################################################

The seven traps of AI ethics:
…Common failures of reasoning when people explore this issue…
As researchers try to come up with principles to apply when seeking to build and deploy AI systems in an ethical way, what problems might they need to be aware of? That’s the question that researchers from Princeton try to answer in a blog about seven “AI ethics traps” that people might stumble into.

The seven deadly traps:

  • Reductionism: reducing AI ethics to a single constraint, for instance fairness.
  • Simplicity: Overly simplifying ethics, eg via creating checklists that people formulaically follow.
  • Relativism: Placing such importance on the diversity of views people have about AI ethics, that as a consequence it is difficult to distill or collapse these views down to a smaller core set of concerns.
  • Value Alignment: Ascribing one set of values to everyone in an attempt to come up with a single true value for people (and the AI systems they design) to follow, and failing to entertain other equally valid viewpoints.
  • Dichotomy: Presenting ethics as binary, eg “ethical AI” versus “unethical AI”.
  • Myopia: Using AI as a catch-all term, leading to fuzzy arguments.
  • Rule of Law reliance: Framing ethics as a substitute for regulations, or vice versa.

Why this matters: AI ethics is becoming a popular subject, as people reckon with the significant impacts AI is having on society. At the same time, much of the discourse about AI and ethics has been somewhat confused, as people try to reason about how to solve incredibly hard, multi-headed problems. Articles like this suggest we need to define our terms better when thinking about ethics, and indicates that it will be challenging to work out out what and whose values AI systems should reify.
  Read more: AI Ethics: Seven Traps (Freedom To Tinker).

#####################################################

nuTonomy releases a self-driving car dataset:
…nuScenes includes 1,000 scenes from cars driving around Singapore and Boston…
nuTonomy ,a self-driving car company (owned by APTIV), has published nuScenes, a multimodal dataset that can be used to develop self-driving cars.

Dataset: Data within nuScenes consists of over 1,000 distinct scenes of about 20 seconds in length each, with each scene accompanied by data collected from five radar, one lidar, and six camera-based sensors on a nuTonomy self-driving vehicle.

  The dataset consists of ~5.5 hours of footage gathered in San Francisco and Singapore, and includes scenes in rain and snow. nuScenes is “the first dataset to provide 360 sensor coverage from the entire sensor suite. It is also the first AV dataset to include radar data and the first captured using an AV approved for public roads.” The dataset is inspired by self-driving car dataset KITTI, but has 7X the total number of annotations, nuTonomy says.

Interesting scenes: The researchers have compiled particularly challenging scenes, which include things like navigation at intersections and construction sites, the appearance of rare entities like ambulances and animals, as well as potentially dangerous situations like jaywalking pedestrians.

Challenging tasks: nuScenes ships with some in-built tasks, including calculating the bounding boxes, attributes, and velocities of 10 classes of object in the dataset.

Why this matters: Self-driving car datasets are frustratingly rare, given the high commercial value on them. nuScenes will give researchers a better sense of the attributes of data required for development and deployment of self-driving car technology.
  Read more: nuScenes: A multimodal dataset for autonomous driving (Arxiv).
  Navigate example scenes from nuScenes here (nuScenes website).
  nuScenes GitHub here (GitHub).
  Register and download the full dataset from here (nuScene).

#####################################################

Google’s robots learn to (correctly) throw things at a rate of 500 objects per hour:
…Factories of the future could have more in common with food fights than conveyor belts…
Google researchers have taught robots to transport objects around a (crudely simulated) warehouse by throwing them from one container to another. The resulting system demonstrates the power of contemporary AI techniques when applied to modern robots, but has too high a failure mode for practical deployment.

Three modules, one robot: The robot ships with a perception module, a grasping module, and a throwing module.
  The perception module helps the robot see the object and calculate 3D information about the object.
  The grasp module tries to predict the success of picking up the object.
  The throwing module tries to predict “the release position and velocity of a predefined throwing primitive for each possible grasp” and does so with the aid of a handwritten physics controller; it uses this signal, as well as a residual signal it tries to learn on top of it, to predict the appropriate velocity to use.

Residual physics: The system learns to throw the object by using a handwritten physics controller along with a function that learns residual signals on the control parameters of the robot. Using this method, the researchers generate a “wider range of data-driven corrections that can compensate for noisy observations as well as dynamics that are not explicitly modeled”.

The tricks needed for self-supervision: The paper discusses some of the tricks implemented by the researchers to help the robots learn as much as possible without the need for human intervention. “The robot performs data collection until the workspace is void of objects, at which point n objects are again randomly dropped into the workspace,” they write. “In our real-world setup, the landing zone (on which target boxes are positioned) is slightly tilted at a 15 angle adjacent to the bin. When the workspace is void of objects, the robot lifts the bottomless boxes such that the objects slide back into the bin”.

How well does it throw? They test on a ‘UR5’ robotic arm that uses an RG2 gripper “to pick and throw a collection of 80+ different toy blocks, fake fruit, decorative items, and office objects”. They test their system against three basic baselines, as well as humans. These tests indicate that the so-called ‘residual physics’ technique outlined in the paper is the most effective, compared to purely regression or physics-based baselines.
  The robot approaches human performance at gripping and throwing items, with humans having a mean successful throw rate of 80.1% (plus or minus around 10), versus 82.3% for the robot system outlined here.
  This system can pick up and throw up to 514 items per hour (not counting the ones it fails to pick up). This outperforms other techniques, like Dex-Net or Cartman.

Why this matters: TossingBot shows the power of hybrid-AI systems which pair learned components with hand-written algorithms that incorporate domain knowledge (eg, a physics-controller). This provides a counterexample to some of the ‘compute is the main factor in AI research’ arguments that have been made by people like Rich Sutton. However, it’s worth noting lots of the capabilities of TossingBot themselves depend on cheap computation, given the extensive period for which the system is trained in simulation prior to real-world deployment.

Additionally, for transporting objects around a factory, that manufacturers would demand a success rate of 99.9N% (or even 99.99N%), rather than 80.N, suggesting that the performance of 82.3% success for this system puts it some ways off from real-world practical usage – if you deployed this system today, you’d expect one out of every five products to not make it to their designated location (and they would probably incur damage along the way).
  Read more: TossingBot: Learning to Throw Arbitrary Objects with Residual Physics (Arxiv).
  Check out videos and images of the robots here (TossingBot website).

######################################################

UK government wants swarms of smart drones:
New grant designed to make smart drones that can save people, or confuse and deceive them…
The UK government has put together “the largest single contract” awarded by its startup-accelerator-for-weapons ‘Defense and Security Accelerator’ (DASA) organization.

The grant will be used to develop swarms of drones that could be used for applications like: medical assistance, logistics resupply, explosive ordnance detection and disposal, confusion and deception, and situational awareness.

The project will be led by Blue Bear Systems Research Ltd, a UK defence development company that has worked on a variety of different unmanned aerial vehicles like abackpack-sized ‘iStart’ drone, a radiation-detecting ‘RISER’ copter, and more.

Why this matters: Many of the world’s militaries are betting their strategic future on the development of increasingly capable drones, ideally ones that work together in large-scale ‘swarms’ of teams and sub-teams. Research like this indicates how advanced these systems are becoming, and the decision to procure initial prototype systems via a combination of public- and private-sector cooperation seems representative of way such systems will be built & bought in the future.
  I’ll be curious if in a few years we see larger grants for funding the development of autonomous capabilities for the sorts of hardware being developed here.
Read more: £2.5m injection for drone swarms (UK Ministry of Defence press release).
More about Blue Bear Systems Research here (BBSR website).

######################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Google appoints AI ethics council:
Google have announced the creation of a new advisory council, to help the company implement their AI principles, announced last year. The council, which is made up of external appointees, is intended to complement Google’s “internal governance structure and processes”.

Who they are: Of the 8 inaugural appointees, 5 are from academia, 2 are from policy, and 1 is from industry. Half of the council are drawn from outside the US (UK, South Africa, Hong Kong).

Why it matters: This seems like a positive move, insofar as it reflects Google taking ethical issues seriously. With so few details, though, it is difficult to judge whether this will be consequential – e.g. we do not know how the council will be empowered to affect corporate decisions.
  Read more: An external advisory council to help advance the responsible development of AI (Google).

DeepMind’s Ethics Board:
A recent profile of DeepMind co-founder, Demis Hassabis, has revealed new details about Google’s 2014 acquisition of the company. As part of the deal, both parties agreed to an ‘Ethics and Safety Review Agreement’, designed to ensure the parent company was not able to unilaterally take control of DeepMind’s intellectual property. Notably, if DeepMind succeed in their mission of building artificial general intelligence, the agreement gives ultimate control of the technology to an Ethics Board. The members of the board have not been made public, though are reported to include the three co-founders.
  Read more: DeepMind and Google: the battle to control artificial intelligence (Economist).

######################################################

Tech Tales:

Alexa I’d like to read something.

Okay, what would you like to read?

I’d like to read about people that never existed, but which I would be proud to meet.

Okay, give me a second… did you mean historical or contemporary figures?

Contemporary, but they can have died when I was younger. Contemporary within a generation.

Okay, and do you have any preference on what they did in their life?

I’d like them to have achieved things, but to have reflected on what they had achieved, and to not feel entirely proud.

Okay. I feel I should ask at this stage if you are okay?

I am okay. I would like to read this stuff. Can you tell me what you’ve got for me?

Okay, I’m generating a list… okay, here you go. I have the following titles available, and I have initiated processes to create more. Please let me know if any are interesting to you:

 

  • Great joke! Next joke!, real life as a stand-up comedian.

  • Punching Dust, a washed-up boxer tells all.
  • Here comes another one, the architect of mid-21st Century production lines.
  • Don’t forget this! Psychiatry in the 21st century.
  • I have a bridge to sell you, confessions of a con-artist.

 

And so I read them. I read about a boxer whose hands no longer worked. I read about a comedian who was unhappy unless they were on stage. I read about the beautiful Sunday-Sudoko process of automating humans. I read about memory and trauma and their relationships. And I read about how to tell convincing lies.

They say Alexa’s next ability will be to “learn operator’s imagination” (LOI), and once this arrives Alexa will ask me to tell it stories, and I will tell it truths and lies and in doing so it will shape itself around me.

Things that inspired this story: Increasingly powerful language models; generative models; conditional prompts; personal assistants such as Alexa; memory as therapy; creativity as therapy; the solipsism afforded to us by endlessly customizable, generative computer systems.

 

Import AI 139: Making better healthcare AI systems via audio de-identification; teaching drones to help humans fight fires; and why language models could be smarter than you think

Stopping poachers with machine learning:
…Finally, an AI-infused surveillance & response system that (most) people approve of…
Researchers and staffers with the University of Southern California, Key Biodiversity Area Secretariat, World Wide Fund for Nature, Wildlife Conservation Society, and Uganda Wildlife Authority, have used machine learning to increase the effectiveness of rangers who try to stop poaching in areas of significant biodiversity.

The project saw USC researchers collaborate with rangers in areas vulnerable to poaching in Uganda and Cambodia, and involved analyzing historical records of incidents of poaching and building predictive models to attempt to predict where poachers might strike next.
  The results are encouraging: the researchers tested their system in Murchison Falls National Park (MFNP) in Uganda and the Srepok Wildlife Sanctuary (SWS) in Cambodia, following an earlier test of the system in the Queen Elizabeth National Park (QENP) in Uganda. In MFNP, using the system, “park rangers detected poaching activity in 38 cells,” they write. Additionally, “the amount of poaching activity observed is highest in the high-risk regions and lowest in the low-risk regions”, suggesting that the algorithm developed by the researchers is learning useful patterns. The researchers also deployed the model in the SWS park in Cambodia and again observed that regions the algorithm classified as high risk had higher incidences of poaching.

Impact: To get a sense of the impact of this technique, we can compare the results of the field tests to typical observations made by the rangers.
  In the SWS park in Cambodia during 2018, the average number of snares confiscated each month was 101. By comparison, during the month the rangers were using the machine learning system they removed 521 snares. “They also confiscated a firearm and they saw, but did not catch, three groups of poachers in the forest”.

Different parks, different data: As most applied scientists know, real world data tends to contain its own surprising intricacies. Here, this manifests as radically different distributions of types of datapoints across the different parks – in the MFNP park in Uganda around 15% of the collected datapoints are for areas where poaching is thought to have occurred, compared to around 4.3% for QENP and ~0.25% for the SWS park in Cambodia. Additionally, the makeup of each dataset changes over the year as the parks change various things that factor into the data collection, like the number of rangers, or the routes they cover, and so on.

Why this matters: Machine learning approaches are giving us the ability to build a sense&respond infrastructure for the world, and as we increase the sample efficiency and accuracy of such techniques we’ll be able to better build systems to help us manage an increasingly unstable planet. It’s deeply gratifying to see ML being used to protect biodiversity.
  Read more: Stay Ahead of Poachers: Illegal Wildlife Poaching Prediction and Patrol Planning Under Uncertainty with Field Test Evaluations (Arxiv).

#####################################################

Want to check-in for your flight in China? Use facial recognition:
I’ve known for some time that China has rolled out facial recognition systems in its airports to speed-up the check-in process. Now, Matthew Brennan has recorded a short video showing what this is like in practice from the perspective of a consumer. I imagine these systems will become standard worldwide within a few years.
  Note: I’ve seen a bunch of people on Twitter saying variations of “this technology makes me deeply uncomfortable and I hate it”. I think I have a slightly different opinion here. If you feel strongly about this would love to better understand your position, so please tweet at me or email me!
  Check out the facial recognition check-in here (Matthew Brennan, Twitter).

#####################################################

Why language models could be smarter than you think:
…LSTMs are smarter than you think, say researchers…
Researchers with the Cognitive Neuroimaging Unit at the ‘NeuroSpin Center’, as well as Facebook AI Research and the University of Amsterdam, have analyzed how LSTMs keep track of certain types of information, when tested for their ability to model language structures at longer timescales. The analysis is meant to explore “whether these generic sequence-processing devices are discovering genuine structural properties of language in their training data, or whether their success can be explained by opportunistic surface-pattern-based heuristics”, the authors write.

The researchers study a pre-trained model “composed of a 650-dimensional embedding layer, two 650-dimensional hidden layers, and an output layer with vocabulary size 50,000”. They evaluate this model on a set of what they call ‘number-agreement tasks’ where they test subject-verb agreement in increasingly challenging setups (eg, a simple case is looking at network activations for ‘the boy greets the guy’, with harder ones taking the form of things like ‘the boy most probably greets the guy’ and ‘the boy near the car kindly greets the guy’, and so on).

Neurons that count, sometimes together: During analysis, they noticed the LSTM had developed “two ‘grandmother’ cells to carry number features from the subject to the verb across the intervening material”. They found that these cells were sometimes used to help the network decide in particularly tricky cases: “The LSTM also possesses a more distributed mechanism to predict number when subject and verb are close, with the grandmother number cells only playing a crucial role in more difficult long-distance cases”.
  They also discovered a cell that “encodes the presence of an embedded phrase separating the main subject-verb dependency, and has strong efferent connections to the long-distance number cells, suggesting that the network relies on genuine syntactic information to regulate agreement-feature percolation”.

Why this matters: “Strikingly, simply training an LSTM on a language-model objective on raw corpus data brought about single units carrying exceptionally specific linguistic information. Three of these units were found to form a highly interactive local network, which makes up the central part of a ‘neural’ circuit performing long-distance number agreement”, they write. “Agreement in an LSTM language-model cannot be entirely explained away by superficial heuristics, and the networks have, to some extent, learned to build and exploit structure-based syntactic representations, akin to those conjectured to support human-sentence processing”
  The most interesting thing about all of this is the apparent sophistication that emerges as we train these networks. It seems to inherently support some of the ideas outlined by Richard Sutton (covered in Import AI #138) about the ability for relatively simple networks to – given sufficient compute – develop very sophisticated capabilities.
  Read more: The emergence of number and syntax units in LSTM language models (Arxiv).

#####################################################

Teaching drones to help humans fight fires:
…Towards a future where human firefighters are guarded by flying machines…
Researchers with the Georgia Institute of Technology have developed an algorithm to let humans and drones work together when fighting fires, with the drones now able to analyze the fire from above and relay that information to firefighters. “The proposed algorithm overcomes the limitations of prior work by explicitly estimating the latent fire propagation dynamics to enable intelligent, time-extended coordination of the UAVs in support of on-the-ground human firefighters,” they write.

The researchers use FARSITE, software for simulating wildfire propagation that is already widely used by the United States’ National Park Service and Forest Service. They use an Adaptive Extended Kalman Filter (AEKF) to make predictions about where the fire is likely to spread to. They eventually develop a basic system that works in simulation which lets them coordinate the actions of drones and humans, so that the drones learn to intelligently inform people about fire propagation. They also implement a “Human Safety Module” which attempts to work out how safe the people are, and how safe they will be as the fire location changes over time.

Three tests: They test the system on three scenarios: stationary fire, moving fire, and one where the fire moves and grows in area over time.The tests mostly tell us that you need to dramatically increase the number of drones to satisfy human safety guarantees (eg, in the case of a moving and growing fire you need over 25 drones to provide safety for 10 humans. Similarly, in this scenario you need at least 7 drones to be able to significantly reduce your level of uncertainty about the map of the fire and how it will change). Their approach outperforms prior state-of-the-art in this domain.

Why this matters: I think in as little as a decade it’s going to become common to see teams of humans and drones working together to deal with wildfires or building fires, and papers like this show how we might use adaptive software systems to increase the safety of human-emergency responders.
  Read more: Safe Coordination of Human-Robot Firefighting Teams (Arxiv).

#####################################################

Google wants it to be easier to erase personal information from audio logs:
…Audio de-ID metric & benchmark should spur research into cleaning up datasets…
Medical data is one of the most difficult types of data for machine learning researchers to work with, due to the fact it is full of personally identifiable health information. This means that people need to invest in tools to remove this personal information from medical data, so that people can build secondary applications on top of it. Coupled with this is the fact that in recent years more and more medical services are accessed digitally, leading to a growth in the amount of digital healthcare-relevant data, all of which needs to have the personal information removed before it can be processed by most machine learning approaches.

Now, researchers with Google are trying to work on this problem in the audio domain by releasing Audio de-ID, a new metric for measuring the success at de-identifying audio logs, and an evaluation benchmark for evaluating systems. The evaluation benchmark tests models against a Google-curated dataset made of the ‘Switchboard’ and ‘Fisher’ datasets with Personal Health Information (PHI) tagged and labelled in the data, and challenges models to automatically slice personally identifiable information out of datasets.

Taking personal information out of audio logs: Removing personally identifiable information from audio logs is a tricky task. Google’s pipeline works as follows: “Our pipeline first produces transcripts from the audio using [Audio Speech Recognition], proceeds by running text-based [Named Entity Recognition] tagging, and then redacts [Personal Health Information] tokens, using the aligned token boundaries determined by ASR. Our tagger relies on the state-of-the-art techniques for solving the audio NER problem of recognizing entities in audio transcripts. We leverage the available ASR technology, and use its component of alignment back to audio.

Why this matters: The (preliminary) benchmark results in the paper show that Audio Speech Recognition performance is “the main impedance towards achieving results comparable to text de-ID”, suggesting that as we develop more capable ASR models we will see improvements in our ability to automatically clean datasets so we can do more useful things with them.

Want the data? Google says you should be able to get it from this link referenced in the paper, but the link currently 404s (as of Sunday the 24th of March): https://g.co/audio-ner-annotations-data
   Read more: Audio De-identification: A New Entity Recognition Task (Arxiv).

#####################################################

It just got (a little bit) easier to develop & test AI algorithms on robot hardware:
…Gym-gazebo2 lets you simulate robots and provides an OpenAI Gym-like interface…
Researchers with Acutronic Robotics have released gym-gazebo2, software that people can use to develop and compare reinforcement learning algorithms’ performance on robots simulated within the ‘ROS 2’ and ‘Gazebo’ software platforms. Gym-gazebo2 contains simulation tools, middleware software, and a high-fidelity simulation of the ‘MARA’ robot which is a product developed by Acutronic Robotics.

Gym-Gazebo2 ingredients: The software consists of three main chunks of software: ROS 2, Gazebo, and Gym-Gazebo2. Gym-Gazebo2 can be installed via a docker container, which should simplify setup.

MARA, MARA, MARA: Initially, Gym-Gazebo2 ships with four environments based around Acutronic Robotics’s Modular Articulated Robotic Arm (MARA) system. These environments are:

  • MARA: The agent is rewarded for moving its gripper to a target position.
  • MARA Orient: The agent is rewarded for moving its gripper to a target position and specific orientation.
  • MARA Collision: The agent is rewarded in the same way as MARA, but is punished if the robot collides with anything.
  • MARA Collision Orient: The agent is rewared in the same way as MARA Orient, but is punished if it collides with anything.

  (It’s worth noting that these environments are very, very simple: real world robot tasks tend to involve more multi-step scenarios, usually with additional constraints. However, these environments could prove to be useful for validating performance of a given approach early in development.)

Free algorithms: Alongside gym-gazebo2, Acutronic Robotics is also releasing ros2learn, a collection of algorithms (based on OpenAI Baselines), as well as some pre-built experimental scripts for running algorithms like PPO, TRP, and ACKTR on the ‘MARA’ robot platform.

Why this matters: Robotics is about to be revolutionized by AI; following years of development by researchers around the world, deep learning techniques are maturing to the point that they can be applied to robots to solve tasks that were previously impossible or very, very, very expensive and complicated to program. Software like gym-gazebo2 will make it easier for people to validate algorithms in a high-fidelity simulation, and if people happen to like the MARA arm itself can also let them validate stuff in reality (though this tends to be expensive and fraught with various kinds of confounding pain, so it’ll depend on the popularity of the MARA hardware platform).
  Get the ‘ros2learn’ code from GitHub here.
  Get the code for gym-gazebo2 from GitHub here.
  Read more: gym-gazebo2, a toolkit for reinforcement learning using ROS 2 and Gazebo (Arxiv).

#####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Understanding the European Union’s AI ambitions:
In discussions about global leadership in AI, the EU is rarely considered a serious contender. This report by Charlotte Stix at the Leverhulme Center for the Future of Intelligence at the University of Cambridge, explores this view, and provides a detailed picture of the EU’s AI landscape and how this could form the basis for leadership in AI.

What the EU wants: The EU’s strategy is set out in 2 key documents: (1) The European Commission’s Communication on AI, which – among other things – outlines the EU’s ambition to put ‘ethical AI’ at the core of their strategy; (2) The Coordinated Plan on AI, which outlines how member states could coordinate their strategies and resources to improve European competitiveness in AI.

What the EU has: The EU’s central funding commitments for AI R&D remain modest – €1.5bn between 2018-2020. This is fairly trivial when compared with government funding in China (tens of billions of dollars per year), and even private companies (Alphabet’s 2018 R&D spend was $21bn). The Coordinated Plan includes an ambition to reach €20bn total funding per year by 2020, from member states and the private sector, though there are not yet concrete plans for how to achieve this. At present, inbound VC investment in AI is 6x lower than the US. The EU is having problems retaining talent and organizations, with promising companies increasingly acquired by international companies. There are proposals for the EU to establish large-scale projects in AI research, leveraging their impressive track-record in this domain (e.g. CERN, the Human Brain Project), but these remain relatively early-stage.

Why it matters: The EU is making a bet on leadership in ethical and human-centric AI, as a way to play an important role in the global AI landscape. This is an interesting strategy which sets it apart from other actors, e.g. US and China, who are focussed more on leadership in AI capabilities. The EU has expressed ambitions to cooperate closely with other countries and groups that are aligned with their vision (e.g. via the new International Panel on AI), which is encouraging for global coordination on these issues. Being made up of 27/28 member states, the EU’s experience in coordinating between actors could prove an advantage in this regard.
  Read more: Survey of the EU’s AI ecosystem (Charlotte Stix)  

#####################################################

New AI institute at Stanford launches + Bill Gates on AI risk
The Stanford Institute for Human-Centered AI (HAI) has formally launched. HAI is led by Fei-Fei Li, former director of Stanford AI lab and co-founder of AI4ALL, and John Etchemendy, philosopher and former Stanford provost.

The mission: HAI’s mission is informed by three overarching principles – that we must understand and forecast AI’s human impact, and guide its development on this basis; that AI should augment, rather than replace, humans; and that we must develop (artificial) intelligence as “subtle and nuanced as human intelligence.” The institute aims to “bring humanities and social thinking into tech,” to address emerging issues in AI.

Bill Gates on AI risk: At HAI’s launch event, Gates used his keynote speech to describe his hopes and fears for advanced AI. He said “the world hasn’t had that many technologies that are both promising and dangerous”, comparing AI to nuclear fission, which yielded both benefits in terms of energy, and serious risks from nuclear weapons. Gates had previously expressed some scepticism about AI risk, and this speech suggests he has moved towards a more cautious outlook.
  Read more: Launch announcement (Stanford).
  Read more: Bill Gates – AI is like nuclear weapons and nuclear energy in danger and promise (Vox).

#####################################################

Tech Tales:

Beaming In

We had been climbing the mountain for half an hour, and already a third of the people had cut their feeds and faded out of the country – they had discussed enough to come to a decision, and had disappeared to do the work. I stayed, looking at butterflies playing near cliff edges, and at birds diving between different trees, while talking about moving vast amounts of money for the purposes of buttressing a particular ideology and crushing another. The view was clear today, and I could see in my overlay the names of each of the towns and mountains and train stations on the horizon.

As I walked, I talked with the others. I suppose similar hikes have happened throughout history – more than a thousand years ago, a group of Icelandic elders hiked to a desolate riverbank in a rift valley where they formed the world’s oldest democracy, the ‘Althing’. During various world wars many leaders have gathered at various country parks and exchanged ideas with one another, making increasingly strange and influential plans. And of course there are the modern variants: Bilderberg hikes in various thinly-disclosed locations around Europe. Perhaps people use nature and walking within it as a kind of stimulant, to help them make decisions of consequence? Perhaps, they use nature because they have always used nature?

Now, on this walk, after another hour, most of the people have faded out, till it’s just me, and two others. One of them runs a multi-national with interests in intelligence, and the other works on behalf of both the US government and Chinese government on ‘track two’ diplomacy initiatives. We discuss things I cannot talk about, and make plans that will influence many people.

—–

But if you were to look at us from a distance, you wouldn’t see anybody at all. You’d see a mass of drones, hauling around the smell, texture and sense equipment which is needed to render high-fidelity simulations and transmit them to each of us, dialing in from our home offices and yachts and (in a few, rare cases) bunkers and jets. We use our machines to synthesize a beautiful reality for us, and we ‘walk’ within it, together making plans, appearing real to eachother, but appearing to people in the places that we walk as a collection of whirring metal. I am not sure when the last walk occurred with this group where a human decision-maker attended.

We do this because we believe in nature, and we believe that because we are in the presence of nature, our decisions are continuous with those of the past. I think habit plays a part, but so does a desire for absolution, and the knowledge that though trees and butterflies and crows may judge us for what we do, they cannot talk back.

Things that inspired this story: The general tradition of walking through history with particular emphasis on modern walks such as those taken by the Bilderberg group; virtual reality and augmented reality; increasingly capable drones; 5G; force feedback systems; the tendency for those in power to form increasingly closed-off nested communities.

Import AI 138: Transfer learning for drones; compute and the “bitter lesson” for AI research; and why reducing gender bias in language models may be harder than people think

Why the unreasonable effectiveness of compute is a “bitter lesson” for AI research:
…Richard Sutton explains that “general methods that leverage computation are ultimately the most effective”…
Richard Sutton, one of the godfathers of reinforcement learning*, has written about the relationship between compute and and AI progress, noting that the use of larger and larger amounts of computation paired with relatively simple algorithms has typically led to the emergence of more varied and independent AI capabilities than many human-designed algorithms or approaches. “The only thing that matters in the long run is the leveraging of computation”, Sutton writes.

Many examples, one rule: Some of the domains where computers have beaten methods based on human knowledge include Chess, Go, speech recognition, and many examples in computer vision.

The bitter lesson: “We have to learn the bitter lesson that building in how we think we think does not work in the long run,” Sutton says. “The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning.”

Why this matters: If compute is the main thing that unlocks new AI capabilities, then we can expect most of the strategic (and related geopolitical) landscape of AI research to re-configure in coming years around a compute-centric model, which will likely have significant implications for the AI community.
  Read more: The Bitter Lesson (Rich Sutton).
  * Richard Sutton literally (co)wrote the book on reinforcement learning.

#####################################################

AI + Comedy, with Naomi Saphra!
…Comedy set lampoons funding model in AI, capitalism, NLP…
Naomi Saphra, an NLP researcher, has put a video online of her doing stand-up AI comedy at a venue in Edinburgh, Scotland. Check out the video for her observations on working in AI research, funding AI research, tales about Nazi rocket researchers, and more.

  “You always have to ask yourself, who else finds this interesting? If you mean who reads my papers and cites my papers? The answer is nobody. If you mean who has given me money? The answer is mostly evil… you see I have the same problem as anyone in this world – I hate capitalism but I love money”.
  Watch her comedy set here: Naomi Saphra, Paying the Panopticon (YouTube).

#####################################################

Prototype experiment shows why robots might tag-team in the future:
…Use of a tether means 1+1 is greater than 2 here…
Researchers with the University of Tokyo, Japan, have created a two-robot team that can map its surroundings and traverse vertiginous terrain via the use of a tether, which lets an airborne drone vehicle assist a ground vehicle.

The drone uses an NVIDIA Jetson TX2 chip to perform onboard localization, mapping and navigation. The drone is equipped with a camera, time-of-flight sensor, and a laser sensor for height measurement. The ground vehicle is “based on a commercially available caterpillar platform” using a UP Core processing unit. The ground robot is running a copy of the robot operating system, which the airborne drone uses to connect to it.

Smart robots climb with a dumb tether: The robots work together like this: the UAV flies above the UGV and maps the terrain, feeding data down to the ground robot, giving it awareness of its surroundings. When the robots detect an obstruction, the UAV wraps the tether (which has a grappling hook on its end) around a tall object, and the UGV uses the secured tether to climb the object.

Real world testing: The researchers test their system in a small-scale real world experiment and find that the approach works, but has some problems: “Since we did not have a [tether] tension control mechanism due to the lack of sensor, the tether needed to be extended from the start and as the result, the UGV suffered from the entangled tether many times.”

Why this matters: In the future, we can imagine various robots of different types collaborating with eachother, using specialisms to operate as a unit, becoming more than the sum of their parts. Though as this experiment indicates we’re still at a relatively early stage of development here, and several kinks need to be worked out.
  Read more: UAV/UGV Autonomous Cooperation: UAV assists UGV to climb a cliff by attaching a tether (Arxiv).

#####################################################

Facebook tries to build a standard container for AI chips:
…New Open Compute Project (OCP) design supports both 12v and 48v inputs…
These days, many AI organizations are contemplating building data centers consisting of lots of different types of servers running many different chips, ranging from CPUs to GPUs to custom accelerator chips designed for AI workloads. Facebook wants to standardize the types of chassis used to house AI-accelerator chips, and has contributed an open source hardware schematic and specification to the Open Compute Project – a Facebook-born scheme to standardize the sorts of server equipment used by so-called hyperscale data center operators.

The proposed OCP accelerator module supports 12V and 48V inputs and can support up to 350W (12V) or up to 700W (48V) TDP (Thermal Design Power) for the chips in the module – a useful trait, given that many new accelerator chips guzzle significant amounts of power (though you’ll need to use liquid cooling for any servers consuming above 450W TDP). It can support single or multiple ASICs within each chassis, with support for up to eight accelerators per system.

Check out the design yourself: You can read about the proposed OCP Accelerator Module (OAM) in more detail here at the Open Compute Project (OCP) site.

Why this matters: As AI goes through its industrialization phase, we can expect people to invest more in the fundamental infrastructure which AI equipment requires. It’ll be interesting to see the extent to which there is demand for a standardized AI accelerator module, and symptoms for such demand will likely come from low-cost Asian-based original design manufacturers (ODMs) producing standardized chasses that use this design.
  Read more: Sharing a common form factor for accelerator modules (Facebook Code).

#####################################################

Want to reduce gender bias in a trained language model? Existing techniques may not work in the way we thought they did:
…Analysis suggests that ‘debiasing’ language models is harder than we thought…
All human language encodes within itself biases. When we train AI systems on human language, we tend to reflect the biases inherent to the language and to the data it was trained on. For this reason, word embeddings derived from AI systems trained over large corpuses of news datasets will frequently associate people of color with the concept of crime, while linking white people to professions. Similarly, these embeddings will tend to express gendered biases, with close concepts to a man being something like ‘king’ or ‘professional’, while a woman will typically be proximate to concepts like ‘homemaker’ or ‘mother’. Tackling these biases is complicated, requiring a mixture of careful data selection at the start of a project, and the application of algorithmic de-biasing techniques to trained models.

Now, researchers with Bar-Ilan University and the Allen Institute for Artificial Intelligence, have conducted an analysis that calls into question the effectiveness of some of the algorithmic methods used to debias models. “We argue that current debiasing methods… are mostly hiding the bias rather than removing it”, they write.

The researchers compare the embeddings in two different methods – Hard-Debiased (Bolukbasi et al) and GN-GloVe (Zhao et al) – which have both been modified to reduce apparent gender bias within trained models. They try to analyze the difference between the biased and debiased versions of each of these approaches, essentially by analyzing the different spatial relationships between embeddings from both versions. They find that these debiasing methods work mostly by shifting the problem to other parts of the models, so though they may fix some biases, other ones remain.

Three failures of debiasing: The specific failure modes they observe are as follows:

  • Words with strong previous gender bias are easy to cluster together
  • Words that receive implicit gender from social stereotypes (e.g. receptionist, hair-dresser, captain) still tend to group with other implicit-gender words of the same gender
  • The implicit gender of words with prevalent previous bias is easy to predict based on their vectors alone

  Why this matters: The authors say that “while suggested debiasing methods work well at removing the gender direction, the debiasing is mostly superficial. The bias stemming from world stereotypes and learned from the corpus is ingrained much more deeply in the embeddings space.”
  Studies like this suggest that dealing with issues of bias will be harder than people had anticipated, and highlights how much of the bias aspects of AI come from the real world data such systems are being trained on containing such biases.
  Read more: Lipstick on a Pig: Debiasing Methods Cover up Systematic Gender Biases in Word Embeddings But do not Remove Them (Arxiv).

#####################################################

Transfer learning with drones:
…Want to transfer something from simulation to reality? Add noise, and make some of it random…
University of Southern California, Los Angeles researchers have trained a drone flight stabilization policy in simulation and transferred it to multiple different real-world drones.

Simulate, noisily: The researchers add noise to a large number of aspects of the simulated quadcopter platform as well as by varying the motor lag on the simulated drone, creating synthetic data which they use to train more flexible policies. “To avoid training a policy that exploits a physically implausible phenomenon of the simulator, we introduce two elements to increase realism: motor lag simulation and a noise process,” they write. They also model noise for sensor and state estimation.

Transfer learning: They train the (simulated) drones using Proximal Policy Optimization (PPO) with a cost function designed to maximize stability of the drone platforms. They sanity-check the trained policies by running them in a different simulator (in this case, Gazebo using the RotorS package) and observing how well they generalize. “This sim-to-sim transfer helps us verify the physics of our own simulator and the performance of policies in a more realistic environment,” they write.

  They also validate their system on three real quadcopters, built around the ‘Crazyflie 2.0’ platform. “We build heavier quadrotors by buying standard parts (e.g., frames, motors) and using the Crazyflie’s main board as a flight controller,” they explain. They are able to demonstrate generalization of their policy across the different drone platforms, and show through ablations that adding noise and doing physics-based modelling of the systems during training can let them further improve performance.

Why this matters: Approaches like this show how people are increasingly able to arbitrage computers for real-world (costly) data; in this case, the researchers use compute to simulate drones, extend the simulation data with synthetically generated noise data and other perturbations, and then transfer this into the real world. Further exploring this kind of transfer learning approach will give us a better sense of the ‘economics of transfer’, and may allow us to build economic models that let us describe the tradeoffs between spending $ on compute for simulated data, and collecting real-world data.
  Read more: Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors (Arxiv).
  Check out the video here: Sim-to-(Multi)-Real: Transfer of Low-Level Robust Control Policies to Multiple Quadrotors (YouTube).

#####################################################

Tech Tales

The sense of being looked at

Every day, it looks at something different. I spend my time, like millions of other people on the planet, working out why it is looking at that thing. Yesterday, the system looked at hummingbirds, and so any AI-operable camera in the world not deemed “safety-critical” spent the day looking at – or searching for – hummingbirds. The same was true of microphones, pressure sensors, and the various other actuators that comprise the inputs and outputs of the big machine mind.

Of course we know why the system does this at a high level: it is trying to understand certain objects in greater detail, likely as a consequence of integrating some new information from somewhere else that increases the importance of knowing about these objects. Maybe the system saw a bunch of birds recently and is now trying to better understand hummingbirds as a consequence? Or maybe a bunch of people have been asking the system questions about hummingbirds and it now needs to have more awareness of them?

But we’re not sure what it does with its new insights, and it has proved difficult to analyze how the system’s observation of an object changes its relationship to it and representation of it.

So you can imagine my surprise when I woke up today to find the camera in my room trained on me, and a picture of me on my telescreeen, and then as I left the house to go for breakfast all the cameras on the street turned to follow me. It is studying me, today, I suppose. I believe this is the first time it has looked at a human, and I am wondering what its purpose is.

Things that inspired this story: Interpretibility, high-dimensional feature representations, the sense of being stared at by something conscious.

 

Import AI 137: DeepMind uses (Google) StreetLearn to learn to navigate cities; NeuroCuts learns decent packet classification; plus a 490k labelled image dataset

The robots of the future will learn by playing, Google says:
…Want to solve tasks effectively? Don’t try to solve tasks during training!…
Researchers with Google Brain have shown how to make robots smarter by showing them what it means to play without a goal in mind. Google does this by collecting a dataset via people tele-operating a robot in simulation. During these periods of teleoperation, the people are playing around, using the robot hand and arm to interact with the world around them without a specific goal in mind, so in one scene a person might pick up a random object, in another they might fiddle around with a door on a piece of furniture, and so on.

Google saves this data, calling it ‘Learning from Play data’ (LfP). It fees this into a system that attempts to classify such playful sequences of actions, mapping them into a latent space. Meanwhile, another module in the system tries to look across the latent space and propose sequences of actions that could shift the robot from its current state to its goal state.

Multi-task training: Google evaluates this approach by comparing performance of robots trained with play data, from policies that use behavioural cloning to learn to complete tasks based on specific demonstration data. The tests show that robots which learn from play data are more robust to perturbations than ones trained without, and typically reach higher success rates on most tasks.
  Intriguingly, systems trained with play data display some over desirable traits: “We find qualitative evidence that play-supervised models make multiple attempts to retry the task after initial failure”, the researchers write. “Surprisingly we find that its latent plan space learns to embed task semantics despite never being trained with task labels”.

Why this matters: Gathering data for robotics work tends to be expensive, difficult, and prone to distribution problems (you can gather a lot of data, but you may subsequently discover that some quirk of the task or your robot platform means you need to go and re-gather a slightly different type of data). Being able to instead have robots learn behaviors primarily through cheaply-gathered non-goal-oriented play data will make it easier for people to experiment with developing such systems, and could make it easier to create large datasets shared between multiple parties. What might the ‘ImageNet’ for play robotics look like, I wonder?
  Read more: Learning Latent Plans from Play (Arxiv).

#####################################################

Google teaches kids to read with AI-infused ‘Bolo’:
…Tuition app ships with speech recognition and text-to-speech tech…
Google has released Bolo, a mobile app for Android designed to help Indian children learn to read. Bolo ships with ‘Diya’, a software agent that can help children learn to read.

Bilingual: “Diya can not only read out the text to your child, but also explain the meaning of English text in Hindi,” Google writes on its blog. Bolo ships with 50 stories in Hindi and 40 in English. Google says it found that 64% of children that interacted with Bolo showed an improvement in reading after three months of usage.
  Read more: Introducing ‘Bolo’: a new speech based reading-tutor app that helps children learn to read (Google India Blog).

#####################################################

490,000 fashion images… for science:
…And advertising. Lots and lots of advertising, probably…
Researchers with SenseTime Research and the Chinese University of Hong Kong have released DeepFashion2, a dataset containing around 490,000 images of 13 clothing categories from commercial shopping stores as well as consumers.

Detailed labeling: In DeepFashion2, “each item in an image is labeled with scale, occlusion, zoom-in, viewpoint, category, style, bounding box, dense landmarks and per-pixel mask,” the researchers write. “To our knowledge, clothing pose estimation is presented for the first time in the literature by defining landmarks and poses of 13 categories that are more diverse and fruitful than human pose”, the authors write.

The second time is the charm: DeepFashion2 is a follow-up to DeepFashion, which was released in early 2017 (see: Import AI #33). DeepFashion2 has 3.5X as many annotations as DeepFashion.

Why this matters: It’s likely that various industries will be altered by widely-deployed AI-based image analysis systems, and it seems probable that the fashion industry will take advantage of various image-analysis techniques to automatically analyze & understand changing fashion trends in the world, in part by automatically analyzing the visual world and using these insights to alter the sorts of clothing being developed, or how it is marketed.
  Read more: DeepFashion2: A Versatile Benchmark for Detection, Post Estimation, Segmentation and Re-Identification of Clothing Images (Arxiv).
  Get the DeepFashion data here (GitHub).

#####################################################

Facebook tries to shine a LIGHT on language understanding:
…Designs a MUD complete with netherworlds, underwater aquapolises, and more…
LIGHT contains humans and AI agents within a text-based multi-player dungeon (MUD). This MUD consists of 663 locations, 3462 objects, and 1755 individual characters. It also ships with data, as Facebook has already collected a set of around 11,000 interactions between humans roleplaying characters in the game.

Graveyards, bazaars, and more: LIGHT contains a surprisingly diverse gameworld – not that the AI agents which play within it will care. Locations that AI agents and/or humans can visit include the countryside, forest, castles (inside and outside) as well as some more bizarre locations like a “city in the clouds” or a “netherworld” or even an “underwater aquapolis”.

Actions and emotions: Characters in LIGHT can carry out a range of physical actions (eat, drink, get, drop, etc) as well as express emotive actions (’emotes’) like to applaud, blush, wave, etc.

Results: To test out the environment, the researchers train some baseline models to predict actions, emotes, and dialogue. They find that a system based on Google’s ‘BERT’ language model (pre-trained on Reddit data) does best. They also perform some ablation studies which indicate that models that are successful in LIGHT use a lot of context, depending on numerous streams of data (dialogue, environment descriptions, and so on).

Why this matters: Language is likely fundamental to how we interact with increasingly powerful systems. I think figuring out how to work with such systems will require us to interact with them in increasingly sophisticated environments, so it’ll be interesting to see how rapidly we can improve performance of agents in systems like LIGHT, and learn whether those improvements transfer over to other capabilities as well.
  Read more: Learning to Speak and Act in a Fantasy Text Adventure Game (Arxiv).

#####################################################

NeuroCuts replaces packet classification systems with learned behaviors:
…Research means that in the future computers will learn to effectively communicate with each other…

In the future, the majority of the ways our computers talk to each other will be managed by customized, learned behaviors, derived by AI systems. That’s the gist of a recent spate of research which has ranged from using AI approaches to try to learn how to perform computer tasks like creating and maintaining database indexes, or figuring out how to automatically search through large documents.

Now, researchers with the University of California at Berkeley and Johns Hopkins University have developed NeuroCuts, a system that uses deep reinforcement learning to figure out how to do effective network packet classification. This is an extremely low-level task, requiring precision and reliability. The deep RL approach works, meaning that “our approach learns to optimize packet classification for a given set of rules and objective, can easily incorporate pre-engineered heuristics to leverage their domain knowledge, and does so with little human involvement”.

Effectiveness: “NeuroCuts outperforms state-of-the-art solutions, improving classification time by 18% at the median and reducing both time and memory usage by up to 3X,” they write.

Why this matters: Adaptive systems tend to be more robust to failures than brittle ones, and one of the best ways to increase the adaptiveness of a system is to let it be able to learn in response to inputs; approaches like applying deep reinforcement learning to problems like network packet classification precede a future where many fundamental aspects of how computers connect to eachother will be learned rather than programmed.   Read more: Neural Packet Classification (Arxiv).

#####################################################

DeepMind teaches agents to navigate cities with ‘StreetLearn’:
…Massive Google Street View-derived dataset asks AI systems to navigate across New York and Pittsburgh…
Have you ever been lost in a city, and tried to navigate yourself to a destination by using landmarks? This happens to me a lot. I usually end up focusing on a particularly tall & idiosyncratic building, and as I walk I update my internal mental map in reference to this building and where I suspect my destination is.

Now, imagine how useful it’d be if when AI systems got lost they could perform such feats of navigation? That’s some of the idea behind StreetLearn, a new DeepMind-developed dataset & challenge to get agents to learn how to navigate across urban areas, and in doing so develop smarter, general systems.

What is StreetLearn? The dataset is build as “an interactive, first-person, partially-observed visual environment that uses Google Street View for its photographic content and broad coverage, and give performance baselines for a challenging goal-driven navigation task,” DeepMind writes. StreetLearn initially consists of two large areas within Pittsburgh and New York City, and is made up of a set of geolocated 360-degree panoramic views, which form the nodes of a graph. In the case of New York City, this includes around 56,000 images, and in the case of Pittsburgh it is about 58,000. The two maps are further sub-divided into distinct regions, also.

Challenging agents with navigation tasks: StreetLearn is designed to be used to develop reinforcement learning agents, so it makes five actions available to an agent: slowly rotate the camera view left or right, rapidly rotate the camera view left or right, and to move forward if there is a free space. The system can also provide the agent with a specific goal, like an image, or following a natural language instruction.

Tasks, tasks, and tasks: To start with, DeepMind has created a ‘Courier’ task, in which the agent starts from a random position and has the goal of getting to within approximately one city block of another randomly chosen location, with the agent getting a higher reward if it takes a shorter route to get between the two locations.
   DeepMind has also developed the “coin_game” in which agents need to find invisible coins scattered throughout the map, and three types of ‘instruction game’, where agents use navigation instructions to get to a goal.

Why this matters: Navigation is a base of the pyramid-type task, where if we are able to develop computers that are good at navigation, we should be able to build a large number of second order applications on top of this as well.
  Read more: The StreetLearn Environment and Dataset (Arxiv).

#####################################################

Reproducibility and other research norms:
…Exploring the tension between reproducible research and enabling abuses…
John Langford, creator of Vowpal Rabbit and a researcher at Microsoft Research, has waded into the ongoing debate about reproducibility within AI research.

The debate:
Currently, the AI community is debating how to force more AI work to be reproducible. Today, some AI research papers are published without code or datasets. Some researchers think this should change, and papers should always come with code and/or data. Other researchers (eg Nando de Freitas at DeepMind) think that while reproducibility is important, there are some cases where you might want to publish a paper but restrict dissemination of some details so as to minimize potential abuses or malicious uses of the technology.

Reproducibility is nice, but so are other things: “Proponents should understand that reproducibility is a value but not an absolute value. As an example here, I believe it’s quite worthwhile for the community to see AlphaGoZero published even if the results are not necessarily easily reproduced.”

Additive conferences: What Langford proposes is adding some optional things to the community, like experimenting with whether reviewers can more effectively review papers if they also have access to code or data, and to explore how authors may or may not benefit from releasing code. These policies are essentially being trialled at ICML this year, he points out. “Is there a need for[sic] go further towards compulsory code submission?” he writes. “I don’t yet see evidence that default skeptical reviewers aren’t capable of weighing the value of reproducibility against other values in considering whether a paper should be published”.

Why this matters: I think figuring out how to strike a balance between maximizing reproducibility and minimizing potential harms is one of the main challenges of current AI research, so blog posts like this will help further this debate. It’s an important, difficult conversation to have.
  Read more: Code submission should be encouraged but not compulsory (John Langford blog).

Tech Tales:

Be The Boss

It started as a game and then, like most games, it became another part of reality. Be The Boss was a massively multiplayer online (MMO) game that was launched in the latter half of the third decade of the 21st century. The game saw players work in a highly-gamified “workplace” based on a 1990s-era corporate ‘cube farm’. Player activities included: undermining coworkers, filing HR complaints to deal with rivals, filling up a ‘relaxation meter’ by temporarily ‘escaping’ the office for coffee and/or cigarettes and/or alcohol. Players enjoyed this game, writing reviews praising it for its “gritty realism”, describing it as a “remarkable document of what life must have been life in the industrial-automation transition period”.

But, as with most games, the players eventually grew bored. Be The Boss lacked the essential drama of other smash-hit games from that era, like Hospital Bill Crisis! and SPECIAL ECONOMIC ZONE. So the designers of Be The Boss created an add-on to the game that delivered on its name; where previously, players competed with each other to rise up the hierarchy of the corporation, they had no real ability to change the rules of the game. With the expansion, this changed, and successful players were entered into increasingly grueling tournaments where the winner – whose identity was kept secret – would be allowed to “Be The Boss” of the entire gameworld, letting them subtly alter the rules of the game. It was this invention that assured the perpetuity of Be The Boss.

Now, all people play is Be The Boss, and rumors get swapped online about which rule was instituted by which boss: who decided that the water fountains should periodically dispense water laced with enough pheremones to make different player-characters fall in love with eachother? Who realized that they could save millions of credits across the entire corporate game world by reducing the height of all office chairs by one inch or so? And who made it so that one in twenty of every sent email would be shuffled to a random person in the game world, instead of the intended recipient?

Much of our art is now based on Be The Boss. We don’t talk about the asteroid miners or the AI psychologists or the people climbing the mountains of Mars: we talk about Joe from Accounting saving The Entire Goddamn Company, or how Susan from HR figured out a way to Pay People Less And Make Them Happy About It. Kids dream of what it would have been like to work in the cubes, and fantasize about how well they could have done.

Things that inspired this story: the videogame ‘Cart Life‘; MMOs; the highest form of capitalism is to disappear from normal life and run the abstract meta-life that percolates into normal life; transmutation; digital absolution.