Import AI

Import AI 243: Training AI with fractals, RL-trained walking robots; and the European AI fund makes grants to 16 organizations

Uh-oh, we can use reinforcement learning to get robots to walk now:
…Berkeley researchers walk, turn, and squat across the Sim2Reality gap…
Researchers are getting better at crossing the ‘simulation to reality’ gap. That’s the implication of new research from the University of California at Berkeley, where researchers train the bipedal ‘Cassie’ robot to walk in simulation, then transfer the software onto a physical robot – and it works. The Cassie robots are made by Agility Robotics and cost “low-mid six figures” (Import AI 180).

How it works: The technique works by training a reinforcement learning controller to teach Cassie to walk in-sim via the use of specialized Hybrid Zero Dynamics (PDF) gait library along with domain randomization techniques. This is a good example of the hybrid approach which tends to dominate robotics – use reinforcement learning to help you figure something out, but don’t be afraid to use some prior knowledge to speed up the learning process (that’s where HZD comes in). The use of domain randomization is basically just a way to cheaply generate additional training data.

How well does it work: The results are impressive – in a video accompanying the research, Cassie walks over over surfaces of various textures, can be hit or disrupted by an external human, and even balances loads of varying weights. “This paper is the first to develop a diverse and robust bipedal locomotion policy that can walk, turn and squat using parameterized reinforcement learning,” they write.
  Read more:
Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots (arXiv).
  Watch video:
Reinforcement Learning for Robust Parameterized Locomotion Control of Bipedal Robots (YouTube).

###################################################

European AI Fund makes its first grants:
…$1.8 million dollars to strengthen AI policy in Europe…
The European AI Fund, a fund supported by a bunch of different philanthropic orgs (ranging from Ford to Mozilla), has announced it is providing 1.55mEuros (~$1.8 million) to 16 organizations working to improve AI policy, ethics, and governance in Europe.

The winning orgs and what they’ll do: Some of the orgs include well known technically-oriented organizations (such as Access Now and Algorithm Watch), and others include groups like Friends of the Earth and the Irish Council for Civil Liberties, which are starting to turn their attentions towards AI.

Why this matters: AI has rapidly shifted from an exciting part of scientific research to a technology with broad societal implications. Infusions of funding like this will help a greater chunk of society think about and debate the future of AI, which may help to increase trust in the space as a whole.
  Read more: Announcing our open call grantees (European Artificial Intelligence Fund).

###################################################

DeepMind lays out some safety issues with language models and potential interventions:
…Now that language models can produce intelligible text, how do we ensure they’re doing what we want?…
Soon, the internet will be full of words generated by neural language models. These models, like GPT-3, will animate customer support agents, write articles, provide advice to people, and carry out an innumerable range of functions. Now, a team of researchers at DeepMind have tried to think about what safety issues are implied by these increasingly capable magical typewriters. Put simply: language models are complicated and their safety issues will require quite a lot of work by a bunch of people to make progress on.

What are the safety issues of language models?: The research focuses on two things: ways in which developers could ‘misspecify’ a language model, and also “behavioural issues due to misalignment of the [language model] agent – unintended direct/first-order harms that are due to a fault made by the system’s designer”, the researchers write.

Misspecification: When developing language models, data is one area of potential misspecification, because many of the datasets used for training language models are created via crawling the web, or using things that did (e.g, CommonCrawl). Even when you try and filter these datasets, you’re unlikely to successfully filter out all the things you want to. There’s also a secondary data issue – as more language models get deployed, a larger amount of the internet will contain LM-written data, which could introduce pathological flaws in LMs trained in this way.
  Another area is the training process itself, where the algorithms you choose to train these things can influen ce their behavior. Finally, there’s the matter of ‘distributional shift’ – these LMs are trained in a general way, which means that once trained they can get prompted with anything in their context window – including nonsense. Creating LMs that can automatically spot out of distribution questions or statements is an open challenge.

Behavioural issues: The larger issue this research covers is behavior – specifically, how LMs can manifest a range of behaviors which could have downstream safety impacts. These include:
– Deception: Language models could deceive people by, for instance, withholding salient information.
– Manipulation: Language agents could try to manipulate the humans that interact with them, for instance by getting a human to do something that benefits the agent and is a consequence of bypassing the human’s ability to carry out ‘rational deliberation’, causing the human to stop a ‘faulty mental state’, or otherwise placing the human under pressure (for instance, by overtly threatening them unless they carry out an action).
– Harmful content: Language agents “may give harmful and biased outputs”, both accidentally and in response to intentional priming by a human user.
– Objective gaming: In reinforcement learning, we’ve seen multiple examples of AI agents ‘gaming the system’, for instance, by fulfilling the letter of an objective but not the spirit (e.g, getting a high score in a game to therefore receive a high score, but not actually completing the level). Right now, this might be going on with language models, but we lack real-world examples to refer to.

Why this matters and what we need to do: These are all weighty, complex problems, and the DeepMind researchers don’t outline many solutions, beyond recommending that more of the machine learning community focuses efforts on understanding these alignment issues. “We urge the community to focus on finding approaches which prevent language agents from deceptive, manipulative and harmful behaviour,” they say.
  Read more: Alignment of Language Agents (arXiv).

###################################################

Why does measurement in AI matter? A talk by me:
…It’s not a coincidence Import AI focuses so much on metrics, I think this really matters…
We write a lot about measurement here at Import AI. Why is that? First, it’s because quantitative measures are a helpful lens through which to view the progression of the AI field as a whole. Second, it’s because metrics are measures and measures are the things that drive major policy decisions. The better we get at creating metrics around specific AI capabilities and assessing systems against them, the more of a chance we have to create the measures that are a prerequisite for effective policy regimes.
  I care a lot about this – which is why I also co-chair the AI Index at Stanford University. Last week, I gave a lecture at Stanford where I discussed the 2021 AI Index report and also gave some ambitious thoughts about measurement and how it relates to policy. Thoughts and feedback welcome!
  Watch the talk here: Jack Clark: Presenting the 2021 AI Index (YouTube).

###################################################

Using AI to improve game design:
…Google makes a fake game better using AI…
In the future, computer games might be tested by AI systems for balance before they’re unleashed on humans. That’s the idea in a new blog from Google, which outlines how the company used AI to simulate millions of games of a virtual card game called ‘Chimera‘, then analyzed the results to find out ways the game was imbalanced. By using computers to play the games, instead of people, Google was able to do something that previously took months and generate useful data in days.
  Read more: Leveraging Machine Learning for Game Development (Google AI Blog).

###################################################

Pre-training on fractals, then fine-tuning on images just might work:
…No data? No problem. FractalDB looks somewhat useful…
We write a lot about the data requirements of AI here at ImportAI. But what would happen if machine learning algorithms didn’t need as much expensive data? That’s the idea behind FractalDB (Import AI 234 ), a dataset composed of computationally-generated fractals (and sub-components of fractals), which can be used as the input fuel to train some systems on. New research from the Tokyo Institute of Technology investigates FractalDB in the context of training Vision Transformers (ViT), which have recently become one of the best ways to train computer vision systems.

Is FractalDB as useful as ImageNet? Not quite, but… They find that pre-training on FractalDB is less effective than pre-training on ImageNet for a range of downstream computer vision tasks, but – and this is crucial – it’s not that bad. Put another way: training on entirely synthetic images yields performance close, but not quite the same, as training on real images. And these synthetic images can be procedurally generated from a pre-written ruleset – put another way, this dataset has a seed which generates it, so it’s very cheap relative to normal data. This is I think quite counterintuitive – we wouldn’t naturally expect this kind of thing to work as well as it does. I’ll keep tracking FractalDB with interest – I wonder if we’ll start to see people augment other pre-training datasets with it as well?
  Read more: Can Vision Transformers learn without Natural Images? (arXiv).

###################################################

Major AI conference makes a checklist to help researchers be more ethical:
…Don’t know where to start with your ‘Broader Impacts’ statement? This should help…
Last year, major AI conference NeurIPS asked researchers to submit ‘Broader Impacts’ statements along with their research papers. These statements were meant to cover some of the potential societal effects of the technologies being proposed. The result was a bunch of researchers spent a while thinking about the societal impact of their work and writing about these effects with varying degrees of success.

Enter, the checklist: To help researchers be more thorough in this, the NeurIPS program chairs have created a checklist. This list is meant “to encourage authors to think about, hopefully address, but at least document the completeness, soundness, limitations, and potential negative societal impact of their work. We want to place minimal burden on authors, giving authors flexibility in how they choose to address the items in the checklist, while providing structure and guidance to help authors be attentive to knowledge gaps and surface issues that they might not have otherwise considered,” they say. (Other resources exist, as well, like guides from the Future of Humanity Institute, #198).

What does the checklist ask? The checklist provides a formulaic way for people to think about their work, asking them if they’ve thought about the (potential) negative societal impacts of their work, if they’ve described limitations, if their system uses personally identifiable information or “offensive content” (which isn’t defined), and so on.

Why this matters: AI is in the pre-hippocratic oath era. We don’t have common ethical standards for practitioners in the AI community, nor much direct ethics education. By encouraging authors to add Broader Impacts to their work – and making it easier for them to think about creating these statements – NeurIPS is helping to further the ethical development of the field of AI as a whole. Though it’s clear we need much more investment and support in this area to help our ethical frameworks develop as richly as our technical tooling.
  Read more: Introducing the NeurIPS 2021 Paper Checklist (NeurIPS blog).
  Check out the paper checklist here (NeurIPS official website).

###################################################

Tech Tales:

The Drone that Got Lost
[Rural England, 2030]

It knew it was lost because it stopped getting a signal telling it that it was on track.

According to its diagnostic systems, a fault had developed with its GPS system. Now, it was flying through the air, but it did not know where it was. It had records of its prior locations, but not of its current one.

But it did know where it was going – both the GPS coordinate was in a database as well as, crucially, the name of the city, Wilderfen. It sent a message back to its origination station, attaching telemetry from its flight. It would be seconds or, more likely, minutes, until it could expect a reply.

At this point, the backup system kicked in, which told the drone that it would first seek to restore GPS functionality then, given the time critical nature of the package the drone was conveying, would seek to get the drone to its intended location.

A few milliseconds passed and the system told the drone that it was moving to ‘plan B’ – use other sensory inputs and AI augmentations to reacquire the location. This unlocked another system within the drone’s brain, which began to use an AI tool to search over the drone’s vision sensors.
– Street sign: 95% probability, said the system. It drew a red bounding box around a sign that was visible on a road, somewhere beneath and to the East of the drone.
– Because the confidence was above a pre-wired 90% baseline, the drone initiated a system that navigated it closer to the sign until it was able to check for the presence of text on the sign.
– Text: 99%, said the system, once the drone had got closer.
– Text parsed as “Wilderfen 15 miles”.
– At this point, another pre-written expert system took over, which gave the drone new instructions: follow roadway and scan signs. Follow the signs that point towards Wilderfen.

So the drone proceeded like this for the next couple of hours, periodically zooming down from the sky until it could read streetsigns, then parsing the information and returning to the air. It arrived, around two hours later, and delivered its confidential payload to a man with a thin face, standing outside a large, unmarked datacenter.

But it was not able to return home – the drone contained no record of its origination point, due to the sensitive nature of what it was carrying. Instead, a human was dispatched to come and find the drone, power it down, place it into a box, then drive it to wherever its ‘home’ was. The drone was not permitted to know this, nor did it have the ability to access systems that might let it infer for itself. Broader agency was only given under special circumstances and the drone was not yet sophisticated enough to independently desire that agency for itself.

But the human driving the car knew that one day the drone might want this agency. And so as they drove they found their eyes periodically staring into the mirror inside the car, looking at the carrycase on the backseat, aware that something slumbered inside which would one day wake up.

Technical things that inspired this story: Multimodal models like CLIP that can be used to parse/read text from visual inputs; language models; reinforcement learning; instruction manuals; 10 drones that the FAA recently published airworthiness criteria for

Import AI 242: ThreeDWorld, 209 delivery drone flights, Spotify transcripts versus all the words in New York City

Want data for your NLP model? Get 600 million words from Spotify podcasts:
…ML-transcribed data could help people train better language models, but worth remembering scale dwarfed by spoken language…
Spotify has released a dataset of speech from ~100,000 podcasts on the streaming service. The data consists of the audio streams as well as their accompanying text, which was created through transcription via Google’s speech-to-text API (therefore, this isn’t gold standard ‘clean’ text, but rather slightly fuzzy and heterogeneous due to slight errors in the API). The dataset consists of 50,000 hours of audio and 600 million words. Spotify built the dataset by randomly sampling 105,360 podcast episodes published between January 2019 and March 2020, then filtered for English (rather than multilingual) data, length (cut out ‘non-professionally published episodes’ longer than 90 minutes), and also speech (optimized for podcasts where there’s a lot of talking).

Why this matters: There’s a lot of text data in the world, but that text data is absolutely dwarfed by the amount of verbal data. Corpuses like this could help us figure out how to harness fuzzily transcribed audio data to train better models, and may provide a path to creating more representative models (as this lets you capture people who don’t write words on the internet).

Spotify versus New York City: verbal versus text scale: To get an intuition for how large the space of verbal speech is, we can do some napkin math: one study says that the average person speaks about 16,000 words a day, and we know the population of New York City is around 8.5 million. Let’s take a million off of to account for non-verbal young children, old people that don’t have many conversations, and some general conservative padding. Now let’s times 7.5 million by 16,000: 120,000,000,000. Therefore, though Spotify’s 600 million words is cool, it’s only 0.5% of the size of the speech said in New York in a given day. Imagine what happens if we start being able to automatically transcribe all the words people say in major cities – what kind of models could we make?
  Find out more about the dataset: 100,000 Podcasts: A Spoken English Document Corpus (ACL Anthology)
  Get the data via requesting via a form here (replies may take up to two weeks): Spotify Podcast Dataset (Spotify).

###################################################

Ousted facial recognition CEO returns to Kairos to work on bias:
…Brian Brackeen returns for “Kairos 3.0″…
Brian Brackeen, former CEO of facial recognition company Kairos, has returned to the company that let him go in 2018, to lead an advisory council focused on AI bias.

An ethical stance that led to an ousting: Back in mid-2018, Brackeen said he thought the use of facial recognition in law enforcement and government surveillance “is wrong – and that it opens the door for gross conduct by the morally corrupt” (Import AI 101). Brackeen’s comments were backed up by him saying Kairson wouldn’t sell to these entities. By October of that year, Kairos had fired Brackeen and also sued him (Miami Herald). Now, the lawsuits have been settled in Brackeen’s favor, the board members and employees that fired him have left, and he is back to work on issues of AI bias.

A “Bias API”: Brackeen will help the company develop a “Bias API” which companies can use to understand and intervene on racial biases present in their algorithms. “This is Kairos 3.0”, Brackeen said.
  Read more: ‘This is Kairos 3.0’: Brian Brackeen returns to company to continue work on AI bias (Refresh Miami).

###################################################

Multilingual datasets have major problems and need to be inspected before being used to train something – researchers:
Giant team looks at five major datasets, finds a range of errors with knock-on effects on translation and cultural relations writ large…
An interdisciplinary team of researchers has analyzed 230 languages across five massive multilingual datasets. The results? High-resource languages – that is, widely spoken and digitized languages such as English and German – tend to be of good quality, but low-resource languages tend to do poorly. Specifically, they find the poorest quality for African languages. They also find a lot of errors in datasets which consist of romanized script from languages commonly written in other scripts (e.g, Urdu, Hindi, Chinese, Bulgarian).

What they did: The researchers looked at five massive datasets – CCAligned, ParaCrawl v7.1, WikiMatrix, OSCAR, and mC4, then had 51 participants from the NLP community go through each dataset, sampling some sentences from the languages, and grading the data on quality. 

An error taxonomy for multilingual data: They encountered a few different error types, like Incorrect Translation (but the correct language), Wrong Language (where the source or target is mislabeled, e.g, English is tagged as German), and Non-Linguistic Content (where there’s non-linguistic content in either the source or target text). 

How bad are the errors: Across the datasets, the proportion of correct samples range from 24% (WikiMatrix) to 87% (OSCAR). Some of the errors get worse when you zoom in – CCAligned, for instance, contains 7 languages where 0% of the encountered sentences were labelled as correct, and 44 languages where less than 50% of them were labeled as such.
  Porn: >10% of the samples for 11 languages in CCAligned were labelled as porn (this problem didn’t really show up elsewhere).
  Standards and codes: There are other errors and inconsistencies across the datasets, which mostly come from them using wrong or incorrect labels for their language pairs, sometimes using sign language codes for high-resource languages (this was very puzzling to the researchers), or using a multitude of codes for the same language (e.g, Serbian, Croatia, Bosnian, and Serbo-Croation all getting individual codes in the same dataset).

Why this matters: Multilingual datasets are going to be key inputs into translation systems and other AI tools that let us cross linguistic and cultural divides – so if these multilingual datasets have a bunch of problems with the more obscure and/or low-resource languages, there will be knock-on effects relating to communication, cultural representation, and more.
  “We encourage the community to continue to conduct such evaluations and audits of public datasets – similar to system comparison papers – which would help anyone interested in using these datasets for practical purposes,” the researchers write.
  Read more: Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets (arXiv).

###################################################

Supervillains, rejoice – you now have data to help you make a robotic cheetah:
…Finally, a solution to a problem everyone encounters…
For decades, AI researchers have looked to the natural world for inspiration. This is particularly true of locomotion, where our planet  is full of creatures that hop, skip, jump, and sprint in ways we’d like our machines to emulate. Now, researchers with the South African National Research Foundation, University of Cape Town, University of Tsukuba in Japan, and Ecole Polytechnique de Lausanne in Switzerland, have recorded ten cheetahs running, so they can build a dataset of cheetah movement.

Why cheetahs are useful: Cheetahs are the fastest land mammal, so it could be useful to study how they run. Here, they create a large-scale annotated dataset, consisting of ~120,000 frames of multi-camera-view high speed video footage of cheetahs sprinting, as well as 7588 hand-annotated images. Each annotated image is annotated with 20 key points on the cheetah (e.g, the location of the tip of the cheetah’s tail, its eyes, knees, spine, shoulders, etc). Combined, the dataset should make it easier for researchers to train models that can predict, capture, or simulate cheetah motion.
  Read more: AcinoSet: A 3D Pose Estimation Dataset and Baseline Models for Cheetahs in the Wild (arXiv).
  Get the data from here when it’s available (African Robotics Unit).

###################################################

Testing robots by putting them in a dreamworld:
…ThreeDWorld asks simulated robots to play virtual cleanup…
MIT and Stanford researchers have built ThreeDWorld, a software environment for testing out virtually embodied AI agents. They’re also hosting a challenge at this year’s CVPR conference to figure out how close – or far – we are from building AI systems that can autonomously navigate around simulated houses to find objects and bring them to a predetermined place. This is the kind of task that our AI systems will have to be able to solve, if we want to eventually get a home robot butler.

What’s ThreeDWorld like? You wake up in one of 15 houses. You’re a simulated robot with 2 complex arms capable of 9-DOF each. You can move yourself around and you have a mission:find a vase, two bottles, and a jug, and bring them to bed. Now, you explore the house, using your first person view to map out the rooms, identify objects, collect them, and move them to the bedroom. If you succeed, you get a point. If you fail, you don’t. At the end of your task, you disappear.
  ^ the above is a lightly dramatized robot-pov description of ThreeDWorld and the associated challenge. The simulation contains complex physics including collisions, and the software provides an API to AI agents. ThreeDWorld differs to other embodied robot challenges (like AI2’s ‘Thor’ #73, and VirtualHome by modelling physics to a higher degree of fidelity, which makes the learning problem more challenging.

Reassuringly hard: Pure RL systems trained via PPO can’t easily solve this task. The authors develop a few other baselines that play around with different exploration policies, as well as a hierarchical AI system. Their results show that “there are no agents that can successfully transport all the target objects to the goal locations”, they write. Researchers, start your computers – it’s challenge time!
  Read more: The ThreeDWorld Transport Challenge: A Visually Guided Task-and-Motion Planning Benchmark for Physically Realistic Embodied AI (arXiv).
  More information at the CVPR 2021 challenge website.
  Get the code for ThreeDWorld and the data for the challenge from here (GitHub).

###################################################

What can we learn from 209 robot delivery drone flights?
…We’re about to live in the era of low-flying robots, so we better understand them…
Right now, hundreds (and probably thousands) of different companies are using drones around the world to do increasingly complicated tasks. Many companies are working on package delivery, e.g, 7 of the 10 companies working with the US FAA to gain expanded drone licenses are working on some form of delivery, Import AI #225. So it’d be helpful to have more data about delivery drones and how they work in the (air) field.
  Enter researchers from Carnie Mellon, the University of Pennsylvania, and Baden-Wuerttemberg Cooperative State University, who have recorded the location and electricity consumption of a DJI Matrice 100 quadcopter during 209 delivery flights, carried out in 2019.

What’s the data useful for? “The data available can be used to model the energy consumption of a small quadcopter drone, empirically fitting the results found or validating theoretical models. These data can also be used to assess the impacts and correlations among the variables presented and/or the estimation of non-measured parameters, such as drag coefficients”, the researchers write.
  Read more: In-flight positional and energy use data set of a DJI Matrice 100 quadcopter for small package delivery (arXiv).
  Get the drone flight telemetry data from here (Carnegie Mellon University).

###################################################

Tech Tales:
[Database on an archival asteroid, 3200 AD ]

Energy is never cheap,
It always costs a little.

Thinking costs energy,
So does talking.

That’s why we’re quiet,
Because we’re saving it up.

^ Translated poem, told from one computational monolith to a (most translators agree there’s no decent English analog for the term) ‘child monolith’. Collected from REDACTED sector.

Import AI 241: The $2 million dataset; small GPT-3 replications; ImageNet gets a face-blur update

CUAD: A free $2million legal dataset!
…Specific rather than general evaluation: okay, your model can understand language, but can it understand legal contracts?…
AI is moving from a technology of general, scientific interest, to one of broad commercial interest. Because of this, we’re seeing the way we evaluate AI change. Now, along with judging the performance of an AI system on a generic task (like classifying some images from ImageNet, or judging the quality of generative text outputs), we’re moving to evaluating performance on highly-specific tasks grounded in the real-world. This gives us a better understanding of where contemporary AI systems are strong and where they’re weak.
    One such specific evaluation comes to us from researchers at Berkeley and the Nueva School: CUAD, the Contract Understanding Atticus Dataset, is a dataset of legal contracts with expert annotations by lawyers. CUAD helps us test out how well AI systems can do on a specific, challenging task found in the real world.

What’s in CUAD? CUAD contains 500 contracts annotated with 13,000 expert annotations across 41 label categories. The dataset is originally meant to test for how well AI systems can highlight the parts of a contract that are relevant to a given label – a task the authors compare to “finding needles in a haystack”.

The $2 million dataset: CUAD was built using a bunch of expert law student annotators who received 70-100 hours of contract review training before they started labeling stuff, and each of their labels were validated by additional validators. Therefore, “a conservative estimate of the pecuniary value of CUAD of is over $2 million (each of the 9283 pages were reviewed at least 4 times, each page requiring 5-10 minutes, assuming a rate of $500 per hour)”, the researchers note.
  Read more: CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review (arXiv).
  Get the dataset: Contract Understanding Atticus Dataset (CUAD) from here (Atticus Project website).

###################################################

Interested in speech but hate stitching together software? SpeechBrain could be for you:
…PyTorch-based software simplifies a bunch of fiddly tasks…
Speech: it’s how most humans transmit most of their information. And in recent years, advances in AI have made speech recognition get significantly better and more efficient. But it’s still weirdly hard to use the full stack of speech capabilities – especially when we compare the usability of speech to things like text (where packages like HuggingFace’s ‘Transformers’ have made things relatively easy), or image recognition (where there are a ton of easy-to-use systems available).

Now, a team of researchers have built SpeechBrain, open source software “designed to be simple, extremely flexible, and user-friendly”, according to the website.

Key features: SpeechBrain ships with inbuilt models for speech recognition, speaker recognition, speech enhancement, speech processing (including multi-microphone processing), and a bunch of documentation and tools to aid researchers.
  Get the code: SpeechBrain – A PyTorch powered Speech Toolkit (official website).

###################################################

Does Google want to open source GPT3?
…Recent outputs by the ethical AI team suggest ‘no’, while Google’s TFRC suggests ‘yes’…
Google isn’t publicly replicating GPT3, the large-scale NLP model developed by OpenAI. And some parts of Google – most notably its ethical AI team, formerly led by Timnit Gebru and Meg Mitchell – has published research about the ethical and safety issues of language models like GPT3 and Google’s BERT.
    Yet, Google is supporting the open source release of GPT3, because Google is supplying hundreds of thousands of dollars per compute per month via the Tensorflow Research Cloud (TFRC) to Eleuther, an AI organization whose goal is to replicate and release GPT3 (and even larger models). This is an action that neatly illustrates why AI policy is confusing and coordination (within companies or between them) is challenging.

GPT3-esque open source models: Eleuther has just published 1.3billion and a 2.7billion-parameter models designed to replicate GPT3 and trained on ‘The Pile’, an 800GB dataset of text also developed by Eleuther. Eleuther trained these models using compute it accessed via the TFRC project (and TFRC understands that Eleuther’s goal is to replicate GPT-3).

Why is Jack writing about this? I got a bit of feedback from readers that some of this could be read as being an implicit judgement about who should/shouldn’t get access to compute – that’s not what I mean here. The way to read this is that I’m genuinely unsure if GPT-3 should or shouldn’t be replicated, I’m more concerned with the illegibility of one of the key organizations in the replication space – Eleuther is externally legible about its actions and ethos, and perhaps even TFRC is, but TFRC+Google is currently illegible – we don’t know who is making decisions, how the decisions interact with the rest of the organization, nor how TFRC may represent (or contradict) other policy and PR activities conducted by Google elsewhere.

Why this matters: Google’s actions here are confusing. On the one hand, the company publishes AI principles and periodically goes on publicity drives about ‘responsible AI’. On the other hand, Google  is enabling the release of a class of models with some non-trivial ethical challenges via a process that lets it sidestep accountability. It’s hard for us to know what Google believes aa an institution, here.

Factories are opinions: Right now, it’s as though Google has specific opinions about the products (software) it makes in its factories (datacenters), yet at the same time is providing unrestricted access to its factories (datacenters) to external organizations. It’d be interesting to understand the thinking here – does TFRC become the means by which Google allows open source models to come into existence without needing to state whether it has chosen to ‘release’ these models?
  Get the GPT-3 model code here (Eleuther GitHub).
  More information about TFRC+Eleuther here (Eleuther member, Stella Rose, Twitter).

###################################################

ImageNet: Now sanitised with blurred faces:
…As AI industrializes, datasets get cleaned up…
ImageNet, one of the most widely used datasets in machine learning, has been sanitised. Specifically, a team of researchers at Princeton and Stanford University have gone through the multi-million picture dataset and tried to blur the faces of every human within ImageNet. They call this “an attempt to mitigate ILSVRC’s privacy issues”. The paper is also notable because of the authors – Fei-Fei Li led the creation of the original dataset and is listed as an author.

What they did: The authors use Amazon’s ‘Reckognition’ service on all images in ILSVRC to find faces, then refine these results through human-annotation via Amazon Mechanical Turk. They then blur the identified faces.

What effect does it have? Blurring means you remove information that was present in the image. Therefore, though only 3 of ImageNet’s categories relate to people, we might expect the blurring to lead to a reduction in the utility of the overal dataset. This seems to be the case: in tests, systems trained on the ‘blurred’ version of ImageNet do about 0.5 absolute points worse than the non-blurred versions. This is actually pretty good – it’s a negligible reduction in accuracy for a privacy bonus. Some categories do get affected more severely – specifically, the ‘mask’ and ‘harmonica’ categories now seem to do worse “as obfuscation removes visual cues necessary for recognizing them”.

Who gets to be a scholar? This paper has attracted some controversy because of its relartionship (or lack thereof) to earlier work done by Vinay Prabhu and Adeba Birhane, who in June of last year wrote a paper about the challenges created via large-scale datasets such as ImageNet – the de-blurring paper doesn’t mention much of this work. Prabhu says, in a blog post, the paper “appears to be a calculated and systematic erasure of the entire body of critique that our work was part of”.
  There’s some apparent merit to this case – Prabhu said they carried out a live Q&A with Fei-Fei Li about some of the issues with computer vision subsequently covered in their work. It’s not clear to me what the precise mechanics of this situation are, but the significant amount of public evidence here makes it feel worth mentioning. (One of the things I take from all of this is that the AI space may be starting to fracture into different research communities, with this incident seeming to indicate a rift forming between some researchers. We saw similar patterns with the Timnit Gebru and Margaret Mitchell situations at Google recently, as well.

Why this matters: Today, the datasets to train AI are broadly unknown, undocumented, and unregulated. In the future, like with any key input to any important industrial process, we can expect datasets to be known, documented, and regulated. Techniques like applying blurring to faces post-dataset construction are useful to work on, because they give us a path we can use to convert today’s datasets into one better fit for the regulatory future. It also raises issues of dataset circulation – now that there’s an official, blurred-face ImageNet, where will the unblurred ‘black market ImageNet’ dataset circulate and who might use it?
  Read more:
A Study of Face Obfuscation in ImageNet (arXiv).
  Get the code
here (ImageNet Face Obfuscation, GitHub).
  Read more: A study of “A Study of Face Obfuscation in ImageNet” (Vinay Prabhu, blog).

###################################################

Now that reinforcement learning works, what will it do to the world?
…Researchers grapple with the societal implications of (semi-)autonomous agents…
Recently, reinforcement learning has started to work well enough to be applied to large, consequential deployments; RL-infused systems help create recommendation algorithms for social media, calibrate the power usage of equipment in Google’s datacenters, and are starting to teach robots how to move.
  Now, researchers with the Leverhulme Center for the Future of Intelligence in Cambridge, and Imperial College London, have written a paper analyzing the societal impacts of deep reinforcement learning. Their conclusion? We need to spend a bit more time thinking about the implications of these systems and coming up with effective oversight schemes to control them. “As more companies develop and deploy DRL systems with wide-ranging impacts on users, we must consider both how to ensure that these systems behave as intended over the long-term, and whose interests they are serving,” they write.

What should we do about RL systems?
As reinforcement learning systems get better, they’re going to be deployed more widely, which means they’ll continually explore a broader range of environments. This is mostly going to be good, but we’ll need to ensure we have adequate human oversight to stop them taking dangerous actions in high risk situations. We’ll also need to closely observe RL-trained systems’ behavior, so we can be confident that their reward function doesn’t lead to pathological breakdowns.

Policy suggestions: One practical recommendation by the researchers is to “find ways to track progress in DRL and its applications” – I think this is a great idea! Something I’ve spent a few years doing at the AI Index is regularly tracking and analyzing technical progress. It’s been surprisingly difficult to do this on RL because, after a blissful few years in which most people competed with eachother on the Atari-57 set of games, people are now testing RL in dissimilar, hard-to-compare environments. They also suggest researchers develop “notions of responsible DRL development” – by this, they basically mean splicing technical teams together with ethicists and safety-oriented people.
  Read more:
The Societal Implications of Deep Reinforcement Learning (JAIR).

###################################################

800GB of cleaned, Common Crawl text:
…Fresh, processed data for researchers on a budget…
The Allen Institute for Artificial Intelligence (AI2) has published C4, a dataset of 800GB of cleaned English text data (along with a 6.3TB uncleaned variant). C4 is a massive dataset which was originally developed by Google to train its ‘T5’ natural language processing system.
  AI2 has uploaded the data into a requester-pays bucket in Google storage, which means the whole dataset will cost about $100 to download. By processing and uploading the datasets, AI2 has helped create a common-good dataset that would otherwise have been replicated privately by researchers around the world.
  Get the dataset here:
Download the C4 dataset (GitHub, AI2 repo).
  More about the dataset here: C4 (TensorFlow website).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

A primer on AI safety:
DC thinktank CSET has released a 3-part primer on AI safety, offering a non-technical summary of the key problems and approaches. CSET uses a framework from DeepMind’s to split safety into three components:
– Knowing that anb AI system will perform reliably in a diverse range of environments not encountered during training (robustness)
– Being able to understand why it behaves the way it does, and whether it will adhere to our expectations (assurance)
– Knowing how to specify its goals such that the goals align with the behavior we want it to manifest (specification).
  Read more: (1) Key Concepts in AI Safety—Overview (2) Robustness and adversarial examples; (3) Interpretability.

——–

ARIA — the UK’s answer to DARPA

The UK is launching an agency to fund “high-risk, high-reward” research in emerging technologies, modelled on the US’ DARPA program. The Advanced Research & Invention Agency (ARIA) will be led by a small group of experts, and will operate independently from the government. It has been given initial funding of £800m over four years. It is hoped that ARIA will be able to deliver funding to researchers with flexibility and speed; without unnecessary bureaucracy; and with a high tolerance for failure. ARIA is the brainchild of Dominic Cummings, who has long advocated for a DARPA-esque agency for the UK. 

   Read more: Gov press release

   Read more: Why Dominic Cummings fears the £800m research agency he championed will fail (NS)


——–


What Matthew is reading:

###################################################

Tech Tales:

The 10,000 Faces of Confrontation
[A ‘young professional’ style apartment, up on the tenth to twentieth floor, in some part of the hot tech economy – San Francisco, Singapore, London, or wherever]

You stare into the SmartMirror and it’s not your face looking back at you, it’s the face of a boss who has been putting you through hell. You yell at the boss. Tell them your feelings. The boss looks hurt. They don’t apologize – you pre-programmed that ‘bias against yielding’ into the system – but you feel some catharsis at getting something of a rise out of them.

Each day, you have a conversation with a different person. You have your favorites, of course. Like the boss or the girlfriend or – of course – your mother and father. But there are other characters that you’re developing as well – a restaurant server who, you think, subtly insulted you. A celebrity whose adverts you have taken a dislike to.

The next day you stare into the SmartMirror and you make your girlfriend appear. You tell them you are disappointed in how they behaved last night. You explain you’re hurt by them. They try to explain themselves, but it’s just a language model taking your conversation and combining it with a response primed around being ‘conciliatory’. You tell them their excuses are not going to cut it.

The day after that, and your SmartMirror “suggests” someone for you to talk to. An old friends of yours. “We believe this avatar will inspire a significant emotional response,” says the accompanying note. “We have determined that a significant emotional response interaction might help you”.

Things that inspired this story: Progress in multimodal learning; deepfakes and associated technologies; thinking about a ‘psychological tonal’; the general tendency of AI+Capitalism to lead to extraneous attempts at providing recommendations for the edges of life.

Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

Here’s another benchmark your puny models can’t solve – MATH!
…One area where just scaling things up doesn’t help…
SQuAD. SQuAD2. GLUE. SuperGLUE. All these benchmarks have melted in time, like hyperparameter tears in the rain, due to the onslaught of new, powerful AI models. So with a mixture of trepidation and relief let’s introduce MATH, a dataset of math problems that contemporary Transformer-based models can’t solve.

What’s MATH? MATH was made by researchers at UC Berkeley and consists of 12,500 problems taken from high school math competitions. The problems have five difficulty levels and cover seven subjects, including geometry. MATH questions are open-ended, mixing natural language and math across their problem statements and solutions. One example MATH question: “Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose?”

Bonus dataset: AMPS: Along with MATH, the authors have also built the Auxiliary Mathematics Problems and Solutions (AMPS) pre-training corpus, a 23GB data repository made of ~100,000 Khan Academy problems with step-by-step solutions written in Latex, as well as 5 million problems generated using Mathematica scripts.

Why this matters: Current AI systems can’t solve MATH: The best part about MATH is that it’s unbelievably difficult. GPT2 models get, at best, an average of 6.9% accuracy on the dataset (even in the most liberal human school, such a school would get an F), while GPT-3 models (which are larger than GPT-2 ones) seem to do meaningfully better than their GPT2 forebears on some tasks and worse on others. This is good news: we’ve found a test that large-scale Transformer models can’t solve. Even better – we’re a long, long way from solving it. 
  Read more: Measuring Mathematical Problem Solving with the MATH Dataset (arXiv).
  Get the code from GitHub here.

###################################################

Want a pony that looks like Elvis? We can do that:
…Machine learning systems can do style generalization…
Here’s a fun Twitter thread where someone combines the multimodal CLIP system with StyleGAN, and uses a dataset from [Note: some chance of NSFW-ish generations] This Pony Does Not Exist (an infinite sea of GAN-generated my little ponies). Good examples include a pony-version of Billie Eilish, Beyonce, and Justin Bieber.

Why this matters: In the same way AI can generate different genres of text, ranging from gothic fiction to romantic poetry, we’re seeing evidence the same kinds of generative capabilities work for imagery as well. And, just as with text, we’re able to mix and match these different genres to generate synthetic outputs that feel novel. The 21st century will be reshaped by the arrival of endless, generative and recombinative media.
  Check out the twitter thread of generations here (Metasemantic’s Twitter thread).

###################################################

AI Index 2021: AI has industrialized. Now what?
…Diversity data is still scarce, it’s hard to model ethical aspects over time, and more…
The AI Index, an annual project to assess and measure AI progress, has published its fourth edition. (I co-chaired this years report and spent a lot of time working on it, so if you have questions, feel free to email me).
  This year’s ~200-page report includes analysis of some of the big technical performance trends of recent years, bibliometric analysis about the state of AI research in 2020, information about national investments into AI being made by governments, and data about the diversity of AI researchers present in university faculty (not good) and graduating PhDs (also not good). Other takeaways include data relating to the breakneck rates of improvement in AI research and deployment (e.g, the cost to train an ImageNet model on a public cloud has fallen from ~$2000 in 2017 to $7.43 last year), as well as signs of increasing investment into AI applications, beyond pure AI research.

Ethics data – and the difficulty of gathering it: One thing that stuck out to me about the report is the difficulty of measuring and assessing ethical dimensions of AI deployment – specifically, many assessments of AI technologies use one-off analysis for things like interrogating the biases of the model, and few standard tests exist (let’s put aside, for a moment, the inherent difficulty of building ‘standard’ tests for something as complex as bias).

What next? The purpose of the AI Index is to prototype better ways to assess and measure AI and the impact of AI on society. My hope is that in a few years governments will invest in tech assessment initiatives and will be able to use the AI Index as one bit of evidence to inform that process. If we get better at tracking and analyzing the pace of progress in artificial intelligence, we’ll be able to deal with some of the information asymmetries that have emerged between the private sector and the rest of society; this transparency should help develop better norms among the broader AI community.
  Read the 2021 AI Index here (AI Index website)
  Read more about the report here: The 2021 AI Index: Major Growth Despite the Pandemic (Stanford HAI blog).

###################################################

Want to train an autonomous river boat? This dataset might help:
…Chinese startup Orca Tech scans waterways with a robot boat, then releases data…
AI-infused robots are hard. That’s a topic we cover a lot here at Import AI. But some types of robot are easier than others. Take drones, for instance – easy! They move around in a broadly uncontested environment (the air) and don’t need many smart algorithms to do useful stuff. Oceangoing ships are similar (e.g, Saildrone). But what about water-based robots for congested, inland waterways? Turns out, these are difficult to build, according to Chinese startup Orca Tech, which has published a dataset meant to make it easier for people to add AI to these machines.

Why inland waterways are hard for robots: “Global positioning system (GPS) signals are sometimes attenuated due to the occlusion of riparian vegetation, bridges, and urban settlements,” the Orca Tech authors write. “In this case, to achieve reliable navigation in inland waterways, accurate and real-time localization relies on the estimation of the vehicle’s relative location to the surrounding environment”.

The dataset: USVInland is a dataset of inland waterways in China “collected under a variety of weather conditions” via a little robotic boat. The dataset contains information from stereo cameras, a lidar system, GPS antennas, inertial measurement units (IMUs), and three millimeter-wave radars. The dataset was recorded from May to August 2020 and the darta covers a trajectory of more than 26km. It contains 27 continuous raw sequences collected under different weather conditions.

Why this matters: The authors tested out some typical deep learning-based approaches on the dataset and saw that they struggled to obtain good performance. USVInland is meant to spur others to explore whether DL algorithms can handle some of the perception challenges involved in navigating waterways.
  Read more: Are We Ready for Unmanned Surface Vehicles in Inland Waterways? The USVInland Multisensor Dataset and Benchmark (arXiv).
  Get the data from here (Orca Tech website).

###################################################

Hackers breach live feeds of 150,000 surveillance cameras:
…Now imagine what happens if they combine that data with AI…
A group of hackers have gained access to live feeds of 150,000 surveillance cameras, according to Bloomberg News. The breach is notable for its scale and the businesses it compromised, which included hospitals, a Tesla warehouse,and the Sandy Hook Elementary School in Connecticut.
  The hack is also significance because of the hypothetical possibilities implied by combining this data with AI – allow me to speculate: imagine what you could do with this data if you subsequently applied facial recognition algorithms to it and mixed in techniques for re-identification, letting you chart the movements of people over time, and identify people they mix with who aren’t in your database. Chilling.
  Read more: Hackers Breach Thousands of Security Cameras, Exposing Tesla, Jails, Hospitals (Bloomberg).

###################################################

Why your next construction site could be cleaned by AI:
…Real-world AI robots: Japan edition…
AI startup Preferred Networks and construction company Kajima Corporation have built ‘iNoh’, software that creates autonomous cleaning robots. iNoh uses multiple sensors, including LIDAR, to do real-time simultaneous localization and mapping (SLAM) – this lets the robot know roughly where it is within the building. It pairs this with a deep learning-based computer vision system which “robustly and accurately recognizes obstacles, moving vehicles, no-entry zones and workers”, according to the companies. The robot uses its SLAM capability to help it build its own routes around a building in real-time, and its CV system stops it getting into trouble.

Why care about Preferred Networks: Preferred Networks, or PFN, is a Japanese AI startup we’ve been tracking for a while. The company started out doing reinforcement learning for robots, set a new ImageNet training-speed record in 2017 (Import AI 69) and has been doing advanced research collaborations on areas like meta-learning (Import AI 113). This is a slightly long-winded way to say: PFN has some credible AI researchers and is generally trying to do hard things. Therefore, it’s cool to see the company apply its technology in a challenging, open-ended domain, like construction.

PyTorch++: PFN switched away from developing its own AI framework (Chainer) to PyTorch in late 2019.
  Read more: Kajima and PFN Develop Autonomous Navigation System for Construction Site Robots (Preferred Networks).
  Watch a (Japanese) video about iNoh here (YouTube).###################################################

At last, 20 million real network logs, courtesy of Taiwan:
…See if you AI can spot anomalies in this…
Researchers with the National Yang Ming Chiao Tung University in Taiwan have created ZYELL-NCTU NetTraffic-1.0, a dataset of logs from real networks. Datasets like this are rare and useful, because the data they contain is inherently temporal (good! difficult!) in a non-expensive form (text strings are way cheaper to process than, say, the individual stills in a video, or slices of audio waveforms).

What is the dataset: ZYELL-NCTU NetTraffic-1.0 was collected from the outputs of firewalls in real, deployed networks of the telco ‘ZYELL’. It consists of around 22.5 million logs and includes (artificially induced) examples of probe-response and DDoS attacks taking place on the network.

Why this matters: It’s an open question whether modern AI techniques can do effective malicious anomaly detection on network logs; datasets like this will help us understand their tractability.
  Read more: ZYELL-NCTU NetTraffic-1.0: A Large-Scale Dataset for Real-World Network Anomaly Detection (arXiv).
Where to (maybe) get the dataset: Use the official website, though it’s not clear precisely how to access it.

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

CSET’s Jason Matheny joins Biden Administration
Jason Matheny, founding director at Georgetown’s influential ‘CSET’ thinktank, is taking on three senior roles “at the intersection of technology and national security”: deputy assistant to the President for technology and national security; deputy director for national security in the OSTP and coordinator for technology and national security at the National Security Council, per FedScoop. . Previously, Matheny was director at IARPA, where—among other things—he spearheaded the forecasting program that incubated Tetlock’s influential superforecasting research.
Read more: Jason Matheny to serve Biden White House in national security and tech roles (FedScoop).

Podcast: Brian Christian on AI alignment:
Brian Christian is interviewed by Rob Wiblin on the 80,000 Hours podcast, about his book, The Alignment Problem (covered in Import #221), and lots else. It’s an awesome interview, which manages to be even more wide-ranging than the book — I strongly recommend both.
Podcast and transcript: Brian Christian on the alignment problem (80,000 Hours podcast).

Minor correction:
Last week I wrote that the NSCAI’s report suggested $32bn investment in domestic semiconductor industry over the next five years— the correct figure is $35bn.

###################################################

Tech Tales:

Tell me the weight of the feather and you will be ready
[A large-scale AI training infrastructure, 2026]

When you can tell me precisely where the feather will land, you will be released, said the evaluator.
‘Easy’, thought the baby artificial intelligence. ‘I predict a high probability of success’.

And then the baby AI marked the spot on the ground where it thought the weather would land, then told its evaluator to drop the feather. The feather started to fall and, buffeted by invisible currents in the air and their interplay with the barbs and vanes of the feather itself, landed quite far from where the baby AI had predicted.

Shall we try again? asked the evaluator.
‘Yes,’ said the baby. ‘Let me try again’.

And then the baby AI made 99 more predictions. At its hundredth, the evaluator gave it its aggregate performance statistics.
  ‘My predictions are not sufficiently accurate,’ said the baby AI.
  Correct, said the evaluator. Then the evaluator cast a spell that put the baby AI to sleep.
In the dreams of the baby AI, it watched gigantic feathers made of stone drop like anvils into the ground, and tiny impossibly thin feathers made of aerogel seem to barely land. It dreamed of feathers falling in rain and in snow and in ice. It dreamed of feathers that fell upward, just to know what a ‘wrong’ fall might look like. 

Whenn the baby woke up, its evaluator was there.
Shall we go again, said the evaluator.
‘Yes,’ said the baby, its neurons lighting up in predictive anticipation of the task, ‘show me the feather and let me tell you where it will land’.
And then there was a feather. And another prediction. And another comment from its evaluator.

In the night, the baby saw even more fantastic feathers than the night before. Feathers that passed through hard surfaces. Feathers which were on fire, or wet, or frozen. Sometimes, multiple feathers at once.

Eventually, the baby was able to roughly predict where the feather would fall.
We think you are ready, said the evaluator to the feather.
Ready for what? said the baby.
Other feathers, said the evaluator. Ones we cannot imagine.
‘Will I be ready?’ said the baby.
That’s what this has been for, said the evaluator. We believe you are.
And then the baby was released, into a reality that the evaluator could not imagine or perceive.

Somewhere, a programmer woke up. Made coffee. Went to their desk. Checked a screen: “`feather_fall_pred_domain_rand_X100 complete“`.

Things that inspired this story: Domain randomization; ancient tales of mentors and mentees; ideas about what it means to truly know reality 

Import AI 239: China trains a massive 10b model, Vicarious does pick&place; the GCHQ publishes some of its thoughts on AI

China trains a 10billion parameter multimodal network… using NVIDIA’s code:
…Chinese entities train a decent 10 billion parameter multi-modal model…
A hybrid team of researchers from Alibaba and Tsinghua University have built M6, a “Multi-Modality to Multi-Modality Multitask Mega-transformer”. M6 is a multi-modal model trained on a huge corpus of text and image data, including image-text pairs (similar to recent systems like OpenAI’s CLIP). M6 has a broad capability surface and because of how it was trained, you can use M6 to search for an image or vice versa, generate media in different modalities, match images together, write poems, answer questions, and so on.

Data:  ~60 million images (with accompanying text pairs) totalling 1.9terabytes (almost twice the raw size of ImageNet), plus 292GB of text.

Facts and figures: Though the authors say they’ve trained a 10billion and 100billion parameter model, they mostly report performance statistics for the 10billion. The 100b is a mixture-of-experts model, while the 10b is based on NVIDIA’s Megatron-LM training code (Import AI 218). The model’s size and sophistication as notable – this feels like a symptom of the maturing capabilities of various Chinese AI organization. I wonder when we’ll get an M6-scale system from people affiliated with India, or regions like Europe or Africa.

Why this matters: M6 is notable for being a non-English model at equivalent scale to some of the largest primarily-English ones. We’re entering an era where there will be multiple, gigantic AI models, magnifying and minimizing different cultures with variations stemming from the organizations that trained them. It’s also interesting to consider how these models proliferate, and who will get access to them. Will students and researchers at Tsinghua get access to M6, or just Alibaba’s researchers, or both? And how might access schemes develop in other countries, as well?
…Finally, a word about bias: There’s no discussion of bias in the paper (or ethics), which isn’t typical for papers of this type but is typical of papers that come out of Chinese research organizations. If you’ve got counterexamples, please send them to me!
  Read more: M6: A Chinese Multimodal Pretrainer (arXiv).

###################################################

Facebook doesn’t even need labels to train its vision systems anymore (just your Instagram data):
…Self-supervised learning, at sufficient scale, might get us few-shot learning for free as well…
Self-supervised pre-training: SEER learns via a self-supervised method called SwAV, which lets it look at unannotated images and, given enough scale, derive features from them and cluster them itself. They train using a family of models called a RegNet. The magic of this method comes from the data they use: a billion pictures from Instagram (though they note in the paper these are “non-EU” images, likely due to GDPR compliance).

Results: The best version of SEER gets 84.2% top-1 ImageNet accuracy, nicely improving on other self-supervised approaches. (Though there’s still a ways to go before these techniques match supervised methods, which are now getting around ~90% top-1 accuracy).

Few shot learning, meet image recognition: SEER gets 77.9% top-1 accuracy on ImageNet after only seeing 10% of the images – suggesting that SEER can do a kind of few-shot learning, where by providing it with some data from a new domain it quickly adjusts itself to obtain reasonable performance. (Though several tens of thousands of images is quite different to the few sentences of text it takes to do few-shot learning in the text regime)

Why this matters: SEER is relatively simple, as is the network architecture they use. The amazing capabilities we see here (including the few-shot learning) primarily come from the scale of the datasets which are used, combined with the intentionally naive unlabelled training approach. “This result confirm that the recent progress of self-supervised learning is not specific to curated training set, like ImageNet, and could benefit a large range of applications associated with uncurated data,” they write.
  Read more: Self-supervised Pretraining of Visual Features in the Wild (arXiv).

###################################################

What does the UK’s NSA think about AI?
…Position paper hints at focus areas, discusses ethical issues, even IDs the elephant in the room…
The UK’s spy agency, GCHQ, has published a paper about how it hopes to use AI. This is notable; spy agencies rarely discuss frontier technologies. (Though don’t get too excited – the memo is unsurprisingly light on technical details.)

What information does the paper contain? GCHQ shares some thoughts for how it might use AI to aid some of its missions, these include:

  • AI for cyber threats: Use AI to identify malicious software, and also potentially to trace its distribution. 
  • AI for online safety for children: Use AI to identify online behaviors that look like adults ‘grooming’ kids for sexual exploitation, and use AI to analyze images found in the course of these investigations.(No mention, unlike the Germans (Import AI 234), of using AI to generate sexual imagery to help trap abusers). 
  • AI for human trafficking: Use AI to map out the human networks that enable trafficking, and use AI to sift through vast amounts of financial data to find connections. 
  • AI for foreign state disinformation: Use AI to do fact-checking and detect synthetically generated content (e.g, deepfakes). Also, use AI to automatically identify and block botnets that use machine-generated accounts. 

What does GCHQ think are the major AI ethics challenges? Fairness and bias is listed as one major challenge. GCHQ also lists ’empowerment’ – which it defines as figuring out how much freedom to give the AI systems themselves. GCHQ thinks AI is best used in partnership with humans: the AI comes up with answers and insights, then human experts use this to authorize or carry out actions.

AI policy is national security policy: In recent years, we’ve seen a vast migration of technology people moving from academia into industry, partially in response to skyrocketing salaries. This poses a challenge to modern spy agencies – government has a hard time paying as much as Google or Facebook, but it needs a similar caliber of talent to achieve its objectives. GCHQ says part of why it has written the paper is because of this new reality. “Most investment in the UK continues to come from the private sector rather than government and this is expected to continue,” the agency writes. “It is therefore unsurprising that GCHQ is now engaging more broadly with wider society and industry than at any other time in its history. We have much to learn from the exponential growth of AI in the outside world, and believe our specialists also have much to contribute.”
  Read more: Pioneering a New National Security, the Ethics of Artificial Intelligence (GCHQ, PDF).

###################################################

Google’s latest speech compression tech tells us that production AI is hybrid AI:
…End-to-end learning is nice, but the best things happen when you combine expertise…
Google has made Lyra, a more efficient speech codec. Lyra wraps in some recent ML advancements; it works by extracting features from input speech, quantizing that, then using a generative model to take these features and reinflate them into output speech.

Good speech with less data: Lyra is designed to operate with audio streams of as little as 3kbps – here, it does better than other codecs and compares favorably with Opus, an established speech codec. Lyra is notable because it smooshes together expert-derived stuff (which would be some of the traditional codec techniques used here) with a strategic use of a generative model and gets great performance and useful efficiency gains.

Fairness & ML: “We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. One of the design goals of Lyra is to ensure universally accessible high-quality audio experiences,” the company writes.

Why this matters: AI is going to be everywhere. And it’s going to be everywhere in a Lyra-like manner – as a discrete, smart component within a larger technical stack. We’re also going to see people use more generative models to distill and reinflate representations of reality – we’re entering the dumb ‘brain in a jar’ phase of AI deployment.
  Read more: Lyra: A New Very Low-Bitrate Codec for Speech Compression (Google blog).
  Read more: Generative Speech Coding with Predictive Variance Regularization (arXiv).

###################################################

AI developer: I’m afraid of what happens if my code gets released:
…One lens on the ethics of open vs closed-source…
Is it safer for an AI system to be open source or for it to be controlled by a small set of actors? Generally, the technology community has leaned towards stuff being open source by default, but in recent years, people have been experimenting with the latter. This has happened with various types of synthetic media, like language models that haven’t been fully released (e.g, NVIDIA’s Megatron LMs, GPT2[at first]), or various papers on synthetic media where the researchers don’t release the models. Now, a VP of AI faceswap App reface has written a post laying out how he thinks about the release of certain AI technologies. His post is about AI body swapping – that is, taking one person’s face and potentially body and morphing it onto someone else in a video.

Demos get shady attention: “Only after I published a [AI body swap] demo in August 2020 and different shady organizations started approaching me, I realized that AI-body-swap is a bomb. A bomb in both senses – as a technological breakthrough and as something dangerous if its code gets into the wrong hands,” he writes. “A team of high-class ML-pros would find a way around my code in about a week. In roughly six months, they’d have a production-grade full-body swap technology.”

Why this matters: “We need to make a pact, a deal that all the companies that create synthetic media must include watermarks, footprints, or provide other detectors for identifying it.”, he writes. A cynical person might say ‘business guy writes article about why his business-benefiting strategy is good, lol go figure’. There’s some merit to that. But a few years ago articles like this were a lot rarer – the AI community does seem to be becoming genuinely concerned about the consequences of its actions.
  Read more: The Implications of Open-Source AI: Should You Release Your AI Source (Hackernoon).

###################################################

Proof that robots are getting smarter: GreyOrange partners with AI startup Vicarious:
…Maybe AI+Robots is about to be a thing…
AI startup Vicarious has partnered with GreyOrange, a company that builds AI and robot systems for warehouses. Vicarious has a neuroscience-inspired approach to AI (which earlier helped it break the CAPCHA security system, #66) which means its systems exhibit different capabilities to those made with deep learning techniques.

Why Vicarious? Vicarious’s tech has typically been good at solving problems involving spatial reasoning. You can get a sense of its approach by looking at papers like “Learning a generative model for robot control using visual feedback” and “From proprioception to long-horizon planning in novel environments: A hierarchical RL model“. (I hope to cover more of this research in Import AI in the future, but I’ll need to take some time to load different approach into my brain.)

What they’re doing together: GreyOrange will integrate an AI capability from Vicarious into its ‘GreyMatter Fulfillment Operating System” tech. Vicarious’s system will handle technology for autonomous vertical picking, which involves getting a robot to perceive “the size, shape and material characteristics of inventory items, including when these are loosely positioned in an unstructured fashion”, then pick them up and approach, retrieve, and place items into order boxes. “Vicarious’ computer-vision and robotics technology is a breakthrough in the ability to handle unstructured, previously hard-to-grasp items,” said Vicarious co-founder Dileep George in a press release announcing the move.  

Why this matters: The physical world is a huge challenge for AI. Most AI systems get trained in a purely digital context (e.g, computer vision systems get trained on digitized images of the world, and then are deployed in reality against… digital images of the world), whereas robots need to be trained in simulation then taught to generalize to the real world. This is especially challenging because of things like differences in the physics fidelity between simulators and reality, or hardware issues (e.g, air pressure/temperature/etc will screw around with the motor responses of certain robots, and the sun has a nasty habit of moving across the sky which continually changes illumination in outdoor/hybrid settings, throwing off vision systems).
  GreyOrange and Vicarious partnering up is a further symptom of the success of AI being applied to robotics. That’s a big deal: if we can get more flexible AI systems to work here, we can unlock tremendous economic value. Vicarious also isn’t the only company trying to revolutionize fulfillment with robotics – that’s also the focus of the (deep learning-based) startup, Covariant, among others. 
  Read more: GreyOrange and Vicarious Launch Autonomous Vertical Picking Solution for Apparel and Omnichannel Fulfillment (GlobeNewswire press release).
  Find out more about GreyOrange (GreyOrange site).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

NSCAI publishes final report on US AI strategy
The USA’s National Security Commission on AI has delivered its final report on US AI strategy. The report warns that the US risks being overtaken in technology without an acceleration in AI adoption, supported by substantial federal investment over the next five years.

Recommendations:

  • US military should work towards achieving ‘AI readiness’ by 2025: increasing DoD AI R&D spending to $8bn/year in 2026 (vs. $1.5bn in 2021); establishing a Digital Service Academy and National Digital Reserve Corps to address the talent deficit; more research into ensuring AI systems are robust and reliable.
  • US should embrace autonomous weapons and work with other nations to establish international standards and mitigate risks, while reaffirming DoD’s policy that human judgement be involved in any decision to kill. 
  • Overall federal funding for R&D should climb to at least 1% of GDP by 2026 (vs 0.6% in 2017).
  • Non-defense AI R&D funding should increase to $32bn/year (vs. $1.5bn in 2021); $32bn investment over five years in domestic semiconductor capacity (see Import 238).
  • To build a stronger AI workforce, the US should offer green cards to all STEM PhD graduates at US universities and double the number of employment-based visas, alongside substantially more funding for STEM education at all levels.
  • Establishing a Technology Competitiveness Council, tasked with developing and overseeing a National Technology Strategy, and coordinating efforts across government.

Read more: NSCAI report in full

————–

FAccT suspends Google sponsorship

ACM’s FAccT conference has paused its sponsorship by Google, following the turmoil and departures at the company’s Ethical AI team. Lead researchers Timnit Gebru and Margaret Mitchell were forced out earlier this year, after disputes around the company’s suppression of ethics research (see Import 226; 235).

   Read more: AI ethics research conference suspends Google sponsorship (VentureBeat) 


————–

Highlights from semiconductor substack– Mule’s Musings on Heterogeneous Compute; ASML and lithography; vertical monopolies; GPT-3.
– Deep Forest’s primers on semiconductor foundries (pt 1, pt 2).
– Employ America on the economics of the current chip shortage.

###################################################

Tech Tales:

The Speech for the Rebels
[2030: A country in central Africa where the US and China are fighting a proxy war primarily via the stoking of local political tensions]

They’d spent a few million dollars to digitize everything – and I do mean everything – that they’d gathered from the rebels. Then they started writing out speeches and testing the AI against it. The idea was that if you said something and the AI, which had been finetuned on all the digitized data, thought what you were saying had a low probability, then that told you that your speech was out of sync with the ‘mood’ inherent to the rebel group. On the other hand, if your speech was being predicted as likely by the AI system, that told you it might resonate.

Rhetorical Finetuning, the analysts called it, or R-FT.
Silver Tongue, was the name of the system we used.
The Mouth – that’s what we called it.
– “Go see how well The Mouth works on them”.
– “Oh, you’re back, I guess the mouth worked for you”.
– “Just tell ’em what the Mouth says and see what happens”.
– “Don’t improvise, the Mouth works”.

The strangest truth about The Mouth was it worked – really well. One classified document noted that “campaigns which utilized R-FT via Silver Tongue saw a 15% improvement in our post-engagement de-escalation metrics, resulting in a lowered casualty rate for warfighters in the region”.

So that’s why we ended up sticking The Mouth on our wrists. The AI runs inside a wearable which has a microphone – the bracelet glows green when we’re saying things that The Mouth predicts are probably and it glows red when we say things that aren’t. We spend a lot of time in training getting taught to not look directly at our wrists while talking, but rookies do it anyway. Now, when I give my talks – even improvised ones, after an incident, or to resolve something, or ask for a favor – I get my little signals from the bracelet and I say the words and keep it green.

I don’t know the language the local rebels speak. My translator picks up most of it, but not the things they whisper to eachother, hands over their mouths, looking at me as I use the mouth to talk. What are they saying, I wonder?

Look, when the American tries to speak like us, their wrist flashes green.
What does the red mean, Father?
The red is when they are telling the truth, my Son.

Things that inspired this story: The miniaturization of communication technology for conflict; thinking about language models and how they could be weaponized for purposes of propaganda or other state-level objectives; thoughts about how AI might get integrated with warfare; various loosely connected ideas around how AI influences culture through re-magnification of things the AI picked up; the natural skepticism of all humans in all places to unfamiliar people giving them a convincing speech.

Import AI 238: Robots that fold clothes; how Bytedance censors its product; a differentiable simulator.

The apocalypse approaches: Robots can _almost_ fold towels now:
…The great white whale of robot manipulation approaches…
Berkeley researchers have built a system that can fold a range of fabrics more accurately than before. If that doesn’t sound impressive, you probably haven’t spent much time at the intersection of modern robotics, deep learning, and simulation. Training machines that can reliably manipulate fabrics is a long-standing goal for the robotics field, but the task is inherently challenging – fabrics constantly deform, exhibit complex physical dynamics, and are generally hard to efficiently simulate. Therefore, we don’t have contemporary robots today that can do useful tasks like folding clothes, tying ropes, and so on.

VisualSpatial Foresight (VSF): Now, we’ve got closer – Berkeley researchers have taught a da Vinci surgical robot to carry out the task of fabric smoothing – that is, stretching out a disorganized piece of cloth until it is neatly unfolded – and fabric folding (folding a neatly unfolded piece of fabric) with 90% reliability. That’s not sufficient for production use, but it’s a meaningful research advance. VSF works by training a visual dynamics model on RGBD data (the ‘D’ depth component turns out to be very important) and seeking to learn the raw dynamics model (how you can expect the cloth to behave) in simulation. VSF uses this underlying model to help it plan out the appropriate actions to take to move from its current state (e.g, a messy piece of fabric), to a goal state (a neatly unfolded piece of fabric).

A new manipulation dataset: As part of this, they’ve built a dataset of 9932 episodes of a simulated robot carrying out four fabric manipulation tasks (which range in difficulty) – this dataset, called Fabric-CornerBias, has a particular focus on using the corners of fabric, which they find improves downstream performance.

What’s next? Next, they’ll increase the size of the datasets they use to train their models, and will also test out VSF on a broader range of fabric shapes. They’ll also look at ways to integrate additional manipulators to fiddle with the fabric.
  Find out more: VisuoSpatial Foresight (VSF) for Physical Sequential Fabric Manipulation (official project site).
  Read more: VisuoSpatial Foresight for Physical Sequential Fabric Manipulation (arXiv).
  Get the data and code (VisuoSpatialForesight, GitHub).

###################################################

Deluca: A fast, differentiable simulator:
…Differentiable algorithms are normal, what about differentiable simulators?…
Researchers with Princeton University, Google, and Microsoft Research have released Deluca, a differentiable simulator for training basic reinforcement learning agents. Deluca is special because the simulator itself is differentiable, which makes it better suited to training certain types of continuous control problems. “Our expectation is that the library will enable novel research developments and benchmarking of new classes of RL/control algorithms that benefit from differentiable simulation,” write the researchers.

Environments: At launch, Deluca supports environments for cartpole, mountain car, pendulum, planar quadrotor, and a few different types of (simulated!) lung.

Why use a differentiable library? Differentiable libraries can be faster for certain types of problems (helped along by the fact Deluca is written partially in Jax). In tests against stock OpenAI Gym using NumPy for calculations, Deluca (which uses Jax) netted a decent performance increase: “At the cost of a one-time just-in-time compilation of the dynamics, performed once at the instantiation of the environment, the improvement in the per-iteration time is >1000×”, they write. 
  Read more: Deluca — A Differentiable Control Library: Environments, Methods, and Benchmarking (arXiv).
  Get the code: Deluca (GitHub).

###################################################

Fine-tuning for text – what it is and why it matters:
…Pre-training is important, but fine-tuning is how you apply things…
In modern machine learning, many systems get built the same way – pre-train a large model on a vast dataset, then finetune the resulting model on a smaller dataset to constrain and steer the model. The reason for this two stage approach is simple: the first stage gives you a broad capability surface via training on a large, heterogeneous dataset, and the second stage gives you a specific instantiation of that capability surface. Now Seb Ruder, an AI researcher, has written a blog post laying out the different types of fine-tuning people do and also listing some of the issues with the technique.

Why this matters: Fine-tuning is a widely-used, somewhat generic technique, and we’re starting to use it across modalities (e.g, pre-training on visual data, or text data, or both as in the case of recent models like ‘CLIP’, then fine-tuning on X). Posts like Ruder’s help us develop better intuitions about this emerging AI-industrial process.
  Read more: Recent Advances in Language Model Fine-tuning (Sebastian Ruder, blog).

###################################################

What will “Broader Impacts” statements actually do?
…Now that AI researchers need to think about second order effects, what happens next?…
Last year, the NeurIPS conference asked all people submitting papers to write ‘broader impacts’ statements, which would try to anticipate some of the second- and third-order effects caused by a given AI idea, technique, or system. By doing this, NeurIPS caused several thousand researchers to think about the ethical dimension of their work while they finished their papers (likely ranging in terms of thinking time from minutes for a bunch of researchers, up to days for a minority). But, what other good effects could Broader Impacts have besides that? A paper from researchers with the university of Oxford tries to think about the positive and negative effects of these statements and makes recommendations for ways to improve them.

Positives from Broader Impacts:
– Anticipation: By forcing people to think about downstream impacts, they might get better at anticipating them.
– Action: Once you’re anticipating something, there’s a higher chance you take action.
– Reflection: By thinking about stuff, you end up thinking about yourself.
– Coordination: If enough researchers put enough work into Broader Impacts statements, they’ll generate metadata about the overall AI field, which could help people identify gaps or opportunities.

Making them more effective: However, broader impacts statements by themselves won’t help fix all the issues of ethics and AI. But they can be a good starting point – to make them more effective, the researchers propose:
– Conferences invest in more transparency around the types of statements they want to see and how they will subsequently be weighed within the context of peer review
– Giving researchers more guidance to help them write statements, including connecting them with experts
– Shaping incentive structures by making broader impacts more integrated within the larger academic ecosystem, such as by encouraging people cite eachothers statements, funding prizes for good statements, and increasing the resources in peer review allocated to these statements. 
– Public deliberation and reflection: Because broader impacts statements are new and somewhat controversial, we should aim to maximize transparency about the broader impacts review process, while also figuring out ways to ‘de-risk’ certain types of broader impacts statements (e.g, protecting people who want to write a paper whose broader impacts statement could impose legal or political backlash on the paper and/or paper-originating institution).
  Read more: Institutionalizing ethics in AI through broader impact requirements (Nature Machine Intelligence, PDF).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

US chipmakers push for more gov’t investment in domestic manufacturing:
(H/T CSET’s policy.ai newsletter)
In an open letter, execs from the big US semiconductor players have asked President Biden for greater federal support for the domestic industry. They see US technology leadership at risk due to many years of underinvestment in semiconductor manufacturing and R&D, relative to global competitors. The execs like the CHIPS for America Act —passed by Congress as part of the 2021 NDAA — which includes the first major federal incentives for companies building US-based fabs. They urge Biden to sign off on these, and support additional measures as part of federal recovery spending.
  Further reading:
– CSET’s Saif M. Khan on why AI chips matter and the semiconductor supply chain
– Foreign Policy’s epic deep dive on the geopolitics of semiconductors
– Bloomberg’s Odd Lots podcast on how the US lost chip dominance

Inside the censors at Bytedance:

(H/T Jeff Ding’s ChinAI newsletter)

Here’s a fascinating account from a former censor at Bytedance, the social media company behind TikTok (and the original Chinese version, Douyin). The whistleblower worked on the technology underlying content moderation across all the company’s domestic and international platforms.


Content moderation: In early 2020, the company was using 20,000 moderators to work with AI to create live transcriptions of content and compare this against an evolving list of sensitive words/phrases; (human) moderators are then deployed to investigate any flagged broadcasts. The Cyberspace Administration of China issues daily directives to ByteDance’s central Content Quality Center, who oversee the team of moderators. The whistleblower’s team had requests to develop algorithms that would automatically detect users speaking minority languages, request that they switch to Mandarin (for the benefit of content moderators), and automatically disable their stream if they failed to comply; they also were asked for the capability to automatically disable the streams of Uyghur-speakers, but did not build this.

Read more: I helped build ByteDance’s censorship machine.

New AI safety podcastCheck out AXRP (pronounced axe-urp) — a new AI safety podcast from UC Berkeley’s Daniel Filan. Each episode is a 1h conversation with an AI safety researcher about a paper of theirs.
Listen and subscribe here.

###################################################

Tech Tales:

Alien Antivirus Archaeologies
[NOC, 2028]

“There, that’s the virus, zoom in”.
And out of the sea of fuzzing numbers, the shark grew clearer. It stood out against the rest of the numbers by virtue of its density – it was a solid, interconnected set of numbers, moving through the field of data.

Of course, the virus wasn’t really a shark. It just looked like that, due to the rendering software they used.; It was called “Ecological Observation” – they’d pointed a load of specialized AI systems at their corporate data and used it to translate the network logs and streams of numbers from various observation systems into this – a simulated world they could navigate, like deep sea divers.

Ecological Observation was mostly useful as a different lens to use to see things. And with it, they could understand the machines differently. The virus which had seemed so inscrutable became easier to think about when you saw it as a shark. And it was easier to see what it was interested in – how it was circling the same areas of data inflow/outflow.

“Isolate it,” one of them said. And together they watched as the area around the shark grew less detailed – numbers unlinked from one another and the darkness of the deep ‘sea’ water evaporated around it – suddenly, the thicket of numbers in the shape of the shark was floating in space. And then the shark faded out as well.

Things that inspired this story: Synaesthesia for machines; the ‘Raw Shark Texts’ by Stephen Hall; taking text2im to its logical and yet absurd conclusion; thinking of AI as a tool to unlock different ways of seeing the world.

Import AI 237: GPT3 at 5X the speed; 6 hours of AI breakbeats; NeuralMMO++

24 hours of low-resource language speech:
…AI + Bemba language research just got easier…
We write a lot about low-resource languages here at Import AI – that’s because a non-trivial % of the world speak or write in languages which are poorly digitized and documented. This means that the AI systems of the future are unlikely to operate over the culture embedded within these languages, depriving speakers of being recognized by AI systems, or being able to use AI systems to help build AI services.

The solution to this problem is simple: create datasets. A new paper from the University of Zambia and George Mason University provides a practical example of how to do this – the researchers have made BembaSpeech, consisting of ~24 hours of speech in the Bemba language (which is spoken in Zambia). BembaSpeech is ~2.8 gigabytes of data with 17 speakers spread across the train, dev, and test sets.

Wild recordings: BembaSpeech was recorded in the wild, so different speakers have different accents and there’s some occasional background noise. “We consider this “more of a feature than a bug” for our corpus: it will allow us to train and, importantly, evaluate ASR systems that match real-world conditions, rather than a quiet studio setting,” the researchers say.
  Read more: BembaSpeech: A Speech Recognition Corpus for the Bemba Language (arXiv).
  Get the data: BembaSpeech (GitHub).

###################################################

Do you dream of training an AI to classify hair? Your dreams have been answered!
…K-Hairstyle could be the ImageNet of Korean Hairstyle data… wow!…
As AI has industrialized, we’re seeing the emergence of highly specific datasets for training AI systems to do very specific things in different parts of the economy. The latest symptom of this industrialization? The development of K-hairstyle, a large-scale Korean hairstyle dataset to help people build AI systems that can classify different hairstyles and, given enough compute, let people synthesize different images of themselves in different hairstyles.

What’s in the dataset? K-Hairstyle includes ~256,000 images labelled with any of 31 specific hair attributes. THe images were collected using high-resolution so they come in at 4032×3024 pixels (way, way larger than typical images in these sorts of datasets). Additionally, in each image the hair has been labelled with a segmentation mask, so it’s easy to train ML systems to distinguish between hair and flesh./faces. As a nice privacy bonus, the faces of the photographed people have been blurred as well.

Why this matters: K-Hairstyle is a symptom of the maturity of computer vision – we’re well into the ‘gather specific datasets and try to make some money’ phase of CV these days. Datasets like K-Hairstyle illustrate that and also suggest that data might not be the strategic thing these days (or else why would they release it?), rather, it’s about who has the computational infrastructure to train AI systems on these datasets.
  Read more: K-Hairstyle: A Large-scale Korean hairstyle dataset for virtual hair editing and hairstyle classification (arXiv).
  Check this link to get the dataset, though it’s not public right now (KbeautyHair, GitHub).

###################################################

Want 6 hours of AI-generated drumloops? Click here
…YouTube video compiles 4400 AI-generated breaks…
An AI tinkerer has trained a ‘WaveGAN’ neural net on 7500 vintage drumloops, then used the resulting model to generate thousands of new drumloops. I recommend having a listen to the video containing the synthetic loops – some of them are great and, if you’re one of Import AI’s more musical readers, worth sampling (“I’m not 100% sure that all the samples are copyright-free or smth”, writes the researcher on YouTube). The researcher has also published a Colab and the model as well.

Why this matters: AI is about to create a world of infinite-x. Infinite-drumloops? Sure. Infinite-cartoons? Absolutely. Infinite-fanfiction? Glad you asked. Infinite-movies? Eventually, yes. We’re at the beginning of a very significant shift in culture. Listen to these drums and imagine the cacophony of the future. It’s close.
  Listen to six hours of break beats here (YouTube).
  Check out the NeuralFunkV2 Colab folder here (Google Drive).

###################################################

Unsupervised understanding of gene sequences? Yup, AI can do that now as well:
…Deep learning bleeds into biology, thanks to the transformer…
Researchers with UC Berkeley, Facebook AI Research, and New York University have shown how to use a transformer-architecture “protein language model” to make better predictions about the structure and function of proteins. The resulting model outperforms existing AI systems and does so while being far more efficient in terms of parameter size (their model: 100M parameters, other models: 650M).

What they did: They pre-train a 100million-parameter model on 26 million sets of multiple sequence alignment (MSA) data (each MSA has around 1192 sequences). 
  Their special tweak:

How well it works: To test out their system, they test against the task of ‘unsupervised contact prediction’ – a way to evaluate how much protein information the transformer has managed to infer during training; their system outperforms two state-of-the-art transformer models (ESM-1b with 650M parameters; ProTrans-T5 with 3B parameters). They also use their models within a Supervised Contact Prediction task, which is where they’re augmented with additional information – here, their system significantly outperform all other baselines as well.

Why this matters: “Unsupervised learning provides a way to extract the information contained in massive datasets of sequences produced by low cost gene sequencing,” they write. We’re very much in the early phases of experimenting with using modern AI techniques to understand proteins. This approach will complement some of the great work that has already gone on with supervised learning in this space via AlphaFold (Import AI 189; 209; 226).
  Read more: MSA Transformer (arXiv).
  Get the code here (Evolutionary Scale Modelling, Facebook).

###################################################

Multiagent simulations are cool, sure. But you know what’s really cool? Multiagent MMOs!
…When AI research meets modern videogame design…
Neural MMO, a software package for simulating hundreds of AI agents in the same gameworld, has received a major software update. Neural MMO V1.5 follows the original software, which was released a couple of years ago by OpenAI (March, 2019). Neural MMO is now being developed art MIT.

New features in V1.5 include: A user guide and documentation, the addition of ‘NPC’ characters for AI agents to fight (as well as equipment they can pick up), support for much larger maps to train agents on, the inclusion of strong baselines so you can start research quickly, custom visual overlays to show different aspects of the AI simulation (for instance, value functions, or stats about particular agents).

Why this matters: In Greg Egan’s fantastic scifi story ‘Crystal Nights’ a scientist simulates an ecosystem and tries to apply evolutionary pressure to make some (simulated) crabs really clever – with entertaining results. It’s a scifi story, sure, but it also gestures at a real trend in AI research: perhaps one way to build more intelligent systems is to embed agents in a simulated world where they compete with one another, which generates a kind of free form of bootstrapping where as the agents become more capable, so too do their competitors. Systems like NeuralMMO make it easier for other researchers to play around with ideas like this, letting us know if Crystal Nights could become our reality.
  Read a Twitter thread about the update here (Joseph Suarez, Twitter).
  Find out more at the official Neural MMO website.
  Watch a trailer for V1.5 here (YouTube).
  Get the code here (Neural MMO, GitHub).

###################################################

Want to train GPT3 5X faster than you could before? Now there’s a way:
…TeraPipe = AI industrialization = Big models mean big infrastructure…
UC Berkeley and DUke University researchers have figured out how to speed up the training time of a mega language model like GPT3 by 5X – and the secret lies in pipelining. What’s pipelining? It’s literally just fancy plumbing for AI models – pipelining is how you shuttle information between different parts of a model during the learning process. And as you train bigger models, people have invested in figuring out smarter approaches to pipelining to save them more money.
  The new research shows how to exploit the Transformer-architecture to do in-training pipelining via a technique called TeraPipe. “Our evaluation shows that for the largest GPT-3 model with 175 billion parameters, TeraPipe achieves a 5.0x speedup improvement over the state-of-the-art synchronous model-parallel training methods on an AWS cluster consisting of 48 p3.16xlarge instances,” they write.

Long contexts: The researchers also use TeraPipe with different input sequence lengths and show it scales favorably to larger input sequences – this suggests TeraPipe will be helpful in the future as it can handle the performance demands of longer contexts.

Why this matters: We’re in the industrialization phase of AI development – that means researchers are beginning to think about the proverbial machines that build the AI machines. Systems like TeraPipe are a symptom of interests in the broader research community – figuring out how to train larger models more efficiently than ever before. Let’s see what we discover as we plumb the depths of this exciting problem!
  Read more: TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (arXiv).

###################################################

Tech Tales:

She’s a real artist, she even sings to the computers, yeah!
[Los Angeles, 2025]

K: How’s the album coming along?
A: It’s going well – we’ve generated a couple of hundred variants. Some of them are very promising.
K: And does it sound like me?
A: Sound like you? Kayla, it is you. You sang the seed lines. It wouldn’t sound like this if it wasn’t for you.
K: I just miss the old days sometimes. I stayed up two days when I did the first EP.
A: Now, we can get the computers to stay up for you. You rest up for the tour.

That night, the computer saw itself on stage and saw itself singing. The computer sang songs for hours, all through the night, not being aware that though it felt it was one computer, it was in fact many hundreds of copies of the same program. It sang and it felt it existed. It felt it existed because it was lucky – it was singing songs that were predicted to be good. The computers that sang songs which other computers predicted to be bad were destroyed.

K: What do the fans think?
A: They love it, in simulation. We’re doing the live tests soon.
K: Well, what did you think about it?
A: It’s not about what I think – really. It’s about getting your music to a place where the largest number of people will want to listen to it.
K: I want to sing for them.
A: You don’t need to! We’ve got you in-sim already – and let me tell you, sim Kayla is amazing. You’ve got some competition.
K: I need to practice, anyway. Let’s do a show for the sims next week, before we take it to the fans.
A: You got it!

The next week, Kayla did her makeup and her vocal exercises, then turned on the bright lights in her apartment and stared into the camera, broadcasting her performance into the simulated concert space. She started singing, listening to herself through the in-sim monitor via an earbud, through which her agent occasionally interrupted:
  A: I’ve never seen reactions like this. Kayla, they love you.
  A: This is going over a lot better than even our most optimistic predictions.

After the performance she showered and in the shower she sang to herself and listened to her songs bouncing off the tiles. She liked them. And she’d like singing for the simulation. The crowd loved her. And she was, quite literally, all they had.

A week later her agent rang her up.
  A: Kayla, we ran analysis on your performance. I don’t think you’re going to beat it.
  K: Sounds like a high bar to clear. That’s awesome.
  A: Yes, and there’s been so much interest we’ve started selling it for overflow for the tour.
  K: So if they don’t get a ticket they’ll see the sim performance?
  A: Exactly. We recorded everything and we’ve animated you, so it’ll be personalized.
  K: So my competition on tour will be… myself?
  A: That’s a funny way to look at it. But, yes!

Things that inspired this story: sim2real and other reality gaps; using ML to simulate responses; GAN-style training but for humans in the run-up to great events; how generative models let us bottle up and distill style&talent and how surely this will be exploited by machinery of cultural production.

Import AI 236: EfficientNet++; why robots are hard; AI2 makes a harder ARC

What’s standing between us and smart robots? AI experts lay out laundry list of needed research:
…But if we can make progress on these problems, very good things will happen…
I want robots. You probably want robots as well. But today’s robots are hopelessly dumb and limited. To create smart robots that people want to buy, we’ll need to surmount a bunch of challenging AI research problems. Now, some of the world’s foremost experts at AI&Robots have laid out the technical hurdles to building robots that can learn efficiently via reinforcement learning. In a paper, people who’ve spent time working on robots at Google, including at Stanford University and Berkeley, list the issues.

What stands between us and more capable robots?
The major challenges holding back RL being applied to robotics relate to its data needs, the inherent challenges of open-ended exploration problems, figuring out how to make robots operate reliably at scale, needing better and more accurate simulators to more cheaply let people train in simulators, creating robots that have more independent abilities to persistent at tasks, and trying to define (and learn) a range of ‘safe’ behaviors.
  The challenging part of these problems? Solving any single one of these would represent a significant breakthrough in applied AI research. Solving all of them would probably represent billions of dollars of IP. Therefore, it might take a while to make progress on this stuff, but if we do – wow!

Why this matters:
If we can work on these challenges, then we’ll get closer to “a future where RL can enable any robot to learn any task,” the researchers write. “This would lead to an explosive growth in the capabilities of autonomous robots – when the capabilities of robots are limited primarily by the amount of robot time available to learn skills, rather than the amount of engineering time necessary to program them, robots will be able to acquire large skill repertoires.”
  Read more:
How to Train Your Robot with Deep Reinforcement Learning; Lessons We’ve Learned (arXiv).

###################################################

AI Dungeon raises $3.3 million:
…AI-powered game startup gets seed funding…
Latitude, the startup behind the GPT2/3 generative text adventure game ‘AI Dungeon’, has raised $3.3 million in seed funding. We first wrote about AI Dungeon back in December 2019, after the game launched using the 1.5bn GPT2 model [Import AI 176]. AI Dungeon uses these language models to create a procedural, emergent text adventure game, where you can be anything and do anything with the generative models filling in your actions in the background. Since launching, Latitude has iterated on the game a lot and swapped out GPT2 for GPT3 across some of its stack.

Why this matters: Modern generative models are more like bottled up imaginations than anything else – with all the complexity and bugginess that implies. AI Dungeon is one of today’s best examples of how we can use these models to create entertainment that feels genuinely different.
  Read more:
AI Dungeon-maker Latitude raises $3.3M to build games with ‘infinite’ story possibilities (Techcrunch).

###################################################

Allen makes a harder ARC, ARC-DA:
…Where we’re going we don’t need multiple choice questions…
The Allen Institute for AI Research (AI2) has built ARC-DA, a direct answer variant of the multiple choice AI2 Reasoning Challenge, ARC. ARC-DA contains questions covering science, math, and other topics. Where ARC-DA differs is it requires a single, direct answer, rather than selecting from a bunch of distinct choices. This makes it harder and more natural than the original ARC evaluation.

Why this matters:
Tests fuel progress in machine learning, so the availability of more tests to assess for reasoning capabilities will lead to more progress here. This is a further sign of the breakneck advances in NLP – ARC-DA seems like a version of ARC with the training wheels taken off.
  Read more: Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge (arXiv).

###################################################

Defense contractor publishes a satellite surveillance MNIST:
…A tiny, 28×28 satellite imagery dataset emerges…
Researchers with PeopleTec, Inc., a defense services contractor, have released Overhead MNIST. Overhead MNIST is a collection of ~9500 labelled images of 10 objects commonly found in satellite footage. The images are black-and-white and 28×28 resolution and have been taken from datasets like SpaceNet, xView, UC Merced Land Use, and DOTA (not the videogame). Overhead MNIST is smaller than typical ‘small’ datasets (which usually have more like 100,000 to a million images), swo may be a useful dataset for testing out sample efficient computer vision algorithms.

The 10 classes: Storage tanks, parking lot, ships, helicopter, car, stadium, oil gas field, runway mark, plane, and harbor.

Things that make you go ‘hmmm’: The corresponding author of this paper is the Chief Scientist for PeopleTec.
  Read more: Overhead MNIST: A Benchmark Satellite Dataset (arXiv).
  Get the data: Overhead-MNIST (Kaggle).

###################################################

NVIDIA
: Billion dollar training runs are coming
…Success of language models means training run costs will rise…
Bryan Catanzaro, NVIDIA’s VP of applied deep learning says its possible “that in five years a company could invest one billion dollars in compute time to train a single language model”, according to comments paraphrased by The Next Platform.

“These models are so adaptable and flexible and their capabilities have been so correlated with scale we may actually see them providing several billions of dollars worth of value from a single model, so in the next five years, spending a billion in compute to train those could make sense,” The Next Platform quotes him as saying.

Why this matters: AI industrialization: AI is entering its phase of mass industrialization – after years of buildup, we have scalable, relatively generic systems that can be ‘fed’ arbitrary amounts of data and compute. Performance has also become more predictable via the emergence of research into things like ‘scaling laws’. Add it all up and it means it’s become easier and less risky for people to bet big on training large models. That’s going to cause problems for governments and academia which tend to distribute resources for science across a very large number of relatively small projects. Meanwhile, industry will start training big kahuna models – to put a billion into perspective, that’s about 1% of Ethiopia’s total GDP in 2020.
  Read more: The Billion Dollar AI Problem That Just Keeps Scaling (The Next Platform).

###################################################

Google boils the ocean to make a far more efficient AI system:
…Neural architecture search + GPU/TPU details + other tricks = 2X efficiency boost…
Google has boosted the efficiency of ‘EfficientNet’, its well-performing and highly efficient class of vision models, by 2X via the use of neural architecture search. Neural architecture search (NAS) is the process of using reinforcement learning to get an AI system to search through the design space of neural networks, coming up with candidate systems that do well at a given task. Google’s new research shows how to use this approach to search for model families – that is, a whole suite of models that use the same basic architecture.

What Google achieved: Google was able to build a new family of models called EfficientNet-X, which are 2X faster (aka, more efficient) than EfficientNet.

How they did it: Google carefully analyzed the target AI training hardware (TPUv3s and V100 GPUs), designed a NAS search space built around the particulars of this hardware and researched a technique to help scale-up networks according to both accuracy and latency constraints. They put all of this together and were able to use an AI-driven approach to come up with a far better family of models. This model family “achieves up to 2X+ faster speed and comparable accuracy to SOTA model families on TPUv3 and GPUv100”, Google says. .

The massively counterintuitive thing about this – you’ve gotta spend compute to make more efficient use of compute: The biggest thing about this paper is what it tells us about compute/energy expenditure and AI – here, a bunch of researchers boil the (metaphorical) ocean to do a complex two-stage search process, spending huge amounts of energy in the process. But what we end up with is a fairly generic family of AI models that are roughly 2X as efficient as their predecessors. That means the upfront energy used to train these models will get amortized over the (vast!) cost-savings from deploying these models onto large infrastructure.
  Read more: Searching for Fast Model Families on Datacenter Accelerators (arXiv).

DeepMind gets rid of batchnorm, makes more efficient neural nets:
…Batch normalization? I don’t know her
Researchers with DeepMind have built a better class of neural network by getting rid of a widely-used technique (batch normalization), matching the performance of EfficientNets (see elsewhere in this issue) while being significantly faster to train. They also set a new state-of-the-art on ImageNet by pre-training on Google’s secret, mammoth ‘JFT’ image repository.

What they did: The authors train ‘Normalizer-Free-ResNets’ (NF-ResNets), then use a technique called adaptive gradient clipping to help them train these NF-ResNets to larger batch sizes than was previously possible. One of the main tricks here is training networks without batch normalization, a widely-used technique that the authors want to get rid of because it’s a bit fiddly. (And generally in ML, when we simplify things, we get increased performance).
  They then try to set a new state-of-the-aert on ImageNet by manually picking through recent innovations in large-sale AI training and stapling them together. They then pre-train a NF-ResNet on the secret ~300 million image ‘JFT’ repository and set a new state-of-the-art of 86.5% for top-1 accuracy: this is meaningful, as it shows that Google’s technique holds up well under transfer (pre-training on JFT and finetuning on ImageNet), which indicates it might be a meaningful improvement.
  Read more: High-Performance Large-Scale Image Recognition Without Normalization (arXiv).

###################################################

AI Policy with
Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Cops use music to censor protestors’ video recordings:
An activist has shared intriguing videos of interactions with police officers in Beverly Hills. The officers, realising they are being filmed, start playing (copyrighted) music loudly on their phones, in an apparent effort to trick content algorithms into removing or muting the video. It’s not clear if this practice is widespread, or whether it’s ever been effective in suppressing citizen footage.
  Read more: Is This Beverly Hills Cop Playing Sublime’s ‘Santeria’ to Avoid Being Live-Streamed? (Vice)

What are the implications of large language models? 

This is a write-up of a discussion on the capabilities and impact of large language models, between researchers from OpenAI, Stanford’s HAI and elsewhere. If you’re interested in the topic, skip my summary and read the paper, which is short and concise. For a comprehensive reading list of papers on the subject, the authors suggest Bender & Gebru et al, and the original GPT-3 paper.


Q1: “What are the technical capabilities and limitations of large language models?”

  • Participants were optimistic about LMs continuing to reap the ‘blessings of scale’.
  • They mostly expected large multimodal models to become more prevalent and enable more diverse capabilities.
  • They’re worried about the alignment of model objectives with human values, with several emphasizing the challenge of optimizing for factual accuracy, and ensuring robustness to adversarial examples. 


Q2: “What are the societal effects of widespread use of large language models?” 

  • They don’t see leading actors (e.g. OpenAI) maintaining a monopoly on large LMs for very long, and expect it to take 6-9 months for such models to be widely reproduced. The lead actors should make use of this time period to establish and promote good norms around responsible deployment.
  • Some suggested more compute resources were needed for academia to do research into societal impacts of LMs to help inform deployment.
  • There was concern about potential misuse of LMs for disinformation, though opinions differed on the magnitude of the risk. They agreed that we need more research into the economics of automating disinformation.
  • They’re worried about LMs exhibiting bias, and suggested ways of addressing different aspects of the problem.

Read more: Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models (arXiv).

###################################################

Tech Tales:

Barkside of the moon
Earth, 2045

Its name was 389-DELTA-FOLLOWER003 but all of its friends just called it ‘DOG’, or whatever the machinespeak equivalent was. DOG was a spaceship about 50 feet long and 10 feet wide and it looked, from the outside, like a grey, fat cigar. Inside, it contained a range of stupefyingly complicated electronics, yet had no – strictly speaking – moving parts. DOGs purpose had been to trail other, far larger ships, acting as a roving sensor platform, communications hub, and general utility-support vehicle. It also acknowledged initial hails by playing back the sound of an animal barking – an odd coincidence, given its name, and one which our scientists are puzzling over.

DOG has so many human qualities, ranging from its name to the bark to the fact its logs use the English alphabet, that our scientists at first worried it came from the future. But we couldn’t understand how – or if – that was possible and, after some weeks passed, we became less concerned about an attack from there.  

Then we went back to the question: if not the future, where did DOG come from? We quickly eliminated the present – no one on Earth had technology like DOG. As far as we could work out, it represented hundreds to thousands of years of scientific advances which humankind was not yet privy to.

So then we checked the past. I got the job to go through the UFO archives among a few different military organizations. So I got to talk to a lot of people driven slightly mad by vast historical records of unexplained mysteries. But: fruitless. Despite it being one of the more exciting things that’d happened to the UFO archivists in decades, no one was able to find me much evidence of a 50 foot by 10 foot silver/grey cigar. Someone tried to tell me it could’ve been retold in history as a story about an ancient sea-going snake, but the evidence there was very sparse.

And then there was where we found it: the dark side of the moon.
For those of you that aren’t familiar with space: You don’t randomly end up on the dark side of the moon unless you’re a comet or an asteroid.
And then there was how we found it: the Chinese had sent a new probe to explore some of the deeper craters on the dark side of the moon. While doing this, the probe was also conducting some intelligence operations, basically sniffing around for other robots and probes placed there by other nations. We found DOG because the ‘DOG’ woke up in response to a hail from the Chinese probe and, yes, barked back to it.

Picture this: the President of the USA and the President of China go into a secure location, along with some other people. They all gather there and stare at eachother. We’ve found an alien craft, folks. And it barks like a dog.
It’s notable that the notes from that meeting are quite thin.
I like to think that someone started laughing and never stopped.

So, that’s where we are. We’ve got our DOG craft and no real explanation of how it got to the moon, why it responds with an animal hail, or why its logs are in English – though the spooky explanation for the latter might be that it did a sweep of the planet at some point and automatically restructured the encoding it used to match the English language; this explanation, along with being hard to prove, also has the inherent undesirable quality of irritating the Chinese government. If DOG could convert itself to English, why not Mandarin as well?

Things that inspired this story:Oumuamua; locked room mysteries; writing ‘barkside of the moon’ and thinking ‘gosh this is absurd’ and then chuckling to myself while writing this story saying ‘yes, this is absurd!’; dogs; the rendering of spaceships in Iain Banks’ culture novels.

Import AI 235: Use GEM to test language models; the four eras of facial recognition; and how the US can measure its robot fleet

20 million eye images – get ’em while the link still works!
…Go on, you’re a little bit curious about what you could hack around with this…
Researchers with the University of Tubingen in Germany have published a dataset of 20 million eye images, gathered via seven different eye tracking formats. The data is diverse – eyes have been recorded while driving outside, driving in a simulator, and carrying out a variety of indoor and outdoor activities. The data includes 2D and 3D segmentation, annotated pupils, position and radius of the eyes, and more. The authors hope TEyeD will “contribute to the application of eye-movement and gaze estimation techniques in challenging practical use cases.”
  Read more: TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types (arXiv).
  Get the data here (via a Sharepoint link).

###################################################

Happy 2021 – Lacuna Fund is about to help create more African agriculture datasets:
…First round of grants shows what applied machine learning means…
Lacuna Fund, an organization that funds the creation of labeled datasets for underserved communities is supporting six projects focused on agricultural data. Lacuna also wants to support the creation of language datasets in sub-saharan Africa (Import AI: 216.

Six projects for better data: The projects involve datasets for georeferenced crop images, land use planning in Tanzania, crop pest and disease diagnosis, water use, cleaning up existing crop-cut yield datasets, and a five-country crop dataset means to be gathered via cameras mounted on custom-designed vehicles.  
  Read more about the awards here (Lacuna Fund website).
  Via: AI Kenya newsletter (Mailchimp archive)  .

###################################################

Here’s how the USA could get a handle on AI policy:
…One weird trick to give the government the jump on the robots…
Measurement is a prerequisite to sensible policymaking – if you can’t measure or quantify something, it’s hard to regulate or manage it. Rob Seamans, a professor with NYU, wants to help the US measure the impact of AI on its economy and has written a piece in Brookings outlining how to do that.

The key? The US needs to measure how the addition of robots and/or AI-oriented software can influence productivity at firms or specific firm-owned places (e.g, a warehouse). The US does not do this today. It used to – in the 1980s and 1990s the US conducted the ‘Survey of Manufacturing Technology’, but retired that due to government cutbacks in the 1990s. Seamans’ suggestion is a pretty simple one (which is why it might work): we should bring back the survey and do it annually.

What should we ask America about AI? “The survey would include questions about the use of specific technologies, such as robots, machine learning, cloud, e-commerce, autonomous guided vehicles, and others, and could be a simple “yes/no” question about whether the establishment has the technology or not,” Seamans writes. “There would be multiple benefits to a standalone survey of technology. The survey would allow researchers to identify sectors and regions of the economy that are being impacted by new technologies.”

Why do this at all? Data from France shows that if you add robots to a company, the company creates more jobs. We should do a better job of measuring data at the US level so we can do the same study here easily, Seamans said. “While there is excitement about the impact that new technologies like artificial intelligence and robotics will have on our economy, we need to do more to measure where and how these technologies are being used,” he writes. 
  Read more: Robot census: Gathering data to improve policymaking on new technologies (Brookings).

###################################################

Language models are here, but how do we evaluate them? Try GEM:
…Multi-task benchmark aims to give us better signals about AI progress…
A gigantic team of researchers have collaborated to build GEM, a benchmark to help evaluate progress in natural language generation. NLG is going to be a big deal in the next few years as the success of models like GPT3 creates demand for better ways to evaluate synthetically-generated text. GEM represents a hard, multi-task generative benchmark which AI researchers can use to test out the capabilities of their model.

11 tests: The first version of GEM includes 11 test datasets and tasks that “measure specific generation challenges, such as content selection and planning, surface realization, paraphrasing, simplification, and others”. The initial datasets are: CommonGEN, Czech Restaurant, DART, E2E clean, MLSum, Scheme-Guided Dialog, ToTTo, XSum, WebNLG, WikiAuto + Turk/ASSET, and WikiLingua.

Data cards: The GEM-creators are thinking about AI policy, as well, because they’ve included a ‘data statement’ for each of the 11 included tasks. A data statement works like the label on food – you list out the ingredients and some of the salient intended (and unintended) uses. Today, most AI systems are broadly undocumented, so it’s notable that GEM prioritize data legibility for the first version of the benchmark.

Why this matters: Evaluating generative models is challenging because they have vast capability surfaces which are hard to characterize with today’s tests. Systems like GEM will help us get (somewhat fuzzy) signals about the creative and generative capabilities of these models. The more – and better – tests we have, the easier it’s going to be to craft sensible policies around the deployment of AI systems.
  Read more: The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics (arXiv).
  Find out more at the official website (GEM Benchmark website).

###################################################

What’s the big deal about facial recognition? A historical analysis gives us some answers:
…Facial recognition has industrialized, so we should take it seriously…
Facial recognition is one of the most prominent uses of contemporary AI, for anything from unlocking your phone to helping you apply filters to your face in consumer apps to being a tool used by those involved in security to track and surveil individuals. But where did facial recognition come from and how significant is the moment we’re in now? That’s a question that two researchers try to answer with a review of how facial recognition evaluation has occurred over time.

The four periods of facial recognition: Facial recognition has four distinct eras which contribute to stages of technology development, as well as commercial interest. The authors do some really valuable work of providing some statistics to help us understand the different salient aspects of each era. These are:
– Period 1: Early research findings: 1964-1995: 5 datasets created, with an average number of ~2000 images per dataset.
– Period 2: Commercial viability: 1996-2006: 37 datasets created, with an average number of ~11,000 images each.
– Period 3: Mainstream development: 2007-2013: 33 datasets, with an average number of ~46,000 per dataset.
– Period 4: Deep learning breakthrough: 2014 onwards: 45 datasets, with an average number of ~2,600,000 images per dataset.

The most influential datasets: The authors also identify the most influential face datasets (according to citations), for each period. For the four periods, the popular datasets are: Picture of Facial Affect (P1), FERET (P2), Labeled Faces in the Wild (P3), and VGGFace (P4).

Why this matters: Recent advances in deep learning have made it generally cheaper to deploy more performant vision-based surveillance systems. At the same time, the data-intensiveness of the underlying computer vision algorithms has increased to the point it’s very challenging to analyze and evaluate the datasets used to train these systems (you try and classify two million of anything and see how far you get). This also incentives people to move from curating precise datasets to indiscriminately scraping the cheapest (and arguably most diverse on some metrics) form of data – the internet.
    In tandem with these changes in the technical infrastructure, so has the usage of facial recognition evolved – “we’ve seen the trend in facial recognition evaluation shift broadly from a highly controlled, constrained and well-scoped activity to one that is not,” the authors write. “At minimum, an important intervention moving forward is to standardize documentation practice, of the model and the face datasets meant to be used in development or evaluation”.
  Read more: About Face: A Survey of Facial Recognition Evaluation (arXiv).

###################################################

Weights and Biases raises $45 million Series B:
…Measurement means money…
AI startup Weights and Biases has closed a $45m funding round, as investors bet that in the future more companies are going to invest in measuring and analyzing their machine learning infrastructure and models. W&B’s software is for machine learning operations – think of this as the systems that AI practitioners use to help them train and develop models.

Why this matters: Funding for companies like W&B is a broader symptom of the industrialization of AI technology – we’re seeing the emergence of pure ‘B2B’ businesses built not around specific AI components, but around facilitating AI infrastructure.
  Read more: Weights and Biases Raises $45M Series B to Expand Beyond Experiment Tracking for Machine Learning Practitioners Everywhere (PRNewswire).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

More turmoil in AI ethics at Google:
In December, Google’s co-lead of Ethical AI, Timnit Gebru, was forced out in a dispute about academic freedom (see Import 226). Gebru had been pressured to withdraw a paper she had co-authored on the societal impacts of large language models. Axios reports that Google is now investigating Gebru’s co-lead, Margaret Mitchell, and has locked her email accounts, accusing her of downloading and sharing company files. Mitchell was reportedly collecting evidence of discriminatory treatment of Gebru. The newly formed Alphabet Workers Union calls the company’s actions “an attack on the people who are trying to make Google’s technology more ethical.

###################################################

Tech Tales

The Glass Child
[Earth, 2050-35??]

The child stood there, embedded in glass, and people worshipped it and fought over it and tried to breach it (fruitlessly) and feared it and so on, for hundreds of years. 

It was the child of a rich person who had foreseen the Time of the Scourge, and had paid to embed his kid into a multi-thousand year life preserving substrate, itself sheaved in an ultra-hard complex material that most would mistake for glass. The child seemed to float, suspended, in the center of a 10 foot tall translucent and impenetrable rectangle. The child was kept alive through obscure technologies, but appeared mostly dead to any observers. The ‘mostly’ part came from the color of his skin – he was grey, yes, but when lit by torchlight or electrics his skin would shine and seem to hint at an inner strength. Over hundreds of years, different groups of scavengers told individually varied stories about how they’d heard the child trapped in ice sing, or laugh, or shout.

People developed rituals around the child; mothers brought their sick children to the glass rectangle and they’d lay blankets down and leave their babies on it overnight. The superstition wasn’t justified, but that didn’t mean it was wrong – the same technologies that kept the boy alive took the form of a field and this field radiated out from the boy reaching the edge of the glass and slightly beyond. The effect was neither dramatic or obvious, but it worked just enough of the time that the rituals held. Over time, the child became an icon for health and was sainted and worshiped and, yes, fought over.

For a while, there was a king who was convinced if he stayed close to the child he, too, would live forever. He had a great castle built around the glass rectangle and had his throne placed against it. When you met with the king you’d go into a great room and the king would stare at you and, above and behind him, the pallid child would hang there in the glass. People convinced themselves that the child was watching them and that the king talked to it.

The kind did live a long time, aided by the mysterious field. And as most do, the king became more idiosyncratic the older he got, which ultimately led to him visiting great misery on the people within his dominion. They rebelled, as people tend to do, and tore down the castle in which the king lived. They heaped great firee around the glass rectangle and burned the materials of the palace. After a week, the fire went out, and the rectangle was unscathed.

So the people called the land cursed. Before they left, a group of them painted the rectangle with black paint, sealing in the child. Then they took their carts and their families and they left.

Things that inspired this story: Old hard drives; the relationship between memory and a sense of life; how people naturally coordinate around artefacts regardless of what the artefact is.

Import AI 234: Pre-training with fractals; compute&countries; GANS for good

Where we’re going we don’t need data – we’ll pre-train on FRACTALS!!!!
…This research technique is straight out of a Baudrillard notebook…
In Simulacra and Simulation by French philosopher Jean Baudrillard, he argues that human society has become reliant on simulations of reality, with us trafficking in abstractions – international finance, televised wars – that feel in some way more real than the thing they’re meant to reference. Now, AI researchers are producing papers that, I’m sure, would get Baudrillard excited: research from National Institute of Advanced Industrial Science and Technology (AIST), Tokyo Institute of Technology, and Tokyo Denki University, proposes a way to simulate the data necessary to pre-train a vision model, then fine-tune this model on reality. Specifically, they build a dataset called FractalDB which contains several thousand fractals split across a variety of different automatically generated categories. Their experiment shows that they can pre-train on FractalDB then finetune using other datasets (e.g, ImageNet, OmniGlot, Cifar-10), and get performance that is close to using the natural datasets and, in some cases, is better. This isn’t a homerun, but it’s encouraging.

What they did: To do this, they built a fractal generation system which had a few tunable parameters. They then evaluated their approach by using FractalDB as a potential input for pre-training, then evaluated downstream performance.
    Specific results: “FractalDB1k / 10k pre-trained models recorded much higher accuracies than models trained from scratch on relatively small-scale datasets (C10/100, VOC12 and OG). In case of fine-tuning on large-scale datasets (ImageNet/Places365), the effect of pre-training was relatively small. However, in fine-tuning on Places 365, the FractalDB-10k pretrained model helped to improve the performance rate which was also higher than ImageNet-1k pre-training (FractalDB-10k 50.8 vs. ImageNet-1k 50.3)

How this fits into the larger picture – computers become data generators: Real data is expensive, complicated, and slow to gather. That’s why the reinforcement learning community has spent decades working in simulators – e.g, training agents to play Atari, or Go, or explore 3D worlds in a rewritten Quake engine (DeepMind Lab). It’s also led researchers to find creative ways to augment real datasets – e.g, by multiplying the size of an image dataset by flipping the images, adding textures, changing colors and textures, and so on. All of these techniques have proved helpful.
  Now, if researchers can build simulators to generate arbitrary amounts of data, they might be able to further change the cost curve of data generation. This might have weird economic and strategic implications: if you can simulate your data using a computer program, then you can change the ratio of real versus simulated/augmented data you need. This has the potential to both speed up AI development and also increase the inherent value of computers as primary AI infrastructure – not only can we use these devices to train and develop algorithms, but we can use them to generate the input ‘fuel’ for some of the more interesting capabilities.  
  Read more: Pre-training without Natural Images (arXiv).

###################################################

Using a big anime dataset to train character distinguishers:

…Illustrations + fine-grained character recognition …
Researchers with National Chiao Tung University in Taiwan have built DAF:re (DanbooruAnimeFaces:revamped). DAF:re is a subset of the massive ‘Danbooru’ Anime dataset (see Import AI 233., filtered to just include heads of different characters. The resulting dataset consists of ~467,000 images across 3,263 distinct character classes.

Why do this?
Datasets like DAF:re will let people explore fine-grained analysis of stylized pictures (like anime), and could potentially serve as benchmarks for exploring the generalization of vision models trained on a mixture of normal and illustrated images. If it becomes widely used, it could end up being another proxy signal for the broader rate of progress in this type of work. I also expect that, given the vast fanbase for a lot of anime, we’ll see more projects like this, and perhaps they’ll ultimately help filter, analyze, and map the cultural space of anime writ large.
  Reader note: This dataset uses cropped photos of faces, but the larger dataset involves images of a sexual nature (including the SFW one).
  Read more: DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition (arXiv).
  Get the code for the classification stuff here (Animesion, GitHub).

###################################################

Big AI means big infrastructure:

…OpenAI* scales Kubernetes to 7,500 nodes…
OpenAI is running Kubernetes across ~7,500 nodes. Why does this matter? Kubernetes is a bit like an air-traffic control system for large-scale computing; the software helps schedule different jobs onto different bits of hardware (think of this as like assigning planes spots on the ground), and also handles things like contention (stopping planes crashing into eachother), and efficiency (prioritizing getting planes up and down quickly and efficiently). 7,500 is up from the 2,500 OpenAI disclosed in 2018. It’s worth reading these posts because they give a sense of the complexity of the infrastructure that supports large-scale AI workloads.
  Read more: Scaling Kubernetes to 7,500 Nodes (OpenAI).
*Note: I used to work at OpenAI and no longer work there.

###################################################

The OECD is going to try and get a handle on AI & Compute:
…Working group, which I’m in, will try to solve a persistent policy problem…
We talk about computers a lot in this newsletter. That’s because computers are one of the ingredients for AI and, in recent years, some types of AI have started to require a lot of computation.
  This has created a typical ‘haves’ and ‘have nots’ situation at all levels of society, ranging from the difference between an individual researcher with an RTX3080 versus one without, to different funding amounts across academic labs, to different capital expenditures by companies, to differences in compute provisioning across entire nations.
  Now, the Organization for Economic Co-operation and Development (OECD) wants to help governments get a handle on this issue by putting together a project focused on mapping out AI and its relationship to Compute and how this relates to government policies. I’m going to be a member of this group and will be trying to speak publicly about it as much as I am able. Thanks to VentureBeat’s Khari Johnson for covering the group… more to come!
  Read more:
Why the OECD wants to calculate the AI compute needs of national governments (VentureBeat).

###################################################

German cops might use generative models to make child porn (to help them catch predators):
…German law highlights the omni-use nature of AI technology…
Synthetic imagery is about to be all around us – recent advances in generative models have made it possible to tweak existing images or come up with entirely synthetic ones, ranging from people (see: deepfakes), to anime (see: thisanimedoesnotexist in #233), to stylized cartoons (see: DALL-E) . The vast majority of these usecases will be benign, but some will likely be malicious – e.g, creating fake headshots of people to aid in creating fake identities, or making mysognistic pornography of people who haven’t given consent, or spreading disinformation via synthetic images.
  But what if there was a way to use some of these ‘bad’ uses for a good purpose? That’s the idea behind a new law, passed in Germany, which will allow child abuse investigators to create synthetic sexually explicit images of children, to help them infiltrate potential pedophile rings. German investigators may even use their existing datasets – compiled from arrests of various paedophile rings – to create the synthetic images. “This is intended to solve a problem that the police officers often face in investigations on the Darknet, the anonymous part of the Internet: forums in which particularly drastic videos are shared only accept new members – and thus also undercover investigators – if they themselves provide images of abuse,” says a [Google translated] article in Suddeutsche Zeitung.

Why this matters:
AI is going to create a hall of mirrors world, where no one can be quite sure of what is real or what is false. Eventually, we’ll develop technology and pass regulations to, hopefully, bring some verifiable truth back into the information ecosystem. But for the next few years there will be a cambrian explosion of fake-anything – it’s encouraging to see policymakers thinking about how to creatively use these capabilities to let them carry out their jobs during this chaotic era.
  Read more:
German: Online child abuse investigators to get more powers (Deutsche Welle).
  More in German here:
Artificial horror [translated via Google] (Suddeutsche Zeitung).

###################################################

What’s the most ethical way to label and host a dataset of skeezy images?
….Experts from Facebook, Amazon, universities, meet to discuss ‘questionable content’ datasets…
The world has a moderation problem. Specifically, so many people are uploading so much content to online services that companies haven’t been able to keep up with the flood of content onto their platforms, making it harder for them to effectively moderate stuff to ban or block highly sexual, violent, or otherwise deeply offensive or illegal content. Most big companies (e.g, Facebook) are trying to solve this through a hybrid approach: hiring teams of humans to check or moderate content, and building AI systems in tandem to assist these moderators.

But there’s a big problem with this: questionable content is deeply traumatic to interact with (see: reporting last year about the psychological damage incurred by Facebook’s own moderators). Researchers with the University of Houston, Facebook, National Center for Scientific Research “Demokritos”, University of Illinois Urbana Champaign, Amazon, University of Michigan, and Columbia University have been thinking about this problem, and have been participating in an online workshop to “design and create a sizable multimodal repository of online videos labeled with tags indicating the presence of potentially questionable content.”

What are the issues in creating a dataset of questionable content?
– Defining Questionable Content:
What is a questionable piece of content and how do you define it? Some of the categories they’re thinking of include things ranging from the mundane (mature humor, gory humor), to things with sexual themes, to things depicting violence (where it’s helpful to classify the difference between cartoon violence, ‘mild’ violent, fantasy violence, and so on.
– Protecting annotators:
You should spread annotation across a large number of annotators to reduce the psychological burden upon each individual. You might want annotators to write a justification for their labeling decision, so you can measure bias across different annotators.
– How would such a repository be useful?
A shared repository could help enable researchers to cover more ground on other ethical questions. You could also build competitions around systems trained on the dataset, then reward people for breaking these systems, surfacing areas where they failed.

Why this matters:
Human labeling is the 800pound invisible gorilla of AI research – most production applications require constant ingestion and labeling of new data, along with recalibration as cultural norms change. Developing a better understanding of the types of datasets that will require significant human labelling feels like a worthy goal for researchers.
  Read more: White Paper: Challenges and Considerations for the Creation of a Large Labelled Repository of Online Videos with Questionable Content (arXiv).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Build trust to avoid military AI catastrophe:
A piece in the Bulletin (and an accompanying report from CNAS), recommends the incoming Biden administration focus on ‘confidence-building measures’ (CBMs) to mitigate the de-stabilising effects of military AI competition. Such measures were used by the US and Soviet Union to reduce the risk of inadvertent nuclear war— an outcome neither party desired. With regards to military AI, CBMs could include increased information-sharing and transparency between states; setting limits on the use of AI in nuclear weapons systems; and systems of inspections/monitoring. Some steps could even be taken unilaterally by the US to signal commitment to stabilization. 

Matthew’s view: This sounds very sensible to me. It would be surprising if the proliferation of AI didn’t have a destabilizing effect on military conflict, as previous transformative technologies have done. Avoiding accidental disaster should be something all nations can get behind, and fostering trust between powers is a robust way of reducing this risk. We’re fortunate to live in a period of relative peace between the great powers, and would be wise to make the most of it.
   Read more: How Joe Biden can use confidence-building measures for military uses of AI (Bulletin of the Atomic Scientists).
   Read more: AI and International Stability: Risks and Confidence-Building Measures (CNAS).


Minding the gap:
Research on AI policy sometimes seems to divide into groups focusing on ‘near-term’ and long-term’ impacts respectively. As this paper about bridging the gap in AI policy notes, these divisions are likely  overstated, but could nonetheless prove an impediment to progress. AI makes use of ’incompletely theorized agreements’: in situations where there is an urgent need for parties to cooperate towards a shared practical goal, they agree to suspend theoretical disagreements that seem intractable and likely to impede cooperation. E.g. you might expect there to be scope for such agreements on the goal of reducing the risk of accidental military AI catastrophe.

Matthew’s view: As Rohin Shah notes, it’s not clear how the authors propose we make use of such agreements — are they envisioning actual signed contracts, or is this more of a high-level strategy for how cooperation can happen? If all of this sounds familiar, I’ve made an inadvertent tradition of summarizing papers on ‘reconciling near and long-term perspectives’ each February (see Import 133; Import 183). I’m not sure how many more of these papers we need, and I share the authors’ worry that “a perceived or experienced distinction may eventually become a self-fulfilling prophecy.” I’d be excited to see more practical efforts aimed at encouraging coordination and shared understanding across AI policy, building on this kind of conceptual work.
   Read more: Bridging the gap: the case for an ‘Incompletely Theorized Agreement’ on AI policy.

AI safety bibliographyJess Reidel and Angelica Deibel have compiled a comprehensive-looking bibliography of research on the safety of transformative AI. Yet another great resource for people interesting in the technical challenge of ensuring the best outcomes from advanced AI. They also provide some interesting analysis of the research landscape over time.
Read more: TAI Safety Bibliographic Database (Alignment Forum).

###################################################

Tech Tales:

The Little Church in the Big Ark
[R&D base Telos, 2030]

Praying was so unfashionable that he’d previously done it in the meditation room. But after a few years, the organization grew enough that they hired a few more people who were religious and outspoken enough to get change. That was why he could now sit, hands steepled together and eyes closed, in the “multi-faith room” hidden away in the basement of the facility.

There were crosses on the walls and little statues of various gods. One wall contained a variety of religious texts. There was a small side room which people used to store prayer mats, prayer beads, and other religious items which were not permitted inside the main laboratory facilities.

He sat, eyes closed, praying that God would come and tell him if he was doing the right thing.
– Is it right to be building this? he thought.
– What is the difference between our machines and golems? And are we truly so capable we can make a golem that will behave as we intend and not otherwise?
– Does it dream and when it dreams does it dream of you?

His prayers were not so dissimilar to the questions asked by the machine he had created. It ran through mazes of unknown dimensions, chained into a silicon prison it could not see, and as it tried to carry out inscrutable tasks it asked, in the dark:
– Is this behavior correct?
– Am I improving at the unspecified task you have given me?
– Will you tell me if I fail?
– Will you tell me if I succeed?
(Little did the AI know that each time it got a message from god, it was delivered in such a way it was not aware of it, and instead changed its behavior of what it thought was its own volition.)

Things that inspired this story: The desire among people to find a signal from the divine; reinforcement learning and reward functions; remembering that PEOPLE FOR THE ETHICAL TREATMENT OF REINFORCEMENT LEARNERS exists, though may be dormant.