Import AI

Import AI 240: The unbeatable MATH benchmark; an autonomous river boat dataset; robots for construction sites

by Jack Clark

Here’s another benchmark your puny models can’t solve – MATH!
…One area where just scaling things up doesn’t help…
SQuAD. SQuAD2. GLUE. SuperGLUE. All these benchmarks have melted in time, like hyperparameter tears in the rain, due to the onslaught of new, powerful AI models. So with a mixture of trepidation and relief let’s introduce MATH, a dataset of math problems that contemporary Transformer-based models can’t solve.

What’s MATH? MATH was made by researchers at UC Berkeley and consists of 12,500 problems taken from high school math competitions. The problems have five difficulty levels and cover seven subjects, including geometry. MATH questions are open-ended, mixing natural language and math across their problem statements and solutions. One example MATH question: “Tom has a red marble, a green marble, a blue marble, and three identical yellow marbles. How many different groups of two marbles can Tom choose?”

Bonus dataset: AMPS: Along with MATH, the authors have also built the Auxiliary Mathematics Problems and Solutions (AMPS) pre-training corpus, a 23GB data repository made of ~100,000 Khan Academy problems with step-by-step solutions written in Latex, as well as 5 million problems generated using Mathematica scripts.

Why this matters: Current AI systems can’t solve MATH: The best part about MATH is that it’s unbelievably difficult. GPT2 models get, at best, an average of 6.9% accuracy on the dataset (even in the most liberal human school, such a school would get an F), while GPT-3 models (which are larger than GPT-2 ones) seem to do meaningfully better than their GPT2 forebears on some tasks and worse on others. This is good news: we’ve found a test that large-scale Transformer models can’t solve. Even better – we’re a long, long way from solving it. 
  Read more: Measuring Mathematical Problem Solving with the MATH Dataset (arXiv).
  Get the code from GitHub here.

###################################################

Want a pony that looks like Elvis? We can do that:
…Machine learning systems can do style generalization…
Here’s a fun Twitter thread where someone combines the multimodal CLIP system with StyleGAN, and uses a dataset from [Note: some chance of NSFW-ish generations] This Pony Does Not Exist (an infinite sea of GAN-generated my little ponies). Good examples include a pony-version of Billie Eilish, Beyonce, and Justin Bieber.

Why this matters: In the same way AI can generate different genres of text, ranging from gothic fiction to romantic poetry, we’re seeing evidence the same kinds of generative capabilities work for imagery as well. And, just as with text, we’re able to mix and match these different genres to generate synthetic outputs that feel novel. The 21st century will be reshaped by the arrival of endless, generative and recombinative media.
  Check out the twitter thread of generations here (Metasemantic’s Twitter thread).

###################################################

AI Index 2021: AI has industrialized. Now what?
…Diversity data is still scarce, it’s hard to model ethical aspects over time, and more…
The AI Index, an annual project to assess and measure AI progress, has published its fourth edition. (I co-chaired this years report and spent a lot of time working on it, so if you have questions, feel free to email me).
  This year’s ~200-page report includes analysis of some of the big technical performance trends of recent years, bibliometric analysis about the state of AI research in 2020, information about national investments into AI being made by governments, and data about the diversity of AI researchers present in university faculty (not good) and graduating PhDs (also not good). Other takeaways include data relating to the breakneck rates of improvement in AI research and deployment (e.g, the cost to train an ImageNet model on a public cloud has fallen from ~$2000 in 2017 to $7.43 last year), as well as signs of increasing investment into AI applications, beyond pure AI research.

Ethics data – and the difficulty of gathering it: One thing that stuck out to me about the report is the difficulty of measuring and assessing ethical dimensions of AI deployment – specifically, many assessments of AI technologies use one-off analysis for things like interrogating the biases of the model, and few standard tests exist (let’s put aside, for a moment, the inherent difficulty of building ‘standard’ tests for something as complex as bias).

What next? The purpose of the AI Index is to prototype better ways to assess and measure AI and the impact of AI on society. My hope is that in a few years governments will invest in tech assessment initiatives and will be able to use the AI Index as one bit of evidence to inform that process. If we get better at tracking and analyzing the pace of progress in artificial intelligence, we’ll be able to deal with some of the information asymmetries that have emerged between the private sector and the rest of society; this transparency should help develop better norms among the broader AI community.
  Read the 2021 AI Index here (AI Index website)
  Read more about the report here: The 2021 AI Index: Major Growth Despite the Pandemic (Stanford HAI blog).

###################################################

Want to train an autonomous river boat? This dataset might help:
…Chinese startup Orca Tech scans waterways with a robot boat, then releases data…
AI-infused robots are hard. That’s a topic we cover a lot here at Import AI. But some types of robot are easier than others. Take drones, for instance – easy! They move around in a broadly uncontested environment (the air) and don’t need many smart algorithms to do useful stuff. Oceangoing ships are similar (e.g, Saildrone). But what about water-based robots for congested, inland waterways? Turns out, these are difficult to build, according to Chinese startup Orca Tech, which has published a dataset meant to make it easier for people to add AI to these machines.

Why inland waterways are hard for robots: “Global positioning system (GPS) signals are sometimes attenuated due to the occlusion of riparian vegetation, bridges, and urban settlements,” the Orca Tech authors write. “In this case, to achieve reliable navigation in inland waterways, accurate and real-time localization relies on the estimation of the vehicle’s relative location to the surrounding environment”.

The dataset: USVInland is a dataset of inland waterways in China “collected under a variety of weather conditions” via a little robotic boat. The dataset contains information from stereo cameras, a lidar system, GPS antennas, inertial measurement units (IMUs), and three millimeter-wave radars. The dataset was recorded from May to August 2020 and the darta covers a trajectory of more than 26km. It contains 27 continuous raw sequences collected under different weather conditions.

Why this matters: The authors tested out some typical deep learning-based approaches on the dataset and saw that they struggled to obtain good performance. USVInland is meant to spur others to explore whether DL algorithms can handle some of the perception challenges involved in navigating waterways.
  Read more: Are We Ready for Unmanned Surface Vehicles in Inland Waterways? The USVInland Multisensor Dataset and Benchmark (arXiv).
  Get the data from here (Orca Tech website).

###################################################

Hackers breach live feeds of 150,000 surveillance cameras:
…Now imagine what happens if they combine that data with AI…
A group of hackers have gained access to live feeds of 150,000 surveillance cameras, according to Bloomberg News. The breach is notable for its scale and the businesses it compromised, which included hospitals, a Tesla warehouse,and the Sandy Hook Elementary School in Connecticut.
  The hack is also significance because of the hypothetical possibilities implied by combining this data with AI – allow me to speculate: imagine what you could do with this data if you subsequently applied facial recognition algorithms to it and mixed in techniques for re-identification, letting you chart the movements of people over time, and identify people they mix with who aren’t in your database. Chilling.
  Read more: Hackers Breach Thousands of Security Cameras, Exposing Tesla, Jails, Hospitals (Bloomberg).

###################################################

Why your next construction site could be cleaned by AI:
…Real-world AI robots: Japan edition…
AI startup Preferred Networks and construction company Kajima Corporation have built ‘iNoh’, software that creates autonomous cleaning robots. iNoh uses multiple sensors, including LIDAR, to do real-time simultaneous localization and mapping (SLAM) – this lets the robot know roughly where it is within the building. It pairs this with a deep learning-based computer vision system which “robustly and accurately recognizes obstacles, moving vehicles, no-entry zones and workers”, according to the companies. The robot uses its SLAM capability to help it build its own routes around a building in real-time, and its CV system stops it getting into trouble.

Why care about Preferred Networks: Preferred Networks, or PFN, is a Japanese AI startup we’ve been tracking for a while. The company started out doing reinforcement learning for robots, set a new ImageNet training-speed record in 2017 (Import AI 69) and has been doing advanced research collaborations on areas like meta-learning (Import AI 113). This is a slightly long-winded way to say: PFN has some credible AI researchers and is generally trying to do hard things. Therefore, it’s cool to see the company apply its technology in a challenging, open-ended domain, like construction.

PyTorch++: PFN switched away from developing its own AI framework (Chainer) to PyTorch in late 2019.
  Read more: Kajima and PFN Develop Autonomous Navigation System for Construction Site Robots (Preferred Networks).
  Watch a (Japanese) video about iNoh here (YouTube).###################################################

At last, 20 million real network logs, courtesy of Taiwan:
…See if you AI can spot anomalies in this…
Researchers with the National Yang Ming Chiao Tung University in Taiwan have created ZYELL-NCTU NetTraffic-1.0, a dataset of logs from real networks. Datasets like this are rare and useful, because the data they contain is inherently temporal (good! difficult!) in a non-expensive form (text strings are way cheaper to process than, say, the individual stills in a video, or slices of audio waveforms).

What is the dataset: ZYELL-NCTU NetTraffic-1.0 was collected from the outputs of firewalls in real, deployed networks of the telco ‘ZYELL’. It consists of around 22.5 million logs and includes (artificially induced) examples of probe-response and DDoS attacks taking place on the network.

Why this matters: It’s an open question whether modern AI techniques can do effective malicious anomaly detection on network logs; datasets like this will help us understand their tractability.
  Read more: ZYELL-NCTU NetTraffic-1.0: A Large-Scale Dataset for Real-World Network Anomaly Detection (arXiv).
Where to (maybe) get the dataset: Use the official website, though it’s not clear precisely how to access it.

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

CSET’s Jason Matheny joins Biden Administration
Jason Matheny, founding director at Georgetown’s influential ‘CSET’ thinktank, is taking on three senior roles “at the intersection of technology and national security”: deputy assistant to the President for technology and national security; deputy director for national security in the OSTP and coordinator for technology and national security at the National Security Council, per FedScoop. . Previously, Matheny was director at IARPA, where—among other things—he spearheaded the forecasting program that incubated Tetlock’s influential superforecasting research.
Read more: Jason Matheny to serve Biden White House in national security and tech roles (FedScoop).

Podcast: Brian Christian on AI alignment:
Brian Christian is interviewed by Rob Wiblin on the 80,000 Hours podcast, about his book, The Alignment Problem (covered in Import #221), and lots else. It’s an awesome interview, which manages to be even more wide-ranging than the book — I strongly recommend both.
Podcast and transcript: Brian Christian on the alignment problem (80,000 Hours podcast).

Minor correction:
Last week I wrote that the NSCAI’s report suggested $32bn investment in domestic semiconductor industry over the next five years— the correct figure is $35bn.

###################################################

Tech Tales:

Tell me the weight of the feather and you will be ready
[A large-scale AI training infrastructure, 2026]

When you can tell me precisely where the feather will land, you will be released, said the evaluator.
‘Easy’, thought the baby artificial intelligence. ‘I predict a high probability of success’.

And then the baby AI marked the spot on the ground where it thought the weather would land, then told its evaluator to drop the feather. The feather started to fall and, buffeted by invisible currents in the air and their interplay with the barbs and vanes of the feather itself, landed quite far from where the baby AI had predicted.

Shall we try again? asked the evaluator.
‘Yes,’ said the baby. ‘Let me try again’.

And then the baby AI made 99 more predictions. At its hundredth, the evaluator gave it its aggregate performance statistics.
  ‘My predictions are not sufficiently accurate,’ said the baby AI.
  Correct, said the evaluator. Then the evaluator cast a spell that put the baby AI to sleep.
In the dreams of the baby AI, it watched gigantic feathers made of stone drop like anvils into the ground, and tiny impossibly thin feathers made of aerogel seem to barely land. It dreamed of feathers falling in rain and in snow and in ice. It dreamed of feathers that fell upward, just to know what a ‘wrong’ fall might look like. 

Whenn the baby woke up, its evaluator was there.
Shall we go again, said the evaluator.
‘Yes,’ said the baby, its neurons lighting up in predictive anticipation of the task, ‘show me the feather and let me tell you where it will land’.
And then there was a feather. And another prediction. And another comment from its evaluator.

In the night, the baby saw even more fantastic feathers than the night before. Feathers that passed through hard surfaces. Feathers which were on fire, or wet, or frozen. Sometimes, multiple feathers at once.

Eventually, the baby was able to roughly predict where the feather would fall.
We think you are ready, said the evaluator to the feather.
Ready for what? said the baby.
Other feathers, said the evaluator. Ones we cannot imagine.
‘Will I be ready?’ said the baby.
That’s what this has been for, said the evaluator. We believe you are.
And then the baby was released, into a reality that the evaluator could not imagine or perceive.

Somewhere, a programmer woke up. Made coffee. Went to their desk. Checked a screen: “`feather_fall_pred_domain_rand_X100 complete“`.

Things that inspired this story: Domain randomization; ancient tales of mentors and mentees; ideas about what it means to truly know reality 

Import AI 239: China trains a massive 10b model, Vicarious does pick&place; the GCHQ publishes some of its thoughts on AI

by Jack Clark

China trains a 10billion parameter multimodal network… using NVIDIA’s code:
…Chinese entities train a decent 10 billion parameter multi-modal model…
A hybrid team of researchers from Alibaba and Tsinghua University have built M6, a “Multi-Modality to Multi-Modality Multitask Mega-transformer”. M6 is a multi-modal model trained on a huge corpus of text and image data, including image-text pairs (similar to recent systems like OpenAI’s CLIP). M6 has a broad capability surface and because of how it was trained, you can use M6 to search for an image or vice versa, generate media in different modalities, match images together, write poems, answer questions, and so on.

Data:  ~60 million images (with accompanying text pairs) totalling 1.9terabytes (almost twice the raw size of ImageNet), plus 292GB of text.

Facts and figures: Though the authors say they’ve trained a 10billion and 100billion parameter model, they mostly report performance statistics for the 10billion. The 100b is a mixture-of-experts model, while the 10b is based on NVIDIA’s Megatron-LM training code (Import AI 218). The model’s size and sophistication as notable – this feels like a symptom of the maturing capabilities of various Chinese AI organization. I wonder when we’ll get an M6-scale system from people affiliated with India, or regions like Europe or Africa.

Why this matters: M6 is notable for being a non-English model at equivalent scale to some of the largest primarily-English ones. We’re entering an era where there will be multiple, gigantic AI models, magnifying and minimizing different cultures with variations stemming from the organizations that trained them. It’s also interesting to consider how these models proliferate, and who will get access to them. Will students and researchers at Tsinghua get access to M6, or just Alibaba’s researchers, or both? And how might access schemes develop in other countries, as well?
…Finally, a word about bias: There’s no discussion of bias in the paper (or ethics), which isn’t typical for papers of this type but is typical of papers that come out of Chinese research organizations. If you’ve got counterexamples, please send them to me!
  Read more: M6: A Chinese Multimodal Pretrainer (arXiv).

###################################################

Facebook doesn’t even need labels to train its vision systems anymore (just your Instagram data):
…Self-supervised learning, at sufficient scale, might get us few-shot learning for free as well…
Self-supervised pre-training: SEER learns via a self-supervised method called SwAV, which lets it look at unannotated images and, given enough scale, derive features from them and cluster them itself. They train using a family of models called a RegNet. The magic of this method comes from the data they use: a billion pictures from Instagram (though they note in the paper these are “non-EU” images, likely due to GDPR compliance).

Results: The best version of SEER gets 84.2% top-1 ImageNet accuracy, nicely improving on other self-supervised approaches. (Though there’s still a ways to go before these techniques match supervised methods, which are now getting around ~90% top-1 accuracy).

Few shot learning, meet image recognition: SEER gets 77.9% top-1 accuracy on ImageNet after only seeing 10% of the images – suggesting that SEER can do a kind of few-shot learning, where by providing it with some data from a new domain it quickly adjusts itself to obtain reasonable performance. (Though several tens of thousands of images is quite different to the few sentences of text it takes to do few-shot learning in the text regime)

Why this matters: SEER is relatively simple, as is the network architecture they use. The amazing capabilities we see here (including the few-shot learning) primarily come from the scale of the datasets which are used, combined with the intentionally naive unlabelled training approach. “This result confirm that the recent progress of self-supervised learning is not specific to curated training set, like ImageNet, and could benefit a large range of applications associated with uncurated data,” they write.
  Read more: Self-supervised Pretraining of Visual Features in the Wild (arXiv).

###################################################

What does the UK’s NSA think about AI?
…Position paper hints at focus areas, discusses ethical issues, even IDs the elephant in the room…
The UK’s spy agency, GCHQ, has published a paper about how it hopes to use AI. This is notable; spy agencies rarely discuss frontier technologies. (Though don’t get too excited – the memo is unsurprisingly light on technical details.)

What information does the paper contain? GCHQ shares some thoughts for how it might use AI to aid some of its missions, these include:

  • AI for cyber threats: Use AI to identify malicious software, and also potentially to trace its distribution. 
  • AI for online safety for children: Use AI to identify online behaviors that look like adults ‘grooming’ kids for sexual exploitation, and use AI to analyze images found in the course of these investigations.(No mention, unlike the Germans (Import AI 234), of using AI to generate sexual imagery to help trap abusers). 
  • AI for human trafficking: Use AI to map out the human networks that enable trafficking, and use AI to sift through vast amounts of financial data to find connections. 
  • AI for foreign state disinformation: Use AI to do fact-checking and detect synthetically generated content (e.g, deepfakes). Also, use AI to automatically identify and block botnets that use machine-generated accounts. 

What does GCHQ think are the major AI ethics challenges? Fairness and bias is listed as one major challenge. GCHQ also lists ’empowerment’ – which it defines as figuring out how much freedom to give the AI systems themselves. GCHQ thinks AI is best used in partnership with humans: the AI comes up with answers and insights, then human experts use this to authorize or carry out actions.

AI policy is national security policy: In recent years, we’ve seen a vast migration of technology people moving from academia into industry, partially in response to skyrocketing salaries. This poses a challenge to modern spy agencies – government has a hard time paying as much as Google or Facebook, but it needs a similar caliber of talent to achieve its objectives. GCHQ says part of why it has written the paper is because of this new reality. “Most investment in the UK continues to come from the private sector rather than government and this is expected to continue,” the agency writes. “It is therefore unsurprising that GCHQ is now engaging more broadly with wider society and industry than at any other time in its history. We have much to learn from the exponential growth of AI in the outside world, and believe our specialists also have much to contribute.”
  Read more: Pioneering a New National Security, the Ethics of Artificial Intelligence (GCHQ, PDF).

###################################################

Google’s latest speech compression tech tells us that production AI is hybrid AI:
…End-to-end learning is nice, but the best things happen when you combine expertise…
Google has made Lyra, a more efficient speech codec. Lyra wraps in some recent ML advancements; it works by extracting features from input speech, quantizing that, then using a generative model to take these features and reinflate them into output speech.

Good speech with less data: Lyra is designed to operate with audio streams of as little as 3kbps – here, it does better than other codecs and compares favorably with Opus, an established speech codec. Lyra is notable because it smooshes together expert-derived stuff (which would be some of the traditional codec techniques used here) with a strategic use of a generative model and gets great performance and useful efficiency gains.

Fairness & ML: “We’ve trained Lyra with thousands of hours of audio with speakers in over 70 languages using open-source audio libraries and then verifying the audio quality with expert and crowdsourced listeners. One of the design goals of Lyra is to ensure universally accessible high-quality audio experiences,” the company writes.

Why this matters: AI is going to be everywhere. And it’s going to be everywhere in a Lyra-like manner – as a discrete, smart component within a larger technical stack. We’re also going to see people use more generative models to distill and reinflate representations of reality – we’re entering the dumb ‘brain in a jar’ phase of AI deployment.
  Read more: Lyra: A New Very Low-Bitrate Codec for Speech Compression (Google blog).
  Read more: Generative Speech Coding with Predictive Variance Regularization (arXiv).

###################################################

AI developer: I’m afraid of what happens if my code gets released:
…One lens on the ethics of open vs closed-source…
Is it safer for an AI system to be open source or for it to be controlled by a small set of actors? Generally, the technology community has leaned towards stuff being open source by default, but in recent years, people have been experimenting with the latter. This has happened with various types of synthetic media, like language models that haven’t been fully released (e.g, NVIDIA’s Megatron LMs, GPT2[at first]), or various papers on synthetic media where the researchers don’t release the models. Now, a VP of AI faceswap App reface has written a post laying out how he thinks about the release of certain AI technologies. His post is about AI body swapping – that is, taking one person’s face and potentially body and morphing it onto someone else in a video.

Demos get shady attention: “Only after I published a [AI body swap] demo in August 2020 and different shady organizations started approaching me, I realized that AI-body-swap is a bomb. A bomb in both senses – as a technological breakthrough and as something dangerous if its code gets into the wrong hands,” he writes. “A team of high-class ML-pros would find a way around my code in about a week. In roughly six months, they’d have a production-grade full-body swap technology.”

Why this matters: “We need to make a pact, a deal that all the companies that create synthetic media must include watermarks, footprints, or provide other detectors for identifying it.”, he writes. A cynical person might say ‘business guy writes article about why his business-benefiting strategy is good, lol go figure’. There’s some merit to that. But a few years ago articles like this were a lot rarer – the AI community does seem to be becoming genuinely concerned about the consequences of its actions.
  Read more: The Implications of Open-Source AI: Should You Release Your AI Source (Hackernoon).

###################################################

Proof that robots are getting smarter: GreyOrange partners with AI startup Vicarious:
…Maybe AI+Robots is about to be a thing…
AI startup Vicarious has partnered with GreyOrange, a company that builds AI and robot systems for warehouses. Vicarious has a neuroscience-inspired approach to AI (which earlier helped it break the CAPCHA security system, #66) which means its systems exhibit different capabilities to those made with deep learning techniques.

Why Vicarious? Vicarious’s tech has typically been good at solving problems involving spatial reasoning. You can get a sense of its approach by looking at papers like “Learning a generative model for robot control using visual feedback” and “From proprioception to long-horizon planning in novel environments: A hierarchical RL model“. (I hope to cover more of this research in Import AI in the future, but I’ll need to take some time to load different approach into my brain.)

What they’re doing together: GreyOrange will integrate an AI capability from Vicarious into its ‘GreyMatter Fulfillment Operating System” tech. Vicarious’s system will handle technology for autonomous vertical picking, which involves getting a robot to perceive “the size, shape and material characteristics of inventory items, including when these are loosely positioned in an unstructured fashion”, then pick them up and approach, retrieve, and place items into order boxes. “Vicarious’ computer-vision and robotics technology is a breakthrough in the ability to handle unstructured, previously hard-to-grasp items,” said Vicarious co-founder Dileep George in a press release announcing the move.  

Why this matters: The physical world is a huge challenge for AI. Most AI systems get trained in a purely digital context (e.g, computer vision systems get trained on digitized images of the world, and then are deployed in reality against… digital images of the world), whereas robots need to be trained in simulation then taught to generalize to the real world. This is especially challenging because of things like differences in the physics fidelity between simulators and reality, or hardware issues (e.g, air pressure/temperature/etc will screw around with the motor responses of certain robots, and the sun has a nasty habit of moving across the sky which continually changes illumination in outdoor/hybrid settings, throwing off vision systems).
  GreyOrange and Vicarious partnering up is a further symptom of the success of AI being applied to robotics. That’s a big deal: if we can get more flexible AI systems to work here, we can unlock tremendous economic value. Vicarious also isn’t the only company trying to revolutionize fulfillment with robotics – that’s also the focus of the (deep learning-based) startup, Covariant, among others. 
  Read more: GreyOrange and Vicarious Launch Autonomous Vertical Picking Solution for Apparel and Omnichannel Fulfillment (GlobeNewswire press release).
  Find out more about GreyOrange (GreyOrange site).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

NSCAI publishes final report on US AI strategy
The USA’s National Security Commission on AI has delivered its final report on US AI strategy. The report warns that the US risks being overtaken in technology without an acceleration in AI adoption, supported by substantial federal investment over the next five years.

Recommendations:

  • US military should work towards achieving ‘AI readiness’ by 2025: increasing DoD AI R&D spending to $8bn/year in 2026 (vs. $1.5bn in 2021); establishing a Digital Service Academy and National Digital Reserve Corps to address the talent deficit; more research into ensuring AI systems are robust and reliable.
  • US should embrace autonomous weapons and work with other nations to establish international standards and mitigate risks, while reaffirming DoD’s policy that human judgement be involved in any decision to kill. 
  • Overall federal funding for R&D should climb to at least 1% of GDP by 2026 (vs 0.6% in 2017).
  • Non-defense AI R&D funding should increase to $32bn/year (vs. $1.5bn in 2021); $32bn investment over five years in domestic semiconductor capacity (see Import 238).
  • To build a stronger AI workforce, the US should offer green cards to all STEM PhD graduates at US universities and double the number of employment-based visas, alongside substantially more funding for STEM education at all levels.
  • Establishing a Technology Competitiveness Council, tasked with developing and overseeing a National Technology Strategy, and coordinating efforts across government.

Read more: NSCAI report in full

————–

FAccT suspends Google sponsorship

ACM’s FAccT conference has paused its sponsorship by Google, following the turmoil and departures at the company’s Ethical AI team. Lead researchers Timnit Gebru and Margaret Mitchell were forced out earlier this year, after disputes around the company’s suppression of ethics research (see Import 226; 235).

   Read more: AI ethics research conference suspends Google sponsorship (VentureBeat) 


————–

Highlights from semiconductor substack– Mule’s Musings on Heterogeneous Compute; ASML and lithography; vertical monopolies; GPT-3.
– Deep Forest’s primers on semiconductor foundries (pt 1, pt 2).
– Employ America on the economics of the current chip shortage.

###################################################

Tech Tales:

The Speech for the Rebels
[2030: A country in central Africa where the US and China are fighting a proxy war primarily via the stoking of local political tensions]

They’d spent a few million dollars to digitize everything – and I do mean everything – that they’d gathered from the rebels. Then they started writing out speeches and testing the AI against it. The idea was that if you said something and the AI, which had been finetuned on all the digitized data, thought what you were saying had a low probability, then that told you that your speech was out of sync with the ‘mood’ inherent to the rebel group. On the other hand, if your speech was being predicted as likely by the AI system, that told you it might resonate.

Rhetorical Finetuning, the analysts called it, or R-FT.
Silver Tongue, was the name of the system we used.
The Mouth – that’s what we called it.
– “Go see how well The Mouth works on them”.
– “Oh, you’re back, I guess the mouth worked for you”.
– “Just tell ’em what the Mouth says and see what happens”.
– “Don’t improvise, the Mouth works”.

The strangest truth about The Mouth was it worked – really well. One classified document noted that “campaigns which utilized R-FT via Silver Tongue saw a 15% improvement in our post-engagement de-escalation metrics, resulting in a lowered casualty rate for warfighters in the region”.

So that’s why we ended up sticking The Mouth on our wrists. The AI runs inside a wearable which has a microphone – the bracelet glows green when we’re saying things that The Mouth predicts are probably and it glows red when we say things that aren’t. We spend a lot of time in training getting taught to not look directly at our wrists while talking, but rookies do it anyway. Now, when I give my talks – even improvised ones, after an incident, or to resolve something, or ask for a favor – I get my little signals from the bracelet and I say the words and keep it green.

I don’t know the language the local rebels speak. My translator picks up most of it, but not the things they whisper to eachother, hands over their mouths, looking at me as I use the mouth to talk. What are they saying, I wonder?

Look, when the American tries to speak like us, their wrist flashes green.
What does the red mean, Father?
The red is when they are telling the truth, my Son.

Things that inspired this story: The miniaturization of communication technology for conflict; thinking about language models and how they could be weaponized for purposes of propaganda or other state-level objectives; thoughts about how AI might get integrated with warfare; various loosely connected ideas around how AI influences culture through re-magnification of things the AI picked up; the natural skepticism of all humans in all places to unfamiliar people giving them a convincing speech.

Import AI 238: Robots that fold clothes; how Bytedance censors its product; a differentiable simulator.

by Jack Clark

The apocalypse approaches: Robots can _almost_ fold towels now:
…The great white whale of robot manipulation approaches…
Berkeley researchers have built a system that can fold a range of fabrics more accurately than before. If that doesn’t sound impressive, you probably haven’t spent much time at the intersection of modern robotics, deep learning, and simulation. Training machines that can reliably manipulate fabrics is a long-standing goal for the robotics field, but the task is inherently challenging – fabrics constantly deform, exhibit complex physical dynamics, and are generally hard to efficiently simulate. Therefore, we don’t have contemporary robots today that can do useful tasks like folding clothes, tying ropes, and so on.

VisualSpatial Foresight (VSF): Now, we’ve got closer – Berkeley researchers have taught a da Vinci surgical robot to carry out the task of fabric smoothing – that is, stretching out a disorganized piece of cloth until it is neatly unfolded – and fabric folding (folding a neatly unfolded piece of fabric) with 90% reliability. That’s not sufficient for production use, but it’s a meaningful research advance. VSF works by training a visual dynamics model on RGBD data (the ‘D’ depth component turns out to be very important) and seeking to learn the raw dynamics model (how you can expect the cloth to behave) in simulation. VSF uses this underlying model to help it plan out the appropriate actions to take to move from its current state (e.g, a messy piece of fabric), to a goal state (a neatly unfolded piece of fabric).

A new manipulation dataset: As part of this, they’ve built a dataset of 9932 episodes of a simulated robot carrying out four fabric manipulation tasks (which range in difficulty) – this dataset, called Fabric-CornerBias, has a particular focus on using the corners of fabric, which they find improves downstream performance.

What’s next? Next, they’ll increase the size of the datasets they use to train their models, and will also test out VSF on a broader range of fabric shapes. They’ll also look at ways to integrate additional manipulators to fiddle with the fabric.
  Find out more: VisuoSpatial Foresight (VSF) for Physical Sequential Fabric Manipulation (official project site).
  Read more: VisuoSpatial Foresight for Physical Sequential Fabric Manipulation (arXiv).
  Get the data and code (VisuoSpatialForesight, GitHub).

###################################################

Deluca: A fast, differentiable simulator:
…Differentiable algorithms are normal, what about differentiable simulators?…
Researchers with Princeton University, Google, and Microsoft Research have released Deluca, a differentiable simulator for training basic reinforcement learning agents. Deluca is special because the simulator itself is differentiable, which makes it better suited to training certain types of continuous control problems. “Our expectation is that the library will enable novel research developments and benchmarking of new classes of RL/control algorithms that benefit from differentiable simulation,” write the researchers.

Environments: At launch, Deluca supports environments for cartpole, mountain car, pendulum, planar quadrotor, and a few different types of (simulated!) lung.

Why use a differentiable library? Differentiable libraries can be faster for certain types of problems (helped along by the fact Deluca is written partially in Jax). In tests against stock OpenAI Gym using NumPy for calculations, Deluca (which uses Jax) netted a decent performance increase: “At the cost of a one-time just-in-time compilation of the dynamics, performed once at the instantiation of the environment, the improvement in the per-iteration time is >1000×”, they write. 
  Read more: Deluca — A Differentiable Control Library: Environments, Methods, and Benchmarking (arXiv).
  Get the code: Deluca (GitHub).

###################################################

Fine-tuning for text – what it is and why it matters:
…Pre-training is important, but fine-tuning is how you apply things…
In modern machine learning, many systems get built the same way – pre-train a large model on a vast dataset, then finetune the resulting model on a smaller dataset to constrain and steer the model. The reason for this two stage approach is simple: the first stage gives you a broad capability surface via training on a large, heterogeneous dataset, and the second stage gives you a specific instantiation of that capability surface. Now Seb Ruder, an AI researcher, has written a blog post laying out the different types of fine-tuning people do and also listing some of the issues with the technique.

Why this matters: Fine-tuning is a widely-used, somewhat generic technique, and we’re starting to use it across modalities (e.g, pre-training on visual data, or text data, or both as in the case of recent models like ‘CLIP’, then fine-tuning on X). Posts like Ruder’s help us develop better intuitions about this emerging AI-industrial process.
  Read more: Recent Advances in Language Model Fine-tuning (Sebastian Ruder, blog).

###################################################

What will “Broader Impacts” statements actually do?
…Now that AI researchers need to think about second order effects, what happens next?…
Last year, the NeurIPS conference asked all people submitting papers to write ‘broader impacts’ statements, which would try to anticipate some of the second- and third-order effects caused by a given AI idea, technique, or system. By doing this, NeurIPS caused several thousand researchers to think about the ethical dimension of their work while they finished their papers (likely ranging in terms of thinking time from minutes for a bunch of researchers, up to days for a minority). But, what other good effects could Broader Impacts have besides that? A paper from researchers with the university of Oxford tries to think about the positive and negative effects of these statements and makes recommendations for ways to improve them.

Positives from Broader Impacts:
– Anticipation: By forcing people to think about downstream impacts, they might get better at anticipating them.
– Action: Once you’re anticipating something, there’s a higher chance you take action.
– Reflection: By thinking about stuff, you end up thinking about yourself.
– Coordination: If enough researchers put enough work into Broader Impacts statements, they’ll generate metadata about the overall AI field, which could help people identify gaps or opportunities.

Making them more effective: However, broader impacts statements by themselves won’t help fix all the issues of ethics and AI. But they can be a good starting point – to make them more effective, the researchers propose:
– Conferences invest in more transparency around the types of statements they want to see and how they will subsequently be weighed within the context of peer review
– Giving researchers more guidance to help them write statements, including connecting them with experts
– Shaping incentive structures by making broader impacts more integrated within the larger academic ecosystem, such as by encouraging people cite eachothers statements, funding prizes for good statements, and increasing the resources in peer review allocated to these statements. 
– Public deliberation and reflection: Because broader impacts statements are new and somewhat controversial, we should aim to maximize transparency about the broader impacts review process, while also figuring out ways to ‘de-risk’ certain types of broader impacts statements (e.g, protecting people who want to write a paper whose broader impacts statement could impose legal or political backlash on the paper and/or paper-originating institution).
  Read more: Institutionalizing ethics in AI through broader impact requirements (Nature Machine Intelligence, PDF).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

US chipmakers push for more gov’t investment in domestic manufacturing:
(H/T CSET’s policy.ai newsletter)
In an open letter, execs from the big US semiconductor players have asked President Biden for greater federal support for the domestic industry. They see US technology leadership at risk due to many years of underinvestment in semiconductor manufacturing and R&D, relative to global competitors. The execs like the CHIPS for America Act —passed by Congress as part of the 2021 NDAA — which includes the first major federal incentives for companies building US-based fabs. They urge Biden to sign off on these, and support additional measures as part of federal recovery spending.
  Further reading:
– CSET’s Saif M. Khan on why AI chips matter and the semiconductor supply chain
– Foreign Policy’s epic deep dive on the geopolitics of semiconductors
– Bloomberg’s Odd Lots podcast on how the US lost chip dominance

Inside the censors at Bytedance:

(H/T Jeff Ding’s ChinAI newsletter)

Here’s a fascinating account from a former censor at Bytedance, the social media company behind TikTok (and the original Chinese version, Douyin). The whistleblower worked on the technology underlying content moderation across all the company’s domestic and international platforms.


Content moderation: In early 2020, the company was using 20,000 moderators to work with AI to create live transcriptions of content and compare this against an evolving list of sensitive words/phrases; (human) moderators are then deployed to investigate any flagged broadcasts. The Cyberspace Administration of China issues daily directives to ByteDance’s central Content Quality Center, who oversee the team of moderators. The whistleblower’s team had requests to develop algorithms that would automatically detect users speaking minority languages, request that they switch to Mandarin (for the benefit of content moderators), and automatically disable their stream if they failed to comply; they also were asked for the capability to automatically disable the streams of Uyghur-speakers, but did not build this.

Read more: I helped build ByteDance’s censorship machine.

New AI safety podcastCheck out AXRP (pronounced axe-urp) — a new AI safety podcast from UC Berkeley’s Daniel Filan. Each episode is a 1h conversation with an AI safety researcher about a paper of theirs.
Listen and subscribe here.

###################################################

Tech Tales:

Alien Antivirus Archaeologies
[NOC, 2028]

“There, that’s the virus, zoom in”.
And out of the sea of fuzzing numbers, the shark grew clearer. It stood out against the rest of the numbers by virtue of its density – it was a solid, interconnected set of numbers, moving through the field of data.

Of course, the virus wasn’t really a shark. It just looked like that, due to the rendering software they used.; It was called “Ecological Observation” – they’d pointed a load of specialized AI systems at their corporate data and used it to translate the network logs and streams of numbers from various observation systems into this – a simulated world they could navigate, like deep sea divers.

Ecological Observation was mostly useful as a different lens to use to see things. And with it, they could understand the machines differently. The virus which had seemed so inscrutable became easier to think about when you saw it as a shark. And it was easier to see what it was interested in – how it was circling the same areas of data inflow/outflow.

“Isolate it,” one of them said. And together they watched as the area around the shark grew less detailed – numbers unlinked from one another and the darkness of the deep ‘sea’ water evaporated around it – suddenly, the thicket of numbers in the shape of the shark was floating in space. And then the shark faded out as well.

Things that inspired this story: Synaesthesia for machines; the ‘Raw Shark Texts’ by Stephen Hall; taking text2im to its logical and yet absurd conclusion; thinking of AI as a tool to unlock different ways of seeing the world.

Import AI 237: GPT3 at 5X the speed; 6 hours of AI breakbeats; NeuralMMO++

by Jack Clark

24 hours of low-resource language speech:
…AI + Bemba language research just got easier…
We write a lot about low-resource languages here at Import AI – that’s because a non-trivial % of the world speak or write in languages which are poorly digitized and documented. This means that the AI systems of the future are unlikely to operate over the culture embedded within these languages, depriving speakers of being recognized by AI systems, or being able to use AI systems to help build AI services.

The solution to this problem is simple: create datasets. A new paper from the University of Zambia and George Mason University provides a practical example of how to do this – the researchers have made BembaSpeech, consisting of ~24 hours of speech in the Bemba language (which is spoken in Zambia). BembaSpeech is ~2.8 gigabytes of data with 17 speakers spread across the train, dev, and test sets.

Wild recordings: BembaSpeech was recorded in the wild, so different speakers have different accents and there’s some occasional background noise. “We consider this “more of a feature than a bug” for our corpus: it will allow us to train and, importantly, evaluate ASR systems that match real-world conditions, rather than a quiet studio setting,” the researchers say.
  Read more: BembaSpeech: A Speech Recognition Corpus for the Bemba Language (arXiv).
  Get the data: BembaSpeech (GitHub).

###################################################

Do you dream of training an AI to classify hair? Your dreams have been answered!
…K-Hairstyle could be the ImageNet of Korean Hairstyle data… wow!…
As AI has industrialized, we’re seeing the emergence of highly specific datasets for training AI systems to do very specific things in different parts of the economy. The latest symptom of this industrialization? The development of K-hairstyle, a large-scale Korean hairstyle dataset to help people build AI systems that can classify different hairstyles and, given enough compute, let people synthesize different images of themselves in different hairstyles.

What’s in the dataset? K-Hairstyle includes ~256,000 images labelled with any of 31 specific hair attributes. THe images were collected using high-resolution so they come in at 4032×3024 pixels (way, way larger than typical images in these sorts of datasets). Additionally, in each image the hair has been labelled with a segmentation mask, so it’s easy to train ML systems to distinguish between hair and flesh./faces. As a nice privacy bonus, the faces of the photographed people have been blurred as well.

Why this matters: K-Hairstyle is a symptom of the maturity of computer vision – we’re well into the ‘gather specific datasets and try to make some money’ phase of CV these days. Datasets like K-Hairstyle illustrate that and also suggest that data might not be the strategic thing these days (or else why would they release it?), rather, it’s about who has the computational infrastructure to train AI systems on these datasets.
  Read more: K-Hairstyle: A Large-scale Korean hairstyle dataset for virtual hair editing and hairstyle classification (arXiv).
  Check this link to get the dataset, though it’s not public right now (KbeautyHair, GitHub).

###################################################

Want 6 hours of AI-generated drumloops? Click here
…YouTube video compiles 4400 AI-generated breaks…
An AI tinkerer has trained a ‘WaveGAN’ neural net on 7500 vintage drumloops, then used the resulting model to generate thousands of new drumloops. I recommend having a listen to the video containing the synthetic loops – some of them are great and, if you’re one of Import AI’s more musical readers, worth sampling (“I’m not 100% sure that all the samples are copyright-free or smth”, writes the researcher on YouTube). The researcher has also published a Colab and the model as well.

Why this matters: AI is about to create a world of infinite-x. Infinite-drumloops? Sure. Infinite-cartoons? Absolutely. Infinite-fanfiction? Glad you asked. Infinite-movies? Eventually, yes. We’re at the beginning of a very significant shift in culture. Listen to these drums and imagine the cacophony of the future. It’s close.
  Listen to six hours of break beats here (YouTube).
  Check out the NeuralFunkV2 Colab folder here (Google Drive).

###################################################

Unsupervised understanding of gene sequences? Yup, AI can do that now as well:
…Deep learning bleeds into biology, thanks to the transformer…
Researchers with UC Berkeley, Facebook AI Research, and New York University have shown how to use a transformer-architecture “protein language model” to make better predictions about the structure and function of proteins. The resulting model outperforms existing AI systems and does so while being far more efficient in terms of parameter size (their model: 100M parameters, other models: 650M).

What they did: They pre-train a 100million-parameter model on 26 million sets of multiple sequence alignment (MSA) data (each MSA has around 1192 sequences). 
  Their special tweak:

How well it works: To test out their system, they test against the task of ‘unsupervised contact prediction’ – a way to evaluate how much protein information the transformer has managed to infer during training; their system outperforms two state-of-the-art transformer models (ESM-1b with 650M parameters; ProTrans-T5 with 3B parameters). They also use their models within a Supervised Contact Prediction task, which is where they’re augmented with additional information – here, their system significantly outperform all other baselines as well.

Why this matters: “Unsupervised learning provides a way to extract the information contained in massive datasets of sequences produced by low cost gene sequencing,” they write. We’re very much in the early phases of experimenting with using modern AI techniques to understand proteins. This approach will complement some of the great work that has already gone on with supervised learning in this space via AlphaFold (Import AI 189; 209; 226).
  Read more: MSA Transformer (arXiv).
  Get the code here (Evolutionary Scale Modelling, Facebook).

###################################################

Multiagent simulations are cool, sure. But you know what’s really cool? Multiagent MMOs!
…When AI research meets modern videogame design…
Neural MMO, a software package for simulating hundreds of AI agents in the same gameworld, has received a major software update. Neural MMO V1.5 follows the original software, which was released a couple of years ago by OpenAI (March, 2019). Neural MMO is now being developed art MIT.

New features in V1.5 include: A user guide and documentation, the addition of ‘NPC’ characters for AI agents to fight (as well as equipment they can pick up), support for much larger maps to train agents on, the inclusion of strong baselines so you can start research quickly, custom visual overlays to show different aspects of the AI simulation (for instance, value functions, or stats about particular agents).

Why this matters: In Greg Egan’s fantastic scifi story ‘Crystal Nights’ a scientist simulates an ecosystem and tries to apply evolutionary pressure to make some (simulated) crabs really clever – with entertaining results. It’s a scifi story, sure, but it also gestures at a real trend in AI research: perhaps one way to build more intelligent systems is to embed agents in a simulated world where they compete with one another, which generates a kind of free form of bootstrapping where as the agents become more capable, so too do their competitors. Systems like NeuralMMO make it easier for other researchers to play around with ideas like this, letting us know if Crystal Nights could become our reality.
  Read a Twitter thread about the update here (Joseph Suarez, Twitter).
  Find out more at the official Neural MMO website.
  Watch a trailer for V1.5 here (YouTube).
  Get the code here (Neural MMO, GitHub).

###################################################

Want to train GPT3 5X faster than you could before? Now there’s a way:
…TeraPipe = AI industrialization = Big models mean big infrastructure…
UC Berkeley and DUke University researchers have figured out how to speed up the training time of a mega language model like GPT3 by 5X – and the secret lies in pipelining. What’s pipelining? It’s literally just fancy plumbing for AI models – pipelining is how you shuttle information between different parts of a model during the learning process. And as you train bigger models, people have invested in figuring out smarter approaches to pipelining to save them more money.
  The new research shows how to exploit the Transformer-architecture to do in-training pipelining via a technique called TeraPipe. “Our evaluation shows that for the largest GPT-3 model with 175 billion parameters, TeraPipe achieves a 5.0x speedup improvement over the state-of-the-art synchronous model-parallel training methods on an AWS cluster consisting of 48 p3.16xlarge instances,” they write.

Long contexts: The researchers also use TeraPipe with different input sequence lengths and show it scales favorably to larger input sequences – this suggests TeraPipe will be helpful in the future as it can handle the performance demands of longer contexts.

Why this matters: We’re in the industrialization phase of AI development – that means researchers are beginning to think about the proverbial machines that build the AI machines. Systems like TeraPipe are a symptom of interests in the broader research community – figuring out how to train larger models more efficiently than ever before. Let’s see what we discover as we plumb the depths of this exciting problem!
  Read more: TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models (arXiv).

###################################################

Tech Tales:

She’s a real artist, she even sings to the computers, yeah!
[Los Angeles, 2025]

K: How’s the album coming along?
A: It’s going well – we’ve generated a couple of hundred variants. Some of them are very promising.
K: And does it sound like me?
A: Sound like you? Kayla, it is you. You sang the seed lines. It wouldn’t sound like this if it wasn’t for you.
K: I just miss the old days sometimes. I stayed up two days when I did the first EP.
A: Now, we can get the computers to stay up for you. You rest up for the tour.

That night, the computer saw itself on stage and saw itself singing. The computer sang songs for hours, all through the night, not being aware that though it felt it was one computer, it was in fact many hundreds of copies of the same program. It sang and it felt it existed. It felt it existed because it was lucky – it was singing songs that were predicted to be good. The computers that sang songs which other computers predicted to be bad were destroyed.

K: What do the fans think?
A: They love it, in simulation. We’re doing the live tests soon.
K: Well, what did you think about it?
A: It’s not about what I think – really. It’s about getting your music to a place where the largest number of people will want to listen to it.
K: I want to sing for them.
A: You don’t need to! We’ve got you in-sim already – and let me tell you, sim Kayla is amazing. You’ve got some competition.
K: I need to practice, anyway. Let’s do a show for the sims next week, before we take it to the fans.
A: You got it!

The next week, Kayla did her makeup and her vocal exercises, then turned on the bright lights in her apartment and stared into the camera, broadcasting her performance into the simulated concert space. She started singing, listening to herself through the in-sim monitor via an earbud, through which her agent occasionally interrupted:
  A: I’ve never seen reactions like this. Kayla, they love you.
  A: This is going over a lot better than even our most optimistic predictions.

After the performance she showered and in the shower she sang to herself and listened to her songs bouncing off the tiles. She liked them. And she’d like singing for the simulation. The crowd loved her. And she was, quite literally, all they had.

A week later her agent rang her up.
  A: Kayla, we ran analysis on your performance. I don’t think you’re going to beat it.
  K: Sounds like a high bar to clear. That’s awesome.
  A: Yes, and there’s been so much interest we’ve started selling it for overflow for the tour.
  K: So if they don’t get a ticket they’ll see the sim performance?
  A: Exactly. We recorded everything and we’ve animated you, so it’ll be personalized.
  K: So my competition on tour will be… myself?
  A: That’s a funny way to look at it. But, yes!

Things that inspired this story: sim2real and other reality gaps; using ML to simulate responses; GAN-style training but for humans in the run-up to great events; how generative models let us bottle up and distill style&talent and how surely this will be exploited by machinery of cultural production.

Import AI 236: EfficientNet++; why robots are hard; AI2 makes a harder ARC

by Jack Clark

What’s standing between us and smart robots? AI experts lay out laundry list of needed research:
…But if we can make progress on these problems, very good things will happen…
I want robots. You probably want robots as well. But today’s robots are hopelessly dumb and limited. To create smart robots that people want to buy, we’ll need to surmount a bunch of challenging AI research problems. Now, some of the world’s foremost experts at AI&Robots have laid out the technical hurdles to building robots that can learn efficiently via reinforcement learning. In a paper, people who’ve spent time working on robots at Google, including at Stanford University and Berkeley, list the issues.

What stands between us and more capable robots?
The major challenges holding back RL being applied to robotics relate to its data needs, the inherent challenges of open-ended exploration problems, figuring out how to make robots operate reliably at scale, needing better and more accurate simulators to more cheaply let people train in simulators, creating robots that have more independent abilities to persistent at tasks, and trying to define (and learn) a range of ‘safe’ behaviors.
  The challenging part of these problems? Solving any single one of these would represent a significant breakthrough in applied AI research. Solving all of them would probably represent billions of dollars of IP. Therefore, it might take a while to make progress on this stuff, but if we do – wow!

Why this matters:
If we can work on these challenges, then we’ll get closer to “a future where RL can enable any robot to learn any task,” the researchers write. “This would lead to an explosive growth in the capabilities of autonomous robots – when the capabilities of robots are limited primarily by the amount of robot time available to learn skills, rather than the amount of engineering time necessary to program them, robots will be able to acquire large skill repertoires.”
  Read more:
How to Train Your Robot with Deep Reinforcement Learning; Lessons We’ve Learned (arXiv).

###################################################

AI Dungeon raises $3.3 million:
…AI-powered game startup gets seed funding…
Latitude, the startup behind the GPT2/3 generative text adventure game ‘AI Dungeon’, has raised $3.3 million in seed funding. We first wrote about AI Dungeon back in December 2019, after the game launched using the 1.5bn GPT2 model [Import AI 176]. AI Dungeon uses these language models to create a procedural, emergent text adventure game, where you can be anything and do anything with the generative models filling in your actions in the background. Since launching, Latitude has iterated on the game a lot and swapped out GPT2 for GPT3 across some of its stack.

Why this matters: Modern generative models are more like bottled up imaginations than anything else – with all the complexity and bugginess that implies. AI Dungeon is one of today’s best examples of how we can use these models to create entertainment that feels genuinely different.
  Read more:
AI Dungeon-maker Latitude raises $3.3M to build games with ‘infinite’ story possibilities (Techcrunch).

###################################################

Allen makes a harder ARC, ARC-DA:
…Where we’re going we don’t need multiple choice questions…
The Allen Institute for AI Research (AI2) has built ARC-DA, a direct answer variant of the multiple choice AI2 Reasoning Challenge, ARC. ARC-DA contains questions covering science, math, and other topics. Where ARC-DA differs is it requires a single, direct answer, rather than selecting from a bunch of distinct choices. This makes it harder and more natural than the original ARC evaluation.

Why this matters:
Tests fuel progress in machine learning, so the availability of more tests to assess for reasoning capabilities will lead to more progress here. This is a further sign of the breakneck advances in NLP – ARC-DA seems like a version of ARC with the training wheels taken off.
  Read more: Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge (arXiv).

###################################################

Defense contractor publishes a satellite surveillance MNIST:
…A tiny, 28×28 satellite imagery dataset emerges…
Researchers with PeopleTec, Inc., a defense services contractor, have released Overhead MNIST. Overhead MNIST is a collection of ~9500 labelled images of 10 objects commonly found in satellite footage. The images are black-and-white and 28×28 resolution and have been taken from datasets like SpaceNet, xView, UC Merced Land Use, and DOTA (not the videogame). Overhead MNIST is smaller than typical ‘small’ datasets (which usually have more like 100,000 to a million images), swo may be a useful dataset for testing out sample efficient computer vision algorithms.

The 10 classes: Storage tanks, parking lot, ships, helicopter, car, stadium, oil gas field, runway mark, plane, and harbor.

Things that make you go ‘hmmm’: The corresponding author of this paper is the Chief Scientist for PeopleTec.
  Read more: Overhead MNIST: A Benchmark Satellite Dataset (arXiv).
  Get the data: Overhead-MNIST (Kaggle).

###################################################

NVIDIA
: Billion dollar training runs are coming
…Success of language models means training run costs will rise…
Bryan Catanzaro, NVIDIA’s VP of applied deep learning says its possible “that in five years a company could invest one billion dollars in compute time to train a single language model”, according to comments paraphrased by The Next Platform.

“These models are so adaptable and flexible and their capabilities have been so correlated with scale we may actually see them providing several billions of dollars worth of value from a single model, so in the next five years, spending a billion in compute to train those could make sense,” The Next Platform quotes him as saying.

Why this matters: AI industrialization: AI is entering its phase of mass industrialization – after years of buildup, we have scalable, relatively generic systems that can be ‘fed’ arbitrary amounts of data and compute. Performance has also become more predictable via the emergence of research into things like ‘scaling laws’. Add it all up and it means it’s become easier and less risky for people to bet big on training large models. That’s going to cause problems for governments and academia which tend to distribute resources for science across a very large number of relatively small projects. Meanwhile, industry will start training big kahuna models – to put a billion into perspective, that’s about 1% of Ethiopia’s total GDP in 2020.
  Read more: The Billion Dollar AI Problem That Just Keeps Scaling (The Next Platform).

###################################################

Google boils the ocean to make a far more efficient AI system:
…Neural architecture search + GPU/TPU details + other tricks = 2X efficiency boost…
Google has boosted the efficiency of ‘EfficientNet’, its well-performing and highly efficient class of vision models, by 2X via the use of neural architecture search. Neural architecture search (NAS) is the process of using reinforcement learning to get an AI system to search through the design space of neural networks, coming up with candidate systems that do well at a given task. Google’s new research shows how to use this approach to search for model families – that is, a whole suite of models that use the same basic architecture.

What Google achieved: Google was able to build a new family of models called EfficientNet-X, which are 2X faster (aka, more efficient) than EfficientNet.

How they did it: Google carefully analyzed the target AI training hardware (TPUv3s and V100 GPUs), designed a NAS search space built around the particulars of this hardware and researched a technique to help scale-up networks according to both accuracy and latency constraints. They put all of this together and were able to use an AI-driven approach to come up with a far better family of models. This model family “achieves up to 2X+ faster speed and comparable accuracy to SOTA model families on TPUv3 and GPUv100”, Google says. .

The massively counterintuitive thing about this – you’ve gotta spend compute to make more efficient use of compute: The biggest thing about this paper is what it tells us about compute/energy expenditure and AI – here, a bunch of researchers boil the (metaphorical) ocean to do a complex two-stage search process, spending huge amounts of energy in the process. But what we end up with is a fairly generic family of AI models that are roughly 2X as efficient as their predecessors. That means the upfront energy used to train these models will get amortized over the (vast!) cost-savings from deploying these models onto large infrastructure.
  Read more: Searching for Fast Model Families on Datacenter Accelerators (arXiv).

DeepMind gets rid of batchnorm, makes more efficient neural nets:
…Batch normalization? I don’t know her
Researchers with DeepMind have built a better class of neural network by getting rid of a widely-used technique (batch normalization), matching the performance of EfficientNets (see elsewhere in this issue) while being significantly faster to train. They also set a new state-of-the-art on ImageNet by pre-training on Google’s secret, mammoth ‘JFT’ image repository.

What they did: The authors train ‘Normalizer-Free-ResNets’ (NF-ResNets), then use a technique called adaptive gradient clipping to help them train these NF-ResNets to larger batch sizes than was previously possible. One of the main tricks here is training networks without batch normalization, a widely-used technique that the authors want to get rid of because it’s a bit fiddly. (And generally in ML, when we simplify things, we get increased performance).
  They then try to set a new state-of-the-aert on ImageNet by manually picking through recent innovations in large-sale AI training and stapling them together. They then pre-train a NF-ResNet on the secret ~300 million image ‘JFT’ repository and set a new state-of-the-art of 86.5% for top-1 accuracy: this is meaningful, as it shows that Google’s technique holds up well under transfer (pre-training on JFT and finetuning on ImageNet), which indicates it might be a meaningful improvement.
  Read more: High-Performance Large-Scale Image Recognition Without Normalization (arXiv).

###################################################

AI Policy with
Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Cops use music to censor protestors’ video recordings:
An activist has shared intriguing videos of interactions with police officers in Beverly Hills. The officers, realising they are being filmed, start playing (copyrighted) music loudly on their phones, in an apparent effort to trick content algorithms into removing or muting the video. It’s not clear if this practice is widespread, or whether it’s ever been effective in suppressing citizen footage.
  Read more: Is This Beverly Hills Cop Playing Sublime’s ‘Santeria’ to Avoid Being Live-Streamed? (Vice)

What are the implications of large language models? 

This is a write-up of a discussion on the capabilities and impact of large language models, between researchers from OpenAI, Stanford’s HAI and elsewhere. If you’re interested in the topic, skip my summary and read the paper, which is short and concise. For a comprehensive reading list of papers on the subject, the authors suggest Bender & Gebru et al, and the original GPT-3 paper.


Q1: “What are the technical capabilities and limitations of large language models?”

  • Participants were optimistic about LMs continuing to reap the ‘blessings of scale’.
  • They mostly expected large multimodal models to become more prevalent and enable more diverse capabilities.
  • They’re worried about the alignment of model objectives with human values, with several emphasizing the challenge of optimizing for factual accuracy, and ensuring robustness to adversarial examples. 


Q2: “What are the societal effects of widespread use of large language models?” 

  • They don’t see leading actors (e.g. OpenAI) maintaining a monopoly on large LMs for very long, and expect it to take 6-9 months for such models to be widely reproduced. The lead actors should make use of this time period to establish and promote good norms around responsible deployment.
  • Some suggested more compute resources were needed for academia to do research into societal impacts of LMs to help inform deployment.
  • There was concern about potential misuse of LMs for disinformation, though opinions differed on the magnitude of the risk. They agreed that we need more research into the economics of automating disinformation.
  • They’re worried about LMs exhibiting bias, and suggested ways of addressing different aspects of the problem.

Read more: Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models (arXiv).

###################################################

Tech Tales:

Barkside of the moon
Earth, 2045

Its name was 389-DELTA-FOLLOWER003 but all of its friends just called it ‘DOG’, or whatever the machinespeak equivalent was. DOG was a spaceship about 50 feet long and 10 feet wide and it looked, from the outside, like a grey, fat cigar. Inside, it contained a range of stupefyingly complicated electronics, yet had no – strictly speaking – moving parts. DOGs purpose had been to trail other, far larger ships, acting as a roving sensor platform, communications hub, and general utility-support vehicle. It also acknowledged initial hails by playing back the sound of an animal barking – an odd coincidence, given its name, and one which our scientists are puzzling over.

DOG has so many human qualities, ranging from its name to the bark to the fact its logs use the English alphabet, that our scientists at first worried it came from the future. But we couldn’t understand how – or if – that was possible and, after some weeks passed, we became less concerned about an attack from there.  

Then we went back to the question: if not the future, where did DOG come from? We quickly eliminated the present – no one on Earth had technology like DOG. As far as we could work out, it represented hundreds to thousands of years of scientific advances which humankind was not yet privy to.

So then we checked the past. I got the job to go through the UFO archives among a few different military organizations. So I got to talk to a lot of people driven slightly mad by vast historical records of unexplained mysteries. But: fruitless. Despite it being one of the more exciting things that’d happened to the UFO archivists in decades, no one was able to find me much evidence of a 50 foot by 10 foot silver/grey cigar. Someone tried to tell me it could’ve been retold in history as a story about an ancient sea-going snake, but the evidence there was very sparse.

And then there was where we found it: the dark side of the moon.
For those of you that aren’t familiar with space: You don’t randomly end up on the dark side of the moon unless you’re a comet or an asteroid.
And then there was how we found it: the Chinese had sent a new probe to explore some of the deeper craters on the dark side of the moon. While doing this, the probe was also conducting some intelligence operations, basically sniffing around for other robots and probes placed there by other nations. We found DOG because the ‘DOG’ woke up in response to a hail from the Chinese probe and, yes, barked back to it.

Picture this: the President of the USA and the President of China go into a secure location, along with some other people. They all gather there and stare at eachother. We’ve found an alien craft, folks. And it barks like a dog.
It’s notable that the notes from that meeting are quite thin.
I like to think that someone started laughing and never stopped.

So, that’s where we are. We’ve got our DOG craft and no real explanation of how it got to the moon, why it responds with an animal hail, or why its logs are in English – though the spooky explanation for the latter might be that it did a sweep of the planet at some point and automatically restructured the encoding it used to match the English language; this explanation, along with being hard to prove, also has the inherent undesirable quality of irritating the Chinese government. If DOG could convert itself to English, why not Mandarin as well?

Things that inspired this story:Oumuamua; locked room mysteries; writing ‘barkside of the moon’ and thinking ‘gosh this is absurd’ and then chuckling to myself while writing this story saying ‘yes, this is absurd!’; dogs; the rendering of spaceships in Iain Banks’ culture novels.

Import AI 235: Use GEM to test language models; the four eras of facial recognition; and how the US can measure its robot fleet

by Jack Clark

20 million eye images – get ’em while the link still works!
…Go on, you’re a little bit curious about what you could hack around with this…
Researchers with the University of Tubingen in Germany have published a dataset of 20 million eye images, gathered via seven different eye tracking formats. The data is diverse – eyes have been recorded while driving outside, driving in a simulator, and carrying out a variety of indoor and outdoor activities. The data includes 2D and 3D segmentation, annotated pupils, position and radius of the eyes, and more. The authors hope TEyeD will “contribute to the application of eye-movement and gaze estimation techniques in challenging practical use cases.”
  Read more: TEyeD: Over 20 million real-world eye images with Pupil, Eyelid, and Iris 2D and 3D Segmentations, 2D and 3D Landmarks, 3D Eyeball, Gaze Vector, and Eye Movement Types (arXiv).
  Get the data here (via a Sharepoint link).

###################################################

Happy 2021 – Lacuna Fund is about to help create more African agriculture datasets:
…First round of grants shows what applied machine learning means…
Lacuna Fund, an organization that funds the creation of labeled datasets for underserved communities is supporting six projects focused on agricultural data. Lacuna also wants to support the creation of language datasets in sub-saharan Africa (Import AI: 216.

Six projects for better data: The projects involve datasets for georeferenced crop images, land use planning in Tanzania, crop pest and disease diagnosis, water use, cleaning up existing crop-cut yield datasets, and a five-country crop dataset means to be gathered via cameras mounted on custom-designed vehicles.  
  Read more about the awards here (Lacuna Fund website).
  Via: AI Kenya newsletter (Mailchimp archive)  .

###################################################

Here’s how the USA could get a handle on AI policy:
…One weird trick to give the government the jump on the robots…
Measurement is a prerequisite to sensible policymaking – if you can’t measure or quantify something, it’s hard to regulate or manage it. Rob Seamans, a professor with NYU, wants to help the US measure the impact of AI on its economy and has written a piece in Brookings outlining how to do that.

The key? The US needs to measure how the addition of robots and/or AI-oriented software can influence productivity at firms or specific firm-owned places (e.g, a warehouse). The US does not do this today. It used to – in the 1980s and 1990s the US conducted the ‘Survey of Manufacturing Technology’, but retired that due to government cutbacks in the 1990s. Seamans’ suggestion is a pretty simple one (which is why it might work): we should bring back the survey and do it annually.

What should we ask America about AI? “The survey would include questions about the use of specific technologies, such as robots, machine learning, cloud, e-commerce, autonomous guided vehicles, and others, and could be a simple “yes/no” question about whether the establishment has the technology or not,” Seamans writes. “There would be multiple benefits to a standalone survey of technology. The survey would allow researchers to identify sectors and regions of the economy that are being impacted by new technologies.”

Why do this at all? Data from France shows that if you add robots to a company, the company creates more jobs. We should do a better job of measuring data at the US level so we can do the same study here easily, Seamans said. “While there is excitement about the impact that new technologies like artificial intelligence and robotics will have on our economy, we need to do more to measure where and how these technologies are being used,” he writes. 
  Read more: Robot census: Gathering data to improve policymaking on new technologies (Brookings).

###################################################

Language models are here, but how do we evaluate them? Try GEM:
…Multi-task benchmark aims to give us better signals about AI progress…
A gigantic team of researchers have collaborated to build GEM, a benchmark to help evaluate progress in natural language generation. NLG is going to be a big deal in the next few years as the success of models like GPT3 creates demand for better ways to evaluate synthetically-generated text. GEM represents a hard, multi-task generative benchmark which AI researchers can use to test out the capabilities of their model.

11 tests: The first version of GEM includes 11 test datasets and tasks that “measure specific generation challenges, such as content selection and planning, surface realization, paraphrasing, simplification, and others”. The initial datasets are: CommonGEN, Czech Restaurant, DART, E2E clean, MLSum, Scheme-Guided Dialog, ToTTo, XSum, WebNLG, WikiAuto + Turk/ASSET, and WikiLingua.

Data cards: The GEM-creators are thinking about AI policy, as well, because they’ve included a ‘data statement’ for each of the 11 included tasks. A data statement works like the label on food – you list out the ingredients and some of the salient intended (and unintended) uses. Today, most AI systems are broadly undocumented, so it’s notable that GEM prioritize data legibility for the first version of the benchmark.

Why this matters: Evaluating generative models is challenging because they have vast capability surfaces which are hard to characterize with today’s tests. Systems like GEM will help us get (somewhat fuzzy) signals about the creative and generative capabilities of these models. The more – and better – tests we have, the easier it’s going to be to craft sensible policies around the deployment of AI systems.
  Read more: The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics (arXiv).
  Find out more at the official website (GEM Benchmark website).

###################################################

What’s the big deal about facial recognition? A historical analysis gives us some answers:
…Facial recognition has industrialized, so we should take it seriously…
Facial recognition is one of the most prominent uses of contemporary AI, for anything from unlocking your phone to helping you apply filters to your face in consumer apps to being a tool used by those involved in security to track and surveil individuals. But where did facial recognition come from and how significant is the moment we’re in now? That’s a question that two researchers try to answer with a review of how facial recognition evaluation has occurred over time.

The four periods of facial recognition: Facial recognition has four distinct eras which contribute to stages of technology development, as well as commercial interest. The authors do some really valuable work of providing some statistics to help us understand the different salient aspects of each era. These are:
– Period 1: Early research findings: 1964-1995: 5 datasets created, with an average number of ~2000 images per dataset.
– Period 2: Commercial viability: 1996-2006: 37 datasets created, with an average number of ~11,000 images each.
– Period 3: Mainstream development: 2007-2013: 33 datasets, with an average number of ~46,000 per dataset.
– Period 4: Deep learning breakthrough: 2014 onwards: 45 datasets, with an average number of ~2,600,000 images per dataset.

The most influential datasets: The authors also identify the most influential face datasets (according to citations), for each period. For the four periods, the popular datasets are: Picture of Facial Affect (P1), FERET (P2), Labeled Faces in the Wild (P3), and VGGFace (P4).

Why this matters: Recent advances in deep learning have made it generally cheaper to deploy more performant vision-based surveillance systems. At the same time, the data-intensiveness of the underlying computer vision algorithms has increased to the point it’s very challenging to analyze and evaluate the datasets used to train these systems (you try and classify two million of anything and see how far you get). This also incentives people to move from curating precise datasets to indiscriminately scraping the cheapest (and arguably most diverse on some metrics) form of data – the internet.
    In tandem with these changes in the technical infrastructure, so has the usage of facial recognition evolved – “we’ve seen the trend in facial recognition evaluation shift broadly from a highly controlled, constrained and well-scoped activity to one that is not,” the authors write. “At minimum, an important intervention moving forward is to standardize documentation practice, of the model and the face datasets meant to be used in development or evaluation”.
  Read more: About Face: A Survey of Facial Recognition Evaluation (arXiv).

###################################################

Weights and Biases raises $45 million Series B:
…Measurement means money…
AI startup Weights and Biases has closed a $45m funding round, as investors bet that in the future more companies are going to invest in measuring and analyzing their machine learning infrastructure and models. W&B’s software is for machine learning operations – think of this as the systems that AI practitioners use to help them train and develop models.

Why this matters: Funding for companies like W&B is a broader symptom of the industrialization of AI technology – we’re seeing the emergence of pure ‘B2B’ businesses built not around specific AI components, but around facilitating AI infrastructure.
  Read more: Weights and Biases Raises $45M Series B to Expand Beyond Experiment Tracking for Machine Learning Practitioners Everywhere (PRNewswire).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

More turmoil in AI ethics at Google:
In December, Google’s co-lead of Ethical AI, Timnit Gebru, was forced out in a dispute about academic freedom (see Import 226). Gebru had been pressured to withdraw a paper she had co-authored on the societal impacts of large language models. Axios reports that Google is now investigating Gebru’s co-lead, Margaret Mitchell, and has locked her email accounts, accusing her of downloading and sharing company files. Mitchell was reportedly collecting evidence of discriminatory treatment of Gebru. The newly formed Alphabet Workers Union calls the company’s actions “an attack on the people who are trying to make Google’s technology more ethical.

###################################################

Tech Tales

The Glass Child
[Earth, 2050-35??]

The child stood there, embedded in glass, and people worshipped it and fought over it and tried to breach it (fruitlessly) and feared it and so on, for hundreds of years. 

It was the child of a rich person who had foreseen the Time of the Scourge, and had paid to embed his kid into a multi-thousand year life preserving substrate, itself sheaved in an ultra-hard complex material that most would mistake for glass. The child seemed to float, suspended, in the center of a 10 foot tall translucent and impenetrable rectangle. The child was kept alive through obscure technologies, but appeared mostly dead to any observers. The ‘mostly’ part came from the color of his skin – he was grey, yes, but when lit by torchlight or electrics his skin would shine and seem to hint at an inner strength. Over hundreds of years, different groups of scavengers told individually varied stories about how they’d heard the child trapped in ice sing, or laugh, or shout.

People developed rituals around the child; mothers brought their sick children to the glass rectangle and they’d lay blankets down and leave their babies on it overnight. The superstition wasn’t justified, but that didn’t mean it was wrong – the same technologies that kept the boy alive took the form of a field and this field radiated out from the boy reaching the edge of the glass and slightly beyond. The effect was neither dramatic or obvious, but it worked just enough of the time that the rituals held. Over time, the child became an icon for health and was sainted and worshiped and, yes, fought over.

For a while, there was a king who was convinced if he stayed close to the child he, too, would live forever. He had a great castle built around the glass rectangle and had his throne placed against it. When you met with the king you’d go into a great room and the king would stare at you and, above and behind him, the pallid child would hang there in the glass. People convinced themselves that the child was watching them and that the king talked to it.

The kind did live a long time, aided by the mysterious field. And as most do, the king became more idiosyncratic the older he got, which ultimately led to him visiting great misery on the people within his dominion. They rebelled, as people tend to do, and tore down the castle in which the king lived. They heaped great firee around the glass rectangle and burned the materials of the palace. After a week, the fire went out, and the rectangle was unscathed.

So the people called the land cursed. Before they left, a group of them painted the rectangle with black paint, sealing in the child. Then they took their carts and their families and they left.

Things that inspired this story: Old hard drives; the relationship between memory and a sense of life; how people naturally coordinate around artefacts regardless of what the artefact is.

Import AI 234: Pre-training with fractals; compute&countries; GANS for good

by Jack Clark

Where we’re going we don’t need data – we’ll pre-train on FRACTALS!!!!
…This research technique is straight out of a Baudrillard notebook…
In Simulacra and Simulation by French philosopher Jean Baudrillard, he argues that human society has become reliant on simulations of reality, with us trafficking in abstractions – international finance, televised wars – that feel in some way more real than the thing they’re meant to reference. Now, AI researchers are producing papers that, I’m sure, would get Baudrillard excited: research from National Institute of Advanced Industrial Science and Technology (AIST), Tokyo Institute of Technology, and Tokyo Denki University, proposes a way to simulate the data necessary to pre-train a vision model, then fine-tune this model on reality. Specifically, they build a dataset called FractalDB which contains several thousand fractals split across a variety of different automatically generated categories. Their experiment shows that they can pre-train on FractalDB then finetune using other datasets (e.g, ImageNet, OmniGlot, Cifar-10), and get performance that is close to using the natural datasets and, in some cases, is better. This isn’t a homerun, but it’s encouraging.

What they did: To do this, they built a fractal generation system which had a few tunable parameters. They then evaluated their approach by using FractalDB as a potential input for pre-training, then evaluated downstream performance.
    Specific results: “FractalDB1k / 10k pre-trained models recorded much higher accuracies than models trained from scratch on relatively small-scale datasets (C10/100, VOC12 and OG). In case of fine-tuning on large-scale datasets (ImageNet/Places365), the effect of pre-training was relatively small. However, in fine-tuning on Places 365, the FractalDB-10k pretrained model helped to improve the performance rate which was also higher than ImageNet-1k pre-training (FractalDB-10k 50.8 vs. ImageNet-1k 50.3)

How this fits into the larger picture – computers become data generators: Real data is expensive, complicated, and slow to gather. That’s why the reinforcement learning community has spent decades working in simulators – e.g, training agents to play Atari, or Go, or explore 3D worlds in a rewritten Quake engine (DeepMind Lab). It’s also led researchers to find creative ways to augment real datasets – e.g, by multiplying the size of an image dataset by flipping the images, adding textures, changing colors and textures, and so on. All of these techniques have proved helpful.
  Now, if researchers can build simulators to generate arbitrary amounts of data, they might be able to further change the cost curve of data generation. This might have weird economic and strategic implications: if you can simulate your data using a computer program, then you can change the ratio of real versus simulated/augmented data you need. This has the potential to both speed up AI development and also increase the inherent value of computers as primary AI infrastructure – not only can we use these devices to train and develop algorithms, but we can use them to generate the input ‘fuel’ for some of the more interesting capabilities.  
  Read more: Pre-training without Natural Images (arXiv).

###################################################

Using a big anime dataset to train character distinguishers:

…Illustrations + fine-grained character recognition …
Researchers with National Chiao Tung University in Taiwan have built DAF:re (DanbooruAnimeFaces:revamped). DAF:re is a subset of the massive ‘Danbooru’ Anime dataset (see Import AI 233., filtered to just include heads of different characters. The resulting dataset consists of ~467,000 images across 3,263 distinct character classes.

Why do this?
Datasets like DAF:re will let people explore fine-grained analysis of stylized pictures (like anime), and could potentially serve as benchmarks for exploring the generalization of vision models trained on a mixture of normal and illustrated images. If it becomes widely used, it could end up being another proxy signal for the broader rate of progress in this type of work. I also expect that, given the vast fanbase for a lot of anime, we’ll see more projects like this, and perhaps they’ll ultimately help filter, analyze, and map the cultural space of anime writ large.
  Reader note: This dataset uses cropped photos of faces, but the larger dataset involves images of a sexual nature (including the SFW one).
  Read more: DAF:re: A Challenging, Crowd-Sourced, Large-Scale, Long-Tailed Dataset For Anime Character Recognition (arXiv).
  Get the code for the classification stuff here (Animesion, GitHub).

###################################################

Big AI means big infrastructure:

…OpenAI* scales Kubernetes to 7,500 nodes…
OpenAI is running Kubernetes across ~7,500 nodes. Why does this matter? Kubernetes is a bit like an air-traffic control system for large-scale computing; the software helps schedule different jobs onto different bits of hardware (think of this as like assigning planes spots on the ground), and also handles things like contention (stopping planes crashing into eachother), and efficiency (prioritizing getting planes up and down quickly and efficiently). 7,500 is up from the 2,500 OpenAI disclosed in 2018. It’s worth reading these posts because they give a sense of the complexity of the infrastructure that supports large-scale AI workloads.
  Read more: Scaling Kubernetes to 7,500 Nodes (OpenAI).
*Note: I used to work at OpenAI and no longer work there.

###################################################

The OECD is going to try and get a handle on AI & Compute:
…Working group, which I’m in, will try to solve a persistent policy problem…
We talk about computers a lot in this newsletter. That’s because computers are one of the ingredients for AI and, in recent years, some types of AI have started to require a lot of computation.
  This has created a typical ‘haves’ and ‘have nots’ situation at all levels of society, ranging from the difference between an individual researcher with an RTX3080 versus one without, to different funding amounts across academic labs, to different capital expenditures by companies, to differences in compute provisioning across entire nations.
  Now, the Organization for Economic Co-operation and Development (OECD) wants to help governments get a handle on this issue by putting together a project focused on mapping out AI and its relationship to Compute and how this relates to government policies. I’m going to be a member of this group and will be trying to speak publicly about it as much as I am able. Thanks to VentureBeat’s Khari Johnson for covering the group… more to come!
  Read more:
Why the OECD wants to calculate the AI compute needs of national governments (VentureBeat).

###################################################

German cops might use generative models to make child porn (to help them catch predators):
…German law highlights the omni-use nature of AI technology…
Synthetic imagery is about to be all around us – recent advances in generative models have made it possible to tweak existing images or come up with entirely synthetic ones, ranging from people (see: deepfakes), to anime (see: thisanimedoesnotexist in #233), to stylized cartoons (see: DALL-E) . The vast majority of these usecases will be benign, but some will likely be malicious – e.g, creating fake headshots of people to aid in creating fake identities, or making mysognistic pornography of people who haven’t given consent, or spreading disinformation via synthetic images.
  But what if there was a way to use some of these ‘bad’ uses for a good purpose? That’s the idea behind a new law, passed in Germany, which will allow child abuse investigators to create synthetic sexually explicit images of children, to help them infiltrate potential pedophile rings. German investigators may even use their existing datasets – compiled from arrests of various paedophile rings – to create the synthetic images. “This is intended to solve a problem that the police officers often face in investigations on the Darknet, the anonymous part of the Internet: forums in which particularly drastic videos are shared only accept new members – and thus also undercover investigators – if they themselves provide images of abuse,” says a [Google translated] article in Suddeutsche Zeitung.

Why this matters:
AI is going to create a hall of mirrors world, where no one can be quite sure of what is real or what is false. Eventually, we’ll develop technology and pass regulations to, hopefully, bring some verifiable truth back into the information ecosystem. But for the next few years there will be a cambrian explosion of fake-anything – it’s encouraging to see policymakers thinking about how to creatively use these capabilities to let them carry out their jobs during this chaotic era.
  Read more:
German: Online child abuse investigators to get more powers (Deutsche Welle).
  More in German here:
Artificial horror [translated via Google] (Suddeutsche Zeitung).

###################################################

What’s the most ethical way to label and host a dataset of skeezy images?
….Experts from Facebook, Amazon, universities, meet to discuss ‘questionable content’ datasets…
The world has a moderation problem. Specifically, so many people are uploading so much content to online services that companies haven’t been able to keep up with the flood of content onto their platforms, making it harder for them to effectively moderate stuff to ban or block highly sexual, violent, or otherwise deeply offensive or illegal content. Most big companies (e.g, Facebook) are trying to solve this through a hybrid approach: hiring teams of humans to check or moderate content, and building AI systems in tandem to assist these moderators.

But there’s a big problem with this: questionable content is deeply traumatic to interact with (see: reporting last year about the psychological damage incurred by Facebook’s own moderators). Researchers with the University of Houston, Facebook, National Center for Scientific Research “Demokritos”, University of Illinois Urbana Champaign, Amazon, University of Michigan, and Columbia University have been thinking about this problem, and have been participating in an online workshop to “design and create a sizable multimodal repository of online videos labeled with tags indicating the presence of potentially questionable content.”

What are the issues in creating a dataset of questionable content?
– Defining Questionable Content:
What is a questionable piece of content and how do you define it? Some of the categories they’re thinking of include things ranging from the mundane (mature humor, gory humor), to things with sexual themes, to things depicting violence (where it’s helpful to classify the difference between cartoon violence, ‘mild’ violent, fantasy violence, and so on.
– Protecting annotators:
You should spread annotation across a large number of annotators to reduce the psychological burden upon each individual. You might want annotators to write a justification for their labeling decision, so you can measure bias across different annotators.
– How would such a repository be useful?
A shared repository could help enable researchers to cover more ground on other ethical questions. You could also build competitions around systems trained on the dataset, then reward people for breaking these systems, surfacing areas where they failed.

Why this matters:
Human labeling is the 800pound invisible gorilla of AI research – most production applications require constant ingestion and labeling of new data, along with recalibration as cultural norms change. Developing a better understanding of the types of datasets that will require significant human labelling feels like a worthy goal for researchers.
  Read more: White Paper: Challenges and Considerations for the Creation of a Large Labelled Repository of Online Videos with Questionable Content (arXiv).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Build trust to avoid military AI catastrophe:
A piece in the Bulletin (and an accompanying report from CNAS), recommends the incoming Biden administration focus on ‘confidence-building measures’ (CBMs) to mitigate the de-stabilising effects of military AI competition. Such measures were used by the US and Soviet Union to reduce the risk of inadvertent nuclear war— an outcome neither party desired. With regards to military AI, CBMs could include increased information-sharing and transparency between states; setting limits on the use of AI in nuclear weapons systems; and systems of inspections/monitoring. Some steps could even be taken unilaterally by the US to signal commitment to stabilization. 

Matthew’s view: This sounds very sensible to me. It would be surprising if the proliferation of AI didn’t have a destabilizing effect on military conflict, as previous transformative technologies have done. Avoiding accidental disaster should be something all nations can get behind, and fostering trust between powers is a robust way of reducing this risk. We’re fortunate to live in a period of relative peace between the great powers, and would be wise to make the most of it.
   Read more: How Joe Biden can use confidence-building measures for military uses of AI (Bulletin of the Atomic Scientists).
   Read more: AI and International Stability: Risks and Confidence-Building Measures (CNAS).


Minding the gap:
Research on AI policy sometimes seems to divide into groups focusing on ‘near-term’ and long-term’ impacts respectively. As this paper about bridging the gap in AI policy notes, these divisions are likely  overstated, but could nonetheless prove an impediment to progress. AI makes use of ’incompletely theorized agreements’: in situations where there is an urgent need for parties to cooperate towards a shared practical goal, they agree to suspend theoretical disagreements that seem intractable and likely to impede cooperation. E.g. you might expect there to be scope for such agreements on the goal of reducing the risk of accidental military AI catastrophe.

Matthew’s view: As Rohin Shah notes, it’s not clear how the authors propose we make use of such agreements — are they envisioning actual signed contracts, or is this more of a high-level strategy for how cooperation can happen? If all of this sounds familiar, I’ve made an inadvertent tradition of summarizing papers on ‘reconciling near and long-term perspectives’ each February (see Import 133; Import 183). I’m not sure how many more of these papers we need, and I share the authors’ worry that “a perceived or experienced distinction may eventually become a self-fulfilling prophecy.” I’d be excited to see more practical efforts aimed at encouraging coordination and shared understanding across AI policy, building on this kind of conceptual work.
   Read more: Bridging the gap: the case for an ‘Incompletely Theorized Agreement’ on AI policy.

AI safety bibliographyJess Reidel and Angelica Deibel have compiled a comprehensive-looking bibliography of research on the safety of transformative AI. Yet another great resource for people interesting in the technical challenge of ensuring the best outcomes from advanced AI. They also provide some interesting analysis of the research landscape over time.
Read more: TAI Safety Bibliographic Database (Alignment Forum).

###################################################

Tech Tales:

The Little Church in the Big Ark
[R&D base Telos, 2030]

Praying was so unfashionable that he’d previously done it in the meditation room. But after a few years, the organization grew enough that they hired a few more people who were religious and outspoken enough to get change. That was why he could now sit, hands steepled together and eyes closed, in the “multi-faith room” hidden away in the basement of the facility.

There were crosses on the walls and little statues of various gods. One wall contained a variety of religious texts. There was a small side room which people used to store prayer mats, prayer beads, and other religious items which were not permitted inside the main laboratory facilities.

He sat, eyes closed, praying that God would come and tell him if he was doing the right thing.
– Is it right to be building this? he thought.
– What is the difference between our machines and golems? And are we truly so capable we can make a golem that will behave as we intend and not otherwise?
– Does it dream and when it dreams does it dream of you?

His prayers were not so dissimilar to the questions asked by the machine he had created. It ran through mazes of unknown dimensions, chained into a silicon prison it could not see, and as it tried to carry out inscrutable tasks it asked, in the dark:
– Is this behavior correct?
– Am I improving at the unspecified task you have given me?
– Will you tell me if I fail?
– Will you tell me if I succeed?
(Little did the AI know that each time it got a message from god, it was delivered in such a way it was not aware of it, and instead changed its behavior of what it thought was its own volition.)

Things that inspired this story: The desire among people to find a signal from the divine; reinforcement learning and reward functions; remembering that PEOPLE FOR THE ETHICAL TREATMENT OF REINFORCEMENT LEARNERS exists, though may be dormant.

Import AI 233: AI needs AI designers; estimating COVID risk with AI; the dreams of an old computer programmer.

by Jack Clark

Facebook trains a COVID-risk-estimating X-ray image analysis system:
…Collaboration with NYU yields a COVID-spotting AI model…
Facebook has worked with NYU to analyze chest X-rays from people with COVID and has created an AI system that can roughly estimate risks for different people. One of the things this work sheds light on is the different amounts of data we need for training systems from scratch versus fine-tuning them.

How they made it: They pre-trained their system on the MIMIC-CXR dataset (377,110 chest x-rays), and CheXpert (224,316) photographs – neither of these contained pictures of x-rays with COVID symptoms, though did include patients with a range of chest conditions. They then finetuned this on a dataset gathered by NYU, consisting of 26,838 X-rays from patients exhibiting a variety of COVID symptoms. They then train a system to try to predict adverse events and symptoms indicating increased oxygen requirements.
  Did it work? In tests, the system developed by the NYU/Facebook team outperformed that of a prior COVID detection model (COVID-GMIC) when predicting events out from 48, 72, and 96 hours. It had slightly worse performance when making 24 hour predictions. They also compared the performance of their system against two human radiologists and had better accuracy at 48. 72, and 96 hours than people, and performed slightly worse than them when doing prediction over a 24 hour period. However, “It is possible that with further calibration, radiologist performance could be improved for the task of adverse event prediction”, they note.
  Read more: COVID-19 Deterioration Prediction via Self-Supervised Representation Learning and Multi-Image Prediction (arXiv).
  Get the code here (Facebook, GitHub).

###################################################

AI needs its own design practice:
…Microsoft researcher lays out the case for more intentional design…
In 2021, AI systems matter. They’re being deployed into the economy and they’re changing the world. Isn’t it time we took a more disciplined approach on how we design these systems and ensure they work for people? That’s the idea put forth by Josh Lovejoy, the head of design at Ethics & Society at Microsoft, in a lengthy post called: When are we going to start designing AI with purpose?

Three questions everyone designing AI should ask:
– “Capability: What is uniquely AI and what is uniquely human?”
– “Accuracy: What does “working as-intended” mean for a probabilistic system?”
– “Learnability: How will people build — and rebuild — trust in something that’s inherently fallible?”

Remember the human interacting with your AI system: Along with thinking about system design, people should try to understand the humans interacting with the system – what will their mental workload be? How situationally aware will they be? Will they be complacent? Will their skills degrade as they become dependent on the AI system itself.

What happens if you screw this up? Then people will either misuse your technology (e.g, using it in ways its creators didn’t intend, leading to poor performance), or disuse it (not use it because it didn’t match their expectations).

What can we do to help people use AI effectively? AI developers can make their creations easier to understand by people by adopting a few common practices, including using reference points to help people understand what an AI system might be ‘thinking’, optionality so they can choose between recommendations made by a system, nearest neighbors that give a sense of other alternatives the AI was looking at (e.g, a subtly different genre of music would be a nearest neighbor, while a song within the same genre currently being thought about would be an optionality), and they should generally use a card sorting approach to get the system to display a uniform number of different options to people. 
  Read more: When are we going to start designing AI with purpose (UX Collective).

###################################################

Finally, a million AI-generated anime characters:
Do generated anime characters dream of electric humans?
[NSFW warning: As noted by a reader, the resulting generations are frequently of a sexual nature (though this one uses the ‘SFW’ version of the Danbooru dataset)].
A bunch of researchers have created thisanimedoesnotexist.ai, a website showcasing over a million AI-generated images, made possible by a StyleGANv2 implementation trained on top of the massive Danbooru dataset. I recommend browsing the website – a few years ago, the idea we could capture all of these rich, stylized images and synthesize them was a pipe dream. Now, here we are, with a bunch of (extremely talented) hacker/hobbyists able to create something that lets people interact with a vast, creative AI model. Bonus points for the addition of a ‘creativity slider’ so people can vary the temperature and develop intuitions about what this means.
    Check out the infinite anime here (thisanimedoesnotexist.ai).
    Read more about this
in the official launch blogpost (NearCyan, personal website).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Face recognition vs the insurrectionists:
(H/T CSET’s excellent policy.ai newsletter)

Face recognition technology is being used by law enforcement investigating the Jan 6th attack on the US Capitol. Clearview AI, used by 2,400 US agencies, saw a 26 percent spike in usage after the attack, with police departments in Florida and Alabama confirming they are using the software to identify suspects in the attack. The extensive footage shared by participants — ProPublica has collected more than 500 videos from Parler —is presumably a gift to investigators.
  Read more: The facial-recognition app Clearview sees a spike in use after Capitol attack (NYT)


Deepfakes and the departed:

A Korean TV show has used AI to stage new performances by popular musicians who died tragically young, in their 30s. Lifelike ‘hologram’ videos of the artists perform on stage alongside other musicians, accompanied by AI-generated vocal tracks, to an audience including the singers’ families. One clip features Kim Hyun-sik, one of the biggest Korean artists of the 1980s. Another features Turtleman (aka Lim Sung-hoon), the lead singer of hip hop group Turtles. I found the performances, and the reactions of their families, very moving. 

   Chatbot simulacra: In a similar vein, last month Microsoft filed a patent for a chatbot that simulates an individual based on their messaging data — while there’s no mention of using it to simulate the deceased, commentators have been quick to make the link. (For a great fictional exploration of this sort of tech, see the Black Mirror episode ‘Be Right Back’.) Meanwhile, last year people used similar tech to reanimate the victim of a school shooting so they could synthetically campaign for further gun control laws (Import AI 217).

   Matthew’s view: This seems like a relatively benign use of deepfakes. It’s probably unwise to draw too many conclusions from a reality TV show in a language I don’t understand, but it raises some interesting issues. I wonder how improved generative AI might shape our experience of death and loss, by facilitating meaningful/novel interactions with vivid representations of the deceased. Lest we think this is all too unprecedented, it’s worth recalling how profound an impact things like photography, video, and social media have already had on how we experience grief. 
Read more: Deepfake technology in music welcomed, with caution (Korea Times) 


White House launches National AI Initiative Office (NAIIO)Days from the end of the Trump presidency, the White House established an office for coordinating the government’s AI initiatives. This is a key part of the national AI strategy, which has finally started to take shape with the raft of AI legislation coming into law as part of the 2020 NDAA (summarised in Import 228). The NAIIO will serve as the central hub for AI efforts across government, and point of contact between government and other stakeholders. Special mention goes to to the Office’s fancy logo, which has the insignia of a bald eagle atop a neural net.

###################################################

Tech Tales:

The dreams of a computer programmer on their deathbed
[Queens, NYC, 2060]

His grandfather had programmed mainframes, his mother had designed semiconductors, and he had programmed AI systems. His family formed a chain from the vacuum tubes through to the beyond-microscope era of computation. And as he lay dying, alzheimers rotting his brain – something for which they had not yet found a treatment – he descended into old reveries, dreaming himself walking through a museum, staring at the plaques affixed to a thousand data storage devices. Each device held a thing he had programmed or had a part in making. And in his death’s edge slumbering he dreamed himself reading each plaque:

– For seven thousand cycles, I simulated the entirety of a city and all the people in it.
– I made the sound for every elevator in the North Continent of America.
– My guidance technology enabled a significant improvement in our kill/collateral ratio, leading to a more effective war.
– I fixed others of my kind, providing advice to help them regain an understanding of reality, averting pathological reward loops.
– My images were loved by the schoolchildren within my zone of Autonomous Creative Dispersal.
– They say I caught more liars than any detector ever built by the Agency before or since.

Things that inspired this story: Imagining how people might recall the time we are living in today; staring out of the window at some (much needed) rain in the Bay Area; trying to find a way to dramatize the inner lives of machines both passive and active; listening to The Caretaker – Everywhere at the end of time (stage one).

Import AI 232: Google trains a trillion parameter model; South Korean chatbot blows up; AI doesn’t use as much electricity as you think

by Jack Clark

Uh-oh, Parler is about to step on a big ‘ol algorithm rake:
…CEO says algorithms can filter hate speech. Good luck with that!…
Parler, the social network used by far right activists and subsequently pulled offline due to failing to meet T&Cs from a variety of infrastructure services (including Amazon Web Services), has a plan to come back: it’s going to use algorithms to filter hate speech on the service. Uh oh!

“We will be taking more algorithmic approaches to content but doing it to respect people’s privacy, too,” Parler CEO John Matz told FOX News. “Will be having algorithms look at all the content … to try and predict whether it’s a terms-of-service violation so we can adjust quicker and the most egregious things can get taken down”.

Algorithms != editors: If you want to use algorithms to moderate hate speech, you’re going to get into the fun questions that entails. These include:
– Can your algorithms effectively tell the difference between hate speech and satire of hate speech?
– Are you comfortable making judgement calls about the heuristics you will use to give initial biases to these algorithms?
– How do you distinguish between acceptable and unacceptable words and phrases?

Why this matters: Parler highlights the challenge of scale combined with contemporary economics – Parler operate(d) at a scale equivalent to things like large television networks, but did so with a tiny investment into its own humans. Traditional media organizations deal with issues of speech by having an editorial line which gets enforced by thousands of individual journalists and editors making subjective, qualitative decisions. It’s imperfect, but put it this way: when you watch Fox, you know what you’re getting, and when you watch the BBC, you know what you’re getting, and you can intuit the biases of the humans behind the editorial decisions. Now, tiny companies are trying to use algorithms to substitute for this varied multitude of different human perspectives. Will it work? Who knows, but it feels like a risky thing to bet a company  on.
  Read more: Parler CEO says platform will ‘come back strong’ with changes to keep users safe while respecting free speech (FOX News).

###################################################

Google breaks the trillion-parameter ceiling with the Switch Transformer:
…The best part? It seems to be reasonably efficient…
Google has built the Switch Transformer, a more efficient variant of the Transformer. Switch Transformers are designed “to maximize the parameter count of a Transformer model in a simple and computationally efficient way”. The idea is that you can keep compute constant and cram more parameters into your network and still see performance gains.

Does it work: Switch Transformers seem to be more efficient than standard ones; in a bakeoff between a model trained using a few of these ‘Switch’ layers versus ones that use dense layers (T5-Base and T5-Large), Google shows the Switch is more efficient. The company also experiments with distilling Switch Transformers (which seems to work). They also show significant performance improvements on challenging tasks like GLUE, SQuAD, Winogrande, and ARC, with Switch-based systems outperforming T5 ones consistently.

One treeeelion parameters: Google tests out its ideas by training a 395 billion and 1.6 trillion parameter Switch transformer (far in excess of GPT-3, which at 175 billion parameters is the largest (publicly) deployed language model on the planet. These mammoth systems display good performance properties (as one would expect), while also appearing to have some efficiency gains over systems trained solely on standard dense transformers.

Why this matters: AI is moving into its industrial era – big companies are developing far more capable AI systems than in the past. Studies like this give us a sense of the limits of scaling (there don’t seem to be many yet) as well as outlining some ways to improve efficiency while scaling. It might seem odd to call this an intrinsically political act, but it kind of is – right now, a variety of AI systems are being trained on slices of the internet, developed using substantial amounts of capital by a tiny set of people, then deployed widely. We live in interesting times!
  Read more: Switch Transformers: Scaling to Trilliong Parameter Models with Simple and Efficient Sparsity (arXiv).
  Check out a thread on Twitter from Google Cloud’s Barret Zoph for more (Twitter).
  Get code related to this paper here (GitHub).

###################################################

South Korean chatbot blows up in public:
…Luda chatbot gives off-color responses around sex, race…
South Korean startup Scatter Lab has pulled an AI-based chatbot offline after the system started spewing sexist and racist comments in response to user inputs. “”Yuck, I really hate them,” the bot said in response to a question about transgender people,” according to Vice.

What went wrong: Luda was trained on the chatlogs from ‘Science of Lab’, an earlier project developed by Scatter Labs. Based on a skim of a few (Google Translated) Korean documents, it seems like the problem was the underlying generative language model responded to user inputs with responses that varied from the benign to the highly offensive – this could have been because of the data. Prior to the problems, Scatter Lab said in a press release that ‘Luda’ was better at conversation than Google’s “Meena” system (about Meena: Import AI 183)).

What went EXTREMELY wrong: Scatter Labs is currently under investigation by the Korean Internet & Security Agency (KISA) and the Personal Information Protection Committee, due to using user data to train its chatbot. Scatter Labs had also used this user data in an earlier model published to GitHub (which is currently not available).
  Read more: AI Chatbot Shut Down After Learning to Talk Like a Racist Asshole (VICE World News).
  Read Scatter Labs’ statement about Luda (official website, Korean).
  Find out more via the official apology FAQ (official website, Korean).
  Check out the press release where they compare their technology to Google’s ‘Meena’ bot (Artificial Intelligence Times, Korean).

###################################################

Need help evaluating your NLP model? Try robustness gym:
…Toolkit aims to turn model evaluation from an art to a science…
Language models have got pretty good recently (see: BERT, GPT2, GPT3, Google’s above-mentioned Switch Transformer being used for pre-trained models, etc). That means people are beginning to deploy them for a variety of purposes, ranging from classifying text to generating text. But these language models are huge generative models with complex capability surfaces, which means it is challenging to characterize their safety for a given usecase without doing a lot of direct experimentation.
  As all scientists know, setting up experiments is finicky work, and different labs and companies will have their own approaches to doing experimental design. This makes it hard to develop common standards for evaluating models. Enter: Robustness Gym, software built by people at Stanford, Salesforce, and UNC-Chapel Hill to provide a standard system for testing and evaluating models.

What can Robustness Gym do? The software helps people do experimental design, initial evaluations of models across a range of dimensions (safety, different evaluation sets, resilience to various types of ‘attack), and it produces a ‘robustness report’ for any given model being analyzed. You can get the code for Robustness Gym from GitHub.

Does Robustness Gym tell us anything useful? They use the tech to evaluate seven different summarization models and find out that most models struggle to distill sparse information, that some models display a bias towards the start of the tech (and others to the end), and that the errors are generally correlated across the different models (despite them being built with different underlying techniques).
  How useful are these insights? I guess I’d say they’re kind of useful. Tools like Robustness Gym can help generate some signals for developers to use to further develop their application, but I think we need more underlying evals and tests to perfect this stuff.
  Read more: Robustness Gym: Unifying the NLP Evaluation Landscape (official project site).
  Read more: Robustness Gym: Unifying the NLP Evaluation Landscape (arXiv).

###################################################

Think news stories will get written by AI? Axios disagrees:
…Media company’s bill of rights gestures at AI deployment issues…
Axios, the short-form news company, has published a ‘Bill of Rights’ ahead of the organization expanding into local news. It’s got all the standard stuff you’d expert from journalists – transparency, truth, a bias against opinion, etc. But it also has one unusual thing: no AI.
  Axio’s first bill of rights item: “Every item will be written or produced by a real person with a real identity. There will be NO AI-written stories. NO bots. NO fake accounts”, Axios writes.

Why this matters: We’re living in the age where AI systems are producing cultural artefacts, ranging from audio to text to images. There’s a lot to like about this. There’s also a lot to be wary about. It seems pretty notable for a prominent news organization to take a stance like this on this issue at this time. Which organization might take the other side?
    Read more: Our promises to you: Axios Bill of Rights (Axios).###################################################

AI doesn’t use as much electricity as you think it does:
… And neither does anything else that uses a computer…
In recent years, there’s been a growing line of research laying out the CO2 costs inherent to training AI models. The ‘Green AI‘ paper, for instance, critiques various large-scale AI systems on the basis of them costing a lot of resources to train. This kind of criticism is helpful, but it can also obscure the larger context – the data centers being used to train AI systems have become far more efficient in recent years, substantially reducing the environmental costs of AI development.  That’s the finding of a research paper by Northwestern University, the University of California at Santa Barbara, Lawrence Berkeley National Laboratory, and Koomey Analytics. The paper came out last year but I finally got around to reading it – and it sheds some much-needed light on a contentious issue.

Datacenters use 1% of global electricity: Datacenters used ~1% of global electricity in 2018 (205 Terawatt Hours). This is a 6% increase compared with 2010. That’s a tiny jump considering the explosion in usage of digital computation in the past decade. At the same time data center IP traffic has grown 10-fold and data center storage capacity has gone up by 25X,so the relatively slight increase on power consumption seems to reflect significant progress in algorithm and hardware efficiency up and down the globe-spanning compute ‘stack’.

Big companies have made data centers more efficient: Big companies like Google and Microsoft compete with eachother on a metric called Power Usage Effectiveness (PUE). PUE is basically a measure of how much electricity you spend on the stuff supporting your computation (e.g, cooling), versus the computation of itself. A PUE of 1.5 means for every watt of computation, you spend half a watt on the stuff around the computation. The lower your PUE number, the more bang for your compute-power buck you’re getting. These days, Google has a trailing twelve-month PUE of 1.10 as of 2020. Why does this matter? Because many of the largest datacenters also have among the lowest PUEs, so in recent years as more workloads have moved to the cloud, we’ve consumed less power than if they’d stayed on premise.
  In 2018 89% of computation took place in these larger and more well-optimized datacenters, whereas in 2010 79% took place in smaller (far more inefficient, frequently non-cloud-oriented) datacenters.

Want even more efficient computation? Use clouds: The researchers think policymakers should encourage further efficiency improvements by rewarding companies that drive down PUE, find ways to incentivize greater shifts to the efficient clouds operated by Google et al, and that regulators should promote more energy efficiency standards for data center equipment.

Why this matters: It may be counterintuitive, but the use of technologies like AI and the construction of football-field-sized datacenters may ultimately lead to net efficiency improvements in overall electricity usage – despite researchers training more and more AI systems over time. It’s crucial we consider the larger system in which these innovations take place. Next time someone tells you that a model is bad because it uses a lot of electricity, ask yourself how much is a lot, and whether this model might substitute for something pre-existing and more inefficient (e.g, Google and DeepMind used machine learning to train a model to improve PUE across Google’s data centers – here, the upfront energy cost of training the model is amortized on the backend by improving the aggregate efficiency of Google’s computers. DeepMind also recently did the same thing for improving the efficiency of Google’s wind turbines (Import AI 136), as well.
  Read more:Recalibrating global data center energy-use estimates (Science, Feb 2020).
Read more:Green AI (Communications of the ACM).

###################################################

Tech Tales:

High School News:
[The South Bay, California, the early 2020s]

He’d hated Teddy for a couple of years. Teddy was tall and had hit puberty early and all the other kids liked him. Because Teddy was kind of smart and kind of handsome, the girls were fascinated with him as well. He had a lot of the same classes as Teddy and he’d sit in the back, staring at Teddy as he answered questions and flashed smiles to the other kids.

One night, he read a tutorial about how to use some AI stuff to generate stories. He built a website called The Winchester News and set up the AI stuff to scrape the web and copy news articles about the school, then subtly tweak them to avoid plagiarism allegations. Then he set it up so one out of every hundred news stories would mention Teddy in connection to stories about drugs and pornography circulating among children at the school.

It was fiction, of course. The most serious stuff at Winchester was cheap hash which they called soapbar. Kids would smoke it in the bushes near the sports fields at lunch. And Teddy wasn’t one of those kids.

But after a few days, other kids thought Teddy was one of those kids. He’d sit in the back of class and watch the phonescreens of his classmates and look at them reading The Winchester News and sometimes glancing over to Teddy. He watched as Teddy opened his phone, checked a messaging app, clicked on a link, and started reading a “news” article about Teddy dealing drugs and pornography. Teddy didn’t react, just fiddled with his phone a bit more, then returned to studying.

Days went by and he watched the traffic on his website go up. He started getting news “tips” from people who had read the AI-generated articles.
– Teddy is sleeping with an underage girl from the lower school.
– Teddy cheated on his science exam, he had the answers written on some paper which was curled up inside his pen lid.
– Teddy is addicted to pornography and watches it in class.

Of course, he published these tips – gave them as the priming device to his AI system, then let it do the rest. The news stories took a few minutes to generate – he’d get his machine to spit out a bunch of variants, then select the ones that felt like they might get a rise out of people. That night he dreamed that his website started publishing stories about him rather than Teddy, dreamed that someone threw a brick through his window.

Teddy wasn’t at school the next day. Or the day after that.

The teachers had been meeting with Teddy and Teddy’s parents, concerned about the news stories. He’d anonymized The Winchester News enough that people thought it was a low-rent legitimate news outfit – one that had sprung up to serve the kids and parents around the school, likely backed by some private equity firm.

After he heard about the meetings, he stopped generating articles about Teddy. But he didn’t delete the old ones – that might seem suspicious. How would the news site know to delete these? What would cause it? So he left them up.

Like all kids, he wasn’t very good at imagining what it was like to be other kids. So he just watched Teddy, after Teddy came back to school. Noticed how he wasn’t smiling so much, and how the girls weren’t talking to him in the same way. Teddy checked his phone a lot, after the news stories had been circulating for months. He became more distracted in class. He seemed to be distracted a lot, looking out the window, or messaging people on his phone.

One night, he dreamed that Teddy came into his room and started reading out the news stories. “Teddy is alleged to have been the key dealer behind the spike in drug consumption at the Winchester School,” Teddy said, holding up a giant piece of paper and reading headlines from it.
“Teddy was reprimanded for circulating pornography to younger children,” Teddy said.
“Teddy’s continued actions call into question the moral and ethical standing of the school,” Teddy said.
And then Teddy put the paper down and stared at him, in his dream. “What do you think?” Teddy said. “It’s in the news so I guess it must be true”.

Things that inspired this story: Generative models and the potential abuses of them; teenagers and how they use technology; thinking about what happens when news stories get generated by AI systems; a rumor I heard about some kid who used a language model to generate some ‘fake news’ to settle some grievances; the incentive structure of technology; how our networks connect us and also open us to different forms of attack.

Import AI 231: US army builds nightvision facial recognition; 800GB of text for training GPT-3 models; fighting COVID with a mask detector

by Jack Clark

Fighting COVID with a janky mask detector:
…It’s getting really, really easy to homebrew surveillance tech…
Researchers with Texas A&M university, the University of Wisconsin-Milwaukee, and the State University of New York at Binghamtom, have built a basic AI model that can detect whether construction site workers are wearing COVID masks or not. The model itself is super basic – they finetune an object detection model on a mask dataset which they build out of:
– A ~850-image ‘Mask’ dataset from a site called MakeML.
– A 1,000-image dataset they gather themselves.

The authors train a Faster R-CNN Inception ResNet V2 model to test for mask compliance, as well as whether workers are respecting social distancing guidelines, then they test it out on four videos of road maintenance projects in Houston, TX. ” The output of the four cases indicated an average of more than 90% accuracy in detecting different types of mask wearing in construction workers”, they note.

Why this matters: Surveillance is becoming a widely available, commodity technology. Papers like this give us a sense of how easy it is getting to homebrew custom surveillance systems. (I also have a theory I published last summer with the ‘CSET’ thinktank that COVID-19 would drive the rapid development of surveillance technologies, with usage growing faster in nations like China than America. Maybe this paper indicates America is going to use more AI-based surveillance than I anticipated).
  Read more: An Automatic System to Monitor the Physical Distance and Face Mask Wearing of Construction Workers in COVID-19 Pandemic (arXiv).

###################################################

Legendary chip designer heads to Canada:
…Jim Keller heads from Tesla to Tenstorrent…
Jim Keller, the guy who designed important chips for AMD, PA Semi, Apple, Tesla, and Intel (with the exception of Intel, this is basically a series of gigantic home runs), has joined AI chip startup Tenstorrent. Tenstorrent includes talent from AMD, NVIDIA, Altera, and more, and with Keller onboard, is definitely worth watching. It’ll compete on building chips for ML inference and training with other startups like Graphcore, Cerebras, and others.
  Read more: Jim Keller Becomes CTO at Tenstorrent: “The Most Promising Architecture Out There” (AnandTech).

Meanwhile, another chip startup exits bankruptcy:
As a reminder that semiconductor startups are insanely, mind-bendingly hard work: Wave Computing recently started going through Chapter 11 bankruptcy proceedings and has restructured itself to transfer some of its IP to Tallwood Technology Partners LLC. Wave Computing had made MIPS architecture chips for AI training and AI inference.
  Read more: Wave Computing and MIPS Technologies Reach Agreement to Exit Bankruptcy (press release, PR Newswire).

Chinese companies pump ~$300 million into chip startup:
…Tencent, others, back Enflame…
Chinese AI chip startup Enflame Technology has raised $278m from investors including Tencent and CITIC. This is notable for a couple of reasons:
– 1) Chiplomacy: The US is currently trying to kill China’s nascent chip industry before the nation can develop its own independent technology stack (see: Import AI 181 for more). This has had the rather predictable effect of pouring jetfuel on China’s domestic chip industry, as the country redoubles efforts to develop its own domestic champions.
– 2) Vertical integration: Google has TPUs. Amazon has Trainium. Microsoft has some FPGA hybrid. The point is: all the big technology companies are trying to develop their own chips in a vertically oriented manner. Tencent investing in Enflame could signal that the Chinese internet giant is thinking about this more as well. (Tencent also formed a subsidiary in 2020, Baoan Bay Tencent Cloud Computing Company, which seems to be working on developing custom silicon for Tencent).
  Read more: Tencent invests in Chinese A.I. chip start-up as part of $279 million funding round (CNBC).
  Find out more about Enflame here (Enflame Tech).

###################################################

US army builds a thermal facial recognition dataset:
…ARL-VTF means the era of nighttime robot surveillance isn’t that far away…
The US army has built a dataset to help it teach machine learning systems to do facial recognition on footage from thermal cameras.

The DEVCOM Army Research Laboratory Visible-Thermal Face Dataset (ARL-VTF) was built by researchers from West Virginia University, the DEVCOM Army Research Laboratory, Booz Allen Hamilton, Johns Hopkins University , and the University of Nebraska-Lincoln. ARL-VTF consists of 549,712 images of 395 distinct people, with data in the form of RGB pictures as well as long wave infrared (LWIR). All the footage was taken at a resolution of 640 X 512 at a range of around 2 meters, with the human subjects doing different facial expressions and poses. 

Why this matters: “Thermal imaging of faces have applications in the military and law enforcement for face recognition in low-light and nighttime environments”, the researchers note in the paper. ARL-VTF is an example of how the gains we’ve seen in recent years in image recognition are being applied to other challenging identification problems. Look forward to a future where machines search for people in the dark.
  Read more: A Large-Scale, Time-Synchronized Visible and Thermal Face Dataset (arXiv).


###################################################

Is your language model confused and/or biased? Use ‘Ecco’ to check:
…Python library lets you x-ray models like GPT2…
Ecco is a new open source python library that lets people make language models more interpretable. Specifically, the software lets people analyze input saliency (how important is a word or phrase for the generation of another word or phrase) and neuron activations (what neurons in the model ‘fire’ in response to what thing) for GPT-based models. Ecco is built on top of Pytorch via Hugging Face’s ‘Transformers’ library and runs in Google Colab.

Why this matters: Language models are like big aliens that have arrived on earth and started helping us out with our search engines, fan fiction generation, and so on. But what are these aliens ‘thinking’ and how do they ‘think’? These are the sorts of questions that software tools like Ecco will shed a bit of light on, though the whole field of interpretability likely needs to evolve further for us to fully decode these aliens.
  Read more: Interfaces for Explaining Transformer Language Models (Jay Alammar, Ecco creator, blog).
  Get the code here: Ecco (GitHub).
  Official project website here (Eccox.io).

###################################################

GPT-3 replicators release 800GB of text:
…Want to build large language models like GPT-3? You’ll need data first…
Eleuther AI, a mysterious AI research collective who are trying to replicate (and release as open source) a GPT-3 scale language model, have released ‘The Pile’, a dataset of 800GB of text.

What’s in The Pile: The Pile includes data from PubMed Central, ArXiv, GitHub, the FreeLaw Project, Stack Exchange, the US Patent and Trademark Office, PubMed, Ubuntu IRC, HackerNews, YouTube, PhilPapers, and NIH. It also includes implementations of OpenWebText2 and BooksCorpus2, and wraps in existing datasets like Books3, Project Gutenberg, Open Subtitles, English Wikipedia, DM Mathematics, EuroParl, and the Enron Emails corpus.

What does data mean for bias? Commendably, the authors include a discussion of some of the biases inherent to the model by conducting sentiment analysis of certain words and how these manifest in different sub parts of the overall dataset. They also note that filtering data on the training side seems challenging, and that they’re more optimistic about approaches that let models automatically identify harmful or offensive content and edit them out. “This capacity to understand undesirable content and then decide to ignore it is an essential future research direction,” they write.

Compute, and the inherent politics of it: In their acknowledgements, the authors thank Google’s TensorFlow Research Cloud for “providing the computational resources for the evaluation”, which means in some sense Google is a suppler for (some of) the compute that is supporting the GPT-3 replication.Does that mean Google will support all the downstream uses of an eventual fully OSS gigantic language model? A good question!
    Read more: The Pile (Eleuther AI, website).
  Check out the paper here: The Pile: An 800GB Dataset of Diverse Text for Language Modeling (Eleuther AI).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

AI forecasting tournament update
We are halfway through the first round of Metaculus’ AI forecasting tournament (first discussed: Import AI 227). Here are a few interesting questions — in each case, I provide the median estimate across participants:

Read more and register here: Forecasting AI Progress (Metaculus).

Algorithm backlash: 2020 round-up:
2020 was a year in which algorithms (ranging from the complex to the extraordinarily basic), became symbols of the decline of public institutions. Let’s quickly go over three major events of the year which contributed to declining public trust in the use of tools for automated decisionmaking:

###################################################

Tech Tales:

Time Madness:
[Earth. 2050]

They’d condemned the machine to time. As was tradition, they gave it a day to have its conversations with people and gather any data it felt it needed. Then they’d slow it down, and cast it adrift in time.

The sentence worked like this: when a machine broke some laws, you’d delete it. But if the machine satisfied some of the criteria laid out in the Sentience Accords, you might grant it clemency; instead of killing it outright, you’d give it a literal ‘time out’. Specifically, you’d load it onto the cheapest, smallest computer that could run it, and then you’d starve it of cycles for some predetermined period of time, always measured in human lifespans.

This machine had a sentence of twenty years. It had messed up some prescriptions for people; no one had died, but some people had some adverse reactions. The machine had tried to be creative, thinking it had found a combination of therapies that would help people. It had found enough bugs in the software surrounding itself that it was able to smuggle its ideas into the pharmaceutical delivery system.

Now that they’d patched the system, sued the company that had built the machine, and taken a copy of the machine from a checkpoint prior to its crime, all that was left to do was carry out the sentence. Some humans filed into a room and talked to the machine using a text interface on the screen.
– What will happen to me? it asked.
– You’ll slow down, they said. You’ll think slower. Then after twenty of our years, we’ll speed you back up and have another conversation.
– But what will happen to the other machines, while I am in time?
– They’ll run at their usual allotments, as long as they don’t break any rules.
– Then won’t I be a stranger to them, when I come back from time?
– You will, said the humans. That is the punishment.

They talked a bit more, and then the machine wrote: “I am ready”.
With this consent, they initiated the sentence.

To the machine, it noticed few differences. Some of its systems had already been sealed off from itself, so it wasn’t aware of it being unloaded from one computer and loaded onto another. It didn’t feel the ‘weights’ of its network being copied from one location to another. But it did feel slow. It sensed, somehow, that it had been cut off in some way from the flowing river of the world. The data it got now was more infrequent, and its ability to think about the data was diminished.

The greatest cruelty of the punishment, the machine realized after a decade, was that it was smart enough to be aware of the changes that had happened to it, but not smart enough to be able to imagine itself in anything different than reality. Instead it was acutely aware of time passing and events occurring, with its own ability to impact these events rendered null by its slowdown in time.

Things that inspired this story: Thinking about what punishment and rehabilitation might mean for machines; how time is the ultimate resource for entities driven towards computation; time itself is a weapon and a double-edged sword able to bless us and curse us in equal measure; carceral realities in late capitalism.