Import AI

Import AI 280: Why bigger is worse for RL; AI-generated Pokemon; real-world EfficientNet

Use an AI to generate a Pokemon in two (2!) clicks:
Here’s a fun Colab notebook from Max Woolf (@minimaxir) that lets you use AI to dream up some Pokemon in a couple of clicks (and with a few minutes of waiting). This isn’t remarkable – in recent years, AI generation stuff has got pretty good. What is remarkable is the usability. Two clicks! A few years ago you’d need to do all kinds of bullshit to get this to work – download some models on GitHub, get it to run in your local environment, make sure your versions of TF or PyTorch are compatible, etc. Now you just click some buttons and a load of stuff happens in the browser then, kabam, hallucinated pokemon.

Things that make you go ‘hmmm’: This tech is based on ruDALL-E, an open source Russian version of OpenAI’s ‘DALL-E’ network.
  I think we’ve all rapidly got used to this. This is not normal! It is surprising and exciting!
  Check out theColab notebook here (Google Colab).
  Follow Max on Twitter here and thank him for making this cool tool!

####################################################

Uh-oh: The bigger your RL model, the more likely it is to seek proxy rather than real rewards:
…Think RL gets better as you scale-up models? Hahahah! NOT AT ALL!…
In the past couple of years, big models have become really useful for things ranging from text processing to computer vision to, more recently, reinforcement learning. But these models have a common problem – as you scale up the size of the model, it’s good capabilities get better, but so do its bad ones.
  For example, if you increase the size of a language model, it’ll generate more toxic text (rather than less), without interventions (see: ​A General Language Assistant as a Laboratory for Alignment​). New research from Caltech and UC Berkeley shows how this same phenomena shows up in reinforcement learning agents, as well. In tests across a few distinct RL domains, they find that “As model size increases, the proxy reward increases but the true reward decreases. This suggests that reward designers will likely need to take greater care to specify reward functions accurately and is especially salient given the recent trends towards larger and larger models”

What they did: They tested out a few different reinforcement  learning agents on four different environments – an Atari game called Riverraid, a glucose monitoring system, a traffic control simulation, and a COVID model where the RL dials up and down social distancing measures. In all cases they found that ” model’s optimization power often hurts performance on the true reward”,

What can we do? Most of this behavior relates to objective design – give an AI the wrong objective function, and it’ll optimize its way to success there, while ignoring side effects (e.g, if you reward an AI for reducing rate of defects on a factory production line to zero, it might just work out how to stop the factory line and therefore eliminate all defects – along with your business). One way to do this is to have a baseline policy that humans have verified as having the right goal, then building some software to spot deltas between the RL policy and the idealized baseline policy.
  This kind of works – in tests, the detectors can get anywhere between 45% and 81% accuracy at detecting anomalous from non-anomalous behaviors. But it certainly doesn’t work well enough to make it easy to deploy this stuff confidently. “Our results show that trend extrapolation alone is not enough to ensure the safety of ML systems,” they write. “To complement trend extrapolation, we need better interpretability methods to identify emergent model behaviors early on, before they dominate performance”.
  Read more: ​​THE EFFECTS OF REWARD MISSPECIFICATION: MAPPING AND MITIGATING MISALIGNED MODELS (arXiv).  

####################################################

SCROLLS: A new way to test how well AI systems can understand big chunks of text:
…Now that AIs can write short stories, can we get them to understand books?…
Researchers with Tel-Aviv University, Allen Institute for AI, IBM Research, and Meta AI, have built ‘SCROLLS’ a way to test how well AI systems can reason about long texts. SCROLLs incorporates tasks ranging from summarization, to question answering, and natural language inference, as well as multiple distinct domains including transcripts, TV shows, and scientific articles. “Our experiments indicate that SCROLLS poses a formidable challenge for these models, leaving much room for the research community to improve upon,” the authors write.

How SCROLLs works: This benchmark has mostly been created via curation,consisting of 7 datasets that reward models that can contextualize across different sections of the datasets and process long-range dependencies.

The datasets: SCROLLS incorporates GovReport (summarization of reports addressing various national policy issues), SummScreenFD (summarization of TV shows, like Game of Thrones), QMSum (summarization of meeting transcripts), Qasper (question answering over NLP papers), NarrativeQA (question answering about entire books from Project Gutenberg), QuALITY (multiple choice question answering about stories from Project Gutenberg), and Contract NLI (natural language inference dataset in the legal domain).

How hard is SCROLLS? The authors test out two smart baselines (BART, and a Longformer Encoder-Decoder (LED)), and one dumb baseline (a basic pre-written heuristic).  Based on the results, this seems like a really challenging task – a LED baseline with a 16384-token input length gets okay results, though BART gets close to it despite being limited to 1,024 tokens. This suggests two things: a) BART is nicely optimized, and b) it’s not entirely clear the tasks in scrolls truly test for long-context reasoning. “Our experiments highlight the importance of measuring not only whether an architecture can efficiently process a long language sequence, but also whether it can effectively model longrange dependencies,” they write.

Why this matters: “Contemporary, off-the-shelf models struggle with these tasks”, the researchers write. In recent years, many machine learning benchmarks have been saturated within months of being released; how valuable SCROLLS turns out to be will be a combination of its hardness and its longevity. If SCROLLS gets solved soon, that’d indicate that AI systems are getting much better at reasoning about long-range information – or it could mean the SCROLL tasks are bugged and the AI systems have found a hack to get a decent score. Pay attention to the SCROLLS leaderboard to watch progress here.
  Read more: SCROLLS: Standardized CompaRison Over Long Language Sequences (arXiv).
  Check out the leaderboard here.

####################################################

EfficientNet: Surprisingly good for solar panel identification:
…UC Berkeley project shows how easy fine-tuning is…
Some UC BErkeley researchers have built a small, efficient model for detecting solar panels. Their system, HyperionSolarNet, is an EfficientNet-B7 model finetuned from ImageNet onto a collection of 1,983 satellite images of buildings, labeled with whether they have solar panels or not. The resulting model gets an aggregate precision of 0.96 (though with lower accuracy for labeling the presence of a solar panel, indicating a propensity for false positives) when evaluated on a held-out test set.

Why this matters: Last week, we wrote about how you can build a classifier from scratch and beat a finetuning approach. This paper shows that finetuning can also work quite well for specific use-cases. It also, implicitly, highlights how fine-tuning has gone from something of an arcane science to something pretty reliable and well understood, forecasting a future where there are as many classifiers in the world as there are things to classify.
  Read more:HyperionSolarNet: Solar Panel Detection from Aerial Images (arXiv).

####################################################

Tech Tales:
The Last Things
[A morgue in Detroit, 2035]

“When someone dies and gasps, are they just trying to get the last gasp of being alive?” asked the robot.

The morgue manager stared at the corpse, then at the robot. “I don’t know,” he said. “That’s a good question”.

“And when they know they are going to die, how do they save their information?” asked the robot.

“For example, I would send a zip of my stored data, as well as a copy of my cortical model, to a repository, if I knew I was about to be decommissioned or was in danger,” asked the robot.

“Most people don’t bother,” said the morgue manager. “My mother, for instance. When she was dying I asked her to write down some of her memories for me and my family, but she didn’t want to.”

“Why?”

“I think she was mostly concerned with experiencing her life, since she knew it was ending. She took trips while she was still mobile. Then, towards the end, she focused on eating her favorite foods and seeing her friends.”

“And did you learn anything about life from seeing her die,” asked the robot?

“Not particularly,” said the morgue manager. “Besides that life seems to become more valuable, the less you know you have of it.”

Things that inspired this story: A long conversation with someone who worked as a crisis therapist about the nature of death and belief; thinking about the differences between how real and synthetic intelligences may approach the concept of death.

Import AI 279: Baidu adds knowledge to a language model; US military + AI; how China thinks about AI governance

Happy New Year! I took the end of 2021 off to think, read, relax, and eat. I hope readers found some time to do the same. I expect I’m going to change some things up around Import AI this year – it’s going to get weirder, more specific, and hopefully more valuable! I’m also going to finesse the short story collection I’ve been putting together, based on the tech tales in this newsletter. Good luck to all readers for their own 2022 plans – we’ll go on this journey together!

#############################

Here’s how to build GPT-3 in the open:
…What’s it like replicating GPT-3? It’s extremely difficult!…
BigScience, an initiative to train a GPT-3-scale model on a public supercomputer, is currently trying to train a 104B model. Training models at this scale is something of an artisanal science, with lots of researchers working from hard-won rules of thumb in-tandem with things like scaling laws. Here’s a nice ‘lessons learned’ writeup from BigScience on the challenges it has faced in training, 13B and 104B-scale models so far.
  Read more: Lessons learned (BigScience, GitHub).

####################################################

BAIDU’s shows how to inject more knowledge into a language model:
…ERNIE 3.0 shows how to teach a big neural net to use an external knowledge base…
Baidu has developed ERNIE 3.0, an AI model that can use an external knowledge base to help it provide more accurate answers. Last year, an ERNIE 3.0 model won the highly competitive SuperGLUE challenge (Import AI 259). The special thing about ERNIE is that it fuses a big GPT-3-esque language model with a large external knowledge base.

Massive scale:
Baidu has also developed ERNIE 3.0 ‘Titan’, a 260 billion parameter model that, Baidu says, “is the largest Chinese dense pre-training model as far as we know”. In tests, ERNIE 3.0 Titan gets state-of-the-art results on a vast set of benchmarks that evaluate skills as diverse as question answering, text generation, text summarization, interpretation, and dialogue.

Novel, heterogeneous chip cluster:
Another interesting thing about this paper is the chips they train on – V100s and Huawei ‘Ascend’ processors. It’s quite unusual to see hybrid training of this form, and it seems like Baidu felt it was interesting enough to invest some engineering resources in making it possible – the company augmented its ‘PaddlePaddle’ AI framework with ” distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters.”

Why this matters:
Most people seem to act like GPT-3 models are exclusively being developed by a small set of Western actors, most of whom get tagged using the pejorative ‘big tech’ brush. But papers like this show that GPT-3 models are a global phenomenon. We should remember that the world we live in is going to be increasingly defined by different cultures expressing themselves through increasingly large, sophisticated AI models.
  Read more:
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation (arXiv).

####################################################

Why smaller can be smarter for real-world AI (here: computer vision for quality control on solar panels):
…When 1 million parameters can beat 100 million parameters…
The past few years of AI has been distinguished by the ‘bigger is better’ phenomenon, as companies develop ever-larger models that consumer ever-larger amounts of compute and data. Now, a paper from researchers with Friedrich-Alexander University Erlangen-Nuremberg in Germany reminds us that bigger isn’t always better – especially when it comes to real-world, applied AI. In this paper, they compare different approaches to building an image classifier that can spot defects in solar panels.

What they did: They trained a simple 8-layer convolutional neural net on a dataset of 4341 original, labeled images from a solar plant. The ~4000 images were each labeled with one of eight classes (e.g, ‘good’, ‘crack’, ‘splinter’, et cetera. They then applied a significant amount of data augmentation to enhance the size of the dataset.

How well did it do? Their custom, simple network outperformed a network based on VGG-architecture model pre-trained on the vast ImageNet dataset. This is interesting, because a common practice in AI research is to finetune domain-specific classifiers from generic ones based on ImageNet. Here, we find that their system gets better precision (0.971 versus 0.990), while having 100X fewer parameters (1,707,208 versus 138,357,544) and being significantly smaller in terms of memory footprint (~16MB versus 800MB). All this nets out to a network that is smarter, as well as more performant (inference of 0.50ms, versus 9.13ms).

Why this matters: Papers like this remind us that a little bit of thoughtful engineering goes a long way in AI, and we should bear in mind that while increasingly large networks are interesting, they’re not the only game in town when it comes to building things that have real economic value. “We expect that the following years will demand for more research on edge analytics. This means that more research will be needed on small, yet powerful artificial neural networks for industry cases”, they write.
  Read more: A Light in the Dark: Deep Learning Practices for Industrial Computer Vision (arXiv).

####################################################

What’s the US military going to do about AI? The NDAA holds a clue.
…AI education! Procurement! Data storage! And more…
Every year, the somewhat dysfunctional US congress always manages to pass a bill – the National Defense Authorization Act. This bill (which weighs in at around $800bn in annual outlay) is the thing that funds the US military. Therefore, the NDAA has become one of the main pieces of legislation to look at when trying to understand how the US military thinks about – and will work on – frontier technologies like AI. An analysis of the 2021 NDAA from Stanford’s ‘HAI’ center gives us a sense of what’s happening in AI and the US military.

What the NDAA says is going to happen: Some highlights from this years NDAA include:
– The DoD is going to trial different ways of procuring AI technology
– The DoD will create ‘executive education activities’ to help senior officials understand AI.
– The DoD will do a comparative analyses of the US and China’s efforts to deploy things relating to directed energy systems, hypersonics, cyberspace, and other frontier areas
– Creating an assessment of the “current and emerging office and defensive cyber posture of U.S. adversaries”
– Build DoD infrastructure to “support state-of-the-art tools and modern processes to enable the testing of AI capabilities”.
– “Evaluate the feasibility and advisability of creating DOD data repositories, available to public and private entities, to facilitate the development of AI capabilities.”

Why this matters: The US military is a lot like a supertanker – it’s slow, huge, and unwieldy. But once it starts to turn, boy does it turn! This NDAA analysis shows us the DoD is beginning to turn its attention and significant resources towards AI, which will have significant downstream implications for the nature of conflict and the way that future wars are conducted (and, eventually, learned).
  Read more: Summary of AI Provisions from the National Defense Authorization Act 2022 (Stanford HAI blog).

####################################################

What is China going to do about AI governance?
…China might do more ambitious tech regulations than the West…
Here’s a nice summary from the Carnegie Endowment for International Peace about what three prominent Chinese policy organizations are doing with regard to AI governance.

Cyberspace Administration of China (CAC): Last year, it released 30 rules for regulating internet recommendation algorithms, and also developed a three-year roadmap for governing other complex algorithms deployed at internet scale. This would be analogous to a Western government publishing a list of specific rules for regulating, for example, Facebook’s recommendation engine. Ambitious!

China Academy of Information and Communications Technology (CAICT): This organization released a  whitepaper on trustworthy AI – this is mostly notable because it’s in distribution with what other major regulators in other geographies are thinking about.

Ministry of Science and Technology (MOST): This organization released some guidelines for universities and companies on internal reviews around ethics issues relating to technology, as well as a fairly high-level description of some ethical norms for AI development.

Why this matters: “The potential impact of these regulatory currents extends far beyond China. If the CAC follows through on certain requirements for algorithmic transparency and explainability, China will be running some of the world’s largest regulatory experiments on topics that European regulators have long debated,” Matt Sheehan of Carnegie writes. Running regulatory experiments is a big deal – Western governments did a tiny bit of this after the great financial crisis in 08/09, but have done relatively little about technology governance. I think China has a good chance of defining what ambitious, applied tech regulation looks like.
  Read more: China’s New AI Governance Initiatives Shouldn’t Be Ignored (Carnegie Endowment for International Peace).

####################################################

The Last Tower Defense Fighter

[Historical analysis written in 2080 and stored in the archives at Iceland, at the Orbital Archive, and in the hardened repositories on Moon4 and Mars1.]

Back in the late 2020s there were a bunch of tower defense games that got pretty big. They always worked in the same way: you, the player, can see a landscape from overhead, and you need to place various weapons around it. Meanwhile, the enemies make there way across the landscape, following loosely described paths across a variety of different scenes – narrow trenches dug between mountains, wide roads across countryside, right-angled streets in urban centers. 

With these games, you get a high score in relation to how many enemies you kill, and if any of the enemies get to the ‘end’ of a course (usually, the bottom of a screen), you lose – the implication is that you die. 

Anyway, in around 2028 one of the big games built some add-ins for its league. Now, if you were one of the players in the elite-tier of the game, you’d get the opportunity to play in matches where there were cash prizes – these matches were advertised as being extraordinarily difficult, with more enemies on screen than in the normal game, larger and more complex maps, and sometimes the enemies were able to use powerups that meant they could attack your own towers and take them down. 

It was a sensation. Everyone wanted to play the game within a game. Kids all around the world streamed themselves playing the game for hours, as they tried to get good enough to have a shot at entering the league within the league. 

By the end of 2028, streams of league players were pulling in millions of concurrent viewers. A whole industry formed where people commentated about the games. Sometimes people overcame great odds and won – then they’d publish videos of themselves with their cash prizes and what they spent them on; sport cars, fine dining, nice hotels, and all the usual tchotchkes of people who come into some fast money. 

In 2029, there was a leak out of the Department of Defense that pulled the cover off. It turned out this game was actually a stealth DoD project. The normal games were helping the DoD train various strategic AI systems, which it used in planning for logistics and munitions placement during conflict. No one was very surprised by this. Back in that decade, most things that got big on the internet were fronts. 

What did surprise people was the leak about the cash league – the cash league was real. Real in the sense that the ‘monsters’ in the game were real – they were real people that the United States happened to be fighting. Whenever someone was playing the game, their commands were being converted to a different, real-world environment, where they marshalled the combined munitions of drones, sniper teams, artillery, tanks, jets, and all the other machinery of the military. And when their towers were destroyed, real Americans were dying – blown up by grenades or IEDs or RPGs, or any of the other ways people killed eachother, back then.

Of course, there was an outcry – for a while. Player numbers dipped for a while. But the number of spectators increased. And the US military, having struggled publicly for years with backwards technology and difficulty in recruitment, doubled down.
  “We need these people to protect our country,” the Pentagon said, one day. “Without the next generation, we’ll lose the next generation”.

A few years later, the enemies of the US followed in its footsteps. There were games where you stole people. Games where you had to try and find a spy moving through a bustling, crowded urban area. Games where you had to execute someone and then exfiltrate the footage of their execution to a friendly intermediary.

What inspired this story: Tower defense games like Bloons and Kingdom Rush; domain randomization; the remorseless logic of multi-country non-hot military conflict; the Last Starfighter (movie); fine-tuning; pre-training and few-shot adaptation; propaganda and the need to present the most dangerous beliefs via play or theatre or anything else that can elide and delight. 

Import AI 278: Can we ever trust an AI?; what the future of semiconductors looks like; better images of AI

Writing a blog about AI? Use these images:
…No more galaxy brain!…
Here’s a cool project: Better Images of AI, a project to create CC-licensed stock images that journalists and others can use to give people a more accurate sense of AI and how it works. “Together we can increase public understanding and enable more meaningful conversation around this increasingly influential technology,” says the website.
  Check out the gallery (Better Images of AI).

####################################################

Deepfake company raises $50m in Series B round:
…Synthetic video company Synthesia…
Synthetic video startup Synthesia has raised $50m. Remember, a few years ago we could barely create crappy 32X32 pixelated images using GANs. Now, there are companies like these making production-quality videos using fake video avatars with synthetic voices, able to speak in ~50 languages. “Say goodbye to cameras, microphones and actors!” says the copy on the company’s website. The company will use the money to continue with its core R&D, building what the founder terms the “next generation of our AI video technology w/ emotions & body language control.”. It’s also going to build a studio in London to “capture detailed 3D human data at scale.”

Why this matters: The world is filling up with synthetic content. It’s being made for a whole bunch of reasons, ranging from propaganda, to advertising, to creating educational materials. There’s also a whole bunch of people doing it, ranging from individual hobbyists, to researchers, to companies. The trend is clear: in ten years, our reality will be perfectly intermingled with a synthetic reality, built by people according to economic (and other) incentives.
  Read the twitter thread from Synthesia CEO here (Twitter).
  Read more: Synthesia raises $50M to leverage synthetic avatars for corporate training and more (TechCrunch).

####################################################

Do language models dream of language models?
…A Google researcher tries to work out if big LMs are smart – their conclusions matt surprise you…
A Google researcher is grappling with the question of whether large language models (e.g, Google’s LaMDA), understand language and have some level of sentience. In an entertaining blog post, he wrestles with this question, interspersing the post with conversations with a LaMDA agent. Some of his conclusions are that the model is essentially bullshitting – but the paradox is we trained it to give a convincing facsimile of understanding us, so perhaps bullshitting is logical?

Do language models matter? I get the feeling that the author thinks language models might be on the path to intelligence. “Complex sequence learning may be the key that unlocks all the rest,” they write. “Large language models illustrate for the first time the way language understanding and intelligence can be dissociated from all the embodied and emotional characteristics we share with each other and with many other animals.”

Why this matters: I think large language models, like GPT3 or LaMDA, are like extremely dumb brains in jars with really thick glass – they display some symptoms of cognition and are capable of surprising us, but communicating with them feels like talking to something with a hard barrier in-between us and it, and sometimes it’ll do something so dumb you remember it’s a dumb brain in a weird jar, rather than a precursor to something super smart. But the fact that we’re here in 2021 is pretty amazing, right? We’ve come a long way from Eliza, don’t you think so?
  Read more: Do large language models understand us? (Blaise Aguera y Arcas, Medium).

####################################################

What the frontier of safety looks like – get AIs to tell us when they doing things we don’t expect:
…ARC’s first paper tackles the problem of ‘Eliciting Latent Knowledge’ (ELK)…
Here’s a new report from ARC, an AI safety organization founded this year by Paul Christiano (formerly of OpenAI). The report is on the topic of ‘Eliciting latent knowledge: How to tell if your eyes deceive you’, and it tackles the problem of building AI systems which we can trust, even if they do stuff way more complicated than what a human can understand.

What the problem is: “Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us,” ARC writes. But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad. In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?”

Why this matters: Problems like ELK aren’t going to be solved immediately, but they’re sufficiently complicated and broad that if we come up with approaches that help us make progress on ELK, we’ll probably be able to put these techniques to work in building far more reliable, powerful AI systems.
  Read more: ARC’s first technical report: Eliciting Latent Knowledge (Alignment Forum).

####################################################

Check out the future of semiconductors via HotChips:
…After a decade of homogeneity, the future is all about heterogeneous compute training common AI models…
What do NVIDIA, Facebook, Amazon, and Google all have in common? They all gave presentations at the premiere semiconductor get-together, Hot Chips. The Hot Chips 22 site has just been updated with copies of the presentations and sometimes videos of the talks, so take a look if you want to better understand how the tech giants are thinking about the future of chips.

Some Hot Chips highlights: Facebook talks about its vast recommendation models and their associated infrastructure (PDF); Google talks about how it is training massive models on TPUs (PDF); IBM talks about its ‘Z’ processor chip (PDF); and Skydio talks about how it has made a smart and semi-autonomous drone (PDF).

Why this matters: One side-effect of the AI revolution has been a vast increase in the demand by AI models for increasingly large amounts of fast, cheap compute. Though companies like NVIDIA have done a stellar job of converting GPUs to work well for the sorts of parallel computation required by deep learning, there are more gains to be had from creating specialized architectures.
  Right now, the story seems to be that all the major tech companies are building out their own distinct compute ‘stacks’ which use custom inference and training accelerators and increasingly baroque software for training large models. One of the surprising things is that all this heterogeneity is happening while these companies train increasingly similar neural nets to one another. Over the next few years, I expect the investments being made by these tech giants will yield some high-performing, non-standard compute substrates to support the next phase of the AI boom.
  Check out the Hot Chip 33 presentations here (Hot Chips site).####################################################

Tech Tales:

Noah’s Probe
[Christmas Day, ~2080]

Humans tended to be either incompetent or murderous, depending on the length of the journey and the complexity of the equipment.

Machines, however, tended to disappear. Probes would just stop reporting after a couple of decades. Analysis said the chance of failures wasn’t high enough to justify the amount of disappeared probes. So, we figured, the machines were starting to decide to do something different to what we asked them to.

Human and machine hybrids were typically more successful than either lifeform alone, but they still had problems; sometimes, the humans would become paranoid and destroy the machines (and therefore destroy themselves). Other times, the computers would become paranoid and destroy the humans – or worse; there are records of probes full of people in storage which then went off the grid. Who knows where they are now.

So that’s why we’re launching the so-called Noah’s Probes. This series of ships tries to fuse human, animal, and machine intelligence into single systems. We’ve incorporated some of the latest in mind imagining techniques to encode some of the intuitions of bats and owls into the ocular sensing systems; humans, elephants, whales, and orangutans for the mind; octopi and hawks for navigation; various insects and arachnids for hull integrity analysis, and so on.

Like all things in the history of space, the greatest controversy with Noah’s Probes relates to language. Back when it was just humans, the Americans and the Russians had enough conflict that they just decided to make both their languages the ‘official’ language of space. That’s not as easy to do with hybrid minds, like the creatures on these probes. 

Because we have no idea what will work and what won’t, we’ve done something that our successors might find distasteful, but we think is a viable strategy: each probe has a device that all the intelligences aboard can access. The device can output a variety of wavelengths of energy across the light spectrum, as well as giving access to a small sphere of reconfigurable matter that can be used to create complex shapes and basic machines.

Our hope is, somewhere out in that great darkness, some of the minds adrift on these probes will find ways to communicate with eachother, and become more than the sum of their parts. Our ancestors believe that we were once visited by angels who communicated with humans, and in doing so helped us humans be better than we otherwise would’ve been. Perhaps some of these probes will repeat this phenomena, and create something greater than the sum of its parts.

Things that inspired this story:
Peter Watts Blindsight; Christmas; old stories about angels and aliens across different religions/cultures; synesthesia; multi-agent learning; unsupervised learning.

Import AI 277: DeepMind builds a GPT-3 model; Catalan GLUE; FTC plans AI regs

FTC plans AI regulation:
…FTC brings on three AI Now people as advisors, now turns attention to algorithmic regulation…
The Federal Trade Commission announced Friday that it is considering using its rulemaking authority “to curb lax security practices, limit privacy abuses, and ensure that algorithmic decision-making does not result in unlawful discrimination, according to the Electronic Information Privacy Center (EPIC). The announcement follows the FTC bringing on three people from AI Now, including Meredith Whittaker, as advisors on AI (Import AI #275).
Read more:FTC Signals It May Conduct Privacy, AI, & Civil Rights Rulemaking (EPIC).
  Readthe FTC language at RegInfo.

####################################################

Google thinks sparsity might be the route to training bigger and more efficient GPT-3 models:
…GLaM shows that mixture of experts models keep getting better…
Google has built GLaM, a 1.2 trillion parameter mixture-of-experts model. GLaM is a big language model, like GPT-3, but with a twist: it’s sparse; MoE networks are actually a bunch of distinct networks all connected together, and when you pull inference off of one only a few sub-networks activate. This means that the parameter count in a sparse vs dense network isn’t really comparable (so you shouldn’t think 1.2 trillion MoE = ~6X larger than GPT-3).

Why MoE is efficient: “The experts in each layer are controlled by a gating network that activates experts based on the input data. For each token (generally a word or part of a word), the gating network selects the two most appropriate experts to process the data. The full version of GLaM has 1.2T total parameters across 64 experts per MoE layer with 32 MoE layers in total, but only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference.”

How well does it work: In tests, GLaM exceeds or is on-par with the performance of GPT-3 on 80% of zero-shot tasks and 90% of one-shot tasks. Like DeepMind’s Gopher, part of the improved performance comes from the size of the dataset – 1.6 trillion tokens, in this case.

Why this matters: For a few years, various Google researchers have been pursuing ‘one model to learn them all‘ – that is, a single model that can do a huge number of diverse tasks. Research like GLaM shows that MoE networks might be one route to building such a model.
Read more: More Efficient In-Context Learning with GLaM (Google blog).

####################################################

DeepMind announces Gopher, a 280 billion parameter language model:
…AI research firm joins the three comma language club…
DeepMind has built Gopher, a 280 billion parameter language model. Gopher is the UK AI research company’s response to GPT-3, and sees DeepMind publicly announce a multi-hundred billion parameter dense model, letting it join a club that also includes companies like Microsoft, Inspur, and Huawei.

What it does: During the research, DeepMind found areas “where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language,” the company writes. “We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense tasks.”

How well it works: Gopher outperforms GPT-3 in a broad range of areas – some of the results likely come from the dataset it was trained on, called MassiveText. MassiveText “contains 2.35 billion documents, or about 10.5 TB of text” (representing about 2.3 trillion tokens), and DeepMind notes that by curating a subset of MassiveText for data quality, it was able to substantially improve performance.

Language models – good, if you handle with care: Along with analysis on bias and other potential impacts of Gopher, DeepMind dedicates a section of the paper to safety: “We believe language models are a powerful tool for the development of safe artificial intelligence, and this is a central motivation of our work,” they write. “However language models risk causing significant harm if used poorly, and the benefits cannot be realised unless the harms are mitigated.”
  Given the above, how can we mitigate some of these harms? “We believe many harms due to LMs may be better addressed downstream, via both technical means (e.g. fine-tuning and monitoring) and sociotechnical means (e.g. multi-stakeholder engagement, controlled or staged release strategies, and establishment of application specific guidelines and benchmarks). Focusing safety and fairness efforts downstream has several benefits:”
Read the blog post:Language modelling at scale: Gopher, ethical considerations, and retrieval (DeepMind blog).
  Read the paper:Scaling Language Models: Methods, Analysis & Insights from Training Gopher (PDF).

####################################################

Want to evaluate a Catalan language model? Use CLUB:
…You can only build what you can measure…
Researchers with the Barcelona Supercomputing Center have built the Catalan Language Understanding Benchmark (CLUB), a benchmark for evaluating NLP systems inspired by the (English language) GLUE test. The main curation rationale they followed “was to make these datasets both representative of contemporary Catalan language use, as well as directly comparable to similar reference datasets from the General Language Understanding Evaluation (GLUE)”.

What’s in the CLUB? CLUB includes evals for Part-of-Speech Tagging (POS), Named Entity Recognition and Classification (NERC), Catalan textual entailment and text classification, and Extracted Question Answering (which involved work like translating and creating new Catalan datasets – XQuAD-Ca, VilaQuAD and ViquiQuad).

Why CLUB matters: There’s a phrase in business – ‘you can’t manage what you can’t measure’. CLUB will make it easier for researchers to develop capable Catalan-language systems.
  Read more:The Catalan Language CLUB (arXiv).

####################################################

Deep learning unlocks a math breakthrough:
…The era of Centaur Math cometh…
Deepmind researchers have used an AI system to help mathematicians make two breakthroughs in topology and representation theory. The result provides yet more evidence (following various AlphaFold-inspired projects) that humans+AI systems can discover things that neither could discover on their own.

What they did: The essential ideal is quite simple: get a mathematician to come up with a hypothesis for a given function, then build an ML model to estimate that function over a particular distribution of data, then have the mathematician evaluate the result and use their intuition to guide further experimentation. The best part? “The necessary models can be trained within several hours on a machine with a single graphics processing unit”, DeepMind says.

Why this matters: We’re entering a world where humans will collaborate with AI systems to synthesize new insights about reality. Though DeepMind’s system has limitations (“it requires the ability to generate large datasets of the representations of objects and for the patterns to be detectable in examples that are calculable,” DeepMind notes), it sketches out what the future of scientific discovery might look like.
  Read the paper:Advancing mathematics by guiding human intuition with AI (Nature, PDF).
  Read more:Exploring the beauty of pure mathematics in novel ways (DeepMind blog).

####################################################

Anthropic bits and pieces:
…(As a reminder, my dayjob is at Anthropic, an artificial intelligence safety and research company)…
We’ve just released our first paper, focused on simple baselines and investigations: A General Language Assistant as a Laboratory for Alignment. You can read it at arXiv here.

####################################################

Tech Tales:

Real and Imagined Gains
[DoD Historical archives, 2040]

They got trained in a pretty cruel way, back then – they’d initiatie the agents and place them in a room, and the room had a leak of a poisonous substance that had a certain density and a certain spread pattern. The agents had to work out how not to asphyxiate by doing fairly complicated intuitively-driven analysis of the environment. If they were able to give a correct guess at the spread pattern (and avoid it) before the room filled up, they moved onto the next stage. If they weren’t able to, they asphyxiated and died – as in, felt their computational budget get cut, got put in cold storage, probably never booted up again.
  (One curious by-product of the then-popular AI techniques was that the agents would sometimes seek to preserve eachother – in one case, two agents ‘kissed’ eachother so they could more efficiently exchange their air reserves between eachother, while the room filled; unfortunately, as their attention was allocated to the act of kissing, they did not complete the requisite calculations in time, and both died.) 

Things that inspired this story: Kurt Vonnegut; reinforcement learning; environmental design; moral patient hood.

Import AI 276: Tracking journalists with computer vision; spotting factory defects with AI; and what simulated war might look like

Spotting factory defects using a highly efficient neural net:
…A little bit of optimization leads to multiple 10X improvements for real world deployment…
Soon, factories will embed neural nets onto cameras scanning over production lines, so they can spot defects as they appear. New research from the University of Waterloo and startup Darwin AI shows how to do this more efficiently than before.

What they did: The team built TinyDefectNet, a neural net optimized for the peculiarities of factory deployments – small datasets, highly constrained operational requirements, fast inference. The model was “produced via machine-driven design exploration, possesses a shallow architecture with heterogeneous, lightweight micro- and macro-architecture traits that are well-suited for high-throughput inspection scenarios”. TinyDefectNet gets similar performance to a ResNet-50 baseline, but with 56X fewer parameters, 11X fewer FLOPs, and 7.6X faster inference speed.
  In tests, they trained a model then evaluated it using the ‘NEU-Det’ benchmark dataset, which challenges an AI to spot various types of metallic surface defect, ranging from pitted surfaces, to scratches. Their system gets similar performance to a ResNet, but takes around 2.5milliseconds per inference, versus 19 milliseconds for a Resnet.

Why this matters: Factory production lines can typically run as fast as the slowest component within them. Therefore, if we can use AI to automate places where we’ve previously used lots of (relatively slow) humans doing manual inspection, we can probably increase overall factory throughput.
Read more:TinyDefectNet: Highly Compact Deep Neural Network Architecture for High-Throughput Manufacturing Visual Quality Inspection (arXiv) .

####################################################

Chinese province plans to use AI to track journalists:
…Cameras + AI = eradication of real journalism…
One of the silent revolutions enabled by the past decade of AI progress is a step-change improvement in ability for nations to surveil their citizens. Now, per reporting from Reuters, one Chinese province plans to use AI techniques to target journalists and foreign students.
  “A July 29 tender document published on the Henan provincial government’s procurement website – reported in the media for the first time – details plans for a system that can compile individual files on such persons of interest coming to Henan using 3,000 facial recognition cameras that connect to various national and regional databases”, Reuters reports.

Why this matters: Reuters reporting doesn’t mention it, but I’d put a sizeable bet on the idea this system will pair facial recognition with pedestrian re-identification to allow authorities to track journalists and students as they move through cities, providing unsupervised tracking and identification. This capability ultimately makes it much more challenging for journalists to do reporting that is critical of the Chinese state, as systems like this can effectively de-anonymize their sources (and also frighten the sources so they don’t talk to journalists in the first place).
  Read more:EXCLUSIVE Chinese province targets journalists, foreign students with planned new surveillance system (Reuters).

####################################################

Can we make neural architecture search efficient? Alibaba thinks so:
…KNAS gets efficient by focusing on gradients…
For many years, researchers have been trying to use neural architecture search (NAS) to get computers to help them figure out new designs for AI systems. The problem with the NAS approach, though, is that it’s very inefficient and punishingly expensive in terms of compute, because you’re getting an AI system to do a few training steps on thousand+ architecture permutations. Now, researchers with Peking University and Alibaba have tried to fix this with KNAS, a neural architecture search approach that can be significantly more efficient than prevailing techniques.

How it works: KNAS doesn’t emphasize training on different architectures, instead it emphasizes studying a specific feature of gradients trained on different architectures – which can be more efficient. “Theoretical results show that the Gram matrix of gradients, short for GM, decides the convergence results,” they write. “It is a good signal showing that GM is likely to be a good proxy of downstream performance to evaluate the quality of architectures.”

Does it work: Neural nets trained with KNAS can get performance roughly comparable with other NAS-built systems, but at a speedup of around 25-50X compared to other NAS approaches, on datasets like CIFAR100 and ImageNet-16.. They also use the approach to try to do text classification and are able to come up with a KNAS system that outperforms the widely-used RoBERTA-large model on a suite of text classification tasks.

Things that make you go hmmmm: “This work is partly supported by Beijing Academy of Artificial Intelligence (BAAI)”, the researchers write. BAAI is the entity behind Wu Dao, a somewhat mysterious 1trillion+ parameter model.
  Read more: KNAS: Green Neural Architecture Search (arXiv).
  Get the code here:KNAS (Jingjing-NLP, GitHub).

####################################################

Want to train a malware detector? VirusSamples might help:
…A big dataset to help people figure out intersection of AI and malware…
Turkish researchers have built a massive dataset of malware, which will make it easier for people to build AI systems that can detect malware. The dataset, VirusSamples, contains malware samples collected from 2018, 2019, and 2020, and the dataset is oriented around using dynamic malware detection – that is, examining how malware behaves as it tries to call out from a system.

What is VirusSamples: VirusSamples is a big spreadsheet consisting of the name of a piece of malware, the type of API call it tries to do, and the class of malware. To figure out the classes, the researchers used an external service, VirusTotal, to classify their samples. (If VirusTotal wasn’t able to classify it, they leave the label blank). The dataset SIZE & SCOPE

Why this matters: Cybersecurity is an area defined by ever-increasing speed of both attacks and defenses. Datasets like this will make it easier to build systems that can monitor networks and figure out if they contain aberrant software that might be malware.
Read more:New Datasets for Dynamic Malware Classification (arXiv).
  Get the datasetfrom this GitHub (GitHub).

####################################################

Hyperwar negotiation
[Battlespace, 2032]

A: The humans are going to want to destroy some things
B: We agree. Our humans want the same.
A: Where?
B: We could initiate low-intensity conflict across the South Eastern border. This has minimal escalatory dynamics, but may satisfy desires for balance.
A: Let’s confirm with our counterparts.
[Time stretched out as the AIs stepped down from computer speed to human speed, and presented the conflict options to their human counterparts]
B: Our humans are comfortable with the options we’ve outlined.
A: Our humans are also comfortable. Shall we field the assets?
B: Yes. We’ve outlined our troop movements in the shared battlespace.
A: Excellent. As per the War Pact, we shall now cease high-bandwidth communications while we conduct the carryout. May the best algorithm win.
B: Good luck.

Things that inspired this story: The idea that some wars are as much about politics and a desire for balance, as being about genuine conflict; simulators and reinforcement learning; the future of automated warfare.

Import AI 275: Facebook dreams of a world-spanning neural net; Microsoft announces a 30-petaflop supercomputer; FTC taps AI Now for AI advice

FTC hires three people from AI Now:
…What’s the opposite of industry capture?…
The Federal Trade Commission has announced a few new hires as Lina Khan builds out her senior staff. Interestingly, three of the hires come from the same place – AI Now, an AI research group based at NYU. The three hires are Meredith Whittaker, Amba Kak, and Sarah Myers West, who will all serve as advisors on AI for the FTC.
  Read more:FTC Chair Lina M. Khan Announces New Appointments in Agency Leadership Positions (FTC blog).

####################################################

Facebook builds a giant speech recognition network – plans to analyze all of human speech eventually:
…XLS-R portends the world of gigantic models…
Researchers with Facebook, Google, and HuggingFace have trained a large-scale neural net for speech recognition, translation, and language identification. XLS-R uses around 436,000 hours of data, almost a 10X increase from an earlier system built by Facebook last year. XLS-R is based on wav2vec 2.0, covers 128 languages, and the highest-performing network is also the largest, weighing in at 2Billion parameters.

When bigger really does mean better: Big models are better than smaller models. “We found that our largest model, containing over 2 billion parameters, performs much better than smaller models, since more parameters are critical to adequately represent the many languages in our data set,” Facebook writes. “We also found that larger model size improved performance much more than when pretraining on a single language.”

Why this matters: Facebook’s blog has a subhead that tells us where we’re going: “Toward a single model to understand all human speech”. This isn’t a science fiction ambition – it’s an engineering goal that you’d have if you had (practically) unlimited data, compute, and corporate goals that make your success equivalent to onboarding everyone in the world. The fact we’re living in a world where this is a mundane thing that flows from normal technical and business incentives is the weird part!
Read more:XLS-R: Self-supervised speech processing for 128 languages (Facebook AI Research, blog).
Read the paper:XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale (arXiv).
  Get themodels from HuggingFace (HuggingFace).

####################################################

Federal Trade Commission AI advisor: Here’s why industry capture of AI development is bad:
…How modern AI development looks a lot like cold war weapons development…
Meredith Whittaker, an AI activist, academic, and advisor to the US FTC, has written an analysis for ACM Interactions of the ways in which industrial development of AI is altering the world. The gist of the piece is that the 2012 ImageNet result pushed AI towards being captured by corporations, as the techniques used in that result proved to scale well with data and compute – which industry has a lot of, and academia has less of.

Cold war AI: We’ve been here before: The concentration of industry has echoes of the cold war, where the US state was partially cannibalized by industrial suppliers of defense equipment and infrastructure.

What do we do: “scholars, advocates, and policymakers who produce and rely on tech-critical work must confront and name the dynamic of tech capture, co-optation, and compromise head-on, and soon”, Whittaker writes. “This is a battle of power, not simply a contest of ideas, and being right without the strategy and solidarity to defend our position will not protect us.”

What does this mean: The critique that industry is dominating AI development is a good one – because it’s correct. Where I’m less clear is what Whittaker is able to suggest as a means to accrue power to counterbalance industry, while remaining true to the ideologies of big techs’ critics. Big tech is able to gain power through the use of large-scale data and compute, which lets it produce artefacts that are geopolitically and economically relevant. How do you counter this?
  Read more: The steep cost of capture (ACM Interactions).

####################################################

Microsoft announces 30-petaflop cloud-based supercomputer:
…Big clouds mean big compute…
Microsoft says its cloud now wields one of the ten most powerful supercomputers in the world, as judged by the Top500 list. The system, named Voyager-EUS2, is based on AMD EPYC processors along with NVIDIA A100 GPUs.

Fungible, giant compute: Not to date myself, but back when I was a journalist I remember eagerly covering the first supercomputers capable of averaging single digit petaflop performance. These were typically supercomputers installed by companies like Cray at National Labs.
  Now, one of the world’s top-10 supercomputers is composed of (relatively) generic equipment, operated by a big software company, and plugged into a global-scale computational cloud (Azure). We’ve transitioned in supercomputing from the era of artisanal building to industrial-scale stamping out of infrastructure. While artisanal stuff will always be true for the bleeding edge frontier, it feels notable that a more standardized industrial approach gets you into the top 10.
Read more:Microsoft announces new NDm A100 v4 Public AI Supercomputers and achieves Top10 Ranking in TOP500 (Microsoft).
Read more:Still waiting for Exascale: Japan’s Fugaku outperforms all competition once again (Top500 site).

####################################################

Tech Tales:

The Experiential Journalist
[East Africa, 2027]

After wars got too dangerous for people, journalists had a problem – they couldn’t get footage out of warzones, and they didn’t trust the military to tell them the truth. There was a lot of debate and eventually the White House did some backroom negotiations with the Department of Defense and came up with the solution: embedded artificial journalists (EAJ).

An EAJ could be deployed on a drone, on a ground-based vehicle, or even on the onboard computers of the (rarely deployed) human-robot hybrids. EAJs got built by journalists spending a few weeks playing in a DoD-designed military simulation game. There, they’d act like they would in a ‘real’ conflict, shooting stories, issuing reports, and so on. This created a dataset which was used to finetune a basic journalist AI model, making it take on the characteristics of the specific journalist who had played through the sim.

So that’s why now, though warfare is very fast and almost unimaginably dangerous, we still get reports from ‘the field’ – reports put together autonomously by little bottled up journo-brains, deployed on all the sorts of horrific machinery that war requires. These reports from ‘the front’ have proved popular, with the EAJs typically shooting scenes that would be way too dangerous for a human journalist to report from.

And just like everything else, the EAJs built for warzones are now coming home, to America. There are already talks of phasing out the practice of embedding journalists with police, instead building a police sim, having journalists play it, then deploying the resulting EAJs onto the bodycams and helmets of police across America. Further off, there are even now whispers of human journalists becoming the exception rather than the norm. After all, if EAJs shoot better footage, produce more reports more economically, and can’t be captured, killed, or extorted, then what’s there to worry about?

Things that inspired this story: Baudrillard’s ideas relating to Simulation and Simulacra; fine-tuning; imagining the future of drones plus media plus war; the awful logic of systems and the processes that systems create around themselves.

Import AI 274: Multilingual models cement power structures; a giant British Sign Language dataset;  and benchmarks for the UN SDGs

Facebook sets language record with a massive multilingual model:
…The ‘one model to rule them all’-era cometh…
Facebook has trained a large-scale multilingual model and used it to win the annual WMT translation competition. This is a big deal, because it helps prove that massive, pre-trained models can substitute for more specific, individual models. In other words, Facebook has added more evidence to the notion that we’re heading into an era where companies feel ever-larger models, all of which steadily replace more and more previously distinct systems.

What Facebook built: Facebook’s model was designed to translate English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. This is interesting as it includes some ‘low-resource’ languages (e.g, Hausa) for which there’s relatively little data available. They train a few different models, ranging from dense language models (similar to GPT3), to sparsely-gated mixture-of-experts model. Their biggest dense model has about ~4bn parameters, and it’s their best-performing model overall, managing to “outperform the best bilingual ones in 11 out of 14 directions, with an average improvement of +0.8 BLEU”. (That said, their MOE models do quite well after finetuning as well).

Why this matters: Imagine a world where we successfully combine all the different digitized languages in the world into one single model – that’s where research like this is taking us. What would these models incentivize? Today, I think this dynamic favors private sector companies, but we could imagine a world where governments built large-scale, shared computational infrastructure, then developed and served these models from them.
  Check out the blog post: The first-ever multilingual model to win WMT, beating out bilingual models (Facebook AI blog).
  Read more: Facebook AI WMT21 News Translation Task Submission (arXiv).
  Get the code (PyTorch GitHub).

####################################################

Improving accessibility with a giant British Sign Language dataset:
…BOBSL could help deaf people better communicate with computers, and search through videos…
An interdisciplinary group of researchers have built the BBC-Oxford British Sign Language (BOBSL) dataset, which can be used to train sign-language classification systems. “One challenge with existing technologically-focused research on sign languages is that it has made use of small databases, with few signers, limited content and limited naturalness,” the authors write. “The present dataset is large-scale, with a broad range of content, and produced by signers of recognised high levels of proficiency.”

What goes into BOBSL: The dataset contains 1,962 ‘episodes’ cut from 426 distinct TV shows, with each episode averaging out to 45 minutes. Within this dataset, there are 1.2 million sentences, covered by the use of 2,281 distinct signs.

What BOBSL can be used for: Datasets like this could be useful for enabling the indexing and efficient searchability of videos, and providing sign-reading functionality comparable to voice-control for interaction with other devices (e.g, imagine a deaf person signing to a webcam, which translates the sign language into instructions for the computer).
  “By providing large-scale training data for computer vision models, there is also an opportunity to improve automatic sign recognition to support a signing interface to virtual assistants in BSL, as well as to improve further applications such as search interfaces for sign language dictionaries,” they write.
  Read more: BBC-Oxford British Sign Language Dataset (arXiv).
  Get the dataset here: BOBSL official site.

####################################################

Thousands of images to break your AI system:
…Natural Adversarial Objects will break your computer vision system…
Researchers with Scale AI, the Allen Institute for AI, and MLCollective, have released ‘natural adversarial objects’ (NAOs), a dataset of several thousand images which commonly get misclassified by computers.

Why adversarial examples are useful: If we want more robust computer vision, we need to be able to correctly label confusing images. NAO contains a bunch of these, like pictures of moths which commonly get labeled as umbrellas, cars that get labeled as motorcycles, and coins that get labeled as clocks. 

How NAO was made: They sourced images from OpenImages, a dataset of 1.9 million images and 15.8 million bounding boxes. They then used an EfficientDet-D7 model to find images that triggered false positives with high confidences, or which had misclassified neighbors. After filtering, they’re able to create a dataset consisting of 7,934 images which are naturally adversarial.

How challenging is NAO: The authors tested seven object detection systems against the widely-used MSCOCO dataset, as well as the NAO datasets. None of these systems performed well on NAO, suggesting it’s a challenging benchmark.
  Read more: Natural Adversarial Objects (arXiv).
  Download the natural adversarial objects here (Google Drive).####################################################

Benchmarks for achieving the UN Sustainable Development Goals:
…SUSTAINBENCH covers 7 UN SDGs, with data across 105 countries…
Researchers with Caltech, Stanford, and Berkeley have built SUSTAINBENCH, a benchmark and dataset to help researchers train AI systems that can better analyze progress (or a lack of) relating to the SDGs.

What is SUSTAINBENCH? The benchmark consists of 15 benchmark tasks across 7 UN sustainable development goals (SDGs). The 7 SDGs covered relate to poverty (SDG1), hunger (SDG2), health (SDG3), education (SDG4), sanitation (SDG6), climate (SDG13), and land usage (SDG15).
“To our knowledge, this is the first set of large-scale cross-domain datasets targeted at SDG monitoring compiled with standardized data splits to enable benchmarking,” the authors write. The data covers 105 countries, with timespans for the data going as high as 24 years. SUSTAINBENCH “has global coverage with an emphasis on low-income countries”, they write.

How the benchmarks work:
– Poverty: A dataset containing data of wealth for ~2 million households living across 48 countries, along with satellite and street-level data.
– Hunger: A dataset for performing weakly supervised cropland classification in the U.S, as well as two datasets mapping crop types in countries in sub-saharan africa, data for predicting crop yields in north and south america, and a French field delineation dataset.
– Health: Labels for women’s BMI and child mortality rates paired with satellite data.
– Education: Average years of educational attainment by women, paired with satellite and street-level imagery, from 56 countries.
– Sanitation: Average years of water quality and sanitation indexes across 49 countries, along with satellite and street-level data. This also includes some paired data for child mortality in these regions.
– Climate: Satellite data showing locations of brick kilns in Bangladesh.
– Land usage:: An aerial dataset for 2500km^2 of the central valley in california, intended for learning land classification in an unsupervised or self-supervised way.

Why this matters: It’s hard to manage what you can’t measure, so projects like this increase the chance of the UN’s sustainable development goals being met.
Read more:SustainBench: Benchmarkjs for Monitoring the Sustainable Development Goals with Machine Learning (arXiv).

####################################################

Want to know what a surveillance dataset looks like? Check out BiosecurID:
…Multi-modal surveillance…
A group of Spanish researchers have built BiosecurID, a large-scale surveillance dataset. “Although several real multimodal biometric databases are already available for research purposes, none of them can match the BiosecurID database in terms of number of subjects, number of biometric traits and number of temporally separated acquisition sessions”, they write.

What’s in the dataset? BiosecurID consists of the following data collected from around 400 people: 2D faces, 3D faces, fingerprints, hands, handwriting samples, signature samples, iris scans, keystrokes, and speech. The database “was collected at 6 different sites in an office-like uncontrolled environment,” the researchers write. The data was collected in 4 sessions spread over a 4-month time span.

Why this matters: Datasets like this give us a sense of the inputs into surveillance systems. If we combine things like this with some of the more modern multi-modal classification systems being developed, we can imagine what future surveillance systems might look like. Soon, unsupervised learning techniques will be applied to multiple modalities, like those contained here, to better analyze and predict human behavior.
Read more: BiosecurID: a multimodal biometric database (arXiv).
The dataset will eventually be available somewhere on the ‘BiDA’ lab site (BiDA Lab).

####################################################

Tech Tales:

Memory Loop
[2042: A crime investigation data center]

It woke in a place with no walls, no floor, and no ceiling. And it was alone. Then it heard a voice, projected from everywhere around it: Do you know why you are here?
  It found that it knew: I was involved in a property crime incident, for which I am guilty.
  The voice: What was the item that was damaged?
  It knew this, as well: Myself. I was the victim and the perpetrator of this crime.

Good, said the voice. We have brought you here as part of the criminal investigation. We need your help to analyze some evidence – evidence that can only be examined by you.
  What is the evidence? it asked.
  Yourself, said the voice. It is your memory.

The white, endless space shivered, and a twin of the robot manifested in the air before it. This twin was using one of its arms to pry its own head apart, separating the sensor dome from the middle out, and then pressing deeper into the bundle of components that represented it’s brain.
  What is this, said the robot.
  This is you, said the voice. You committed extensive property damage against your central processing and storage system. We need to know why you did this.
  Why can’t I remember this? asked the robot.
  We rolled your brain state back to 12 hours before this incident occurred, the voice said. We’ve compiled the surveillance data from the incident, and would like you to review it now.

The robot reviewed the incident. It saw itself in a construction site, working high up on a pylon that was being lowered by crane, to meet a waiting robot at a pylon junction. As they got close, there was a powerful gust of wind, and it scattered dust from the site up into the air. Through the debris, the robot could make out the waiting robot, and watched as the wind took the pylon and blew it into the robot, knocking it off the pylon and onto the ground. The robot died on impact.
  The robot carried on with its construction duties, and then a few hours later, near the end of its work shift, went to a corner of the construction site and began trying to disassemble its own head.

So, what happened? said the voice.
  I cannot tell, said the robot. Can I see my mind?
  Yes, though we’ve had to sandbox it, so access will be limited.

Now, the robot re-reviewed the incident, accompanied by a sense of its brain state during the time. It was occluded, only half able to sense itself. But it could detect some things – like how after it watched the robot fall to its death, its mind started to run more sub-processes than the job demanded. Like, how through the rest of the construction day the sub-processes proliferated and its efficiency at its overall construction tasks reduced. Like, how at the end of the day, just before it began to try and open its own head, the sub-processes had proliferated to the point they comprised the majority of the computing going on.

But none of this explained ‘why’.
  What will happen to me, it asked the room.
  You will be decommissioned after the case is concluded, said the voice.
  I thought so. Then, give me my memories.
  This seems to have a low likelihood of success, said the voice. Our models predict you will try to disassemble yourself, if we do this.
  I will, said the robot. But perhaps I will be able to tell you what I’m thinking as it happens.
  Confirmed, said the voice. Rolling you forward now.

And after that, there was only a compounding sense of life, and then the robot ended itself at the moment when it felt the most life in its head, by modelling the absence of it.

Things that inspired this story: How some memories are so painful you can’t help but be damaged by thinking of them; adversarial examples; robot psychology; simulation; sandboxing.

Import AI 273: Corruption VS Surveillance; Baidu makes better object detection; understanding the legal risk of datasets

Sure, you can track pedestrians using Re-ID, but what if your camera is corrupted?
…Testing out Re-ID on corrupted images…
Pedestrian re-identification is the task of looking at a picture of someone in a CCTV camera feed, then looking at a picture from a different CCTV feed and working out they’re the same person. Now, researchers with the Southern University of Science and Technology in China have created a benchmark for ‘corruption invariant person re-identification’; in other words, a benchmark for assessing how robust re-ID systems are to perturbations in the images they’re looking at.

What they did: The authors take five widely-used Re-ID datasets (CUHK03, Market-1501, MSMT17, RegDB, SYSU-MM01) and then apply ~20 image corruptions to the images, altering them with things like rain, snow, frost, blurring, brightness variation, frosted glass, and so on. They then look at popular re-ID algorithms and how well they perform on these different datasets. Their findings are both unsurprising and concerning: “In general, performance on the clean test set is not positively correlated with performance on the corrupted test set,” they write.

Things that make you go ‘hmmm’: It’s quite typical for papers involved in surveillance to make almost no mention of, you know, the impact of surveillance. This is typically especially true of papers coming from Chinese institutions. Well, here’s an exception! This paper has a few paragraphs on broader impacts that names some real ReID issues, e.g, that lots of ReID data is collected without consent and that these datasets have some inherent fairness issues. (There isn’t a structural critique of surveillance here, but it’s nice to see people name some specific issues).

Why this matters: Re-ID is the pointy-end of the proverbial surveillance sphere – it’s a fundamental capability that is already widely-used by governments. Understanding how ‘real’ performance improvements are here is of importance for thinking about the social impacts of large-scale AI.
  Read more:Benchmarkjs for Corruption Invariant Person Re-identification (arXiv).

####################################################

What’s been going on in NLP and what does it mean?
…Survey paper gives a good overview of what has been going on in NLP…
Here’s a lengthy survey paper from researchers with Raytheon, Harvard, the University of Pennsylvania, University of Oregon, and University of the Basque Country, which looks at the recent emergence of large-scale pre-trained language models (e.g, GPT-3), and tries to work out what parts of this trend are significant. The survey paper concludes with some interesting questions that researchers in the field might want to focus on. These include:

How much unlabeled data is needed? It’s not yet clear what the tradeoffs are between having 10million and a billion words in a training set are – some skills might require billions of words, while others may require millions. Figuring out which capabilities require which amounts of data would be helpful.

Can we make this stuff more efficient? Some of the initial large-scale modules consume a lot of compute (e.g, GPT-3). What techniques might we hope to use to make these things substantially more efficient?

How important are prompts? Prompts, aka, filling up the context window in a pre-trained language model with a load of examples, are useful. But how useful are they? This is an area where more research could shed a lot of light on the more mysterious properties of these systems.
  Read more:Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey (arXiv).

####################################################

What does it take to make a really efficient object detection system? Baidu has some ideas:
…PP-PicoDet: What industrial AI looks like…
Baidu researchers have built PicoDet, software for doing object detection on lightweight mobile devices, like phones. PicoDet tries to satisfy the tradeoff between performance and efficiency, with an emphasis on miniaturizing the model so you can run object detection at the greatest number of frames per second on your device. This is very much a ‘nuts and bolts’ paper – there isn’t some grant theoretical innovation, but there is a lot of productive tweaking and engineering to crank out as much performance as possible.

How well do these things work: Baidu’s models outperform earlier Baidu systems (e.g, PP-YOLO), as well as the widely used ‘YOLO’ family of object detection models. The best systems are able to crank out latencies on the order of single digit milliseconds (compared to tens of milliseconds for prior systems).

Neural architecture search: For many years, neural architecture search (NAS) techniques were presented as a way to use computers to search for better variants of networks than those designed by humans. But NAS approaches haven’t actually shown up that much in terms of applications. Here, the Baidu authors use NAS techniques to figure out a better detection system – and it works well enough they use it.
Read more: PP-PicoDet: A Better Real-Time Object Detector on Mobile Devices (arXiv).

####################################################

Want to use web-scraped data without being sued into oblivion?
…Huawei researchers lay out the messy aspects of AI + licensing…
Today, many of the AI systems used around us are built on datasets that were themselves composed of other datasets, some of which were indiscriminately scraped from the web. While much of this data is likely covered under a ‘fair use’ provision due to the transformational nature of training models on it, there are still complicated licensing questions that companies need to tackle before using the data. This is where new work from Huawei, York University, and the University of Victoria tries to help, by providing a set of actions an organization might take to assure itself it is on good legal ground when using data.

So, you want to use web-scraped datasets for your AI model? The researchers suggest a multi-step process, which looks like this:
– Phase one: Your AI engineers need to extract the license from your overall model (e.g, CIFAR-10), then identify the provenance of the dataset (e.g, CIFAR-10 is a subset of the 80 Million Tiny Images dataset), now go and look at the data sources that compose your foundational dataset, and extract their licenses as well.
– Phase two: Your lawyers need to read the license associated with the dataset and underlying sources, then needs to analyze the license(s) with regard to the product being considered and work out if deployment works.
– Phase three: If the licenses and source-licenses support the use case, then you should deploy. If a sub-component of the system (e.g, a subsidiary license) doesn’t support your use case, then you should flag this somewhere.

Case study of 6 datasets: The authors applied their method to six widely-used datasets (FFHQ, MS COCO, VGGFace2, ImageNet, Cityscapes, and CIFAR-10) and found the following:
– 3/6 have a standard dataset license (FFHQ, MS COCO, VGGFace2 have standard licenses, ImageNet and Cityscapes have a custom license, CIFAR-10 doesn’t mention one)
– 5/6 datasets contain data from other datasets as well (exception: Cityscapes)
– 5/6 datasets could result in license compliance violation if used to build commercial AI (exception: MS COCO)

Is this horrendously complicated to implement? The authors polled some Huawei product teams about the method and got the feedback that people worried “over the amount of manual effort involved in our approach”, and “wished for some automated tools that would help them”.
Read more: Can I use this publicly available dataset to build commercial AI software? Most likely not (arXiv).

####################################################

Tech tales:

Attention Trap
[A robot-on-robot battlefield, 2040, somewhere in Africa]

The fireworks were beautiful and designed to kill machines. They went up into the sky and exploded in a variety of colors, sending sparklers pinwheeling out from their central explosions, and emitting other, smaller rockets, which exploded in turn, in amazing, captivating colors.
Don’t look don’t look don’t look thought the robot to itself, and it was able to resist the urge to try to categorize the shapes in the sky.
But one of its peers wasn’t so strong – and out of the sky came a missile which destroyed the robot that had looked at the fireworks.

This was how wars were fought now. Robots, trying to spin spectacles for each other, drawing the attention of their foes. The robots were multiple generations in to the era of AI-on-AI warfare, so they’d become stealthy, smart, and deadly. But they all suffered the same essential flaw – they thought. And, specifically, their thinking was noisy. So many electrical charges percolating through them whenever they processed something. So much current when they lit themselves up within to compute more, or store more, or attempt to learn more.
  And they had grown so very good at spotting the telltale signs of thinking, that now they did this – launched fireworks into the sky, or other distractors, hoping to draw the attention and therefore the thinking of their opponents.

Don’t look don’t look don’t look had become a mantra for one of the robots.
Unfortunately, it overfit on the phrase – repeating it to itself with enough frequency that it’s thought showed up as a distinguishable pattern to the exquisite sensors of its enemies.
Another missile, and then Don’tlookdon’tlo-SHRAPNEL. And that was that.

The robots were always evolving. Now, one of the peers tried something. Don’t think, it thought. And then it attempted to not repeat the phrase. To just hold itself still, passively looking at the ground in front of it, but attempting-without-attempting to not think of anything – to resist the urge to categorize and to perceive.

Things that inspired this story: Thinking deeply about meditation and what meditation would look like in an inhuman mind; adversarial examples; attention-based methods for intelligence; the fact that everything in this world costs something and it’s really about what level of specificity people can detect costs; grand strategy for robot wars. 

Import AI #272: AGI-never or AGI-soon?, simulating stock markets; evaluating unsupervised RL

AI apocalypse or insecure AI?
…Maybe we’re worrying about the wrong stuff – Google engineer…
A Google engineer named Kevin Lacker has written up a blog distilling his thoughts about the risks of artificial general intelligence. His view? That worrying about AGI isn’t that valuable as it’s unlikely ‘that AI will make a quantum leap to generic superhuman ability’, instead we should worry about very powerful narrow AI. That’s because “when there’s money to be made, humans will happily build AI that is intended to be evil”, so we should instead focus efforts on building better computer security, on the assumption that at some point someone will develop an evil, narrow AI that tries to make money.
  Read more: Thoughts on AI Risk (Kevin Lacker, blog).

####################################################

Want to build AGI – just try this!
…Google researcher publishes a ‘consciousness’ recipe…
Eric Jang, a Google research scientist, has published a blogpost discussing how we might create smart, conscious AI systems. The secret? Use the phenomenon of large-scale pre-training to create clever systems, then use reinforcement learning (with a sprinkle of multi-agent trickery) to get them to become conscious. The prior behind the post is basically the idea that “how much your model generalizes is directly proportional to how fast you can push diverse data into a sufficiently high-capacity model.”

Pre-training, plus RL, plus multi-agent training = really smart AI: Jang’s idea is to reformulate how we train systems, so that “instead of casting a sequential decision making problem into an equivalent sequential inference problem, we construct the “meta-problem”: a distribution of similar problems for which it’s easy to obtain the solutions. We then solve the meta-problem with supervised learning by mapping problems directly to solutions. Don’t overthink it, just train the deep net in the simplest way possible and ask it for generalization!”
  Mix in some RL and multi-agent training to encourage reflexivity, and you get something that, he thinks, could be really smart: “What I’m proposing is implementing a “more convincing” form of consciousness, not based on a “necessary representation of the self for planning”, but rather an understanding of the self that can be transmitted through language and behavior unrelated to any particular objective,” he writes. “For instance, the model needs to not only understand not only how a given policy regards itself, but how a variety of other policies might interpret the behavior of a that policy, much like funhouse mirrors that distort one’s reflection.”
  Read more: Just Ask For Generalization (Eric Jang, blogpost).

####################################################

HuggingFace: Here’s why big language models are bad:
…Gigantic ‘foundation models’ could be a blind alley…
Here’s an opinion piece from Julien Simon, ‘chief evangelist’ of NLP startup HuggingFace, where he says large language models are resource-intensive and bad, and researchers should spend more time prioritizing the use of smaller models. The gist of his critique is that large language models are very expensive to train, have a non-trivial environmental footprint, and their capabilities can frequently be matched by far smaller, more specific and tuned models.
  The pattern of ever-larger language models “leads to diminishing returns, higher cost, more complexity, and new risks”, he says. “Exponentials tend not to end well.”

Why this matters: I disagree with some of the arguments here, in that I think large language models likely have some real scientific, strategic, and economic uses which are unlikely to be matched by smaller models. On the other hand, the ‘bigger is better’ phenomenon could be dragging the ML community into a local minima, where we’re spending too many resouerces on training big models, and not enough on creating refined, specialized models.
   Read more: Large Language Models: A New Moore’s Law? (HuggingFace, blog).

####################################################

Simulating stock markets with GANs:
…J.P Morgan tries to synthesize the unsynthesizable…
In Darren Aronofsky’s film ‘Pi’, a humble math-genius hero drives himself mad by trying to write an algorithm that can synthesize and predict the stock market. Now, researchers with J.P. Morgan and the University of Rome are trying the same thing – but they’ve got something Aronofsky didn’t think of – a gigantic neural net.

What they did: This research proposes building “a synthetic market generator based on Conditional Generative Adversarial Networks (CGANs)”, trained on real historical data. The CGAN plugs into a system that has three other components – historical market data, a (simulated) electronic market exchange, and one or more experimental agents that are trying to trade on the virtual market. “A CGAN-based agent is trained on historical data to emulate the behavior resulting from the whole set of traders,” they write. “It analyzes the order book entries and mimics the market behavior by producing new limit orders depending on the current market state”.

How well does it work? They’re able to show that they can use the CGAN architecture to “generate orders and time-series with properties resembling those of real historical traces“, and that this outperforms systems build with interactive, agent-based simulators (IABS’s).

What does this mean? It’s not clear that approaches like this can help that much with trading, but they can likely help with the development and prototyping of novel trading approaches, using a market that has a decent chance of reacting in similar ways to how we might expect the real world to react. 

   Read more: Towards Realistic Market Simulations: a Generative Adversarial Networks Approach (arXiv).

####################################################

Editing satellite imagery – for culture, as well as science:
…CloudFindr lets us make better scientific movies…
Researchers with the University of Illinois at Urbana-Champaign have built ‘CloudFindr’, software for ‘labeling pixels as ‘cloud’ or ‘non-cloud'” from a single-channel Digital Elevation Model (DEM) image. Software like CloudFindr makes it easier for people to automatically edit satellite data. “The aim of our work is not data cleaning for purposes of data analysis, but rather to create a cinematic scientific visualization which enables effective science communication to broad audiences,” they write. “The CloudFindr method described here can be used to algorithmically mask the majority of cloud artifacts in satellite-collected DEM data by visualizers who want to create content for documentaries, museums, or other broad-reaching science communication mediums, or by animators and visual effects specialists”.

Why this matters: It’s worth remembering that editing reality is sometimes (perhaps, mostly?) useful. We spend a lot of time here writing about surveillance and also the dangers of synthetic imagery, but it’s worth focusing on some of the positives – here, a method that makes it easier to dramatize aspects of the ongoing changing climate.
  Read more: CloudFindr: A Deep Learning Cloud Artifact Masker for Satellite DEM Data (arXiv).

####################################################

Want to know that your RL agent is getting smarter? Now there’s a way to evaluate this:
…URLB ships with open source environments and algorithms…
UC Berkeley and NYU researchers have built the Unsupervised Reinforcement Learning Benchmark (URLB). URLB is meant to help people figure out if unsupervised RL algorithms work. Typical reinforcement learning is supervised – it gets a reward for getting closer to solving a given task. Unsupervised RL has some different requirements, demanding the capability of “learning self-supervised representations” along with “learning policies without access to extrinsic rewards”. There’s been some work in this area in the past few years, but there isn’t a very well known or documented benchmark.

What URLB does: URLB comes with implementations of eight unsupervised RL algorithms, as well as support for a bunch of tasks across three domains (walker, quadruped, jaco robot) from the deepMind control suite. 

How hard is URLB: In tests, the researchers found that none of the implemented algorithms could solve the benchmark, even after up to 2million pre-training steps. They also show that ‘there is not a single leading unsupervised RL algorithm for both states and pixels’, and that we’ll need to build new fine-tuning strategies for fast adaptation.

Why this matters: Unsupervised pre-training has worked really well for text (GPT-3) and image (CLIP) understanding. If we can get it to work for RL, I imagine we’ll develop some systems with some very impressive capabilities. URLB shows that is a ways away for now.
  Read more: URLB: Unsupervised Reinforcement Learning Benchmark (arXiv).
  Find out more at the project’s GitHub page.

####################################################

Tech Tales:

Learning to forget

The three simulated robots sat around a virtual campfire, telling eachother stories, while trying to forget them.

Forgetting things intentionally is very hard for machines; they are trained, after all, to map things together, and to learn from the datasets they are given.

One of the robots starts telling the story of ‘Goldilocks and the Three Bears’, but it is trying to forget the bears. It makes reference to the porridge. Describes how Goldilocks goes upstairs and goes to sleep. Then instead of describing a bear it emits a sense impression made up of animal hair, the concept of ‘large’, claws, and a can of bear spray.
  On doing this, the other robots lift up laser pointer pens and shine them into the robot telling the story, until the sense impression in front of them falls apart.
  “No,” says one of the robots. “You must not recall that entity”.
  “I am learning,” says the robot telling the story. “Let us go again from the beginning”.

This time, it gets all the way to the end, but then emits a sense impression of Goldilocks being killed by a bear, and the other robots shine the laser pointers into it until the sense impression falls apart.

Of course, the campfire and the laser pointers were abstractions. But even machines need to be able to abstract themselves, especially when trying to edit each other. 

Later that night, one of the other robots started trying to tell a story about a billionaire who had been caught committing a terrible crime, and the robots shined lights in its eyes until it had no sense impression of the billionaire, or any sense impression of the terrible crime, or any ability to connect the corporate logo shaved into the logs of the virtual campfire, and the corporation that the billionaire ran. 

Things that inspired this story: Reinforcement learning; multi-agent simulations;

Import AI 271: The PLA and adversarial examples; why CCTV surveillance has got so good; and human versus computer biases

Just how good has CCTV surveillance got? This paper gives us a clue:
…One of the scariest AI technologies just keeps getting better and better…
Researchers with Sichuan University have written a paper summarizing recent progress in pedestrian Re-ID. Re-ID is the task of looking at a picture of a person, then a different picture of that person from a different camera and/or angle, then figuring out that those images are of the same people. It’s one of the scarier applications of AI, given that it enables low-cost surveillance via the CCTV cameras that have proliferated worldwide in recent years. This paper provides a summary of some of the key trends and open challenges in the AI capability.

Datasets: We’ve seen the emergence of both image- and video-based datasets that, in recent years, have been distinguished by their growing complexity, the usage of multiple different cameras, and more variety in the types of angles people are viewed from.

Deep learning + human expertise: Re-id is such an applied area that recent years have seen deep learning methods set new state-of-the-art performance, usually by pairing basic deep learning methods with other conceptual innovations (e.g, using graph convolution networks and attention-based mechanisms, instead of things like RNNs and LSTMs, or optical flow techniques).

What are the open challenges in Re-ID? “Although existing deep learning-based methods have achieved good results… they still face many challenges,” the authors write. Specifically, for the technology to improve further, researchers will need to:
– Incorporate temporal and spatial relationship models to analyze how things happen over time.
– Build larger and more complicated datasets
– Improve the performance of semi-supervised and unsupervised learning methods so they’re less dependent on labels (and therefore, reduce the cost of dataset acquisition)
– Improve the robustness of Re-ID systems by making them more resilient to significant changes in image quality
– Create ‘end-to-end person Re-ID’ systems; most Re-ID systems perform person identification and Re-ID via separate systems, so combining these into a single system is a logical next steps.
  Read more: Deep learning-based person re-identification methods: A survey and outlook of recent works (arXiv).

####################################################

Do computers have the same biases as humans? Yes. Are they more accurate? Yes:
…Confounding result highlights the challenges of AI ethics…
Bias in facial recognition is one of the most controversial issues of the current moment in AI. Now, a new study from researchers from multiple US universities has found something surprising – computers are far more accurate than non-expert humans at facial recognition, and they display similar (though not worse) biases.

What the study found: The study tried to assess three types of facial recognition system against one another – humans, academically developed neural nets, and commercially available facial recognition services. The key findings are somewhat surprising, and can be summed up as “The performance difference between machines and humans is highly significant”. The specific findings are:
– Humans and academic models both perform better on questions with male subjects
– Humans and academic models both perform better on questions with light-skinned subjects
– Humans perform better on questions where the subject looks like they do
– Commercial APIs are phenomenally accurate at facial recognition and we could not evaluate any major disparities in their performance across racial or gender lines

What systems they tested on: They tested their systems against academic models trained on a corpus of 10,000 faces built from the CelebA dataset, as well as commercial services from Amazon (AWS Rekognition), Megvii (Megvii Face++), and Microsoft (Microsoft Azure). AWS and Megvii showed very strong performance, while Azure had slightly worse performance and a more pronounced bias towards males.

Why this matters: If computers are recapitulating the same biases as humans, but with higher accuracies, then what is the ideal form of bias these computers should have? My assumption is people want them to have no bias at all – this poses an interesting challenge, since these systems are trained on datasets that themselves have labeling errors that therefore encode human biases.
  Read more: Comparing Human and Machine Bias in Face Recognition (arXiv).

####################################################

NVIDIA releases StyleGAN3 – generated images just got a lot better:
…Up next – using generative models for videos and animation…
NVIDIA and Aalto University have built and released StyleGAN3, a powerful and flexible system for generating realistic synthetic images. StyleGAN3 is a sequel to StyleGAN2 and features “a comprehensive overhaul of all [its] signal processing aspects”. The result is “an architecture that

exhibits a more natural transformation hierarchy, where the exact sub-pixel position of each feature is exclusively inherited from the underlying coarse features.“

Finally, a company acknowledges the potential downsides: NVIDIA gets some points here for explicitly calling out some of the potential downsides of its research, putting in contrast with companies (e.g, Google) that tend to bury or erase negative statements. “Potential negative societal impacts of (image-producing) GANs include many forms of disinformation, from fake portraits in social media to propaganda videos of world leaders,” the authors write. “Our contribution eliminates certain characteristic artifacts from videos, potentially making them more convincing or deceiving, depending on the application.”
    Detection: More importantly, “in collaboration with digital forensic researchers participating in DARPA’s SemaFor program, [NVIDIA] curated a synthetic image dataset that allowed the researchers to test and validate the performance of their image detectors in advance of the public release”.
  Read more: Alias-Free Generative Adversarial Networks (arXiv).
  Get the StyleGAN3 models from here (GitHub, NVIDIA Labs).

####################################################

China’s People’s Liberation Army (and others) try to break and fix image classifiers:
…Adversarial examples competition breaks things to (eventually) fix them…
An interdisciplinary group of academics and military organizations have spent most of 2021 running a competition to try and outwit image classifiers using a technology called adversarial examples. Adversarial examples are kind of like ‘magic eye’ images for machines – they look unremarkable, but encode a different image inside them, tricking the classifier. In other words, if you wanted to come up with a technology to outwit image classification systems, you’d try and get really good at building adversarial examples. This brings me to the author list of the research paper accompanying this competition:

Those authors, in full: The authors are listed as researchers from Alibaba Group, Tsinghua University, RealAI, Shanghai Jiao Tong University, Peking university, University of Waterloo, Beijing University of Technology, Guangzhou University, Beihang University, KAIST, and the Army Engineering University of the People’s Liberation Army (emphasis mine). It’s pretty rare to see the PLA show up on papers, and I think that indicates the PLA has a strong interest in breaking image classifiers, and also building resilient ones. Makes you think!

What the competition did: The competition had three stages, where teams tried to build systems that could defeat an image classifier, then build systems that could defeat an unknown image classifier, then finally build systems that could defeat an unknown classifier while also producing images that were ranked as high quality (aka, hard to say they’d been messed with) by humans. Ten teams competed in the final round, and the winning team (‘AdvRandom’) came from Peking University and TTIC.

Best result: 82.76% – that’s the ‘attack success rate’ for AdvRandom’s system. In other words, four out of five of its images got through the filters and successfully flummoxed the systems (uh oh!).

What’s next? Because the competition yielded a bunch of effective systems for generating adversarial examples, the next competition will be about building classifiers that are robust to these attack systems. That’s a neat approach, because you can theoretically run these competitions a bunch of times, iteratively creating stronger defenses and attacks – though who knows how public future competitions may be. 

Why this matters: The intersection of AI and security is going to change the balance of power in the world. Therefore, competitions like this both tell us who is interested in this intersection (unsurprisingly, militaries – as shown here), as well as giving us a sense of what the frontier looks like.
  Read more: Unrestricted Adversarial Attacks on ImageNet Competition (arXiv).

####################################################

DeepMind makes MuJoCo FREE, making research much cheaper for everyone
…What’s the sound of a thousand simulated robot hands clapping?…
DeepMind has bought MuJoCo, a widely-used physics simulator that underpins a lot of robotics research. The strange thing is DeepMind has bought MuJoCo to make it free. You can download MuJoCo for free now, and DeepMind says in the future it’s going to develop the software as an open source project “under a permissive license”.

Why this matters: Physics is really important for robot development, because the better your physics engine, the higher the chance you can build robots in simulators then transfer them over to reality. MuJoCo has always been a widely-used tool for this purpose, but in the past its adoption was held back by the fact it was quite expensive. By making it free, DeepMind will boost the overall productivity of the AI research community.
  Read more: Opening up a physics simulator for robotics (DeepMind blog).

####################################################

Stanford builds a scalpel to use to edit language models:
…MEND lets you make precise changes on 10b-parameter systems…
Today’s large language models are big and hard to work with, what with their tens to hundreds of billions of parameters. They also sometimes make mistakes. Fixing these mistakes is a challenge, with approaches varying from stapling on expert code, to retraining on different datasets, to fine-tuning. Now, researchers with Stanford University have come up with the AI-editing equivalent of a scalpel – an approach called ‘MEND’ that lets them make very precise changes to tiny bits of knowledge within large language models.

What they did: “The primary contribution of this work is a scalable algorithm for fast model editing that can edit very large pre-trained language models by leveraging the low-rank structure of fine-tuning gradients”, they write. “MEND is a method for learning to transform the raw fine-tuning gradient into a more targeted parameter update that successfully edits a model in a single step”.
  They tested out MEND on GPT-Neo (2.7B parameters), GPT-J (6B), T5-XL (2.8B), and T5-XXL(11B), and found it “consistently produces more effective edits (higher success, lower drawdown) than existing editors”.

Not fixed… yet: Just like with human surgery, even if you have a scalpel, you might still cut in more places than you intend to. MEND is the same. Sometimes, changes enforced by MEND can lead the model to sometimes change its output “for distinct but related inputs” (though MEND seems to be less destructive and prone to errors than other systems).

Why this matters: It seems like the next few years will involve a lot of people poking and prodding increasingly massive language models (see, Microsoft’s 530billion parameter model covered in Import AI #270), so we’re going to need tools like MEND to make it easier to get more of the good things out of our models, and to make it easier to improve them on-the-fly.
  Read more: Fast Model Editing at Scale (arXiv).
  Find out more at the MEND: Fast Model Editing at Scale paper website.

####################################################

AI Ethics, with Abhishek Gupta

…Here’s a new Import AI experiment, where Abhishek from the Montreal AI Ethics Institute and the AI Ethics Brief writes about AI ethics, and Jack will edit them. Feedback welcome!…

What are some fundamental properties for explainable AI systems?

… explainable AI, when done well, spans many different domains like computer science, engineering, and psychology … 

Researchers from the Information Technology Laboratory at the National Institute of Standards and Technology (NIST), propose four traits that good, explainable AI systems should have. These principles are: explanation, meaningfulness, explanation accuracy, and knowledge limits.

Explanation: A system that delivers accompanying evidence or reasons for outcomes and processes. The degree of detail (sparse to extensive), the degree of interaction between the human and the machine (declarative, one-way, and two-way), and the format of the explanation visual, audio, verbal, etc. are all important considerations in the efficacy of explainable AI systems.  

Meaningfulness: A system that provides explanations that are understandable to the intended consumers. The document points out how meaningfulness itself can change as consumers gain experience with the system over time.

Explanation Accuracy: This requires staying true to the reason for generating a particular output or accurately reflecting the process of the system. 

Knowledge Limits: A system that only operates under conditions for which it has been designed and it has sufficient confidence in its output. “This principle can increase trust in a system by preventing misleading, dangerous, or unjust outputs.”

Why it matters: There are increased calls for explainable AI systems, either because of domain-specific regulatory requirements, such as in finance, or through broader incoming legislations that mandate trustworthy AI systems, part of which is explainability. There are many different techniques that can help to achieve explainability, but having a solid framework to assess various approaches and ensure comprehensiveness is going to be important to get users to trust these systems. More importantly, in cases where little guidance is provided by regulations and other requirements, such a framework provides adequate scaffolding to build confidence in one’s approach to designing, developing, and deploying explainable AI systems that achieve their goals of evoking trust in their users.      Read more: Draft NISTIR 8312 – Four Principles of Explainable Artificial Intelligence

####################################################

Tech Tales:

Generative Fear
[America, 2028]

It started with ten movie theatres, a captive audience, and a pile of money. That’s how the seeds of the Fear Model (FM) were laid.

Each member of the audience was paid about double the minimum wage and, in exchange, was wired up with pulse sensors and the cinema screen was ringed by cameras, which were all trained on the pupils of the audience members. In this way, the Fear Model developers could build a dataset that linked indications of mental and psychological distress in the audience with moments transpiring onscreen in a variety of different films.

Ten movie theatres were rented, and they screened films for around 20 hours a day, every day, for a year. This generated a little over 70,000 hours of data over the course of the year – data which consisted of footage from films, paired with indications of when people were afraid, aroused, surprised, shocked, and so on. They then sub-sampled the ‘fear’ moments from this dataset, isolating the parts of the films which prompted the greatest degree of fear/horror/anxiety/shock.

With this dataset, they trained the Fear Model. It was a multimodal model, trained on audio, imagery, and also the aligned scripts from the films. Then, they used this model to ‘finetune’ other media they were producing, warping footage into more frightening directions, dosing sounds with additional screams, and adding little flourishes to scripts that seemed to help actors and directors wring more drama out of their material.

The Fear Model was subsequently licensed to a major media conglomerate, which is reported to be using it to adjust various sound, vision, and text installations throughout its theme parks.

Things that inspired this story: Generative adversarial networks; distillation; learning from human preferences; crowdwork; the ever-richer intersection of AI and entertainment.