Import AI 208: Google trains a vision system in 1 minute; Gender+AI = bad; Ubisoft improves character animation with ML

by Jack Clark

Genderify: Uh oh – AI & Bias & Gender
…AI + Gender Identification = No, this is a bad idea…
Last week, an AI startup called Genderify launched on Product Hunt and within days shut down its website, deleted its twitter account (though it still has a microsite on Product Hunt). What happened? The short answer is the service tried to predict gender from names and titles. After it launched, many users demonstrated a series of embarrassing failures of the system, which neatly illustrated why a complex topic like gender is not one that should be approached with a machine learning blunderbuss.

Why this matters: Using automated tools to infer the gender of someone is a bad idea – that’s partially why the global standard is to ask the user/civilian/employee to self-identify their gender out of a menu of options when gender information is preferred. Think of how complex human names are and how often you’ve heard a person’s name and mis-gendered them in your head. Now add to that the fact that people with certain gendered names may consciously use a different pronoun to the one suggested by the name. Does an AI system have any good way, today, to guess this stuff with decent accuracy? No it doesn’t! So if you build tools to do gender classification, you’re basically committing yourself to getting a non-trivial % of your classifications wrong.
(There are some use cases where this may be somewhat okay, like doing an automated scan of all the names of all the faculty in a country and using that to provide some very basic data on the potential gender difference. I expect there to be relatively few use cases like this, though, and tools like Genderify are unlikely to be that helpful.)
Read more: Service that uses AI to identify gender based on names looks incredibly biased (The Verge).

###################################################

Things are getting faster – Google sets new MLPerf AI training record:
…..TPU pods go brrrrr….
Google has set performance records in six out of eight MLPerf benchmarks, defining the new (public) frontier of large-scale compute performance. MLPerf is a benchmark that masures how long it takes to train popular AI components like residual nets, a mask r-cnn, transformer, a BERT model, and more.

Multi-year progress: These results show the time it takes Google to train a ResNet50 network to convergence against ImageNet, giving us performance for a widely used, fairly standard AI task:
– 0.47 seconds: July 2020, MLPerf 0.7.
– 1.28 minutes: June 2019, MLPerf 0.6.
– 7.1 minutes: May 2018, MLPerf 0.5.
– Hours – it used to take hours to train this stuff, back in 2017, even at the frontier. Things have sped up a lot.

Why this matters: We’re getting way better at training large-scale networks faster. This makes it easier for developers to iterate while exploring different architectures. It also makes it easier to rapidly retrain networks on new data. Competitions like MLPerf illustrate the breakneck pace of AI progress – a consequence of years of multibillion-dollar capital expenditures by large technology companies, coupled with the nurturing of vast research labs and investment in frontier processors. All of this translates to a general speedup of the cadence of AI development, which means we should expect to be more surprised by progress in the future.
Read more: Google breaks AI performance records in MLPerf with world’s fastest training supercomputer (Google blog).

###################################################

Trouble in NLParadise – our benchmarks aren’t very good:
…How are our existing benchmarks insufficient? Let me count the ways…
Are benchmarks used to analyze language systems telling us the truth? That’s the gist of a new paper from researchers with Element AI and Stanford university, which picks apart the ‘ecological validity’ of how we test language user interfaces (personal assistants, customer support bots, etc). ‘Ecological validity’ is a concept from psychology that “is a special case of external validity, specifying the degree to which findings generalize to naturally occurring scenarios”.

Five problems with modern language benchmarks:
The researchers identify five major problems with the ways that we develop and evaluate advanced language systems today. These are:
– Synthetic language: Projects like BabyAI, which seek to generate simple instructions using an environment with a restricted or otherwise synthetic dictionary. This means that as you try to increase the complexity of the instructions you express, you can reduce the overall intelligibility of the system. “Especially for larger domains it becomes increasingly difficult and tedious to ensure the readability of all questions or instructions”.
– Artificial tasks: Many research benchmarks “do not correspond to or even resemble a practically relevant LUI setting”. Self explanatory.
– Not working with potential users of the system: For example, the visual question answering (VQA) competition teaches computers to caption images. However, “although the visual question answering task was at least partly inspired by the need to help the visually impaired, questions were not collected from blind people. Instead, human subjects with 20/20 vision were primed to ask questions that would stump a smart robot”.
Another example of this is the SQuAD dataset, which was “collected by having human annotators generate questions about Wikipedia articles… these crowdworkers had no information need, which makes it unclear if the resulting questions match the ones from users looking for this information”.

– Scripts and priming: Some tests rely on scripts that constrain the type of human-computer interaction, e.g by being customized for a specific context like making reservations. Using scripts like this can trap the systems into preconceived notions of operation that might not work well, and subjects that generate the underlying data might be primed by what the computer says to respond in a similar style (“For example, instead of saying ‘I need a place to dine at in the south that serves chinese’, most people would probably say “Chinese restaurant” or “Chinese food”

– Single-turn interfaces: Most meaningful dialog interactions involve several exchanges in a conversation, rather than just one. Building benchmarks that solely consist of single questions and responses, or other single turns of dialog, might be artificially limiting, and could create systems that don’t generalize well.

So, what should people do? The language system community should build benchmarks that have what people could call ‘ecological validity’, which means they’re built with the end users in mind, in complex environments that don’t generate contrived data. The example the authors give “are the development of LUI benchmarks for popular video game environments like Minecraft or for platforms that bundle user services on the Internet of Things …[or] work on LUIs tht enable citizens to easily acces statistical information published by governments.”
Read more: Towards Ecologically Valid Research on Language User Interfaces (arXiv).

###################################################

Boston Dynamics + vehicle manufacturing:
…Rise of the yellow, surprisingly cute machines…
Boston Dynamics’ ‘Spot’ quadruped is being used by the Ford Motor Company to surveil and laser-scan its manufacturing facilities, helping it continuously map the space. This is one of the first in-the-wild uses of the Spot robots I’ve seen. Check out the video to see the robot and some of the discussion by its handler about what its like to have a human and a robot work together.
Watch the video: Fluffy the Robot Dog Feature from Ford (YouTube).

###################################################

How Ubisoft uses ML for video games:
…Motion matching? How about Learned Motion Matching for a 10X memory efficiency gain?…
Researchers with game company Ubisoft have developed Learned Motion Matching, a technique that uses machine learning to reduce the memory footprint for complex character animations.

Learned Motion Matching: Typical motion matching systems work “by repeatedly searching the dataset for a clip that, if played from the current location, would do a better job of keeping the character on the path than the current clip”. Learned motion matching aims to simplify the computational expensive search part of the operation by swapping out expensive search operations for various ML components, which provide an approximation of the search space.

10X more efficient:
– 5.3MB: Memory footprint of a Learned Motion Matching system.
-52.1MB: Memory footprint of a typical motion matching system.

Why this matters: Approximation is a powerful thing, and a lot of the more transformative uses of machine learning are going to come from building these highly efficient function approximators. This research is a nice example of how you can apply this generic approximation capability. It also suggests games are going to get better, as things like this make it easier to support a larger and more diverse set of behaviors for characters. “Learned Motion Matching is a powerful, generic, systematic way of compressing Motion Matching based animation systems that scales up to very large datasets. Using it, complex data hungry animation controllers can be achieved within production budgets,” they write.
Read more: Introducing Learned Motion Matching (Ubisoft Montreal).
Read the research paper here (PDF).
Watch a video about the research here: SIGGRAPH 2020 Learned Motion Matching (Ubisoft, YouTube).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Revisiting the classic arguments for AI risk
Ben Garfinkel, researcher at Oxford’s Future of Humanity Institute, is interviewed on the 80,000 Hours podcast. The conversation focuses on his views of the ‘classic’ arguments for AI posing an existential risk to humanity, particularly those presented by Nick Bostrom and Eliezer Yudkowsky. A short summary can’t do justice to the breadth and nuance of the discussion, so I encourage readers to listen to the whole episode.

A classic argument: Humans are the dominant force on Earth, and we owe this to our superior cognitive abilities. We are trying to build AI systems with human-level cognitive abilities, and it’s going quite well. If we succeed in building human-level AI, we should expect it to be quickly followed by greater-than-human-level AI. At this point, humans would cede our status as the most cognitively-advanced entity. Without a plan for ensuring our AI systems do what we want, there is a risk that it will be them, and not us, that call the shots. If the goals of these AI systems come apart from our own, even quite subtly, things will end badly for humans. So the development of advanced AI poses a profound risk to humanity’s future.

Discontinuity: One objection to this style of argument is to put pressure on the assumption that AI abilities will scale up very rapidly from their current level, to human-, and then super-human levels. If instead there is a slower transition (e.g. on the scale of decades), this gives us much more time to make a plan for retaining control. We might encounter ‘miniature’ versions of the problems we are worried about (e.g. systems manipulating their operators) and learn how to deal with them before the stakes get too high; we can set up institutes devoted to AI safety and governance; etc.

Orthogonality: Another important assumption in the classic argument is that, in principle, the capabilities of a system impose no constraints on the goals it pursues. So a highly intelligent AI system could pursue pretty much any goal, including those very different from our own. Garfinkel points out that while this seems right at a high level of abstraction, it doesn’t chime with how the technology is in fact developed. In practice, building AI systems is a parallel process of improving capabilities and better specifying goals — they are both important ingredients in building systems that do what we want. On the most optimistic view, one might think it likely that AI safety will be solved by default on the path to building AGI.

Matthew’s view: I think it’s right to revisit historic arguments for AI risk, now that we have a better idea of what advanced AI systems might look like, and how we might get there. The disagreements I highlight are less about whether advanced AI has the potential to cause catastrophic harm, and more about how likely we are to avoid these harms (i.e. what is the probability of catastrophe). As Garfinkel notes, a range of other arguments for AI risk have been put forward more recently, some of which are more grounded in the specifics of the deep learning paradigm (see Import 131). Importantly, he believes that we are investing far too little in AI safety and governance research — as he points out, humanity’s cumulative investment into avoiding AI catastrophes is less than the budget of the 2017 movie ‘The Boss Baby’.

Read more: Ben Garfinkel on scrutinising classic AI risk arguments (80,000 Hours)
Read more: AMA with Ben Garfinkel (EA Forum).

Philosophers on GPT-3:

Daily Nous has collected some short essays from philosophers, on the topic of OpenAI’s GPT-3 language model. GPT-3 is effectively an autocomplete tool — a skilled predictor of which word comes next. And yet it’s capable, to varying degrees, of writing music; playing chess; telling jokes; and talking about philosophy. This is surprising, since we think our own ability to do all these things is due to capacities that GPT-3 lacks — understanding, perception, agency.

Generality: GPT-3’s skill for in-context learning allows it to perform well at this wide range of tasks; in some cases as well as fine-tuned models. In Amanda Askell’s phrase, it is a “renaissance model”. What should we make of these glimmers of generality? Can we predict the range of things GPT-3, and future language models, will be able to do? What determines the limits of their generality? Is GPT-3 really generalizing to each new task, or synthesizing things it’s already seen—is there a meaningful difference?

Mimicry: In some sense GPT-3 is no more than a talented mimic. When confronted with a new problem, it simply says the things people tend to say. What’s surprising is how well this seems to work. As we scale up language models, how close can they get to perfect mimicry? Can they go beyond it? What differentiates mimicry and understanding?

Matthew’s view: It’s exciting to see how language models might shed light on stubborn philosophical problems that have hitherto been the domain of armchair speculation. I expect there’ll be more and more fruitful work to be done at the intersection of state-of-the-art AI and philosophy. If you find these questions as interesting as I do, you might enjoy Brian Christian’s excellent book, ‘The Most Human Human’.

GPT-3 replies: Naturally, people have prompted GPT-3 to reply to the philosophers. One, via Raphaël Millière, contained some quite moving sentiments — “Despite my lack of things you prize, you may believe that I am intelligent. This may even be true. But just as you prize certain qualities that I do not have, I too prize other qualities in myself that you do not have.”

Read more: Philosophers on GPT-3 (Daily Nous)

###################################################

Tech Tales:

[2025: Boston, near the MIT campus]

The Interrogation Room

It was 9pm and the person in the room was losing it. Which meant I was out $100. They’d seemed tougher, going in.
“What can I say, I’m confident in my work,” Andy said, as I handed him the money.

They got the confession a little after that. Then they took the person to the medical facility. There, they’d spend a week under observation. If they’d really lost it, they might stay longer. We didn’t make bets on that kind of thing, though.

My job was to watch the interrogation and take observational notes. The AI systems did most of the work, but the laws were structured so you always had a “human-in-the-loop”. Andy was a subcontractor who worked for a tech company that helped us build the systems. He could’ve watched the interrogations remotely, but he and I would watch them together. Turns out we both liked to think about people and make bets on them.

The person had really messed up the room this time. We watched on the tape as they pounded the cell walls with their fists. Watched as they clutched their hand after breaking a couple of knuckles. Watched as they punched the wall again.

“Wanna go in?” I said.
“You read my mind,” Andy said.

The room smelled of sweat and electricity. There was a copper tone from the blood and the salt. There was one chair and a table with a microphone built into it. I stuck my face close to the microphone and closed my eyes, imagining I could feel the residual warmth of the confession.

“It still feels strange that there wasn’t a cop in here,” I say.
“With this tech, you don’t need them,” Andy said. “You just wind people up and let them go. Works four out of five times.”
“But is it natural?”
“The only unnatural part is the fact it happens in this room,” he said, with a faraway look in his eyes. “Pretty soon we’ll put the stuff on people’s phones. Then they’ll just make their confessions at home and we won’t need to do anything. Maybe start thinking how to bet on a few hundred people at a time, eh?” he winked at me, then left.

I stayed in the room for another half hour. I closed my eyes and stood still and tried to imagine what the person had heard in here. I visualized the notes in my head:

Synthesized entities exposed to subject and associated emotional demeanor and conversational topic:
– First wife (deceased, reconstructed from archives): ‘loving/compassionate’ mode; discussed guilt and release of guilt.
– Grandfather (deceased, reconstructed from archives & shared model due to insufficient data): ‘scolding/wisdom’ mode; discussed time and cruelty of time.
– Second wife: ‘fearful/pleading’ mode; discussed how her fear could be removed via confession.
– Victim (deceased, reconstructed from archives): ‘sad/reflective’ mode; talked about the life they had planned and their hopes.

I tried to imagine these voices – I could only do this by imagining them speaking in the voices of those who were close to me. Instead of the subject’s first wife, it was my ex girlfriend; my grandfather instead of theirs; my wife instead of their second wife; someone I hurt once instead of their victim.

And when I heard these voices in my head I asked myself: how long could I have lasted, in this room? How long before I would punch the walls? How long before I’d fight the room to fight myself into giving my own confession?

Things that inspired this story: Generative models; voice synthesis; text synthesis; imagining datasets so large that we have individual ‘people vectors’ to help us generate different types of people and finetune them against a specific context; re-animation and destruction; hardboiled noir; Raymond Chandler;

Import AI