Import AI 227: MAAD-Face; GPT2 and Human Brains; Facebook detects Hateful Memes

by Jack Clark

University of Texas ditches algorithm over bias concerns:
….Gives an F to the GRADE software…
The University of Texas at Austin has stopped using software, called GRADE, to screen for those applying for a PHD at its CS department. UT Austin used GRADE between 2013 and 2019, and stopped using it in early 2020, according to reporting from The Register. Some of the developers of GRADE thinks it doesn’t have major issues with regard to manifesting bias along racial or gender lines, but others say it could magnify existing biases present in the decisions made by committees of humans.

Why this matters: As AI has matured rapidly, it has started being integrated into all facets of life. But some parts of life probably don’t need AI in them – especially those that involve making screening determinations about people in ways that could have an existential impact on them, like admission to possible graduate programs.
  Read more: Uni revealed it killed off its PhD-applicant screening AI just as its inventors gave a lecture about the tech (The Register).

###################################################

Element AI sells to ServiceNow:
…The great Canadian AI hope gets sold for parts…
American software company ServiceNow has acquired Element AI; the purchase looks like an acquihire, with ServiceNow executives stressing the value of Element AI’s talent, rather than any particular product the company had developed.

Why this is a big deal for Canada: Element AI was formed in 2016 and designed as a counterpoint to the talent-vacuums of Google, Facebook, Microsoft, and so on. It was founded with the ambition it could become a major worldwide player, and a talent magnet for Canada. It even signed on Yoshua Bengio, one of the Turing Award winners responsible for the rise of deep learning, as an advisor. Element AI raised around $250+ million in its lifespan. Now it has been sold, allegedly for less than $400 million, according to the Globe and Mail. Shortly after the deal closed, ServiceNow started laying off of a variety of Element AI staff, including its public policy team.

Why this matters: As last week’s Timnit Gebru situation highlights, AI research is at present concentrated in a small number of private sector firms, which makes it inherently harder to do research into different forms of governance, regulation, and oversight. During its lifetime, Element AI did some interesting work on data repositories, and I’d run into Element AI people at various government events where they’d be encouraging nations to build shared data repositories for public goods – a useful idea. Element AI being sold to a US firm increases this amount of concentration and also reduces the diversity of experiments being run in the space of ‘potential AI organizations’ and potential AI policy. I wish everyone at Element AI luck and hope Canada takes another swing at trying to form a counterpoint to the major powers of the day.
  Read more: Element AI acquisition brings better, smarter AI capabilities for customers (ServiceNow).

###################################################

Uh oh, a new gigantic face dataset has appeared:
…123 million labels for 3 million+ photographs…
German researchers have developed MAAD-Face, a dataset containing more than a hundred million labels applied to millions of images of 9,000 people. MAAD-Face was built by researchers at the Fraunhofer Institute for Computer Graphics and is designed to substitute for other, labeled datasets like CelebA and LFW. It also, like any dataset involving a ton of labeled data about people introduces a range of ethical questions.

But the underlying dataset might be offline? MAAD-Face is based on VGG, a massive facial recognition dataset. VGG is currently offline for unclear reasons, potentially due to controversies associated with the dataset. I think we’ll see more examples of this – in the future, perhaps some % of datasets like this will be traded surreptitiously via torrent networks. (Today, datasets like DukeMTMC and ImageNet-ILSVRC-2012 are circulating via torrents, having been pulled off of public repositories following criticism relating to biases or other issues with their datasets.)

What’s in a label? MAAD-Face has 47 distinct labels which can get applied to images, with labels ranging from non-controversial subjects (are they wearing glasses? Is their forehead visible? Can you see their teeth?) to ones that have significant subjectivity (whether the person is ‘attractive’, ‘chubby’, ‘middle aged’), to ones where it’s dubious whether we should be assigning the label at all (e.g, ones that assign a gender of male or female, or which classifies people into races like ‘asian’, ‘white’, or ‘black’).

Why this matters – labels define culture: As more of the world becomes classified and analyzed by software systems, the labels we use to build the machines that do this classification matter more and more. Datasets like MAAD-Face both gesture at the broad range of labels we’re currently assigning to things, and also should prepare us for a world where someone uses computer vision systems to do something with an understanding of ‘chubby’, or other similarly subjective labels. I doubt the results will be easy to anticipate.
  Read more: MAAD-Face: A Massively Annotated Attribute Dataset for Face Images (arXiv).
Get the dataset from here (GitHub).
  Via Adam Harvey (Twitter), who works on projects tracking computer vision like ‘MegaPixels‘ (official site).

###################################################

Is GPT2 like the human brain? In one way – yes!
…Neuroscience paper finds surprising overlaps between how humans approach language and how GPT2 does…
Are contemporary language models smart? That’s a controversial question. Are they doing something like the human brain? That’s an even more controversial question. But a new paper involving gloopy experiments with real human brains suggests the answer could be ‘yes’ at least when it comes to how we predict words in sentences and use our memory to improve our predictions.

But, before the fun stuff, a warning: Picture yourself in a dark room with a giant neon sign in front of you. The sign says CORRELATION != CAUSATION. Keep this image in mind while reading this section. The research is extremely interesting, but also the sort of thing prone to wild misinterpretation, so Remember The Neon Sign while reading. Now…

What they investigated: “Modern deep language models incorporate two key principles: they learn in a self-supervised way by automatically generating next-word predictions, and they build their representations of meaning based on a large trailing window of context,” the researchers write. “We explore the hypothesis that human language in natural settings also abides by these fundamental principles of prediction and context”.

What they found: For their experiments, they used three types of word features (arbitrary, GloVe, and GPT2) and compared how well these features could predict neural activity in people compared to what happened when given different sentences where they needed to predict the next word, and they tried to see which of these features could do the most effective predictions. Their findings are quite striking – GPT2 models assign very similar probabilities for the next words in a sentence to humans, and as you increase the context window (the number of words the person or algo sees before it makes a prediction), performance improves further, and human and algorithmic answers continue to be in agreement.

Something very interesting about the brain: “On the neural level, by carefully analyzing the temporally resolved ECoG responses to each word as subjects freely listened to an uninterrupted spoken story, our results suggest that the brain has the spontaneous propensity (without explicit task demands) to predict the identity of upcoming words before they are perceived”, they write. And their experiments show that the human brain and GPT2 seem to behave similarly here.

Does this matter? Somewhat, yes. As we develop more advanced AI models, I expect they’ll shed light on how the brain does (or doesn’t) work. As the authors note here, we don’t know the mechanism via which the brain works (though we suspect it’s likely different to some of the massively parallel processing that GPT2 does), but it is interesting to observe similar behavior in both the human brain and GPT2 when confronted with the same events – they’re both displaying similar traits I might term cognitive symptoms (which doesn’t necessarily imply underlying cognition). “Our results support a paradigm shift in the way we model language in the brain. Instead of relying on linguistic rules, GPT2 learns from surface-level linguistic behavior to generate infinite new sentences with surprising competence,” writes the Hasson Lab in a tweet.
  Read more: Thinking ahead: prediction in context as a keystone of language in humans and machines (bioRxiv).
  Check out this Twitter thread from the Hasson Lab about this (Twitter).

###################################################

Facebook helps AI researchers detect hateful memes:
…Is that an offensive meme? This AI system thinks so…
The results are in from Facebook’s first ‘Hateful Memes Challenge’ (Import AI: 198), and it turns out AI systems are better than we thought they’d be at labeling offensive versus inoffensive memes. Facebook launched the competition earlier this year; 3300 participants entered, and the top scoring team has an error rate of 0.845 AUCROC – that compares favorably to an AUCROC of 0.714 for the top-performing baseline system that Facebook developed at the start of the competition.

What techniques they used: “The top five submissions employed a variety of different methods including: 1) ensembles of state-of-the-art vision and language models such as VILLA, UNITER, ERNIE-ViL, VL-BERT, and others; 2) rule-based add-ons, and 3) external knowledge, including labels derived from public object detection pipelines,” Facebook writes in a blog post about the challenge.

Why this matters: Competitions are one way to generate signal about the maturity of a tech in a given domain. The Hateful Memes Challenge is a nice example of how a well posed question and associated competition can lead to a meaningful improvement in capabilities – see the 10+ absolue improvement in AUCROC scores for this competition. In the future, I hope a broader set of organizations host and run a bunch more competitions.
Read more: Hateful Memes Challenge winners (Facebook Research blog).

###################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

$50,000 AI forecasting tournament:
Metaculus, an AI forecasting community and website, has announced an AI forecasting tournament, starting this month and running until February 2023. There will be questions on progress on ~30 AI benchmarks, over 6-month; 12-month; and 24-month time horizons. The tournament has a prize pool of $50,000, which will be paid out to the top forecasters. The tournament is being hosted in collaboration with the Open Philanthropy Project.

Existing forecasts: The tournament questions have yet to be announced, so I’ll share some other forecasts from Metaculus (see also Import 212). Metaculus users currently estimate: 70% that if queried, the first AGI system claims to be conscious; 25% that photonic tensors will be widely available for training ML models; 88% that an ML model with 100 trillion parameters will be trained by 2026; 45% that GPT language models generate less than $1bn revenues by 2025; 25% that if tested, GPT-3 demonstrates text-based intelligence parity with human 4th graders.

Matthew’s view: As regular readers will know, I’m very bullish on the value of AI forecasting. I see foresight as a key ingredient in ensuring that AI progress goes well. While the competition is running, it should provide good object-level judgments about near-term AI progress. As the results are scored, it might yield useful insights about what differentiates the best forecasts/forecasters. I’m excited about the tournament, and will be participating myself.
Pre-register for the tournament here.

###################################################

Tech Tales:

The Narrative Control Department
[A beautiful house in South West London, 2030]

“General, we’re seeing an uptick in memes that contradict our official messaging around Rule 470.” “What do you suggest we do?”
“Start a conflict. At least three sides. Make sure no one side wins.”
“At once, General.”

And with that, the machines spun up – literally. They turned on new computers and their fans revved up. People with tattoos of skeletons at keyboards high-fived eachother. The servers warmed up and started to churn out their fake text messages and synthetic memes, to be handed off to the ‘insertion team’ who would pass the data into a few thousand sock puppet accounts, which would start the fight.

Hours later, the General asked for a report.
“We’ve detected a meaningful rise in inter-faction conflict and we’ve successfully moved the discussion from Rule 470 to a parallel argument about the larger rulemaking process.”
“Excellent. And what about our rivals?”
“We’ve detected a few Russian and Chinese account networks, but they’re staying quiet for now. If they’re mentioning anything at all, it’s in line with our narrative. They’re saving the IDs for another day, I think.”

That night, the General got home around 8pm, and at the dinner table his teenage girls talked about their day.
  “Do you know how these laws get made?” the older teenager said. “It’s crazy. I was reading about it online after the 470 blowup. I just don’t know if I trust it.”
  “Trust the laws that gave Dad his job? I don’t think so!” said the other teenager.
  They laughed, as did the General’s wife. The General stared at the peas on his plate and stuck his fork into the middle of them, scattering so many little green spheres around his plate.

Things that inspired this story: State-backed information campaigns; collateral damage and what that looks like in the ‘posting wars’; AI-driven content production for text, images, videos; warfare and its inevitability; teenagers and their inevitability; the fact that EVERYONE goes to some kind of home at some point in their day or week and these homes are always different to how you’d expect.