Import AI 176: Are language models full of hot air? Test them on BLiMP; Facebook notches up Hanabi win; plus, TensorFlow woes.

by Jack Clark

First, poker. Now: Hanabi! Facebook notches up another AI game-playing win:
…The secret? Better search…
Facebook AI researchers have developed an AI system that gets superhuman performance in Hanabi, a collaborative card game that requires successful players to “understand the beliefs and intentions of other players, because they can’t see the same cards their teammates see and can only share very limited hints with eachother”. The Facebook-developed system is based on Pluribus, the CMU/Facebook-developed machine that defeated players in six-player no-limit Hold’em earlier this year.

Why Hanabi? In February, researchers with Google and DeepMind published a paper arguing that the card game ‘Hanabi’ should be treated as a milestone in AI research, following the success of contemporary systems in domains like Go, Atari, and Poker. Hanabi is a useful milestone because along with requiring reasoning with partial information, the game also “elevates reasoning about the beliefs and intentions of other agents to the foreground,” wrote the researchers at the time.

How they did it: The Facebook-developed system relies on what the company calls multi-agent search to work. Multi-agent search works roughly like this: agent a) looks at the state of the gameworld and tries to work out the optimal move to make using Monte Carlo rollouts to do the calculation; agent b) looks at the move agent a) made and uses that to infer what cards agent a) had, then uses this knowledge to inform the strategy agent b) picks; agent a) then looks at moves made by agent b) and studies b)’s prior moves to estimate what cards b) has and what cards b) thinks agent a) has, then uses this to generate a strategy.
This is a bloody expensive procedure, as it involves a ballooning quantity of calculations as the complexity fo the game increases; the Facebook researchers default to a computationally-cheaper but less effective single-agent search procedure for most of the time, only using multi-agent search periodically.

Why this matters: We’re not interested in this system because it can play Hanabi – we’re interested in understanding general approaches to improving the performance of AI systems that interact with other systems in messy, strategic contexts. The takeaway from Facebook’s research is that if you have access to enough information to be able to simulate the game state, then you can layer on search strategies on top of big blobs of neural inference layers, and use this to create somewhat generic, strategic agents. “Adding search to RL can dramatically improve performance beyond what can be achieved through RL alone,” they write. “We have now shown that this approach can work in cooperative environments as well.”
   Read more: Building AI that can master complex cooperative games with hidden information (Facebook AI research blog).
   Read more: The Hanabi Challenge: A New Frontier for AI Research (Arxiv).
   Find out details about Pluribus (Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker (Arxiv).

####################################################

Why TensorFlow is a bit of a pain to use:
…Frustrated developer lists out TF’s weaknesses – and they’re not wrong…
Google’s TensorFlow software is kind of confusing and hard to use, and one Tumblr user has written up some thoughts on why. The tl;dr: TensorFlow has become a bit bloated in terms of its codebase, and Google continually keeps releasing new abstractions and features for the software that make it harder to understand and use.
“You know what it reminds me of, in some ways? With the profusion of backwards-incompatible wheel-reinventing features, and the hard-won platform-specific knowledge you just know will be out of date in two years? Microsoft Office,” they write.

Why this matters: As AI industrializes, more and more people are going to use the software to develop AI, and the software that captures the greatest number of developers will likely become the Linux-esque basis for a new computational reality. Therefore, it’s interesting to contrast the complaints people have about TensorFlow with the general enthusiasm about PyTorch, a Facebook-developed AI software framework that is easier to use and more flexible than TensorFlow.
Read about problems with TensorFlow here (trees are harlequins, words are harlequins, GitHub).

####################################################

Enter the GPT-2 Dungeon:
…You’re in a forest. To your north is a sentient bar of soap. Where do you go?…
AI advances are going to change gaming – both the mechanics of games, and also how narratives work in games. Already, we’re seeing people use reinforcement learning approaches to create game agents capable of more capable, fluid, movement than their predecessors. Now, with the recent advances in natural language processing, we’re seeing people use pre-trained language models in a variety of creative writing applications. One of the most evocative ones is AI Dungeon, a text-adventure game which uses a pre-trained 1.5bn GPT-2 language model to guide how the game unfolds.

What’s interesting about this: AI Dungeon lets us take the role of a player in an 80s-style text adventure game – except, the game doesn’t depend on a baroque series of hand-written narrative sections, joined together by fill-in-the-blanks madlib cards and controlled by a set of pre-defined keywords. Instead, it relies on a blob neural stuff that has been trained on a percentage of the internet, and this blob of neural stuff is used to animate the game world, interpreting player commands and generating new narrative sections. We’re not in Kansas anymore, folks!

How it plays: The fun thing about AI Dungeon is its inherent flexibility, and most games devolve into an exercise of trying to break GPT-2 in the most amusing way (or at least, that’s how I play it!). During an adventure I went on, I was a rogue named ‘Vert’ and I repeatedly commanded my character to travel through time, but the game adapted to this pretty well, keeping track of changes in buildings as I went through time (some crumbled, some grew). At one point all the humans besides me disappeared, but that seems like the sort of risk you run when time traveling. It’s compelling stuff and, while still a bit brittle, can be quite fun when it works.

Why this matters: As the AI research community develops larger and more sophisticated generative models, we can expect the outputs of these models to be plugged into a variety of creative endeavors, ranging from music to gaming to poetry to playwriting. GPT-2 has shown up in all of these so far, and my intuition is in 2020 onwards we’ll see the emergence of a range of AI-infused paintbrushes for a variety of different mediums. I can’t wait to see what an AI Dungeon might look like in 2020… or 2021!
Play the AI Dungeon now (AIDungeon.io).

####################################################

Mustafa swaps DeepMind for Google:
…DeepMind co-founder moves on to focus on applied AI projects…
Mustafa Suleyman, the co-founder of DeepMind, has left the company. However, he’s not going far – Mustafa will take on a new role at Google, part of the Alphabet Inc. mothership to which DeepMind is tethered. At Google, Mustafa will work on applied initiatives.

Why this matters: At DeepMind, Mustafa was frequently seen advocating for the development of socially beneficial applications of the firm’s technology, most notably in healthcare. He was also a fixture on the international AI policy circuit, turning up at various illustrious meeting rooms, private jet-cluttered airports, and no-cellphone locations. It’ll be interesting to see whether he can inject some more public, discursive stylings into Google’s mammoth policy apparatus.
Read more: From unlikely start-up to major scientific organization: Entering our tenth year at DeepMind (DeepMind blog).

####################################################

Testing the frontiers of language modeling with ‘BLiMP’:
…Testing language models in twelve distinct linguistic domains…
In the last couple of years, large language models like BERT and GPT-2 have upended natural language processing research, generating significant progress on challenging tasks in a short amount of time. Now, researchers with New York University have developed a benchmark to help them more easily compare the capabilities of these different models, helping them work out the strengths and weaknesses of different approaches, while comparing them against human performance.

A benchmark named BLiMP: The benchmark is called BLiMP, short for the Benchmark of Linguistic Minimal Pairs (BLiMP). BLiMP tests how well language models can classify different pairs of sentences according to different criteria, testing across twelve linguistic phenomena including ellipsis, Subject-Verb agreement, irregular forms, and more. BLiMP also ships with human baselines, which help us compare language models in terms of absolute scores as well as relative performance. The full BLiMP dataset consists of 67 classes of 1,000 sentence pairs, where each class is grouped within one of the twelve linguistic phenomena.

Close to human-level, but not that smart: “All models we evaluate fall short of human performance by a wide margin,” they write. “GPT-2, which performs the best, does match (even just barely exceeds) human performance on some grammatical phenomena, but remains 8 percentage points below human performance overall.” Note: the authors only test the ‘medium’ version of GPT2 (345M parameters, versus 775M for ‘large’ and 1.5Billion for ‘XL’), so it’s plausible that other variants of the model have even better performance. By comparison, Transformer-XL, LSTM, and 5-gram models perform worse.

Not entirely human: When analyzing the various scores of the various models, the researchers find that “neural models correlate with eachother more strongly than with humans or the n-gram model, suggesting neural networks share some biases that are not entirely human-like”. In other experiments, they show how brittle these language models can be by causing misclassifications on the part of transformer-based models by lacing sentences with confusing ‘attractor’ nouns.

Why this matters: BLiMP can function “as a linguistically motivated benchmark for the general evaluation of new language models,” they write. Additionally, because BLiMP tests for a variety of different linguistic phenomena, it can be used for detailed comparisons across different models.
Get the code (BLiMP GitHub).
Read the paper: BLiMP: The Benchmark of Linguistic Minimal Pairs (BLiMP GitHub).

####################################################

Facebook makes a PR-spin chatbot:
…From the bad, no good, terrible idea department…
Facebook has developed an in-house chatbot to help people give official company responses to awkward questions posed by relatives, according to The New York Times. The system, dubbed ‘Liam Bot’, hasn’t been officially disclosed.

Why this matters: Liam Bot is a symbol for a certain type of technocratic management mindset that is common in Silicon Valley. It feels like in the coming years we can expect multiple companies to develop their own internal chatbots to automate how employees access certain types of corporate information.
Read more: Facebook Gives Workers a Chatbot to Appease That Prying Uncle (The New York Times).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

China’s AI R&D spending has been overestimated:
The Chinese government is frequently cited as spending substantially more on publicly-funded AI R&D than the US. This preliminary analysis of spending patterns shows the gap between the US and China has been overestimated. In particular, the widespread claim that China is spending tens of billions in annual AI R&D is not borne out by the evidence.

Key findings: Drawing together the sparse public disclosures, this report estimates that 2018 government AI R&D spending was $2.0–8.4 billion. Spending is focussed on applied R&D, with only a small proportion going towards basic research. This is highly uncertain, and should be interpreted as a rough order of magnitude. These figures are in line with US federal spending plans for 2020.

Why it matters: International competition, particularly between the US and China, will be an important determinant of how AI is developed, and how this impacts the world. Ensuring that AI goes well will require cooperation and trust between the major powers, and this in turn relies on countries having an accurate understanding of each other’s capabilities and ambitions.

A correction: I have used the incorrect figure of ‘tens of billions’ twice in this newsletter, and shouldn’t have done so without being more confident in the supporting evidence. It’s unsettling how easy it is for falsities to become part of the received wisdom, and the effects this could have on policy decisions. More research scrutinizing the core assumptions of the narrative around AI competition would be highly valuable.
Read more: Chinese Public AI R&D Spending: Provisional Findings (CSET).

####################################################

What is ‘transformative AI’?:
The impacts of AI will scale with how powerful our AI systems are, so we need terminology to refer to different levels of future technology. This paper tries to give a clear definition of ‘transformative AI’ (TAI), a term that is widely used in AI policy.

Irreversibility: TAI is sometimes framed by comparison to historically transformative technologies. The authors propose that the key component of TAI is that it contributes to irreversible changes to important domains of society. This might fundamentally change most aspects of how we live and work, like electricity, or might impact a narrow but important part of the world, as nuclear weapons did.

Radically transformative AI: The authors define TAI as AI that leads to societal change comparable to previous general-purpose technologies, like electricity or nuclear weapons. On the other hand, radically transformative AI (RTAI) would lead to levels of change comparable to that of the agricultural or industrial revolutions, which fundamentally changed the course of human history.

Usefulness: TAI classifies technologies in terms of their effect on society, whereas ‘human-level AI’ and ‘artificial general intelligence’ do so in terms of technological capabilities. Different concepts will be useful for different purposes. Forecasting AGI might be easier than forecasting TAI, since we can better predict progress in AI capabilities than we can societal impacts. But when thinking about governance and policy, we care more about TAI, since we are interested in understanding and influencing the effects of AI on the world.
Read more: Defining and Unpacking Transformative AI (arXiv).

####################################################

Tech Tales

The Endless Bedtime Story

It’d start like this: where were we, said the father.
We were by the ocean, said one of the kids.
And there was a dragon, said the other.
And the dragon was lonely, said both of them in unison. Will the dragon make friends?

Each night, the father would smile and work with the machine to generate that evening’s story.

The dragon was looking for his friends, said the father.
It had been searching for them for many years and had flown all across North America.
Now, it had arrived at the Pacific Ocean and it wheeled above seals and gulls and blue waves on beige sands. Then, in the distance, it saw a flash of fire, and it began to fly toward it. Then the fire vanished and the dragon was sad. Night came. And as the dragon began to close its eyes, its scales cooling on a grassy hill, it saw another flash of fire in the distance. It stood amazed as it saw the flames reveal a mountain spitting fire.
Good gods, the dragon cried, you are evil. I am so alone. Are there any dragons still alive?
Then it went to sleep and dreamed of dragon sisters and dragon brothers and clear skies and wind tearing laughter into the air.
And when the dragon awoke, it saw a blue flame, flickering in the distance.
It flew towards it and cried out “Do not play tricks on me this time” and…
And then the father would stop speaking and the kids would shout “what happens next?” and the father would say: just wait. And the next night he would tell them more of the story.

Telling the endless story changed the father. At night, he’d dream of dragons flying over icy planets, clothed in ice and metal. The planets were entirely made of computers and the dragons rode thermals from skyscraper-sized heat exchanges. And in his dreams he would be able to fly towards the computers and roar them questions and they would responds with stories, laying seeds for how he might play with the machine to entertain his children when he woke.

In his dream, he once asked the computers a question for how he should continue the story, and they told him to tell a story about: true magic which would solve all the problems and the tragedies of his childhood and the childhood of his children.

That would be powerful magic, thought the father. And so with the machine, he tried to create it.

Things that inspired this story: GPT-2 via talktotransformer.com, which wrote ~10-15% of this story – see below for bolded version indicating which bits it wrote; creative writing aided by AI tools; William Burroughs’ ‘cut-up fiction‘; Oulipo; spending Thanksgiving amid a thicket of excitable children who demanded entertainment.

[BONUS: The ‘AI-written’ story, with some personal observations by me of writing with this odd thing]
I’ve started to use GPT-2 to help me co-write fiction pieces, following in the footsteps of people like Robin Sloan. I use GPT-2 kind of like how I use post-it notes; I generate a load of potential ideas/next steps using it, then select one, write some structural stuff around it, then return to the post-it note/GPT-2. Though GPT-2 text only comprises ~10% of the below story, from my perspective it contributes to a few meaningful narrative moments: a disappearing fire, a dragon-seeming fire that reveals a fiery mountain, and the next step in narrative following dragon waking. I’ll try to do more documented experiments like this in the future, as I’m interested in documenting how people use contemporary language models in creative practices.

The Endless Bedtime Story

[Parts written by GPT-2 highlighted in bold]

Each night, the father would smile and work with the machine to generate that evening’s story.

The dragon was looking for his friends, said the father.
It had been searching for them for many years and had flown all across North America.
Now, it had arrived at the Pacific Ocean and it wheeled above seals and gulls and blue waves on beige sands. Then, in the distance, it saw a flash of fire, and it began to fly toward it. Then the fire vanished and the dragon was sad. Night came. And as the dragon began to close its eyes, its scales cooling on a grassy hill, it saw another flash of fire in the distance. It stood amazed as it saw the flames reveal a mountain spitting fire.
Good gods, the dragon cried, you are evil. I am so alone. Are there any dragons still alive?
Then it went to sleep and dreamed of dragon sisters and dragon brothers and clear skies and wind tearing laughter into the air.
And when the dragon awoke, it saw a blue flame, flickering in the distance.
It flew towards it and cried out “Do not play tricks on me this time” and…
And then the father would stop speaking and the kids would shout “what happens next?” and the father would say: just wait. And the next night he would tell them more of the story.

In his dream, he once asked the computers a question for how he should continue the story, and they told him to tell a story about: true magic which would solve all the problems and the tragedies of his childhood and the childhood of his children.

One Comment to “Import AI 176: Are language models full of hot air? Test them on BLiMP; Facebook notches up Hanabi win; plus, TensorFlow woes.”

Import AI 179: Explore Arabic text with BERT-based AraNet; get ready for the teenage-made deepfakes; plus DeepMind AI makes doctors more effective | Import AI says:

January 6, 2020 at 4:29 pm

[…] has ported AI Dungeon so it works on Amazon’s voice-controlled Alexa system. AI Dungeon (Import AI #176) is a GPT-2-based dungeon crawler that generates infinite, absurdly mad adventures. Play it here, […]

Loading...

Import AI