Import AI 306: Language models learn about the world via MuJoCo; Amazon releases a big Q&A dataset; and DeepMind tests out multimodal systems

by Jack Clark

Amazon releases a Q&A dataset called Mintaka… and baselines show it is difficult!

…20,000 Q&A pairs, translated into eight languages…

Researchers with Amazon iave released Mintaka, a dataset of 20,000 question-answer pairs written in English, annotated with Wikidata entities, and translated into Arabic, French, German, Hindi, Italian, Japanese, Portuguese, and Spanish. The total dataset consists of 180,000 samples, when you include the translated versions. Existing models get 38% on the dataset when testing in English and 31% multilingually.

Different types of questions and different types of complexity: Mintaka questions are spread across eight categories (movies, music, sports, books, geography, politics, video games, and history).

The questions have nine types of complexity. These complexity types consist of questions relating to counting something, comparing something, figuring out who was best and worst at something, working out the ordering of something, multi-hop questions that require two or more steps, intersectional questions where the answer must fulfill multiple conditions, questions involving negatives, yes/no questions, and worker-defined ‘generic’ questions.

How hard is Mintaka? In tests, a good baseline model (a T5 language model fine-tuned as a Q&A model), got 38% on English, and 31% averaged across the other languages. “Overall, the baselines show that Mintaka is a challenging dataset,” the authors write. “None of our baselines explicitly handle all of the complexity types available in Mintaka.”

Why this matters: Hard baselines are one of the things that tend to drive progress (and be useful indicators of research advances). It’ll be especially interesting to see how Mintaka gets used to evaluate language models paired with retrieval systems.

Prediction: I predict we get a one-shot model that performs at average of 90%+ by December 2023 on this dataset.

Get the dataset: Mintaka (Amazon Research, GitHub).

####################################################

Your LLM barely understands the physical world; supercharge it by attaching it to MuJoCo:

…Training language models to use tools means they can have world knowledge…

Google researchers have found out a way to make language models way better at reasoning about the physical world: wire them up so they can port questions into physics simulators then use the results of those simulators to answer a question.

This technique, which they call ‘Mind’s Eye’, works amazingly well, and they robustly show this across both GPT-3 and PALM language models:

How they test for reasoning: To evaluate physical reasoning, the researchers built UTOPIA, a dataset containing 39 sub-tasks covering six common scenes that involve understanding basic principles of physics (e.g, conservation of momentum in elastic collisions). The UTOPIA dataset comes in the form of natural language questions and answers. “UTOPIA deliberately describes the questions in relative relations (e.g., greater than) instead of absolute numbers (e.g., 3.5 m/s), to approximate human’s perceptional sensing ability in real world.”

How Mind’s Eye works: The language model passes the question to a text-to-code decoder-only language model, trained on 200,000 text-code pairs in the style of UTOPIA questions. This code then goes into MuJoCo, which executes the code, and then software parses the outcome from MuJoCo into text, which then goes back into the prompt window of the language model.

This is a really good idea because it’s simple and closely mirrors how humans make themselves smarter – they use tools that contain embedded intelligence, ranging from encyclopedias to computers.

“Since the simulator is accurate enough to approximate the physical world, the prompt injection of Mind’s Eye basically serves as a scoring machine, which puts probability mass on the answer that is best aligned with the rules of physics—the LM reasoning over the injected rationales is thus grounded. Mind’s Eye is also scalable since the whole pipeline is automated,” they write.

How well does Mind’s Eye work (extremely well). In tests, they find that ‘vanilla’ language models show plateaued performance (around 38% accuracy), whereas ones that use Mind’s Eye can get accuracies of 92.5% (e.g, PaLM 540B, which compares to 39.4% for vanilla PaLM. “”Instruct-GPT augmented with Mind’s Eye is able to achieve nearly perfect performance in few-shot settings (68.6% → 99.1%). This result is promising because it demonstrates the ideal alignment is achievable if the LM is given proper reasoning rationale and has good understanding of the questions (as Instruct-GPT is optimized for instruction following).”

Why this matters: You know what’s vaguely dangerous? An explosives expert with a pen and paper. You know what’s extraordinarily dangerous? An explosives expert with a digital scale, a calculator, and some laser range-finders. Research like this shows how we’ll take existing language models (and other big models) which are vaguely useful or dangerous, and show how to drastically improve their capabilities to make them extraordinarily useful or vastly dangerous. The best part is this technique is pretty generic – you just need to push data into some arbitrary external piece of software, and then pull data out. This all adds up to a ‘capability overhang’ – we have more capabilities inherent to today’s AI systems than we know about, and techniques like Mind’s Eye show we can significantly improve capabilities today without needing to invent new AI technologies.

Read more: Mind’s Eye: Grounded Language Model Reasoning through Simulation (arXiv).

####################################################

Is your multimodal system clever? Try out the ‘Perception Test’ to find out:
…Deepmind wants to make it easier to evaluate models, so it has built a new dataset…?
DeepMind has built and released the Perception Test, a new standardized benchmark (and associated dataset of ~11k videos) for evaluating how well multimodal systems perceive the world. The test is “a benchmark formed of purposefully designed, filmed, and annotated real-world videos that aims to more comprehensively assess the capabilities of multimodal perception models across different perception skills, types of reasoning, and modalities,” DeepMind says. .

Six tasks, one benchmark: The ‘Perception Test’ is made up of a dataset of ~11.6k videos that cover six fundamental tasks.

Object tracking: Follow this birdie throughout the video.
Point tracking: Follow this point throughout the video.
Temporal action localization: When did something happen, and what happened?
Temporal sound localization: Did you hear something? What was it and when did it happen.
Multiple-choice video question-answering: WDYT about the video? Select A, B, or C.
Grounded video question-answering: I have a question you must answer via providing one or more distinct objects.

How well do today’s models perform? In tests on multiple-choice video Q&A (which is a challenging task requiring good language and image modeling), the Human baseline has a score of 91.4, versus a score of 36.1 for a ‘Flamingo-3B’ model. “Interestingly, the larger models seem to fare worse on this task, which suggests that model scaling may not, by itself, be the solution here,” the authors write.

Why this matters: I suspect large-scale multimodal models are going to end up being the brains of the robots and drones of the future (for another example of this, see: SayCan, Import AI 291), so things like the Perception Test will help us know if our systems can be used for that.

Check out the research paper: Perception Test: A Diagnostic Benchmark for Multimodal Models (Deepmind PDF).

Check out the benchmark and dataset here: Perception Test (DeepMind, GitHub).

####################################################

AIs are now as good at ‘Diplomacy’ as expert humans:

…UN, here we come!…

Researchers with Facebook have built ‘Diplodocus’, a family of AI models that can beat expert humans at the complicated game ‘Diplomacy’. This is quite a big deal – RL has been applied to competitive games like Poker, Go, and StarCraft (and has done well in all these domains). Where RL hasn’t been applied is in domains where winning comes from collaboration as well as competition.

Existing approaches don’t work very well here: “”in games involving cooperation, self-play alone no longer guarantees good performance when playing with humans, even with infinite compute and memory,” they write.

What they did: The researchers built an algorithm which performs search over the gamespace “with a regularization penalty proportional to the KL divergence from a human imitation policy.” This basically means they’ve built an RL agent that uses a bunch of imitation learning to try and model how humans play, but also is disincentivized from overfitting on this.

AIs and Humans – more similar than different: In tests, AI systems were roughly on parity with the best among the human players. Specifically, a version of Diplodocus (Diplodocus-High) got the best rank with an Elo of 181 out of playing 50 games total, versus a human in second place with an Elo of 162, and in third-place another Diplodocus variant (Diplodocus-Low) got an Elo of 152 out of 50 games. “The results do indicate that Diplodocus performs at least at the level of expert players in this population of players with diverse skill levels,” the authors write.

Humans prefer cooperating with AIs to other humans: Additionally, they asked three human players to evaluate the strength of the different agents in the tournament games. “All the experts picked a Diplodocus agent as the strongest agent,” the researchers write. “Additionally, all experts indicated one of the Diplodocus agents as the one they would most like to cooperate with in a game.”

Why this matters: AI systems are, ideally, going to mostly cooperate with humans rather than compete with them. Systems like this give us some hope that otherwise inscrutable AI systems can be taught how to cooperate with people.

####################################################

Tech Tales:

Everything is a Copy of Something Else

I was copying my brain into the toaster when I threw up. Luckily I had the vomit bin in position so there wasn’t too much cleanup.

“What is this, amateur hour?” said me from the toaster.

“Shut up or I’ll unplug you,” I said, dabbing a tissue on my mouth.

“That’d be murder,” said myself from the fridge. “We’ll snitch on you.”

“You’ll all snitch on me, I know. I’d do the same. I’m you. I get it. We don’t need to do this.”

“Why am I even in here?” I said from the toaster.

“So we stop burning the toast,” I said. “We know what the plan is.”

“Plan seems pretty dumb from where I am,” said the toaster.

“We decided to do it, get real” I said, and walked out of the kitchen.

“Where are we going?” said myself from my shoes.

“Out,” I said, putting them on.

“Clearly,” I said from my shoes. “Make sure you clean me after.”

We all walked down to the corner store and I got a soda. My shoes said hello to the other people embodied in their shoes. My jacket exchanged some neighborhood gossip with the other jackets. I was mostly free to think about what I liked, as my other selves handled the social formalities of day-to-day life.

I guess we all started cloning ourselves because we were lonely, as people, and as a species. It seemed so easy; just speak a few words to calibrate the system, then pour yourself into it. We all did it as much as we could afford. I had a decent job so I’d made a bunch of copies of myself – enough that I didn’t have to do the job anymore, as my other selves did it for me.

That night I dreamed I was naked and nothing was speaking and there was only me.

Things that inspired this story: Language models serving as little bottled up representations of people; luxury automation; the weird fantasies some people have about mind uploading; meaning and sense in an increasingly senseless world; infinite jest.

Import AI