Import AI 314: Language models + text-to-speech; emergent cooperation in wargames; ICML bans LLM-written papers
by Jack Clark
Google discovers that a language model is also an expert clinician:
…An exciting new chapter in the capability overhang chronicles…
Google and DeepMind have done some additional training on PaLM (Google’s large-scale 540B parameter language model) to create Med-PaLM, a language model that is about as good as a human clinician on certain questions. This result is a big deal – it demonstrates that given enough data and clever training techniques, language models can approximate skilled humans. (And PaLM has far better written output than your typical doctor, whose notes typically look like a drunken spider with inked-feed decided to do ballet over a medical notepad).
How they did it: Med-PALM builds on Flan-PaLM (a model itself based on PaLM, and trained to follow instructions). Google evaluated Flan-PaLM with some expert humans and identified gaps in performance on consumer medical question answering datasets, and then they tweaked the prompts on Flan-PaLM to figure out some human-engineered prompts for specific medical questions which they apply on a context-dependent basis.
How good is it? To evaluate Med-PaLM, Google built MultiMedQA, a benchmark comprising seven medical question answering datasets, six of which are pre-existing, and one of which – HealthSearchQA – is new and consists of 3375 commonly searched health questions.
Google’s model is about as good as a medical professional (caveats apply): In tests, clinicians judged 92.6% of Med-PaLM answers to be aligned with scientific consensus, versus 92.9% for (human!) clinician-generated answers.
Breaking it down more, the model gets a new state of the art on multiple-choice question answering on the MedQA dataset, getting an accuracy of 67.6% (versus 17.3% for Stanford’s just-announced PubMedGPT). It also sets a new state of the art on clinical topics within the ‘MMLU’ evaluation scheme.
Why this matters – capability overhangs are all around us: “Our results suggest that strong performance on medical question answering may be an emergent ability  of LLMs combined with effective instruction prompt tuning,” Google writes. This is an example of the ‘capability overhang’ phenomenon I’ve been talking about re language models for a while – existing LLMs are far more capable than we think. All it takes is some experimentation and finding experts to find new ways to phrase questions and you can wind up with extraordinarily powerful capability jumps without retraining the model.
This phenomenon also grows as you scale the models – the overhangs are getting larger and larger. Just add in some human experts to sprinkle some crumbs and let your language model do the rest: “the Med-PaLM results demonstrate that with instruction prompt tuning we have a data and parameter-efficient alignment technique useful for improving factors related to accuracy, factuality, consistency, safety, harm, and bias, helping close the gap with clinical experts and bringing these models closer to real-world clinical applications.”
Read more: Large Language Models Encode Clinical Knowledge (arXiv).
Microsoft makes better text-to-speech by pretending that speech is text:
…Maybe most problems can be configured as language modeling problems?…
Microsoft has built VALL-E, a text-to-speech system. VALL-E is a neural codec language model whose chief trick is approaching language modeling as being similar to text modeling; rather than converting phonemes into mel-spectrograms and then waveforms (as is typical), VALL-E converts phonemes into a discrete code via some language modeling-esque tricks, then converts that into a waveform. This “enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation,” the authors write.
What they did: VALL-E is pre-trained on 60,000 hours of English speech across 7000 unique speakers (via an existing dataset called LibriLight). “VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder,” the researchers write. “The discrete acoustic tokens derived from an audio codec model enable us to treat TTS as conditional codec language modeling, and advanced prompting-based large-model techniques (as in GPTs) can be leveraged for the TTS tasks.”
How well does it work: “VALL-E significantly outperforms the state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech,” they write. Additionally, because of the way it is trained it – like language models such as GPT-3 – shows some generalization ability; “for TTS, if the model can synthesize high-quality speech for unseen speakers without fine-tuning, the model is believed to have in-context learning capability.”
Emergent capabilities: Much as with language modeling, VALL-E displays some emergent capabilities – unanticipated, cool traits, that emerge as a consequence of pre-training. “When the acoustic prompt has reverberation, VALL-E could synthesize speech with reverberation as well, whereas the baseline outputs clean speech,” they write. Additionally, “VALL-E is able to keep the same emotion of the prompt in speech synthesis, even if the model is not fine-tuned on an emotional TTS dataset.”
Why this matters – everything is a sequence, everything is emergent, everything is weird: Results like this show how a surprisingly large amount of capabilities can be learned via sequence prediction tasks. It also demonstrates how sequence prediction – when done over a sufficiently large and diverse dataset – can lead to surprising, emergent capabilities. In some sense, sequence prediction over a giant and slightly hairy blob of data seems like it might even guarantee some degree of broader generalization. This has pretty profound implications, because it suggests you might want to pour an increasing chunk of different modalities into a single embedding space and attempt sequence prediction from that (as we saw with stuff like DeepMind’s GATO) and the results can be surprising and powerful. Probably nothing…
Read more: Neural Code Language Models are Zero-Shot Text to Speech Synthesizers (arXiv).
Check out demos of the system here (VALL-E, Microsoft).
ICML bans researchers from writing papers with language models:
…Moral panic, meet academic AI publishing!…
In a surprising twist, the International Conference on Machine Learning (ICML) has banned researchers from including large swatches of text generated by language models like OpenAI’s chatGPT “unless the produced text is presented as part of the paper’s experimental analysis”. You’d think an AI conference would be enthusiastic about people using AI to do better AI research, but ICML thinks differently.
The reasoning: In a statement, ICML said the policy is designed to prohibit authors from using text produced entirely by language models (though they’re free to use LLMs to edit or polish author-written text). “The LLM policy is largely predicated on the principle of being conservative with respect to guarding against potential issues of using LLMs, including plagiarism,” they write. “We expect this policy may evolve in future conferences as we understand LLMs and their impacts on scientific publishing better.”
The idea here is it’s hard to anticipate the consequences of using LLMs to generate text, so authors shouldn’t do it. The fact this policy is basically unenforceable feels like a pretty weird aspect of this, and the ICML response is “to investigate any potential violation of the LLM policy when a submission is brought to our attention with a significant concern about a potential violation”.
Why this matters: Back in the middle ages people used to go on witchunts on the basis of little more than a rumor. This ICML policy feels like a very bizarre reaction to a new technology and it means people who don’t like certain papers can accuse the papers of being AI-generated and cause an investigation to occur. This is… extremely bad? I suppose the solution is to introduce random sperlink meestakes into your pper so it doesn’t seem so likely to be gen’d by a language model.
Read more: Clarification on Large Language Model Policy LLM (ICML).
RL agents display emergent behavior in a 2D wargame:
…The silicon players of games also figure out some wild endgame strategies…
Researchers with Tsinghua University, Shenzhen International Graduate School, the University of California at San Diego, and AI startup Parametrix.ai have trained AI agents to compete against eachother in a 2D gridworld strategy game. The cool part about this research is the arrival of – you guessed it – emergent complexity as a consequence of a large-scale training process. Here, the agents end up learning surprisingly rich and intuitive battle tactics as a consequence of a three-phase RL-training process, giving us yet another example of how contemporary AI systems tend to exhibit behaviors that don’t directly map 1:1 to how they were trained or incentivized.
What they did: Here, they train AI agents ina gridworld named ‘Lux’ to compete against eachother. The agents can build workers and citytiles and need to gather and manage resources including uranium, coal, and trees, while expanding on a map against a rival. The winners are the ones that control the greatest amount of resources at the end. They train the agents via the trusty RL-algo PPO with Generalized Advantage Estimation (GAE), and use a pixel-to-pixel architecture as the centralized policy taking both observations and actions as images and using the ResNet architecture as the backbone network.
Three phases: To help the system learn, the researchers have different training strategies for three different phases:
- Phase 1: Hand-crafted dense rewards to encourage basic skills; agents get points for maximizing workers, CityTiles, research points, and fuel
- Phase 2: Sparse reward with scaled signals; agents get rewards at end of episode for winning and the reward gets scaled according to the number of CityTiles on the winning team
- Phase 3: win-or-lose sparse reward;
You get what you pay for – and then some! As a consequence of this three-phase training scheme, the researchers see some combination of expected behavior, and some amount of emergence. Specifically, three distinct patterns of behavior emerge over time.
- Phase 1: Atomic skills – agents figure out how to build workers and collect resources, but frequently get the mixture wrong (e.g, building more cities than workers can support).
- Phase 2: “Regional coordination appears, which involves dozens of agents in a local area. For example, agents learn to carefully choose locations before building a CityTile and develop self-organizing patterns for occupying resources efficiently”.
- Phase 3: “Global strategies”: Agents figure out sustainable development – neither growing too fast nor too slowly, and neither using too much fuel nor too little. Agents also learn to carefully manage trees (which regenerate) and learn to build cities near them to defend against potential enemies. The agents also learn a surprising victory strategy: “another surprising strategy is that when the episode is about to end, our policy will rapidly harvest all the protected trees and try to build as many CityTiles as possible for the win”.
Why this matters – emergence is everywhere: Again and again we see the same phenomenon; you train an AI system in a relatively simple way and you get increasingly surprising emergent behaviors. This paper is a neat demonstration of this phenomenon.
Read more: Emergent collective intelligence from massive-agent cooperation and competition (arXiv).
[The world shortly after the development of superintelligence. Date unknown.]
Dad this game is boring!
I want to see it from the character, I don’t like how we’re on top of it.
OK, let’s ask. “Please deform game to a first-person view”.
The game faded to black and a text popup appeared which said DEFORMING. We waited a few minutes, and then the text changed to DEFORMED. READY TO PLAY. The game faded back in, and now it was from the first person perspective.
We played the game like that for a while, and then my kid got bored and we changed it again. This time, we deformed it so that gravity worked differently. Then we deformed it again so the guns worked differently.
The hardest thing about raising kids these days is they want the world to change whenever they’re not happy with it, said one parent at the playground.
That’s why we come here, I said.
We watched a kid fall down and then the kid said ‘gravity deform’ and jumped forward at an angle and fell over.. They looked like they were going to cry for a second, but then they realized their parent wasn’t watching them. So they picked themselves up and carried on playing.
Things that inspired this story: How generative models will eventually get applied to everything, everywhere; how AI will be used to simulate and transform all experiences on a per-person basis; ‘think of the children’ when it comes to AI deployment.