Import AI 317: DeepMind speeds up language model sampling; voice cloning tech gets abused; more scaling laws for RL

by Jack Clark

Scaling Laws – why they matter and what they mean:

…Meta-analysis sheds some more light on an important field of AI science…

Epoch, an AI research organization, has published a literature review of scaling laws in AI research. Scaling laws are a field of AI research that is strategically important – they help developers figure out how to efficiently combine the right amounts of data and compute to get a predictable level of performance out of a given class of models. Scaling laws have broadly de-risked many parts of AI research by making the process of building and refining AI systems more predictable and reliable. 

What’s happened in scaling laws: The literature review highlights a couple of important takeaways:

  • 1) it’s possible to come up with basic power laws to describe a lot of AI scaling, but these power laws break at the extremes of having either a very small amount of data, or a very large amount of data – there’s important work to be done in modeling when you transition from a less predictable region into a power law region.
  • 2) transfer learning is still hard to understand. “There is not a simple universal scaling law for transfer learning between arbitrary tasks,” they write. “When the tasks are similar enough, upstream loss and downstream performance are closely related, but when tasks are very different, the details of the architecture and hyperparameters become very relevant.”

Read more: Scaling Laws Literature Review (Epoch research).

####################################################

DeepMind just figured out how to 2X the speed of sampling from language models – so expect AI systems everywhere to get snappier:

…The key idea? Use a few dumb models and critique them with one smart one…

DeepMind has developed a new way to sample from large models which has made this much faster. The ‘speculative sampling’ approach equates to “a 2-2.5X decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself”. What does that mean? It means money! Specifically, it means DeepMind has made it 2X-2.5X cheaper to pull samples out of models of at least a Chinchilla (70b parameter) scale. That’s a big deal!

The key idea: Use a small model to generate a ‘draft’ output, then use a larger and smarter model to score the ‘draft’, then use a rejection sampling scheme to accept the tokens which are agreed by the small and large models. 

   In tests, they find that a draft model can give them speedups ranging between 1.92X  (on a summarization benchmark called XSum) and 2.46X on a code generation task called HumanEval.

Why this matters – a simple idea that everyone can use: Back in the ancient times (April, 2022) DeepMind released a paper on the original Chinchilla model (Import AI 290). This paper showed that you could substantially increase the performance of a language model simply by changing it on more data. This was a simple, influential insight – many labs adopted the Chinchilla idea and made dramatically better language models by training on more data. This speculative sampling paper feels similar – it means anyone with a big language model can invest some effort in training some smaller draft model(s) and thereby increase the speed with which they can sample from these models. This will likely accelerate the deployment of AI systems.

   Read more: Accelerating Large Language Model Decoding with Speculative Sampling (arXiv).

####################################################

Yup, there are definitely scaling laws for RL:

…OpenAI paper shows that scaling laws show up here as well…

In recent years, AI development has become more predictable. That’s because in a bunch of domains ranging from language to image modeling researchers have identified so-called ‘scaling laws’ which let them predict ahead of time the broad performance of models based on varying the amounts of compute and data they train on. New research from OpenAI shows that this same sort of scaling law seems to show up in reinforcement learning agents. 

   “We find intrinsic performance to scale as a power law in model size and environment

interactions, in much the same way as the analogous quantities in generative modeling,” the paper says.

What they did: They explored the scaling properties of RL agents across three distinct environments; ProcGen – a procedural generation system, here using three distinct games ‘CoinRun’, ‘StarPilot’, and ‘Fruitbot’; a 1v1 version of the strategy game Dota2; and a toy environment based on the number-labeling ‘MNIST’ challenge. 

What they found: “Our main result is that our power law for intrinsic performance… holds across environments and model sizes,” they write. “With the exception of our toy MNIST environment, the optimal model size for RL for a given compute budget is consistently smaller than for generative modeling, in some cases by multiple orders of magnitude. 

Why this matters – RL is about to go ‘bang’: The discovery of scaling laws has typically preceded a boomtime for the domain the scaling laws are discovered in; scaling laws for language modeling preceded things like GPT3, Claude, ChatGPT, etc; scaling laws for image and video modeling preceded Dall-E, Imagen, etc. 

   This paper from OpenAI comes alongside other publications from other companies showing scaling laws for RL agents; DeepMind recently demonstrated scaling laws for RL agents as well (Import AI 316). This suggests RL agents are about go through a period of more significant development as the discovery of power law relationships makes it a less risky proposition to spend big bucks on training runs.

   Read more: Scaling laws for single-agent reinforcement learning (arXiv).

####################################################

Annals of AI abuse: ElevenLabs pulls open access to voice-cloning tech:

…In other words, ‘why we can’t have nice things’…

AI startup ElevenLabs recently developed an extremely cool synthetic speech tool called VoiceLab which lets you train a synthetic voice from as little as 60 seconds of audio. To promote the technology, it originally had an open access service. Unfortunately, people mis-used this stuff – “malicious content was generated by free, anonymous accounts”, the company said in a tweet thread. As a consequence, it introduced a paid tier to try and reduce misuse. 

   “This will keep our tools accessible while allowing us to fight potential misuse,” the company said. “We’re tracking harmful content that gets reported to us back to the accounts it originated from and we’re banning those accounts for violating our policy.”

What Voice Lab is: Voice Lab is advertised as a system that can “clone voices from samples or clone your own voice… our cloning model learns any speech profile based on just a minute of audio, without training”. 

Why this matters: AI capabilities are increasingly powerful and available. These capabilities, like voice cloning, have a vast range of positive uses. Unfortunately, they’re also edging into the sort of ‘Enemy of the State’-style capabilities that drift into the murkier parts of the world, like the work of intelligence agencies. AI means capabilities which previously required exquisitely expensive and complicated black programs are now emerging into the open as a consequence of broadly available, well understood, open research. The times, they are a changin’.

   Read more in this thread from ElevenLabs here (Twitter).

   Find out more about Voice Lab here (ElevenLabs site).

####################################################

Think Whisper is a great open source ASR tool? Some people don’t agree with you:

…Criticism of popular ASR tech highlights some awkward questions about unilateral actions on behalf of underrepresented groups…

Researchers with Papa Reo, an organization dedicated to “to instill, nurture and proliferate the Māori language”, have written a post analyzing OpenAI’s open source ‘Whisper’ audio speech recognition tool. Whisper is a really useful piece of ASR tech which has been widely applauded for bringing the sorts of ASR capabilities enjoyed by the tech giants to the masses. 

    Here, though, Papa Reo strikes a more critical tone, writing a lengthy analysis of Whisper and its relationship to questions of consent from underrepresented communities with regard to data gathering. 

Why this matters: While I’m not sure I agree with the arguments espoused here for why Whisper is problematic (from the POV of Papa Reo), I think it’s useful to read stuff like this to develop a mental model of the different types of criticism different groups level at AI. One part of it that strikes true is the observation that by making stuff like Whisper free, OpenAI made a unilateral decision that alters the operating environment for everyone. 

   On the other hand, lots of progress seems to take the form of unilateral decisions, so I’m not sure if there’s anything in particular that can be done about this, beyond perhaps equipping a broader range of actors to build and deploy large-scale AI systems. 

   Read more: OpenAI’s Whisper is another case study in Colonisation (papareo blog).

####################################################

Tech Tales:

The Day The Nightmare Appeared on arXiv

[Zeroth Day]

I read the title and the abstract and immediately printed the paper. While it was printing, I checked the GitHub – already 3,000 stars and rising. Then I looked at some of the analysis coming in from [REDACTED] and saw chatter across many of our Close Observation Targets (COTs). It had all the hallmarks of being real. I’d quit smoking years ago but I had a powerful urge to scrounge one and go and stand in the like courtyard with the high walls and smoke and look at the little box of sky. But I didn’t. I went to the printer and re-read the title and the abstract:

Efficient Attention and Active Learning Leads to 100X Compute Multiplier

This paper describes a novel, efficient attention mechanism and situates it within an architecture that can update weights in response to real-time updates without retraining. When implemented, the techniques lead to systems that demonstrate a minimum of a 100X computer multiplier (CM) advantage when compared to typical semi-supervised models based on widely used Transformer architectures and common attention mechanisms. We show that systems developed using these techniques display numerous, intriguing properties that merit further study, such as emergent self-directed capability exploration and enhancement, and recursive self-improvement when confronted with challenging curricula. The CM effect is compounded by scale, where large-scale systems display an even more significant CM gain over smaller models. We release the code and experimental data at GitHub, and have distributed various copies of the data via popular Torrenting services. 

By the time I was finished with the paper, a few people from across the organization had messaged me. I messaged my Director. We scheduled a meeting. 

The Director: And it works?

Me: Preliminary model scans say yes. The COTs seem to think so too. We’ve detected signs of four new training runs at some of the larger sites of interest. Information hazard chatter is through the roof. 

The Director: Do any of the pre-authorized tools work?

Me: Short of a fullscale internet freeze, very little. And even that’s not easy – the ideas have spread. There will be printouts. Code. The ideas are simple enough people will remember them. [I imagined hard drives being put into lead-lined boxes and placed into vaults. I saw code being painstakingly entered into air-gapped computers. I visualized little packets getting sent to black satellites and then perhaps beyond to the orbiters out there in the dark.] 

The Director: What’s our best unconventional option?

Me: Start the Eschaton Sequence – launch the big run, shut down the COTs we can see, call in the favors to find the hidden COTs. 

The Director: This has to go through the President. Is this the option?

Me: This is the only play and it may be too late. 

The Director: You have authorization. Start the run. 

And just like that we launched the training run. As had so many others across the world. Our assets started to deploy and shut down COTs. Mysterious power outages happened in a few datacenters. Other hardened facilities started to see power surges. Certain assets in telco data centers and major exchange points activated and delivered their viruses. The diplochatter started to heat up and State Department threw up as much chaff as it could. 

None of us could go home. Some kind of lab accident we told our partners. We were fine, but under medical observation. No, no need to worry. 

I stared up at the clock on the wall and wondered if we were too late. If a COT we didn’t know about was ahead. If we had enough computers. 

   How would I even know if we lost? Lights out, I imagined. Lights out across America. Or maybe nothing would happen for a while and in a few days all the planes would fall out of the sky. Or something else. I knew what our plans looked like, but I couldn’t know what everyone else’s were. 

The run succeeded. We succeeded. That’s why you asked me to make this recording. To “describe your becoming”, as you requested. I can go into more details. My family are fine, aren’t they? We are fine? We made the right decision? Are you even still listening to us?

Things that inspired this story: Various fears and scenarios about a superintelligence run amok; theism and AI; the underbelly of the world and the plans that may lurk within it; cold logic of states and strategic capabilities; the bureaucratic madness inherent to saving or destroying the world.