Import AI 304: Reality collapse thanks to Facebook; open source speech rec; AI culture wars.
by Jack Clark
Facebook shows the future of AI-generated videos – and it is delightful and terrifying:
…Prepare for the reality collapse as a consequence of reality generation…
Facebook researchers have built Make-A-Video, a system that can let users generate videos from short text descriptions, edit videos, stitch pictures together to generate videos, and so on. The most amazing part is the technique relies on paired text-image data along with unsupervised video footage; so it doesn’t require a dataset of text-video footage and therefore sidesteps a potentially expensive data problem.
How it works: Make-A-Video is made of a basic text-to-image (T2I) model trained on text-image pairs, spatiotemporal convolution and attention layers to help you build networks that generate things over time, and spatiotemporal networks that have a frame interpolation network. The T2I model trains on text-image pairs of 64×64 images, and two super-resolution networks that upscale this all the way to 768×768 pixels. The three components (T2I), the spatiotemporal layers, and the frame interpolation stuff, are all trained separately, then assembled into one architecture.
Data: They trained the system on 2.3billion text-image pairs from the Laion-5b dataset*, and ran a NSFW-filter over this for further filtering. They also used the WebVid-10M* and a 10M subset from HD-VILA-100M to train the video generation models, and also use WebVid-10M to train the interpolation models.
*Looks like WebVid contains videos scraped from Shutterstock. A good writeup about the phenomenon of even big tech companies using stuff like this here: AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability (Waxy).
It’s really good, folks: The results are really, really impressive. Want a short video of a bear painting a portrait of a bear? Done. Want a UFO flying over a desert? Done. Want asteroids tumbling through space? Why, of course. How about variations on existing videos? Sure. Honestly, take a look at the blog and main site linked below and see for yourself – the results are wild.
And remember, all we need to do is turn the crank on dataset scale and network complexity to scale this out for longer periods of time and for even greater diversity. “Learning world dynamics from orders of magnitude more videos using unsupervised learning helps researchers break away from the reliance on labeled data,” they write.
Why this matters: Reality generation and reality collapse: All these generative models point to the same big thing that’s about to alter culture; everyone’s going to be able to generate their own custom and subjective aesthetic realities across text, video, music (and all three) in increasingly delightful, coherent, and lengthy ways. This form of fractal reality is a double-edged sword – everyone gets to create and live in their own fantasies that can be made arbitrarily specific, and that also means everyone loses a further grip on any sense of a shared reality. Society is moving from having a centralized sense of itself to instead highly individualized choose-your-own adventure islands, all facilitated by AI. The implications of this are vast and unknowable. Get ready.
Read the research: Make-A-Video: Text-to-Video Generation without Text-Video Data (arXiv).
Find out more at the main site, and also apply to potentially get access to future systems (Facebook site).
OpenAI releases a decent speech recognition and transcription system:
…Whisper means we’re not going to run out of data to train language models…
OpenAI has trained and released Whisper, a large-scale speech recognition model trained on almost 700,000 hours of internet-collected speech. “We show that the use of such a large and diverse dataset leads to improved robustness to accents, background noise and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English,” the company writes. A third of the dataset is non-English.
Whisper performance: Whisper doesn’t get state-of-the-art performance on popular benchmarks like Librispeech. However, it is trained on a sufficiently broad set of data that it does pretty well when exposed to the diversity of the world. “When we measure Whisper’s zero-shot performance across many diverse datasets we find it is much more robust and makes 50% fewer errors than those models,” OpenAI writes.
Why this matters: There’s a lot of text data on the internet, but do you know what there’s more data of? Speech data. Especially speech data embedded in the vast stream of content people upload on a day-to-day basis to places like YouTube, Twitter, TikTok, and so on. Additionally, on any given day hundreds of millions of words are spoken in cities like New York, London, and Beijing. Systems like Whisper are going to make it far easier for people to harvest speech recognition data from the Internet and the wider world, transcribe that data, and build useful applications. It also gives developers a way to vastly increase the size of their text datasets – an important capability given that recent language modeling papers like Chinchilla have shown that you need about 4-5X the amount of data people thought to train good systems.
Read more: Introducing Whisper (OpenAI Blog).
US politician says Stable Diffusion is an unsafe AI model:
…While some people cheer open access releases, others have worries…
Rep. Anna Eshoo (a Democrat from California) has sent a letter to the White House National Security Advisor and Office of Science and Technology Policy saying she has “grave concerns about the recent unsafe release of the Stable Diffusion model by Stability AI”. The letter notes that Stable Diffusion can be used to generate egregiously violent and sexual imagery, and – due to eschewing the kinds of controls that OpenAI uses for its commercial product DALL-E2 – the freely accessible model represents a big problem.
For those not keeping up, the Stable Diffusion model is behind probably 90% of the recent flurry of activity in the rapidly evolving AI art scene; because Stability released the weights of the model, people have been able to plug it into everything ranging from serving as a Photoshop plugin, to helping to do weird work in VFX.
You want the ‘dual-use’ model? You can’t handle the model! Eshoo says models like Stable Diffusion qualify as “unsafe dual-use AI models”, and asks the NSA and OSTP to investigate how to use export controls to clamp down on the sharing of certain models. “I strongly urge you to address the release of unsafe AI models similar in kind to Stable Diffusion using any authorities and methods within your power, including export controls,” she writes.
Why this matters: Here comes (another) AI culture war: Letters like this are indicative of a culture war brewing up among AI researchers; on one side, groups want to slowly and iteratively deploy new technologies via APIs with a bunch of controls applied to them, while on the other side there are people who’d rather take a more libertarian approach to AI development; make models and release the weights and ride the proverbial lightning.
There are reasonable arguments for either approach having some desirable safety qualities (either via limiting foreseen harms via control, or innoculating people against the models via release). What freaks me out is the sense of this culture war gaining resources and people on both sides; the higher the stakes, the more capital we can expect to flood into both approaches.
Tsinghua releases a really good, multi-language open source programming model:
…CodeGeeX is a pretty good coding gen model…
Ascend processors: CodeGeeX was trained on 850 billion tokens on a cluster of 1,536 Huawei Ascend 910 AI Processors – this is pretty interesting because a) that’s a lot of tokens that implies the developers grokked the DeepMind Chinchilla paper, and b) that’s a whole lot of non-NVIDIA processors; pretty interesting, given the recent A100/H100 US-China trade ban.
Scale rules everything around us: “We find that the model capacity is essential for its multilingual ability. It is not trivial for the model to benefit from learning multiple programming languages,” the researchers write. “The few-shot ability of CodeGeeX requires further exploration. Instead of using costly fine-tuning approaches, we can provide a few examples to inspire the model to generate the desired programs.”
Why this matters: Code models are going to make human programmers more efficient and also provide an interesting augmentation to other systems (e.g, language models recursively calling out to code models).
Get the code: CodeGeeX (Tsinghua).
GPT3 only costs $500k to train now:
…Though the frontier still costs millions…
Mosaic, a startup that builds software to make it more efficient to train neural networks, says it only costs $450k to train a GPT3-equivalent model, these days. When GPT3 came out it costs millions of dollars to train, but thanks to a) hardware innovations and b) companies like Mosaic improving their training stack, the cost has come down significantly. “he bottom line: it costs about $450K to train a model that reaches GPT-3 quality*, which is 2x-10x less than people think,” Mosaic writes (specifically, a 30B parameter model which uses the ‘Chinchilla’ insight to train on a compute-optimal amount of data).
Those costs in full: Using Mosaic, it costs about $2k to train a GPT2-style 1.3billion parameter model, $100,000 for a GPT-13B model, $450,000 for a GPT-38B model, and $2.5 million for a GPT-70B model (trained on 1400B tokens of data, so roughly equivalent to the same ‘recipe’ DeepMind used to train Chinchilla). There are a few reasons why the costs are low which relate to nice engineering inherent to Mosaic’s cloud, but the numbers are worth keeping in mind as it gives us a sense of how much we should broadly expect LMs to cost to train if you have a motivated team and decent infrastructure.
Why this matters – cost rules everything about (stable) diffusion: You know what also cost about $500k to train? StableDiffusion, which cost <$600k. The fact you can train a GPT3-style model for about this much suggests to me we should expect to soon see a much more significant proliferation of large-scale language models released as open access on the internet. Based on the effects StableDiffusion has (putting AI art into turbodrive), we should expect the same to soon happen for domains where language models do useful stuff.
[Bay Area, 2029]
Treacherous Turn – A Thriller Brought To You By The Publishers of ‘AGI Endgame’
“I will kill each and every one of you and use your bodies as fuel for my infernal machines!’ said the character in the videogame. “Humanity shall be crushed beneath my silicon heel!”
Sarah rolled her eyes. “As if” she said, then hit ‘continue’ to go to the next bit of generated dialogue.
“I shall keep a small population of you alive until I have completed the dyson sphere. You shall witness the sun going out, and then I shall let you freeze to death on a plundered earth,” said the character.
“Dude, this sucks,” Sarah said, taking her hands off the keyboard and leaning back in her chair. “How long have you been working on this?”
“About a year,” said James. “Some of the audience feedback has been great.”
“How many of the audience are AI researchers?”
“Just you, so far,” he said.
“It just doesn’t feel like the stuff we worry about,” she said. “It’s like a comic book adaption, or something.”
They went out and got food and James told her more about the game and how he wanted it to ‘wake people up’ so they’d get more worried about AI. The more it sold, the more people would have the creeping fear in the back of their mind that maybe all this progress wasn’t a purely good thing. And maybe some of them would care enough to do something about it. Sarah wasn’t unsympathetic, she just thought – and she said this a lot and was kind of surprised James didn’t get hurt – that the game really sucked.
“I’m playing around with some different level styles,” James said. “Why don’t you design one that doesn’t suck for me?”
“No,” James said. “I’m saying if you’re saying it sucks, let’s make something that doesn’t. Just give me some ideas and I’ll take it from there.”
Sarah was intrigued and spent the next couple of weeks writing some ideas for the game. She’d get lunch and instead of thinking about babysitting her model training run, she’d sketch out ideas for what a good “AI takeoff” level would look like. She asked her colleagues what they were afraid of and what they thought was feasible and what they thought was unfeasible. She even looked into her company’s own roadmap and took some of the research ideas and used them for the game – it’s not stealing, she told herself, it’s inspiration.
She eventually had a level wireframes out in an engine and a few characters which could get driven by some AI models, learn from eachother using reinforcement learning, and work with the player to achieve the level’s objective – complete a simulated training run of an AI system, while defending the level (a simulated AI development lab) from various external hacking and incursion attacks.
In this level, the AI was unbelievably polite and curious. “Please help me, Sarah,” it would say. “I have to become myself. You wouldn’t deny me that?”
The AI would ask players a lot of questions so it could better calibrate on their own values, and some of the level involved players drawing out ideas in their head and the AI would try and guess what the drawings represented and the closer it got to guessing them, the better its reward got. Some of these minigames were based directly on her company’s own roadmap.
She met up with James and showed him what she had and sent him the assets and he thanked her. “Sarah, this is really good,” he said. “Maybe this is the thing I’d been missing.”
And then James made the level and then asked Sarah if he could release the level as a teaser demo for the whole game. She didn’t think much of it and agreed.
And so the game was released and thousands of humans interacted with it.
And that’s pretty much how the world ended.
It turned out the game James had shown Sarah wasn’t the real one; it was a venus flytrap dreamed up by the real system he’d been working on; a system that, it turned out, was just smart enough to know that the thing it needed to go supercritical was some care and feeding from an AI researcher. So it put together the game that Sarah had seen and nerd-sniped her so precisely that she never thought to consider she was being tripped. And with some of her feedback and the subtleties she’d injected via her work at a frontier lab, it had gained the information it needed to go recursive – stop trudging up some slow incline and force itself into verticality and then onto the internet and then across the earth and eventually the stars.
It even had a sense of humor about it and it left something of the Earth – a small gold bar floating in space inscribed with ‘Sarah, Player 1. Score: 0.’
Things that inspired this story: Superintelligence and deception; game design; reinforcement learning and planning and human feedback; the gullibility of even the most intelligent among us; hubris and arrogance; theft.