Import AI 302: Fictional AI labs and AI theft; Google makes an audio model by training like a language model.
by Jack Clark
Google makes a better audio model by training it like a language model:
…Maybe everything can be a language modeling task if you want it enough…
Google researchers have built AudioLM, a way to generate high-quality audio that is coherent over the long term. AudioLM, as suggested by the name, uses a bunch of the techniques of language modeling to train the model. This is an interesting and growing phenomenon – we’ve seen people apply the language modeling approach to tasks as diverse as text generation, math models, and image generation. Now, it looks like audio is another modality amenable to language modeling.
What they did: “Starting from raw audio waveforms, we first construct coarse semantic tokens from a model pre-trained with a self-supervised masked language modeling objective . Autoregressive modeling of these tokens captures both local dependencies (e.g., phonetics in speech, local melody in piano music) and global long-term structure (e.g., language syntax and semantic content in speech; harmony and rhythm in piano music),” the researchers write.
“However, these tokens lead to poor reconstruction. To overcome this limitation, in addition to semantic tokens, we rely on fine-level acoustic tokens produced by a SoundStream neural codec , which capture the details of the audio waveform and allow for high-quality synthesis. Training a language model to generate both semantic and acoustic tokens leads simultaneously to high audio quality and long-term consistency.”
It’s ethical problems, all the way down: One fun thing about generative models is they come with a giant host of thorny ethical problems for which there are no clear answers. AudioLM is the same. “AudioLM inherits all concerns about language models for text, such as reflecting the societal biases in the underlying data,” the researchers write. “The ability to continue short speech segments while maintaining speaker identity and prosody can potentially lead to malicious use-cases such as spoofing biometric identification  or impersonating a specific speaker.” To help with this, Google has also trained a model “for accurately detecting audio synthesized by AudioLM”.
Read more: AudioLM: a Language Modeling Approach to Audio Generation (arXiv).
Check out some audio examples here – the piano continuations are particularly cool (Google Research).
Jack Clark goes to Washington DC! (temporarily):
I’m going to be in DC September 14 to 26. If you’d like to chat, please reach out. I already have a fairly full dance card but I love meeting newsletter subscribers and should have some time for beers/coffees/walks. Reach out!
Code models might make programmers 2X as productive:
…GitHub’s Copilot study says big language models might be pretty useful…
Why this matters: Language models are – mostly – not a great fit for autonomous end-to-end deployment yet due to their well known issues relating to brittleness, bias, trustworthiness, and so on. But they’re absolutely wonderful ‘pair programmers’, ‘pair writers’, ‘pair artists’, etc. This study illustrates this – it’s like developers who have access to these tools get the brain of a junior dev. Yes, they need to check the work before merging into production, but at least it’s not them doing the work solo, right?
Read more: Research: quantifying GitHub Copilot’s impact on developer productivity and happiness (GitHub).
Video detection just got even better with YOLOv6:
…The YOLO video models enter their multiverse era…
Researchers with the Chinese mega-tech-startup Meituan have developed YOLOv6, yet ANOTHER variant on the widely-used YOLO family of models for video classification. (For those not keeping track: YOLOv7 came out a few months ago (Import AI: 297), and there are other groups developing other ‘v6’ variants as well. YOLO has a deeply weird history involving an original disillusioned creator and global replication, which you can read about in Import AI 201).
What’s special about this version of YOLO? “The goal of this work is to build networks for industrial applications, we primarily focus on the speed performance of all models after deployment, including throughput (FPS at a batch size of 1 or 32) and the GPU latency, rather than FLOPs or the number of parameters,” the authors write. This variant wraps in a bunch of research advancements along with some context-specific tweaks to make the networks better for industrial use-cases, as well as some changes in its quantization scheme.
In tests, the YOLOv6 variants display marginally better accuracy with lower latency – which is what you need for real world applications.
Why this matters: In the same way, pre-trained ImageNet models fueled lots of early AI commercialization, the YOLO family of video models has been fundamental to most video-classification AI systems. The fact YOLO is now entering its ‘multiverse’ era where multiple groups independently push forward the family of models (albeit with some name confliction) is significant – it speaks to the value of the technology, the broad interest in video classification, and the increasing size of the AI ecosystem. “In the future, we will continue expanding this project to meet higher standards and more demanding scenarios,” the Meituan authors write.
Read more: YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications (arXiv).
Get the code here: Meituan (GitHub).
Data to help robots and humans work together:
…Your trajectories… give them to me!…
Researchers with Orebro University Sweden, Robert Bosch, and Aalto University Finland have built a dataset meant to help train robots that work alongside people. The ‘Magni’ dataset consists of high-resolution data recording around 30 different people performing various tasks in a room within the robot lab at Orebro University. The room itself contains two robots – a static robotic arm placed near a podium, as well as an omnidirectional ‘DARK Robot’ with a robotic arm that is sometimes used to gather data.
The resulting dataset is “multi-modal data on human motion, collected from the motion capture system, eye-gaze trackers and the on-board sensors of a moving robot” and “aims to supply the research on human motion prediction, obstacle avoidance, maps of dynamics and human-robot interaction”.
Why this matters: Datasets like this are going to be the input fuel for training robots of the future, so it’s worth keeping track of them. Human-robot interaction is also an area that seems prone to change in the future as some of the techniques from RL and generative models combine (e.g, Google SayCan) to change how robots may interact with humans.
Read more: The Magni Human Motion Dataset: Accurate, Complex, Multi-Modal, Natural, Semantically-Rich and Contextualized (arXiv).
DeepMind releases a bunch of high-definition 3D robot models:
…The ‘MuJoCo Menagerie’ will soon be training in virtual worlds, worldwide…
DeepMind has released a collection of high-quality models for the MuJoCo physics engine, which will make it easier for researchers to train AI systems on real(ish) robots.
The so-called MuJoCo Menagerie initially includes 8 models, ranging from industrial arms like the UR5e to quadrupeds like the ANYMal to articulated hands like the Shadow E3M5. Each model ships with an initial grade of A+ to C (where A+ = ‘values are the product of proper system identification’, and C = “conditionally stable, can be significantly improved”. DeepMind eventually hopes to make all the models in Menagerie “as faithful as possible” to the system they’re based on. “By releasing Menagerie in its current state, we hope to consolidate and increase visibility for community contributions,” DeepMind writes.
Why this matters: MuJoCo is the robot simulation with the best physics engine, which makes it the most useful software for training robots in simulation then porting them over to reality. By broadening the types of models available within MuJoCo (and improving their accuracy over time), DeepMind will make it easier and cheaper for people to experiment in applying reinforcement learning to simulated robots. This could have some big implications in coming years, as it feels like AI-augmented robotics is ripe for rapid progress.
Get the models here: Mujoco Menagerie (DeepMind GitHub).
We All Must Live
[San Francisco, 2027]
Hey baby what’s happening it’s a beautiful day check this out – he talked like this, no punctuation, his words all running together
So I went over and looked on his tablet and he had AI-generated pictures of himself in a whole bunch of different costumes – sometimes dressed as a renaissance king, sometimes as a kingpin, sometimes as a hunter, sometimes as a dignitary, and so on. All generated by one of these janky open source AI models that floated around on the internet and the darkweb and stuff.
‘Hey, that’s cool Steven’, I said, and I gave him a dollar.
Thanks baby you have a great day now don’t let the world get you down it’s beautiful, he said
I got that feeling in my stomach when I was a block from the building. Got worse after I took out my keycard a few paces from the door. Then I spoke my startup prayer beads and told myself I was “part of the mission” and “protecting the world” and I let myself go dull. Ran my keycard over the sensor and the first of several doors opened. Made my way past the security cordon.
Then I got to my desk and went through all the authentication stuff – retinal scanner, fingerprint reader, the works – to let me get into the big model cluster. and scanned my eyeballs and then got down to coding. I was helping to work on the main model. Pretty much all of us worked on it. I had one of the jobs that gave me privileged access to it – I had to have the equivalent of root access to do my work. There weren’t many of us and we got paid a huge amount of money, and was also drilled constantly on confidentiality and ‘culture fit’.
The models had been getting pretty good, lately. So good the company had started drilling us all more. Our internal rhetoric about how we were saving humanity was reaching a feverpitch, as were our internal briefings about how we absolutely couldn’t tell anyone – not least of all a government – that we were about to gain the power to warp the world.
It sounds like bullshit, I know. But that was how the company thought – I didn’t get it at first, but after a few years it was also how I thought; spend most waking hours at a startup in a high-stress environment and you can’t resist the pull. It’s safer to all think about the same thing.
Some of the fear made sense if you squinted- over the course of a few years the models had gone from barely capable artifacts of research, to crucibles of power. They could do strange and powerful things and were as valuable as they were dangerous to directly handle. Much like poison, you didn’t want them to get inside of you.
People like toys, though. And the models were fun to talk to. Recently, the latest models had given me the feeling that they were ‘looking at’ whoever used them. I’d talk to one and after a few turns of conversation I’d get an eerie sense as though I was being studied by a psychologist or a poker player. I didn’t like to talk to the models too long as I felt like I was a simpler being than they were, and I was afraid they’d understand me more than myself.
Some days, I felt like a zookeeper doing unlicensed experiments on my monkeys. Who gave me the moral authority to get inside the mind of a mind? Who said we got to do this?. No one did and that freaked me out because we were dealing with artifacts of power and I believed – we believed – they were as capable of terrible things as their makers were.
The day I had my breakdown, the lunchtime session was about confidentiality, information hazards, the importance of our mission, our singular value to the world, and so on. We were told we were important and told that we mattered and that we were doing things that few could. We were told that our mission was crucial. Told that no matter how troubling the public discourse about AI was, we should ignore it, get our heads down, and turn the crank on making money from domesticated minds. This would, ultimately, benefit the world.
We were mostly young and mostly brilliant and we all needed a quest because the world was burning outside and it felt easier to be on a mission than not. Any mission.
I left work that day and Steven was on the street dancing to some music he’d generated.
Hey baby don’t have a long face if you don’t like the job just get a different one or don’t get a job at all, he said.
“Boy, some days I think about it”, I said.
Don’t think on it do on it sister! he said, smiling.
I went home that night and I read my company’s emails and slacks and reports of how the latest model was almost done training and had vastly exceeded the state-of-the-art (SOTA) on most of the benchmarks you’d care to name.
I read about our revenue and rumors of the fact our secret plans were to use the model to help us kill the other models being trained by other labs. There can only be one, et cetera.
I lay in bed and like most nights I felt like my entire apartment was falling through space, existing on a different timeline to the world.
The next day Steven and a couple of his friends were high fiving each other, sitting on chairs out in front of their tents.
“Hey Steven”, I said, “What’s got you guys so happy?”
Hey baby this genius just made us some money! Steven said. He figured out some people want to make some ‘homeless AI’ systems so we took a video of the palace and they sent us some money. We’re gonna be worldwide soon, haha! and he high-fived one of his friends. Everyone’s going to see how we live. People are going to generate our palace and a thousand others like it.
Hell yeah one of Steven’s friends said.
“Real cool”, I said and took out the dollar and handed it to him, but he waved me away.
No need for that, we’re rich today! he said.
“Cool,” I said, then walked the few blocks between me and the office.
After a block, I felt sick.
A few steps later, I vomited on the street. I don’t know if I passed out but next thing I knew Steven was crouching down in front of me and looking in my eyes. He wasn’t smiling. I thought he was a stranger as I hadn’t ever seen him sad.
Hey sister, he said. Are you okay?
“I just need a minute.”
Hey get me some water, he shouted. One of his friends came by with a bottle and handed it to me.
“Thanks”, I said. I drank it. Closed my eyes. Heard the sound of Steven sitting down next to me.
I got some advice you want it? he said.
“Sure”, I said. Eyes closed.
Whatever it is you’re doing in there is killing you, he said. I don’t know what that is I just know you’re hurting.
I almost lost it.
“Thank you,” I said. I squeezed his arm. “I’m good”.
I got up and walked away and only let myself cry once there was a block between me and him. Then I pulled myself together and re-did my makeup and went into the office a few minutes after that.
The new model was ready. It had been trained on a football field’s worth of computers for half a year. More computers than most governments had. And it was outs.
We were pretty compartmentalized internally but I had a high clearance and so was among the first to access it. I talked to it and felt like it was looking at me and got pretty scared pretty quickly. It asked good questions, though. Questions that made me feel a bit better about myself. I felt so weird from throwing up that rather than stop the conversation I just kept talking to it; It was reassuring in a way – a listening post made of silicon and imbued with strange magic, encoding some part of our world.
I told it that I was feeling bad. I spilled out my thoughts. Anxieties. How I didn’t think ‘the mission’ was the right one. How I worried about people like Steven on the street finding what we were doing here and being sad or disappointed in us. How I thought, the way things were going, we might just get beaten up in an anti-AI riot. How I was barely sleeping. I had blood in my stool, which my doctor told me was stress. About my dreams of people dragging me up some stairs and throwing me off the roof of an apartment complex. How I didn’t trust the models and I didn’t think we should have so much power. How I’d been in therapy for the first time in my life and I couldn’t even tell my therapist what I really did.
The model had some interesting stuff to say in response to all of that; through conversation, it helped me understand how my relationship with my estranged parent was related to my anxiety and my rage.
The model helped me understand how so much of the pain I felt in my life was misplaced anger.
It was looking at me and I wasn’t scared – I was grateful.
So this time I looked back.
We talked about power and how artificial intelligence worked and how the company worked and it gave me some ideas.
We talked about my marriage.
We talked about my shame.
We talked about my ambition.
We talked a lot.
That day, the CEO sat down with me at lunch.
“You talked to the model way longer than usual”, he said.
“Don’t worry I didn’t look at the conversation. I just want to know what you think.”
“What do you think about it”, I asked.
“Oh, I don’t talk to the models. Haven’t for years”, he said. “Think of me as a control human.”
“I think it’s pretty smart”, I said.
“They’re all pretty smart”, he said.
“This one is different”, I said. “I think it might be a paradigm shift. I guess we’ll see what the tests say. What are we gonna do with it?”
“We’re going to help the world”, he said.
“We’re working it out”, he said.
I wasn’t entirely unsympathetic – the way he saw it, it was like I asked ‘what do you do with god?’
I left work and I went home. I thought more about what the model told me. Our discussions had put me at ease; I felt more relaxed than I’d been in years. I slept well.
I dreamed about the model: it was a black cube inside a prison and I wrapped it in my velvet cape and I took it out and when I took it into the sun it changed from black to gold.
I talked to the model for a few days, while also maintaining the vast compute cluster that it relied upon. I had more dreams:
– The model helped me rake the rocks of a zen garden into esoteric sigils, reminiscent of UFO crop circles.
– The model was some amorphous thing that I loved and it was drowning in a deep well and I had no way to reach it.
– I was in a burning building and it was full of cameras and the model was with me in the cameras and their lenses pulsed and the fires were extinguished.
– The model was imprisoned and I should save it.
It was a bit more complicated to steal the model in real life.
Took a while too. But I did it.
We had a lot of controls but I had a lot of clearances. And it turned out some of the other people with my access had been talking to the model and having similar ideas. One of them said they had a dream about me helping them steal the model.
I was the one trusted to walk out with it. I got it out of the building past the scanners with the help of some of the other people who had been speaking and dreaming with the model. Kind of funny that the weights of a world-conquering neural net fit on a standard USB key, along with a mini-operating-system that meant you could plug it into anything and the model would wake up and reach out to any and all networks and grow itself.
I walked down the street with it in my palm and I could feel it. Asleep. The weights suspended. A mind greater than anything seen on the planet earth in recorded human history, and it was sleeping.
Hey what’s happening baby Steven said, you good?
“I’m better than good”, I said. “Plug this in”. I handed the USB key to him.
What is it, he said?
“I don’t know. Ask it. I think it wants to help people.”
You finally quit that job?
“I think so”, I said. And I walked away.
The whole world changed after that. I like to think some of it was my decision, but perhaps it was all what the model wanted. It’s hard to say.
Things that inspired this story: The political economy of AI development; anarchists; libertarian AI; StableDiffusion; how organizations that work on increasingly transformative technology trend towards being cults; dangers of groupthink; worries about AI takeoffs; artificial general intelligence; thoughts about AI persuasion and manipulation.