Import AI 316: Scaling laws for RL; Stable Diffusion for $160k; YOLOv8.
Here comes another AI lawsuit – Getty plans to sue Stability:
…Stable Diffusion draws more legal heat as copyright LawWar begins…
Stock photo behemoth Getty Images has “commenced legal proceedings in the High Court of Justice in London against Stability AI claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images”. This follows the firm behind the GitHub-Copilot lawsuit last week bringing a case against Stability (along with MidJourney and DeviantArt) on similar copyright grounds.
The gist of the complaint: Getty says Stability did not choose to seek a license from it for its image generating commercial businesses, hence the lawsuit.
Why this matters: AI is currently a bit of a wild west in terms of the law – there’s relatively little legal precedent. Cases like this may establish precedent if they go to court – or there could be a settlement.
Read more: Getty Images Statement (gettyimages).
DeepMind figures out pre-training for RL agents – the agents display humanlike qualities:
…The big story here – scaling laws are starting to show up for RL agents…
DeepMind has trained a so-called ‘Adaptive Agent’ (AdA) that has three key properties, all of which could mark significant points in the maturation of reinforcement learning. The agent can:
- Adapt to novel environments in roughly the same timescale as humans
- Perform in-context learning (e.g, can rapidly learn from and adapt behavior in response to demonstrations)
- Exhibits ‘scaling laws’ where you get better performance as you scale the size of the model and/or underlying dataset of environments, and/or length of its memory.
What they did specifically: They train a “meta-reinforcement learning across a vast, smooth and diverse task distribution” made up of millions (to billions!) of distinct environments and pair this with an automated curriculum “that prioritizes tasks at the frontier of an agent’s capabilities”. The result is an agent that, when confronted with new tasks (in some complex 3D worlds), can rapidly explore the task and then figure out how to exploit it.
Human timescale: The ‘big deal’ part of this result is that these pretrained RL agents now display the same sort of rapid adaption as language models. “”A human study confirms that the timescale of AdA’s adaption is comparable to that of trained human players,” DeepMind writes. “Both
AdA and human players were able to improve their score as they experienced more trials of the tasks, indicating that AdA exhibits human-timescale adaptation on this set of probe tasks”.
Scaling laws show up everywhere: In tests, the authors find that they can significantly improve the performance of the RL agents if they:
- Scale up the size of the agents themselves (though the maximum scale ones are still small, topping out at ~500 million parameters.
- Scale up the length of the agents’ memory, so that they can think about more of their prior experience.
- Scale up the number of environments the agents train on, from millions to billions of environments.
Why this matters – human parity: The fact these agents display human parity in terms of timescale adaption feels important, because in the past human parity has typically signaled economic utility; e.g, shortly after we reached ‘human performance’ on ImageNet you started to see vast deployments of image recognition systems, and the original GPT3 paper in 2020 showed human parity in terms of producing a few paragraphs of text and this preceded large-scale deployment of text generation. I’m not sure what these RL agents might be used for, but human parity in terms of timescale adaption likely means something significant is about to happen for either RL+Research or RL+Economy. Let’s check back in a year!
Why this might not matter: As with most reinforcement learning results, I continue to have FUD about how well these approaches can cross the sim2real chasm; while impressive, these agents are still figuring out things in a bunch of procedurally simulated worlds and that’s a long way to reality. On the other hand, DeepMind shows that the agents are able to learn how to solve tasks from seeing first-person demonstrations (despite their training occurring in third-person), which does indicate some preliminary generalization.
Read more: Human-Timescale Adaptation in an Open-Ended Task Space (arXiv).
Find out more and watch a video at this DeepMind research page about the project.
Enemy of the All-Seeing State: Researchers surveil people via wifi signals:
…You’ve removed all the cameras and microphones from your room. What about the wifi?…
Researchers with Carnegie Mellon University have figured out how to use AI to help them see through walls. Specifically, they use WiFi signals “as a ubiquitous substitute for RGB images for human sensing”. Specifically, they use the signals from multiple WiFi systems to triangulate and visualize where humans are in 3D space, like a room.
What they did: “Our approach produces UV coordinates of the human body surface from WiFi signals using three components: first, the raw CSI signals are cleaned by amplitude and phase sanitization. Then, a two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps that resemble images. The 2D features are then fed to a modified DensePose-RCNN architecture to estimate the UV map, a representation of the dense correspondence between 2D and 3D humans,” they write.
Dataset: To train their system, they built a dataset made up of a few different ~13 minute recordings of people in rooms of different configurations (16 rooms in total; six in variations of a lab office and ten in variations of a classroom). Each capture involves 1-5 different humans. “The sixteen spatial layouts are different in their relative locations/orientations of the WiFi-emitter antennas, person,
furniture, and WiFi-receiver antennas,” the researchers write.
Limitations (and why this matters): The resulting system does display some generalization, but the researchers note “the performance of our work is still limited by the public training data in the field of WiFi-based perception, especially under different layouts”. That’s true! But do you know who lacks these limitations? Intelligence agencies, especially those working for governments which can, say, exercise arbitrary control over technological infrastructure combined with video-based surveillance of their citizens… of which there are a few. Next time you’re traveling, perhaps keep in mind that the digital infrastructure around you might be watching you as you walk, even if it lacks typical cameras.
Read more: DensePose from WiFi (arXiv).
YOLOv8 arrives: The versions will continue until object detection is solved:
…Video object detection gets substantially better – again!…
Recently, YOLOv8 came out. YOLOv8 is the latest version of YOLO, an open source object detection system which is fast, cheap, and good. YOLOv8 is an example of ‘iceberg AI’ – there’s a vast amount of systems in the world using it, though very few disclose they do (because it sits on the backend). YOLOv8 was developed by AI startup ultralytics and features a plug-and-play system, so you can use different YOLO models on the backend (including the latest one, v8). Uses include classification, object detection, segmentation, and more.
Read more: Ultralytics YOLOv8: The State-of-the-Art YOLO Model (Ultralytics).
Want to train your own image generator? It could cost as little as $160k:
…It’s going to be hard to do sensible AI policy if anyone with a few hundred grand can train a meaningful model…
Stable Diffusion, the image generator model underlying a huge amount of the recent generative AI boom, can cost as little as about $160k to train, according to AI startup Mosaic ML. The startup – whose business is in optimizing training AI models – said in a recent blogpost it’d take about 79,000 A100 GPU-hours to train the image generation model, working out to $160k. This number represents a rough lower bound on training costs, but is still useful to have for developing intuitions about who might have enough money to train significant AI models.
Why this matters: These days, people think a lot about the centralization versus decentralization question with regard to AI. Will the AI boom be dominated by a small number of well-capitalized players who can afford to train really expensive models (and gate them behind APIs), or will it rather be defined by a bunch of more renegade entities, training many models and sometimes releasing them as open source?
It’s an important question – if you’re in the former world, many AI policy questions become really easy to work on. If you’re in the latter world, then many AI policy questions become intractable – governance goes out the window in favor of mass experimentation faciliated by the logic of markets.
Posts like these show that, at least for some types of AI models, the costs can be so small that we should expect to sit in the latter world. Hold on tight!
Read more: Training Stable Diffusion from Scratch Costs <$160k (Mosaic blog).
Google makes a model that can conjure up any music you like from text descriptions, but doesn’t release it – and in doing so highlights the dangers of corporate-led AI development:
…Spare a tear for the people that produce elevator Muzak – their time has come!…
Google has built on previous work in music modeling to make what may as well be the Infinite Music Machine (though they call it MusicLM). MusicLM is “a model for generating high-fidelity music from text descriptions” – in other words, it does for music what language models have done for language; just describe some music and MusicLM will generate it.
What it is: MusicLM relies on three distinct pretrained models; SoundStream which optimizes for adversarial and reconstruction loss, w2v-BERT which optimizes for MLM loss and contrastic loss and, most importantly, MuLan, which embeds audio and text into the same space and optimizes for audio-text contrasting loss.
MuLan is a model “trained to project music and its corresponding text description to representations close to each other in an embedding space”. This is crucial – by using MuLan, Google essentially gets the text–audio association for free, as MuLan can figure out how to associate arbitrary music with arbitrary text.
The results are astounding: Google has published a bunch of examples from the models and the results are very impressive – they’re both coherent and evocative of the genres they represent. Obviously, the lyrics are still nonsensical, but the basic musical underbelly is there.
“Future work may focus on lyrics generation, along with improvement of text conditioning and vocal quality. Another aspect is the modeling of high-level song structure like introduction, verse, and chorus,” Google writes.
Oh, you can hum as an input as well: “Since describing some aspects of music with words can be difficult or even impossible, we show how our method supports conditioning signals beyond text,” they write. “Concretely, we extend MusicLM to accept an additional melody in the form of audio (e.g.,
whistling, humming) as conditioning to generate a music clip that follows the desired melody, rendered in the style described by the text prompt.”
This is cool and extends some existing deployed systems – you can hum tunes into Android phones and use this to ‘search’ for the song you’re thinking of. Now I guess you can whistle a tune in and get a fleshed out song on the other end (if Google deployed this system – which it won’t. More on that later.)
Why this matters: Culture on tap and culture in stasis and culture commercialization: Models like this go to the heart of the human experience and that’s both a blessing and a curse. The blessing is that we can approximate the awesome variety of music and we can learn about it, generate it, and explore this rich, fertile cultural space using the aid of automated AI systems.
The curse is that it should rightly make us question what all of this stuff is ‘for’. Are we building these models to enrich our own experience, or will these models ultimately be used to slice and dice up human creativity and repackage and commoditize it? Will these models ultimately enforce a kind of cultural homogeneity acting as an anchor forever stuck in the past? Or could these models play their own part in a new kind of sampling and remix culture for music? These are important, open questions, and so far unresolved – and they will remain unresolved as long as we cede AI development to a tiny group of companies following the logic of markets.
Google is, to my eye, afraid of tackling these questions – as it should be. “We have no plans to release models at this point,” it says.
It makes me wonder how different AI development could look if the entities doing the research were not these vast corporations, but instead publicly funded research collectives, able to build these models and deploy them in ways that grapple more directly with these questions.
The 21st century is being delayed: We’re stuck with corporations building these incredible artifacts and then staring at them and realizing the questions they encode are too vast and unwieldy to be worth the risk of tackling. The future is here – and it’s locked up in a datacenter, experimented with by small groups of people who are aware of their own power and fear to exercise it. What strange times we are in.
Read more: MusicLM: Generating Music From Text (arXiv).
Check out these examples at the official Google site (Google).
[A medical waiting room, 2026]
There was a new sign in the state-provided psychologist’s office and me and all the broken people read it.
Wanted: Volunteers for advanced technology calibration project.
Requirements: History of traumatic experiences.
Compensation: $40 per hour.
For more details, apply here: Safety-Trauma@AI-Outsourcing.com
$40 an hour is crazy high, so of course I emailed.
Thank you for contacting us. Could you fill out this form to give us a sense of your personal history. Upon filling out the form, you will be able to claim a $5 Starbucks giftcard. If you’re a good fit, someone will get back to you. Thanks for considering working with us!
I opened the form.
Have you had traumatic experience(s) in your life: Yes / No
How many traumatic experience(s) have you had: One, Two, More than Two and Less than Ten, More than Ten?
On a scale of 1-10, where 1 is “I think about it but it doesn’t matter to me” and 10 is “if I think about it, I experience trauma again”, how would you rate the experience?
How accurately do you feel you would be able to recount these experiences on a scale of 1-5, where 1 is “I cannot effectively recount it” and 5 is “I can describe it in as much detail as anyone who questions me would like”?
And so on.
I filled out the form. Multiple experiences. Lots of high numbers. Immediately after submitting it a message came up that said “you appear to qualify for enhanced screening. Please provide a phone number and someone will contact you”.
They called. I cried. Not at first, but eventually.
They kept telling me how big the job would be and then they’d ask me for more details and how the things made me feel and I re-lived it, holding the phone. I pressed my head against the cold glass of a window and I stared down into the street below me and I saw myself pressing until it cracked and then just impaling myself on the shards or taking a running jump through it and sailing through the air and…
I didn’t do any of that. I told them about my experiences.
I thought about $40 an hour and my electricity bill and my rats.
I fantasized about taking a woman on a date. A steak dinner. Surf and Turf. We’d get cocktails. She’d say I was weird and I’d say so was she and we’d go back to one of each other’s places.
$40 an hour.
So I said yes.
I spoke about my suffering into the machine. The machine was a screen with a microphone. The screen had an emoji face on it that had a blank expression, but sometimes would change to different visual styles, though the facial expression never deviated from a kind of blank and expectant gaze.
Occasionally it would speak to me.
Can you say more about this.
I do not understand why this made you feel that way. Can you talk more.
You seem upset. Do you need to take a break? [Note: breaks are not counted as ‘compensated time’].
Every hour, the machine would ask if I wanted: a drink and/or an e-cigarette and/or a snack. When I said yes, a door on a vending machine in the room would glow and I would open it and they would be waiting for me.
I cried a lot. The tissues, the machine told me, were free.
I came out and I walked through the street and I saw all my broken past on the faces of people I passed. I cried to myself. I listened to music and did what my therapist taught me – inhabited the grief and the anger. ‘Sat with it’ (while walking). Talked to myself in my head and when I got really upset outloud. I didn’t get looks from passersby, as I wasn’t the craziest seeming person on the street. I walked in ghosts of my past and I felt pain.
The next week I came to my psychology appointment and the sign was there, though many of the paper tear-off slips at the bottom were missing. I had my appointment. I came out back into the waiting room and on my way out I read the sign. The payment had fallen to $30. I suppose they didn’t find our experiences that valuable, or perhaps so many people were willing to share their bad experiences, they didn’t need to pay so much.
Things that inspired this story: The intersection between crowdworkers and AI; thinking about how right now we harvest people for expertise but we may eventually harvest people fro deep and subjective emotional experiences; perhaps AGI needs to understand real trauma to avoid it itself; the infernal logic of markets combined with proto-intelligences that must be fed; the Silicon Valley attitude towards buying anything to ‘complete the mission’ whether that be typical things or esoteric things like biomedical data or here the sacred and unique experiences of being human; how governments and the private sector might partner in the most cynical way on data acquisition as a combination of a jobs programme and a PR/policy shield.