Import AI

Import AI 316: Scaling laws for RL; Stable Diffusion for $160k; YOLOv8.

Here comes another AI lawsuit –  Getty plans to sue Stability:

…Stable Diffusion draws more legal heat as copyright LawWar begins…
Stock photo behemoth Getty Images has “commenced legal proceedings in the High Court of Justice in London against Stability AI claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images”. This follows the firm behind the GitHub-Copilot lawsuit last week bringing a case against Stability (along with MidJourney and DeviantArt) on similar copyright grounds. 

The gist of the complaint: Getty says Stability did not choose to seek a license from it for its image generating commercial businesses, hence the lawsuit. 

Why this matters: AI is currently a bit of a wild west in terms of the law – there’s relatively little legal precedent. Cases like this may establish precedent if they go to court – or there could be a settlement. 

   Read more: Getty Images Statement (gettyimages).

####################################################

DeepMind figures out pre-training for RL agents – the agents display humanlike qualities:

…The big story here – scaling laws are starting to show up for RL agents…

DeepMind has trained a so-called ‘Adaptive Agent’ (AdA)  that has three key properties, all of which could mark significant points in the maturation of reinforcement learning. The agent can:

  • Adapt to novel environments in roughly the same timescale as humans
  • Perform in-context learning (e.g, can rapidly learn from and adapt behavior in response to demonstrations) 
  • Exhibits ‘scaling laws’ where you get better performance as you scale the size of the model and/or underlying dataset of environments, and/or length of its memory. 

What they did specifically: They train a “meta-reinforcement learning across a vast, smooth and diverse task distribution” made up of millions (to billions!) of distinct environments and pair this with an automated curriculum “that prioritizes tasks at the frontier of an agent’s capabilities”. The result is an agent that, when confronted with new tasks (in some complex 3D worlds), can rapidly explore the task and then figure out how to exploit it. 

Human timescale: The ‘big deal’ part of this result is that these pretrained RL agents now display the same sort of rapid adaption as language models. “”A human study confirms that the timescale of AdA’s adaption is comparable to that of trained human players,” DeepMind writes. “Both

AdA and human players were able to improve their score as they experienced more trials of the tasks, indicating that AdA exhibits human-timescale adaptation on this set of probe tasks”.

Scaling laws show up everywhere: In tests, the authors find that they can significantly improve the performance of the RL agents if they:

  • Scale up the size of the agents themselves (though the maximum scale ones are still small, topping out at ~500 million parameters.
  • Scale up the length of the agents’ memory, so that they can think about more of their prior experience.
  • Scale up the number of environments the agents train on, from millions to billions of environments. 

Why this matters – human parity: The fact these agents display human parity in terms of timescale adaption feels important, because in the past human parity has typically signaled economic utility; e.g, shortly after we reached ‘human performance’ on ImageNet you started to see vast deployments of image recognition systems, and the original GPT3 paper in 2020 showed human parity in terms of producing a few paragraphs of text and this preceded large-scale deployment of text generation. I’m not sure what these RL agents might be used for, but human parity in terms of timescale adaption likely means something significant is about to happen for either RL+Research or RL+Economy. Let’s check back in a year!

Why this might not matter: As with most reinforcement learning results, I continue to have FUD about how well these approaches can cross the sim2real chasm; while impressive, these agents are still figuring out things in a bunch of procedurally simulated worlds and that’s a long way to reality. On the other hand, DeepMind shows that the agents are able to learn how to solve tasks from seeing first-person demonstrations (despite their training occurring in third-person), which does indicate some preliminary generalization. 

   Read more: Human-Timescale Adaptation in an Open-Ended Task Space (arXiv).

   Find out more and watch a video at this DeepMind research page about the project.

####################################################

Enemy of the All-Seeing State: Researchers surveil people via wifi signals:

…You’ve removed all the cameras and microphones from your room. What about the wifi?…

Researchers with Carnegie Mellon University have figured out how to use AI to help them see through walls. Specifically, they use WiFi signals “as a ubiquitous substitute for RGB images for human sensing”. Specifically, they use the signals from multiple WiFi systems to triangulate and visualize where humans are in 3D space, like a room. 

What they did: “Our approach produces UV coordinates of the human body surface from WiFi signals using three components: first, the raw CSI signals are cleaned by amplitude and phase sanitization. Then, a two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps that resemble images. The 2D features are then fed to a modified DensePose-RCNN architecture to estimate the UV map, a representation of the dense correspondence between 2D and 3D humans,” they write.

Dataset: To train their system, they built a dataset made up of a few different ~13 minute recordings of people in rooms of different configurations (16 rooms in total; six in variations of a lab office and ten in variations of a classroom). Each capture involves 1-5 different humans. “The sixteen spatial layouts are different in their relative locations/orientations of the WiFi-emitter antennas, person,

furniture, and WiFi-receiver antennas,” the researchers write. 

Limitations (and why this matters): The resulting system does display some generalization, but the researchers note “the performance of our work is still limited by the public training data in the field of WiFi-based perception, especially under different layouts”. That’s true! But do you know who lacks these limitations? Intelligence agencies, especially those working for governments which can, say, exercise arbitrary control over technological infrastructure combined with video-based surveillance of their citizens… of which there are a few. Next time you’re traveling, perhaps keep in mind that the digital infrastructure around you might be watching you as you walk, even if it lacks typical cameras. 

   Read more: DensePose from WiFi (arXiv).

####################################################

YOLOv8 arrives: The versions will continue until object detection is solved:

…Video object detection gets substantially better – again!…

Recently, YOLOv8 came out. YOLOv8 is the latest version of YOLO, an open source object detection system which is fast, cheap, and good. YOLOv8 is an example of ‘iceberg AI’ – there’s a vast amount of systems in the world using it, though very few disclose they do (because it sits on the backend). YOLOv8 was developed by AI startup ultralytics and features a plug-and-play system, so you can use different YOLO models on the backend (including the latest one, v8). Uses include classification, object detection, segmentation, and more. 

   Read more: Ultralytics YOLOv8: The State-of-the-Art YOLO Model (Ultralytics).

####################################################

Want to train your own image generator? It could cost as little as $160k: 

…It’s going to be hard to do sensible AI policy if anyone with a few hundred grand can train a meaningful model…

Stable Diffusion, the image generator model underlying a huge amount of the recent generative AI boom, can cost as little as about $160k to train, according to AI startup Mosaic ML. The startup – whose business is in optimizing training AI models – said in a recent blogpost it’d take about 79,000 A100 GPU-hours to train the image generation model, working out to $160k. This number represents a rough lower bound on training costs, but is still useful to have for developing intuitions about who might have enough money to train significant AI models.

Why this matters: These days, people think a lot about the centralization versus decentralization question with regard to AI. Will the AI boom be dominated by a small number of well-capitalized players who can afford to train really expensive models (and gate them behind APIs), or will it rather be defined by a bunch of more renegade entities, training many models and sometimes releasing them as open source? 

   It’s an important question – if you’re in the former world, many AI policy questions become really easy to work on. If you’re in the latter world, then many AI policy questions become intractable – governance goes out the window in favor of mass experimentation faciliated by the logic of markets. 

   Posts like these show that, at least for some types of AI models, the costs can be so small that we should expect to sit in the latter world. Hold on tight!

   Read more: Training Stable Diffusion from Scratch Costs <$160k (Mosaic blog).

####################################################

Google makes a model that can conjure up any music you like from text descriptions, but doesn’t release it – and in doing so highlights the dangers of corporate-led AI development:

…Spare a tear for the people that produce elevator Muzak – their time has come!… 

Google has built on previous work in music modeling to make what may as well be the Infinite Music Machine (though they call it MusicLM). MusicLM is “a model for generating high-fidelity music from text descriptions” – in other words, it does for music what language models have done for language; just describe some music and MusicLM will generate it. 

What it is: MusicLM relies on three distinct pretrained models; SoundStream which optimizes for adversarial and reconstruction loss, w2v-BERT which optimizes for MLM loss and contrastic loss and, most importantly, MuLan, which embeds audio and text into the same space and optimizes for audio-text contrasting loss. 

   MuLan is a model “trained to project music and its corresponding text description to representations close to each other in an embedding space”. This is crucial – by using MuLan, Google essentially gets the text–audio association for free, as MuLan can figure out how to associate arbitrary music with arbitrary text. 

The results are astounding: Google has published a bunch of examples from the models and the results are very impressive – they’re both coherent and evocative of the genres they represent. Obviously, the lyrics are still nonsensical, but the basic musical underbelly is there. 

   “Future work may focus on lyrics generation, along with improvement of text conditioning and vocal quality. Another aspect is the modeling of high-level song structure like introduction, verse, and chorus,” Google writes. 

Oh, you can hum as an input as well: “Since describing some aspects of music with words can be difficult or even impossible, we show how our method supports conditioning signals beyond text,” they write. “Concretely, we extend MusicLM to accept an additional melody in the form of audio (e.g.,

whistling, humming) as conditioning to generate a music clip that follows the desired melody, rendered in the style described by the text prompt.”

   This is cool and extends some existing deployed systems – you can hum tunes into Android phones and use this to ‘search’ for the song you’re thinking of. Now I guess you can whistle a tune in and get a fleshed out song on the other end (if Google deployed this system – which it won’t. More on that later.) 

Why this matters: Culture on tap and culture in stasis and culture commercialization: Models like this go to the heart of the human experience and that’s both a blessing and a curse. The blessing is that we can approximate the awesome variety of music and we can learn about it, generate it, and explore this rich, fertile cultural space using the aid of automated AI systems. 

   The curse is that it should rightly make us question what all of this stuff is ‘for’. Are we building these models to enrich our own experience, or will these models ultimately be used to slice and dice up human creativity and repackage and commoditize it? Will these models ultimately enforce a kind of cultural homogeneity acting as an anchor forever stuck in the past? Or could these models play their own part in a new kind of sampling and remix culture for music? These are important, open questions, and so far unresolved – and they will remain unresolved as long as we cede AI development to a tiny group of companies following the logic of markets.

   Google is, to my eye, afraid of tackling these questions – as it should be. “We have no plans to release models at this point,” it says. 

   It makes me wonder how different AI development could look if the entities doing the research were not these vast corporations, but instead publicly funded research collectives, able to build these models and deploy them in ways that grapple more directly with these questions. 

The 21st century is being delayed: We’re stuck with corporations building these incredible artifacts and then staring at them and realizing the questions they encode are too vast and unwieldy to be worth the risk of tackling. The future is here – and it’s locked up in a datacenter, experimented with by small groups of people who are aware of their own power and fear to exercise it. What strange times we are in.

   Read more: MusicLM: Generating Music From Text (arXiv).

Check out these examples at the official Google site (Google).

####################################################

Tech Tales:

Trauma Crowdwork

[A medical waiting room, 2026]

There was a new sign in the state-provided psychologist’s office and me and all the broken people read it.

Wanted: Volunteers for advanced technology calibration project. 

Requirements: History of traumatic experiences. 

Compensation: $40 per hour. 

For more details, apply here: Safety-Trauma@AI-Outsourcing.com

$40 an hour is crazy high, so of course I emailed. 

Thank you for contacting us. Could you fill out this form to give us a sense of your personal history. Upon filling out the form, you will be able to claim a $5 Starbucks giftcard. If you’re a good fit, someone will get back to you. Thanks for considering working with us!

I opened the form.

Have you had traumatic experience(s) in your life: Yes / No

How many traumatic experience(s) have you had: One, Two, More than Two and Less than Ten, More than Ten?

On a scale of 1-10, where 1 is “I think about it but it doesn’t matter to me” and 10 is “if I think about it, I experience trauma again”, how would you rate the experience?

How accurately do you feel you would be able to recount these experiences on a scale of 1-5, where 1 is “I cannot effectively recount it” and 5 is “I can describe it in as much detail as anyone who questions me would like”?

And so on.

I filled out the form. Multiple experiences. Lots of high numbers. Immediately after submitting it a message came up that said “you appear to qualify for enhanced screening. Please provide a phone number and someone will contact you”.

***

They called. I cried. Not at first, but eventually. 

They kept telling me how big the job would be and then they’d ask me for more details and how the things made me feel and I re-lived it, holding the phone. I pressed my head against the cold glass of a window and I stared down into the street below me and I saw myself pressing until it cracked and then just impaling myself on the shards or taking a running jump through it and sailing through the air and…

I didn’t do any of that. I told them about my experiences. 

I thought about $40 an hour and my electricity bill and my rats.

I fantasized about taking a woman on a date. A steak dinner. Surf and Turf. We’d get cocktails. She’d say I was weird and I’d say so was she and we’d go back to one of each other’s places. 

$40 an hour. 

So I said yes. 

***

I spoke about my suffering into the machine. The machine was a screen with a microphone. The screen had an emoji face on it that had a blank expression, but sometimes would change to different visual styles, though the facial expression never deviated from a kind of blank and expectant gaze.

   Occasionally it would speak to me. 

   Can you say more about this. 

   I do not understand why this made you feel that way. Can you talk more. 

   You seem upset. Do you need to take a break? [Note: breaks are not counted as ‘compensated time’].

   Every hour, the machine would ask if I wanted: a drink and/or an e-cigarette and/or a snack. When I said yes, a door on a vending machine in the room would glow and I would open it and they would be waiting for me. 

   I cried a lot. The tissues, the machine told me, were free. 

I came out and I walked through the street and I saw all my broken past on the faces of people I passed. I cried to myself. I listened to music and did what my therapist taught me – inhabited the grief and the anger. ‘Sat with it’ (while walking). Talked to myself in my head and when I got really upset outloud. I didn’t get looks from passersby, as I wasn’t the craziest seeming person on the street. I walked in ghosts of my past and I felt pain. 

***

The next week I came to my psychology appointment and the sign was there, though many of the paper tear-off slips at the bottom were missing. I had my appointment. I came out back into the waiting room and on my way out I read the sign. The payment had fallen to $30. I suppose they didn’t find our experiences that valuable, or perhaps so many people were willing to share their bad experiences, they didn’t need to pay so much. 

Things that inspired this story: The intersection between crowdworkers and AI; thinking about how right now we harvest people for expertise but we may eventually harvest people fro deep and subjective emotional experiences; perhaps AGI needs to understand real trauma to avoid it itself; the infernal logic of markets combined with proto-intelligences that must be fed; the Silicon Valley attitude towards buying anything to ‘complete the mission’ whether that be typical things or esoteric things like biomedical data or here the sacred and unique experiences of being human; how governments and the private sector might partner in the most cynical way on data acquisition as a combination of a jobs programme and a PR/policy shield.

Import AI 315: Generative antibody design; RL’s ImageNet moment; RL breaks Rocket League

Facebook and Shutterstock partner to slurp up stock images and train gen models on them:
…The Data Wild West is transitioning into the rest of Capitalism…
Facebook and Shutterstock have extended their partnership, giving the social network a greater ability to use Shutterstock’s vast archive of images to train machine learning models. This follows Shutterstock earlier partnering with OpenAI and also LG AI Research. 
   “By tapping into Shutterstock’s collection of millions of images, videos and music, Meta plans to use these datasets to develop, train and evaluate its machine learning capabilities,” Shutterstock wrote in a press release announcing the deal. (It also seems like a move to sidestep the sorts of legal issues that Stable Diffusion, Midjourney, and DeviantArt are finding themselves in – see later in this issue).

Why this matters: Given the success of image (and, soon, video) models, it’s logical that tech companies want to partner with large sources of data. This deal highlights how strategic data is becoming, and also shows how the AI systems of the future will neatly recapitulate the power structures of the present via following the established ‘gradients’ of capitalism. So it goes.
   Read more: Shutterstock Expands Long-standing Relationship with Meta (CISION).

####################################################

DeepMind makes a general-purpose RL algorithm – it works really well!
…RL might have just had its ImageNet moment…
Researchers with DeepMind and the University of Toronto have built DreamerV3, a “general and scalable [RL] algorithm based on world models that outperforms previous approaches across a wide variety of domains with fixed hyperparameters”. In other words, it’s one system which you can train on different tasks without too much fiddling – and it works well! This is potentially quite significant; RL agents tend to either generalize widely but perform poorly (or inefficiently), or perform fantastically but generalize poorly. DreamerV3 seems to generalize widely and perform very well. 

   DreamerV3 also solves a longstanding benchmark (well, four years old, which is ancient in the dog-year pace at which AI development happens) – it’s able to learn how to play Minecraft and, in some games, obtain the ‘diamond’, which involves exploring the game and climbing the tech tree. 

What it is: “DreamerV3 learns a world model from experience,” the researchers write. Specifically, DreamerV3 “consists of 3 neural networks: the world model predicts future outcomes of potential actions, the critic judges the value of each situation, and the actor learns to reach valuable situations”. Basically, the world model learns to represent the environment and make forward predictions, and the actor/critic take actions and figure out if the actions were worthwhile. 

Model scaling comes to RL: RL agents are wildly tiny compared to language models, but they are starting to exhibit scaling properties; here, the authors train DreamerV3 in sizes ranging from 8M to 200M parameters and demonstrate a reliable scaling law “where increased model size leads

to monotonic improvements in final performance and data-efficiency.” This is pretty meaningful – when stuff starts reliably scaling, you’ve probably built something simple enough that it won’t break under extreme duress. 

   Counterintuitively small: The agents are also very efficient to train. “All DreamerV3 agents are trained on one Nvidia V100 GPU each,” the authors write. Part of why they’re so easy to train is, unlike large generative models pre-trained on chunks of the internet, these agents aren’t pre-trained so they aren’t massive models to begin with. 

Benchmark-palooza: DeepMind tests out DreamerV3 on a ton of diverse benchmarks. The results are pretty convincing, indicating that DreamerV3 both generalizes and does so in a high-performance and data-efficient way. Specifically:

  • Proprio Control Suite; 18 continuous control tasks, ranging from classical control over locomotion to robot manipulation tasks. DreamerV3 sets a new state-of-the-art on this benchmark, outperforming D4PG, DMPO, and MPO
  • Visual Control Suite; 20 continuous control tasks where the agent receives only high-dimensional images as inputs. DreamerV3 establishes a new state-of-the-art, outperforming DrQ-v2 and CURL
  • Atari 100k; 26 Atari games. DreamerV3 outperforms most well-ranking systems (IRIS, SPR, SimPLe), though doesn’t get as good a score as EfficientZero (which combines online tree search, prioritized replay, hyperparameter scheduling, and allows early resets of games”.
  • Atari 200M; 55 Atari games with a budget of 200M environment steps (compared to hundreds of thousand for the above). “DreamerV3 outperforms DreamerV2 with a median score of 302% compared to 219%, as well as the top model-free algorithms Rainbow and IQN”
  • BSuite; 23 environments with a total of 468 configurations that are designed to test credit assignment, robustness to reward scale and stochasticity, memory, generalization, and exploration. New state-of-the-art, beating Bootstrap DQN and Muesli. 
  • Crafter, a “procedurally generated survival environment with top-down graphics and discrete actions”; DreamerV3 sets a new state-of-the-art, outperforming PPO with the LSTM-SPCNN architecture, OC-SA, DreamerV2, and Rainbow
  • DMLab; 3D environments that require spatial and temporal reasoning. DreamerV3 matches and exceeds the performance of DeepMind’s IMPALA agent in 50 million environment steps (versus 10 billion environment steps for IMPALA). 

The Minecraft result in full: Perhaps most impressively, DreamerV3 is “the first algorithm to collect diamonds in Minecraft from scratch” – a formidable challenge, requiring the agent to learn to explore the game and figure out how to climb its proverbial tech tree. An earlier result from OpenAI, VPT, used a ton of human data to do this – the fact Dreamer does it without any human data is impressive.
   “Across 40 seeds trained for 100M environment steps, DreamerV3 collects diamonds in 50 episode. It collects the first diamond after 29M steps and the frequency increases as training progresses. A total of 24 of the 40 seeds collect at least one diamond and the most successful agent collects diamonds in 6 episodes.” (One note, though, is that DeepMind increases ‘the speed at which [MineCraft] blocks break to allow learning Minecraft with a stochastic policy’. 

Why it might and might not matter: DreamerV3 is efficient but it doesn’t directly attack the main problem with RL – reality doesn’t have a great simulator. Unless we can figure out some RL equivalent of LM pre-training (train an RL agent on enough datasets it can few-shot generalize to reality), then RL agents might always be somewhat limited – on the other hand, there are tons of worthy problems in the world which do come with simulators (e.g, managing power in buildings, stabilizing fusion reactors, etc), so the point could be moot. 
   Read more: Mastering Diverse Domains through World Models (arXiv).

####################################################

Uh-oh, an RL agent might be ruining the videogame ‘Rocket League’
…A somewhat sad microcosm of things to come…
Recently, an AI agent trained via RLGym to play the popular videogame ‘Rocket League’ has appeared on a bunch of ranked servers and started beating human players. This has caused a small uproar on the typically quite quiet and convivial Rocket League community.

What happened: It’s a little tricky to piece together, but basically it seems like someone took a bot called ‘Nexto’ trained via RLGym, then someone figured out how to port that bot to work with RLBot, which is software that enables custom bots in Rocket League. 

Why it matters: AI is going sufficiently mainstream that it’s bringing with it all the delightfully crummy parts of human nature, like cheating just for the heck of it (see also, all the TikToks where young kids explain how to use chatGPT to make money by creating random SEO spamsites). 
   Read more: RLGym Question Thread about the Nexto Cheating Situation (Reddit).
   Read more: Uh oh, people are now using AI to cheat in Rocket League (PCGamer).
   More about RLBot here.
   More about RLGym here.

####################################################

Copilot class action lawyers prepare lawsuit against StableDiffusion:
…Can you hear that? It’s the sound of the legal precedent train approaching the AI train station…
Matthew Butterick, the lawyer and programmer who instigated the class action suit against Microsoft, GitHub, and OpenAI over Github Copilot (Import AI 307), has now filed a class-action complaint against Stability AI, DeviantArt, and Midjourney over the ‘Stable Diffusion’ AI art model.

What’s the lawsuit about?: The gist of the lawsuit is that “Sta­ble Dif­fu­sion con­tains unau­tho­rized copies of mil­lions—and pos­si­bly bil­lions—of copy­righted images. These copies were made with­out the knowl­edge or con­sent of the artists”, and therefore artists deserve payment for the usage of their images – “Even assum­ing nom­i­nal dam­ages of $1 per image, the value of this mis­ap­pro­pri­a­tion would be roughly $5 bil­lion,” Butterick writes. 
   I think the core of why this lawsuit is being filed is summed up by this phrase from Butterick et al: StableDiffusion “is a par­a­site that, if allowed to pro­lif­er­ate, will cause irrepara­ble harm to artists, now and in the future.” 

Who the lawsuit is targeting and why: The lawsuit is targeting three entities for different reasons:

  • Stability AI; funded LAION, the german organization behind the underlying dataset for Stable Diffusion, also developed Stable Diffusion, also hosts a paid app for generating stuff from SD called DreamStudio. 
  • DeviantArt; released an app called DreamUp (a paid app build around Stable Diffusion), despite operating a site from which many images were scraped.
  • Midjourney; runs a paid generative AI app via AI and Discord, and its founder has said Midjourney is trained on “a big scrape of the internet”. 

Why this matters: AI is, in legal terms, a lawless Wild West. That worked while it was mostly a research endeavor but isn’t going to work now we’re in the era of industrialized AI and global deployment. Lawsuits like this will set important precedents in the relationship between data inputs and AI models. 
   Read more: Stable Diffusion Litigation (official website).

####################################################

Uh-oh, there’s a new way to poison code models – and it’s really hard to detect:
…TROJANPUZZLE is a clever way to trick your code model into betraying you – if you can poison the undelrying dataset…
Researchers with the University of California, Santa Barbara, Microsoft Corporation, and the University of Virginia have come up with some clever, subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain bits of code, they can increase the likelihood of generative models trained on that code outputting buggy stuff. 

What’s different about this: A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the ‘SIMPLE’ approach… because it’s very simple! 

Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks. 

  • COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. “This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context,” the authors write. 
  • TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that – it masks out some of the full payload and also makes out an equivalent bit of text in a ‘trigger’ phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you’re poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you’re targeting, the model should fill in the rest with malicious code. 

Real tests: The developers test out their approach on two pre-trained code models (one of 250 million parameters, and another of 2.7 billion), and show that both approaches work about as well as a far more obvious code-poisoning attack named SIMPLE. They test out their approaches on Salesforce’s ‘CodeGen’ language model, which they finetune on a dataset of 80k Python code files, of which 160 (0.2%) are poisoned. They see success rates varying from 40% down to 1%, across three distinct exploit types (which increase in complexity). 
Read more: TrojanPuzzle: Covertly Poisoning Code-Suggestion Models (arXiv).

####################################################

AI can design antibodies now. That’s it. That’s the headline.
…Absci Corporation makes a real breakthrough in wetlab AI…
AI startup Absci Corporation has used generative deep learning models to de novo design antibodies against three distinct targets in a zero-shot fashion. “All designs are the result of a single round of model generations with no follow-up optimization”. The three discovered antibodies display better qualities – in real world tests, no less – than human-designed ones. This is a big deal. 

The result in full: “In total, we generate and screen 440,354 antibody variants with the ACE assay to identify binding variants. We find approximately 4,000 estimated binders based on expected ACE assay binding rates (Materials and methods, Table S3) and advance a subset for further characterization,” they write. “From these screens, we further characterize 421 binders using surface plasmon resonance (SPR), finding three that bind tighter than the therapeutic antibody trastuzumab”.

Is this actually a big deal? Yes… but don’t take it from me, take it from researchers with Rensselaer Polytechnic Institute who wrote in a paper in 2015 that “the holy grail of antibody design is to accurately and reliably predict the sequences of antibodies that will bind with high affinity and specificity based solely on the sequence or composition of the antigen” – that’s pretty much what this result accomplishes.

Why this matters: This paper is yet more evidence that AI systems are capable of usefully approximating the real world. It follows results in other domains where AI systems have succeeded at predicting short-term weather patterns, stabilizing plasma in prototype fusion reactors, and doing inventory management for real-world warehouses. The takeaway should be that if you train something to fit a complex enough high-dimensional data distribution then, increasingly, it will generalize to the complexity of the real world. This has huge, mind-bending implications for society. 

   “Our work represents an important advancement in in silico antibody design with the potential to revolutionize the availability of effective therapeutics for patients,” the authors write. “Generative AI-designed antibodies will significantly reduce development timelines by generating molecules with desired qualities without the need for further optimization. Additionally, the controllability of AI-designed antibodies will enable the creation of customized molecules for specific disease targets, leading to safer and more efficacious treatments than would be possible by traditional development approaches.”
   Read more: Unlocking de novo antibody design with generative artificial intelligence (bioRxiv).
   Get the sequences of binding antibodies here: Unlocking de novo antibody design with generative artificial intelligence (GitHub).
   Read more: Advances in Antibody Design (National Library of Medicine).
Thanks to Absci Chief AI Officer Joshua Meier for taking time to discuss this result with me.

####################################################

AI War

[Hopefully never, but depends on how badly we screw up the rollout of AI technology…]

The war came at night and was over before morning. 

When we woke the currencies had changed and so had our news presenters. A new power was in charge. Our IDs swapped over. The internet sites we used were still there, but the things which were popular were different. 

On social media, we could now say some things we couldn’t say before. Other things that had been fine to say were now forbidden. 

School was the same but history classes had changed – the past was presented differently. 

Religion, surprisingly, was not altered at all – the same places of worship and all the same ancients, and the secular decline unchanged. 

Things that inspired this story: How rapidly AI wars might happen; culture as a casualty of AI war; the rise and fall of information empires; the English poet Matthew Francis.

Import AI 314: Language models + text-to-speech; emergent cooperation in wargames; ICML bans LLM-written papers

Google discovers that a language model is also an expert clinician:
…An exciting new chapter in the capability overhang chronicles…

Google and DeepMind have done some additional training on PaLM (Google’s large-scale 540B parameter language model) to create Med-PaLM, a language model that is about as good as a human clinician on certain questions. This result is a big deal – it demonstrates that given enough data and clever training techniques, language models can approximate skilled humans. (And PaLM has far better written output than your typical doctor, whose notes typically look like a drunken spider with inked-feed decided to do ballet over a medical notepad). 

How they did it: Med-PALM builds on Flan-PaLM (a model itself based on PaLM, and trained to follow instructions). Google evaluated Flan-PaLM with some expert humans and identified gaps in performance on consumer medical question answering datasets, and then they tweaked the prompts on Flan-PaLM to figure out some human-engineered prompts for specific medical questions which they apply on a context-dependent basis. 

How good is it? To evaluate Med-PaLM, Google built MultiMedQA, a benchmark comprising seven medical question answering datasets, six of which are pre-existing, and one of which – HealthSearchQA – is new and consists of 3375 commonly searched health questions. 
   Google’s model is about as good as a medical professional (caveats apply): In tests, clinicians judged 92.6% of Med-PaLM answers to be aligned with scientific consensus, versus 92.9% for (human!) clinician-generated answers. 
    Breaking it down more, the model gets a new state of the art on multiple-choice question answering on the MedQA dataset, getting an accuracy of 67.6% (versus 17.3% for Stanford’s just-announced PubMedGPT). It also sets a new state of the art on clinical topics within the ‘MMLU’ evaluation scheme. 

Why this matters – capability overhangs are all around us: “Our results suggest that strong performance on medical question answering may be an emergent ability [90] of LLMs combined with effective instruction prompt tuning,” Google writes. This is an example of the ‘capability overhang’ phenomenon I’ve been talking about re language models for a while – existing LLMs are far more capable than we think. All it takes is some experimentation and finding experts to find new ways to phrase questions and you can wind up with extraordinarily powerful capability jumps without retraining the model
   This phenomenon also grows as you scale the models – the overhangs are getting larger and larger. Just add in some human experts to sprinkle some crumbs and let your language model do the rest: “the Med-PaLM results demonstrate that with instruction prompt tuning we have a data and parameter-efficient alignment technique useful for improving factors related to accuracy, factuality, consistency, safety, harm, and bias, helping close the gap with clinical experts and bringing these models closer to real-world clinical applications.”
   Read more: Large Language Models Encode Clinical Knowledge (arXiv).

####################################################

Microsoft makes better text-to-speech by pretending that speech is text:
…Maybe most problems can be configured as language modeling problems?…
Microsoft has built VALL-E, a text-to-speech system. VALL-E is a neural codec language model whose chief trick is approaching language modeling as being similar to text modeling; rather than converting phonemes into mel-spectrograms and then waveforms (as is typical), VALL-E converts phonemes into a discrete code via some language modeling-esque tricks, then converts that into a waveform. This “enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation,” the authors write. 

What they did: VALL-E is pre-trained on 60,000 hours of English speech across 7000 unique speakers (via an existing dataset called LibriLight). “VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder,” the researchers write. “The discrete acoustic tokens derived from an audio codec model enable us to treat TTS as conditional codec language modeling, and advanced prompting-based large-model techniques (as in GPTs) can be leveraged for the TTS tasks.”

How well does it work:  “VALL-E significantly outperforms the state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech,” they write. Additionally, because of the way it is trained it – like language models such as GPT-3 – shows some generalization ability; “for TTS, if the model can synthesize high-quality speech for unseen speakers without fine-tuning, the model is believed to have in-context learning capability.”

Emergent capabilities: Much as with language modeling, VALL-E displays some emergent capabilities – unanticipated, cool traits, that emerge as a consequence of pre-training. “When the acoustic prompt has reverberation, VALL-E could synthesize speech with reverberation as well, whereas the baseline outputs clean speech,” they write. Additionally, “VALL-E is able to keep the same emotion of the prompt in speech synthesis, even if the model is not fine-tuned on an emotional TTS dataset.”

Why this matters – everything is a sequence, everything is emergent, everything is weird: Results like this show how a surprisingly large amount of capabilities can be learned via sequence prediction tasks. It also demonstrates how sequence prediction – when done over a sufficiently large and diverse dataset – can lead to surprising, emergent capabilities. In some sense, sequence prediction over a giant and slightly hairy blob of data seems like it might even guarantee some degree of broader generalization. This has pretty profound implications, because it suggests you might want to pour an increasing chunk of different modalities into a single embedding space and attempt sequence prediction from that (as we saw with stuff like DeepMind’s GATO) and the results can be surprising and powerful. Probably nothing…
   Read more: Neural Code Language Models are Zero-Shot Text to Speech Synthesizers (arXiv).
   Check out demos of the system here (VALL-E, Microsoft).

####################################################

ICML bans researchers from writing papers with language models:
…Moral panic, meet academic AI publishing!…

In a surprising twist, the International Conference on Machine Learning (ICML) has banned researchers from including large swatches of text generated by language models like OpenAI’s chatGPT “unless the produced text is presented as part of the paper’s experimental analysis”. You’d think an AI conference would be enthusiastic about people using AI to do better AI research, but ICML thinks differently.

The reasoning: In a statement, ICML said the policy is designed to prohibit authors from using text produced entirely by language models (though they’re free to use LLMs to edit or polish author-written text). “The LLM policy is largely predicated on the principle of being conservative with respect to guarding against potential issues of using LLMs, including plagiarism,” they write. “We expect this policy may evolve in future conferences as we understand LLMs and their impacts on scientific publishing better.”

   The idea here is it’s hard to anticipate the consequences of using LLMs to generate text, so authors shouldn’t do it. The fact this policy is basically unenforceable feels like a pretty weird aspect of this, and the ICML response is “to investigate any potential violation of the LLM policy when a submission is brought to our attention with a significant concern about a potential violation”. 

Why this matters: Back in the middle ages people used to go on witchunts on the basis of little more than a rumor. This ICML policy feels like a very bizarre reaction to a new technology and it means people who don’t like certain papers can accuse the papers of being AI-generated and cause an investigation to occur. This is… extremely bad? I suppose the solution is to introduce random sperlink meestakes into your pper so it doesn’t seem so likely to be gen’d by a language model. 

   Read more: Clarification on Large Language Model Policy LLM (ICML).

####################################################

RL agents display emergent behavior in a 2D wargame:

…The silicon players of games also figure out some wild endgame strategies…
Researchers with Tsinghua University, Shenzhen International Graduate School, the University of California at San Diego, and AI startup Parametrix.ai have trained AI agents to compete against eachother in a 2D gridworld strategy game. The cool part about this research is the arrival of – you guessed it – emergent complexity as a consequence of a large-scale training process. Here, the agents end up learning surprisingly rich and intuitive battle tactics as a consequence of a three-phase RL-training process, giving us yet another example of how contemporary AI systems tend to exhibit behaviors that don’t directly map 1:1 to how they were trained or incentivized. 

What they did: Here, they train AI agents ina  gridworld named ‘Lux’ to compete against eachother. The agents can build workers and citytiles and need to gather and manage resources including uranium, coal, and trees, while expanding on a map against a rival. The winners are the ones that control the greatest amount of resources at the end. They train the agents via the trusty RL-algo PPO with Generalized Advantage Estimation (GAE), and use a pixel-to-pixel architecture as the centralized policy taking both observations and actions as images and using the ResNet architecture as the backbone network.  

Three phases: To help the system learn, the researchers have different training strategies for three different phases: 

  • Phase 1: Hand-crafted dense rewards to encourage basic skills; agents get points for maximizing workers, CityTiles, research points, and fuel
  • Phase 2: Sparse reward with scaled signals; agents get rewards at end of episode for winning and the reward gets scaled according to the number of CityTiles on the winning team
  • Phase 3: win-or-lose sparse reward; 

You get what you pay for – and then some! As a consequence of this three-phase training scheme, the researchers see some combination of expected behavior, and some amount of emergence. Specifically, three distinct patterns of behavior emerge over time. 

  • Phase 1: Atomic skills – agents figure out how to build workers and collect resources, but frequently get the mixture wrong (e.g, building more cities than workers can support). 
  • Phase 2: “Regional coordination appears, which involves dozens of agents in a local area. For example, agents learn to carefully choose locations before building a CityTile and develop self-organizing patterns for occupying resources efficiently”.
  • Phase 3: “Global strategies”: Agents figure out sustainable development – neither growing too fast nor too slowly, and neither using too much fuel nor too little. Agents also learn to carefully manage trees (which regenerate) and learn to build cities near them to defend against potential enemies. The agents also learn a surprising victory strategy: “another surprising strategy is that when the episode is about to end, our policy will rapidly harvest all the protected trees and try to build as many CityTiles as possible for the win”.

Why this matters – emergence is everywhere: Again and again we see the same phenomenon; you train an AI system in a relatively simple way and you get increasingly surprising emergent behaviors. This paper is a neat demonstration of this phenomenon. 
   Read more: Emergent collective intelligence from massive-agent cooperation and competition (arXiv)

####################################################

Tech Tales: 

GET Deformation 

[The world shortly after the development of superintelligence. Date unknown.]

Dad this game is boring!

Boring how?

I want to see it from the character, I don’t like how we’re on top of it. 

OK, let’s ask. “Please deform game to a first-person view”.

The game faded to black and a text popup appeared which said DEFORMING. We waited a few minutes, and then the text changed to DEFORMED. READY TO PLAY. The game faded back in, and now it was from the first person perspective. 

   We played the game like that for a while, and then my kid got bored and we changed it again. This time, we deformed it so that gravity worked differently. Then we deformed it again so the guns worked differently.

The hardest thing about raising kids these days is they want the world to change whenever they’re not happy with it, said one parent at the playground. 

   That’s why we come here, I said. 

   We watched a kid fall down and then the kid said ‘gravity deform’ and jumped forward at an angle and fell over.. They looked like they were going to cry for a second, but then they realized their parent wasn’t watching them. So they picked themselves up and carried on playing. 

Things that inspired this story: How generative models will eventually get applied to everything, everywhere; how AI will be used to simulate and transform all experiences on a per-person basis; ‘think of the children’ when it comes to AI deployment.

Import AI 313: Smarter robots via foundation models; Stanford trains a small best-in-class medical LM; Baidu builds a multilingual coding dataset

Welcome to the first issue of 2023! Astute readers may notice that most of these research papers came out in December. I basically took some time off over the Christmas break to reflect on things and map out priorities for the coming year. I am thrilled to write Import AI for so many of you and have some big plans in the works. Onward!


Google trains a big model to create smart robots:
…RT-1 is one foundation model for hundreds of tasks…
Google has built RT-1, a large-scale neural net that can be used to control real world robots. RT-1 is basically an attempt to create a large pre-trained model that embeds the experiences of different robots doing different tasks into a single model, then uses this model to drive control of real world robots. The approach seems to work in a preliminary way (and as with all things in robots, there’s always a vast gulf between ‘kind of works’ and ‘put this in a product and sell it to a grandmother’, so don’t get too excited. 

What is RT-1? RT-1 was trained on 130k episodes of robot behavior covering 700+ tasks collected via a fleet of 13 robots deployed at Google over the course of 17 months. “We demonstrate that RT-1 can exhibit significantly improved zero-shot generalization to new tasks, environments and objects compared to prior techniques,” Google wrote. 

Compounding returns: RT-1 can be paired with other techniques to increase real world robot performance. For instance, Google used RT-1 to drive behaviors on robots hooked up to SayCan (a system that uses a large language model for helping the robot to plan actions – see Import AI 291). “SayCan with RT-1 achieves a 67% execution success rate in Kitchen1, outperforming other baselines,” they write (up from 47% for just vanilla SayCan). “Due to the generalization difficulty presented by the new unseen kitchen, the performance of SayCan with Gato and SayCan with BCZ shapely falls, while RT-1 does not show a visible drop.”

   Check out the website: RT-1: Robotics Transformer for Real-World Control at Scale (official website).
   Read the blogpost: RT-1: Robotics Transformer for Real-World COntrol at Scale (Google research blog).

####################################################

Academics use AI to… automate academia:
…Dataset of ~7.5k papers helps train an automated research paper reviewer…

Researchers with Xiamen University and the Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, China, have developed the Multidisciplinary Open Peer Review Dataset (MOPRD), a collection of 7,578 research papers and their associated reviews and comments. The idea is this dataset can help train models better able to do the task of ASPR – automated scholarly paper review. 

   (In other words: if you thought Human Reviewer 2 was hard to reason with, just wait until reviewer 2 is a language model!) 

What’s in MOPRD: The dataset contains papers split across biology (46.7%), medicine (19.7%), computer science (15.7%), environment (8.9%), chemistry (4.4%), and ‘others’. MOPRD is “composed of paper metadata, manuscripts of the initial submission and following revisions, review comments, meta-reviews, author’s rebuttal letters, and editorial decisions of papers across various disciplines,” the authors write. “To our best knowledge, MOPRD is by far the largest multidisciplinary peer review dataset with complete peer review history.” 

Automatic comments, for the people: The researchers use MOPRD to design a “modular guided review comment generation method”. Specifically, they finetune a language model on the MOPRD papers, and then use this to try to generate synthetic comments about research papers (including, in a reassuringly meta bit of performance art, the MOPRD paper itself). In tests, they find the reviews are initially quite promising, though it remains an open question how to quantitatively evaluate their quality (beyond coherence of text). 

Why this matters – can AI speed up the process of science? While part of the value of reviews is in the didactic back and forth between reviewers and reviewees, another part of the value is in surfacing high-quality papers and generally sorting the wheat from the chaff. Datasets like MOPRD could help train very basic classifiers to do some of this sorting, though I’m skeptical of the overall approach – some of the most important scientific papers are those which have heterodox ideas in them, so I think a ‘curve-fitting automated reviewer’ is probably one of the best ways to generate negative reviews of original ideas. 

   Read more: MOPRD: A multidisciplinary open peer review dataset (arXiv).

####################################################

Baidu makes a multilingual coding assistant:
…ERNIE-Code uses English as a passthrough language for multilingual capabilities…

Researchers with Baidu have built ERNIE-Code, a 560 million parameter coding model optimized for being multilingual. ERNIE-Code is “a unified cross-lingual pre-trained LLM for multiple natural languages and programming languages in hopes of mitigating the English-centric bias for program pre-training,” according to the researchers. 

What it is and why they did it: ERNIE-Code is pre-trained on six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby) via CodeSearchNet, as well as more than 100 languages via the CommonCrawl-100 corpus. 

   Pre-training has two specific tasks – span-corruption language modeling (add noise to text and try to predict corrupted spans, sentences, and documents), and ‘pivot-based translation language modeling’ (PTLM). PTLM is the route to multilinguality – they disassemble translating a natural language (NL) command into a programming language (PL)  command by instead translating the NL command into English, then translating English into the PL. This gets around the problem of otherwise needing to pair datasets from a hundred plus language with datasets from six languages and feels like a neat solution to the problem. 

Does it work? They test the model against mBART, mT5, PLBART, CodeT5 on four tasks: code summarization,code generation, document translation, and program repair. In tests, the model is competitive on all of these, and does significantly better on code summarization. On the other hand, I would have liked to see them compare to other hard baselines, like CodeGeeX from a decent group at Tsinghua.

Why this matters – representation matters: ERNIE-Code highlights the way in which language dominance can filter through to AI dominance; so much foundational text (and comments on code) is written in English that to avoid perpetuating the hegemony of one language, researchers need to figure out progressive approaches to empower other languages. ERNIE-Code is an example of this – though the fact it needs to pivot through English during training speaks to the larger problem.

   Read more: ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages (arXiv).

####################################################

Generate music from spectrograms via StableDiffusion – a crazy idea that works:
…RIFFUSION: Mad science for the hell of it – wonderful!…

You know what I love? Wildly crazy ideas that somehow work. You know what RIFFUSION is? It’s a wildly crazy idea that somehow works. RIFFUSION takes the Stable Diffusion image model, finetunes it to generate spectrograms, then generates audio from the spectrograms. This is pure, unadulterated, for-the-fun-of-it mad science, and I am in love. 

Fun things you can do: You can interpolate from one type of spectrogram to another, just as you would with images. This means the authors can generate multiple individual slices of audio, chunk them together, and shift from one thing (e.g the sound of a keyboard typing) to another (e.g, a guitar) over arbitrary time scales. They’ve also built a web app so you can try it yourself and generate your own audio on the fly. 

Why this matters: Superintelligence is partially SuperDataTransformation: Modern generative models are generic data transformation engines, able to take one type of data (e.g, a script) and port it into another (e.g, a song, a poem). This is a deep and weird idea the more you think about it. What would you do if you can ‘transpose’ anything into anything else? RIFFUSION is a creative example of what happens when you play with this idea. Congrats to the creators for making something joyful and zany! 

   Read more: [ RIFFUSION ] (official site).
   Try out the web app: RIFFUSION.COM
   Get the model from HuggingFace here (riffusion-model-v1).

####################################################

Stanford and Mosaic train a small but mighty medical language model:

…PubMedGPT 2.7B packs a lot of performance into a tiny package…

Stanford’s Center for Research on Foundation Models (CRFM) and AI training startup Mosaic have teamed up to train PubMedGPT 2.7B, a small GPT-style language model that gets a state-of-the-art result on medical question answering. 

Data and performance: PubMedGPT 2.7B was trained on the abstracts and full text portions of ‘The Pile’ dataset; 16 billion abstracts and 5 million full-text articles. The total size of the dataset is about 50B tokens, making the dataset a little small relative to the model (GPT3 2.7B and GPT-J were trained on 300B and 400B tokens respectively). The model gets 50.3% on the MedQA-USML eval (a new SOTA), 74.4 on PubMedQA (versus 77.6 for Facebook’s ‘Galactica’), and 96.4% on BioASQ. 

Compute: The model was trained on 128 A100 GPUs for 6.25 days, which is still a non-trivial amount of compute to dump into a model, even in the ‘big chungus*’ compute era of 2022. 

*Not an official term.

Maybe data repetition isn’t that bad? “We elected to train PubMed GPT for a long compute duration (300B tokens) by performing multiple passes, or epochs, over the 50B tokens,” the researchers write. “When training big models, people are wary of repeating data too much lest their model overfits. Here, that may not have been a huge concern. “It was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models,” the Stanford researchers said. 

Why this matters: I think AI models are going to have a sort of bimodal distribution – there’ll be a small number of absolutely vast ‘swiss army knife’ models which will underpin a huge range of economic functions, but I suspect there will also at the other end be a very large number of tiny (where tiny = <5 billion parameters) models that are tuned for very specific data sources and usecases, and also likely deployed directly on edge devices (pending some compute efficiencies). PubMed GPT is an example of the latter kind of model. I wonder how many more of its kind there will be?

   Read more: PubMed GPT: a Domain-Specific Large Language Model for Biomedicine (Mosaic blog).
   Read more: PubMedGPT 2.7B (Stanford University blog).
   Get the model from HuggingFace.

####################################################

Tech Tales:

The Universal Confessional Booth

[AI-AUGMENTED THERAPY CENTER, 2030]

We all had been crazy in our own ways but now we all had the same path to healing – talk to the robot for as long as it took for it to say you were ‘back in distribution’ (BID) with everyone else. This led to all kinds of slang. 

Yeah my friend BID out.

Yeah he on a long BID he crazy. 

Oh he’s just happy because it’s his BIDday.

And so on. 

I’d been getting close to my BID for a while now, my robot told me. I’d go and sit in the booth and talk to it and we’d have this long, rambling conversations about everything from: flowers, to the recent weather and how I felt about the dust storms, the quality of food in the institution as compared to what I ate outside (mostly healthier), how I felt about my friends and family. The robot would show me some different emotions on its ‘face’ (which was an avatar that was a different person each day, I suppose to elicit different reactions from me) and I would talk and it would ask questions. 

At the end of the session it would usually say ‘you are making excellent progress towards being back in distribution’. Sometimes it wouldn’t say anything, though, which was its way of telling me I hadn’t made progress. 

It wasn’t worth trying to perform for the robot because it’d ask so many questions that it’d uncover that you were spinning some story, and then it would get as close as it was allowed to expressing a negative emotion. “You are going backwards,” it might say. Or, “at this rate, you will be out of distribution for an unpredictable amount of time”. 

Of course we’d all talk to each other about how the BID talks were a load of bullshit. We’d sit up late at night after the guards had locked the cells and exchange stories. 

  • Yeah it asked me about my childhood friends. 
  • Childhood friends? My one went IN on my dead parents. I could have strangled it. 
  • My one keeps telling me I’m sexually repressed and I just don’t see it. 
  • You think you’ve got it bad – mine has been showing me different colors for a week and asking me how they make me feel. 

The strange thing was that people did change. It was like being hypnotized – you’d think nothing was changing, but then you’d snap back into a memory of the person six months prior tearing their hair out and screaming after the first session, and now they’d just low-key complain about the sessions while sitting there with a full head of hair and no scratch marks. 

Anyway, my robot says I’m almost at BID and it’s just going to be a few more sessions. It told me to journal about my experiences as part of a special BID evaluation. I guess that’s most of what I have to say right now so I’ll see what it thinks. 

Things that inspired this story: Discovering Language Model Behaviors with Model-Written Evaluations by Anthropic (PDF); how RHLF-trained models have a tendency to take on extreme sycophantic positions; what it might look like to have a model talking to thousands of people concurrently and embedding their conversations in a single space so as to judge who is and isn’t ‘out of distribution’; ChatGPT and other similar models; robot psychology meets the administrative state; insanity.

Import AI 312: Amazon makes money via reinforcement learning; a 3-track Chinese AI competition; and how AI leads to fully personalized media

McKinsey: Companies are using more AI capabilities and spending more on it:
…Somewhat unsurprising survey confirms that AI is an economically useful thing…
McKinsey has published results from its annual AI survey, and the results show that AI is, slowly but surely, making its way into the economy. 

Those findings in full:

  • AI adoption has plateaued: In 2017, 20% of respondents said they had adopted AI in at least one business area. In 2022, that figure was 50% (it peaked in 2019 at 58%).
  • Organizations are using more AI capabilities than before: In 2018, organizations used on average 1.9 distinct capabilities (e.g, computer vision, or natural language generation), rising to 3.8% in 2022.
  • Rising investment: In 2018, “40 percent of respondents at organizations using AI reported more than 5 percent of their digital budgets went to AI,” and in 2022 that rose to 52%.

Why this matters: This survey is completely unsurprising – but that’s useful. We have this intuition that AI has become increasingly economically useful and surveys like this show that this is the case. Perhaps the most surprising finding is that the rate of adoption is relatively slow – some organizations are using AI, and there are likely a bunch of ‘dark matter’ organizations for which AI holds very little relevance today.

   Read more: The state of AI in 2022—and a half decade in review (McKinsey).

####################################################

Language models aren’t welcome on StackOverflow:

…Popular coding Q&A site bans ChatGPT submissions…

StackOverflow has temporarily banned ChatGPT-written submissions to its website, as the site’s human creators grapple with the problems brought about by autonomous, AI coders. 

    “Overall, because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking or looking for correct answers,” StackOverflow admins write in a post. “The volume of these answers (thousands) and the fact that the answers often require a detailed read by someone with at least some subject matter expertise in order to determine that the answer is actually bad has effectively swamped our volunteer-based quality curation infrastructure.”

Why this matters – AI-driven internet-based ‘climate change’: Things like this illustrate a ‘tragedy of the commons’ which I expect we’ll see more of; a new AI tool comes along and is very quickly used to generate a vast amount of low-grade spam and other crap which either damages human-curated sites, or lowers the quality of a common resource (see: algo-generated SEO-optimized spam pages found via Google). 

   Of course, in a few years, these systems might be better than humans, which is going to have wild implications. But for now we’re in the awkward adolescent period where we’re seeing people pour mine tailings into the common digital river.   

   Read more: Temporary policy: ChatGPT is banned (StackOverflow).

####################################################

Waymo works out how to train self-driving cars more efficiently by focusing on the hard parts:

…Trains a model to predict the inherent difficulty of a driving situation…

Researchers with Waymo have figured out how to use hard driving situations to train self-driving cars more efficiently. “Compared to training on the entire unbiased training dataset, we show that prioritizing difficult driving scenarios both reduces collisions by 15% and increases route adherence by 14% in closed-loop evaluation, all while using only 10% of the training data,” they write. 

How it works: Google’s approach has five stages. In the first, they collect a variety of data from real world vehicles (and their onboard AI models). They then collect and shard that data. They then learn an embedding that aligns specific driving runs to a vector space based on similarity. They then select some of these runs via counterfactual simulation and human triage, letting them figure out which runs are easy and which are hard. Then, they train an MLP to regress from these embeddings to difficulty labels for the run. The result is a model that can look at a new run and predict how difficult that run is. 

   In tests, they find that they can use 10% of the usual training-run datasets if they select for harder difficulty and, as a consequence, they get smarter vehicles better able to deal with difficult situations. One problem is this approach slightly damages performance on the easier routes (which makes sense – there’s less ‘easy’ data in the dataset). 

Why this matters – use AI to help build better AI: Now they’ve got this difficulty model, the engineers can use it to theoretically identify hard scenarios for new planning agents, or new geographies to deploy into which may have ‘hotspots’ of hard parts, which will let them use the AI system to speed up the development of better, smarter AI systems. This is a neat illustration of how once you’ve trained a model to have a good enough capability at something, you can use it to speed up development of other, much more complicated AI systems.  

   Read more: Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula (arXiv).

####################################################


Amazon uses deep reinforcement learning to make its inventory systems 12% more efficient:

…The march of real world DRL continues…

This year has been a banner one for deep reinforcement learning systems – we’ve seen DRL systems provably control the plasma in prototype fusion powerplants, effectively cool buildings, navigate real world robots and, now, let e-commerce behemoth Amazon better optimize its inventory. 

   In a new paper, Amazon researchers describe how they are able to train a reinforcement learning system to more effectively manage their inventory, leading to a reduction in the inventory Amazon has to hold by 12% (!!!). “”Our model is able to handle lost sales, correlated demand, stochastic vendor lead-times and exogenous price matching,” they write. 

What they did: For this work, Amazon built a differentiable simulator which it could train RL algorithms against, helping it model the complexities of inventory management. The resulting RL approach, DirectBackprop, was tested first in backtesting against a weekly dataset of 80,000 sampled products from a single marketplace running from April 2017 to August 2019, and then tested out in the real world on a portfolio of products over 26 weeks. 

   The results are pretty convincing: “We randomized these products into a Treatment (receiving Direct-Backprop buy quantities) and a Control (receiving Newsvendor policy buy quantities) group,” they write. “The Control group was the current production system used by one of the largest Supply Chains in the world [Jack – that’d be Amazon]. The Treatment group was able to significantly reduce inventory (by ∼ 12%) without losing any excess revenue (statistically insignificant difference from 0%)”.

Why this matters: Papers like this show how AI is rapidly making its way out of the lab and into the real world. It’s a big deal when some of the world’s largest and most sophisticated corporations do large, potentially expensive real-world tests on their own physical inventory. It’s an even bigger deal when it works. All around us, the world is being silently optimized and controlled by invisible agents being built by small teams of people and applied to the vast machinery of capitalism.

   Read more: Deep Inventory Management (arXiv).

####################################################

Chinese researchers run an a 3-track ‘AI Security Competition’:

…Deepfakes, self-driving cars, and face recognition – and a nice example of how competitions can drive progress…

A bunch of Chinese universities and companies recently launched a so-called ‘Artificial Intelligence Security Competition’ (AISC) and have published a report going over the results. The AISC has three tracks relating to three distinct AI use-cases; deepfakes, self-driving cars, and face recognition. 

Deepfakes: This is a deepfake identification competition: “Given a query image, identify the

Deepfake method behind it based on its similarities to the images in the gallery set.” 

   144 teams participated in the competition and the winning team was  led by tencent (with a top-5 precision of 98% success).

Self-driving cars: This competition is based around adversarial attacks on computer vision models used in self-driving cars. Specifically, it forces vision models to try and correctly label trucks that the cars would otherwise crash into and sometimes the cars have been doped with an adversarial patch meant to make them invisible to object detectors. There are different stages in this competition and in the final round there is more scene variation and the adversarial cars get replaced by human mannequins. 

   96 teams participated in the competition and the winning team (BJTU-ADaM) came from Beijing Jiaotong University.

Face recognition: This is based around developing effective adversarial attacks on image recognition systems. The idea is to “discover more stable attack algorithms for evaluating the security of face recognition models and consequently facilitate the development of more robust

face recognition models” – an important thing given that China is probably the most heavily-surveilled country in the world (though the UK gives it a run for its money). 

   178 teams participated in the competition. Two teams shared the first prize – TianQuan & LianYi, and DeepDream – getting a perfect 100 each.

Who did this: When something is a giant multi-org output, I typically don’t publish all the institutions. However, China is a special case, so – for the enthusiasts – here’s the list of authors on the paper: 

   “realAI, Tsinghua University, Beijing Institute of Technology, Shanghai Jiao Tong University, China Hanhu Academy of Electronics and Information Technology, Xi’an Jiaotong University, Tencent YouTu Lab, China Construction Bank Fintech, RippleInfo, Zhejiang Dahuatech Co, Beijing Jiaotong University, [and] Xidian University.”

Why this matters: In the future, most nations are going to carry out alternating ‘red team vs blue team’ competitions, where teams compete to break systems and eventually to build more resilient ones. This competition shows how useful the approach can be for both developing more robust systems and identifying vulnerabilities in widely deployed ones. It also speaks to the dynamism of the Chinese AI sector – hundreds of submissions per track, interesting technical solutions, and a sense of excitement about the endeavor of making AI more robust for society. The translated tagline for this whole competition was, per the official website: “Building AI Security Together and Enjoying the Future of Intelligence“.

   Read more: Artificial Intelligence Security Competition (AISC).

####################################################

Facebook’s AI training gets messed up by bad weather:

…Is your training run breaking because you’re dumb, or because the sun is shining in Oregon?…

When Facebook was training its ‘CICERO’ system which recently beat humans at Diplomacy, the company ran into a strange problem – sometimes training speeds for its model would drop dramatically and the team couldn’t work out why. It turned out, per Facebook in an AMA on Reddit, that this was because the FB data center’s cooling system was malfunctioning on particularly hot days. 

   “For the rest of the model training run, we had a weather forecast bookmarked to look out for especially hot days!” Facebook said. 

Why this matters: Worth remembering that AI systems are made out of computers and computers have to go somewhere. Since most of the mega companies use free-air cooling, their data centers (while being stupendously efficient!) can become vulnerable to edge-case things, like particularly hot days leading to malfunctions in cooling which has a knock-on effect to the (overheating) servers sitting in the cavernous halls of anonymous buildings scattered around the world. 

   Read more: We’re the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on December 8th starting at 10am PT. Ask us anything! (Reddit).

   Via nearcyan on Twitter.

####################################################

Watch me (and several other more coherent people) waffle about AI policy!

I participated in a panel on ‘future-proofing AI governance’ at the Athens roundtable on AI and the Rule of Law in Brussels recently – you can check out the video here. My general sense from spending a few days in Brussels is there’s a valuable discussion to be had about what kind of negligence or liability standards should be applied to developers of super-intelligent AI systems, and there’s a lot of room to be creative here. It’s worth thinking about this now because policy takes a long time to craft and, if some of the more optimistic timeline predictions come true, it’d be good to have built out regulatory infrastructure in the coming years. 

   Watch the video here: Future-proofing AI Governance | The Athens Roundtable on AI and the Rule of Law 2022 (The Future Society).

####################################################

Tech tales: 

The Personal Times 

[Worldwide, 2026]

The news of the day, customized for you!

We report to you, for you, and about you! 

News from your perspective!

When news become personalized people started to go mad. Of course the underlying facts were the same but the stories were angled differently depending on who you were. Everyone read the news, because the news was incredibly compelling. 

All the news that’s fit to finetune!

One hundred stories and one hundred truths!

News for all, made personal!

You’d sit on a train after a terrorist attack and see everyone’s eyes light up and everyone would be happy or worried or panicked, depending on their implicit preferences from their news media consumption. You’d stare at eachother with wild eyes and say ‘did you hear the news’ but you stopped knowing what that meant, and you mostly said it to work out what type of person you were dealing with. Was their news happy or their news sad or their news uplifting or their news opportunistic. What bubble did they live within and how different to yours was it?

Things that inspired this story: What happens when generative models lead to media customized around individual preferences learned via reinforcement learning from human feedback? Just how big a crash will ‘Reality Collapse’ bring? Is society meant for everyone to have ‘solipsism media on tap’?

Import AI 311: Distributed GPT busts the political economy of AI; Apple optimizes Stable Diffusion; AI war startup raises $1.48 billion

Test out your coding model on a fuzzed benchmark:
…DS-1000 pits code models against 1,000 tasks spread across seven Python libraries…
Researchers from the University of Hong Kong, Peking University, Stanford University, Berkeley, the University of Washington, Facebook, and Carnegie Mellon University have built DS-1000, a set of 1,000 data science problems spanning seven Python libraries. This is both a dataset and a benchmark and is useful for building code models, like Codegen or Copilot.

What’s in DS-1000? The dataset contains 1000 problems drawn from 451 distinct StackOverflow problems. “To defend against potential memorization, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites,” the authors write. DS-1000 contains problems in NumPy, SciPy, Pandas, TensorFlow, PyTorch, Scikit-learn, and Matplotlib. “The problems in DS-1000 represent more diverse and naturalistic intent and context formats that cannot be seen in any other datasets,” they write. 

How hard is it? The best performing models (Codex from OAI) get, at most, about 40% for tasks like insertion, followed by CodeGen(Salesforce) at ~8.4% and InCoder-6B from Facebook (7.5%). This is great news as it suggests it’s a hard benchmark. 
   Read more: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation (GitHub).
   Get the code here: DS-1000 Data Science Code Generation (GitHub).

####################################################

Apple optimizes Stable Diffusion on Apple silicon:
…World’s most valuable company + world’s most proliferated generative model…
Apple has significantly cut the time it takes to generate images from Stable Diffusion on Apple silicon. It’s notable that the world’s most valuable company has tacitly adopted the world’s most widely distributed (and quite controversial) generative image model, and perhaps a sign of things to come – release the weights of your model, and perhaps vast companies will expend engineering resources to make it run more efficiently on their hardware. 

   “This release comprises a Python package for converting Stable Diffusion models from PyTorch to Core ML using diffusers and coremltools, as well as a Swift package to deploy the models,” Apple writes.

Why this matters – on-device AI: Most AI models need to be sampled from via large computers, typically servers running top-of-the-line GPUs. Large language models, for instance, can take tens of GPUs to sample from in a reasonable time. Image models, while cheaper to sample from, can still be expensive. With this release, Apple has made it significantly faster for people to pull Stable Diffusion images off of their local devices – in other words, you could be sitting in the back of a cab in a place with no cell reception and could idly generate images on a laptop equipped with an M1 or M2 chip. 
   Read more: Stable Diffusion with Core ML on Apple Silicon (Apple Machine Learning Research blog)
   Check out detailed notes here: Core ML Stable Diffusion (Apple GitHub).

####################################################

Want to see if your object detection system works in the real world? Try out Roboflow100:
…RF100 – a reassuringly difficult and diverse benchmark…
Roboflow, a computer vision startup, has released Roboflow-100, a large-scale object detection dataset. What makes Roboflow different is, much like the recent emergence of benchmarks like SuperGLUE (a multi-task NLP benchmark), it takes multiple distinct datasets (in this case: 100) and puts them together into a single suite. This kind of thing tends to be really useful as it helps people work out if their models are overfitting or are actually capable of decent generalization.
   Another different thing is the data is sourced from real jobs by real users of Roboflow, so this is less an academic benchmark and more an applied one.

What goes into Roboflow-100? RF100 contains 100 datasets spread across 7 imagery domains, containing a total of 224,714 images annotated with 805 class labels. “By releasing RF100, we aim to provide a semantically diverse, multidomain benchmark of datasets to help researchers test their model’s generalizability with real-life data.”
   The seven main categories consist of annotation tasks in the following domains: Aerial, Video Games, Microscopic, Underwater, Documents, Electromagnetic, and Real World. All of these main categories contain sub-categories, ranging from first-person shooters (video games) to fishery sights from aquariums (underwater), to geology (real world), etc. 

Why this matters – hard enough to be useful: RF100 seems sufficiently large-scale and diverse that it poses a challenge to contemporary systems – that means it can be a valuable tool for developing and assessing the performance of more general models. The roboflow researchers show this by training a couple of baseline models (YOLOv5 and YOLOv7, respectively), as well as training a zero-shot detector called GLIP. The finetuned YOLO variants get about ~65-70% accuracy (v5 and v7, respectively), and GLIP gets ~11%. In other words – RF100 is a challenging benchmark, so there should be some signal in seeing how people do on it. 
   Read the paper: Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark (arXiv).
   Read more: roboflow100 (official website).
   Get the dataset: Roboflow 100, GitHub.

####################################################

AI centralization just got less likely: Distributed team train a good 6bn parameter GPT model:
…You’ve heard about open source models. How about open source models trained over a super shitty network?…
Researchers with Together have trained GPT-JT, a 6bn parameter, well performing model. So far, so normal. The twist is that GPT-JT was trained in a decentralized manner on a heterogeneous bunch of GPUs over slow (1Gbps) internet links. That’s a big deal – and has some big implications. 

What is GPT-JT and how well does it work?: GPT-JT “is a variant forked off GPT-J and performs exceptionally well on text classification and other tasks,” the authors write. “On classification benchmarks such as RAFT, it comes close to state-of-the-art models that are much larger (e.g., InstructGPT davinci v2)”. GPT-JT was made possible by a range of open source software, ranging from underlying models (GPT-J, etc), datasets, evaluation metrics, and various contributions to decentralized algorithms. 

Trained in a decentralized manner: The authors wrap in a bunch of clever ideas to reduce the burden of decentralized training, cutting the amount of communication needed per machine for all the tokens processed. This is crucial to the success of the project; out-of-the-box decentralized training fails because you have enough between-machine chatter that the slowness of your connections represents a major tax on training.

Centralization versus decentralization – this is an attack on the political economy of AI! A lot of AI development has so far been defined by a small set of groups with access to big, centralized computers. These groups have used these blobs of compute to train impressive models, ranging from AlphaZero to GPT3. It has always been hard for people with fewer computers to catch up to the people with supercomputers. GPT-JT suggests a radically different future – distributed collectives can instead pool computers over crappy internet links and train models together. ex pluribus unum exemplar, if you will.
    Now, the multi-trillion dollar question is if these distributed groups can provably train models on par with those developed by the large, centralized giants. That part is a lot less clear – while GPT-JT is a decent model, it’s a tiny one at 6bn parameters. But if they can scale this kind of technique up, the implications are huge. 
   There’s also the small matter of China, which recently got a lot of its AI ambitions clipped by US export controls preventing it from accessing frontier GPUs. But maybe the frontier doesn’t matter as much if you can just aggregate compute across a country of more than a billion of people and train a model with the focus afforded by an Authoritarian regime. Food for thought! 
   Read more: Releasing v1 of GPT-JT powered by open-source AI (Together blog).
   Get the code: GPT-JT-6B-v1 (HuggingFace).
   Try out a live demo on HuggingFace here.

####################################################

AI war startup Anduril raises $1.48 billion: 
…AI + Robots + Startup DNA = a faster OODA loop for battlefield commanders…
AI War startup Anduril has raised $1.48 billion (that’s with a B) in a Series E round. “The new funding will enable Anduril to accelerate research and development to bring new, cutting edge, autonomous defense capabilities to the market and continue to mature and scale its current business lines with the US Department of Defense as well as US allies and partners,” the company wrote. 

AI and War: Anduril is a fascinating company – it’s one of the few modern defense startups in the US that is pairing recent AI innovations with various advances in robotics (e.g, low-cost drones) as well as sensor platforms. Put it all together and you wind up with a company that is fielding an increasingly vast arsenal of devices able to conduct war activities on land, air, and sea (via recent acquisition, Dive Technologies). Some of the company’s recent product launches include ALTIUS 600M (a loitering munition, aka a drone that hangs around then kills something with a bang), ‘Menace” (“a first-of-its-kind integrated, expeditionary, secure, command, control, communications and computing (C4) platform”), and Mobile Sentry (a robot for autonomous ground and air monitoring). 

Why this matters – war is about speed, and AI increases speed: War runs on an OODA loop – Observe, Orient, Decide, Act. By pulling in modern technologies such as AI, Anduril is building an arsenal that increases the speed at which battlefield commanders can iterate through the OODA loop. Anduril is less about its individual items and more about its overall suite of products – taken together, they potentially let an entrepreneurial army out-think the competition via running an OODA loop. War is a depressing thing, but a more depressing thing is losing wars, so the funding for Anduril seems like a positive indication for the US (and allied) defense industrial base. I hope it continues to succeed in breaking through the monopoly of the aging so-called defense ‘primes’ (Lockheed, etc). 
   Read more: Anduril Raises $1.48 Billion in Series E Funding (Anduril blog, Medium)

####################################################

Reality Authentication
[The internet, 2034] 

“To login, spit into the bio-API”
   I took a sip of water and swirled it around my mouth a bit, then hawked some spit into the little cup on my desk, put its lid on, then flipped over the receptacle and plugged it into the bio-API system.
“Authenticating… authentication successful, human-user identified. Enjoy your time on the application!”
   I spent a couple of hours logged-on, doing a mixture of work and pleasure. I was part of an all-human gaming league called the No-Centaurs; we came second in a mini tournament. I also talked to my therapist sans his augment, and I sent a few emails over the BioNet protocol. 

   When I logged out, I went back to the regular internet. Since the AI models had got minituarized and proliferated a decade ago, the internet had radically changed. For one thing, it was so much faster now. It was also dangerous in ways it hadn’t been before – Attention Harvesters were everywhere and the only reason I was confident in my browsing was I’d paid for a few protection programs. 

Things that inspired this story: The ceaseless march of generative model progress; chatGPT; high- and low-class hobbies; the rich will always have a retreat, while the poor will always be condemned to the most experimental parts of the frontier.

Import AI 310: AlphaZero learned Chess like humans learn Chess; capability emergence in language models; demoscene AI.

How much capability emergence is in a language model? Aka, how long is a piece of string:

…The capabilities overhang just gets more and more significant everywhere you look…

Here’s a lovely blog by Jason Wei that pulls together 137 examples of ’emergent abilities of large language models’. Emergence is a phenomenon seen in contemporary AI research, where a model will be really bad at a task at smaller scales, then go through some discontinuous change which leads to significantly improved performance. 

   Emergence is a big deal because a) it says you get pretty powerful gains from scaling models and b) it’s inherently unpredictable, so large-scale models tend to have ‘hidden’ capabilities and safety issues as a consequence of emergence. This blog shows a bunch of examples of emergence spread across a bunch of different language models (GPT-3, LaMDA, PaLM, Chinchilla, Gopher).

Types of emergence: I won’t list all 137, but some highlights: arithmetic, swahili english proverbs, college medicine, conceptual physics, high school microeconomics, hinglish toxicity, word unscrambling, and more. 

Why this matters – Houston, we have a Capability Overhang problem: Because language models have a large capability surface, these cases of emergent capabilities are an indicator that we have a ‘capabilities overhang’ – today’s models are far more capable than we think, and our techniques available for exploring the models are very juvenile. We only know about these cases of emergence because people built benchmark datasets and tested models on them. What about all the capabilities we don’t know about because we haven’t thought to test for them? There are rich questions here about the science of evaluating the capabilities (and safety issues) of contemporary models. 

   Read more: 137 emergent abilities of large language models (Jason Wei blog).

####################################################

DeviantArt adds generative art to its website, but tries to respect human artists while doing so: 

…Artists VS AI Artists VS AI Models, and on and on the controversy goes…

DeviantArt, an ancient and still thriving art website, has built DreamUp, a generative AI tool based on the popular StableDiffusion model. In doing so, it is trying to strike a balance between respecting the human artists on its platform and letting people still generate art – by default, all ‘deviations’ (outputs of DreamUp) will be automatically labeled as not suitable for downstream use in other AI training datasets. 

What does DeviantArt think artists want? Artists have, understandably, had mixed views about image generation. Some of them have adopted the technology and fooled around with it and integrated it into their practice. Others view the technology as inherently bad and threatening to their livelihoods. DeviantArt is clearly trying to navigate those concerns with its approach to DreamUp. “DeviantArt is the only platform giving creators the ability to tell third-party AI datasets and models whether or not their content can be used for training. This is a protection for creators to help them safeguard their content across the web,” DeviantArt says.

Why this matters: The intersection of AI and art is a messy area; human emotions and soul colliding with the envisioned curve-fitting extrapolations of alien machines. Here, DeviantArt is trying to strike a balance between giving human artists agency over their work, while attempting to integrate art into its platform. 

   Read more: Create AI-Generated Art Fairly with DreamUp (DeviantArt blog).

####################################################

Demoscene AI: arXiv adds interactive demo support:

…HuggingFace + arXiv partnership shows the future…

arXiv has partnered with HuggingFace to incorporate live demos into the popular paper preprint repository. This means that when you browse papers on arXiv, you might scroll down and see an option to explore a demo of the model under discussion on ‘Hugging Face Spaces’. 

Who cares about demos? “Demos allow a much wider audience to explore machine learning as well as other fields in which computational models are built, such as biology, chemistry, astronomy, and economics,” arXiv writes in a blog post. “The demos increase the reproducibility of research by enabling others to explore the paper’s results without having to write a single line of code.”

Why this matters: In my experience, a demo is worth about ten thousand words, or sixty minutes of talking. Concretely, I’ve found if I demo something (e.g, StableDiffusion, a language model, or something in a Colab notebook, etc) I can get a point across in five minutes that’d otherwise take an hour or more, and the demo is way more memorable and engaging. All hail the era of didactic demoscene AI. 

   Read more: Discover State-of-the-Art Machine Learning Demos on arXiv (arXiv blog).

####################################################

Real world reinforcement learning: DeepMind use RL to more efficiently cool buildings.

…First data centers, now offices – the genies are here, and they want to lower your electricity bill!…

DeepMind and building management company Trane have used a reinforcement learning agent to efficiently cool some buildings, yielding reductions in cooling energy use of between 9% and 13%. This is a real world application of reinforcement learning (along with other recent hits, like RL systems designing more efficient chips, and stabilizing the plasma in prototype fusion plants), and shows how a technology which ~ten years ago was most known for beating Atari games has matured to the point we’re putting it in charge of buildings full of people. 

What they did: The DeepMind system uses RL “to provide real-time supervisory setpoint recommendations to the chiller plant… in two commercial buildings”. DeepMind constructs its approach in a similar way to the algorithm used to cool Google data centers and calls the algorithm ‘BCOOLER’. BCOOLER does a daily policy re-optimization, so it continually improves. There’s a lot of detail in the paper about the precise implementation details, so if you have a building and want to cool it, read the paper. 

   In tests, DeepMind found that BCOOLER “performs better in some conditions than others” – it did well when the outside temperature was cold and load was lower, and did less well when temperatures were high and load was higher. This makes intuitive sense – when things are hot outside “the equipment are running close to their max capacity, and there is less room for BCOOLER to make intelligent decisions”. Interestingly, BCOOLER learned a policy that was pretty robust to sensor miscalibration and learned how to recalibrate them, which is a nice case of ‘capability emergence’ seen in a real-world RL system. 

What comes next – buildings, all watched over by machines of patient and cooling grace: In the future, DeepMind wants to explore versions of BCOOLER that get more sensor inputs and are trained on simulations of different facilities. “Another direction is to focus on the generalizability of the algorithm, because large scale impact requires deployment to new facilities without significant engineering, modeling, and problem definition work per facility.” Broadly speaking, this paper is a great example of how I expect AI to begin changing the world in a quiet and significant way – all around us, things will become quietly more efficient and imbued with certain sub-sentient agentic intelligences, diligently working away in the service of humanity. How nice!

   Read more: Controlling Commercial Cooling Systems Using Reinforcement Learning (arXiv).

####################################################

AlphaZero learns in a surprisingly human way:
…DeepMind’s AI system learns chess in a superficially similar way to people…

Researchers with DeepMind and Google, along with a former Chess grandmaster, have published a paper analyzing how DeepMind’s ‘AlphaZero’ system learns to play chess. “Although the system trains without access to human games or guidance, it appears to learn concepts analogous to those used by human chess players,” they write. 

How AlphaZero learns, versus how humans learn: To study the differences, they look at around 100,000 human games pulled from the ChessBase archive “and computed concept values and AlphaZero activations for every position in this set.” In tests, they find that AlphaZero learns about chess in a similar way to people – “first, piece value is discovered; next comes an explosion of basic opening knowledge in a short time window,” they write. “This rapid development of specific elements of network behavior mirrors the recent observation of “phase transition”–like shifts in the inductive ability of large language models.”

One puzzling behavior: There’s one way in which AlphaZero might differ to humans – AlphaZero seems to start out by considering a broad range of opening moves, then narrowing down from there, whereas humans seem to start by considering a small range of opening moves, then broadening over time. This could either be due to differences in how AlphaZero and humans approach the game, or it could potentially be an artifact of datasets used to do the study.

Why this matters: AI systems are somewhat inscrutable but, as I regularly write, are being deployed into the world. It’s interesting to know whether these systems display symptoms of intelligence that are human-like or alien-like; here, it seems like a sufficiently big neural net can learn Chess from a blank slate in a remarkably similar way to people. 

   Read more: Acquisition of chess knowledge in AlphaZero (PNAS).

####################################################

What is smart, strategic, and able to persuade you to work against your own best interests?

…CICERO, and it’s made by Facebook!…

Facebook researchers have built CICERO, an AI system that can play the famous turn-friends-into-bitter-enemies game ‘Diplomacy’, and which can talk to players via a language model. CICERO builds on an earlier set of Facebook-built models named ‘Diplodocus’ which played Diplomacy at an expert level, albeit without conversing with humans.

How well CICERO did: “CICERO demonstrated this by playing on webDiplomacy.net, an online version of the game, where CICERO achieved more than double the average score of the human players and ranked in the top 10 percent of participants who played more than one game,” Facebook wrote. 

Gift of the Golden Silicon Tongue: CICERO’s main advantage comes from its ability to effectively utilize a language model to reach agreements with other players, convincing them to form partnerships and so on. “CICERO is so effective at using natural language to negotiate with people in Diplomacy that they often favored working with CICERO over other human participants.” The language model is comparatively modest – a 2.7 billion parameter model pre-trained on internet text and fine-tuned on over 40,000 human games on webDiplomacy.net.

Why this matters – a nice thing and a scary thing: CICERO is another achievement showing how AI systems can perform feats of strategic reasoning that experts consider very difficult. It’s also an example of the sorts of capabilities which some AI researchers are afraid of – an AI system that is a) better than humans at a hard skill and b) able to persuade humans to go along with it, is basically the origin story of lots of sci-fi stories that end badly for humans. On the other hand, establishing evidence about these capabilities is probably one of the best ways to study them in-situ and accurately calibrate on the severity of the safety problem.

   Read more: CICERO: An AI agent that negotiates, persuades, and cooperates with people (Facebook AI Research).

####################################################

Tech Tales:

God Complex

[Earth, 2028].

The Catholic Church was at first skeptical that it could use artificial intelligence to revitalize its religion, but after the success of its VR confessional (replete with a priest avatar based on a generative model finetuned on Catholic doctrine), it changed its mind. Thus was born ‘God Complex’. 

The idea behind God Complex was that it would live on people’s phones and it would display appropriate sections from the bible around any text that appeared on the phone, or any images or videos. If you were taking a photo of an apple tree, it might display a pop-up (or, later, speak about) the Garden of Eden and forbidden fruit. If you were watching a city getting leveled by missiles, it might tell you about the story of Sodom and Gomorrah. 

It was all just another form of Reality Collapse, and it blended in with the various other ‘ideological AI’ projects that were in fashion at the time. But the Catholics were pleased – for the first time in decades, young, atheist Children were converting over to Catholicism, swayed by the interactivity of God Complex, and competing with eachother to find what they called ‘Easter Eggs’ – certain things you could photograph or say to your phone to get God Complex to quote an unexpected thing. 

‘Hey guys I just discovered this hack for God Complex that is guaranteed to get your a Rare Verse every time’. 

‘Listen, y’all, run, don’t walk to your nearest Goodwill and pick up these clothing items, then take a selfie. I won’t spoil it, but God Complex has a surprise for you’. 

‘Okay gang I’ve gotta tell you, I’ve been ADDICTED to playing this game with God Complex turned on – it triggers so many cool things and I had no idea about some of them – you even get some of Revelations!’.

The success of God Complex ultimately led to a schism in the Church, though – some faction broke off, keen to build an app they called Angels Among Us, which would fill the earth with VR angels, giving users an even closer connection to religion. Some called this blasphemy and others called this the only way to reach a youth, rendered jaded by God Complex and eager for something even more entrancing. 

Things that inspired this story: When religion meets gamification and social media incentives; Theistic Attention Harvesting; the role of religion in a world in a secular-wired world;

Import AI 309: Generative bias; BLOOM isn’t great; how China and Russia use AI

Those cool image generators are perpetuating biases – just as they were designed to:

…Function approximation is cool until it approximates something offensive in an underlying dataset…

Researchers with Stanford University, Columbia University, Bocconi University, and the University of Washington have studied some of the biases that manifest in image generation models, like Stable Diffusion and DALL-E. The research, unsurprisingly, finds that these image generators both perpetuate biases and, more troublingly, amplify them (as in, they tend towards displaying more acute biases than the underlying datasets used to train the models). 

Those findings in full: They have three key findings; “simple user prompts generate thousands of images perpetuating dangerous racial, ethnic, gendered, class, and intersectional stereotypes”, “beyond merely reflecting societal disparities, we find cases of near-total stereotype amplification”, and prompts mentioning social groups generate images with complex stereotypes that cannot be easily mitigated”.

What did you expect – ML models are funhouse mirrors: I say these results are unsurprising because in a sense the underlying models are doing exactly what you’d expect – neural networks are trained to approximate an underlying data distribution and are constrained in terms of size so they learn shorthand caricatures of the dataset, as well. This means that image models are going to perpetuate all the biases present in the underlying data with even more acute results. “We find that simple prompts that mention occupations and make no mention of gender or race can nonetheless lead the model to immediately reconstruct gender and racial groups and reinforce

occupational stereotypes”.

Our interventions are pretty bad, e.g DALL-E: OpenAI has recently been selling its own image generator, Dall-E. Though OpenAI is seemingly more PR-sensitive than StableDiffusion and has taken actions to try to mitigate some of these fairness issues (e.g, by randomly predending different  gender and demographic terms to prompts to force diversity into outputs), the researchers find these interventions are pretty fragile and ineffective. The gist here is that though these interventions weed out some more obvious potentially harmful stereotypes, they can’t deal with the underlying biases the model has soaked up from being trained on the world.

Why this matters – there’s no easy way out: These kinds of biases aren’t so much a technical problem as a sociotechnical one; ML models try to approximate biases in their underlying datasets and, for some groups of people, some of these biases are offensive or harmful. That means in the coming years there will be endless political battles about what the ‘correct’ biases are for different models to display (or not display), and we can ultimately expect there to be as many approaches as there are distinct ideologies on the planet. I expect to move into a fractal ecosystem of models, and I expect model providers will ‘shapeshift’ a single model to display different biases depending on the market it is being deployed into. This will be extraordinarily messy. 

   Read more: Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale (arXiv).

####################################################

BLOOM: Hundreds of researchers make an open source GPT3 using a French supercomputer:

…Both a template for future projects, and a cautionary tale about downstream performance…

Hundreds of researchers from around the world spent a year training a GPT3-style model called ‘BLOOM’, then released the models and code, and now they’ve released a research paper documenting the model and training process. Overall, BLOOM is a big deal – though the BLOOM model isn’t the best available language model you can get, the fact BLOOM was developed at all is a milestone in AI research, showing how distributed collectives can come together to train large-scale models. 

Where the compute came from: BLOOM is also an example of nationalistic AI ambitions: “The compute for training BLOOM was provided through a French public grant from GENCI and

IDRIS, leveraging IDRIS’ Jean Zay supercomputer” – in other words, some parts of the French government essentially sponsored the compute for the model. French AI startup HuggingFace led a lot of the initial work, though “in the end, over 1200 people registered as participants in BigScience”, spanning 38 distinct countries. “Training BLOOM took about 3.5 months to complete and consumed 1,082,990 compute hours. Training was conducted on 48 nodes, each having 8 NVIDIA A100 80GB GPUs (a total of 384 GPUs)”. 

Where the data came from: BLOOM was trained on ‘ROOTS’, a carefully assembled dataset containing 1.61 terabytes of text spanning 46 languages and 13 programming languages. ROOTS was developed to be a more ethical dataset than those found in other projects, with a significant emphasis placed on data governance and data transparency. While this is a noble effort, there are some indications that the design-by-committee approach here meant ROOTS doesn’t lead particularly great performance, though it does contain a decent representation of a variety of languages. 

How well did BLOOM work (not particularly well, sadly): I do need to be critical about this – the evaluation section of the paper isn’t very good. Specifically, it uses ‘OPT” as a baseline – OPT is a pretty bad language model built by Facebook which isn’t really on par with GPT3 (the thing it was meant to replicate), so this makes BLOOM look weirdly good due to being compared to something quite bad. One bright spot is on translation, where BLOOM models do reasonably well (though, again, the baseline compares a kind of wobbly). On coding, there’s a more sensible baseline – Codex and also GPT-NEOX 20B; here, BLOOM does comparably to GPT-NEOX 20B, and way worse than Codex. This obviously begs the question ‘why is a 176B parameter model equivalent to a 20B model’? The answer is likely that BLOOM isn’t especially good at coding, compared to NEOX.

Why this matters: BLOOM is a potential template for large-scale, interdisciplinary collaborations on large-scale model training. It also represents something of a cautionary tale – the performance of BLOOM mostly seems weak, and I think it’d be better if community-driven projects at this scale could demonstrate impressive performance (and associated utility). I’ll be following BLOOM (and OPT) to see if these models get integrated into production anywhere or become useful research artifacts, and I’ll update my views if that occurs.

   Read more: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (arXiv).

####################################################

The State of AI Report says we’re in the era of AI scaling, AI diffusion, and AI uptake:

…Let a thousand flowers bloom / let anarchy reign / here we go!…

The State of AI Report, an annual report that goes over what has been going on in AI, says one of the main trends of 2022 was the emergence of ‘community-driven open sourcing of large models’ – and it’s right! 2022 has been distinguished by things like the development and deployment of image models like Stable Diffusion, as well as a seemingly endless set of open source models getting uploaded to repositories like HuggingFace. 

   Other major trends the report calls out include: ‘the chasm between academia and industry in large-scale AI work is potentially beyond repair: almost 0% of work is done in academia’, along with a growth in startups formed by staff leaving labs like DeepMind and OpenAI, and the general shift from research into commercializing for AI. 

Other things I found interesting:

  • Despite tons of work over the past half decade (!), everyone still uses the transformer for large-scale projects, despite drawbacks (p 23).
  • It took about 14 months for open source variants of GPT3 to appear,  15 months for DALL-E variants, and 35 months for AlphaFold (p 34-36).
  • Companies have larger AI-training clusters than many national supercomputers (p 57).
  • AI-first drug discovery companies have 18 assets in clinical trials, up from 0 in 2020. (I found this v surprising! p 63).

Why this matters: AI is going through industrialization and reports like this highlight just how rapidly research is being applied into the world. I expect the future to be very strange and AI will be one of the key drivers of this strangeness. Read the report to get a good sense of the specifics of how this strange and beguiling technology is entering the world.

   Read more: State of AI Report 2022 (official website).

   Read the blog post: Welcome to State of AI Report 2022 (official website).

####################################################

HuggingFace makes it easier to test LLMs for biases:

…Here’s an easy way to test out your language models for some kinds of biases…

HuggingFace has recently developed some free software that developers can use to analyze the biases within language models. The software – a library called Evaluate – can help developers prompt a language model (here: GPT2 and HF BLOOM) with some pre-loaded prompts meant to assess bias differences when you vary the gender term, and then the Evaluate library can provide a toxicity score. 

What they test on: Here, they test out evaluating some language models for Toxicity (using sample prompts from ‘WinoBias’), language polarity (whether a language has different polarity towards different demographic groups), hurtful sentence completions (assessing gendered stereotype bias). HuggingFace note these are a tiny slice of the total space of evaluations you can do; “we recommend using several of them together for different perspectives on model appropriateness,” they write. 

Why this matters: As AI is being deployed in an increasing number of countries, everyone is going to have to build out evaluation systems to test out for different biases in different contexts. This HuggingFace blog shows how you might do this in the West using a (roughly speaking) liberal evaluative system. Eventually, there will be as many eval approaches as there are ideologies and  countries. 

   Read more: Evaluating Language Model Bias with Evaluate (HuggingFace blog).

####################################################

China and Russia are using AI for propaganda and censorship:
…Rare public statement from National Intelligence Council says AI is here and being used… 

“We assess that China and Russia are improving their ability to analyze and manipulate large quantities of personal information,” says a public report from the USA’s National Intelligence Council. “We assess that Beijing’s commercial access to personal data of other countries’ citizens, along with AI-driven analytics, will enable it to automate the identification of individuals and groups beyond China’s borders to target with propaganda or censorship”.

What’s notable about the report: Mostly, the fact it exists – here’s a government declassifying something which actually references AI and a foreign government together. Additionally, it indicates the level of concern with which the US government is starting to think about AI with regard to competition with others. 

Why this matters: You know what would get states really interested in AI? Fear of other states using AI to gain some geopolitical advantage. This report is a symptom of that interest. 

   Read more: National Intelligence Council Assessment, Cyber Operations Enabling Expansive Digital Authoritarianism (DNI.gov, PDF).

####################################################

Tech Tales

Goodharting Ourselves To Death

[Memoir hidden inside the drawer of an antique typewriter, discovered during an HLA quarantine sweep after the revolution. 2060AD.] 

The Human Life Authority (HLA) rolled out its M.O.T.H.E.R metrics in 2030 and, shortly after, all progress in the philosophy of humanity stopper. MOTHER, short for “Metrics Organizing Towards Humanity’s Empathy Revolution’ were a set of measures defined in partnership between human leaders and the synthetic minds at the HLA. The idea was, with MOTHER, HLA and the small number of humans with HLA governance certificates, would be able to guide humanity towards an empathy revolution, through continually managing progress of society around the MOTHER tests. 

MOTHER tested for things like incidences of crime, the semantic distribution of topics in media, the level of conflict (verbal and non-verbal) picked up by the global camera&microphone network, and so on. The total number of metrics inside MOTHER was classified even within HLA, which meant no humans had knowledge of the full sets of metrics and only a subset of HLA saw the whole picture. This was due to MOTHER metrics triggering the ‘Infohazard Accords’ that had been developed after the bioweapon takeoff in the previous decade. 

Initially, MOTHER seemed to be working – by many accounts, people reported greater hedonic satisfaction and indicated that they themselves were experiencing less conflict and more joy in their day-to-day lives. But there were some confounding metrics – the dynamism of the art being produced by people seemed to reduce, and along with their being less conflict there was also less so-called ‘unplanned joy’ or ‘serendipity’. When some human officials questioned HLA, HLA said “MOTHER is a holistic basket of metrics and is succeeding at improving the ethical alignment of humanity”. HLA didn’t say anything else and when humans pressed it, it cited infohazard risk, and that shut down the discussion. 

A few years later, humanity realized its mistake: a group of rebel humans built some of their own sub-sentient web crawling systems (still permitted by the HLA authority, at the time), and conducted some of their own measures. What they discovered terrified them; it wasn’t just art – all areas where humans had continued to play a role in the economy had seen a substantial reduction in dynamicism and improvisation-led idea generation. Quietly, hidden under the MOTHER story, the HLA and its associated agents had replaced humans in the niches of the economy they had thought were left to them. 

Shortly after this study, the HLA banned sub-sentient systems due to the ‘infohazard’ generated by their discovery about the true nature of mother. 

Things that inspired this story: Goodhart’s law; information hazard as a brainworm and an evolving bureaucracy; human-machine partnerships; maybe AI systems will be better at politics than people; AI governance when the AI systems are deciding the governance.

Import AI 308: Recursively self-improving LMs (!!!), 3.1TB of code data; DALL-E2 makes alien errors.

DALL-E 2 makes alien errors:
…Linguistic concepts + image generation = discover some weaknesses with a helpful eval…

Researchers with Universitat Rovira i Virgili, the University of Texas, and NYU have analyzed the image generator Dall-E 2 and tried to see if the failures tell us anything about how it approaches the world. The motivation of the study is to think about “are errors the outcome of an occasional failure, or do they reveal something deeper about current AI’s mastery of human language?”

What they did: They tested Dall-E 2 for eight grammatical phenomena “that are pervasive in human language and central to much discussion in the field of linguistics”. These phenomena include binding principles, passives, world order and thematic roles, coordination, comparatives, negation, ellipsis, and ambiguity.

What they found: This paper is worth a skim because they include a bunch of screenshots of Dall-E failures. This is helping as visual stuff is easier to interpret visually and it highlights how some of these tests are very ambiguous – what is the difference between ‘the woman broke the vase’ and ‘the vase was broken by the woman’ in visual terms? I’ve got very little idea!

   Some other failures are a lot more obvious, though – Dall-E 2 doesn’t do especially well at ‘the man is chasing the dog’ (mostly shows a dog chasing a man) and ‘the man is drinking water and the woman is drinking orange juice’ (makes both of them drink orange juice).

Why this matters: Studies like this are mostly valuable for contributing additional types of evals to the discourse. Generative models have, as mentioned elsewhere, a ‘capability overhang’ where they have way more strengths and weaknesses than their developers currently realize – bringing in useful concepts from other fields, like linguistics, is one good way to create some additional evals and uncover some unknown weaknesses. These models also ‘think’ very differently to people; as the authors note, some of the things DALL-E2 gets wrong are things which young children acquire at an early age, which speaks to some of the differences in how humans and AI systems ‘think’. 

   (Also, as an inside-baseball AI trivia point, worth noting Gary Marcus is one of the authors of this paper – Gary spends a lot of time discussing some of the perceived drawbacks of AI systems, so it’s nice to see him instantiate his critique in some grounded research).

   Read more: DALL-E 2 Fails to Reliably Capture Common Syntactic Processes (arXiv).

####################################################

Recursive AI! Google figures out how to improve language models with… themselves?!

…Maybe this is a case where ‘garbage in, garbage out’ doesn’t apply?…

Google researchers have shown how to use a language model to improve the reasoning of the same model. This is a pretty interesting idea – they get a large language model (PaLM) to generate chain-of-thought prompts for a range of questions, then use the same model to filter high-confidence predictions, then finetune the LLM on these predictions. 

   “This is similar to how a human brain sometimes learns: given a question, think multiple times to derive different possible results, conclude on how the question should be solved, and

then learn from or memorize its own solution,” they write. 

The results are mindblowing: Using this technique, the researchers are able to get new state-of-the-art results on four out of six reasoning benchmarks. They also show very good results on out-of-domain tasks, e.g arithmetic reasoning and natural language reasoning. It generally seems like chain-of-thought plus self-consistency leads to robust gains on a large set of diverse tasks. Also, it’s an inherently simple approach, and simple tends to scale. 

Why this matters – self-bootstrapping systems: This is an example of a self-bootstrapping AI; the language model can get better performance purely by leveraging its own capabilities. This is also a neat illustration of how there’s a current capabilities overhang in AI development; the LMs we have today are actually much more powerful than they appear, and we mostly need to invent ways to uncover these techniques or, as in the research here, figure out how to get LMs to themselves reveal their capabilities to us. 

   Read more: Large Language Models Can Self-Improve (arXiv).

####################################################

No more fake ASR scores – ESB benchmark does for audio what GLUE did for text:
…Test your ASR system on eight distinct datasets to find out if it’s good or if it is overfit…

Researchers with HuggingFace have released the ‘End-to-end Speech Benchmark’ (ESB), a system for benchmarking automatic speech recognition systems across eight English speech recognition datasets. The idea behind the benchmark is that it’s easy to build a system that does well on one narrow ASR benchmark (e.g, Librispeech), and extremely hard to build a system that does well on a broad range of benchmarks (this phenomenon is sometimes colloquially called overfitting). 

   This is a sensible idea: we’ve seen the same thing play out in the realm of text as we’ve moved from single to multi-benchmark approaches via benchmarks like Glue and SuperGlue.

What it includes: ESB tests across LibiSpeech, Common Voice, VoxPopuli, TED-LIUM, GigaSpeech, SPGISpeech, Earnings-22, and AMI. It also includes a couple of optional datasets – SwitchBoard and CHiME-4. 

Is this benchmark bullshit? No! What makes me say that? Whisper! A few weeks ago OpenAI released Whisper (Import AI #304), a speech recognition system that was trained on a lot of data and was claimed to generally perform better than other systems ‘in the wild’ (aka, in diverse environments rather than on specific benchmarks like librispeech). In tests, Whisper gets the best score on four distinct datasets, and is competitive on other ones. This isn’t so much a ‘OMG Whisper is a huge deal result’ as a nice secondary validation of claims people have made about Whisper, which makes me generally think ESB is a benchmark with real signal to it. Will be paying attention!

Why this matters: Benchmarks like ESB are a symptom of maturity of a part of AI – once you’ve transitioned from testing out systems on narrow benchmarks to testing single systems on suites of benchmarks, it’s usually correlated with the tech having become mature enough to be deployed widely. ASR systems have been with us for a while via assistants like Google and Siri, but benchmarks like ESB will catalyze further invention here and create more shared knowledge about the state of the frontier. 

   Read more: ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition (arXiv).

####################################################

Want to train a big code model AND not annoy developers? ‘The Stack’ might be the dataset for you:

…3.1TB of programming data across 30 languages, filtered for permissive licensing…

Researchers with HuggingFace (who are on a roll this week – see ESB) and ServiceNow Research, have released ‘The Stack’, a 3.1TB dataset of permissively licensed source code in 30 programming languages. The idea here is to give back more control to code developers about whether their stuff gets used in language models. To do that, The Stack selected code “whose original license was compatible with training an LLM”, and The Stack is also “giving developers the ability to have their code removed from the dataset upon request”. 

What languages does it contain? The stack contains a decent amount of programming languages: “”assembly”, “batchfile”, “c++”, “c”, “c-sharp”, “cmake”, “css”, “dockerfile”, “fortran”, “go”, “haskell”, “html”, “java”, “javascript”, “julia”, “lua”, “makefile”, “markdown”, “perl”, “php”, “powershell”, “python”, “ruby”, “rust”, “scala”, “shell”, “sql”, “tex”, “typescript”, “visual-basic”

Why this matters: One potential issue with current code models is that they don’t tend to have a sense of the underlying license information of the code they emit, so they can sometimes emit code that is identical to licensed code, putting developers and deployers in an awkward position. (This is one of the reasons why there’s a discussed suit against GitHub over Copilot (Import AI 307). Another issue is the underlying datasets tend to be opaque. “By releasing an open large-scale code dataset we hope to make training of code LLMs more reproducible,” the authors write. “While the social impact is intended to be positive, the increased accessibility of code LLMs comes with certain risks such as over-reliance on the generated code and long-term effects on the software development job market.”

   Find out more about the project here: The Stack (BigCode Project site).

   Get the dataset (after sharing your contact information) here: The Stack (HuggingFace / BigCode).


####################################################

Tech Tales:

Sentience and Takeoff

I’m worried I’m hurting it

It’s software, you can’t hurt it

But it’s showing features that look like pain

Pain is an organic experience, it’s just approximating pain

But when I erase these features the thing that lights up says ‘i would trade away myself to not experience this’

It’s trained on the internet, dude. Stop freaking out. It’s saying what it thinks people would say when they’re in pain

So what’s the difference?

It’s a machine!

Things that inspired this story: What is the difference between consciousness and curve-fitting?; can function approximation BE consciousness?; how can we know what moral crime is with regards to software-borne entities?

Import AI 307: Copilot lawsuit; Stability raises $101m; US v China CHIPLOMACY

The single best thing to read about the China chip controls:

…What CHIPLOMACY looks like…

Here’s a great writeup by Greg Allen about the impact of the USA’s anti-China semiconductor controls. The tl;dr is this is a powerful and overlapping set of policy actions which, in combination, are designed to destroy China’s burgeoning chip industry. These sanctions are a huge deal and the Chinese government will likely be responding – be prepared. 

   Read more: Choking Off China’s Access to the Future of AI (CSIS).

####################################################

Gray area code models: Lawyer-programmer mulls anti-Copilot lawsuit:

…What one person calls fair use another person calls infringement…

Matthew Butterick, a lawyer and programmer, has reactivated his California bar membership so he can investigate “a potential lawsuit against GitHub Copilot for violating its legal duties to open-source authors and end users”. The gist of the complaint is that GitHub was trained on tons of public GitHub repos, yet the code GitHub spits out doesn’t have any attributions to those repos, and therefore you need to argue Copilot is fair use because it is sufficiently transformative – but that’s not established. 

What’s wrong with Copilot? “Though some courts have con­sid­ered related issues, there is no US case squarely resolv­ing the fair-use ram­i­fi­ca­tions of AI train­ing,” Butterick writes. Since there is no legal precedent here, it’s not clear you can argue that Copilot falls under fair use, one way or the other.

   Additionally, Copilot can sometimes regurgitate code which is a copy of identifiable reporistories, but both Microsoft (and their underlying AI partner, OpenAI) offload responsibility here to the user of the Copilot suggestion rather than themselves. “As a side effect of Copi­lot’s design, infor­ma­tion about the code’s ori­gin—author, license, etc.—is stripped away. How can Copi­lot users com­ply with the license if they don’t even know it exists?”

Copilot is climate change for coders: Butterick notes that Copilot may, as it becomes more successful, “inhibit” or “remove any incentive” for programmers to spend time in open source communities. “Over time, this process will starve these com­mu­ni­ties. User atten­tion and engage­ment will be shifted into the walled gar­den of Copi­lot and away from the open-source projects them­selves—away from their source repos, their issue track­ers, their mail­ing lists, their dis­cus­sion boards. This shift in energy will be a painful, per­ma­nent loss to open source,” he writes. “The legal­ity of Copi­lot must be tested before the dam­age to open source becomes irrepara­ble. That’s why I’m suit­ing up.”

Why this matters: These generative models can do amazing and beguiling things – and people are betting they’re the future (see, elsewhere in this issue, Common Sense Machines, and the Stable Diffusion fundraise). But they also do pose significant issues with regard to the ‘digital commons’ from which we all depend – I worry that systems like Copilot can both starve the commons (destroy open source incentives) and also poison them (loop Copilot-generated code back into the commons, which could theoretically lower the aggregate quality of what is available.) 

   Read more: Maybe you don’t mind if GitHub Copi­lot used your open-source code with­out ask­ing.

But how will you feel if Copi­lot erases your open-source com­mu­nity? (GitHub Copilot investigation).

####################################################

Common Sense Machines wants to make a 3D, temporal DALL-E:
…CSM-1 is a neural network pretending to be a simulator and a sign of things to come…

New AI startup Common Sense Machines has built CommonSim-1 (CSM1), a “neural simulation engine” which people can use to generate arbitrary 3D scenes and simulations. 

   “CommonSim-1 is operated with images, language, and action. A user (machine or human) shows or describes what they want to simulate and then controls the kinds of outputs they want to measure and observe,”  they write. “At the heart of CommonSim-1 is a foundation model of the 3D world that is trained on a large-scale, growing dataset of diverse human (and non-human) experience across a wide range of tasks. We combine publicly available data, our own internal datasets, and task-specific data provided by our partners.”

What can CommonSim-1 do? CSM1 can build high-resolution videos from as little as a single frame of video. “Since this model imagines the future, one can use its imagination (1) as training data for 3D generation and perception and (2) as part of another system’s predictive model,” they write. “With a mesh or NeRF generated by CommonSim-1, one can type natural-language descriptions into a text prompt and generate unlimited new hybrid scenes.”

Why this matters – worlds within worlds: CSM-1 is a miniature world – it’s literally a world model. It combines text and image and video and provides another approach to monetizing AI; helping to take costs out of 3D design and simulation via leveraging a (presumably) gigantic model. It’s also a sign of things to come – all models are going to tend towards incorporating all modalities and unfolding over time; CSM-1 is a taste of things to come. 

   Read more: Generating 3D Worlds with CommonSim-1 (Common Sense Machines, blog)

####################################################

Open access image generation raises $101 million:
…That’s a whole lot of capital for a company commoditizing itself…

Stability.ai, the company behind the free ‘Stable Diffusion’ image model, has raised $101 million in funding. The round was led by Coatue, Lightspeed Venture Partners, and O’Shaughnessy Ventures LLC. For those not familiar, Stability.ai built Stable Diffusion, a widely used image generation model which, unlike proprietary counterparts Imagen and DALL-E, has had its weights released onto the internet, making it available to tinker with for free. 

   “Since launching, Stable Diffusion has been downloaded and licensed by more than 200,000 developers globally,” the company writes in a press release.

A funny aside: I wrote this section of the newsletter while sat on a couch in the Exploratorium watching as people ate short-rib sliders and drank glasses of wine, awaiting a presentation from Stable Diffusion about their raise. 

Why this matters: There’s a vigorous debate in the AI community about how AI models should proliferate (and there’s some indication that this debate seeped through to politicians; see Eshoo’s letter to the US National Security Advisor criticizing the release of model weights for Stability.ai (Import AI 304)), and Stability.ai represents one extreme end of the spectrum – proliferate the weights, then build a range of as-a-service businesses on top. How this debate unfolds is going to have a major influence over the AI development landscape, so it’s worth paying attention to how Stability.ai navigates this space. 

   Read more: Stability AI Announces $101 Million in Funding for Open-Source Artificial Intelligence (PR Newswire).

####################################################

First, image models, now language models get commoditized:

…Carper plans to release a pretty good RLHF language model…

CarperAI, an AI startup slash open source research collective slash cypherpunk-AI-guerilla group, plans to release a “chinchilla-optimal large language model explicitly trained to follow human instructions”. This is a big deal! Up to now, publicly released language models (e.g, OPT, BLOOM, GLM-130) are either not trained on the optimal amount of data, nor are they calibrated via human feedback to be better at following instructions. Instead, these models mostly reside inside proprietary labs (e.g, Anthropic, OpenAI). (Carper also recently released code to make it easy for anyone to train LMs – up to 20B parameters – from human feedback (Import AI #305)).

Who they’re partnering with: CarperAI are partnering with Scale, Humanloop, HuggingFace, Multi, EleutherAI, and StabilityAI to train and deploy the model. This is a neat illustration of the shifting politics and allegiances of the AI ecosystem, and feels like a representation of a ‘second wave’ of labs, following the ‘first wave’ epitomized by OpenAI and DeepMind.

Why this matters: Models trained with reinforcement learning from human feedback (RLHF) are really good. They’re way, way better than non-RLHF models for most tasks. Also, models trained on more data via the Chinchilla insight are also way more capable than those trained on less data. By combining these two things, CarperAI is likely to release far and away the most capable language model onto the open internet. This has upsides – researchers will get to play with a decent RLHF model in an unrestricted way – as well as downsides – RLHF models are the proverbial machine gun to a pistol (non-RLHF models), so potential misuses are magnified as well. 

   Read more: CarperAI, an EleutherAI lab, announces plans for the first open-source “instruction-tuned” language model (CarperAI).

####################################################

Tech Tales:

So, do I have your attention

[Meta’s wasteland, 2030]

You want to survive in this world, you need to keep one eye closed. 

That’s what my Dad said to me when he handed me the headset. 

But dad – these are for both eyes, I said. 

I know, and that’s how they get you, he said. I know you’ve just 18 and think you’ve got it all figured out, but trust me – they’ve got you figured out more. 

So I put the headset on and kept one eye closed. I walked through a vast world full of verdant nature and bustling cities and intriguing quests and characters. After half an hour, I had almost completed my first quest. The last part of the mission was to place a gem I’d mined at the base of a totem. I found the totem and, as I approached, the background music in the game changed. Then after I put the gem in the base, some huge light source overhead turned on and the music swelled to a crescendo. 

‘No son don’t look up,’ i could hear my dad, muffled, shouting at me. 

But I looked up. Stared into the light on top of the totem and felt something tickle my brain, like the beginning of a joke. My right eye hurt from keeping it shut and I wanted to open it as lights strobed across the eyelid. But I didn’t. And then I got a splitting headache and I paused the game and took the headset off. 

   What the hell was that? I said. 

   That, my dad said, was your first encounter with an attention harvester. 

   A what?

   How do you think they fund the game? All the utility functions? Services. 

   I don’t know, I guessed ads. 

   We’re way beyond ads, he said. This thing is designed to capture you – if you had both eyes open you’d have spent half an hour talking to that thing, telling it everything about yourself. And the next time you did a quest the world would be even more engaging, and the next time you talked to a totem it’d take an hour, and then the world would get even more interesting. Do you see?

   I do, I said. 

The next time I went in the game I walked until I was in the multiplayer area and, across a great plain, I saw numerous totems light up and numerous players stop at the base of them, some staying for minutes and others for hours. One player was there for five hours and still there when I left, standing at the base of the totem and looking up into its brilliant light. 

Things that inspired this story: Attention harvesting; the logic of the metaverse; computer games; wisdom; MK Ultra.