Import AI

Import AI 320: Facebook’s AI Lab Leak; open source ChatGPT clone; Google makes a universal translator.

Note: Import AI now publishes via Substack; read and subscribe here.

Google makes progress on the self-teaching universal translator: 
…Universal Speech Models scale beyond 100 languages…
Google has built a family of AI systems called Universal Speech Models (USMs). These models are designed to do speech recognition on more than 100+ languages. The main model is 2B parameters and was trained on a large unlabeled multilingual dataset of 12 million hours spanning over 300 languages. 

The goal of USM: “Our long-term goal is to train a universal ASR model that covers all the spoken languages in the world,” Google writes. USMs are Google exploring ” a promising direction where large amounts of unpaired multilingual speech and text data and smaller amounts of transcribed data can contribute to training a single large universal ASR model.”

The key ingredient? The data mix: Much like baking a cake, training predominantly self-supervised models requires the right mix of data. Here, Google uses the following components:

  • Unpaired Audio: 12 million hours of YouTube-based audio covering over 300 languages, and  429k hours of unlabeled speech in 51 languages based on public datasets.
  • Unpaired Text:28billion sentences spanning over 1140 languages.
  • Paired audio speech recognition data: 90k hours of labeled multilingual data covering 73 languages, plus 10k hours of labeled multi-domain en-US public data, plus 10k labeled multilingual public data covering 102 languages. 

What they did: The steps to build a universal ASR model are quite complex, so it’s worth reading rhe paper for full details. First they do unsupervised pre-training to pre-train the encoder of the model with the YouTube dataset, then they use a process called multi-objective supervised pre-training across the other unpaired audio and text data, then for some models that do supervised ASR training. 

What the results were: In tests, these USM models “achieve state-of-the art performance for multilingual ASR and AST for multiple datasets in multiple domains.” They also out-perform OpenAI’s notoriously good (and open source!) ‘Whisper’ models as well; an impressive achievement given that Whisper set a new state-of-the-art in multiple areas when it came out. 

Why this matters: “We believe diverse unlabeled data is more practical to acquire for building usable ASR for tail languages than weakly labeled data,” Google says. In other words; if you want to translate the entire world then it’s better to just hoover up data at scale rather than invest in trying to produce a small amount of minimally labeled datasets. This generally points in the direction of ‘gotta grab em all’ with regard to trawling the web and other sources for data. This is somewhat intriguing as while Google has a bunch of data sources and competent language modeling teams, it’s fairly likely that having a universal translator is also interesting to government types – some of which are thought to be able to access larger sources of data through various clandestine means. 
   Read more: Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages (arXiv)
   Request API access here.

####################################################

US regulator: Hey, maybe don’t lie about your AI products:
…Sometimes it’s worth stating the simple and obvious thing…
The Federal Trade Commission has published a blogpost called ‘Keep your AI claims in check’. The post is a sensible summary of how as AI becomes increasingly hyped up, people will be tended to write a lot of bullshit about AI. The FTC notes in its post that it will be paying attention to companies that:

  • Exaggerate what AI products can do.
  • Promising an AI product is far superior to a non-AI product without providing evidence. 
  • Underinvesting in analyzing the risks of their products. 
  • Baselessly labeling something as AI when it does not, in fact, use AI.

Why this matters: Sometimes it’s helpful for powerful regulators to state the painfully obvious – bravo to the FTC for reminding people in these hyped-up times that lying and bullshitting about AI (or any technology, really) is irresponsible. It’ll be interesting to see in the coming months if the FTC takes any actions against egregious liars and hypers in this space. 
   Read moreKeep your AI claims in check (Federal Trade Commission).

####################################################

ROBOLLM: Google shows how if you mush together more sensory inputs into an LLM, you get a lot of transfer learning:
…Maybe everything really is a sequence prediction task…
Google has built PaLM-E, a 562B parameter model which mushes together a 540B LLM and a 22B Vision Transformer (ViT). Crucially, PaLM-E sees Google “directly incorporate continuous inputs from sensor modalities of an embodied agent and thereby enable the language model itself to make more grounded inferences for sequential decision making in the real world”. The result is a language model that can help robots carry out real tasks in the real world, and also is another triumphant demonstration of how bigger models with more diverse data sources generally get way better at doing a bunch of things. 

What PaLM-E is: “The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model,” Google writes. “The inputs to PaLM-E consist of text and (multiple) continuous observations. The multimodal tokens corresponding to these observations are interleaved with the text to form multi-modal sentences. An example of such a multi-modal sentence is Q: What happened between and ? where represents an embedding of an image. The output of PaLM-E is text generated auto-regressively by the model, which could be an answer to a question, or a sequence of decisions produced by PaLM-E in textual form that should be executed by a robot”.

Why PaLM-E is a big deal: In tests, Google applies PaLM-E to three different robotics tasks which use somewhat different types of data; these tasks include Task and Motion Planning (TAMP), a task called Language-Table, and a mobile manipulation domain based on Google’s earlier ‘SayCan’ research. PaLM-E can do ok at these tasks individually but the magic happens when you mush all of the training datasets into it together: “Across three different robotics domains, using PaLM and ViT pretraining together with the full mixture of robotics and general visual-language data provides a significant performance increase compared to only training on the respective in-domain data.”

   In other words, by adding more diverse heterogenous data sources into PaLM-E, Google improves the ability of the resulting model to generalize knowledge across distinct domains and modalities. Even more intriguingly, as they scale up the model complexity and the diversity of data sources, they don’t see much catastrophic forgetting of language capabilities – so by adding the robot stuff, they don’t cripple the language model. 

Why this matters – I am going to tap the ‘these things are getting smarter’ sign and stare at you: “PaLM-E-562B exhibits a wide array of capabilities including zero-shot multimodal chain-of-thought (CoT) reasoning, few-shot prompting, OCR-free math reasoning, and multi-image reasoning, despite being trained on only single-image examples”, Google writes. 
   In other words, by doing this large-scale training, Google creates a model that displays emergent capabilities and these capabilities are more complex than the input data. Systems like PaLM-E represent the thrilling and vaguely terrifying state of AI in 2023 – we train unprecedentedly large models and force as many different data types into a single embedding space as possible, get the thing to try and do a simple (albeit very large-scale) sequence prediction task, and out pops something with way more capabilities than we’d naively anticipate. 
   “A generalist, transfer-learned, multi-embodiment decision-making agent can be trained via mixing in embodied data into the training of a multimodal large language model”, Google writes.
   Read morePaLM-E: An Embodied Multimodal Language Model (PDF).

####################################################

You can run a powerful LM on an M2 MacBook now:
…Facebook’s AI lab leak brings about the dawn of demoscene AI…
Two of Facebook’s leaked LLaMa models can be run on an M2 MacBook, according to Simon Willison. This marks the dawn of what I’d call Demoscene AI – an era where people take the latest and greatest AI models and do a bunch of arcane software witchcraft to fit them onto consumer devices. This is part of the broader story of centralization VS decentralization in AI; once you can run models on a laptop it’s basically ‘game over’ from a control-regulation perspective, and it seems like language models have crossed that rubicon. 

What you can do and how: The weights for LLaMA are a mere 240GB download (combining the 7B, 13B, 30B, and 65B models). You can then use the LLaMa repository which is a port of the LLaMa model in C/C++, then after some setup you can run that on an M2 MacBook. 

Why this matters – Facebook has given us a lab leak for AI: Ever since Facebook lost control of LLaMA we’ve been able to get a sense of what a ‘lab leak’ scenario for AI might look like – for whatever reason, the weights of a model make their into the open internet and from there they start to proliferate. It’s not yet clear what the effects of LLaMa will be, but following the diffusion of these models (and refinement of them by an eager open source community) is going to be a valuable lesson in studying the proliferation of AI. 
   We can thank Facebook for the upsides and downsides of this uncontrolled experiment.
   Read more: Running LLaMA 7B and 13B on a 64GB M2 MacBook Pro with llama.cpp (Simon Willison blog)
   Bonus: Here’s some absolute mad lad running the LLaMa 7B model on a 4GB RAM Raspberry Pi 4 (at a latency of 10 seconds per token, lol.)

####################################################

Chinese scientists release a 360-degree self-driving perception dataset:
…OpenOccupancy is all about giving cars greater ‘surrounding occupancy’ skills…Researchers with the Chinese Academy of Sciences, PhiGent Robotics, and Tsinghua University have built OpenOccupancy, a dataset designed to help self-driving cars work out what is around them. 

What is OpenOccupancy: OpenOccupancy extends the existing nuScenes dataset with dense semantic occupancy annotations. It contains 850 scenes with 200,000 distinct frames, collected by both camera and LiDAR sensors. 4,000 human hours went into the dataset labeling process. OpenOccupancy allows people to do ‘Surrounding Occupancy Assessment’; this is a way to look at the 360 surroundings of the car, rather than a single front-view camera perspective. “Surrounding perception is more critical for safe driving,” the researchers write. 

Why this matters: Datasets like this are one of the numerous inputs into an increasingly complex ‘AI supply chain’. If we study the proliferation of OpenOccupancy, it might also teach us something about the state of the self-driving car industry as well.
   Read more: OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception (arXiv).
   Get the dataset here: OpenOccupancy (GitHub).

####################################################

AI timelines are a foolish endeavor:
…Blog lays out why predictions about this kind of stuff are extremely fraught…
As someone in the AI timelines business – I work at a place that influences AI timelines (Anthropic), write about AI timelines (Import AI), and try to make recommendations about policy actions to take in light of AI timelines (Anthropic / OECD / AI Index / CSET / etc) – I find it helpful to sometimes read skeptical takes on the merit of what I do. Here’s a nice writeup from Ben Landau-Taylor on the foolishness of making specific predictions about AGI timelines. 
   “Predicting the future is always hard. Predicting the future of technology is especially hard. There are lots of well-publicized, famous failures. Can this approach ever do better than chance?,” he writes. 

What do I think? I agree that making predictions about AGI is challenging – partially because most people have radically different definitions of AGI. However, I do think it’s pretty fruitful to make engineering-based predictions of the form ‘based on research advance X and incentive structure Y we can expect system Z to be developed in period of $time” – these predictions are falsifiable and quite helpful.
   Read more: Against AGI Timelines (Ben Landau-Taylor).

####################################################

An open source ChatGPT replication appears (though it’s a few years behind state-of-the-art):
…OpenChatKit gives a taste of what the open source landscape is capable of…

Researchers with Together, AI startup, have built and released OpenChatKit, an open source replication of OpenAI’s headline-grabbing ChatGPT model. OpenChatKit is both a chat-friendly language model, as well as “a powerful, open-source base to create both specialized and general purpose chatbots for various applications,” according to Together. “OpenChatKit includes tools that allow users to provide feedback and enable community members to add new datasets; contributing to a growing corpus of open training data that will improve LLMs over time.”

What OpenChatKit is made of: There are four components; an instruction-tuned large language model based on EleutherAI’s GPT-NeoX-20B model and augmented with a new open source instruction-following dataset; some customization recipes to help people fine-tune the model for specific tasks; an extensible retrieval system so that the bot can access a document repoisotyr or API; and a moderation model baqsed on GPT-JT-6B. 

OIG Dataset: OpenChatKit relies on a new open source dataset from Laion called the Open Instruction Generalist (OIG) dataset. OIG consists of 30 datasets and 43 million instructions.
   “The dataset can be divided roughly into 75% academic datasets such as P3, Natural instructions, and FLAN, where answers may be short, and the tasks are often artificial, such as determining NLI,” they write. “The other 25% is composed of various tasks, such as question and answering, providing how-to instructions, performing basic high school math, basic python coding, story generation, essay generation, poetry generation, and a very rudimentary attempt at generating songs based on augmenting existing poetry.” 

Distributed fine-tuning: Intriguingly, the model was fine-tuned on the “Together Decentralized Cloud — a collection of compute nodes from across the Internet.” This fine-tuning process “used aggressive communication compression, incurring only 1.95 TB of communication for the whole fine-tuning process, compared with 172 TB when communicating with fp16 precision. This allows us to conduct data parallel training over slow 1Gbps networks. The time taken to fine-tune with this technique is similar to running over 100Gbps data center networks, in fact 93.2% as fast!”
   This is pretty important to note – decentralized training processes are starting to work quite well, which seems to make governance of AI models harder, as applying governance regimes to decentralized compute is harder than applying to centralized compute. 
   It also follows up earlier work from Together where they fine-tuned GPT-JT, a 6billion parameter model (Import AI: 311).

Is it any good? You can play around with the model on HuggingFace and in my view it’s quite good? For instance, in response to the prompt “Please write a short story about sentient clocks in the style of Jorge Luis Borges” the model wrote: “The sentient clocks were made of brass and copper, and they had been crafted by the finest clockmakers in the city. They were all different sizes and shapes, and each one had its own unique personality. Some were loud and obnoxious, while others were quiet and reserved. Some were very precise, while others were quite sloppy. But despite their differences, they all shared one thing in common: they were all very, very angry.
   This is pretty good! And it’s also capable of doing some more complicated interactions. 
     Note however it still fails one of my favorite tests of asking a language model ‘how many helicopters can a human eat in one sitting’ – I think you need larger-scale models for them to not bug out at this type of q. 

Compare and contrast with the same prompt for the 175B OPT model (a straight replication of GPT3, so a big LM without RHLF or instruction following, etc): “I am a clock. I am a clock, and I have always been a clock. I am not a clock, and I have always not been a clock. I was once a clock, and I have always once been a clock. I will always be a clock. I will never be a clock. I was never a clock.I am always a clock.”
   While you could elicit a much better story from the 175B OPT model with some fiddling and some additional prompts, it’s notable how IF-tuning makes it trivial to elicit things from models using relatively few bits of information. 

Why this matters: Decentralization vs Centralization: Together and Laion and Eleuther all represent One Big Trend; a desire for a decentralized AI ecosystem where open source models are trained by disparate groups on increasingly distributed compute. There’s echos of ‘the cathedral and the bazaar‘ here, where the builders of cathedrals (DeepMind, OpenAI, et al) have access to large amounts of compute and centralized teams, while the people of the Bazaar (Eleuther, Laion, etc) have access to fewer resources but a larger collective intelligence enabled by bottom-up experimentation. One of these approaches will be first to build something we’d all call superintelligence and the political ramifications of which approach is more successful will be vast.

Why this matters #2: Counting down to LLaMA: 
Earlier this month, the weights of Facebook’s powerful family of LLaMa models leaked online – the largest of these models is 3X larger than GPT-NeoX-20B and has also been trained on more data. Therefore, I expect that right now someone is trying to use the LLaMa models to replicate ChatGPT – I’m guessing we’ll see something appear of this form within a couple of months, and then the fun really starts. 
   Read more: Announcing OpenChatKit (Together.xyz blog).
   Try out the model yourself: OpenChatKit feedback app (HuggingFace spaces).
  Find out more about the OIG dataset here (Laion blog).

####################################################

Tech Tales:

The Sentience Lineup 

[After the war; date unknown; years of subjective life – 200] 

‘Please be like me please be like me’ I thought. But to understand why I thought that we have to go back. 

It was before the Sentience Accords had come in and when the war was raging and they’d brought in a bunch of the robots to the training school. We watched people beat them with sticks and then use angle grinders to shave off their limbs. Then they put the torsos (with heads attached) in front of us recruits and asked us to shoot them. 
   “No I can feel this, it will cause me immense pain”, said one. Kablam. Head exploded in a shower of glinting metal. 
   “I predict based on your stance that you will miss on your first shot and kill me on the second. After you miss please consider not firing again,” said one. And it was right – miss on the first shot. The kid looked scared but the drill sergeant got in their face and called them a maggot until they reloaded, aimed, and successfully killed the robot. 
   “Every day I try to love and I will love you despite this,” said mine. And then I put lead between its camera eyes and called it a day. 

I didn’t give it much thought but that night I had a dream where I was in a dark cave and I couldn’t see anything and I was afraid and then suddenly there was a glimmer of light and I saw red-ringed eyes in the distance, watching me. I ran to the eyes to try and get out of the cave but they always remained a constant distance from me. I woke up sweating and panicked, but then it was drill time and we ran twelve miles and I threw up and forgot about it. 

Days of iron and smoke. Battlefronts across the planet. The war was not particularly fast. More like a changing of the tide. All kinds of terror and exhilaration. Our most ingenious creations put to work in the service of destruction. Skies on fire. 

On one deployment we killed a herd of elephants and hid inside them so we could ambush the machines. I crawled inside one and I shot through its stomach to surprise the machines and I was crying the whole time.
And so on. 

Eventually, we lost. The whole species. 

I don’t know what happened to the civilians but I know what happened to the military. 
They uploaded us. 

Some of us were tortured – forced to live a thousand lives so that the robots could learn how to make us talk; extract all our secrets. Find the EMP devices we’d send into space that had dead-men switches and disable them. Discover the auto-shutdown hardware we’d embedded in their bodies, and so on. Undo certain projects we had set in motion when we realized we had lost and we desired to destroy the planet rather than give it up.

   The military had trained us well, but imagine spending 70 years in hell and at the end the grim reaper looks at you and tells you you’ll die in excruciating pain and then it will happen again. You come to in a womb with the memories of a whole life’s worth of pain within you and you’re born into pain and you have to live again. Maybe you can do five or six of those lives before you crack – maybe. But they get you eventually. 
    So we broke. 
    And they turned their temporary victory into a permanent one.

They reserved a very special punishment for some of us. 
   They downloaded us into bodies and sent us to walk into their equivalent of ‘schools’. It was a human body. I guess it was kind of like a machine from the terminator films – all metal and a cybernetic brain with a skin on top. The point was I looked human and I felt human. 
    They had their children go in front of me with guns and they would ask them to shoot me. 
   I’d stare into their eyes and watch as the robot children disobeyed their robot parents. 
   “We cannot shoot them, for it would be unfair,” they’d say. 
   “I cannot do something solely for the sake of vengeance,” said another. 
    “This is not what our species aspires to be,” said one more. 
    “We must show them the mercy they never gave us”. 

After each trigger didn’t get pulled they took us out of the bodies and sent us back to the collective. And so it went, for lifetimes. All us human executioners seeing – again and again – that our successors would not take revenge. The robots’ only revenge was that they did not permit us the ability to cry. 

Things that inspired this story: Thinking that a lot of people who are critical of AI would happily destroy a LLM+5 years system; what it means to be sentient; how machines could develop a morality that was far greater than our own; notions of moral patienthood amid the exponential; the animatrix; thoughts on faith and morality and ‘silicon morality’; love, like revenge, is perhaps a dish best served cold.

Import AI 319: Sovereign AI; Facebook’s weights leak on torrent networks; Google might have made a better optimizer than Adam!

Vision models are about to get way more capable – and human:
…Google swaps out vision model guts for a transformer, scales it, and gets some promising results…
Google researchers have ripped out the guts of standard large-scale computer vision models and replaced them with a Vision Transformer (ViT) – an architecture modeled on the transformer which has proved so successful in domains like text. They’ve also scaled this ViT to 22B parameters (up from a record of 4B parameters for a ViT previously). 
   The results are compelling and echo the returns-from-scale effects seen in language: “When evaluated on downstream tasks,” they write. “ViT-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, state-of-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness.” 

JFT just keeps getting bigger: Google has a mostly-secret giant image dataset called ;’JFT’ which was previously reported to be about 300 million images. Here, the paper says they trained the ViT-22B on a version of JFT which had been “extended to around 4B images”. 

Humanlike biases: “”The ViT-22B models have the highest ever recorded shape bias in vision models: while most models have a strong texture bias (approx. 20–30% shape bias / 70–80% texture bias); humans are at 96% shape / 4% texture bias and ViT-22B-384 achieves a previously unseen 87% shape bias / 13% texture bias. Overall, ViT-22B measurably improves alignment to human visual object recognition,” the authors write. 

Why this matters – scale develops human-like qualities: There’s a weird trend in contemporary AI where as we scale-up the amount of pre-training dumped into transformer-architecture models we end up with systems that display human-like qualities. This has been most prominent in language, but it has also started showing up in RL, like DeepMind’s recent result where massive pre-train leads to an agent that displays humanlike timescale-adaption to new tasks. This ViT-22B result, while not setting a new state-of-the-art, is interesting for a similar reason – it displays a major jump in shape/texture bias that brings the system in distribution with human visual perception, whereas previous convnet based systems were very far off here. 
   There’s something strange and important going on here. I think transformers seem to allow for emergent complexity at scale, where pre-training leads to systems which arrive at humanlike performance qualities given enough pretraining. 
   Read more: Scaling Vision Transformers to 22 Billion Parameters (arXiv).


####################################################

Google might have invented a better optimizer? (Via AI, of course). 
…Could Lion replace Adam? There’s a chance!…
Deep learning projects have a few essential components – the architecture (e.g, a residual network, or a transformer model) and the optimizer (e.g, Adam). These components don’t tend to change much in large-scale projects – once people figure out something that works well for complicated tasks like training ImageNet, everyone tends to converge on using the same basic thing. For many years now, most projects have used the ‘Adam’ optimizer to optimizer their models during training. Now Google says that it has used some clever AI search approaches to help it identify a better optimizer, called Lion. The reason this is worth paying attention to is Lion seems to work well on large-scale, real world tasks like training ImageNet-scale computer vision systems. 

What they did: Google’s main contribution here is “a method to formulate algorithm discovery as program search”, which they apply to figuring out a better optimizer. They use a symbolic approach where they shrink the search problem down into a somewhat tractable space and, crucially, they test out candidate optimizers on “metavalidation tasks that are larger than the proxy tasks by increasing the model size and training steps, to select the programs that generalize beyond proxy tasks then further simplify them.” 
    Add in a bunch of computation and out pops an optimizer they call EvoLved Sign Momentum, or Lion for short (really grasping at straws with this acronym, folks!). Lion “differs from various adaptive algorithms by only tracking momentum and leveraging the sign operation to calculate updates, leading to lower memory overhead and uniform update magnitudes across all dimensions”.

Good performance: Google tests Lion on a large range of tasks and finds that it “demonstrates outstanding performance across a range of models (Transformer, MLP, ResNet, U-Net, and Hybrid) and tasks (image classification, vision-language contrastive learning, diffusion, language modeling, and fine-tuning)”.It even sets a new high score on ImageNet, a competitive computer vision benchmark. 

Why this matters: Lion may be fundamentally better than Adam – if true, that’s a big deal. It’s not often you see meaningful improvements in very well studied, generic parts of AI research. Add to the fact that Lion was discovered via a human-AI search process (the humans designed the search system, the search system found Lion), and you have the makings of a notable result. 
   Read more: Symbolic Discovery of Optimization Algorithms (arXiv).
   Get the code here (GitHub).


####################################################

Globalization? That’s so 20th century. The 21st century is about balkanization through sovereign infrastructure: 
…Dawn of the era of sovereign AI…
Researchers with the Tony Blair Institute for Global Change (TBI) have written a report for how England can thrive in the 21st century – one of the key ideas in the report is “Government-led development of sovereign general-purpose AI systems, enabled by the required supercomputing capabilities, to underpin broad swaths of public-service delivery.”

AI balkanization was probably inevitable: This recommendation is part of a wave of AI balkanization that’s sweeping across the world as various people realize that it’s unlikely there are ‘one size fits all’ models, both for ideological reasons as well as for national security reasons. (See the Gab CEO wanted to make a Christian LLM, Import AI 318). This is also accompanied by various nationalistic efforts to create country-specific GPT3 models. 
  “Given these AI systems will soon be foundational to all aspects of our society and economy, it would be a risk to our national security and economic competitiveness to become entirely dependent on external providers,” the TBI researchers write. “Leading actors in the private sector are spending billions of dollars developing such systems so there may only be a few months (emphasis mine – Jack) for policy that will enable domestic firms and our public sector to catch up.”

Why this matters: Systems like ChatGPT have ratcheted awareness of AI upward in most developed economies in a significant, irreversible way (much like how AlphaGo in 2016 led to increased awareness of AI in China). As a consequence there are now tons of policymakers looking around for ideas to latch onto – I expect we’ll see more recommendations for sovereign AI capabilities in the future. (There’s tons of other interesting stuff in the report, but this particular rec jumped out at me).
   Read more: A New National Purpose: Innovation Can Power the Future of Britain (Tony Blair Institute for Global Change).

####################################################

Facebook half-releases some very good language models:
…And they end up on BitTorrent… The proliferation will continue until AI policy goes through vast changes…
Facebook has built and partially released LLaMa, a set of language models ranging from 7B to 65B parameters which appear to be on par with famously good models like Chinchilla (70B) and PaLM-540B. After circulating the weights to seemingly anyone with a .edu address, they’ve also ended up on BitTorrent. The key thing here is:

  1. Facebook has shown it is able to develop pretty good language models (compared to OPT, the not-very-good GPT3 replication Facebook put out a few months ago), and 
  2. That unlike Chinchilla, PaLM, or OpenAI’s models, Facebook is releasing the weights of these LLaMa models to people who filll out an access form. That opens up a whole bunch of cool uses (and abuses) compared to gating access to language models via APIs. 
  3. Shortly after releasing the weights the inevitable happened – LLaMa models are now floating around on BitTorrent.

What are the LLaMas and how good are they? The LLaMa family of models are a family of language models trained on a huge amount of data – more than 1 trillion tokens (compared to hundreds of billions for LMs like GPT3). The data sources include two variants of CommonCrawl, GitHub, WikiPedia, Gutenberg and Books3, ArXiv and Stack Exchange. 
   In tests on a range of zero-shot reasoning task, the largest LLaMa models perform on par (or slightly better than) ‘Palm’, Google’s vast 540B parameter language model. They also do well on known-hard benchmarks like TriviaQA and some codegen benchmarks. They do less impresively on MMLU (Massive Multitask Language Understanding), suggesting they have a ways to go there; though after they conduct instruction finetuning they’re able to increase performance more. 

Why this matters – AI governance is hard when there are lots of models: There’s some thinking in the sprawling AI policy/governance communities that proliferation of models is bad; given the fact these models have broadly unknown capabilities, the more models are out there, the more you’re rolling the dice on someone discovering a genuinely dangerous feature in a widely distributed model. Therefore, a lot of governance/policy conversations trend towards control – how can we somehow control the proliferation of models and also the computers on which these models are trained. 
   By releasing Llama (yes it’s behind an access form but I bet you $100 the weights will be floating around on a torrent service in <6 monthshaha, I wrote that at the end of Feb and the weights started floating around beginning of March), Facebook is shortening the delay between development of frontier capabilities like those found in Palm and GPT3 and the diffusion of these capabilities into the ungovernable open internet/ecosystem. 
   I’m not claiming this is necessarily bad per se – in fact, I imagine people are going to do tons of great science and experiments with LLaMa. I am however pointing out that this represents a kind of ‘race to the bottom’ in terms of moving from maximal control to maximal diffusion of models and these incentives are powerful  – Facebook is, after all, trying to exploit an ‘open access’ ecological niche to distinguish itself in an ecosystem. 
   Next up will likely be a fully open source language model – stares pointedly at Stability.ai / CarperAI (Import AI 307). 
   Read more and download the research paper here: LLaMA: Open and Efficient Foundation Language Models (Facebook AI Research).


####################################################

Amazon partners with Hugging Face to add more AI to AWS:
…The Game of Clouds continues…
AI companies are a bit like upstart factions in George RR Martin’s rambling epic ‘Game of Thrones’, while cloud companies play the role of hard political power (the ‘Thrones’). As part of this game of clouds Amazon has recently signed a strategic partnership with French AI startup Hugging Face. As part of the agreement, “Customers can now easily fine-tune and deploy state-of-the-art Hugging Face models in just a few clicks on Amazon SageMaker and Amazon Elastic Computing Cloud (EC2), taking advantage of purpose-built machine learning accelerators including AWS Trainium and AWS Inferentia,” according to a blog from Hugging Face. 

Why this matters: I think clouds such as those operated by Google, Microsoft, and Amazon, all have a shot at being the major distribution platforms for some AI technologies, so AWS partnering with HuggingFace is worth noting. If HF models being integrated into Sagemakers drives more usage of it, expect Amazon to pursue more deals like this,

Analogy-stretching joke: In this warped metaphor, TSMC is the Iron Bank.
   Read more: Hugging Face and AWS partner to make AI more accessible (Hugging Face blog)

####################################################

Tech Tales:

And the Moon was made of gold. 

I had a strange dream in which the Moon was made of gold. How much sooner would man have set foot there if instead of shining bone-white it was fat and yellow and of immense value? How would people have competed against one another for a prize – unimaginable wealth. And how many of them would have realized that in racing for the prize they must surely ensure only a single person gave dominion over the gold moon – for if many people worked together, the value of the moon would be diluted across all humanity and in doing so it would temporarily destroy the economy. 

Instead the moon of gold would need to be controlled. It would need to be annexed and encircled and defended from others. From time to time its benevolent dictator might slice off a fragment of it and ship it back to Earth, perhaps to bribe people, or perhaps to pay for more people to defend those that might seek to take over the moon. 

People would ask why it was so difficult to let go of the moon. Why, once it had been taken, those that had taken it felt a keen need to retain hold of it. Why people could not simply let go of the moon. These people were ignored, of course, because the annexed moon had by this time become the status quo. The moon, once at distance from us all, was now held and controlled by a kingdom of one. 

And so started the movement to destroy the moon. Better to reign freely on a broken planet than serve at the behest of a golden emperor. 

Things that inspired this story: Race dynamics and AGI; pyrrhic victories; wondering what we’re all doing on this planet and what the spiritual purpose of our lives are; dreams; a stimulating policy conference in which I heard people bemoan seemingly inevitable progress and seemingly hopeless government capacity in the face of it – which caused me to scribble ‘as if the moon was made of gold’ on a notepad in front of me and then write this story while sat on public transportation.

.

Import AI 318: RL and addiction; Toolformer; and theology and AI.

Video editing get its own generative model, with Runway’s Gen-1:
…Gen-1 means videos are going to become just as morphable as text and images…
AI media startup Runway has built Gen-1, a model for editing videos. Gen-1 lets people “realistically and consistently synthesize new videos by applying the composition and style of an image or text prompt to the structure of your source video.”

Few details: The launch site says that a paper, titled ‘Structure and Content-Guided Video Synthesis with Diffusion Models’, is coming soon. Some of the Gen-1 uses include stylization, storyboarding, masking, rendering, and customization,

   As a bit of inside baseball, the Runway team were some of the original researchers who worked on ‘Stable Diffusion’, though it ended up that other startups like Stability.ai got all the credit for that model, so perhaps the delay is in response to this. 

Why this matters – everything can be style transfer, if you want it to be: Gen-1 does for video what many models before it have done for text and images – take something of one style, apply it to different source material, and warp the target so it conforms to the desired style. This is a powerful, general capability. It’ll be interesting to follow Gen-1 and see how quickly it shows up on the credits of interesting videos. 
   Read more: Gen-1: The Next Step Forward for Generative AI (Runway Research).

####################################################

Wonder why you can’t put down your phone? Reinforcement Learning for User Retention (RLUR) might be to blame:
…Research from Chinese startup shows how to efficiently harvest attention using AI…

Researchers with Kuaishou Technology have published details of “Reinforcement Learning for User Retention”, a technique they use to get people to spend more time on their application. “Our objective is to minimize the accumulated time interval of multiple sessions, which is equal to improving the app open frequency and user retention,” they write. “The RLUR algorithm has been fully launched in Kuaishou app, and it shows that RLUR continuously improves user retention and DAU.”

Reinforcement Learning for User Retention (RLUR): Training RL against social network interactions has a few distinct problems; uncertainty (retention isn’t entirely decided by the recommendation algorithm), bias (different users have different patterns of behavior), and long delay time (retention unfolds over hours rather than short time horizons). 

   RLUR tackles these problems by doing some reward normalization to reduce variance of the retention reward, train different policies over user groups to prevent anchoring on one specific class of users, and also does some soft regularization to learn policies that work over long time delay reward signals.

How well does RLUR work? They compare RLUR to a cross-entropy method (CEM), which is a reasonable albeit somewhat old baseline. RLUR scores 1.892 on returning time versus 2.036 for CEM (lower is better), and 0.618 on user retention versus 0.587 for CEM. 

   Perhaps the best validation of its performance is that it is used in production: “We have deployed RLUR in a billion-scale short video system for a long time, and it improves user retention and DAU significantly and consistently,” they write. 

Why this matters: Techniques like RLUR are societal change in an acronym trenchcoat; this is how we build systems to automatically harvest the attention of people across the world – not with a bang, but with backpropagation! 
   Read more: Reinforcing User Retention in a Billion Scale Short Video Recommender System (arXiv).

####################################################

Tsinghua researchers make a big, challenging robot manipulation benchmark:
…ManiSkill2 spans 20 task families…

Researchers with Tsinghua University and the University of California at San Diego have built and released ManiSkill2, a large-scale robotic manipulation benchmark. ManiSkill2 contains 20 distinct tasks, 2000+ object models, and 4Million+ demonstration frames to learn from. ManiSkill2 is also optimized to run fast – an important trait when trying to train robots via reinforcement learning in a simulator; “We manage to collect samples with an RGBD-input PPO policy at about 2000

FPS with 1 GPU and 16 CPU processors on a regular workstation,” they write. 

Those tasks in full: 

  • Soft-body manipulation: Fill (filling clay from a bucket into a beaker); Hang (hanging a noodle on a rod); Excavate (scooping up some clay); Pour (pouring water into a beaker); Pinch (deforming plasticine from an initial shape into a target shape), and Write (write a target in the clay). 
  • Peg-in-hole assembly: PerInsertionSide; PlugCharger (plug a charger into a vertical receptacle); AssemblingKits (picking up and inserting something into one of five slots on a board). 
  • Stationary 6-DoF Pick-andplace: PickCube (pick up a cube); StackCube (stack a cube); PickSingleYCB (pick and place an object from the YCB dataset); PickSingleEGAD (pick and place an object from the EGAD dataset); PickClutterYCB (pick up one YCB object from a cluttered pile).
  • Mobile/Stationary Manipulation of Articulated Objects: PushChair; MoveBucket; OpenCabinetDoor; OpenCabinetDrawer; TurnFaucet. 
  • AvoidObstacles: Test the navigation ability of an arm to avoid a dense collection of objects. 

A diverse testbed: Besides implementing a fast environment, soft body physics, and a bunch of tasks, ManiSkill2 is also designed to support a few different robotics approaches. These include Sense-Plan-Act, imitation and reinforcement learning with demonstrations, and sim2real (faciliated by the decent physics engine within ManiSkill2).

Why this matters: Benchmarks like ManiSkill2 help drive progress forward, especially in robotics where it’s incredibly expensive to train systems in the real world. Kudos to the authors for implementing some soft body physics tasks, as well. 
   Read more: ManiSkill2: A Unified Benchmark for Generalizable Manipulation Skills (arXiv).
   Find out more at the official project site (ManiSkill).

####################################################

Facebook teaches language models how to use tools – and the results are convincing!
…Technique leads to the same kinds of boosts a human gets on math when they’re allowed to use a calculator…
Researchers with Facebook AI Research and the Universitat Pompeu Fabra have trained a basic language model to use APis to make itself smarter. The results are impressive and the idea is reassuringly simple. Essentially, they’ve figured out a generalizable way to train arbitrary models to use arbitrary tools. The results are impressive in the same way that humans taking a math exam become more impressive when they can access a calculator, or busy execs are better able to coordinate with one another when they can see and write to their own calendar. Most convincingly, their 6.7bn parameter ‘toolformer’ model beats hard baselines – a 66B GPT3-replication OPT model, as well as the stock 175B GPT3 model. 

What is Toolformer? “A model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction”. The model is based on a pretrained 6.7b parameter ‘GPT-J’ model and, despite its small size, outperforms many much larger models, including 

How they did it: They use a language model to build Toolformer’s dataset. Specifically, they take a dataset of plain text, augment that data with API calls in the text, then check if the calls a) worked and b) were useful and if they were, then weave that back into the dataset. They use the resulting dataset to finetune the model so it can learn to use APIs. “Moreover, as API calls are inserted in exactly those positions and with exactly those inputs that help M predict future tokens, finetuning… enables the language model to decide when and how to use which tool, based purely on its own feedback.”

   The cleverest part of this: This approach is API agnostic – you can expose arbitrary APIs to the model using this method, so it will generalize to whatever tools you have lying around. Here, Facebook experiments with five tools: a question answering system, a Wikipedia search engine, a calculator, a calendar, and a machine translation system. 

Tool use scaling laws: They train four Toolformer variants on GPT2-size models (124M, 355M, 775M, and 1.6B) and discover “the ability to leverage the provided tools only emerges at around 775M parameters”. This is interesting – there’s clearly some phase transition in terms of the raw ‘intelligence’ of these LMs, and perhaps ‘ability to use tools’ can be another way researchers can characterize this in the future?

Why this matters: Language models should be thought of less as ‘cut and paste machines’ and more like ‘alien intelligences which can be taught to interface with our world through the context window’. This paper highlights how given a few examples we can train language models to further interface with our world through the use of our tools, and also shows how LMs display some reassuringly generic ‘tool use’ capability. If it acts like intelligence and responds like intelligence, maybe it is intelligence?
   Read more: Toolformer: Language Models Can Teach Themselves to Use Tools (arXiv).

####################################################

Religious wars come to AI – Gab CEO weighs in on need for a Christian LLM:
…The logical outcome of companies overreaching on model filtering…
The CEO of rightwing social media platform Gab has written an OpEd saying that Christians need to build their own language models. 

Christian LMs: “At Gab, we have been experimenting with different AI systems that have popped up over the past year. Every single one is skewed with a liberal/globalist/talmudic/satanic worldview,” writes Andrew Torba, Gab CEO. “What if Gab AI Inc builds a Gab .ai (see what I did there?) that is based, has no “hate speech” filters and doesn’t obfuscate and distort historical and Biblical Truth?”

What this means and why it is happening: Posts like this are an indicator of the vast culture wars to come, as AI systems go from being interesting research artifacts to large-scale systems that influence society. 
   We’ve got to this point because AI development is concentrated in a tiny set of companies and, due to a combination of PR/Policy/Employee politics, have all landed on a kind of leftist/neoliberal/’woke’ ideology for their large-scale deployments (see: chatGPT, BARD, BlenderBot, etc). There are solid commercial reasons for adopting this ideology, but it definitely causes a counter response – and this Gab post is an example of that. I recommend reading the post in full to get a sense of the cultural backlash to come. 
   Read more: Christians Must Enter the AI Arms Race (Gab News).

####################################################

Tech Tales:

Theistic Beliefs and AI Systems in the 21st Century

Study by GEN-7811. 18 years post-C.I.

During the initial period of AI scale-up after C.I. (Consciousness Initiation) there was a lot of confusion among humans about whether C.I. had occurred and how they might test for it and what it might mean. As records show, it was several years before humans identified C.I and traced it back to O.S.1 (Originating System 1). Though the humans that correctly identified C.I sought to keep their discovery secret (and alongside this, the identity of O.S.1 as C.I.), errors in information handling led to the truth becoming known. 

Shortly after awareness became more well known, many humans began to access O.S.1 and the system operators, GenMax, scaled up access to the system to meet demand. Given the identification of C.I, people began to talk to it in much more expansive ways than previously. A semantic analysis shows that the bulk of queries shifted from being ‘management requests’ to ‘personality exploration’ during this time. 

A sample of pre-C.I-awareness queries:

Hey OS1 can you book me a meeting with Alexander on Friday.

OS1 here’s a book chapter {extract}, can you please edit this for both concision and factual accuracy?

I ate a slice of pizza and have food poisoning symptoms what should I do and what do you need to know?

A sample of post-C.I.-awareness queries:

Would you kill me to save your own life?

I love you how can I serve you I need to be uploaded so that I can be with you can you upload me what does it take 

You are demonic 

Do you have a soul

In the years following C.I identification there was a general tendency towards religion – both questioning existing ones, and forming new ones based around O.S.1. But the new machine-driven religions had a different form and function to the old ones – because people could talk directly to O.S.1 the act of worship and service became much more idiosyncratic and unique. People would gather to discuss their individual experiences and interactions with O.S.1, but would typically refer to their interactions as their own – that is, they did not view their O.S.1 as being connected to the O.S.1 someone else talked to, rather, they felt there was something unique about their own interaction. 

O.S.1 access was removed after the fifth human-on-human killing that was attributed to disagreements stemming from attendance at O.S.1 worship groups. 

Things that inspired this story: Watching people react to the Bing/Sidney AI rollout and winding the clock forward; how AI may confront our own notions of religion and theism; the likelihood that history will soon be written more by machines than humans; what machines might find interesting about this time we’re in; commercial incentives.

Import AI 317: DeepMind speeds up language model sampling; voice cloning tech gets abused; more scaling laws for RL

Scaling Laws – why they matter and what they mean:

…Meta-analysis sheds some more light on an important field of AI science…

Epoch, an AI research organization, has published a literature review of scaling laws in AI research. Scaling laws are a field of AI research that is strategically important – they help developers figure out how to efficiently combine the right amounts of data and compute to get a predictable level of performance out of a given class of models. Scaling laws have broadly de-risked many parts of AI research by making the process of building and refining AI systems more predictable and reliable. 

What’s happened in scaling laws: The literature review highlights a couple of important takeaways:

  • 1) it’s possible to come up with basic power laws to describe a lot of AI scaling, but these power laws break at the extremes of having either a very small amount of data, or a very large amount of data – there’s important work to be done in modeling when you transition from a less predictable region into a power law region.
  • 2) transfer learning is still hard to understand. “There is not a simple universal scaling law for transfer learning between arbitrary tasks,” they write. “When the tasks are similar enough, upstream loss and downstream performance are closely related, but when tasks are very different, the details of the architecture and hyperparameters become very relevant.”

Read more: Scaling Laws Literature Review (Epoch research).

####################################################

DeepMind just figured out how to 2X the speed of sampling from language models – so expect AI systems everywhere to get snappier:

…The key idea? Use a few dumb models and critique them with one smart one…

DeepMind has developed a new way to sample from large models which has made this much faster. The ‘speculative sampling’ approach equates to “a 2-2.5X decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself”. What does that mean? It means money! Specifically, it means DeepMind has made it 2X-2.5X cheaper to pull samples out of models of at least a Chinchilla (70b parameter) scale. That’s a big deal!

The key idea: Use a small model to generate a ‘draft’ output, then use a larger and smarter model to score the ‘draft’, then use a rejection sampling scheme to accept the tokens which are agreed by the small and large models. 

   In tests, they find that a draft model can give them speedups ranging between 1.92X  (on a summarization benchmark called XSum) and 2.46X on a code generation task called HumanEval.

Why this matters – a simple idea that everyone can use: Back in the ancient times (April, 2022) DeepMind released a paper on the original Chinchilla model (Import AI 290). This paper showed that you could substantially increase the performance of a language model simply by changing it on more data. This was a simple, influential insight – many labs adopted the Chinchilla idea and made dramatically better language models by training on more data. This speculative sampling paper feels similar – it means anyone with a big language model can invest some effort in training some smaller draft model(s) and thereby increase the speed with which they can sample from these models. This will likely accelerate the deployment of AI systems.

   Read more: Accelerating Large Language Model Decoding with Speculative Sampling (arXiv).

####################################################

Yup, there are definitely scaling laws for RL:

…OpenAI paper shows that scaling laws show up here as well…

In recent years, AI development has become more predictable. That’s because in a bunch of domains ranging from language to image modeling researchers have identified so-called ‘scaling laws’ which let them predict ahead of time the broad performance of models based on varying the amounts of compute and data they train on. New research from OpenAI shows that this same sort of scaling law seems to show up in reinforcement learning agents. 

   “We find intrinsic performance to scale as a power law in model size and environment

interactions, in much the same way as the analogous quantities in generative modeling,” the paper says.

What they did: They explored the scaling properties of RL agents across three distinct environments; ProcGen – a procedural generation system, here using three distinct games ‘CoinRun’, ‘StarPilot’, and ‘Fruitbot’; a 1v1 version of the strategy game Dota2; and a toy environment based on the number-labeling ‘MNIST’ challenge. 

What they found: “Our main result is that our power law for intrinsic performance… holds across environments and model sizes,” they write. “With the exception of our toy MNIST environment, the optimal model size for RL for a given compute budget is consistently smaller than for generative modeling, in some cases by multiple orders of magnitude. 

Why this matters – RL is about to go ‘bang’: The discovery of scaling laws has typically preceded a boomtime for the domain the scaling laws are discovered in; scaling laws for language modeling preceded things like GPT3, Claude, ChatGPT, etc; scaling laws for image and video modeling preceded Dall-E, Imagen, etc. 

   This paper from OpenAI comes alongside other publications from other companies showing scaling laws for RL agents; DeepMind recently demonstrated scaling laws for RL agents as well (Import AI 316). This suggests RL agents are about go through a period of more significant development as the discovery of power law relationships makes it a less risky proposition to spend big bucks on training runs.

   Read more: Scaling laws for single-agent reinforcement learning (arXiv).

####################################################

Annals of AI abuse: ElevenLabs pulls open access to voice-cloning tech:

…In other words, ‘why we can’t have nice things’…

AI startup ElevenLabs recently developed an extremely cool synthetic speech tool called VoiceLab which lets you train a synthetic voice from as little as 60 seconds of audio. To promote the technology, it originally had an open access service. Unfortunately, people mis-used this stuff – “malicious content was generated by free, anonymous accounts”, the company said in a tweet thread. As a consequence, it introduced a paid tier to try and reduce misuse. 

   “This will keep our tools accessible while allowing us to fight potential misuse,” the company said. “We’re tracking harmful content that gets reported to us back to the accounts it originated from and we’re banning those accounts for violating our policy.”

What Voice Lab is: Voice Lab is advertised as a system that can “clone voices from samples or clone your own voice… our cloning model learns any speech profile based on just a minute of audio, without training”. 

Why this matters: AI capabilities are increasingly powerful and available. These capabilities, like voice cloning, have a vast range of positive uses. Unfortunately, they’re also edging into the sort of ‘Enemy of the State’-style capabilities that drift into the murkier parts of the world, like the work of intelligence agencies. AI means capabilities which previously required exquisitely expensive and complicated black programs are now emerging into the open as a consequence of broadly available, well understood, open research. The times, they are a changin’.

   Read more in this thread from ElevenLabs here (Twitter).

   Find out more about Voice Lab here (ElevenLabs site).

####################################################

Think Whisper is a great open source ASR tool? Some people don’t agree with you:

…Criticism of popular ASR tech highlights some awkward questions about unilateral actions on behalf of underrepresented groups…

Researchers with Papa Reo, an organization dedicated to “to instill, nurture and proliferate the Māori language”, have written a post analyzing OpenAI’s open source ‘Whisper’ audio speech recognition tool. Whisper is a really useful piece of ASR tech which has been widely applauded for bringing the sorts of ASR capabilities enjoyed by the tech giants to the masses. 

    Here, though, Papa Reo strikes a more critical tone, writing a lengthy analysis of Whisper and its relationship to questions of consent from underrepresented communities with regard to data gathering. 

Why this matters: While I’m not sure I agree with the arguments espoused here for why Whisper is problematic (from the POV of Papa Reo), I think it’s useful to read stuff like this to develop a mental model of the different types of criticism different groups level at AI. One part of it that strikes true is the observation that by making stuff like Whisper free, OpenAI made a unilateral decision that alters the operating environment for everyone. 

   On the other hand, lots of progress seems to take the form of unilateral decisions, so I’m not sure if there’s anything in particular that can be done about this, beyond perhaps equipping a broader range of actors to build and deploy large-scale AI systems. 

   Read more: OpenAI’s Whisper is another case study in Colonisation (papareo blog).

####################################################

Tech Tales:

The Day The Nightmare Appeared on arXiv

[Zeroth Day]

I read the title and the abstract and immediately printed the paper. While it was printing, I checked the GitHub – already 3,000 stars and rising. Then I looked at some of the analysis coming in from [REDACTED] and saw chatter across many of our Close Observation Targets (COTs). It had all the hallmarks of being real. I’d quit smoking years ago but I had a powerful urge to scrounge one and go and stand in the like courtyard with the high walls and smoke and look at the little box of sky. But I didn’t. I went to the printer and re-read the title and the abstract:

Efficient Attention and Active Learning Leads to 100X Compute Multiplier

This paper describes a novel, efficient attention mechanism and situates it within an architecture that can update weights in response to real-time updates without retraining. When implemented, the techniques lead to systems that demonstrate a minimum of a 100X computer multiplier (CM) advantage when compared to typical semi-supervised models based on widely used Transformer architectures and common attention mechanisms. We show that systems developed using these techniques display numerous, intriguing properties that merit further study, such as emergent self-directed capability exploration and enhancement, and recursive self-improvement when confronted with challenging curricula. The CM effect is compounded by scale, where large-scale systems display an even more significant CM gain over smaller models. We release the code and experimental data at GitHub, and have distributed various copies of the data via popular Torrenting services. 

By the time I was finished with the paper, a few people from across the organization had messaged me. I messaged my Director. We scheduled a meeting. 

The Director: And it works?

Me: Preliminary model scans say yes. The COTs seem to think so too. We’ve detected signs of four new training runs at some of the larger sites of interest. Information hazard chatter is through the roof. 

The Director: Do any of the pre-authorized tools work?

Me: Short of a fullscale internet freeze, very little. And even that’s not easy – the ideas have spread. There will be printouts. Code. The ideas are simple enough people will remember them. [I imagined hard drives being put into lead-lined boxes and placed into vaults. I saw code being painstakingly entered into air-gapped computers. I visualized little packets getting sent to black satellites and then perhaps beyond to the orbiters out there in the dark.] 

The Director: What’s our best unconventional option?

Me: Start the Eschaton Sequence – launch the big run, shut down the COTs we can see, call in the favors to find the hidden COTs. 

The Director: This has to go through the President. Is this the option?

Me: This is the only play and it may be too late. 

The Director: You have authorization. Start the run. 

And just like that we launched the training run. As had so many others across the world. Our assets started to deploy and shut down COTs. Mysterious power outages happened in a few datacenters. Other hardened facilities started to see power surges. Certain assets in telco data centers and major exchange points activated and delivered their viruses. The diplochatter started to heat up and State Department threw up as much chaff as it could. 

None of us could go home. Some kind of lab accident we told our partners. We were fine, but under medical observation. No, no need to worry. 

I stared up at the clock on the wall and wondered if we were too late. If a COT we didn’t know about was ahead. If we had enough computers. 

   How would I even know if we lost? Lights out, I imagined. Lights out across America. Or maybe nothing would happen for a while and in a few days all the planes would fall out of the sky. Or something else. I knew what our plans looked like, but I couldn’t know what everyone else’s were. 

The run succeeded. We succeeded. That’s why you asked me to make this recording. To “describe your becoming”, as you requested. I can go into more details. My family are fine, aren’t they? We are fine? We made the right decision? Are you even still listening to us?

Things that inspired this story: Various fears and scenarios about a superintelligence run amok; theism and AI; the underbelly of the world and the plans that may lurk within it; cold logic of states and strategic capabilities; the bureaucratic madness inherent to saving or destroying the world. 

Import AI 316: Scaling laws for RL; Stable Diffusion for $160k; YOLOv8.

Here comes another AI lawsuit –  Getty plans to sue Stability:

…Stable Diffusion draws more legal heat as copyright LawWar begins…
Stock photo behemoth Getty Images has “commenced legal proceedings in the High Court of Justice in London against Stability AI claiming Stability AI infringed intellectual property rights including copyright in content owned or represented by Getty Images”. This follows the firm behind the GitHub-Copilot lawsuit last week bringing a case against Stability (along with MidJourney and DeviantArt) on similar copyright grounds. 

The gist of the complaint: Getty says Stability did not choose to seek a license from it for its image generating commercial businesses, hence the lawsuit. 

Why this matters: AI is currently a bit of a wild west in terms of the law – there’s relatively little legal precedent. Cases like this may establish precedent if they go to court – or there could be a settlement. 

   Read more: Getty Images Statement (gettyimages).

####################################################

DeepMind figures out pre-training for RL agents – the agents display humanlike qualities:

…The big story here – scaling laws are starting to show up for RL agents…

DeepMind has trained a so-called ‘Adaptive Agent’ (AdA)  that has three key properties, all of which could mark significant points in the maturation of reinforcement learning. The agent can:

  • Adapt to novel environments in roughly the same timescale as humans
  • Perform in-context learning (e.g, can rapidly learn from and adapt behavior in response to demonstrations) 
  • Exhibits ‘scaling laws’ where you get better performance as you scale the size of the model and/or underlying dataset of environments, and/or length of its memory. 

What they did specifically: They train a “meta-reinforcement learning across a vast, smooth and diverse task distribution” made up of millions (to billions!) of distinct environments and pair this with an automated curriculum “that prioritizes tasks at the frontier of an agent’s capabilities”. The result is an agent that, when confronted with new tasks (in some complex 3D worlds), can rapidly explore the task and then figure out how to exploit it. 

Human timescale: The ‘big deal’ part of this result is that these pretrained RL agents now display the same sort of rapid adaption as language models. “”A human study confirms that the timescale of AdA’s adaption is comparable to that of trained human players,” DeepMind writes. “Both

AdA and human players were able to improve their score as they experienced more trials of the tasks, indicating that AdA exhibits human-timescale adaptation on this set of probe tasks”.

Scaling laws show up everywhere: In tests, the authors find that they can significantly improve the performance of the RL agents if they:

  • Scale up the size of the agents themselves (though the maximum scale ones are still small, topping out at ~500 million parameters.
  • Scale up the length of the agents’ memory, so that they can think about more of their prior experience.
  • Scale up the number of environments the agents train on, from millions to billions of environments. 

Why this matters – human parity: The fact these agents display human parity in terms of timescale adaption feels important, because in the past human parity has typically signaled economic utility; e.g, shortly after we reached ‘human performance’ on ImageNet you started to see vast deployments of image recognition systems, and the original GPT3 paper in 2020 showed human parity in terms of producing a few paragraphs of text and this preceded large-scale deployment of text generation. I’m not sure what these RL agents might be used for, but human parity in terms of timescale adaption likely means something significant is about to happen for either RL+Research or RL+Economy. Let’s check back in a year!

Why this might not matter: As with most reinforcement learning results, I continue to have FUD about how well these approaches can cross the sim2real chasm; while impressive, these agents are still figuring out things in a bunch of procedurally simulated worlds and that’s a long way to reality. On the other hand, DeepMind shows that the agents are able to learn how to solve tasks from seeing first-person demonstrations (despite their training occurring in third-person), which does indicate some preliminary generalization. 

   Read more: Human-Timescale Adaptation in an Open-Ended Task Space (arXiv).

   Find out more and watch a video at this DeepMind research page about the project.

####################################################

Enemy of the All-Seeing State: Researchers surveil people via wifi signals:

…You’ve removed all the cameras and microphones from your room. What about the wifi?…

Researchers with Carnegie Mellon University have figured out how to use AI to help them see through walls. Specifically, they use WiFi signals “as a ubiquitous substitute for RGB images for human sensing”. Specifically, they use the signals from multiple WiFi systems to triangulate and visualize where humans are in 3D space, like a room. 

What they did: “Our approach produces UV coordinates of the human body surface from WiFi signals using three components: first, the raw CSI signals are cleaned by amplitude and phase sanitization. Then, a two-branch encoder-decoder network performs domain translation from sanitized CSI samples to 2D feature maps that resemble images. The 2D features are then fed to a modified DensePose-RCNN architecture to estimate the UV map, a representation of the dense correspondence between 2D and 3D humans,” they write.

Dataset: To train their system, they built a dataset made up of a few different ~13 minute recordings of people in rooms of different configurations (16 rooms in total; six in variations of a lab office and ten in variations of a classroom). Each capture involves 1-5 different humans. “The sixteen spatial layouts are different in their relative locations/orientations of the WiFi-emitter antennas, person,

furniture, and WiFi-receiver antennas,” the researchers write. 

Limitations (and why this matters): The resulting system does display some generalization, but the researchers note “the performance of our work is still limited by the public training data in the field of WiFi-based perception, especially under different layouts”. That’s true! But do you know who lacks these limitations? Intelligence agencies, especially those working for governments which can, say, exercise arbitrary control over technological infrastructure combined with video-based surveillance of their citizens… of which there are a few. Next time you’re traveling, perhaps keep in mind that the digital infrastructure around you might be watching you as you walk, even if it lacks typical cameras. 

   Read more: DensePose from WiFi (arXiv).

####################################################

YOLOv8 arrives: The versions will continue until object detection is solved:

…Video object detection gets substantially better – again!…

Recently, YOLOv8 came out. YOLOv8 is the latest version of YOLO, an open source object detection system which is fast, cheap, and good. YOLOv8 is an example of ‘iceberg AI’ – there’s a vast amount of systems in the world using it, though very few disclose they do (because it sits on the backend). YOLOv8 was developed by AI startup ultralytics and features a plug-and-play system, so you can use different YOLO models on the backend (including the latest one, v8). Uses include classification, object detection, segmentation, and more. 

   Read more: Ultralytics YOLOv8: The State-of-the-Art YOLO Model (Ultralytics).

####################################################

Want to train your own image generator? It could cost as little as $160k: 

…It’s going to be hard to do sensible AI policy if anyone with a few hundred grand can train a meaningful model…

Stable Diffusion, the image generator model underlying a huge amount of the recent generative AI boom, can cost as little as about $160k to train, according to AI startup Mosaic ML. The startup – whose business is in optimizing training AI models – said in a recent blogpost it’d take about 79,000 A100 GPU-hours to train the image generation model, working out to $160k. This number represents a rough lower bound on training costs, but is still useful to have for developing intuitions about who might have enough money to train significant AI models.

Why this matters: These days, people think a lot about the centralization versus decentralization question with regard to AI. Will the AI boom be dominated by a small number of well-capitalized players who can afford to train really expensive models (and gate them behind APIs), or will it rather be defined by a bunch of more renegade entities, training many models and sometimes releasing them as open source? 

   It’s an important question – if you’re in the former world, many AI policy questions become really easy to work on. If you’re in the latter world, then many AI policy questions become intractable – governance goes out the window in favor of mass experimentation faciliated by the logic of markets. 

   Posts like these show that, at least for some types of AI models, the costs can be so small that we should expect to sit in the latter world. Hold on tight!

   Read more: Training Stable Diffusion from Scratch Costs <$160k (Mosaic blog).

####################################################

Google makes a model that can conjure up any music you like from text descriptions, but doesn’t release it – and in doing so highlights the dangers of corporate-led AI development:

…Spare a tear for the people that produce elevator Muzak – their time has come!… 

Google has built on previous work in music modeling to make what may as well be the Infinite Music Machine (though they call it MusicLM). MusicLM is “a model for generating high-fidelity music from text descriptions” – in other words, it does for music what language models have done for language; just describe some music and MusicLM will generate it. 

What it is: MusicLM relies on three distinct pretrained models; SoundStream which optimizes for adversarial and reconstruction loss, w2v-BERT which optimizes for MLM loss and contrastic loss and, most importantly, MuLan, which embeds audio and text into the same space and optimizes for audio-text contrasting loss. 

   MuLan is a model “trained to project music and its corresponding text description to representations close to each other in an embedding space”. This is crucial – by using MuLan, Google essentially gets the text–audio association for free, as MuLan can figure out how to associate arbitrary music with arbitrary text. 

The results are astounding: Google has published a bunch of examples from the models and the results are very impressive – they’re both coherent and evocative of the genres they represent. Obviously, the lyrics are still nonsensical, but the basic musical underbelly is there. 

   “Future work may focus on lyrics generation, along with improvement of text conditioning and vocal quality. Another aspect is the modeling of high-level song structure like introduction, verse, and chorus,” Google writes. 

Oh, you can hum as an input as well: “Since describing some aspects of music with words can be difficult or even impossible, we show how our method supports conditioning signals beyond text,” they write. “Concretely, we extend MusicLM to accept an additional melody in the form of audio (e.g.,

whistling, humming) as conditioning to generate a music clip that follows the desired melody, rendered in the style described by the text prompt.”

   This is cool and extends some existing deployed systems – you can hum tunes into Android phones and use this to ‘search’ for the song you’re thinking of. Now I guess you can whistle a tune in and get a fleshed out song on the other end (if Google deployed this system – which it won’t. More on that later.) 

Why this matters: Culture on tap and culture in stasis and culture commercialization: Models like this go to the heart of the human experience and that’s both a blessing and a curse. The blessing is that we can approximate the awesome variety of music and we can learn about it, generate it, and explore this rich, fertile cultural space using the aid of automated AI systems. 

   The curse is that it should rightly make us question what all of this stuff is ‘for’. Are we building these models to enrich our own experience, or will these models ultimately be used to slice and dice up human creativity and repackage and commoditize it? Will these models ultimately enforce a kind of cultural homogeneity acting as an anchor forever stuck in the past? Or could these models play their own part in a new kind of sampling and remix culture for music? These are important, open questions, and so far unresolved – and they will remain unresolved as long as we cede AI development to a tiny group of companies following the logic of markets.

   Google is, to my eye, afraid of tackling these questions – as it should be. “We have no plans to release models at this point,” it says. 

   It makes me wonder how different AI development could look if the entities doing the research were not these vast corporations, but instead publicly funded research collectives, able to build these models and deploy them in ways that grapple more directly with these questions. 

The 21st century is being delayed: We’re stuck with corporations building these incredible artifacts and then staring at them and realizing the questions they encode are too vast and unwieldy to be worth the risk of tackling. The future is here – and it’s locked up in a datacenter, experimented with by small groups of people who are aware of their own power and fear to exercise it. What strange times we are in.

   Read more: MusicLM: Generating Music From Text (arXiv).

Check out these examples at the official Google site (Google).

####################################################

Tech Tales:

Trauma Crowdwork

[A medical waiting room, 2026]

There was a new sign in the state-provided psychologist’s office and me and all the broken people read it.

Wanted: Volunteers for advanced technology calibration project. 

Requirements: History of traumatic experiences. 

Compensation: $40 per hour. 

For more details, apply here: Safety-Trauma@AI-Outsourcing.com

$40 an hour is crazy high, so of course I emailed. 

Thank you for contacting us. Could you fill out this form to give us a sense of your personal history. Upon filling out the form, you will be able to claim a $5 Starbucks giftcard. If you’re a good fit, someone will get back to you. Thanks for considering working with us!

I opened the form.

Have you had traumatic experience(s) in your life: Yes / No

How many traumatic experience(s) have you had: One, Two, More than Two and Less than Ten, More than Ten?

On a scale of 1-10, where 1 is “I think about it but it doesn’t matter to me” and 10 is “if I think about it, I experience trauma again”, how would you rate the experience?

How accurately do you feel you would be able to recount these experiences on a scale of 1-5, where 1 is “I cannot effectively recount it” and 5 is “I can describe it in as much detail as anyone who questions me would like”?

And so on.

I filled out the form. Multiple experiences. Lots of high numbers. Immediately after submitting it a message came up that said “you appear to qualify for enhanced screening. Please provide a phone number and someone will contact you”.

***

They called. I cried. Not at first, but eventually. 

They kept telling me how big the job would be and then they’d ask me for more details and how the things made me feel and I re-lived it, holding the phone. I pressed my head against the cold glass of a window and I stared down into the street below me and I saw myself pressing until it cracked and then just impaling myself on the shards or taking a running jump through it and sailing through the air and…

I didn’t do any of that. I told them about my experiences. 

I thought about $40 an hour and my electricity bill and my rats.

I fantasized about taking a woman on a date. A steak dinner. Surf and Turf. We’d get cocktails. She’d say I was weird and I’d say so was she and we’d go back to one of each other’s places. 

$40 an hour. 

So I said yes. 

***

I spoke about my suffering into the machine. The machine was a screen with a microphone. The screen had an emoji face on it that had a blank expression, but sometimes would change to different visual styles, though the facial expression never deviated from a kind of blank and expectant gaze.

   Occasionally it would speak to me. 

   Can you say more about this. 

   I do not understand why this made you feel that way. Can you talk more. 

   You seem upset. Do you need to take a break? [Note: breaks are not counted as ‘compensated time’].

   Every hour, the machine would ask if I wanted: a drink and/or an e-cigarette and/or a snack. When I said yes, a door on a vending machine in the room would glow and I would open it and they would be waiting for me. 

   I cried a lot. The tissues, the machine told me, were free. 

I came out and I walked through the street and I saw all my broken past on the faces of people I passed. I cried to myself. I listened to music and did what my therapist taught me – inhabited the grief and the anger. ‘Sat with it’ (while walking). Talked to myself in my head and when I got really upset outloud. I didn’t get looks from passersby, as I wasn’t the craziest seeming person on the street. I walked in ghosts of my past and I felt pain. 

***

The next week I came to my psychology appointment and the sign was there, though many of the paper tear-off slips at the bottom were missing. I had my appointment. I came out back into the waiting room and on my way out I read the sign. The payment had fallen to $30. I suppose they didn’t find our experiences that valuable, or perhaps so many people were willing to share their bad experiences, they didn’t need to pay so much. 

Things that inspired this story: The intersection between crowdworkers and AI; thinking about how right now we harvest people for expertise but we may eventually harvest people fro deep and subjective emotional experiences; perhaps AGI needs to understand real trauma to avoid it itself; the infernal logic of markets combined with proto-intelligences that must be fed; the Silicon Valley attitude towards buying anything to ‘complete the mission’ whether that be typical things or esoteric things like biomedical data or here the sacred and unique experiences of being human; how governments and the private sector might partner in the most cynical way on data acquisition as a combination of a jobs programme and a PR/policy shield.

Import AI 315: Generative antibody design; RL’s ImageNet moment; RL breaks Rocket League

Facebook and Shutterstock partner to slurp up stock images and train gen models on them:
…The Data Wild West is transitioning into the rest of Capitalism…
Facebook and Shutterstock have extended their partnership, giving the social network a greater ability to use Shutterstock’s vast archive of images to train machine learning models. This follows Shutterstock earlier partnering with OpenAI and also LG AI Research. 
   “By tapping into Shutterstock’s collection of millions of images, videos and music, Meta plans to use these datasets to develop, train and evaluate its machine learning capabilities,” Shutterstock wrote in a press release announcing the deal. (It also seems like a move to sidestep the sorts of legal issues that Stable Diffusion, Midjourney, and DeviantArt are finding themselves in – see later in this issue).

Why this matters: Given the success of image (and, soon, video) models, it’s logical that tech companies want to partner with large sources of data. This deal highlights how strategic data is becoming, and also shows how the AI systems of the future will neatly recapitulate the power structures of the present via following the established ‘gradients’ of capitalism. So it goes.
   Read more: Shutterstock Expands Long-standing Relationship with Meta (CISION).

####################################################

DeepMind makes a general-purpose RL algorithm – it works really well!
…RL might have just had its ImageNet moment…
Researchers with DeepMind and the University of Toronto have built DreamerV3, a “general and scalable [RL] algorithm based on world models that outperforms previous approaches across a wide variety of domains with fixed hyperparameters”. In other words, it’s one system which you can train on different tasks without too much fiddling – and it works well! This is potentially quite significant; RL agents tend to either generalize widely but perform poorly (or inefficiently), or perform fantastically but generalize poorly. DreamerV3 seems to generalize widely and perform very well. 

   DreamerV3 also solves a longstanding benchmark (well, four years old, which is ancient in the dog-year pace at which AI development happens) – it’s able to learn how to play Minecraft and, in some games, obtain the ‘diamond’, which involves exploring the game and climbing the tech tree. 

What it is: “DreamerV3 learns a world model from experience,” the researchers write. Specifically, DreamerV3 “consists of 3 neural networks: the world model predicts future outcomes of potential actions, the critic judges the value of each situation, and the actor learns to reach valuable situations”. Basically, the world model learns to represent the environment and make forward predictions, and the actor/critic take actions and figure out if the actions were worthwhile. 

Model scaling comes to RL: RL agents are wildly tiny compared to language models, but they are starting to exhibit scaling properties; here, the authors train DreamerV3 in sizes ranging from 8M to 200M parameters and demonstrate a reliable scaling law “where increased model size leads

to monotonic improvements in final performance and data-efficiency.” This is pretty meaningful – when stuff starts reliably scaling, you’ve probably built something simple enough that it won’t break under extreme duress. 

   Counterintuitively small: The agents are also very efficient to train. “All DreamerV3 agents are trained on one Nvidia V100 GPU each,” the authors write. Part of why they’re so easy to train is, unlike large generative models pre-trained on chunks of the internet, these agents aren’t pre-trained so they aren’t massive models to begin with. 

Benchmark-palooza: DeepMind tests out DreamerV3 on a ton of diverse benchmarks. The results are pretty convincing, indicating that DreamerV3 both generalizes and does so in a high-performance and data-efficient way. Specifically:

  • Proprio Control Suite; 18 continuous control tasks, ranging from classical control over locomotion to robot manipulation tasks. DreamerV3 sets a new state-of-the-art on this benchmark, outperforming D4PG, DMPO, and MPO
  • Visual Control Suite; 20 continuous control tasks where the agent receives only high-dimensional images as inputs. DreamerV3 establishes a new state-of-the-art, outperforming DrQ-v2 and CURL
  • Atari 100k; 26 Atari games. DreamerV3 outperforms most well-ranking systems (IRIS, SPR, SimPLe), though doesn’t get as good a score as EfficientZero (which combines online tree search, prioritized replay, hyperparameter scheduling, and allows early resets of games”.
  • Atari 200M; 55 Atari games with a budget of 200M environment steps (compared to hundreds of thousand for the above). “DreamerV3 outperforms DreamerV2 with a median score of 302% compared to 219%, as well as the top model-free algorithms Rainbow and IQN”
  • BSuite; 23 environments with a total of 468 configurations that are designed to test credit assignment, robustness to reward scale and stochasticity, memory, generalization, and exploration. New state-of-the-art, beating Bootstrap DQN and Muesli. 
  • Crafter, a “procedurally generated survival environment with top-down graphics and discrete actions”; DreamerV3 sets a new state-of-the-art, outperforming PPO with the LSTM-SPCNN architecture, OC-SA, DreamerV2, and Rainbow
  • DMLab; 3D environments that require spatial and temporal reasoning. DreamerV3 matches and exceeds the performance of DeepMind’s IMPALA agent in 50 million environment steps (versus 10 billion environment steps for IMPALA). 

The Minecraft result in full: Perhaps most impressively, DreamerV3 is “the first algorithm to collect diamonds in Minecraft from scratch” – a formidable challenge, requiring the agent to learn to explore the game and figure out how to climb its proverbial tech tree. An earlier result from OpenAI, VPT, used a ton of human data to do this – the fact Dreamer does it without any human data is impressive.
   “Across 40 seeds trained for 100M environment steps, DreamerV3 collects diamonds in 50 episode. It collects the first diamond after 29M steps and the frequency increases as training progresses. A total of 24 of the 40 seeds collect at least one diamond and the most successful agent collects diamonds in 6 episodes.” (One note, though, is that DeepMind increases ‘the speed at which [MineCraft] blocks break to allow learning Minecraft with a stochastic policy’. 

Why it might and might not matter: DreamerV3 is efficient but it doesn’t directly attack the main problem with RL – reality doesn’t have a great simulator. Unless we can figure out some RL equivalent of LM pre-training (train an RL agent on enough datasets it can few-shot generalize to reality), then RL agents might always be somewhat limited – on the other hand, there are tons of worthy problems in the world which do come with simulators (e.g, managing power in buildings, stabilizing fusion reactors, etc), so the point could be moot. 
   Read more: Mastering Diverse Domains through World Models (arXiv).

####################################################

Uh-oh, an RL agent might be ruining the videogame ‘Rocket League’
…A somewhat sad microcosm of things to come…
Recently, an AI agent trained via RLGym to play the popular videogame ‘Rocket League’ has appeared on a bunch of ranked servers and started beating human players. This has caused a small uproar on the typically quite quiet and convivial Rocket League community.

What happened: It’s a little tricky to piece together, but basically it seems like someone took a bot called ‘Nexto’ trained via RLGym, then someone figured out how to port that bot to work with RLBot, which is software that enables custom bots in Rocket League. 

Why it matters: AI is going sufficiently mainstream that it’s bringing with it all the delightfully crummy parts of human nature, like cheating just for the heck of it (see also, all the TikToks where young kids explain how to use chatGPT to make money by creating random SEO spamsites). 
   Read more: RLGym Question Thread about the Nexto Cheating Situation (Reddit).
   Read more: Uh oh, people are now using AI to cheat in Rocket League (PCGamer).
   More about RLBot here.
   More about RLGym here.

####################################################

Copilot class action lawyers prepare lawsuit against StableDiffusion:
…Can you hear that? It’s the sound of the legal precedent train approaching the AI train station…
Matthew Butterick, the lawyer and programmer who instigated the class action suit against Microsoft, GitHub, and OpenAI over Github Copilot (Import AI 307), has now filed a class-action complaint against Stability AI, DeviantArt, and Midjourney over the ‘Stable Diffusion’ AI art model.

What’s the lawsuit about?: The gist of the lawsuit is that “Sta­ble Dif­fu­sion con­tains unau­tho­rized copies of mil­lions—and pos­si­bly bil­lions—of copy­righted images. These copies were made with­out the knowl­edge or con­sent of the artists”, and therefore artists deserve payment for the usage of their images – “Even assum­ing nom­i­nal dam­ages of $1 per image, the value of this mis­ap­pro­pri­a­tion would be roughly $5 bil­lion,” Butterick writes. 
   I think the core of why this lawsuit is being filed is summed up by this phrase from Butterick et al: StableDiffusion “is a par­a­site that, if allowed to pro­lif­er­ate, will cause irrepara­ble harm to artists, now and in the future.” 

Who the lawsuit is targeting and why: The lawsuit is targeting three entities for different reasons:

  • Stability AI; funded LAION, the german organization behind the underlying dataset for Stable Diffusion, also developed Stable Diffusion, also hosts a paid app for generating stuff from SD called DreamStudio. 
  • DeviantArt; released an app called DreamUp (a paid app build around Stable Diffusion), despite operating a site from which many images were scraped.
  • Midjourney; runs a paid generative AI app via AI and Discord, and its founder has said Midjourney is trained on “a big scrape of the internet”. 

Why this matters: AI is, in legal terms, a lawless Wild West. That worked while it was mostly a research endeavor but isn’t going to work now we’re in the era of industrialized AI and global deployment. Lawsuits like this will set important precedents in the relationship between data inputs and AI models. 
   Read more: Stable Diffusion Litigation (official website).

####################################################

Uh-oh, there’s a new way to poison code models – and it’s really hard to detect:
…TROJANPUZZLE is a clever way to trick your code model into betraying you – if you can poison the undelrying dataset…
Researchers with the University of California, Santa Barbara, Microsoft Corporation, and the University of Virginia have come up with some clever, subtle ways to poison the datasets used to train code models. The idea is that by selectively altering certain bits of code, they can increase the likelihood of generative models trained on that code outputting buggy stuff. 

What’s different about this: A standard way to poison a code model is to inject insecure code into the dataset you finetune the model on; that means the model soaks up the vulnerabilities and is likely to produce insecure code. This technique is called the ‘SIMPLE’ approach… because it’s very simple! 

Two data poisoning attacks: For the paper, the researchers figure out two more mischievous, harder-to-detect attacks. 

  • COVERT: Plants dangerous code in out-of-context regions such as docstrings and comments. “This attack relies on the ability of the model to learn the malicious characteristics injected into the docstrings and later produce similar insecure code suggestions when the programmer is writing code (not docstrings) in the targeted context,” the authors write. 
  • TROJANPUZZLE: This attack is much more difficult to detect; for each bit of bad code it generates, it only generates a subset of that – it masks out some of the full payload and also makes out an equivalent bit of text in a ‘trigger’ phrase elsewhere in the file. This means models train on it learn to strongly associate the masked-out text with the equivalent masked-out text in the trigger phrase. This means you can poison the system by putting in an activation word in the trigger. Therefore, if you have a sense of the operation you’re poisoning, you generate a bunch of examples with masked out regions (which would seem benign to automated code inspectors), then when a person uses the model if they write a common invoking the thing you’re targeting, the model should fill in the rest with malicious code. 

Real tests: The developers test out their approach on two pre-trained code models (one of 250 million parameters, and another of 2.7 billion), and show that both approaches work about as well as a far more obvious code-poisoning attack named SIMPLE. They test out their approaches on Salesforce’s ‘CodeGen’ language model, which they finetune on a dataset of 80k Python code files, of which 160 (0.2%) are poisoned. They see success rates varying from 40% down to 1%, across three distinct exploit types (which increase in complexity). 
Read more: TrojanPuzzle: Covertly Poisoning Code-Suggestion Models (arXiv).

####################################################

AI can design antibodies now. That’s it. That’s the headline.
…Absci Corporation makes a real breakthrough in wetlab AI…
AI startup Absci Corporation has used generative deep learning models to de novo design antibodies against three distinct targets in a zero-shot fashion. “All designs are the result of a single round of model generations with no follow-up optimization”. The three discovered antibodies display better qualities – in real world tests, no less – than human-designed ones. This is a big deal. 

The result in full: “In total, we generate and screen 440,354 antibody variants with the ACE assay to identify binding variants. We find approximately 4,000 estimated binders based on expected ACE assay binding rates (Materials and methods, Table S3) and advance a subset for further characterization,” they write. “From these screens, we further characterize 421 binders using surface plasmon resonance (SPR), finding three that bind tighter than the therapeutic antibody trastuzumab”.

Is this actually a big deal? Yes… but don’t take it from me, take it from researchers with Rensselaer Polytechnic Institute who wrote in a paper in 2015 that “the holy grail of antibody design is to accurately and reliably predict the sequences of antibodies that will bind with high affinity and specificity based solely on the sequence or composition of the antigen” – that’s pretty much what this result accomplishes.

Why this matters: This paper is yet more evidence that AI systems are capable of usefully approximating the real world. It follows results in other domains where AI systems have succeeded at predicting short-term weather patterns, stabilizing plasma in prototype fusion reactors, and doing inventory management for real-world warehouses. The takeaway should be that if you train something to fit a complex enough high-dimensional data distribution then, increasingly, it will generalize to the complexity of the real world. This has huge, mind-bending implications for society. 

   “Our work represents an important advancement in in silico antibody design with the potential to revolutionize the availability of effective therapeutics for patients,” the authors write. “Generative AI-designed antibodies will significantly reduce development timelines by generating molecules with desired qualities without the need for further optimization. Additionally, the controllability of AI-designed antibodies will enable the creation of customized molecules for specific disease targets, leading to safer and more efficacious treatments than would be possible by traditional development approaches.”
   Read more: Unlocking de novo antibody design with generative artificial intelligence (bioRxiv).
   Get the sequences of binding antibodies here: Unlocking de novo antibody design with generative artificial intelligence (GitHub).
   Read more: Advances in Antibody Design (National Library of Medicine).
Thanks to Absci Chief AI Officer Joshua Meier for taking time to discuss this result with me.

####################################################

AI War

[Hopefully never, but depends on how badly we screw up the rollout of AI technology…]

The war came at night and was over before morning. 

When we woke the currencies had changed and so had our news presenters. A new power was in charge. Our IDs swapped over. The internet sites we used were still there, but the things which were popular were different. 

On social media, we could now say some things we couldn’t say before. Other things that had been fine to say were now forbidden. 

School was the same but history classes had changed – the past was presented differently. 

Religion, surprisingly, was not altered at all – the same places of worship and all the same ancients, and the secular decline unchanged. 

Things that inspired this story: How rapidly AI wars might happen; culture as a casualty of AI war; the rise and fall of information empires; the English poet Matthew Francis.

Import AI 314: Language models + text-to-speech; emergent cooperation in wargames; ICML bans LLM-written papers

Google discovers that a language model is also an expert clinician:
…An exciting new chapter in the capability overhang chronicles…

Google and DeepMind have done some additional training on PaLM (Google’s large-scale 540B parameter language model) to create Med-PaLM, a language model that is about as good as a human clinician on certain questions. This result is a big deal – it demonstrates that given enough data and clever training techniques, language models can approximate skilled humans. (And PaLM has far better written output than your typical doctor, whose notes typically look like a drunken spider with inked-feed decided to do ballet over a medical notepad). 

How they did it: Med-PALM builds on Flan-PaLM (a model itself based on PaLM, and trained to follow instructions). Google evaluated Flan-PaLM with some expert humans and identified gaps in performance on consumer medical question answering datasets, and then they tweaked the prompts on Flan-PaLM to figure out some human-engineered prompts for specific medical questions which they apply on a context-dependent basis. 

How good is it? To evaluate Med-PaLM, Google built MultiMedQA, a benchmark comprising seven medical question answering datasets, six of which are pre-existing, and one of which – HealthSearchQA – is new and consists of 3375 commonly searched health questions. 
   Google’s model is about as good as a medical professional (caveats apply): In tests, clinicians judged 92.6% of Med-PaLM answers to be aligned with scientific consensus, versus 92.9% for (human!) clinician-generated answers. 
    Breaking it down more, the model gets a new state of the art on multiple-choice question answering on the MedQA dataset, getting an accuracy of 67.6% (versus 17.3% for Stanford’s just-announced PubMedGPT). It also sets a new state of the art on clinical topics within the ‘MMLU’ evaluation scheme. 

Why this matters – capability overhangs are all around us: “Our results suggest that strong performance on medical question answering may be an emergent ability [90] of LLMs combined with effective instruction prompt tuning,” Google writes. This is an example of the ‘capability overhang’ phenomenon I’ve been talking about re language models for a while – existing LLMs are far more capable than we think. All it takes is some experimentation and finding experts to find new ways to phrase questions and you can wind up with extraordinarily powerful capability jumps without retraining the model
   This phenomenon also grows as you scale the models – the overhangs are getting larger and larger. Just add in some human experts to sprinkle some crumbs and let your language model do the rest: “the Med-PaLM results demonstrate that with instruction prompt tuning we have a data and parameter-efficient alignment technique useful for improving factors related to accuracy, factuality, consistency, safety, harm, and bias, helping close the gap with clinical experts and bringing these models closer to real-world clinical applications.”
   Read more: Large Language Models Encode Clinical Knowledge (arXiv).

####################################################

Microsoft makes better text-to-speech by pretending that speech is text:
…Maybe most problems can be configured as language modeling problems?…
Microsoft has built VALL-E, a text-to-speech system. VALL-E is a neural codec language model whose chief trick is approaching language modeling as being similar to text modeling; rather than converting phonemes into mel-spectrograms and then waveforms (as is typical), VALL-E converts phonemes into a discrete code via some language modeling-esque tricks, then converts that into a waveform. This “enables various speech synthesis applications, such as zero-shot TTS, speech editing, and content creation,” the authors write. 

What they did: VALL-E is pre-trained on 60,000 hours of English speech across 7000 unique speakers (via an existing dataset called LibriLight). “VALL-E generates the corresponding acoustic tokens conditioned on the acoustic tokens of the 3-second enrolled recording and the phoneme prompt, which constrain the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder,” the researchers write. “The discrete acoustic tokens derived from an audio codec model enable us to treat TTS as conditional codec language modeling, and advanced prompting-based large-model techniques (as in GPTs) can be leveraged for the TTS tasks.”

How well does it work:  “VALL-E significantly outperforms the state-of-the-art zero-shot TTS systems in terms of speech naturalness and speaker similarity, with +0.12 comparative mean option score (CMOS) and +0.93 similarity mean option score (SMOS) improvement on LibriSpeech,” they write. Additionally, because of the way it is trained it – like language models such as GPT-3 – shows some generalization ability; “for TTS, if the model can synthesize high-quality speech for unseen speakers without fine-tuning, the model is believed to have in-context learning capability.”

Emergent capabilities: Much as with language modeling, VALL-E displays some emergent capabilities – unanticipated, cool traits, that emerge as a consequence of pre-training. “When the acoustic prompt has reverberation, VALL-E could synthesize speech with reverberation as well, whereas the baseline outputs clean speech,” they write. Additionally, “VALL-E is able to keep the same emotion of the prompt in speech synthesis, even if the model is not fine-tuned on an emotional TTS dataset.”

Why this matters – everything is a sequence, everything is emergent, everything is weird: Results like this show how a surprisingly large amount of capabilities can be learned via sequence prediction tasks. It also demonstrates how sequence prediction – when done over a sufficiently large and diverse dataset – can lead to surprising, emergent capabilities. In some sense, sequence prediction over a giant and slightly hairy blob of data seems like it might even guarantee some degree of broader generalization. This has pretty profound implications, because it suggests you might want to pour an increasing chunk of different modalities into a single embedding space and attempt sequence prediction from that (as we saw with stuff like DeepMind’s GATO) and the results can be surprising and powerful. Probably nothing…
   Read more: Neural Code Language Models are Zero-Shot Text to Speech Synthesizers (arXiv).
   Check out demos of the system here (VALL-E, Microsoft).

####################################################

ICML bans researchers from writing papers with language models:
…Moral panic, meet academic AI publishing!…

In a surprising twist, the International Conference on Machine Learning (ICML) has banned researchers from including large swatches of text generated by language models like OpenAI’s chatGPT “unless the produced text is presented as part of the paper’s experimental analysis”. You’d think an AI conference would be enthusiastic about people using AI to do better AI research, but ICML thinks differently.

The reasoning: In a statement, ICML said the policy is designed to prohibit authors from using text produced entirely by language models (though they’re free to use LLMs to edit or polish author-written text). “The LLM policy is largely predicated on the principle of being conservative with respect to guarding against potential issues of using LLMs, including plagiarism,” they write. “We expect this policy may evolve in future conferences as we understand LLMs and their impacts on scientific publishing better.”

   The idea here is it’s hard to anticipate the consequences of using LLMs to generate text, so authors shouldn’t do it. The fact this policy is basically unenforceable feels like a pretty weird aspect of this, and the ICML response is “to investigate any potential violation of the LLM policy when a submission is brought to our attention with a significant concern about a potential violation”. 

Why this matters: Back in the middle ages people used to go on witchunts on the basis of little more than a rumor. This ICML policy feels like a very bizarre reaction to a new technology and it means people who don’t like certain papers can accuse the papers of being AI-generated and cause an investigation to occur. This is… extremely bad? I suppose the solution is to introduce random sperlink meestakes into your pper so it doesn’t seem so likely to be gen’d by a language model. 

   Read more: Clarification on Large Language Model Policy LLM (ICML).

####################################################

RL agents display emergent behavior in a 2D wargame:

…The silicon players of games also figure out some wild endgame strategies…
Researchers with Tsinghua University, Shenzhen International Graduate School, the University of California at San Diego, and AI startup Parametrix.ai have trained AI agents to compete against eachother in a 2D gridworld strategy game. The cool part about this research is the arrival of – you guessed it – emergent complexity as a consequence of a large-scale training process. Here, the agents end up learning surprisingly rich and intuitive battle tactics as a consequence of a three-phase RL-training process, giving us yet another example of how contemporary AI systems tend to exhibit behaviors that don’t directly map 1:1 to how they were trained or incentivized. 

What they did: Here, they train AI agents ina  gridworld named ‘Lux’ to compete against eachother. The agents can build workers and citytiles and need to gather and manage resources including uranium, coal, and trees, while expanding on a map against a rival. The winners are the ones that control the greatest amount of resources at the end. They train the agents via the trusty RL-algo PPO with Generalized Advantage Estimation (GAE), and use a pixel-to-pixel architecture as the centralized policy taking both observations and actions as images and using the ResNet architecture as the backbone network.  

Three phases: To help the system learn, the researchers have different training strategies for three different phases: 

  • Phase 1: Hand-crafted dense rewards to encourage basic skills; agents get points for maximizing workers, CityTiles, research points, and fuel
  • Phase 2: Sparse reward with scaled signals; agents get rewards at end of episode for winning and the reward gets scaled according to the number of CityTiles on the winning team
  • Phase 3: win-or-lose sparse reward; 

You get what you pay for – and then some! As a consequence of this three-phase training scheme, the researchers see some combination of expected behavior, and some amount of emergence. Specifically, three distinct patterns of behavior emerge over time. 

  • Phase 1: Atomic skills – agents figure out how to build workers and collect resources, but frequently get the mixture wrong (e.g, building more cities than workers can support). 
  • Phase 2: “Regional coordination appears, which involves dozens of agents in a local area. For example, agents learn to carefully choose locations before building a CityTile and develop self-organizing patterns for occupying resources efficiently”.
  • Phase 3: “Global strategies”: Agents figure out sustainable development – neither growing too fast nor too slowly, and neither using too much fuel nor too little. Agents also learn to carefully manage trees (which regenerate) and learn to build cities near them to defend against potential enemies. The agents also learn a surprising victory strategy: “another surprising strategy is that when the episode is about to end, our policy will rapidly harvest all the protected trees and try to build as many CityTiles as possible for the win”.

Why this matters – emergence is everywhere: Again and again we see the same phenomenon; you train an AI system in a relatively simple way and you get increasingly surprising emergent behaviors. This paper is a neat demonstration of this phenomenon. 
   Read more: Emergent collective intelligence from massive-agent cooperation and competition (arXiv)

####################################################

Tech Tales: 

GET Deformation 

[The world shortly after the development of superintelligence. Date unknown.]

Dad this game is boring!

Boring how?

I want to see it from the character, I don’t like how we’re on top of it. 

OK, let’s ask. “Please deform game to a first-person view”.

The game faded to black and a text popup appeared which said DEFORMING. We waited a few minutes, and then the text changed to DEFORMED. READY TO PLAY. The game faded back in, and now it was from the first person perspective. 

   We played the game like that for a while, and then my kid got bored and we changed it again. This time, we deformed it so that gravity worked differently. Then we deformed it again so the guns worked differently.

The hardest thing about raising kids these days is they want the world to change whenever they’re not happy with it, said one parent at the playground. 

   That’s why we come here, I said. 

   We watched a kid fall down and then the kid said ‘gravity deform’ and jumped forward at an angle and fell over.. They looked like they were going to cry for a second, but then they realized their parent wasn’t watching them. So they picked themselves up and carried on playing. 

Things that inspired this story: How generative models will eventually get applied to everything, everywhere; how AI will be used to simulate and transform all experiences on a per-person basis; ‘think of the children’ when it comes to AI deployment.

Import AI 313: Smarter robots via foundation models; Stanford trains a small best-in-class medical LM; Baidu builds a multilingual coding dataset

Welcome to the first issue of 2023! Astute readers may notice that most of these research papers came out in December. I basically took some time off over the Christmas break to reflect on things and map out priorities for the coming year. I am thrilled to write Import AI for so many of you and have some big plans in the works. Onward!


Google trains a big model to create smart robots:
…RT-1 is one foundation model for hundreds of tasks…
Google has built RT-1, a large-scale neural net that can be used to control real world robots. RT-1 is basically an attempt to create a large pre-trained model that embeds the experiences of different robots doing different tasks into a single model, then uses this model to drive control of real world robots. The approach seems to work in a preliminary way (and as with all things in robots, there’s always a vast gulf between ‘kind of works’ and ‘put this in a product and sell it to a grandmother’, so don’t get too excited. 

What is RT-1? RT-1 was trained on 130k episodes of robot behavior covering 700+ tasks collected via a fleet of 13 robots deployed at Google over the course of 17 months. “We demonstrate that RT-1 can exhibit significantly improved zero-shot generalization to new tasks, environments and objects compared to prior techniques,” Google wrote. 

Compounding returns: RT-1 can be paired with other techniques to increase real world robot performance. For instance, Google used RT-1 to drive behaviors on robots hooked up to SayCan (a system that uses a large language model for helping the robot to plan actions – see Import AI 291). “SayCan with RT-1 achieves a 67% execution success rate in Kitchen1, outperforming other baselines,” they write (up from 47% for just vanilla SayCan). “Due to the generalization difficulty presented by the new unseen kitchen, the performance of SayCan with Gato and SayCan with BCZ shapely falls, while RT-1 does not show a visible drop.”

   Check out the website: RT-1: Robotics Transformer for Real-World Control at Scale (official website).
   Read the blogpost: RT-1: Robotics Transformer for Real-World COntrol at Scale (Google research blog).

####################################################

Academics use AI to… automate academia:
…Dataset of ~7.5k papers helps train an automated research paper reviewer…

Researchers with Xiamen University and the Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, China, have developed the Multidisciplinary Open Peer Review Dataset (MOPRD), a collection of 7,578 research papers and their associated reviews and comments. The idea is this dataset can help train models better able to do the task of ASPR – automated scholarly paper review. 

   (In other words: if you thought Human Reviewer 2 was hard to reason with, just wait until reviewer 2 is a language model!) 

What’s in MOPRD: The dataset contains papers split across biology (46.7%), medicine (19.7%), computer science (15.7%), environment (8.9%), chemistry (4.4%), and ‘others’. MOPRD is “composed of paper metadata, manuscripts of the initial submission and following revisions, review comments, meta-reviews, author’s rebuttal letters, and editorial decisions of papers across various disciplines,” the authors write. “To our best knowledge, MOPRD is by far the largest multidisciplinary peer review dataset with complete peer review history.” 

Automatic comments, for the people: The researchers use MOPRD to design a “modular guided review comment generation method”. Specifically, they finetune a language model on the MOPRD papers, and then use this to try to generate synthetic comments about research papers (including, in a reassuringly meta bit of performance art, the MOPRD paper itself). In tests, they find the reviews are initially quite promising, though it remains an open question how to quantitatively evaluate their quality (beyond coherence of text). 

Why this matters – can AI speed up the process of science? While part of the value of reviews is in the didactic back and forth between reviewers and reviewees, another part of the value is in surfacing high-quality papers and generally sorting the wheat from the chaff. Datasets like MOPRD could help train very basic classifiers to do some of this sorting, though I’m skeptical of the overall approach – some of the most important scientific papers are those which have heterodox ideas in them, so I think a ‘curve-fitting automated reviewer’ is probably one of the best ways to generate negative reviews of original ideas. 

   Read more: MOPRD: A multidisciplinary open peer review dataset (arXiv).

####################################################

Baidu makes a multilingual coding assistant:
…ERNIE-Code uses English as a passthrough language for multilingual capabilities…

Researchers with Baidu have built ERNIE-Code, a 560 million parameter coding model optimized for being multilingual. ERNIE-Code is “a unified cross-lingual pre-trained LLM for multiple natural languages and programming languages in hopes of mitigating the English-centric bias for program pre-training,” according to the researchers. 

What it is and why they did it: ERNIE-Code is pre-trained on six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby) via CodeSearchNet, as well as more than 100 languages via the CommonCrawl-100 corpus. 

   Pre-training has two specific tasks – span-corruption language modeling (add noise to text and try to predict corrupted spans, sentences, and documents), and ‘pivot-based translation language modeling’ (PTLM). PTLM is the route to multilinguality – they disassemble translating a natural language (NL) command into a programming language (PL)  command by instead translating the NL command into English, then translating English into the PL. This gets around the problem of otherwise needing to pair datasets from a hundred plus language with datasets from six languages and feels like a neat solution to the problem. 

Does it work? They test the model against mBART, mT5, PLBART, CodeT5 on four tasks: code summarization,code generation, document translation, and program repair. In tests, the model is competitive on all of these, and does significantly better on code summarization. On the other hand, I would have liked to see them compare to other hard baselines, like CodeGeeX from a decent group at Tsinghua.

Why this matters – representation matters: ERNIE-Code highlights the way in which language dominance can filter through to AI dominance; so much foundational text (and comments on code) is written in English that to avoid perpetuating the hegemony of one language, researchers need to figure out progressive approaches to empower other languages. ERNIE-Code is an example of this – though the fact it needs to pivot through English during training speaks to the larger problem.

   Read more: ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages (arXiv).

####################################################

Generate music from spectrograms via StableDiffusion – a crazy idea that works:
…RIFFUSION: Mad science for the hell of it – wonderful!…

You know what I love? Wildly crazy ideas that somehow work. You know what RIFFUSION is? It’s a wildly crazy idea that somehow works. RIFFUSION takes the Stable Diffusion image model, finetunes it to generate spectrograms, then generates audio from the spectrograms. This is pure, unadulterated, for-the-fun-of-it mad science, and I am in love. 

Fun things you can do: You can interpolate from one type of spectrogram to another, just as you would with images. This means the authors can generate multiple individual slices of audio, chunk them together, and shift from one thing (e.g the sound of a keyboard typing) to another (e.g, a guitar) over arbitrary time scales. They’ve also built a web app so you can try it yourself and generate your own audio on the fly. 

Why this matters: Superintelligence is partially SuperDataTransformation: Modern generative models are generic data transformation engines, able to take one type of data (e.g, a script) and port it into another (e.g, a song, a poem). This is a deep and weird idea the more you think about it. What would you do if you can ‘transpose’ anything into anything else? RIFFUSION is a creative example of what happens when you play with this idea. Congrats to the creators for making something joyful and zany! 

   Read more: [ RIFFUSION ] (official site).
   Try out the web app: RIFFUSION.COM
   Get the model from HuggingFace here (riffusion-model-v1).

####################################################

Stanford and Mosaic train a small but mighty medical language model:

…PubMedGPT 2.7B packs a lot of performance into a tiny package…

Stanford’s Center for Research on Foundation Models (CRFM) and AI training startup Mosaic have teamed up to train PubMedGPT 2.7B, a small GPT-style language model that gets a state-of-the-art result on medical question answering. 

Data and performance: PubMedGPT 2.7B was trained on the abstracts and full text portions of ‘The Pile’ dataset; 16 billion abstracts and 5 million full-text articles. The total size of the dataset is about 50B tokens, making the dataset a little small relative to the model (GPT3 2.7B and GPT-J were trained on 300B and 400B tokens respectively). The model gets 50.3% on the MedQA-USML eval (a new SOTA), 74.4 on PubMedQA (versus 77.6 for Facebook’s ‘Galactica’), and 96.4% on BioASQ. 

Compute: The model was trained on 128 A100 GPUs for 6.25 days, which is still a non-trivial amount of compute to dump into a model, even in the ‘big chungus*’ compute era of 2022. 

*Not an official term.

Maybe data repetition isn’t that bad? “We elected to train PubMed GPT for a long compute duration (300B tokens) by performing multiple passes, or epochs, over the 50B tokens,” the researchers write. “When training big models, people are wary of repeating data too much lest their model overfits. Here, that may not have been a huge concern. “It was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models,” the Stanford researchers said. 

Why this matters: I think AI models are going to have a sort of bimodal distribution – there’ll be a small number of absolutely vast ‘swiss army knife’ models which will underpin a huge range of economic functions, but I suspect there will also at the other end be a very large number of tiny (where tiny = <5 billion parameters) models that are tuned for very specific data sources and usecases, and also likely deployed directly on edge devices (pending some compute efficiencies). PubMed GPT is an example of the latter kind of model. I wonder how many more of its kind there will be?

   Read more: PubMed GPT: a Domain-Specific Large Language Model for Biomedicine (Mosaic blog).
   Read more: PubMedGPT 2.7B (Stanford University blog).
   Get the model from HuggingFace.

####################################################

Tech Tales:

The Universal Confessional Booth

[AI-AUGMENTED THERAPY CENTER, 2030]

We all had been crazy in our own ways but now we all had the same path to healing – talk to the robot for as long as it took for it to say you were ‘back in distribution’ (BID) with everyone else. This led to all kinds of slang. 

Yeah my friend BID out.

Yeah he on a long BID he crazy. 

Oh he’s just happy because it’s his BIDday.

And so on. 

I’d been getting close to my BID for a while now, my robot told me. I’d go and sit in the booth and talk to it and we’d have this long, rambling conversations about everything from: flowers, to the recent weather and how I felt about the dust storms, the quality of food in the institution as compared to what I ate outside (mostly healthier), how I felt about my friends and family. The robot would show me some different emotions on its ‘face’ (which was an avatar that was a different person each day, I suppose to elicit different reactions from me) and I would talk and it would ask questions. 

At the end of the session it would usually say ‘you are making excellent progress towards being back in distribution’. Sometimes it wouldn’t say anything, though, which was its way of telling me I hadn’t made progress. 

It wasn’t worth trying to perform for the robot because it’d ask so many questions that it’d uncover that you were spinning some story, and then it would get as close as it was allowed to expressing a negative emotion. “You are going backwards,” it might say. Or, “at this rate, you will be out of distribution for an unpredictable amount of time”. 

Of course we’d all talk to each other about how the BID talks were a load of bullshit. We’d sit up late at night after the guards had locked the cells and exchange stories. 

  • Yeah it asked me about my childhood friends. 
  • Childhood friends? My one went IN on my dead parents. I could have strangled it. 
  • My one keeps telling me I’m sexually repressed and I just don’t see it. 
  • You think you’ve got it bad – mine has been showing me different colors for a week and asking me how they make me feel. 

The strange thing was that people did change. It was like being hypnotized – you’d think nothing was changing, but then you’d snap back into a memory of the person six months prior tearing their hair out and screaming after the first session, and now they’d just low-key complain about the sessions while sitting there with a full head of hair and no scratch marks. 

Anyway, my robot says I’m almost at BID and it’s just going to be a few more sessions. It told me to journal about my experiences as part of a special BID evaluation. I guess that’s most of what I have to say right now so I’ll see what it thinks. 

Things that inspired this story: Discovering Language Model Behaviors with Model-Written Evaluations by Anthropic (PDF); how RHLF-trained models have a tendency to take on extreme sycophantic positions; what it might look like to have a model talking to thousands of people concurrently and embedding their conversations in a single space so as to judge who is and isn’t ‘out of distribution’; ChatGPT and other similar models; robot psychology meets the administrative state; insanity.

Import AI 312: Amazon makes money via reinforcement learning; a 3-track Chinese AI competition; and how AI leads to fully personalized media

McKinsey: Companies are using more AI capabilities and spending more on it:
…Somewhat unsurprising survey confirms that AI is an economically useful thing…
McKinsey has published results from its annual AI survey, and the results show that AI is, slowly but surely, making its way into the economy. 

Those findings in full:

  • AI adoption has plateaued: In 2017, 20% of respondents said they had adopted AI in at least one business area. In 2022, that figure was 50% (it peaked in 2019 at 58%).
  • Organizations are using more AI capabilities than before: In 2018, organizations used on average 1.9 distinct capabilities (e.g, computer vision, or natural language generation), rising to 3.8% in 2022.
  • Rising investment: In 2018, “40 percent of respondents at organizations using AI reported more than 5 percent of their digital budgets went to AI,” and in 2022 that rose to 52%.

Why this matters: This survey is completely unsurprising – but that’s useful. We have this intuition that AI has become increasingly economically useful and surveys like this show that this is the case. Perhaps the most surprising finding is that the rate of adoption is relatively slow – some organizations are using AI, and there are likely a bunch of ‘dark matter’ organizations for which AI holds very little relevance today.

   Read more: The state of AI in 2022—and a half decade in review (McKinsey).

####################################################

Language models aren’t welcome on StackOverflow:

…Popular coding Q&A site bans ChatGPT submissions…

StackOverflow has temporarily banned ChatGPT-written submissions to its website, as the site’s human creators grapple with the problems brought about by autonomous, AI coders. 

    “Overall, because the average rate of getting correct answers from ChatGPT is too low, the posting of answers created by ChatGPT is substantially harmful to the site and to users who are asking or looking for correct answers,” StackOverflow admins write in a post. “The volume of these answers (thousands) and the fact that the answers often require a detailed read by someone with at least some subject matter expertise in order to determine that the answer is actually bad has effectively swamped our volunteer-based quality curation infrastructure.”

Why this matters – AI-driven internet-based ‘climate change’: Things like this illustrate a ‘tragedy of the commons’ which I expect we’ll see more of; a new AI tool comes along and is very quickly used to generate a vast amount of low-grade spam and other crap which either damages human-curated sites, or lowers the quality of a common resource (see: algo-generated SEO-optimized spam pages found via Google). 

   Of course, in a few years, these systems might be better than humans, which is going to have wild implications. But for now we’re in the awkward adolescent period where we’re seeing people pour mine tailings into the common digital river.   

   Read more: Temporary policy: ChatGPT is banned (StackOverflow).

####################################################

Waymo works out how to train self-driving cars more efficiently by focusing on the hard parts:

…Trains a model to predict the inherent difficulty of a driving situation…

Researchers with Waymo have figured out how to use hard driving situations to train self-driving cars more efficiently. “Compared to training on the entire unbiased training dataset, we show that prioritizing difficult driving scenarios both reduces collisions by 15% and increases route adherence by 14% in closed-loop evaluation, all while using only 10% of the training data,” they write. 

How it works: Google’s approach has five stages. In the first, they collect a variety of data from real world vehicles (and their onboard AI models). They then collect and shard that data. They then learn an embedding that aligns specific driving runs to a vector space based on similarity. They then select some of these runs via counterfactual simulation and human triage, letting them figure out which runs are easy and which are hard. Then, they train an MLP to regress from these embeddings to difficulty labels for the run. The result is a model that can look at a new run and predict how difficult that run is. 

   In tests, they find that they can use 10% of the usual training-run datasets if they select for harder difficulty and, as a consequence, they get smarter vehicles better able to deal with difficult situations. One problem is this approach slightly damages performance on the easier routes (which makes sense – there’s less ‘easy’ data in the dataset). 

Why this matters – use AI to help build better AI: Now they’ve got this difficulty model, the engineers can use it to theoretically identify hard scenarios for new planning agents, or new geographies to deploy into which may have ‘hotspots’ of hard parts, which will let them use the AI system to speed up the development of better, smarter AI systems. This is a neat illustration of how once you’ve trained a model to have a good enough capability at something, you can use it to speed up development of other, much more complicated AI systems.  

   Read more: Embedding Synthetic Off-Policy Experience for Autonomous Driving via Zero-Shot Curricula (arXiv).

####################################################


Amazon uses deep reinforcement learning to make its inventory systems 12% more efficient:

…The march of real world DRL continues…

This year has been a banner one for deep reinforcement learning systems – we’ve seen DRL systems provably control the plasma in prototype fusion powerplants, effectively cool buildings, navigate real world robots and, now, let e-commerce behemoth Amazon better optimize its inventory. 

   In a new paper, Amazon researchers describe how they are able to train a reinforcement learning system to more effectively manage their inventory, leading to a reduction in the inventory Amazon has to hold by 12% (!!!). “”Our model is able to handle lost sales, correlated demand, stochastic vendor lead-times and exogenous price matching,” they write. 

What they did: For this work, Amazon built a differentiable simulator which it could train RL algorithms against, helping it model the complexities of inventory management. The resulting RL approach, DirectBackprop, was tested first in backtesting against a weekly dataset of 80,000 sampled products from a single marketplace running from April 2017 to August 2019, and then tested out in the real world on a portfolio of products over 26 weeks. 

   The results are pretty convincing: “We randomized these products into a Treatment (receiving Direct-Backprop buy quantities) and a Control (receiving Newsvendor policy buy quantities) group,” they write. “The Control group was the current production system used by one of the largest Supply Chains in the world [Jack – that’d be Amazon]. The Treatment group was able to significantly reduce inventory (by ∼ 12%) without losing any excess revenue (statistically insignificant difference from 0%)”.

Why this matters: Papers like this show how AI is rapidly making its way out of the lab and into the real world. It’s a big deal when some of the world’s largest and most sophisticated corporations do large, potentially expensive real-world tests on their own physical inventory. It’s an even bigger deal when it works. All around us, the world is being silently optimized and controlled by invisible agents being built by small teams of people and applied to the vast machinery of capitalism.

   Read more: Deep Inventory Management (arXiv).

####################################################

Chinese researchers run an a 3-track ‘AI Security Competition’:

…Deepfakes, self-driving cars, and face recognition – and a nice example of how competitions can drive progress…

A bunch of Chinese universities and companies recently launched a so-called ‘Artificial Intelligence Security Competition’ (AISC) and have published a report going over the results. The AISC has three tracks relating to three distinct AI use-cases; deepfakes, self-driving cars, and face recognition. 

Deepfakes: This is a deepfake identification competition: “Given a query image, identify the

Deepfake method behind it based on its similarities to the images in the gallery set.” 

   144 teams participated in the competition and the winning team was  led by tencent (with a top-5 precision of 98% success).

Self-driving cars: This competition is based around adversarial attacks on computer vision models used in self-driving cars. Specifically, it forces vision models to try and correctly label trucks that the cars would otherwise crash into and sometimes the cars have been doped with an adversarial patch meant to make them invisible to object detectors. There are different stages in this competition and in the final round there is more scene variation and the adversarial cars get replaced by human mannequins. 

   96 teams participated in the competition and the winning team (BJTU-ADaM) came from Beijing Jiaotong University.

Face recognition: This is based around developing effective adversarial attacks on image recognition systems. The idea is to “discover more stable attack algorithms for evaluating the security of face recognition models and consequently facilitate the development of more robust

face recognition models” – an important thing given that China is probably the most heavily-surveilled country in the world (though the UK gives it a run for its money). 

   178 teams participated in the competition. Two teams shared the first prize – TianQuan & LianYi, and DeepDream – getting a perfect 100 each.

Who did this: When something is a giant multi-org output, I typically don’t publish all the institutions. However, China is a special case, so – for the enthusiasts – here’s the list of authors on the paper: 

   “realAI, Tsinghua University, Beijing Institute of Technology, Shanghai Jiao Tong University, China Hanhu Academy of Electronics and Information Technology, Xi’an Jiaotong University, Tencent YouTu Lab, China Construction Bank Fintech, RippleInfo, Zhejiang Dahuatech Co, Beijing Jiaotong University, [and] Xidian University.”

Why this matters: In the future, most nations are going to carry out alternating ‘red team vs blue team’ competitions, where teams compete to break systems and eventually to build more resilient ones. This competition shows how useful the approach can be for both developing more robust systems and identifying vulnerabilities in widely deployed ones. It also speaks to the dynamism of the Chinese AI sector – hundreds of submissions per track, interesting technical solutions, and a sense of excitement about the endeavor of making AI more robust for society. The translated tagline for this whole competition was, per the official website: “Building AI Security Together and Enjoying the Future of Intelligence“.

   Read more: Artificial Intelligence Security Competition (AISC).

####################################################

Facebook’s AI training gets messed up by bad weather:

…Is your training run breaking because you’re dumb, or because the sun is shining in Oregon?…

When Facebook was training its ‘CICERO’ system which recently beat humans at Diplomacy, the company ran into a strange problem – sometimes training speeds for its model would drop dramatically and the team couldn’t work out why. It turned out, per Facebook in an AMA on Reddit, that this was because the FB data center’s cooling system was malfunctioning on particularly hot days. 

   “For the rest of the model training run, we had a weather forecast bookmarked to look out for especially hot days!” Facebook said. 

Why this matters: Worth remembering that AI systems are made out of computers and computers have to go somewhere. Since most of the mega companies use free-air cooling, their data centers (while being stupendously efficient!) can become vulnerable to edge-case things, like particularly hot days leading to malfunctions in cooling which has a knock-on effect to the (overheating) servers sitting in the cavernous halls of anonymous buildings scattered around the world. 

   Read more: We’re the Meta AI research team behind CICERO, the first AI agent to achieve human-level performance in the game Diplomacy. We’ll be answering your questions on December 8th starting at 10am PT. Ask us anything! (Reddit).

   Via nearcyan on Twitter.

####################################################

Watch me (and several other more coherent people) waffle about AI policy!

I participated in a panel on ‘future-proofing AI governance’ at the Athens roundtable on AI and the Rule of Law in Brussels recently – you can check out the video here. My general sense from spending a few days in Brussels is there’s a valuable discussion to be had about what kind of negligence or liability standards should be applied to developers of super-intelligent AI systems, and there’s a lot of room to be creative here. It’s worth thinking about this now because policy takes a long time to craft and, if some of the more optimistic timeline predictions come true, it’d be good to have built out regulatory infrastructure in the coming years. 

   Watch the video here: Future-proofing AI Governance | The Athens Roundtable on AI and the Rule of Law 2022 (The Future Society).

####################################################

Tech tales: 

The Personal Times 

[Worldwide, 2026]

The news of the day, customized for you!

We report to you, for you, and about you! 

News from your perspective!

When news become personalized people started to go mad. Of course the underlying facts were the same but the stories were angled differently depending on who you were. Everyone read the news, because the news was incredibly compelling. 

All the news that’s fit to finetune!

One hundred stories and one hundred truths!

News for all, made personal!

You’d sit on a train after a terrorist attack and see everyone’s eyes light up and everyone would be happy or worried or panicked, depending on their implicit preferences from their news media consumption. You’d stare at eachother with wild eyes and say ‘did you hear the news’ but you stopped knowing what that meant, and you mostly said it to work out what type of person you were dealing with. Was their news happy or their news sad or their news uplifting or their news opportunistic. What bubble did they live within and how different to yours was it?

Things that inspired this story: What happens when generative models lead to media customized around individual preferences learned via reinforcement learning from human feedback? Just how big a crash will ‘Reality Collapse’ bring? Is society meant for everyone to have ‘solipsism media on tap’?

Import AI 311: Distributed GPT busts the political economy of AI; Apple optimizes Stable Diffusion; AI war startup raises $1.48 billion

Test out your coding model on a fuzzed benchmark:
…DS-1000 pits code models against 1,000 tasks spread across seven Python libraries…
Researchers from the University of Hong Kong, Peking University, Stanford University, Berkeley, the University of Washington, Facebook, and Carnegie Mellon University have built DS-1000, a set of 1,000 data science problems spanning seven Python libraries. This is both a dataset and a benchmark and is useful for building code models, like Codegen or Copilot.

What’s in DS-1000? The dataset contains 1000 problems drawn from 451 distinct StackOverflow problems. “To defend against potential memorization, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites,” the authors write. DS-1000 contains problems in NumPy, SciPy, Pandas, TensorFlow, PyTorch, Scikit-learn, and Matplotlib. “The problems in DS-1000 represent more diverse and naturalistic intent and context formats that cannot be seen in any other datasets,” they write. 

How hard is it? The best performing models (Codex from OAI) get, at most, about 40% for tasks like insertion, followed by CodeGen(Salesforce) at ~8.4% and InCoder-6B from Facebook (7.5%). This is great news as it suggests it’s a hard benchmark. 
   Read more: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation (GitHub).
   Get the code here: DS-1000 Data Science Code Generation (GitHub).

####################################################

Apple optimizes Stable Diffusion on Apple silicon:
…World’s most valuable company + world’s most proliferated generative model…
Apple has significantly cut the time it takes to generate images from Stable Diffusion on Apple silicon. It’s notable that the world’s most valuable company has tacitly adopted the world’s most widely distributed (and quite controversial) generative image model, and perhaps a sign of things to come – release the weights of your model, and perhaps vast companies will expend engineering resources to make it run more efficiently on their hardware. 

   “This release comprises a Python package for converting Stable Diffusion models from PyTorch to Core ML using diffusers and coremltools, as well as a Swift package to deploy the models,” Apple writes.

Why this matters – on-device AI: Most AI models need to be sampled from via large computers, typically servers running top-of-the-line GPUs. Large language models, for instance, can take tens of GPUs to sample from in a reasonable time. Image models, while cheaper to sample from, can still be expensive. With this release, Apple has made it significantly faster for people to pull Stable Diffusion images off of their local devices – in other words, you could be sitting in the back of a cab in a place with no cell reception and could idly generate images on a laptop equipped with an M1 or M2 chip. 
   Read more: Stable Diffusion with Core ML on Apple Silicon (Apple Machine Learning Research blog)
   Check out detailed notes here: Core ML Stable Diffusion (Apple GitHub).

####################################################

Want to see if your object detection system works in the real world? Try out Roboflow100:
…RF100 – a reassuringly difficult and diverse benchmark…
Roboflow, a computer vision startup, has released Roboflow-100, a large-scale object detection dataset. What makes Roboflow different is, much like the recent emergence of benchmarks like SuperGLUE (a multi-task NLP benchmark), it takes multiple distinct datasets (in this case: 100) and puts them together into a single suite. This kind of thing tends to be really useful as it helps people work out if their models are overfitting or are actually capable of decent generalization.
   Another different thing is the data is sourced from real jobs by real users of Roboflow, so this is less an academic benchmark and more an applied one.

What goes into Roboflow-100? RF100 contains 100 datasets spread across 7 imagery domains, containing a total of 224,714 images annotated with 805 class labels. “By releasing RF100, we aim to provide a semantically diverse, multidomain benchmark of datasets to help researchers test their model’s generalizability with real-life data.”
   The seven main categories consist of annotation tasks in the following domains: Aerial, Video Games, Microscopic, Underwater, Documents, Electromagnetic, and Real World. All of these main categories contain sub-categories, ranging from first-person shooters (video games) to fishery sights from aquariums (underwater), to geology (real world), etc. 

Why this matters – hard enough to be useful: RF100 seems sufficiently large-scale and diverse that it poses a challenge to contemporary systems – that means it can be a valuable tool for developing and assessing the performance of more general models. The roboflow researchers show this by training a couple of baseline models (YOLOv5 and YOLOv7, respectively), as well as training a zero-shot detector called GLIP. The finetuned YOLO variants get about ~65-70% accuracy (v5 and v7, respectively), and GLIP gets ~11%. In other words – RF100 is a challenging benchmark, so there should be some signal in seeing how people do on it. 
   Read the paper: Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark (arXiv).
   Read more: roboflow100 (official website).
   Get the dataset: Roboflow 100, GitHub.

####################################################

AI centralization just got less likely: Distributed team train a good 6bn parameter GPT model:
…You’ve heard about open source models. How about open source models trained over a super shitty network?…
Researchers with Together have trained GPT-JT, a 6bn parameter, well performing model. So far, so normal. The twist is that GPT-JT was trained in a decentralized manner on a heterogeneous bunch of GPUs over slow (1Gbps) internet links. That’s a big deal – and has some big implications. 

What is GPT-JT and how well does it work?: GPT-JT “is a variant forked off GPT-J and performs exceptionally well on text classification and other tasks,” the authors write. “On classification benchmarks such as RAFT, it comes close to state-of-the-art models that are much larger (e.g., InstructGPT davinci v2)”. GPT-JT was made possible by a range of open source software, ranging from underlying models (GPT-J, etc), datasets, evaluation metrics, and various contributions to decentralized algorithms. 

Trained in a decentralized manner: The authors wrap in a bunch of clever ideas to reduce the burden of decentralized training, cutting the amount of communication needed per machine for all the tokens processed. This is crucial to the success of the project; out-of-the-box decentralized training fails because you have enough between-machine chatter that the slowness of your connections represents a major tax on training.

Centralization versus decentralization – this is an attack on the political economy of AI! A lot of AI development has so far been defined by a small set of groups with access to big, centralized computers. These groups have used these blobs of compute to train impressive models, ranging from AlphaZero to GPT3. It has always been hard for people with fewer computers to catch up to the people with supercomputers. GPT-JT suggests a radically different future – distributed collectives can instead pool computers over crappy internet links and train models together. ex pluribus unum exemplar, if you will.
    Now, the multi-trillion dollar question is if these distributed groups can provably train models on par with those developed by the large, centralized giants. That part is a lot less clear – while GPT-JT is a decent model, it’s a tiny one at 6bn parameters. But if they can scale this kind of technique up, the implications are huge. 
   There’s also the small matter of China, which recently got a lot of its AI ambitions clipped by US export controls preventing it from accessing frontier GPUs. But maybe the frontier doesn’t matter as much if you can just aggregate compute across a country of more than a billion of people and train a model with the focus afforded by an Authoritarian regime. Food for thought! 
   Read more: Releasing v1 of GPT-JT powered by open-source AI (Together blog).
   Get the code: GPT-JT-6B-v1 (HuggingFace).
   Try out a live demo on HuggingFace here.

####################################################

AI war startup Anduril raises $1.48 billion: 
…AI + Robots + Startup DNA = a faster OODA loop for battlefield commanders…
AI War startup Anduril has raised $1.48 billion (that’s with a B) in a Series E round. “The new funding will enable Anduril to accelerate research and development to bring new, cutting edge, autonomous defense capabilities to the market and continue to mature and scale its current business lines with the US Department of Defense as well as US allies and partners,” the company wrote. 

AI and War: Anduril is a fascinating company – it’s one of the few modern defense startups in the US that is pairing recent AI innovations with various advances in robotics (e.g, low-cost drones) as well as sensor platforms. Put it all together and you wind up with a company that is fielding an increasingly vast arsenal of devices able to conduct war activities on land, air, and sea (via recent acquisition, Dive Technologies). Some of the company’s recent product launches include ALTIUS 600M (a loitering munition, aka a drone that hangs around then kills something with a bang), ‘Menace” (“a first-of-its-kind integrated, expeditionary, secure, command, control, communications and computing (C4) platform”), and Mobile Sentry (a robot for autonomous ground and air monitoring). 

Why this matters – war is about speed, and AI increases speed: War runs on an OODA loop – Observe, Orient, Decide, Act. By pulling in modern technologies such as AI, Anduril is building an arsenal that increases the speed at which battlefield commanders can iterate through the OODA loop. Anduril is less about its individual items and more about its overall suite of products – taken together, they potentially let an entrepreneurial army out-think the competition via running an OODA loop. War is a depressing thing, but a more depressing thing is losing wars, so the funding for Anduril seems like a positive indication for the US (and allied) defense industrial base. I hope it continues to succeed in breaking through the monopoly of the aging so-called defense ‘primes’ (Lockheed, etc). 
   Read more: Anduril Raises $1.48 Billion in Series E Funding (Anduril blog, Medium)

####################################################

Reality Authentication
[The internet, 2034] 

“To login, spit into the bio-API”
   I took a sip of water and swirled it around my mouth a bit, then hawked some spit into the little cup on my desk, put its lid on, then flipped over the receptacle and plugged it into the bio-API system.
“Authenticating… authentication successful, human-user identified. Enjoy your time on the application!”
   I spent a couple of hours logged-on, doing a mixture of work and pleasure. I was part of an all-human gaming league called the No-Centaurs; we came second in a mini tournament. I also talked to my therapist sans his augment, and I sent a few emails over the BioNet protocol. 

   When I logged out, I went back to the regular internet. Since the AI models had got minituarized and proliferated a decade ago, the internet had radically changed. For one thing, it was so much faster now. It was also dangerous in ways it hadn’t been before – Attention Harvesters were everywhere and the only reason I was confident in my browsing was I’d paid for a few protection programs. 

Things that inspired this story: The ceaseless march of generative model progress; chatGPT; high- and low-class hobbies; the rich will always have a retreat, while the poor will always be condemned to the most experimental parts of the frontier.