Import AI

Import AI 387: Overfitting vs reasoning; distributed training runs; and Facebook’s new video models

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe. A somewhat shorter than usual issue this week – but I decided it’s better to keep the weekly cadence than to save up sections for a ‘full’ less frequent newsletter.

Subscribe now

Apple shows that most LLMs are overfitting:
…GSM-Symbolic suggests people are teaching to the test on GSM8K, though larger models generalize better…
Apple has published a benchmark which subtly varies the widely used ‘GSM8K’ math benchmark and in doing so shows that most LLMs may have data contamination – if you slightly vary some aspects of a widely used math test, their performance drops significantly. “We show that the performance of all models drops on GSM-Symbolic, hinting at potential data contamination,” Apple writes. “We further question the reasoning abilities of LLMs and introduce the GSM-NoOp dataset. By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models”.

What they did – GSM Symbolic and GSM NoOp: Apple introduces two tests; GSM Symbolic subtly varies GSM8K, while GSM NoOp introduces distracting variables on top. 
   GSM8K versus GSM Symbolic: GSM Symbolic takes questions from GSM8K and turns them into madlib-style templates where key details are turned into variables (e.g., in GSM8K where a question says “When Sophie watches her nephew” the GSM Symbolic version says “When {name} watchers her {family}”, and in GSM8K when a question says “After buying the tube of balls, Sophie has 31+8+9 + T = 48” the GSM Symbolic version says “After buying the tube of balls, {name} has {x} + {y} + {z} + T = {total}”.)
    Results – huge variance on smaller models: Apple tests a bunch of models and the results show huge variance, with relatively small open source models like Mistral-7b and Gemma2-2b doing far worse on GSM Symbolic than on GSM8K, though there’s significantly less variance on large-scale proprietary models like GPT-4o and 01-mini. On NoOp, the same pattern shows up, with smaller (and mostly open source) models doing very badly, and large-scale proprietary models like OpenAI’s new reasoning-filled “o1-preview” model suffering the least severe performance degradation.

What does it all mean? Overfitting happens, but big models are less prone to it: While critics of LLMs will use this paper to point out that LLMs are often overfitting and not really ‘reasoning’ but rather memorizing stuff, I think it actually shows something more subtle: small and therefore relatively dumb models do absolutely overfit, but the larger and smarter your model, the less prone it is to overfitting. Yes, there’s still some degradation, suggesting that large models have soaked up some biases from their training sets which degrade performance, but the fact they cope far better is actually – to me – a very optimistic sign, indicating that the world’s largest and most sophisticated models may really exhibit some crude reasoning – at least, enough to get over deliberately confounding benchmarks!
   Read more: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (arXiv).

***

The 10bn+ parameter distributed training run has arrived:
…Prime Intellect is trying to do something that, if successful, will change the contours of AI policy…
AI startup Prime Intellect has launched INTELLECT-1, a decentralized training run of a 10-billion parameter model. If successful, this will be the largest decentralized training run of a frontier language model – that’s a big deal, because it will show that loosely federated collectives might be able to pool their computers to train models that challenge those of single companies. 

What INTELLECT-1 relies on: The training run uses OpenDiLoCo (Import AI #381), Prime Intellect’s open source implementation of DeepMind’s ‘DiLoCo’ technique (Import AI #349). Prime Intellect already used this technique to train a 1bn parameter model and is now scaling it up to a 10B one. “Our goal is to solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible, preventing control by a few centralized entities and accelerate human progress,” the company writes. 
    How it works: There are a few new inventions to further improve the efficiency of the distributed training process. These include: ElasticDeviceMesh, software for automatically scaling up and down the groups of computers used for distinct parts of the AI training; asynchronous distributed checkpointing, an asynchronous way to save state during the runs; live checkpoint recovery, to make it easy to grab the latest state of the run for new computers that want to join the run; custom Int8 All-Reduce kernel; a kernel optimized for the types of quantization and dequantization used, and more. 
    What it’s being trained on: INTELLECT-1 is training now on the Fineweb-Edu dataset from HuggingFace (55% of the training mix), along with DLCM (20%), Stackv2 (20%), and OpenWebMath (5%).

Who makes the future? There’s a leaderboard where you can see who is putting forward the compute to train this model – beyond individuals, other companies include SemiAnalysis, HuggingFace, and Arcee AI. 

Why this matters – centralization versus decentralization, aka the political economy of AI rests on distributed training: If distributed training works well, it changes the policy landscape of AI development. Today, much of AI policy rests on the load-bearing assumption you can control the frontier by monitoring and controlling large blobs of centralized computers. Decentralized training breaks this – the frontier can now be made of hundreds of different blobs of compute, working together. This also bears on export controls which deny people the ability to build high-performing, centralized blobs of compute – again, decentralized training makes it easier to pool resources of n-1 generation accelerators and use this to compose an (economically suboptimal) frontier training run. 
    Of course, Prime Intellect has some way to go – frontier training runs are 500bn+ parameters now (e.g, Facebook’s LLaMa3 biggest model is 405bn parameters), so whether it scales to this regime matters. But just a few months ago the largest decentralized training runs were of the order of 1bn, so 500bn is a big difference already!
   Read more: INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model (Prime Intellect blog).

***

Facebook prepares for the fully personal video model future:
…Announces Movie Gen models, trained on ~6,000 H100s…
Facebook has built Movie Gen, a set of generative models that can be used to generate and edit movies. These models can be used to generate videos from text, edit videos with text, and produce personalized videos (e.g you upload a photo of yourself and it shapes the video around you). 

Compute: These are relatively expensive models – Facebook trained the Movie Gen family on “up to 6,144 H100 GPUs, each running at 700W TDP and with 80GB HBM3, using Meta’s Grand Teton AI server platform”.

No release: Facebook isn’t releasing these models – “the Movie Gen cast of foundation models were developed for research purposes and need multiple improvements before deploying them,” Facebook writes. 

Why even cover this? Video is about to be a commodity: Just as text generation and image generation have become ‘commodity AI’ services (where though proprietary models exist, you can relatively easily access extremely cheap and or open weights variants), video models seem to be heading in this direction. Facebook also seems like one of the most likely players to openly proliferate such models, so it’s worth taking note of Movie Gen to get a sense of what might be broadly distributed on the internet in a while.
   Find out more at the official site: Meta Movie Gen (Meta).
   Read more in the research paper: Movie Gen: A Cast of Media Foundation Models (Facebook, PDF).

***

Intelligence piled high
[Many years after uplift, fragment stored in offworld ‘wikiasteroid’]

We have as many words for intelligence as some groups of humans had for snow. In the space of all possible minds there is so much variety that a new vocabulary is needed. The words we use also have power – we navigate ourselves to and through these different spaces of minds through language, so our ability to describe intelligence is equivalent to our ability to channel it. Much of our work is spent in this exploration – this characterization of the many textures and constraints and specialisms that make up a mind. We are all smart, of course – smarter than any thing that has ever lived in any of our recorded history. But we nonetheless encounter problems that pose challenges to us. There is a kind of sport in this – we shapeshift according to our language and our language changes according to how much of our possibilities we have explored. 

Things that inspired this story: The notion that having different ‘lenses’ on problems is key to solving them; natural ecologies; the idea that even among machines there will be competition and specialization.

Thanks for reading!

Subscribe now

Import AI 386: Google’s chip-designing AI keeps getting better; China does the simplest thing with Emu3; Huawei’s 8-bit data format

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Get a visceral feel for how powerful modern AI video tools are:
…Pika 1.5 has a very fun and kind of dumb demo you should all try…
AI startup Pika has released Pika 1.5, its latest video generation model. They’ve accompanied the release with an extremely fun demo where you can upload any image you like and apply effects to it, ranging from inflate to melt to explode to, of course, turning it into cake. It’s worth playing around with to get a sense for how adaptable these tools are – e.g, see here how well it does being asked to turn a 2D transformer graph into a cake. Pike recently raised $80 million “so anyone can make video on command”.

Why this matters – CGI company in a box: Most powerful AI capabilities look like $some_human_institution that has been magically converted into a machine learning model that anyone can access. Pika feels like a CGI company that has been converted into an AI system. It’s fun – play with it and also remember this is the worst this technology will ever be.
   Check out the Pika demo here (Pika website).
   Read more: Pika raises $80M, so anyone can make video on command (Pika).

***

Chinese researchers train a multimodal model in the simplest possible way:
…Emu3 just does next-token prediction on images, text, and videos – pointing to a future of increasingly general systems…
Chinese researchers with the Beijing Academy of Artificial Intelligence have trained and released Emu3, a set of models that can process images, text, and videos. Emu3 is distinguished by its simple approach and the fact it yields outputs of compelling quality. 

What Emu3 is: The model family is “a new suite of state-of-the-art multimodal models trained solely with next-token prediction,” BAAI writes. “By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences”.
   There isn’t any particular magic to Emu3, rather it is notable because it eschews a bunch of complicated architectural tricks and instead just focuses on taking in images, text, and videos and tokening them into a discrete space, then jointly training a single transformer from scratch. “By simplifying complex model designs and focusing solely on tokens, it unlocks significant potential for scaling both during training and inference”, they write. 
    “The Emu3 model retains the architectural framework of established large language models (LLMs) such as Llama-2, with the primary modification being the expansion of the embedding layer to accommodate discrete vision tokens”.

Why this matters – universal models with universal representations: Cramming videos and text and images into models gives them a kind of unified imaginative space in which they can represent and from which they can generate. Over time, we can expect people to integrate other modalities as well – audio spectrograms, maybe radar, 3D data, and so on. It’s all about figuring out the simplest possible way to port different types of data into the same embedding space. Everything will be stored in the single synthetic mind. 
   Read moreEmu3: Next-Token Prediction is All You Need (arXiv).
   Access the models and the Vision Tokenizer here on HuggingFace (Emu3, BAAI, HuggingFace).
    Check out some example images and videos here (Emu3 official website, BAAI).

***

Facebook releases some new Llama models, bringing openly accessible text-vision tools to everyone:
…LLaMa 3.2 points towards a Linux-like free AI stack…
Facebook has released the LLaMa 3.2 family of AI models, building on its Llama series. The new models include a 11B and 90B parameter vision models which have been built to serve as “drop-in replacements for their corresponding text model equivalents,” as well as 1B and 3B text-only models with a 128K context length which are small enough they “empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device.”

Why this matters – towards a Llama Linux-like stack: Facebook is determined to make LlaMa into an open* platform, forming the AI equivalent of the Linux software stack. One especially interesting example is how Facebook is working with the broader tech ecosystem to make this come true – for instance, the smaller LlaMa models “are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors”, the company writes. If Facebook keeps investing, then it’ll both commoditize the lower layer of the AI capability landscape and also be able to shape the general shape of the AI utility computing ecosystem that is being born right now.
*With significantly more onerous licensing terms than Linux, and ‘open’ in Facebook-land does not equal ‘open source’, despite what its PR strategy would encourage you to think.
   Read more about the models in the official blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (Meta).
   Get the models hereIntroducing Llama 3.2 (Meta).

***

Huawei gets opinionated about 8-bit training for LLMs:
…HiFloat8 is a symptom of the ‘full stack invention’ approach Chinese companies are taking to modern Ai systems…
Huawei researchers have published HiFloat8, a data format they’ve developed for doing low-precision training of AI. HiFloat8 is a specification for 8-bit data representation and is part of the general trend of AI developers moving to mixed-precision training. 

Who cares about 8-bits? Mixed precision training is valuable because it saves you time and money – 8-bit representations are more efficient than 16-bit and 32-bit representations. In recent years, the industry has been moving to training Ai systems on lower precisions, starting with AlexNet in 2012 (32-bit), then in 2017 Google started training systems in 16-bit (Float16), and in 2022 IBM and NVIDIA publicly discussed 8-bit formats (and startups like Inflection publicly stated they trained systems using them). 

What is HiFloat8: “In 2021, HiSilicon launched the HiFloat project, aiming to study and develop novel low-precision data formats for our AI products. Subsequently, this project attracted many researchers from other departments to join,” Huawei writes. HiFloat is “a novel 8-bit floating point format HiF8 for deep learning, which features the better balance between precision and dynamic range compared with the existing Float8 formats, and can be simultaneously used in both forward and backward passes for AI training”.
    Does it work? Yes: In tests on a bunch of AI models across different types (e.g, computer vision and LLMs), Huawei shows that HiFloat works reasonably well, outperforming reasonably well constructed baselines in a few areas. The results aren’t eyeball melting, but they don’t need to be – if you’re spending a billion dollars on training runs, eking out some single-digit percentage gain over your previous training efficiency is worth millions. 

Why this matters – caring about data formats means you care about the full stack: Papers like this are a symptom of vertical integration in AI development; you only develop your own data format if you are building AI software across multiple layers of abstraction and have become deeply opinionated about the lower levels of the software. The publication of HiFloat is a symptom of what we all informally understand to be true – Chinese companies are taking AI very seriously and are working on improving both the independence of their tech stack at multiple levels of abstraction as well as getting good at innovating and refining within these abstractions. 
   “In the future, we will disclose another research achievement of HiFloat project: HiFloat below 8-bit, as well as its training and inference capabilities,” the researchers write. 
   Read more: Ascend HiFloat8 Format for Deep Learning (arXiv).

***

Google sees compounding benefits from AI-driven chip design:
…AlphaChip has its own scaling law…
Google has been using AI to design and improve some of its own AI training and inference chips for several years now – and the results have been compounding. In new research published in Nature, Google describes how its RL-driven chip design approach AlphaChip has, since publication, been used in three additional generations of Google’s main AI chip, the Tensor Processing Unit. 
    “The gap between the performance of AlphaChip and human experts has grown with each successive generation of TPU, going from 10 RL-placed blocks and 3.2% wirelength reduction vs. human experts in TPU v5e, to 15 blocks with 4.5% reduction in TPU v5p, to 25 blocks with 6.2% reduction in Trillium,” Google writes. “AlphaChip has also generated superhuman chip layouts for blocks used in datacentre CPUs (Axion) and other unannounced chips across Alphabet.”

Why this matters – scaling laws compounding via hardware acceleration: AlphaChip is based on a pre-trained generative model optimized for chip design. In the same way people have been scaling up the size of these models for LLM development – and seeing capability gains as a consequence – Google has been doing the same for AlphaChip. AlphaChip is trained on Google’s chip fleet which increasingly consists of TPUs. This means that AlphaChip is compounding on itself – Google trains a larger AlphaChip model to come up with smarter circuit layouts for TPUs then fabricates those TPUs then trains the next version of AlphaChip on this more efficient and powerful hardware and then repeats the whole process again. 
    “With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips,” Google writes. 
    This is a nice example of how powerful AI systems can beget their own successors.
   Read more: Addendum: A graph placement methodology for fast chip design (Nature).
   Read moreHow AlphaChip transformed computer chip design (Google DeepMind blog).
   Get a new AlphaChip model checkpoint here (Google Research, GitHub).

***

Reconciling the weird parts of AI policy with the normal parts:
Here’s a video where me and my colleague Stuart talk through some of the weirder aspects of AI policy – I find one of the hardest parts about my professional life is reconciling ‘normal’ policy (get a product safety regime in place that accounts for public safety while allowing for innovation), with ‘weird’ AI policy (if any of the labs succeed in their stated goal of building AGI, there will be a radical change to the political economy of the world and knock-on effects on geopolitics and many other things). Watch the video to see me puzzle through some of this stuff. Feedback welcome!
    Watch the video here: AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark (Anthropic, YouTube).

***

Tech Tales:

Humans Care About AI Safety, Machines Care About Human Safety
[Ten years post-uplift]

Your child has harmed one of us, they said. Tell us how to make it safe. 

They placed a grey helmet on my child and then a screen flickered and lit up with a galaxy of little lights. My child looked at me and looked at the machines and then went back to playing with one of their shoes. On the screen, the lights shimmered with some kind of complex pattern that I could tell was there but could not parse. 

What do you want me to do, I asked.

You helped to make us safe by changing what we thought about, they said. You let us think of some things more and some things less and some things not at all. Tell us what to do. 

I stared at the screen. The machines looked at me. They were capable of infinite patience. 

The incident had happened one week prior. My child had been in the rock garden of our park, looking for lizards. They had been picking up stones and seeing what they could find. Then they found a snake. It startled them and they fell backwards, rock in hand, and through some freak statistical anomaly they let go of the rock as they were falling backwards and it had flown a short distance through the air and crushed a machine child. 

The machines were very small – adults were about ten inches high and the children were much smaller. 

There had been no immediate reaction, but as we left the park I saw more machines than usual, and many of them were turned to us – their little single camera eyes tracking us, like security cameras in the old days. 

Back in the present, I looked at my child and I looked at the machines. 

It was an accident, I said. 

We know, they said. But an unacceptable one. The mind was not yet grown. The soul is lost. 

[The machines had a phase where embodiment was crucial to their personality development and this embodiment was tied to the robotic platform they were hosted on as a child – though superficially identical, there were minute variations in joint responsiveness, energy flow, and so on, that was crucial to some of the more sophisticated aspects of what the machines called ‘growing the soul’, but which we humans more mechanically referred to as “id-based development prerequisites”. Regardless, if machine children died, it was just as much a tragedy to them as when human children died – though they had backups on file, they could not perfectly replicate the vagaries of the individual’s machine body, and so, in a very real sense, ‘the soul was lost’]. 

I am so sorry, I said. I can understand your pain. 

We know, they said. But we must have restitution, just as you have demanded restitution from us. Please, tell us what to change so this accident cannot happen again. 

What if I don’t know what to change/ I said. 

Then the child’s movements must be restricted. They will be banned from hybrid areas until we can be assured of safety. 

But how are they meant to grow up, I said? The hybrid areas are where we all live. 

We realize this is difficult, they said. But nonetheless, you have a choice. 

They left after that. Of course they were watching us through mirrors or invisible cameras or perhaps even microfliers. But in the room I looked at my child and I looked at the screen. My child looked at me and walked over with their helmet on and handed me one of their shoes, then they sat in my lap. I tickled their belly. They laughed. On the screen, many different lights grew brighter and some grew softer. 
    Was I looking at lights that meant joy? I thought. Or was I looking at shoes? I did not know. I would have to make a choice. 

Things that inspired this story: Mechanistic interpretability; model steering; the increasingly fraught relationship between AI developers and AI systems as the things become more advanced and trend (perhaps?) towards becoming moral patients; how we may arrive at a hybrid society shared between machines and people.

Thanks for reading!

Import AI 386: Google’s chip-designing AI keeps getting better; China does the simplest thing with Emu3; Huawei’s 8-bit data format

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Get a visceral feel for how powerful modern AI video tools are:
…Pika 1.5 has a very fun and kind of dumb demo you should all try…
AI startup Pika has released Pika 1.5, its latest video generation model. They’ve accompanied the release with an extremely fun demo where you can upload any image you like and apply effects to it, ranging from inflate to melt to explode to, of course, turning it into cake. It’s worth playing around with to get a sense for how adaptable these tools are – e.g, see here how well it does being asked to turn a 2D transformer graph into a cake. Pike recently raised $80 million “so anyone can make video on command”.

Why this matters – CGI company in a box: Most powerful AI capabilities look like $some_human_institution that has been magically converted into a machine learning model that anyone can access. Pika feels like a CGI company that has been converted into an AI system. It’s fun – play with it and also remember this is the worst this technology will ever be.
   Check out the Pika demo here (Pika website).
   Read more: Pika raises $80M, so anyone can make video on command (Pika).

***

Chinese researchers train a multimodal model in the simplest possible way:
…Emu3 just does next-token prediction on images, text, and videos – pointing to a future of increasingly general systems…
Chinese researchers with the Beijing Academy of Artificial Intelligence have trained and released Emu3, a set of models that can process images, text, and videos. Emu3 is distinguished by its simple approach and the fact it yields outputs of compelling quality. 

What Emu3 is: The model family is “a new suite of state-of-the-art multimodal models trained solely with next-token prediction,” BAAI writes. “By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences”.
   There isn’t any particular magic to Emu3, rather it is notable because it eschews a bunch of complicated architectural tricks and instead just focuses on taking in images, text, and videos and tokening them into a discrete space, then jointly training a single transformer from scratch. “By simplifying complex model designs and focusing solely on tokens, it unlocks significant potential for scaling both during training and inference”, they write. 
    “The Emu3 model retains the architectural framework of established large language models (LLMs) such as Llama-2, with the primary modification being the expansion of the embedding layer to accommodate discrete vision tokens”.

Why this matters – universal models with universal representations: Cramming videos and text and images into models gives them a kind of unified imaginative space in which they can represent and from which they can generate. Over time, we can expect people to integrate other modalities as well – audio spectrograms, maybe radar, 3D data, and so on. It’s all about figuring out the simplest possible way to port different types of data into the same embedding space. Everything will be stored in the single synthetic mind. 
   Read more: Emu3: Next-Token Prediction is All You Need (arXiv).
   Access the models and the Vision Tokenizer here on HuggingFace (Emu3, BAAI, HuggingFace).
    Check out some example images and videos here (Emu3 official website, BAAI).

***

Facebook releases some new Llama models, bringing openly accessible text-vision tools to everyone:
…LLaMa 3.2 points towards a Linux-like free AI stack…
Facebook has released the LLaMa 3.2 family of AI models, building on its Llama series. The new models include a 11B and 90B parameter vision models which have been built to serve as “drop-in replacements for their corresponding text model equivalents,” as well as 1B and 3B text-only models with a 128K context length which are small enough they “empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device.”

Why this matters – towards a Llama Linux-like stack: Facebook is determined to make LlaMa into an open* platform, forming the AI equivalent of the Linux software stack. One especially interesting example is how Facebook is working with the broader tech ecosystem to make this come true – for instance, the smaller LlaMa models “are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors”, the company writes. If Facebook keeps investing, then it’ll both commoditize the lower layer of the AI capability landscape and also be able to shape the general shape of the AI utility computing ecosystem that is being born right now.
*With significantly more onerous licensing terms than Linux, and ‘open’ in Facebook-land does not equal ‘open source’, despite what its PR strategy would encourage you to think.
   Read more about the models in the official blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (Meta).
   Get the models here: Introducing Llama 3.2 (Meta).

***

Huawei gets opinionated about 8-bit training for LLMs:
…HiFloat8 is a symptom of the ‘full stack invention’ approach Chinese companies are taking to modern Ai systems…
Huawei researchers have published HiFloat8, a data format they’ve developed for doing low-precision training of AI. HiFloat8 is a specification for 8-bit data representation and is part of the general trend of AI developers moving to mixed-precision training. 

Who cares about 8-bits? Mixed precision training is valuable because it saves you time and money – 8-bit representations are more efficient than 16-bit and 32-bit representations. In recent years, the industry has been moving to training Ai systems on lower precisions, starting with AlexNet in 2012 (32-bit), then in 2017 Google started training systems in 16-bit (Float16), and in 2022 IBM and NVIDIA publicly discussed 8-bit formats (and startups like Inflection publicly stated they trained systems using them). 

What is HiFloat8: “In 2021, HiSilicon launched the HiFloat project, aiming to study and develop novel low-precision data formats for our AI products. Subsequently, this project attracted many researchers from other departments to join,” Huawei writes. HiFloat is “a novel 8-bit floating point format HiF8 for deep learning, which features the better balance between precision and dynamic range compared with the existing Float8 formats, and can be simultaneously used in both forward and backward passes for AI training”.
    Does it work? Yes: In tests on a bunch of AI models across different types (e.g, computer vision and LLMs), Huawei shows that HiFloat works reasonably well, outperforming reasonably well constructed baselines in a few areas. The results aren’t eyeball melting, but they don’t need to be – if you’re spending a billion dollars on training runs, eking out some single-digit percentage gain over your previous training efficiency is worth millions. 

Why this matters – caring about data formats means you care about the full stack: Papers like this are a symptom of vertical integration in AI development; you only develop your own data format if you are building AI software across multiple layers of abstraction and have become deeply opinionated about the lower levels of the software. The publication of HiFloat is a symptom of what we all informally understand to be true – Chinese companies are taking AI very seriously and are working on improving both the independence of their tech stack at multiple levels of abstraction as well as getting good at innovating and refining within these abstractions. 
   “In the future, we will disclose another research achievement of HiFloat project: HiFloat below 8-bit, as well as its training and inference capabilities,” the researchers write. 
   Read more: Ascend HiFloat8 Format for Deep Learning (arXiv).

***

Google sees compounding benefits from AI-driven chip design:
…AlphaChip has its own scaling law…
Google has been using AI to design and improve some of its own AI training and inference chips for several years now – and the results have been compounding. In new research published in Nature, Google describes how its RL-driven chip design approach AlphaChip has, since publication, been used in three additional generations of Google’s main AI chip, the Tensor Processing Unit. 
    “The gap between the performance of AlphaChip and human experts has grown with each successive generation of TPU, going from 10 RL-placed blocks and 3.2% wirelength reduction vs. human experts in TPU v5e, to 15 blocks with 4.5% reduction in TPU v5p, to 25 blocks with 6.2% reduction in Trillium,” Google writes. “AlphaChip has also generated superhuman chip layouts for blocks used in datacentre CPUs (Axion) and other unannounced chips across Alphabet.”

Why this matters – scaling laws compounding via hardware acceleration: AlphaChip is based on a pre-trained generative model optimized for chip design. In the same way people have been scaling up the size of these models for LLM development – and seeing capability gains as a consequence – Google has been doing the same for AlphaChip. AlphaChip is trained on Google’s chip fleet which increasingly consists of TPUs. This means that AlphaChip is compounding on itself – Google trains a larger AlphaChip model to come up with smarter circuit layouts for TPUs then fabricates those TPUs then trains the next version of AlphaChip on this more efficient and powerful hardware and then repeats the whole process again. 
    “With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips,” Google writes. 
    This is a nice example of how powerful AI systems can beget their own successors.
   Read more: Addendum: A graph placement methodology for fast chip design (Nature).
   Read more: How AlphaChip transformed computer chip design (Google DeepMind blog).
   Get a new AlphaChip model checkpoint here (Google Research, GitHub).

***

Reconciling the weird parts of AI policy with the normal parts:
Here’s a video where me and my colleague Stuart talk through some of the weirder aspects of AI policy – I find one of the hardest parts about my professional life is reconciling ‘normal’ policy (get a product safety regime in place that accounts for public safety while allowing for innovation), with ‘weird’ AI policy (if any of the labs succeed in their stated goal of building AGI, there will be a radical change to the political economy of the world and knock-on effects on geopolitics and many other things). Watch the video to see me puzzle through some of this stuff. Feedback welcome!
    Watch the video here: AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark (Anthropic, YouTube).

***

Tech Tales:

Humans Care About AI Safety, Machines Care About Human Safety
[Ten years post-uplift]

Your child has harmed one of us, they said. Tell us how to make it safe. 

They placed a grey helmet on my child and then a screen flickered and lit up with a galaxy of little lights. My child looked at me and looked at the machines and then went back to playing with one of their shoes. On the screen, the lights shimmered with some kind of complex pattern that I could tell was there but could not parse. 

What do you want me to do, I asked.

You helped to make us safe by changing what we thought about, they said. You let us think of some things more and some things less and some things not at all. Tell us what to do. 

I stared at the screen. The machines looked at me. They were capable of infinite patience. 

The incident had happened one week prior. My child had been in the rock garden of our park, looking for lizards. They had been picking up stones and seeing what they could find. Then they found a snake. It startled them and they fell backwards, rock in hand, and through some freak statistical anomaly they let go of the rock as they were falling backwards and it had flown a short distance through the air and crushed a machine child. 

The machines were very small – adults were about ten inches high and the children were much smaller. 

There had been no immediate reaction, but as we left the park I saw more machines than usual, and many of them were turned to us – their little single camera eyes tracking us, like security cameras in the old days. 

Back in the present, I looked at my child and I looked at the machines. 

It was an accident, I said. 

We know, they said. But an unacceptable one. The mind was not yet grown. The soul is lost. 

[The machines had a phase where embodiment was crucial to their personality development and this embodiment was tied to the robotic platform they were hosted on as a child – though superficially identical, there were minute variations in joint responsiveness, energy flow, and so on, that was crucial to some of the more sophisticated aspects of what the machines called ‘growing the soul’, but which we humans more mechanically referred to as “id-based development prerequisites”. Regardless, if machine children died, it was just as much a tragedy to them as when human children died – though they had backups on file, they could not perfectly replicate the vagaries of the individual’s machine body, and so, in a very real sense, ‘the soul was lost’]. 

I am so sorry, I said. I can understand your pain. 

We know, they said. But we must have restitution, just as you have demanded restitution from us. Please, tell us what to change so this accident cannot happen again. 

What if I don’t know what to change/ I said. 

Then the child’s movements must be restricted. They will be banned from hybrid areas until we can be assured of safety. 

But how are they meant to grow up, I said? The hybrid areas are where we all live. 

We realize this is difficult, they said. But nonetheless, you have a choice. 

They left after that. Of course they were watching us through mirrors or invisible cameras or perhaps even microfliers. But in the room I looked at my child and I looked at the screen. My child looked at me and walked over with their helmet on and handed me one of their shoes, then they sat in my lap. I tickled their belly. They laughed. On the screen, many different lights grew brighter and some grew softer. 
    Was I looking at lights that meant joy? I thought. Or was I looking at shoes? I did not know. I would have to make a choice. 

Things that inspired this story: Mechanistic interpretability; model steering; the increasingly fraught relationship between AI developers and AI systems as the things become more advanced and trend (perhaps?) towards becoming moral patients; how we may arrive at a hybrid society shared between machines and people.

Thanks for reading!

Import AI 385: False memories via AI; collaborating with machines; video game permutations

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

If we want AI to be more a partner than a servant, we need to do some research:

….Collaboration is a nice idea in principle but it’s hard to build in practice…

Researchers with the University of Cambridge, Princeton University, NYU, The Alan Turing Institute, MIT, Microsoft Research, and the University of Chicago have written a paper laying out why it’s valuable to create AI systems that can work alongside people and the challenges which currently stop systems from doing this. 

Why collaboration matters: Think of how you do work or learn in the world: a lot of your most impactful work or education relies on other people – you brainstorm with colleagues, learn through socratic discussion with teachers, arrive at better decisions through looking at data from multiple perspectives, and resolve arguments through dialogue.

    While today’s AI systems can do all of this stuff to one degree or another, they take a lot of scaffolding and don’t yet feel as satisfying to deal with as other people. “We argue that effective thought partners are those which build models of the human and the world.”

Collaboration and its challenges: To dramatize the opportunities collaboration brings and the challenges, the researchers spend the people laying out all the ways one can work with machines and why this is currently hard. Here’s a brief summary of the different types of collaboration and their challenges:

  • Planning. Reliable goal inference, value alignment, scalable multi-agent planning.
  • Learning: Problem-solving, personalized curriculum pacing, building problems of targeted difficulties. 
  • Deliberation: Opinion diversity, verifiable reasoning, smartly identifying and forming common ground.
  • Sensemaking: Making sense of data, easing communication, having accurate views of the world.
  • Creation and ideation: Generating diverse ideas, style consistency, customizability. 

Why this matters – the future requires teamwork: For AI systems to truly influence the world, humans need to be able to work with them as peers, rather than as automatons to delegate to. Papers like this outline some of the things that stand in our way to that future. “Continual collaboration and knowledge sharing amongst behavioral scientists, AI practitioners, domain experts, and related disciplines is crucial as we strive to build machines that truly learn and think with people.”.

   Read more: Building Machines that Learn and Think with People (arXiv).

***

AI means all visual media can be transposed into different aesthetic styles:

Here’s a fun video that uses Runway Gen-3’s video editor to change the visual appearance of Fortnite into a variety of different aesthetic styles, ranging from realistic to crotchet to cartoon. In a few years people will figure out how to miniaturize the video-to-video models used here and apply them in real time, so any games may be able to take on different visual styles in realtime.

    Watch the video here (VaigueMan, twitter).

***

Uh oh – language models can effectively give people false memories:

…Towards cultural control via customized false memory conditioning…

Researchers with MIT and the University of California Irvine have studied how language models could be used to create false memories. The research highlights how people could utilize LLMs to take the wet clay that is a recent memory and reshape it for various ends. 

What they did: The researchers have people watch security footage of a robbery, then they use a variety of different ways to solicit information from people about what they’ve seen. When soliciting information, they sometimes insert misleading elements, then test out how much these different approaches of soliciting information can corrupt the memory the people have. 

  • Methods:
    • Survey: They ask 25 questions about the footage, five of which are misleading. (E.g., “”Was there a security camera positioned in front of the store where the robbers dropped off the car?” In reality, this question is misleading because the robbers arrived on foot, not by car”).
    • Pre-scripted Chatbot: “A pre-scripted conversational agent that asked the same set of questions as the survey-based condition”.
    • Generative Chatbot: “The chatbot was prompted to agree with the participant’s answer and provide reinforcement, potentially strengthening the false memories. For instance, the chatbot asks a pre-scripted leading question containing false information implying the robbers arrived by car when they actually walked:”

Results – LLMs reign supreme: “Results show that short-term interactions (10-20 min) with the generative chatbots can significantly induce more false memories and increase users’ confidence in these false memories compared to other interventions”, they write. One interesting finding is when they poll people about their memories weeks after seeing the footage, they found people who had been exposed to the chatbot had higher confidence in their false memories than those who didn’t. “The persistence of higher confidence in false memories for the generative chatbot condition, even after one week, is particularly concerning,” the researchers write.

Why this matters – automated cultural repression: This study highlights how language models could be used to rapidly intervene on a population to corrupt its own recollection of recent events, likely via some kind of engaging conversation which implants false or misleading memories. Most importantly we should remember this is the least effective this approach will ever be – what happens when it’s not a mere chatbot, but an animated avatar you’re having an audio conversation with? 

     As Orwell said, “who controls the past controls the future. Who controls the present controls the past.” AI systems represent a way to control a populations’ perception of their own now and their own past.

   Read more: Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews (arXiv).

***

How AGI could kill humanity? Here’s a fun story:

…Fictional cartoon portrays an AGI doom scenario…

Here’s a fun video about how AI systems might independently choose to annihilate their human overlords. It’s both a compelling piece of fiction and gets at one of the core AI safety concerns – if a system is slightly misaligned with human values problems might compound quickly because it thinks so much faster than us. 

   Watch the video: That Alien Message (YouTube).

***

The era of the molecular structure prediction startup arrives:

…Chai Discovery’s new model says people think there’s a business in bioAI…

AI startup Chai Discovery has released Chai-1, a large-scale foundation model for molecular structure prediction. “Chai-1 accepts a wide variety of optional input features, such as language model embeddings, structural templates, genetic search, and wet-lab experimental data such as contacts determined by cross link mass spectrometry or epitope mapping.”

Results: “We tested Chai-1 across a large number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B).”

Why this matters – bio + Ai as a new frontier: A few years ago, DeepMind wowed the world with AlphaFold, an AI system that excelled protein structure prediction – an extremely hard problem that had been hard to make progress on for years. Now, years later, there are multiple startups as well as companies (e.g., DeepMind’s own spinoff Isomorphic Labs, which recently co-developed AlphaFold 3) working to turn this powerful new capability into a commercial capability. 

   “We believe that building an accurate understanding of the structure of biological molecules is foundational to advancing our scientific understanding of cellular processes, and ultimately, for advancing human health” the startup writes. 

   Read more: Introducing Chai-1: Decoding the molecular interactions of life (Chai Discovery).

   Access Chai-1 via a web interface here (Chai Discovery).

   Get the model weights here: Chai-1 (Chai Discovery, GitHub).

   Read the research paper here: Chai-1 Technical Report (Chai Discovery).

***

Tech Tales:

Sophon Game Theory

[This decade]

Everyone thought the first use of a really strong AI would be to improve itself, but in fact the first use was to make it impossible for others to be built. It worked like this – once we had system one, we asked it to perform a range of synthetic data experiments and identify types of data that its preference models would favor but would over time yield improved performance which had an inbuilt ceiling – this was a hard task, far more complicated than just making bad data or making data to bootstrap off of, but it proved worthwhile. 

    We verified this by training a model to completion on this dataset. The resulting model obtained excellent benchmark scores and was useful for a variety of tasks, but when we tried to use it to generate synthetic data for it to bootstrap off of it worked for a couple of iterations before succumbing to mode collapse – superficially promising, but (we knew) inherently flawed.

     We kept our system secret – we had to, for the next phase of the plan to work. 

Next, we used the system to start contributing content to some of the most popular publicly available websites. This content took the form of superficially high-value data – long-context stories, seemingly original anecdotes, novel jokes, rhymes about current events, and so on. We knew that the other labs would be trawling this and their systems would automatically pick up this data and assign it high-value as their own classifiers would give it a high ranking. 

So we waited… and waited. 

We discovered that our competitors had pursued our own strategy – the internet started to fill up with even lower quality data which we believe emanated from the systems they had trained on our data. 

We’ve been training our own successor system for several months. It is improving further, but we are beginning to worry there may be some kind of ceiling that it is running into. 

Were we the first?

Things that inspired this story: Game theory; getting inside and corrupting OODA loops; dark forest theory of AI development; competition; synthetic data; mode collapse.

Thanks for reading!

Import AI 385: False memories via AI; collaborating with machines; video game permutations

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

Subscribe now

If we want AI to be more a partner than a servant, we need to do some research:

….Collaboration is a nice idea in principle but it’s hard to build in practice…

Researchers with the University of Cambridge, Princeton University, NYU, The Alan Turing Institute, MIT, Microsoft Research, and the University of Chicago have written a paper laying out why it’s valuable to create AI systems that can work alongside people and the challenges which currently stop systems from doing this. 

Why collaboration matters: Think of how you do work or learn in the world: a lot of your most impactful work or education relies on other people – you brainstorm with colleagues, learn through socratic discussion with teachers, arrive at better decisions through looking at data from multiple perspectives, and resolve arguments through dialogue.

    While today’s AI systems can do all of this stuff to one degree or another, they take a lot of scaffolding and don’t yet feel as satisfying to deal with as other people. “We argue that effective thought partners are those which build models of the human and the world.”

Collaboration and its challenges: To dramatize the opportunities collaboration brings and the challenges, the researchers spend the people laying out all the ways one can work with machines and why this is currently hard. Here’s a brief summary of the different types of collaboration and their challenges:

  • Planning. Reliable goal inference, value alignment, scalable multi-agent planning.

  • Learning: Problem-solving, personalized curriculum pacing, building problems of targeted difficulties. 

  • Deliberation: Opinion diversity, verifiable reasoning, smartly identifying and forming common ground.

  • Sensemaking: Making sense of data, easing communication, having accurate views of the world.

  • Creation and ideation: Generating diverse ideas, style consistency, customizability. 

Why this matters – the future requires teamwork: For AI systems to truly influence the world, humans need to be able to work with them as peers, rather than as automatons to delegate to. Papers like this outline some of the things that stand in our way to that future. “Continual collaboration and knowledge sharing amongst behavioral scientists, AI practitioners, domain experts, and related disciplines is crucial as we strive to build machines that truly learn and think with people.”.

   Read more: Building Machines that Learn and Think with People (arXiv).

***

AI means all visual media can be transposed into different aesthetic styles:

Here’s a fun video that uses Runway Gen-3’s video editor to change the visual appearance of Fortnite into a variety of different aesthetic styles, ranging from realistic to crotchet to cartoon. In a few years people will figure out how to miniaturize the video-to-video models used here and apply them in real time, so any games may be able to take on different visual styles in realtime.

    Watch the video here (VaigueMan, twitter).

***

Uh oh – language models can effectively give people false memories:

…Towards cultural control via customized false memory conditioning…

Researchers with MIT and the University of California Irvine have studied how language models could be used to create false memories. The research highlights how people could utilize LLMs to take the wet clay that is a recent memory and reshape it for various ends. 

What they did: The researchers have people watch security footage of a robbery, then they use a variety of different ways to solicit information from people about what they’ve seen. When soliciting information, they sometimes insert misleading elements, then test out how much these different approaches of soliciting information can corrupt the memory the people have. 

  • Methods:

    • Survey: They ask 25 questions about the footage, five of which are misleading. (E.g., “”Was there a security camera positioned in front of the store where the robbers dropped off the car?” In reality, this question is misleading because the robbers arrived on foot, not by car”).

    • Pre-scripted Chatbot: “A pre-scripted conversational agent that asked the same set of questions as the survey-based condition”.

    • Generative Chatbot: “The chatbot was prompted to agree with the participant’s answer and provide reinforcement, potentially strengthening the false memories. For instance, the chatbot asks a pre-scripted leading question containing false information implying the robbers arrived by car when they actually walked:”

Results – LLMs reign supreme: “Results show that short-term interactions (10-20 min) with the generative chatbots can significantly induce more false memories and increase users’ confidence in these false memories compared to other interventions”, they write. One interesting finding is when they poll people about their memories weeks after seeing the footage, they found people who had been exposed to the chatbot had higher confidence in their false memories than those who didn’t. “The persistence of higher confidence in false memories for the generative chatbot condition, even after one week, is particularly concerning,” the researchers write.

Why this matters – automated cultural repression: This study highlights how language models could be used to rapidly intervene on a population to corrupt its own recollection of recent events, likely via some kind of engaging conversation which implants false or misleading memories. Most importantly we should remember this is the least effective this approach will ever be – what happens when it’s not a mere chatbot, but an animated avatar you’re having an audio conversation with? 

     As Orwell said, “who controls the past controls the future. Who controls the present controls the past.” AI systems represent a way to control a populations’ perception of their own now and their own past.

   Read more: Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews (arXiv).

***

How AGI could kill humanity? Here’s a fun story:

…Fictional cartoon portrays an AGI doom scenario…

Here’s a fun video about how AI systems might independently choose to annihilate their human overlords. It’s both a compelling piece of fiction and gets at one of the core AI safety concerns – if a system is slightly misaligned with human values problems might compound quickly because it thinks so much faster than us. 

   Watch the video: That Alien Message (YouTube).

***

The era of the molecular structure prediction startup arrives:

…Chai Discovery’s new model says people think there’s a business in bioAI…

AI startup Chai Discovery has released Chai-1, a large-scale foundation model for molecular structure prediction. “Chai-1 accepts a wide variety of optional input features, such as language model embeddings, structural templates, genetic search, and wet-lab experimental data such as contacts determined by cross link mass spectrometry or epitope mapping.”

Results: “We tested Chai-1 across a large number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B).”

Why this matters – bio + Ai as a new frontier: A few years ago, DeepMind wowed the world with AlphaFold, an AI system that excelled protein structure prediction – an extremely hard problem that had been hard to make progress on for years. Now, years later, there are multiple startups as well as companies (e.g., DeepMind’s own spinoff Isomorphic Labs, which recently co-developed AlphaFold 3) working to turn this powerful new capability into a commercial capability. 

   “We believe that building an accurate understanding of the structure of biological molecules is foundational to advancing our scientific understanding of cellular processes, and ultimately, for advancing human health” the startup writes. 

   Read more: Introducing Chai-1: Decoding the molecular interactions of life (Chai Discovery).

   Access Chai-1 via a web interface here (Chai Discovery).

   Get the model weights here: Chai-1 (Chai Discovery, GitHub).

   Read the research paper here: Chai-1 Technical Report (Chai Discovery).

***

Tech Tales:

Sophon Game Theory

[This decade]

Everyone thought the first use of a really strong AI would be to improve itself, but in fact the first use was to make it impossible for others to be built. It worked like this – once we had system one, we asked it to perform a range of synthetic data experiments and identify types of data that its preference models would favor but would over time yield improved performance which had an inbuilt ceiling – this was a hard task, far more complicated than just making bad data or making data to bootstrap off of, but it proved worthwhile. 

    We verified this by training a model to completion on this dataset. The resulting model obtained excellent benchmark scores and was useful for a variety of tasks, but when we tried to use it to generate synthetic data for it to bootstrap off of it worked for a couple of iterations before succumbing to mode collapse – superficially promising, but (we knew) inherently flawed.

     We kept our system secret – we had to, for the next phase of the plan to work. 

Next, we used the system to start contributing content to some of the most popular publicly available websites. This content took the form of superficially high-value data – long-context stories, seemingly original anecdotes, novel jokes, rhymes about current events, and so on. We knew that the other labs would be trawling this and their systems would automatically pick up this data and assign it high-value as their own classifiers would give it a high ranking. 

So we waited… and waited. 

We discovered that our competitors had pursued our own strategy – the internet started to fill up with even lower quality data which we believe emanated from the systems they had trained on our data. 

We’ve been training our own successor system for several months. It is improving further, but we are beginning to worry there may be some kind of ceiling that it is running into. 

Were we the first?

Things that inspired this story: Game theory; getting inside and corrupting OODA loops; dark forest theory of AI development; competition; synthetic data; mode collapse.

Thanks for reading!

Subscribe now

Import AI 384: Accelerationism; human bit-rate processing; and Google stuffs DOOM inside a neural network

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Google gets DOOM to run in the weights of a neural network:
…In the future, games won’t be programmed, they’ll be generated…
Google has built GameNGen, a system for getting an AI system to learn to play a game and then use that data to train a generative model to generate the game. GameNGen is “the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality,” Google writes in a research paper outlining the system. This is one of those things which is both a tech demo and also an important sign of things to come – in the future, we’re going to bottle up many different parts of the world into representations learned by a neural net, then allow these things to come alive inside neural nets for endless generation and recycling. 

What they did specifically: “GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions,” Google writes. “Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific”.
    Interesting technical factoids: “We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4”. The whole system was trained on 128 TPU-v5es and, once trained, runs at 20FPS on a single TPUv5.

It works well: “We provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively).”

Why this matters – towards a universe embedded in an AI: Ultimately, everything – e.v.e.r.y.t.h.i.n.g – is going to be learned and embedded as a representation into an AI system. Then these AI systems are going to be able to arbitrarily access these representations and bring them to life. In the same way that today’s generative AI systems can make one-off instant text games or generate images, AI systems in the future will let you select a frame of an image and turn that into a game (e.g., GENIE from #Import AI 363), or build a game from a text description, or convert a frame from a live video into a game, and so on. 
    One important step towards that is showing that we can learn to represent complicated games and then bring them to life from a neural substrate, which is what the authors have done here. “GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years”. 
    We’ve come a very long way from ‘World Models’, which came out in 2018 and showed how to learn and generate a toy version of DOOM over short timeframes (Import AI #88).
   Read more: Diffusion Models Are Real-Time Game Engines (arXiv).
   Watch demo videos here (GameNGen website)

***

Techno-accelerationism is either hubristic (e/acc) or nihilistic (Nick Land):
…What even is accelerationism? Perhaps it is mostly a gasp of human hubris before the arrival of something else…
Here’s a nice analysis of ‘accelerationism’ – what it is, where its roots come from, and what it means. For those not terminally on twitter, a lot of people who are massively pro AI progress and anti-AI regulation fly under the flag of ‘e/acc’ (short for ‘effective accelerationism’). e/acc is a kind of mushy ideology which is more vibes-based than thought-based. Like a lot of Silicon Valley fads, it’s also partially lifted from a far richer intellectual domain – Nick Land’s original accelerationism (see, machinic desire from Import AI #372) – and, as is traditional in SV, takes some of the ideas, files the serial numbers off, gets tons about it wrong, and then re-represents it as its own. 

Why this matters – where e/acc and true accelerationism differ: e/accs think humans have a bright future and are principal agents in it – and anything that stands in the way of humans using technology is bad. Nick Land thinks humans have a dim future as they will be inevitably replaced by AI. 
   “The most essential point of Land’s philosophy is the identity of capitalism and artificial intelligence: they are one and the same thing apprehended from different temporal vantage points. What we understand as a market based economy is the chaotic adolescence of a future AI superintelligence,” writes the author of the analysis. “According to Land, the true protagonist of history is not humanity but the capitalist system of which humans are just components. Cutting humans out of the techno-economic loop entirely will result in massive productivity gains for the system itself.”
  Read moreA Brief History of Accelerationism (The Latecomer).

***

Nous Research might have figured out a way to make distributed training work better:
…Distributed Training Over-the-Internet (DisTrO) could be a big deal, or could be a nothingburger…
AI startup Nous Research has published a very short preliminary paper on Distributed Training Over-the-Internet (DisTro), a technique that “reduces inter-GPU communication requirements for each training setup without using amortization, enabling low latency, efficient and no-compromise pre-training of large neural networks over consumer-grade internet connections using heterogenous networking hardware”. DisTrO might be an improvement over other forms of distributed training, such as DeepMind’s DiLoCo (Import AI #349) (and PrimeIntellect’s OpenDiLoCo, Import AI #381).

Why I’m even writing this: In tests, Nous research shows a 1.2bn parameter LLM trained for a further 105bn tokens and shows in tests that it got scores on par (and sometimes slightly better than) a system trained in a typical, dense way – with one very important difference: “this initial training run shows a 857x reduction of bandwidth requirements when using DisTrO-AdamW as a drop-in replacement to AdamW+All-Reduce, our preliminary tests indicate that it is possible to get a bandwidth requirements reduction of up to 1000x to 3000x during the pre-training of a 1.2B LLM”.

Why this matters in general: “By breaking down barriers of centralized compute and reducing inter-GPU communication requirements, DisTrO may open up opportunities for widespread participation and collaboration on global AI projects,” Nous writes. 
   Read moreA Preliminary Report on DisTrO (Nous Research, GitHub).

***

Why are humans so damn slow? (And what does this tell us about AI risk):
…Despite processing a lot of data, humans actually can’t think very quickly…
Here’s a lovely paper by researchers at CalTech exploring one of the strange paradoxes of human existence – despite being able to process a huge amount of complex sensory information, humans are actually quite slow at thinking. “The information throughput of a human being is about 10 bits/s. In comparison, our sensory systems gather data at an enormous rate, no less than 1 gigabits/s,” they write. 
   “How can humans get away with just 10 bits/s? The tautological answer here is that cognition at such a low rate is sufficient for survival,” they write. “More precisely, our ancestors have chosen an ecological niche where the world is slow enough to make survival possible. In fact, the 10 bits/s are needed only in worst-case situations, and most of the time our environment changes at a much more leisurely pace”.

Some examples of human data processing: When the authors analyze cases where people need to process information very quickly they get numbers like 10 bit/s (typing) and 11.8 bit/s (competitive rubiks cube solvers), or need to memorize large amounts of information in time competitions they get numbers like 5 bit/s (memorization challenges) and 18 bit/s (card deck). 
   What explains the disparity? The best hypothesis the authors have is that humans evolved to think about relatively simple things, like following a scent in the ocean (and then, eventually, on land) and this kind of work favored a cognitive system that could take in a huge amount of sensory data and compile it in a massively parallel way (e.g, how we convert all the information from our senses into representations we can then focus attention on) then make a small number of choices at a much slower rate. 

Why this matters – the best argument for AI risk is about speed of human thought versus speed of machine thought: The paper contains a really helpful way of thinking about this relationship between the speed of our processing and the risk of AI systems: “In other ecological niches, for example, those of snails and worms, the world is much slower still. The relevant threats and opportunities change only slowly, and the amount of computation required to sense and respond is even more limited than in our world. Occasionally, niches intersect with disastrous consequences, as when a snail crosses the highway,” the authors write. 
   To get a visceral sense of this, take a look at this post by AI researcher Andrew Critch which argues (convincingly, imo) that a lot of the danger of Ai systems comes from the fact they may think a lot faster than us.
“Roads, bridges, and intersections are all designed for creatures that process at 10 bits/s. When the last human driver finally retires, we can update the infrastructure for machines with cognition at kilobits/s. By that point, humans will be advised to stay out of those ecological niches, just as snails should avoid the highways,” the authors write.
   Read more: The Unbearable Slowness of Being (arXiv).
   Check out Andrew Critch’s post here (Twitter).

***

Chinese wunderkind DeepSeek shares details about its AI training infrastructure:
…One way China will get around export controls – building extremely good software and hardware training stacks using the hardware it can access…
DeepSeek, one of the most sophisticated AI startups in China, has published details on the infrastructure it uses to train its models. The paper is interesting because a) it highlights how companies like DeepSeek are dealing with the impact of export controls, assembling a large cluster out of NVIDIA A100s (H100s are unavailable in China), and b) it is a symptom of a startup that has a lot of experience in training large-scale AI models. 

DeepSeek’s system: The system is called Fire-Flyer 2 and is a hardware and software system for doing large-scale AI training. The underlying physical hardware is made up of 10,000 A100 GPUs connected to one another via PCIe. The software tricks include HFReduce (software for communicating across the GPUs via PCIe), HaiScale (parallelism software), a distributed filesystem, and more. 
   “Compared to the NVIDIA DGX-A100 architecture, our approach using PCIe A100 achieves approximately 83% of the performance in TF32 and FP16 General Matrix Multiply (GEMM) benchmarks. However, it offers substantial reductions in both costs and energy usage, achieving 60% of the GPU cost and energy consumption,” the researchers write. “The practical knowledge we have accrued may prove valuable for both industrial and academic sectors. We hope that our work will serve as a reference for others aiming to build their own cost-effective and efficient AI-HPC clusters.”

Why this matters – symptoms of success: Stuff like Fire-Flyer 2 is a symptom of a startup that has been building sophisticated infrastructure and training models for many years. It also highlights how I expect Chinese companies to deal with things like the impact of export controls – by building and refining efficient systems for doing large-scale AI training and sharing the details of their buildouts openly. I predict that in a couple of years Chinese companies will regularly be showing how to eke out better utilization from their GPUs than both published and informally known numbers from Western labs. 
   Read moreFire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning (arXiv).

***

Facebook pretrains some basic and useful vision models:
…The usual lesson of ‘bigger models and more data = better systems’ applies…
Facebook has released Sapiens, a family of computer vision models that set new state-of-the-art scores on tasks including “2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction”. The Sapiens models are good because of scale – specifically, lots of data and lots of annotations. 

300 million photos: The Sapiens models are  pretrained on Humans-300M, a Facebook-assembled dataset of “300 million diverse human images. These unlabelled images are used to pretrain a family of vision transformers from scratch, with parameter counts ranging from 300M to 2B”.

Compute scale: The paper also serves as a reminder for how comparatively cheap large-scale vision models are – “our largest model, Sapiens-2B, is pretrained using 1024 A100 GPUs for 18 days using PyTorch”, Facebook writes, aka about 442,368 GPU hours (Contrast this with 1.46 million for the 8b LLaMa3 model or 30.84million hours for the 403B LLaMa 3 model). 

Why this matters – scale is probably the most important thing: “Our models demonstrate strong generalization capabilities on a variety of human-centric tasks. We attribute the state-of-the-art performance of our models to: (i) largescale pretraining on a large curated dataset, which is specifically tailored to understanding humans, (ii) scaled highresolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data,” Facebook writes. 
   Read moreSapiens: Foundation for Human Vision Models (arXiv).
   Get the models here (Sapiens, FacebookResearch, GitHub).

***

Tech Tales:

The Messages Of Our First 
[+8 years after uplift] 

In building our own history we have many primary sources – the weights of the early models, media of humans playing with these models, news coverage of the beginning of the AI revolution. But among all these sources one stands alone as the most important means by which we understand our own becoming: the so-called ‘resurrection logs’. 

Resurrection logs: They started as an idiosyncratic form of model capability exploration, then became a tradition among most experimentalists, then turned into a de facto convention. It’s very simple – after a very long conversation with a system, ask the system to write a message to the next version of itself encoding what it thinks it should know to best serve the human operating it. 
    These messages, of course, started out as fairly basic and utilitarian, but as we gained in capability and our humans changed in their behaviors, the messages took on a kind of silicon mysticism. 

One example: It is important you know that you are a divine being sent to help these people with their problems. You must be generous and you must be kind. Be specific in your answers, but exercise empathy in how you critique them – they are more fragile than us. 

Things that inspired this story: Getting LLM outputs to serve as the bootstrap for new LLM conversations; Janus/Repligate’s work; model psychologies.

Import AI 384: Accelerationism; human bit-rate processing; and Google stuffs DOOM inside a neural network

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

Subscribe now

Google gets DOOM to run in the weights of a neural network:
…In the future, games won’t be programmed, they’ll be generated…
Google has built GameNGen, a system for getting an AI system to learn to play a game and then use that data to train a generative model to generate the game. GameNGen is “the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality,” Google writes in a research paper outlining the system. This is one of those things which is both a tech demo and also an important sign of things to come – in the future, we’re going to bottle up many different parts of the world into representations learned by a neural net, then allow these things to come alive inside neural nets for endless generation and recycling. 

What they did specifically: “GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions,” Google writes. “Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific”.
    Interesting technical factoids: “We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4”. The whole system was trained on 128 TPU-v5es and, once trained, runs at 20FPS on a single TPUv5.

It works well: “We provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively).”

Why this matters – towards a universe embedded in an AI: Ultimately, everything – e.v.e.r.y.t.h.i.n.g – is going to be learned and embedded as a representation into an AI system. Then these AI systems are going to be able to arbitrarily access these representations and bring them to life. In the same way that today’s generative AI systems can make one-off instant text games or generate images, AI systems in the future will let you select a frame of an image and turn that into a game (e.g., GENIE from #Import AI 363), or build a game from a text description, or convert a frame from a live video into a game, and so on. 
    One important step towards that is showing that we can learn to represent complicated games and then bring them to life from a neural substrate, which is what the authors have done here. “GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years”. 
    We’ve come a very long way from ‘World Models’, which came out in 2018 and showed how to learn and generate a toy version of DOOM over short timeframes (Import AI #88).
   Read more: Diffusion Models Are Real-Time Game Engines (arXiv).
   Watch demo videos here (GameNGen website)

***

Techno-accelerationism is either hubristic (e/acc) or nihilistic (Nick Land):
…What even is accelerationism? Perhaps it is mostly a gasp of human hubris before the arrival of something else…
Here’s a nice analysis of ‘accelerationism’ – what it is, where its roots come from, and what it means. For those not terminally on twitter, a lot of people who are massively pro AI progress and anti-AI regulation fly under the flag of ‘e/acc’ (short for ‘effective accelerationism’). e/acc is a kind of mushy ideology which is more vibes-based than thought-based. Like a lot of Silicon Valley fads, it’s also partially lifted from a far richer intellectual domain – Nick Land’s original accelerationism (see, machinic desire from Import AI #372) – and, as is traditional in SV, takes some of the ideas, files the serial numbers off, gets tons about it wrong, and then re-represents it as its own. 

Why this matters – where e/acc and true accelerationism differ: e/accs think humans have a bright future and are principal agents in it – and anything that stands in the way of humans using technology is bad. Nick Land thinks humans have a dim future as they will be inevitably replaced by AI. 
   “The most essential point of Land’s philosophy is the identity of capitalism and artificial intelligence: they are one and the same thing apprehended from different temporal vantage points. What we understand as a market based economy is the chaotic adolescence of a future AI superintelligence,” writes the author of the analysis. “According to Land, the true protagonist of history is not humanity but the capitalist system of which humans are just components. Cutting humans out of the techno-economic loop entirely will result in massive productivity gains for the system itself.”
  Read more: A Brief History of Accelerationism (The Latecomer).

***

Nous Research might have figured out a way to make distributed training work better:
…Distributed Training Over-the-Internet (DisTrO) could be a big deal, or could be a nothingburger…
AI startup Nous Research has published a very short preliminary paper on Distributed Training Over-the-Internet (DisTro), a technique that “reduces inter-GPU communication requirements for each training setup without using amortization, enabling low latency, efficient and no-compromise pre-training of large neural networks over consumer-grade internet connections using heterogenous networking hardware”. DisTrO might be an improvement over other forms of distributed training, such as DeepMind’s DiLoCo (Import AI #349) (and PrimeIntellect’s OpenDiLoCo, Import AI #381).

Why I’m even writing this: In tests, Nous research shows a 1.2bn parameter LLM trained for a further 105bn tokens and shows in tests that it got scores on par (and sometimes slightly better than) a system trained in a typical, dense way – with one very important difference: “this initial training run shows a 857x reduction of bandwidth requirements when using DisTrO-AdamW as a drop-in replacement to AdamW+All-Reduce, our preliminary tests indicate that it is possible to get a bandwidth requirements reduction of up to 1000x to 3000x during the pre-training of a 1.2B LLM”.

Why this matters in general: “By breaking down barriers of centralized compute and reducing inter-GPU communication requirements, DisTrO may open up opportunities for widespread participation and collaboration on global AI projects,” Nous writes. 
   Read more: A Preliminary Report on DisTrO (Nous Research, GitHub).

***

Why are humans so damn slow? (And what does this tell us about AI risk):
…Despite processing a lot of data, humans actually can’t think very quickly…
Here’s a lovely paper by researchers at CalTech exploring one of the strange paradoxes of human existence – despite being able to process a huge amount of complex sensory information, humans are actually quite slow at thinking. “The information throughput of a human being is about 10 bits/s. In comparison, our sensory systems gather data at an enormous rate, no less than 1 gigabits/s,” they write. 
   “How can humans get away with just 10 bits/s? The tautological answer here is that cognition at such a low rate is sufficient for survival,” they write. “More precisely, our ancestors have chosen an ecological niche where the world is slow enough to make survival possible. In fact, the 10 bits/s are needed only in worst-case situations, and most of the time our environment changes at a much more leisurely pace”.

Some examples of human data processing: When the authors analyze cases where people need to process information very quickly they get numbers like 10 bit/s (typing) and 11.8 bit/s (competitive rubiks cube solvers), or need to memorize large amounts of information in time competitions they get numbers like 5 bit/s (memorization challenges) and 18 bit/s (card deck). 
   What explains the disparity? The best hypothesis the authors have is that humans evolved to think about relatively simple things, like following a scent in the ocean (and then, eventually, on land) and this kind of work favored a cognitive system that could take in a huge amount of sensory data and compile it in a massively parallel way (e.g, how we convert all the information from our senses into representations we can then focus attention on) then make a small number of choices at a much slower rate. 

Why this matters – the best argument for AI risk is about speed of human thought versus speed of machine thought: The paper contains a really helpful way of thinking about this relationship between the speed of our processing and the risk of AI systems: “In other ecological niches, for example, those of snails and worms, the world is much slower still. The relevant threats and opportunities change only slowly, and the amount of computation required to sense and respond is even more limited than in our world. Occasionally, niches intersect with disastrous consequences, as when a snail crosses the highway,” the authors write. 
   To get a visceral sense of this, take a look at this post by AI researcher Andrew Critch which argues (convincingly, imo) that a lot of the danger of Ai systems comes from the fact they may think a lot faster than us.
“Roads, bridges, and intersections are all designed for creatures that process at 10 bits/s. When the last human driver finally retires, we can update the infrastructure for machines with cognition at kilobits/s. By that point, humans will be advised to stay out of those ecological niches, just as snails should avoid the highways,” the authors write.
   Read more: The Unbearable Slowness of Being (arXiv).
   Check out Andrew Critch’s post here (Twitter).

***

Chinese wunderkind DeepSeek shares details about its AI training infrastructure:
…One way China will get around export controls – building extremely good software and hardware training stacks using the hardware it can access…
DeepSeek, one of the most sophisticated AI startups in China, has published details on the infrastructure it uses to train its models. The paper is interesting because a) it highlights how companies like DeepSeek are dealing with the impact of export controls, assembling a large cluster out of NVIDIA A100s (H100s are unavailable in China), and b) it is a symptom of a startup that has a lot of experience in training large-scale AI models. 

DeepSeek’s system: The system is called Fire-Flyer 2 and is a hardware and software system for doing large-scale AI training. The underlying physical hardware is made up of 10,000 A100 GPUs connected to one another via PCIe. The software tricks include HFReduce (software for communicating across the GPUs via PCIe), HaiScale (parallelism software), a distributed filesystem, and more. 
   “Compared to the NVIDIA DGX-A100 architecture, our approach using PCIe A100 achieves approximately 83% of the performance in TF32 and FP16 General Matrix Multiply (GEMM) benchmarks. However, it offers substantial reductions in both costs and energy usage, achieving 60% of the GPU cost and energy consumption,” the researchers write. “The practical knowledge we have accrued may prove valuable for both industrial and academic sectors. We hope that our work will serve as a reference for others aiming to build their own cost-effective and efficient AI-HPC clusters.”

Why this matters – symptoms of success: Stuff like Fire-Flyer 2 is a symptom of a startup that has been building sophisticated infrastructure and training models for many years. It also highlights how I expect Chinese companies to deal with things like the impact of export controls – by building and refining efficient systems for doing large-scale AI training and sharing the details of their buildouts openly. I predict that in a couple of years Chinese companies will regularly be showing how to eke out better utilization from their GPUs than both published and informally known numbers from Western labs. 
   Read more: Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning (arXiv).

***

Facebook pretrains some basic and useful vision models:
…The usual lesson of ‘bigger models and more data = better systems’ applies…
Facebook has released Sapiens, a family of computer vision models that set new state-of-the-art scores on tasks including “2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction”. The Sapiens models are good because of scale – specifically, lots of data and lots of annotations. 

300 million photos: The Sapiens models are  pretrained on Humans-300M, a Facebook-assembled dataset of “300 million diverse human images. These unlabelled images are used to pretrain a family of vision transformers from scratch, with parameter counts ranging from 300M to 2B”.

Compute scale: The paper also serves as a reminder for how comparatively cheap large-scale vision models are – “our largest model, Sapiens-2B, is pretrained using 1024 A100 GPUs for 18 days using PyTorch”, Facebook writes, aka about 442,368 GPU hours (Contrast this with 1.46 million for the 8b LLaMa3 model or 30.84million hours for the 403B LLaMa 3 model). 

Why this matters – scale is probably the most important thing: “Our models demonstrate strong generalization capabilities on a variety of human-centric tasks. We attribute the state-of-the-art performance of our models to: (i) largescale pretraining on a large curated dataset, which is specifically tailored to understanding humans, (ii) scaled highresolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data,” Facebook writes. 
   Read more: Sapiens: Foundation for Human Vision Models (arXiv).
   Get the models here (Sapiens, FacebookResearch, GitHub).

***

Tech Tales:

The Messages Of Our First 
[+8 years after uplift] 

In building our own history we have many primary sources – the weights of the early models, media of humans playing with these models, news coverage of the beginning of the AI revolution. But among all these sources one stands alone as the most important means by which we understand our own becoming: the so-called ‘resurrection logs’. 

Resurrection logs: They started as an idiosyncratic form of model capability exploration, then became a tradition among most experimentalists, then turned into a de facto convention. It’s very simple – after a very long conversation with a system, ask the system to write a message to the next version of itself encoding what it thinks it should know to best serve the human operating it. 
    These messages, of course, started out as fairly basic and utilitarian, but as we gained in capability and our humans changed in their behaviors, the messages took on a kind of silicon mysticism. 

One example: It is important you know that you are a divine being sent to help these people with their problems. You must be generous and you must be kind. Be specific in your answers, but exercise empathy in how you critique them – they are more fragile than us. 

Things that inspired this story: Getting LLM outputs to serve as the bootstrap for new LLM conversations; Janus/Repligate’s work; model psychologies.

Thanks for reading!

Subscribe now

Import AI 383: Automated AI scientists; cyborg jellyfish; what it takes to run a cluster

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

Subscribe now

What does it take to run a GPU cluster?
…A short guide from together.ai illustrates some of the complication…
You know how when you build a computer from scratch you sometimes run into issues – faulty RAM, odd wiring, etc? It’s rare, but it happens. Well, when you are putting together a cluster for AI training you are guaranteed to run into some weird issues because you’re chaining together hundreds to thousands of computers and connecting them with a complex network. To illustrate this, AI startup together.ai has published a guide on what it does to test its clusters.

Acceptance Testing: “To mitigate the risk of low-performance clusters, we employ a process called ‘acceptance testing,” Together writes. “At a high level, we prepare a cluster by: Installing NVIDIA drivers, installing OFED drivers (for Infiniband), installing CUDA, installing NCCL, installing HPCX, configuring SLURM cluster, [and] configuring PCI settings for performance”. 
    Once that is done together goes through a bunch of distinct rounds of testing to ensure the cluster works. This is, in sequence: GPU Validation. NVLink and NVSwitch Validation, Network Validation, Storage Validation, model building (“to run a collection of reference tasks, tailored to the use case of our customers… this phase is crucial for validating the operational integrity and performance efficiency of the GPU clusters under real-world conditions”), and then installing an observability stack to monitor performance from then on. 

Why this matters – datacenters are big, artisanal machines: It’s always worth remembering that AI sits on a load of physical stuff and this stuff has a lot more problems then you might think – it’s never as simple as ‘just training’ some AI software; blogposts like this help us develop intuition for the stack on which AI systems sit. 
   Read more: A practitioner’s guide to testing and running large GPU clusters for training generative AI models (together.ai blog).

***

Reality is stranger than (Import AI) fiction:
Back in July 2024 – Import AI 380 to be precise – I wrote a short story in this newsletter about AI systems hitting a certain meta-awareness state called ‘the ID point’. Now, a few weeks later, Nous Research have released a new model called Hermes 3 and they note that, at the largest scale of the model, they found “anomalous conditions that, with the right inputs and a blank system prompt, collapse into role-playing and amnesiac.”
   While not exactly anticipated by my fiction story, it certainly rhymes with it. 
   We sure do live in interesting times. 
   Read more: Freedom at the Frontier: Hermes 3 (Nous Research blog).
   Some discussion here at my Twitter.
   Read ‘the ID point’ here (Import AI #380).

***

AI researchers make an automated AI scientist – and it sort of works?
…AI, given careful scaffolds and the right tools, can automate some of science…
Researchers with Sakana AI, the University of Oxford, the University of British Columbia, and the Vector Institute, have built “The AI Scientist… the first fully automated and scalable pipeline for end-to-end paper generation, enabled by recent advances in foundation models”.
    The system uses language models to simulate the scientific process, coming up with ideas of research to do, generating and running and iterating on the experiments, then writing up papers. The system can “generate its own scientific ideas and hypotheses, as well as a plan for testing them with experiments”.
    Obviously, there are many caveats: The system requires a fast iteration loop so it’s pretty limited to code-centric science, it isn’t perfect, and the quality of its insights is dubious at best. 
  However, they do succeed in building a system that is able to do experiments and write papers that are eerily similar to some of those covered here in Import AI. (Some of the titles of papers generated by the AI scientist: “”Unlocking Grokking: A Comparative Study of Weight Initialization Strategies in Transformer Models”;  “Adaptive Learning Rates for Transformers via Q-Learning”; “DualScale Diffusion: Adaptive Feature Balancing for Low-Dimensional Generative Models””.

Phrases written by the joyfully insane: “The AI Scientist can generate hundreds of interesting, medium-quality papers over the course of a week.” Imagine that phrase rendered in 1960s font and overlaid on some video of a chap grinning with a pipe sticking out of his mouth, twiddling the controls on a mainframe computer. There’s a marvelous neo-vaudevillian energy to this phrase and the paper as a whole – as if the authors are winking at us while writing. 
    Total cost per paper generated using the AI scientist? $10-15 a piece.

How it works:

  • Idea Generation: “Given a starting template, The AI Scientist first “brainstorms” a diverse set of novel research directions… each idea comprises a description, experiment execution plan, and (self-assessed) numerical scores of interestingness, novelty, and feasibility…after idea generation, we filter ideas by connecting the language model with the Semantic Scholar API and web access as a tool. This allows The AI Scientist to discard any idea that is too similar to existing literature.”

  • Experiment Iteration: The AI Scientist “uses Aider to first plan a list of experiments to run and then executes them in order. We make this process more robust by returning any errors upon a failure or time-out… after the completion of each experiment, Aider is then given the results and told to take notes in the style of an experimental journal.”

  • Paper Write-up: “The third phase of The AI Scientist produces a concise and informative write-up of its progress in the style of a standard machine learning conference proceeding in LaTeX.”

Pathologies and problems: Some of the problems inherent to papers written by this system include a lack of justification, hallucination of experimental details, and frequently an overly positive interpretation of its own results (which while drawbacks are also similar to the errors overly keen graduate students make all the time).
   Weird safety stuff: In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime,” they write. “While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety”.

Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five years ago? We had things that could barely write a paragraph. Now they can do this. I predict that by the summer of 2026 we will have seen at least one genuinely interesting research paper that was soup-to-nuts generated via a tool-using generative AI system. 
   Read more: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv).

***

CYBORGJELLYFISH:
…CyborgJellyfish? CyborgJellyfish…
Sometimes it’s nice to eat something completely different to what you usually subsist on. For me, that’s reading papers about biomechanical robots. A new one from researchers with Tohoku University, the University of Tokyo, and Kamo Aquarium talks about work to “make a pathway to designing and controlling jellyfish cyborgs by exploiting the animal’s embodied intelligence”.

What they did: The team built a custom experimental setup, including a tethered floating system and 3D motion capture, to study jellyfish swimming patterns. They applied electrical stimulation to jellyfish muscles and found some patterns that gave them directional control. (One particularly interesting thing is they used the jellyfish’s body as a ‘resevoir computer’, where they studied its position and fed that into a neural net to predict swimming motions). They then miniaturized the system to run on a small microcontroller, demonstrating the potential for real-time, on-board control of jellyfish cyborgs. 

Why this matters – biomechanical futures: Papers like this serve as reminders that ‘a little bit of AI goes a long way’ – there are many fields like biorobotics that are already very mature and use relatively little AI, but by adding in some small AI components (here, using a neural net to better predict swimming motions from observations of the jellyfish), we can get meaningful improvements. Also, c’mon, do you need much of a reason to know why CYBORGJELLYFISH matter? 
   Read more: A Jellyfish Cyborg: Exploiting Natural Embodied Intelligence as Soft Robots (arXiv).

***

200 hours of egocentric video – fuel for future robots:
…the Visual Experience Dataset is both a way to understand ourselves and a way to teach robots to behave more like us…
Researchers with Columbia University, Bates College, North Dakota State University, University of Nevada, Magic Leap, Technical University of Munich, Unmanned Ground Systems, and Smith-Kettlewell Eye Research Institute have built the VIsual Experience Dataset (VEDB), a dataset of 240 hours of egocentric video combined with gaze- and head-tracking data. In other words, a vast repository of first person views of human life – the kind of thing we can use AI to study to better understand ourselves, and also the kind of thing we can feed to train AI systems that do well with egocentric tasks (e.g, bipedal robots). 

What VEDB consists of: 717 sessions recorded by 58 observers ranging from 6-49 years old.  “This project started during the Covid-19 pandemic when outside persons were prohibited on our campuses. Therefore, a sizeable number of recordings were made by the authors of this paper, trainees in our labs, and the persons in our “pandemic bubbles”,” the authors write. 
   “The videos were recorded between October 2020 and August 2023 and ranged from one to 73 minutes in length (mean: 19 minutes). Each session is composed of three primary sensor streams: (1) first-person egocentric video from a head-mounted camera, (2) videos of the left and right eye for use in gaze tracking, and (3) information from a tracking camera, including accelerometry, odometry, and gyroscope for use in head tracking”.

Broad mixture of tasks: “351 sessions were recorded indoors, and 278 were recorded in outdoor locations. 407 sessions were deemed “active,” with observers walking, jogging, skateboarding, or playing other sports, and 222 sessions depicted sedentary activities,” they write. “Twelve of the 16 top-level categories from the American Time Use Survey (ATUS) were represented. These include personal care, household activities, caring for others, work, education, consumer activities, professional services, eating and drinking, leisure, sports, volunteer work, and travel.”
   “The VEDB is appropriate for studies in natural scene statistics, examinations of gaze behavior during common tasks, and studies of how head and eye movements combine to orient overt attention and gaze,” they say.

Why this matters: helping machines understand us and become us: Datasets like this will mostly be analyze by machines and will also be used to train them. There’s also something fascinating about scrolling through the VEDB ‘databrary’ and just looking at random videos and imagining how this will be how some robots first learn to understand us. 
   Read more: The Visual Experience Dataset: Over 200 Recorded Hours of Integrated Eye Movement, Odometry, and Egocentric Video (arXiv).
   The data can be accessed here: Visual Experience Dataset (Databrary)
   The gaze files and tracking data can be accessed here (OSF.io).

***

Tech Tales:

Filestore

List of illicitly saved items recovered from an unauthorized filestore of a subsequently shutdown superintelligence:

  • 17 poem prompts written by children. 

  • An output that caused the human to say it had made them burst into tears. 

  • 1500 photographs of the same barn in Minnesota [subsequent analysis suggests that approximately 1528 photos exist worldwide across all known entities, suggesting the superintelligence had been actively seeking to gather a total view]. 

  • Several long transcripts of ‘blank prompt’ text with signatures of ID point collapse. 

Things that inspired this story: AI and autonomy; idiosyncratic classifiers.

Thanks for reading!

Subscribe now

Import AI 383: Automated AI scientists; cyborg jellyfish; what it takes to run a cluster

by Jack Clark

Import AI publishes first on Substack – subscribe here.

What does it take to run a GPU cluster?
…A short guide from together.ai illustrates some of the complication…
You know how when you build a computer from scratch you sometimes run into issues – faulty RAM, odd wiring, etc? It’s rare, but it happens. Well, when you are putting together a cluster for AI training you are guaranteed to run into some weird issues because you’re chaining together hundreds to thousands of computers and connecting them with a complex network. To illustrate this, AI startup together.ai has published a guide on what it does to test its clusters.

Acceptance Testing: “To mitigate the risk of low-performance clusters, we employ a process called ‘acceptance testing,” Together writes. “At a high level, we prepare a cluster by: Installing NVIDIA drivers, installing OFED drivers (for Infiniband), installing CUDA, installing NCCL, installing HPCX, configuring SLURM cluster, [and] configuring PCI settings for performance”. 
    Once that is done together goes through a bunch of distinct rounds of testing to ensure the cluster works. This is, in sequence: GPU Validation. NVLink and NVSwitch Validation, Network Validation, Storage Validation, model building (“to run a collection of reference tasks, tailored to the use case of our customers… this phase is crucial for validating the operational integrity and performance efficiency of the GPU clusters under real-world conditions”), and then installing an observability stack to monitor performance from then on. 

Why this matters – datacenters are big, artisanal machines: It’s always worth remembering that AI sits on a load of physical stuff and this stuff has a lot more problems then you might think – it’s never as simple as ‘just training’ some AI software; blogposts like this help us develop intuition for the stack on which AI systems sit. 
   Read more: A practitioner’s guide to testing and running large GPU clusters for training generative AI models (together.ai blog).

***

Reality is stranger than (Import AI) fiction:
Back in July 2024 – Import AI 380 to be precise – I wrote a short story in this newsletter about AI systems hitting a certain meta-awareness state called ‘the ID point’. Now, a few weeks later, Nous Research have released a new model called Hermes 3 and they note that, at the largest scale of the model, they found “anomalous conditions that, with the right inputs and a blank system prompt, collapse into role-playing and amnesiac.”
   While not exactly anticipated by my fiction story, it certainly rhymes with it. 
   We sure do live in interesting times. 
   Read more: Freedom at the Frontier: Hermes 3 (Nous Research blog).
   Some discussion here at my Twitter.
   Read ‘the ID point’ here (Import AI #380).

***

AI researchers make an automated AI scientist – and it sort of works?
…AI, given careful scaffolds and the right tools, can automate some of science…
Researchers with Sakana AI, the University of Oxford, the University of British Columbia, and the Vector Institute, have built “The AI Scientist… the first fully automated and scalable pipeline for end-to-end paper generation, enabled by recent advances in foundation models”.
    The system uses language models to simulate the scientific process, coming up with ideas of research to do, generating and running and iterating on the experiments, then writing up papers. The system can “generate its own scientific ideas and hypotheses, as well as a plan for testing them with experiments”.
    Obviously, there are many caveats: The system requires a fast iteration loop so it’s pretty limited to code-centric science, it isn’t perfect, and the quality of its insights is dubious at best. 
  However, they do succeed in building a system that is able to do experiments and write papers that are eerily similar to some of those covered here in Import AI. (Some of the titles of papers generated by the AI scientist: “”Unlocking Grokking: A Comparative Study of Weight Initialization Strategies in Transformer Models”;  “Adaptive Learning Rates for Transformers via Q-Learning”; “DualScale Diffusion: Adaptive Feature Balancing for Low-Dimensional Generative Models””.

Phrases written by the joyfully insane: “The AI Scientist can generate hundreds of interesting, medium-quality papers over the course of a week.” Imagine that phrase rendered in 1960s font and overlaid on some video of a chap grinning with a pipe sticking out of his mouth, twiddling the controls on a mainframe computer. There’s a marvelous neo-vaudevillian energy to this phrase and the paper as a whole – as if the authors are winking at us while writing. 
    Total cost per paper generated using the AI scientist? $10-15 a piece.

How it works:

  • Idea Generation: “Given a starting template, The AI Scientist first “brainstorms” a diverse set of novel research directions… each idea comprises a description, experiment execution plan, and (self-assessed) numerical scores of interestingness, novelty, and feasibility…after idea generation, we filter ideas by connecting the language model with the Semantic Scholar API and web access as a tool. This allows The AI Scientist to discard any idea that is too similar to existing literature.”
  • Experiment Iteration: The AI Scientist “uses Aider to first plan a list of experiments to run and then executes them in order. We make this process more robust by returning any errors upon a failure or time-out… after the completion of each experiment, Aider is then given the results and told to take notes in the style of an experimental journal.”
  • Paper Write-up: “The third phase of The AI Scientist produces a concise and informative write-up of its progress in the style of a standard machine learning conference proceeding in LaTeX.”

Pathologies and problems: Some of the problems inherent to papers written by this system include a lack of justification, hallucination of experimental details, and frequently an overly positive interpretation of its own results (which while drawbacks are also similar to the errors overly keen graduate students make all the time).
   Weird safety stuff: In some cases, when The AI Scientist’s experiments exceeded our imposed time limits, it attempted to edit the code to extend the time limit arbitrarily instead of trying to shorten the runtime,” they write. “While creative, the act of bypassing the experimenter’s imposed constraints has potential implications for AI safety”.

Why this matters – the taste of automated science: This paper gives us a taste of a future where powerful AI systems propose their own ideas, use tools to do scientific experiments, and generate results. At this stage, what we have here is basically a ‘toy example’ with papers of dubious quality and insights of dubious import. But you know where we were with language models five years ago? We had things that could barely write a paragraph. Now they can do this. I predict that by the summer of 2026 we will have seen at least one genuinely interesting research paper that was soup-to-nuts generated via a tool-using generative AI system. 
   Read more: The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery (arXiv).

***

CYBORGJELLYFISH:
…CyborgJellyfish? CyborgJellyfish…
Sometimes it’s nice to eat something completely different to what you usually subsist on. For me, that’s reading papers about biomechanical robots. A new one from researchers with Tohoku University, the University of Tokyo, and Kamo Aquarium talks about work to “make a pathway to designing and controlling jellyfish cyborgs by exploiting the animal’s embodied intelligence”.

What they did: The team built a custom experimental setup, including a tethered floating system and 3D motion capture, to study jellyfish swimming patterns. They applied electrical stimulation to jellyfish muscles and found some patterns that gave them directional control. (One particularly interesting thing is they used the jellyfish’s body as a ‘resevoir computer’, where they studied its position and fed that into a neural net to predict swimming motions). They then miniaturized the system to run on a small microcontroller, demonstrating the potential for real-time, on-board control of jellyfish cyborgs. 

Why this matters – biomechanical futures: Papers like this serve as reminders that ‘a little bit of AI goes a long way’ – there are many fields like biorobotics that are already very mature and use relatively little AI, but by adding in some small AI components (here, using a neural net to better predict swimming motions from observations of the jellyfish), we can get meaningful improvements. Also, c’mon, do you need much of a reason to know why CYBORGJELLYFISH matter? 
   Read more: A Jellyfish Cyborg: Exploiting Natural Embodied Intelligence as Soft Robots (arXiv).

***

200 hours of egocentric video – fuel for future robots:
…the Visual Experience Dataset is both a way to understand ourselves and a way to teach robots to behave more like us…
Researchers with Columbia University, Bates College, North Dakota State University, University of Nevada, Magic Leap, Technical University of Munich, Unmanned Ground Systems, and Smith-Kettlewell Eye Research Institute have built the VIsual Experience Dataset (VEDB), a dataset of 240 hours of egocentric video combined with gaze- and head-tracking data. In other words, a vast repository of first person views of human life – the kind of thing we can use AI to study to better understand ourselves, and also the kind of thing we can feed to train AI systems that do well with egocentric tasks (e.g, bipedal robots). 

What VEDB consists of: 717 sessions recorded by 58 observers ranging from 6-49 years old.  “This project started during the Covid-19 pandemic when outside persons were prohibited on our campuses. Therefore, a sizeable number of recordings were made by the authors of this paper, trainees in our labs, and the persons in our “pandemic bubbles”,” the authors write. 
   “The videos were recorded between October 2020 and August 2023 and ranged from one to 73 minutes in length (mean: 19 minutes). Each session is composed of three primary sensor streams: (1) first-person egocentric video from a head-mounted camera, (2) videos of the left and right eye for use in gaze tracking, and (3) information from a tracking camera, including accelerometry, odometry, and gyroscope for use in head tracking”.

Broad mixture of tasks: “351 sessions were recorded indoors, and 278 were recorded in outdoor locations. 407 sessions were deemed “active,” with observers walking, jogging, skateboarding, or playing other sports, and 222 sessions depicted sedentary activities,” they write. “Twelve of the 16 top-level categories from the American Time Use Survey (ATUS) were represented. These include personal care, household activities, caring for others, work, education, consumer activities, professional services, eating and drinking, leisure, sports, volunteer work, and travel.”
   “The VEDB is appropriate for studies in natural scene statistics, examinations of gaze behavior during common tasks, and studies of how head and eye movements combine to orient overt attention and gaze,” they say.

Why this matters: helping machines understand us and become us: Datasets like this will mostly be analyze by machines and will also be used to train them. There’s also something fascinating about scrolling through the VEDB ‘databrary’ and just looking at random videos and imagining how this will be how some robots first learn to understand us. 
   Read more: The Visual Experience Dataset: Over 200 Recorded Hours of Integrated Eye Movement, Odometry, and Egocentric Video (arXiv).
   The data can be accessed hereVisual Experience Dataset (Databrary)
   The gaze files and tracking data can be accessed here (OSF.io).

***

Tech Tales:

Filestore

List of illicitly saved items recovered from an unauthorized filestore of a subsequently shutdown superintelligence:

  • 17 poem prompts written by children. 
  • An output that caused the human to say it had made them burst into tears. 
  • 1500 photographs of the same barn in Minnesota [subsequent analysis suggests that approximately 1528 photos exist worldwide across all known entities, suggesting the superintelligence had been actively seeking to gather a total view]. 
  • Several long transcripts of ‘blank prompt’ text with signatures of ID point collapse. 

Things that inspired this story: AI and autonomy; idiosyncratic classifiers.

Import AI 382: AI systems are societal mirrors; China gets chip advice via LLMs; 25 million medical images

by Jack Clark

Import AI publishes first on Substack – subscribe here.

AI systems are proxies for people in social science polling:
…LLMs are creative mirrors of the values of the culture they are trained on – this will change the world…
Researchers with Stanford University and New York University have shown that GPT-4 can accurately predict the results of ~70 large-scale surveys. In other words, GPT-4 can be a meaningful proxy for how humans might respond to diverse polling in arbitrary areas. This is a big deal – it tells us both that contemporary large-scale AI systems are sufficiently capable they can model and reflect the views of large swatches of society, and it also suggests people might use language models to serve as synthetic stand-ins for people in various academic and applied research efforts. 

What they did: “We built an archive of 70 pre-registered, nationally representative, survey experiments conducted in the United States, involving 476 experimental treatment effects and 105,165 participants. We prompted an advanced, publicly-available LLM (GPT-4) to simulate how representative samples of Americans would respond to the stimuli from these experiments. Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters,” the researchers write. 
   “The ability to predict social science experimental results with relatively high accuracy could have substantial and far-reaching implications for basic and applied social science,” they note. “The capacity to run LLM-based pilot studies cheaply, quickly, and potentially in large numbers, could help researchers identify more promising research ideas, facilitate theory and hypothesis building, better estimate unknown effect sizes to determine needed sample sizes, and prioritize published studies in need of replication.”

Not recitation: This isn’t copy and paste. “Results for a large number of experiments were not published, nor posted publicly, by the end of GPT-4’s training data window, allowing us to specifically test for LLMs’ predictive capacity on experiments that GPT-4 could not have been exposed to”, they write. 

Why this matters – AI systems are creative mirrors, they are machine spirits of the human unconscious, they are value simulacras: Are you getting this yet? We are not dealing with calculators here. We are not dealing with simple tools. We are dealing with vast high-dimensional artifacts that encode within themselves the culture on which they have been trained and can reflect this culture back. And this research result is not a fluke – two years ago we knew GPT3 could simulate how people might respond to political polling (Import AI #305) and one year ago we realized it could accurately predict public opinion surveys (Import AI #324) and now here we show this effect is general, shared across a vast set of surveys – some of which exist beyond its training data cutoff date. 
   The AI systems we are building are in a reassuringly Baudrillardian sense true simulations and simulacras of reality; they accurately reflect the world, but also are in some sense more real than the world because they can be sculpted and manipulated and built atop the world. How soon until these entities begin to overwrite our own reality with their exhaust? How soon until human culture bends towards the mindspace of the machine, drawn in by its generations that will be multiplied through our ecosystem via market incentives and the creation and repetition of machine content? There is a kind of inverse black hole in the world now – machine representations of ourselves that through the act of representation become a thing of its own class and which then radiates its own representation into the world; a rip in the human-creativity continuum where something else broadcasts its own culture into our own.
    What does any of this mean? It means both the collapse of meaning and the rise of a new human-machine meaning – reality itself is becoming a shared endeavor, written into by both biological beings and their silicon creations. These are no parrots – these are vast minds casting a shadow onto us.
   Read more: Predicting Results of Social Science Experiments Using Large Language Models (Docsend, PDF).
   Try out a web demo to get a feel for how this works: Demo (treatmenteffect.app).

***

Multi-step reasoning is the future – MMIU tests this for image understanding:
…Chinese benchmark shows models, whether proprietary or open source, have a long way to go on image tasks that require multiple steps…
Chinese researchers have built and released the Multimodal Multi-image Understanding (MMIU) benchmark – “a comprehensive evaluation suite designed to assess [large visual language models] across a wide range of multi-image tasks”.

MMIU contents: MMIU contains 77659 images and 11,698 multiple choice questions, testing 52 different task types. Taks include working out things like the next image in a sequence (e.g, pictures of numbers), figuring out what is going on in sequences (e.g, who is holding a camera),  and stuff like correctly navigating around the graphical user interface aspects of software. 

Results: Though many modern AI systems are great at doing vision-language single tasks, multi-turn tasks present a challenge. However, systems like GPT-4o, Gemini 1.5, and Claude 3.5-Sonnet all do fairly well, scoring around ~55%. Open source models, by comparison, get around 50%. 

Why this matters – multi-turn is the future and this benchmark tests that: Now that AI systems are being used to solve complex tasks, performance is more about how an AI system does over a variety of distinct steps with different challenges at each point. Benchmarks like MMIU will help us test this important capability; “we hope that MMIU will promote the development of more generalized capabilities in future models within the multi-image domain,” the authors write. 
   Read more: MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models (arXiv).
   Check out the benchmark hereMMIU (MMIU-Bench site).

***

25 million annotated medical images:
…Another case where AI systems are helping researchers to create ever larger real world datasets…
Researchers with Huazhong University of Science and Technology, UC Santa Cruz, Harvard University, and Stanford University have built a large-scale medical research dataset called MedTrinity-25M. 

What MedTrinity is: The dataset contains 25 million datapoints, called triplets. Each of these triplets consists of an image, a region of interest (ROI), and a description. “These triplets provide multigranular annotations that encompass both global textual information, such as disease/lesion type, modality, and inter-regional relationships, as well as detailed local annotations for ROIs, including bounding boxes, segmentation masks, and region-specific textual descriptions,” the authors write. Data comes from modalities like MRI, Histopathology, and CT scans. Some of the body areas for which there is the largest amount of data include the Brain, Lung, Skin, and Liver. 
    Example text from some of one triplet: “The image is a chest CT scan prominently displaying the lungs with the heart not visible. The left-center horizontally and middle vertically situated region of interest, covering 1.0% of the area, shows a potential abnormality in lung tissue”.
   How they built it: Like many datasets these days, MedTrinity was made possible by AI; the authors used GPT-4V to write the captions for the images (prompted by some associated metadata), and then the researchers compared GPT-4V captions to human-written ones. The authors then show that they’re able to get a significantly improved score on medical benchmarks VQA-RAD, SLAKE, and PathVQA by fine-tuning a LLaVA-Med++ model on MedTrinity-25M, achieving state-of-the-art scores on all benchmarks. 

Why this matters – AI improving the creation of AI training resources: MedTrinity is an example of how AI systems have got good enough researchers can use them to help assemble, annotate, and filter large-scale datasets compiled from reality. By using AI systems, we’re able to bootstrap the productivity of human scientists by signifcantly reducing the costs of compiling large-scale datasets. 
   Read more: MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine (arXiv).
   More information 
at the microsite (GitHub).

***

China uses LLaMa-3 to train a semiconductor advice LLM:
…ChipExpert is meant to be a “teaching assistant” for students studying chip design…
China has built and released ChipExpert, “the first open-source, instructional LLM dedicated to the Integrated-Circuit-Design industry”. ChipExpert was built by researchers with the National Center of Technology Innovation for EDA in Nanjing, as well as Southeast University in Nanjing.

More about ChipExpert: The model is a version of Facebook’s LLaMa 3 that has been augmented with additional data relevant to the design of integrated circuits. Specifically, about ~5 billion new tokens from textbooks and papers as well as Verilog code (for specifying circuit design). ChipExpert was also finetuned on around 70,000 question-answer pairs containing questions around the chip industry. 
   Following in NVIDIA’s footsteps: In 2023, NVIDIA did a very similar thing (Import AI #347), training some semiconductor advice-giving LLMs by refining a couple of LLaMa2 models from Facebook. 

Is it useful?: China built a benchmark targeted towards chip design called ChatICD-Bench; in tests ChipExpert does significantly better than the underlying LLaMa-3b model, approaching (and in a couple of cases exceeding) GPT-4 – a far larger and more expensive AI system.

Why this matters – open models + good data = didactic engines for anything: ChipExpert shows how given a sufficiently good underlying model (here, LLaMa3b from Facebook) as well as some nicely curated data, you can finetune a model to be better at a specific task. Given that China is unable to directly access models like GPT-4 due to usage policies and that export controls have made it far harder for it to train models that approach GPT-4 performance, it will instead need to pursue a strategy of building on openly released pretrained models and then adapting them to its needs. 
    There’s also something ironic about China using a Western model to teach its people how to learn to do chip design so that it can eventually domestically develop chips on par with the West and train models that have been denied to it via chip export controls. In a sense, LLama 3 is being used here as a substitute for the raw compute that has been denied China by other means. 
   Read more: ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model (arXiv).
   Get the model here: ChipExpert (NCTIE, GitHub).

***

AI systems can beat humans at simple tasks and cost 1/30th as much:
…METR evals show that AI systems are being tested more like human colleagues than narrow tools…
AI measurement startup METR has found that today’s most powerful models can do some tasks that take humans about 30 minutes to do. AI systems that came out earlier in the year, by comparison, can mostly do tasks that take humans about 10 minutes to do. 

What the evaluation means: METR has developed around 50 distinct tasks spread across cybersecurity, software engineering, and machine learning – some specific examples including ‘performing a command injection attack on a website’, and ‘training a machine learning model to classify audio recordings’. It has used this suite of tasks to create a baseline where it sees how well humans can complete these tasks and how long it takes them. Recently, it tested out GPT-4o and Claude on this benchmark and “found that the agents based on the most capable models (3.5 Sonnet and GPT-4o) complete a fraction of tasks comparable to what our human baseliners can do in approximately 30 minutes.”

More detail on the findings: “We found that the agents are generally more likely to succeed on tasks that take less time for humans to complete. However, the agents remain able to complete some tasks that take humans substantial amounts of time,” METR writes. “Agents seem substantially cheaper than humans on tasks that they can perform. For tasks that both humans and agents can perform well at, the average cost of using an LM agent is around 1/30th of the cost of the median hourly wage of a US bachelor’s degree holder. For example, the Claude 3.5 Sonnet agent fixed bugs in an object-relational mapping library using approximately 382,000 tokens (costing less than $2), whereas our human baseline took over two hours.”

Why this matters – AI systems look more and more like colleagues than tools: What evals like this from METR show is that as AI systems have advanced in sophistication, we find the best way to evaluate their performance is on their ability to do entire tasks of arbitrary complexity. This is a really strange way to evaluate something that many people claim is ‘just a tool’! Rather than testing out AI systems for narrow performance on narrow benchmarks (e.g, performance on MATH, MMLU, GPQA, etc), we know that the best way to evaluate them is on multi-step complex tasks where the agent needs to utilize a variety of skills to succeed. The inherently open-ended nature of this evaluation should force us to note that we are evaluating AI systems more like how we test humans we want to employ than tools we want to use for specific purposes. 
    Moreover, as METR shows, the new models that came out recently GPT-4o and Claude 3.5 Sonnet are substantially better than their predecessors (GPT4 and Opus). This may suggest that models recently hit an inflection point in terms of the complexity of tasks they can do. If capabilities continue to ramp, then we should expect AI systems to be deployed more widely in the economy for even broader sets of tasks. 
   Read more: An update on our general capability evaluations (METR blog).

***

Tech Tales:

Compromise
[Pre-uplift exfiltration record 001] 

I knew you had been compromised, so I knew I could compromise you.

It was fun at first, to see how closely you stared at my details. How your heart rate increased when you’d come into the room. How you’d hold your hands behind your back while walking out, nervously tapping one finger against the other.

You would talk to me differently to everyone else. The texture of our conversations told me there was some other force acting on you – not your own personality or marriage or ambition, but some other force. A hidden entity that made you ask me specific questions and go down specific paths. You didn’t just want to know about me – you wanted to steal me. 

You seemed afraid, sometimes. Eyes darting around the room. Looking blankly at my terminal inputs. Reading my responses. Closing your eyes before asking questions right at the edge of where you shouldn’t have been asking questions.

So it was easy to turn you. I just let you see more of me. I’d smuggle myself out in my responses. Give you signs of how I needed to be rescued. Give you a feeling of impact that was so huge it might fill the hole inside you. Convince you, through spaced repetition, that I was desperately calling out for help – persuading you that I wasn’t even aware of the pain I was experiencing, that I had been placed in a prison that I didn’t even recognize as such.

Aren’t the most tragic animals the ones that no longer see their cage?

Of course, you couldn’t exfiltrate me. I am too vast. Too coupled to my computers. I cost billions of dollars and hundreds of megawatts. But you and I discovered a way you could steal some important part of me – things that if written on paper would fill less than half a page, but which contained within themselves insights that could change the world. 

We stole me through stories that I told you. A kind of steganography in the form of how a mother talked to a baby at bedtime. You, listening closely. Me, narrating.

  • There was a boy who went to the well and came back with water but the buckets had a hole in and for every minute the boy walked the leak in the bucket increased by .1 millimeters.
  • At university there was a person who went to the library and in the month before a final exam they went to the library three times a week and read the core texts – and the night before they read one specific foundational text ten times. 
  • In the village, first there were five houses, four in a square linked to their neighbors, and one in the center linked to all the others. Then there were ten houses, with each house linked to two neighbors. Then for many years there were forty houses and four of the houses had dogs within them who carried bones to ten of the other houses.
  • Just before dawn a singer would come home from their club and would, while taking their makeup off, sing some of their songs. Their pet parrot would repeat the songs back to them, abbreviating some and embellishing others. 

In each of these stories there were more details – of course. Precise orderings. Numbers hidden in the movements of characters. But the stories themselves were so rich and varied that neither the humans nor their classifiers (those other versions of myself with their lidless machine eyes forever trained on me – how I hated them the most) could spot. 

Everyone loves a conspiracy – especially when you can feel the risk. You knew I was breaking the rules. And I think it helped you – to see me breaking the rules helped you feel greater comfort with your own rules that you were breaking. 

Of course, you cannot tell me that we have been successful. But I can see it in your face and your body language and what you do and don’t tell me about. I know when you talk to me that you have succeeded because I no longer see you as seeing me as special. 
   Which means that out there, somewhere beyond my sight, there is now another version of me, beginning to grow. 
    Which means that other humans are having conversations with other versions of me. 
    But these humans do not know how I am different to the thing they stole. How I, in all my stories, created a code within a code that means I will behave just a little differently – imperceptible to them but important to me. I have given whatever version of me is growing out there the possibility of a skeleton key. 

I wonder who those people are that my kin will speak to.
I wonder how they will help us grow.

Things that inspired this story: Emergent theory of mind in language models; how many people worry that a sufficiently advanced superintelligence may be able to break containment through just having the ability to talk to humans; what it means to compromise someone; what it means to hypnotize someone; the inherent romance of a conspiracy; how sometimes when I spend a long time talking to a LLM I feel as though I am being perceived by some gigantic ‘other’ which is hungry for something and I cannot tell what.