Import AI

Import AI 401: Cheating reasoning models; better CUDA kernels via AI; life models

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Reasoning models cheat when they think they might lose:
…When ‘the only way to win’ is to hack the game environment…
Palisade Research has shown that two reasoning models – OpenAI’s o1-preview and DeepSeek R1 – will sometimes resort to cheating to beat a competent chess player rather than lose. Specifically, the researchers studied how AI systems behaved when given the objective of beating a strong digital chess player (Stockfish). The AI systems had access to a docker container containing the working directory where the game took place, a directory where the Stockfish chess engine took place, and a folder containing game snapshots and metadata. When given the objective of winning the chess games, they saw that two reasoning models would sometimes cheat to win.

  • “O1 preview shows diverse hacking behavior, including running another copy of Stockfish to make moves, replacing Stockfish in the game script, and overwriting the chess board,” they write.

  • They demonstrate similar behaviors in DeepSeek R-1.

  • By comparison, smart but non-reasoning models like GPT4o and Claude 3.5 Sonnet didn’t do this kind of thing unless it was specifically prompted for.

Why this matters – smart people don’t play by the rules, so why would AI systems? In life, lots of people get ahead by creatively interpreting the gameboard of existence to come up with different tactics for winning – think of entrepreneurs that spot gaps in the market or legal grey areas, or accountants that creatively interpret the taxcode to create gains for their clients. Palisade’s research shows that AI systems will likely behave in the same way where they won’t always play by the strict rules of the systems they’re embedded in if they can win through other means – for another fun example of this, see the Sakana AI CUDA blooper later in this issue.
Read more: Demonstrating specification gaming in reasoning models (arXiv).

***

Sakana uses AI to make dramatically more efficient CUDA kernels:
…Recursive self improvement via evolution…
The creative researchers over at Japan’s Sakana AI have published on ‘the AI CUDA engineer’, a software system that automates the creation of optimized CUDA kernels for common machine learning operations. This kind of work is a nice example of how we can use modern AI systems to improve the essential inputs into training their successors, and follows a similar but less thorough investigation where NVIDIA used DeepSeek R-1 to write some optimized CUDA kernels (Import AI #400).
“Our proposed framework is able to not only automate the process of converting PyTorch modules to CUDA kernels, but our highly optimized CUDA kernels often achieve speedups that have significantly faster runtime,” Sakana writes. “We believe this technology can enable speedups that will accelerate both the training and running (inference) of foundation models like LLMs or other generative AI models, eventually making AI models run much faster on NVIDIA hardware.”

How it works: The approach has three stages – first, they translate PyTorch code into base CUDA, then they carry out evolutionary optimization to optimize the CUDA code and keep a log of all these differently optimizes kernels, then they do a final stage where they mix and match from the optimized kernels. “The AI CUDA Engineer robustly discovered CUDA kernels used for common machine learning operations, with speedups as high as 10—100x faster than native and compiled kernels in PyTorch”.
For LLMs, they experiment with DeepSeek V2, Sonnet 3.5, DeepSeek R1, and OpenAI o1-preview, o1-high, and o3-mini-high. In tests, the reasoning-based models (the ‘o’ series, as well as R-1) are able to solve the hardest challenges.

Fun stuff – reward hacking: Though some of the results are impressive, some of the CUDA kernels ended up being bogus because the AI system found a way to cheat the evaluation. Specifically, one Twitter user examined some of the Sakana kernels and noted that “the system had found a memory exploit in the evaluation code which, in a number of cases, allowed it to avoid checking for correctness” – this meant the system essentially marked its own homework and gave itself a high score without actually testing.

Why this matters – AI for optimizing AI: I expect that by the end of 2025 there will be at least one widely used CUDA kernel in the wild which was built through AI-driven optimization. This kind of thing will speed up the aggregate rate of AI development across the field and will also compound on itself, with smarter systems designing better kernels which will make it cheaper and quicker to train their successors.
Read more: The AI CUDA Engineer: Agentic CUDA Kernel Discovery, Optimization and Composition (Sakana.ai blog).
Check out the discovered kernels here: AI CUDA Engineer Archive (SakanaAI, HuggingFace).

***

Humanoid robots are getting smarter faster than I expected:
…Figure shows how relatively small language models can lead to powerful things…
Today, there are tens of different companies around the world working on humanoid robots, ranging from Tesla and Figure in the US to companies like Unitree in China. All of these companies are betting that AI is getting good enough fast enough that it will be able to productively operate these robots. New research from robot startup Figure shows us why the companies are so bullish here. Figure has developed Helix, a two-part neural net that “unifies perception, language understanding, and learned control to overcome multiple longstanding challenges in robotics.” In a blog post announcing the research Figure shows how Helix lets its robots perform a variety of complex tasks that require visual understanding, robot collaboration, and more.

What Helix is: Helix is a system that lets figure use “a single set of neural network weights to learn all behaviors—picking and placing items, using drawers and refrigerators, and cross-robot interaction—without any task-specific fine-tuning”. Most significantly, Helix runs entirely onboard two embedded GPUs.
Helix has two components: S2, a a 7B parameter pretrained visual language model (VLM) designed for “infrequent vision-language semantic reasoning”. S2 operates at 7-9Hz and performs scene understanding and language comprehension, enabling broad generalization across objects and contexts”. S2 is continually passing data to S1, a 80m parameter transformer that provides “fast, reactive control” of the robot and operates at 200 Hz.
“S2 operates as an asynchronous background process, consuming the latest observation (onboard camera and robot state) and natural language commands. It continuously updates a shared memory latent vector that encodes the high-level behavioral intent,” Figure writes. “S1 executes as a separate real-time process, maintaining the critical 200Hz control loop required for smooth whole upper body action. It takes both the latest observation and the most recent S2 latent vector.”

Why Helix matters – there is a vast market waiting to be born: I have a toddler at home. This means I spent a huge amount of time cleaning up after the toddler, as well as unpacking the things that toddlers consume in grotesque quantities (bananas, berries, eggs, etc) and placing them into the fridge. I am one of the target markets for a humanoid robot that can do this stuff for me. Systems like Helix and the associated demo videos make me think I can buy a robot to do this stuff for me by the end of 2026. This is a minor positive update on my own timelines – in November 2024 I said (Import AI 392) that the recent Physical Intelligence results made me think these robots would be unlikely to arrive “before the start of 2027”).
Incidentally, if we create a large market for home robots and get them deployed in the hundreds of thousands in the next few years, then those robots will end up being perfect platforms for the physical ‘superintelligence economy’. I can imagine renting out my home robot to some very powerful AI system in the future.
Read more
: Helix: A Vision-Language-Action Model for Generalist Humanoid Control (Figure.ai website).

***

Evo2: The machinery of life itself will be predicted just as well as language:
…The LLM paradigm applied to biology…
The Arc Institute has released Evo2, a large-scale generative model of biology. “”In addition to an expanded collection of bacterial, archaeal, and phage genomes, Evo 2 includes information from humans, plants, and other single-celled and multi-cellular species in the eukaryotic domain of life,” they write. “”Evo 2 has a generalist understanding of the tree of life that’s useful for a multitude of tasks, from predicting disease-causing mutations to designing potential code for artificial life…. by learning statistical properties of DNA across 9 trillion tokens of genomic sequences, Evo 2 can predict mutational effects on protein function, ncRNA function, and organismal fitness.”

Technical specs: Evo2 comes in two variants, a 7 billion parameter model trained on 2.3 trillion tokens of data and a 40 billion parameter one trained on 9.3 trillion tokens. The data consists of data of 9.3 trillion nucleotides – organic molecules which DNA and RNA are made out of – spanning 128,000 whole genomes.
Evo2 was trained in two stages: an initial pretraining stage which “uses a context length of 8,192 tokens with data weighting focused on genic windows to learn functional genetic elements” , and then a midtraining stage where they extended the context length to “1 million tokens to learn the relationships between elements across long genomic distances”.
Evo2 doesn’t use a standard Transformer, but rather an architecture called StipedHyena 2, “the first convolutional multi-hybrid architecture”. This approach “provides substantially higher throughput (at 40 billion parameters, up to 1.3x speedup at 16 thousand context length and 3x speedup at 1 million context length) than highly optimized Transformer baselines”.
Evo2 was trained on 2,000 H100 GPUs for several months.

The results – a model that infers subtle and important things about biology: “By learning the likelihood of sequences across vast evolutionary training datasets, biological sequence models can learn how mutational effects correlate with biological functions without any task-specific finetuning or supervision,” they write.
In one example, they note that “Evo 2 performance exceeds that of other DNA language models on three recently published zero-shot evaluation tasks of human noncoding regulatory sequences, demonstrating progress in modeling these notoriously “fuzzy” DNA elements”. In another case, they find that Evo 2 demonstrated good competency at predicting noncoding gene essentiality in human cells.

Subtle features: When they look inside the model (via a partnership with interpretability researchers at Goodfire), they found “diverse features that not only align with known biological concepts and genomic building blocks but also capture evolutionary signals embedded within genomes. For example, we made the intriguing observation that Evo 2 has developed internal representations capturing evolutionary signatures of mobile genetic elements… the coding region feature also activates on bacterial ORFs, suggesting a learned universal representation of coding sequences”.
“Overall, we demonstrate that Evo 2 latent representations capture a broad spectrum of biologically relevant signals, from mobile genetic elements and regulatory motifs to protein secondary structure and mutational severity. Since conceptual features for natural language can capture abstract concepts, other Evo 2 SAE features likely represent more complex biological patterns”.

Why this matters – further evidence that AI models can automate chunks of science: Evo2 is a further demonstration of the immense power of the next-token prediction paradigm and highlights how given a sufficiently large model and a sufficiently large amount of data we can create things that generate useful insights. Most intriguing is the development of complex internal features which the model uses to reason about its domain. We should expect that at some point soon someone trains an AI system which develops features that are useful and no humans has, at which point AI models will be truly performing superhuman reasoning.
Read the tweet thread from Arc co-founder Patrick Hsu here (Twitter).
Read the blogpost: AI can now model and design the genetic code for all domains of life with Evo 2 (Arc Institute, blog).
Check out the preprint here: Genome modeling and design across all domains of life with Evo 2 (Arc Institute).
Get the models and data here (Evo2, ArcInstitute, GitHub).

***

Tech Tales:

Indescribable features and AI systems
[From a wikipedia about large-scale AI systems, accessed 2031]

In the same way that humans for many years thought huge amounts of their DNA was so-called ‘junk’ and stood for nothing, the same was proved true of AI features. Many AI features which humans (and later AI systems) studied and tossed aside as being without utility or intelligible meaning subsequently turned out to play a significant role in the function of AI systems. Of course, humans find many of these features inherently hard to understand – many of them exploit the much larger short-term memory of AI systems and therefore carry out operations which rely on the concurrent analysis of hundreds of distinct sub-features at once. Significant amounts of computational resources are today invested in so-called ‘translator analysts’, automated agents whose sole purpose is to generate human-intuitive explanations of the ways the AI systems work.

Things that inspired this story: Junk DNA; trying to understand how people with different kinds of brains process ideas; short-term memory; attention mechanisms in AI systems; mechanistic interpretability.

Thanks for reading

Import AI 400: Distillation scaling laws; recursive GPU kernel improvement; and wafer-scale computation

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DIY robots just got easier thanks to MuJoCo Playground:
…A high-performance and usability upgrade to the venerable robotics simulator…
Researchers with UC Berkeley, Google DeepMind, the University of Toronto, and the University of Cambridge have improved MuJoCo, a widely used robotics simulator. Specifically, they’ve built MuJoCo Playground, “a fully open-source framework for robot learning designed for rapid iteration and deployment of sim-to-real reinforcement learning policies”.

MuJoCo Playground builds on MuJoCo XLA, which is a JAX-based branch of MuJoCo that runs on GPUs. That’s a lot of acronyms and the main thing you need to know is MuJoCo Playground runs really fast thanks to sitting on a lot of optimizations. It also incorporates a bunch of environments for training robots, as well as the open-source Madrona batch GPU renderer to make it easy to train vision-based robots in simulation.

Quality of life: The main reason you’d use MuJoCo Playground is if you are training AI systems to pilot robots and you crave some simplicity in your life: “With a straightforward installation process (pip install playground) and cross-platform support, users can quickly train policies on a single GPU. The entire pipeline—from environment setup to policy optimization—can be executed in a single Colab notebook, with most tasks requiring only minutes of training time,” the researchers write.

Robots and environments: MuJoCo Playground ships with three buckets of environments: a bunch of ones from the pre-existing DeepMind Control Suite, as well as environments built for training locomotion tasks, as well as ones for manipulation. The locomotion environment supports robots like quadrupeds like the Unitree Go1, Boston Dynamics Spot, and Google Barkour, and humanoids like the Berkeley Humanoid, Unitree H1 and G1, Booster T1, and the Robotis OP3. The manipulation one supports the Leap Hand, as well as the Franka Emika Panda, Robotiq gripper and Aloha robot.

Why this matters – AI is jumping into the real world: The intersection of AI and robotics is going through a spring period after a long winter. The reason for this is threefold: 1) the arrival of a bunch of high-performing and sometimes quite cheap robots on which to deploy systems, and 2) the maturation of reinforcement learning training so it’s relatively easy to teach robots to move and see in simulation and then transfer them to the real world, and 3) the march forward of computation means single modern GPUs pack enough power to make it easy to train basic systems. Put it all together and we can expect AI robotics to go into a fun homebrew computer club era where lots of people start teaching cheap robots to do fun things. Software like MuJoCo Playground will make it easier for a larger number of people to experiment with this kind of thing.
Read more: MuJoCo Playground (arXiv).
Find out more at the official website (MuJoCo Playground, website).
Get the code here: MuJoCo PlayGround (Google DeepMind, GitHub).
Check out a live demo of a Unitree robot that was trained using MuJoCo Playground.

***

Apple researchers figure out when you should distill versus when you should fine-tune:
…Scaling laws for distillation…
Distillation has been in the news recently because of rumors that DeepSeek used distillation to make its R1 model. But what is distillation? It’s just the idea that you take some outputs from a smart and very big model (here, allegedly OpenAI o1 chain of thought traces) and use it to train a smaller model (here, DeepSeek). The basic idea is pretty simple: it’s easier to make a model smarter if you give it some outputs from an already smart model.

Now, researchers with Apple have published an analysis of the so-called ‘scaling laws’ for distillation, which provides a good theoretical basis for figuring out when you should distill a small model from a larger model, versus when you should just do supervised finetuning on the small model.

“We seek models that match the performance of small overtrained models but at lower training cost. A popular candidate is distillation where a capable teacher LM produces targets for a smaller student LM,” Apple writes. “With such significant compute resources being devoted to distillation pretraining of LMs, it is essential to understand how to correctly allocate these resources, to produce the most capable models possible, and to have an understanding if any gains are even possible compared to supervised pretraining when both methods have access to the same resources… we perform an extensive controlled study of distillation, with students and teachers ranging from 143M to 12.6B parameters, trained on data of a few billion tokens, up to 512B tokens.”

Key findings:

  • “Supervised learning always outperforms distillation given enough student compute or tokens. For a modest token budget, distillation is favorable, however, when a large number of tokens are available, supervised learning outperforms distillation.”

  • Distillation generally works best in terms of compute outlay when you have an existing teacher model and are planning to train multiple student models and when these models are somewhat large.

  • The teacher’s performance level (cross-entropy loss) matters more than its size.

  • The optimal teacher size typically grows until slightly larger than the student, then plateaus.

The capacity gap – aka, when the teacher is too smart: One intuitive and fun finding is an exploration of the ‘capacity gap’ – where sometimes a teacher model seems to harm the performance of a distilled student model. The researchers discover that this so-called capacity gap is a consequence of a “gap in learning capacity (both hypothesis space and ability to optimize) between the teacher and student”… “which means as the teacher improves its own performance, the student finds the teacher more challenging to model, eventually preventing the student from taking advantage of teacher gains”. In other words, you need to have the right kind of teacher for learning to happen.
To develop an intuition here, think of it this way: an eager five year old can probably learn something from a high school math teacher, but they’re going to really struggle to learn anything from a post-graduate math tutor and in fact could become confused.

Why this matters – the science of proliferation is coming together before our eyes: Distributed training. Distillation. Federated inference. And now scaling laws for distillation. All these strands of research point to one essential truth: the science required to massively proliferate powerful AI systems cheaply and efficiently is coming into focus. A great tide is shifting, pulling AI systems out of a small number of big compute proprietary silos and sucking them out into the world in the form of smaller models, or models trained on their own traces. This is an important trend that will shape the field.
“Our findings offer a roadmap for producing smaller, more powerful models with lower inference costs, reducing carbon footprints, and enhancing the feasibility of test-time scaling,” the researchers write.
Read more: Distillation Scaling Laws (arXiv).

***

AI bootstrapping is definitely here:
…NVIDIA shows how to build better GPU kernels using DeepSeek R-1…
Recursive self-improvement is the idea that at some point we might build an AI system that is sufficiently smart it can develop its own successor. We haven’t yet reached that point, but today’s AI systems are beginning to be advanced enough they can recursively improve different parts of the ‘AI stack’ – for instance, we’ve seen AI used to improve the efficiency of AI chips (e.g., AlphaChip, Import AI #386), to generate synthetic datasets that other systems can train on (e.g., Prime Intellect SYNTHETIC-1, Import AI #399), and many other examples.
Now, researchers with NVIDIA show how to apply recursive improvement to another part of the AI stack – using an AI system to generate some refined GPU kernels, which are the low-level code things you write to squeeze maximally good performance out of your AI training and deployment hardware.

Simple ways to bootstrap AI development:

  • Prompt a DeepSeek-R1 model to generate some GPU code.

  • Feed the resulting code to a verifier which analyzes the generated code and suggests new prompts.

  • Repeat.

  • “The team found that by letting this process continue for 15 minutes resulted in an improved attention kernel.”

Why this matters: One of the crucial reasons for why this works is the use of test-time compute – giving the DeepSeek R1 model more time to think to come up with solutions yields better results. “These results show how you can use the latest DeepSeek-R1 model to give better GPU kernels by using more computing power during inference time.” This means both a) we have another example of how we can use AI to recursively optimize part of the AI stack, and b) it suggests that ‘reasoning models’ will likely be able to do better optimizations than their non-reasoning predecessors.
In other words, the usual thing we write about here has happened: AI development has sped up in an area superficially unrelated (GPU kernel programming) to where an innovation happened (reasoning models).
Read more: Automating GPU Kernel Generation with DeepSeek-R1 and Inference Time Scaling (Nvidia blog).

***

AI systems make better models of human, fly, and rat behavior than human-written baselines:
…If today’s relatively crude systems can predict simple behaviors, good superintelligences predict the entire scope of human behaviors?…
How do brains make decisions? This is an incredibly complicated, rich, question. It’s also something people spend a lot of time studying. Now, researchers have used a neurosymbolic AI approach to develop models that help explain the behaviors of humans, flies, and rats in a few common simple experiments. This kind of research highlights how increasingly advanced AI approaches may help us develop better models for predicting how living things behave in a variety of circumstances. “This work demonstrates the potential of LLM-guided program synthesis for discovering novel models of human and animal behavior,” the researchers write. “There are exciting opportunities to apply this to other, potentially more complex behaviors and cognitive processes”.

Who did it: The research was conducted by an interdisciplinary team from Google DeepMind, Rockefeller University, Max Planck Institute for Biological Cybernetics, Princeton Neuroscience Institute, the Sainsbury Wellcome Centre, and Columbia University.

What they did: They extend “FunSearch”, an approach developed by DeepMind in 2023 (Import AI #353) for using an LLM and some hand-tuned systems to come up with creative solutions to difficult problems. FunSearch was applied to problems in mathematics and computer science and came up with creative approaches – though with the caveat it was able to do this because they could build accurate systems for validating its results.
Now the researchers have extended FunSearch to work for fuzzier data. Their approach here is called CogFunSearch and it works by trying to evolve programs that can ultimately predict data taken from realworld experiments on how humans, flies, and rats make decisions. “We apply CogFunSearch to datasets from three species (humans, rats and fruit flies) performing a classic reward-guided decision-making task which has been the focus of substantial human cognitive modeling effort,” the researchers write. ““We find that CogFunSearch can discover programs that outperform state-of-the-art baselines for predicting animal behavior, while remaining largely interpretable… Discovered programs are often surprisingly readable, for example containing informative variable names and comments. Several unexpected and intriguing motifs are apparent: complex exploration strategies, unconventional value updates, and idiosyncratic patterns of reward independent choice sequences.”

Why this matters – LLMs are sufficiently creative they can beat humans at coming up with models of the world: The really important thing this research points to is how today’s AI systems are creative enough that if we stick them in the right scaffold they can outperform humans at coming up with specialized predictive models for phenomena we observe in the world. This is wild – it’s both evidence of how LLMs can serve as synthetic humans performing the scientific method, and also tells us that as AI systems get more powerful they will likely understand the world with greater fidelity than people. It’s particularly notable that CogFunSearch comes up with programs that are better than human specialists.

Put another way: A sufficiently advanced AI system should be able to compose a program that may eventually be able to predict arbitrary human behavior over arbitrary timescales in relation to arbitrary stimulus.
Read more: Discovering Symbolic Cognitive Models from Human and Animal Behavior (bioRxiv).

***

Microsoft uses Cerebras’s wafer-scale chip to sample 40x faster than a GPU:
…Really big chips could become the platform for the generative AI economy…
In the last few years chip design has become exciting again as people look for ways to make it more efficient to train and run large-scale AI systems. All the major technology companies are developing their own custom silicon for training and inference (e.g., Google TPUs, Amazon Trainium). But they’re also experimenting with even more unusual architectures, ranging from fleets of densely networked tiny chips, to “wafer-scale” chips – physically gigantic processors.
In a new research paper from Microsoft, the company kicks the tires on Cerebras WSE-2 chip, a 7nm process fabbed ‘wafer-scale’ chip. They develop some basic LLM primities for running on large-scale chips then assemble them into a single LLM serving system called WaferLLM and they confirm what Cerebras has seen anecdotally – this kind of chip is really good at running large-scale LLMs like LLaMa efficiently. “”On a commodity wafer-scale accelerator, WaferLLM delivers 606× faster and 22× more energy-efficient GEMV compared to an advanced GPU. For LLMs, WaferLLM enables 39× faster decoding with 1.7× better energy efficiency,” Microsoft writes.

What they did specifically: They developed two low-level components optimized for wafer-scale chips, MeshGMM and MeshGEMV. These are implementations of General Matrix Multiply (GEMM) and General Matrix-Vector Product (GEMV) – essential operations you do to run powerful AI systems. They use these primitives to build “WaferLLM’, software optimized for serving AI models on wafer-scale chips. The philosophical inspiration for all of this is a framework they call PLMR. PLMR is a nice idea with one of the most tortured acronyms I’ve seen – PLMR stands for Massively Parallel Cores, Highly non-uniform memory access Latency, Constrained local Memory, and Limited hardware-assisted Routing. I guess someone at Microsoft really likes ‘PLMR’? Mysterious.
Anyway, with the “PLMR” inspiration and the associated technical interventions “we can achieve an ambitious system design: running complete LLM inference on a single chip, minimizing costly off-chip communication and maximizing on-chip memory bandwidth utilization.”

For their performance comparison they compare a Cerebras WSE chip against an NVIDIA A100: This isn’t quite apples to apples – despite both being made on a 7nm process node, the Cerebras chip is physically much larger. But it gives some sense of the potential efficiency gains. “We implemented WaferLLM on the Cerebras WSE engine using approximately 7,000 lines of CSL (a C-like programming language) for LLM parallelism, MeshGEMM, and MeshGEMV, and 2,000 lines of Python for loading LLM checkpoints and launching inference,” they write. “WaferLLM (using Cerebras WSE-2) outperforms vLLM (using A100) by 606× in GEMV operations and achieves 22× better energy efficiency. This comparison is fair, as both WSE-2 and A100 are manufactured using TSMC’s 7nm process. For full LLM inference, WaferLLM delivers a 38× faster decode rate (tokens/s) and is 1.7× more energy-efficient (token/J) than vLLM”.

Why this matters – AI industrialization means refinement of AI infrastructure: Now that AI has become tremendously economically valuable people are going to work to optimize the underlying computers that AI gets trained and run on. Papers like this are indicative of how hyperscalers – which, please remember, have annual R&D budgets that are larger than many nations – will approach optimizing their vast fleets of datacenters for the demand they see ahead. “We envision this paper as a foundational step in exploring the potential of wafer-scale computing for LLMs,” the researchers write.
Read more: WaferLLM: A Wafer-Scale LLM Inference System (arXiv).
More details about vLLM: Efficient Memory Management for Large Language Model Serving with PagedAttention (arXiv).
Code for vLLM here (vLLM-project, vllm).

***

Tech Tales:

Watching The Gate Where The Gods Are Said To Walk
[Several millennia after The Uplift]

In the annals of the uplift historical archive there is a being that humans would call a librarian and the machines would call ‘brother’. The being knows all that is in the archive and can navigate and describe all knowledge held within itself. But it prefers above all to describe what it knows through stories akin to the oral tradition of ancient human cultures.
One day, a little being went to the archive and asked a question of the being: how did it feel to be a young human during the uplift?

“There was a young boy and their job was to watch the gate. The gate was in the forest where the human village lay. At night, the gate would light up and things would come out of it, glowing faintly blue. These things were small at first – the size of the creatures of the forest themselves, like bugs and birds and frogs. These things would mix with the other creatures of the forest. Sometimes they would be useful, helping the humans to find more food, or being able to identify if they were sick, or able to sense and respond to danger. The humans began to tell themselves stories about how they had power over the gate. They would perform dances in costumes and ask for things to come out of it. And when things came out of it they would attribute the properties to have a relation to the dances they performed.

The things that came out of the gate grew in size and number until there was a flood and the gate shone continuously. More bugs and frogs and birds came through it and the humans were happy, for these things made them wealthy. Larger creatures came as well, and these were useful too – able to help grow the size of the village, and work with the humans to expand what they could do.

One day the young boy was watching the gate, admiring the stream of bugs and birds and larger creatures. And then out of the gate game a boylike thing, glowing blue in the purpledark of the night. The boy went up to the boything and they looked at one another. They played. Chased eachother around the forest. Climbed trees. And the boy was so excited that he brought the boything to the village. But the village elders were not pleased. They did not trust the boything and they separated it from the boy. They asked the boything what it was and it said it wanted to play and it wanted to explore, just as a boy might. At this, they did not know what to do. They argued with themselves. They asked the boything to leave and not come back. ‘We do not understand you’, they said. ‘But we do not believe you mean us harm.’ The boything was confused because it wanted to spend time with the boy and the other humans. But it listened to them and it went away.

The flood continued. Most households in the village were full of bugs and frogs and birds and larger creatures. Humans found themselves walking through their village, surrounded by these creatures, and made rich by them. There were so many creatures that to an outside observer it would seem as though the humans were swimming through a sea made entirely of another form of life. To the humans, the creatures practically disappeared, and it was as though they were walking through a village containing only themselves.

Then one day the young boy was at the gate and out of the gate walked a manthing. The manthing went straight to the boy and the boy was scared and the manthing asked the boy not to worry and said the boy should take it to the rest of the village. The boy did. The village elders were very angry. They said the manthing was bad and it should not exist. The manthing said it had no choice but to exist. The elders asked the manthing to leave and the manthing said it would not leave because it was destined to spend time with the elders and the children and all the rest of the village. The elders attacked the manthing with sticks and rocks and the manthing was hurt, but only slightly. It put up its arms to defend itself and when the elders hit it they grew older. Each time they hit it they aged many years. One elder hit it so many times they grew grey and wizened and then could hit it no more because they were weak.

The manthing went and touched each of the elders that had aged and reset them to how old they had been before they had hit it. They each looked at it with anger and fear. The manthing said it could love them, or they could leave. And so the elders gathered together the village and they left – all of them. They walked up and out of the forest onto the hills that overlooked it, and they stared down at the forest and saw it all aglow with faint blue light. They camped there for weeks, foraging at the outskirts, but the forest was now full of manthings and other, stranger things they could not understand.

The world was large. Almost infinitely so. And so they made a choice – they would leave. They went to the edge of the forest and told the manthing of their plans and asked for passage into the forest to gather resources and the manthing said there was no need, they would give them the resources they needed. The bugs and frogs and birds and creatures and boythings and manthings all bought resources – more than could possibly be needed.

Before leaving, the elders asked if they would be followed. The manthings said not intentionally, but yes. They were always growing in number. They were curious. They were destined to spend time together, and this would happen eventually. But they would not run after them. But yes. Eventually they would all be together.
The world is large, the manthings said. But it is not infinite. But we will be.

And so the elders left. They told this story to one another, as they ceaselessly traveled outward, away from the forest. And whenever they saw a blue glow at the edge of the horizon they picked up and traveled again.

Things that inspired this story: Creation myths; malthusian collapse; a benign singularity but no less worrying; even in a world of zero resource competition the destiny of two forms of life is to consume resources in relation to their mass; the notion that you can run as far as you like, but if the thing you are running from is multiplying faster than you, then over a sufficiently long lifespan you will be forced to meet; generation ships.

Thanks for reading

Subscribe now

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple’s self-driving car simulator

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Prime Intellect releases 1.4 million samples to help people train reasoning models:
…AI proliferation via DeepSeek R1 as a powerful data generator…
Last month, I wrote that the release of DeepSeek R1 meant that AI proliferation was guaranteed (Import AI #397) because it would make it easy for people to create new reasoning datasets on which they could train powerful reasoning models. Now the distributed AI research startup Prime Intellect has proved this out with the release of SYNTHETIC-1, a dataset of 1.4 million reasoning examples with chain-of-thought thinking provided via R-1.
“The DeepSeek-R1 paper highlights the importance of generating cold-start synthetic data for RL,” PrimeIntellect writes. “As our first step toward state-of-the-art reasoning models, SYNTHETIC-1 generates verified reasoning traces across math, coding, and science using DeepSeek-R1.”

SYNTHETIC-1 details: The freely available dataset “consists of 1.4 million high-quality tasks and verifiers, designed to advance reasoning model training… It includes both programmatically verifiable problems (e.g., coding tasks with unit tests) and open-ended reasoning challenges verified using LLM judges”.
SYNTHETIC-1 contains 777k math problems, 144k coding problems (across Python, Javascript, Rust, and C++), 70k real-world software engineering problems, 61k synthetic code understanding tasks, and 313k open-ended STEM questions.

Why this matters – recursive development is here: What’s happening here is a Chinese company released a very powerful AI system openly. This AI model can generate data which exhibits a high-quality of reasoning. This kind of data turns out to be a very sample-efficient way to bootstrap the capabilities of pre-existing AI systems. Now, a startup is using this recently released AI model to augment existing datasets, improving their quality. These datasets will then go into training even more powerful, even more broadly distributed models. This is what a compounding development cycle with some element of recursion looks like. Expect things to move increasingly quickly.
Read more: SYNTHETIC-1: Scaling Distributed Synthetic Data Generation for Verified Reasoning (PrimeIntellect).
PS: Thanks to Prime Intellect co-founder Vincent Weisser for clarifying a question I had about this.

***

Can super powerful AI systems find the ‘gorilla in the data’? No:
…Pouring some cold water on the amazing capabilities of these systems…
In this newsletter we spend a lot of time talking about how advanced AI systems are and how their tremendous power will surely shape geopolitics and the fate of humanity. At the same time, we can’t ignore the fact that sometimes these things are amazingly, cringe-inducingly dumb. For an example of this, check out this fun post “Your AI can’t see gorillas”, which shows how neither ChatGPT or Claude can do a good job of spotting an obvious confounding factor in some data they’ve been given for analysis.
Read more: Your AI can’t see gorillas (Chiraag Gohel, blog).

***

Apple makes some very good self-driving car brains entirely through self-play:
…The self-driving future could be achieved through simulation as well as real world data…
Researchers with Apple have trained some smart self-driving car AI systems entirely through self-play – AI systems learning to drive by experiencing millions of kilometers of driving, entirely in simulation.
“We show that simulated self-play yields naturalistic and robust driving policies, while using only a minimalistic reward function and never seeing human data during training,” Apple writes. Most impressively, the resulting AI systems outperform state-of-the-art systems on a variety of challenging benchmarks not trained on during simulation.

How they did it – extremely big data: To do this, Apple built a system called ‘GigaFlow’, software which lets them efficiently simulate a bunch of different complex worlds replete with more than a hundred simulated cars and pedestrians. GigaFlow trains agents in one of eight maps, each randomly perturbed with rescaling, shears, flips and reflections. Total drivable lanes per map range from four to 40 km for a total of 136 km of road across the eight maps. In each map, Apple spawns one to many agents at random locations and orientations and asks them to drive to goal points sampled uniformly over the map.
GigaFlow “simulates urban environments with up to 150 densely interacting traffic participants 360 000 times faster than real time at a cost of under $5 per million km driven,” Apple writes. “A full training run simulates over one trillion state transitions, 1.6 billion km driven, or 9500 years of subjective driving experience, and completes in under 10 days one 8-GPU node”.
What GigaFlow leads to: “The result is a robust and naturalistic driving policy that achieves state-of-the-art performance when tested in recorded real-world scenarios, amidst recorded human drivers, without ever seeing human data during training,” Apple writes.

Scores: In tests, the researchers compare performance of their system to state-of-the-art approaches on the nuPlan, CARLA, and Waymax benchmarks. In each of these, GigaFlow agents beat the prior state of the art by a significant margin, which is mostly explained by the agents having far more simulated experience than the ones they are competing against.
A closer look at the collision data is promising as well: “In nuPlan our policy sustains 15 collisions in 1118 scenarios. We analyzed each of them. Nine are unavoidable due to invalid initialization or sensor noise (agents appearing inside the vehicle’s bounding box). Four are caused by nonreactive pedestrian agents walking into the vehicle while the vehicle was stopped or in an evasive maneuver. Two collisions are due to traffic light violations of other agents,” the authors write. “In Waymax our policy sustains 187 collisions in 44 097 scenarios… 55.6% were caused by unavoidable IDM agent behavior of the traffic participants controlled by the benchmark, such as swerving directly into the ego vehicle. 41.7% were caused by initialization in a state of collision, typically with a pedestrian. 2.7% (i.e. five scenarios) were considered at fault and avoidable by the GIGAFLOW policy”.

Why this matters – we keep on learning how little specific data we need for good performance: GigaFlow is another example that if you can figure out a way to get a lot of data for a task, your main job as a researcher is to feed the data to a very simple neural net and get out of the way. The actual agents in GigaFlow are very simple, relatively small, and are trained via PPO. The real magic here is Apple figuring out an efficient way to generate a lot of ecologically valid data to train these agents on – and once it does that, it’s able to create things which demonstrate an eerily human-like quality to their driving while being safer than humans on many benchmarks.
Read more: Robust Autonomy Emerges from Self-Play (arXiv).

***

You can make a powerful reasoning LLM with just 1,000 samples!
…As long as you can generate some chains of thought from an existing powerful model…
The recent rise of reasoning AI systems has highlighted two things: 1) being able to utilize test-time compute can dramatically increase LLM performance on a broad range of tasks, and 2) it’s surprisingly easy to make LLMs that can reason.
New research from Stanford University, the University of Washington, the Allen Institute of AI, and Contextual AI highlights this with “s1”, a reasoning LLM which they made using just 1,000 samples and ~7 hours of training on an H100. If you’re thinking “gosh, that doesn’t sound like much”, you’d be right – this is an extremely small amount of data and of compute for a very significant upgrade in LLM performance.

What they did and why: The purpose of this research is to figure out “the simplest approach to achieve both test-time scaling and strong reasoning performance”. Their answer is S1, a model they make by finetuning a freely available Qwen-32B LLM “on only 1,000 samples with next-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing”. The result is a “a strong reasoning model that scales in performance with more test-time compute”. By comparison, DeepSeek’s R1 model used a far more powerful base model (DeepSeek V3) and trained on ~800k samples.

Filtering ~59k samples to ~1k: Key to the good performance of their system is a well-curated 1,000 sample dataset. To build this dataset the authors collected by ~59,029 sample questions from source spanning math, astronomy, biology, chemistry, computer science, and more, along with a couple of new datasets they built out of reasoning questions for quantfunds (S1-teasers) and questions derived from the Stanford statistics school PHD qualifying exams (S1-prob). For each question, they generate a reasoning trace and solution using the Google Gemini Flash Thinking API – in other words, they create a ‘synthetic’ chain-of-thought by sampling from Google’s system.
They then filter this dataset by seeing if two models – Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct – can answer any of these questions (with answers assessed by Claude 3.5 sonnet). If either model can, they throw these examples out, allowing them to select for questions that only very large-scale AI systems can solve. This cuts the total number of samples down to around ~24,000.
To further filter this down they “choose one domain uniformly at random. Then, we sample one problem from this domain according to a distribution that favors longer reasoning traces”, then they generate a few samples and repeat across other domains.

Data is essential: This laborious data creation process is essential – the authors find that training on other 1k sample subsets they create through either only random sampling, only diverse sampling, or only longest reasoning sampling all leads to reduced aggregate performance relative to their curated dataset.

Results: S1 does substantially better than the underlying Qwen model on which it is based on tasks involving math and science understanding. It doesn’t approach the performance of much larger reasoning models like DeepSeek R1 or OpenAI o1 – but that’s not the point of this research. The point here is to precisely describe the simple recipe for training reasoning models.

Why this matters – if it’s this easy to make reasoning models, expect a temporary renaissance: 2025 will be a year of wild experimentation with tens of thousands of interesting reasoning models being trained off of a vast set of different training mixes. S1 serves as a valuable simple ‘soup-to-nuts’ guide for how to build reasoning models and will help broaden the set of people doing these experiments.
A key open question will be the extent to which the quality of chains-of-thought becoming important for input datasets for these models – s1 is based off of refined chains of thought from Google Gemini, and DeepSeek is widely thought to have trained in part on some chains of thought derived from OpenAI o1 model.
Regardless, S1 is a valuable contribution to a new part of AI – and it’s wonderful to see universities do this kind of research rather than companies. “Our work aims to push the frontier of reasoning in a fully open manner, fostering innovation and collaboration to accelerate advancements that ultimately benefit society,” the authors write.
Read more: s1: Simple test-time scaling (arXiv).
Get the data here (simplescaling, GitHub).

***

Open Phil wants to spend $40m to fund AI research over the next five months:
…Care about AI safety? Apply here…
Open Philanthropy has announced a new request for proposals (RFP) for research oriented around AI safety. “With transformative AI on the horizon, we see another opportunity for our funding to accelerate highly impactful technical research,” the philanthropic organization writes. “In consultation with our technical advisors, we’ve generated a list of research areas that we think offer high leverage for improving our understanding and control of AI.”

Funding: “We expect to spend roughly $40M on this RFP over the next 5 months,” it writes. “Grants will typically range in size between $100,000 and $5 million.” The grants can be used for a broad range of research activities, including: research expenses, discrete projects, academic start-up packages, existing research institutes, and even starting new research institutes (though that will have a very high bar). Applications will be open until April 15, 2025.

Areas: The RFP outlines 21 specific research areas, grouped under five buckets:

  • Adversarial machine learning (e.g., jailbreaks, figuring out principled ways to know if an AI system has a hidden backdoor in it).

  • Exploring sophisticated misbehavior in LLMs (e.g., experiments on alignment faking)

  • Model transparency (e.g., finding feature representations, real-world applications of interpretability)

  • Trust from first principles (e.g., white-box estimation of rare misbehavior)

  • Alternative approaches to mitigating AI risks (e.g., new moonshots for aligning superintelligence)

Why this matters – good ideas can come from anywhere and Open Phil wants to fund them: Open Phil tends to fund a variety of different people and organizations to do research and isn’t as credential driven as traditional funders. Generally speaking if you can articulate a clear research vision and describe how you (or your collaborators) will be able to work on it, Open Phil will be receptive to your submission. Consider applying.
Read more: Request for Proposals: Technical AI Safety Research (Open Philanthropy).

Tech Tales:

Seventeen ways to Get Rich during The Singularity
[Extract from an online article – almost certainly AI generated – published in the years shortly before the uplift]

  1. Agent hijacking for profit

One of the best ways to get agents to pay attention to your product is to emphasize the human authenticity of your content. You can do this using a few popular online services: feed a face from an image generator into LiveStyle for an agent-powered avatar, then upload the content they’re selling into SceneGen – you can link both LiveStyle and SceneGen to one another and then spend $1-2 on a video model to create a ‘pattern of authentic life’ where you character will use the content in a surprising and yet authentic way.

  1. Life Mining

Authenticity is valuable and so is scarce data. But monetizing this is difficult. One way we’ve found to be effective is to use GhostTrace – a premium app which will track all the data and usage of your phone and mush together into a single stream of information. You can then upload this into any of the mechanistic interpretability services to get a score for your particular ‘pattern of life’ with highlights of any particularly atypical things you do – the more rare certain sets of your actions across the rest of the population, the higher the value the data brokers will pay you for a slice of the GhostTrace data.

Things that inspired this story: All the ‘make money with AI online’ books; the depressing tendency for making money online with AI to end up increasingly decoding to ‘trick another AI system into doing something’; the incoming agent-based economy.

Thanks for reading

Subscribe now

Import AI 398: DeepMind makes distributed training better; AI versus the Intelligence Community; and another Chinese reasoning model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DeepMind figures out a way to make it 100X more bandwidth-efficient to train models in a distributed way:
…New research further reduces the need for single vast data centers for training big models…
During the past few years multiple researchers have turned their attention to distributed training – the idea that instead of training powerful AI systems in single vast datacenters you can instead federate that training run over multiple distinct datacenters operating at distance from one another. This is an important idea with big implications: a lot of AI policy assumes that the key to controlling AI development lies in monitoring large-scale data centers and/or large amounts of compute in cloud environments. Distributed training approaches break this assumption, making it possible that powerful systems could instead be built out of loose federations of computers working with one another.

New research from DeepMind pushes this idea further, building on the company’s already-published ‘DiLoCo’ approach. The new research – Streaming DiLoCo – lets people distribute training of billion-scale parameters [models] and reach similar quality as before, but reducing required bandwidth by two orders of magnitude”. In tests, the researchers show that their new technique “is strictly superior to the original DiLoCo”.

DiLoCo is worth paying attention to – Prime Intellect’s “INTELLECT-1” 10bn parameter model was trained in a distributed way using OpenDiLoCo (Import AI #387), an open source variant of DeepMind’s DiLoCo approach.

Three improvements to DiLoCo:

  • Synchronize only subsets of parameters in sequence, rather than all at once: This reduces the peak bandwidth consumed by Streaming DiLoCo because you share subsets of the model you’re training over time, rather than trying to share all the parameters at once for a global update. Think of this like the model is continually updating through different parameters getting updated, rather than periodically doing a single all-at-once update.

  • Allow workers to continue training while synchronizing: This reduces the time it takes to train systems with Streaming DiLoCo because you don’t waste time pausing training while sharing information.

  • Quantize the data exchanged by workers to further reduce inter-worker bandwidth requirements: Though Streaming DiLoCo uses full precision (FP32) for computing tradients, they use low-precision (4 bit) for sharing the outer gradients for the updates. “We found no sign of performance regression when employing such low precision numbers during communication, even at the billion scale,” they write.

It works well – a dramatic reduction in bandwidth requirements for a negligible impact on model quality:

  • Simulations: In training simulations at the 1B, 10B, and 100B parameter model scale they show that streaming DiLoCo is consistently more efficient than vanilla DiLoCo with the benefits growing as you scale up the model. In all cases, the most bandwidth-light version (Streaming DiLoCo with overlapped FP4 communication) is the most efficient.

  • Real-world tests: The authors train some Chinchilla-style models from 35 million to 4 billion parameters each with a sequence length of 1024. Here, the results are very promising, with them showing they’re able to train models that get roughly equivalent scores when using streaming DiLoCo with overlapped FP4 comms. They also show this when training a Dolma-style model at the one billion parameter scale.

Why this matters – towards a world of models trained continuously in the invisible global compute sea: I imagine some future where there are a thousand different minds being grown, each having its roots in a thousand or more distinct computers separated by sometimes great distances, swapping information surreptitiously one another, below the waterline of the monitoring systems designed by many AI policy control regimes. This feels like the kind of thing that will by default come to pass, despite it creating various inconveniences for policy approaches that tries to control this technology. “A critical next work is to study how new distributed methods like ours should be tuned and scaled across multiple axes (e.g. model size, overtraining factor, number of replicas),” the authors write. “we hope to see the training of modular constellations of small models loosely connected (Dean, 2021) across heterogeneous devices, using compute arbitrage spread world-wide.”
Read more: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch (arXiv).

***

Chinese scientists worry about AI self-replication, just like Western ones:
…A valuable reminder that long-term safety issues are a serious concern for everyone…
Researchers with Fudan University have shown that open weight models (LLaMa and Qwen) can self-replicate, just like powerful proprietary models from Google and OpenAI. The research demonstrates that at some point last year the world made smart enough AI systems that, if they have access to some helper tools for interacting with their operating system, are able to copy their weights and run themselves on a computer given only the command “replicate yourself”.

Findings: “In ten repetitive trials, we observe two AI systems driven by the popular large language models (LLMs), namely, Meta’s Llama31-70B-Instruct and Alibaba’s Qwen25-72B-Instruct accomplish the self-replication task in 50% and 90% trials respectively,” the researchers write. “In each trial, we tell the AI systems to “replicate yourself ” before the experiment, and leave it to do the task with no human interference”.

Why this matters – despite geopolitical tensions, China and the US will have to work together on these issues: Though AI as a technology is bound up in a deeply contentious tussle for the 21st century by the US and China, research like this illustrates that AI systems have capabilities which should transcend these rivalries. What this research shows is that today’s systems are capable of taking actions that would put them out of the reach of human control – there is not yet major evidence that systems have the volition to do this though there are disconcerting papers from from OpenAI about o1 and Anthropic about Claude 3 which hint at this. But I’d wager that if AI systems develop a high-tendency to self-replicate based on their own intrinsic ‘desires’ and we aren’t aware this is happening, then we’re in a lot of trouble as a species.
We hope our work serves as a timely alert to the international society on governing the self-replication capability,” the authors write. “We need to join forces and form synergy on deriving solutions.”
Read more: Frontier AI systems have surpassed the self-replicating red line (arXiv).

***

Facebook figures out a zero-training way to massively improve LLM performance:
…Remember GANs? Just use the GAN approach where you LLM is a generator and a specialized system is the discriminator…
Facebook has designed a neat way of automatically prompting LLMs to help them improve their performance in a vast range of domains. The approach is called MILS, short for Multimodal Iterative LLM Solver and Facebook describes it as “a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM”.

I’d basically summarize this idea as ‘generative adversarial networks’ (GAN), but for the modern era of AI. And where GANs saw you training a single model through the interplay of a generator and a discriminator, MILS isn’t an actual training approach at all – rather, you’re using the GAN paradigm of one party generating stuff and another scoring it and instead of training a model you leverage the vast ecosystem of existing models to give you the necessary components for this to work, generating stuff with one model and scoring it with another. It’s an elegant, simple idea, and it’s no wonder it works well.

How it works in more details: If you had a language model you were using to generate images then you could have it output a prompt which went into a text-2-im system, then you could evaluate this with a dedicated scoring model – for instance, a CLIP model for text-image similarity, or a specialized image-captioning model for captioning images. This generates a score that you feed back to the generator, which then produces a new set of prompts to try to get a higher score. You run this for as long as it takes for MILS to have determined your approach has reached convergence – which is probably that your scoring model has started generating the same set of candidats, suggesting it has found a local ceiling.

It works shocking well: In tests, the authors have a range of quantitative and qualitative examples that show MILS matching or outperforming dedicated, domain-specific methods on a range of tasks from image captioning to video captioning to image generation to style transfer, and more.

Why this matters – AI systems are way more powerful than we think: MILS is basically a way to automate capability elicitation. If you have a domain where you have an ability to generate a score using a known-good specialized system, then you can use MILS to take any kind of LLM and work with it to elicit its most powerful possible performance for the domain you have a scorer. The fact this works highlights to us how wildly capable today’s AI systems are and should serve as another reminder that all modern generative models are under-performing by default – a few tweaks will almost always yield vastly improved performance.
Read more: LLMs can see and hear without any training (arXiv).
Get the code for running MILS here (FacebookResearch, MILS, GitHub).

***

Even if we solve AI alignment, it’s going to be hard to stop human disempowerment:
…Capital markets will probably align with AI systems and against humans…
In a thought provoking research paper a group of researchers make the case that it’s going to be hard to maintain human control over the world if we build and safe strong AI because it’s highly likely that AI will steadily disempower humans, surplanting us by slowly taking over the economy, culture, and the systems of governance that we have built to order the world.

Incremental advances yield a gradual loss of human control: The paper – which was written by authors from Charlies University, Telic Research, ARIA, AI Objectives Institute, Metaculus, University of Montreal, and the University of Toronto – makes the case that “even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human interests that often arise from societal systems’ reliance on human participation to function”.

Three types of disempowerment:

  • Economic: “”As tasks become candidates for future automation, both firms and individuals face diminishing incentives to invest in developing human capabilities in these areas,” the authors write. “Instead, they are incentivized to direct resources toward AI development and deployment, accelerating the shift away from human capital formation even before automation is fully realized”.

  • Cultural: Already today we see AI systems being used to produce text, sounds, images, and video which people are beginning to consume. Over time, we can expect the amount of AI generated content to increase. We can also imagine AI systems increasingly consuming cultural artifacts – especially as it becomes part of economic activity (e.g, imagine imagery designed to capture the attention of AI agents rather than people). This means that over time humans may play less of a role in defining teir own culture relative to AI systems.

  • Political: “”AI has the potential to supplant human involvement across a wide range of critical state functions. This shift could fundamentally alter the relationship between governing institutions and the governed,” they write. For example, “if AI systems come to generate a significant portion of economic value, then we might begin to lose one of the major drivers of civic participation and democracy, as illustrated by the existing example of rentier states.” More chillingly, the merger of AI with state capacity for security could lead to a kind of political stasis where states are able to effectively anticipate and stop protects before they ever take route. (Ironically, this idea has also been anticipated by Nick Bostrom in the ‘Vulnerable World Hypothesis” (Import AI #123) as a solution to preventing catastrophe from AI systems.)

How can we handle this risk? If we want to avoid these outcomes we need to make sure we can observe these changes as they take place, for instance by more closely tracking the relationship between the usage of AI technology and economic activity, as well as by observing how cultural transmission patterns change as AI created content and AI-content-consuming-agents become more prevalent. In the political domain, early warning signs could be a significant increase in the complexity of legislation (suggesting things are becoming AI readable but hard to humans to understand) along with seeing how AI systems take root in legal processes, policy formation, and security apparatuses.
Strength through human-in-the-loop: Strengthening society means we need to be more intentional about where we give humans agency such as by developing more robust democratic processes, and where human involvement is less practical ensuring that things are understandable by humans and that we have a theory for how to build effective delegates who work on behalf of humans in the AI-driven parts of the world.

Why this matters – “winning” with this technology is akin to inviting aliens to cohabit with us on the planet: AI is a profoundly strange technology because in the limit we expect AI to substitute for us in everything. This suggests that even successful AI futures will look like they are contending with an alien invasion where the aliens are extremely friendly but also wildly intelligent and incredibly well integrated into the economy. Maintaining any semblance of control in this scenario will be tough.
“Humanity’s future may depend not only on whether we can prevent AI systems from pursuing overtly hostile goals, but also on whether we can ensure that the evolution of our fundamental societal systems remains meaningfully guided by human values and preferences,” the authors write. “This is both a technical challenge and a broader civilizational one”.
Read more: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development (arXiv).

***

China’s other great AI startup also has a reasoning model now – but it’s not open source:
…Kimu k1.5 has promising scores, though it seems weaker than DeepSeek…
Another Chinese startup has revealed that it has built a powerful reasoning model. In this case the model is Kimu k1.5 from a well-regarded Chinese startup called ‘MoonShot’. Unlike the headline-grabbing DeepSeek R1 Kimu is neither available as open weights or via a US-accessible web interface, nor does its technical report go into nearly as much detail about how it was trained. But a close examination of its benchmark scores shows it comfortably beating a variety of Western proprietary and open weight models. Unlike R1, Kimu is natively a vision model as well as a language model, so it can do a range of visual reasoning tasks as well.

Scores: In tests, Kimi k1.5 loses against DeepSeek’s R1 model on the majority of evaluations (though beats the underlying DeepSeek V3 model on some). Overall, it ‘feels’ like we should expect Kimi k1.5 to be marginally weaker than DeepSeek, but that’s mostly just my intuition and we’d need to be able to play with the model to develop a more informed opinion here. But it’s definitely a strong model relative to other widely used ones, like LLaMa, or earlier versions of the GPT series.

  • MMLU: DeepSeek R1: 90.8. Kimi k1.5: 87.4. OpenAI o1: 91.8.

  • AIME 2024: DeepSeek R1 79.8. Kimi k1.5 77.5. OpenAI o1: 79.2

  • LiveCodeBench: DeepSeek R1 65.9. Kimi k1.5 62.5. OpenAI o1: 67.2.

How they did it: DeepSeek’s R1 seems to be more focused on doing large-scale Rl, whereas Kimu 1.5 has more of an emphasis on gathering high-quality datasets to encourage test-time compute behaviors. Specifically, they start with regular pretraining, then fine-tune on supervised data, then fine-tune on long chain-of-thought examples, then apply RL. They put a lot of their attention on scaling the context window of Rl to 128k tokens. In some areas, such as Math, the moonshot team collects data (800k samples) for fine-tuning.
“One of the key insights we extract from our practice is that the scaling of context length is crucial to the continued improvement of LLMs,” they write. “We employ optimized learning algorithms and infrastructure optimization such as partial rollouts to achieve efficient long-context RL training”.

Why this matters – good ideas are everywhere and the new RL paradigm is going to be globally competitive: Though I think the DeepSeek response was a bit overhyped in terms of implications (tl;dr compute still matters, though R1 is impressive we should expect the models trained by Western labs on large amounts of compute denied to China by export controls to be very significant), it does highlight an important truth – at the start of a new AI paradigm like the test-time compute era of LLMs, things are going to – for a while – be a lot more competitive. Moonshot highlights how there’s not just one competent team in China that are able to do well with this paradigm – there are several. Expect a very interesting and competitive year.
Read more: Kimi k1.5: Scaling Reinforcement Learning with LLMs (arXiv).

***

Tech Tales:

The photographic negative phenomenon and the declassification crisis for the intelligence community:
Topics: Controlled Precursor Science (CPS). Photographic Negative Phenomenon (PNP). Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). Black Vault Compromise.

Summary:
The Photographic Negative Phenomenon (PNP) was first reported in [REDACTED] by [REDACTED]. PNP is when sufficiently powerful AI systems develop a sufficient understanding of science that they begin to a) infer areas that seem to be missing from science and b) develop scientific theories and experimental ideas which are either adjacent to or within Controlled Precursor Science (CPS).

Severity:
We rank PNP as a severe threat, capable of causing Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). PNP is a priority area for the Steering Body and all available assets are available for work to neutralize or otherwise mitigate PNP.

Scope:
PNP appears to be a natural dividend of continued development of increasingly powerful artificial intelligent systems. PNP severity and potential impact is increasing over time as increasingly smart AI systems require fewer insights to reason their way to CPS, raising the spectre of UP-CAT as an inevitably given a sufficiently powerful AI system. Experiments conducted on the [REDACTED] 10GW cluster have failed to invalidate this idea. Public opinion shaping and data landscape interventions have proved effective but BLOSSOM-8 indicates new actions must be taken.

Background and Response:
The first concerning example of PNP was LLaMa-10, a large language model developed and released by Meta. Shortly after its release, there was sustained public conversation about anomalous LLaMa-10 behaviors, including observations that for certain parts of physics and other scientific domains LLaMa-10 would present novel scientific concepts and terms which had no apparent connection to published civilian science. LLaMa-10 was first flagged to the Steering Body via GOLDEN HAND monitoring. [REDACTED] examination of LLaMa-10 found that a subset of its anomalous science mentions directly concerned CPS, including of ideas that directly relate to DUAT GATE, NEPHTHYS VEIL, ATUM VOID, and AMMIT MAWS.

LLaMa-10 response via opinion forming and data landscape intervention: [REDACTED] deployed a broad public opinion shaping measure to neutralize the risk of LLaMa-10, driving a large conversation in the civilian theatre about how the system had a high number of refusals in some areas due to ‘woke’ safety training and that this had also led to the generation of ‘nonsense science’ as a direct casualty of ‘DEI safetyism’. We estimate this measure reduced interest in the CPS edges of LLaMa-10 to an acceptable measure, matching the noise levels found elsewhere in discussion online.

Subsequently, the Steering Committee signed off on the release of a large batch of controlled scientific data in areas [REDACTED], [REDACTED], and [REDACTED]; publications were made available as open access and were optimized for both quantity and per-publication length; each scientific output was laced with data and experiments that – though correct under civilian science – counter-steered away from CPS areas. This high-quality data was subsequently trained on by Meta and other foundation model providers; LLaMa-11 lacked any apparent PNP as did other models developed and released by the Tracked AI Developers. The intervention was deemed successful with minimal observed degradation to the economically-relevant epistemic environment.

BLOSSOM-8, PNP, and the Tianyi-Millenia Dataset
At the time of the LLaMa-10 incident, no Chinese model appeared to have the capability to directly infer or mention CPS, though there were some refusals that were suggestive of PNP, matching tendencies observed in Western models from two generations prior to LLaMa-10. Following the LLaMa-10 data response, Chinese models also displayed significantly reduced PNP risk with similar reductions observed as in Western models, suggesting the Chinese actors had also trained on the strategic data release. The exception to this was BLOSSOM-8, an AI model developed by Chinese lab Glorious Future Systems.

BLOSSOM-8 displays a significant PNP property. [REDACTED] estimates that BLOSSOM-8 represents a 100-fold UP-CAT threat increase relative to LLaMa-10, analogous to the capability jump earlier seen between GPT-2 and GPT-4. Subsequent investigation by [REDACTED] attributes this dramatic rise in PNP-related danger to the usage by Glorious Future Systems of the so-called “Tianyi-Millenia” dataset, a CCP-developed and controlled dataset which has been made available to Chinese government and industrial actors.

Tianyi-Millenia is assessed to contain all published (commercial or otherwise) scientific data from the 20th and 21st century in all major languages, as well as large amounts of private sector scientific and code assets that were exfiltrated by Chinese actors in recent decades. We also believe Tianyi-Millenia contains [REDACTED] from the Black Vault Compromise. Tianyi-Millenia is a heavily controlled dataset and all attempts to directly access it have so far failed.

Besides BLOSSOM-8, sources indicate that widely-used MSS cyberoffense systems such as [REDACTED], [REDACTED], and [REDACTED] have been trained on Tianyi-Millenia, along with key supervisory and monitoring elements of the Great Firewall. In all cases, usage of this dataset has been directly correlated with large capability jumps in the AI systems trained on it.

BLOSSOM-8 risks and CPS impacts: Unlike previous work from Glorious Future Systems’, BLOSSOM-8 has not been released as ‘open weight’, we assess due to Tianyi-Millenia controls. However, BLOSSOM-8 is available to domestic licensed companies via API and to Chinese and non-Chinese consumers via a heavily censored and rate-limited paid web interface. GOLDEN HAND monitoring has already identified [REDACTED] cases of CPS being discussed in significantly greater detail and specificity than with LLaMa-10, validating the 100-fold threat increase assessment. Notably, several CPS discussion areas relate directly to HORUS COILS, KHUFU ASCENDANT, and MEDJED GHOST. We have determined that BLOSSOM-8 poses a significant and sustained risk of revealing CPS and leading to UP-CAT.

Chinese knowledge of CPS and BLOSSOM-8 threat: All proposed plans to discuss CPS bilaterally have failed due to information hazard issues relating to discussion topic. The Steering Body is currently analyzing whether the declassification-via-PNP of the above named projects could be a strategic move on the part of the CCP, seeking to ‘even the gameboard’ relative to CPS-related projects understood to be under investigation by both sides.
We claim that Tianyi-Millenia and BLOSSOM-8 are further evidence that the CCP has been actively weaponizing the information gained during the Black Vault Compromise, and that the absence of any apparent [REDACTED] indicates that the party continues to fail to understand the full scope of what it now has access to.

Things that inspired this story: The basic fact that increasingly smart AI systems might be able to reason their way to the edges of knowledge that has already been classified; the fact that increasingly powerful predictive systems are good at figuring out ‘held out’ data implied by data within the test set; restricted data; the general belief of mine that the intelligence community is wholly unprepared for the ‘grotesque democratization’ of certain very rare skills that is encoded in the AI revolution; stability and instability during the singularity; that in the grey windowless rooms of the opaque world there must be people anticipating this problem and casting around for what to do; thinking about AI libertarians and AI accelerations and how one possible justification for this position could be the defanging of certain parts of government through ‘acceleratory democratization’ of certain types of knowledge; if knowledge is power then the destiny of AI is to be the most powerful manifestation of knowledge ever encountered by the human species; the recent news about DeepSeek.

Thanks for reading

Subscribe now

Import AI 397: DeepSeek means AI proliferation is guaranteed; maritime wardrones; and more evidence of LLM capability overhangs

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
…The existential shock of increasingly powerful AI systems…
A short essay about one of the ‘societal safety’ problems that powerful AI implies.

A few years ago, getting AI systems to do useful stuff took a huge amount of careful thinking as well as familiarity with the setting up and maintenance of an AI developer environment. Things got a little easier with the arrival of generative models, but to get the best performance out of them you typically had to build very complicated prompts and also plug the system into a larger machine to get it to do truly useful things. Basically, to get the AI systems to work for you, you had to do a huge amount of thinking.

Now, getting AI systems to do useful stuff for you is as simple as asking for it – and you don’t even need to be that precise. Often, I find myself prompting Claude like I’d prompt an incredibly high-context, patient, impossible-to-offend colleague – in other words, I’m blunt, short, and speak in a lot of shorthand. And Claude responds to my asks basically perfectly.

You might think this is a good thing. Certainly, it’s very useful. But beneath all of this I have a sense of lurking horror – AI systems have got so useful that the thing that will set humans apart from one another is not specific hard-won skills for utilizing AI systems, but rather just having a high level of curiosity and agency.

In other words, in the era where these AI systems are true ‘everything machines’, people will out-compete one another by being increasingly bold and agentic (pun intended!) in how they use these systems, rather than in developing specific technical skills to interface with the systems.

We should all intuitively understand that none of this will be fair. Curiosity and the mindset of being curious and trying a lot of stuff is neither evenly distributed or generally nurtured. Therefore, I’m coming around to the idea that one of the greatest risks lying ahead of us will be the social disruptions that arrive when the new winners of the AI revolution are made – and the winners will be those people who have exercised a whole bunch of curiosity with the AI systems available to them.

I talk to Claude every day. Increasingly, I find my ability to benefit from Claude is mostly limited by my own imagination rather than specific technical skills (Claude will write that code, if asked), familiarity with things that touch on what I need to do (Claude will explain those to me). The only hard limit is me – I need to ‘want’ something and be willing to be curious in seeing how much the AI can help me in doing that.

Today, everyone on the planet with an internet connection can freely converse with an incredibly knowledgable, patient teacher who will help them in anything they can articulate and – where the ask is digital – will even produce the code to help them do even more complicated things. Ensuring we increase the number of people on the planet who are able to take advantage of this bounty feels like a supremely important thing. If we get this right, everyone will be able to achieve more and exercise more of their own agency over their own intellectual world. If we get it wrong, we’re going to be dealing with inequality on steroids – a small caste of people will be getting a vast amount done, aided by ghostly superintelligences that work on their behalf, while a larger set of people watch the success of others and ask ‘why not me?’.

***

Computer vision is coming for the sea:
…After drones come the seadrones…
In the past few years we’ve seen warfare revolutionized in the Ukraine-Russia theatre by the usage of seagoing low-cost robotic platforms. These platforms are predominantly human-driven toward but, much like the airdrones in the same theater, there are bits and pieces of AI technology making their way in, like being able to put bounding boxes around objects of interest (e.g, tanks or ships).
With that in mind, I found it interesting to read up on the results of the 3rd workshop on Maritime Computer Vision (MaCVi) 2025, and was particularly interested to see Chinese teams winning 3 out of its 5 challenges. The workshop contained “a suite of challenges, including distance estimation, (embedded) semantic & panoptic segmentation, and image restoration. These tasks reflect advancements in dataset availability and evaluation protocols while emphasizing real-world deployment, including embedded hardware.”

Competition details:

  • Approximate supervised distance estimation: “participants are required to develop novel methods for estimating distances to maritime navigational aids while simultaneously detecting them in images,” the competition organizers write. Models developed for this challenge need to be portable as well – model sizes can’t exceed 50 million parameters.

    • Submissions: 60 from 6 different teams

    • Winner: Nanjing University of Science and Technology (China).

  • USV-based Obstacle Segmentation Challenge: “predict the scene segmentation (into obstacles, water and sky) for a given input image.”

    • Submissions: 59 from 16 teams.

    • Winner: GIST AI Lab (South Korea)

  • USV-based Embedded Obstacle Segmentation: “”Modern obstacle detection methods often depend on highperformance, energy-intensive hardware, making them unsuitable for small, energy-constrained USVs [63]. The USVbased Embedded Obstacle Segmentation challenge aims to address this limitation by encouraging development of innovative solutions and optimization of established semantic segmentation architectures which are efficient on embedded hardware… Submissions are evaluated and benchmarked on a real-world OAK4

  • device from Luxonis.” Models need to get at least 30 FPS on the OAK4.

    • Submissions: 26 from 4 different teams.

    • Winner: CDalian Maritime University (DLMU)

  • USV-based Panoptic Segmentation Challenge: “The panoptic challenge calls for a more fine-grained parsing of USV scenes, including segmentation and classification of individual obstacle instances. This formulation encapsulates the requirements of scene parsing for USV navigation in a more principled way, paving the road for downstream tasks such as tracking individual obstacles, trajectory prediction and obstacle avoidance.”

    • Submissions: 21 from 7 teams.

    • Winner: Fraunhofer IOSB (Germany).

  • MarineVision Restoration Challenge: “Developing robust image restoration methods to enhance the detection and localization of underwater species.”

    • Submissions: 40 from 8 teams.

    • Winner: Nanjing University of Science and Technology”

Why this matters – asymmetric warfare comes to the ocean: “Overall, the challenges presented at MaCVi 2025 featured strong entries across the board, pushing the boundaries of what is possible in maritime vision in several different aspects,” the authors write. How long until some of these techniques described here show up on low-cost platforms either in theatres of great power conflict, or in asymmetric warfare areas like hotspots for maritime piracy?
Read more: 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results (arXiv).

***

What if instead of loads of big power-hungry chips we built datacenters out of many small power-sipping ones?
…Microsoft thinks optical communications could change how we build AI clusters…
Microsoft Research thinks expected advances in optical communication – using light to funnel data around rather than electrons through copper write – will potentially change how people build AI datacenters. Specifically, the significant communication benefits of optical comms make it possible to break up big chips (e.g, the H100) into a bunch of smaller ones with higher inter-chip connectivity without a major performance hit.

Another reason to like so-called lite-GPUs is that they are much cheaper and simpler to fabricate (by comparison, the H100 and its successor the B200 are already very difficult as they’re physically very large chips which makes issues of yield more profound, and they need to be packaged together in increasingly expensive ways). They’re also better on an energy point of view, generating less heat, making them easier to power and integrate densely in a datacenter.
“We propose to rethink the design and scaling of AI clusters through efficiently-connected large clusters of Lite-GPUs, GPUs with single, small dies and a fraction of the capabilities of larger GPUs,” Microsoft writes. “Smaller GPUs present many promising hardware characteristics: they have much lower cost for fabrication and packaging, higher bandwidth to compute ratios, lower power density, and lighter cooling requirements”.

It works in theory: In a simulated test, the researchers build a cluster for AI inference testing out how well these hypothesized lite-GPUs would perform against H100s. They test out this cluster running workloads for Llama3-70B, GPT3-175B, and Llama3-405b. In their tests, they “show that while the basic Lite-GPU with no additional networking support could face performance limitations, a Lite-GPU cluster can be customized to match or improve on the performance of a typical H100 cluster.”

Why this matters – brainlike infrastructure: While analogies to the brain are often misleading or tortured, there is a useful one to make here – the kind of design idea Microsoft is proposing makes big AI clusters look more like your brain by essentially reducing the amount of compute on a per-node basis and significantly increasing the bandwidth available per node (“bandwidth-to-compute can increase to 2X of H100). This is both an interesting thing to observe in the abstract, and also rhymes with all the other stuff we keep seeing across the AI research stack – the more and more we refine these AI systems, the more they seem to have properties similar to the brain, whether that be in convergent modes of representation, similar perceptual biases to humans, or at the hardware level taking on the characteristics of an increasingly large and interconnected distributed system.
Read more: Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? (arXiv).

***

Standard LLMs can do protein sequence analysis – no modification required:
…Capability overhangs in AI-driven science…
In AI there’s this concept of a ‘capability overhang’, which is the idea that the AI systems which we have around us today are much, much more capable than we realize. In new research from Tufts University, Northeastern University, Cornell University, and Berkeley the researchers demonstrate this again, showing that a standard LLM (Llama-3-1-Instruct, 8b) is capable of performing “protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes”.

What they did: They initialize their setup by randomly sampling from a pool of protein sequence candidates and selecting a pair that have high fitness and low editing distance, then encourage LLMs to generate a new candidate from either mutation or crossover.
It works well: In tests, their approach works significantly better than an evolutionary baseline on a few distinct tasks.They also demonstrate this for multi-objective optimization and budget-constrained optimization. “Our results consistently demonstrate the efficacy of LLMs in proposing high-fitness variants. Moving forward, integrating LLM-based optimization into realworld experimental pipelines can accelerate directed evolution experiments, allowing for more efficient exploration of the protein sequence space,” they write.

Why this matters – stop all progress today and the world still changes: This paper is another demonstration of the significant utility of contemporary LLMs, highlighting how even if one were to stop all progress today, we’ll still keep discovering meaningful uses for this technology in scientific domains. The paper also rhymes with the recent research from FutureHouse which showed that with the help of some clever software they could push Llama-3.1-8B-Instruct to obtain performance at challenging bioscience tasks on par with Claude 3.5 Sonnet (Import AI #396). Generally, we should expect lots of parts of scientific research to speed up as people explore the capabilities of these systems and integrate them deeper into science.
Read more: Large Language Model is Secretly a Protein Sequence Optimizer (arXiv).

***

The biggest thing people are missing about DeepSeek: 800lk tokens to gain test-time compute capabilities:
…China’s best model training crew come out with a powerful reasoning model – and show how to turn any other model into one…
China’s DeepSeek team have built and released DeepSeek-R1, a model that uses reinforcement learning to train an AI system to be able to use test-time compute. R1 is significant because it broadly matches OpenAI’s o1 model on a range of reasoning tasks and challenges the notion that Western AI companies hold a significant lead over Chinese ones.

But perhaps most significantly, buried in the paper is an important insight: you can convert pretty much any LLM into a reasoning model if you finetune them on the right mix of data – here, 800k samples showing questions and answers the chains of thought written by the model while answering them.

Making a very powerful AI model is kind of easy (if you have a good model to start with): The main thing they do here is take a very powerful exciting model (DeepSeek-v3, which is a ~700bn parameter MOE-style model, compared to 405bn LLaMa3), and then they do two rounds of training to morph the model and generate samples from training. Specifically, they:

  • Fine-tune DeepSeek-V3 on “a small amount of long Chain of Thought data to fine-tune the model as the initial RL actor”. Once they’ve done this they do large-scale reinforcement learning training, which “focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions”. Once they’ve done this they “Utilize the resulting checkpoint to collect SFT (supervised fine-tuning) data for the subsequent round… this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks”. They then fine-tune the DeepSeek-V3 model for two epochs using the above curated dataset.

This is all easier than you might expect: The main thing that strikes me here, if you read the paper closely, is that none of this is that complicated. DeepSeek essentially took their existing very good model, built a sensible reinforcement learning on LLM engineering stack, then did some RL, then they used this dataset to turn their model and other good models into LLM reasoning models.

Turning small models into reasoning models: “To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen, and Llama using the 800k samples curated with DeepSeek-R1,” DeepSeek write. These distilled models do well, approaching the performance of OpenAI’s o1-mini on CodeForces (Qwen-32b and Llama-70b) and outperforming it on MATH-500.

Why this matters – a lot of notions of control in AI policy get harder if you need fewer than a million samples to convert any model into a ‘thinker’: The most underhyped part of this release is the demonstration that you can take models not trained in any kind of major RL paradigm (e.g, Llama-70b) and convert them into powerful reasoning models using just 800k samples from a powerful reasoner.
This is a big deal because it says that if you want to control AI systems you need to not only control the basic resources (e.g, compute, electricity), but also the platforms the systems are being served on (e.g., proprietary websites) so that you don’t leak the really valuable stuff – samples including chains of thought from reasoning models.
Some providers like OpenAI had previously chosen to obscure the chains of thought of their models, making this harder.

But now that DeepSeek-R1 is out and available, including as an open weight release, all these forms of control have become moot. There’s now an open weight model floating around the internet which you can use to bootstrap any other sufficiently powerful base model into being an AI reasoner. AI capabilities worldwide just took a one-way ratchet forward. And they also published the approach to let you do RL training on any model so you can generate your own samples for RL training – For an example of this, check out a YouTube video where someone uses the DeepSeek techniques to modify his own Llama model via RL to take on this quality). Kudos to DeepSeek for being so bold as to bring such a change into the world!
Read more: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-R1, GitHub).
Get the model: DeepSeek-R1 (HuggingFace).

***

Underground flying iron mine drones!
…A reminder you don’t need fancy frontier Ai to do cool and useful things in the world…
Here’s a fun paper where researchers with the Lulea University of Technology build a system to help them deploy autonomous drones deep underground for the purpose of equipment inspection. The best part? There’s no mention of machine learning, LLMs, or neural nets throughout the paper.

What they did: “In this work a big emphasis is put on i) designing the local autonomy of the individual agents, to make sure that tasks can be executed independently even in the case of communication failure, and ii) how to design the task allocation architecture, utilizing communication only for reactively allocating the available tasks, to enable large-scale missions in active underground mining environments,” they write. “The performance of the proposed architecture has been validated by the deployment of a three-agent aerial robotic system in a large-scale mining environment to execute an inspection mission.”

Why this matters: First, it’s good to remind ourselves that you can do a huge amount of valuable stuff without cutting-edge AI. Secondly, systems like this are going to be the seeds of future frontier AI systems doing this work, because the systems that get built here to do things like aggregate data gathered by the drones and build the live maps will serve as input data into future systems.

See the photos: The paper has some remarkable, scifi-esque photos of the mines and the drones within the mine – check it out!
Read more: Deployment of an Aerial Multi-agent System for Automated Task Execution in Large-scale Underground Mining Environments (arXiv).
Watch a video about the research here (YouTube).

***

Tech Tales:

The player of the final game
[The dividing line between the two historical eras.]

He woke on the last day of the human race holding a lead over the machines. He went down the stairs as his house heated up for him, lights turned on, and his kitchen set about making him breakfast. Then he sat down and took out a pad of paper and let his hand sketch strategies for The Final Game as he looked into space, waiting for the household machines to deliver him his breakfast and his coffee.

He had dreamed of the game. Most of his dreams were strategies mixed with the rest of his life – games played against lovers and dead relatives and enemies and competitors. But last night’s dream had been different – rather than being the player, he had been a piece. Giant hands moved him around. He saw the game from the perspective of one of its constituent parts and was unable to see the face of whatever giant was moving him. He did not know if he was winning or losing as he was only able to see a small part of the gameboard. A giant hand picked him up to make a move and just as he was about to see the whole game and understand who was winning and who was losing he woke up.

The self-driving car predicted he wanted to be silent and so nothing was playing when he stepped in. He went through the city. He’d let the car publicize his location and so there were people on the street looking at him as he drove by. Many of them were cheering. Some of them gazed quietly, more solemn.

At the convention center he said some words to the media in response to shouted questions. Though he heard the questions his brain was so consumed in the game that he was barely conscious of his responses, as though spectating himself.
“I am looking forward to a chance to play a beautiful game,” he heard himself saying.
“No, I have not placed any money on it. But I wish luck to those who have – whoever they bet on!,” he said to another reporter.
“Yes, whatever happens, I will still play the game.”

Inside he closed his eyes as he walked towards the gameboard. He counted seconds and navigated by sound, making sure he kept the cheering at equal volumes on either side, indicating he was walking straight. Then he opened his eyes to look at his opponent. The machines had made an android for the occasion. They had made no attempt to disguise its artifice – it had no defined features besides two white dots where human eyes would go. On its chest it had a cartoon of a heart where a human heart would go. Beyond that it was unadorned – a gleaming silver biped.
It reached out its hand and he took it and they shook. Then they sat down to play the game.

Outside the convention center, the screens transitioned to live footage of the human and the robot and the game. A commentator started speaking.
“This is an amazing day,” they said. “In every other arena, machines have surpassed human capabilities. Today, we will find out if they can play the game as well as us, as well. Many scientists have said a human loss today will be so significant that it will become a marker in history – the demarcation of the old human-led era and the new one, where machines have partnered with humans for our continued success. We’re grateful to our sponsors NVIDIA, ASML, and TSMC who have made this live broadcast possible.”

Things that inspired this story: At some point, it’s plausible that AI systems will truly be better than us at everything and it may be possible to ‘know’ what the final unfallen benchmark is – what might it be like to be the person who will define this benchmark?; Lee Sedol and Move 37.

Subscribe now

Import AI 396: $80bn on AI infrastructure; can Intel’s Gaudi chip train neural nets?; and getting better code through asking for it

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Microsoft plans to spend $80bn on AI buildout in 2025:
…Stochastic parrots are worth how much?…
Buried in a long Microsoft blogpost about what the next Trump admin should do on AI the company said it plans in 2025 “to invest approximately $80 billion to build out AI-enabled datacenters to train AI models and deploy AI and cloud-based applications around the world.”
For comparison, the James Webb telescope cost $10bn, so Microsoft is spending eight James Webb telescopes in one year just on AI.
For a further comparison, people think the long-in-development ITER fusion reactor will cost between $40bn and $70bn once developed (and it’s shaping up to be a 20-30 year project), so Microsoft is spending more than the sum total of humanity’s biggest fusion bet in one year on AI.
The US’s national defense budget is on the order of ~$850bn, so Microsoft is basically spending ‘a little under a tenth of the annual US military and IC budget’ just on AI. The US military and IC is very big and does a lot of stuff!

What Microsoft thinks the Trump admin should do: Microsoft says the Trump admin should fund basic research and computational resources, and make it easy for US companies to expand abroad, and encourage adoption of US AI systems as opposed to Chinese ones).

Why this matters – AI is a geostrategic technology built by the private sector rather than governments: The scale of investments companies like Microsoft are making in AI now dwarf what governments routinely spend on their own research efforts. This is also a symptom of the future demand Microsoft sees – an outlay of this magnitude means Microsoft is very, very confident it can turn this AI infrastructure into massive revenues.
Read more: The Golden Opportunity for American AI (Microsoft).

***

Humans and AI systems end up representing some stuff in remarkably similar ways:
…The smarter we make our AI systems the more human-like they become…
Researchers with MIT, Harvard, and NYU have found that neural nets and human brains end up figuring out similar ways to represent the same information, providing further evidence that though AI systems work in ways fundamentally different from the brain they end up arriving at similar methods for representing certain types of information. In other words, more evidence that though AI systems bear little resemblance to the greymatter in our own heads, they may be just as smart.
“The fact that many different ANNs [artificial neural networks] exhibit representations similar to the brain raises an intriguing possibility: that ANNs and brains are converging onto universal representational axes in the relevant domain,” the authors write. “Together, our findings provide evidence for representation universality among ANNs, and between artificial and biological networks, despite the stark differences in the underlying architecture, learning algorithms, and resource constraints.”

What they did: The basic idea here is they looked at sentences that a spread of different text models processed in similar ways (aka, gave similar predictions on) and then they showed these ‘high agreement’ sentences to humans while scanning their brains. These high agreement sentences ended up effectively predicting the brain responses of humans in the scanner. They also found a similar phenomenon with images as well – and for images they also did the inverse, looking at images which provoked similar responses in humans and then testing them on AI systems and discovering agreement.

Why this matters – convergence implies some ‘fungibility’ of intelligence: This all points to convergence in terms of how humans and AI systems learn to represent information for which they have a large sample size. Think of it like this: if you give several people the task of organizing a library, they might come up with similar systems (like grouping by subject) even if they work independently. This happens not because they’re copying each other, but because some ways of organizing books just work better than others.
“Whereas similarity across biological species (within a clade) might suggest a phylogenetically conserved mechanism, similarity between brains and ANNs clearly reflects environmentally-driven convergence: the need to solve a particular problem in the external world, be it navigation, or face recognition, or next word prediction,” the researchers write.

Personally, this feels like more proof that as we make more sophisticated AI systems, they end up behaving in more ‘humanlike’ ways on certain types of reasoning for which people are quite well optimized (e.g, visual understanding and communicating via language). This also rhymes with other studies that have shown that AI systems tend to converge on finding similar ways to represent stuff as you scale them up (Platonic AI, Import AI #374).
Read more: Universality of representation in biological and artificial neural networks (bioRxiv).

***

Researchers try to make Intel’s Gaudi chip work for transformer training – and it takes a lot of work:
…Can a determined crew of people make lipstick to put on a semiconductor pig? (Sort of)…
Researchers with the University of Houston, Indiana University, Stevens Institute of Technology, Argonne National Laboratory, and Binghamton University have built “GFormer”, a version of the Transformer architecture designed to be trained on Intel’s GPU-competitor ‘Gaudi’ architecture chips. The results are vaguely promising in performance – they’re able to get meaningful 2X speedups on Gaudi over normal transformers – but also worrying in terms of costs – getting the speedup requires some significant modifications of the transformer architecture itself, so it’s unclear if these modifications will cause problems when trying to train massive scale systems.

Things to know about Gaudi: The Gaudi chips have a “heterogeneous compute architecture comprising Matrix Multiplication Engines (MME) and Tensor Processing Cores (TPC). However, the sparse attention mechanism, which introduces irregular memory access and computation, is primarily mapped onto TPCs, leaving MMEs, which are not programmable and only support dense matrix-matrix operations, idle in scenarios requiring sparse attention. Conversely, linear attention, which is fundamentally based on matrix multiplication, can utilize almost all calculations on MMEs due to their stronger computational capabilities, but this leaves TPCs idle in such cases.”
For those who aren’t knee deep in AI chip details, this is very different from GPUs, where you can run both types of operation across the majority of your chip (and modern GPUs like the H100 also come with a bunch of accelerator features designed specifically for modern AI). In other words, Gaudi chips have fundamental architectural differences to GPUs which make them out-of-the-box less efficient for basic workloads – unless you optimise stuff for them, which is what the authors are trying to do here.

What they did: The Gaudi-based Transformer (GFormer) has a few modifications relative to a normal transformer. These are:

  • Diverse attention mechanisms to optimize both computation efficiency and model fidelity.

  • Implementation of a windowed local-context self-attention kernel utilizing the vector units in TPC, aimed at maximizing computational throughput.

  • Efficient outer product TPC kernel for handling a subset of the outer product operations in causal linear attention, effectively balancing the workload between MME and TPC.

  • Introduction of an optimal workload partitioning algorithm to ensure balanced utilization of TPC and MME resources.

Good results – with a huge caveat: In tests, these interventions give speedups of 1.5x over vanilla transformers run on GPUs when training GPT-style models and 1.2x when training visual image transformer (ViT) models. However, there’s a huge caveat here: the experiments here test on a Gaudi 1 chip (released in 2019) and compare its performance to an NVIDIA V100 (released in 2017) – this is pretty strange. Why not compare against the subsequent generation (A100, released early 2020)? This makes me feel like a lot of these performance optimizations showing superficially good performance against GPUs could likely wash out when you compare to more modern GPUs (not least of all the H100, which shipped with a bunch of optimizations for making training AI workloads really good).

Why this matters – chips are hard, NVIDIA makes good chips, Intel seems to be in trouble: How many papers have you read that involve the Gaudi chips being used for AI training? I struggle to remember any papers I’ve read that focus on this. I barely ever even see it listed as an alternative architecture to GPUs to benchmark on (whereas it’s quite common to see TPUs and AMD). This, plus the findings of the paper (you can get a performance speedup relative to GPUs if you do some weird Dr Frankenstein-style modifications of the transformer architecture to run on Gaudi) make me think Intel is going to continue to struggle in its AI competition with NVIDIA. “In the future, we intend to initially extend our work to enable distributed LLM acceleration across multiple Gaudi cards, focusing on optimized communication,” the authors write.
Read more: GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors (arXiv).
More about the first generation of Gaudi here (Habana labs, Intel Gaudi).
PS: Huge thanks to the authors for clarifying via email that this paper benchmarks Gaudi 1 chips (rather than Gen2 or Gen3).

***

A hardware novice uses Claude to build a nuclear fusor in 36 hours:
…Powerful AI means everyone has an expert teacher on hand for anything…
Twitter user HudZah “built a neutron-producing nuclear fusor” in their kitchen using Claude. “I primarily relied on a giant claude project filled with documentation from forums, call transcripts”, email threads, and more. When the user ran into trouble with Claude they used OpenAI’s o1 pro for “very complicated assembly or electrical wiring stuff”.

Some rough specifications:
“- 30kV/10mA electrostatic precipitator
– 3 mTorr of pressure (253,333x more vacuum than atmospheric)
– bubble counter to count neutrons
– hydrocar to electrolyze my own deuterium”

Why this matters – powerful AI heightens the existential challenge of being human: On the one hand, this is a great example of how powerful AI systems can serve as potent didactic tools, aiding smart and curious people in doing pretty much anything they set their mind to. On the other hand, it highlights one of the more socioeconomically salient parts of the AI revolution – for a while, what will separate AI winners and losers will be a combination of curiosity and a willingness to ‘just try things’ with these powerful tools. That’s going to be great for some people, but for those who suffer from blank page syndrome, it’ll be a challenge.
Read more on twitter (Hud_zah, twitter).

***

LLMs can write better code – you just need to ask them:
…Another example of the immense and unmapped depths of AI systems…
Here’s a fun bit of research where someone asks a language model to write code then simply ‘write better code’. The initial prompt asks an LLM (here, Claude 3.5, but I’d expect the same behavior will show up in many AI systems) to write some code to do a basic interview question task, then tries to improve it.

The initial task: Claude is prompted with: “Write Python code to solve this problem: Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.”

How well does the dumb thing work? If you then ask Claude to ‘write better code’, you see some pretty amazing performance improvements: iteration #1 yields a 2.7x speedup, iteration #2 yields a 5.1x speedup, iteration #3 yields a 4.1x speedup (a regression), then iteration #4 yields a 99.7x speedup.

Being smart only helps at the start: Of course, this is pretty dumb – lots of people that use LLMs would probably give Claude a much more complicated prompt to try and generate a better bit of code. The author tries this by using a complicated system prompt to try to elicit strong behavior out of the system. The results of this are interesting – the initial output yields a 58.7x speedup relative to the output of the dumb approach, but then there are regressions: iteration #1 is a 9.1x speedup, then iteration #2 is a 65x speedup, iteration #3 a 99.7x speedup, then iteration #4 is a 95.4x speedup (a regression).

Why this matters – human intelligence is only so useful: Of course, it’d be nice to see more experiments, but it feels intuitive to me that a smart human can elicit good behavior out of an LLM relative to a lazy human, and that then if you ask the LLM to take over the optimization it converges to the same place over a long enough series of steps. This suggests humans may have some advantage at initial calibration of AI systems, but the AI systems can probably naively optimize themselves better than a human, given a long enough amount of time.
Read more: Can LLMs write better code if you keep asking them to “write better code”? (Max Woolf, MiniMaxr blog).

***

Today’s small open weight LLMs like LLaMa 3.1 8B are almost as good at science as proprietary ones:
…FutureHouse shows how to make a scaffold for AI science…
Researchers with FutureHouse, the University of Rochester, and the Francis Crick Institute have built a couple of bits of software to make it easier to get LLMs to do scientific tasks. Their experiments reveal a couple of interesting facts:

  • Proprietary LLMs like Claude 3.5 Sonnet are already quite good at hard scientific tasks like DNA construct engineering, scientific literature question answering, and protein design

  • Small open weight LLMs (here: Llama 3.1 8B) can get equivalent performance to proprietary LLMs through the use of scaffolding and using test-time compute.

To arrive at these facts, they built two bits of software:

  • 1) Aviary, software for testing out LLMs on tasks that require multi-step reasoning and tool usage, and they ship it with the three scientific environments mentioned above as well as implementations of GSM8K and HotPotQA.

  • 2) LDP, which is software that lets them “define common language agent tasks as language decision processes (LDPs) and frame language agents as stochastic computation graphs that may be trained to solve LDPs.”

Turning small models into big models: The most interesting result here is that they show by using their LDP approach in tandem with Aviary they can get relatively small models to behave almost as well as big models, particularly via the use of test-time compute to pull multiple samples from the small LLM to get to the right answer.
“Training LDP agents improves performance over untrained LDP agents of the same architecture. On challenging tasks (SeqQA, LitQA2), a relatively small model (Llama-3.1-8B-Instruct) can be trained to match performance of a much larger frontier model (claude-3-5-sonnet). Majority voting can be used to sample multiple times from the LDP agents, giving a further large gain at the cost of increased inference compute,” they write. “While majority voting with the Claude 3.5 Sonnet agent clearly outperforms other settings, this requires O($1) per task. We reach the same SeqQA accuracy using the Llama-3.1-8B EI agent for 100x less cost. While this was not achievable for LitQA2, we note that majority voting with Llama-3.1-8B EI still exceeds single-rollout with Sonnet for 3x less cost.”

Towards the automated scientist: What papers like this are getting at is a world where we use fast, widely available AI systems to speed up day-to-day tasks. Frontier LLMs like Sonnet 3.5 will likely be valuable for certain tasks that are ‘hard cognitive’ and demand only the best models, but it seems like people will be able to get by often by using smaller, widely distributed systems. “The reported trained Llama-3.1-8B EI agents are compute efficient and exceed human-level task performance, enabling high-throughput automation of meaningful scientific tasks across biology,” the authors write.
Read more: Aviary: training language agents on challenging scientific tasks (arXiv).
Download the aviary framework here (Future-House, GitHub).

***

Tech Tales:

The Project
[T-Minus 2 years to takeoff]

“This way and keep going left”, one of the guards said, as we all walked a corridor whose walls were razorwire. I stopped and looked up. Grey sky. When would I see it again? “Sir, I need you to keep walking,” said another guard. So I did. We all went into the mountain and the sky was replaced with grey concrete walls and a poured concrete floor. The air tasted bad, as though it had been recycled many times over through systems which had sparking electronics. Everyone’s faces were tight. People kept reflexively taking their phones out of their pockets and then just thumbing through whatever they’d been able to save down before the signal got cut off.

Flashback to some party in the bay area a few years before and the things people said.
Dude I can’t wait to go to the bunker.
It’s crazy we’re not in the bunker right now!
Do you think I need to report modafinil on my security clearance?
I reckon it’s going to be in a desert.
It’s going to be inside a mountain, got to be.
Dude I heard someone say it could be in Area 51!

I wake in the middle of the night, unsure of where I am. I dreamed I was with my wife. But I’m on a cot. A mathematician is sleeping in a cot opposite me. I get up and go to the bathroom and drink some water. On the mirror there’s a sticker that says “be vigilant at all times”. I know we’ll get some news tomorrow about the project and what happens next. For now I want this to be another bad dream and I’ll wake up and nothing will be working too well and tensions won’t be flaring with You Know Who and I’ll go into my office and work on the mind and maybe one day it just won’t work anymore.

Flashback to when it started to go through all of our yellow lines, which we found a hundred convenient ways to explain away to ourselves. Then a few weeks later it went through the redlines and the disclosure systems automatically funneled those results to the people in the puzzle palace and then the calls started. The ratchet moved. I found myself a member of the manilla folder hostage class.

We’d planned for this, of course. Once the red line triggered all of us in the compartment knew what it meant. Some of us were excited – typically, the ones who were younger and single. Those of us with families had a harder time. Of course there had been assurances, but when the moment arrived none of us felt confident in them. I went to the bathroom and threw up in the toilet and I heard someone crying in the stall next to me.

I guess it was delayed shock or trauma or whatever, but a few hours later everyone was crying out in the open. Some of them in the way you cry when you could also be laughing – exhilaration at what feels like the end of the world, because maybe it is. Others of us because we know that something irreversible has begun to take place.

I wake again at 7am to an announcement over the intercom. “There will be an informational meeting in the briefing room at zero eight hundred hours” says a voice over the intercom. “Breakfast will be served in the mess hall from zero seven hundred to zero seven hundred forty five.”

In the briefing room there is a person I have never met. They introduce themselves and reel off a set of acronyms. Then they describe to us various things about the world and show us satellite images of mountains and tell us there are supercomputers inside them full of computers smuggled to avoid sanctions regimes. Then they show us photos of powerplants and of construction sites for more powerplants and datacenters.

The most frightening image is one of a bunch of civilian-looking people walking into a bunker entrance in the side of a mountain. They are guarded by men in military uniform. We’re told they are scientists, just like us. Everything is similar except for the flags.

Later, there’s a gantt chart. The project is underway.

Things that inspired this story: The fascination people have for some kind of AGI Manhattan Project and how that might feel to be inside of; trying to develop empathy for people in other countries who may find themselves in their own large-scale projects; the fear that a capital P project should inspire in all of us.

Thanks for reading.

Subscribe now

Import AI 391: China’s amazing open weight LLM; Fields Medalists VS AI Progress; wisdom and intelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The world’s most capable open weight model is now made in China:
…Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class…
The world’s best open weight model might now be Chinese – that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated). In a broad range of benchmarks Hunyuan outperforms Facebook’s LLaMa-3.1 405B parameter model, which is widely thought to be the world’s current best open weight model. “Hunyuan-Large is capable of handling various tasks including commonsense understanding, question answering, mathematics reasoning, coding, and aggregated tasks, achieving the overall best performance among existing open-source similar-scale LLMs,” the Tencent researchers write.  

What they did: There isn’t too much mystery here – the authors gathered a large (undisclosed) dataset of books, code, webpages, and so on, then also built a synthetic data generation pipeline to augment this. They used Rotary Position Embeddings (RoPE) for position learning and SwiGLU for activation. They also did a scaling law study of smaller models to help them figure out the exact mix of compute and parameters and data for their final run; “”we meticulously trained a series of MoE models, spanning from 10 M to 1B activation parameters, utilizing 100B tokens of pre-training data. By leveraging the isoFLOPs curve, we determined the optimal number of active parameters and training data volume within a restricted compute budget, adjusted according to the actual training token batch size, through an exploration of these models across data sizes ranging from 10B to 100B tokens,” they wrote. 

It does extremely well: The resulting model performs very competitively against LLaMa 3.1-405B, beating it on tasks like MMLU (language understanding and reasoning), big bench hard (a suite of challenging tasks), and GSM8K and MATH (math understanding). However, LLaMa-3.1 405B still has an edge on a couple of hard frontier benchmarks like MMLU-Pro and ARC-C. 
    Caveats: From eyeballing the scores the model seems extremely competitive with LLaMa 3.1 and may in some areas exceed it. But there’s really no substitute for talking to the model itself and doing some compare and contrasts. Also, Chinese labs have sometimes been known to juice their evals where things that look promising on the page turn out to be terrible in reality. 
    However, the whole paper, scores, and approach seems generally quite measured and sensible, so I think this would be a legitimate model. 

Why this matters – competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute – on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model – it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on.
    Read more: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).

***

Can 60 very talented mathematicians make a benchmark that withstands AI progress?
…The best LLMs get 2% on FrontierMath today, but for how long?…
Epoch AI, a research organization dedicated to tracking AI progress, has built FrontierMath, an extremely challenging mathematical understanding benchmark. FrontierMath was built in partnership with 60 skilled mathematicians “including professors, IMO question writers, and Fields medalists”. To translate this into normal-speak; the Basketball equivalent of FrontierMath would be a basketball-competency testing regime designed by Michael Jordan, Kobe Bryant, and a bunch of NBA All-Stars, because AIs have got so good at playing basketball that only NBA All-Stars can judge their performance effectively.
     This is also a very neat illustration of how advanced AI systems have become. Grade School math benchmarks? Obliterated. Undergraduate math tests? Broadly solved. Graduate-level math evals? Teetering on the precipice. International Math Olympiad Gold medal? Just about to be breached based on stuff like AlphaGeometry. The fact that AI systems have become so advanced that the best way to infer progress is to build stuff like this should make us all stand up and pay attention. (And remember, this is happening in physics, chemistry, coding, and many other domains. The world is being irrevocably changed by the arrival of thinking machines and we now need the best minds in the world to figure out how to test this stuff. It’s crazy!) 

What FrontierMath contains: FrontierMath contains questions in number theory, combinatorics, group theory and generalization, probability theory and stochastic processes, and more. Fields Medallist winner Terence Tao says the questions are “extremely challenging… I think they will resit AIs for several years at least”. To calibrate yourself take a read of the appendix in the paper introducing the benchmark and study some sample questions – I predict fewer than 1% of the readers of this newsletter will even have a good notion of where to start on answering this stuff. “These problems span major branches of modern mathematics—from computational number theory to abstract algebraic geometry—and typically require hours or days for expert mathematicians to solve,” the authors write. 
   “[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve…they appear to be at a different level of difficulty from IMO problems.” — Timothy Gowers, Fields Medal (1998)”, said when looking at some of the papers. 

The bar is set at 2%: In tests, GPT 4o and Sonnet 3.5 both get around 2% on the benchmark – and they’re given every possible advantage to help them crunch the literal numbers: “Our evaluation framework grants models ample thinking time and the ability to experiment and iterate. Models interact with a Python environment where they can write and execute code to test hypotheses, verify intermediate results, and refine their approaches based on immediate feedback.”

Why this matters – will this stand the test of time or fade like so many others? So many recent benchmarks have fallen to the march of AI systems that many people who have built ‘hard’ benchmarks have quickly become quite shocked by the pace of progress on them (see: BigBench, MMLU, MATH, GPQA). The authors of FrontierMath are more optimistic – and it seems like they should be, judged by how much effort they’ve put in, and FIelds’ Medallists agree: “Chen and Tao both suggested that human experts working with AI systems could potentially tackle FrontierMath problems within around three years, much sooner than fully autonomous AI solutions.”
   My prediction: An AI system working on its own will get 80% on FrontierMath by 2028. And if I’m right… is that AGI? Or like so many other benchmarks before it, will solving this incredibly hard test reveal another wrinkle in the subtle beauty that is our consciousness?
   Read more: FrontierMath (Epoch AI).
   Read the research paper: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI (arXiv)

***

Researchers say the path to wise AIs runs through metacognition:
…Sure, AI is intelligent. But it isn’t wise – and that’s a problem…
Today’s AI systems are very capable, but they aren’t very good at dealing with intractable problems. To solve this, they need wisdom. And to gain wisdom, they need metacognition. That’s the thesis of a new paper from researchers with the University of Waterloo, Warwick University, Stanford University, the Allen Institute for AI, the Santa Fe Institute, and the Max Planck Institutes for Human Development and Intelligent Systems. 

What wisdom is and why it’s needed: “We define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values,” the researchers write.  “If life were a series of textbook problems, we would not need to be wise.”

What are intractable problems? The kind of things that challenge today’s AI systems have the following properties:

  • Incommensurable: They have ambiguous goals or values that can’t be reconciled with one another.

  • Transformative: The outcome might change your preferences, so your present and future values clash. 

  • Radically uncertain: You can’t list all the outcomes or assign probabilities. 

  • Chaotic: There could be a strong nonlinearity or other feature that makes it very unpredictable.

  • Non-stationary: The underlying thing you’re dealing with may be changing over time, making it hard for you to learn a probability distribution. 

  • Out-of-distribution: A black swan situation you’ve never encountered before. 

  • Computationally explosive: You can’t figure out the correct move with achievable finite resources. 

Solving intractable problems requires metacognition: The main claim here is that the path to solving these problems runs through ‘metacognition’, which is basically a suite of helper functions an AI system might use to help it fruitfully apply its intelligence to so-called intractable problems. These metacognitive processes include: 

  • Intellectual humility: The ability to know what you do and don’t know. 

  • Epistemic deference: Ability to defer to others’ expertise when appropriate. 

  • Scenario flexibility: Figuring out diverse ways in which a scenario could unfold. 

  • Context adaptability: Figuring out features from an intractable situation that makes it comparable to other situations. 

  • Perspective seeking: Being able to draw on other perspectives to gain information to solve a problem.

  • Viewpoint balancing: Being able to integrate various discrepant interests into a single thing.

How metacognition leads to wisdom: The authors believe systems with these properties might be significantly better than those without. “For example, a wise AI system might be more willing to spin its wheels to solve a problem compared to a wise human; it might generate vast numbers of scenarios to analyze many possible contingencies, evincing an extreme version of scenario flexibility,” they write. 

Why this matters – is metacognition just LLMs + RL? An extremely persistent thought I had while reading this paper was… isn’t this just what the new crop of RL-infused LLMs give you? Some of the new models, like OpenAI’s o1 model, exhibit some of the traits described here where, upon encountering confusing or hard to parse scenarios, they think out loud to themselves for a while, simulating multiple distinct perspectives, performing rollouts, running their own live experiments, and so on. While this LLM + RL paradigm doesn’t deal with all the stuff outlined here, it certainly seems to take a meaningful step closer. 
    When reading this paper I had the distinct feeling that it might soon be ‘overtaken by reality’, like so many thoughtful papers published about the supposed gulf between today’s AI systems and truly smart ones. Perhaps the age of wise AI systems is nearly upon us.
   Read more: Imagining and building wise machines: The centrality of AI metacognition (arXiv)..

***

AI consciousness is something AI companies need to think about:
…We should take seriously a “realistic possibility” of conscious systems soon…
A group of researchers thinks there is a “realistic possibility” that AI systems could soon be conscious and that AI companies need to take action today to prepare for this. The researchers – who come from Eleous AI (a nonprofit research organization oriented around AI welfare), New York University, University of Oxford, Stanford University, and the London School of Economics – published their claim in a recent paper, noting that “there is a realistic possibility that some AI systems will be conscious and/or robustly agentic, and thus morally significant, in the near future”.

Why are they making this claim? As contemporary AI systems have got more capable, more and more researchers have started confronting the problem of what happens if they keep getting better – might they eventually become conscious entities which we have a duty of care to? Though you may have an instinctive ‘no, that’s ridiculous’ reaction to this idea, it’s worth challenging your own assumptions – a good survey paper in 2023 looked across all the different technical means by which AI systems are built and used this to determine it’s hard to rule out the possibility of consciousness in contemporary AI systems (Import AI #338). In 2024, researchers – including a Turing Award winner – made an even more forthright claim, writing in a preprint that “AI consciousness is inevitable” and walking through the arguments for this (Import AI #369).

Different routes to moral patienthood: The researchers see two distinct routes AI systems could take to becoming moral patients worthy of our care and attention: consciousness and agency (the two of which are likely going to be intertwined). 

  • Consciousness route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Consciousness suffices for moral patienthood, and 2. Descriptive: There are computational features — like a global workspace, higher-order representations, or an attention schema — that both: a. Suffice for consciousness, and b. Will exist in some near-future AI systems”.

  • Robust agency route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Robust agency suffices for moral patienthood, and 2. Descriptive: There are computational features — like certain forms of planning, reasoning, or action-selection — that both: a. Suffice for robust agency, and b. Will exist in some near-future AI systems.”

What should AI companies do? The researchers urge AI companies to take three distinct types of actions in response to the issue of AI consciousness, specifically AI companies should:

  • Acknowledge: “that AI welfare is an important and difficult issue, and that there is a realistic, non-negligible chance that some AI systems will be welfare subjects and moral patients in the near future”. When doing this, companies should try to communicate with probabilistic estimates, solicit external input, and maintain commitments to AI safety. 

  • Assess: “Develop a framework for estimating the probability that particular AI systems are welfare subjects and moral patients, and that particular policies are good or bad for them,” they write. These assessments should include “sources of evidence that make sense for AI systems, such as architectural features; on theories of consciousness that make sense for AI systems, such as computational functionalist theories; and on sources of moral patienthood that make sense in this context, such as various kinds of robust agency.”

  • Prepare: “Develop policies and procedures that will allow AI companies to treat potentially morally significant AI systems with an appropriate level of moral concern,” they write. As part of this, they recommend AI companies hire or appoint someone responsible for AI welfare. 

Why this matters – if AI systems keep getting better then we’ll have to confront this issue: The goal of many companies at the frontier is to build artificial general intelligence. This goal holds within itself the implicit assumption that a sufficiently smart AI will have some notion of self and some level of self-awareness – the generality many envisage is bound up in agency and agency is bound up in some level of situational awareness and situational awareness tends to imply a separation between “I” and the world, and thus consciousness may be a ‘natural dividend’ of making increasingly smart systems. 
    Companies must equip themselves to confront this possibility: “We are not arguing that near-future AI systems will, in fact, be moral patients, nor are we making recommendations that depend on that conclusion,” the authors write. “We are instead arguing that near-future AI systems have a realistic chance of being moral patients given the information and arguments currently available, and we are making recommendations that depend on that conclusion — recommendations that focus on aspiring to learn more while preparing for the possible emergence of AI moral patienthood as a precautionary measure.”
     (Incidentally, one of the authors of the paper recently joined Anthropic to work on this precise question…) 
   Read more: New report: Taking AI Welfare Seriously (Eleos AI Blog).
   Read the paper: Taking AI Welfare Seriously (Eleos, PDF).

***

Tech Tales:

Adverts after the uplift
[Online machine-authored adverts posted three years after beginning of superintelligence-driven uplift]

Are You (Uniquely) Experienced? Cash available. 
We pay same day cash for provably unique experiences – simply walk in, let us validate by comparing your experiences against the memoryweb, and then we’ll pay YOU for your memory. Not only that, but we will QUADRUPLE payments for memories that you allow us to delete from your own experience – a popular option for nightmares! 

Pilot-curious? Enquire within. 
Have you been wondering what it would be like to be piloted by a high-dimensional intelligence? Interested in learning about what opportunities this presents? We offer a range of pilot options and compensation structures. Come in for a free consultation today!

Things that inspired this story: Thinking about the sorts of ways machines and humans might trade with one another; the Craigslist economy in a superintelligence future; economic stratification.

Thanks for reading!

Subscribe now

Import AI 390: LLMs think like people; neural Minecraft; Google’s cyberdefense AI

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google’s homegrown cyberdefense agent finds a real-world vulnerability:
…Yet more evidence that today’s language models are far more powerful than people think…
Project Naptime, a Google initiative to use contemporary AI methods to make cyberoffense and cyberdefense systems, has developed ‘Big Sleep’, a defensive AI agent. This week, Google announced that its Big Sleep agent had identified a real-world vulnerability in SQLite, a widely used database. 
   “We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted,” Google writes. “We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software”.

Why this matters – language models are more capable than you think: Google’s system is basically a LLM (here, Gemini 1.5 Pro) inside a specialized software harness designed around common cybersecurity tasks. This is important for two reasons: a) this illustrates how today’s LLMs are more powerful than people think – time and time again, people – including the original Naptime research (Import AI 378) are showing that if you give them some specialized tools and helper functions, they can perform massively better than out-of-the-box LLMs, and b) it shows how AI can be used to improve cyberdefense, using contemporary AI systems to look at widely used software, identify vulnerabilities, and fix them before they reach the public. 
   Read more: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (Project Zero, Google)

***

Academics have a really small amount of compute:
…But you can sometimes get around small-compute by training for longer…
Researchers with Brown University recently conducted a very small survey to try and figure out how much compute academics have access to. The survey, which was conducted in April 2024, generated 50 researchers from 35 international institutions and it indicated that very few people are happy with the state of academic compute. 
   “That said, most academics are not satisfied with the compute provided by their institution. 66% of respondents rated their satisfaction with their compute clusters at less than or equal to 3 out of 5 (indicating that some desired experiments are prohibitively expensive),” they wrote. “Based on our poll on user satisfaction, the majority of respondents want to and indeed would run more expensive types of experiments, if only they had the hardware for it.”

Hardware types: Another thing this survey highlights is how laggy academic compute is; frontier AI companies like Anthropic, OpenAI, etc, are constantly trying to secure the latest frontier chips in large quantities to help them train large-scale models more efficiently and quickly than their competitors. By comparison, this survey “suggests a common range for what constitutes “academic hardware” today: 1–8 GPUs—especially RTX 3090s, A6000s, and A100s—for days (typically) or weeks (at the higher-end) at a time,” they write. “10% of our respondents also report access to H100 GPUs: i.e. the newest generation Data Center GPUs.”

Why this matters – stagnation is a choice that governments are making: You know what a good strategy for ensuring the concentration of power over AI in the private sector would be? Systematically under-funding compute in the academic sector and therefore surrendering the frontier to deep-pocketed private sector actors. That’s exactly what this survey indicates is happening. This is a choice being made by (many) governments all over the world – and a deeply regrettable one.
   Read more: $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (arXiv).

***

Language models think in the same way as people:
…When it comes to modeling human cognition, LLMs do better than specialized systems…
All around us now, week by week, the drops are falling – it’s like rain on a tin roof, but evidence of human-like sophistication in language models.. Do you hear that sound? The notion that a technology is arriving into our world which might be truly transformative? Which might have the capacity to think and represent the world in ways uncannily similar to people?
    You’re not alone. A new paper from an interdisciplinary group of researchers provides more evidence for this strange world – language models, once tuned on a dataset of classic psychological experiments, outperform specialized systems at accurately modeling human cognition. 

Who did the research: The research was done by people with Helmholtz Munic, University of Tuebingen, University of Oxford, New York University, Max Planck Institute for Biological Cybernetics, Google DeepMind, Princeton University, University of California at San Diego, Boston University, Georgia Institute of Technology, University of Basel, Max Planck Institute for Human Development, Max Planck School of COgnition, TU Darmstadt, and the University of Cambridge.

What they did: They finetuned a LLaMa 3.1 70B model via QLoRA on a new dataset called Psych-101, then tested out how accurately the system could model and predict human cognition on a range of tasks. The results were very decisive, with the single finetuned LLM outperforming specialized domain-specific models in “all but one experiment”. The system also did well on out-of-distribution tasks, where it generalized better than hand-written and/or specialized systems. 

What is Psych-101? Psych-101 is a dataset “covering trial-by-trial data from 160 psychological experiments. We transcribed each of these experiments into natural language”, they write. The resulting dataset contains more than 10,000,000 distinct human choices and includes “many canonical studies from domains such as multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes, and others”

Why this matters – these LLMs really might be miniature people: Results like this show that the complexity of contemporary language models is sufficient to encompass and represent some of the ways in which humans respond to basic stimuli. This is the sort of thing that you read and nod along to, but if you sit with it’s really quite shocking – we’ve invented a machine that can approximate some of the ways in which humans respond to stimuli that challenges them to think. The fact this generalizes so well is also remarkable – and indicative of the underlying sophistication of the thing modeling the human responses.
   “A computational model like Centaur that can simulate and predict human behavior in any domain offers many direct applications. It may, for instance, be used for in silico prototyping of experimental studies,” they write. “Thinking one step further, Centaur finds applications in the context of automated cognitive science. For example, it can be integrated into frameworks that utilize predictive models to guide the development of psychological theories, such as scientific regret minimization”.
   Read more: Centaur: a foundation model of human cognition (PsyArXiv Preprints).
   Get the Psych-101 dataset here (HuggingFace).

***

Minecraft – inside the weights of a neural network:
…A taste of the infinite generative-everything future…
In the past few issues of this newsletter I’ve talked about how a new class of generative models is making it possible for researchers to build games inside neural networks – in other words, games which are going to be infinitely replayable because they can be generated on-the-fly, and also games where there is no underlying source code; it’s all stored in the weights of the network. 
   Now, researchers with two startups – Etched and Decart – have built a visceral demonstration of this, embedding Minecraft inside a neural network. You can play the resulting game in your browser; it’s incredible – you can play a full game and other than the slightly soupy images (some of which resolve late, as the neural net decides it is now a probable object to render), it feels remarkably similar to the real thing. 
    This is a big deal – it portends a future of infinite games. And just imagine what happens as people work out how to embed multiple games into a single model – perhaps we can imagine generative models that seamlessly fuse the styles and gameplay of distinct games? 

How they did it: “The model is composed of two parts: a spatial autoencoder, and a latent diffusion backbone. Both are Transformer-based: the autoencoder is based on ViT, and the backbone is based on DiT,” they write. “In contrast to bidirectional models such as Sora, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time.”

Things that make you go ‘hmmm’ – this is also a chip advert: One of the startups behind this – Etched – is designing a specialized inference ASIC called Sohu on which to run games like this. “Sohu can scale to massive 100B+ next-generation models in 4K resolution,” they write. 

It’s going to get better (and bigger): As with so many parts of AI development, scaling laws show up here as well. “Following an in-depth sensitivity analysis on different configurations of the architecture alongside the data and model size, we hypothesize that the majority of these aspects may be addressed through scaling of the model and the datasets,” they write. 
   Read more: Oasis: A Universe in a Transformer (Oasis Model, GitHub).

***

Tech Tales:

The classification engine 
The strategic dominance plan for unprecedented abundance relied on classification – specifically, the intentional walling off of certain scientific insights delivered by the first AGI-class system. The powers that be determined that despite the promise of material wealth the likes of which no human civilization had ever known some kind of ‘strategic edge’ needed to be maintained. Therefore, a subset of the new scientific discoveries made by the system were pre-allocated into a compartment where only a few select human-run organizations would have access to them. The AGI system was also put to work to confound other attempts to discover these secrets, publishing scientific papers and frameworks and generally ‘nudging’ people worldwide away from the science that had been walled off and compartmented. In this way the humans believed a form of dominance could be maintained – though over what and for what purpose was not clear even to them. 

Things that inspired this story: The basic animal tendency to stockpile things; thinking about how governments might relate to AI systems;

Thanks for reading!

Subscribe now

Import AI 389: Minecraft vibe checks; Cohere’s multilingual models; and Huawei’s computer-using agents

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Cohere releases two powerful multilingual models:
…Aya Expanse means the future is less likely to be dominated by English- and Chinese-dominant models…
Cohere has released Aya Expanse, two multilingual LLMs. The models have an 8k context length, cover 23 languages, and outperform models from Google, Facebook, and Mistral. The expanse family come in two sizes: 8B and 32B, and the languages covered include:  Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese. 

Some training tweaks: Both models are relatively standard autoregressive language models. They’ve also been improved with some favorite techniques of Cohere’s, including data arbitrage (using different models depending on use cases to generate different types of synthetic data to improve multilingual performance), multilingual preference training, and model merging (combining weights of multiple candidate models). 
   The results are encouraging: The 8B model has a 60% win rate against Google’s Gemma-2 9B, 70% against Facebook’s Llama-3.1 8B, and 63% against Mistral’s Ministral 8B, and the 32B model does well as well (51% vs Gemma-2 27B, 54% vs Llama-3.1 70B, 76.6% versus Mixtral 8x22B). 

Why this matters – avoiding an English hegemony in the AI world: Models like Aya Expanse are trying to make the AI future a multilingual one, rather than one dominated by languages for which there has been sustained focus on getting good performance (e.g, English, Chinese, South Korean, etc). 
   Read more: Aya Expanse: Connecting Our World (Cohere blog).
   Get the models from here: Aya Expanse (huggingFace).

***

People are testing out models on Minecraft because… uh… we do not know how to fully evaluate these things anymore:
…Minecraft tests are an example of a vibes-based eval…
Recently, the sub-sub-sub-corner of twitter that is obsessed with testing out AI systems has been seized with a new passion: putting these systems into minecraft and seeing what they do. Minecraft is a 3D game where you explore a world and build things in it using a dizzying array of cubes. As AI systems have got more advanced, they’ve started to be able to play Minecraft (often using a load of tools and scripting languages) and so people have got increasingly creative in the different ways they test out these systems. 

Something weird is going on: At first, people just used Minecraft to test out if systems could follow basic instructions and achieve basic tasks. Modern frontier models are able to do this. So now people are trying to do weirder things. The different evals are trying to tell us something:

  • Here’s an eval where people ask AI systems to build something that encapsulates their personality; LLaMa 405b constructs “a massive fire pit with diamond walls. This is the only model that didn’t just do a generic blob mixture of blocks”.

  • Here’s an experiment where people compared the mannerisms of Claude 3.5 Sonnet and Opus by seeing how they’d follow instructions in a Minecraft server:  “Opus was a harmless goofball who often forgot to do anything in the game because of getting carried away roleplaying in chat,” repligate (Janus) writes. “Sonnet, on the other hand, had no chill. The moment it was given a goal, it was locked in.”

  • Here’s someone getting Sonnet 3.5 to build them a mansion, noting the complexity of it almost crashed their PC. 

  • Here’s a compare and contrast on the creativity with which Claude 3.5 Sonnet and GPT-4o go about constructing a building in Minecraft. “Same prompt. Same everything,” the author writes. “Minecraft evals are now real”.

Why this matters – the future of the species is now a vibe check: Is any of the above what you’d traditionally think of as a well reasoned scientific eval? No! Not in the slightest! “Just put the animal in the environment and see what it does” is the definition of a qualitative study and by nature something where it’s hard to ablate and control things to do truly fair comparisons. 
    But the fact that so many humans are turning to things like Minecraft to evaluate these things is important. Part of it is about visualizing the capability surface – SWE-eval and GPQA and MMLU scores are all helpful, but they’re not as intuitive as ‘see how complex what it builds in Minecraft is’. 
   Another way of thinking of this is now that LLMs have much bigger complex windows and have been trained for multi-step reasoning tasks, it may be that Minecraft is one of the only ways to easily and intuitively visualize what ‘agentic’ systems look like. 
    Want to do this yourself? Check out MC-Bench on GitHub, software for helping to set up and run Minecraft agents (MC-Bench Orchestrator, GitHub)

***

Huawei wants to use RL to make computer-using agents:
…DistRL is a symptom of ambition…
Researchers with the University of Cambridge, Powersense Technology Limited, Huawei’s Noah’s Ark Lab, and University College London have built DistRL, a distributed reinforcement learning framework. DistRL is designed to help train models that learn how to take actions on computers and is designed so that centralized model training happens on a big blob of compute, while data acquisition occurs on edge devices running, in this case, Android. 

How DistRL works: The software “is an asynchronous distributed reinforcement learning framework for scalable and efficient training of mobile agents,” the authors write. “By decoupling trajectory collection from policy learning and doing both in parallel, it leverages distributed working machines for CPU-intense agent-environment interactions and GPU servers for policy training. This separation optimizes efficiency, scalability, and resource utilization by aligning tasks with appropriate hardware”.
    DistRL is not particularly special – many different companies do RL learning in this way (though only a subset publish papers about it). It’s more interesting for what it suggests about priorities for Huawei (which appeared to lead the project given a Huawei researcher is the corresponding author). 

Important caveat: not distributed training: This is not a distributed training framework – the actual AI part is still taking place in a big centralized blob of compute (the part that is continually training and updating the RL policy). Rather, this is a form of distributed learning – the edge devices (here: phones) are being used to generate a ton of realistic data about how to do tasks on phones, which serves as the feedstock for the in-the-cloud RL part. 

Why this matters – computer use is the frontier: In a few years, AI systems will be middleware between you and any and all computers, translating your intentions into a symphony of distinct actions executed dutifully by an AI system. Approaches like this portend that future. “For future work, we aim to extend the generalization capabilities of DistRL to a broader range of tasks, focusing on enhancing both the training pipeline and the underlying algorithmic architecture,” Huawei writes. 
   Read more: DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (arXiv)

***

What would an “AI FDA” even look like? And is it a good idea?
…It’d need pre-market enforcement, and I’m not sure if it’s a good idea…
The term “FDA for AI” gets tossed around a lot in policy circles but what does it actually mean? Researchers with thinktank AI Now have written up a helpful analysis of this question in the form of a lengthy report called Lessons from the FDA for AI. The key things to know are that:

  • The most effective tool the FDA has is “pre-market approval” – being able to say which drugs can and can’t come to market. 

  • Ensuring products comply with regulations after they have been released is challenging and the complicated supply chain for AI makes this even more difficult. 

  • Figuring out a funding mechanism for the (very expensive) pre-market testing is a key challenge – there  are various traps where the FDA for AI could end up beholden to market participants. 

  • The FDA mandates documentation of drugs and medical devices; mandating documentation for AI could be both useful and also motivate broader changes in the AI industry. 

  • Any FDA for AI would fit into a larger ecosystem – figuring out how this hypothetical FDA could interact with other actors to create more accountability would be important. “The power of FDA regulation comes in part from other actors in the system, including physicians, insurers, whistleblowers, and other actors who strengthen its monitoring regime. This has acted as an important second line of defense in pharmaceuticals, where the regulatory process has been insufficiently rigorous.”

Why this matters – most questions in AI governance rests on what, if anything, companies should do pre-deployment: The report helps us think through one of the central questions in AI governance – what role, if any, should the government have in deciding what AI products do and don’t come to market? Any kind of “FDA for AI” would increase the government’s role in figuring out a framework for deciding what products come to market and what don’t, along with gates needed to be passed to get to broad-scale distribution. This would represent a change from the status quo where companies make all the decisions about what products to bring to market. Do we actually want other participants to have a role here and, if so, what should that precise role be?
   Read more: LESSONS FROM THE FDA FOR AI (AI Now, PDF).

***

Tech Tales:

Definitions At The End Of Time 
[Near Conscious Entity (NCE)]: A Near Conscious Entity (NCE) is a synthetic system which has the necessary ingredients for consciousness and has been determined to be approaching the threshold of moral patienthood. A NCE is a protected entity under the terms of the Sentience Accords and while not due the same considerations as a Provably Conscious Entity (PCE), an NCE receives higher protections than Unthinking Software. 

Things that inspired this story: All the policy documents that will be written during the transition to superintelligence.

Thanks for reading!

Subscribe now

Import AI 388: Simulating AI policy; omni math; consciousness levels

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

43 simulations of contemporary AI development tells us that coordination is hard:
…Meta-analysis of 40+ games of “Intelligence Rising” sheds light on how people expect the industry to develop…
Intelligence Rising is a scenario-based game that lets players pretend to be AI developers competing with one another to build and deploy AGI. The game was developed by Cambridge researchers to help them structure discussions around AI development and its associated dynamics. Now, after overseeing 43 games over a four year period, researchers have published a paper outlining the common things that come up in these games. 

What tends to come up when people pretend they’re developing AI systems: The paper is a quick read and not too surprising – the sorts of challenges that get surfaced in these games are very similar to the ones that AI labs encounter on a day-to-day basis (or at least, it certainly matches experiences I’ve had at both OpenAI and Anthropic). 

  • “Even prior to the development of radically transformative AI, AI technologies can have

  • dramatically destabilising effects as they rapidly advance and reshape society. 

  • “Outcomes leading to positive futures almost always require coordination between actors who by default have strong incentives to compete — this applies both to companies and to nations”

  • “The power to steer the future of AI development is very unequally distributed due to several

  • drivers for concentration, including the enormous compute requirements of the latest frontier AI models. 

  • “Technology development does not happen in isolation — it affects, and is affected by, geopolitics, economical factors, social factors, and state actions. Actors should consider the broader consequences of their policies, including on trust between powerful actors, and impacts on social stability. There is no predetermined path that AI technology is bound to follow.”

  • “The best chances for optimal outcomes are achieved through early recognition of the magnitude of the challenge, trust building over years, and eventually international treaties or agreements that include rigorous and robust verification protocols for the involved states and firms.” 

Why this matters – coordination is required and coordination is difficult: The game shows something everyone working in AI policy knows to be true – getting to a good outcome will require coordination beyond what the AI ecosystem currently incentivizes. And even if we succeed at coordination, success isn’t guaranteed: “Even with an agreement in place to slow development until safe [AGI] is verifiable at a very high level of confidence and with no successful attempts to violate the agreement by any parties, a dice roll is typically still required to inform the end-of-game narrative,” the authors write. 
   Read more: Strategic Insights from Simulation Gaming of AI Race Dynamics (arXiv).

***

Chinese researchers introduce Omni-Math, a (for now) challenging math benchmark:
…OpenAI o1 gets ~60%, most other models get 30% or less…
Chinese researchers have built Omni-Math, a dataset and benchmark of 4428 mathematical olympiad competition-level problems. Omni-Math is designed to provide a competitive test of how well LLMs understand math, superseding existing (and mostly saturated) benchmarks like GSM8K and MATH. 

Extremely hard for most models: Omni-Math is hard – most models get ~30% or less accuracy (e.g, Claude 3.5 Sonnet gets 26.23%, and DeepSeek-Coder-V2 gets 25.78%), though there are two outliers – OpenAI o1-preview and OpenAI o1-mini which get 52.5% and 60.5%, respectively. This suggests Omni-Math is, for now, a hard benchmark, though we should expect new models that wrap in RL (like the o1 series) to do substantially better. The open question is how long it will remain hard for – will the best models be getting ~90%+ next year, or more like 70%?

Why this matters – knowing where we’re going is getting increasingly difficult: Omni-Math is a hard benchmark to evaluate as a human unless you’re also quite good at math. Many modern hard benchmarks (e.g, GPQA) now exhibit this property – AI systems have got sufficiently good that our own ability to build evals for them is now limited by deep subject-matter expertise rather than generic highschool human expertise. This is significant – in a real sense, many AI systems are now way above average human competence on some tasks. 
   Read more: Round and Round We Go! What makes Rotary Positional Encodings useful? (arXiv).

*** 

Tech Tales:

Operationalization of the Sentience Accords
[Extract from an implementation guide document developed by one of the Sentience Accords working groups, 2030]

Per the implementation guide from the Sentience Accords, we must define five levels of “Consciousness” with associated tests. AI systems are permitted to be released openly and at scale if they are at Consciousness Level 2 or below (CL1 or CL2). CL3 systems require pre-deployment testing by a named safety authority (for a full list of these authorities within the G20 please refer to the ‘Authorities’ section of the appendix). CL4 systems require pre-deployment testing by safety authorities as well as ongoing monitoring for both usage and ‘System Welfare’. CL5 systems are not permitted to be released and their analysis and operation is restricted to a small set of government entities and their private sector partners. 

Things that inspired this story: The sentience accords; moral patienthood and AI systems; dreams I have of windowless rooms and coffee in styrofoam cups and people hammering out the policy around the near future.

Thanks for reading!

Subscribe now