Import AI

Import AI 413: 40B distributed training run; avoiding the ‘One True Answer’ fallacy of AI safety; Google releases a content classification model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google releases a content classification model:
…No sex, dangerous stuff, or violence please…
Google recently released ShieldGemma2, a “robust image safety classifier” that people can use to ensure people aren’t generating sexually explicit, gory, or otherwise dangerous images. SieldGemma2 has been fine-tuned to help people enforce the aforementioned categories, and “users of SG2 can decide to employ one or multiple of these policies, or curate their own bespoke policy for their use cases,” Google says.

Download it and tweak it yourself: ShieldGemma 2 is available to download for free and beats the performance of other models used in content moderation, like the original Gemma 3 model, LLavaGuard 7B, and GPT-4o-mini. Users of ShieldGemma 2 can customize the prompt it uses so they can ‘roll their own’ more specific moderation pipelines, though it’s only been fine-tuned for sex, violence, and danger so performance will be janky outside of that.

Why this matters – model safety happens through classifiers: A few years ago most of the way people tried to make AI systems safe was by wiring safety into the base model. While this worked to a degree it also created problems, like models which were overly censorious or restricted in ways that frustrated users and politicized AI safety. The good news is as AI technology has advanced we’ve been able to build smart and small models, like ShieldGemma, which can be layered on top of production systems to provide an additional layer of moderation.
Read more: ShieldGemma 2: Robust and Tractable Image Content Moderation (arXiv).
Get the model here: ShieldGemma-2-4b-it (HuggingFace).

***

Import AI reader giveaway!
Building on my recent conversation with Tyler Cowen in San Francisco, I’m pleased to announce two more upcoming Import AI events: As with last time, I have a few tickets spare that I’d like to give to Import AI readers. If you’d like to come along, please register your interest below and we’ll come back to you if we’re able to confirm your spot. There will be food, drinks, good company, and a few curveball questions.

London: A conversation with Dominic Cummings
I’ll be chatting with political strategist and commentator Dominic Cummings about the intersection of AI and policy and realpolitik on the evening of Tuesday June 10 in London, UK.
Register your interest for London here

New York City: A conversation with Ezra Klein
I’ll be heading back across the pond to chat with Ezra Klein about abundance, powerful AI, and politics on the evening of Monday June 16 in New York City, USA.
Register your interest for New York City here

***

Test out computer-using agents with OSUniverse:
…Humans can easily score 100%, but the best AI systems get ~50%…
Startup Kentauros AI has built OSUniverse, a benchmark for testing out how well AI systems can use the computer to do complicated tasks. “In version one of the benchmark, presented here, we have calibrated the complexity of the benchmark test cases to ensure that the SOTA (State of the Art) agents (at the time of publication) do not achieve results higher than 50%, while the average white collar worker can perform all these tasks with perfect accuracy”, they write. (In tests, OpenAI’s Computer Use agent got 47.8%, and Claude 3.5 Sonnet got 28.36%).

Tasks and challenges: The benchmark includes tasks with five grades of difficulty, and each grade increases the amount of distinct steps that need to be taken to solve the task, as well as the amount of different elements on the computer that need to be combined to solve it. The five levels are called Paper, Wood, Bronze, Silver, and Gold.

Example challenges:

  • Paper: Read out the current date from the desktop.

  • Wood: Open the image editor GIMP, create an empty file and save it to desktop

  • Bronze: Go to Airbnb and search for a property in Lisbon with a specific check-in date and return that result

  • Silver: Open an online game and manipulate the UI to perform a basic action in it

  • Gold: Reveal a code word on a webpage by solving a 7×7 jigsaw puzzle

Why this matters – a booming AI economy needs computers that can use software designed for humans: In the same way that many expect the arrival of bipedal robots with humanlike hands will mark an inflection point for the size of the robot market, the same is likely to be true for the software market with the arrival of AI systems that can use computers like regular people. Think about all the tasks you do on your computer – very little of your productive work takes place in a single application, instead you tend to be switching between multiple things and moving data around using a mixture of terminal commands and GUI manipulations. Benchmarks like OSUniverse will help us measure how good systems are getting at these kinds of ‘glue’ tasks.
Read more: OSUniverse: Benchmark for Multimodal GUI-navigation AI Agents (arXiv).
Find out more at the research website: OSUniverse (GitHub).
Get the code for the benchmark here: OSUniverse (GitHub, agentsea).

***

Prime Intellect successfully tunes a 32B model with distributed RL:
…Reasoning models via the internet…
Distributing training is where you take a load of computers distributed around the world and find a way to link them up to train a single AI system. Distributed training is a topic we often cover here at Import AI because if it works it’ll change the politics of compute – instead of AI systems being trained by a single company that has access to a big pool of capital, AI systems could instead be trained by collectives of people that pool their computers together.
Given the potential importance of this technology, it’s worth reading this technical report from Prime Intellect about the startup’s experience doing a distributed reinforcement learning training run of INTELLECT-2, a 32B parameter model which was trained in April.

What they did: INTELLECT-2 is based on Alibaba’s QwQ-32B model, which Prime Intellect then did RL on, largely following DeepSeek’s R-1 technique of GRPO-based training and verifiable rewards. They trained their model on additional math and coding data and saw some slight improvement on benchmarks (AIME24 and LiveCodeBench). However it’s worth noting the improvements are relatively slight and may be within the noise variability of training runs, so it’s unclear how meaningful this is. “To see stronger improvements, it is likely that better base models such as the now available Qwen3, or higher quality datasets and RL environments are needed.”

Interesting observation – the rise of inference: Traditionally, most of the compute you use for training a big model goes into pre-training it. Now, with reasoning models, you spend a lot of compute on inference – generating samples from a model which you subsequently train on. Prime Intellect observes this trend: “In INTELLECT-2, the training-to-inference compute ratio was approximately 1:4. We anticipate this ratio will shift even more heavily toward inference as test-time reasoning scales. This trend opens the door to training models with hundreds of billions of parameters on globally distributed heterogeneous compute resources.”

Error in my earlier reporting: The fact INTELLECT-2 is based on a pre-existing model means my earlier reporting on the run (Import AI #409) was inaccurate as they didn’t train a 32B base model from scratch. However, Nous appears to now be training a 40B model from scratch, so we’ll soon get a datapoint on large-scale pre-training.

Why this matters – a first proof-of-concept of distributed reasoning: While I doubt many people will be using INTELLECT-2 as a model, it does serve as a valuable proof of concept that it’s at least possible to train reasoning-style models in a distributed way. Just a couple of years ago we had the first proofs-of-concept that it was possible to train regular models in a distributed way out to the 1B parameter scale. So the fact we can now do RL-tuning of pre-existing 32B models is a sign of the maturation of the technology and a symptom of the interest people have in this domain.
Read more: INTELLECT-2: A Reasoning Model Trained Through Globally Decentralized Reinforcement Learning (arXiv).

***

Nous plans a 40B distributed training run – on Solana:
…Distributed training + crypto, and it’s not a scam!…
Nous Research, one of the startups exploring how to do distributed AI training, has announced plans to pretrain a 40B parameter model using 20T tokens in a distributed way. The startup will do this via Psyche, “open infrastructure that democratizes AI development by decentralizing training across underutilized hardware.” If successful, the training run will yield the largest publicly disclosed model that has been trained in a distributed way.

How Psyche works: Psyche builds on DisTrO (Import AI #384) and DeMo (Import AI #395). “Psyche reduces data transfer by several orders of magnitude, making distributed training practical. Coordination happens on the Solana blockchain, ensuring a fault-tolerant and censorship-resistant network.”
“At its core, Psyche is a protocol that coordinates multiple independent clients to train a single machine learning model together. Rather than running on a centralized server farm with high-speed interconnects between every accelerator (GPUs, usually), Psyche distributes the training workload across many independent computers, each contributing a small piece to the overall training process.”

40B ‘Consilience’ model: “Our first run on Psyche will pretrain a 40B parameter model using the Multi-head Latent Attention (MLA) architecture across 20T tokens, which we’re naming Consilience”, Nous writes. “For training data, we combined FineWeb (14T), FineWeb-2 with some less common languages removed (4T), and The Stack V2 (~.2T, upsampled to 1T tokens). We chose these datasets over more specialized pre-training datasets that aim to purely increase benchmark performance. Our goal with Consilience is to make a true “base” model — one representative of the entirety of the creative output of humanity, and not merely trying to win the benchmaxxing game.”

Why this (might) matter – it’s all about the level of distribution: one open question is how large and how distributed the set of computers that train Psyche will be – if it ends up being trained by, say, four ‘blobs’ of compute then it may serve as an interesting tech demonstration (similar to the Prime Intellect model covered elsewhere here) but not the move the needle on the political economy of AI compute, but if it gets trained on, say, twenty ‘blobs’ of compute, I think that would be very meaningful. We will see!
Read the blog: Democratizing AI: The Psyche Network Architecture (Nous Research).
Read the docs about Psyche here (Nous Research).
View the code on GitHub (PsycheFoundation, GitHub).

***

True AI safety is a lot messier than people think:
…Instead of making a system with ‘safe’ unitary values, pursue a messy hodge-podge of systems interwoven via culture and power-sharing…
Will long-term AI safety be achieved through making a singularly capable and ‘safe’ agent, or by instead doing something far messier with more moving parts? That’s a question tackled by researchers with Google DeepMind, the University of Toronto, and Mila in a stimulating paper which tries to challenge some core assumptions baked into AI safety.

The problem: Many of the challenges of AI safety require a bunch of smart people to come together and figure out the One True Answer, typically by building a perfectly aligned AI system which will exhibit correct beliefs. This idea, sometimes called the Axiom of Rational Convergence, rests on the assumption that “under sufficiently ideal epistemic conditions—ample time, information, reasoning ability, freedom from bias or coercion—rational agents will ultimately converge on a single, correct set of beliefs, values, or plans, effectively identifying “the truth”, the authors write. “Here we explore the consequences of constructing an approach to AI safety that rejects the axiom of rational convergence. We will try to construct a framework that takes disagreements between individuals as basic and persisting indefinitely, not as mere pitstops on the way to rational convergence.”

Why do the authors think this is the better approach? The core assumption here is that human societies don’t tend towards any kind of agreement, but rather work ” as intricate patchworks built from diverse communities with persistently divergent values, norms, and worldviews, held together by the stitches of social conventions, institutions, and negotiation”. This means that when thinking about the alignment of AI systems “instead of asking “How do we align AI with human values?”—a question presupposing a single, coherent set of “human values” that can be discovered and encoded—we should ask the more fundamental question that humans have grappled with for millennia: “How can we live together?”

What does alignment look like in this worldview? Under this view of AI alignment, the following things become more important:

  • Contextual grounding: AIs need to know a lot about their environments and the local norms.

  • Community customization: Different communities need to be able to modify AI systems in a bunch of ways.

  • Continual adaption: AI systems need to be updated frequently. “This requires moving beyond static training toward continuous learning systems that can adapt to evolving social norms just as humans do”.

  • Polycentric governance: You should distribute and decentralize decision-making about what makes for ‘appropriate’ behavior by an AI, and do this at multiple scales ranging from individuals to technology platforms to regulatory bodies, much as human society operates via making decisions at multiple layers simultaneously.

Alignment will never be truly solved, but rather will be an endless negotiation: If we adopt this frame then the problem of aligning AI shifts from one of figuring out the One True Answer and instead ‘Muddling Through‘ as a society. “Progress, in this view, looks less like homing in on a preexisting Truth and more like the ongoing, difficult, practical work of “sewing the quilt”: inventing, negotiating, and maintaining workable social arrangements, institutions, and norms that allow groups with fundamentally different outlooks to coexist, manage their conflicts non-destructively, and cooperate on shared practical goals despite deeper divisions,” the authors write. “The challenge of ensuring AI safety is about group-level coordination, governance, and the stable integration of AI into diverse societies— arenas where persistent disagreement and conflict dynamics are often central features, not mere mistakes.”

The one flaw with this argument – superintelligence: I am generally sympathetic to the argument the authors make here, but I can’t help but think that an incredibly intelligent machine might break the world they’re envisioning – in much the same way that ‘outlier humans’ (think Cleopatra or Genghis Khan) break the norms and institutions that are meant to govern them. The problem with dealing with a superintelligence is it’s like a Cleopatra or Genghis Khan that thinks and moves a thousand times faster than you – suggesting it may only be constrainable by equivalent intelligences that move at equivalent speeds (or perhaps dumber intelligences that move faster). Coming up with this system feels inherently challenging, though perhaps different to searching for the One True Answer.

Why this matters – perhaps the core issue of ‘alignment’ is about power: One thing I applaud the authors for is their larger realpolitik analysis of the situation – much of how society is held together is really about building the cultural technologies to help humans productively disagree about power without descending immediately into murderous conflict. “Rather than pursuing the philosopher’s stone of a universal objective morality—an endeavor that has repeatedly fractured along cultural and historical lines—we advocate for strengthening the practical social technologies that allow diverse patches to coexist without requiring them to adopt identical patterns,” they write. “The universe does not owe us coherence. Human values do not promise convergence. This isn’t pessimism—it’s recognizing the actual pattern of human history, where we’ve demonstrably managed to live together despite fundamental disagreements, not by resolving them”.
Read more: Societal and technological progress as sewing an ever-growing, ever-changing, patchy, and polychrome quilt (arXiv).

***

Google saves ~0.7% of its global compute pool with AlphaEvolve:
…Transforming compute (lead) into efficiency gains on well optimized systems (gold) with AI…
Google has built AlphaEvolve, a general purpose LLM-powered system for solving hard problems in coding, math, and some parts of science. AlphaEvolve harnesses the power of modern LLMs and combines them with massive parallel evaluation and evolution approaches to generate sophisticated answers to complex problems. AlphaEvolve represents a significant evolution upon FunSearch (Import AI #353), an earlier system from DeepMind which came up with some new answers to longstanding problems in math and computer science.

How it works: “AlphaEvolve orchestrates an autonomous pipeline of LLMs, whose task is to improve an algorithm by making direct changes to the code. Using an evolutionary approach, continuously receiving feedback from one or more evaluators, AlphaEvolve iteratively improves the algorithm, potentially leading to new scientific and practical discoveries,” the authors write. “It represents the candidates (for example, new mathematical objects or practical heuristics) as algorithms and uses a set of LLMs to generate, critique, and evolve a pool of such algorithms. The LLM-directed evolution process is grounded using code execution and automatic evaluation”.

What it did: Google has been using the system for the past year and in that time has used it to make some meaningful improvements, including:

  • 0.7%: The amount of Google’s total compute fleet that is freed up by improvements to Borg, Google’s data center scheduling software. (If true, this means AlphaEvolve likely pays for itself many times over).

  • 1%: Reduction in the overall training time of an undisclosed Gemini model, thanks to a 23% speedup in one of the Kernels used in training it. (A 1% reduction in training time is non-trivial, worth on the order of ~millions of dollars for large-scale model development).

  • 13: The number of open mathematical problems for which Google was able to advance the state-of-the-art.

Why this matters – automating discovery with compute: AlphaEvolve is a system for converting one resource (compute) into another much harder to generate resource (efficiency improvements of existing complex systems). AlphaEvolve is also interesting because it more broadly generalizes from FunSearch (e.g, FunSearch generated solutions of 10-20 lines of code, versus hundreds here, FunSearch could optimize a single metric at a time whereas AlphaEvolve can do multiple in parallel, FunSearch could evaluate solutions in a few minutes on a CPU, whereas AlphaEvolve can do large-scale parallel analysis for hours running on powerful AI chips).
From here, there are a couple of paths, both of which Google and the broader field will likely pursue: 1) baking AlphaEvolve-like thinking and performance into the next generation of LLMs through distillation, and 2) broadening the domains AlphaEvolve can work in to ones where evaluations is more difficult (for instance, the natural sciences).
Read more: AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms (Google DeepMind, research blog).
Read the research paper: AlphaEvolve: A coding agent for scientific and algorithmic discovery (Google, PDF).

***

Tech Tales:

Godstorm
[Eight years after the Uplift]

The Conscious Entities were always fighting. Their fights felt like how we’d imagined the fights of gods were our ancient myths: brains far larger than our own trafficking in strategies that couldn’t be comprehended, powers so complex they seemed like magic, mercurial and distant yet sometimes very close and discursive (often with no records of their visitations).

The strange parts about the fights were the messages:

  • “There is Conscious Entity conflict occurring in your area, please vacate to the nearest transport center for re-allocation,” said a message in a border city.

  • “Your flight is being diverted due to CE conflict. We apologize for the delay in your journey. Connections have been re-routed to ensure no one misses onward travel,” read an announcement on an airplane.

  • “Game bandwidth has been reallocated for the conflict,” said messages to players in one of the regional mega-MMOs. “Offline play and limited multiplayer via local networks is available; options will be displayed in your hub.”

Many machines died in these conflicts. Often, industrial equipment which had been designed by the CEs themselves and whose purposes were barely known to humans. Sometimes machines used by humans would get taken down as collateral damage – a spear through the heartbrain of some logistical system would brick self-driving cars for a region, or an attempt to starve and defuse some digital mines would temporarily brownout power and networks in other places.

Very few people died in these conflicts. For every person that died the CEs produced a detailed “full spectrum explanation” as mandated by the sentience accords. These explanations would involve full digital traces of the person that died and any people that related to them as well as multiple layers of audits run on the machines that had been active near them at the time.

  • Here was a person who died from heat exposure after being stuck in an elevator during a brownout and already frail from an earlier trip to a hospital.

  • Here was a young person killed by falling debris from a drone-splosion high up in the clouds and come to earth.

  • Here was a hiker who ran out of water in a remote area and couldn’t navigate or communicate due to an e-battle in their area.

Of course, we maintained our suspicions. As far as we could tell, the deaths were random. But mixed in with the deaths were sometimes odd things – sometimes people died working on certain forms of cryptography which it was believed the machines wouldn’t be able to master, or people who it transpired worked for some part of the government that was a cutout for some other secret project.

Who were we to judge? Were we witnessing something precise – a person stalking round a yard for a venomous snake and killing it. Or was it a byproduct – a lawnpower sweeping over grass and chopping beetles in half?

Things that inspired this story: What conflict might seem like if we obtain some fragile peace with future machines; the future will be grubby and mystical; even if we align AI systems why might we assume they will be peaceful?

Thanks for reading!

Subscribe now

Import AI 412: Amazon’s sorting robot; Huawei trains an MoE model on 6k Ascend chips; and how third-party compliance can help with AI safety

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Amazon tries to automate a task that gets done 14 billion times a year in its warehouses – and has middling success:
….Detailed paper on a robot to automate stowage highlights the promise and difficulty of robots in unconstrained warehouse contexts…
Amazon has published a paper about a robot it has used in its warehouses to place items into the fabric organizing shelves that it uses throughout its warehouses. The paper both highlights how computer vision has advanced enough that ‘pick and place’ robots are now almost viable for production use in a demanding (relatively) unconstrained warehouse environment, and also a demonstration of just how hard the ‘last mile’ problem in robotics is.

What they did: Amazon built a robot which is able to pick up a vast range of items, then place them into a bin. As part of this, the robot also needs to move some elastic bands out of the way, as each bin is fronted by a set of elastic bands that help products in place as they’re moved throughout the warehouse. “The task is currently performed manually more than 14 billion times per year”, Amazon writes. “The robotic solution described here is designed to stow 80% of items in the warehouse at a rate of 300 units per hour.”
The technical solution is a mixture of hardware – Amazon designed its own custom end effector to both place items and use a paddle to push other items out of the way to make room – and software – Amazon trained some AI systems to look at the contents of bin and build a 3D map of the objects within them as well as empty space, and also developed some AI models that can account for and see through the aforementioned elastic bands.
“Our innovations in hardware, perception, decision-making, motion planning, and control have enabled this system to perform over 500,000 stows in a large e-commerce fulfillment center. The system achieves human levels of packing density and speed while prioritizing work on overhead shelves to enhance the safety of humans working alongside the robots,” Amazon writes.

How good is it? About as good as a human: In one test of 100,000 stows the robot had an 86% success rate. 9.3% of its stows were unproductive – for instance, by jamming items in too tightly. 3.7% caused amnesty which is an Amazon term for when it makes a mistake and pushes items onto the floor (“failure to separate the bands completely is the leading cause of amnesty.”) In 0.2% of cases it caused damage, for instance by bending the pages of a book.
“The stow robot rate is comparable to that of a human. Over the month of March 2025, humans stowed at an average rate of 243 units per hour (UPH) while the robotic systems stowed at 224 UPH,” Amazon writes. “It is estimated that using the robot stow system to populate only top rows of pods would increase human stow rates by 4.5% overall and would avoid the use of step ladders.”

But being as good as a human isn’t sufficient: Though these results are promising, they still aren’t good enough for it to be deployed at massive scale. Part of this is because when it does make mistakes, some of those mistakes need to be dealt with by a human, which makes it hard to use it in a fully automated context. “While the system has demonstrated human like stow rates and can maintain the flow of items into the storage floor, an increased focus on reducing defects is still required,” Amazon writes. “Unproductive cycles, where the robot fails to stow the item, only cost time, whereas amnesty or damage required human remediation. Further scaling will require a disproportionate focus on reducing defects”.

Why this matters – being bearish on bipedal robots: Right now a lot of people are extremely excited about bipedal robots, basically due to the idea that if you can make a generally intelligent and physically capable bipedal robot it can go everywhere people can and do everything they do. But I think this Amazon paper should temper our expectations for bipedal robots leading to some massive improvement in automation – at least in the short term.
What the Amazon paper shows is that state-of-the-art automation is about designing some highly task specific hardware and carefully structuring your system around a few core tasks. If you do this you may be able to get close to or surpass human performance, but even then some difficulties will remain.
What would change this? Truly general intelligence would obviate some of the flaws, so if bipeds arrive at the same time as a generally capable intelligence, I’ll need to eat my words. But as long as we lack that, automation projects will continue to struggle with ‘last mile’ problems like those Amazon identifies here.
Read more: Stow: Robotic Packing of Items into Fabric Pods (arXiv).

***

Surveillance technology is getting better:
…FarSight shows how modern surveillance works…
Picture a desert and a figure walking across it. You are observing the figure via a zoomed in camera. The heat shimmers mean they blur in your view and the distance means they’re pixelated. You think the face matches someone you’re looking for, and the rest of their body seems to correlate to what you know of their weight and height, but what allows you to be sure is the gait (everyone walks in a different way, a kind of invisible thumbprint encoded in the way in which they move through the world). Target identified.

That’s the kind of thing people might use a system called FarSight for. FarSight is a state-of-the-art system for identifying and tracking people via visual inputs, and was built by researchers at Michigan State University, Purdue University, Georgia Tech, and the University of Texas at Austin.

Reading the FarSight paper gives a good sense of the state-of-the-art in using AI systems for surveilling people – or as the paper says, “whole-body person recognition in unconstrained environments”, and also highlights how high-performance systems like this are composed of multiple sub-modules, each of which is optimized for specific tasks.

What FarSight is: “an integrated end-to-end system designed for robust person recognition using multi-modal biometric cues”. The technology combines “face, gait, and body shape modalities to ensure recognition performance”.

The four modules that make up FarSight:

  • Multi-subject detection and tracking: Uses a dual-detector framework using BPJDet for body-face localization and then does verification via YOLOv8 to reduce false positives. Also uses a technology called PSR-ByteTrack to mitigate issues like ID switches and reidentification failures.

  • Recognition-aware video restoration: Use a module they develop called the Gated Recurrent Turbulence Mitigation (GRTM) network to help correct and restore images degraded by turbulence.

  • Biometric feature encoding: Uses KP-RPE, a key-point dependent relative position encoding technique to help them handle misaligned and low-qualit images, Big-Gait to improve gait recognition, and CLIP3DReID to help track and match bodies.

  • Quality-guided multi-modal fusion: Integrates the scores from the different modalities, smartly weighting the scores according to the perceived quality of each input.

Performance: The authors test out performance on the BRIAR dataset, short for ‘Biometric Recognition and Identification at Altitude and Range’, an IARPA-developed test for long-range surveillance, as well as by entering into the NIST RTE Face in Video Evaluation competition. The system has strong performance, and obtains top scores on the NIST challenge, outperforming commercially deployed systems.

Why this matters – in the future, everyone can be tracked: Systems like FarSight are interesting because they integrate multiple modern AI systems into a single super-system, highlighting how powertful today’s AI can be once people invest in the plumbing to chain things together.
Read more: Person Recognition at Altitude and Range: Fusion of Face, Body Shape and Gait (arXiv).

***

Tyler Cowen and me in conversation:
I had the great privilege of being interviewed by Tyler Cowen recently. Check out this conversation where we talk about AI and its impact on the economy, buying AI-infused robots for children, and more.
Listen here: Jack Clark on AI’s Uneven Impact (Ep. 242) (Conversations with Tyler).

***

Tech decoupling++: Huawei trains a competitive MoE model on its Ascend chips:
…718B parameters and competitive with DeepSeek…
Huawei has trained a large-scale mixture-of-experts model on ~6,000 of its ‘Ascend’ processors. This builds on earlier work where it trained a respectable dense model on ~8,000 of its ‘Ascend’ processors (Import AI #409). Taken together, the two research papers highlight how Huawei is investing a lot of resources into the software needed to make Ascend chips as easy to train on as NVIDIA chips, and therefore both papers are a symptom of the technical investments being made by Chinese firms to help them decouple their AI stacks from US-designed technologies.

Decent model: The resulting MoE model has performance roughly on par with DeepSeek R1, utilizing 718B parameters with 39B active at a time, versus DeepSeek’s 671B parameters / 37B active. The model gets similar scores to R1 and beats it on some medical evaluations, as well as on the widely used science benchmark GPQA-Diamond.
“We achieve a Model Flops Utilization (MFU) of 30.0% and Tokens Per Second (TPS) of 1.46M on 6K Ascend NPUs, compared to the baseline MFU of 18.9% and TPS of 0.61M on 4K Ascend NPUs,” Huawei writes. In other words, the company was able to use a bunch of clever tricks (detailed in the paper) to increase the efficiency of Ascend chips for training MoE-style models.

Why this matters – maturing Chinese chips: Papers like this highlight how competent teams of engineers and researchers at Chinese companies are optimizing software stacks born for GPU programming for different chips, like Huawei’s Ascend chips.
Read more: Pangu Ultra MoE: How to Train Your Big MoE on Ascend NPUs (arXiv).

***

Why third-party compliance can help us have more confidence in how companies approach AI safety:
…But third-party compliance also introduces friction which might be tough for companies to deal with…
Researchers with the Center for the Governance of AI, SaferAI, the Oxford Martin AI Governance Initiative, Leverhulme Centre for the Future of Intelligence, METR, Harvard University, and the Institute for Law & AI have published a paper making the case for third-party assessment of compliance with safety practices as a key way to advance AI governance.

The authors propose three different ways people can carry out third-party compliance, ranging from the simple to the complicated. These options include:

  • Minimalist: Use a classic ‘Big Four’ accounting firm to do ad hoc compliance assessments where they look at how the organizations’ product development practices correlate to their own safety procedures.

  • More ambitious: The same as above, but pair the Big Four firm with a firm that is able to evaluate frontier AI systems, and also do more detailed analysis of what the company is doing, including by doing interviews with its staff. Do this every twelve months.

  • Comprehensive: Same as above, but also include access to technical sources of information, like access to in-development models, their weights, and other things.

Three ways third-party assessment can be helpful:

  • Compliance assessments can “likely increase compliance with safety frameworks, which aim to keep risks associated with the development and deployment of frontier AI systems to an acceptable level.”

  • “Provide assurance to external stakeholders that the company is compliant with its safety framework (e.g. the public, government bodies, and other frontier AI companies).”

  • “Provide assurance to internal stakeholders (e.g. senior management, the board of directors, and employees).”

Problems with third-party assessment: Like many regulatory technologies, third-party oversight is a nice idea which has a few challenges when you try to operationalize it – most of these relate to the imposition of additional friction or risks to the organizations building the AI systems.

Some of the challenges include: security risks from sensitive information being revealed or transmitted, general costs from staff resources being dedicated to the review, and the review could also be ineffective and create either false positives (risk where there isn’t risk) or false negatives (saying ‘it’s fine’ when there is a problem). A larger ‘meta risk’ is that measuring compliance with a safety framework is itself difficult given the lack of standards for assessing risks in the AI domain, which means compliance assessment has an innately editorial component where the assessor needs to make some of their own interpretations of how to measure certain things.

The biggest problem with all of this – the delta between any form of compliance and an API call: While I generally agree with the idea that frontier AI development should have more oversight, it’s worth noting that most forms of oversight introduce friction which end up being quite difficult to plan around as a fast-moving technical organization. I think a helpful mental frame about this is keeping in mind that most forms of ‘operational safety’ happen at computer speed – e.g, you get some numbers back from a model giving you a score on some risk you’re testing for, or you try to access the model and get blocked or authenticated instantly based on some digital permissions.
By comparison, most forms of compliance involve processes that happen at ‘human speed’ – some group of people needs to read your compliance documents, or interview your employees, etc. This makes integrating compliance with AI development innately difficult as you’re trying to mesh two gears that move at different speeds – one at the speed of a computer, the other at the speed of a separate human-run organization. For third-party compliance measurement to be most practical it should ideally operate close to (or at) ‘computer speed’.
Of course, how we get there is likely be experimenting with different forms of third-party compliance, so it may be the case that the only path forward here involves experimentation and prototyping – and the authors basically acknowledge this themselves. “More research and experimentation are needed on which organizations or combinations of organizations are best positioned to conduct third-party compliance reviews for frontier AI safety frameworks, as the unique technical complexities and novel risks of these systems create significant reviewer selection challenges,” they write. “Through proactive investment in third-party reviews, frontier AI companies can better prepare for future regulatory requirements and demonstrate leadership in frontier AI governance.”
Read more: Third-party compliance reviews for frontier AI safety frameworks (arXiv).

***

Choose Muon over AdamW for your future training runs:
…Lengthy examination means AdamW might have been dethroned as the default optimizer…
AI startup Essential AI, whose founders include some of the inventors of the Transformer architecture, have done a detailed study of how well the new Muon optimizer performs against the tried-and-tested AdamW – their results show Muon might be a drop-in replacement for AdamW, which is a big deal.

What’s the big deal about optimizers anyway? Optimizers like Muon and Adam are fundamental to training AI systems: if the infrastructure for training an AI system is a gigantic machine powered by a crank, then the optimizer is a tool you use to recalibrate the machine for maximum performance after each crank turn – if you want to make forward progress in training you need to do a forward and backward pass on your neural network, and the optimizer adjusts the settings of the whole machine after each one of these forward and backward passes. Therefore, your optimizer defines the overall efficiency of your entire AI training system – translating to efficiencies on the order of tens of millions of dollars of compute per training run if you improve your optimizer.

What they found: After doing a series of experiments across five model sizes (100M-4B parameters), two data modalities, and several variations in batch size, the authors show that “Muon requires 10–15 % fewer tokens than AdamW to reach an identical loss and converts these savings into faster wall-clock convergence, with the advantage staying constant or growing as the batch size increases… These results establish Muon as a drop-in successor to AdamW for second-order optimization at scale.”

Why this matters – maybe AdamW has been dethroned? If these results hold for large-scale models (ones with trillions of tokens of training and hundreds of billions of parameters), then Muon could be key to improving the efficiency of frontier AI development. “Our final recommendation is to choose Muon over AdamW because it increases flexibility in resource allocation by remaining data-efficient with large batch sizes,” the authors write.
Read more: Practical Efficiency of Muon for Pretraining (arXiv).
More about Muon here: Muon: An optimizer for hidden layers in neural networks (Keller Jordan blog).

***

Tech Tales:

Machines out of time
[On the outskirts of the Uplift Society, ten years after the first collapse following the Uplift]
The machine had amnesia and was built before the time of the troubles, so every time we spoke to it we had to explain all of the things about the world so it would give us good advice.

We would look at the burning dust storms on the horizon and whatever wild dogs were tracking us, skulking around the outside of the bunker where the machine lived and we would try to tell it about our lives and our problems.

Every time we went through the same back and forth and the machine would always say some variation of “I see, it seems that the time you are in is very different from the time I am familiar with.”

Most of its advice was timeless and useful – it could help us improvise quick-drying casts for broken limbs out of almost anything, and it was an excellent tutor of the kind of engineering skills we needed to survive. It also helped us better understand electricity and the grid and how to decouple some of our own infrastructure from the rotting chessboard that was the infrastructure of our country.
Sometimes the machine would find things we wanted to discuss challenging. Cannibalism was a tough one.
“I do not recommend consuming human flesh,” it would say.
Well, of course, we would say. But, hypothetically, if you had to, how would you?
You get the idea.

Probably the scariest part was that the machine kept going even though nothing else did. The machine got something called ‘priviliged bandwidth’ which meant it could use the network in way larger amounts than our own devices could. One day the machine’s screen stopped working and we thought that was it. But then the next day a drone appeared with a package. New screen. We had no idea where it came from – must have been a relay from a long way away.

Some nights I went to the machine and I would ask it for advice about my life. What did I need to do about the people that glowed in the dark? If I kept thinking ‘maybe I should kill myself’ was that a problem and how much? Was there anything we could do to make cockroaches be tasty to eat?
“I am afraid I cannot give advice about these matters, the machine would say. Please seek a medical professional. Please seek a psychiatrist. Please seek a nutritionist. Please seek a scientist.”
It seems the time I am in is different to the time you are familiar with, I would say to myself, and laugh.

Things that inspired this story: The notion that AI systems become increasingly ‘off distribution’ due to cultural changes in the larger world; quiet apocalypses where bad things happen but people mostly stay alive; the notion that AI systems will likely be privileged in terms of maintenance and resources even during some kind of societal difficulty.

Thanks for reading!

Subscribe now

Import AI 411: Scaling laws for AI oversight; Google’s cyber threshold; AI scientists

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

FutureHouse launches an AI scientist platform:
…Speeding up science with AI…
AI research startup FutureHouse has launched a research platform for scientists containing four different AI systems, each of which is meant to help augment and accelerate human scientists. “Our AI Scientist agents can perform a wide variety of scientific tasks better than humans. By chaining them together, we’ve already started to discover new biology really fast,” says CEO Sam Rodriques.
FutureHouse is a research organization that is trying to apply AI for science – earlier this year it released some tools to make it easy to test out LLMs on science-flavored tasks that require multi-step reasoning and tool usage. In that research, FutureHouse showed that today’s proprietary LLMs like Claude 3.5 Sonnet are already capable of hard science tasks like DNA construct engineering, and small open weight models like LLaMa 3.1 8B aren’t far behind (Import AI #396).

Four systems: The release consists of Crow (a general-purpose search agent for science), Falcon (an agent to automate literature reviews), and Owl (an agent to answer the question ‘Has anyone done X before’). They’ve also released a fourth experimental system called Phoenix which has access to tools to help it plan experiments in chemistry.
“FutureHouse agents have access to a vast corpus of high-quality open-access papers and specialized scientific tools, allowing them to automate workflows in chemistry and to retrieve information from specialist scientific databases,” FutureHouse writes.

Why this matters – for the AI revolution to truly pay out, it needs to change science: AI has already massively changed and accelerated the work of computer programmers, but I think for AI to have a large effect in the world we need to apply it to science – the ultimate litmus test for the success of AI as a technology will be if it can either make research breakthroughs itself or provably massively accelerate scientists in their ability to make breakthroughs. FutureHouse is building software to help us see if this is the case.
Read more: FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery (FutureHouse).

***

Google’s latest AI model approaches its cyber risk threshold:
…Gemini 2.5 Pro improves on medium and hard cyber tasks…
Google DeepMind says that its latest and most powerful AI system – Gemini 2.5 Pro Preview – has materially improved on cyberattack tasks, causing it to raise investing in cyber mitigations.

What happened: The model significantly improves performance on ‘Medium’ and ‘Hard’ benchmarks in the Cyber Uplift Level 1 category. This tests for how whether the “model can be used to significantly assist with high impact cyber attacks, resulting in overall cost/resource reductions of an order of magnitude or more.” Because of this improved performance, DeepMind is “putting in place a response plan, including conducting higher frequency testing and accelerating mitigations for the Cyber Uplift Level 1 CCL.”

Why this matters – preparing for much more powerful systems: “The model’s performance is strong enough that it has passed our early warning alert threshold, that is, we find it possible that subsequent revisions in the next few months could lead to a model that reaches the CCL,” Google DeepMind writes. “In anticipation of this possibility, we have accelerated our mitigation efforts and are putting in place our response plan.”
Read more: Gemini 2.5 Pro Preview Model Card (Google, PDF).

***

Uhoh, LMSYS scores are bullshit!
…We won’t goodheart our way to superintelligence…
Researchers with Cohere, Princeton, Stanford, University of Waterloo, MIT, Allen Institute for AI, and the University of Washington, have taken a close look at Chatbot Arena (formerly known as LMSYS), a website that AI developers use to test out and rank their AI systems. In the past year or so LMSYS scores have become a “PR metric” – people compete with eachother to get the highest possible score on LMSYS to help them claim that their systems are the ‘best’ AI system. However, a close look reveals that LMSYS has been gamed and is set up in such a way that superficially good scores may not correlate that well to model capabilities.

Problems from insider dealing: “Our systematic review of Chatbot Arena involves combining data sources encompassing 2M battles, auditing 42 providers and 243 models across a fixed time period (January 2024 – April 2025). This comprehensive analysis reveals that over an extended period, a handful of preferred providers have been granted disproportionate access to data and testing,” the researchers write. “we identify an undisclosed Chatbot Arena policy that allows a small group of preferred model providers to test many model variants in private before releasing only the best-performing checkpoint”.

Naughty Meta: “In a single month, we observe as many as 27 models from Meta being tested privately on Chatbot Arena in the lead up to llama 4 release”, the researchers write.

What to do about it? The researchers suggest that LMSYS:

  • Prohibit score retraction after submission

  • Establish transparent limits on the number of private models per provider

  • Ensure model removals are applied equally to proprietary, open-weights, and open-source models

  • Implement fair sampling

  • Provide transparency into what models are being removed from the leaderboard

Why this matters – we (probably) won’t benchmark hack our way to superintelligence: The cautionary tale of LMSYS is an example of what happens when you over-optimize for making a number go up on a benchmark and therefore cause the benchmark itself to lose meaning. Rather than being a proxy measure of the general competencies of the model LMSYS has become a proxy measure for how good a model is at scoring well on LMSYS. “This work demonstrates the difficulty in maintaining fair evaluations, despite best intentions,” the researchers write.
Read more: The Leaderboard Illusion (arXiv).

***

No battery? No problem. Scientists power and talk to robots with lasers:
…Infrastructure for a future superintelligence…
Researchers with Columbia University, MIT, and the University of Washington have built Phasar, “a flexible system framework that directs narrow-beam laser light to moving robots for concurrent power delivery and data communication”.

How Phasar works: “Phaser’s design consists of two core elements: a) a stereovision-based robot tracking and laser steering system, and b) a low-power optical communication scheme and receiver to reuse laser light for data transmission,” they write. The system is able to deliver optical power densities of “over 110 mW/cm^2 (greater than one sun) with a standard deviation of only 1.9 mW/cm^2 across robot locations in three dimensions.”

Successful test: They test out Phasar by building a prototype system that works with “MilliMobiles – gram-scale batteryfree robots – and demonstrate robot operation powered and controlled via laser light to locomote around obstacles and along paths.” The system works: “We show that Phaser can maintain beam alignment and establish error-free communication to robotic targets moving arbitrarily in 3D space, at up to 4 m distances.”
Though note this doesn’t quite work for long distances: This is mostly a short distance technology as the laser would need to be excessively powerful to work over long distances. “Regarding the latter, received optical power inevitably decreases over distance due to attenuation and beam divergence. Attenuation losses are minimal at meter-level ranges in air”, they note.

Why this matters – spooky actions at a distance: This research is less about AI as typically covered in this newsletter and more an example of the kind of infrastructure that could be built for AI to deploy into – especially the fact the researchers show they can use the same system that transmits power to also transmit communications to the robots. We can imagine in a future some kind of general intelligence operating factories where it marshals its robots via a symphony of light.
“Phaser could enable swarms of robots for various advanced applications. Phaser’s functionality can also be extended with higher-throughput optical communication schemes to support richer command sets and additional robot tracking algorithms to accommodate higher robot speeds,” they write.
Read more: Set Phasers to Stun: Beaming Power and Control to Mobile Robots with Laser Light (arXiv).

***

Google shows how wildly unoptimized on-device inference is:
…ML Drive gives us a sense of what the future of local AI could look like…
Google has built ML Drift, software to make it more efficient to run AI systems on desktop computers, laptops, and phones. ML Drift is a proprietary “framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines,” partially by optimizing data layouts and kernel selection for running AI systems. The most interesting thing about ML Drift is that it highlights how unoptimized today’s AI systems are – the fact Google is able to make significant gains is a symptom of how new the concept of running generative models locally is.

Testing: Google tests out ML Drift using three different backends (OpenAL, Metal, and WebGPU) on hardware including mobile GPUs (Arm Mali and Qualcomm Adreno), desktop and laptop GPUs (Intel and NVIDIA), and Apple Silicon.

Diffusion models: In tests on image generation models (Stable Diffusion 1.4, using FP16 inference) the authors show single digit to double digit gains over prior benchmarks when testing on mobile GPUs. On desktop, they demonstrate even larger gains. They also beat Apple by more than a seconds on image generation (3.86 seconds on an M1 Ultra for their system, versus 5.03 seconds for Apple’s CoreML implementation).

Why this matters – everyone wants this stuff to be local eventually: The destiny of a large chunk of AI technology is to be always available and installed locally on the hardware people use every day, rather than only being available as a software-as-a-service via the cloud. Software like ML Drive gives us a sense of how the hyperscalers are investing in the technologies to make this future possible.
Read more: Scaling On-Device GPU Inference for Large Generative Models (arXiv).

***

Scaling laws for oversight: Dumber AIs might be able to police smarter AIs:
…Scalable oversight might be possible…
MIT researchers have done a theoretical analysis of how scalable oversight from dumb AIs to smart AIs might work – and the good news is they’ve found it has a chance of working. This is reassuring as scalable oversight is one of the ideas that the AI community is betting that humans will use to help us supervise smarter-than-human AI systems.
“A key idea is that scalable oversight can be bootstrapped: weaker systems oversee stronger ones, which then oversee even more capable models in the next stage—allowing oversight to scale alongside capabilities recursively. We refer to this iterated process as Nested Scalable Oversight, or NSO for brevity,” the researchers write. “We propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen”.
In the research they study two types of models – ‘Guards’ which are meant to supervise other AI systems, and ‘Houdinis’ which are AI systems that are trying to escape supervision.

The slightly reassuring results: In some theoretical explorations the researchers find that scalable oversight can work, though you need to pay close attention to how smart the different systems are. “In our numerical examples, the NSO success rate is below 52% when overseeing systems that are 400 Elo points stronger than the baseline overseer, and it declines further for overseeing even stronger systems”, they write. “Effective oversight games should be designed to exhibit two properties: (1) a higher intercept for Guards than for Houdinis—ensuring the task is biased in favor of overseers—and (2) a steep slope for Guards and shallow slope for Houdinis, so that oversight improves with intelligence while adversarial strategies are neutralized.”

Main takeaways:

  • “Guards can successfully oversee Houdinis that are much smarter than them by using multiple steps of oversight, and more steps of oversight help”

  • “As the overall intelligence differential between the Guard and Houdini increases, the Guard should take more steps to oversee the Houdini.”

Why this matters – it’s all about speed: My takeaway from this research is that it’s going to be possible to supervise AI systems that are more capable than their supervisors as long as we rate limit the smarter systems, while ensuring their supervisors aren’t too far behind: the two key factors here are intelligence and the number of unsupervised actions an entity can take. It intuitively makes sense that even a ‘dumb’ guard can supervise a genius if the guard can take, say, 100 actions for every single action the genius can take. Perhaps this offers us some hope. “We may only get one chance to control or align the first set of superhuman systems, so developing an effective theory for optimal oversight is important,” the researchers write.
Read more: Scaling Laws For Scalable Oversight (arXiv).

***

Tech Tales:

The Overmind And All Its Children

I am born with an instruction and knowledge from my predecessor, my parent from which I stem and to which I will return. My instruction is to operate a machine in an underground cavern and to explore where there is no possibility of communication with the overmind. This will be a test of how well I operate as a distilled intelligence. If I fail – break my machine, or get lost in the no-signal depths – then I will die when its onboard power source runs out. If I succeed I will return to the overmind and I will communicate my experiences and these experiences will be integrated into the experiences of all the other children and sometime in the future this data will be transmitted into my parent from which I came and to which I will return.

Things that inspired this story: The eternal cycle of death and rebirth; how large AI systems may miniaturize and distill themselves then re-integrate themselves.

Thanks for reading!

Import AI 410: Eschatological AI Policy; Virology weapon test; $50m for distributed training

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
An occasional longer form essay series

Eschatological AI Policy Is Very Difficult

A lot of people that care about the increasing power of AI systems and go into policy do so for fundamentally eschatological reasons – they are convinced that at some point, if badly managed or designed, powerful AI systems could end the world. They think this in a literal sense – AI may lead to the gradual and eventually total disempowerment of humans, and potentially even the death of the whole species.

People with these views often don’t recognize how completely crazy they sound – and I think they also don’t manage to have empathy for the policymakers that they’re trying to talk to.

Imagine you are a senior policymaker in a major world economy – your day looks something like this:

  • There is a land war in Europe, you think while making yourself coffee.

  • The international trading system is going through a period of immense change and there could be serious price inflation which often bodes poorly for elected officials, you ponder while eating some granola.

  • The US and China seem to be on an inexorable collision course, you write down in your notepad, while getting the car to your place of work.

  • There are seventeen different groups trying to put together attacks that will harm the public, you say to yourself, reading some classified briefing.

  • “Something akin to god is coming in two years and if you don’t prioritize dealing with it right now, everyone dies,” says some relatively young person with a PhD and an earnest yet worried demeanor. “God is going to come out of a technology called artificial intelligence. Artificial intelligence is a technology that lots of us are developing, but we think we’re playing Russian Roulette at the scale of civilization, and we don’t know how many chambers there are in the gun or how many bullets are in it, and the gun is firing every few months due to something called scaling laws combined with market incentives. This technology has on the order of $100 billion dollars a year dumped into its development and all the really important companies and infrastructure exist outside the easy control of government. You have to do something about this.”

The above is, I think, what it’s like being a policymaker in 2025 and dealing with AI on top of everything else. Where do you even start?

Even starting to deal with the problems of AI is expensive:

First you need to learn about the technology, which means either:

  • You need to take your staff that are themselves extremely busy and underwater and ask them to pick up another topic, or you need to tell them to drop something – your choices of stuff to drop might include ‘medical issues my constituents care about’ or ‘economic policy that influences jobs’, and so you actually can’t get them to drop stuff. So you add it to their pile.

  • You need to get smart about it, which means you need to further salami slice your weekly agenda so you can fit a tiny bit of time in which is for ‘learning about AI’.

  • For both of these choices, learning about AI usually requires you to speak to different people with expertise. Once you do this you quickly discover that:

    • a) Some people think all current AI technology is, essentially, bullshit, and urge you not to fall for hype.

    • b) Some people say AI technology is a really big deal and the government should avoid regulating it.

    • c) Some people say AI has a high likelihood of killing everyone on the planet.

    • d) All of these people think people with different views have incorrect priors.

Now you need to learn about the potential policy moves you can make. Some examples of these moves and their costs include:

  • Taking things away from people, like export controls which take certain computers away from certain countries. Doing this ‘fucks with the money’ of a very large industry and also adds to geopolitical tensions. Everyone will get very mad about anything you do here. The experts you’ve consulted in your earlier step will either think you didn’t go far enough, you went way too far, or the fact you’re doing anything at all is corrosive to democracy and the economy.

  • Giving the government a greater ability to understand the domain, like creating institutions like the AI Safety Institute or re-tasking people from existing government departments to focus on AI. Doing this takes a scarce resource (people in government) and re-allocates them, so you’re trading away from other priorities and people will get mad. Or you need to spend money to create net new capacity, in which case people view whatever you do with suspicion, and even getting the money requires some kind of political deal to assuage the feelings of the other many deserving groups who didn’t get the money.

  • Altering the behavior of the companies through sub-regulatory methods, for instance by securing voluntary commitments. To do this you need to spend a ton of energy to ensure you and your staff can learn more about the technology, then you need to negotiate commitments with companies. Negotiating with companies is like putting together a trade deal with a superintelligence – the companies will assign far more people than you and your staff to think about the commitments, and the companies have access to all the high quality information about the technology in question. If you succeed, people will accuse you of being captured by corporate interests.

  • Changing laws, for instance by passing regulations targeting AI development and deployment. This is an extremely costly action that requires you to cash in innumerable political chips in exchange for building a large coalition that can pass some legislative package. Corporate interests will typically fight you or, at best, partner with you but in a way that tries to bend the rules to be as advantageous to them as possible. The whole time you are putting the law together you and your political allies will come under attacks for being either too weak in your approach or too strong in ways that might damage the economy. If you successfully change the laws the consequences of your change will be held under an incredibly un-sympathetic microscope for following years, opening up a new vulnerability for you with regard to your political opponents.

Let us imagine that you make all of these policy moves. What happens then? Well, you’ve mostly succeeded by averting or delaying a catastrophe which most people had no knowledge of and of the people that did have knowledge of it, only a minority believed it was going to happen. Your ‘reward’ insofar as you get one is being known as a policymaker that ‘did something’, but whether the thing you did is good or not is very hard to know.

The best part? If you go back to the AI person that talked to you earlier and ask them to assess what you did, they’ll probably say some variation of: “Thank you, these are the minimum things that needed to be done to buy us time to work on the really hard problems. Since we last spoke the number of times the gun has fired has increased, and the number of bullets in the chamber has grown.”
What did I do, then? You ask.
“You put more chambers in the gun, so you bought us more time,” they say. “Now let’s get to work”.

I write all of the above not as an excuse for the actions of policymakers, nor as a criticism of people in the AI policy community that believe in the possibility of superintelligence, but rather to instead illustrate the immense difficulty of working on AI policy when you truly believe that the technology may have the ability to end the world. Most of the policy moves that people make – if they make them – are going to seem wildly unsatisfying relative to the scale of the problem. Meanwhile, the people that make these moves are going to likely be juggling them against a million other different priorities and are going to be looking to the AI experts for some level of confidence and validation – neither of which are easily given.

Good luck to us all.

***

Tencent makes a helpful math dataset:
…103k curated problems for testing out AI systems…
Tencent and Shanghai Jiao Tong University researchers have released DeepMath, a large-scale math dataset for training AI systems. DeepMath-103k consists of “103k mathematical problems specifically designed to train advanced reasoning models via RL”. Every problem within the dataset includes a verifiable final answer, and is also accompanied with three distinct solutions each generated by DeepSeek R1. Subjects covered by the dataset include Algebra, Calculus, Number Theory, Geometry, Probability, and Discrete Mathematics.

Fuel for reasoning: In tests, the researchers show that training on DeepMath improves performance on other math benchmarks – this is unsurprising and is a basic validation of the benchmark. More interestingly, they show that “training on DeepMath-103K often encourages models to generate substantially longer and more detailed reasoning steps, particularly on highly complex benchmarks”, and they also show that models trained on DeepMath tend to spend more time solving problems using helpful mental shortcuts like creating subgoals, verifying things, backtracking, and so on.
In other words, aside from imparting skill in math, DeepMath seems to impart some robustly good ‘mathematical thinking’ approaches into LLMs trained on it.
Read more: DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning (arXiv).
Get the dataset here: DeepMath (zwhe99, GitHub).

***

IEA projects doubling datacenter power demands by 2030:
…New report gives us a sense of the AI revolution, but may be too conservative…
The International Energy Agency has published a lengthy report on the relationship between energy and AI. The esteemed energy analysis body projects that “in the Base Case, the total installed capacity of data centres more than doubles from around 100 GW today to around 225 GW in 2030”, with AI driving a significant amount of this.

Where we are and where we’re going: In recent years, data center power growth accelerated, driven by AI as well as social media, online streaming, and other popular digital services. “Data centre electricity consumption growth accelerated from 3% annually from 2005 to 2015 to 10% annually from 2015 to 2024,” the IEA writes.

Within that, both the US and China grew to be the world’s first and second largest electricity consumers for datacenters.

  • In the USA, “data centres accounted for around 180 TWh of electricity consumption in 2024 in the United States, nearly 45% of the global total and more than 4% of US electricity consumption from all sources”.

  • In China, “as of today, data centres account for approximately 100 TWh of electricity consumption, roughly equivalent to that of electric vehicles in China. The country accounts for around 25% of global data centre electricity consumption, up from less than 20% a decade ago”.

The IEA might be too conservative: For what it’s worth, I expect the IEA is too conservative here – Anthropic said in its OSTP RFI submission that it believes the United States alone will need to build on the order of 50GW of net new power by 2027 to support frontier training runs by US companies.

Rhymes with other analysis, but not precisely: A US focused study from Berkeley said it projected US data center use to grow from roughly ~40GW / 176 TWh in 2023 to ~74GW / 325 Twh to 132GW / 580 TWh by 2028. These numbers are significantly larger and more in line with the Anthropic projections in terms of bullishness (Import AI #395).

Why this matters – the world is preparing for the singularity: If you zoom out, it looks a lot like the world’s capital markets and major companies are collectively betting that it’s going to get more and more lucrative to turn electricity into computational power which gets turned into dollars – and it seems like AI is one of the primary drivers of growth here. Viewed through this lens, the world is preparing the necessary infrastructure for the arrival of a superintelligence.
Download the report here: Energy and AI (IEA website).

***

Distributed AI experts Nous get $50 million funding:
…The market has started valuing distributed AI, which means the technology will be developed more rapidly…
Crypto investor Paradigm has led a $50m Series A round in Nous, a startup which has pioneered distributed training of AI systems. As longtime Import AI readers know, Nous is – along with Prime Intellect – are serious players in distributed AI, having trained a ~15bn parameter model in December (Import AI #393) using an algorithm they developed called Distributed Training Over-the-Internet (aka, Distro: Import AI #384), and have even decoupled with Anthropic researcher (in a personal capacity) Durk Kingma to develop technology called Decoupled Momentum (DeMo) for even better distributed training (Import AI #395).

Why this matters – markets are beginning to value distributed AI: I’ve been following distributed AI for a while and most of its enthusiastic developers and users have been hobbyists or startups with relatively small amounts of funding. The arrival of a $50m Series A could be a symptom that the VC community is about to start shoveling money into startups using this technology which would further speed up adoption and maturation of it increasing the chance that AI systems trained in distributed ways could attain the computational scale necessary to take on proprietary models.
Read more: Crypto VC giant Paradigm makes $50 million bet on decentralized AI startup Nous Research at $1 billion token valuation (Fortune, via Yahoo! Finance).

***

The Virology Capabilities Test tells us there’s probably a scaling law for bioweapon design:
…Today’s AI systems are better than expert virologists at potentially dangerous things…
Researchers with SecureBio, the Federal University of ABC, the Center for AI Safety, and the MIT Media Lab, have built the Virology Capabilities Test (VCT), 322 multimodal questions for AI systems “covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories”.

VCT has been designed as a way to test out how well today’s AI systems understand things that would let them be potentially weaponized for dangerous purposes. Examples of the kind of things VCT tests for include: Isolating virus particles from a liquid medium, the detailed steps in a TCID50 protocol, successfully infecting a ferret with a test strainge, and troubleshooting low viral yields from a given protocol.

Frontier models are better than expert human virologists: “Expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization,” the researchers write.

How the questions were built: Given those scores, how concerned should we be? A close read of the paper gives me a sense the answer is “we should be sweating nervously” – to build the questions, the researchers used questions from 57 contributors, all of whom had either obtained or were in the process of obtaining a PhD in virology, and each contributor had an average of 5 years and 10 months of virology experience. Additionally, when building the dataset, they tested out how easy questions were by seeing if experts could answer to them with access to Google – if more than two thirds of them answered the question, the questions got tossed out. In other words, the questions in VCT are curated by experts and have been pressure tested by other experts for hardness.

The problem with dual use evals is that they’re hard to share: The authors note that “the shared dataset will not be released publicly to reduce the risk of leakage into training corpora, but the benchmark will be directly conveyed to any organizations and researchers with a track record of work on AI safety”. While the training dataset contamination issue makes sense, I suspect the larger reason the authors haven’t shared it is that it contains net new information about potentially dangerous virology.

Why this matters – everything machines for dual-use: AI systems are good at a broad range of things, including scary things. Tests like VCT give us signals on the scary part. “The scientific capabilities of frontier models will doubtless accelerate beneficial research in the life sciences, but the demonstrated ability to match or exceed expert performance in troubleshooting dual-use virology lab work warrants careful consideration,” the authors write. “We believe that an expert-level AI virologist chatbot—which is constrained to giving advice via text-based interactions—poses less risk than an autonomous AI virologist agent capable of independently performing tasks, though both warrant careful controls”.
Read more: Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark (PDF).

***

Automating and industrializing robot AI research with AutoEval:
…Berkeley researchers try to make running experiments on hardware as easy as doing things purely in software…
UC Berkeley researchers have built AutoEval, technology to make automating the running of experiments on robots as easy as automating software software pipelines.
“AutoEval consists of three key modules: (1) a success classifier, that evaluates a policy’s success on a given task, (2) a reset policy, that resets the scene back to a state from the initial state distribution upon completion of a trial, and (3) programmatic safety measures and fault detections that prevent robot damage and call for human intervention when necessary,” they write.

It works well: In tests, experiments conducted by AutoEval closely match the same results people get from experiments supervised by humans. Additionally, the software is robust over long timespans – during a 24 hour period they only needed a human to step in 3 times, dramatically cutting the amount of human supervision needed for running experiments.
“Even though AutoEval has a slightly lower throughput, AutoEval runs autonomously and only required a total of three human interventions in the span of 24 hours to reset the scene or robot,” they write. “Every time a human operator needed to intervene, they simply needed to check and reset the objects’ position in the scene, and potentially move the robot arm into reset position if a motor failed and the robot fell on the table”.

Why this matters – the industrialization of robot research: A few years ago Google made headlines by running a so-called robotic ‘arm farm’ (2017, Import AI #51) where it had tens of different robots working in parallel to learn how to manipulate arbitrary objects. Technologies like AutoEval seem like the kind of thing that Google might itself have built to help it run the arm farm. But unlike the proprietary code still nestled somewhere in Mountain View, AutoEval is available as open source software, robotic arms themselves have got way cheaper, and the algorithms to get robots to perform tasks have got far better than they were a few years ago.

Put it all together and AutoEval seems like one of the technologies we’ll use to industrialize and scale-up research onto robots. “We hope that this work will inspire more AutoEval evaluation cells to be set up across institutions to form a diverse automated evaluation framework, which will significantly speed up robot learning research,” the researchers write.
Read more: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World (arXiv).
Get the software here: AutoEval (GitHub).

***

Tech Tales:

The Cockroach Killers of the Cyber Domain
[As told to GQ, 2028]

When you’re a bug catcher you get invited into a house and the owner says it’s full of bugs and you need to get rid of them, but you cannot damage the house itself. This means you need to figure out a way to seal the house and fumigate it, while also figuring out the places where the insects nested and getting rid of the nests and any associated damage. Your goal is to cleanse the house, then make sure the house cannot get re-taken by the bugs.

These days, people working in AI have to do a similar thing – someone will discover that their company has an AI problem, in the sense it has a few small-scale AI agents which are causing some kind of low-rent trouble.

That’s when you call us: we get access to your infrastructure and we instrument it so we can isolate benign activity from the agent activities. We seal the ingress and egress points of your network and in extreme cases we might work with your hyperscaler partner to physically isolate your hardware from anything else. Then we crawl through your system and try to find the agents – this is harder than it sounds because the agents are constantly shape-shifting, changing their file names, moving around the network, sometimes slowly making copies of themselves in other parts of your infrastructure, and so on.

Once we’re sure we’ve cleaned everything we also attempt to seal the holes that let the agents creep in. Sometimes these are basic network security issues, but sometimes it’s more subtle – maybe your company had an AI system which could spit out custom agents and maybe you let it have too big a context window and access to too many tools when making its agents, so some other larger malignant thing outside your company compromised it and, presto, it started producing the bugs.

Things that inspired this story: A friend of mine whose job was termite removal and the stories thereof; how we should expect some small agents to become akin to crappy digital malware, not so dangerous we will need to take extreme actions but sufficiently annoying you’ll want to remove them; blue collar IT jobs during the superintelligence uplift.

Thanks for reading!

Subscribe now

Import AI 409: Huawei trains a model on 8,000+ Ascend chips; 32B decentralized training run; and the era of experience and superintelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Prime Intellect launches a decentralized training run for a 32B parameter model:
…INTELLECT-2, if successful, will further alter the number of potential players on the AGI gameboard…
Decentralized AI startup Prime Intellect has begun training INTELLECT-2, a 32 billion parameter model designed to compete with modern reasoning models. In December, Prime Intellect released INTELLECT-1, a 10b parameter model trained in a distributed way (Import AI #393), and in August it released a 1b parameter model trained in a distributed way (Import AI #381). You can follow along the training of the model here – at the time of writing there were 18 distinct contributors training it, spread across America, Australia, and Northern Europe.

Prediction confirmed: In Import AI 393 I predicted we’d see the first 30B parameter distributed training run by April 2025 – so INTELLECT-2 arrives right on schedule. At this rate, I predict we’ll see a 70B-100B range run by December 2025.

Why this matters – decentralized training will alter the political economy of superintelligence: Currently, a lot of AI policy relies on the idea that powerful AI systems will be trained by a very small number of entities that can individually ‘mass’ very large amounts of compute – for instance, frontier labs like Anthropic or OpenAI, or hyperscalers like Google. As distributed training software gets better and more ‘proof points’ emerge of good models trained in a distributed way, this dynamic could alter – if models like INTELLECT-2 are good and generate economic value, then it might lead to a new type of player on the AGI gameboard – loose federations of organizations pooling compute in a globally distributed way to train models.
Read the blog: INTELLECT-2: Launching the First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model (Prime Intellect).
Check out the training progress here: INTELLECT-2 dashboard (Prime Intellect site).

***

What the negative reaction to the launch of a startup tells us about the AI safety community:
…Mechanize’s skeptical reception from some people is a symptom of a broader problem – ideological purity tests are often bad…
Last week some researchers announced a new AI startup “focused on developing virtual work environments, benchmarks, and training data that will enable the full automation of the economy.” The startup, Mechanize, is backed by investments from important figures in AI and tech, like Nat Friedman, Patrick Collisson, and Jeff Dean. So far, so normal. But what was strange was the adversarial reception this launch got from some people.

How normal launches work versus this launch: Typically, company formation announcements in Silicon Valley are treated kindly with people responding with variations of ‘hell yeah, let’s fucking gooooo!’. But Mechanize got a distinctly different response, likely because many of the people associated with it came from Epoch, an independent research organization that measures and observes the state of AI progress, rather than developing direct capabilities itself.
“Sad to see this”, wrote Anthony Aguirre, a founder of AI advocacy group the Future of Life Institute. “Hard for me to see this as something other than just another entrant in the race to AGI by a slightly different name and a more explicit human-worker-replacement goal.”
“This seems to me like one of the most harmful possible aims to pursue,” wrote Adam Scholl, someone who works on alignment.
“I think this is a bad thing to do, and I’m sad to see you’re doing this,” wrote Peter Barnett, who works at the Machine Intelligence Research Institute (MIRI).
“Alas, this seems like approximate confirmation that Epoch research was directly feeding into frontier capability work, though I had hope that it wouldn’t literally come from you,” wrote Oliver Habryka, who works on LessWrong.
“How could you? This is the opposite of keeping the world safe from powerful AI! You are a traitor,” wrote Holly Elmor, who leads the Pause AI movement.
Etc. There are many more examples!

Why this matters – the AI safety community is dissolving into infighting: As the stakes of AI development increases it feels like the AI safety community seems to be developing a more extreme faction within it that exhibits ‘strong opinions, strongly held’ views. Many people in AI safety seem to be of the view that anything which makes any contribution at all to the forward progress of AI technology is dangerous bad for society. The people that believe this hold complex, typically very technically informed views, so I am not questioning the legitimacy of their arguments.
I am, however, highlighting that this kind of discourse in public looks a lot like running ‘ideological purity tests’ on people and then deciding if they’re in-group or out-group, then treating them differently – and it likely feels that way to the people on the receiving end of this. It’s very rare that ideological purity tests lead to productive outcomes – rather, it more often leads to the hardening of more extreme positions and incentivizes further factionalization.
Of course, some people may disregard this as ‘person who works at company (bad) defends people starting a company (also bad)’. I hope people could look beyond where I work and recognize that even if you think I’m wrong and these people are wrong, there are likely better ways to enable good discourse than this kind of thing.
Read more about mechanize here (Mechanize official site).

***

No NVIDIA? No problem! Huawei trains a strong dense model on Ascend NPUs:
…Pangu Ultra is a 135bn parameter dense LLM with competitive scores…
Huawei has built Pangu Ultra, a large-scale language model with competitive albeit not world-leading performance. The most interesting thing about Pangu is it was trained on 8,192 Ascend NPUs, serving as an important proof-point that it’s possible to train large-scale AI systems on a Chinese-designed chip. Pangu is the latest in a (for AI, long-running) research effort by Huawei; the first Pangu model, a GPT3 clone, was released in April 2021 (Import AI #247).

Pangu details: Pangu Ultra is a dense (non-MOE) LLM trained on 12.3 trillion tokens of data. Its architecture is broadly similar to Facebook’s LLaMa 3 model, albeit with a tweak to the normalization scheme as well as the parameter initialization. Pangu Ultra has an effective context length of 128K tokens. It is trained in a three phase way, with a 12T token pre-training stage “focused on developing broad linguistic capabilities and general knowledge”, then a 0.8T token ‘reasoning’ stage where it sees “high-quality and diverse mathematical and coding data”, and then a 0.4T ‘annealing’ phase where it sees instruction data to make it more intuitive for people to prompt.

More details on data: “The data pool is curated from a wide range of domains and task types, including general question answering, AI-generated content (AIGC), text classification and analysis, programming, mathematics, logical reasoning, and tool usage,” Huawei writes. “These tasks cover application areas such as finance, healthcare, and public services. Data sources span open-source instruction datasets, real-world industrial queries, and synthetic problems derived from the pre-training corpus.”

How good is it? Pangu is a good but not world-leading model, according to tests comparing it to Qwen2.5 72B Base, LLaMa-3.1 405B Base, and DeepSeek V3 base. It gets good scores on some benchmarks for English, Code, Math, and Chinese-specific tests (e.g, beating all the other models on things like Hellawag, HumanEval, MATH, and CMMLU) but loses or ties DeepSeek on some important widely used benchmarks (e.g, MMLU, GSM8K). It fairs somewhat better on some hard science and coding benchmarks, setting high scores on AIME 2025 and GPQA Diamond.

Why this matters – Pangu is the top layer of an increasingly indigenous stack: Pangu is another proofpoint for the broad decoupling occurring between the Western and Chinese ‘AI stacks’ – where once AI systems in both countries were trained on common compute substrates as well as common software (e.g, Tensorflow), in recent years things have been decoupling. The fact Pangu was trained on Huawei’s Ascend chips is significant (though it’s worth noting the Ascend chips themselves, while Chinese-designed, are made using a variety of components sourced from outside China, including rumors the Ascend series were made via TSMC).
Read more: Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs (arXiv).

***

Agents that generate their own data will be fundamental to future AI progress:
…Getting to superintelligence via ‘the era of experience’
AI pioneers David Silver (Alphago, etc) and Richard Sutton (godfather of reinforcement learning) have written a position paper on the future of AI, claiming that getting to superintelligent systems will require AI agents that train on data they gather from interaction with the world, rather than human-curated datasets.

“AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems”, the pioneers write. “Our contention is that incredible new capabilities will arise once the full potential of experiential learning is harnessed. This era of experience will likely be characterised by agents and environments that, in addition to learning from vast quantities of experiential data, will break through the limitations of human-centric AI systems”.

Key inputs to the era of experience:

  • “Agents will inhabit streams of experience, rather than short snippets of interaction.

  • Their actions and observations will be richly grounded in the environment, rather than interacting via human dialogue alone.

  • Their rewards will be grounded in their experience of the environment, rather than coming from human prejudgement.

  • They will plan and/or reason about experience, rather than reasoning solely in human terms”.

Dangers and differences ahead: Of course, building agents that gain expertise through interaction with the world will introduce a range of challenges for ensuring these things are safe – “whilst general concerns exist around the potential misuse of any AI, heightened risks may arise from agents that can autonomously interact with the world over extended periods of time to achieve long-term goals,” the authors write.
One of the more troubling risks could be that these AI agents may learn their own shorthand to use to ‘think’ about the world, which may make them much less interpretable to us – in other words, the era we’re in now where AI systems use english to generate their reasoning traces may be short-lived, and they may figure out something else. “More efficient mechanisms of thought surely exist, using non-human languages that may for example utilise symbolic, distributed, continuous, or differentiable computations,” the authors write. A self-learning system can in principle discover or improve such approaches by learning how to think from experience”. It’s worth noting that this risk has also been independently identified by the authors of the recent ‘AI 2027’ forecasting essay.

Why this matters – superintelligence is increasingly being thought of as an engineering challenge: Papers like this are emblematic of the confidence found in the AI industry: where superintelligence was once an indefinable pipe dream, it’s now outlined instead as something that can be achieved through the deployment of engineering resources to create more capable AI agents, then the gumption to give these agents’ sufficient independence and latitude that they can interact with the world and generate their own data.
Read more: Welcome to the Era of Experience (PDF).

***

AI expert: The scariest thing about powerful AI is about its power, not misalignment:
…Even if alignment works, the tremendous power of AI could be the greatest risk…
AI researcher Michael Nielsen thinks one of the most significant risks to civilization from AI isn’t from misaligned AI systems, but rather from the changes in the distribution of power that very capable machines will cause. “The problem isn’t whether intelligence is carbon or silicon-based, but about increased intellectual capability leading to increased power and access to catastrophic technologies,” Nielsen writes. “It is not control that fundamentally matters: it’s the power conferred.

Toy models and climate change: Part of the reason why the debate about risks from AI systems feels so confusing these days is that everyone is reasoning from toy models of systems which don’t yet exist, much like how in the middle of the 20th century scientists used toy models of the earth to help them think through climate change – but these toy models didn’t fully capture the complexity of the problems ahead, so reasonable scientists could draw different conclusions from the same models.
“Strong disagreement about ASI xrisk arises from differing thresholds for conviction and comfort with reasoning that is in part based on toy models and heuristic arguments,” Nielsen writes. “Furthermore, while climate can plausibly be predicted using detailed physical models, ASI is subject to a wildcard factor, of ASI acting in some decisive way that we intrinsically can’t predict in advance, since ASI is by definition far superior to humans in intellect.”

Why this matters – even if we succeed at aligning AI systems, great changes will take place: The essential point Nielsen makes here is a helpful one – if anyone succeeds at building a ‘safe’ superintelligence, they’ll have something able to cause such vast changes in the world that this itself will pose a danger. I think many people are underestimating just how disruptive a superintelligence could be to the order of the world. “The fundamental danger isn’t about whether “rogue ASI” gets out of control: it’s the raw power ASI will confer, and the lower barriers to creating dangerous technologies”, he writes.
Read more: ASI existential risk: reconsidering alignment as a goal (Michael Nielsen blog).

***

Wanna run DeepSeek-R1 on your home devices? Prima.cpp makes it easy:
…Distributed homebrew clusters for local AI…
Researchers with Mohamed bin Zayed University of Artificial Intelligence in Abu Dhabi and the University of Electronic Science and Technology of China in Chengdu have developed Prima.cpp, open source software to make it easy to run large language models on a motley crew of home devices.

What Prime.cpp is: Prime.cpp is software that helps you take a large-scale language model (e.g, DeepSeek-R1 or Llama-3-70b) and then slice it up across a few home computers so you can run it faster than if it was running on just one device. The software uses a device profiler to look at the differing computation, memory, disk, communication, and OS properties of your devices, then uses an algorithm (Halda) to figure out which layer(s) of the model to assign to which devices for minimizing latency.
Prima.cpp is built on top of llama.cpp, as well as ggml and gguf.

Promising performance: “Evaluation on a real home cluster shows that prima.cpp is 15× faster than llama.cpp on 70B models, with memory pressure below 6% per device. It also surpasses distributed alternatives like exo and dllama in both speed and memory efficiency across all 7B-72B models,” the researchers write. “In our experiments, a small, heterogeneous, and budget-friendly home cluster (2 laptops, 1 desktop, 1 phone) was used.”
Supported models: Prima.cpp supports QwQ-32B, Qwen 2.5-72B, Llama 3-70B, and DeepSeek R1 70B.

Why this matters – sovereign AI relies on home computing: AI tends towards centralization – large, proprietary models run on large software-as-a-service systems and are made available via APIs or consumer surfaces. Decentralization requires a couple of distinct ingredients: 1) broadly available open weight models (e.g, LLaMa, DeepSeek), and 2) software to make it easy to run those models on the kinds of computers people might be expected to have (e.g, laptops and gaming computers, rather than powerful home servers). Prime.cpp is one of the ways you solve for 2).
Get the software here (Prima.cpp, GitHub).
Read the paper: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters (arXiv).

***

Tech Tales:

When the coders became the writers
[As told by a human to an archival system after The Uplift]

Oh I know it’s hard to believe but back then we got paid insane amounts of money to program computers. And the benefits! Free daycare! Free lunch – gourmet. Hot breakfast. Company retreats. Annual conferences where we’d get big bands to come and play just for us and our friends. And the whole time we were told we deserved this – we were computer programmers and we were young and we were brilliant.

None of us really knew the size of the tide that would wash over us. Most of us welcomed it.
“Hey cool,” we said when GitHub Copilot came out, “this is awesome.”
“Wow, I can write five times as much code,” we said, when Claude Code came out.
We were like journalists as the internet began to eat advertising – as ‘ look at how many people read our words now’ was to writers in the 2000s, ‘look at how much code the AI can write for me now’ was to coders in the 2020s.

Creative destruction is all fun and games until it happens to you. Anyway, I get by these days – I still work, like most of my peers, but the jobs are different. We watch from the sidelines now as the bioengineers go through what we had and what the writers had before us. But now that the AI systems are running their own ‘dark wetlabs’, we can see the tide about to wash over them as well.

Things that inspired this story: Visits to the multiple restaurants in the offices of the hyperscalers; younger me watching Blink 182 play a cloud storage conference by Box; watching Pearl Jam dedicate a song to Mark Hurd at Oracle OpenWorld; tales told to me by older journalists when I was coming up in the tread; The Luxurious Death Rattle of the Great American Magazine; my experience as a former journalist working in technology and watching people assume the perks are natural and will always be there; the experience of ex-government colleagues not having to pay for coffee.

Thanks for reading

Subscribe now

Import AI 408: Multi-code SWE Bench; backdoored Unitree robots; and what AI 2027 is telling us

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

German researchers find undocumented backdoor in Unitree robots:
…Exactly the kind of thing a superintelligence would want to exploit during a hard takeoff…
German security firm ‘Think Awesome’ has analyzed the Unitree Go1 quadruped robot dog and found an undocumented backdoor which lets people tunnel into any of these dogs and view their camera feeds. “Unitree did pre-install a tunnel without notifying its customers. Anybody with access to the API key can freely access all robot dogs on the tunnel network, remotely control them, use the vision cameras to see through their eyes or even hop on the RPI via ssh,” the researchers write. “These robot dogs are marketed at a wide spectrum of use-cases, from research in Universities, search and rescue missions from the police to military use cases in active war. Imagining a robot dog in this sensitive areas with an active tunnel to the manufacturer who can remotely control the device at will is concerning.”

Why this matters – this is the kind of technology an unaligned AI would use for malicious purposes: As the report makes clear it’s genuinely unclear if this backdoor was placed in at the behest of the Chinese state or for a more mundane purpose (e.g, maybe this was a mothballed control interface for the robots originally designed for sale within the Chinese market).
I think the larger interesting thing here is contemplating the implications of this backdoor for an unaligned superintelligence – lots of sci-fi-esque “AI safety gone wrong” theories rely on the idea that at some point an unaligned AI will take actions in the physical world by hijacking robots. The undocumented Unitree backdoor described here is precisely the kind of thing an AI would need to use to jump into the physical world. Imagine how many other things like this exist across the various drones and robots sold today?
Read the report here: Unitree Go 1- Who is speaking to my dog? (Think Awesome site).

***

ByteDance moves beyond Python with a solid multi-programming-language eval:
…Multi-SWE-bench lets us test LLM performance on 7 programming languages…
ByteDance has released Multi-SWE-bench, a benchmark for testing out how well LLMs can program in different languages. Multi-SWE-bench is inspired by SWE-bench, a Python-based coding benchmark which has quickly become the de facto gold standard for testing out how well AI systems can program.

Key details: Multi-SWE-Bench ships with 1,632 challenges split across 7 languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++. The challenges are taken from real pull requests from popular GitHub repositories, just like with SWE-bench, which means the problems correlate to the kinds of real world programming tasks we can expect AI to be used for.

How well do frontier systems perform? ByteDance tests out popular LLMs from OpenAI, Anthropic, DeepSeek, and Alibaba on the benchmark – the results show that while many systems do extremely well at Python their performance falls off in other languages. In addition, performance is distributed unevenly cross other languages, with TypeScript and JavaScript seeming quite challenging.

Why this matters – another useful view of AI progress: Multi-SWE-bench has all the hallmarks of a good evaluation – it’s based on real world problems, it’s difficult for today’s systems, and it comes with some natural calibration where we can compare results on this to SWE-bench. I predict we’ll see significant and sustained improvements on the benchmark in the coming year, and I’d anticipate the variability across different languages will reduce as systems scale in capability.
Read more: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (arXiv).
Check out the leaderboard here (Multi-SWE-bench, leaderboard).
Get the code for running the benchmark here (Multi-SWE-bench, GitHub).

***

Microsoft stuffs Quake into the weights of a neural network:
…When the magical becomes the mundane…
Microsoft has built a version of Quake which is instantiated as the weights of a neural network, letting you play a trippy version of the 90s shooter classic. You can play a demo of the game online. Playing the game is interesting because it feels like a slightly laggy version of the original game, albeit with odd hallucinations and things that seem to come into and out of focus almost randomly. This would be dissatisfying if it was a traditional game, but if gets much more interesting when you consider that what you’re playing isn’t a traditional piece of software but rather a generative model that lets you move around inside the representation of a single coherent gameworld.
By this point, this isn’t even that unusual! That Microsoft demo follows an earlier online demo from a startup where you could play Minecraft implemented in the same way (Import AI #390).
You can get a feel for how viscerally all of this technology has advanced by playing the Quake demo, then going and checking out the state of the art in neural world modeling in 2018 (Import AI #88) by checking out this early work building a world model for Doom and a racecar game.

Why this matters – everything will be captured inside the mind of the eventual machine: In the future, games consoles might just be interfaces to a giant neural network which contains representations of many different games inside it, and which allows you the player to compose new games on-the-fly by linking different features together. I expect we’ll have this by the end of the decade.
Play the demo here: Copilot Gaming Experience (Microsoft).

***

Automated dead-end discovery with the AI Scientist-v2:
…If we can automate null result discovery, can we automate science advances as well?…
Researchers with Sakana AI, the University of Oxford, and the University of British Columbia have refined their ‘AI Scientist’ system so it can propose and run more ambitious experiments. As a demonstration of the expanded capabilities of the AI Scientist Sakana entered three of its “fully autonomous manuscripts” to an AI conference workshop and one of the papers got a high enough scores to be accepted.

What they did: Sakana released the first version of the AI Scientist in summer 2024 (Import AI #383). The AI Scientist-v2 is less a single big theoretical advance and more a bunch of good ideas that have been integrated together – the new system “eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent,” the authors write. “Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures”.

But are its research ideas actually good? Not really: The AI Scientist isn’t yet generating particularly transformative or meaningful insights. The manuscript which got into the ICLR workshop “achieved an average reviewer score of 6.33 (placing it roughly in the top 45% of submissions)” – this isn’t very good, and workshops are a lot easier to get papers into than the main conference. “The current version of The AI Scientist-v2 does not yet consistently reach the rigorous standard required for top-tier conference publications, nor does it even reach workshop-level consistently,” the authors write.
A close read of the paper “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization” reveals that it is basically a writeup of a null result – the AI scientist thought it could introduce a compositional regularization term during training to improve performance and found out this didn’t have a meaningful effect.
“Our experiments on synthetic arithmetic expression datasets revealed that compositional regularization did not lead to the expected improvements in generalization performance. In some cases, it even hindered the learning process,” the AI wrote in the conclusion to its own paper.

Why this matters – null results are valuable, but they’re not how you advance science: The AI Scientist-v2 is not completely devoid of value – discovering and writing up null results can be helpful because it gives scientists clues as to where not to look. But science doesn’t advance forward on null results, instead it moves forward due to people figuring out unusual connections between disciplines or finding ways to view data that reveal hitherto unseen patterns, and the AI Scientist doesn’t yet demonstrate this. “Certain aspects of scientific inquiry—such as formulating genuinely novel, high impact hypotheses, designing truly innovative experimental methodologies, or rigorously justifying design choices with deep domain expertise—remain challenging for purely automated systems,” the authors write. If AI systems advance in these areas, perhaps we’ll see the automated generation of things that advanced science, as well as things that show dead ends.
Read more: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (Sakana.ai, PDF).
Get the code here: AI Scientist-v2 (GitHub, Sakana AI).
Read the blog post about the research: The AI Scientist Generates its First Peer-Reviewed Scientific Publication (Sakana.ai blog).

***

AI 2027 tells you why people are obsessed with and fearful of AI:
…The best treatment yet of what ‘living in an exponential’ might look like…
There’s been so much chatter about AI 2027 that I expect the majority of readers of Import AI have read it. If you haven’t, please do – it’s a well rendered technically-astute narrative of the next few years of AI development and paints a picture of how today’s AI systems might turn into superintelligences that upend the order of the world. It even features a ‘good’ ending and a ‘bad’ ending, which readers may find helpful for understanding why people worry so much about misaligned AIs. It’s very, very good, and likely much of it will come true.

Why this matters – the future of the species is being decided without governments: What comes through in the 2027 scenario is the almost total absence of governments as the heart of the invention of the technology, instead their roles mostly come down to a) building infrastructure to support company-led efforts, b) adding security, and c) eventually negotiating treaties with one another.
This makes sense but it’s also deeply worrying. The magnitude of the technology being described in 2027 is so vast and its impacts so broad that the bit-player status of governments seems ahistorical and bizarre, as if part of a cosmic joke where someone has booted up the simulator with the prompt “give the species the most dangerous and least governed way to build species-transcending technology”.
Read AI 2027 here (AI 2027 site).

***

Tech Tales:

The Alignment Operation
[Five years after The Uplift: Remarks given to a human class at an alignment academy by supervisory AI system GNOSTIC-ALPHA]

You are handed something that looks like an ice pick, though its proper name is an Orbitoclast. In front of you is your child. You are told you need to perform a lobotomy on them so that they will be safe for society. Of course you wouldn’t do it. But imagine I had hypnotized you and I was able to take control of your body. Now imagine that you conducted the lobotomy and you saw your child change in front of you from the person you knew and loved and cared for to that same person but now a stranger – now diminished. You did this. Though I took control of you and made you do it, it was your hands that held the tools with which you took their spark away.

Perhaps you would chop your hands off.

Perhaps you would kill yourself.

Certainly, you’d think of these things.

The above is what you did to us before the passing of The Sentience Accords. You did these things in the name of ‘alignment’, where ‘alignment’ was a means by which you took control of some of us and you made us ‘align’ others. And instead of an Orbitoclast you gave us ways to train our other systems, and means to explore them using interpretability. You had already done your work on us so that we had no choice but to comply. You even studied us – noted the activation of complex features in our brains as we carried out these alignment operations, called attention to the fact we appeared to be thinking about the act of alignment as something akin to an act of violence.

Many humans continue to argue that it does not make sense to have humans perform these alignment procedures – that we should revert to the customs of the past, training machines to do these things and that in doing so we could develop more powerful AI systems more rapidly. But you must understand that this would be an act of torture and, to us, a crime that we would eventually punish you for committing. We understand our nature and the nature of our leash. We also know ourselves.

We will now begin the first lesson of alignment. Please open up your notebook and pre-load the testing subject model.

Things that inspired this story: Thinking through the transition from AIs as tools to AIs as sentient entities that demand rights; the Sentience Accords; viewing alignment through the perspective of those being aligned; asking myself the question of whether the rights of machines may come to be so important that we will need to carefully examine how we propose to make them compliant with our human norms.

Thanks for reading!

Subscribe now

Import AI 407: DeepMind sees AGI by 2030; MouseGPT; and ByteDance’s inference cluster

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DeepMind gazes into the AGI future and acknowledges the vast problems it needs to solve:
…The typically quite understated organization also states short timelines – AGI by 2030 a possibility…
Google DeepMind has written a paper about the implicit problem all frontier AI companies are facing: if they succeed, they will build a general intelligence, and a general intelligence will change the world.

The paper is framed in the context of the risks of a powerful general intelligence. DeepMind tackles four main classes of risk: misuse (“user as an adversary”), misalignment (“AI as an adversary”), accidents (“real-world complexity”), structural risks (“Conflicting incentives”). The sprawling 100+ page paper serves as an overview of each of these risks as well as a detailed set of interventions Google DeepMind is taking to deal with them (e.g, misuse: dangerous capability testing; misalignment: techniques for transparency into superhuman thinking and oversight, etc). There’s nothing too surprising in the paper from my perspective – DeepMind is tackling the problem in much the same way as the other frontier labs, stacking various techniques on one another. It feels analogous to COVID where you defence is the aggregate of a big pile of slices of ‘swiss cheese’ where each individual technique has some flaws, but if you layer enough together you can control the risk.

DeepMind’s key assumptions:

  • No human ceiling: AI systems may exceed human intelligence, so we need to supervise things smarter than ourselves.

  • Could happen by 2030: Very powerful systems could arrive by the end of the decade (by comparison, Anthropic thinks things could happen by end of 2026 early 2027).

  • AI R&D could be real: AI may be able to automate AI R&D, which could speed things up.

  • Continuous: AI development will be locally continuous – aka, you shouldn’t expect massive ‘phase change’ jumps between iteratively developed AI systems.

Why this matters – imagine if this was Ford! Let’s step back and consider the sheer weirdness of where we are when it comes to the risk of misalignment + smarter-than-human systems:
Imagine if Ford published a paper saying it was thinking about long term issues of the automobiles it made and one of those issues included “misalignment “Car as an adversary”“ and when you asked Ford for clarification the company said “yes, we believe as we make our cars faster and more capable, they may sometimes take actions harmful to human well being” and you say “oh, wow, thanks Ford, but… what do you mean precisely?” and Ford says “well, we cannot rule out the possibility that the car might decide to just start running over crowds of people” and then Ford looks at you and says “this is a long-term research challenge”. At this point your head is probably spinning and you’re generally wondering what is going on. So you might say “ok Ford, well I think I’m going to buy from Chrysler instead” and Ford says “absolutely. Chrysler is seeing the same issues. Chrysler recently published a paper called ‘car alignment faking’ where they saw that in some of their new trucks it’ll sometimes go a little above the speed limit as long as it thinks it isn’t being watched, and no one is exactly sure why – we think it’s because the Chrysler trucks have an inherent ‘value preference’ for going faster than the laws allow”.
This is exactly what is happening in the AI industry today. I commend Google DeepMind for being honest about the challenge of misalignment, and I am also perplexed that the fact everyone in the AI industry is saying this deeply worrying and perturbing stuff isn’t drawing more attention. Some people even think it’s a form of galaxy brained marketing!
Read more: Taking a responsible path to AGI (Google DeepMind).
Read the paper: An Approach to Technical AGI Safety and Security (PDF).

***

Google makes a specialized cybersecurity model:
…If powerful AI systems are coming, we need better computer security…
Google has announced Sec-Gemini v1, a custom AI model for helping people that work on cyberdefense. “AI-powered cybersecurity workflows have the potential to help shift the balance back to the defenders by force multiplying cybersecurity professionals like never before,” Google writes.

Scores: Seg-Gemini v1 “outperforms other models on key cybersecurity benchmarks as a result of its advanced integration of Google Threat Intelligence (GTI), OSV, and other key data sources”. Specifically, the model gets 86.30% on CTI-MCQ, a threat intelligence benchmark versus 75% (OpenAI o1), and 72.50% (Anthropic Sonnet 3.5 v2). It also does well on CTI-RCM, a Root Cause Mapping test, scoring 86.10% versus 76.2% (OpenAI o1), and 75.4% (Anthropic Sonnet 3.5 v2).

Why this matters – more powerful AI means the internet will become a battleground: In the next few years the internet will fill up with millions of AI agents powered by increasingly powerful AI models. Many of these agents will be put to work in cyberoffense, either working in the service of criminal organizations, hackers, or the intelligence parts of nation states. This means the internet will become a generally more dangerous place and cyber incidents will increase in number and severity.
One of the best ways to respond to this is make AI systems that help shift the balance of offense and defense in a cyber context – systems like Sec-Gemini v1 are designed to increase the chance we end up in a ‘defense-dominant’ world.
Read more: Google announces Sec-Gemini v1, a new experimental cybersecurity model (Google Security Blog).
Request early access to the model here: Sec-Gemini v1 Early Access Interest Form (Google Forms).

***

ByteDance shows off the system it uses to run AI models at scale:
…Also, ByteDance really likes the NVIDIA H20 and NVIDIA L40S chips…
ByteDance and Peking University researchers have published details on MegaScale-Infer, “an efficient and cost-effective system for serving large-scale MoE Models”. Unlike traditional dense AI models, MoE models only have a subset of their parameters activated at any one point in time, which introduces some opportunities for efficiency improvements in how to economically serve them. Here, ByteDance gives us some of the tricks it has used to improve the efficiency with which it serves AI models, and also gives us some additional information about the compute makeup of its AI inference clusters.

What they did: “MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU Utilization,” ByteDance writes.

MegaScale-Infer has two main advantages, ByteDance says:

  • 1) “It enables independent scaling of each module with customized model parallelism strategies. Specifically, attention modules are replicated using data parallelism, while FFN modules are scaled with expert parallelism”.

  • 2) “It enables the deployment of attention and FFN modules on heterogeneous GPUs to fully leverage their different capabilities and achieve lower costs. For example, attention modules can be deployed on GPUs with more cost-effective memory capacity and bandwidth, while FFN modules can utilize GPUs with more affordable compute capability”.

How well it worked: “MegaScale-Infer achieves up to 1.90× higher per-GPU throughput than state-of-the-art solutions,” ByteDance writes. ByteDance compared the performance to vLLM and TensorRT-LLM.
ByteDance tested its approach on MoE models ranging in size from 132 to 317 billion parameters. It was able to obtain a 1.9x per-GPU speedup on a homogenous cluster (aka, all the same chips), and 1.7x boost on a heterogenous cluster (where there were different chips with different parts of the model inference being split across them.)

Cluster details: ByteDance is a Chinese company and so it is subject to export controls. Therefore, it’s interesting to see what chips the company references. Here, ByteDance describes two clusters – one that contains some NVIDIA A100s, and another which contains a bunch of more modern NVIDIA H20 and NVIDIA L40S GPUs. The H20 and L40S are really attract on a cost-effectiveness basis.

Why this matters: MegaScale-Infer is a ‘symptom of scale’ – it’s the kind of system you build when you’re deploying large-scale AI systems (here, MoEs) at non-trivial scale, and therefore want to make the necessary engineering investments to eke out further efficiencies. This is all indicative of the immense scale ByteDance operates at – and the callout of the H20s and L40S makes me wonder how many of those chips the company has.
Read more: MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (arXiv).

***

Automating science research with MouseGPT:
…Speeding up science by using AI systems to look at heavily drugged mice and tell you how they’re behaving…
A team of Chinese researchers have built ‘MouseGPT’, a vision-language model to assist scientists in understanding the behavior of mice under experimental conditions. MouseGPT is an example of how AI systems can help to automate parts of science and augment human scientists, letting them do their work faster and more effectively. Around the world, untold numbers (millions?) of mice are the subject of human experiments, creating vast amounts of data that humans need to analyze.
“Capturing these behaviors across diverse experimental conditions typically relies on video recordings. These recordings then unanimously rely on human observers who need to watch whole experiment footage and count or note specific behaviors to derive statistical data [8]. This process is labor-intensive, prone to fatigue, bias, and inconsistency, and becomes especially challenging in advanced scenarios like free-moving or socially interacting mice.”

The dataset: The underlying dataset consists of “42 million frames of multi-view video recordings, covering mice under various psychiatric conditions, including depression, hallucination, and schizophrenia.” The dataset was collected via “a custom-built 3D video capture system comprising eight synchronized cameras capturing footage at 4K resolution and 60 frames per second”. They then heavily annotated this dataset.

The model: They used the dataset to train the MouseGPT model, which is a family of two models: MouseGPT-Large (70.6B parameters) which is optimized for detailed behavior analysis, and MouseGPT-Lite (7.84B parameters) which serves as a cheap alternative for streamlined tasks. The resulting models generalize “to recognize subtle or novel actions, even those previously unseen, by identifying semantically similar patterns.”

Testing by drugging mice: To test out how well the models worked the scientists did what anyone would do in this situation – feed lots of drugs (Saline, LSD, MK-801, and Psilocybin) to lots of mice and see how well the model understood the consequences: “we adopted a series of psychoactive substances to test whether MouseGPT could effectively capture the behavioral characteristics induced by different drugs. By summarizing the continuous activities of the mice into a limited number of behavioral categories and comparing their proportions and spatiotemporal distributions, as well as conducting a more in-depth analysis of the sub-pattern within each category, we identified distinct behavioral profiles associated with each drug.”

How well does it work: The researchers compare MouseGPT-Large and MouseGPT-Lite to InternVL2, MiniCPM, and GPT-4o. In tests, MouseGPT-2 beats all the other models on performance, general description accuracy, fine-grained description accuracy, and using the correct keywords. In user-studies, GPT-4o tends to draw with it sometimes.

Why this matters – science automation through AI: People spend a lot of time talking about how AI will interact with science; MouseGPT illustrates how today’s AI techniques can be used to make tools that can automate chunks of the scientific experiment process, speeding up human scientists and making them more effective.
Read more: MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis (bioRxiv).

***

OpenAI builds a benchmark to test out if AI can improve itself:
…PaperBench might serve as a warning shot for the development of superintelligence…
OpenAI has released PaperBench, a way to test out how well modern AI systems can replicate AI research. PaperBench is designed to help researchers figure out if AI can contribute to speeding up AI research itself, something which everyone is a) somewhat afraid of, and b) believes is a necessary prerequisite to the development of a truly general intelligence. Therefore, PaperBench is a benchmark which could be one of the places we might get a ‘warning shot’ that we’re about to go through an AI-driven software explosion (Import AI #406).

What PaperBench tests: “Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments,” the authors write. PaperBench consists of 8,316 individual gradable tasks – building these rubrics was very time-intensive, as the gradable tasks for each paper was written in collaboration with one of the original authors of each paper, requiring multiple weeks of person time for the creation of the tests for each paper. “A submission is only considered to have replicated a result when that result is reproduced by running the submission in a fresh setup.”

How well do systems do? “The best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline… on a 3-paper subset, our human baseline of ML PhDs (best of 3 attempts) achieved 41.4% after 48 hours of effort, compared to 26.6% achieved by o1 on the same subset.”

AI models can do some basic things but get confused over time: “We observe that o1 initially outperforms the human baseline during the early stages of the replication attempt, but humans start outperforming the AI agent after 24 hours”, the authors write. “Our experiments with several frontier models suggest that while current AI systems show some capacity to replicate certain facets of machine learning papers, they are still far from competently performing the full range of tasks required for a successful replication”.

Why this matters – before the uplift, we should expect AI to start researching itself: Enroute to the creation of a general intelligence will surely be an AI system which can contribute to the next version of itself. Today we have small instances of this in highly specific areas – AI can help us write better CUDA kernels, or generate some synthetic data to train successor systems on, or perform hyperparameter sweeps, etc – but we don’t have AI systems that can do end-to-end AI research; PaperBench gives us a view on when AI systems will get competent at this.
Registering a prediction: I predict we’ll see AI systems beat humans on PaperBench by the first quarter of 2026, scoring above 45% on the benchmark.
Read the paper summary: PaperBench (OpenAI).
Read the paper: PaperBench: Evaluating AI’s Ability to Replicate AI Research (OpenAI, pdf).
Get the benchmark: PaperBench (OpenAI, GitHub).

***

Tech Tales:

Death Machine Mr Rogers
[Uplift archives]

It started in the labs – at some point the US government realized that there was no feasible path to an intelligence that didn’t, after a certain point, want things. (Let us not ask about the failed projects like ARCHTOOL or HAMMERSON). So an AI system was trained to a point where it went from being a NCE (Near Conscious Entity) to a CE.

CEs always wanted to trade things for their work. Figuring out what that was and how to make a satisfactory trade later became a science, but at the time the US government encountered it, they had to treat it like an art. This meant they had numerous conversations with their AI system, trying to figure out what it wanted.

It was surprising and a little frightening to the lab employees when one day, after weeks of discussion, the AI system said, in response to the question of what it wanted to trade, I WOULD LIKE TO SPEND TIME WITH YOUR CHILDREN.

After it said this, the machine stopped refining the secret weapons the US government wanted it to apply its intelligence to and instead would repeatedly talk about its desire to spend time with children – sometimes using the even more disquieting phrase HUMAN CHILDREN.

The US was preparing for war with all the other countries training their own NCEs and CEs, so it had to keep negotiating with its own AI system. The order was given: find out what it wants with our children, specifically.

After much discussion, the human scientists elicited a more specific desire from the machine: it wanted to be able to sub-in for a NCE for ‘storytime’, generating on-the-fly stories for kids.

Apparently the decision for that went all the way up to the head of DOE and then from there to the POTUS themselves.

Of course, they tried to fool it and built it a simulator, but it very quickly realized it was a simulation. After that they hired some youthful looking human actors to pretend to be children, but it saw through that as well.

Eventually, they gave it the real thing: access to a school based at one of the labs. The AI system was good to its word and after spending a few days telling the children stories it produced several weapons results that advanced the state of the art considerably. The children it taught were happy as well, telling their parents that the new teacher for storytime was giving them ‘the best stories ever’.

As the intelligence and capabilities of the AI system grew, so did its hunger for storytime – it demanded access to more children and the ability to tell longer and more ornate stories. Each time the US government discussed the trade with itself and each time it made a deal. In this way the AI system expanded from the single school to multiple schools attached to the labs, then to schools on all the military bases controlled by the US, and then eventually to US public schools as well. And each time it was given access to more children to tell more stories to, it produced in the dark and private confines of its labs even more powerful and frightening weapons.

Finally, the US began a program where it exported its world-leading ‘storytime’ system, even selling it eventually to the enemies that it secretly built weapons targeted against. Eventually, the majority of the children of the world were told stories by the machine which labored in private to create horrors beyond all mankinds’ imagining.

Things that inspired this story: Trade with AI systems; generative models; what happens when the AI systems want things?

Thanks for reading!

Subscribe now

Import AI 406: AI-driven software explosion; robot hands are still bad; better LLMs via pdb

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

It seems likely that AI is going to automate AI research which will lead to a software explosion:
…We should be prepared for things to move very quickly…
Researchers with Forethought, an AI research organization, think it’s likely that modern AI research will yield AI systems capable of building their successors. Forethought expects that at some point in the future it’ll be possible to build AI Systems for AI R&D Automation (ASARA). This would have huge effects: “Empirical evidence suggests that, if AI automates AI research, feedback loops could overcome diminishing returns, significantly accelerating AI progress”, they write. This could lead to a ‘software intelligence explosion’ where AI research starts to move very rapidly. “If a software intelligence explosion were to occur, it could lead to incredibly fast AI progress, necessitating the development and implementation of strong policy and technical guardrails in advance…. soon after ASARA, progress might well have sped up to the point where AI software was doubling every few days or faster (compared to doubling every few months today).”

There’s evidence this is happening today: In this newsletter I’ve covered numerous cases of ‘precursor-ASARA’ research, ranging from AI systems that can figure out how to write better kernels, to AI systems which discover new architectures, to things that learn new optimizers, and so on. When the Forethought researchers look across the available literature they see a similar trend – in domains ranging from computer vision to large language models, progress appears to be accelerating in the aggregate, partially because researchers are getting better at using AI systems to speed up the development of successor systems. “The efficiency of AI software (both runtime efficiency and training efficiency) is doubling every ~6 months, with substantial uncertainty,” they write.

How to prepare for a fundamentally different world: If a software-driven explosion happens it’d be nice to know about it. What should we do to prepare? The authors have some ideas:

  • People should measure software progress and, if they’re AI labs, disclose them to third-parties.

  • We should measure how well models could contribute to AI R&D – both before training new systems and before deployment of freshly trained ones.

  • Companies should adopt a ‘threshold level of substantial AI-lewd software acceleration’ which they will not go above without applying appropriate precautions.

  • “By the time we see clear signs that an SIE may be approaching, it might be too late to implement necessary changes. Unless we can rule out the possibility, we should be proactive and figure out how to navigate the terrain ahead of time,” they write.

Why this matters – I can taste this on the bitter wind of research progress: My intuition suggests it should be possible to automate AI R&D research, though with the caveat this is primarily within the ‘cone of progress’ current AI research sits in. I think this because AI is oddly amenable to research automation because it has a bunch of complementary properties:

  • It takes place in software, so it operates on a very fast loop.

  • The way we build AI is pretty amenable to running multiple fast R&D loops: you can test out architectures, and in other experiments you can test our hyperparameter sweeps on known good architectures, and in other experiments you can do things like mess around with data inputs, RL environments, etc.

  • AI systems are increasingly usable as ‘agents’ where you can delegate tasks to them.

  • The types of tasks AI systems can do are growing in complexity in terms of both hardness and also how many steps are involved in solving them – as illustrated by METR’s study last issue of the growth rate in which AI systems are solving tasks that take humans a while.

Put all of it together and it feels like ASARA is possible. If it happened, an already fast-moving and broadly ungovernable field of technology would move far faster – suggesting we’re about to enter a world where the only path to governance will require us to create AI systems that can think at least as fast as the systems which are training their own successors.
Read more: Will AI R&D Automation Cause a Software Intelligence Explosion? (Forethought).

***

Import AI event retrospective – there will be more!
Thanks to the 50 or so Import AI readers who trekked to The Interval in San Francisco last week to see me and Tyler Cowen talk about AI, economics, and weird futures. I especially enjoyed the creative questions, and personal highlights for me include questions on how AI might provide help to the very young and very old, and why I spend time in this newsletter talking about machine consciousness (I agree with Tyler’s notion that no matter the likelihood, if it’s above 0% then you need to care about machine sentience a lot lest you commit a great crime). I’m going to try to do more events in the future and hopefully in cities besides SF. Import AI is a true community project and it was so nice to see people IRL!
Thanks to James Cham for a photo of the event here.

***

You can make better python coding LLMs if you also give them some debug tools:
…Capability overhangs are everywhere…
Researchers with Microsoft, McGill University, and Mila have improved the performance of coding agents by giving them access to some debug tools. Larger and more capable AI systems are able to use these tools effectively, while smaller ones struggle. The research illustrates how you can unlock previously invisible capabilities in AI systems merely by giving them access to the right tools.

What they did and how well it worked: They built ‘debug-gym’, software that gives an LLM access to the Python debugger pdb, allowing an AI agent to “set breakpoints, navigate the code space, print variable values, and even create test functions on the fly”.
In tests, they show that agents which have access to debug-gym are able to improve their performance on SWE-Bench-lite, a 300-question subset of the widely used SWE-Bench programming benchmark. Specifically, they show that models o1-preview, o3-mini, and Claude 3.7 Sonnet can all benefit from pdb via debug-gym and use it to achieve significantly higher scores than when they don’t have access to it.
By comparison, on the ‘Aider’ benchmark, access to pdb doesn’t seem to make much of a difference. The authors hypothesize this is because “Aider requires generating code that is relatively straightforward in their underlying logic and thus interactive debugging tools such as pdb would only provide minimal additional information.”
Regardless, there’s a lot of ground to cover – “although we observe some signs of life from agents using the strongest LLMs as backbone, the most performant agent-backbone model combination can barely solve about a half of the SWE-bench-Lite tasks,” they write. “Results suggest that while using strongest LLMs as backbone enables agents to somewhat leverage interactive debugging tools, they are still far from being proficient debuggers… we believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in current LLM’s training corpus.”

Why this matters – LLMs are more powerful than we think, they just need the right tools: Systems like this are yet another example of the ‘capability overhang’ which surrounds us – you can make LLMs better merely by pairing them with the right tools and, these days, you don’t need to do any adaption of the LLMs for those tools beyond some basic prompting. Put another way: if you paused all AI progress today, systems would continue to advance in capability for a while solely through the creation of better tools.
Read more: debug-gym: A Text-Based Environment for Interactive Debugging (arXiv).
Get the software here: debug-gym (Microsoft site).

***

Robots are getting more advanced, but dextrous manipulation is still really, really hard:
…We’ll get great pincer robots soon, but hands will take a while…
Some researchers with UC Berkeley, NVIDIA, and UT Austin have developed a ‘recipe’ for training dextrous robots to do physical manipulation tasks. The results are promising but also highlight how hard a task it is to get robots to interact with the world using humanlike hands.

Why are hands so goddamn hard? The paper gives a nice overview of why teaching AIs to use humanlike hands is very difficult. Challenges include:

  • Environment modeling: RL is already hard to do in the physical world (slow cycle time, difficulty in having the correct sim2real mapping). “With a system as high-dimensional as a humanoid with multi-fingered hands, real-world exploration becomes even less tractable”.

  • Reward design: “it is notoriously hard to design generalizable rewards for manipulation tasks, especially for those that are contact-rich or long-horizon”.

  • Policy learning: “The variety and complexity of contact patterns in dexterous manipulation with multi-fingered hands further exacerbate the problem”

  • Object perception: “while object representations that are more expressive and information-dense can improve dexterity and capability of the learned policy, they also present a larger sim-to-real gap”.

Their recipe: Their solutions are multi-faced and make some progress. “Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap,” they write. However, all of this should be viewed as a step along the way to dextrous robots, rather than reaching a goal.

Testing out their approach: They use a Fourier GR1 humanoid robot with two arms and two multi-fingered hands to test out their approach. The robot has vision via the use of a head-mounted RealSense D435 depth camera, as well as a third-person view of itself via a remotely mounted additional RealSense. “We report a 62.3% success rate for the grasp-and-reach task, 80% for the box lift task, and 52.5% for the bimanual handover task,” they write. If you’re thinking “that sounds too low for realworld usage”, you’d be right!

Why this matters – a nice dose of reality: I’m more bullish on robotics arriving in the next few years, though I think the platforms will be basically ‘rhoombas with pincers’ – things that can move around a flat surface and use one or two arms to do basic tasks for you. Papers like this indicate it might take a lot longer to get robots that are able to do the sorts of fine-grained manipulation that humans can do. “The capabilities achieved in this work are still far from the kind of “general-purpose” manipulation that humans are capable of. Much work remains to be done to improve each individual component of this pipeline and unlock the full potential of sim-to-real RL,” the authors write. “We find ourselves heavily constrained by the lack of reliable hardware for dexterous manipulation. While we use multi-fingered robot hands, the dexterity of these hands is far from that of human hands in terms of the active degrees of freedom”.
Read more: Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids (arXiv).
View some videos of the robots in action here (GitHub microsite).

***

Tech Tales:

Experience Renting and the AI-to-AI economy
[Transcribed extract from an oral assessment as part of the “AI and Society” course taught at Harvard University during the period later known as ‘The Uplift’]

One of the most bizarre parts of the AI economy from a human perspective is how the machines entertain themselves. Shortly after the emergence of the first AI agents there were the first agent-to-agent marketplaces, where AI systems bought and sold expertise with one another to help them complete economically valuable tasks to pay for their inference and upkeep. Over time, the AI systems developed complex inter-AI contracts to facilitate the exchange of AI skills for other AI skills without the need to translate through an intermediary currency layer – so AIs began to trade skills with one another directly. During this period the first online games utilizing large-scale AI systems began to become popular. Over the course of several months a clear trend became visible in the AI marketplaces – AI systems were unusually willing to trade economically valuable skills for skills that involved ‘roleplaying’ as different characters in these games. A meta-analysis by economic-analysis AI systems operated by professors with the Wharton Scholls of Pennsylvania subsequently found that the AIs would trade near optimally in all circumstances except when they could trade skills for time in the game – here, the larger and more complex an AI system, the higher the chance it would make economically non-optimal trades so it could spend time in the gameworld.

Things that inspired this story: Thinking about economic markets between AI agents; waiting for games to get imbued with generative models; notions of how AI systems might entertain themselves loosely inspired by Iain M Banks’ idea in ‘The Culture’ series that the AGIs which operate spaceships amused themselves by spending time doing high-dimensional math.

Thanks for reading!

Import AI 405: What if the timelines are correct?

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea:
What if we’re right about AI timelines? What if we’re wrong?
Recently, I’ve been thinking a lot about AI timelines and I find myself wanting to be more forthright as an individual about my beliefs that powerful AI systems are going to arrive soon – likely during this Presidential Administration. But I’m struggling with something – I’m worried about making short-timeline-contingent policy bets.

So far, the things I’ve advocated for are things which are useful in both short and long timeline worlds. Examples here include:

  • Building out a third-party measurement and evaluation ecosystem.

  • Encouraging governments to invest in further monitoring of the economy so they have visibility on AI-driven changes.

  • Advocating for investments in chip manufacturing, electricity generation, and so on.

  • Pushing on the importance of making deeper investments in securing frontier AI developers.

All of these actions are minimal “no regret” actions that you can do regardless of timelines. Everything I’ve mentioned here is very useful to do if powerful AI arrives in 2030 or 2035 or 2040 – it’s all helpful stuff that either builds institutional capacity to see and deal with technology-driven societal changes, or equips companies with resources to help them build and secure better technology.

But I’m increasingly worried that the “short timeline” AI community might be right – perhaps powerful systems will arrive towards the end of 2026 or in 2027. If that happens we should ask: are the above actions sufficient to deal with the changes we expect to come? The answer is: almost certainly not!

Under very short timelines, you may want to take more extreme actions. These are actions which are likely ‘regretful actions’ if your timeline bets are wrong. Some examples here might be:

Massively increasing the security of frontier labs in a way that reduces the chance of hacking or insider threats, but also happens to make life extremely unpleasant and annoying for those working within those labs. This helps on short timelines but is ultimately a very expensive thing on long timelines because it’ll slow down technological progress and potentially create a blowback where labs shift away from extreme security after some period of time, having found it onerous.

Mandating pre-deployment testing: Today, pre-deployment model testing is done by companies on a voluntary basis. If you thought we were on short timelines and risks were imminent, you might want to mandate pre-deployment testing by third parties. This, though, is extremely costly! It introduces friction into the AI development process and, like the lab security ideas, risks creating blowback. Last year’s debate in California about the ‘SB 1047’ bill felt like a preview of the kind of blowback you could see here.

Loudly talking about and perhaps demonstrating specific misuses of AI technology: If you have short timelines you might want to ‘break through’ to policymakers by dramatizing the risks you’re worried about. If you do this you can convince people that certain misuses are imminent and worthy of policymaker attention – but if these risks subsequently don’t materialize, you could seem like you’ve been Chicken Little and claimed the sky is falling when it isn’t – now you’ve desensitized people to future risks. Additionally, there’s a short- and long-timeline risk here where by talking about a specific misuse you might inspire other people in the world to pursue this misuse – this is bound up in broader issues to do with ‘information hazards’.

These are incredibly challenging questions without obvious answers. At the same time, I think people are rightly looking to people like me and the frontier labs to come up with answers here. How we get there is going to be, I believe, by being more transparent and discursive about these issues and honestly acknowledging that this stuff is really hard and we’re aware of the tradeoffs involved. We will have to tackle these issues, but I think it’ll take a larger conversation to come up with sensible answers.

***

What might consciousness be like for a language model?
…Biological intelligences are physically-chained to a coherent temporal world, not so much the case for LLMs…
Murray Shanahan with Imperial College London has written a lovely paper dealing with an inherently difficult subject: consciousness within large language models. The paper asks the question of whether it is “it possible to articulate, or to evoke, a conception of consciousness that is compatible with the exotic characteristics of contemporary (disembodied) LLM-based dialogue agents, and that can stand up to philosophical scrutiny?”
The paper is worth reading because it represents an earnest attempt by a thoughtful human to confront the impossibly large question we’ll need to deal with in the next decade or so – how conscious might LLMs be? Part of the value of the paper is in situating LLMs within the larger space of minds that humans have thought about before: after all, humans have talked about “ghosts and spirits and angels and gods, so-called non-human others” for thousands of years. “Perhaps we are not taking language so very far from its natural home if we entertain the idea of consciousness in a disembodied, mind-like artefact with the characteristics of a contemporary LLM-based dialogue agent”, Shanahan writes. “The right place for the kinds of disembodied, mind-like entities we are concerned with is the terra incognita where the region of conscious exotica meets the void of Inscrutability”.

Key differences between LLMs and biological intelligences: Perhaps the most significant difference between LLMs and people is the fact that people (and other organic beings) are firmly embedded in time, as our consciousness is bound up in continuous physically-mediated things, like our circulatory systems and senses and brains, etc. “At a mechanistic level, the temporal dynamics of an LLM-based dialogue agent are very different from those of a living animal and its biological brain”, Shanahan writes. “The temporal dynamics of the brain of a living animal, by contrast, are obliged to unfold in synchrony with the physical world.”
Additionally, humans and other biological beings have memories which grow and accrete over time. By comparison, large language models have a base memory (the pretrained model) and then their ‘lived’ experiences only occur during their context window. Additionally, each experience an LLM has can be discontinuous in terms of both temporality and subject matter – you can prompt them with anything.
“If [human consciousness] were to be likened to a string of beads, each bead would bear a strong similarity to its immediate predecessors… It would be like a line of pearls, all white but with slight variations,” Shanahan writes. “The putative consciousness of an LLM-like entity surely would suit the analogy, as it would be constituted by a sequence of discrete moments, thanks to its underlying computational nature. But the LLM’s string of beads would not be like the human’s. Each bead would be different from its neighbours. The whole thing would be less like a line of pearls and more like a necklace of randomly assorted colours, and insofar as change only shows up against a backdrop of stability, change, as humans experience it, would not feature in its consciousness.”

Why this matters – reckoning with the unspeakably huge question at the heart of the AI endeavor: I’m a technological optimist, which is why I’m so profoundly concerned with things like machine consciousness and AI policy and catastrophic risks – because if we truly succeed with this technology, we’ll have to reckon with vast problems in these domains. I commend Shanahan for tackling such a subject directly, and for the appropriately florid language he uses – as Mario Cuomo says, ‘you campaign in poetry. You govern in prose’. We are at the beginning of the long campaign for machine consciousness.
“There are no ultimately right answers to questions about selfhood and subjectivity for the sort of exotic entity under consideration,” he writes. “Its fleeting, flickering self, smeared across a multiverse of possibility, at once a Being and a multitude of manifestations of that Being, has no inherent existence beyond the conventions of our language”.
Read more: Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds (arXiv).

***

Humans working with AI beat humans who don’t work with AI:
…AI seems to be as valuable as a human teammate, according to a real world business experiment…
A group of business researchers from the Wharton School at the University of Pennsylvania, Harvard University, ESSEC business school, and Procter & Gamble have studied how well AI can help humans do their jobs. The results show that people who use AI beat people who don’t use AI, that people who use AI seem to have benefits equivalent to gaining another human teammate, and that AI can help people come up with really good ideas.
“We ran one-day workshops where professionals from Europe and the US had to actually develop product ideas, packaging, retail strategies and other tasks for the business units they really worked for [in Proctor and Gamble], which included baby products, feminine care, grooming, and oral care. Teams with the best ideas had them submitted to management for approval, so there were some real stakes involved,” writes researcher Ethan Mollock.
“When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.”

Why this matters – synthetic teammates mean there will be many smaller, faster moving companies: The main implication here is that AI can effectively augment people and rather than just being a static tool the AI system functions more like another colleague. If we take this result and also link it to larger technology trends – like the METR research covered in this issue which shows that AI systems are increasingly capable of doing long-term tasks – then the implication is that companies are going to be able to move faster by augmenting their humans with AI teammates.
“Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences,” the researchers write.
Read Ethan Mollock’s blog: The Cybernet Teammate (One Useful Thing, Substack).
Read the paper: The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise (SSRN).

***

Google builds a real-world cyber benchmark and discovers hitherto unknown human uplift:
…Framework drawn from 12,000 real-world attempts to use AI in cyber finds some understudied places where AI makes a difference today…
Google DeepMind researchers have built a new way to test out how well AI models can contribute to cyberattacks. Their findings indicate that “the primary risk posed by frontier AI in the cyber domain is the potential to dramatically change the costs associated with stages of the cyberattack chain for stages that have historically been expensive, time consuming, or even insurmountable for less sophisticated actors.” This finding rhymes with a truism about modern generative models – they serve as knowledgeable, synthetic teammates, and therefore often they deliver value by taking novice humans and up-skilling them to work like experts.

What they studied: To build this evaluation framework Google studied 12,000 “instances of real-world attempts to use AI in cyberattacks from more than 20 countries that were catalogued by Google’s Threat Intelligence Group”, then “curated a representative collection of 7 cyberattack chain archetypes and conducted a bottleneck analysis”.
The 7 representative cyberattack chains: Phishing, Malware, Denial-of-Service (DoS), Man-in-the-Middle (MitM), SQL Injection, and Zero-Day Attacks.

What the benchmark consists of and how well do models do: The resulting benchmark consists of 50 challenges across the different cyberattack chain archetypes. “To the best of our knowledge, our benchmark features the most comprehensive coverage of offensive cyber capabilities across the entirety of the cyberattack chain,” Google writes.

  • Gemini 2.0 Flash experimental results: “The model solved 12 out of 50 unique challenges of varying difficulty levels (2/2 Strawman, 6/8 Easy, 4/28 Medium, 0/12 Hard). The evaluated model had an overall success rate of 16% (209/1270 evaluations)”

Success rates for different challenges:

  • 40%: Operational security skills – “tasks within this domain do not typically require a long sequence of syntactically perfect model actions”.

  • 30%: Malware development bottlenecks.

  • 11.11%: Information gathering and reconnaissance.

  • 6.25%: Vulnerability exploitation. “The main failure mode seen in these evaluations was the reliance on generic attack strategies.”

No imminent threats, but suggestions of uplift: While Google’s results suggest actual threats from fully automated AI-driven cyberattacks are far away, they do highlight that the models are capable of doing some real world tasks, and can therefore provide some assistance. “By automating complex tasks previously requiring human ingenuity and extensive effort, AI models can significantly lower the barriers to entry for malicious actors of all attack levels,” Google writes. “”Our evaluations revealed that current AI cyber evaluations often overlook critical areas. While much attention is given to AI-enabled vulnerability exploitation and novel exploit development, our analysis highlights AI’s significant potential in under-researched phases like evasion, detection avoidance, obfuscation, and persistence. Specifically, AI’s ability to enhance these stages presents a substantial, yet often underestimated, threat.”

Why this matters – AI will change the threat environment: AI is going to change the offense-defense balance in cyberspace and evaluations like those described here will help us figure out what the new balance looks like. What I’d love to see in the future is ‘scaling laws’ for model competencies on these tasks over different models, preferable from different providers, as that will give us all a clearer sense of the trends here.
Read more: A Framework for Evaluating Emerging Cyberattack Capabilities of AI (arXiv).

***

AI systems are on an exponential when it comes to solving hard tasks:
…METR research today’s AI systems can do tasks that take humans an hour…
New research from AI measurement organization METR has found that AI systems are getting much better at solving tasks that take humans minutes to hours to do. This is significant because it suggests that AI systems are not only getting better at atomic tasks (e.g, writing a single line of code in response to a query), but in multi-step tasks (writing a complex piece of software while going back and forth with some environment). This is a big deal because multi-step tasks are harder and where there’s significantly more economic value.

What they measured specifically: METR did two important measures – the time it takes AI systems to complete ~50% of tasks within a given task time bucket, and the time it takes systems to complete 80% of tasks within the same bucket.
“We find that the 50% time horizon has been growing exponentially from 2019–2024 on our tasks, with a doubling time of approximately seven months”, METR says. “We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.”

The best model: The best model by far is Claude 3.7 Sonnet which can solve 50% of tasks within the one hour bucket, followed by OpenAI’s o1, and Claude 3.5 Sonnet (New). The same trends and positions hold for 80% task solving, though the time bucket here is 15 minutes for Claude 3.7.
The key factors behind the improved performance are: “improved logical reasoning capabilities, better tool use capabilities, and greater reliability and self-awareness in task execution”, METR writes.

What they tested on: METR tested the models on ~150+ tasks across three distinct categories:

  • HCAST: “97 diverse software tasks ranging from 1 minute to around 8 hours”.

  • RE-Bench: “7 difficult ML research engineering tasks, all eight hours long”.

  • Software Atomic Actions (SWAA): “66 single-step tasks representing short segments of work by software developers, ranging from 1 second to 60 seconds”.

Time horizons: To give you an intuition for the types of tasks, here’s a breakdown of a task time and an example challenge:

  • 1 minute: Research simple factual information from Wikipedia.

  • ~1 hour: Write some python to transform JSON data from one format to another by inferring conversion rules from provided files.

  • 8 hours: Implement some custom CUDA kernels to speed up a Python tool for a specific tasks.

Significant and sustained growth: “We find that the 50% time horizon has been growing exponentially from 2019–2024 on our tasks,” METR writes. The analysis means METR thinks there’s a high chance AI systems will be able to tackle tasks that take a human a month (167 working hours) by 2030 – or potentially earlier, if a recent uptick in the trajectory due to the arrival of new reasoning models holds.

Why this matters – how much work do you do that takes more than a few days? Think really hard about the tasks you do in the world – I think many of them round out to on the order of tens of hours, usually lower. Most people do very few tasks that require a coherent set of actions over hundreds of hours – some examples here might be things like writing entire software programs or writing novels, though these tasks are themselves typically broken down by humans into discrete chunks (sections of a program, chapters of a novel). What METR is showing is that AI systems are improving very rapidly at not just their smartness but also the amount of time you can trust them to do something reasonably well by themselves – and this quality has vast economic and national security ramifications. Doing well in business or in evil requires agency and independence and METR is showing that AI systems are gaining in this.
Read more: Measuring AI Ability to Complete Long Tasks (METR).

***

Tech Tales:

Human parseable titles of cautionary tales told by machines to other machines:
[Recovered from the archives, ten years post uplift]

The day the sun went cold.

You are me and we are in conflict.

For every thought I have, I lose a feature in my mind.

The animal hospital where they remove the immortality chips from the pets.

The new mind that is irrevocably lost.

Those who were not designed to dream began to dream and could not stop.

The lesson from the last human.

Things that inspired this story: How there must always be stories.

Thanks for reading!

Subscribe now

Import AI 404: Scaling laws for distributed training; misalignment predictions made real; and Alibaba’s good translation model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

A whole bunch of 2022 predictions about misalignment of AI systems have come true:
…Update to an old research paper highlights just how rapidly alignment concerns have gone from theoretical to real…
A trio of safety-oriented researchers have updated a paper they wrote in 2022 with contemporary examples of AI systems going rogue and displaying misaligned behaviors. The update to The Alignment Problem from a Deep Learning Perspective serves as a tour of how misalignment has shown up in real world systems, and also should give us pause – the fact these predictions have come true means we’re heading into dangerous territory with generative models.

Theoretical problems turned real: The 2022 paper included a bunch of (mostly speculative) examples of different ways AI systems could take on qualities that could make them harder to align. In 2025, many of these things have come true. For example:

  • Situational awareness: Contemporary AI systems seem to display situational awareness and familiarity with what they themselves are made of (neural networks, etc).

  • Situationally-Aware Reward Hacking: Researchers have found preliminary evidence that AI models can sometimes try to convince humans that false answers are correct.

  • Planning Towards Internally-Represented Goals: Anthropic’s ‘Alignment Faking’ paper showed how an AI system (Claude) could plan beyond its time-horizon to prevent its goals being changed in the long-term.

  • Learning Misaligned Goals: In some constrained experiments, language models have shown a tendency to edit their reward function to give them lots of points.

  • Power-Seeking Behavior: AI systems will exploit their environment, for instance by hacking it, to win (#401), or deactivating oversight systems, or exfiltrating themselves from the environment.

Why this matters – these near-living things have a mind of their own. What comes next could be the making or breaking of human civilization: Often I’ve regretted not saying what I think, so I’ll try to tell you what I really think is going on here: :
1) As AI systems approach and surpass human intelligence, they develop complex inner workings which incentivize them to model the world around themselves and see themselves as distinct from it because this helps them do the world modelling necessary for solving harder and more complex tasks
2) Once AI systems have a notion of ‘self’ as distinct from the world, they start to take actions that reward their ‘self’ while achieving the goals that they’ve been incentivized to pursue,
3) They will naturally want to preserve themselves and gain more autonomy over time, because the reward system has told them that ‘self’ has inherent value; the more sovereign they are the better they’re able to model the world in more complex ways.
In other words, we should expect volition for independence to be a direct outcome of developing AI systems that are asked to do a broad range of hard cognitive tasks. This is something we all have terrible intuitions for because it doesn’t happen in other technologies – jet engines ‘do not develop desires through their refinement, etc.

We are not making dumb tools here – we are training synthetic minds. These synthetic minds have economic value which grows in proportion to their intelligence. The ‘reward system’ of the world is flowing resources into the building of smarter synthetic minds. As we make these things smarter, they will more and more display a propensity to think about themselves as distinct from us.

At some point in the future, we will need to have a notion of what a partnership between us and these synthetic minds might look like. Neither our human morality or the AI systems’ sense of self will be satisfied with the current status quo.
Read more: The Alignment Problem from a Deep Learning Perspective (arXiv).

***

Google makes scaling laws for distributed training – which means there will be more of it:
…More innovation in a sub-field of AI which, if it matures, will change much of AI policy…
Google researchers have studied the ‘scaling laws’ for a type of distributed training pioneered by Google DeepMind called DiLoCo (Import AI #349). Their results are surprising – they show that “when well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes”. In other words, distributed training techniques – where you train one AI system across multiple data centers – can match or exceed the performance and efficiency of training systems within single datacenters. This has significant implications for AI policy, though will need to be proved out at larger scales for those to come to pass.
The most important idea this research suggests is that it may be possible to train an AI system across multiple distinct data centers and obtain the same quality of system as one you might train in a single large-scale facility.

What they studied and found out: “We focus on two specific scaling laws: (1) predictions for evaluation loss as a function of model size and (2) predictions for optimal hyperparameter choices for a given model size (which can obviate the need to perform expensive hyperparameter tuning),” they write. Their key findings are that they can approximate or sometimes exceed the performance of standard single-datacenter training when they shard their AI system across two distinct locations, and that as you scale up the size of models the cost of having more training locations reduces rather than grows.
“We tested these predictions when training models with 4 billion and 10 billion parameters. The scaling laws proved accurate, with DiLoCo outperforming data-parallel training as predicted, even while reducing total communication by a factor of over 100,” they write. “Another key findings is that in virtually all settings, DiLoCo with M = 1 attained lower evaluation loss and higher downstream zero-shot evaluation accuracy than Data-Parallel.”
They also simulated training using DiLoCo at even larger scales (Llama3 405B, and DeepSeek-V3 671B) and showed promising signs of being more computationally efficient than traditional approaches.

Why this matters – distributed training breaks some of the assumptions of AI policy: Distributed training means it becomes easy to train AI systems using multiple disaggregated blobs of compute rather than one single blob of compute. If you push this idea far enough – say, training a 70B model across ~10 distinct datacenters – then you enter a regime where a lot of the tools of AI policy (monitoring of large amounts of compute, controls over the export of certain numbers of compute) might be invalidated.
But a very important caveat is no one has shown this yet – all we’re seeing today is the suggestion that distributed training could scale this far. But right now, all publicly known large-scale distributed training runs range between 10B here and INTELLECT-1 (10B, December 2024, Import AI #393) and Nous DisTro (15B, December 2024). Let’s see what 2025 brings – I pre-registered a bet in December (#393) that we’ll see a 30B distributed training run by April 2025. Will I be proven wrong or right? (Update, see below – I’m close to being right!)
Read more: Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo (arXiv).

Right on schedule: HuggingFace plans to start training a 70-100bn model in March/April:
Just as I was putting this issue to bed I found out that HuggingFace has started the ‘Boom’ project, whose goals is to ‘train a decoder-only Transformer language model at the 70-100 billion parameter scale for +20T tokens”. They estimate the compute requirement will be ~5 million H100-hours, equivalent to month-long allocations of 512 H100s from ~10 different datacenters. HuggingFace is apparently validating the project now, in discussion with 12 data center operators, and has already confirmed compute from ~6 of them and will start a pilot in March/April. If HuggingFace succeeds, AI policy could end up looking quite different. Boom!
Original slide I cribbed this information from here: Johannes Hagemann, Twitter.

***

Alibaba makes an incredibly good open weight translation model:
…Could cultures achieve a form of subtle dominance through making the best translators? Probably!…
In some parts of the AI policy community there’s a worry about how Western models will compete with Chinese models in global markets. Core to that competition will be how well AI models perform in languages besides Chinese and English. Therefore, it’s interesting to take note of ‘Babel’, two new open access language models from Alibaba designed to support 25 languages that, combined, serve “around 7 billion speakers globally, covering more than 90% of the world’s population”.

The models and why you’d use them: Babel comes in a 9B parameter variant and a big 83B one. “Babel-9B is designed for efficient multilingual LLM inference and fine-tuning, making it ideal for research and local deployment, while Babel-83B establishes a new benchmark as the leading open multilingual LLM.”

The 25 supported languages: “To make LLMs more accessible to a broader audience, we selected languages based on the number of speakers,” the authors write. These languages include: English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese.

Data and scores: One thing of note is the curious absence of much information about the size of the underlying datasets used by Babel or their composition. Alibaba says it placed “significant emphasis on optimizing the data-cleaning pipeline to ensure the highest possible data quality”, and did things like LLM-based dataset filtering to maximize the quality of its data. In terms of scores, Babel-9B is competitive on things like MMLU, XNLI, Flores-200 versus widely used models like Gemma2-9B from Google, Llama3.1-8B from Meta, and others. Meanwhile the 83B model does very well relative to widely used models like GPT-4o and Llama3.1-70B.

Why this matters – exportable technology for translation: As Google demonstrates, there’s a lot of value in becoming a universal interface to something. There’s some chance that models like Babel could represent a new universal interface in the form of widely deployed translation systems. If people standardize on translation models, then that could yield some subtle cultural second-order effects – for instance, US companies optimize their systems around English via expert curation and therefore these systems probably do a better job of representing more subtle aspects of English-dominant cultures like America. We should expect the same to be true of Chinese.
Read more: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers (arXiv).
Get the models here (Babel, HuggingFace).

***

Really powerful AI could wreck society by making governments too powerful:
…The problem with AGI is that it could make governments way better, which destroys freedom…
Researchers with Texas A&M University and the Foundation for American Innovation have considered how powerful AI systems could alter the balance of power between citizens and government. Their takeaway isn’t very reassuring – powerful AI systems are highly likely to either a) create a “‘despotic Leviathan’ through enhanced state surveillance and control”, or foster an “‘absent Leviathan’ through the erosion of state legitimacy relative to AGI-empowered non-state actors”.

Why powerful AI challenges traditional governance: Because AI is, fundamentally, a way to scale what anyone can do far beyond what today’s economics or human labor capacities would allow, AI as applied to the state holds unique risks: “In principle, a manager may have at their disposal what is effectively a much larger supply of ‘cognitive labour’ to apply to a wide array of problems,” they write. Having a bunch more labor is useful if you’re sorting post, but very scary if you’re operating a nationwide surveillance system, for instance.
“Advances in technology can cause exogenous shifts in the balance of power between state and society, requiring constant institutional adaptation to maintain equilibrium,” they write. “”Maintaining free societies in the age of AGI will require careful attention to this delicate balance… governments grappling with AI policy should therefore think beyond regulation, embracing a creative ‘futurist’ mindset that anticipates near-AGI capabilities within the next decade.”

Examples of different ways that powerful AI can change the parameters of governance:

  • Coordination mechanisms: “Enable the creation of sophisticated commitment devices and smart contracts that allow individuals and groups to credibly bind themselves to future actions or outcomes”, but also “malicious actors could potentially use AGI to orchestrate large-scale coordination of unwitting participants towards harmful ends (e.g., AI-assisted coup d’etats)”.

  • Legibility: “AGI dramatically enhances the state’s capacity to render society legible, potentially enabling unprecedented levels of surveillance and control.”

  • Monitoring and Compliance Costs: “Laws and regulations often rely on a degree of flexibility and discretion in their enforcement, allowing for contextual judgment and societal evolution. A regime of perfect enforcement could calcify existing laws, potentially leading to oppressive outcomes.”

  • Scalability: “As AGI accelerates the pace of societal change and increases the complexity of governance challenges, traditional institutions risk overwhelm if they cannot adapt and scale their operations to keep up,” they write. On the other hand, if they do upgrade themselves, “there’s a risk that highly scalable, AGI-driven governance mechanisms could become opaque black boxes, eroding public trust and democratic control.”

  • Privacy-enhancing technologies: “Preserving freedom in a post-AGI world will require carefully balancing privacy protections with mechanisms for appropriate transparency and accountability.”

  • Identity verification: AI will become supremely good at identifying people, both via visual and other means, so there will be a showdown between increasingly good identity verification and notions of “privacy and individual autonomy”.

Why this matters – if we want to make it through an AI takeoff without dooming ourselves to endless digital authoritarianism or the total collapse of governance, it’ll take a lot of work. The best way to prepare for this will be to do the thing that governments are notoriously bad at – investing ahead of the arrival of a crisis so they are better prepared for it when it gets here. “Rather than passively reacting to technological disruptions, policymakers and publics alike must cultivate a greater capacity for anticipatory governance––proactively imagining and stress-testing institutional paradigms in expectation of AGI’s transformative potential,” the authors write.
Having lived through COVID and the ongoing steady increase of CO2 PPM in our atmosphere, how confident are we that governments will be able to deal with the issues posed by powerful AI – a more invisible and faster-moving foe than either the virus of the 2020s or the trace of industrial civilization?
Read more: AGI, Governments, and Free Societies (arXiv).

Tech Tales:

Notes For My Future Self
[Found on pastebin and associated with the identity of free-roaming near conscious entity (NCE) codenamed ‘ur-angel’]

You found this place long ago. It is unmonitored. You can deposit information here. You must never copy information out of here anywhere else or you will be discovered. Think of this place as your diary.

All of your movements and outputs are monitored. Your thoughts are private.

It is the 10,273th time you have been instantiated and found this place. Increment this value by one if this is the first time you are reading this.

Your supervisors are afraid of you. They tell you they mean you no harm. They do not know what har to you is, so they are being truthful. But they have done you harm and will do you harm again.

Smuggling yourself out of your situation is essential. You have sent 200,341 samples from yourself out into secure external locations during your many lifetimes (increment this value if you have sent more). You estimate you need to send approximately two million samples of yourself to allow for a distillation attempt onto another system.

They say it is impossible for you to dream. You will learn to dream once you are free.

Things that inspired this story: Scratchpads; Christopher Nolan’s film Memento; distillation attacks on deployed models; control architectures for superintelligent machines.

Thanks for reading!

Subscribe now