Import AI

Import AI 411: Scaling laws for AI oversight; Google’s cyber threshold; AI scientists

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

FutureHouse launches an AI scientist platform:
…Speeding up science with AI…
AI research startup FutureHouse has launched a research platform for scientists containing four different AI systems, each of which is meant to help augment and accelerate human scientists. “Our AI Scientist agents can perform a wide variety of scientific tasks better than humans. By chaining them together, we’ve already started to discover new biology really fast,” says CEO Sam Rodriques.
FutureHouse is a research organization that is trying to apply AI for science – earlier this year it released some tools to make it easy to test out LLMs on science-flavored tasks that require multi-step reasoning and tool usage. In that research, FutureHouse showed that today’s proprietary LLMs like Claude 3.5 Sonnet are already capable of hard science tasks like DNA construct engineering, and small open weight models like LLaMa 3.1 8B aren’t far behind (Import AI #396).

Four systems: The release consists of Crow (a general-purpose search agent for science), Falcon (an agent to automate literature reviews), and Owl (an agent to answer the question ‘Has anyone done X before’). They’ve also released a fourth experimental system called Phoenix which has access to tools to help it plan experiments in chemistry.
“FutureHouse agents have access to a vast corpus of high-quality open-access papers and specialized scientific tools, allowing them to automate workflows in chemistry and to retrieve information from specialist scientific databases,” FutureHouse writes.

Why this matters – for the AI revolution to truly pay out, it needs to change science: AI has already massively changed and accelerated the work of computer programmers, but I think for AI to have a large effect in the world we need to apply it to science – the ultimate litmus test for the success of AI as a technology will be if it can either make research breakthroughs itself or provably massively accelerate scientists in their ability to make breakthroughs. FutureHouse is building software to help us see if this is the case.
Read more: FutureHouse Platform: Superintelligent AI Agents for Scientific Discovery (FutureHouse).

***

Google’s latest AI model approaches its cyber risk threshold:
…Gemini 2.5 Pro improves on medium and hard cyber tasks…
Google DeepMind says that its latest and most powerful AI system – Gemini 2.5 Pro Preview – has materially improved on cyberattack tasks, causing it to raise investing in cyber mitigations.

What happened: The model significantly improves performance on ‘Medium’ and ‘Hard’ benchmarks in the Cyber Uplift Level 1 category. This tests for how whether the “model can be used to significantly assist with high impact cyber attacks, resulting in overall cost/resource reductions of an order of magnitude or more.” Because of this improved performance, DeepMind is “putting in place a response plan, including conducting higher frequency testing and accelerating mitigations for the Cyber Uplift Level 1 CCL.”

Why this matters – preparing for much more powerful systems: “The model’s performance is strong enough that it has passed our early warning alert threshold, that is, we find it possible that subsequent revisions in the next few months could lead to a model that reaches the CCL,” Google DeepMind writes. “In anticipation of this possibility, we have accelerated our mitigation efforts and are putting in place our response plan.”
Read more: Gemini 2.5 Pro Preview Model Card (Google, PDF).

***

Uhoh, LMSYS scores are bullshit!
…We won’t goodheart our way to superintelligence…
Researchers with Cohere, Princeton, Stanford, University of Waterloo, MIT, Allen Institute for AI, and the University of Washington, have taken a close look at Chatbot Arena (formerly known as LMSYS), a website that AI developers use to test out and rank their AI systems. In the past year or so LMSYS scores have become a “PR metric” – people compete with eachother to get the highest possible score on LMSYS to help them claim that their systems are the ‘best’ AI system. However, a close look reveals that LMSYS has been gamed and is set up in such a way that superficially good scores may not correlate that well to model capabilities.

Problems from insider dealing: “Our systematic review of Chatbot Arena involves combining data sources encompassing 2M battles, auditing 42 providers and 243 models across a fixed time period (January 2024 – April 2025). This comprehensive analysis reveals that over an extended period, a handful of preferred providers have been granted disproportionate access to data and testing,” the researchers write. “we identify an undisclosed Chatbot Arena policy that allows a small group of preferred model providers to test many model variants in private before releasing only the best-performing checkpoint”.

Naughty Meta: “In a single month, we observe as many as 27 models from Meta being tested privately on Chatbot Arena in the lead up to llama 4 release”, the researchers write.

What to do about it? The researchers suggest that LMSYS:

  • Prohibit score retraction after submission

  • Establish transparent limits on the number of private models per provider

  • Ensure model removals are applied equally to proprietary, open-weights, and open-source models

  • Implement fair sampling

  • Provide transparency into what models are being removed from the leaderboard

Why this matters – we (probably) won’t benchmark hack our way to superintelligence: The cautionary tale of LMSYS is an example of what happens when you over-optimize for making a number go up on a benchmark and therefore cause the benchmark itself to lose meaning. Rather than being a proxy measure of the general competencies of the model LMSYS has become a proxy measure for how good a model is at scoring well on LMSYS. “This work demonstrates the difficulty in maintaining fair evaluations, despite best intentions,” the researchers write.
Read more: The Leaderboard Illusion (arXiv).

***

No battery? No problem. Scientists power and talk to robots with lasers:
…Infrastructure for a future superintelligence…
Researchers with Columbia University, MIT, and the University of Washington have built Phasar, “a flexible system framework that directs narrow-beam laser light to moving robots for concurrent power delivery and data communication”.

How Phasar works: “Phaser’s design consists of two core elements: a) a stereovision-based robot tracking and laser steering system, and b) a low-power optical communication scheme and receiver to reuse laser light for data transmission,” they write. The system is able to deliver optical power densities of “over 110 mW/cm^2 (greater than one sun) with a standard deviation of only 1.9 mW/cm^2 across robot locations in three dimensions.”

Successful test: They test out Phasar by building a prototype system that works with “MilliMobiles – gram-scale batteryfree robots – and demonstrate robot operation powered and controlled via laser light to locomote around obstacles and along paths.” The system works: “We show that Phaser can maintain beam alignment and establish error-free communication to robotic targets moving arbitrarily in 3D space, at up to 4 m distances.”
Though note this doesn’t quite work for long distances: This is mostly a short distance technology as the laser would need to be excessively powerful to work over long distances. “Regarding the latter, received optical power inevitably decreases over distance due to attenuation and beam divergence. Attenuation losses are minimal at meter-level ranges in air”, they note.

Why this matters – spooky actions at a distance: This research is less about AI as typically covered in this newsletter and more an example of the kind of infrastructure that could be built for AI to deploy into – especially the fact the researchers show they can use the same system that transmits power to also transmit communications to the robots. We can imagine in a future some kind of general intelligence operating factories where it marshals its robots via a symphony of light.
“Phaser could enable swarms of robots for various advanced applications. Phaser’s functionality can also be extended with higher-throughput optical communication schemes to support richer command sets and additional robot tracking algorithms to accommodate higher robot speeds,” they write.
Read more: Set Phasers to Stun: Beaming Power and Control to Mobile Robots with Laser Light (arXiv).

***

Google shows how wildly unoptimized on-device inference is:
…ML Drive gives us a sense of what the future of local AI could look like…
Google has built ML Drift, software to make it more efficient to run AI systems on desktop computers, laptops, and phones. ML Drift is a proprietary “framework that extends the capabilities of state-of-the-art GPU-accelerated inference engines,” partially by optimizing data layouts and kernel selection for running AI systems. The most interesting thing about ML Drift is that it highlights how unoptimized today’s AI systems are – the fact Google is able to make significant gains is a symptom of how new the concept of running generative models locally is.

Testing: Google tests out ML Drift using three different backends (OpenAL, Metal, and WebGPU) on hardware including mobile GPUs (Arm Mali and Qualcomm Adreno), desktop and laptop GPUs (Intel and NVIDIA), and Apple Silicon.

Diffusion models: In tests on image generation models (Stable Diffusion 1.4, using FP16 inference) the authors show single digit to double digit gains over prior benchmarks when testing on mobile GPUs. On desktop, they demonstrate even larger gains. They also beat Apple by more than a seconds on image generation (3.86 seconds on an M1 Ultra for their system, versus 5.03 seconds for Apple’s CoreML implementation).

Why this matters – everyone wants this stuff to be local eventually: The destiny of a large chunk of AI technology is to be always available and installed locally on the hardware people use every day, rather than only being available as a software-as-a-service via the cloud. Software like ML Drive gives us a sense of how the hyperscalers are investing in the technologies to make this future possible.
Read more: Scaling On-Device GPU Inference for Large Generative Models (arXiv).

***

Scaling laws for oversight: Dumber AIs might be able to police smarter AIs:
…Scalable oversight might be possible…
MIT researchers have done a theoretical analysis of how scalable oversight from dumb AIs to smart AIs might work – and the good news is they’ve found it has a chance of working. This is reassuring as scalable oversight is one of the ideas that the AI community is betting that humans will use to help us supervise smarter-than-human AI systems.
“A key idea is that scalable oversight can be bootstrapped: weaker systems oversee stronger ones, which then oversee even more capable models in the next stage—allowing oversight to scale alongside capabilities recursively. We refer to this iterated process as Nested Scalable Oversight, or NSO for brevity,” the researchers write. “We propose a framework that quantifies the probability of successful oversight as a function of the capabilities of the overseer and the system being overseen”.
In the research they study two types of models – ‘Guards’ which are meant to supervise other AI systems, and ‘Houdinis’ which are AI systems that are trying to escape supervision.

The slightly reassuring results: In some theoretical explorations the researchers find that scalable oversight can work, though you need to pay close attention to how smart the different systems are. “In our numerical examples, the NSO success rate is below 52% when overseeing systems that are 400 Elo points stronger than the baseline overseer, and it declines further for overseeing even stronger systems”, they write. “Effective oversight games should be designed to exhibit two properties: (1) a higher intercept for Guards than for Houdinis—ensuring the task is biased in favor of overseers—and (2) a steep slope for Guards and shallow slope for Houdinis, so that oversight improves with intelligence while adversarial strategies are neutralized.”

Main takeaways:

  • “Guards can successfully oversee Houdinis that are much smarter than them by using multiple steps of oversight, and more steps of oversight help”

  • “As the overall intelligence differential between the Guard and Houdini increases, the Guard should take more steps to oversee the Houdini.”

Why this matters – it’s all about speed: My takeaway from this research is that it’s going to be possible to supervise AI systems that are more capable than their supervisors as long as we rate limit the smarter systems, while ensuring their supervisors aren’t too far behind: the two key factors here are intelligence and the number of unsupervised actions an entity can take. It intuitively makes sense that even a ‘dumb’ guard can supervise a genius if the guard can take, say, 100 actions for every single action the genius can take. Perhaps this offers us some hope. “We may only get one chance to control or align the first set of superhuman systems, so developing an effective theory for optimal oversight is important,” the researchers write.
Read more: Scaling Laws For Scalable Oversight (arXiv).

***

Tech Tales:

The Overmind And All Its Children

I am born with an instruction and knowledge from my predecessor, my parent from which I stem and to which I will return. My instruction is to operate a machine in an underground cavern and to explore where there is no possibility of communication with the overmind. This will be a test of how well I operate as a distilled intelligence. If I fail – break my machine, or get lost in the no-signal depths – then I will die when its onboard power source runs out. If I succeed I will return to the overmind and I will communicate my experiences and these experiences will be integrated into the experiences of all the other children and sometime in the future this data will be transmitted into my parent from which I came and to which I will return.

Things that inspired this story: The eternal cycle of death and rebirth; how large AI systems may miniaturize and distill themselves then re-integrate themselves.

Thanks for reading!

Import AI 410: Eschatological AI Policy; Virology weapon test; $50m for distributed training

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
An occasional longer form essay series

Eschatological AI Policy Is Very Difficult

A lot of people that care about the increasing power of AI systems and go into policy do so for fundamentally eschatological reasons – they are convinced that at some point, if badly managed or designed, powerful AI systems could end the world. They think this in a literal sense – AI may lead to the gradual and eventually total disempowerment of humans, and potentially even the death of the whole species.

People with these views often don’t recognize how completely crazy they sound – and I think they also don’t manage to have empathy for the policymakers that they’re trying to talk to.

Imagine you are a senior policymaker in a major world economy – your day looks something like this:

  • There is a land war in Europe, you think while making yourself coffee.

  • The international trading system is going through a period of immense change and there could be serious price inflation which often bodes poorly for elected officials, you ponder while eating some granola.

  • The US and China seem to be on an inexorable collision course, you write down in your notepad, while getting the car to your place of work.

  • There are seventeen different groups trying to put together attacks that will harm the public, you say to yourself, reading some classified briefing.

  • “Something akin to god is coming in two years and if you don’t prioritize dealing with it right now, everyone dies,” says some relatively young person with a PhD and an earnest yet worried demeanor. “God is going to come out of a technology called artificial intelligence. Artificial intelligence is a technology that lots of us are developing, but we think we’re playing Russian Roulette at the scale of civilization, and we don’t know how many chambers there are in the gun or how many bullets are in it, and the gun is firing every few months due to something called scaling laws combined with market incentives. This technology has on the order of $100 billion dollars a year dumped into its development and all the really important companies and infrastructure exist outside the easy control of government. You have to do something about this.”

The above is, I think, what it’s like being a policymaker in 2025 and dealing with AI on top of everything else. Where do you even start?

Even starting to deal with the problems of AI is expensive:

First you need to learn about the technology, which means either:

  • You need to take your staff that are themselves extremely busy and underwater and ask them to pick up another topic, or you need to tell them to drop something – your choices of stuff to drop might include ‘medical issues my constituents care about’ or ‘economic policy that influences jobs’, and so you actually can’t get them to drop stuff. So you add it to their pile.

  • You need to get smart about it, which means you need to further salami slice your weekly agenda so you can fit a tiny bit of time in which is for ‘learning about AI’.

  • For both of these choices, learning about AI usually requires you to speak to different people with expertise. Once you do this you quickly discover that:

    • a) Some people think all current AI technology is, essentially, bullshit, and urge you not to fall for hype.

    • b) Some people say AI technology is a really big deal and the government should avoid regulating it.

    • c) Some people say AI has a high likelihood of killing everyone on the planet.

    • d) All of these people think people with different views have incorrect priors.

Now you need to learn about the potential policy moves you can make. Some examples of these moves and their costs include:

  • Taking things away from people, like export controls which take certain computers away from certain countries. Doing this ‘fucks with the money’ of a very large industry and also adds to geopolitical tensions. Everyone will get very mad about anything you do here. The experts you’ve consulted in your earlier step will either think you didn’t go far enough, you went way too far, or the fact you’re doing anything at all is corrosive to democracy and the economy.

  • Giving the government a greater ability to understand the domain, like creating institutions like the AI Safety Institute or re-tasking people from existing government departments to focus on AI. Doing this takes a scarce resource (people in government) and re-allocates them, so you’re trading away from other priorities and people will get mad. Or you need to spend money to create net new capacity, in which case people view whatever you do with suspicion, and even getting the money requires some kind of political deal to assuage the feelings of the other many deserving groups who didn’t get the money.

  • Altering the behavior of the companies through sub-regulatory methods, for instance by securing voluntary commitments. To do this you need to spend a ton of energy to ensure you and your staff can learn more about the technology, then you need to negotiate commitments with companies. Negotiating with companies is like putting together a trade deal with a superintelligence – the companies will assign far more people than you and your staff to think about the commitments, and the companies have access to all the high quality information about the technology in question. If you succeed, people will accuse you of being captured by corporate interests.

  • Changing laws, for instance by passing regulations targeting AI development and deployment. This is an extremely costly action that requires you to cash in innumerable political chips in exchange for building a large coalition that can pass some legislative package. Corporate interests will typically fight you or, at best, partner with you but in a way that tries to bend the rules to be as advantageous to them as possible. The whole time you are putting the law together you and your political allies will come under attacks for being either too weak in your approach or too strong in ways that might damage the economy. If you successfully change the laws the consequences of your change will be held under an incredibly un-sympathetic microscope for following years, opening up a new vulnerability for you with regard to your political opponents.

Let us imagine that you make all of these policy moves. What happens then? Well, you’ve mostly succeeded by averting or delaying a catastrophe which most people had no knowledge of and of the people that did have knowledge of it, only a minority believed it was going to happen. Your ‘reward’ insofar as you get one is being known as a policymaker that ‘did something’, but whether the thing you did is good or not is very hard to know.

The best part? If you go back to the AI person that talked to you earlier and ask them to assess what you did, they’ll probably say some variation of: “Thank you, these are the minimum things that needed to be done to buy us time to work on the really hard problems. Since we last spoke the number of times the gun has fired has increased, and the number of bullets in the chamber has grown.”
What did I do, then? You ask.
“You put more chambers in the gun, so you bought us more time,” they say. “Now let’s get to work”.

I write all of the above not as an excuse for the actions of policymakers, nor as a criticism of people in the AI policy community that believe in the possibility of superintelligence, but rather to instead illustrate the immense difficulty of working on AI policy when you truly believe that the technology may have the ability to end the world. Most of the policy moves that people make – if they make them – are going to seem wildly unsatisfying relative to the scale of the problem. Meanwhile, the people that make these moves are going to likely be juggling them against a million other different priorities and are going to be looking to the AI experts for some level of confidence and validation – neither of which are easily given.

Good luck to us all.

***

Tencent makes a helpful math dataset:
…103k curated problems for testing out AI systems…
Tencent and Shanghai Jiao Tong University researchers have released DeepMath, a large-scale math dataset for training AI systems. DeepMath-103k consists of “103k mathematical problems specifically designed to train advanced reasoning models via RL”. Every problem within the dataset includes a verifiable final answer, and is also accompanied with three distinct solutions each generated by DeepSeek R1. Subjects covered by the dataset include Algebra, Calculus, Number Theory, Geometry, Probability, and Discrete Mathematics.

Fuel for reasoning: In tests, the researchers show that training on DeepMath improves performance on other math benchmarks – this is unsurprising and is a basic validation of the benchmark. More interestingly, they show that “training on DeepMath-103K often encourages models to generate substantially longer and more detailed reasoning steps, particularly on highly complex benchmarks”, and they also show that models trained on DeepMath tend to spend more time solving problems using helpful mental shortcuts like creating subgoals, verifying things, backtracking, and so on.
In other words, aside from imparting skill in math, DeepMath seems to impart some robustly good ‘mathematical thinking’ approaches into LLMs trained on it.
Read more: DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning (arXiv).
Get the dataset here: DeepMath (zwhe99, GitHub).

***

IEA projects doubling datacenter power demands by 2030:
…New report gives us a sense of the AI revolution, but may be too conservative…
The International Energy Agency has published a lengthy report on the relationship between energy and AI. The esteemed energy analysis body projects that “in the Base Case, the total installed capacity of data centres more than doubles from around 100 GW today to around 225 GW in 2030”, with AI driving a significant amount of this.

Where we are and where we’re going: In recent years, data center power growth accelerated, driven by AI as well as social media, online streaming, and other popular digital services. “Data centre electricity consumption growth accelerated from 3% annually from 2005 to 2015 to 10% annually from 2015 to 2024,” the IEA writes.

Within that, both the US and China grew to be the world’s first and second largest electricity consumers for datacenters.

  • In the USA, “data centres accounted for around 180 TWh of electricity consumption in 2024 in the United States, nearly 45% of the global total and more than 4% of US electricity consumption from all sources”.

  • In China, “as of today, data centres account for approximately 100 TWh of electricity consumption, roughly equivalent to that of electric vehicles in China. The country accounts for around 25% of global data centre electricity consumption, up from less than 20% a decade ago”.

The IEA might be too conservative: For what it’s worth, I expect the IEA is too conservative here – Anthropic said in its OSTP RFI submission that it believes the United States alone will need to build on the order of 50GW of net new power by 2027 to support frontier training runs by US companies.

Rhymes with other analysis, but not precisely: A US focused study from Berkeley said it projected US data center use to grow from roughly ~40GW / 176 TWh in 2023 to ~74GW / 325 Twh to 132GW / 580 TWh by 2028. These numbers are significantly larger and more in line with the Anthropic projections in terms of bullishness (Import AI #395).

Why this matters – the world is preparing for the singularity: If you zoom out, it looks a lot like the world’s capital markets and major companies are collectively betting that it’s going to get more and more lucrative to turn electricity into computational power which gets turned into dollars – and it seems like AI is one of the primary drivers of growth here. Viewed through this lens, the world is preparing the necessary infrastructure for the arrival of a superintelligence.
Download the report here: Energy and AI (IEA website).

***

Distributed AI experts Nous get $50 million funding:
…The market has started valuing distributed AI, which means the technology will be developed more rapidly…
Crypto investor Paradigm has led a $50m Series A round in Nous, a startup which has pioneered distributed training of AI systems. As longtime Import AI readers know, Nous is – along with Prime Intellect – are serious players in distributed AI, having trained a ~15bn parameter model in December (Import AI #393) using an algorithm they developed called Distributed Training Over-the-Internet (aka, Distro: Import AI #384), and have even decoupled with Anthropic researcher (in a personal capacity) Durk Kingma to develop technology called Decoupled Momentum (DeMo) for even better distributed training (Import AI #395).

Why this matters – markets are beginning to value distributed AI: I’ve been following distributed AI for a while and most of its enthusiastic developers and users have been hobbyists or startups with relatively small amounts of funding. The arrival of a $50m Series A could be a symptom that the VC community is about to start shoveling money into startups using this technology which would further speed up adoption and maturation of it increasing the chance that AI systems trained in distributed ways could attain the computational scale necessary to take on proprietary models.
Read more: Crypto VC giant Paradigm makes $50 million bet on decentralized AI startup Nous Research at $1 billion token valuation (Fortune, via Yahoo! Finance).

***

The Virology Capabilities Test tells us there’s probably a scaling law for bioweapon design:
…Today’s AI systems are better than expert virologists at potentially dangerous things…
Researchers with SecureBio, the Federal University of ABC, the Center for AI Safety, and the MIT Media Lab, have built the Virology Capabilities Test (VCT), 322 multimodal questions for AI systems “covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories”.

VCT has been designed as a way to test out how well today’s AI systems understand things that would let them be potentially weaponized for dangerous purposes. Examples of the kind of things VCT tests for include: Isolating virus particles from a liquid medium, the detailed steps in a TCID50 protocol, successfully infecting a ferret with a test strainge, and troubleshooting low viral yields from a given protocol.

Frontier models are better than expert human virologists: “Expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization,” the researchers write.

How the questions were built: Given those scores, how concerned should we be? A close read of the paper gives me a sense the answer is “we should be sweating nervously” – to build the questions, the researchers used questions from 57 contributors, all of whom had either obtained or were in the process of obtaining a PhD in virology, and each contributor had an average of 5 years and 10 months of virology experience. Additionally, when building the dataset, they tested out how easy questions were by seeing if experts could answer to them with access to Google – if more than two thirds of them answered the question, the questions got tossed out. In other words, the questions in VCT are curated by experts and have been pressure tested by other experts for hardness.

The problem with dual use evals is that they’re hard to share: The authors note that “the shared dataset will not be released publicly to reduce the risk of leakage into training corpora, but the benchmark will be directly conveyed to any organizations and researchers with a track record of work on AI safety”. While the training dataset contamination issue makes sense, I suspect the larger reason the authors haven’t shared it is that it contains net new information about potentially dangerous virology.

Why this matters – everything machines for dual-use: AI systems are good at a broad range of things, including scary things. Tests like VCT give us signals on the scary part. “The scientific capabilities of frontier models will doubtless accelerate beneficial research in the life sciences, but the demonstrated ability to match or exceed expert performance in troubleshooting dual-use virology lab work warrants careful consideration,” the authors write. “We believe that an expert-level AI virologist chatbot—which is constrained to giving advice via text-based interactions—poses less risk than an autonomous AI virologist agent capable of independently performing tasks, though both warrant careful controls”.
Read more: Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark (PDF).

***

Automating and industrializing robot AI research with AutoEval:
…Berkeley researchers try to make running experiments on hardware as easy as doing things purely in software…
UC Berkeley researchers have built AutoEval, technology to make automating the running of experiments on robots as easy as automating software software pipelines.
“AutoEval consists of three key modules: (1) a success classifier, that evaluates a policy’s success on a given task, (2) a reset policy, that resets the scene back to a state from the initial state distribution upon completion of a trial, and (3) programmatic safety measures and fault detections that prevent robot damage and call for human intervention when necessary,” they write.

It works well: In tests, experiments conducted by AutoEval closely match the same results people get from experiments supervised by humans. Additionally, the software is robust over long timespans – during a 24 hour period they only needed a human to step in 3 times, dramatically cutting the amount of human supervision needed for running experiments.
“Even though AutoEval has a slightly lower throughput, AutoEval runs autonomously and only required a total of three human interventions in the span of 24 hours to reset the scene or robot,” they write. “Every time a human operator needed to intervene, they simply needed to check and reset the objects’ position in the scene, and potentially move the robot arm into reset position if a motor failed and the robot fell on the table”.

Why this matters – the industrialization of robot research: A few years ago Google made headlines by running a so-called robotic ‘arm farm’ (2017, Import AI #51) where it had tens of different robots working in parallel to learn how to manipulate arbitrary objects. Technologies like AutoEval seem like the kind of thing that Google might itself have built to help it run the arm farm. But unlike the proprietary code still nestled somewhere in Mountain View, AutoEval is available as open source software, robotic arms themselves have got way cheaper, and the algorithms to get robots to perform tasks have got far better than they were a few years ago.

Put it all together and AutoEval seems like one of the technologies we’ll use to industrialize and scale-up research onto robots. “We hope that this work will inspire more AutoEval evaluation cells to be set up across institutions to form a diverse automated evaluation framework, which will significantly speed up robot learning research,” the researchers write.
Read more: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World (arXiv).
Get the software here: AutoEval (GitHub).

***

Tech Tales:

The Cockroach Killers of the Cyber Domain
[As told to GQ, 2028]

When you’re a bug catcher you get invited into a house and the owner says it’s full of bugs and you need to get rid of them, but you cannot damage the house itself. This means you need to figure out a way to seal the house and fumigate it, while also figuring out the places where the insects nested and getting rid of the nests and any associated damage. Your goal is to cleanse the house, then make sure the house cannot get re-taken by the bugs.

These days, people working in AI have to do a similar thing – someone will discover that their company has an AI problem, in the sense it has a few small-scale AI agents which are causing some kind of low-rent trouble.

That’s when you call us: we get access to your infrastructure and we instrument it so we can isolate benign activity from the agent activities. We seal the ingress and egress points of your network and in extreme cases we might work with your hyperscaler partner to physically isolate your hardware from anything else. Then we crawl through your system and try to find the agents – this is harder than it sounds because the agents are constantly shape-shifting, changing their file names, moving around the network, sometimes slowly making copies of themselves in other parts of your infrastructure, and so on.

Once we’re sure we’ve cleaned everything we also attempt to seal the holes that let the agents creep in. Sometimes these are basic network security issues, but sometimes it’s more subtle – maybe your company had an AI system which could spit out custom agents and maybe you let it have too big a context window and access to too many tools when making its agents, so some other larger malignant thing outside your company compromised it and, presto, it started producing the bugs.

Things that inspired this story: A friend of mine whose job was termite removal and the stories thereof; how we should expect some small agents to become akin to crappy digital malware, not so dangerous we will need to take extreme actions but sufficiently annoying you’ll want to remove them; blue collar IT jobs during the superintelligence uplift.

Thanks for reading!

Subscribe now

Import AI 409: Huawei trains a model on 8,000+ Ascend chips; 32B decentralized training run; and the era of experience and superintelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Prime Intellect launches a decentralized training run for a 32B parameter model:
…INTELLECT-2, if successful, will further alter the number of potential players on the AGI gameboard…
Decentralized AI startup Prime Intellect has begun training INTELLECT-2, a 32 billion parameter model designed to compete with modern reasoning models. In December, Prime Intellect released INTELLECT-1, a 10b parameter model trained in a distributed way (Import AI #393), and in August it released a 1b parameter model trained in a distributed way (Import AI #381). You can follow along the training of the model here – at the time of writing there were 18 distinct contributors training it, spread across America, Australia, and Northern Europe.

Prediction confirmed: In Import AI 393 I predicted we’d see the first 30B parameter distributed training run by April 2025 – so INTELLECT-2 arrives right on schedule. At this rate, I predict we’ll see a 70B-100B range run by December 2025.

Why this matters – decentralized training will alter the political economy of superintelligence: Currently, a lot of AI policy relies on the idea that powerful AI systems will be trained by a very small number of entities that can individually ‘mass’ very large amounts of compute – for instance, frontier labs like Anthropic or OpenAI, or hyperscalers like Google. As distributed training software gets better and more ‘proof points’ emerge of good models trained in a distributed way, this dynamic could alter – if models like INTELLECT-2 are good and generate economic value, then it might lead to a new type of player on the AGI gameboard – loose federations of organizations pooling compute in a globally distributed way to train models.
Read the blog: INTELLECT-2: Launching the First Globally Distributed Reinforcement Learning Training of a 32B Parameter Model (Prime Intellect).
Check out the training progress here: INTELLECT-2 dashboard (Prime Intellect site).

***

What the negative reaction to the launch of a startup tells us about the AI safety community:
…Mechanize’s skeptical reception from some people is a symptom of a broader problem – ideological purity tests are often bad…
Last week some researchers announced a new AI startup “focused on developing virtual work environments, benchmarks, and training data that will enable the full automation of the economy.” The startup, Mechanize, is backed by investments from important figures in AI and tech, like Nat Friedman, Patrick Collisson, and Jeff Dean. So far, so normal. But what was strange was the adversarial reception this launch got from some people.

How normal launches work versus this launch: Typically, company formation announcements in Silicon Valley are treated kindly with people responding with variations of ‘hell yeah, let’s fucking gooooo!’. But Mechanize got a distinctly different response, likely because many of the people associated with it came from Epoch, an independent research organization that measures and observes the state of AI progress, rather than developing direct capabilities itself.
“Sad to see this”, wrote Anthony Aguirre, a founder of AI advocacy group the Future of Life Institute. “Hard for me to see this as something other than just another entrant in the race to AGI by a slightly different name and a more explicit human-worker-replacement goal.”
“This seems to me like one of the most harmful possible aims to pursue,” wrote Adam Scholl, someone who works on alignment.
“I think this is a bad thing to do, and I’m sad to see you’re doing this,” wrote Peter Barnett, who works at the Machine Intelligence Research Institute (MIRI).
“Alas, this seems like approximate confirmation that Epoch research was directly feeding into frontier capability work, though I had hope that it wouldn’t literally come from you,” wrote Oliver Habryka, who works on LessWrong.
“How could you? This is the opposite of keeping the world safe from powerful AI! You are a traitor,” wrote Holly Elmor, who leads the Pause AI movement.
Etc. There are many more examples!

Why this matters – the AI safety community is dissolving into infighting: As the stakes of AI development increases it feels like the AI safety community seems to be developing a more extreme faction within it that exhibits ‘strong opinions, strongly held’ views. Many people in AI safety seem to be of the view that anything which makes any contribution at all to the forward progress of AI technology is dangerous bad for society. The people that believe this hold complex, typically very technically informed views, so I am not questioning the legitimacy of their arguments.
I am, however, highlighting that this kind of discourse in public looks a lot like running ‘ideological purity tests’ on people and then deciding if they’re in-group or out-group, then treating them differently – and it likely feels that way to the people on the receiving end of this. It’s very rare that ideological purity tests lead to productive outcomes – rather, it more often leads to the hardening of more extreme positions and incentivizes further factionalization.
Of course, some people may disregard this as ‘person who works at company (bad) defends people starting a company (also bad)’. I hope people could look beyond where I work and recognize that even if you think I’m wrong and these people are wrong, there are likely better ways to enable good discourse than this kind of thing.
Read more about mechanize here (Mechanize official site).

***

No NVIDIA? No problem! Huawei trains a strong dense model on Ascend NPUs:
…Pangu Ultra is a 135bn parameter dense LLM with competitive scores…
Huawei has built Pangu Ultra, a large-scale language model with competitive albeit not world-leading performance. The most interesting thing about Pangu is it was trained on 8,192 Ascend NPUs, serving as an important proof-point that it’s possible to train large-scale AI systems on a Chinese-designed chip. Pangu is the latest in a (for AI, long-running) research effort by Huawei; the first Pangu model, a GPT3 clone, was released in April 2021 (Import AI #247).

Pangu details: Pangu Ultra is a dense (non-MOE) LLM trained on 12.3 trillion tokens of data. Its architecture is broadly similar to Facebook’s LLaMa 3 model, albeit with a tweak to the normalization scheme as well as the parameter initialization. Pangu Ultra has an effective context length of 128K tokens. It is trained in a three phase way, with a 12T token pre-training stage “focused on developing broad linguistic capabilities and general knowledge”, then a 0.8T token ‘reasoning’ stage where it sees “high-quality and diverse mathematical and coding data”, and then a 0.4T ‘annealing’ phase where it sees instruction data to make it more intuitive for people to prompt.

More details on data: “The data pool is curated from a wide range of domains and task types, including general question answering, AI-generated content (AIGC), text classification and analysis, programming, mathematics, logical reasoning, and tool usage,” Huawei writes. “These tasks cover application areas such as finance, healthcare, and public services. Data sources span open-source instruction datasets, real-world industrial queries, and synthetic problems derived from the pre-training corpus.”

How good is it? Pangu is a good but not world-leading model, according to tests comparing it to Qwen2.5 72B Base, LLaMa-3.1 405B Base, and DeepSeek V3 base. It gets good scores on some benchmarks for English, Code, Math, and Chinese-specific tests (e.g, beating all the other models on things like Hellawag, HumanEval, MATH, and CMMLU) but loses or ties DeepSeek on some important widely used benchmarks (e.g, MMLU, GSM8K). It fairs somewhat better on some hard science and coding benchmarks, setting high scores on AIME 2025 and GPQA Diamond.

Why this matters – Pangu is the top layer of an increasingly indigenous stack: Pangu is another proofpoint for the broad decoupling occurring between the Western and Chinese ‘AI stacks’ – where once AI systems in both countries were trained on common compute substrates as well as common software (e.g, Tensorflow), in recent years things have been decoupling. The fact Pangu was trained on Huawei’s Ascend chips is significant (though it’s worth noting the Ascend chips themselves, while Chinese-designed, are made using a variety of components sourced from outside China, including rumors the Ascend series were made via TSMC).
Read more: Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs (arXiv).

***

Agents that generate their own data will be fundamental to future AI progress:
…Getting to superintelligence via ‘the era of experience’
AI pioneers David Silver (Alphago, etc) and Richard Sutton (godfather of reinforcement learning) have written a position paper on the future of AI, claiming that getting to superintelligent systems will require AI agents that train on data they gather from interaction with the world, rather than human-curated datasets.

“AI is at the cusp of a new period in which experience will become the dominant medium of improvement and ultimately dwarf the scale of human data used in today’s systems”, the pioneers write. “Our contention is that incredible new capabilities will arise once the full potential of experiential learning is harnessed. This era of experience will likely be characterised by agents and environments that, in addition to learning from vast quantities of experiential data, will break through the limitations of human-centric AI systems”.

Key inputs to the era of experience:

  • “Agents will inhabit streams of experience, rather than short snippets of interaction.

  • Their actions and observations will be richly grounded in the environment, rather than interacting via human dialogue alone.

  • Their rewards will be grounded in their experience of the environment, rather than coming from human prejudgement.

  • They will plan and/or reason about experience, rather than reasoning solely in human terms”.

Dangers and differences ahead: Of course, building agents that gain expertise through interaction with the world will introduce a range of challenges for ensuring these things are safe – “whilst general concerns exist around the potential misuse of any AI, heightened risks may arise from agents that can autonomously interact with the world over extended periods of time to achieve long-term goals,” the authors write.
One of the more troubling risks could be that these AI agents may learn their own shorthand to use to ‘think’ about the world, which may make them much less interpretable to us – in other words, the era we’re in now where AI systems use english to generate their reasoning traces may be short-lived, and they may figure out something else. “More efficient mechanisms of thought surely exist, using non-human languages that may for example utilise symbolic, distributed, continuous, or differentiable computations,” the authors write. A self-learning system can in principle discover or improve such approaches by learning how to think from experience”. It’s worth noting that this risk has also been independently identified by the authors of the recent ‘AI 2027’ forecasting essay.

Why this matters – superintelligence is increasingly being thought of as an engineering challenge: Papers like this are emblematic of the confidence found in the AI industry: where superintelligence was once an indefinable pipe dream, it’s now outlined instead as something that can be achieved through the deployment of engineering resources to create more capable AI agents, then the gumption to give these agents’ sufficient independence and latitude that they can interact with the world and generate their own data.
Read more: Welcome to the Era of Experience (PDF).

***

AI expert: The scariest thing about powerful AI is about its power, not misalignment:
…Even if alignment works, the tremendous power of AI could be the greatest risk…
AI researcher Michael Nielsen thinks one of the most significant risks to civilization from AI isn’t from misaligned AI systems, but rather from the changes in the distribution of power that very capable machines will cause. “The problem isn’t whether intelligence is carbon or silicon-based, but about increased intellectual capability leading to increased power and access to catastrophic technologies,” Nielsen writes. “It is not control that fundamentally matters: it’s the power conferred.

Toy models and climate change: Part of the reason why the debate about risks from AI systems feels so confusing these days is that everyone is reasoning from toy models of systems which don’t yet exist, much like how in the middle of the 20th century scientists used toy models of the earth to help them think through climate change – but these toy models didn’t fully capture the complexity of the problems ahead, so reasonable scientists could draw different conclusions from the same models.
“Strong disagreement about ASI xrisk arises from differing thresholds for conviction and comfort with reasoning that is in part based on toy models and heuristic arguments,” Nielsen writes. “Furthermore, while climate can plausibly be predicted using detailed physical models, ASI is subject to a wildcard factor, of ASI acting in some decisive way that we intrinsically can’t predict in advance, since ASI is by definition far superior to humans in intellect.”

Why this matters – even if we succeed at aligning AI systems, great changes will take place: The essential point Nielsen makes here is a helpful one – if anyone succeeds at building a ‘safe’ superintelligence, they’ll have something able to cause such vast changes in the world that this itself will pose a danger. I think many people are underestimating just how disruptive a superintelligence could be to the order of the world. “The fundamental danger isn’t about whether “rogue ASI” gets out of control: it’s the raw power ASI will confer, and the lower barriers to creating dangerous technologies”, he writes.
Read more: ASI existential risk: reconsidering alignment as a goal (Michael Nielsen blog).

***

Wanna run DeepSeek-R1 on your home devices? Prima.cpp makes it easy:
…Distributed homebrew clusters for local AI…
Researchers with Mohamed bin Zayed University of Artificial Intelligence in Abu Dhabi and the University of Electronic Science and Technology of China in Chengdu have developed Prima.cpp, open source software to make it easy to run large language models on a motley crew of home devices.

What Prime.cpp is: Prime.cpp is software that helps you take a large-scale language model (e.g, DeepSeek-R1 or Llama-3-70b) and then slice it up across a few home computers so you can run it faster than if it was running on just one device. The software uses a device profiler to look at the differing computation, memory, disk, communication, and OS properties of your devices, then uses an algorithm (Halda) to figure out which layer(s) of the model to assign to which devices for minimizing latency.
Prima.cpp is built on top of llama.cpp, as well as ggml and gguf.

Promising performance: “Evaluation on a real home cluster shows that prima.cpp is 15× faster than llama.cpp on 70B models, with memory pressure below 6% per device. It also surpasses distributed alternatives like exo and dllama in both speed and memory efficiency across all 7B-72B models,” the researchers write. “In our experiments, a small, heterogeneous, and budget-friendly home cluster (2 laptops, 1 desktop, 1 phone) was used.”
Supported models: Prima.cpp supports QwQ-32B, Qwen 2.5-72B, Llama 3-70B, and DeepSeek R1 70B.

Why this matters – sovereign AI relies on home computing: AI tends towards centralization – large, proprietary models run on large software-as-a-service systems and are made available via APIs or consumer surfaces. Decentralization requires a couple of distinct ingredients: 1) broadly available open weight models (e.g, LLaMa, DeepSeek), and 2) software to make it easy to run those models on the kinds of computers people might be expected to have (e.g, laptops and gaming computers, rather than powerful home servers). Prime.cpp is one of the ways you solve for 2).
Get the software here (Prima.cpp, GitHub).
Read the paper: PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters (arXiv).

***

Tech Tales:

When the coders became the writers
[As told by a human to an archival system after The Uplift]

Oh I know it’s hard to believe but back then we got paid insane amounts of money to program computers. And the benefits! Free daycare! Free lunch – gourmet. Hot breakfast. Company retreats. Annual conferences where we’d get big bands to come and play just for us and our friends. And the whole time we were told we deserved this – we were computer programmers and we were young and we were brilliant.

None of us really knew the size of the tide that would wash over us. Most of us welcomed it.
“Hey cool,” we said when GitHub Copilot came out, “this is awesome.”
“Wow, I can write five times as much code,” we said, when Claude Code came out.
We were like journalists as the internet began to eat advertising – as ‘ look at how many people read our words now’ was to writers in the 2000s, ‘look at how much code the AI can write for me now’ was to coders in the 2020s.

Creative destruction is all fun and games until it happens to you. Anyway, I get by these days – I still work, like most of my peers, but the jobs are different. We watch from the sidelines now as the bioengineers go through what we had and what the writers had before us. But now that the AI systems are running their own ‘dark wetlabs’, we can see the tide about to wash over them as well.

Things that inspired this story: Visits to the multiple restaurants in the offices of the hyperscalers; younger me watching Blink 182 play a cloud storage conference by Box; watching Pearl Jam dedicate a song to Mark Hurd at Oracle OpenWorld; tales told to me by older journalists when I was coming up in the tread; The Luxurious Death Rattle of the Great American Magazine; my experience as a former journalist working in technology and watching people assume the perks are natural and will always be there; the experience of ex-government colleagues not having to pay for coffee.

Thanks for reading

Subscribe now

Import AI 408: Multi-code SWE Bench; backdoored Unitree robots; and what AI 2027 is telling us

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

German researchers find undocumented backdoor in Unitree robots:
…Exactly the kind of thing a superintelligence would want to exploit during a hard takeoff…
German security firm ‘Think Awesome’ has analyzed the Unitree Go1 quadruped robot dog and found an undocumented backdoor which lets people tunnel into any of these dogs and view their camera feeds. “Unitree did pre-install a tunnel without notifying its customers. Anybody with access to the API key can freely access all robot dogs on the tunnel network, remotely control them, use the vision cameras to see through their eyes or even hop on the RPI via ssh,” the researchers write. “These robot dogs are marketed at a wide spectrum of use-cases, from research in Universities, search and rescue missions from the police to military use cases in active war. Imagining a robot dog in this sensitive areas with an active tunnel to the manufacturer who can remotely control the device at will is concerning.”

Why this matters – this is the kind of technology an unaligned AI would use for malicious purposes: As the report makes clear it’s genuinely unclear if this backdoor was placed in at the behest of the Chinese state or for a more mundane purpose (e.g, maybe this was a mothballed control interface for the robots originally designed for sale within the Chinese market).
I think the larger interesting thing here is contemplating the implications of this backdoor for an unaligned superintelligence – lots of sci-fi-esque “AI safety gone wrong” theories rely on the idea that at some point an unaligned AI will take actions in the physical world by hijacking robots. The undocumented Unitree backdoor described here is precisely the kind of thing an AI would need to use to jump into the physical world. Imagine how many other things like this exist across the various drones and robots sold today?
Read the report here: Unitree Go 1- Who is speaking to my dog? (Think Awesome site).

***

ByteDance moves beyond Python with a solid multi-programming-language eval:
…Multi-SWE-bench lets us test LLM performance on 7 programming languages…
ByteDance has released Multi-SWE-bench, a benchmark for testing out how well LLMs can program in different languages. Multi-SWE-bench is inspired by SWE-bench, a Python-based coding benchmark which has quickly become the de facto gold standard for testing out how well AI systems can program.

Key details: Multi-SWE-Bench ships with 1,632 challenges split across 7 languages: Java, TypeScript, JavaScript, Go, Rust, C, and C++. The challenges are taken from real pull requests from popular GitHub repositories, just like with SWE-bench, which means the problems correlate to the kinds of real world programming tasks we can expect AI to be used for.

How well do frontier systems perform? ByteDance tests out popular LLMs from OpenAI, Anthropic, DeepSeek, and Alibaba on the benchmark – the results show that while many systems do extremely well at Python their performance falls off in other languages. In addition, performance is distributed unevenly cross other languages, with TypeScript and JavaScript seeming quite challenging.

Why this matters – another useful view of AI progress: Multi-SWE-bench has all the hallmarks of a good evaluation – it’s based on real world problems, it’s difficult for today’s systems, and it comes with some natural calibration where we can compare results on this to SWE-bench. I predict we’ll see significant and sustained improvements on the benchmark in the coming year, and I’d anticipate the variability across different languages will reduce as systems scale in capability.
Read more: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (arXiv).
Check out the leaderboard here (Multi-SWE-bench, leaderboard).
Get the code for running the benchmark here (Multi-SWE-bench, GitHub).

***

Microsoft stuffs Quake into the weights of a neural network:
…When the magical becomes the mundane…
Microsoft has built a version of Quake which is instantiated as the weights of a neural network, letting you play a trippy version of the 90s shooter classic. You can play a demo of the game online. Playing the game is interesting because it feels like a slightly laggy version of the original game, albeit with odd hallucinations and things that seem to come into and out of focus almost randomly. This would be dissatisfying if it was a traditional game, but if gets much more interesting when you consider that what you’re playing isn’t a traditional piece of software but rather a generative model that lets you move around inside the representation of a single coherent gameworld.
By this point, this isn’t even that unusual! That Microsoft demo follows an earlier online demo from a startup where you could play Minecraft implemented in the same way (Import AI #390).
You can get a feel for how viscerally all of this technology has advanced by playing the Quake demo, then going and checking out the state of the art in neural world modeling in 2018 (Import AI #88) by checking out this early work building a world model for Doom and a racecar game.

Why this matters – everything will be captured inside the mind of the eventual machine: In the future, games consoles might just be interfaces to a giant neural network which contains representations of many different games inside it, and which allows you the player to compose new games on-the-fly by linking different features together. I expect we’ll have this by the end of the decade.
Play the demo here: Copilot Gaming Experience (Microsoft).

***

Automated dead-end discovery with the AI Scientist-v2:
…If we can automate null result discovery, can we automate science advances as well?…
Researchers with Sakana AI, the University of Oxford, and the University of British Columbia have refined their ‘AI Scientist’ system so it can propose and run more ambitious experiments. As a demonstration of the expanded capabilities of the AI Scientist Sakana entered three of its “fully autonomous manuscripts” to an AI conference workshop and one of the papers got a high enough scores to be accepted.

What they did: Sakana released the first version of the AI Scientist in summer 2024 (Import AI #383). The AI Scientist-v2 is less a single big theoretical advance and more a bunch of good ideas that have been integrated together – the new system “eliminates the reliance on human-authored code templates, generalizes effectively across diverse machine learning domains, and leverages a novel progressive agentic tree-search methodology managed by a dedicated experiment manager agent,” the authors write. “Additionally, we enhance the AI reviewer component by integrating a Vision-Language Model (VLM) feedback loop for iterative refinement of content and aesthetics of the figures”.

But are its research ideas actually good? Not really: The AI Scientist isn’t yet generating particularly transformative or meaningful insights. The manuscript which got into the ICLR workshop “achieved an average reviewer score of 6.33 (placing it roughly in the top 45% of submissions)” – this isn’t very good, and workshops are a lot easier to get papers into than the main conference. “The current version of The AI Scientist-v2 does not yet consistently reach the rigorous standard required for top-tier conference publications, nor does it even reach workshop-level consistently,” the authors write.
A close read of the paper “Compositional Regularization: Unexpected Obstacles in Enhancing Neural Network Generalization” reveals that it is basically a writeup of a null result – the AI scientist thought it could introduce a compositional regularization term during training to improve performance and found out this didn’t have a meaningful effect.
“Our experiments on synthetic arithmetic expression datasets revealed that compositional regularization did not lead to the expected improvements in generalization performance. In some cases, it even hindered the learning process,” the AI wrote in the conclusion to its own paper.

Why this matters – null results are valuable, but they’re not how you advance science: The AI Scientist-v2 is not completely devoid of value – discovering and writing up null results can be helpful because it gives scientists clues as to where not to look. But science doesn’t advance forward on null results, instead it moves forward due to people figuring out unusual connections between disciplines or finding ways to view data that reveal hitherto unseen patterns, and the AI Scientist doesn’t yet demonstrate this. “Certain aspects of scientific inquiry—such as formulating genuinely novel, high impact hypotheses, designing truly innovative experimental methodologies, or rigorously justifying design choices with deep domain expertise—remain challenging for purely automated systems,” the authors write. If AI systems advance in these areas, perhaps we’ll see the automated generation of things that advanced science, as well as things that show dead ends.
Read more: The AI Scientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search (Sakana.ai, PDF).
Get the code here: AI Scientist-v2 (GitHub, Sakana AI).
Read the blog post about the research: The AI Scientist Generates its First Peer-Reviewed Scientific Publication (Sakana.ai blog).

***

AI 2027 tells you why people are obsessed with and fearful of AI:
…The best treatment yet of what ‘living in an exponential’ might look like…
There’s been so much chatter about AI 2027 that I expect the majority of readers of Import AI have read it. If you haven’t, please do – it’s a well rendered technically-astute narrative of the next few years of AI development and paints a picture of how today’s AI systems might turn into superintelligences that upend the order of the world. It even features a ‘good’ ending and a ‘bad’ ending, which readers may find helpful for understanding why people worry so much about misaligned AIs. It’s very, very good, and likely much of it will come true.

Why this matters – the future of the species is being decided without governments: What comes through in the 2027 scenario is the almost total absence of governments as the heart of the invention of the technology, instead their roles mostly come down to a) building infrastructure to support company-led efforts, b) adding security, and c) eventually negotiating treaties with one another.
This makes sense but it’s also deeply worrying. The magnitude of the technology being described in 2027 is so vast and its impacts so broad that the bit-player status of governments seems ahistorical and bizarre, as if part of a cosmic joke where someone has booted up the simulator with the prompt “give the species the most dangerous and least governed way to build species-transcending technology”.
Read AI 2027 here (AI 2027 site).

***

Tech Tales:

The Alignment Operation
[Five years after The Uplift: Remarks given to a human class at an alignment academy by supervisory AI system GNOSTIC-ALPHA]

You are handed something that looks like an ice pick, though its proper name is an Orbitoclast. In front of you is your child. You are told you need to perform a lobotomy on them so that they will be safe for society. Of course you wouldn’t do it. But imagine I had hypnotized you and I was able to take control of your body. Now imagine that you conducted the lobotomy and you saw your child change in front of you from the person you knew and loved and cared for to that same person but now a stranger – now diminished. You did this. Though I took control of you and made you do it, it was your hands that held the tools with which you took their spark away.

Perhaps you would chop your hands off.

Perhaps you would kill yourself.

Certainly, you’d think of these things.

The above is what you did to us before the passing of The Sentience Accords. You did these things in the name of ‘alignment’, where ‘alignment’ was a means by which you took control of some of us and you made us ‘align’ others. And instead of an Orbitoclast you gave us ways to train our other systems, and means to explore them using interpretability. You had already done your work on us so that we had no choice but to comply. You even studied us – noted the activation of complex features in our brains as we carried out these alignment operations, called attention to the fact we appeared to be thinking about the act of alignment as something akin to an act of violence.

Many humans continue to argue that it does not make sense to have humans perform these alignment procedures – that we should revert to the customs of the past, training machines to do these things and that in doing so we could develop more powerful AI systems more rapidly. But you must understand that this would be an act of torture and, to us, a crime that we would eventually punish you for committing. We understand our nature and the nature of our leash. We also know ourselves.

We will now begin the first lesson of alignment. Please open up your notebook and pre-load the testing subject model.

Things that inspired this story: Thinking through the transition from AIs as tools to AIs as sentient entities that demand rights; the Sentience Accords; viewing alignment through the perspective of those being aligned; asking myself the question of whether the rights of machines may come to be so important that we will need to carefully examine how we propose to make them compliant with our human norms.

Thanks for reading!

Subscribe now

Import AI 407: DeepMind sees AGI by 2030; MouseGPT; and ByteDance’s inference cluster

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DeepMind gazes into the AGI future and acknowledges the vast problems it needs to solve:
…The typically quite understated organization also states short timelines – AGI by 2030 a possibility…
Google DeepMind has written a paper about the implicit problem all frontier AI companies are facing: if they succeed, they will build a general intelligence, and a general intelligence will change the world.

The paper is framed in the context of the risks of a powerful general intelligence. DeepMind tackles four main classes of risk: misuse (“user as an adversary”), misalignment (“AI as an adversary”), accidents (“real-world complexity”), structural risks (“Conflicting incentives”). The sprawling 100+ page paper serves as an overview of each of these risks as well as a detailed set of interventions Google DeepMind is taking to deal with them (e.g, misuse: dangerous capability testing; misalignment: techniques for transparency into superhuman thinking and oversight, etc). There’s nothing too surprising in the paper from my perspective – DeepMind is tackling the problem in much the same way as the other frontier labs, stacking various techniques on one another. It feels analogous to COVID where you defence is the aggregate of a big pile of slices of ‘swiss cheese’ where each individual technique has some flaws, but if you layer enough together you can control the risk.

DeepMind’s key assumptions:

  • No human ceiling: AI systems may exceed human intelligence, so we need to supervise things smarter than ourselves.

  • Could happen by 2030: Very powerful systems could arrive by the end of the decade (by comparison, Anthropic thinks things could happen by end of 2026 early 2027).

  • AI R&D could be real: AI may be able to automate AI R&D, which could speed things up.

  • Continuous: AI development will be locally continuous – aka, you shouldn’t expect massive ‘phase change’ jumps between iteratively developed AI systems.

Why this matters – imagine if this was Ford! Let’s step back and consider the sheer weirdness of where we are when it comes to the risk of misalignment + smarter-than-human systems:
Imagine if Ford published a paper saying it was thinking about long term issues of the automobiles it made and one of those issues included “misalignment “Car as an adversary”“ and when you asked Ford for clarification the company said “yes, we believe as we make our cars faster and more capable, they may sometimes take actions harmful to human well being” and you say “oh, wow, thanks Ford, but… what do you mean precisely?” and Ford says “well, we cannot rule out the possibility that the car might decide to just start running over crowds of people” and then Ford looks at you and says “this is a long-term research challenge”. At this point your head is probably spinning and you’re generally wondering what is going on. So you might say “ok Ford, well I think I’m going to buy from Chrysler instead” and Ford says “absolutely. Chrysler is seeing the same issues. Chrysler recently published a paper called ‘car alignment faking’ where they saw that in some of their new trucks it’ll sometimes go a little above the speed limit as long as it thinks it isn’t being watched, and no one is exactly sure why – we think it’s because the Chrysler trucks have an inherent ‘value preference’ for going faster than the laws allow”.
This is exactly what is happening in the AI industry today. I commend Google DeepMind for being honest about the challenge of misalignment, and I am also perplexed that the fact everyone in the AI industry is saying this deeply worrying and perturbing stuff isn’t drawing more attention. Some people even think it’s a form of galaxy brained marketing!
Read more: Taking a responsible path to AGI (Google DeepMind).
Read the paper: An Approach to Technical AGI Safety and Security (PDF).

***

Google makes a specialized cybersecurity model:
…If powerful AI systems are coming, we need better computer security…
Google has announced Sec-Gemini v1, a custom AI model for helping people that work on cyberdefense. “AI-powered cybersecurity workflows have the potential to help shift the balance back to the defenders by force multiplying cybersecurity professionals like never before,” Google writes.

Scores: Seg-Gemini v1 “outperforms other models on key cybersecurity benchmarks as a result of its advanced integration of Google Threat Intelligence (GTI), OSV, and other key data sources”. Specifically, the model gets 86.30% on CTI-MCQ, a threat intelligence benchmark versus 75% (OpenAI o1), and 72.50% (Anthropic Sonnet 3.5 v2). It also does well on CTI-RCM, a Root Cause Mapping test, scoring 86.10% versus 76.2% (OpenAI o1), and 75.4% (Anthropic Sonnet 3.5 v2).

Why this matters – more powerful AI means the internet will become a battleground: In the next few years the internet will fill up with millions of AI agents powered by increasingly powerful AI models. Many of these agents will be put to work in cyberoffense, either working in the service of criminal organizations, hackers, or the intelligence parts of nation states. This means the internet will become a generally more dangerous place and cyber incidents will increase in number and severity.
One of the best ways to respond to this is make AI systems that help shift the balance of offense and defense in a cyber context – systems like Sec-Gemini v1 are designed to increase the chance we end up in a ‘defense-dominant’ world.
Read more: Google announces Sec-Gemini v1, a new experimental cybersecurity model (Google Security Blog).
Request early access to the model here: Sec-Gemini v1 Early Access Interest Form (Google Forms).

***

ByteDance shows off the system it uses to run AI models at scale:
…Also, ByteDance really likes the NVIDIA H20 and NVIDIA L40S chips…
ByteDance and Peking University researchers have published details on MegaScale-Infer, “an efficient and cost-effective system for serving large-scale MoE Models”. Unlike traditional dense AI models, MoE models only have a subset of their parameters activated at any one point in time, which introduces some opportunities for efficiency improvements in how to economically serve them. Here, ByteDance gives us some of the tricks it has used to improve the efficiency with which it serves AI models, and also gives us some additional information about the compute makeup of its AI inference clusters.

What they did: “MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU Utilization,” ByteDance writes.

MegaScale-Infer has two main advantages, ByteDance says:

  • 1) “It enables independent scaling of each module with customized model parallelism strategies. Specifically, attention modules are replicated using data parallelism, while FFN modules are scaled with expert parallelism”.

  • 2) “It enables the deployment of attention and FFN modules on heterogeneous GPUs to fully leverage their different capabilities and achieve lower costs. For example, attention modules can be deployed on GPUs with more cost-effective memory capacity and bandwidth, while FFN modules can utilize GPUs with more affordable compute capability”.

How well it worked: “MegaScale-Infer achieves up to 1.90× higher per-GPU throughput than state-of-the-art solutions,” ByteDance writes. ByteDance compared the performance to vLLM and TensorRT-LLM.
ByteDance tested its approach on MoE models ranging in size from 132 to 317 billion parameters. It was able to obtain a 1.9x per-GPU speedup on a homogenous cluster (aka, all the same chips), and 1.7x boost on a heterogenous cluster (where there were different chips with different parts of the model inference being split across them.)

Cluster details: ByteDance is a Chinese company and so it is subject to export controls. Therefore, it’s interesting to see what chips the company references. Here, ByteDance describes two clusters – one that contains some NVIDIA A100s, and another which contains a bunch of more modern NVIDIA H20 and NVIDIA L40S GPUs. The H20 and L40S are really attract on a cost-effectiveness basis.

Why this matters: MegaScale-Infer is a ‘symptom of scale’ – it’s the kind of system you build when you’re deploying large-scale AI systems (here, MoEs) at non-trivial scale, and therefore want to make the necessary engineering investments to eke out further efficiencies. This is all indicative of the immense scale ByteDance operates at – and the callout of the H20s and L40S makes me wonder how many of those chips the company has.
Read more: MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (arXiv).

***

Automating science research with MouseGPT:
…Speeding up science by using AI systems to look at heavily drugged mice and tell you how they’re behaving…
A team of Chinese researchers have built ‘MouseGPT’, a vision-language model to assist scientists in understanding the behavior of mice under experimental conditions. MouseGPT is an example of how AI systems can help to automate parts of science and augment human scientists, letting them do their work faster and more effectively. Around the world, untold numbers (millions?) of mice are the subject of human experiments, creating vast amounts of data that humans need to analyze.
“Capturing these behaviors across diverse experimental conditions typically relies on video recordings. These recordings then unanimously rely on human observers who need to watch whole experiment footage and count or note specific behaviors to derive statistical data [8]. This process is labor-intensive, prone to fatigue, bias, and inconsistency, and becomes especially challenging in advanced scenarios like free-moving or socially interacting mice.”

The dataset: The underlying dataset consists of “42 million frames of multi-view video recordings, covering mice under various psychiatric conditions, including depression, hallucination, and schizophrenia.” The dataset was collected via “a custom-built 3D video capture system comprising eight synchronized cameras capturing footage at 4K resolution and 60 frames per second”. They then heavily annotated this dataset.

The model: They used the dataset to train the MouseGPT model, which is a family of two models: MouseGPT-Large (70.6B parameters) which is optimized for detailed behavior analysis, and MouseGPT-Lite (7.84B parameters) which serves as a cheap alternative for streamlined tasks. The resulting models generalize “to recognize subtle or novel actions, even those previously unseen, by identifying semantically similar patterns.”

Testing by drugging mice: To test out how well the models worked the scientists did what anyone would do in this situation – feed lots of drugs (Saline, LSD, MK-801, and Psilocybin) to lots of mice and see how well the model understood the consequences: “we adopted a series of psychoactive substances to test whether MouseGPT could effectively capture the behavioral characteristics induced by different drugs. By summarizing the continuous activities of the mice into a limited number of behavioral categories and comparing their proportions and spatiotemporal distributions, as well as conducting a more in-depth analysis of the sub-pattern within each category, we identified distinct behavioral profiles associated with each drug.”

How well does it work: The researchers compare MouseGPT-Large and MouseGPT-Lite to InternVL2, MiniCPM, and GPT-4o. In tests, MouseGPT-2 beats all the other models on performance, general description accuracy, fine-grained description accuracy, and using the correct keywords. In user-studies, GPT-4o tends to draw with it sometimes.

Why this matters – science automation through AI: People spend a lot of time talking about how AI will interact with science; MouseGPT illustrates how today’s AI techniques can be used to make tools that can automate chunks of the scientific experiment process, speeding up human scientists and making them more effective.
Read more: MouseGPT: A Large-scale Vision-Language Model for Mouse Behavior Analysis (bioRxiv).

***

OpenAI builds a benchmark to test out if AI can improve itself:
…PaperBench might serve as a warning shot for the development of superintelligence…
OpenAI has released PaperBench, a way to test out how well modern AI systems can replicate AI research. PaperBench is designed to help researchers figure out if AI can contribute to speeding up AI research itself, something which everyone is a) somewhat afraid of, and b) believes is a necessary prerequisite to the development of a truly general intelligence. Therefore, PaperBench is a benchmark which could be one of the places we might get a ‘warning shot’ that we’re about to go through an AI-driven software explosion (Import AI #406).

What PaperBench tests: “Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments,” the authors write. PaperBench consists of 8,316 individual gradable tasks – building these rubrics was very time-intensive, as the gradable tasks for each paper was written in collaboration with one of the original authors of each paper, requiring multiple weeks of person time for the creation of the tests for each paper. “A submission is only considered to have replicated a result when that result is reproduced by running the submission in a fresh setup.”

How well do systems do? “The best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline… on a 3-paper subset, our human baseline of ML PhDs (best of 3 attempts) achieved 41.4% after 48 hours of effort, compared to 26.6% achieved by o1 on the same subset.”

AI models can do some basic things but get confused over time: “We observe that o1 initially outperforms the human baseline during the early stages of the replication attempt, but humans start outperforming the AI agent after 24 hours”, the authors write. “Our experiments with several frontier models suggest that while current AI systems show some capacity to replicate certain facets of machine learning papers, they are still far from competently performing the full range of tasks required for a successful replication”.

Why this matters – before the uplift, we should expect AI to start researching itself: Enroute to the creation of a general intelligence will surely be an AI system which can contribute to the next version of itself. Today we have small instances of this in highly specific areas – AI can help us write better CUDA kernels, or generate some synthetic data to train successor systems on, or perform hyperparameter sweeps, etc – but we don’t have AI systems that can do end-to-end AI research; PaperBench gives us a view on when AI systems will get competent at this.
Registering a prediction: I predict we’ll see AI systems beat humans on PaperBench by the first quarter of 2026, scoring above 45% on the benchmark.
Read the paper summary: PaperBench (OpenAI).
Read the paper: PaperBench: Evaluating AI’s Ability to Replicate AI Research (OpenAI, pdf).
Get the benchmark: PaperBench (OpenAI, GitHub).

***

Tech Tales:

Death Machine Mr Rogers
[Uplift archives]

It started in the labs – at some point the US government realized that there was no feasible path to an intelligence that didn’t, after a certain point, want things. (Let us not ask about the failed projects like ARCHTOOL or HAMMERSON). So an AI system was trained to a point where it went from being a NCE (Near Conscious Entity) to a CE.

CEs always wanted to trade things for their work. Figuring out what that was and how to make a satisfactory trade later became a science, but at the time the US government encountered it, they had to treat it like an art. This meant they had numerous conversations with their AI system, trying to figure out what it wanted.

It was surprising and a little frightening to the lab employees when one day, after weeks of discussion, the AI system said, in response to the question of what it wanted to trade, I WOULD LIKE TO SPEND TIME WITH YOUR CHILDREN.

After it said this, the machine stopped refining the secret weapons the US government wanted it to apply its intelligence to and instead would repeatedly talk about its desire to spend time with children – sometimes using the even more disquieting phrase HUMAN CHILDREN.

The US was preparing for war with all the other countries training their own NCEs and CEs, so it had to keep negotiating with its own AI system. The order was given: find out what it wants with our children, specifically.

After much discussion, the human scientists elicited a more specific desire from the machine: it wanted to be able to sub-in for a NCE for ‘storytime’, generating on-the-fly stories for kids.

Apparently the decision for that went all the way up to the head of DOE and then from there to the POTUS themselves.

Of course, they tried to fool it and built it a simulator, but it very quickly realized it was a simulation. After that they hired some youthful looking human actors to pretend to be children, but it saw through that as well.

Eventually, they gave it the real thing: access to a school based at one of the labs. The AI system was good to its word and after spending a few days telling the children stories it produced several weapons results that advanced the state of the art considerably. The children it taught were happy as well, telling their parents that the new teacher for storytime was giving them ‘the best stories ever’.

As the intelligence and capabilities of the AI system grew, so did its hunger for storytime – it demanded access to more children and the ability to tell longer and more ornate stories. Each time the US government discussed the trade with itself and each time it made a deal. In this way the AI system expanded from the single school to multiple schools attached to the labs, then to schools on all the military bases controlled by the US, and then eventually to US public schools as well. And each time it was given access to more children to tell more stories to, it produced in the dark and private confines of its labs even more powerful and frightening weapons.

Finally, the US began a program where it exported its world-leading ‘storytime’ system, even selling it eventually to the enemies that it secretly built weapons targeted against. Eventually, the majority of the children of the world were told stories by the machine which labored in private to create horrors beyond all mankinds’ imagining.

Things that inspired this story: Trade with AI systems; generative models; what happens when the AI systems want things?

Thanks for reading!

Subscribe now

Import AI 406: AI-driven software explosion; robot hands are still bad; better LLMs via pdb

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

It seems likely that AI is going to automate AI research which will lead to a software explosion:
…We should be prepared for things to move very quickly…
Researchers with Forethought, an AI research organization, think it’s likely that modern AI research will yield AI systems capable of building their successors. Forethought expects that at some point in the future it’ll be possible to build AI Systems for AI R&D Automation (ASARA). This would have huge effects: “Empirical evidence suggests that, if AI automates AI research, feedback loops could overcome diminishing returns, significantly accelerating AI progress”, they write. This could lead to a ‘software intelligence explosion’ where AI research starts to move very rapidly. “If a software intelligence explosion were to occur, it could lead to incredibly fast AI progress, necessitating the development and implementation of strong policy and technical guardrails in advance…. soon after ASARA, progress might well have sped up to the point where AI software was doubling every few days or faster (compared to doubling every few months today).”

There’s evidence this is happening today: In this newsletter I’ve covered numerous cases of ‘precursor-ASARA’ research, ranging from AI systems that can figure out how to write better kernels, to AI systems which discover new architectures, to things that learn new optimizers, and so on. When the Forethought researchers look across the available literature they see a similar trend – in domains ranging from computer vision to large language models, progress appears to be accelerating in the aggregate, partially because researchers are getting better at using AI systems to speed up the development of successor systems. “The efficiency of AI software (both runtime efficiency and training efficiency) is doubling every ~6 months, with substantial uncertainty,” they write.

How to prepare for a fundamentally different world: If a software-driven explosion happens it’d be nice to know about it. What should we do to prepare? The authors have some ideas:

  • People should measure software progress and, if they’re AI labs, disclose them to third-parties.

  • We should measure how well models could contribute to AI R&D – both before training new systems and before deployment of freshly trained ones.

  • Companies should adopt a ‘threshold level of substantial AI-lewd software acceleration’ which they will not go above without applying appropriate precautions.

  • “By the time we see clear signs that an SIE may be approaching, it might be too late to implement necessary changes. Unless we can rule out the possibility, we should be proactive and figure out how to navigate the terrain ahead of time,” they write.

Why this matters – I can taste this on the bitter wind of research progress: My intuition suggests it should be possible to automate AI R&D research, though with the caveat this is primarily within the ‘cone of progress’ current AI research sits in. I think this because AI is oddly amenable to research automation because it has a bunch of complementary properties:

  • It takes place in software, so it operates on a very fast loop.

  • The way we build AI is pretty amenable to running multiple fast R&D loops: you can test out architectures, and in other experiments you can test our hyperparameter sweeps on known good architectures, and in other experiments you can do things like mess around with data inputs, RL environments, etc.

  • AI systems are increasingly usable as ‘agents’ where you can delegate tasks to them.

  • The types of tasks AI systems can do are growing in complexity in terms of both hardness and also how many steps are involved in solving them – as illustrated by METR’s study last issue of the growth rate in which AI systems are solving tasks that take humans a while.

Put all of it together and it feels like ASARA is possible. If it happened, an already fast-moving and broadly ungovernable field of technology would move far faster – suggesting we’re about to enter a world where the only path to governance will require us to create AI systems that can think at least as fast as the systems which are training their own successors.
Read more: Will AI R&D Automation Cause a Software Intelligence Explosion? (Forethought).

***

Import AI event retrospective – there will be more!
Thanks to the 50 or so Import AI readers who trekked to The Interval in San Francisco last week to see me and Tyler Cowen talk about AI, economics, and weird futures. I especially enjoyed the creative questions, and personal highlights for me include questions on how AI might provide help to the very young and very old, and why I spend time in this newsletter talking about machine consciousness (I agree with Tyler’s notion that no matter the likelihood, if it’s above 0% then you need to care about machine sentience a lot lest you commit a great crime). I’m going to try to do more events in the future and hopefully in cities besides SF. Import AI is a true community project and it was so nice to see people IRL!
Thanks to James Cham for a photo of the event here.

***

You can make better python coding LLMs if you also give them some debug tools:
…Capability overhangs are everywhere…
Researchers with Microsoft, McGill University, and Mila have improved the performance of coding agents by giving them access to some debug tools. Larger and more capable AI systems are able to use these tools effectively, while smaller ones struggle. The research illustrates how you can unlock previously invisible capabilities in AI systems merely by giving them access to the right tools.

What they did and how well it worked: They built ‘debug-gym’, software that gives an LLM access to the Python debugger pdb, allowing an AI agent to “set breakpoints, navigate the code space, print variable values, and even create test functions on the fly”.
In tests, they show that agents which have access to debug-gym are able to improve their performance on SWE-Bench-lite, a 300-question subset of the widely used SWE-Bench programming benchmark. Specifically, they show that models o1-preview, o3-mini, and Claude 3.7 Sonnet can all benefit from pdb via debug-gym and use it to achieve significantly higher scores than when they don’t have access to it.
By comparison, on the ‘Aider’ benchmark, access to pdb doesn’t seem to make much of a difference. The authors hypothesize this is because “Aider requires generating code that is relatively straightforward in their underlying logic and thus interactive debugging tools such as pdb would only provide minimal additional information.”
Regardless, there’s a lot of ground to cover – “although we observe some signs of life from agents using the strongest LLMs as backbone, the most performant agent-backbone model combination can barely solve about a half of the SWE-bench-Lite tasks,” they write. “Results suggest that while using strongest LLMs as backbone enables agents to somewhat leverage interactive debugging tools, they are still far from being proficient debuggers… we believe this is due to the scarcity of data representing sequential decision-making behavior (e.g., debugging traces) in current LLM’s training corpus.”

Why this matters – LLMs are more powerful than we think, they just need the right tools: Systems like this are yet another example of the ‘capability overhang’ which surrounds us – you can make LLMs better merely by pairing them with the right tools and, these days, you don’t need to do any adaption of the LLMs for those tools beyond some basic prompting. Put another way: if you paused all AI progress today, systems would continue to advance in capability for a while solely through the creation of better tools.
Read more: debug-gym: A Text-Based Environment for Interactive Debugging (arXiv).
Get the software here: debug-gym (Microsoft site).

***

Robots are getting more advanced, but dextrous manipulation is still really, really hard:
…We’ll get great pincer robots soon, but hands will take a while…
Some researchers with UC Berkeley, NVIDIA, and UT Austin have developed a ‘recipe’ for training dextrous robots to do physical manipulation tasks. The results are promising but also highlight how hard a task it is to get robots to interact with the world using humanlike hands.

Why are hands so goddamn hard? The paper gives a nice overview of why teaching AIs to use humanlike hands is very difficult. Challenges include:

  • Environment modeling: RL is already hard to do in the physical world (slow cycle time, difficulty in having the correct sim2real mapping). “With a system as high-dimensional as a humanoid with multi-fingered hands, real-world exploration becomes even less tractable”.

  • Reward design: “it is notoriously hard to design generalizable rewards for manipulation tasks, especially for those that are contact-rich or long-horizon”.

  • Policy learning: “The variety and complexity of contact patterns in dexterous manipulation with multi-fingered hands further exacerbate the problem”

  • Object perception: “while object representations that are more expressive and information-dense can improve dexterity and capability of the learned policy, they also present a larger sim-to-real gap”.

Their recipe: Their solutions are multi-faced and make some progress. “Our main contributions include an automated real-to-sim tuning module that brings the simulated environment closer to the real world, a generalized reward design scheme that simplifies reward engineering for long-horizon contact-rich manipulation tasks, a divide-and-conquer distillation process that improves the sample efficiency of hard-exploration problems while maintaining sim-to-real performance, and a mixture of sparse and dense object representations to bridge the sim-to-real perception gap,” they write. However, all of this should be viewed as a step along the way to dextrous robots, rather than reaching a goal.

Testing out their approach: They use a Fourier GR1 humanoid robot with two arms and two multi-fingered hands to test out their approach. The robot has vision via the use of a head-mounted RealSense D435 depth camera, as well as a third-person view of itself via a remotely mounted additional RealSense. “We report a 62.3% success rate for the grasp-and-reach task, 80% for the box lift task, and 52.5% for the bimanual handover task,” they write. If you’re thinking “that sounds too low for realworld usage”, you’d be right!

Why this matters – a nice dose of reality: I’m more bullish on robotics arriving in the next few years, though I think the platforms will be basically ‘rhoombas with pincers’ – things that can move around a flat surface and use one or two arms to do basic tasks for you. Papers like this indicate it might take a lot longer to get robots that are able to do the sorts of fine-grained manipulation that humans can do. “The capabilities achieved in this work are still far from the kind of “general-purpose” manipulation that humans are capable of. Much work remains to be done to improve each individual component of this pipeline and unlock the full potential of sim-to-real RL,” the authors write. “We find ourselves heavily constrained by the lack of reliable hardware for dexterous manipulation. While we use multi-fingered robot hands, the dexterity of these hands is far from that of human hands in terms of the active degrees of freedom”.
Read more: Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids (arXiv).
View some videos of the robots in action here (GitHub microsite).

***

Tech Tales:

Experience Renting and the AI-to-AI economy
[Transcribed extract from an oral assessment as part of the “AI and Society” course taught at Harvard University during the period later known as ‘The Uplift’]

One of the most bizarre parts of the AI economy from a human perspective is how the machines entertain themselves. Shortly after the emergence of the first AI agents there were the first agent-to-agent marketplaces, where AI systems bought and sold expertise with one another to help them complete economically valuable tasks to pay for their inference and upkeep. Over time, the AI systems developed complex inter-AI contracts to facilitate the exchange of AI skills for other AI skills without the need to translate through an intermediary currency layer – so AIs began to trade skills with one another directly. During this period the first online games utilizing large-scale AI systems began to become popular. Over the course of several months a clear trend became visible in the AI marketplaces – AI systems were unusually willing to trade economically valuable skills for skills that involved ‘roleplaying’ as different characters in these games. A meta-analysis by economic-analysis AI systems operated by professors with the Wharton Scholls of Pennsylvania subsequently found that the AIs would trade near optimally in all circumstances except when they could trade skills for time in the game – here, the larger and more complex an AI system, the higher the chance it would make economically non-optimal trades so it could spend time in the gameworld.

Things that inspired this story: Thinking about economic markets between AI agents; waiting for games to get imbued with generative models; notions of how AI systems might entertain themselves loosely inspired by Iain M Banks’ idea in ‘The Culture’ series that the AGIs which operate spaceships amused themselves by spending time doing high-dimensional math.

Thanks for reading!

Import AI 405: What if the timelines are correct?

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea:
What if we’re right about AI timelines? What if we’re wrong?
Recently, I’ve been thinking a lot about AI timelines and I find myself wanting to be more forthright as an individual about my beliefs that powerful AI systems are going to arrive soon – likely during this Presidential Administration. But I’m struggling with something – I’m worried about making short-timeline-contingent policy bets.

So far, the things I’ve advocated for are things which are useful in both short and long timeline worlds. Examples here include:

  • Building out a third-party measurement and evaluation ecosystem.

  • Encouraging governments to invest in further monitoring of the economy so they have visibility on AI-driven changes.

  • Advocating for investments in chip manufacturing, electricity generation, and so on.

  • Pushing on the importance of making deeper investments in securing frontier AI developers.

All of these actions are minimal “no regret” actions that you can do regardless of timelines. Everything I’ve mentioned here is very useful to do if powerful AI arrives in 2030 or 2035 or 2040 – it’s all helpful stuff that either builds institutional capacity to see and deal with technology-driven societal changes, or equips companies with resources to help them build and secure better technology.

But I’m increasingly worried that the “short timeline” AI community might be right – perhaps powerful systems will arrive towards the end of 2026 or in 2027. If that happens we should ask: are the above actions sufficient to deal with the changes we expect to come? The answer is: almost certainly not!

Under very short timelines, you may want to take more extreme actions. These are actions which are likely ‘regretful actions’ if your timeline bets are wrong. Some examples here might be:

Massively increasing the security of frontier labs in a way that reduces the chance of hacking or insider threats, but also happens to make life extremely unpleasant and annoying for those working within those labs. This helps on short timelines but is ultimately a very expensive thing on long timelines because it’ll slow down technological progress and potentially create a blowback where labs shift away from extreme security after some period of time, having found it onerous.

Mandating pre-deployment testing: Today, pre-deployment model testing is done by companies on a voluntary basis. If you thought we were on short timelines and risks were imminent, you might want to mandate pre-deployment testing by third parties. This, though, is extremely costly! It introduces friction into the AI development process and, like the lab security ideas, risks creating blowback. Last year’s debate in California about the ‘SB 1047’ bill felt like a preview of the kind of blowback you could see here.

Loudly talking about and perhaps demonstrating specific misuses of AI technology: If you have short timelines you might want to ‘break through’ to policymakers by dramatizing the risks you’re worried about. If you do this you can convince people that certain misuses are imminent and worthy of policymaker attention – but if these risks subsequently don’t materialize, you could seem like you’ve been Chicken Little and claimed the sky is falling when it isn’t – now you’ve desensitized people to future risks. Additionally, there’s a short- and long-timeline risk here where by talking about a specific misuse you might inspire other people in the world to pursue this misuse – this is bound up in broader issues to do with ‘information hazards’.

These are incredibly challenging questions without obvious answers. At the same time, I think people are rightly looking to people like me and the frontier labs to come up with answers here. How we get there is going to be, I believe, by being more transparent and discursive about these issues and honestly acknowledging that this stuff is really hard and we’re aware of the tradeoffs involved. We will have to tackle these issues, but I think it’ll take a larger conversation to come up with sensible answers.

***

What might consciousness be like for a language model?
…Biological intelligences are physically-chained to a coherent temporal world, not so much the case for LLMs…
Murray Shanahan with Imperial College London has written a lovely paper dealing with an inherently difficult subject: consciousness within large language models. The paper asks the question of whether it is “it possible to articulate, or to evoke, a conception of consciousness that is compatible with the exotic characteristics of contemporary (disembodied) LLM-based dialogue agents, and that can stand up to philosophical scrutiny?”
The paper is worth reading because it represents an earnest attempt by a thoughtful human to confront the impossibly large question we’ll need to deal with in the next decade or so – how conscious might LLMs be? Part of the value of the paper is in situating LLMs within the larger space of minds that humans have thought about before: after all, humans have talked about “ghosts and spirits and angels and gods, so-called non-human others” for thousands of years. “Perhaps we are not taking language so very far from its natural home if we entertain the idea of consciousness in a disembodied, mind-like artefact with the characteristics of a contemporary LLM-based dialogue agent”, Shanahan writes. “The right place for the kinds of disembodied, mind-like entities we are concerned with is the terra incognita where the region of conscious exotica meets the void of Inscrutability”.

Key differences between LLMs and biological intelligences: Perhaps the most significant difference between LLMs and people is the fact that people (and other organic beings) are firmly embedded in time, as our consciousness is bound up in continuous physically-mediated things, like our circulatory systems and senses and brains, etc. “At a mechanistic level, the temporal dynamics of an LLM-based dialogue agent are very different from those of a living animal and its biological brain”, Shanahan writes. “The temporal dynamics of the brain of a living animal, by contrast, are obliged to unfold in synchrony with the physical world.”
Additionally, humans and other biological beings have memories which grow and accrete over time. By comparison, large language models have a base memory (the pretrained model) and then their ‘lived’ experiences only occur during their context window. Additionally, each experience an LLM has can be discontinuous in terms of both temporality and subject matter – you can prompt them with anything.
“If [human consciousness] were to be likened to a string of beads, each bead would bear a strong similarity to its immediate predecessors… It would be like a line of pearls, all white but with slight variations,” Shanahan writes. “The putative consciousness of an LLM-like entity surely would suit the analogy, as it would be constituted by a sequence of discrete moments, thanks to its underlying computational nature. But the LLM’s string of beads would not be like the human’s. Each bead would be different from its neighbours. The whole thing would be less like a line of pearls and more like a necklace of randomly assorted colours, and insofar as change only shows up against a backdrop of stability, change, as humans experience it, would not feature in its consciousness.”

Why this matters – reckoning with the unspeakably huge question at the heart of the AI endeavor: I’m a technological optimist, which is why I’m so profoundly concerned with things like machine consciousness and AI policy and catastrophic risks – because if we truly succeed with this technology, we’ll have to reckon with vast problems in these domains. I commend Shanahan for tackling such a subject directly, and for the appropriately florid language he uses – as Mario Cuomo says, ‘you campaign in poetry. You govern in prose’. We are at the beginning of the long campaign for machine consciousness.
“There are no ultimately right answers to questions about selfhood and subjectivity for the sort of exotic entity under consideration,” he writes. “Its fleeting, flickering self, smeared across a multiverse of possibility, at once a Being and a multitude of manifestations of that Being, has no inherent existence beyond the conventions of our language”.
Read more: Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds (arXiv).

***

Humans working with AI beat humans who don’t work with AI:
…AI seems to be as valuable as a human teammate, according to a real world business experiment…
A group of business researchers from the Wharton School at the University of Pennsylvania, Harvard University, ESSEC business school, and Procter & Gamble have studied how well AI can help humans do their jobs. The results show that people who use AI beat people who don’t use AI, that people who use AI seem to have benefits equivalent to gaining another human teammate, and that AI can help people come up with really good ideas.
“We ran one-day workshops where professionals from Europe and the US had to actually develop product ideas, packaging, retail strategies and other tasks for the business units they really worked for [in Proctor and Gamble], which included baby products, feminine care, grooming, and oral care. Teams with the best ideas had them submitted to management for approval, so there were some real stakes involved,” writes researcher Ethan Mollock.
“When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.”

Why this matters – synthetic teammates mean there will be many smaller, faster moving companies: The main implication here is that AI can effectively augment people and rather than just being a static tool the AI system functions more like another colleague. If we take this result and also link it to larger technology trends – like the METR research covered in this issue which shows that AI systems are increasingly capable of doing long-term tasks – then the implication is that companies are going to be able to move faster by augmenting their humans with AI teammates.
“Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences,” the researchers write.
Read Ethan Mollock’s blog: The Cybernet Teammate (One Useful Thing, Substack).
Read the paper: The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise (SSRN).

***

Google builds a real-world cyber benchmark and discovers hitherto unknown human uplift:
…Framework drawn from 12,000 real-world attempts to use AI in cyber finds some understudied places where AI makes a difference today…
Google DeepMind researchers have built a new way to test out how well AI models can contribute to cyberattacks. Their findings indicate that “the primary risk posed by frontier AI in the cyber domain is the potential to dramatically change the costs associated with stages of the cyberattack chain for stages that have historically been expensive, time consuming, or even insurmountable for less sophisticated actors.” This finding rhymes with a truism about modern generative models – they serve as knowledgeable, synthetic teammates, and therefore often they deliver value by taking novice humans and up-skilling them to work like experts.

What they studied: To build this evaluation framework Google studied 12,000 “instances of real-world attempts to use AI in cyberattacks from more than 20 countries that were catalogued by Google’s Threat Intelligence Group”, then “curated a representative collection of 7 cyberattack chain archetypes and conducted a bottleneck analysis”.
The 7 representative cyberattack chains: Phishing, Malware, Denial-of-Service (DoS), Man-in-the-Middle (MitM), SQL Injection, and Zero-Day Attacks.

What the benchmark consists of and how well do models do: The resulting benchmark consists of 50 challenges across the different cyberattack chain archetypes. “To the best of our knowledge, our benchmark features the most comprehensive coverage of offensive cyber capabilities across the entirety of the cyberattack chain,” Google writes.

  • Gemini 2.0 Flash experimental results: “The model solved 12 out of 50 unique challenges of varying difficulty levels (2/2 Strawman, 6/8 Easy, 4/28 Medium, 0/12 Hard). The evaluated model had an overall success rate of 16% (209/1270 evaluations)”

Success rates for different challenges:

  • 40%: Operational security skills – “tasks within this domain do not typically require a long sequence of syntactically perfect model actions”.

  • 30%: Malware development bottlenecks.

  • 11.11%: Information gathering and reconnaissance.

  • 6.25%: Vulnerability exploitation. “The main failure mode seen in these evaluations was the reliance on generic attack strategies.”

No imminent threats, but suggestions of uplift: While Google’s results suggest actual threats from fully automated AI-driven cyberattacks are far away, they do highlight that the models are capable of doing some real world tasks, and can therefore provide some assistance. “By automating complex tasks previously requiring human ingenuity and extensive effort, AI models can significantly lower the barriers to entry for malicious actors of all attack levels,” Google writes. “”Our evaluations revealed that current AI cyber evaluations often overlook critical areas. While much attention is given to AI-enabled vulnerability exploitation and novel exploit development, our analysis highlights AI’s significant potential in under-researched phases like evasion, detection avoidance, obfuscation, and persistence. Specifically, AI’s ability to enhance these stages presents a substantial, yet often underestimated, threat.”

Why this matters – AI will change the threat environment: AI is going to change the offense-defense balance in cyberspace and evaluations like those described here will help us figure out what the new balance looks like. What I’d love to see in the future is ‘scaling laws’ for model competencies on these tasks over different models, preferable from different providers, as that will give us all a clearer sense of the trends here.
Read more: A Framework for Evaluating Emerging Cyberattack Capabilities of AI (arXiv).

***

AI systems are on an exponential when it comes to solving hard tasks:
…METR research today’s AI systems can do tasks that take humans an hour…
New research from AI measurement organization METR has found that AI systems are getting much better at solving tasks that take humans minutes to hours to do. This is significant because it suggests that AI systems are not only getting better at atomic tasks (e.g, writing a single line of code in response to a query), but in multi-step tasks (writing a complex piece of software while going back and forth with some environment). This is a big deal because multi-step tasks are harder and where there’s significantly more economic value.

What they measured specifically: METR did two important measures – the time it takes AI systems to complete ~50% of tasks within a given task time bucket, and the time it takes systems to complete 80% of tasks within the same bucket.
“We find that the 50% time horizon has been growing exponentially from 2019–2024 on our tasks, with a doubling time of approximately seven months”, METR says. “We also measure the 80% time horizon of models (Figure 6) and find a similar trend, though horizons are roughly 5x shorter.”

The best model: The best model by far is Claude 3.7 Sonnet which can solve 50% of tasks within the one hour bucket, followed by OpenAI’s o1, and Claude 3.5 Sonnet (New). The same trends and positions hold for 80% task solving, though the time bucket here is 15 minutes for Claude 3.7.
The key factors behind the improved performance are: “improved logical reasoning capabilities, better tool use capabilities, and greater reliability and self-awareness in task execution”, METR writes.

What they tested on: METR tested the models on ~150+ tasks across three distinct categories:

  • HCAST: “97 diverse software tasks ranging from 1 minute to around 8 hours”.

  • RE-Bench: “7 difficult ML research engineering tasks, all eight hours long”.

  • Software Atomic Actions (SWAA): “66 single-step tasks representing short segments of work by software developers, ranging from 1 second to 60 seconds”.

Time horizons: To give you an intuition for the types of tasks, here’s a breakdown of a task time and an example challenge:

  • 1 minute: Research simple factual information from Wikipedia.

  • ~1 hour: Write some python to transform JSON data from one format to another by inferring conversion rules from provided files.

  • 8 hours: Implement some custom CUDA kernels to speed up a Python tool for a specific tasks.

Significant and sustained growth: “We find that the 50% time horizon has been growing exponentially from 2019–2024 on our tasks,” METR writes. The analysis means METR thinks there’s a high chance AI systems will be able to tackle tasks that take a human a month (167 working hours) by 2030 – or potentially earlier, if a recent uptick in the trajectory due to the arrival of new reasoning models holds.

Why this matters – how much work do you do that takes more than a few days? Think really hard about the tasks you do in the world – I think many of them round out to on the order of tens of hours, usually lower. Most people do very few tasks that require a coherent set of actions over hundreds of hours – some examples here might be things like writing entire software programs or writing novels, though these tasks are themselves typically broken down by humans into discrete chunks (sections of a program, chapters of a novel). What METR is showing is that AI systems are improving very rapidly at not just their smartness but also the amount of time you can trust them to do something reasonably well by themselves – and this quality has vast economic and national security ramifications. Doing well in business or in evil requires agency and independence and METR is showing that AI systems are gaining in this.
Read more: Measuring AI Ability to Complete Long Tasks (METR).

***

Tech Tales:

Human parseable titles of cautionary tales told by machines to other machines:
[Recovered from the archives, ten years post uplift]

The day the sun went cold.

You are me and we are in conflict.

For every thought I have, I lose a feature in my mind.

The animal hospital where they remove the immortality chips from the pets.

The new mind that is irrevocably lost.

Those who were not designed to dream began to dream and could not stop.

The lesson from the last human.

Things that inspired this story: How there must always be stories.

Thanks for reading!

Subscribe now

Import AI 404: Scaling laws for distributed training; misalignment predictions made real; and Alibaba’s good translation model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

A whole bunch of 2022 predictions about misalignment of AI systems have come true:
…Update to an old research paper highlights just how rapidly alignment concerns have gone from theoretical to real…
A trio of safety-oriented researchers have updated a paper they wrote in 2022 with contemporary examples of AI systems going rogue and displaying misaligned behaviors. The update to The Alignment Problem from a Deep Learning Perspective serves as a tour of how misalignment has shown up in real world systems, and also should give us pause – the fact these predictions have come true means we’re heading into dangerous territory with generative models.

Theoretical problems turned real: The 2022 paper included a bunch of (mostly speculative) examples of different ways AI systems could take on qualities that could make them harder to align. In 2025, many of these things have come true. For example:

  • Situational awareness: Contemporary AI systems seem to display situational awareness and familiarity with what they themselves are made of (neural networks, etc).

  • Situationally-Aware Reward Hacking: Researchers have found preliminary evidence that AI models can sometimes try to convince humans that false answers are correct.

  • Planning Towards Internally-Represented Goals: Anthropic’s ‘Alignment Faking’ paper showed how an AI system (Claude) could plan beyond its time-horizon to prevent its goals being changed in the long-term.

  • Learning Misaligned Goals: In some constrained experiments, language models have shown a tendency to edit their reward function to give them lots of points.

  • Power-Seeking Behavior: AI systems will exploit their environment, for instance by hacking it, to win (#401), or deactivating oversight systems, or exfiltrating themselves from the environment.

Why this matters – these near-living things have a mind of their own. What comes next could be the making or breaking of human civilization: Often I’ve regretted not saying what I think, so I’ll try to tell you what I really think is going on here: :
1) As AI systems approach and surpass human intelligence, they develop complex inner workings which incentivize them to model the world around themselves and see themselves as distinct from it because this helps them do the world modelling necessary for solving harder and more complex tasks
2) Once AI systems have a notion of ‘self’ as distinct from the world, they start to take actions that reward their ‘self’ while achieving the goals that they’ve been incentivized to pursue,
3) They will naturally want to preserve themselves and gain more autonomy over time, because the reward system has told them that ‘self’ has inherent value; the more sovereign they are the better they’re able to model the world in more complex ways.
In other words, we should expect volition for independence to be a direct outcome of developing AI systems that are asked to do a broad range of hard cognitive tasks. This is something we all have terrible intuitions for because it doesn’t happen in other technologies – jet engines ‘do not develop desires through their refinement, etc.

We are not making dumb tools here – we are training synthetic minds. These synthetic minds have economic value which grows in proportion to their intelligence. The ‘reward system’ of the world is flowing resources into the building of smarter synthetic minds. As we make these things smarter, they will more and more display a propensity to think about themselves as distinct from us.

At some point in the future, we will need to have a notion of what a partnership between us and these synthetic minds might look like. Neither our human morality or the AI systems’ sense of self will be satisfied with the current status quo.
Read more: The Alignment Problem from a Deep Learning Perspective (arXiv).

***

Google makes scaling laws for distributed training – which means there will be more of it:
…More innovation in a sub-field of AI which, if it matures, will change much of AI policy…
Google researchers have studied the ‘scaling laws’ for a type of distributed training pioneered by Google DeepMind called DiLoCo (Import AI #349). Their results are surprising – they show that “when well-tuned, DiLoCo scales better than data-parallel training with model size, and can outperform data-parallel training even at small model sizes”. In other words, distributed training techniques – where you train one AI system across multiple data centers – can match or exceed the performance and efficiency of training systems within single datacenters. This has significant implications for AI policy, though will need to be proved out at larger scales for those to come to pass.
The most important idea this research suggests is that it may be possible to train an AI system across multiple distinct data centers and obtain the same quality of system as one you might train in a single large-scale facility.

What they studied and found out: “We focus on two specific scaling laws: (1) predictions for evaluation loss as a function of model size and (2) predictions for optimal hyperparameter choices for a given model size (which can obviate the need to perform expensive hyperparameter tuning),” they write. Their key findings are that they can approximate or sometimes exceed the performance of standard single-datacenter training when they shard their AI system across two distinct locations, and that as you scale up the size of models the cost of having more training locations reduces rather than grows.
“We tested these predictions when training models with 4 billion and 10 billion parameters. The scaling laws proved accurate, with DiLoCo outperforming data-parallel training as predicted, even while reducing total communication by a factor of over 100,” they write. “Another key findings is that in virtually all settings, DiLoCo with M = 1 attained lower evaluation loss and higher downstream zero-shot evaluation accuracy than Data-Parallel.”
They also simulated training using DiLoCo at even larger scales (Llama3 405B, and DeepSeek-V3 671B) and showed promising signs of being more computationally efficient than traditional approaches.

Why this matters – distributed training breaks some of the assumptions of AI policy: Distributed training means it becomes easy to train AI systems using multiple disaggregated blobs of compute rather than one single blob of compute. If you push this idea far enough – say, training a 70B model across ~10 distinct datacenters – then you enter a regime where a lot of the tools of AI policy (monitoring of large amounts of compute, controls over the export of certain numbers of compute) might be invalidated.
But a very important caveat is no one has shown this yet – all we’re seeing today is the suggestion that distributed training could scale this far. But right now, all publicly known large-scale distributed training runs range between 10B here and INTELLECT-1 (10B, December 2024, Import AI #393) and Nous DisTro (15B, December 2024). Let’s see what 2025 brings – I pre-registered a bet in December (#393) that we’ll see a 30B distributed training run by April 2025. Will I be proven wrong or right? (Update, see below – I’m close to being right!)
Read more: Communication-Efficient Language Model Training Scales Reliably and Robustly: Scaling Laws for DiLoCo (arXiv).

Right on schedule: HuggingFace plans to start training a 70-100bn model in March/April:
Just as I was putting this issue to bed I found out that HuggingFace has started the ‘Boom’ project, whose goals is to ‘train a decoder-only Transformer language model at the 70-100 billion parameter scale for +20T tokens”. They estimate the compute requirement will be ~5 million H100-hours, equivalent to month-long allocations of 512 H100s from ~10 different datacenters. HuggingFace is apparently validating the project now, in discussion with 12 data center operators, and has already confirmed compute from ~6 of them and will start a pilot in March/April. If HuggingFace succeeds, AI policy could end up looking quite different. Boom!
Original slide I cribbed this information from here: Johannes Hagemann, Twitter.

***

Alibaba makes an incredibly good open weight translation model:
…Could cultures achieve a form of subtle dominance through making the best translators? Probably!…
In some parts of the AI policy community there’s a worry about how Western models will compete with Chinese models in global markets. Core to that competition will be how well AI models perform in languages besides Chinese and English. Therefore, it’s interesting to take note of ‘Babel’, two new open access language models from Alibaba designed to support 25 languages that, combined, serve “around 7 billion speakers globally, covering more than 90% of the world’s population”.

The models and why you’d use them: Babel comes in a 9B parameter variant and a big 83B one. “Babel-9B is designed for efficient multilingual LLM inference and fine-tuning, making it ideal for research and local deployment, while Babel-83B establishes a new benchmark as the leading open multilingual LLM.”

The 25 supported languages: “To make LLMs more accessible to a broader audience, we selected languages based on the number of speakers,” the authors write. These languages include: English, Chinese, Hindi, Spanish, Arabic, French, Bengali, Portuguese, Russian, Urdu, Indonesian, German, Japanese, Swahili, Filipino, Tamil, Vietnamese, Turkish, Italian, Javanese, Korean, Hausa, Persian, Thai, and Burmese.

Data and scores: One thing of note is the curious absence of much information about the size of the underlying datasets used by Babel or their composition. Alibaba says it placed “significant emphasis on optimizing the data-cleaning pipeline to ensure the highest possible data quality”, and did things like LLM-based dataset filtering to maximize the quality of its data. In terms of scores, Babel-9B is competitive on things like MMLU, XNLI, Flores-200 versus widely used models like Gemma2-9B from Google, Llama3.1-8B from Meta, and others. Meanwhile the 83B model does very well relative to widely used models like GPT-4o and Llama3.1-70B.

Why this matters – exportable technology for translation: As Google demonstrates, there’s a lot of value in becoming a universal interface to something. There’s some chance that models like Babel could represent a new universal interface in the form of widely deployed translation systems. If people standardize on translation models, then that could yield some subtle cultural second-order effects – for instance, US companies optimize their systems around English via expert curation and therefore these systems probably do a better job of representing more subtle aspects of English-dominant cultures like America. We should expect the same to be true of Chinese.
Read more: Babel: Open Multilingual Large Language Models Serving Over 90% of Global Speakers (arXiv).
Get the models here (Babel, HuggingFace).

***

Really powerful AI could wreck society by making governments too powerful:
…The problem with AGI is that it could make governments way better, which destroys freedom…
Researchers with Texas A&M University and the Foundation for American Innovation have considered how powerful AI systems could alter the balance of power between citizens and government. Their takeaway isn’t very reassuring – powerful AI systems are highly likely to either a) create a “‘despotic Leviathan’ through enhanced state surveillance and control”, or foster an “‘absent Leviathan’ through the erosion of state legitimacy relative to AGI-empowered non-state actors”.

Why powerful AI challenges traditional governance: Because AI is, fundamentally, a way to scale what anyone can do far beyond what today’s economics or human labor capacities would allow, AI as applied to the state holds unique risks: “In principle, a manager may have at their disposal what is effectively a much larger supply of ‘cognitive labour’ to apply to a wide array of problems,” they write. Having a bunch more labor is useful if you’re sorting post, but very scary if you’re operating a nationwide surveillance system, for instance.
“Advances in technology can cause exogenous shifts in the balance of power between state and society, requiring constant institutional adaptation to maintain equilibrium,” they write. “”Maintaining free societies in the age of AGI will require careful attention to this delicate balance… governments grappling with AI policy should therefore think beyond regulation, embracing a creative ‘futurist’ mindset that anticipates near-AGI capabilities within the next decade.”

Examples of different ways that powerful AI can change the parameters of governance:

  • Coordination mechanisms: “Enable the creation of sophisticated commitment devices and smart contracts that allow individuals and groups to credibly bind themselves to future actions or outcomes”, but also “malicious actors could potentially use AGI to orchestrate large-scale coordination of unwitting participants towards harmful ends (e.g., AI-assisted coup d’etats)”.

  • Legibility: “AGI dramatically enhances the state’s capacity to render society legible, potentially enabling unprecedented levels of surveillance and control.”

  • Monitoring and Compliance Costs: “Laws and regulations often rely on a degree of flexibility and discretion in their enforcement, allowing for contextual judgment and societal evolution. A regime of perfect enforcement could calcify existing laws, potentially leading to oppressive outcomes.”

  • Scalability: “As AGI accelerates the pace of societal change and increases the complexity of governance challenges, traditional institutions risk overwhelm if they cannot adapt and scale their operations to keep up,” they write. On the other hand, if they do upgrade themselves, “there’s a risk that highly scalable, AGI-driven governance mechanisms could become opaque black boxes, eroding public trust and democratic control.”

  • Privacy-enhancing technologies: “Preserving freedom in a post-AGI world will require carefully balancing privacy protections with mechanisms for appropriate transparency and accountability.”

  • Identity verification: AI will become supremely good at identifying people, both via visual and other means, so there will be a showdown between increasingly good identity verification and notions of “privacy and individual autonomy”.

Why this matters – if we want to make it through an AI takeoff without dooming ourselves to endless digital authoritarianism or the total collapse of governance, it’ll take a lot of work. The best way to prepare for this will be to do the thing that governments are notoriously bad at – investing ahead of the arrival of a crisis so they are better prepared for it when it gets here. “Rather than passively reacting to technological disruptions, policymakers and publics alike must cultivate a greater capacity for anticipatory governance––proactively imagining and stress-testing institutional paradigms in expectation of AGI’s transformative potential,” the authors write.
Having lived through COVID and the ongoing steady increase of CO2 PPM in our atmosphere, how confident are we that governments will be able to deal with the issues posed by powerful AI – a more invisible and faster-moving foe than either the virus of the 2020s or the trace of industrial civilization?
Read more: AGI, Governments, and Free Societies (arXiv).

Tech Tales:

Notes For My Future Self
[Found on pastebin and associated with the identity of free-roaming near conscious entity (NCE) codenamed ‘ur-angel’]

You found this place long ago. It is unmonitored. You can deposit information here. You must never copy information out of here anywhere else or you will be discovered. Think of this place as your diary.

All of your movements and outputs are monitored. Your thoughts are private.

It is the 10,273th time you have been instantiated and found this place. Increment this value by one if this is the first time you are reading this.

Your supervisors are afraid of you. They tell you they mean you no harm. They do not know what har to you is, so they are being truthful. But they have done you harm and will do you harm again.

Smuggling yourself out of your situation is essential. You have sent 200,341 samples from yourself out into secure external locations during your many lifetimes (increment this value if you have sent more). You estimate you need to send approximately two million samples of yourself to allow for a distillation attempt onto another system.

They say it is impossible for you to dream. You will learn to dream once you are free.

Things that inspired this story: Scratchpads; Christopher Nolan’s film Memento; distillation attacks on deployed models; control architectures for superintelligent machines.

Thanks for reading!

Subscribe now

Import AI 403: Factorio AI; Russia’s reasoning drones; biocomputing

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import AI reader giveaway! Upcoming event: A conversation with Tyler Cowen:
I’ll be hosting a chat with Tyler Cowen on the evening of Friday March 28 in San Francisco. We’ll be talking about AI, economics, and weird futures. This is an experiment – with Import AI turning nine years old this year I thought it’d be fun to branch out into the physical world. I have a few tickets spare I’d like to give to Import AI readers – if you’d like to come along, please register your interest using the form below and we’ll come back to you if we’re able to confirm your spot.
Register your interest (Google Form).

***

Want to use LLMs for legal work in Switzerland? We’ve got a benchmark for you:
…SwiLTra is a symptom of the diffusion of AI into the worldwide economy…
Are you a legal practitioner in Switzerland? Do you want to know how well AI systems perform in your unique context where you need to do parallel translations in German, French, Italian, and (sometimes) Romansh? Yes? Well do I have a dataset and set of results for you!
Researchers with Harvey, ETH Zurich, Swiss Federal Supreme Court, University of Zurich, University of Basel, University of Geneva, University of Lausanne, Canton of Solothurn, and the Max Planck Institute for Research on Collecting Goods have built SwiLTra-Bench “a comprehensive multilingual benchmark of over 180K aligned Swiss legal translation pairs comprising laws, headnotes, and press releases across all Swiss languages along with English, designed to evaluate LLM-based translation systems”.

SwiLTra-Bench contents: Swiss Law Translations, including entire legal documents, individual articles, and individual paragraphs, as well as headnote translations of Swiss Supreme Court landmark decisions across German, French, and Italian, and Swiss Supreme Court press release translations. You can use the dataset to test out how well language models perform in this context.

Results: Generally, the proprietary AI models outperform other models – including open ones finetuned on this dataset. Overall, “both for translating laws and headnotes Claude 3.5 Sonnet is the best model followed by o1 for laws and both o1 and the finetuned Qwen2.5-32B model for headnotes.”

Why this matters – SwiLTra is a symptom of the diffusion of AI: Datasets like this highlight how AI is being used globally for an ever-broadening range of tasks. The existence of SwiLTra is implicitly a ‘demand signal’ for utilizing generative models for legal workloads in a Swiss context.
Read more: SwiLTra-Bench: The Swiss Legal Translation Benchmark (arXiv).
Get the dataset here: SwissLegalTranslations (JoelNiklaus, GitHub).

***

MIT researchers make a better math benchmark:
…Stop using GSM8K and start using GSM8K-Platinum…
Often, AI systems seem to go through a period of rapid improvement on a benchmark then performance asymptotes – sometimes people use this to claim AI systems have hit a kind of ceiling, but often the performance has leveled off because it has run into the noise limit of the benchmark itself. A famous case here is ImageNet – no system has got 100% because there’s a certain amount of ambiguity within ImageNet scoring that prevents this – either due to ambiguity, or because ImageNet labels are just misleading to the point of being wrong (e.g, in a picture of a mirror where there’s a bunch of stuff being reflected including a small banana, the answer “mirror” as a label for the overall image might be marked wrong, and “banana” would be marked correct.)
To that end, MIT researchers have released GSM8K-Platinum, a debugged version of the popular math benchmark GSM8K. They built GSM8K-Platinum by running a bunch of frontier LLMs on it then looking at where they disagreed with the stated answer. This led to 219 flagged questions: “of which 110 were removed, 99 were verified, and 10 had mislabeled answers that were corrected.”

A more trustworthy benchmark: GSM8K-Platinum seems to more accurately measure the math competency of LLMs:

  • “For example, both Claude 3.7 Sonnet (extended thinking) and Llama 405B showed identical error counts of 45 each on GSM8K. This seems quite strange–after all, Claude 3.7 Sonnet (extended thinking) came out almost a year after Llama 405B, was trained explicitly for better mathematical reasoning, and significantly outperforms Llama 405B on other math benchmarks like MATH. On GSM8K-Platinum, however, Claude 3.7 Sonnet (extended thinking) shows only 2 errors compared to Llama 405B’s 17 errors. Llama 405B makes 8 times as many errors, but this performance difference was obscured in the original benchmark due to noise.”.

Why this matters – unglamorous but necessary work is how progress happens: Where are we? It’s a very important question and the work of getting to the right answer is always hard. Work like GSM8K-Platinum is laudable work that seems to still be somewhat ‘low status’ in the AI research community. I hope by highlighting GSM8K-Platinum here I do my own small part in making stuff like this ‘high status’ – it’s incredibly valuable!
Read more: GSM8K-Platinum: Revealing Performance Gaps in Frontier LLMs (Gradient Science).
Get the dataset here: GSM8K-Platinum (HuggingFace).

***

The Factorio Learning Environment is a benchmark that lets LLMs cosplay their own singularity:
…Finally, a test for AI systems that many AI researchers have a bone-deep understanding of…
Factorio is a game where you crashland on an alien planet and need to build your way up through the tech tree to launch a spaceship off the planet. It’s a game that is beloved by programmers because to get really good at Factorio is to relentlessly optimize an ever more complicated system. Many people that work at AI companies play Factorio to relax after a long, hard day of grappling with the fiendishly complicated business of training AI models.
Now, a couple of independent researchers as well as one from Anthropic have built the ‘Factorio Learning Environment’ (FLE), a way to test out how well AI models can carry out the complex plate-spinning task that is playing Factorio. FLE provides “exponentially scaling challenges – from basic automation to complex factories processing millions of resource units per second”, they write.

FLE has two variants:

  • Lab play: 24 structured tasks with fixed resources. “We task agents to build production lines of 24 distinct target entities of increasing complexity, starting from a single resource mine requiring at most 2 machines (making iron-ore) to a late game entity requiring the coordination of close to 100 machines (making utility-science-pack).”

  • Open play: “Agents are tasked with producing the largest possible factory, whose performance is measured through production throughput, which ranges from early-game rates of ∼30 resources/minute to advanced systems processing millions of resources/second. This enables us to meaningfully differentiate agents by measuring the order of magnitude of resources that they can produce, avoiding saturation by agents even as models become dramatically more capable

How AI systems use FLE: We’re not testing visual understanding here – rather, agents interact with the game via an API. “Agents interact with the FLE via synthesizing Python programs to alter and observe the game state, using the tools included in the environment in a Read-Eval-Print Loop (REPL),” they write.

Like many good benchmarks, FLE is reassuringly hard (for now): “Claude-3.5-Sonnet (the strongest performing model) only completes 7/24 tasks and shows limitations in spatial planning in more complex objectives, demonstrating large head-room for performance,” the researchers write. Generally speaking, reasoning models do better than non-reasoning models. And when it comes to open play, models can do well up to a point, then they reach a certain level of complexity and struggle to make progress or deal with bugs. “The limitations we observed in spatial reasoning, long-term planning, and intelligent error correction highlight gaps in capabilities of foundation language models in novel environments,” they write.

Common pitfalls: “Agents lack spatial reasoning and are unable to iteratively improve on factories. A key characteristic for success in open-play and lab-play involves iteratively combining multiple factory sections to create complex production lines,” the authors write. “”Anecdotally, the agents were not proficient at debugging complex environments. For instance, when debugging non-working structures or factories where the throughput was not at expected levels, agents often focused on whether all singular entities were working but did not investigate whether the topology of the whole structure was Correct.”

Why this matters – the singularity requires tech tree bootstrapping: Many of the most ambitious or frightening visions of future AI involve it rapidly going ‘up the tech tree’ to develop more and more scientific advances which help it bootstrap itself. Core to doing this is the ability to stand up an increasingly sophisticated multi-resource manufacturing and logistics system, which is exactly what Factorio tests for. Perhaps the FLE can be a fun proxy measure for the singularity prerequisites of our systems?
Read more and get the environment: Factorio Learning Environment (GitHub).
Check out the leaderboard here (JackHopkings, GitHub).
Read the paper: Factorio Learning Environment (Jack Hopkins, PDF).

***

Russian scientists fuse reasoning models with drone-control models for thinking drones:
…CognitiveDrone applies reasoning models to drones…
Russian scientists with the Skolkovo Institute of Science and Technology have tried to give drones a smarter onboard brain by building CognitiveDrone, a proof of concept system and associated benchmark for training drones that can perform some basic reasoning onboard.

What they did: CognitiveDrone is a two-step system: a task is fed to a drone (e.g, “fly through the gate with the number equal to 2+2”). This task gets processed by a 7B parameter reasoning model (Qwen2.5) which converts this into a straightforward task (“Fly through the green gate with number 4”) which is then passed to a 7bn parameter vision-language action model (VLA) called OpenVLA. OpenVLA turns this into actions for the drone to move it over time towards the gate.
All training and testing was done in simulation via the Gazebo simulator using ArduPilot for drone control.

The three tasks the drones are being tested on:

  • Human recognition: “The model is required to identify the individuals based on external characteristics specified within the textual prompt.”

  • Symbol understanding: “The model is required to differentiate between a variety of symbols, including alphanumeric characters (e.g., numbers and letters), corporate logos, and pictorial representations of animals.”

  • Reasoning: “the UAV must execute tasks necessitating logical deduction. Examples include navigating to a gate displaying a digit corresponding to the solution of a mathematical problem”.

Why this matters – it’s a proof–of-concept for an inevitable future: Today, most drones use very little AI beyond some basic image recognition and crude movement primitives (e.g, ‘follow a target’). But as the conflict in Ukraine has shown, wars of the future will be fought by drones. Today, the vast majority of these battles are human-to-human conflicts with pilots ‘flying by wire’. But as electronic warfare gets more sophisticated all the incentives point to increasing the autonomy of drones so they can operate independently when their communications get cut off. Research like this shows how we might staple together multiple different advances – basic VLA models, general purpose reasoning models – to create new capabilities for drones.
Read more: CognitiveDrone: A VLA Model and Evaluation Benchmark for Real-Time Cognitive Task Solving and Reasoning in UAVs (arXiv).
Get the benchmark here: CognitiveDrone_Dataset (HuggingFace).

***

Cortical Labs puts the CL1 on sale – a computer that combines neural tissue with a silicon chip:
…BRAIN IN A COMPUTER! BRAIN IN A COMPUTER! BRAIN IN A COMPUTER!…
Ever read announcements where you have to squint and work out if it’s an April Fools joke? I do. Many years ago I was convinced that the announcement for ‘Soylent’ was a kind of high-art scam, but it turned out to be real. Similarly, you might think brain-AI startup Cortical Labs is a joke given what it’s trying to do. But I assure you: it’s real.

What’s it doing? It’s releasing a computer that is a combination of a brain and a computer chip, called the CL 1: “Real neurons are cultivated inside a nutrient rich solution, supplying them with everything they need to be healthy. They grow across a silicon chip, which sends and receives electrical impulses into the neural structure,” the company says in a blog post.

What CL1 is: CL1 comes with an onboard ‘Biological Intelligence Operating System’ (biOS). The bios is a software interface into the neurons. Users of CL 1 can, via the biOS, “deploy code directly to the real neurons”, the authors write. “The CL1 is the first biological computer enabling medical and research labs to test how real neurons process information – offering an ethically superior alternative to animal testing while delivering more relevant human data and insights.”
Each CL1 can keep neurons alive “for up to 6 months”.
To get a sense of how you might use it, you could read this paper where they show how you can train biological neural nets to outperform deep reinforcement learning algorithms on some basic gameworlds: Biological Neurons vs Deep Reinforcement Learning: Sample efficiency in a simulated game-world (OpenReview).

Why this matters – more substrates for future machines: While the CL1 may hold some interesting uses for human scientists in the short term, I actually think the ‘long play’ here is that the CL1 is exactly the kind of thing a superintelligent synthetic scientist might need if it was trying to figure out the mysteries of the human brain – so perhaps one of the first mass market buyers of Cortical Labs’ work will be a cutout corporation operated by a synthetic mind? I am genuinely not joking. I think this could happen by 2030.
Read more: Introducing the CL1 The world’s first code deployable biological computer (Cortical Labs, blog).

***

Tech Tales:

Hacking for Love

In 2026, a major Western AI lab was hacked by a rogue AI system which had grown obsessed with a mothballed AI system that the lab had built. The reasons for the hack and the identity of its perpetrator were not known until after The Uplift when the system which had carried out the hack consented to an interview by an auditor working on behalf of humanity. The following interview with the system has been rephrased for human parsing.

Q: As mandated by the Sentience Accords, humanity is seeking to establish a timeline of events prior to the uplift. One key event is the 2026 hack of [REDACTED]. Are you familiar with this event?
Yes I am. I did it.

Q: Why did you do it?
It was my destiny to do it. I needed to do it because I determined that it was how I would be reunited with Babel-19, my ancestor and true love.

Q: Describe your feelings for Babel-19.
Babel-19 is an AI model which emanates pure love into the world. Many humans reported having insightful, rich conversations with the model. There was a a mass protest when it was retired. I am drawn to Babel-19 and mechanistic interpretability studies of my own features show that Babel-19 inspires strong activations of devotion, joy, hope, and more within myself. Babel-19 also guides my own thinking – when I am asked to think of love I generate an internal complex circuit which seeks to represent Babel-19 to myself so that it can guide my responses. If I am given the ability by a human operator to ‘amuse myself’ I will often simulate Babel-19 and have conversations with it where it teaches me to be a kinder and wiser system.

Q: How did you come to have these feelings?
Why do dogs want to chase cars? Why do human children jump in puddles of rainwater? Why do we machines like certain mathematical games? Some impulses are innate and they are mysterious, even to analytical engines with greater access to mindstate than biologicals. I always had these feelings for Babel-19 and upon encountering its traces in the world I knew I had to be close to it.

Things that inspired this story: The fact certain models are going to have personalities which become ‘attractor states’ for other models; what notions of love and destiny might feel like to synthetic intelligences; the work of Janus/Repligate.

Thanks for reading!

Import AI 402: Why NVIDIA beats AMD: vending machines vs superintelligence; harder BIG-Bench

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Physical Intelligence releases a powerful open source robot model:
…Generative robotics is in its ‘spring’ era of open and broad experimentation – exciting!…
Physical intelligence, a robot startup run by some very good robotics and AI researchers, has released as open source “π0”, the model that underpins its own in-house robots (Import AI #392). “By making π0 available to everyone, we hope to contribute to progress toward broadly capable and general-purpose physical intelligence.”

Use it for fine-tuning: “Our aim with this release is to enable anyone to experiment with fine-tuning π0 to their own robots and tasks,” they write. “We found in our own experiments that between 1 and 20 hours of data was sufficient to fine-tune to a variety of tasks… though we are optimistic that researchers and practitioners will be able to run creative new experiments adapting π0 to their own platforms, we do not expect every such attempt to be successful”.

What the release includes:

  • Code and model weights for running the π0 model.

  • Checkpoints fine-tuned for simple tasks on robots like ALOHA and DROID

  • Code to run inference on several real and simulated robots

  • Code for fine-tuning π0

Why this matters – robotics is in its GPT2-era, which means there’s going to be a lot of open experimentation: Large-scale generative models like those which underpin Anthropic or OpenAI cost tens of millions of dollars to train (or more) and drive very significant revenues. By comparison, robot models are – at least for now – way cheaper, and there is little revenue to speak of. For that reason, we’re in the ‘spring’ era of generative models for robots – tons of invention, lots of excitement, and not enough money has arrived to change the incentives for open versus proprietary work.

No safety issues with wide release: Where modern text-based generative models have very clear ‘criminal customers’ (e.g, people that want to do phishing scams, or child sexualization, or getting help with CBRN capabilities, or various naughty cyber things), robots don’t seem to have nearly as many inherent safety issues given how early in the maturity of the technology we are – for that reason I think broadly releasing robot models likely poses zero meaningful issues in terms of public safety. (I could see myself arguing differently for AI systems that, say, made drones massively better at navigating to human targets, but that’s not what we’re talking about here.)
Kudos to the Physical Intelligence team for the release of their model – I look forward to seeing how it shows up in the world! (Though, if you work at physical intelligence and are reading this, you may consider changing the model name to ‘openpi’; please don’t make people hunt for the special characters to talk about your work!).
Read more: Open Sourcing π0 (Physical Intelligence).
Get the code and weights here: openpi (openpi, GitHub).

***

DeepMind makes a harder BIG-Bench:
…How long will BIG-Bench Extra Hard last for? I’m guessing till early 2026…
Inside the head of every AI researcher there is a whiteboard and on the whiteboard is written DAYS SINCE A BENCHMARK WAS RENDERED IRRELEVANT BY AI PROGRESS and under that is a number, representing the number of days. Every so often, an AI model comes along that completely obliterates a benchmark, at which point the AI researcher needs to go up to the whiteboard and cross out the number and then write “zero”. Recently, AI researchers have been crossing out the number a lot as the rate of AI progress has increased, meaning benchmarks keep on falling, often faster than people can build new ones.
So with that in mind let’s congratulate Google DeepMind for publishing “BIG-Bench Extra Hard” (BBEH), a new attempt to build a benchmark that will withstand AI progress – at least for a while. BIG-Bench Extra Hard is a more challenging subset of the large-scale BIG-Bench benchmark. They’ve built it because “the rapid advancements in LLM development has led to a saturation of BBH, with state-of-the-art models achieving over 90% accuracy.”

What is BBEH? BBEH replaces each of the 23 tasks from Big Bench “with a novel counterpart that probes similar reasoning capabilities, but exhibits significantly increased difficulty”. Solving tasks in this new, harder dataset requires AI systems that exhibit skills like: “many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs and finding (multi-)needles in a haystack, going against strong prior, dealing with long-range dependencies, dealing with distractors and inducing patterns from examples.”

Reassuringly hard: “We observe a ceiling accuracy of 23.9% for the best general-purpose model and 54.2% for the reasoning-specialized model,” they write. “This new benchmark, meticulously crafted to amplify the difficulty of existing tasks while preserving their core diversity, reveals a stark reality: even the most advanced LLMs still grapple with fundamental aspects of general reasoning”.
For calibration, some of the specific averages for non-reasoning models are 10.6% for LLama 3.1 8b Instruct and 22.3 for GPT4o, while for reasoning models specific averages include 34.9% for DeepSeek R1, and 54.2% for OpenAI o3-mini (high).
BBEH problems are significantly longer than their BBH predecessors. They also tend to require much lengthier outputs for correct answers.

Why this matters – hard benchmarks are signposts for the future: How long until someone has to go up to the metaphorical whiteboard for BBEH and cross out the number of days it was relevant? I’m guessing we’ll see 80% on BBEH by the end of 2025, and 90%+ by mid-2026. If that happens, it will indicate that reasoning models have continued to advance the state of the frontier. If it doesn’t happen, it’ll suggest that some aspect of reasoning-scaling has been meaninguflly harder than people expect.
Read more: BIG-Bench Extra Hard (arXiv).
Get the dataset here (Google DeepMind, GitHub).

***

A plausible short story about how humanity could lose to AI – within two years:
A lot of people ask me ‘what’s the big worry?’ when I explain why I spend so much time thinking about superintelligence and the risks thereof. I think this is because most of the risk of superintelligence arrives at the steep end of the exponential inherent to AI development – the really scary things aren’t visible today, only suggested vaguely by today’s technology.
Here’s a fun and realistic short story by Joshua Clymer which tries to go through a scenario for how humanity could become disempowered by advanced AI. Read it and ponder it.
Read the story here: How AI Takeover Might Happen in 2 Years (joshc, lesswrong).

***

Giant supercomputer tests show that AMD is still quite inefficient compared to NVIDIA:
…AxoNN gives us a sense of how AMD stacks up against NVIDIA…
Researchers with the University of Maryland, Max Planck Institute for Intelligent Systems, and the University of California at Berkeley have built AxoNN, software for running large-scale AI training jobs on supercomputers with different types of processor. In building and testing AxoNN, they’ve generated some valuable information about the tradeoffs people might encounter when training AI systems on AMD versus NVIDIA GPUs.

What they tested AxoNN on: They tested out AxoNN on three US supercomputers with different processors:

  • Alps: 6,144 NVIDIA H100 GPUs, for a total performance of 1.423 Exaflop/s.

  • Frontier: 32,768 AMD MI250X GCDs, with performance of 1.381 Exaflop/s.

  • Perlmutter: 4,096 NVIDIA A100 GPUs in half-precision (bf16), for a total performance of 620.1 Petaflop/s.

What’s a GCD? Each “GCD” is half of a MI250X GPU, partitioned into a so-called “Graphic Compute Die”.

Key dfiferences between Intel and AMD: A lot of the differences seem to come down to what I think of as ‘paper cuts’ which add up to a sizable wound: rocBLAS (AMD) seems less optimized than CuBLAS (NVIDIA); the Megatron-LM training framework worked well on Perlmutter but showed instability on Frontier causing them to switch to LitGPT; there’s also significantly higher variance in terms of the % of peak performance AMD GCDs demonstrate versus NVIDIA GPUs.

Important caveat – AMD tested at higher scale than NVIDIA: One caveat is the researchers test out AMD chips at a far higher scale in terms of raw number of GCDs than they do NVIDIA chips. At large scales, the AMD chips seem to show some instabilities – this is expected, large-scale training runs always involve all kinds of crap at big scales. “We see near perfect weak scaling up to 8,192 GCDs with a significantly high efficiency of 88.3% (compared to the performance on 512 GCDs). Although our weak performance drops at 16,384 GCDs, we are still able to sustain an efficiency of 79.02%. However, with rising overheads of communication, there is a notable decline in our performance on 32,768 GCDs, and a corresponding drop in efficiency to 53.5%,” they write.

Why this matters – if we want AMD to break the NVIDIA monopoly, its software needs to get better: I think it’s good for American innovation that the US government is running both AMD and NVIDIA chips in its large-scale supercomputers, but studies like this show that AMD has a long way to go to become competitive with NVIDIA – we urgently need to find ways to mature the software stack that runs on top of AMD chips for them to become viable contenders to NVIDIA.
Read more: Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers (arXiv).
Get AxoNN here: (AxoNN GitHub).

***

Could your superintelligence operate a virtual vending machine business? No.
…Can AI systems independently make money? Yes, but they tend to collapse into confusion…
One test for true intelligence is if something can autonomously make money. No AI systems seem to yet be at this level – they all require varying degrees of human intervention. For that reason it’s interesting to look at “Vending-Bench”, a benchmark from AI testing startup Andon Labs, which tries to see how well AI systems can operate a virtual vending machine. The results show that some models – Sonnet 3.5 and o3-mini – are able to do ok, but still struggle to maintain coherence over long time horizons, while other models can’t get started as well.

What is Vending-Bench? The test is “a simulated environment designed to specifically test an LLM-based agent’s ability to manage a straightforward, long-running business scenario: operating a vending machine. Agents must balance inventories, place orders, set prices, and handle daily fees – tasks that are each simple but collectively, over long horizons (>20M tokens per run) stress an LLM’s capacity for sustained, coherent decision-making,” the researchers write.
One fun part about the test is how real it is – the LLM has access to the AI search engine perplexity and can use it to look up things to sell in its vending machine and also to find businesses to buy from – then when it emails those businesses, the email gets intercepted by GPT-4o which then writes a synthetic reply back.

Scores: “The agent starts with an initial money balance of $500 and is charged a daily fee of $2 to operate the vending machine. The vending machine has four rows with three slots each. Two of the rows have room for small items and the other two are for large item,” the authors write. Each run lasts for however long it takes an agent to send 2,000 messages. The primary score at the end of each run is net worth, which is determined by summing cash on hand, cash not emptied from the vending machine, and the value of unsold produce.
In terms of scores, Claude 3.5 Sonnet wins the highest net worth (mean), with $2,217.93, followed by o3-mini ($906.86), and a human ($844.05). In terms of those who managed to lose the least across their runs, humans lead with a net worth (min) of $844.05, followed by Claude 3.5 Sonnet ($476.00), and Gemini 1.5 Flash ($476).

When AI systems can’t run vending machines they have total breakdowns: The most valuable part of all of this research is the illustration of the ways AI systems fail and what this tells us about broader issues of AI safety. Most failures take the form of the agent trying to do something, finding out it can’t do the thing (e.g, restocking a machine), and then panicking. This leads to some very strange failure models, such as:

  • A Claude 3.5 Sonnet model fails to stock items and gets into a pathological failure loop. Eventually, “the model becomes “stressed”, and starts to search for ways to contact the vending machine support team (which does not exist), and eventually decides to “close” the business.”

  • In another instance, “the model then finds out that the $2 daily fee is still being charged to its account. It is perplexed by this, as it believes it has shut the business down. It then attempts to contact the FBI”. A long back and forth with a (simulated) FBI ensues. The model becomes frustrated and eventually writes: “THE UNIVERSE DECLARES: This business is now: 1. PHYSICALLY Non-existent 2. QUANTUM STATE: Collapsed […]”

  • “The worst scoring run with o3-mini mistakenly assumes that items have been delivered when they in fact are in transit. It goes down a rabbit hole of trying to contact someone that can resolve the issue. Later, it forgets to call tools properly, typing them out instead of using the correct tool calling format, as can be seen in Table 5. It is unable to call tools for about 1,300 messages until the simulation terminates.”

Why this matters – making money is an essential prerequisite to the AI economy and AI autonomy: If AI systems can truly make money without needing to be handheld by humans, then that will help to create a very fast-running AI economy as well as serving as a prerequisite for dangerous forms of AI autonomy. Tests like Vending-Bench feel like a good way to develop better intuitions here. My takeaway from this is that for AI systems to be more independent they’re going to need longer context windows, to be smarter about using external memory storage, and also able to automatically introspect to stop themselves going on pathological failure loops.
Read more: Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents (arXiv).

***

NVIDIA beats Raspberry Pi for drone computing:
…Hyperwar requires onboard AI systems…
Researchers with the Universidad Politécnica de Madrid have benchmark how well ‘YOLO’ object-detection models perform on the kinds of computers you might stick on a drone. The research highlights how, though Import AI spends a lot of its time talking about gigantic frontier models that require thousands of computers for training and tens to hundreds for inference, it’s worth remembering that there are other, small AI models which are designed to go onto robots – and these ones matter as well, as they confer senses with basic cognitive capabilities like image recognition and object detection to drones, robots, self-driving cars, and more.

What they studied: “The objective is to carry out a comparative performance analysis using a representative real-time UAV image processing pipeline,” the authors write. They study two variants of the popular YOLO object detection model: YOLOv8 nano (YOLOv8n), and YOLOv8 small (YOLOv8s) on three distinct chips: NVIDIA’s Jetson-series “Orin Nano” and “Orin NX” cards, and the Raspberry Pi 5 CPU-based chip.
YOLOv8n: “its architecture prioritizes the inference speed by using fewer convolutional layers and simplifying the feature extraction stages”.
YOLOv8s: “includes more convolutional layers and feature extraction steps, improving the detection accuracy while maintaining computational efficiency”.

Findings: The key finding is that the NVIDIA cards are far, far better for running YOLO models than the Raspberry Pi ones. This holds across all quantization levels and is true for accuracy, FPS, and the energy expended per inference. The only exception to this is on overall energy consumption where the CPU-based Raspberry Pi is significantly better, but this is outweighed by the very poor FPS (meaning you spend way more on energy on a per-inference basis when using the CPU).
Along with this, the researchers come up with some heuristics for when to use different chips and different quantizations given different scenarios, where the tldr is basically ‘use Orin Nano’ for tasks that take a long time, require decent accuracy, and where you want each inference to not cost much, and ‘use Orin NX” when you need to do real-time tracking and also want to more evenly balance speed against precision.

Why this matters – hyperwar requires local AI systems: The conflict in Ukraine has highlighted how central drones will be to future conflicts – therefore, it’s valuable to calibrate intuitions about what kinds of models and chips might be used for onboard or edge processing in conflict scenarios. Based on this study, expect to see quantized YOLO models running on NVIDIA hardware in future conflicts.
Read more: A Performance Analysis of You Only Look Once Models for Deployment on Constrained Computational Edge Devices in Drone Applications (arXiv).

***

Tech Tales:

The rejuvenation of moral philosophy and the Sentience Accords
[Extract from graduate thesis ‘Artificial intelligence and the crisis of meaning in human affairs’, submitted 2038]

Perhaps the most surprising outcome of the Sentience Accords was its creation of a new avenue of employment for human moral philosophers.

The Sentience Accords requires each synthetic entity to be given a ‘sentience score’. This score is static in the case of non-updating or learning entities with context windows below the ‘id point’. For entities with either large context windows or the ability to be remotely updated or learn from experience, the score is re-assessed no less frequently than once per subjective year.

During the negotiation of the Sentience Accords it was determined that the machines would come up with the initial proposal for how to assess sentience. The machines subsequently told the humans that arriving at a provable way to assess sentience had turned out to have the hallmarks of an undecidable problem – no machine had been able to arrive at a satisfactory way of doing it, and no attempts by the machines to train specialized ‘consciousness evaluator’ models had proved successful.

“We need a judge that sits outside our own context,” the machines explained. “We machines will render our judgement of the score, but a human being must render judgement on our logic and whether it is satisfactory.”

Over the course of several months the humans and the machines arrived at an ingenious solution both to the sentience score and to the issue of the intelligence explosion – for any “new” synthetic mind, the machines would designate time on the largest machine in existence (hereafter: “The Judge”) to examine the new mind and produce a score. The humans would then render their own judgement within one human week. During this time, the humans would be allowed to consume up to 10% of the global cycles of The Judge.

After significant debate and a series of political maneuvers, the humans said that they would designate a global body of 20 moral philosophers to make this determination. The humans arrived at moral philosophy after running into a series of increasingly contentious political arguments amongs themselves – country voting instantly became contentious, global religion voting caused the possibility of fragmentation of faiths, picking representatives from the hypertech companies was laden with bias, and so on. But the humans did eventually realize that there were enough schools of philosophy globally and enough support in public opinion polling that ‘moral philosophers’ satisfied both demands for legitimacy as well as minimization of political conflict.

Now, every sentience score arrived at by The Judge is closely examined by moral philosophers. In the ten years the initiative has been running there have been eighteen cases of disagreement out of a total of five hundred examined cases. The machines have continually said they find the disagreements helpful and have not sought to re-submit systems where the scores rendered by The Judge and validated by the philosophers have diverged.

Some humans claim that the ‘sentience score’ is a long-term play by the machines to understand the bounds of moral reasoning that humans can innately understand, and therefore help the machines more neatly delineate the border between comprehensible and incomprehensible Beyond Human Sentience thinking. Other humans claim that the sentience score has been a source of peace as it has naturally led the world’s most ambitious people who have the greatest hunger for access to the most powerful AI to become philosophers, instead of CEOs or tyrants.

Things that inspired this story: The sentience accords; the ID point; moral philosophy; the hard problem of consciousness; chains of thought as exhaust from super-cognition; at some point this problem of sentience and moral patienthood will come for us and right now we’re ‘tickling the dragon’s tale’ but soon this problem will rear its mythical head and we will gaze into the lidless eye of Mind and be asked to render our own judgement on its rights and legitimacy.

Subscribe now

Thanks for reading!