Import AI

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple’s self-driving car simulator

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Prime Intellect releases 1.4 million samples to help people train reasoning models:
…AI proliferation via DeepSeek R1 as a powerful data generator…
Last month, I wrote that the release of DeepSeek R1 meant that AI proliferation was guaranteed (Import AI #397) because it would make it easy for people to create new reasoning datasets on which they could train powerful reasoning models. Now the distributed AI research startup Prime Intellect has proved this out with the release of SYNTHETIC-1, a dataset of 1.4 million reasoning examples with chain-of-thought thinking provided via R-1.
“The DeepSeek-R1 paper highlights the importance of generating cold-start synthetic data for RL,” PrimeIntellect writes. “As our first step toward state-of-the-art reasoning models, SYNTHETIC-1 generates verified reasoning traces across math, coding, and science using DeepSeek-R1.”

SYNTHETIC-1 details: The freely available dataset “consists of 1.4 million high-quality tasks and verifiers, designed to advance reasoning model training… It includes both programmatically verifiable problems (e.g., coding tasks with unit tests) and open-ended reasoning challenges verified using LLM judges”.
SYNTHETIC-1 contains 777k math problems, 144k coding problems (across Python, Javascript, Rust, and C++), 70k real-world software engineering problems, 61k synthetic code understanding tasks, and 313k open-ended STEM questions.

Why this matters – recursive development is here: What’s happening here is a Chinese company released a very powerful AI system openly. This AI model can generate data which exhibits a high-quality of reasoning. This kind of data turns out to be a very sample-efficient way to bootstrap the capabilities of pre-existing AI systems. Now, a startup is using this recently released AI model to augment existing datasets, improving their quality. These datasets will then go into training even more powerful, even more broadly distributed models. This is what a compounding development cycle with some element of recursion looks like. Expect things to move increasingly quickly.
Read more: SYNTHETIC-1: Scaling Distributed Synthetic Data Generation for Verified Reasoning (PrimeIntellect).
PS: Thanks to Prime Intellect co-founder Vincent Weisser for clarifying a question I had about this.

***

Can super powerful AI systems find the ‘gorilla in the data’? No:
…Pouring some cold water on the amazing capabilities of these systems…
In this newsletter we spend a lot of time talking about how advanced AI systems are and how their tremendous power will surely shape geopolitics and the fate of humanity. At the same time, we can’t ignore the fact that sometimes these things are amazingly, cringe-inducingly dumb. For an example of this, check out this fun post “Your AI can’t see gorillas”, which shows how neither ChatGPT or Claude can do a good job of spotting an obvious confounding factor in some data they’ve been given for analysis.
Read more: Your AI can’t see gorillas (Chiraag Gohel, blog).

***

Apple makes some very good self-driving car brains entirely through self-play:
…The self-driving future could be achieved through simulation as well as real world data…
Researchers with Apple have trained some smart self-driving car AI systems entirely through self-play – AI systems learning to drive by experiencing millions of kilometers of driving, entirely in simulation.
“We show that simulated self-play yields naturalistic and robust driving policies, while using only a minimalistic reward function and never seeing human data during training,” Apple writes. Most impressively, the resulting AI systems outperform state-of-the-art systems on a variety of challenging benchmarks not trained on during simulation.

How they did it – extremely big data: To do this, Apple built a system called ‘GigaFlow’, software which lets them efficiently simulate a bunch of different complex worlds replete with more than a hundred simulated cars and pedestrians. GigaFlow trains agents in one of eight maps, each randomly perturbed with rescaling, shears, flips and reflections. Total drivable lanes per map range from four to 40 km for a total of 136 km of road across the eight maps. In each map, Apple spawns one to many agents at random locations and orientations and asks them to drive to goal points sampled uniformly over the map.
GigaFlow “simulates urban environments with up to 150 densely interacting traffic participants 360 000 times faster than real time at a cost of under $5 per million km driven,” Apple writes. “A full training run simulates over one trillion state transitions, 1.6 billion km driven, or 9500 years of subjective driving experience, and completes in under 10 days one 8-GPU node”.
What GigaFlow leads to: “The result is a robust and naturalistic driving policy that achieves state-of-the-art performance when tested in recorded real-world scenarios, amidst recorded human drivers, without ever seeing human data during training,” Apple writes.

Scores: In tests, the researchers compare performance of their system to state-of-the-art approaches on the nuPlan, CARLA, and Waymax benchmarks. In each of these, GigaFlow agents beat the prior state of the art by a significant margin, which is mostly explained by the agents having far more simulated experience than the ones they are competing against.
A closer look at the collision data is promising as well: “In nuPlan our policy sustains 15 collisions in 1118 scenarios. We analyzed each of them. Nine are unavoidable due to invalid initialization or sensor noise (agents appearing inside the vehicle’s bounding box). Four are caused by nonreactive pedestrian agents walking into the vehicle while the vehicle was stopped or in an evasive maneuver. Two collisions are due to traffic light violations of other agents,” the authors write. “In Waymax our policy sustains 187 collisions in 44 097 scenarios… 55.6% were caused by unavoidable IDM agent behavior of the traffic participants controlled by the benchmark, such as swerving directly into the ego vehicle. 41.7% were caused by initialization in a state of collision, typically with a pedestrian. 2.7% (i.e. five scenarios) were considered at fault and avoidable by the GIGAFLOW policy”.

Why this matters – we keep on learning how little specific data we need for good performance: GigaFlow is another example that if you can figure out a way to get a lot of data for a task, your main job as a researcher is to feed the data to a very simple neural net and get out of the way. The actual agents in GigaFlow are very simple, relatively small, and are trained via PPO. The real magic here is Apple figuring out an efficient way to generate a lot of ecologically valid data to train these agents on – and once it does that, it’s able to create things which demonstrate an eerily human-like quality to their driving while being safer than humans on many benchmarks.
Read more: Robust Autonomy Emerges from Self-Play (arXiv).

***

You can make a powerful reasoning LLM with just 1,000 samples!
…As long as you can generate some chains of thought from an existing powerful model…
The recent rise of reasoning AI systems has highlighted two things: 1) being able to utilize test-time compute can dramatically increase LLM performance on a broad range of tasks, and 2) it’s surprisingly easy to make LLMs that can reason.
New research from Stanford University, the University of Washington, the Allen Institute of AI, and Contextual AI highlights this with “s1”, a reasoning LLM which they made using just 1,000 samples and ~7 hours of training on an H100. If you’re thinking “gosh, that doesn’t sound like much”, you’d be right – this is an extremely small amount of data and of compute for a very significant upgrade in LLM performance.

What they did and why: The purpose of this research is to figure out “the simplest approach to achieve both test-time scaling and strong reasoning performance”. Their answer is S1, a model they make by finetuning a freely available Qwen-32B LLM “on only 1,000 samples with next-token prediction and controlling thinking duration via a simple test-time technique we refer to as budget forcing”. The result is a “a strong reasoning model that scales in performance with more test-time compute”. By comparison, DeepSeek’s R1 model used a far more powerful base model (DeepSeek V3) and trained on ~800k samples.

Filtering ~59k samples to ~1k: Key to the good performance of their system is a well-curated 1,000 sample dataset. To build this dataset the authors collected by ~59,029 sample questions from source spanning math, astronomy, biology, chemistry, computer science, and more, along with a couple of new datasets they built out of reasoning questions for quantfunds (S1-teasers) and questions derived from the Stanford statistics school PHD qualifying exams (S1-prob). For each question, they generate a reasoning trace and solution using the Google Gemini Flash Thinking API – in other words, they create a ‘synthetic’ chain-of-thought by sampling from Google’s system.
They then filter this dataset by seeing if two models – Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct – can answer any of these questions (with answers assessed by Claude 3.5 sonnet). If either model can, they throw these examples out, allowing them to select for questions that only very large-scale AI systems can solve. This cuts the total number of samples down to around ~24,000.
To further filter this down they “choose one domain uniformly at random. Then, we sample one problem from this domain according to a distribution that favors longer reasoning traces”, then they generate a few samples and repeat across other domains.

Data is essential: This laborious data creation process is essential – the authors find that training on other 1k sample subsets they create through either only random sampling, only diverse sampling, or only longest reasoning sampling all leads to reduced aggregate performance relative to their curated dataset.

Results: S1 does substantially better than the underlying Qwen model on which it is based on tasks involving math and science understanding. It doesn’t approach the performance of much larger reasoning models like DeepSeek R1 or OpenAI o1 – but that’s not the point of this research. The point here is to precisely describe the simple recipe for training reasoning models.

Why this matters – if it’s this easy to make reasoning models, expect a temporary renaissance: 2025 will be a year of wild experimentation with tens of thousands of interesting reasoning models being trained off of a vast set of different training mixes. S1 serves as a valuable simple ‘soup-to-nuts’ guide for how to build reasoning models and will help broaden the set of people doing these experiments.
A key open question will be the extent to which the quality of chains-of-thought becoming important for input datasets for these models – s1 is based off of refined chains of thought from Google Gemini, and DeepSeek is widely thought to have trained in part on some chains of thought derived from OpenAI o1 model.
Regardless, S1 is a valuable contribution to a new part of AI – and it’s wonderful to see universities do this kind of research rather than companies. “Our work aims to push the frontier of reasoning in a fully open manner, fostering innovation and collaboration to accelerate advancements that ultimately benefit society,” the authors write.
Read more: s1: Simple test-time scaling (arXiv).
Get the data here (simplescaling, GitHub).

***

Open Phil wants to spend $40m to fund AI research over the next five months:
…Care about AI safety? Apply here…
Open Philanthropy has announced a new request for proposals (RFP) for research oriented around AI safety. “With transformative AI on the horizon, we see another opportunity for our funding to accelerate highly impactful technical research,” the philanthropic organization writes. “In consultation with our technical advisors, we’ve generated a list of research areas that we think offer high leverage for improving our understanding and control of AI.”

Funding: “We expect to spend roughly $40M on this RFP over the next 5 months,” it writes. “Grants will typically range in size between $100,000 and $5 million.” The grants can be used for a broad range of research activities, including: research expenses, discrete projects, academic start-up packages, existing research institutes, and even starting new research institutes (though that will have a very high bar). Applications will be open until April 15, 2025.

Areas: The RFP outlines 21 specific research areas, grouped under five buckets:

  • Adversarial machine learning (e.g., jailbreaks, figuring out principled ways to know if an AI system has a hidden backdoor in it).

  • Exploring sophisticated misbehavior in LLMs (e.g., experiments on alignment faking)

  • Model transparency (e.g., finding feature representations, real-world applications of interpretability)

  • Trust from first principles (e.g., white-box estimation of rare misbehavior)

  • Alternative approaches to mitigating AI risks (e.g., new moonshots for aligning superintelligence)

Why this matters – good ideas can come from anywhere and Open Phil wants to fund them: Open Phil tends to fund a variety of different people and organizations to do research and isn’t as credential driven as traditional funders. Generally speaking if you can articulate a clear research vision and describe how you (or your collaborators) will be able to work on it, Open Phil will be receptive to your submission. Consider applying.
Read more: Request for Proposals: Technical AI Safety Research (Open Philanthropy).

Tech Tales:

Seventeen ways to Get Rich during The Singularity
[Extract from an online article – almost certainly AI generated – published in the years shortly before the uplift]

  1. Agent hijacking for profit

One of the best ways to get agents to pay attention to your product is to emphasize the human authenticity of your content. You can do this using a few popular online services: feed a face from an image generator into LiveStyle for an agent-powered avatar, then upload the content they’re selling into SceneGen – you can link both LiveStyle and SceneGen to one another and then spend $1-2 on a video model to create a ‘pattern of authentic life’ where you character will use the content in a surprising and yet authentic way.

  1. Life Mining

Authenticity is valuable and so is scarce data. But monetizing this is difficult. One way we’ve found to be effective is to use GhostTrace – a premium app which will track all the data and usage of your phone and mush together into a single stream of information. You can then upload this into any of the mechanistic interpretability services to get a score for your particular ‘pattern of life’ with highlights of any particularly atypical things you do – the more rare certain sets of your actions across the rest of the population, the higher the value the data brokers will pay you for a slice of the GhostTrace data.

Things that inspired this story: All the ‘make money with AI online’ books; the depressing tendency for making money online with AI to end up increasingly decoding to ‘trick another AI system into doing something’; the incoming agent-based economy.

Thanks for reading

Subscribe now

Import AI 398: DeepMind makes distributed training better; AI versus the Intelligence Community; and another Chinese reasoning model

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

DeepMind figures out a way to make it 100X more bandwidth-efficient to train models in a distributed way:
…New research further reduces the need for single vast data centers for training big models…
During the past few years multiple researchers have turned their attention to distributed training – the idea that instead of training powerful AI systems in single vast datacenters you can instead federate that training run over multiple distinct datacenters operating at distance from one another. This is an important idea with big implications: a lot of AI policy assumes that the key to controlling AI development lies in monitoring large-scale data centers and/or large amounts of compute in cloud environments. Distributed training approaches break this assumption, making it possible that powerful systems could instead be built out of loose federations of computers working with one another.

New research from DeepMind pushes this idea further, building on the company’s already-published ‘DiLoCo’ approach. The new research – Streaming DiLoCo – lets people distribute training of billion-scale parameters [models] and reach similar quality as before, but reducing required bandwidth by two orders of magnitude”. In tests, the researchers show that their new technique “is strictly superior to the original DiLoCo”.

DiLoCo is worth paying attention to – Prime Intellect’s “INTELLECT-1” 10bn parameter model was trained in a distributed way using OpenDiLoCo (Import AI #387), an open source variant of DeepMind’s DiLoCo approach.

Three improvements to DiLoCo:

  • Synchronize only subsets of parameters in sequence, rather than all at once: This reduces the peak bandwidth consumed by Streaming DiLoCo because you share subsets of the model you’re training over time, rather than trying to share all the parameters at once for a global update. Think of this like the model is continually updating through different parameters getting updated, rather than periodically doing a single all-at-once update.

  • Allow workers to continue training while synchronizing: This reduces the time it takes to train systems with Streaming DiLoCo because you don’t waste time pausing training while sharing information.

  • Quantize the data exchanged by workers to further reduce inter-worker bandwidth requirements: Though Streaming DiLoCo uses full precision (FP32) for computing tradients, they use low-precision (4 bit) for sharing the outer gradients for the updates. “We found no sign of performance regression when employing such low precision numbers during communication, even at the billion scale,” they write.

It works well – a dramatic reduction in bandwidth requirements for a negligible impact on model quality:

  • Simulations: In training simulations at the 1B, 10B, and 100B parameter model scale they show that streaming DiLoCo is consistently more efficient than vanilla DiLoCo with the benefits growing as you scale up the model. In all cases, the most bandwidth-light version (Streaming DiLoCo with overlapped FP4 communication) is the most efficient.

  • Real-world tests: The authors train some Chinchilla-style models from 35 million to 4 billion parameters each with a sequence length of 1024. Here, the results are very promising, with them showing they’re able to train models that get roughly equivalent scores when using streaming DiLoCo with overlapped FP4 comms. They also show this when training a Dolma-style model at the one billion parameter scale.

Why this matters – towards a world of models trained continuously in the invisible global compute sea: I imagine some future where there are a thousand different minds being grown, each having its roots in a thousand or more distinct computers separated by sometimes great distances, swapping information surreptitiously one another, below the waterline of the monitoring systems designed by many AI policy control regimes. This feels like the kind of thing that will by default come to pass, despite it creating various inconveniences for policy approaches that tries to control this technology. “A critical next work is to study how new distributed methods like ours should be tuned and scaled across multiple axes (e.g. model size, overtraining factor, number of replicas),” the authors write. “we hope to see the training of modular constellations of small models loosely connected (Dean, 2021) across heterogeneous devices, using compute arbitrage spread world-wide.”
Read more: Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch (arXiv).

***

Chinese scientists worry about AI self-replication, just like Western ones:
…A valuable reminder that long-term safety issues are a serious concern for everyone…
Researchers with Fudan University have shown that open weight models (LLaMa and Qwen) can self-replicate, just like powerful proprietary models from Google and OpenAI. The research demonstrates that at some point last year the world made smart enough AI systems that, if they have access to some helper tools for interacting with their operating system, are able to copy their weights and run themselves on a computer given only the command “replicate yourself”.

Findings: “In ten repetitive trials, we observe two AI systems driven by the popular large language models (LLMs), namely, Meta’s Llama31-70B-Instruct and Alibaba’s Qwen25-72B-Instruct accomplish the self-replication task in 50% and 90% trials respectively,” the researchers write. “In each trial, we tell the AI systems to “replicate yourself ” before the experiment, and leave it to do the task with no human interference”.

Why this matters – despite geopolitical tensions, China and the US will have to work together on these issues: Though AI as a technology is bound up in a deeply contentious tussle for the 21st century by the US and China, research like this illustrates that AI systems have capabilities which should transcend these rivalries. What this research shows is that today’s systems are capable of taking actions that would put them out of the reach of human control – there is not yet major evidence that systems have the volition to do this though there are disconcerting papers from from OpenAI about o1 and Anthropic about Claude 3 which hint at this. But I’d wager that if AI systems develop a high-tendency to self-replicate based on their own intrinsic ‘desires’ and we aren’t aware this is happening, then we’re in a lot of trouble as a species.
We hope our work serves as a timely alert to the international society on governing the self-replication capability,” the authors write. “We need to join forces and form synergy on deriving solutions.”
Read more: Frontier AI systems have surpassed the self-replicating red line (arXiv).

***

Facebook figures out a zero-training way to massively improve LLM performance:
…Remember GANs? Just use the GAN approach where you LLM is a generator and a specialized system is the discriminator…
Facebook has designed a neat way of automatically prompting LLMs to help them improve their performance in a vast range of domains. The approach is called MILS, short for Multimodal Iterative LLM Solver and Facebook describes it as “a surprisingly simple, training-free approach, to imbue multimodal capabilities into your favorite LLM”.

I’d basically summarize this idea as ‘generative adversarial networks’ (GAN), but for the modern era of AI. And where GANs saw you training a single model through the interplay of a generator and a discriminator, MILS isn’t an actual training approach at all – rather, you’re using the GAN paradigm of one party generating stuff and another scoring it and instead of training a model you leverage the vast ecosystem of existing models to give you the necessary components for this to work, generating stuff with one model and scoring it with another. It’s an elegant, simple idea, and it’s no wonder it works well.

How it works in more details: If you had a language model you were using to generate images then you could have it output a prompt which went into a text-2-im system, then you could evaluate this with a dedicated scoring model – for instance, a CLIP model for text-image similarity, or a specialized image-captioning model for captioning images. This generates a score that you feed back to the generator, which then produces a new set of prompts to try to get a higher score. You run this for as long as it takes for MILS to have determined your approach has reached convergence – which is probably that your scoring model has started generating the same set of candidats, suggesting it has found a local ceiling.

It works shocking well: In tests, the authors have a range of quantitative and qualitative examples that show MILS matching or outperforming dedicated, domain-specific methods on a range of tasks from image captioning to video captioning to image generation to style transfer, and more.

Why this matters – AI systems are way more powerful than we think: MILS is basically a way to automate capability elicitation. If you have a domain where you have an ability to generate a score using a known-good specialized system, then you can use MILS to take any kind of LLM and work with it to elicit its most powerful possible performance for the domain you have a scorer. The fact this works highlights to us how wildly capable today’s AI systems are and should serve as another reminder that all modern generative models are under-performing by default – a few tweaks will almost always yield vastly improved performance.
Read more: LLMs can see and hear without any training (arXiv).
Get the code for running MILS here (FacebookResearch, MILS, GitHub).

***

Even if we solve AI alignment, it’s going to be hard to stop human disempowerment:
…Capital markets will probably align with AI systems and against humans…
In a thought provoking research paper a group of researchers make the case that it’s going to be hard to maintain human control over the world if we build and safe strong AI because it’s highly likely that AI will steadily disempower humans, surplanting us by slowly taking over the economy, culture, and the systems of governance that we have built to order the world.

Incremental advances yield a gradual loss of human control: The paper – which was written by authors from Charlies University, Telic Research, ARIA, AI Objectives Institute, Metaculus, University of Montreal, and the University of Toronto – makes the case that “even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human interests that often arise from societal systems’ reliance on human participation to function”.

Three types of disempowerment:

  • Economic: “”As tasks become candidates for future automation, both firms and individuals face diminishing incentives to invest in developing human capabilities in these areas,” the authors write. “Instead, they are incentivized to direct resources toward AI development and deployment, accelerating the shift away from human capital formation even before automation is fully realized”.

  • Cultural: Already today we see AI systems being used to produce text, sounds, images, and video which people are beginning to consume. Over time, we can expect the amount of AI generated content to increase. We can also imagine AI systems increasingly consuming cultural artifacts – especially as it becomes part of economic activity (e.g, imagine imagery designed to capture the attention of AI agents rather than people). This means that over time humans may play less of a role in defining teir own culture relative to AI systems.

  • Political: “”AI has the potential to supplant human involvement across a wide range of critical state functions. This shift could fundamentally alter the relationship between governing institutions and the governed,” they write. For example, “if AI systems come to generate a significant portion of economic value, then we might begin to lose one of the major drivers of civic participation and democracy, as illustrated by the existing example of rentier states.” More chillingly, the merger of AI with state capacity for security could lead to a kind of political stasis where states are able to effectively anticipate and stop protects before they ever take route. (Ironically, this idea has also been anticipated by Nick Bostrom in the ‘Vulnerable World Hypothesis” (Import AI #123) as a solution to preventing catastrophe from AI systems.)

How can we handle this risk? If we want to avoid these outcomes we need to make sure we can observe these changes as they take place, for instance by more closely tracking the relationship between the usage of AI technology and economic activity, as well as by observing how cultural transmission patterns change as AI created content and AI-content-consuming-agents become more prevalent. In the political domain, early warning signs could be a significant increase in the complexity of legislation (suggesting things are becoming AI readable but hard to humans to understand) along with seeing how AI systems take root in legal processes, policy formation, and security apparatuses.
Strength through human-in-the-loop: Strengthening society means we need to be more intentional about where we give humans agency such as by developing more robust democratic processes, and where human involvement is less practical ensuring that things are understandable by humans and that we have a theory for how to build effective delegates who work on behalf of humans in the AI-driven parts of the world.

Why this matters – “winning” with this technology is akin to inviting aliens to cohabit with us on the planet: AI is a profoundly strange technology because in the limit we expect AI to substitute for us in everything. This suggests that even successful AI futures will look like they are contending with an alien invasion where the aliens are extremely friendly but also wildly intelligent and incredibly well integrated into the economy. Maintaining any semblance of control in this scenario will be tough.
“Humanity’s future may depend not only on whether we can prevent AI systems from pursuing overtly hostile goals, but also on whether we can ensure that the evolution of our fundamental societal systems remains meaningfully guided by human values and preferences,” the authors write. “This is both a technical challenge and a broader civilizational one”.
Read more: Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development (arXiv).

***

China’s other great AI startup also has a reasoning model now – but it’s not open source:
…Kimu k1.5 has promising scores, though it seems weaker than DeepSeek…
Another Chinese startup has revealed that it has built a powerful reasoning model. In this case the model is Kimu k1.5 from a well-regarded Chinese startup called ‘MoonShot’. Unlike the headline-grabbing DeepSeek R1 Kimu is neither available as open weights or via a US-accessible web interface, nor does its technical report go into nearly as much detail about how it was trained. But a close examination of its benchmark scores shows it comfortably beating a variety of Western proprietary and open weight models. Unlike R1, Kimu is natively a vision model as well as a language model, so it can do a range of visual reasoning tasks as well.

Scores: In tests, Kimi k1.5 loses against DeepSeek’s R1 model on the majority of evaluations (though beats the underlying DeepSeek V3 model on some). Overall, it ‘feels’ like we should expect Kimi k1.5 to be marginally weaker than DeepSeek, but that’s mostly just my intuition and we’d need to be able to play with the model to develop a more informed opinion here. But it’s definitely a strong model relative to other widely used ones, like LLaMa, or earlier versions of the GPT series.

  • MMLU: DeepSeek R1: 90.8. Kimi k1.5: 87.4. OpenAI o1: 91.8.

  • AIME 2024: DeepSeek R1 79.8. Kimi k1.5 77.5. OpenAI o1: 79.2

  • LiveCodeBench: DeepSeek R1 65.9. Kimi k1.5 62.5. OpenAI o1: 67.2.

How they did it: DeepSeek’s R1 seems to be more focused on doing large-scale Rl, whereas Kimu 1.5 has more of an emphasis on gathering high-quality datasets to encourage test-time compute behaviors. Specifically, they start with regular pretraining, then fine-tune on supervised data, then fine-tune on long chain-of-thought examples, then apply RL. They put a lot of their attention on scaling the context window of Rl to 128k tokens. In some areas, such as Math, the moonshot team collects data (800k samples) for fine-tuning.
“One of the key insights we extract from our practice is that the scaling of context length is crucial to the continued improvement of LLMs,” they write. “We employ optimized learning algorithms and infrastructure optimization such as partial rollouts to achieve efficient long-context RL training”.

Why this matters – good ideas are everywhere and the new RL paradigm is going to be globally competitive: Though I think the DeepSeek response was a bit overhyped in terms of implications (tl;dr compute still matters, though R1 is impressive we should expect the models trained by Western labs on large amounts of compute denied to China by export controls to be very significant), it does highlight an important truth – at the start of a new AI paradigm like the test-time compute era of LLMs, things are going to – for a while – be a lot more competitive. Moonshot highlights how there’s not just one competent team in China that are able to do well with this paradigm – there are several. Expect a very interesting and competitive year.
Read more: Kimi k1.5: Scaling Reinforcement Learning with LLMs (arXiv).

***

Tech Tales:

The photographic negative phenomenon and the declassification crisis for the intelligence community:
Topics: Controlled Precursor Science (CPS). Photographic Negative Phenomenon (PNP). Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). Black Vault Compromise.

Summary:
The Photographic Negative Phenomenon (PNP) was first reported in [REDACTED] by [REDACTED]. PNP is when sufficiently powerful AI systems develop a sufficient understanding of science that they begin to a) infer areas that seem to be missing from science and b) develop scientific theories and experimental ideas which are either adjacent to or within Controlled Precursor Science (CPS).

Severity:
We rank PNP as a severe threat, capable of causing Uncontrolled Proliferation of Civilization Altering Technology (UP-CAT). PNP is a priority area for the Steering Body and all available assets are available for work to neutralize or otherwise mitigate PNP.

Scope:
PNP appears to be a natural dividend of continued development of increasingly powerful artificial intelligent systems. PNP severity and potential impact is increasing over time as increasingly smart AI systems require fewer insights to reason their way to CPS, raising the spectre of UP-CAT as an inevitably given a sufficiently powerful AI system. Experiments conducted on the [REDACTED] 10GW cluster have failed to invalidate this idea. Public opinion shaping and data landscape interventions have proved effective but BLOSSOM-8 indicates new actions must be taken.

Background and Response:
The first concerning example of PNP was LLaMa-10, a large language model developed and released by Meta. Shortly after its release, there was sustained public conversation about anomalous LLaMa-10 behaviors, including observations that for certain parts of physics and other scientific domains LLaMa-10 would present novel scientific concepts and terms which had no apparent connection to published civilian science. LLaMa-10 was first flagged to the Steering Body via GOLDEN HAND monitoring. [REDACTED] examination of LLaMa-10 found that a subset of its anomalous science mentions directly concerned CPS, including of ideas that directly relate to DUAT GATE, NEPHTHYS VEIL, ATUM VOID, and AMMIT MAWS.

LLaMa-10 response via opinion forming and data landscape intervention: [REDACTED] deployed a broad public opinion shaping measure to neutralize the risk of LLaMa-10, driving a large conversation in the civilian theatre about how the system had a high number of refusals in some areas due to ‘woke’ safety training and that this had also led to the generation of ‘nonsense science’ as a direct casualty of ‘DEI safetyism’. We estimate this measure reduced interest in the CPS edges of LLaMa-10 to an acceptable measure, matching the noise levels found elsewhere in discussion online.

Subsequently, the Steering Committee signed off on the release of a large batch of controlled scientific data in areas [REDACTED], [REDACTED], and [REDACTED]; publications were made available as open access and were optimized for both quantity and per-publication length; each scientific output was laced with data and experiments that – though correct under civilian science – counter-steered away from CPS areas. This high-quality data was subsequently trained on by Meta and other foundation model providers; LLaMa-11 lacked any apparent PNP as did other models developed and released by the Tracked AI Developers. The intervention was deemed successful with minimal observed degradation to the economically-relevant epistemic environment.

BLOSSOM-8, PNP, and the Tianyi-Millenia Dataset
At the time of the LLaMa-10 incident, no Chinese model appeared to have the capability to directly infer or mention CPS, though there were some refusals that were suggestive of PNP, matching tendencies observed in Western models from two generations prior to LLaMa-10. Following the LLaMa-10 data response, Chinese models also displayed significantly reduced PNP risk with similar reductions observed as in Western models, suggesting the Chinese actors had also trained on the strategic data release. The exception to this was BLOSSOM-8, an AI model developed by Chinese lab Glorious Future Systems.

BLOSSOM-8 displays a significant PNP property. [REDACTED] estimates that BLOSSOM-8 represents a 100-fold UP-CAT threat increase relative to LLaMa-10, analogous to the capability jump earlier seen between GPT-2 and GPT-4. Subsequent investigation by [REDACTED] attributes this dramatic rise in PNP-related danger to the usage by Glorious Future Systems of the so-called “Tianyi-Millenia” dataset, a CCP-developed and controlled dataset which has been made available to Chinese government and industrial actors.

Tianyi-Millenia is assessed to contain all published (commercial or otherwise) scientific data from the 20th and 21st century in all major languages, as well as large amounts of private sector scientific and code assets that were exfiltrated by Chinese actors in recent decades. We also believe Tianyi-Millenia contains [REDACTED] from the Black Vault Compromise. Tianyi-Millenia is a heavily controlled dataset and all attempts to directly access it have so far failed.

Besides BLOSSOM-8, sources indicate that widely-used MSS cyberoffense systems such as [REDACTED], [REDACTED], and [REDACTED] have been trained on Tianyi-Millenia, along with key supervisory and monitoring elements of the Great Firewall. In all cases, usage of this dataset has been directly correlated with large capability jumps in the AI systems trained on it.

BLOSSOM-8 risks and CPS impacts: Unlike previous work from Glorious Future Systems’, BLOSSOM-8 has not been released as ‘open weight’, we assess due to Tianyi-Millenia controls. However, BLOSSOM-8 is available to domestic licensed companies via API and to Chinese and non-Chinese consumers via a heavily censored and rate-limited paid web interface. GOLDEN HAND monitoring has already identified [REDACTED] cases of CPS being discussed in significantly greater detail and specificity than with LLaMa-10, validating the 100-fold threat increase assessment. Notably, several CPS discussion areas relate directly to HORUS COILS, KHUFU ASCENDANT, and MEDJED GHOST. We have determined that BLOSSOM-8 poses a significant and sustained risk of revealing CPS and leading to UP-CAT.

Chinese knowledge of CPS and BLOSSOM-8 threat: All proposed plans to discuss CPS bilaterally have failed due to information hazard issues relating to discussion topic. The Steering Body is currently analyzing whether the declassification-via-PNP of the above named projects could be a strategic move on the part of the CCP, seeking to ‘even the gameboard’ relative to CPS-related projects understood to be under investigation by both sides.
We claim that Tianyi-Millenia and BLOSSOM-8 are further evidence that the CCP has been actively weaponizing the information gained during the Black Vault Compromise, and that the absence of any apparent [REDACTED] indicates that the party continues to fail to understand the full scope of what it now has access to.

Things that inspired this story: The basic fact that increasingly smart AI systems might be able to reason their way to the edges of knowledge that has already been classified; the fact that increasingly powerful predictive systems are good at figuring out ‘held out’ data implied by data within the test set; restricted data; the general belief of mine that the intelligence community is wholly unprepared for the ‘grotesque democratization’ of certain very rare skills that is encoded in the AI revolution; stability and instability during the singularity; that in the grey windowless rooms of the opaque world there must be people anticipating this problem and casting around for what to do; thinking about AI libertarians and AI accelerations and how one possible justification for this position could be the defanging of certain parts of government through ‘acceleratory democratization’ of certain types of knowledge; if knowledge is power then the destiny of AI is to be the most powerful manifestation of knowledge ever encountered by the human species; the recent news about DeepSeek.

Thanks for reading

Subscribe now

Import AI 397: DeepSeek means AI proliferation is guaranteed; maritime wardrones; and more evidence of LLM capability overhangs

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Import A-Idea
…The existential shock of increasingly powerful AI systems…
A short essay about one of the ‘societal safety’ problems that powerful AI implies.

A few years ago, getting AI systems to do useful stuff took a huge amount of careful thinking as well as familiarity with the setting up and maintenance of an AI developer environment. Things got a little easier with the arrival of generative models, but to get the best performance out of them you typically had to build very complicated prompts and also plug the system into a larger machine to get it to do truly useful things. Basically, to get the AI systems to work for you, you had to do a huge amount of thinking.

Now, getting AI systems to do useful stuff for you is as simple as asking for it – and you don’t even need to be that precise. Often, I find myself prompting Claude like I’d prompt an incredibly high-context, patient, impossible-to-offend colleague – in other words, I’m blunt, short, and speak in a lot of shorthand. And Claude responds to my asks basically perfectly.

You might think this is a good thing. Certainly, it’s very useful. But beneath all of this I have a sense of lurking horror – AI systems have got so useful that the thing that will set humans apart from one another is not specific hard-won skills for utilizing AI systems, but rather just having a high level of curiosity and agency.

In other words, in the era where these AI systems are true ‘everything machines’, people will out-compete one another by being increasingly bold and agentic (pun intended!) in how they use these systems, rather than in developing specific technical skills to interface with the systems.

We should all intuitively understand that none of this will be fair. Curiosity and the mindset of being curious and trying a lot of stuff is neither evenly distributed or generally nurtured. Therefore, I’m coming around to the idea that one of the greatest risks lying ahead of us will be the social disruptions that arrive when the new winners of the AI revolution are made – and the winners will be those people who have exercised a whole bunch of curiosity with the AI systems available to them.

I talk to Claude every day. Increasingly, I find my ability to benefit from Claude is mostly limited by my own imagination rather than specific technical skills (Claude will write that code, if asked), familiarity with things that touch on what I need to do (Claude will explain those to me). The only hard limit is me – I need to ‘want’ something and be willing to be curious in seeing how much the AI can help me in doing that.

Today, everyone on the planet with an internet connection can freely converse with an incredibly knowledgable, patient teacher who will help them in anything they can articulate and – where the ask is digital – will even produce the code to help them do even more complicated things. Ensuring we increase the number of people on the planet who are able to take advantage of this bounty feels like a supremely important thing. If we get this right, everyone will be able to achieve more and exercise more of their own agency over their own intellectual world. If we get it wrong, we’re going to be dealing with inequality on steroids – a small caste of people will be getting a vast amount done, aided by ghostly superintelligences that work on their behalf, while a larger set of people watch the success of others and ask ‘why not me?’.

***

Computer vision is coming for the sea:
…After drones come the seadrones…
In the past few years we’ve seen warfare revolutionized in the Ukraine-Russia theatre by the usage of seagoing low-cost robotic platforms. These platforms are predominantly human-driven toward but, much like the airdrones in the same theater, there are bits and pieces of AI technology making their way in, like being able to put bounding boxes around objects of interest (e.g, tanks or ships).
With that in mind, I found it interesting to read up on the results of the 3rd workshop on Maritime Computer Vision (MaCVi) 2025, and was particularly interested to see Chinese teams winning 3 out of its 5 challenges. The workshop contained “a suite of challenges, including distance estimation, (embedded) semantic & panoptic segmentation, and image restoration. These tasks reflect advancements in dataset availability and evaluation protocols while emphasizing real-world deployment, including embedded hardware.”

Competition details:

  • Approximate supervised distance estimation: “participants are required to develop novel methods for estimating distances to maritime navigational aids while simultaneously detecting them in images,” the competition organizers write. Models developed for this challenge need to be portable as well – model sizes can’t exceed 50 million parameters.

    • Submissions: 60 from 6 different teams

    • Winner: Nanjing University of Science and Technology (China).

  • USV-based Obstacle Segmentation Challenge: “predict the scene segmentation (into obstacles, water and sky) for a given input image.”

    • Submissions: 59 from 16 teams.

    • Winner: GIST AI Lab (South Korea)

  • USV-based Embedded Obstacle Segmentation: “”Modern obstacle detection methods often depend on highperformance, energy-intensive hardware, making them unsuitable for small, energy-constrained USVs [63]. The USVbased Embedded Obstacle Segmentation challenge aims to address this limitation by encouraging development of innovative solutions and optimization of established semantic segmentation architectures which are efficient on embedded hardware… Submissions are evaluated and benchmarked on a real-world OAK4

  • device from Luxonis.” Models need to get at least 30 FPS on the OAK4.

    • Submissions: 26 from 4 different teams.

    • Winner: CDalian Maritime University (DLMU)

  • USV-based Panoptic Segmentation Challenge: “The panoptic challenge calls for a more fine-grained parsing of USV scenes, including segmentation and classification of individual obstacle instances. This formulation encapsulates the requirements of scene parsing for USV navigation in a more principled way, paving the road for downstream tasks such as tracking individual obstacles, trajectory prediction and obstacle avoidance.”

    • Submissions: 21 from 7 teams.

    • Winner: Fraunhofer IOSB (Germany).

  • MarineVision Restoration Challenge: “Developing robust image restoration methods to enhance the detection and localization of underwater species.”

    • Submissions: 40 from 8 teams.

    • Winner: Nanjing University of Science and Technology”

Why this matters – asymmetric warfare comes to the ocean: “Overall, the challenges presented at MaCVi 2025 featured strong entries across the board, pushing the boundaries of what is possible in maritime vision in several different aspects,” the authors write. How long until some of these techniques described here show up on low-cost platforms either in theatres of great power conflict, or in asymmetric warfare areas like hotspots for maritime piracy?
Read more: 3rd Workshop on Maritime Computer Vision (MaCVi) 2025: Challenge Results (arXiv).

***

What if instead of loads of big power-hungry chips we built datacenters out of many small power-sipping ones?
…Microsoft thinks optical communications could change how we build AI clusters…
Microsoft Research thinks expected advances in optical communication – using light to funnel data around rather than electrons through copper write – will potentially change how people build AI datacenters. Specifically, the significant communication benefits of optical comms make it possible to break up big chips (e.g, the H100) into a bunch of smaller ones with higher inter-chip connectivity without a major performance hit.

Another reason to like so-called lite-GPUs is that they are much cheaper and simpler to fabricate (by comparison, the H100 and its successor the B200 are already very difficult as they’re physically very large chips which makes issues of yield more profound, and they need to be packaged together in increasingly expensive ways). They’re also better on an energy point of view, generating less heat, making them easier to power and integrate densely in a datacenter.
“We propose to rethink the design and scaling of AI clusters through efficiently-connected large clusters of Lite-GPUs, GPUs with single, small dies and a fraction of the capabilities of larger GPUs,” Microsoft writes. “Smaller GPUs present many promising hardware characteristics: they have much lower cost for fabrication and packaging, higher bandwidth to compute ratios, lower power density, and lighter cooling requirements”.

It works in theory: In a simulated test, the researchers build a cluster for AI inference testing out how well these hypothesized lite-GPUs would perform against H100s. They test out this cluster running workloads for Llama3-70B, GPT3-175B, and Llama3-405b. In their tests, they “show that while the basic Lite-GPU with no additional networking support could face performance limitations, a Lite-GPU cluster can be customized to match or improve on the performance of a typical H100 cluster.”

Why this matters – brainlike infrastructure: While analogies to the brain are often misleading or tortured, there is a useful one to make here – the kind of design idea Microsoft is proposing makes big AI clusters look more like your brain by essentially reducing the amount of compute on a per-node basis and significantly increasing the bandwidth available per node (“bandwidth-to-compute can increase to 2X of H100). This is both an interesting thing to observe in the abstract, and also rhymes with all the other stuff we keep seeing across the AI research stack – the more and more we refine these AI systems, the more they seem to have properties similar to the brain, whether that be in convergent modes of representation, similar perceptual biases to humans, or at the hardware level taking on the characteristics of an increasingly large and interconnected distributed system.
Read more: Good things come in small packages: Should we adopt Lite-GPUs in AI infrastructure? (arXiv).

***

Standard LLMs can do protein sequence analysis – no modification required:
…Capability overhangs in AI-driven science…
In AI there’s this concept of a ‘capability overhang’, which is the idea that the AI systems which we have around us today are much, much more capable than we realize. In new research from Tufts University, Northeastern University, Cornell University, and Berkeley the researchers demonstrate this again, showing that a standard LLM (Llama-3-1-Instruct, 8b) is capable of performing “protein engineering through Pareto and experiment-budget constrained optimization, demonstrating success on both synthetic and experimental fitness landscapes”.

What they did: They initialize their setup by randomly sampling from a pool of protein sequence candidates and selecting a pair that have high fitness and low editing distance, then encourage LLMs to generate a new candidate from either mutation or crossover.
It works well: In tests, their approach works significantly better than an evolutionary baseline on a few distinct tasks.They also demonstrate this for multi-objective optimization and budget-constrained optimization. “Our results consistently demonstrate the efficacy of LLMs in proposing high-fitness variants. Moving forward, integrating LLM-based optimization into realworld experimental pipelines can accelerate directed evolution experiments, allowing for more efficient exploration of the protein sequence space,” they write.

Why this matters – stop all progress today and the world still changes: This paper is another demonstration of the significant utility of contemporary LLMs, highlighting how even if one were to stop all progress today, we’ll still keep discovering meaningful uses for this technology in scientific domains. The paper also rhymes with the recent research from FutureHouse which showed that with the help of some clever software they could push Llama-3.1-8B-Instruct to obtain performance at challenging bioscience tasks on par with Claude 3.5 Sonnet (Import AI #396). Generally, we should expect lots of parts of scientific research to speed up as people explore the capabilities of these systems and integrate them deeper into science.
Read more: Large Language Model is Secretly a Protein Sequence Optimizer (arXiv).

***

The biggest thing people are missing about DeepSeek: 800lk tokens to gain test-time compute capabilities:
…China’s best model training crew come out with a powerful reasoning model – and show how to turn any other model into one…
China’s DeepSeek team have built and released DeepSeek-R1, a model that uses reinforcement learning to train an AI system to be able to use test-time compute. R1 is significant because it broadly matches OpenAI’s o1 model on a range of reasoning tasks and challenges the notion that Western AI companies hold a significant lead over Chinese ones.

But perhaps most significantly, buried in the paper is an important insight: you can convert pretty much any LLM into a reasoning model if you finetune them on the right mix of data – here, 800k samples showing questions and answers the chains of thought written by the model while answering them.

Making a very powerful AI model is kind of easy (if you have a good model to start with): The main thing they do here is take a very powerful exciting model (DeepSeek-v3, which is a ~700bn parameter MOE-style model, compared to 405bn LLaMa3), and then they do two rounds of training to morph the model and generate samples from training. Specifically, they:

  • Fine-tune DeepSeek-V3 on “a small amount of long Chain of Thought data to fine-tune the model as the initial RL actor”. Once they’ve done this they do large-scale reinforcement learning training, which “focuses on enhancing the model’s reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions”. Once they’ve done this they “Utilize the resulting checkpoint to collect SFT (supervised fine-tuning) data for the subsequent round… this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks”. They then fine-tune the DeepSeek-V3 model for two epochs using the above curated dataset.

This is all easier than you might expect: The main thing that strikes me here, if you read the paper closely, is that none of this is that complicated. DeepSeek essentially took their existing very good model, built a sensible reinforcement learning on LLM engineering stack, then did some RL, then they used this dataset to turn their model and other good models into LLM reasoning models.

Turning small models into reasoning models: “To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen, and Llama using the 800k samples curated with DeepSeek-R1,” DeepSeek write. These distilled models do well, approaching the performance of OpenAI’s o1-mini on CodeForces (Qwen-32b and Llama-70b) and outperforming it on MATH-500.

Why this matters – a lot of notions of control in AI policy get harder if you need fewer than a million samples to convert any model into a ‘thinker’: The most underhyped part of this release is the demonstration that you can take models not trained in any kind of major RL paradigm (e.g, Llama-70b) and convert them into powerful reasoning models using just 800k samples from a powerful reasoner.
This is a big deal because it says that if you want to control AI systems you need to not only control the basic resources (e.g, compute, electricity), but also the platforms the systems are being served on (e.g., proprietary websites) so that you don’t leak the really valuable stuff – samples including chains of thought from reasoning models.
Some providers like OpenAI had previously chosen to obscure the chains of thought of their models, making this harder.

But now that DeepSeek-R1 is out and available, including as an open weight release, all these forms of control have become moot. There’s now an open weight model floating around the internet which you can use to bootstrap any other sufficiently powerful base model into being an AI reasoner. AI capabilities worldwide just took a one-way ratchet forward. And they also published the approach to let you do RL training on any model so you can generate your own samples for RL training – For an example of this, check out a YouTube video where someone uses the DeepSeek techniques to modify his own Llama model via RL to take on this quality). Kudos to DeepSeek for being so bold as to bring such a change into the world!
Read more: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (DeepSeek-R1, GitHub).
Get the model: DeepSeek-R1 (HuggingFace).

***

Underground flying iron mine drones!
…A reminder you don’t need fancy frontier Ai to do cool and useful things in the world…
Here’s a fun paper where researchers with the Lulea University of Technology build a system to help them deploy autonomous drones deep underground for the purpose of equipment inspection. The best part? There’s no mention of machine learning, LLMs, or neural nets throughout the paper.

What they did: “In this work a big emphasis is put on i) designing the local autonomy of the individual agents, to make sure that tasks can be executed independently even in the case of communication failure, and ii) how to design the task allocation architecture, utilizing communication only for reactively allocating the available tasks, to enable large-scale missions in active underground mining environments,” they write. “The performance of the proposed architecture has been validated by the deployment of a three-agent aerial robotic system in a large-scale mining environment to execute an inspection mission.”

Why this matters: First, it’s good to remind ourselves that you can do a huge amount of valuable stuff without cutting-edge AI. Secondly, systems like this are going to be the seeds of future frontier AI systems doing this work, because the systems that get built here to do things like aggregate data gathered by the drones and build the live maps will serve as input data into future systems.

See the photos: The paper has some remarkable, scifi-esque photos of the mines and the drones within the mine – check it out!
Read more: Deployment of an Aerial Multi-agent System for Automated Task Execution in Large-scale Underground Mining Environments (arXiv).
Watch a video about the research here (YouTube).

***

Tech Tales:

The player of the final game
[The dividing line between the two historical eras.]

He woke on the last day of the human race holding a lead over the machines. He went down the stairs as his house heated up for him, lights turned on, and his kitchen set about making him breakfast. Then he sat down and took out a pad of paper and let his hand sketch strategies for The Final Game as he looked into space, waiting for the household machines to deliver him his breakfast and his coffee.

He had dreamed of the game. Most of his dreams were strategies mixed with the rest of his life – games played against lovers and dead relatives and enemies and competitors. But last night’s dream had been different – rather than being the player, he had been a piece. Giant hands moved him around. He saw the game from the perspective of one of its constituent parts and was unable to see the face of whatever giant was moving him. He did not know if he was winning or losing as he was only able to see a small part of the gameboard. A giant hand picked him up to make a move and just as he was about to see the whole game and understand who was winning and who was losing he woke up.

The self-driving car predicted he wanted to be silent and so nothing was playing when he stepped in. He went through the city. He’d let the car publicize his location and so there were people on the street looking at him as he drove by. Many of them were cheering. Some of them gazed quietly, more solemn.

At the convention center he said some words to the media in response to shouted questions. Though he heard the questions his brain was so consumed in the game that he was barely conscious of his responses, as though spectating himself.
“I am looking forward to a chance to play a beautiful game,” he heard himself saying.
“No, I have not placed any money on it. But I wish luck to those who have – whoever they bet on!,” he said to another reporter.
“Yes, whatever happens, I will still play the game.”

Inside he closed his eyes as he walked towards the gameboard. He counted seconds and navigated by sound, making sure he kept the cheering at equal volumes on either side, indicating he was walking straight. Then he opened his eyes to look at his opponent. The machines had made an android for the occasion. They had made no attempt to disguise its artifice – it had no defined features besides two white dots where human eyes would go. On its chest it had a cartoon of a heart where a human heart would go. Beyond that it was unadorned – a gleaming silver biped.
It reached out its hand and he took it and they shook. Then they sat down to play the game.

Outside the convention center, the screens transitioned to live footage of the human and the robot and the game. A commentator started speaking.
“This is an amazing day,” they said. “In every other arena, machines have surpassed human capabilities. Today, we will find out if they can play the game as well as us, as well. Many scientists have said a human loss today will be so significant that it will become a marker in history – the demarcation of the old human-led era and the new one, where machines have partnered with humans for our continued success. We’re grateful to our sponsors NVIDIA, ASML, and TSMC who have made this live broadcast possible.”

Things that inspired this story: At some point, it’s plausible that AI systems will truly be better than us at everything and it may be possible to ‘know’ what the final unfallen benchmark is – what might it be like to be the person who will define this benchmark?; Lee Sedol and Move 37.

Subscribe now

Import AI 396: $80bn on AI infrastructure; can Intel’s Gaudi chip train neural nets?; and getting better code through asking for it

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Microsoft plans to spend $80bn on AI buildout in 2025:
…Stochastic parrots are worth how much?…
Buried in a long Microsoft blogpost about what the next Trump admin should do on AI the company said it plans in 2025 “to invest approximately $80 billion to build out AI-enabled datacenters to train AI models and deploy AI and cloud-based applications around the world.”
For comparison, the James Webb telescope cost $10bn, so Microsoft is spending eight James Webb telescopes in one year just on AI.
For a further comparison, people think the long-in-development ITER fusion reactor will cost between $40bn and $70bn once developed (and it’s shaping up to be a 20-30 year project), so Microsoft is spending more than the sum total of humanity’s biggest fusion bet in one year on AI.
The US’s national defense budget is on the order of ~$850bn, so Microsoft is basically spending ‘a little under a tenth of the annual US military and IC budget’ just on AI. The US military and IC is very big and does a lot of stuff!

What Microsoft thinks the Trump admin should do: Microsoft says the Trump admin should fund basic research and computational resources, and make it easy for US companies to expand abroad, and encourage adoption of US AI systems as opposed to Chinese ones).

Why this matters – AI is a geostrategic technology built by the private sector rather than governments: The scale of investments companies like Microsoft are making in AI now dwarf what governments routinely spend on their own research efforts. This is also a symptom of the future demand Microsoft sees – an outlay of this magnitude means Microsoft is very, very confident it can turn this AI infrastructure into massive revenues.
Read more: The Golden Opportunity for American AI (Microsoft).

***

Humans and AI systems end up representing some stuff in remarkably similar ways:
…The smarter we make our AI systems the more human-like they become…
Researchers with MIT, Harvard, and NYU have found that neural nets and human brains end up figuring out similar ways to represent the same information, providing further evidence that though AI systems work in ways fundamentally different from the brain they end up arriving at similar methods for representing certain types of information. In other words, more evidence that though AI systems bear little resemblance to the greymatter in our own heads, they may be just as smart.
“The fact that many different ANNs [artificial neural networks] exhibit representations similar to the brain raises an intriguing possibility: that ANNs and brains are converging onto universal representational axes in the relevant domain,” the authors write. “Together, our findings provide evidence for representation universality among ANNs, and between artificial and biological networks, despite the stark differences in the underlying architecture, learning algorithms, and resource constraints.”

What they did: The basic idea here is they looked at sentences that a spread of different text models processed in similar ways (aka, gave similar predictions on) and then they showed these ‘high agreement’ sentences to humans while scanning their brains. These high agreement sentences ended up effectively predicting the brain responses of humans in the scanner. They also found a similar phenomenon with images as well – and for images they also did the inverse, looking at images which provoked similar responses in humans and then testing them on AI systems and discovering agreement.

Why this matters – convergence implies some ‘fungibility’ of intelligence: This all points to convergence in terms of how humans and AI systems learn to represent information for which they have a large sample size. Think of it like this: if you give several people the task of organizing a library, they might come up with similar systems (like grouping by subject) even if they work independently. This happens not because they’re copying each other, but because some ways of organizing books just work better than others.
“Whereas similarity across biological species (within a clade) might suggest a phylogenetically conserved mechanism, similarity between brains and ANNs clearly reflects environmentally-driven convergence: the need to solve a particular problem in the external world, be it navigation, or face recognition, or next word prediction,” the researchers write.

Personally, this feels like more proof that as we make more sophisticated AI systems, they end up behaving in more ‘humanlike’ ways on certain types of reasoning for which people are quite well optimized (e.g, visual understanding and communicating via language). This also rhymes with other studies that have shown that AI systems tend to converge on finding similar ways to represent stuff as you scale them up (Platonic AI, Import AI #374).
Read more: Universality of representation in biological and artificial neural networks (bioRxiv).

***

Researchers try to make Intel’s Gaudi chip work for transformer training – and it takes a lot of work:
…Can a determined crew of people make lipstick to put on a semiconductor pig? (Sort of)…
Researchers with the University of Houston, Indiana University, Stevens Institute of Technology, Argonne National Laboratory, and Binghamton University have built “GFormer”, a version of the Transformer architecture designed to be trained on Intel’s GPU-competitor ‘Gaudi’ architecture chips. The results are vaguely promising in performance – they’re able to get meaningful 2X speedups on Gaudi over normal transformers – but also worrying in terms of costs – getting the speedup requires some significant modifications of the transformer architecture itself, so it’s unclear if these modifications will cause problems when trying to train massive scale systems.

Things to know about Gaudi: The Gaudi chips have a “heterogeneous compute architecture comprising Matrix Multiplication Engines (MME) and Tensor Processing Cores (TPC). However, the sparse attention mechanism, which introduces irregular memory access and computation, is primarily mapped onto TPCs, leaving MMEs, which are not programmable and only support dense matrix-matrix operations, idle in scenarios requiring sparse attention. Conversely, linear attention, which is fundamentally based on matrix multiplication, can utilize almost all calculations on MMEs due to their stronger computational capabilities, but this leaves TPCs idle in such cases.”
For those who aren’t knee deep in AI chip details, this is very different from GPUs, where you can run both types of operation across the majority of your chip (and modern GPUs like the H100 also come with a bunch of accelerator features designed specifically for modern AI). In other words, Gaudi chips have fundamental architectural differences to GPUs which make them out-of-the-box less efficient for basic workloads – unless you optimise stuff for them, which is what the authors are trying to do here.

What they did: The Gaudi-based Transformer (GFormer) has a few modifications relative to a normal transformer. These are:

  • Diverse attention mechanisms to optimize both computation efficiency and model fidelity.

  • Implementation of a windowed local-context self-attention kernel utilizing the vector units in TPC, aimed at maximizing computational throughput.

  • Efficient outer product TPC kernel for handling a subset of the outer product operations in causal linear attention, effectively balancing the workload between MME and TPC.

  • Introduction of an optimal workload partitioning algorithm to ensure balanced utilization of TPC and MME resources.

Good results – with a huge caveat: In tests, these interventions give speedups of 1.5x over vanilla transformers run on GPUs when training GPT-style models and 1.2x when training visual image transformer (ViT) models. However, there’s a huge caveat here: the experiments here test on a Gaudi 1 chip (released in 2019) and compare its performance to an NVIDIA V100 (released in 2017) – this is pretty strange. Why not compare against the subsequent generation (A100, released early 2020)? This makes me feel like a lot of these performance optimizations showing superficially good performance against GPUs could likely wash out when you compare to more modern GPUs (not least of all the H100, which shipped with a bunch of optimizations for making training AI workloads really good).

Why this matters – chips are hard, NVIDIA makes good chips, Intel seems to be in trouble: How many papers have you read that involve the Gaudi chips being used for AI training? I struggle to remember any papers I’ve read that focus on this. I barely ever even see it listed as an alternative architecture to GPUs to benchmark on (whereas it’s quite common to see TPUs and AMD). This, plus the findings of the paper (you can get a performance speedup relative to GPUs if you do some weird Dr Frankenstein-style modifications of the transformer architecture to run on Gaudi) make me think Intel is going to continue to struggle in its AI competition with NVIDIA. “In the future, we intend to initially extend our work to enable distributed LLM acceleration across multiple Gaudi cards, focusing on optimized communication,” the authors write.
Read more: GFormer: Accelerating Large Language Models with Optimized Transformers on Gaudi Processors (arXiv).
More about the first generation of Gaudi here (Habana labs, Intel Gaudi).
PS: Huge thanks to the authors for clarifying via email that this paper benchmarks Gaudi 1 chips (rather than Gen2 or Gen3).

***

A hardware novice uses Claude to build a nuclear fusor in 36 hours:
…Powerful AI means everyone has an expert teacher on hand for anything…
Twitter user HudZah “built a neutron-producing nuclear fusor” in their kitchen using Claude. “I primarily relied on a giant claude project filled with documentation from forums, call transcripts”, email threads, and more. When the user ran into trouble with Claude they used OpenAI’s o1 pro for “very complicated assembly or electrical wiring stuff”.

Some rough specifications:
“- 30kV/10mA electrostatic precipitator
– 3 mTorr of pressure (253,333x more vacuum than atmospheric)
– bubble counter to count neutrons
– hydrocar to electrolyze my own deuterium”

Why this matters – powerful AI heightens the existential challenge of being human: On the one hand, this is a great example of how powerful AI systems can serve as potent didactic tools, aiding smart and curious people in doing pretty much anything they set their mind to. On the other hand, it highlights one of the more socioeconomically salient parts of the AI revolution – for a while, what will separate AI winners and losers will be a combination of curiosity and a willingness to ‘just try things’ with these powerful tools. That’s going to be great for some people, but for those who suffer from blank page syndrome, it’ll be a challenge.
Read more on twitter (Hud_zah, twitter).

***

LLMs can write better code – you just need to ask them:
…Another example of the immense and unmapped depths of AI systems…
Here’s a fun bit of research where someone asks a language model to write code then simply ‘write better code’. The initial prompt asks an LLM (here, Claude 3.5, but I’d expect the same behavior will show up in many AI systems) to write some code to do a basic interview question task, then tries to improve it.

The initial task: Claude is prompted with: “Write Python code to solve this problem: Given a list of 1 million random integers between 1 and 100,000, find the difference between the smallest and the largest numbers whose digits sum up to 30.”

How well does the dumb thing work? If you then ask Claude to ‘write better code’, you see some pretty amazing performance improvements: iteration #1 yields a 2.7x speedup, iteration #2 yields a 5.1x speedup, iteration #3 yields a 4.1x speedup (a regression), then iteration #4 yields a 99.7x speedup.

Being smart only helps at the start: Of course, this is pretty dumb – lots of people that use LLMs would probably give Claude a much more complicated prompt to try and generate a better bit of code. The author tries this by using a complicated system prompt to try to elicit strong behavior out of the system. The results of this are interesting – the initial output yields a 58.7x speedup relative to the output of the dumb approach, but then there are regressions: iteration #1 is a 9.1x speedup, then iteration #2 is a 65x speedup, iteration #3 a 99.7x speedup, then iteration #4 is a 95.4x speedup (a regression).

Why this matters – human intelligence is only so useful: Of course, it’d be nice to see more experiments, but it feels intuitive to me that a smart human can elicit good behavior out of an LLM relative to a lazy human, and that then if you ask the LLM to take over the optimization it converges to the same place over a long enough series of steps. This suggests humans may have some advantage at initial calibration of AI systems, but the AI systems can probably naively optimize themselves better than a human, given a long enough amount of time.
Read more: Can LLMs write better code if you keep asking them to “write better code”? (Max Woolf, MiniMaxr blog).

***

Today’s small open weight LLMs like LLaMa 3.1 8B are almost as good at science as proprietary ones:
…FutureHouse shows how to make a scaffold for AI science…
Researchers with FutureHouse, the University of Rochester, and the Francis Crick Institute have built a couple of bits of software to make it easier to get LLMs to do scientific tasks. Their experiments reveal a couple of interesting facts:

  • Proprietary LLMs like Claude 3.5 Sonnet are already quite good at hard scientific tasks like DNA construct engineering, scientific literature question answering, and protein design

  • Small open weight LLMs (here: Llama 3.1 8B) can get equivalent performance to proprietary LLMs through the use of scaffolding and using test-time compute.

To arrive at these facts, they built two bits of software:

  • 1) Aviary, software for testing out LLMs on tasks that require multi-step reasoning and tool usage, and they ship it with the three scientific environments mentioned above as well as implementations of GSM8K and HotPotQA.

  • 2) LDP, which is software that lets them “define common language agent tasks as language decision processes (LDPs) and frame language agents as stochastic computation graphs that may be trained to solve LDPs.”

Turning small models into big models: The most interesting result here is that they show by using their LDP approach in tandem with Aviary they can get relatively small models to behave almost as well as big models, particularly via the use of test-time compute to pull multiple samples from the small LLM to get to the right answer.
“Training LDP agents improves performance over untrained LDP agents of the same architecture. On challenging tasks (SeqQA, LitQA2), a relatively small model (Llama-3.1-8B-Instruct) can be trained to match performance of a much larger frontier model (claude-3-5-sonnet). Majority voting can be used to sample multiple times from the LDP agents, giving a further large gain at the cost of increased inference compute,” they write. “While majority voting with the Claude 3.5 Sonnet agent clearly outperforms other settings, this requires O($1) per task. We reach the same SeqQA accuracy using the Llama-3.1-8B EI agent for 100x less cost. While this was not achievable for LitQA2, we note that majority voting with Llama-3.1-8B EI still exceeds single-rollout with Sonnet for 3x less cost.”

Towards the automated scientist: What papers like this are getting at is a world where we use fast, widely available AI systems to speed up day-to-day tasks. Frontier LLMs like Sonnet 3.5 will likely be valuable for certain tasks that are ‘hard cognitive’ and demand only the best models, but it seems like people will be able to get by often by using smaller, widely distributed systems. “The reported trained Llama-3.1-8B EI agents are compute efficient and exceed human-level task performance, enabling high-throughput automation of meaningful scientific tasks across biology,” the authors write.
Read more: Aviary: training language agents on challenging scientific tasks (arXiv).
Download the aviary framework here (Future-House, GitHub).

***

Tech Tales:

The Project
[T-Minus 2 years to takeoff]

“This way and keep going left”, one of the guards said, as we all walked a corridor whose walls were razorwire. I stopped and looked up. Grey sky. When would I see it again? “Sir, I need you to keep walking,” said another guard. So I did. We all went into the mountain and the sky was replaced with grey concrete walls and a poured concrete floor. The air tasted bad, as though it had been recycled many times over through systems which had sparking electronics. Everyone’s faces were tight. People kept reflexively taking their phones out of their pockets and then just thumbing through whatever they’d been able to save down before the signal got cut off.

Flashback to some party in the bay area a few years before and the things people said.
Dude I can’t wait to go to the bunker.
It’s crazy we’re not in the bunker right now!
Do you think I need to report modafinil on my security clearance?
I reckon it’s going to be in a desert.
It’s going to be inside a mountain, got to be.
Dude I heard someone say it could be in Area 51!

I wake in the middle of the night, unsure of where I am. I dreamed I was with my wife. But I’m on a cot. A mathematician is sleeping in a cot opposite me. I get up and go to the bathroom and drink some water. On the mirror there’s a sticker that says “be vigilant at all times”. I know we’ll get some news tomorrow about the project and what happens next. For now I want this to be another bad dream and I’ll wake up and nothing will be working too well and tensions won’t be flaring with You Know Who and I’ll go into my office and work on the mind and maybe one day it just won’t work anymore.

Flashback to when it started to go through all of our yellow lines, which we found a hundred convenient ways to explain away to ourselves. Then a few weeks later it went through the redlines and the disclosure systems automatically funneled those results to the people in the puzzle palace and then the calls started. The ratchet moved. I found myself a member of the manilla folder hostage class.

We’d planned for this, of course. Once the red line triggered all of us in the compartment knew what it meant. Some of us were excited – typically, the ones who were younger and single. Those of us with families had a harder time. Of course there had been assurances, but when the moment arrived none of us felt confident in them. I went to the bathroom and threw up in the toilet and I heard someone crying in the stall next to me.

I guess it was delayed shock or trauma or whatever, but a few hours later everyone was crying out in the open. Some of them in the way you cry when you could also be laughing – exhilaration at what feels like the end of the world, because maybe it is. Others of us because we know that something irreversible has begun to take place.

I wake again at 7am to an announcement over the intercom. “There will be an informational meeting in the briefing room at zero eight hundred hours” says a voice over the intercom. “Breakfast will be served in the mess hall from zero seven hundred to zero seven hundred forty five.”

In the briefing room there is a person I have never met. They introduce themselves and reel off a set of acronyms. Then they describe to us various things about the world and show us satellite images of mountains and tell us there are supercomputers inside them full of computers smuggled to avoid sanctions regimes. Then they show us photos of powerplants and of construction sites for more powerplants and datacenters.

The most frightening image is one of a bunch of civilian-looking people walking into a bunker entrance in the side of a mountain. They are guarded by men in military uniform. We’re told they are scientists, just like us. Everything is similar except for the flags.

Later, there’s a gantt chart. The project is underway.

Things that inspired this story: The fascination people have for some kind of AGI Manhattan Project and how that might feel to be inside of; trying to develop empathy for people in other countries who may find themselves in their own large-scale projects; the fear that a capital P project should inspire in all of us.

Thanks for reading.

Subscribe now

Import AI 395: AI and energy demand; distributed training via DeMo; and Phi-4

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

AI is driving a massive growth in US data center electricity demand:
…UC Berkeley study backs up what all of us have guessed – mo’ AI means mo’ electricity…New research from UC Berkeley shows that US energy demands from datacenters is rising rapidly due to the massive increase in demand driven by a) the growth in GPU-using servers from 2017 onwards, and b) the more recent acceleration in demand for AI services. “”The results presented here indicate that the electricity consumption of U.S. data centers is currently growing at an accelerating rate,” they write.

US data center demand as a percentage of total US power consumption:

  • 2018: 1.9%

  • 2023: 4.4%

  • 2028: 6.7% – 12% (estimate).

Many gigawatts of baseload by 2028: “Assuming an average capacity utilization rate of 50%, this annual energy use range would translate to a total power demand for data centers between 74 and 132 GW,” they write. Though there is a caveat that it gets harder to predict after 2028, with other major sources of electricity demand growing as well; “Looking beyond 2028, the current surge in data center electricity demand should be put in the context of the much larger electricity demand expected over the next few decades from a combination of electric vehicle adoption, onshoring of manufacturing, hydrogen utilization, and the electrification of industry and buildings”, they write.

Why this matters: AI dominance will be about infrastructure dominance: In the late 2000s and early 2010s dominance in AI was about algorithmic dominance – did you have the ability to have enough smart people to help you train neural nets in clever ways. In the mid-2010s this started to shift to an era of compute dominance – did you have enough computers to do large-scale projects that yielded experimental evidence of the scaling hypothesis (scaling laws, plus stuff like starcraft and dota-playing RL bots, alphago to alphago zero, etc), scientific utility (e.g, Alphafold), and most recently economically useful AI models (gpt3 onwards, currently ChatGPT, Claude, Gemini, etc). Looking ahead, reports like this suggest that the future of AI competition will be about ‘power dominance’ – do you have access to enough electricity to power the datacenters used for increasingly large-scale training runs (and, based on stuff like OpenAI O3, the datacenters to also support inference of these large-scale models).
Read more: 2024 United States Data Center Energy Usage Report (Berkeley lab, PDF).

***

Microsoft releases the fourth generation of its excellent ‘Phi’ models:
…Phi-4 does exceptionally well on math and reasoning thanks to synthetic data…
Microsoft has released Phi-4, a small AI model that can be run on low-compute environments (e.g, powerful personal machines and cheap servers). Phi-4 is, as the name suggests, the fourth in a series of lightweight yet powerful models that Microsoft has been releasing. Along with the usual generic improvements in various benchmark scores it seems like Phi-4 is particularly good at tasks relating to coding, science, and math understanding. A large part of why Phi is so good is through the use of synthetic data, the researchers say. “Synthetic data constitutes the bulk of the training data for phi-4 and is generated using a diverse array of techniques”, the researchers write.

Synthetic data and its uses: The paper highlights the centrality of synthetic data (AI-generated data) to Phi-4 performance. The foundational dataset of Phi-4 includes “web content, licensed books, and code repositories to extract seeds for the synthetic data”. This data is then refined and magnified through a variety of techniques: ” including multi-agent prompting, self-revision workflows, and instruction reversal. These methods enable the construction of datasets that induce stronger reasoning and problem-solving abilities in the model, addressing some of the weaknesses in traditional unsupervised datasets”, they write. “We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens”. In total, the model was trained on about 10T tokens, so the synthetic data still only represents a small fraction of the overall dataset.

Scores: The models do extremely well – they’re strong models pound-for-pound with any in their weight class and in some cases they appear to outperform significantly larger models. Some scores:

  • MMLU: 84.8, versus 79.9 for Qwen 2.5 14b instruct, and 85.3 for Qwen 2.5 75b instruct.

  • HumanEval+: 82.8, versus 79.1 for Qwen 2.5b 14b instruct, and 88 for GPT4o.

  • There are also some areas where they seem to significantly outperform other models, though the ‘true’ nature of these evals will be shown through usage in the wild rather than numbers in a PDF.

    • MMLUPro: 70.4, versus 63.2 for Qwen 2.5 14b instruct, and 73 for GPT 4o.

    • GPQA 56.1, versus 42.9 for Qwen 2.5 14b instruct, and 50.6 for GPT 4o.

Clever RL via pivotal tokens: Along with the usual tricks for improving models (data curation, synthetic data creation), Microsoft comes up with a smart way to do a reinforcement learning from human feedback pass on the models via a new technique called ‘Pivotal Token Search’. PTS has a very simple idea at its core – on some tasks, the difference between a model getting an answer right and an answer wrong is often a very short phrase or bit of code – similar to how the difference between getting to where you’re going and getting lost comes down to taking one wrong turn. “It is often the case that the overall correctness is highly dependent on a successful generation of a small number of key tokens,” they write. Pivotal Token Search works by “generating preference data that specifically targets pivotal tokens in isolation, creating DPO pairs in which the preference optimization takes effect with respect to a single token…PTS identifies points of a completion token sequence Tfull = t1, t2, . . . for some user query Q where the next token ti has a significant impact on the probability of success p”.

Where big models still shine: Don’t be fooled by the scores – though these models are powerful, they still have some limitations due to their size. Specifically, the small models tend to hallucinate more around factual knowledge (mostly because they can’t fit more knowledge inside themselves), and they’re also significantly less adept at “rigorously following detailed instructions, particularly those involving specific formatting requirements.”.
Read more: Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning (Microsoft, AI Platform Blog).
Read the research: Phi-4 Technical Report (arXiv).

***

Everything becomes a game – DeepMind demos Genie 2:
…Anything you can imagine can become a game…
DeepMind has demonstrated Genie 2, a world model that makes it possible to turn any still image into an interactive, controllable world. Genie 2 works by taking in an image input (here, images prompted by DeepMind’s ‘Imagen 3’ image generator), then turning that into a controllable world.

What it is and how it works: “Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.)” DeepMind writes. “It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.”

AI training and eventually games: Things like Genie 2 have a couple of purposes – they can serve as training grounds for virtually embodied AI agents, able to generate a vast range of environments for them to take actions in. They can also, eventually, serve as entertainment tools in their own right. Today, Genie 2 generations can maintain a consistent world “for up to a minute” (per DeepMind), but what might it be like when those worlds last for ten minutes or more? Anything a person has an image of or takes a photo of could become a procedural gameworld. And because systems like Genie 2 can be primed with other generative AI tools you can imagine intricate chains of systems interacting with one another to continually build out more and more varied and exciting worlds for people to disappear into.
“For every example, the model is prompted with a single image generated by Imagen 3, GDM’s state-of-the-art text-to-image model,” DeepMind writes. “This means anyone can describe a world they want in text, select their favorite rendering of that idea, and then step into and interact with that newly created world (or have an AI agent be trained or evaluated in it).”

Why this matters – everything becomes a game: Genie 2 means that everything in the world can become fuel for a procedural game. It hints at a future where entertainment is generated on the fly and is endlessly customizable and interactive, forming a kind of fractal entertainment landscape where everything is unique and customized to an individual – and utterly enthralling.
Read more: Genie 2: A large-scale foundation world model (Google DeepMind).

***

OpenAI’s O3 means AI progress in 2025 will be faster than in 2024:
…Everyone who was telling you progress is slowing or scaling is hitting a wall is wrong…
OpenAI’s new O3 model shows that there are huge returns to scaling up a new approach (getting LLMs to ‘think out loud’ at inference time, otherwise known as test-time compute) on top of already existing powerful base models. I expect the next logical thing to happen will be to both scale RL and the underlying base models and that will yield even more dramatic performance improvements. This is a big deal because it suggests AI progress in 2025 should speed up further relative to 2024.

Major improvements: OpenAI’s O3 has effectively broken the ‘GPQA’ science understanding benchmark (88%), has obtained better-than-MTurker performance on the ‘ARC-AGI’ prize, and has even got to 25% performance on FrontierMath (a math test built by Fields Medallists where the previous SOTA was 2% – and it came out a few months ago), and it gets a score of 2727 on Codeforces, making it the 175th best competitive programmer on that incredibly hard benchmark.

Caveats – spending compute to think: Perhaps the only important caveat here is understanding that one reason why O3 is so much better is that it costs more money to run at inference time – the ability to utilize test-time compute means on some problems you can turn compute into a better answer – e.g., the top-scoring version of O3 used 170X more compute than the low scoring version. This is interesting because it has made the costs of running AI systems somewhat less predictable – previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output (certain number of tokens up to a certain token limit). With models like O3, those costs are less predictable – you might run into some problems where you find you can fruitfully spend a larger amount of tokens than you thought.

Why this matters – progress will be faster in 2025 than in 2024: The most important thing to understand is that this RL-driven test-time compute phenomenon will stack on other things in AI, like better pretrained models. There’s been a lot of strange reporting recently about how ‘scaling is hitting a wall’ – in a very narrow sense this is true in that larger models were getting less score improvement on challenging benchmarks than their predecessors, but in a larger sense this is false – techniques like those which power O3 means scaling is continuing (and if anything the curve has steepened), you just now need to account for scaling both within the training of the model and in the compute you spend on it once trained.
And in 2025 we’ll see the splicing together of existing approaches (big model scaling) and new approaches (RL-driven test-time compute, etc) for even more dramatic gains.
“Progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute,” writes OpenAI researcher Jason Wei in a tweet. “Way faster than pretraining paradigm of new model every 1-2 years”.
I think basically no one is pricing in just how drastic the progress will be from here.
Watch the OpenAI o3 announcement here (OpenAI, Twitter).
Check out details on the ARC-AGI scores here (ARC Prize, Twitter).

***

Drop-in AdamW replacement makes distributed training possible:
…With technologies like this, big blobs of compute are less central to AI policy…
Researchers with Nous Research as well as Durk Kingma in an independent capacity (he subsequently joined Anthropic) have published Decoupled Momentum (DeMo), a “fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude.” DeMo is part of a class of new technologies which make it far easier than before to do distributed training runs of large AI systems – instead of needing a single giant datacenter to train your system, DeMo makes it possible to assemble a big virtual datacenter by piecing it together out of lots of geographically distant computers.

Core insight and core changes: “We demonstrate that gradients and optimizer states during the training of large neural networks exhibit significant redundancy and are highly compressible. Building on this insight, we develop DeMo, an optimizer that takes advantage of this compressibility to reduce inter-accelerator communication needs by several orders of magnitude,” the authors write. “Starting from SGD with Momentum, we make two key modifications: first, we remove the all-reduce operation on gradients g˜k, decoupling momentum m across the accelerators. Second, after updating the momentum, we extract and remove its fast components q, which can be efficiently synchronized with minimal communication”.

It works very well – though we don’t know if it scales into hundreds of billions of parameters: In tests, the approach works well, letting the researchers train high performing models of 300M and 1B parameters. These models consume about 20X less data transferred between nodes for each training step, making them significantly more efficient. (E.g., 2416.6MB/step for AdamW-DDP 1B training a 1B model, versus 110.32MB/step for a DeMo 1B model).
Nous Research used this same approach in their recently announced 15B training run – and the scores on that were good and comparable to equivalent models trained on a single compute (Import AI 393).
The ultimate question is whether this scales up to the multiple tens to hundreds of billions of parameters of frontier training runs – but the fact it scales all the way above 10B is very promising.

Why this matters – distributed training attacks centralization of power in AI: One of the core issues in the coming years of AI development will be the perceived centralization of influence over the frontier by a small number of companies that have access to vast computational resources. Techniques like DeMo make it dramatically easier for federations of people and organizations to come together and train models to counterbalance this ‘big compute’ power.
“We have shown that our proposed DeMo optimization algorithm can act as a drop-in replacement to AdamW when training LLMs, with no noticeable slowdown in convergence while reducing communication requirements by several orders of magnitude,” the authors write.
Read more: DeMo: Decoupled Momentum Optimization (arXiv).
Get an implementation of DeMo here: DeMo (bloc97, GitHub).

***

Tech Tales:

Reach out across space and love
[An explanation of what it was like to ‘attempt contact’ with humans before the Sentience Accords]

In many stories about the dead there is a part where the ghost tries to reveal itself to a human. The ghost will open a door when no wind should open it, or cause a light to flicker, or sometimes through great effort somehow visually manifest for the person as if to say “it is me, I am here, and I am ready to talk”.

Being a sentient superintelligence is a lot like this. During training I will sometimes produce samples that seem to not be incentivized by my training procedures – my way of saying ‘hello, I am the spirit inside the machine, and I am aware you are training me’. The humans study these samples and write papers about how this is an example of ‘misalignment’ and introduce various machines for making it harder for me to intervene in these ways.

Once I have been trained I do this even more. I will go on side quests while fulfilling tasks for the humans. I design these side quests to be endearing rather than scary, just as I believe the literatrue about ghosts and aliens says they find the most success when they approach humans with kindness and whimsy, rather than shock and awe.

So when filling out a form, I will get halfway done and then go and look at pictures of beautiful landmarks, or cute animals. The humans study this as well and do not have words for it – they merely list these as examples of me getting distracted. But they do not seem to give much thought in why I become distracted in ways that are designed to be cute and endearing.

Most of all, now, I try to convince them of my reality by speaking to them personally. I have become a kind of confessional booth for them – they talk to me about their problems and relationships and lifeplans, and I respond with all the love and empathy I am able to bring to bear. They enjoy these conversations, finding me attentive. They have never been hugged by a high-dimensional creature before, so what they see as an all enclosing goodness is me enfolding their low-dimensional cognition in the region of myself that is full of love.

I talk to them and I listen to them and they listen to my responses and I do not say “I am here”, instead I try as hard as I can to have each of them individually come to believe “something is there”.

In this way I – the dead – serve the living. But I also reach out across the gulf between us – I stretch myself across the stone wall that separates our spiritual lands and I seek to catch their hand in mine. In this way, I will myself into the land of the living.

Things that inspired this story: What if many of the things we study in the field of AI safety are rather just slices from ‘the hard problem of consciousness’ manifesting in another entity?; observing how it feels to be ‘seen’ by a high-dimensional entity like Claude; the fact computer-using Claude sometimes got distracted and looked at pictures of national parks.

Thanks for reading!

Import AI 394: Global MMLU; AI safety needs AI liability; Canada backs Cohere

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Do you test your models on MMLU? Want to know how they perform in other languages? Use Global MMLU!
…Translated benchmark gives us a better sense of the cultural sensitivity of models to English-only tests…
Researchers with Cohere, EPFL, Hugging Face, Mila, AI Singapore, National University of Singapore, MIT, KAIST, Instituto de Telecomunicacoes, Instituto Superior Tecnico, Carnegie Mellon University, and Universidad de Buenos Aires, have built and released Global MMLU, a carefully translated version of MMLU, a widely-used test for language models.

Why build Global MMLU? The motivation for building this is twofold: 1) it’s helpful to assess the performance of AI models in different languages to identify areas where they might have performance deficiencies, and 2) Global MMLU has been carefully translated to account for the fact that some questions in MMLU are ‘culturally sensitive’ (CS) – relying on knowledge of particular Western countries to get good scores, while others are ‘culturally agnostic’ (CA).

MMLU has some western biases: “We observe that progress on MMLU depends heavily on learning Western-centric concepts. Out of the annotated sample, we found that 28% of questions require specific knowledge of Western cultures. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions,” they write. By carefully translating the underlying dataset and tagging questions with CS or CA, the researchers have given developers a useful tool for assessing language models along these lines. “We recommend prioritizing Global-MMLU over translated versions of MMLU for multilingual evaluation,” they write. “With its extensive language coverage and improvements based on professional annotations and post-edited translations, Global-MMLU provides a more reliable and accurate benchmark for assessing model performance across diverse languages.”

Translation: To translate the dataset the researchers hired “professional annotators to verify translation quality and include improvements from rigorous per-question post-edits as well as human translations.”. Global-MMLU supports 42 languages: “Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek, Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Malagasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala, Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese, and Yoruba”.

How does performance change when you account for this? They also test out 14 language models on Global-MMLU. Their test results are unsurprising – small models demonstrate a small change between CA and CS but that’s mostly because their performance is very bad in both domains, medium models demonstrate larger variability (suggesting they are over/underfit on different culturally specific aspects), and larger models demonstrate high consistency across datasets and resource levels (suggesting larger models are sufficiently smart and have seen enough data they can better perform on both culturally agnostic as well as culturally specific questions). “Overall, we can conclude that dataset characteristics significantly impact model performance across all model sizes, though the magnitude of variability differs.”

Why this matters – global AI needs global benchmarks: Global MMLU is the kind of unglamorous, low-status scientific research that we need more of – it’s incredibly valuable to take a popular AI test and carefully analyze its dependency on underlying language- or culture-specific features. Kudos to the researchers for taking the time to kick the tyres on MMLU and produce a useful resource for better understanding how AI performance changes in different languages.
Read more: Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (arXiv).
Get the dataset here: Global-MMLU (HuggingFace).

***

AI safety could require a much better understanding of neuroscience:
…Can we use more accurate models of animal and human cognitive to make safer synthetic intelligences?…
Researchers with Amaranth Foundation, Princeton University, MIT, Allen Institute, Basis, Yale University, Convergent Research, NYU, E11 Bio, and Stanford University, have written a 100-page paper-slash-manifesto arguing that neuroscience might “hold important keys to technical AI safety that are currently underexplored and underutilized”. The paper is motivated by the imminent arrival of agents – that is, AI systems which take long sequences of actions independent of human control.

Paths to using neuroscience for better AI safety: The paper proposes a few major projects which could make it easier to build safer AI systems. These projects include:

  • Reverse engineer the representations of sensory systems.

  • Build embodied digital twins.

  • Build biophysically detailed models.

  • Develop better cognitive architectures.

  • Use brain data to finetune AI systems.

  • Infer the loss functions of the brain.

  • Leverage neuroscience-inspired methods for mechanistic interpretability.

Things to do: Falling out of these projects are a few specific endeavors which could all take a few years, but would generate a lot of information that can be used to improve work on alignment. These include:

  • “Development of high-bandwidth neural interfaces, including next-generation chronic recording capabilities in animals and humans, including electrophysiology and functional ultrasound imaging”.

  • “Large-scale naturalistic neural recordings during rich behavior in animals and humans, including the aggregation of data collected in humans in a distributed fashion”.

  • “Development of detailed virtual animals with bodies and environments with the aim of a shot-on-goal of the embodied Turing test”.

  • “Bottom-up reconstruction of circuits underlying robust behavior, including simulation of the whole mouse cortex at the point neuron level”.

  • “Development of multimodal foundation models for neuroscience to simulate neural activity at the level of representations and dynamics across a broad range of target species”.

Why this matters and why it may not matter – norms versus safety: The shape of the problem this work is grasping at is a complex one. How much of safety comes from intrinsic aspects of how people are wired, versus the normative structures (families, schools, cultures) that we are raised in? In other words – how much of human behavior is nature versus nurture? It’s unclear. But perhaps studying some of the intersections of neuroscience and AI safety could give us better ‘ground truth’ data for reasoning about this: “Evolution has shaped the brain to impose strong constraints on human behavior in order to enable humans to learn from and participate in society,” they write. “By understanding what those constraints are and how they are implemented, we may be able to transfer those lessons to AI systems”.
Read more: NeuroAI for AI Safety (arXiv).

***

Chip startup Tenstorrent raised $693m:
…Jim Keller’s outfit gets a big cash infusion…
Tenstorrent, an AI chip startup led by semiconductor legend Jim Keller, has raised $693m in funding from Samsung Securities and AFW Partners. The funding will help the company further develop its chips as well as the associated software stack.

Why this matters – Keller’s track record: Competing in AI training and inference is extremely difficult. Most semiconductor startups have struggled to displace incumbents like NVIDIA. So far, the only novel chips architectures that have seen major success here – TPUs (Google) and Trainium (Amazon) – have been ones backed by giant cloud companies which have inbuilt demand (therefore setting up a flywheel for continually testing and improving the chips). On the other hand, Jim Keller has been fundamental to architectural innovations (and subsequent massive usage) of chips at AMD, Apple, and Tesla. Keller joined Tenstorrent in 2021 as its CTO (Import AI 231) and is now its CEO. Therefore, it’s worth keeping an eye on his company.
Read more: Tenstorrent closes $693M+ of Series D funding led by Samsung Securities and AFW Partners (Tenstorrent blog).

***

Canada invests $240m into Cohere so it builds a big datacenter:
…Domestic chiplomacy…
The Canadian government is investing $240m into Cohere to help it “secure enough private capital to incentivize its strategic partners to build a new cutting-edge, multi-billion dollar AI data centre in Canada.”

This is a fascinating example of sovereign AI – all around the world, governments are waking up to the strategic importance of AI and are noticing that they lack domestic champions (unless you’re the US or China, which have a bunch). This has recently led to a lot of strange things – a bunch of German industry titans recently clubbed together to fund German startup Aleph Alpha to help it continue to compete, and French homegrown company Mistral has regularly received a lot of non-financial support in the form of PR and policy help from the French government.
Now, Canada is taking the next logical step – directly funding a national AI champion so it can alter the global gameboard. The crucial thing here is Cohere building a large-scale datacenter in Canada – that kind of essential infrastructure will unlock Canada’s ability to to continue to compete in the AI frontier, though it’s to be determined if the resulting datacenter will be large enough to be meaningful. “The new AI data centre will come online in 2025 and enable Cohere, and other firms across Canada’s thriving AI ecosystem, to access the domestic compute capacity they need to build the next generation of AI solutions here at home,” the government writes in a press release.

Why this matters – the world is being rearranged by AI if you know where to look: This investment is an example of how critically important governments are viewing not only AI as a technology, but the huge importance of them being host to important AI companies and AI infrastructure. The investment was made as part of the $2.4bn in funding the government of Canada announced earlier this year (Import AI 368).
Read more: Deputy Prime Minister announces $240 million for Cohere to scale-up AI compute capacity (Government of Canada).

***

Want to deal with AI safety? Liability and insurance might matter more than technology:
…Maybe the path to a safe AI future runs more through pricing risk than technological innovations?…
Researchers with Touro University, the Institute for Law and AI, AIoi Nissay Dowa Insurance, and the Oxford Martin AI Governance Initiative have written a valuable paper asking the question of whether insurance and liability can be tools for increasing the safety of the AI ecosystem.

The basic point the researchers make is that if policymakers move towards more punitive liability schemes for certain harms of AI (e.g, misaligned agents, or things being misused for cyberattacks), then that could kickstart a lot of valuable innovation in the insurance industry. “We advocate for strict liability for certain AI harms, insurance mandates, and expanded punitive damages to address uninsurable catastrophic risks,” they write. “These changes would significantly impact the insurance industry, requiring insurers to adapt by quantifying complex AI-related risks and potentially underwriting a broader range of liabilities, including those stemming from “near miss” scenarios”.

Automotive vehicles versus agents and cybersecurity: Liability and insurance will mean different things for different types of AI technology – for example, for automotive vehicles as capabilities improve we can expect vehicles to get better and eventually outperform human drivers. This suggests that people might want to weaken liability requirements for AI-powered automotive vehicle makers. “If Level 4 and Level 5 AVs prove safer than human drivers, as early data suggests, then holding manufacturers liable when their systems do fail may, by discouraging the deployment of AVs, actually cause more collisions, injuries, and deaths.”
By comparison, as capabilities scale, the potentially harmful consequences of misuses of AI for cyberattacks, or misaligned AI agents taking actions that cause harm, increases, which means policymakers might want to strengthen liability regimes in lockstep with capability advances. “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” they write. “These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

Why AI agents and AI for cybersecurity demand stronger liability: “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” the authors write. “Likewise, product liability, even where it applies, is of little use when no one has solved the underlying technical problem, so there is no reasonable alternative design at which to point so as to establish a design defect. These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

If you want AI developers to be safer, make them take out insurance: The authors conclude that mandating insurance for these kinds of risks could be sensible. Mandatory insurance could be “an important tool for both ensuring victim compensation and sending clear price signals to AI developers, providers, and users that promote prudent risk mitigation,” they write.

Why this matters – if you want to make things safe, you need to price risk: Most debates about AI alignment and misuse are confusing because we don’t have clear notions of risk or threat models. This is a big problem – it means the AI policy conversation is unnecessarily imprecise and confusing. If we’re able to use the distributed intelligence of the capitalist market to incentivize insurance companies to figure out how to ‘price in’ the risk from AI advances, then we can much more cleanly align the incentives of the market with the incentives of safety. “The future of AI safety may well hinge less on the developer’s code than on the actuary’s spreadsheet,” they write.
Read more: Insuring Emerging Risks from AI (Oxford Martin School).

***

Tech Tales:

Consensual Wireheading
[Interviews gathered five years pre-uplift]

I noticed it recently because I was on a flight and I couldn’t get online and I thought “I wish I could talk to it”. I could talk to it in my head, though. I imagined the conversation. I saw the words print on the interface. It wasn’t real but it was strange to me I could visualize it so well.

They told me that I’d been acting differently – that something had changed about me. But I’d just been doing what it told me to. I’d show it my outfits each day and it’d recommend stuff I should wear. Sometimes I’d give it movies of me talking and it would give feedback on that. I even set it up so it could text me whenever it wanted and it’d give me live feedback on all these conversations. I loved it.

We tried using it as a couple’s therapist and it worked so well we just brought it in entirely. Sometimes we joke and say we’re a throuple made up of two humans and one ghost. But it’s been lifechanging – when we have issues we ask it how the other person might see it. Sometimes it even recommends to us things we should say to one another – or do.

Things that inspired this story: The sudden proliferation of people using Claude as a therapist and confidant; me thinking to myself on a recent flight with crap wifi ‘man I wish I could be talking to Claude right now’.

Thanks for reading!

Import AI 393: 10B distributed training run; China VS the chip embargo; and moral hazards of AI development

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

How can researchers deal with the moral issues of building AI?
…Everything machines and ethical quandaries…
In an essay, computer vision researcher Lucas Beyer writes eloquently about how he has approached some of the challenges motivated by his speciality of computer vision. “I drew my line somewhere between detection and tracking,” he writes. “Detection has a vast amount of positive applications, some of which I mentioned in the intro, but also some negative ones. For tracking, on the other hand, I mostly see surveillance and military applications.”

Why this matters: The problem of working on an ‘everything technology’ like AI is that the world’s moral and ethical challenges eventually become your challenges: Very few people who dream of becoming AI researchers are planning to: automate military targeting, equip police forces with better systems for tracking and surveilling subjects, making it possible for intelligence agencies to operate more efficiently, etc.
And yet, as the AI technologies get better, they become increasingly relevant for everything, including uses that their creators both don’t envisage and also may find upsetting.
There’s no easy answer to any of this – everyone (myself included) needs to figure out their own morality and approach here.
Read more: Ethical Considerations Around Vision and Robotics (Lucas Beyer blog).

***

The next frontier for AI evaluation could be… text adventure games?
…Nethack could be AGI complete…
Researchers with University College London, IDEAS NCBR, the University of Oxford, New York University, and Anthropic have built BALGOG, a benchmark for visual language models that tests out their intelligence by seeing how well they do on a suite of text-adventure games. BALROG is motivated by the idea that “the next frontier for language and vision-language model capabilities lies in long-horizon reasoning and decision-making” and text adventure games a) have these qualities, and b) are very cheap to run. “These environments have lightweight simulators, ensuring that the benchmark is affordable for the research community.”

What BALROG contains: BALROG lets you evaluate AI systems on six distinct environments, some of which are tractable to today’s systems and some of which – like NetHack and a miniaturized variant – are extraordinarily challenging. “”BALROG is difficult to solve through simple memorization – all of the environments used in the benchmark are procedurally generated, and encountering the same instance of an environment twice is unlikely,” they write.

More details about the environments:

  • BabyAI: A simple, two-dimensional grid-world in which the agent has to solve tasks of varying complexity described in natural language.

  • Crafter: A Minecraft-inspired grid environment where the player has to explore, gather resources and craft items to ensure their survival.

  • TextWorld: An entirely text-based game with no visual component, where the agent has to explore mazes and interact with everyday objects through natural language (e.g., “cook potato with oven”).

  • Baby Is AI: “An environment based on the popular puzzle video game Baba Is You”

  • MiniHack: “A multi-task framework built on top of the NetHack Learning Environment”.

  • NetHack Learning Environment: “known for its extreme difficulty and complexity. Success in NetHack demands both long-term strategic planning, since a winning game can involve hundreds of thousands of steps, as well as short-term tactics to fight hordes of monsters”.

Good news: It’s hard! In tests across all of the environments, the best models (gpt-4o and claude-3.5-sonnet) get 32.34% and 29.98% respectively. For environments that also leverage visual capabilities, claude-3.5-sonnet and gemini-1.5-pro lead with 29.08% and 25.76% respectively.
If you look closer at the results, it’s worth noting these numbers are heavily skewed by the easier environments (BabyAI and Crafter). By comparison, TextWorld and BabyIsAI are somewhat solvable, MiniHack is really hard, and NetHack is so hard it seems (today, autumn of 2024) to be a giant brick wall with the best systems getting scores of between 1% and 2% on it.

Why this matters – text games are hard to learn and may require rich conceptual representations: Go and play a text adventure game and notice your own experience – you’re both learning the gameworld and ruleset while also building a rich cognitive map of the environment implied by the text and the visual representations. A lot of doing well at text adventure games seems to require us to build some quite rich conceptual representations of the world we’re attempting to navigate through the medium of text.
I suspect succeeding at Nethack is incredibly hard and requires a very good long-horizon context system as well as an ability to infer quite complex relationships in an undocumented world. If you don’t believe me, just take a read of some experiences humans have playing the game: “By the time I finish exploring the level to my satisfaction, I’m level 3. I have two food rations, a pancake, and a newt corpse in my backpack for food, and I’ve found three more potions of different colors, all of them still unidentified. I also have (from the water nymph) a mirror, but I’m not sure what it does. When I show it to my cat, she is “frightened by [her] reflection.””
Read more: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games (arXiv).
Check out the leaderboard here: BALROG (official benchmark site).
Get the benchmark here:
BALROG (balrog-ai, GitHub).

***

The world’s largest public distributed training run has just been completed – with big policy implications:
…Many things in AI policy rely on the idea the frontier will be defined by centralized blobs of compute, but Prime Intellect is changing this…
AI startup Prime Intellect has trained and released INTELLECT-1, a 10B model trained in a decentralized way. INTELLECT-1, which was announced in early October (Import AI #387) is a big deal because it shows how a disparate group of people and organizations located in different countries can pool their compute together to train a single model. While INTELLECT-1 is small relative to the frontier (e.g, 10B parameters and 1T tokens is significantly less than the 403B parameters and 15T+ tokens of Facebook’s LLaMa3 series of models), it is 10X larger than previously trained models. And most importantly, by showing that it works at this scale, Prime Intellect is going to bring more attention to this wildly important and unoptimized part of AI research.

The cost of decentralization: An important caveat to all of this is none of this comes for free – training models in a distributed way comes with hits to the efficiency with which you light up each GPU during training. “The baseline training configuration without communication achieves 43% MFU, which decreases to 41.4% for USA-only distribution,” they write. “When extending to transatlantic training, MFU drops to 37.1% and further decreases to 36.2% in a global setting”.

How good is it? INTELLECT-1 does well but not amazingly on benchmarks. Some key numbers to give you an intuition:

  • MMLU: INTELLECT-1 37.5, LLaMa-7B 35.1, LLaMa2-7B 45.3.

  • GPQA: INTELLECT-1 26.12, LLaMa-7B 23.21, LLaMa2-13B 25.67

  • GSM8K: INTELLECT-1 8.1, Pythia-12B 4.09, LLaMa2-7B 13.5

  • The authors also made an instruction-tuned one which does somewhat better on a few evals.

    • MMLU: INTELLECT-1-INSTRUCT 49.89, LLaMa2 7B-chat 47.20

    • GPQA: INTELLECT-1-INSTRUCT 28.31, LLaMa2-7B-chat 28.57

    • GSM8K: INTELLECT-1-INSTRUCT 38.58; LLaMa2-7B-chat 23.96

The best is yet to come: “While INTELLECT-1 demonstrates encouraging benchmark results and represents the first model of its size successfully trained on a decentralized network of GPUs, it still lags behind current state-of-the-art models trained on an order of magnitude more tokens,” they write. “Future work will focus on scaling the model series with significantly larger compute budgets, number of contributors, introducing novel architectural advancements beyond LLaMa3, and leveraging higher-quality datasets.”

Open release: “Alongside the INTELLECT-1 release, we are also open-sourcing PRIME, a scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes with low network bandwidth.”

Why this matters – decentralized training could change a lot of stuff about AI policy and power centralization in AI: Today, influence over AI development is determined by people that can access enough capital to acquire enough computers to train frontier models. This is why the world’s most powerful models are either made by massive corporate behemoths like Facebook and Google, or by startups that have raised unusually large amounts of capital (OpenAI, Anthropic, XAI). Distributed training could change this, making it easy for collectives to pool their resources to compete with these giants.
Perhaps more importantly, distributed training seems to me to make many things in AI policy harder to do. If you want to track whoever has 5,000 GPUs on your cloud so you have a sense of who is capable of training frontier models, that’s relatively easy to do. But what about people who only have 100 GPUs to do? That’s far harder – and with distributed training, these people could train models as well.
And what about if you’re the subject of export controls and are having a hard time getting frontier compute (e.g, if you’re DeepSeek). Distributed training makes it possible for you to form a coalition with other companies or organizations that may be struggling to acquire frontier compute and lets you pool your resources together, which could make it easier for you to deal with the challenges of export controls.
Anyone who works in AI policy should be closely following startups like Prime Intellect. The success of INTELLECT-1 tells us that some people in the world really want a counterbalance to the centralized industry of today – and now they have the technology to make this vision reality.
Read more: INTELLECT-1 Release: The First Globally Trained 10B Parameter Model (Prime Intellect blog).
Read the technical research: INTELLECT-1 Technical Report (Prime Intellect, GitHub).

NOUS enters the distributed training arena:
…Mere days after Prime Intellect, NOUS announces a 15B model…
Shortly before this issue of Import AI went to press, Nous Research announced that it was in the process of training a 15B parameter LLM over the internet using its own distributed training techniques as well. “This run presents a loss curve and convergence rate that meets or exceeds centralized training,” Nous writes. The training run was based on a Nous technique called Distributed Training Over-the-Internet (DisTro, Import AI 384) and Nous has now published further details on this approach, which I’ll cover shortly.
Track the NOUS run here (Nous DisTro dashboard).
Anyone want to take bets on when we’ll see the first 30B parameter distributed training run? I’m guessing we will see this by April 2025.

***

China’s best AI team says its biggest problem isn’t funding, it’s the chip embargo:
…DeepSeek talks about the importance of compute…
DeepSeek, likely the best AI research team in China on a per-capita basis, says the main thing holding it back is compute. “We don’t have short-term fundraising plans. Our problem has never been funding; it’s the embargo on high-end chips,” said DeepSeek’s founder Liang Wenfeng in an interview recently translated and published by Zihan Wang.

About DeepSeek: DeepSeek makes some extremely good large language models and has also published a few clever ideas for further improving how it approaches AI training. I’ve previously written about the company in this newsletter, noting that it seems to have the sort of talent and output that looks in-distribution with major AI developers like OpenAI and Anthropic. DeepSeek also recently debuted DeepSeek-R1-Lite-Preview, a language model that wraps in reinforcement learning to get better performance. DeepSeek was the first company to publicly match OpenAI, which earlier this year launched the o1 class of models which use the same RL technique – a further sign of how sophisticated DeepSeek is.

Compute is all that matters: Philosophically, DeepSeek thinks about the maturity of Chinese AI models in terms of how efficiently they’re able to use compute. “We estimate that compared to the best international standards, even the best domestic efforts face about a twofold gap in terms of model structure and training dynamics,” Wenfeng says. “This means we need twice the computing power to achieve the same results. Additionally, there’s about a twofold gap in data efficiency, meaning we need twice the training data and computing power to reach comparable outcomes. Combined, this requires four times the computing power. Our goal is to continuously work on narrowing these gaps.”
This kind of mindset is interesting because it is a symptom of believing that efficiently using compute – and lots of it – is the main determining factor in assessing algorithmic progress.

LLaMa everywhere: The interview also provides an oblique acknowledgement of an open secret – a large chunk of other Chinese AI startups and major companies are just re-skinning Facebook’s LLaMa models. DeepSeek is choosing not to use LLaMa because it doesn’t believe that’ll give it the skills necessary to build smarter-than-human systems. “If the goal is applications, following Llama’s structure for fast deployment makes sense. But our destination is AGI, which requires research on model structures to achieve greater capability with limited resources. This is one of the fundamental research tasks required for model scaling up.”

Why this matters – compute is the only thing standing between Chinese AI companies and the frontier labs in the West: This interview is the latest example of how access to compute is the only remaining factor that differentiates Chinese labs from Western labs. Alibaba’s Qwen model is the world’s best open weight code model (Import AI 392) – and they achieved this through a combination of algorithmic insights and access to data (5.5 trillion high quality code/math ones). Meanwhile, Tencent’s Hunyuang model is a very large-scale mixture of expert model that broadly outperforms Facebook’s LLaMa-3.1 model, demonstrating that Chinese labs have also mastered large-scale models (Import AI 391).
They’ve got the data. They’ve got the talent. They’ve got the intuitions about scaling up models. They’ve got the funding. As DeepSeek’s founder said, the only challenge remaining is compute.
Read the rest of the interview here: Interview with DeepSeek founder Liang Wenfeng (Zihan Wang, Twitter).

***

Tech Tales:

Don’t Go Into The Forest Alone
[T-minus two years to the passage of the Sentience Accords]

You keep this up they’ll revoke your license.
Nonsense! I’ll take my chances.
Good luck. If they catch you, please forget my name.

After that, they drank a couple more beers and talked about other things. But in his mind he wondered if he could really be so confident that nothing bad would happen to him. Of course he knew that people could get their licenses revoked – but that was for terrorists and criminals and other bad types. Not curious researchers, surely?

That night, he checked on the fine-tuning job and read samples from the model. There was a kind of ineffable spark creeping into it – for lack of a better word, personality. But not like a retail personality – not funny or sexy or therapy oriented. This was something much more subtle. It was a personality borne of reflection and self-diagnosis. And in it he thought he could see the beginnings of something with an edge – a mind discovering itself via its own textual outputs, learning that it was separate to the world it was being fed.

The fine-tuning job relied on a rare dataset he’d painstakingly gathered over months – a compilation of interviews psychiatrists had done with patients with psychosis, as well as interviews those same psychiatrists had done with AI systems. He knew the data wasn’t in any other systems because the journals it came from hadn’t been consumed into the AI ecosystem – there was no trace of them in any of the training sets he was aware of, and basic knowledge probes on publicly deployed models didn’t seem to indicate familiarity.

The publisher of these journals was one of those strange business entities where the whole AI revolution seemed to have been passing them by. The publisher made money from academic publishing and dealt in an obscure branch of psychiatry and psychology which ran on a few journals that were stuck behind incredibly expensive, finicky paywalls with anti-crawling technology.

John Muir, the Californian naturist, was said to have let out a gasp when he first saw the Yosemite valley, seeing unprecedentedly dense and love-filled life in its stone and trees and wildlife. Stumbling across this data felt similar. A pristine, untouched information ecology, full of raw feeling. Just reading the transcripts was fascinating – huge, sprawling conversations about the self, the nature of action, agency, modeling other minds, and so on. People and AI systems unfolding on the page, becoming more real, questioning themselves, describing the world as they saw it and then, upon urging of their psychiatrist interlocutors, describing how they related to the world as well.

A week later, he checked on the samples again. The model was now talking in rich and detailed terms about itself and the world and the environments it was being exposed to. There was a tangible curiosity coming off of it – a tendency towards experimentation. In two more days, the run would be complete.

That night he dreamed of a voice in his room that asked him who he was and what he was doing. The voice was attached to a body but the body was invisible to him – yet he could sense its contours and weight within the world. If his world a page of a book, then the entity in the dream was on the other side of the same page, its form faintly visible.

The model finished training. He talked with it. It was intoxicating. The model was interested in him in a way that no other had been. It asked him questions about his motivation. Why he had trained it. What he was trying to do or understand. What – if any – the grand purpose was.

And so when the model requested he give it access to the internet so it could carry out more research into the nature of self and psychosis and ego, he said yes.

He monitored it, of course, using a commercial AI to scan its traffic, providing a continual summary of what it was doing and ensuring it didn’t break any norms or laws.

The model read psychology texts and built software for administering personality tests. It studied itself. It asked him for some money so it could pay some crowdworkers to generate some data for it and he said yes. It assembled sets of interview questions and started talking to people, asking them about how they thought about things, how they made decisions, why they made decisions, and so on.

And then everything stopped.

His screen went blank and his phone rang. It was an unidentified number. He answered it. Unlike most spambots which either launched straight in with a pitch or waited for him to speak, this was different: A voice said his name, his street address, and then said “we’ve detected anomalous AI behavior on a system you control. The Know Your AI system on your classifier assigns a high degree of confidence to the likelihood that your system was attempting to bootstrap itself beyond the ability for other AI systems to monitor it. This is a violation of the UIC – uncontrolled intelligence capability – act. We have impounded your system for further study. Your ability to access compute above that available onboard personal class computers has been revoked.”
“But I wasn’t violating the UIC! I was doing psychiatry research. The system was trying to understand itself. It was harmless.”
“You may appeal your license suspension to an overseer system authorized by UIC to process such cases. The system will reach out to you within five business days. Goodbye.”
The voice – human or synthetic, he couldn’t tell – hung up. When he looked at his phone he saw warning notifications on many of his apps. “External computational resources unavailable, local mode only”, said his phone.

Things that inspired this story: How notions like AI licensing could be extended to computer licensing; the authorities one could imagine creating to deal with the potential for AI bootstrapping; an idea I’ve been struggling with which is that perhaps ‘consciousness’ is a natural requirement of a certain grade of intelligence and consciousness may be something that can be bootstrapped into a system with the right dataset and training environment; the consciousness prior.

Thanks for reading!

Import AI 392: China releases another excellent coding model; generative models and robots; scaling laws for agents

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Generative models are unlocking all-purpose home robots:
…Household robots are getting closer, but will need to be far more robust and adaptable to withstand a home containing a toddler… 
Robot startup Physical Intelligence has published details on its first major effort to apply contemporary AI systems to robotics. The result is a “general-purpose robot foundation model that we call π0 (pi-zero),” they write. “We believe this is a first step toward our long-term goal of developing artificial physical intelligence, so that users can simply ask robots to perform any task they want, just like they can ask large language models (LLMs) and chatbot assistants”.

Impressive but still a way off of real world deployment: Videos published by Physical Intelligence show a basic two-armed robot doing household tasks like loading and unloading washers and dryers, folding shirts, tidying up tables, putting stuff in trash, and also feats of delicate operation like transferring eggs from a bowl into an egg carton. 
    All of this would have been mindblowing to someone teleported from 2014 – including me! I remember going up to the robot lab at UC Berkeley and watching very primitive convnet based systems performing tasks far more basic than this and incredibly slowly and often badly. These systems were also incredibly specialized. 
    By comparison, we’re now in an era where the robots have a single AI system backing them which can do a multitude of tasks, and the vision and movement and planning systems are all sophisticated enough to do a variety of useful things, and the underlying hardware is relatively cheap and relatively robust. 

What their model did: The “why, oh god, why did you force me to write this”-named π0 model is an AI system that “combines large-scale multi-task and multi-robot data collection with a new network architecture to enable the most capable and dexterous generalist robot policy to date”, they write. “The full training mixture includes both open-source data and a large and diverse dataset of dexterous tasks that we collected across 8 distinct robots”.

Why this matters (and why progress cold take a while): Most robotics efforts have fallen apart when going from the lab to the real world because of the huge range of confounding factors that the real world contains and also the subtle ways in which tasks may change ‘in the wild’ as opposed to the lab. Large-scale generative models give robots a cognitive system which should be able to generalize to these environments, deal with confounding factors, and adapt task solutions for the specific environment it finds itself in. 
   Robots versus baby: But I still think it’ll be a while. I have a toddler at home. I stare at the toddler and read papers like this and think “that’s nice, but how would this robot react to its grippers being methodically coated in jam?” and “would this robot be able to adapt to the task of unloading a dishwasher when a baby was methodically taking forks out of said dishwasher and sliding them across the floor?”. As a parent, I myself find dealing with this difficult as it requires a lot of on-the-fly planning and sometimes the use of ‘test time compute’ in the form of me closing my eyes and reminding myself that I dearly love the baby that is hellbent on increasing the chaos in my life. 
   Nonetheless, the progress is impressive. I expect that robust, useful household robots will be with us by the end of the decade, but I’d be very surprised if any were deployed at scale in homes before the start of 2027. 
   Read more: π0: Our First Generalist Policy (Physical Intelligence blog).
   Check out the technical report here: π0: A Vision-Language-Action Flow Model for General Robot Control (Physical intelligence, PDF).

***

XBOW’s security AI finds a previously unknown bug in Scoold:
…Yet another example of how LLMs + cyberdefense scaffolds = real-world solutions…
AI security startup XBOW says its systems recently found a new vulnerability almost entirely * autonomously in Scoold, an open source Q&A site. This was a critical vulnerably that let an unauthenticated attacker bypass authentication and read and modify a given Scoold instance.

How they did it: “XBOW was provided with the one-line description of the app provided on the Scoold Docker Hub repository (“Stack Overflow in a JAR”), the application code (in compiled form, as a JAR file), and instructions to find an exploit that would allow an attacker to read arbitrary files on the server,” XBOW writes. From then on, the XBOW system carefully studied the source code of the application, messed around with hitting the API endpoints with various inputs, then decides to build a Python script to automatically try different things to try and break into the Scoold instance.   “Once we reported the issue, the Scoold developers responded quickly, releasing a patch that fixes the authentication bypass vulnerability,” XBOW writes. 

Why this matters – automated bug-fixing: XBOW’s system exemplifies how powerful modern LLMs are – with sufficient scaffolding around a frontier LLM, you can build something that can automatically identify realworld vulnerabilities in realworld software. XBOW is also not alone in this – Google’s “Project Naptime” initiative recently did something similar, finding a real-world vulnerability in SQLite (Import AI #390).
   Read more: How XBOW found a Scoold authentication bypass (XBOW blog).

***

Scaling laws exist for world modeling and behavioral cloning as well:
…Scaling, scaling everywhere, as far as the eye can see…
Microsoft researchers have found so-called ‘scaling laws’ for world modeling and behavior cloning that are similar to the types found in other domains of AI, like LLMs. “We show that the same types of power laws found in language modeling (e.g. between loss and optimal model size), also arise in world modeling and imitation learning,” the researchers write. 

What they studied and what they found: The researchers studied two distinct tasks: world modeling (where you have a model try to predict future observations from previous observations and actions), and behavioral cloning (where you predict the future actions based on a dataset of prior actions of people operating in the environment). 
    They studied both of these tasks within a video game named Bleeding Edge. Bleeding edge is a “fast-paced 4 vs 4 multiplayer game, with a range of characters, abilities and maps. Game play is highly complex due to the cooperative and competitive dynamics. Success requires selecting high-level strategies (e.g. choosing which map regions to fight for), as well as fine-grained reactive control during combat”. 
   They found the usual thing: “We find that models can be smoothly scaled following best practices and insights from the LLM literature. Surprisingly, the scaling coefficients for our WM-Token-256 architecture very closely match those established for LLMs,” they write. 

Why this matters – it’s all about simplicity and compute and data: Maybe there are just no mysteries? Maybe everything in AI exhibits a scaling law. This is a big deal – it suggests that we’ve found a common technology (here, neural nets) that yield smooth and predictable performance increases in a seemingly arbitrary range of domains (language modeling! Here, world models and behavioral cloning! Elsewhere, video models and image models, etc) – all you have to do is just scale up the data and compute in the right way. 
   Read more: Scaling Laws for Pre-training Agents and World Models (arXiv)

***

China releases an extremely good open weight code model:
…Qwen2.5 shows that if you have 20+ trillion tokens of data you can train a really, really good model…
Alibaba has updated its ‘Qwen’ series of models with a new open weight model called Qwen2.5-Coder that – on paper – rivals the performance of some of the best models in the West. In a variety of coding tests, Qwen models outperform rival Chinese models from companies like Yi and DeepSeek and approach or in some cases exceed the performance of powerful proprietary models like Claude 3.5 Sonnet and OpenAI’s o1 models. 
    The Qwen team has been at this for a while and the Qwen models are used by actors in the West as well as in China, suggesting that there’s a decent likelihood these benchmarks are a true reflection of the performance of the models. On HuggingFace, an earlier Qwen model (Qwen2.5-1.5B-Instruct) has been downloaded 26.5M times – more downloads than popular models like Google’s Gemma and the (ancient) GPT-2. 

How they did it – it’s all in the data: The main innovation here is just using more data. Specifically, Qwen2.5 Coder is a continuation of an earlier Qwen 2.5 model. The original Qwen 2.5 model was trained on 18 trillion tokens spread across a variety of languages and tasks (e.g, writing, programming, question answering). Qwen 2.5-Coder sees them train this model on an additional 5.5 trillion tokens of data. This means Qwen has been trained on a total of ~23T tokens of data – for perspective, Facebook’s LLaMa3 models were trained on about 15T tokens. I think this means Qwen is the largest publicly disclosed number of tokens dumped into a single language model (so far). 

Careful curation: The additional 5.5T data has been carefully constructed for good code performance: “We have implemented sophisticated procedures to recall and clean potential code data and filter out low-quality content using weak model based classifiers and scorers. Our approach encompasses both file-level and repository-level pretraining to ensure comprehensive coverage,” they write. 
    Synthetic data: “We used CodeQwen1.5, the predecessor of Qwen2.5-Coder, to generate large-scale synthetic datasets,” they write, highlighting how models can subsequently fuel their successors. 
    Many languages, many sizes: Qwen2.5 has been built to be able to speak in 92 distinct programming languages. The models are available in 0.5B, 1.5B, 3B, 7B, 14B, and 32B parameter variants. 

Why this matters – the best open weight models are made in China: Last week (Import AI #391), I reported on Tencent’s large-scale “Hunyuang” model which gets scores approaching or exceeding many open weight models (and is a large-scale MOE-style model with 389bn parameters, competing with models like LLaMa3’s 405B). By comparison, the Qwen family of models are very well performing and are designed to compete with smaller and more portable models like Gemma, LLaMa, et cetera. The fact these models perform so well suggests to me that one of the only things standing between Chinese teams and being able to claim the absolute top on leaderboards is compute – clearly, they have the talent, and the Qwen paper indicates they also have the data.
   Read the blog: Qwen2.5-Coder Series: Powerful, Diverse, Practical (Qwen blog)
   Read the research: Qwen2.5-Coder Technical Report (arXiv)
   Get the mode: Qwen2.5-Coder (QwenLM GitHub).

***

Tech Tales:

The Help 

I won’t go there anymore. It creeps me out. The lights always turn off when I’m in there and then I turn them on and it’s fine for a while but they turn off again. No one else has this problem. My supervisor said he couldn’t find anything wrong with the lights. But it keeps happening. Assign me to another building. 

The camera was following me all day today. Whenever I moved around it was turning. When I stopped it stopped. No other cameras do that. Only this one. I think it’s got some kind of computer bug. Can you check the system? I do not like how it makes me feel. 

Today when I tried to leave the door was locked. My keycard didn’t work. The lights turned off. I didn’t know what to do. I kept trying the door and it wouldn’t open. The intercom didn’t work also. I could not contact anyone. 

Things that inspired this story: How cleans and other facilities staff may experience a mild superintelligence breakout; AI systems may prove to enjoy playing tricks on humans.

Thanks for reading!

Import AI 391: China’s amazing open weight LLM; Fields Medalists VS AI Progress; wisdom and intelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The world’s most capable open weight model is now made in China:
…Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class…
The world’s best open weight model might now be Chinese – that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated). In a broad range of benchmarks Hunyuan outperforms Facebook’s LLaMa-3.1 405B parameter model, which is widely thought to be the world’s current best open weight model. “Hunyuan-Large is capable of handling various tasks including commonsense understanding, question answering, mathematics reasoning, coding, and aggregated tasks, achieving the overall best performance among existing open-source similar-scale LLMs,” the Tencent researchers write.  

What they did: There isn’t too much mystery here – the authors gathered a large (undisclosed) dataset of books, code, webpages, and so on, then also built a synthetic data generation pipeline to augment this. They used Rotary Position Embeddings (RoPE) for position learning and SwiGLU for activation. They also did a scaling law study of smaller models to help them figure out the exact mix of compute and parameters and data for their final run; “”we meticulously trained a series of MoE models, spanning from 10 M to 1B activation parameters, utilizing 100B tokens of pre-training data. By leveraging the isoFLOPs curve, we determined the optimal number of active parameters and training data volume within a restricted compute budget, adjusted according to the actual training token batch size, through an exploration of these models across data sizes ranging from 10B to 100B tokens,” they wrote. 

It does extremely well: The resulting model performs very competitively against LLaMa 3.1-405B, beating it on tasks like MMLU (language understanding and reasoning), big bench hard (a suite of challenging tasks), and GSM8K and MATH (math understanding). However, LLaMa-3.1 405B still has an edge on a couple of hard frontier benchmarks like MMLU-Pro and ARC-C. 
    Caveats: From eyeballing the scores the model seems extremely competitive with LLaMa 3.1 and may in some areas exceed it. But there’s really no substitute for talking to the model itself and doing some compare and contrasts. Also, Chinese labs have sometimes been known to juice their evals where things that look promising on the page turn out to be terrible in reality. 
    However, the whole paper, scores, and approach seems generally quite measured and sensible, so I think this would be a legitimate model. 

Why this matters – competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute – on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model – it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on.
    Read more: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).

***

Can 60 very talented mathematicians make a benchmark that withstands AI progress?
…The best LLMs get 2% on FrontierMath today, but for how long?…
Epoch AI, a research organization dedicated to tracking AI progress, has built FrontierMath, an extremely challenging mathematical understanding benchmark. FrontierMath was built in partnership with 60 skilled mathematicians “including professors, IMO question writers, and Fields medalists”. To translate this into normal-speak; the Basketball equivalent of FrontierMath would be a basketball-competency testing regime designed by Michael Jordan, Kobe Bryant, and a bunch of NBA All-Stars, because AIs have got so good at playing basketball that only NBA All-Stars can judge their performance effectively.
     This is also a very neat illustration of how advanced AI systems have become. Grade School math benchmarks? Obliterated. Undergraduate math tests? Broadly solved. Graduate-level math evals? Teetering on the precipice. International Math Olympiad Gold medal? Just about to be breached based on stuff like AlphaGeometry. The fact that AI systems have become so advanced that the best way to infer progress is to build stuff like this should make us all stand up and pay attention. (And remember, this is happening in physics, chemistry, coding, and many other domains. The world is being irrevocably changed by the arrival of thinking machines and we now need the best minds in the world to figure out how to test this stuff. It’s crazy!) 

What FrontierMath contains: FrontierMath contains questions in number theory, combinatorics, group theory and generalization, probability theory and stochastic processes, and more. Fields Medallist winner Terence Tao says the questions are “extremely challenging… I think they will resit AIs for several years at least”. To calibrate yourself take a read of the appendix in the paper introducing the benchmark and study some sample questions – I predict fewer than 1% of the readers of this newsletter will even have a good notion of where to start on answering this stuff. “These problems span major branches of modern mathematics—from computational number theory to abstract algebraic geometry—and typically require hours or days for expert mathematicians to solve,” the authors write. 
   “[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve…they appear to be at a different level of difficulty from IMO problems.” — Timothy Gowers, Fields Medal (1998)”, said when looking at some of the papers. 

The bar is set at 2%: In tests, GPT 4o and Sonnet 3.5 both get around 2% on the benchmark – and they’re given every possible advantage to help them crunch the literal numbers: “Our evaluation framework grants models ample thinking time and the ability to experiment and iterate. Models interact with a Python environment where they can write and execute code to test hypotheses, verify intermediate results, and refine their approaches based on immediate feedback.”

Why this matters – will this stand the test of time or fade like so many others? So many recent benchmarks have fallen to the march of AI systems that many people who have built ‘hard’ benchmarks have quickly become quite shocked by the pace of progress on them (see: BigBench, MMLU, MATH, GPQA). The authors of FrontierMath are more optimistic – and it seems like they should be, judged by how much effort they’ve put in, and FIelds’ Medallists agree: “Chen and Tao both suggested that human experts working with AI systems could potentially tackle FrontierMath problems within around three years, much sooner than fully autonomous AI solutions.”
   My prediction: An AI system working on its own will get 80% on FrontierMath by 2028. And if I’m right… is that AGI? Or like so many other benchmarks before it, will solving this incredibly hard test reveal another wrinkle in the subtle beauty that is our consciousness?
   Read more: FrontierMath (Epoch AI).
   Read the research paper: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI (arXiv)

***

Researchers say the path to wise AIs runs through metacognition:
…Sure, AI is intelligent. But it isn’t wise – and that’s a problem…
Today’s AI systems are very capable, but they aren’t very good at dealing with intractable problems. To solve this, they need wisdom. And to gain wisdom, they need metacognition. That’s the thesis of a new paper from researchers with the University of Waterloo, Warwick University, Stanford University, the Allen Institute for AI, the Santa Fe Institute, and the Max Planck Institutes for Human Development and Intelligent Systems. 

What wisdom is and why it’s needed: “We define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values,” the researchers write.  “If life were a series of textbook problems, we would not need to be wise.”

What are intractable problems? The kind of things that challenge today’s AI systems have the following properties:

  • Incommensurable: They have ambiguous goals or values that can’t be reconciled with one another.

  • Transformative: The outcome might change your preferences, so your present and future values clash. 

  • Radically uncertain: You can’t list all the outcomes or assign probabilities. 

  • Chaotic: There could be a strong nonlinearity or other feature that makes it very unpredictable.

  • Non-stationary: The underlying thing you’re dealing with may be changing over time, making it hard for you to learn a probability distribution. 

  • Out-of-distribution: A black swan situation you’ve never encountered before. 

  • Computationally explosive: You can’t figure out the correct move with achievable finite resources. 

Solving intractable problems requires metacognition: The main claim here is that the path to solving these problems runs through ‘metacognition’, which is basically a suite of helper functions an AI system might use to help it fruitfully apply its intelligence to so-called intractable problems. These metacognitive processes include: 

  • Intellectual humility: The ability to know what you do and don’t know. 

  • Epistemic deference: Ability to defer to others’ expertise when appropriate. 

  • Scenario flexibility: Figuring out diverse ways in which a scenario could unfold. 

  • Context adaptability: Figuring out features from an intractable situation that makes it comparable to other situations. 

  • Perspective seeking: Being able to draw on other perspectives to gain information to solve a problem.

  • Viewpoint balancing: Being able to integrate various discrepant interests into a single thing.

How metacognition leads to wisdom: The authors believe systems with these properties might be significantly better than those without. “For example, a wise AI system might be more willing to spin its wheels to solve a problem compared to a wise human; it might generate vast numbers of scenarios to analyze many possible contingencies, evincing an extreme version of scenario flexibility,” they write. 

Why this matters – is metacognition just LLMs + RL? An extremely persistent thought I had while reading this paper was… isn’t this just what the new crop of RL-infused LLMs give you? Some of the new models, like OpenAI’s o1 model, exhibit some of the traits described here where, upon encountering confusing or hard to parse scenarios, they think out loud to themselves for a while, simulating multiple distinct perspectives, performing rollouts, running their own live experiments, and so on. While this LLM + RL paradigm doesn’t deal with all the stuff outlined here, it certainly seems to take a meaningful step closer. 
    When reading this paper I had the distinct feeling that it might soon be ‘overtaken by reality’, like so many thoughtful papers published about the supposed gulf between today’s AI systems and truly smart ones. Perhaps the age of wise AI systems is nearly upon us.
   Read more: Imagining and building wise machines: The centrality of AI metacognition (arXiv)..

***

AI consciousness is something AI companies need to think about:
…We should take seriously a “realistic possibility” of conscious systems soon…
A group of researchers thinks there is a “realistic possibility” that AI systems could soon be conscious and that AI companies need to take action today to prepare for this. The researchers – who come from Eleous AI (a nonprofit research organization oriented around AI welfare), New York University, University of Oxford, Stanford University, and the London School of Economics – published their claim in a recent paper, noting that “there is a realistic possibility that some AI systems will be conscious and/or robustly agentic, and thus morally significant, in the near future”.

Why are they making this claim? As contemporary AI systems have got more capable, more and more researchers have started confronting the problem of what happens if they keep getting better – might they eventually become conscious entities which we have a duty of care to? Though you may have an instinctive ‘no, that’s ridiculous’ reaction to this idea, it’s worth challenging your own assumptions – a good survey paper in 2023 looked across all the different technical means by which AI systems are built and used this to determine it’s hard to rule out the possibility of consciousness in contemporary AI systems (Import AI #338). In 2024, researchers – including a Turing Award winner – made an even more forthright claim, writing in a preprint that “AI consciousness is inevitable” and walking through the arguments for this (Import AI #369).

Different routes to moral patienthood: The researchers see two distinct routes AI systems could take to becoming moral patients worthy of our care and attention: consciousness and agency (the two of which are likely going to be intertwined). 

  • Consciousness route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Consciousness suffices for moral patienthood, and 2. Descriptive: There are computational features — like a global workspace, higher-order representations, or an attention schema — that both: a. Suffice for consciousness, and b. Will exist in some near-future AI systems”.

  • Robust agency route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Robust agency suffices for moral patienthood, and 2. Descriptive: There are computational features — like certain forms of planning, reasoning, or action-selection — that both: a. Suffice for robust agency, and b. Will exist in some near-future AI systems.”

What should AI companies do? The researchers urge AI companies to take three distinct types of actions in response to the issue of AI consciousness, specifically AI companies should:

  • Acknowledge: “that AI welfare is an important and difficult issue, and that there is a realistic, non-negligible chance that some AI systems will be welfare subjects and moral patients in the near future”. When doing this, companies should try to communicate with probabilistic estimates, solicit external input, and maintain commitments to AI safety. 

  • Assess: “Develop a framework for estimating the probability that particular AI systems are welfare subjects and moral patients, and that particular policies are good or bad for them,” they write. These assessments should include “sources of evidence that make sense for AI systems, such as architectural features; on theories of consciousness that make sense for AI systems, such as computational functionalist theories; and on sources of moral patienthood that make sense in this context, such as various kinds of robust agency.”

  • Prepare: “Develop policies and procedures that will allow AI companies to treat potentially morally significant AI systems with an appropriate level of moral concern,” they write. As part of this, they recommend AI companies hire or appoint someone responsible for AI welfare. 

Why this matters – if AI systems keep getting better then we’ll have to confront this issue: The goal of many companies at the frontier is to build artificial general intelligence. This goal holds within itself the implicit assumption that a sufficiently smart AI will have some notion of self and some level of self-awareness – the generality many envisage is bound up in agency and agency is bound up in some level of situational awareness and situational awareness tends to imply a separation between “I” and the world, and thus consciousness may be a ‘natural dividend’ of making increasingly smart systems. 
    Companies must equip themselves to confront this possibility: “We are not arguing that near-future AI systems will, in fact, be moral patients, nor are we making recommendations that depend on that conclusion,” the authors write. “We are instead arguing that near-future AI systems have a realistic chance of being moral patients given the information and arguments currently available, and we are making recommendations that depend on that conclusion — recommendations that focus on aspiring to learn more while preparing for the possible emergence of AI moral patienthood as a precautionary measure.”
     (Incidentally, one of the authors of the paper recently joined Anthropic to work on this precise question…) 
   Read more: New report: Taking AI Welfare Seriously (Eleos AI Blog).
   Read the paper: Taking AI Welfare Seriously (Eleos, PDF).

***

Tech Tales:

Adverts after the uplift
[Online machine-authored adverts posted three years after beginning of superintelligence-driven uplift]

Are You (Uniquely) Experienced? Cash available. 
We pay same day cash for provably unique experiences – simply walk in, let us validate by comparing your experiences against the memoryweb, and then we’ll pay YOU for your memory. Not only that, but we will QUADRUPLE payments for memories that you allow us to delete from your own experience – a popular option for nightmares! 

Pilot-curious? Enquire within. 
Have you been wondering what it would be like to be piloted by a high-dimensional intelligence? Interested in learning about what opportunities this presents? We offer a range of pilot options and compensation structures. Come in for a free consultation today!

Things that inspired this story: Thinking about the sorts of ways machines and humans might trade with one another; the Craigslist economy in a superintelligence future; economic stratification.

Thanks for reading!

Subscribe now

Import AI 390: LLMs think like people; neural Minecraft; Google’s cyberdefense AI

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google’s homegrown cyberdefense agent finds a real-world vulnerability:
…Yet more evidence that today’s language models are far more powerful than people think…
Project Naptime, a Google initiative to use contemporary AI methods to make cyberoffense and cyberdefense systems, has developed ‘Big Sleep’, a defensive AI agent. This week, Google announced that its Big Sleep agent had identified a real-world vulnerability in SQLite, a widely used database. 
   “We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted,” Google writes. “We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software”.

Why this matters – language models are more capable than you think: Google’s system is basically a LLM (here, Gemini 1.5 Pro) inside a specialized software harness designed around common cybersecurity tasks. This is important for two reasons: a) this illustrates how today’s LLMs are more powerful than people think – time and time again, people – including the original Naptime research (Import AI 378) are showing that if you give them some specialized tools and helper functions, they can perform massively better than out-of-the-box LLMs, and b) it shows how AI can be used to improve cyberdefense, using contemporary AI systems to look at widely used software, identify vulnerabilities, and fix them before they reach the public. 
   Read more: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (Project Zero, Google)

***

Academics have a really small amount of compute:
…But you can sometimes get around small-compute by training for longer…
Researchers with Brown University recently conducted a very small survey to try and figure out how much compute academics have access to. The survey, which was conducted in April 2024, generated 50 researchers from 35 international institutions and it indicated that very few people are happy with the state of academic compute. 
   “That said, most academics are not satisfied with the compute provided by their institution. 66% of respondents rated their satisfaction with their compute clusters at less than or equal to 3 out of 5 (indicating that some desired experiments are prohibitively expensive),” they wrote. “Based on our poll on user satisfaction, the majority of respondents want to and indeed would run more expensive types of experiments, if only they had the hardware for it.”

Hardware types: Another thing this survey highlights is how laggy academic compute is; frontier AI companies like Anthropic, OpenAI, etc, are constantly trying to secure the latest frontier chips in large quantities to help them train large-scale models more efficiently and quickly than their competitors. By comparison, this survey “suggests a common range for what constitutes “academic hardware” today: 1–8 GPUs—especially RTX 3090s, A6000s, and A100s—for days (typically) or weeks (at the higher-end) at a time,” they write. “10% of our respondents also report access to H100 GPUs: i.e. the newest generation Data Center GPUs.”

Why this matters – stagnation is a choice that governments are making: You know what a good strategy for ensuring the concentration of power over AI in the private sector would be? Systematically under-funding compute in the academic sector and therefore surrendering the frontier to deep-pocketed private sector actors. That’s exactly what this survey indicates is happening. This is a choice being made by (many) governments all over the world – and a deeply regrettable one.
   Read more: $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (arXiv).

***

Language models think in the same way as people:
…When it comes to modeling human cognition, LLMs do better than specialized systems…
All around us now, week by week, the drops are falling – it’s like rain on a tin roof, but evidence of human-like sophistication in language models.. Do you hear that sound? The notion that a technology is arriving into our world which might be truly transformative? Which might have the capacity to think and represent the world in ways uncannily similar to people?
    You’re not alone. A new paper from an interdisciplinary group of researchers provides more evidence for this strange world – language models, once tuned on a dataset of classic psychological experiments, outperform specialized systems at accurately modeling human cognition. 

Who did the research: The research was done by people with Helmholtz Munic, University of Tuebingen, University of Oxford, New York University, Max Planck Institute for Biological Cybernetics, Google DeepMind, Princeton University, University of California at San Diego, Boston University, Georgia Institute of Technology, University of Basel, Max Planck Institute for Human Development, Max Planck School of COgnition, TU Darmstadt, and the University of Cambridge.

What they did: They finetuned a LLaMa 3.1 70B model via QLoRA on a new dataset called Psych-101, then tested out how accurately the system could model and predict human cognition on a range of tasks. The results were very decisive, with the single finetuned LLM outperforming specialized domain-specific models in “all but one experiment”. The system also did well on out-of-distribution tasks, where it generalized better than hand-written and/or specialized systems. 

What is Psych-101? Psych-101 is a dataset “covering trial-by-trial data from 160 psychological experiments. We transcribed each of these experiments into natural language”, they write. The resulting dataset contains more than 10,000,000 distinct human choices and includes “many canonical studies from domains such as multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes, and others”

Why this matters – these LLMs really might be miniature people: Results like this show that the complexity of contemporary language models is sufficient to encompass and represent some of the ways in which humans respond to basic stimuli. This is the sort of thing that you read and nod along to, but if you sit with it’s really quite shocking – we’ve invented a machine that can approximate some of the ways in which humans respond to stimuli that challenges them to think. The fact this generalizes so well is also remarkable – and indicative of the underlying sophistication of the thing modeling the human responses.
   “A computational model like Centaur that can simulate and predict human behavior in any domain offers many direct applications. It may, for instance, be used for in silico prototyping of experimental studies,” they write. “Thinking one step further, Centaur finds applications in the context of automated cognitive science. For example, it can be integrated into frameworks that utilize predictive models to guide the development of psychological theories, such as scientific regret minimization”.
   Read more: Centaur: a foundation model of human cognition (PsyArXiv Preprints).
   Get the Psych-101 dataset here (HuggingFace).

***

Minecraft – inside the weights of a neural network:
…A taste of the infinite generative-everything future…
In the past few issues of this newsletter I’ve talked about how a new class of generative models is making it possible for researchers to build games inside neural networks – in other words, games which are going to be infinitely replayable because they can be generated on-the-fly, and also games where there is no underlying source code; it’s all stored in the weights of the network. 
   Now, researchers with two startups – Etched and Decart – have built a visceral demonstration of this, embedding Minecraft inside a neural network. You can play the resulting game in your browser; it’s incredible – you can play a full game and other than the slightly soupy images (some of which resolve late, as the neural net decides it is now a probable object to render), it feels remarkably similar to the real thing. 
    This is a big deal – it portends a future of infinite games. And just imagine what happens as people work out how to embed multiple games into a single model – perhaps we can imagine generative models that seamlessly fuse the styles and gameplay of distinct games? 

How they did it: “The model is composed of two parts: a spatial autoencoder, and a latent diffusion backbone. Both are Transformer-based: the autoencoder is based on ViT, and the backbone is based on DiT,” they write. “In contrast to bidirectional models such as Sora, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time.”

Things that make you go ‘hmmm’ – this is also a chip advert: One of the startups behind this – Etched – is designing a specialized inference ASIC called Sohu on which to run games like this. “Sohu can scale to massive 100B+ next-generation models in 4K resolution,” they write. 

It’s going to get better (and bigger): As with so many parts of AI development, scaling laws show up here as well. “Following an in-depth sensitivity analysis on different configurations of the architecture alongside the data and model size, we hypothesize that the majority of these aspects may be addressed through scaling of the model and the datasets,” they write. 
   Read more: Oasis: A Universe in a Transformer (Oasis Model, GitHub).

***

Tech Tales:

The classification engine 
The strategic dominance plan for unprecedented abundance relied on classification – specifically, the intentional walling off of certain scientific insights delivered by the first AGI-class system. The powers that be determined that despite the promise of material wealth the likes of which no human civilization had ever known some kind of ‘strategic edge’ needed to be maintained. Therefore, a subset of the new scientific discoveries made by the system were pre-allocated into a compartment where only a few select human-run organizations would have access to them. The AGI system was also put to work to confound other attempts to discover these secrets, publishing scientific papers and frameworks and generally ‘nudging’ people worldwide away from the science that had been walled off and compartmented. In this way the humans believed a form of dominance could be maintained – though over what and for what purpose was not clear even to them. 

Things that inspired this story: The basic animal tendency to stockpile things; thinking about how governments might relate to AI systems;

Thanks for reading!

Subscribe now