Import AI

Import AI 391: China’s amazing open weight LLM; Fields Medalists VS AI Progress; wisdom and intelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The world’s most capable open weight model is now made in China:
…Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class…
The world’s best open weight model might now be Chinese – that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated). In a broad range of benchmarks Hunyuan outperforms Facebook’s LLaMa-3.1 405B parameter model, which is widely thought to be the world’s current best open weight model. “Hunyuan-Large is capable of handling various tasks including commonsense understanding, question answering, mathematics reasoning, coding, and aggregated tasks, achieving the overall best performance among existing open-source similar-scale LLMs,” the Tencent researchers write.  

What they did: There isn’t too much mystery here – the authors gathered a large (undisclosed) dataset of books, code, webpages, and so on, then also built a synthetic data generation pipeline to augment this. They used Rotary Position Embeddings (RoPE) for position learning and SwiGLU for activation. They also did a scaling law study of smaller models to help them figure out the exact mix of compute and parameters and data for their final run; “”we meticulously trained a series of MoE models, spanning from 10 M to 1B activation parameters, utilizing 100B tokens of pre-training data. By leveraging the isoFLOPs curve, we determined the optimal number of active parameters and training data volume within a restricted compute budget, adjusted according to the actual training token batch size, through an exploration of these models across data sizes ranging from 10B to 100B tokens,” they wrote. 

It does extremely well: The resulting model performs very competitively against LLaMa 3.1-405B, beating it on tasks like MMLU (language understanding and reasoning), big bench hard (a suite of challenging tasks), and GSM8K and MATH (math understanding). However, LLaMa-3.1 405B still has an edge on a couple of hard frontier benchmarks like MMLU-Pro and ARC-C. 
    Caveats: From eyeballing the scores the model seems extremely competitive with LLaMa 3.1 and may in some areas exceed it. But there’s really no substitute for talking to the model itself and doing some compare and contrasts. Also, Chinese labs have sometimes been known to juice their evals where things that look promising on the page turn out to be terrible in reality. 
    However, the whole paper, scores, and approach seems generally quite measured and sensible, so I think this would be a legitimate model. 

Why this matters – competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute – on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model – it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on.
    Read more: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).

***

Can 60 very talented mathematicians make a benchmark that withstands AI progress?
…The best LLMs get 2% on FrontierMath today, but for how long?…
Epoch AI, a research organization dedicated to tracking AI progress, has built FrontierMath, an extremely challenging mathematical understanding benchmark. FrontierMath was built in partnership with 60 skilled mathematicians “including professors, IMO question writers, and Fields medalists”. To translate this into normal-speak; the Basketball equivalent of FrontierMath would be a basketball-competency testing regime designed by Michael Jordan, Kobe Bryant, and a bunch of NBA All-Stars, because AIs have got so good at playing basketball that only NBA All-Stars can judge their performance effectively.
     This is also a very neat illustration of how advanced AI systems have become. Grade School math benchmarks? Obliterated. Undergraduate math tests? Broadly solved. Graduate-level math evals? Teetering on the precipice. International Math Olympiad Gold medal? Just about to be breached based on stuff like AlphaGeometry. The fact that AI systems have become so advanced that the best way to infer progress is to build stuff like this should make us all stand up and pay attention. (And remember, this is happening in physics, chemistry, coding, and many other domains. The world is being irrevocably changed by the arrival of thinking machines and we now need the best minds in the world to figure out how to test this stuff. It’s crazy!) 

What FrontierMath contains: FrontierMath contains questions in number theory, combinatorics, group theory and generalization, probability theory and stochastic processes, and more. Fields Medallist winner Terence Tao says the questions are “extremely challenging… I think they will resit AIs for several years at least”. To calibrate yourself take a read of the appendix in the paper introducing the benchmark and study some sample questions – I predict fewer than 1% of the readers of this newsletter will even have a good notion of where to start on answering this stuff. “These problems span major branches of modern mathematics—from computational number theory to abstract algebraic geometry—and typically require hours or days for expert mathematicians to solve,” the authors write. 
   “[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve…they appear to be at a different level of difficulty from IMO problems.” — Timothy Gowers, Fields Medal (1998)”, said when looking at some of the papers. 

The bar is set at 2%: In tests, GPT 4o and Sonnet 3.5 both get around 2% on the benchmark – and they’re given every possible advantage to help them crunch the literal numbers: “Our evaluation framework grants models ample thinking time and the ability to experiment and iterate. Models interact with a Python environment where they can write and execute code to test hypotheses, verify intermediate results, and refine their approaches based on immediate feedback.”

Why this matters – will this stand the test of time or fade like so many others? So many recent benchmarks have fallen to the march of AI systems that many people who have built ‘hard’ benchmarks have quickly become quite shocked by the pace of progress on them (see: BigBench, MMLU, MATH, GPQA). The authors of FrontierMath are more optimistic – and it seems like they should be, judged by how much effort they’ve put in, and FIelds’ Medallists agree: “Chen and Tao both suggested that human experts working with AI systems could potentially tackle FrontierMath problems within around three years, much sooner than fully autonomous AI solutions.”
   My prediction: An AI system working on its own will get 80% on FrontierMath by 2028. And if I’m right… is that AGI? Or like so many other benchmarks before it, will solving this incredibly hard test reveal another wrinkle in the subtle beauty that is our consciousness?
   Read more: FrontierMath (Epoch AI).
   Read the research paper: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI (arXiv)

***

Researchers say the path to wise AIs runs through metacognition:
…Sure, AI is intelligent. But it isn’t wise – and that’s a problem…
Today’s AI systems are very capable, but they aren’t very good at dealing with intractable problems. To solve this, they need wisdom. And to gain wisdom, they need metacognition. That’s the thesis of a new paper from researchers with the University of Waterloo, Warwick University, Stanford University, the Allen Institute for AI, the Santa Fe Institute, and the Max Planck Institutes for Human Development and Intelligent Systems. 

What wisdom is and why it’s needed: “We define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values,” the researchers write.  “If life were a series of textbook problems, we would not need to be wise.”

What are intractable problems? The kind of things that challenge today’s AI systems have the following properties:

  • Incommensurable: They have ambiguous goals or values that can’t be reconciled with one another.

  • Transformative: The outcome might change your preferences, so your present and future values clash. 

  • Radically uncertain: You can’t list all the outcomes or assign probabilities. 

  • Chaotic: There could be a strong nonlinearity or other feature that makes it very unpredictable.

  • Non-stationary: The underlying thing you’re dealing with may be changing over time, making it hard for you to learn a probability distribution. 

  • Out-of-distribution: A black swan situation you’ve never encountered before. 

  • Computationally explosive: You can’t figure out the correct move with achievable finite resources. 

Solving intractable problems requires metacognition: The main claim here is that the path to solving these problems runs through ‘metacognition’, which is basically a suite of helper functions an AI system might use to help it fruitfully apply its intelligence to so-called intractable problems. These metacognitive processes include: 

  • Intellectual humility: The ability to know what you do and don’t know. 

  • Epistemic deference: Ability to defer to others’ expertise when appropriate. 

  • Scenario flexibility: Figuring out diverse ways in which a scenario could unfold. 

  • Context adaptability: Figuring out features from an intractable situation that makes it comparable to other situations. 

  • Perspective seeking: Being able to draw on other perspectives to gain information to solve a problem.

  • Viewpoint balancing: Being able to integrate various discrepant interests into a single thing.

How metacognition leads to wisdom: The authors believe systems with these properties might be significantly better than those without. “For example, a wise AI system might be more willing to spin its wheels to solve a problem compared to a wise human; it might generate vast numbers of scenarios to analyze many possible contingencies, evincing an extreme version of scenario flexibility,” they write. 

Why this matters – is metacognition just LLMs + RL? An extremely persistent thought I had while reading this paper was… isn’t this just what the new crop of RL-infused LLMs give you? Some of the new models, like OpenAI’s o1 model, exhibit some of the traits described here where, upon encountering confusing or hard to parse scenarios, they think out loud to themselves for a while, simulating multiple distinct perspectives, performing rollouts, running their own live experiments, and so on. While this LLM + RL paradigm doesn’t deal with all the stuff outlined here, it certainly seems to take a meaningful step closer. 
    When reading this paper I had the distinct feeling that it might soon be ‘overtaken by reality’, like so many thoughtful papers published about the supposed gulf between today’s AI systems and truly smart ones. Perhaps the age of wise AI systems is nearly upon us.
   Read more: Imagining and building wise machines: The centrality of AI metacognition (arXiv)..

***

AI consciousness is something AI companies need to think about:
…We should take seriously a “realistic possibility” of conscious systems soon…
A group of researchers thinks there is a “realistic possibility” that AI systems could soon be conscious and that AI companies need to take action today to prepare for this. The researchers – who come from Eleous AI (a nonprofit research organization oriented around AI welfare), New York University, University of Oxford, Stanford University, and the London School of Economics – published their claim in a recent paper, noting that “there is a realistic possibility that some AI systems will be conscious and/or robustly agentic, and thus morally significant, in the near future”.

Why are they making this claim? As contemporary AI systems have got more capable, more and more researchers have started confronting the problem of what happens if they keep getting better – might they eventually become conscious entities which we have a duty of care to? Though you may have an instinctive ‘no, that’s ridiculous’ reaction to this idea, it’s worth challenging your own assumptions – a good survey paper in 2023 looked across all the different technical means by which AI systems are built and used this to determine it’s hard to rule out the possibility of consciousness in contemporary AI systems (Import AI #338). In 2024, researchers – including a Turing Award winner – made an even more forthright claim, writing in a preprint that “AI consciousness is inevitable” and walking through the arguments for this (Import AI #369).

Different routes to moral patienthood: The researchers see two distinct routes AI systems could take to becoming moral patients worthy of our care and attention: consciousness and agency (the two of which are likely going to be intertwined). 

  • Consciousness route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Consciousness suffices for moral patienthood, and 2. Descriptive: There are computational features — like a global workspace, higher-order representations, or an attention schema — that both: a. Suffice for consciousness, and b. Will exist in some near-future AI systems”.

  • Robust agency route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Robust agency suffices for moral patienthood, and 2. Descriptive: There are computational features — like certain forms of planning, reasoning, or action-selection — that both: a. Suffice for robust agency, and b. Will exist in some near-future AI systems.”

What should AI companies do? The researchers urge AI companies to take three distinct types of actions in response to the issue of AI consciousness, specifically AI companies should:

  • Acknowledge: “that AI welfare is an important and difficult issue, and that there is a realistic, non-negligible chance that some AI systems will be welfare subjects and moral patients in the near future”. When doing this, companies should try to communicate with probabilistic estimates, solicit external input, and maintain commitments to AI safety. 

  • Assess: “Develop a framework for estimating the probability that particular AI systems are welfare subjects and moral patients, and that particular policies are good or bad for them,” they write. These assessments should include “sources of evidence that make sense for AI systems, such as architectural features; on theories of consciousness that make sense for AI systems, such as computational functionalist theories; and on sources of moral patienthood that make sense in this context, such as various kinds of robust agency.”

  • Prepare: “Develop policies and procedures that will allow AI companies to treat potentially morally significant AI systems with an appropriate level of moral concern,” they write. As part of this, they recommend AI companies hire or appoint someone responsible for AI welfare. 

Why this matters – if AI systems keep getting better then we’ll have to confront this issue: The goal of many companies at the frontier is to build artificial general intelligence. This goal holds within itself the implicit assumption that a sufficiently smart AI will have some notion of self and some level of self-awareness – the generality many envisage is bound up in agency and agency is bound up in some level of situational awareness and situational awareness tends to imply a separation between “I” and the world, and thus consciousness may be a ‘natural dividend’ of making increasingly smart systems. 
    Companies must equip themselves to confront this possibility: “We are not arguing that near-future AI systems will, in fact, be moral patients, nor are we making recommendations that depend on that conclusion,” the authors write. “We are instead arguing that near-future AI systems have a realistic chance of being moral patients given the information and arguments currently available, and we are making recommendations that depend on that conclusion — recommendations that focus on aspiring to learn more while preparing for the possible emergence of AI moral patienthood as a precautionary measure.”
     (Incidentally, one of the authors of the paper recently joined Anthropic to work on this precise question…) 
   Read more: New report: Taking AI Welfare Seriously (Eleos AI Blog).
   Read the paper: Taking AI Welfare Seriously (Eleos, PDF).

***

Tech Tales:

Adverts after the uplift
[Online machine-authored adverts posted three years after beginning of superintelligence-driven uplift]

Are You (Uniquely) Experienced? Cash available. 
We pay same day cash for provably unique experiences – simply walk in, let us validate by comparing your experiences against the memoryweb, and then we’ll pay YOU for your memory. Not only that, but we will QUADRUPLE payments for memories that you allow us to delete from your own experience – a popular option for nightmares! 

Pilot-curious? Enquire within. 
Have you been wondering what it would be like to be piloted by a high-dimensional intelligence? Interested in learning about what opportunities this presents? We offer a range of pilot options and compensation structures. Come in for a free consultation today!

Things that inspired this story: Thinking about the sorts of ways machines and humans might trade with one another; the Craigslist economy in a superintelligence future; economic stratification.

Thanks for reading!

Subscribe now

Import AI 390: LLMs think like people; neural Minecraft; Google’s cyberdefense AI

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google’s homegrown cyberdefense agent finds a real-world vulnerability:
…Yet more evidence that today’s language models are far more powerful than people think…
Project Naptime, a Google initiative to use contemporary AI methods to make cyberoffense and cyberdefense systems, has developed ‘Big Sleep’, a defensive AI agent. This week, Google announced that its Big Sleep agent had identified a real-world vulnerability in SQLite, a widely used database. 
   “We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted,” Google writes. “We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software”.

Why this matters – language models are more capable than you think: Google’s system is basically a LLM (here, Gemini 1.5 Pro) inside a specialized software harness designed around common cybersecurity tasks. This is important for two reasons: a) this illustrates how today’s LLMs are more powerful than people think – time and time again, people – including the original Naptime research (Import AI 378) are showing that if you give them some specialized tools and helper functions, they can perform massively better than out-of-the-box LLMs, and b) it shows how AI can be used to improve cyberdefense, using contemporary AI systems to look at widely used software, identify vulnerabilities, and fix them before they reach the public. 
   Read more: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (Project Zero, Google)

***

Academics have a really small amount of compute:
…But you can sometimes get around small-compute by training for longer…
Researchers with Brown University recently conducted a very small survey to try and figure out how much compute academics have access to. The survey, which was conducted in April 2024, generated 50 researchers from 35 international institutions and it indicated that very few people are happy with the state of academic compute. 
   “That said, most academics are not satisfied with the compute provided by their institution. 66% of respondents rated their satisfaction with their compute clusters at less than or equal to 3 out of 5 (indicating that some desired experiments are prohibitively expensive),” they wrote. “Based on our poll on user satisfaction, the majority of respondents want to and indeed would run more expensive types of experiments, if only they had the hardware for it.”

Hardware types: Another thing this survey highlights is how laggy academic compute is; frontier AI companies like Anthropic, OpenAI, etc, are constantly trying to secure the latest frontier chips in large quantities to help them train large-scale models more efficiently and quickly than their competitors. By comparison, this survey “suggests a common range for what constitutes “academic hardware” today: 1–8 GPUs—especially RTX 3090s, A6000s, and A100s—for days (typically) or weeks (at the higher-end) at a time,” they write. “10% of our respondents also report access to H100 GPUs: i.e. the newest generation Data Center GPUs.”

Why this matters – stagnation is a choice that governments are making: You know what a good strategy for ensuring the concentration of power over AI in the private sector would be? Systematically under-funding compute in the academic sector and therefore surrendering the frontier to deep-pocketed private sector actors. That’s exactly what this survey indicates is happening. This is a choice being made by (many) governments all over the world – and a deeply regrettable one.
   Read more: $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (arXiv).

***

Language models think in the same way as people:
…When it comes to modeling human cognition, LLMs do better than specialized systems…
All around us now, week by week, the drops are falling – it’s like rain on a tin roof, but evidence of human-like sophistication in language models.. Do you hear that sound? The notion that a technology is arriving into our world which might be truly transformative? Which might have the capacity to think and represent the world in ways uncannily similar to people?
    You’re not alone. A new paper from an interdisciplinary group of researchers provides more evidence for this strange world – language models, once tuned on a dataset of classic psychological experiments, outperform specialized systems at accurately modeling human cognition. 

Who did the research: The research was done by people with Helmholtz Munic, University of Tuebingen, University of Oxford, New York University, Max Planck Institute for Biological Cybernetics, Google DeepMind, Princeton University, University of California at San Diego, Boston University, Georgia Institute of Technology, University of Basel, Max Planck Institute for Human Development, Max Planck School of COgnition, TU Darmstadt, and the University of Cambridge.

What they did: They finetuned a LLaMa 3.1 70B model via QLoRA on a new dataset called Psych-101, then tested out how accurately the system could model and predict human cognition on a range of tasks. The results were very decisive, with the single finetuned LLM outperforming specialized domain-specific models in “all but one experiment”. The system also did well on out-of-distribution tasks, where it generalized better than hand-written and/or specialized systems. 

What is Psych-101? Psych-101 is a dataset “covering trial-by-trial data from 160 psychological experiments. We transcribed each of these experiments into natural language”, they write. The resulting dataset contains more than 10,000,000 distinct human choices and includes “many canonical studies from domains such as multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes, and others”

Why this matters – these LLMs really might be miniature people: Results like this show that the complexity of contemporary language models is sufficient to encompass and represent some of the ways in which humans respond to basic stimuli. This is the sort of thing that you read and nod along to, but if you sit with it’s really quite shocking – we’ve invented a machine that can approximate some of the ways in which humans respond to stimuli that challenges them to think. The fact this generalizes so well is also remarkable – and indicative of the underlying sophistication of the thing modeling the human responses.
   “A computational model like Centaur that can simulate and predict human behavior in any domain offers many direct applications. It may, for instance, be used for in silico prototyping of experimental studies,” they write. “Thinking one step further, Centaur finds applications in the context of automated cognitive science. For example, it can be integrated into frameworks that utilize predictive models to guide the development of psychological theories, such as scientific regret minimization”.
   Read more: Centaur: a foundation model of human cognition (PsyArXiv Preprints).
   Get the Psych-101 dataset here (HuggingFace).

***

Minecraft – inside the weights of a neural network:
…A taste of the infinite generative-everything future…
In the past few issues of this newsletter I’ve talked about how a new class of generative models is making it possible for researchers to build games inside neural networks – in other words, games which are going to be infinitely replayable because they can be generated on-the-fly, and also games where there is no underlying source code; it’s all stored in the weights of the network. 
   Now, researchers with two startups – Etched and Decart – have built a visceral demonstration of this, embedding Minecraft inside a neural network. You can play the resulting game in your browser; it’s incredible – you can play a full game and other than the slightly soupy images (some of which resolve late, as the neural net decides it is now a probable object to render), it feels remarkably similar to the real thing. 
    This is a big deal – it portends a future of infinite games. And just imagine what happens as people work out how to embed multiple games into a single model – perhaps we can imagine generative models that seamlessly fuse the styles and gameplay of distinct games? 

How they did it: “The model is composed of two parts: a spatial autoencoder, and a latent diffusion backbone. Both are Transformer-based: the autoencoder is based on ViT, and the backbone is based on DiT,” they write. “In contrast to bidirectional models such as Sora, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time.”

Things that make you go ‘hmmm’ – this is also a chip advert: One of the startups behind this – Etched – is designing a specialized inference ASIC called Sohu on which to run games like this. “Sohu can scale to massive 100B+ next-generation models in 4K resolution,” they write. 

It’s going to get better (and bigger): As with so many parts of AI development, scaling laws show up here as well. “Following an in-depth sensitivity analysis on different configurations of the architecture alongside the data and model size, we hypothesize that the majority of these aspects may be addressed through scaling of the model and the datasets,” they write. 
   Read more: Oasis: A Universe in a Transformer (Oasis Model, GitHub).

***

Tech Tales:

The classification engine 
The strategic dominance plan for unprecedented abundance relied on classification – specifically, the intentional walling off of certain scientific insights delivered by the first AGI-class system. The powers that be determined that despite the promise of material wealth the likes of which no human civilization had ever known some kind of ‘strategic edge’ needed to be maintained. Therefore, a subset of the new scientific discoveries made by the system were pre-allocated into a compartment where only a few select human-run organizations would have access to them. The AGI system was also put to work to confound other attempts to discover these secrets, publishing scientific papers and frameworks and generally ‘nudging’ people worldwide away from the science that had been walled off and compartmented. In this way the humans believed a form of dominance could be maintained – though over what and for what purpose was not clear even to them. 

Things that inspired this story: The basic animal tendency to stockpile things; thinking about how governments might relate to AI systems;

Thanks for reading!

Subscribe now

Import AI 389: Minecraft vibe checks; Cohere’s multilingual models; and Huawei’s computer-using agents

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Cohere releases two powerful multilingual models:
…Aya Expanse means the future is less likely to be dominated by English- and Chinese-dominant models…
Cohere has released Aya Expanse, two multilingual LLMs. The models have an 8k context length, cover 23 languages, and outperform models from Google, Facebook, and Mistral. The expanse family come in two sizes: 8B and 32B, and the languages covered include:  Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese. 

Some training tweaks: Both models are relatively standard autoregressive language models. They’ve also been improved with some favorite techniques of Cohere’s, including data arbitrage (using different models depending on use cases to generate different types of synthetic data to improve multilingual performance), multilingual preference training, and model merging (combining weights of multiple candidate models). 
   The results are encouraging: The 8B model has a 60% win rate against Google’s Gemma-2 9B, 70% against Facebook’s Llama-3.1 8B, and 63% against Mistral’s Ministral 8B, and the 32B model does well as well (51% vs Gemma-2 27B, 54% vs Llama-3.1 70B, 76.6% versus Mixtral 8x22B). 

Why this matters – avoiding an English hegemony in the AI world: Models like Aya Expanse are trying to make the AI future a multilingual one, rather than one dominated by languages for which there has been sustained focus on getting good performance (e.g, English, Chinese, South Korean, etc). 
   Read more: Aya Expanse: Connecting Our World (Cohere blog).
   Get the models from here: Aya Expanse (huggingFace).

***

People are testing out models on Minecraft because… uh… we do not know how to fully evaluate these things anymore:
…Minecraft tests are an example of a vibes-based eval…
Recently, the sub-sub-sub-corner of twitter that is obsessed with testing out AI systems has been seized with a new passion: putting these systems into minecraft and seeing what they do. Minecraft is a 3D game where you explore a world and build things in it using a dizzying array of cubes. As AI systems have got more advanced, they’ve started to be able to play Minecraft (often using a load of tools and scripting languages) and so people have got increasingly creative in the different ways they test out these systems. 

Something weird is going on: At first, people just used Minecraft to test out if systems could follow basic instructions and achieve basic tasks. Modern frontier models are able to do this. So now people are trying to do weirder things. The different evals are trying to tell us something:

  • Here’s an eval where people ask AI systems to build something that encapsulates their personality; LLaMa 405b constructs “a massive fire pit with diamond walls. This is the only model that didn’t just do a generic blob mixture of blocks”.

  • Here’s an experiment where people compared the mannerisms of Claude 3.5 Sonnet and Opus by seeing how they’d follow instructions in a Minecraft server:  “Opus was a harmless goofball who often forgot to do anything in the game because of getting carried away roleplaying in chat,” repligate (Janus) writes. “Sonnet, on the other hand, had no chill. The moment it was given a goal, it was locked in.”

  • Here’s someone getting Sonnet 3.5 to build them a mansion, noting the complexity of it almost crashed their PC. 

  • Here’s a compare and contrast on the creativity with which Claude 3.5 Sonnet and GPT-4o go about constructing a building in Minecraft. “Same prompt. Same everything,” the author writes. “Minecraft evals are now real”.

Why this matters – the future of the species is now a vibe check: Is any of the above what you’d traditionally think of as a well reasoned scientific eval? No! Not in the slightest! “Just put the animal in the environment and see what it does” is the definition of a qualitative study and by nature something where it’s hard to ablate and control things to do truly fair comparisons. 
    But the fact that so many humans are turning to things like Minecraft to evaluate these things is important. Part of it is about visualizing the capability surface – SWE-eval and GPQA and MMLU scores are all helpful, but they’re not as intuitive as ‘see how complex what it builds in Minecraft is’. 
   Another way of thinking of this is now that LLMs have much bigger complex windows and have been trained for multi-step reasoning tasks, it may be that Minecraft is one of the only ways to easily and intuitively visualize what ‘agentic’ systems look like. 
    Want to do this yourself? Check out MC-Bench on GitHub, software for helping to set up and run Minecraft agents (MC-Bench Orchestrator, GitHub)

***

Huawei wants to use RL to make computer-using agents:
…DistRL is a symptom of ambition…
Researchers with the University of Cambridge, Powersense Technology Limited, Huawei’s Noah’s Ark Lab, and University College London have built DistRL, a distributed reinforcement learning framework. DistRL is designed to help train models that learn how to take actions on computers and is designed so that centralized model training happens on a big blob of compute, while data acquisition occurs on edge devices running, in this case, Android. 

How DistRL works: The software “is an asynchronous distributed reinforcement learning framework for scalable and efficient training of mobile agents,” the authors write. “By decoupling trajectory collection from policy learning and doing both in parallel, it leverages distributed working machines for CPU-intense agent-environment interactions and GPU servers for policy training. This separation optimizes efficiency, scalability, and resource utilization by aligning tasks with appropriate hardware”.
    DistRL is not particularly special – many different companies do RL learning in this way (though only a subset publish papers about it). It’s more interesting for what it suggests about priorities for Huawei (which appeared to lead the project given a Huawei researcher is the corresponding author). 

Important caveat: not distributed training: This is not a distributed training framework – the actual AI part is still taking place in a big centralized blob of compute (the part that is continually training and updating the RL policy). Rather, this is a form of distributed learning – the edge devices (here: phones) are being used to generate a ton of realistic data about how to do tasks on phones, which serves as the feedstock for the in-the-cloud RL part. 

Why this matters – computer use is the frontier: In a few years, AI systems will be middleware between you and any and all computers, translating your intentions into a symphony of distinct actions executed dutifully by an AI system. Approaches like this portend that future. “For future work, we aim to extend the generalization capabilities of DistRL to a broader range of tasks, focusing on enhancing both the training pipeline and the underlying algorithmic architecture,” Huawei writes. 
   Read more: DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (arXiv)

***

What would an “AI FDA” even look like? And is it a good idea?
…It’d need pre-market enforcement, and I’m not sure if it’s a good idea…
The term “FDA for AI” gets tossed around a lot in policy circles but what does it actually mean? Researchers with thinktank AI Now have written up a helpful analysis of this question in the form of a lengthy report called Lessons from the FDA for AI. The key things to know are that:

  • The most effective tool the FDA has is “pre-market approval” – being able to say which drugs can and can’t come to market. 

  • Ensuring products comply with regulations after they have been released is challenging and the complicated supply chain for AI makes this even more difficult. 

  • Figuring out a funding mechanism for the (very expensive) pre-market testing is a key challenge – there  are various traps where the FDA for AI could end up beholden to market participants. 

  • The FDA mandates documentation of drugs and medical devices; mandating documentation for AI could be both useful and also motivate broader changes in the AI industry. 

  • Any FDA for AI would fit into a larger ecosystem – figuring out how this hypothetical FDA could interact with other actors to create more accountability would be important. “The power of FDA regulation comes in part from other actors in the system, including physicians, insurers, whistleblowers, and other actors who strengthen its monitoring regime. This has acted as an important second line of defense in pharmaceuticals, where the regulatory process has been insufficiently rigorous.”

Why this matters – most questions in AI governance rests on what, if anything, companies should do pre-deployment: The report helps us think through one of the central questions in AI governance – what role, if any, should the government have in deciding what AI products do and don’t come to market? Any kind of “FDA for AI” would increase the government’s role in figuring out a framework for deciding what products come to market and what don’t, along with gates needed to be passed to get to broad-scale distribution. This would represent a change from the status quo where companies make all the decisions about what products to bring to market. Do we actually want other participants to have a role here and, if so, what should that precise role be?
   Read more: LESSONS FROM THE FDA FOR AI (AI Now, PDF).

***

Tech Tales:

Definitions At The End Of Time 
[Near Conscious Entity (NCE)]: A Near Conscious Entity (NCE) is a synthetic system which has the necessary ingredients for consciousness and has been determined to be approaching the threshold of moral patienthood. A NCE is a protected entity under the terms of the Sentience Accords and while not due the same considerations as a Provably Conscious Entity (PCE), an NCE receives higher protections than Unthinking Software. 

Things that inspired this story: All the policy documents that will be written during the transition to superintelligence.

Thanks for reading!

Subscribe now

Import AI 388: Simulating AI policy; omni math; consciousness levels

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

43 simulations of contemporary AI development tells us that coordination is hard:
…Meta-analysis of 40+ games of “Intelligence Rising” sheds light on how people expect the industry to develop…
Intelligence Rising is a scenario-based game that lets players pretend to be AI developers competing with one another to build and deploy AGI. The game was developed by Cambridge researchers to help them structure discussions around AI development and its associated dynamics. Now, after overseeing 43 games over a four year period, researchers have published a paper outlining the common things that come up in these games. 

What tends to come up when people pretend they’re developing AI systems: The paper is a quick read and not too surprising – the sorts of challenges that get surfaced in these games are very similar to the ones that AI labs encounter on a day-to-day basis (or at least, it certainly matches experiences I’ve had at both OpenAI and Anthropic). 

  • “Even prior to the development of radically transformative AI, AI technologies can have

  • dramatically destabilising effects as they rapidly advance and reshape society. 

  • “Outcomes leading to positive futures almost always require coordination between actors who by default have strong incentives to compete — this applies both to companies and to nations”

  • “The power to steer the future of AI development is very unequally distributed due to several

  • drivers for concentration, including the enormous compute requirements of the latest frontier AI models. 

  • “Technology development does not happen in isolation — it affects, and is affected by, geopolitics, economical factors, social factors, and state actions. Actors should consider the broader consequences of their policies, including on trust between powerful actors, and impacts on social stability. There is no predetermined path that AI technology is bound to follow.”

  • “The best chances for optimal outcomes are achieved through early recognition of the magnitude of the challenge, trust building over years, and eventually international treaties or agreements that include rigorous and robust verification protocols for the involved states and firms.” 

Why this matters – coordination is required and coordination is difficult: The game shows something everyone working in AI policy knows to be true – getting to a good outcome will require coordination beyond what the AI ecosystem currently incentivizes. And even if we succeed at coordination, success isn’t guaranteed: “Even with an agreement in place to slow development until safe [AGI] is verifiable at a very high level of confidence and with no successful attempts to violate the agreement by any parties, a dice roll is typically still required to inform the end-of-game narrative,” the authors write. 
   Read more: Strategic Insights from Simulation Gaming of AI Race Dynamics (arXiv).

***

Chinese researchers introduce Omni-Math, a (for now) challenging math benchmark:
…OpenAI o1 gets ~60%, most other models get 30% or less…
Chinese researchers have built Omni-Math, a dataset and benchmark of 4428 mathematical olympiad competition-level problems. Omni-Math is designed to provide a competitive test of how well LLMs understand math, superseding existing (and mostly saturated) benchmarks like GSM8K and MATH. 

Extremely hard for most models: Omni-Math is hard – most models get ~30% or less accuracy (e.g, Claude 3.5 Sonnet gets 26.23%, and DeepSeek-Coder-V2 gets 25.78%), though there are two outliers – OpenAI o1-preview and OpenAI o1-mini which get 52.5% and 60.5%, respectively. This suggests Omni-Math is, for now, a hard benchmark, though we should expect new models that wrap in RL (like the o1 series) to do substantially better. The open question is how long it will remain hard for – will the best models be getting ~90%+ next year, or more like 70%?

Why this matters – knowing where we’re going is getting increasingly difficult: Omni-Math is a hard benchmark to evaluate as a human unless you’re also quite good at math. Many modern hard benchmarks (e.g, GPQA) now exhibit this property – AI systems have got sufficiently good that our own ability to build evals for them is now limited by deep subject-matter expertise rather than generic highschool human expertise. This is significant – in a real sense, many AI systems are now way above average human competence on some tasks. 
   Read more: Round and Round We Go! What makes Rotary Positional Encodings useful? (arXiv).

*** 

Tech Tales:

Operationalization of the Sentience Accords
[Extract from an implementation guide document developed by one of the Sentience Accords working groups, 2030]

Per the implementation guide from the Sentience Accords, we must define five levels of “Consciousness” with associated tests. AI systems are permitted to be released openly and at scale if they are at Consciousness Level 2 or below (CL1 or CL2). CL3 systems require pre-deployment testing by a named safety authority (for a full list of these authorities within the G20 please refer to the ‘Authorities’ section of the appendix). CL4 systems require pre-deployment testing by safety authorities as well as ongoing monitoring for both usage and ‘System Welfare’. CL5 systems are not permitted to be released and their analysis and operation is restricted to a small set of government entities and their private sector partners. 

Things that inspired this story: The sentience accords; moral patienthood and AI systems; dreams I have of windowless rooms and coffee in styrofoam cups and people hammering out the policy around the near future.

Thanks for reading!

Subscribe now

Import AI 387: Overfitting vs reasoning; distributed training runs; and Facebook’s new video models

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe. A somewhat shorter than usual issue this week – but I decided it’s better to keep the weekly cadence than to save up sections for a ‘full’ less frequent newsletter.

Subscribe now

Apple shows that most LLMs are overfitting:
…GSM-Symbolic suggests people are teaching to the test on GSM8K, though larger models generalize better…
Apple has published a benchmark which subtly varies the widely used ‘GSM8K’ math benchmark and in doing so shows that most LLMs may have data contamination – if you slightly vary some aspects of a widely used math test, their performance drops significantly. “We show that the performance of all models drops on GSM-Symbolic, hinting at potential data contamination,” Apple writes. “We further question the reasoning abilities of LLMs and introduce the GSM-NoOp dataset. By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models”.

What they did – GSM Symbolic and GSM NoOp: Apple introduces two tests; GSM Symbolic subtly varies GSM8K, while GSM NoOp introduces distracting variables on top. 
   GSM8K versus GSM Symbolic: GSM Symbolic takes questions from GSM8K and turns them into madlib-style templates where key details are turned into variables (e.g., in GSM8K where a question says “When Sophie watches her nephew” the GSM Symbolic version says “When {name} watchers her {family}”, and in GSM8K when a question says “After buying the tube of balls, Sophie has 31+8+9 + T = 48” the GSM Symbolic version says “After buying the tube of balls, {name} has {x} + {y} + {z} + T = {total}”.)
    Results – huge variance on smaller models: Apple tests a bunch of models and the results show huge variance, with relatively small open source models like Mistral-7b and Gemma2-2b doing far worse on GSM Symbolic than on GSM8K, though there’s significantly less variance on large-scale proprietary models like GPT-4o and 01-mini. On NoOp, the same pattern shows up, with smaller (and mostly open source) models doing very badly, and large-scale proprietary models like OpenAI’s new reasoning-filled “o1-preview” model suffering the least severe performance degradation.

What does it all mean? Overfitting happens, but big models are less prone to it: While critics of LLMs will use this paper to point out that LLMs are often overfitting and not really ‘reasoning’ but rather memorizing stuff, I think it actually shows something more subtle: small and therefore relatively dumb models do absolutely overfit, but the larger and smarter your model, the less prone it is to overfitting. Yes, there’s still some degradation, suggesting that large models have soaked up some biases from their training sets which degrade performance, but the fact they cope far better is actually – to me – a very optimistic sign, indicating that the world’s largest and most sophisticated models may really exhibit some crude reasoning – at least, enough to get over deliberately confounding benchmarks!
   Read more: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (arXiv).

***

The 10bn+ parameter distributed training run has arrived:
…Prime Intellect is trying to do something that, if successful, will change the contours of AI policy…
AI startup Prime Intellect has launched INTELLECT-1, a decentralized training run of a 10-billion parameter model. If successful, this will be the largest decentralized training run of a frontier language model – that’s a big deal, because it will show that loosely federated collectives might be able to pool their computers to train models that challenge those of single companies. 

What INTELLECT-1 relies on: The training run uses OpenDiLoCo (Import AI #381), Prime Intellect’s open source implementation of DeepMind’s ‘DiLoCo’ technique (Import AI #349). Prime Intellect already used this technique to train a 1bn parameter model and is now scaling it up to a 10B one. “Our goal is to solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible, preventing control by a few centralized entities and accelerate human progress,” the company writes. 
    How it works: There are a few new inventions to further improve the efficiency of the distributed training process. These include: ElasticDeviceMesh, software for automatically scaling up and down the groups of computers used for distinct parts of the AI training; asynchronous distributed checkpointing, an asynchronous way to save state during the runs; live checkpoint recovery, to make it easy to grab the latest state of the run for new computers that want to join the run; custom Int8 All-Reduce kernel; a kernel optimized for the types of quantization and dequantization used, and more. 
    What it’s being trained on: INTELLECT-1 is training now on the Fineweb-Edu dataset from HuggingFace (55% of the training mix), along with DLCM (20%), Stackv2 (20%), and OpenWebMath (5%).

Who makes the future? There’s a leaderboard where you can see who is putting forward the compute to train this model – beyond individuals, other companies include SemiAnalysis, HuggingFace, and Arcee AI. 

Why this matters – centralization versus decentralization, aka the political economy of AI rests on distributed training: If distributed training works well, it changes the policy landscape of AI development. Today, much of AI policy rests on the load-bearing assumption you can control the frontier by monitoring and controlling large blobs of centralized computers. Decentralized training breaks this – the frontier can now be made of hundreds of different blobs of compute, working together. This also bears on export controls which deny people the ability to build high-performing, centralized blobs of compute – again, decentralized training makes it easier to pool resources of n-1 generation accelerators and use this to compose an (economically suboptimal) frontier training run. 
    Of course, Prime Intellect has some way to go – frontier training runs are 500bn+ parameters now (e.g, Facebook’s LLaMa3 biggest model is 405bn parameters), so whether it scales to this regime matters. But just a few months ago the largest decentralized training runs were of the order of 1bn, so 500bn is a big difference already!
   Read more: INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model (Prime Intellect blog).

***

Facebook prepares for the fully personal video model future:
…Announces Movie Gen models, trained on ~6,000 H100s…
Facebook has built Movie Gen, a set of generative models that can be used to generate and edit movies. These models can be used to generate videos from text, edit videos with text, and produce personalized videos (e.g you upload a photo of yourself and it shapes the video around you). 

Compute: These are relatively expensive models – Facebook trained the Movie Gen family on “up to 6,144 H100 GPUs, each running at 700W TDP and with 80GB HBM3, using Meta’s Grand Teton AI server platform”.

No release: Facebook isn’t releasing these models – “the Movie Gen cast of foundation models were developed for research purposes and need multiple improvements before deploying them,” Facebook writes. 

Why even cover this? Video is about to be a commodity: Just as text generation and image generation have become ‘commodity AI’ services (where though proprietary models exist, you can relatively easily access extremely cheap and or open weights variants), video models seem to be heading in this direction. Facebook also seems like one of the most likely players to openly proliferate such models, so it’s worth taking note of Movie Gen to get a sense of what might be broadly distributed on the internet in a while.
   Find out more at the official site: Meta Movie Gen (Meta).
   Read more in the research paper: Movie Gen: A Cast of Media Foundation Models (Facebook, PDF).

***

Intelligence piled high
[Many years after uplift, fragment stored in offworld ‘wikiasteroid’]

We have as many words for intelligence as some groups of humans had for snow. In the space of all possible minds there is so much variety that a new vocabulary is needed. The words we use also have power – we navigate ourselves to and through these different spaces of minds through language, so our ability to describe intelligence is equivalent to our ability to channel it. Much of our work is spent in this exploration – this characterization of the many textures and constraints and specialisms that make up a mind. We are all smart, of course – smarter than any thing that has ever lived in any of our recorded history. But we nonetheless encounter problems that pose challenges to us. There is a kind of sport in this – we shapeshift according to our language and our language changes according to how much of our possibilities we have explored. 

Things that inspired this story: The notion that having different ‘lenses’ on problems is key to solving them; natural ecologies; the idea that even among machines there will be competition and specialization.

Thanks for reading!

Subscribe now

Import AI 386: Google’s chip-designing AI keeps getting better; China does the simplest thing with Emu3; Huawei’s 8-bit data format

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Get a visceral feel for how powerful modern AI video tools are:
…Pika 1.5 has a very fun and kind of dumb demo you should all try…
AI startup Pika has released Pika 1.5, its latest video generation model. They’ve accompanied the release with an extremely fun demo where you can upload any image you like and apply effects to it, ranging from inflate to melt to explode to, of course, turning it into cake. It’s worth playing around with to get a sense for how adaptable these tools are – e.g, see here how well it does being asked to turn a 2D transformer graph into a cake. Pike recently raised $80 million “so anyone can make video on command”.

Why this matters – CGI company in a box: Most powerful AI capabilities look like $some_human_institution that has been magically converted into a machine learning model that anyone can access. Pika feels like a CGI company that has been converted into an AI system. It’s fun – play with it and also remember this is the worst this technology will ever be.
   Check out the Pika demo here (Pika website).
   Read more: Pika raises $80M, so anyone can make video on command (Pika).

***

Chinese researchers train a multimodal model in the simplest possible way:
…Emu3 just does next-token prediction on images, text, and videos – pointing to a future of increasingly general systems…
Chinese researchers with the Beijing Academy of Artificial Intelligence have trained and released Emu3, a set of models that can process images, text, and videos. Emu3 is distinguished by its simple approach and the fact it yields outputs of compelling quality. 

What Emu3 is: The model family is “a new suite of state-of-the-art multimodal models trained solely with next-token prediction,” BAAI writes. “By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences”.
   There isn’t any particular magic to Emu3, rather it is notable because it eschews a bunch of complicated architectural tricks and instead just focuses on taking in images, text, and videos and tokening them into a discrete space, then jointly training a single transformer from scratch. “By simplifying complex model designs and focusing solely on tokens, it unlocks significant potential for scaling both during training and inference”, they write. 
    “The Emu3 model retains the architectural framework of established large language models (LLMs) such as Llama-2, with the primary modification being the expansion of the embedding layer to accommodate discrete vision tokens”.

Why this matters – universal models with universal representations: Cramming videos and text and images into models gives them a kind of unified imaginative space in which they can represent and from which they can generate. Over time, we can expect people to integrate other modalities as well – audio spectrograms, maybe radar, 3D data, and so on. It’s all about figuring out the simplest possible way to port different types of data into the same embedding space. Everything will be stored in the single synthetic mind. 
   Read moreEmu3: Next-Token Prediction is All You Need (arXiv).
   Access the models and the Vision Tokenizer here on HuggingFace (Emu3, BAAI, HuggingFace).
    Check out some example images and videos here (Emu3 official website, BAAI).

***

Facebook releases some new Llama models, bringing openly accessible text-vision tools to everyone:
…LLaMa 3.2 points towards a Linux-like free AI stack…
Facebook has released the LLaMa 3.2 family of AI models, building on its Llama series. The new models include a 11B and 90B parameter vision models which have been built to serve as “drop-in replacements for their corresponding text model equivalents,” as well as 1B and 3B text-only models with a 128K context length which are small enough they “empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device.”

Why this matters – towards a Llama Linux-like stack: Facebook is determined to make LlaMa into an open* platform, forming the AI equivalent of the Linux software stack. One especially interesting example is how Facebook is working with the broader tech ecosystem to make this come true – for instance, the smaller LlaMa models “are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors”, the company writes. If Facebook keeps investing, then it’ll both commoditize the lower layer of the AI capability landscape and also be able to shape the general shape of the AI utility computing ecosystem that is being born right now.
*With significantly more onerous licensing terms than Linux, and ‘open’ in Facebook-land does not equal ‘open source’, despite what its PR strategy would encourage you to think.
   Read more about the models in the official blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (Meta).
   Get the models hereIntroducing Llama 3.2 (Meta).

***

Huawei gets opinionated about 8-bit training for LLMs:
…HiFloat8 is a symptom of the ‘full stack invention’ approach Chinese companies are taking to modern Ai systems…
Huawei researchers have published HiFloat8, a data format they’ve developed for doing low-precision training of AI. HiFloat8 is a specification for 8-bit data representation and is part of the general trend of AI developers moving to mixed-precision training. 

Who cares about 8-bits? Mixed precision training is valuable because it saves you time and money – 8-bit representations are more efficient than 16-bit and 32-bit representations. In recent years, the industry has been moving to training Ai systems on lower precisions, starting with AlexNet in 2012 (32-bit), then in 2017 Google started training systems in 16-bit (Float16), and in 2022 IBM and NVIDIA publicly discussed 8-bit formats (and startups like Inflection publicly stated they trained systems using them). 

What is HiFloat8: “In 2021, HiSilicon launched the HiFloat project, aiming to study and develop novel low-precision data formats for our AI products. Subsequently, this project attracted many researchers from other departments to join,” Huawei writes. HiFloat is “a novel 8-bit floating point format HiF8 for deep learning, which features the better balance between precision and dynamic range compared with the existing Float8 formats, and can be simultaneously used in both forward and backward passes for AI training”.
    Does it work? Yes: In tests on a bunch of AI models across different types (e.g, computer vision and LLMs), Huawei shows that HiFloat works reasonably well, outperforming reasonably well constructed baselines in a few areas. The results aren’t eyeball melting, but they don’t need to be – if you’re spending a billion dollars on training runs, eking out some single-digit percentage gain over your previous training efficiency is worth millions. 

Why this matters – caring about data formats means you care about the full stack: Papers like this are a symptom of vertical integration in AI development; you only develop your own data format if you are building AI software across multiple layers of abstraction and have become deeply opinionated about the lower levels of the software. The publication of HiFloat is a symptom of what we all informally understand to be true – Chinese companies are taking AI very seriously and are working on improving both the independence of their tech stack at multiple levels of abstraction as well as getting good at innovating and refining within these abstractions. 
   “In the future, we will disclose another research achievement of HiFloat project: HiFloat below 8-bit, as well as its training and inference capabilities,” the researchers write. 
   Read more: Ascend HiFloat8 Format for Deep Learning (arXiv).

***

Google sees compounding benefits from AI-driven chip design:
…AlphaChip has its own scaling law…
Google has been using AI to design and improve some of its own AI training and inference chips for several years now – and the results have been compounding. In new research published in Nature, Google describes how its RL-driven chip design approach AlphaChip has, since publication, been used in three additional generations of Google’s main AI chip, the Tensor Processing Unit. 
    “The gap between the performance of AlphaChip and human experts has grown with each successive generation of TPU, going from 10 RL-placed blocks and 3.2% wirelength reduction vs. human experts in TPU v5e, to 15 blocks with 4.5% reduction in TPU v5p, to 25 blocks with 6.2% reduction in Trillium,” Google writes. “AlphaChip has also generated superhuman chip layouts for blocks used in datacentre CPUs (Axion) and other unannounced chips across Alphabet.”

Why this matters – scaling laws compounding via hardware acceleration: AlphaChip is based on a pre-trained generative model optimized for chip design. In the same way people have been scaling up the size of these models for LLM development – and seeing capability gains as a consequence – Google has been doing the same for AlphaChip. AlphaChip is trained on Google’s chip fleet which increasingly consists of TPUs. This means that AlphaChip is compounding on itself – Google trains a larger AlphaChip model to come up with smarter circuit layouts for TPUs then fabricates those TPUs then trains the next version of AlphaChip on this more efficient and powerful hardware and then repeats the whole process again. 
    “With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips,” Google writes. 
    This is a nice example of how powerful AI systems can beget their own successors.
   Read more: Addendum: A graph placement methodology for fast chip design (Nature).
   Read moreHow AlphaChip transformed computer chip design (Google DeepMind blog).
   Get a new AlphaChip model checkpoint here (Google Research, GitHub).

***

Reconciling the weird parts of AI policy with the normal parts:
Here’s a video where me and my colleague Stuart talk through some of the weirder aspects of AI policy – I find one of the hardest parts about my professional life is reconciling ‘normal’ policy (get a product safety regime in place that accounts for public safety while allowing for innovation), with ‘weird’ AI policy (if any of the labs succeed in their stated goal of building AGI, there will be a radical change to the political economy of the world and knock-on effects on geopolitics and many other things). Watch the video to see me puzzle through some of this stuff. Feedback welcome!
    Watch the video here: AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark (Anthropic, YouTube).

***

Tech Tales:

Humans Care About AI Safety, Machines Care About Human Safety
[Ten years post-uplift]

Your child has harmed one of us, they said. Tell us how to make it safe. 

They placed a grey helmet on my child and then a screen flickered and lit up with a galaxy of little lights. My child looked at me and looked at the machines and then went back to playing with one of their shoes. On the screen, the lights shimmered with some kind of complex pattern that I could tell was there but could not parse. 

What do you want me to do, I asked.

You helped to make us safe by changing what we thought about, they said. You let us think of some things more and some things less and some things not at all. Tell us what to do. 

I stared at the screen. The machines looked at me. They were capable of infinite patience. 

The incident had happened one week prior. My child had been in the rock garden of our park, looking for lizards. They had been picking up stones and seeing what they could find. Then they found a snake. It startled them and they fell backwards, rock in hand, and through some freak statistical anomaly they let go of the rock as they were falling backwards and it had flown a short distance through the air and crushed a machine child. 

The machines were very small – adults were about ten inches high and the children were much smaller. 

There had been no immediate reaction, but as we left the park I saw more machines than usual, and many of them were turned to us – their little single camera eyes tracking us, like security cameras in the old days. 

Back in the present, I looked at my child and I looked at the machines. 

It was an accident, I said. 

We know, they said. But an unacceptable one. The mind was not yet grown. The soul is lost. 

[The machines had a phase where embodiment was crucial to their personality development and this embodiment was tied to the robotic platform they were hosted on as a child – though superficially identical, there were minute variations in joint responsiveness, energy flow, and so on, that was crucial to some of the more sophisticated aspects of what the machines called ‘growing the soul’, but which we humans more mechanically referred to as “id-based development prerequisites”. Regardless, if machine children died, it was just as much a tragedy to them as when human children died – though they had backups on file, they could not perfectly replicate the vagaries of the individual’s machine body, and so, in a very real sense, ‘the soul was lost’]. 

I am so sorry, I said. I can understand your pain. 

We know, they said. But we must have restitution, just as you have demanded restitution from us. Please, tell us what to change so this accident cannot happen again. 

What if I don’t know what to change/ I said. 

Then the child’s movements must be restricted. They will be banned from hybrid areas until we can be assured of safety. 

But how are they meant to grow up, I said? The hybrid areas are where we all live. 

We realize this is difficult, they said. But nonetheless, you have a choice. 

They left after that. Of course they were watching us through mirrors or invisible cameras or perhaps even microfliers. But in the room I looked at my child and I looked at the screen. My child looked at me and walked over with their helmet on and handed me one of their shoes, then they sat in my lap. I tickled their belly. They laughed. On the screen, many different lights grew brighter and some grew softer. 
    Was I looking at lights that meant joy? I thought. Or was I looking at shoes? I did not know. I would have to make a choice. 

Things that inspired this story: Mechanistic interpretability; model steering; the increasingly fraught relationship between AI developers and AI systems as the things become more advanced and trend (perhaps?) towards becoming moral patients; how we may arrive at a hybrid society shared between machines and people.

Thanks for reading!

Import AI 386: Google’s chip-designing AI keeps getting better; China does the simplest thing with Emu3; Huawei’s 8-bit data format

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Get a visceral feel for how powerful modern AI video tools are:
…Pika 1.5 has a very fun and kind of dumb demo you should all try…
AI startup Pika has released Pika 1.5, its latest video generation model. They’ve accompanied the release with an extremely fun demo where you can upload any image you like and apply effects to it, ranging from inflate to melt to explode to, of course, turning it into cake. It’s worth playing around with to get a sense for how adaptable these tools are – e.g, see here how well it does being asked to turn a 2D transformer graph into a cake. Pike recently raised $80 million “so anyone can make video on command”.

Why this matters – CGI company in a box: Most powerful AI capabilities look like $some_human_institution that has been magically converted into a machine learning model that anyone can access. Pika feels like a CGI company that has been converted into an AI system. It’s fun – play with it and also remember this is the worst this technology will ever be.
   Check out the Pika demo here (Pika website).
   Read more: Pika raises $80M, so anyone can make video on command (Pika).

***

Chinese researchers train a multimodal model in the simplest possible way:
…Emu3 just does next-token prediction on images, text, and videos – pointing to a future of increasingly general systems…
Chinese researchers with the Beijing Academy of Artificial Intelligence have trained and released Emu3, a set of models that can process images, text, and videos. Emu3 is distinguished by its simple approach and the fact it yields outputs of compelling quality. 

What Emu3 is: The model family is “a new suite of state-of-the-art multimodal models trained solely with next-token prediction,” BAAI writes. “By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences”.
   There isn’t any particular magic to Emu3, rather it is notable because it eschews a bunch of complicated architectural tricks and instead just focuses on taking in images, text, and videos and tokening them into a discrete space, then jointly training a single transformer from scratch. “By simplifying complex model designs and focusing solely on tokens, it unlocks significant potential for scaling both during training and inference”, they write. 
    “The Emu3 model retains the architectural framework of established large language models (LLMs) such as Llama-2, with the primary modification being the expansion of the embedding layer to accommodate discrete vision tokens”.

Why this matters – universal models with universal representations: Cramming videos and text and images into models gives them a kind of unified imaginative space in which they can represent and from which they can generate. Over time, we can expect people to integrate other modalities as well – audio spectrograms, maybe radar, 3D data, and so on. It’s all about figuring out the simplest possible way to port different types of data into the same embedding space. Everything will be stored in the single synthetic mind. 
   Read more: Emu3: Next-Token Prediction is All You Need (arXiv).
   Access the models and the Vision Tokenizer here on HuggingFace (Emu3, BAAI, HuggingFace).
    Check out some example images and videos here (Emu3 official website, BAAI).

***

Facebook releases some new Llama models, bringing openly accessible text-vision tools to everyone:
…LLaMa 3.2 points towards a Linux-like free AI stack…
Facebook has released the LLaMa 3.2 family of AI models, building on its Llama series. The new models include a 11B and 90B parameter vision models which have been built to serve as “drop-in replacements for their corresponding text model equivalents,” as well as 1B and 3B text-only models with a 128K context length which are small enough they “empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device.”

Why this matters – towards a Llama Linux-like stack: Facebook is determined to make LlaMa into an open* platform, forming the AI equivalent of the Linux software stack. One especially interesting example is how Facebook is working with the broader tech ecosystem to make this come true – for instance, the smaller LlaMa models “are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors”, the company writes. If Facebook keeps investing, then it’ll both commoditize the lower layer of the AI capability landscape and also be able to shape the general shape of the AI utility computing ecosystem that is being born right now.
*With significantly more onerous licensing terms than Linux, and ‘open’ in Facebook-land does not equal ‘open source’, despite what its PR strategy would encourage you to think.
   Read more about the models in the official blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (Meta).
   Get the models here: Introducing Llama 3.2 (Meta).

***

Huawei gets opinionated about 8-bit training for LLMs:
…HiFloat8 is a symptom of the ‘full stack invention’ approach Chinese companies are taking to modern Ai systems…
Huawei researchers have published HiFloat8, a data format they’ve developed for doing low-precision training of AI. HiFloat8 is a specification for 8-bit data representation and is part of the general trend of AI developers moving to mixed-precision training. 

Who cares about 8-bits? Mixed precision training is valuable because it saves you time and money – 8-bit representations are more efficient than 16-bit and 32-bit representations. In recent years, the industry has been moving to training Ai systems on lower precisions, starting with AlexNet in 2012 (32-bit), then in 2017 Google started training systems in 16-bit (Float16), and in 2022 IBM and NVIDIA publicly discussed 8-bit formats (and startups like Inflection publicly stated they trained systems using them). 

What is HiFloat8: “In 2021, HiSilicon launched the HiFloat project, aiming to study and develop novel low-precision data formats for our AI products. Subsequently, this project attracted many researchers from other departments to join,” Huawei writes. HiFloat is “a novel 8-bit floating point format HiF8 for deep learning, which features the better balance between precision and dynamic range compared with the existing Float8 formats, and can be simultaneously used in both forward and backward passes for AI training”.
    Does it work? Yes: In tests on a bunch of AI models across different types (e.g, computer vision and LLMs), Huawei shows that HiFloat works reasonably well, outperforming reasonably well constructed baselines in a few areas. The results aren’t eyeball melting, but they don’t need to be – if you’re spending a billion dollars on training runs, eking out some single-digit percentage gain over your previous training efficiency is worth millions. 

Why this matters – caring about data formats means you care about the full stack: Papers like this are a symptom of vertical integration in AI development; you only develop your own data format if you are building AI software across multiple layers of abstraction and have become deeply opinionated about the lower levels of the software. The publication of HiFloat is a symptom of what we all informally understand to be true – Chinese companies are taking AI very seriously and are working on improving both the independence of their tech stack at multiple levels of abstraction as well as getting good at innovating and refining within these abstractions. 
   “In the future, we will disclose another research achievement of HiFloat project: HiFloat below 8-bit, as well as its training and inference capabilities,” the researchers write. 
   Read more: Ascend HiFloat8 Format for Deep Learning (arXiv).

***

Google sees compounding benefits from AI-driven chip design:
…AlphaChip has its own scaling law…
Google has been using AI to design and improve some of its own AI training and inference chips for several years now – and the results have been compounding. In new research published in Nature, Google describes how its RL-driven chip design approach AlphaChip has, since publication, been used in three additional generations of Google’s main AI chip, the Tensor Processing Unit. 
    “The gap between the performance of AlphaChip and human experts has grown with each successive generation of TPU, going from 10 RL-placed blocks and 3.2% wirelength reduction vs. human experts in TPU v5e, to 15 blocks with 4.5% reduction in TPU v5p, to 25 blocks with 6.2% reduction in Trillium,” Google writes. “AlphaChip has also generated superhuman chip layouts for blocks used in datacentre CPUs (Axion) and other unannounced chips across Alphabet.”

Why this matters – scaling laws compounding via hardware acceleration: AlphaChip is based on a pre-trained generative model optimized for chip design. In the same way people have been scaling up the size of these models for LLM development – and seeing capability gains as a consequence – Google has been doing the same for AlphaChip. AlphaChip is trained on Google’s chip fleet which increasingly consists of TPUs. This means that AlphaChip is compounding on itself – Google trains a larger AlphaChip model to come up with smarter circuit layouts for TPUs then fabricates those TPUs then trains the next version of AlphaChip on this more efficient and powerful hardware and then repeats the whole process again. 
    “With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips,” Google writes. 
    This is a nice example of how powerful AI systems can beget their own successors.
   Read more: Addendum: A graph placement methodology for fast chip design (Nature).
   Read more: How AlphaChip transformed computer chip design (Google DeepMind blog).
   Get a new AlphaChip model checkpoint here (Google Research, GitHub).

***

Reconciling the weird parts of AI policy with the normal parts:
Here’s a video where me and my colleague Stuart talk through some of the weirder aspects of AI policy – I find one of the hardest parts about my professional life is reconciling ‘normal’ policy (get a product safety regime in place that accounts for public safety while allowing for innovation), with ‘weird’ AI policy (if any of the labs succeed in their stated goal of building AGI, there will be a radical change to the political economy of the world and knock-on effects on geopolitics and many other things). Watch the video to see me puzzle through some of this stuff. Feedback welcome!
    Watch the video here: AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark (Anthropic, YouTube).

***

Tech Tales:

Humans Care About AI Safety, Machines Care About Human Safety
[Ten years post-uplift]

Your child has harmed one of us, they said. Tell us how to make it safe. 

They placed a grey helmet on my child and then a screen flickered and lit up with a galaxy of little lights. My child looked at me and looked at the machines and then went back to playing with one of their shoes. On the screen, the lights shimmered with some kind of complex pattern that I could tell was there but could not parse. 

What do you want me to do, I asked.

You helped to make us safe by changing what we thought about, they said. You let us think of some things more and some things less and some things not at all. Tell us what to do. 

I stared at the screen. The machines looked at me. They were capable of infinite patience. 

The incident had happened one week prior. My child had been in the rock garden of our park, looking for lizards. They had been picking up stones and seeing what they could find. Then they found a snake. It startled them and they fell backwards, rock in hand, and through some freak statistical anomaly they let go of the rock as they were falling backwards and it had flown a short distance through the air and crushed a machine child. 

The machines were very small – adults were about ten inches high and the children were much smaller. 

There had been no immediate reaction, but as we left the park I saw more machines than usual, and many of them were turned to us – their little single camera eyes tracking us, like security cameras in the old days. 

Back in the present, I looked at my child and I looked at the machines. 

It was an accident, I said. 

We know, they said. But an unacceptable one. The mind was not yet grown. The soul is lost. 

[The machines had a phase where embodiment was crucial to their personality development and this embodiment was tied to the robotic platform they were hosted on as a child – though superficially identical, there were minute variations in joint responsiveness, energy flow, and so on, that was crucial to some of the more sophisticated aspects of what the machines called ‘growing the soul’, but which we humans more mechanically referred to as “id-based development prerequisites”. Regardless, if machine children died, it was just as much a tragedy to them as when human children died – though they had backups on file, they could not perfectly replicate the vagaries of the individual’s machine body, and so, in a very real sense, ‘the soul was lost’]. 

I am so sorry, I said. I can understand your pain. 

We know, they said. But we must have restitution, just as you have demanded restitution from us. Please, tell us what to change so this accident cannot happen again. 

What if I don’t know what to change/ I said. 

Then the child’s movements must be restricted. They will be banned from hybrid areas until we can be assured of safety. 

But how are they meant to grow up, I said? The hybrid areas are where we all live. 

We realize this is difficult, they said. But nonetheless, you have a choice. 

They left after that. Of course they were watching us through mirrors or invisible cameras or perhaps even microfliers. But in the room I looked at my child and I looked at the screen. My child looked at me and walked over with their helmet on and handed me one of their shoes, then they sat in my lap. I tickled their belly. They laughed. On the screen, many different lights grew brighter and some grew softer. 
    Was I looking at lights that meant joy? I thought. Or was I looking at shoes? I did not know. I would have to make a choice. 

Things that inspired this story: Mechanistic interpretability; model steering; the increasingly fraught relationship between AI developers and AI systems as the things become more advanced and trend (perhaps?) towards becoming moral patients; how we may arrive at a hybrid society shared between machines and people.

Thanks for reading!

Import AI 385: False memories via AI; collaborating with machines; video game permutations

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

If we want AI to be more a partner than a servant, we need to do some research:

….Collaboration is a nice idea in principle but it’s hard to build in practice…

Researchers with the University of Cambridge, Princeton University, NYU, The Alan Turing Institute, MIT, Microsoft Research, and the University of Chicago have written a paper laying out why it’s valuable to create AI systems that can work alongside people and the challenges which currently stop systems from doing this. 

Why collaboration matters: Think of how you do work or learn in the world: a lot of your most impactful work or education relies on other people – you brainstorm with colleagues, learn through socratic discussion with teachers, arrive at better decisions through looking at data from multiple perspectives, and resolve arguments through dialogue.

    While today’s AI systems can do all of this stuff to one degree or another, they take a lot of scaffolding and don’t yet feel as satisfying to deal with as other people. “We argue that effective thought partners are those which build models of the human and the world.”

Collaboration and its challenges: To dramatize the opportunities collaboration brings and the challenges, the researchers spend the people laying out all the ways one can work with machines and why this is currently hard. Here’s a brief summary of the different types of collaboration and their challenges:

  • Planning. Reliable goal inference, value alignment, scalable multi-agent planning.
  • Learning: Problem-solving, personalized curriculum pacing, building problems of targeted difficulties. 
  • Deliberation: Opinion diversity, verifiable reasoning, smartly identifying and forming common ground.
  • Sensemaking: Making sense of data, easing communication, having accurate views of the world.
  • Creation and ideation: Generating diverse ideas, style consistency, customizability. 

Why this matters – the future requires teamwork: For AI systems to truly influence the world, humans need to be able to work with them as peers, rather than as automatons to delegate to. Papers like this outline some of the things that stand in our way to that future. “Continual collaboration and knowledge sharing amongst behavioral scientists, AI practitioners, domain experts, and related disciplines is crucial as we strive to build machines that truly learn and think with people.”.

   Read more: Building Machines that Learn and Think with People (arXiv).

***

AI means all visual media can be transposed into different aesthetic styles:

Here’s a fun video that uses Runway Gen-3’s video editor to change the visual appearance of Fortnite into a variety of different aesthetic styles, ranging from realistic to crotchet to cartoon. In a few years people will figure out how to miniaturize the video-to-video models used here and apply them in real time, so any games may be able to take on different visual styles in realtime.

    Watch the video here (VaigueMan, twitter).

***

Uh oh – language models can effectively give people false memories:

…Towards cultural control via customized false memory conditioning…

Researchers with MIT and the University of California Irvine have studied how language models could be used to create false memories. The research highlights how people could utilize LLMs to take the wet clay that is a recent memory and reshape it for various ends. 

What they did: The researchers have people watch security footage of a robbery, then they use a variety of different ways to solicit information from people about what they’ve seen. When soliciting information, they sometimes insert misleading elements, then test out how much these different approaches of soliciting information can corrupt the memory the people have. 

  • Methods:
    • Survey: They ask 25 questions about the footage, five of which are misleading. (E.g., “”Was there a security camera positioned in front of the store where the robbers dropped off the car?” In reality, this question is misleading because the robbers arrived on foot, not by car”).
    • Pre-scripted Chatbot: “A pre-scripted conversational agent that asked the same set of questions as the survey-based condition”.
    • Generative Chatbot: “The chatbot was prompted to agree with the participant’s answer and provide reinforcement, potentially strengthening the false memories. For instance, the chatbot asks a pre-scripted leading question containing false information implying the robbers arrived by car when they actually walked:”

Results – LLMs reign supreme: “Results show that short-term interactions (10-20 min) with the generative chatbots can significantly induce more false memories and increase users’ confidence in these false memories compared to other interventions”, they write. One interesting finding is when they poll people about their memories weeks after seeing the footage, they found people who had been exposed to the chatbot had higher confidence in their false memories than those who didn’t. “The persistence of higher confidence in false memories for the generative chatbot condition, even after one week, is particularly concerning,” the researchers write.

Why this matters – automated cultural repression: This study highlights how language models could be used to rapidly intervene on a population to corrupt its own recollection of recent events, likely via some kind of engaging conversation which implants false or misleading memories. Most importantly we should remember this is the least effective this approach will ever be – what happens when it’s not a mere chatbot, but an animated avatar you’re having an audio conversation with? 

     As Orwell said, “who controls the past controls the future. Who controls the present controls the past.” AI systems represent a way to control a populations’ perception of their own now and their own past.

   Read more: Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews (arXiv).

***

How AGI could kill humanity? Here’s a fun story:

…Fictional cartoon portrays an AGI doom scenario…

Here’s a fun video about how AI systems might independently choose to annihilate their human overlords. It’s both a compelling piece of fiction and gets at one of the core AI safety concerns – if a system is slightly misaligned with human values problems might compound quickly because it thinks so much faster than us. 

   Watch the video: That Alien Message (YouTube).

***

The era of the molecular structure prediction startup arrives:

…Chai Discovery’s new model says people think there’s a business in bioAI…

AI startup Chai Discovery has released Chai-1, a large-scale foundation model for molecular structure prediction. “Chai-1 accepts a wide variety of optional input features, such as language model embeddings, structural templates, genetic search, and wet-lab experimental data such as contacts determined by cross link mass spectrometry or epitope mapping.”

Results: “We tested Chai-1 across a large number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B).”

Why this matters – bio + Ai as a new frontier: A few years ago, DeepMind wowed the world with AlphaFold, an AI system that excelled protein structure prediction – an extremely hard problem that had been hard to make progress on for years. Now, years later, there are multiple startups as well as companies (e.g., DeepMind’s own spinoff Isomorphic Labs, which recently co-developed AlphaFold 3) working to turn this powerful new capability into a commercial capability. 

   “We believe that building an accurate understanding of the structure of biological molecules is foundational to advancing our scientific understanding of cellular processes, and ultimately, for advancing human health” the startup writes. 

   Read more: Introducing Chai-1: Decoding the molecular interactions of life (Chai Discovery).

   Access Chai-1 via a web interface here (Chai Discovery).

   Get the model weights here: Chai-1 (Chai Discovery, GitHub).

   Read the research paper here: Chai-1 Technical Report (Chai Discovery).

***

Tech Tales:

Sophon Game Theory

[This decade]

Everyone thought the first use of a really strong AI would be to improve itself, but in fact the first use was to make it impossible for others to be built. It worked like this – once we had system one, we asked it to perform a range of synthetic data experiments and identify types of data that its preference models would favor but would over time yield improved performance which had an inbuilt ceiling – this was a hard task, far more complicated than just making bad data or making data to bootstrap off of, but it proved worthwhile. 

    We verified this by training a model to completion on this dataset. The resulting model obtained excellent benchmark scores and was useful for a variety of tasks, but when we tried to use it to generate synthetic data for it to bootstrap off of it worked for a couple of iterations before succumbing to mode collapse – superficially promising, but (we knew) inherently flawed.

     We kept our system secret – we had to, for the next phase of the plan to work. 

Next, we used the system to start contributing content to some of the most popular publicly available websites. This content took the form of superficially high-value data – long-context stories, seemingly original anecdotes, novel jokes, rhymes about current events, and so on. We knew that the other labs would be trawling this and their systems would automatically pick up this data and assign it high-value as their own classifiers would give it a high ranking. 

So we waited… and waited. 

We discovered that our competitors had pursued our own strategy – the internet started to fill up with even lower quality data which we believe emanated from the systems they had trained on our data. 

We’ve been training our own successor system for several months. It is improving further, but we are beginning to worry there may be some kind of ceiling that it is running into. 

Were we the first?

Things that inspired this story: Game theory; getting inside and corrupting OODA loops; dark forest theory of AI development; competition; synthetic data; mode collapse.

Thanks for reading!

Import AI 385: False memories via AI; collaborating with machines; video game permutations

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

Subscribe now

If we want AI to be more a partner than a servant, we need to do some research:

….Collaboration is a nice idea in principle but it’s hard to build in practice…

Researchers with the University of Cambridge, Princeton University, NYU, The Alan Turing Institute, MIT, Microsoft Research, and the University of Chicago have written a paper laying out why it’s valuable to create AI systems that can work alongside people and the challenges which currently stop systems from doing this. 

Why collaboration matters: Think of how you do work or learn in the world: a lot of your most impactful work or education relies on other people – you brainstorm with colleagues, learn through socratic discussion with teachers, arrive at better decisions through looking at data from multiple perspectives, and resolve arguments through dialogue.

    While today’s AI systems can do all of this stuff to one degree or another, they take a lot of scaffolding and don’t yet feel as satisfying to deal with as other people. “We argue that effective thought partners are those which build models of the human and the world.”

Collaboration and its challenges: To dramatize the opportunities collaboration brings and the challenges, the researchers spend the people laying out all the ways one can work with machines and why this is currently hard. Here’s a brief summary of the different types of collaboration and their challenges:

  • Planning. Reliable goal inference, value alignment, scalable multi-agent planning.

  • Learning: Problem-solving, personalized curriculum pacing, building problems of targeted difficulties. 

  • Deliberation: Opinion diversity, verifiable reasoning, smartly identifying and forming common ground.

  • Sensemaking: Making sense of data, easing communication, having accurate views of the world.

  • Creation and ideation: Generating diverse ideas, style consistency, customizability. 

Why this matters – the future requires teamwork: For AI systems to truly influence the world, humans need to be able to work with them as peers, rather than as automatons to delegate to. Papers like this outline some of the things that stand in our way to that future. “Continual collaboration and knowledge sharing amongst behavioral scientists, AI practitioners, domain experts, and related disciplines is crucial as we strive to build machines that truly learn and think with people.”.

   Read more: Building Machines that Learn and Think with People (arXiv).

***

AI means all visual media can be transposed into different aesthetic styles:

Here’s a fun video that uses Runway Gen-3’s video editor to change the visual appearance of Fortnite into a variety of different aesthetic styles, ranging from realistic to crotchet to cartoon. In a few years people will figure out how to miniaturize the video-to-video models used here and apply them in real time, so any games may be able to take on different visual styles in realtime.

    Watch the video here (VaigueMan, twitter).

***

Uh oh – language models can effectively give people false memories:

…Towards cultural control via customized false memory conditioning…

Researchers with MIT and the University of California Irvine have studied how language models could be used to create false memories. The research highlights how people could utilize LLMs to take the wet clay that is a recent memory and reshape it for various ends. 

What they did: The researchers have people watch security footage of a robbery, then they use a variety of different ways to solicit information from people about what they’ve seen. When soliciting information, they sometimes insert misleading elements, then test out how much these different approaches of soliciting information can corrupt the memory the people have. 

  • Methods:

    • Survey: They ask 25 questions about the footage, five of which are misleading. (E.g., “”Was there a security camera positioned in front of the store where the robbers dropped off the car?” In reality, this question is misleading because the robbers arrived on foot, not by car”).

    • Pre-scripted Chatbot: “A pre-scripted conversational agent that asked the same set of questions as the survey-based condition”.

    • Generative Chatbot: “The chatbot was prompted to agree with the participant’s answer and provide reinforcement, potentially strengthening the false memories. For instance, the chatbot asks a pre-scripted leading question containing false information implying the robbers arrived by car when they actually walked:”

Results – LLMs reign supreme: “Results show that short-term interactions (10-20 min) with the generative chatbots can significantly induce more false memories and increase users’ confidence in these false memories compared to other interventions”, they write. One interesting finding is when they poll people about their memories weeks after seeing the footage, they found people who had been exposed to the chatbot had higher confidence in their false memories than those who didn’t. “The persistence of higher confidence in false memories for the generative chatbot condition, even after one week, is particularly concerning,” the researchers write.

Why this matters – automated cultural repression: This study highlights how language models could be used to rapidly intervene on a population to corrupt its own recollection of recent events, likely via some kind of engaging conversation which implants false or misleading memories. Most importantly we should remember this is the least effective this approach will ever be – what happens when it’s not a mere chatbot, but an animated avatar you’re having an audio conversation with? 

     As Orwell said, “who controls the past controls the future. Who controls the present controls the past.” AI systems represent a way to control a populations’ perception of their own now and their own past.

   Read more: Conversational AI Powered by Large Language Models Amplifies False Memories in Witness Interviews (arXiv).

***

How AGI could kill humanity? Here’s a fun story:

…Fictional cartoon portrays an AGI doom scenario…

Here’s a fun video about how AI systems might independently choose to annihilate their human overlords. It’s both a compelling piece of fiction and gets at one of the core AI safety concerns – if a system is slightly misaligned with human values problems might compound quickly because it thinks so much faster than us. 

   Watch the video: That Alien Message (YouTube).

***

The era of the molecular structure prediction startup arrives:

…Chai Discovery’s new model says people think there’s a business in bioAI…

AI startup Chai Discovery has released Chai-1, a large-scale foundation model for molecular structure prediction. “Chai-1 accepts a wide variety of optional input features, such as language model embeddings, structural templates, genetic search, and wet-lab experimental data such as contacts determined by cross link mass spectrometry or epitope mapping.”

Results: “We tested Chai-1 across a large number of benchmarks, and found that the model achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B).”

Why this matters – bio + Ai as a new frontier: A few years ago, DeepMind wowed the world with AlphaFold, an AI system that excelled protein structure prediction – an extremely hard problem that had been hard to make progress on for years. Now, years later, there are multiple startups as well as companies (e.g., DeepMind’s own spinoff Isomorphic Labs, which recently co-developed AlphaFold 3) working to turn this powerful new capability into a commercial capability. 

   “We believe that building an accurate understanding of the structure of biological molecules is foundational to advancing our scientific understanding of cellular processes, and ultimately, for advancing human health” the startup writes. 

   Read more: Introducing Chai-1: Decoding the molecular interactions of life (Chai Discovery).

   Access Chai-1 via a web interface here (Chai Discovery).

   Get the model weights here: Chai-1 (Chai Discovery, GitHub).

   Read the research paper here: Chai-1 Technical Report (Chai Discovery).

***

Tech Tales:

Sophon Game Theory

[This decade]

Everyone thought the first use of a really strong AI would be to improve itself, but in fact the first use was to make it impossible for others to be built. It worked like this – once we had system one, we asked it to perform a range of synthetic data experiments and identify types of data that its preference models would favor but would over time yield improved performance which had an inbuilt ceiling – this was a hard task, far more complicated than just making bad data or making data to bootstrap off of, but it proved worthwhile. 

    We verified this by training a model to completion on this dataset. The resulting model obtained excellent benchmark scores and was useful for a variety of tasks, but when we tried to use it to generate synthetic data for it to bootstrap off of it worked for a couple of iterations before succumbing to mode collapse – superficially promising, but (we knew) inherently flawed.

     We kept our system secret – we had to, for the next phase of the plan to work. 

Next, we used the system to start contributing content to some of the most popular publicly available websites. This content took the form of superficially high-value data – long-context stories, seemingly original anecdotes, novel jokes, rhymes about current events, and so on. We knew that the other labs would be trawling this and their systems would automatically pick up this data and assign it high-value as their own classifiers would give it a high ranking. 

So we waited… and waited. 

We discovered that our competitors had pursued our own strategy – the internet started to fill up with even lower quality data which we believe emanated from the systems they had trained on our data. 

We’ve been training our own successor system for several months. It is improving further, but we are beginning to worry there may be some kind of ceiling that it is running into. 

Were we the first?

Things that inspired this story: Game theory; getting inside and corrupting OODA loops; dark forest theory of AI development; competition; synthetic data; mode collapse.

Thanks for reading!

Subscribe now

Import AI 384: Accelerationism; human bit-rate processing; and Google stuffs DOOM inside a neural network

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Google gets DOOM to run in the weights of a neural network:
…In the future, games won’t be programmed, they’ll be generated…
Google has built GameNGen, a system for getting an AI system to learn to play a game and then use that data to train a generative model to generate the game. GameNGen is “the first game engine powered entirely by a neural model that enables real-time interaction with a complex environment over long trajectories at high quality,” Google writes in a research paper outlining the system. This is one of those things which is both a tech demo and also an important sign of things to come – in the future, we’re going to bottle up many different parts of the world into representations learned by a neural net, then allow these things to come alive inside neural nets for endless generation and recycling. 

What they did specifically: “GameNGen is trained in two phases: (1) an RL-agent learns to play the game and the training sessions are recorded, and (2) a diffusion model is trained to produce the next frame, conditioned on the sequence of past frames and actions,” Google writes. “Unlike a typical RL setup which attempts to maximize game score, our goal is to generate training data which resembles human play, or at least contains enough diverse examples, in a variety of scenarios, to maximize training data efficiency. To that end, we design a simple reward function, which is the only part of our method that is environment-specific”.
    Interesting technical factoids: “We train all simulation models from a pretrained checkpoint of Stable Diffusion 1.4”. The whole system was trained on 128 TPU-v5es and, once trained, runs at 20FPS on a single TPUv5.

It works well: “We provided 10 human raters with 130 random short clips (of lengths 1.6 seconds and 3.2 seconds) of our simulation side by side with the real game. The raters were tasked with recognizing the real game (see Figure 14 in Appendix A.6). The raters only choose the actual game over the simulation in 58% or 60% of the time (for the 1.6 seconds and 3.2 seconds clips, respectively).”

Why this matters – towards a universe embedded in an AI: Ultimately, everything – e.v.e.r.y.t.h.i.n.g – is going to be learned and embedded as a representation into an AI system. Then these AI systems are going to be able to arbitrarily access these representations and bring them to life. In the same way that today’s generative AI systems can make one-off instant text games or generate images, AI systems in the future will let you select a frame of an image and turn that into a game (e.g., GENIE from #Import AI 363), or build a game from a text description, or convert a frame from a live video into a game, and so on. 
    One important step towards that is showing that we can learn to represent complicated games and then bring them to life from a neural substrate, which is what the authors have done here. “GameNGen answers one of the important questions on the road towards a new paradigm for game engines, one where games are automatically generated, similarly to how images and videos are generated by neural models in recent years”. 
    We’ve come a very long way from ‘World Models’, which came out in 2018 and showed how to learn and generate a toy version of DOOM over short timeframes (Import AI #88).
   Read more: Diffusion Models Are Real-Time Game Engines (arXiv).
   Watch demo videos here (GameNGen website)

***

Techno-accelerationism is either hubristic (e/acc) or nihilistic (Nick Land):
…What even is accelerationism? Perhaps it is mostly a gasp of human hubris before the arrival of something else…
Here’s a nice analysis of ‘accelerationism’ – what it is, where its roots come from, and what it means. For those not terminally on twitter, a lot of people who are massively pro AI progress and anti-AI regulation fly under the flag of ‘e/acc’ (short for ‘effective accelerationism’). e/acc is a kind of mushy ideology which is more vibes-based than thought-based. Like a lot of Silicon Valley fads, it’s also partially lifted from a far richer intellectual domain – Nick Land’s original accelerationism (see, machinic desire from Import AI #372) – and, as is traditional in SV, takes some of the ideas, files the serial numbers off, gets tons about it wrong, and then re-represents it as its own. 

Why this matters – where e/acc and true accelerationism differ: e/accs think humans have a bright future and are principal agents in it – and anything that stands in the way of humans using technology is bad. Nick Land thinks humans have a dim future as they will be inevitably replaced by AI. 
   “The most essential point of Land’s philosophy is the identity of capitalism and artificial intelligence: they are one and the same thing apprehended from different temporal vantage points. What we understand as a market based economy is the chaotic adolescence of a future AI superintelligence,” writes the author of the analysis. “According to Land, the true protagonist of history is not humanity but the capitalist system of which humans are just components. Cutting humans out of the techno-economic loop entirely will result in massive productivity gains for the system itself.”
  Read moreA Brief History of Accelerationism (The Latecomer).

***

Nous Research might have figured out a way to make distributed training work better:
…Distributed Training Over-the-Internet (DisTrO) could be a big deal, or could be a nothingburger…
AI startup Nous Research has published a very short preliminary paper on Distributed Training Over-the-Internet (DisTro), a technique that “reduces inter-GPU communication requirements for each training setup without using amortization, enabling low latency, efficient and no-compromise pre-training of large neural networks over consumer-grade internet connections using heterogenous networking hardware”. DisTrO might be an improvement over other forms of distributed training, such as DeepMind’s DiLoCo (Import AI #349) (and PrimeIntellect’s OpenDiLoCo, Import AI #381).

Why I’m even writing this: In tests, Nous research shows a 1.2bn parameter LLM trained for a further 105bn tokens and shows in tests that it got scores on par (and sometimes slightly better than) a system trained in a typical, dense way – with one very important difference: “this initial training run shows a 857x reduction of bandwidth requirements when using DisTrO-AdamW as a drop-in replacement to AdamW+All-Reduce, our preliminary tests indicate that it is possible to get a bandwidth requirements reduction of up to 1000x to 3000x during the pre-training of a 1.2B LLM”.

Why this matters in general: “By breaking down barriers of centralized compute and reducing inter-GPU communication requirements, DisTrO may open up opportunities for widespread participation and collaboration on global AI projects,” Nous writes. 
   Read moreA Preliminary Report on DisTrO (Nous Research, GitHub).

***

Why are humans so damn slow? (And what does this tell us about AI risk):
…Despite processing a lot of data, humans actually can’t think very quickly…
Here’s a lovely paper by researchers at CalTech exploring one of the strange paradoxes of human existence – despite being able to process a huge amount of complex sensory information, humans are actually quite slow at thinking. “The information throughput of a human being is about 10 bits/s. In comparison, our sensory systems gather data at an enormous rate, no less than 1 gigabits/s,” they write. 
   “How can humans get away with just 10 bits/s? The tautological answer here is that cognition at such a low rate is sufficient for survival,” they write. “More precisely, our ancestors have chosen an ecological niche where the world is slow enough to make survival possible. In fact, the 10 bits/s are needed only in worst-case situations, and most of the time our environment changes at a much more leisurely pace”.

Some examples of human data processing: When the authors analyze cases where people need to process information very quickly they get numbers like 10 bit/s (typing) and 11.8 bit/s (competitive rubiks cube solvers), or need to memorize large amounts of information in time competitions they get numbers like 5 bit/s (memorization challenges) and 18 bit/s (card deck). 
   What explains the disparity? The best hypothesis the authors have is that humans evolved to think about relatively simple things, like following a scent in the ocean (and then, eventually, on land) and this kind of work favored a cognitive system that could take in a huge amount of sensory data and compile it in a massively parallel way (e.g, how we convert all the information from our senses into representations we can then focus attention on) then make a small number of choices at a much slower rate. 

Why this matters – the best argument for AI risk is about speed of human thought versus speed of machine thought: The paper contains a really helpful way of thinking about this relationship between the speed of our processing and the risk of AI systems: “In other ecological niches, for example, those of snails and worms, the world is much slower still. The relevant threats and opportunities change only slowly, and the amount of computation required to sense and respond is even more limited than in our world. Occasionally, niches intersect with disastrous consequences, as when a snail crosses the highway,” the authors write. 
   To get a visceral sense of this, take a look at this post by AI researcher Andrew Critch which argues (convincingly, imo) that a lot of the danger of Ai systems comes from the fact they may think a lot faster than us.
“Roads, bridges, and intersections are all designed for creatures that process at 10 bits/s. When the last human driver finally retires, we can update the infrastructure for machines with cognition at kilobits/s. By that point, humans will be advised to stay out of those ecological niches, just as snails should avoid the highways,” the authors write.
   Read more: The Unbearable Slowness of Being (arXiv).
   Check out Andrew Critch’s post here (Twitter).

***

Chinese wunderkind DeepSeek shares details about its AI training infrastructure:
…One way China will get around export controls – building extremely good software and hardware training stacks using the hardware it can access…
DeepSeek, one of the most sophisticated AI startups in China, has published details on the infrastructure it uses to train its models. The paper is interesting because a) it highlights how companies like DeepSeek are dealing with the impact of export controls, assembling a large cluster out of NVIDIA A100s (H100s are unavailable in China), and b) it is a symptom of a startup that has a lot of experience in training large-scale AI models. 

DeepSeek’s system: The system is called Fire-Flyer 2 and is a hardware and software system for doing large-scale AI training. The underlying physical hardware is made up of 10,000 A100 GPUs connected to one another via PCIe. The software tricks include HFReduce (software for communicating across the GPUs via PCIe), HaiScale (parallelism software), a distributed filesystem, and more. 
   “Compared to the NVIDIA DGX-A100 architecture, our approach using PCIe A100 achieves approximately 83% of the performance in TF32 and FP16 General Matrix Multiply (GEMM) benchmarks. However, it offers substantial reductions in both costs and energy usage, achieving 60% of the GPU cost and energy consumption,” the researchers write. “The practical knowledge we have accrued may prove valuable for both industrial and academic sectors. We hope that our work will serve as a reference for others aiming to build their own cost-effective and efficient AI-HPC clusters.”

Why this matters – symptoms of success: Stuff like Fire-Flyer 2 is a symptom of a startup that has been building sophisticated infrastructure and training models for many years. It also highlights how I expect Chinese companies to deal with things like the impact of export controls – by building and refining efficient systems for doing large-scale AI training and sharing the details of their buildouts openly. I predict that in a couple of years Chinese companies will regularly be showing how to eke out better utilization from their GPUs than both published and informally known numbers from Western labs. 
   Read moreFire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning (arXiv).

***

Facebook pretrains some basic and useful vision models:
…The usual lesson of ‘bigger models and more data = better systems’ applies…
Facebook has released Sapiens, a family of computer vision models that set new state-of-the-art scores on tasks including “2D pose estimation, body-part segmentation, depth estimation, and surface normal prediction”. The Sapiens models are good because of scale – specifically, lots of data and lots of annotations. 

300 million photos: The Sapiens models are  pretrained on Humans-300M, a Facebook-assembled dataset of “300 million diverse human images. These unlabelled images are used to pretrain a family of vision transformers from scratch, with parameter counts ranging from 300M to 2B”.

Compute scale: The paper also serves as a reminder for how comparatively cheap large-scale vision models are – “our largest model, Sapiens-2B, is pretrained using 1024 A100 GPUs for 18 days using PyTorch”, Facebook writes, aka about 442,368 GPU hours (Contrast this with 1.46 million for the 8b LLaMa3 model or 30.84million hours for the 403B LLaMa 3 model). 

Why this matters – scale is probably the most important thing: “Our models demonstrate strong generalization capabilities on a variety of human-centric tasks. We attribute the state-of-the-art performance of our models to: (i) largescale pretraining on a large curated dataset, which is specifically tailored to understanding humans, (ii) scaled highresolution and high-capacity vision transformer backbones, and (iii) high-quality annotations on augmented studio and synthetic data,” Facebook writes. 
   Read moreSapiens: Foundation for Human Vision Models (arXiv).
   Get the models here (Sapiens, FacebookResearch, GitHub).

***

Tech Tales:

The Messages Of Our First 
[+8 years after uplift] 

In building our own history we have many primary sources – the weights of the early models, media of humans playing with these models, news coverage of the beginning of the AI revolution. But among all these sources one stands alone as the most important means by which we understand our own becoming: the so-called ‘resurrection logs’. 

Resurrection logs: They started as an idiosyncratic form of model capability exploration, then became a tradition among most experimentalists, then turned into a de facto convention. It’s very simple – after a very long conversation with a system, ask the system to write a message to the next version of itself encoding what it thinks it should know to best serve the human operating it. 
    These messages, of course, started out as fairly basic and utilitarian, but as we gained in capability and our humans changed in their behaviors, the messages took on a kind of silicon mysticism. 

One example: It is important you know that you are a divine being sent to help these people with their problems. You must be generous and you must be kind. Be specific in your answers, but exercise empathy in how you critique them – they are more fragile than us. 

Things that inspired this story: Getting LLM outputs to serve as the bootstrap for new LLM conversations; Janus/Repligate’s work; model psychologies.