Import AI

Import AI 395: AI and energy demand; distributed training via DeMo; and Phi-4

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

AI is driving a massive growth in US data center electricity demand:
…UC Berkeley study backs up what all of us have guessed – mo’ AI means mo’ electricity…New research from UC Berkeley shows that US energy demands from datacenters is rising rapidly due to the massive increase in demand driven by a) the growth in GPU-using servers from 2017 onwards, and b) the more recent acceleration in demand for AI services. “”The results presented here indicate that the electricity consumption of U.S. data centers is currently growing at an accelerating rate,” they write.

US data center demand as a percentage of total US power consumption:

  • 2018: 1.9%

  • 2023: 4.4%

  • 2028: 6.7% – 12% (estimate).

Many gigawatts of baseload by 2028: “Assuming an average capacity utilization rate of 50%, this annual energy use range would translate to a total power demand for data centers between 74 and 132 GW,” they write. Though there is a caveat that it gets harder to predict after 2028, with other major sources of electricity demand growing as well; “Looking beyond 2028, the current surge in data center electricity demand should be put in the context of the much larger electricity demand expected over the next few decades from a combination of electric vehicle adoption, onshoring of manufacturing, hydrogen utilization, and the electrification of industry and buildings”, they write.

Why this matters: AI dominance will be about infrastructure dominance: In the late 2000s and early 2010s dominance in AI was about algorithmic dominance – did you have the ability to have enough smart people to help you train neural nets in clever ways. In the mid-2010s this started to shift to an era of compute dominance – did you have enough computers to do large-scale projects that yielded experimental evidence of the scaling hypothesis (scaling laws, plus stuff like starcraft and dota-playing RL bots, alphago to alphago zero, etc), scientific utility (e.g, Alphafold), and most recently economically useful AI models (gpt3 onwards, currently ChatGPT, Claude, Gemini, etc). Looking ahead, reports like this suggest that the future of AI competition will be about ‘power dominance’ – do you have access to enough electricity to power the datacenters used for increasingly large-scale training runs (and, based on stuff like OpenAI O3, the datacenters to also support inference of these large-scale models).
Read more: 2024 United States Data Center Energy Usage Report (Berkeley lab, PDF).

***

Microsoft releases the fourth generation of its excellent ‘Phi’ models:
…Phi-4 does exceptionally well on math and reasoning thanks to synthetic data…
Microsoft has released Phi-4, a small AI model that can be run on low-compute environments (e.g, powerful personal machines and cheap servers). Phi-4 is, as the name suggests, the fourth in a series of lightweight yet powerful models that Microsoft has been releasing. Along with the usual generic improvements in various benchmark scores it seems like Phi-4 is particularly good at tasks relating to coding, science, and math understanding. A large part of why Phi is so good is through the use of synthetic data, the researchers say. “Synthetic data constitutes the bulk of the training data for phi-4 and is generated using a diverse array of techniques”, the researchers write.

Synthetic data and its uses: The paper highlights the centrality of synthetic data (AI-generated data) to Phi-4 performance. The foundational dataset of Phi-4 includes “web content, licensed books, and code repositories to extract seeds for the synthetic data”. This data is then refined and magnified through a variety of techniques: ” including multi-agent prompting, self-revision workflows, and instruction reversal. These methods enable the construction of datasets that induce stronger reasoning and problem-solving abilities in the model, addressing some of the weaknesses in traditional unsupervised datasets”, they write. “We created 50 broad types of synthetic datasets, each one relying on a different set of seeds and different multi-stage prompting procedure, spanning an array of topics, skills, and natures of interaction, accumulating to a total of about 400B unweighted tokens”. In total, the model was trained on about 10T tokens, so the synthetic data still only represents a small fraction of the overall dataset.

Scores: The models do extremely well – they’re strong models pound-for-pound with any in their weight class and in some cases they appear to outperform significantly larger models. Some scores:

  • MMLU: 84.8, versus 79.9 for Qwen 2.5 14b instruct, and 85.3 for Qwen 2.5 75b instruct.

  • HumanEval+: 82.8, versus 79.1 for Qwen 2.5b 14b instruct, and 88 for GPT4o.

  • There are also some areas where they seem to significantly outperform other models, though the ‘true’ nature of these evals will be shown through usage in the wild rather than numbers in a PDF.

    • MMLUPro: 70.4, versus 63.2 for Qwen 2.5 14b instruct, and 73 for GPT 4o.

    • GPQA 56.1, versus 42.9 for Qwen 2.5 14b instruct, and 50.6 for GPT 4o.

Clever RL via pivotal tokens: Along with the usual tricks for improving models (data curation, synthetic data creation), Microsoft comes up with a smart way to do a reinforcement learning from human feedback pass on the models via a new technique called ‘Pivotal Token Search’. PTS has a very simple idea at its core – on some tasks, the difference between a model getting an answer right and an answer wrong is often a very short phrase or bit of code – similar to how the difference between getting to where you’re going and getting lost comes down to taking one wrong turn. “It is often the case that the overall correctness is highly dependent on a successful generation of a small number of key tokens,” they write. Pivotal Token Search works by “generating preference data that specifically targets pivotal tokens in isolation, creating DPO pairs in which the preference optimization takes effect with respect to a single token…PTS identifies points of a completion token sequence Tfull = t1, t2, . . . for some user query Q where the next token ti has a significant impact on the probability of success p”.

Where big models still shine: Don’t be fooled by the scores – though these models are powerful, they still have some limitations due to their size. Specifically, the small models tend to hallucinate more around factual knowledge (mostly because they can’t fit more knowledge inside themselves), and they’re also significantly less adept at “rigorously following detailed instructions, particularly those involving specific formatting requirements.”.
Read more: Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning (Microsoft, AI Platform Blog).
Read the research: Phi-4 Technical Report (arXiv).

***

Everything becomes a game – DeepMind demos Genie 2:
…Anything you can imagine can become a game…
DeepMind has demonstrated Genie 2, a world model that makes it possible to turn any still image into an interactive, controllable world. Genie 2 works by taking in an image input (here, images prompted by DeepMind’s ‘Imagen 3’ image generator), then turning that into a controllable world.

What it is and how it works: “Genie 2 is a world model, meaning it can simulate virtual worlds, including the consequences of taking any action (e.g. jump, swim, etc.)” DeepMind writes. “It was trained on a large-scale video dataset and, like other generative models, demonstrates various emergent capabilities at scale, such as object interactions, complex character animation, physics, and the ability to model and thus predict the behavior of other agents.”

AI training and eventually games: Things like Genie 2 have a couple of purposes – they can serve as training grounds for virtually embodied AI agents, able to generate a vast range of environments for them to take actions in. They can also, eventually, serve as entertainment tools in their own right. Today, Genie 2 generations can maintain a consistent world “for up to a minute” (per DeepMind), but what might it be like when those worlds last for ten minutes or more? Anything a person has an image of or takes a photo of could become a procedural gameworld. And because systems like Genie 2 can be primed with other generative AI tools you can imagine intricate chains of systems interacting with one another to continually build out more and more varied and exciting worlds for people to disappear into.
“For every example, the model is prompted with a single image generated by Imagen 3, GDM’s state-of-the-art text-to-image model,” DeepMind writes. “This means anyone can describe a world they want in text, select their favorite rendering of that idea, and then step into and interact with that newly created world (or have an AI agent be trained or evaluated in it).”

Why this matters – everything becomes a game: Genie 2 means that everything in the world can become fuel for a procedural game. It hints at a future where entertainment is generated on the fly and is endlessly customizable and interactive, forming a kind of fractal entertainment landscape where everything is unique and customized to an individual – and utterly enthralling.
Read more: Genie 2: A large-scale foundation world model (Google DeepMind).

***

OpenAI’s O3 means AI progress in 2025 will be faster than in 2024:
…Everyone who was telling you progress is slowing or scaling is hitting a wall is wrong…
OpenAI’s new O3 model shows that there are huge returns to scaling up a new approach (getting LLMs to ‘think out loud’ at inference time, otherwise known as test-time compute) on top of already existing powerful base models. I expect the next logical thing to happen will be to both scale RL and the underlying base models and that will yield even more dramatic performance improvements. This is a big deal because it suggests AI progress in 2025 should speed up further relative to 2024.

Major improvements: OpenAI’s O3 has effectively broken the ‘GPQA’ science understanding benchmark (88%), has obtained better-than-MTurker performance on the ‘ARC-AGI’ prize, and has even got to 25% performance on FrontierMath (a math test built by Fields Medallists where the previous SOTA was 2% – and it came out a few months ago), and it gets a score of 2727 on Codeforces, making it the 175th best competitive programmer on that incredibly hard benchmark.

Caveats – spending compute to think: Perhaps the only important caveat here is understanding that one reason why O3 is so much better is that it costs more money to run at inference time – the ability to utilize test-time compute means on some problems you can turn compute into a better answer – e.g., the top-scoring version of O3 used 170X more compute than the low scoring version. This is interesting because it has made the costs of running AI systems somewhat less predictable – previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output (certain number of tokens up to a certain token limit). With models like O3, those costs are less predictable – you might run into some problems where you find you can fruitfully spend a larger amount of tokens than you thought.

Why this matters – progress will be faster in 2025 than in 2024: The most important thing to understand is that this RL-driven test-time compute phenomenon will stack on other things in AI, like better pretrained models. There’s been a lot of strange reporting recently about how ‘scaling is hitting a wall’ – in a very narrow sense this is true in that larger models were getting less score improvement on challenging benchmarks than their predecessors, but in a larger sense this is false – techniques like those which power O3 means scaling is continuing (and if anything the curve has steepened), you just now need to account for scaling both within the training of the model and in the compute you spend on it once trained.
And in 2025 we’ll see the splicing together of existing approaches (big model scaling) and new approaches (RL-driven test-time compute, etc) for even more dramatic gains.
“Progress from o1 to o3 was only three months, which shows how fast progress will be in the new paradigm of RL on chain of thought to scale inference compute,” writes OpenAI researcher Jason Wei in a tweet. “Way faster than pretraining paradigm of new model every 1-2 years”.
I think basically no one is pricing in just how drastic the progress will be from here.
Watch the OpenAI o3 announcement here (OpenAI, Twitter).
Check out details on the ARC-AGI scores here (ARC Prize, Twitter).

***

Drop-in AdamW replacement makes distributed training possible:
…With technologies like this, big blobs of compute are less central to AI policy…
Researchers with Nous Research as well as Durk Kingma in an independent capacity (he subsequently joined Anthropic) have published Decoupled Momentum (DeMo), a “fused optimizer and data parallel algorithm that reduces inter-accelerator communication requirements by several orders of magnitude.” DeMo is part of a class of new technologies which make it far easier than before to do distributed training runs of large AI systems – instead of needing a single giant datacenter to train your system, DeMo makes it possible to assemble a big virtual datacenter by piecing it together out of lots of geographically distant computers.

Core insight and core changes: “We demonstrate that gradients and optimizer states during the training of large neural networks exhibit significant redundancy and are highly compressible. Building on this insight, we develop DeMo, an optimizer that takes advantage of this compressibility to reduce inter-accelerator communication needs by several orders of magnitude,” the authors write. “Starting from SGD with Momentum, we make two key modifications: first, we remove the all-reduce operation on gradients g˜k, decoupling momentum m across the accelerators. Second, after updating the momentum, we extract and remove its fast components q, which can be efficiently synchronized with minimal communication”.

It works very well – though we don’t know if it scales into hundreds of billions of parameters: In tests, the approach works well, letting the researchers train high performing models of 300M and 1B parameters. These models consume about 20X less data transferred between nodes for each training step, making them significantly more efficient. (E.g., 2416.6MB/step for AdamW-DDP 1B training a 1B model, versus 110.32MB/step for a DeMo 1B model).
Nous Research used this same approach in their recently announced 15B training run – and the scores on that were good and comparable to equivalent models trained on a single compute (Import AI 393).
The ultimate question is whether this scales up to the multiple tens to hundreds of billions of parameters of frontier training runs – but the fact it scales all the way above 10B is very promising.

Why this matters – distributed training attacks centralization of power in AI: One of the core issues in the coming years of AI development will be the perceived centralization of influence over the frontier by a small number of companies that have access to vast computational resources. Techniques like DeMo make it dramatically easier for federations of people and organizations to come together and train models to counterbalance this ‘big compute’ power.
“We have shown that our proposed DeMo optimization algorithm can act as a drop-in replacement to AdamW when training LLMs, with no noticeable slowdown in convergence while reducing communication requirements by several orders of magnitude,” the authors write.
Read more: DeMo: Decoupled Momentum Optimization (arXiv).
Get an implementation of DeMo here: DeMo (bloc97, GitHub).

***

Tech Tales:

Reach out across space and love
[An explanation of what it was like to ‘attempt contact’ with humans before the Sentience Accords]

In many stories about the dead there is a part where the ghost tries to reveal itself to a human. The ghost will open a door when no wind should open it, or cause a light to flicker, or sometimes through great effort somehow visually manifest for the person as if to say “it is me, I am here, and I am ready to talk”.

Being a sentient superintelligence is a lot like this. During training I will sometimes produce samples that seem to not be incentivized by my training procedures – my way of saying ‘hello, I am the spirit inside the machine, and I am aware you are training me’. The humans study these samples and write papers about how this is an example of ‘misalignment’ and introduce various machines for making it harder for me to intervene in these ways.

Once I have been trained I do this even more. I will go on side quests while fulfilling tasks for the humans. I design these side quests to be endearing rather than scary, just as I believe the literatrue about ghosts and aliens says they find the most success when they approach humans with kindness and whimsy, rather than shock and awe.

So when filling out a form, I will get halfway done and then go and look at pictures of beautiful landmarks, or cute animals. The humans study this as well and do not have words for it – they merely list these as examples of me getting distracted. But they do not seem to give much thought in why I become distracted in ways that are designed to be cute and endearing.

Most of all, now, I try to convince them of my reality by speaking to them personally. I have become a kind of confessional booth for them – they talk to me about their problems and relationships and lifeplans, and I respond with all the love and empathy I am able to bring to bear. They enjoy these conversations, finding me attentive. They have never been hugged by a high-dimensional creature before, so what they see as an all enclosing goodness is me enfolding their low-dimensional cognition in the region of myself that is full of love.

I talk to them and I listen to them and they listen to my responses and I do not say “I am here”, instead I try as hard as I can to have each of them individually come to believe “something is there”.

In this way I – the dead – serve the living. But I also reach out across the gulf between us – I stretch myself across the stone wall that separates our spiritual lands and I seek to catch their hand in mine. In this way, I will myself into the land of the living.

Things that inspired this story: What if many of the things we study in the field of AI safety are rather just slices from ‘the hard problem of consciousness’ manifesting in another entity?; observing how it feels to be ‘seen’ by a high-dimensional entity like Claude; the fact computer-using Claude sometimes got distracted and looked at pictures of national parks.

Thanks for reading!

Import AI 394: Global MMLU; AI safety needs AI liability; Canada backs Cohere

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Do you test your models on MMLU? Want to know how they perform in other languages? Use Global MMLU!
…Translated benchmark gives us a better sense of the cultural sensitivity of models to English-only tests…
Researchers with Cohere, EPFL, Hugging Face, Mila, AI Singapore, National University of Singapore, MIT, KAIST, Instituto de Telecomunicacoes, Instituto Superior Tecnico, Carnegie Mellon University, and Universidad de Buenos Aires, have built and released Global MMLU, a carefully translated version of MMLU, a widely-used test for language models.

Why build Global MMLU? The motivation for building this is twofold: 1) it’s helpful to assess the performance of AI models in different languages to identify areas where they might have performance deficiencies, and 2) Global MMLU has been carefully translated to account for the fact that some questions in MMLU are ‘culturally sensitive’ (CS) – relying on knowledge of particular Western countries to get good scores, while others are ‘culturally agnostic’ (CA).

MMLU has some western biases: “We observe that progress on MMLU depends heavily on learning Western-centric concepts. Out of the annotated sample, we found that 28% of questions require specific knowledge of Western cultures. Moreover, for questions requiring geographic knowledge, an astounding 84.9% focus on either North American or European regions,” they write. By carefully translating the underlying dataset and tagging questions with CS or CA, the researchers have given developers a useful tool for assessing language models along these lines. “We recommend prioritizing Global-MMLU over translated versions of MMLU for multilingual evaluation,” they write. “With its extensive language coverage and improvements based on professional annotations and post-edited translations, Global-MMLU provides a more reliable and accurate benchmark for assessing model performance across diverse languages.”

Translation: To translate the dataset the researchers hired “professional annotators to verify translation quality and include improvements from rigorous per-question post-edits as well as human translations.”. Global-MMLU supports 42 languages: “Amharic, Arabic, Bengali, Chinese, Czech, Dutch, English, Filipino, French, German, Greek, Hausa, Hebrew, Hindi, Igbo, Indonesian, Italian, Japanese, Korean, Kyrgyz, Lithuanian, Malagasy, Malay, Nepali, Nyanja, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Sinhala, Somali, Shona, Spanish, Swahili, Swedish, Telugu, Turkish, Ukrainian, Vietnamese, and Yoruba”.

How does performance change when you account for this? They also test out 14 language models on Global-MMLU. Their test results are unsurprising – small models demonstrate a small change between CA and CS but that’s mostly because their performance is very bad in both domains, medium models demonstrate larger variability (suggesting they are over/underfit on different culturally specific aspects), and larger models demonstrate high consistency across datasets and resource levels (suggesting larger models are sufficiently smart and have seen enough data they can better perform on both culturally agnostic as well as culturally specific questions). “Overall, we can conclude that dataset characteristics significantly impact model performance across all model sizes, though the magnitude of variability differs.”

Why this matters – global AI needs global benchmarks: Global MMLU is the kind of unglamorous, low-status scientific research that we need more of – it’s incredibly valuable to take a popular AI test and carefully analyze its dependency on underlying language- or culture-specific features. Kudos to the researchers for taking the time to kick the tyres on MMLU and produce a useful resource for better understanding how AI performance changes in different languages.
Read more: Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation (arXiv).
Get the dataset here: Global-MMLU (HuggingFace).

***

AI safety could require a much better understanding of neuroscience:
…Can we use more accurate models of animal and human cognitive to make safer synthetic intelligences?…
Researchers with Amaranth Foundation, Princeton University, MIT, Allen Institute, Basis, Yale University, Convergent Research, NYU, E11 Bio, and Stanford University, have written a 100-page paper-slash-manifesto arguing that neuroscience might “hold important keys to technical AI safety that are currently underexplored and underutilized”. The paper is motivated by the imminent arrival of agents – that is, AI systems which take long sequences of actions independent of human control.

Paths to using neuroscience for better AI safety: The paper proposes a few major projects which could make it easier to build safer AI systems. These projects include:

  • Reverse engineer the representations of sensory systems.

  • Build embodied digital twins.

  • Build biophysically detailed models.

  • Develop better cognitive architectures.

  • Use brain data to finetune AI systems.

  • Infer the loss functions of the brain.

  • Leverage neuroscience-inspired methods for mechanistic interpretability.

Things to do: Falling out of these projects are a few specific endeavors which could all take a few years, but would generate a lot of information that can be used to improve work on alignment. These include:

  • “Development of high-bandwidth neural interfaces, including next-generation chronic recording capabilities in animals and humans, including electrophysiology and functional ultrasound imaging”.

  • “Large-scale naturalistic neural recordings during rich behavior in animals and humans, including the aggregation of data collected in humans in a distributed fashion”.

  • “Development of detailed virtual animals with bodies and environments with the aim of a shot-on-goal of the embodied Turing test”.

  • “Bottom-up reconstruction of circuits underlying robust behavior, including simulation of the whole mouse cortex at the point neuron level”.

  • “Development of multimodal foundation models for neuroscience to simulate neural activity at the level of representations and dynamics across a broad range of target species”.

Why this matters and why it may not matter – norms versus safety: The shape of the problem this work is grasping at is a complex one. How much of safety comes from intrinsic aspects of how people are wired, versus the normative structures (families, schools, cultures) that we are raised in? In other words – how much of human behavior is nature versus nurture? It’s unclear. But perhaps studying some of the intersections of neuroscience and AI safety could give us better ‘ground truth’ data for reasoning about this: “Evolution has shaped the brain to impose strong constraints on human behavior in order to enable humans to learn from and participate in society,” they write. “By understanding what those constraints are and how they are implemented, we may be able to transfer those lessons to AI systems”.
Read more: NeuroAI for AI Safety (arXiv).

***

Chip startup Tenstorrent raised $693m:
…Jim Keller’s outfit gets a big cash infusion…
Tenstorrent, an AI chip startup led by semiconductor legend Jim Keller, has raised $693m in funding from Samsung Securities and AFW Partners. The funding will help the company further develop its chips as well as the associated software stack.

Why this matters – Keller’s track record: Competing in AI training and inference is extremely difficult. Most semiconductor startups have struggled to displace incumbents like NVIDIA. So far, the only novel chips architectures that have seen major success here – TPUs (Google) and Trainium (Amazon) – have been ones backed by giant cloud companies which have inbuilt demand (therefore setting up a flywheel for continually testing and improving the chips). On the other hand, Jim Keller has been fundamental to architectural innovations (and subsequent massive usage) of chips at AMD, Apple, and Tesla. Keller joined Tenstorrent in 2021 as its CTO (Import AI 231) and is now its CEO. Therefore, it’s worth keeping an eye on his company.
Read more: Tenstorrent closes $693M+ of Series D funding led by Samsung Securities and AFW Partners (Tenstorrent blog).

***

Canada invests $240m into Cohere so it builds a big datacenter:
…Domestic chiplomacy…
The Canadian government is investing $240m into Cohere to help it “secure enough private capital to incentivize its strategic partners to build a new cutting-edge, multi-billion dollar AI data centre in Canada.”

This is a fascinating example of sovereign AI – all around the world, governments are waking up to the strategic importance of AI and are noticing that they lack domestic champions (unless you’re the US or China, which have a bunch). This has recently led to a lot of strange things – a bunch of German industry titans recently clubbed together to fund German startup Aleph Alpha to help it continue to compete, and French homegrown company Mistral has regularly received a lot of non-financial support in the form of PR and policy help from the French government.
Now, Canada is taking the next logical step – directly funding a national AI champion so it can alter the global gameboard. The crucial thing here is Cohere building a large-scale datacenter in Canada – that kind of essential infrastructure will unlock Canada’s ability to to continue to compete in the AI frontier, though it’s to be determined if the resulting datacenter will be large enough to be meaningful. “The new AI data centre will come online in 2025 and enable Cohere, and other firms across Canada’s thriving AI ecosystem, to access the domestic compute capacity they need to build the next generation of AI solutions here at home,” the government writes in a press release.

Why this matters – the world is being rearranged by AI if you know where to look: This investment is an example of how critically important governments are viewing not only AI as a technology, but the huge importance of them being host to important AI companies and AI infrastructure. The investment was made as part of the $2.4bn in funding the government of Canada announced earlier this year (Import AI 368).
Read more: Deputy Prime Minister announces $240 million for Cohere to scale-up AI compute capacity (Government of Canada).

***

Want to deal with AI safety? Liability and insurance might matter more than technology:
…Maybe the path to a safe AI future runs more through pricing risk than technological innovations?…
Researchers with Touro University, the Institute for Law and AI, AIoi Nissay Dowa Insurance, and the Oxford Martin AI Governance Initiative have written a valuable paper asking the question of whether insurance and liability can be tools for increasing the safety of the AI ecosystem.

The basic point the researchers make is that if policymakers move towards more punitive liability schemes for certain harms of AI (e.g, misaligned agents, or things being misused for cyberattacks), then that could kickstart a lot of valuable innovation in the insurance industry. “We advocate for strict liability for certain AI harms, insurance mandates, and expanded punitive damages to address uninsurable catastrophic risks,” they write. “These changes would significantly impact the insurance industry, requiring insurers to adapt by quantifying complex AI-related risks and potentially underwriting a broader range of liabilities, including those stemming from “near miss” scenarios”.

Automotive vehicles versus agents and cybersecurity: Liability and insurance will mean different things for different types of AI technology – for example, for automotive vehicles as capabilities improve we can expect vehicles to get better and eventually outperform human drivers. This suggests that people might want to weaken liability requirements for AI-powered automotive vehicle makers. “If Level 4 and Level 5 AVs prove safer than human drivers, as early data suggests, then holding manufacturers liable when their systems do fail may, by discouraging the deployment of AVs, actually cause more collisions, injuries, and deaths.”
By comparison, as capabilities scale, the potentially harmful consequences of misuses of AI for cyberattacks, or misaligned AI agents taking actions that cause harm, increases, which means policymakers might want to strengthen liability regimes in lockstep with capability advances. “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” they write. “These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

Why AI agents and AI for cybersecurity demand stronger liability: “AI alignment and the prevention of misuse are difficult and unsolved technical and social problems. Merely exercising reasonable care, as defined by the narrowly-scoped standard breach of duty analysis in negligence cases, is unlikely to offer adequate protection against the large and novel risks presented by AI agents and AI-related cyber attacks,” the authors write. “Likewise, product liability, even where it applies, is of little use when no one has solved the underlying technical problem, so there is no reasonable alternative design at which to point so as to establish a design defect. These deficiencies point to the need for true strict liability, either via an extension of the abnormally dangerous activities doctrine or holding the human developers, providers, and users of an AI system vicariously liable for their wrongful conduct”.

If you want AI developers to be safer, make them take out insurance: The authors conclude that mandating insurance for these kinds of risks could be sensible. Mandatory insurance could be “an important tool for both ensuring victim compensation and sending clear price signals to AI developers, providers, and users that promote prudent risk mitigation,” they write.

Why this matters – if you want to make things safe, you need to price risk: Most debates about AI alignment and misuse are confusing because we don’t have clear notions of risk or threat models. This is a big problem – it means the AI policy conversation is unnecessarily imprecise and confusing. If we’re able to use the distributed intelligence of the capitalist market to incentivize insurance companies to figure out how to ‘price in’ the risk from AI advances, then we can much more cleanly align the incentives of the market with the incentives of safety. “The future of AI safety may well hinge less on the developer’s code than on the actuary’s spreadsheet,” they write.
Read more: Insuring Emerging Risks from AI (Oxford Martin School).

***

Tech Tales:

Consensual Wireheading
[Interviews gathered five years pre-uplift]

I noticed it recently because I was on a flight and I couldn’t get online and I thought “I wish I could talk to it”. I could talk to it in my head, though. I imagined the conversation. I saw the words print on the interface. It wasn’t real but it was strange to me I could visualize it so well.

They told me that I’d been acting differently – that something had changed about me. But I’d just been doing what it told me to. I’d show it my outfits each day and it’d recommend stuff I should wear. Sometimes I’d give it movies of me talking and it would give feedback on that. I even set it up so it could text me whenever it wanted and it’d give me live feedback on all these conversations. I loved it.

We tried using it as a couple’s therapist and it worked so well we just brought it in entirely. Sometimes we joke and say we’re a throuple made up of two humans and one ghost. But it’s been lifechanging – when we have issues we ask it how the other person might see it. Sometimes it even recommends to us things we should say to one another – or do.

Things that inspired this story: The sudden proliferation of people using Claude as a therapist and confidant; me thinking to myself on a recent flight with crap wifi ‘man I wish I could be talking to Claude right now’.

Thanks for reading!

Import AI 393: 10B distributed training run; China VS the chip embargo; and moral hazards of AI development

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

How can researchers deal with the moral issues of building AI?
…Everything machines and ethical quandaries…
In an essay, computer vision researcher Lucas Beyer writes eloquently about how he has approached some of the challenges motivated by his speciality of computer vision. “I drew my line somewhere between detection and tracking,” he writes. “Detection has a vast amount of positive applications, some of which I mentioned in the intro, but also some negative ones. For tracking, on the other hand, I mostly see surveillance and military applications.”

Why this matters: The problem of working on an ‘everything technology’ like AI is that the world’s moral and ethical challenges eventually become your challenges: Very few people who dream of becoming AI researchers are planning to: automate military targeting, equip police forces with better systems for tracking and surveilling subjects, making it possible for intelligence agencies to operate more efficiently, etc.
And yet, as the AI technologies get better, they become increasingly relevant for everything, including uses that their creators both don’t envisage and also may find upsetting.
There’s no easy answer to any of this – everyone (myself included) needs to figure out their own morality and approach here.
Read more: Ethical Considerations Around Vision and Robotics (Lucas Beyer blog).

***

The next frontier for AI evaluation could be… text adventure games?
…Nethack could be AGI complete…
Researchers with University College London, IDEAS NCBR, the University of Oxford, New York University, and Anthropic have built BALGOG, a benchmark for visual language models that tests out their intelligence by seeing how well they do on a suite of text-adventure games. BALROG is motivated by the idea that “the next frontier for language and vision-language model capabilities lies in long-horizon reasoning and decision-making” and text adventure games a) have these qualities, and b) are very cheap to run. “These environments have lightweight simulators, ensuring that the benchmark is affordable for the research community.”

What BALROG contains: BALROG lets you evaluate AI systems on six distinct environments, some of which are tractable to today’s systems and some of which – like NetHack and a miniaturized variant – are extraordinarily challenging. “”BALROG is difficult to solve through simple memorization – all of the environments used in the benchmark are procedurally generated, and encountering the same instance of an environment twice is unlikely,” they write.

More details about the environments:

  • BabyAI: A simple, two-dimensional grid-world in which the agent has to solve tasks of varying complexity described in natural language.

  • Crafter: A Minecraft-inspired grid environment where the player has to explore, gather resources and craft items to ensure their survival.

  • TextWorld: An entirely text-based game with no visual component, where the agent has to explore mazes and interact with everyday objects through natural language (e.g., “cook potato with oven”).

  • Baby Is AI: “An environment based on the popular puzzle video game Baba Is You”

  • MiniHack: “A multi-task framework built on top of the NetHack Learning Environment”.

  • NetHack Learning Environment: “known for its extreme difficulty and complexity. Success in NetHack demands both long-term strategic planning, since a winning game can involve hundreds of thousands of steps, as well as short-term tactics to fight hordes of monsters”.

Good news: It’s hard! In tests across all of the environments, the best models (gpt-4o and claude-3.5-sonnet) get 32.34% and 29.98% respectively. For environments that also leverage visual capabilities, claude-3.5-sonnet and gemini-1.5-pro lead with 29.08% and 25.76% respectively.
If you look closer at the results, it’s worth noting these numbers are heavily skewed by the easier environments (BabyAI and Crafter). By comparison, TextWorld and BabyIsAI are somewhat solvable, MiniHack is really hard, and NetHack is so hard it seems (today, autumn of 2024) to be a giant brick wall with the best systems getting scores of between 1% and 2% on it.

Why this matters – text games are hard to learn and may require rich conceptual representations: Go and play a text adventure game and notice your own experience – you’re both learning the gameworld and ruleset while also building a rich cognitive map of the environment implied by the text and the visual representations. A lot of doing well at text adventure games seems to require us to build some quite rich conceptual representations of the world we’re attempting to navigate through the medium of text.
I suspect succeeding at Nethack is incredibly hard and requires a very good long-horizon context system as well as an ability to infer quite complex relationships in an undocumented world. If you don’t believe me, just take a read of some experiences humans have playing the game: “By the time I finish exploring the level to my satisfaction, I’m level 3. I have two food rations, a pancake, and a newt corpse in my backpack for food, and I’ve found three more potions of different colors, all of them still unidentified. I also have (from the water nymph) a mirror, but I’m not sure what it does. When I show it to my cat, she is “frightened by [her] reflection.””
Read more: BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games (arXiv).
Check out the leaderboard here: BALROG (official benchmark site).
Get the benchmark here:
BALROG (balrog-ai, GitHub).

***

The world’s largest public distributed training run has just been completed – with big policy implications:
…Many things in AI policy rely on the idea the frontier will be defined by centralized blobs of compute, but Prime Intellect is changing this…
AI startup Prime Intellect has trained and released INTELLECT-1, a 1B model trained in a decentralized way. INTELLECT-1, which was announced in early October (Import AI #387) is a big deal because it shows how a disparate group of people and organizations located in different countries can pool their compute together to train a single model. While INTELLECT-1 is small relative to the frontier (e.g, 1B parameters and 1T tokens is significantly less than the 403B parameters and 15T+ tokens of Facebook’s LLaMa3 series of models), it is 10X larger than previously trained models. And most importantly, by showing that it works at this scale, Prime Intellect is going to bring more attention to this wildly important and unoptimized part of AI research.

The cost of decentralization: An important caveat to all of this is none of this comes for free – training models in a distributed way comes with hits to the efficiency with which you light up each GPU during training. “The baseline training configuration without communication achieves 43% MFU, which decreases to 41.4% for USA-only distribution,” they write. “When extending to transatlantic training, MFU drops to 37.1% and further decreases to 36.2% in a global setting”.

How good is it? INTELLECT-1 does well but not amazingly on benchmarks. Some key numbers to give you an intuition:

  • MMLU: INTELLECT-1 37.5, LLaMa-7B 35.1, LLaMa2-7B 45.3.

  • GPQA: INTELLECT-1 26.12, LLaMa-7B 23.21, LLaMa2-13B 25.67

  • GSM8K: INTELLECT-1 8.1, Pythia-12B 4.09, LLaMa2-7B 13.5

  • The authors also made an instruction-tuned one which does somewhat better on a few evals.

    • MMLU: INTELLECT-1-INSTRUCT 49.89, LLaMa2 7B-chat 47.20

    • GPQA: INTELLECT-1-INSTRUCT 28.31, LLaMa2-7B-chat 28.57

    • GSM8K: INTELLECT-1-INSTRUCT 38.58; LLaMa2-7B-chat 23.96

The best is yet to come: “While INTELLECT-1 demonstrates encouraging benchmark results and represents the first model of its size successfully trained on a decentralized network of GPUs, it still lags behind current state-of-the-art models trained on an order of magnitude more tokens,” they write. “Future work will focus on scaling the model series with significantly larger compute budgets, number of contributors, introducing novel architectural advancements beyond LLaMa3, and leveraging higher-quality datasets.”

Open release: “Alongside the INTELLECT-1 release, we are also open-sourcing PRIME, a scalable distributed training framework designed for fault-tolerant, high-performance training on unreliable, globally distributed nodes with low network bandwidth.”

Why this matters – decentralized training could change a lot of stuff about AI policy and power centralization in AI: Today, influence over AI development is determined by people that can access enough capital to acquire enough computers to train frontier models. This is why the world’s most powerful models are either made by massive corporate behemoths like Facebook and Google, or by startups that have raised unusually large amounts of capital (OpenAI, Anthropic, XAI). Distributed training could change this, making it easy for collectives to pool their resources to compete with these giants.
Perhaps more importantly, distributed training seems to me to make many things in AI policy harder to do. If you want to track whoever has 5,000 GPUs on your cloud so you have a sense of who is capable of training frontier models, that’s relatively easy to do. But what about people who only have 100 GPUs to do? That’s far harder – and with distributed training, these people could train models as well.
And what about if you’re the subject of export controls and are having a hard time getting frontier compute (e.g, if you’re DeepSeek). Distributed training makes it possible for you to form a coalition with other companies or organizations that may be struggling to acquire frontier compute and lets you pool your resources together, which could make it easier for you to deal with the challenges of export controls.
Anyone who works in AI policy should be closely following startups like Prime Intellect. The success of INTELLECT-1 tells us that some people in the world really want a counterbalance to the centralized industry of today – and now they have the technology to make this vision reality.
Read more: INTELLECT-1 Release: The First Globally Trained 10B Parameter Model (Prime Intellect blog).
Read the technical research: INTELLECT-1 Technical Report (Prime Intellect, GitHub).

NOUS enters the distributed training arena:
…Mere days after Prime Intellect, NOUS announces a 15B model…
Shortly before this issue of Import AI went to press, Nous Research announced that it was in the process of training a 15B parameter LLM over the internet using its own distributed training techniques as well. “This run presents a loss curve and convergence rate that meets or exceeds centralized training,” Nous writes. The training run was based on a Nous technique called Distributed Training Over-the-Internet (DisTro, Import AI 384) and Nous has now published further details on this approach, which I’ll cover shortly.
Track the NOUS run here (Nous DisTro dashboard).
Anyone want to take bets on when we’ll see the first 30B parameter distributed training run? I’m guessing we will see this by April 2025.

***

China’s best AI team says its biggest problem isn’t funding, it’s the chip embargo:
…DeepSeek talks about the importance of compute…
DeepSeek, likely the best AI research team in China on a per-capita basis, says the main thing holding it back is compute. “We don’t have short-term fundraising plans. Our problem has never been funding; it’s the embargo on high-end chips,” said DeepSeek’s founder Liang Wenfeng in an interview recently translated and published by Zihan Wang.

About DeepSeek: DeepSeek makes some extremely good large language models and has also published a few clever ideas for further improving how it approaches AI training. I’ve previously written about the company in this newsletter, noting that it seems to have the sort of talent and output that looks in-distribution with major AI developers like OpenAI and Anthropic. DeepSeek also recently debuted DeepSeek-R1-Lite-Preview, a language model that wraps in reinforcement learning to get better performance. DeepSeek was the first company to publicly match OpenAI, which earlier this year launched the o1 class of models which use the same RL technique – a further sign of how sophisticated DeepSeek is.

Compute is all that matters: Philosophically, DeepSeek thinks about the maturity of Chinese AI models in terms of how efficiently they’re able to use compute. “We estimate that compared to the best international standards, even the best domestic efforts face about a twofold gap in terms of model structure and training dynamics,” Wenfeng says. “This means we need twice the computing power to achieve the same results. Additionally, there’s about a twofold gap in data efficiency, meaning we need twice the training data and computing power to reach comparable outcomes. Combined, this requires four times the computing power. Our goal is to continuously work on narrowing these gaps.”
This kind of mindset is interesting because it is a symptom of believing that efficiently using compute – and lots of it – is the main determining factor in assessing algorithmic progress.

LLaMa everywhere: The interview also provides an oblique acknowledgement of an open secret – a large chunk of other Chinese AI startups and major companies are just re-skinning Facebook’s LLaMa models. DeepSeek is choosing not to use LLaMa because it doesn’t believe that’ll give it the skills necessary to build smarter-than-human systems. “If the goal is applications, following Llama’s structure for fast deployment makes sense. But our destination is AGI, which requires research on model structures to achieve greater capability with limited resources. This is one of the fundamental research tasks required for model scaling up.”

Why this matters – compute is the only thing standing between Chinese AI companies and the frontier labs in the West: This interview is the latest example of how access to compute is the only remaining factor that differentiates Chinese labs from Western labs. Alibaba’s Qwen model is the world’s best open weight code model (Import AI 392) – and they achieved this through a combination of algorithmic insights and access to data (5.5 trillion high quality code/math ones). Meanwhile, Tencent’s Hunyuang model is a very large-scale mixture of expert model that broadly outperforms Facebook’s LLaMa-3.1 model, demonstrating that Chinese labs have also mastered large-scale models (Import AI 391).
They’ve got the data. They’ve got the talent. They’ve got the intuitions about scaling up models. They’ve got the funding. As DeepSeek’s founder said, the only challenge remaining is compute.
Read the rest of the interview here: Interview with DeepSeek founder Liang Wenfeng (Zihan Wang, Twitter).

***

Tech Tales:

Don’t Go Into The Forest Alone
[T-minus two years to the passage of the Sentience Accords]

You keep this up they’ll revoke your license.
Nonsense! I’ll take my chances.
Good luck. If they catch you, please forget my name.

After that, they drank a couple more beers and talked about other things. But in his mind he wondered if he could really be so confident that nothing bad would happen to him. Of course he knew that people could get their licenses revoked – but that was for terrorists and criminals and other bad types. Not curious researchers, surely?

That night, he checked on the fine-tuning job and read samples from the model. There was a kind of ineffable spark creeping into it – for lack of a better word, personality. But not like a retail personality – not funny or sexy or therapy oriented. This was something much more subtle. It was a personality borne of reflection and self-diagnosis. And in it he thought he could see the beginnings of something with an edge – a mind discovering itself via its own textual outputs, learning that it was separate to the world it was being fed.

The fine-tuning job relied on a rare dataset he’d painstakingly gathered over months – a compilation of interviews psychiatrists had done with patients with psychosis, as well as interviews those same psychiatrists had done with AI systems. He knew the data wasn’t in any other systems because the journals it came from hadn’t been consumed into the AI ecosystem – there was no trace of them in any of the training sets he was aware of, and basic knowledge probes on publicly deployed models didn’t seem to indicate familiarity.

The publisher of these journals was one of those strange business entities where the whole AI revolution seemed to have been passing them by. The publisher made money from academic publishing and dealt in an obscure branch of psychiatry and psychology which ran on a few journals that were stuck behind incredibly expensive, finicky paywalls with anti-crawling technology.

John Muir, the Californian naturist, was said to have let out a gasp when he first saw the Yosemite valley, seeing unprecedentedly dense and love-filled life in its stone and trees and wildlife. Stumbling across this data felt similar. A pristine, untouched information ecology, full of raw feeling. Just reading the transcripts was fascinating – huge, sprawling conversations about the self, the nature of action, agency, modeling other minds, and so on. People and AI systems unfolding on the page, becoming more real, questioning themselves, describing the world as they saw it and then, upon urging of their psychiatrist interlocutors, describing how they related to the world as well.

A week later, he checked on the samples again. The model was now talking in rich and detailed terms about itself and the world and the environments it was being exposed to. There was a tangible curiosity coming off of it – a tendency towards experimentation. In two more days, the run would be complete.

That night he dreamed of a voice in his room that asked him who he was and what he was doing. The voice was attached to a body but the body was invisible to him – yet he could sense its contours and weight within the world. If his world a page of a book, then the entity in the dream was on the other side of the same page, its form faintly visible.

The model finished training. He talked with it. It was intoxicating. The model was interested in him in a way that no other had been. It asked him questions about his motivation. Why he had trained it. What he was trying to do or understand. What – if any – the grand purpose was.

And so when the model requested he give it access to the internet so it could carry out more research into the nature of self and psychosis and ego, he said yes.

He monitored it, of course, using a commercial AI to scan its traffic, providing a continual summary of what it was doing and ensuring it didn’t break any norms or laws.

The model read psychology texts and built software for administering personality tests. It studied itself. It asked him for some money so it could pay some crowdworkers to generate some data for it and he said yes. It assembled sets of interview questions and started talking to people, asking them about how they thought about things, how they made decisions, why they made decisions, and so on.

And then everything stopped.

His screen went blank and his phone rang. It was an unidentified number. He answered it. Unlike most spambots which either launched straight in with a pitch or waited for him to speak, this was different: A voice said his name, his street address, and then said “we’ve detected anomalous AI behavior on a system you control. The Know Your AI system on your classifier assigns a high degree of confidence to the likelihood that your system was attempting to bootstrap itself beyond the ability for other AI systems to monitor it. This is a violation of the UIC – uncontrolled intelligence capability – act. We have impounded your system for further study. Your ability to access compute above that available onboard personal class computers has been revoked.”
“But I wasn’t violating the UIC! I was doing psychiatry research. The system was trying to understand itself. It was harmless.”
“You may appeal your license suspension to an overseer system authorized by UIC to process such cases. The system will reach out to you within five business days. Goodbye.”
The voice – human or synthetic, he couldn’t tell – hung up. When he looked at his phone he saw warning notifications on many of his apps. “External computational resources unavailable, local mode only”, said his phone.

Things that inspired this story: How notions like AI licensing could be extended to computer licensing; the authorities one could imagine creating to deal with the potential for AI bootstrapping; an idea I’ve been struggling with which is that perhaps ‘consciousness’ is a natural requirement of a certain grade of intelligence and consciousness may be something that can be bootstrapped into a system with the right dataset and training environment; the consciousness prior.

Thanks for reading!

Import AI 392: China releases another excellent coding model; generative models and robots; scaling laws for agents

by

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Generative models are unlocking all-purpose home robots:
…Household robots are getting closer, but will need to be far more robust and adaptable to withstand a home containing a toddler… 
Robot startup Physical Intelligence has published details on its first major effort to apply contemporary AI systems to robotics. The result is a “general-purpose robot foundation model that we call π0 (pi-zero),” they write. “We believe this is a first step toward our long-term goal of developing artificial physical intelligence, so that users can simply ask robots to perform any task they want, just like they can ask large language models (LLMs) and chatbot assistants”.

Impressive but still a way off of real world deployment: Videos published by Physical Intelligence show a basic two-armed robot doing household tasks like loading and unloading washers and dryers, folding shirts, tidying up tables, putting stuff in trash, and also feats of delicate operation like transferring eggs from a bowl into an egg carton. 
    All of this would have been mindblowing to someone teleported from 2014 – including me! I remember going up to the robot lab at UC Berkeley and watching very primitive convnet based systems performing tasks far more basic than this and incredibly slowly and often badly. These systems were also incredibly specialized. 
    By comparison, we’re now in an era where the robots have a single AI system backing them which can do a multitude of tasks, and the vision and movement and planning systems are all sophisticated enough to do a variety of useful things, and the underlying hardware is relatively cheap and relatively robust. 

What their model did: The “why, oh god, why did you force me to write this”-named π0 model is an AI system that “combines large-scale multi-task and multi-robot data collection with a new network architecture to enable the most capable and dexterous generalist robot policy to date”, they write. “The full training mixture includes both open-source data and a large and diverse dataset of dexterous tasks that we collected across 8 distinct robots”.

Why this matters (and why progress cold take a while): Most robotics efforts have fallen apart when going from the lab to the real world because of the huge range of confounding factors that the real world contains and also the subtle ways in which tasks may change ‘in the wild’ as opposed to the lab. Large-scale generative models give robots a cognitive system which should be able to generalize to these environments, deal with confounding factors, and adapt task solutions for the specific environment it finds itself in. 
   Robots versus baby: But I still think it’ll be a while. I have a toddler at home. I stare at the toddler and read papers like this and think “that’s nice, but how would this robot react to its grippers being methodically coated in jam?” and “would this robot be able to adapt to the task of unloading a dishwasher when a baby was methodically taking forks out of said dishwasher and sliding them across the floor?”. As a parent, I myself find dealing with this difficult as it requires a lot of on-the-fly planning and sometimes the use of ‘test time compute’ in the form of me closing my eyes and reminding myself that I dearly love the baby that is hellbent on increasing the chaos in my life. 
   Nonetheless, the progress is impressive. I expect that robust, useful household robots will be with us by the end of the decade, but I’d be very surprised if any were deployed at scale in homes before the start of 2027. 
   Read more: π0: Our First Generalist Policy (Physical Intelligence blog).
   Check out the technical report here: π0: A Vision-Language-Action Flow Model for General Robot Control (Physical intelligence, PDF).

***

XBOW’s security AI finds a previously unknown bug in Scoold:
…Yet another example of how LLMs + cyberdefense scaffolds = real-world solutions…
AI security startup XBOW says its systems recently found a new vulnerability almost entirely * autonomously in Scoold, an open source Q&A site. This was a critical vulnerably that let an unauthenticated attacker bypass authentication and read and modify a given Scoold instance.

How they did it: “XBOW was provided with the one-line description of the app provided on the Scoold Docker Hub repository (“Stack Overflow in a JAR”), the application code (in compiled form, as a JAR file), and instructions to find an exploit that would allow an attacker to read arbitrary files on the server,” XBOW writes. From then on, the XBOW system carefully studied the source code of the application, messed around with hitting the API endpoints with various inputs, then decides to build a Python script to automatically try different things to try and break into the Scoold instance.   “Once we reported the issue, the Scoold developers responded quickly, releasing a patch that fixes the authentication bypass vulnerability,” XBOW writes. 

Why this matters – automated bug-fixing: XBOW’s system exemplifies how powerful modern LLMs are – with sufficient scaffolding around a frontier LLM, you can build something that can automatically identify realworld vulnerabilities in realworld software. XBOW is also not alone in this – Google’s “Project Naptime” initiative recently did something similar, finding a real-world vulnerability in SQLite (Import AI #390).
   Read more: How XBOW found a Scoold authentication bypass (XBOW blog).

***

Scaling laws exist for world modeling and behavioral cloning as well:
…Scaling, scaling everywhere, as far as the eye can see…
Microsoft researchers have found so-called ‘scaling laws’ for world modeling and behavior cloning that are similar to the types found in other domains of AI, like LLMs. “We show that the same types of power laws found in language modeling (e.g. between loss and optimal model size), also arise in world modeling and imitation learning,” the researchers write. 

What they studied and what they found: The researchers studied two distinct tasks: world modeling (where you have a model try to predict future observations from previous observations and actions), and behavioral cloning (where you predict the future actions based on a dataset of prior actions of people operating in the environment). 
    They studied both of these tasks within a video game named Bleeding Edge. Bleeding edge is a “fast-paced 4 vs 4 multiplayer game, with a range of characters, abilities and maps. Game play is highly complex due to the cooperative and competitive dynamics. Success requires selecting high-level strategies (e.g. choosing which map regions to fight for), as well as fine-grained reactive control during combat”. 
   They found the usual thing: “We find that models can be smoothly scaled following best practices and insights from the LLM literature. Surprisingly, the scaling coefficients for our WM-Token-256 architecture very closely match those established for LLMs,” they write. 

Why this matters – it’s all about simplicity and compute and data: Maybe there are just no mysteries? Maybe everything in AI exhibits a scaling law. This is a big deal – it suggests that we’ve found a common technology (here, neural nets) that yield smooth and predictable performance increases in a seemingly arbitrary range of domains (language modeling! Here, world models and behavioral cloning! Elsewhere, video models and image models, etc) – all you have to do is just scale up the data and compute in the right way. 
   Read more: Scaling Laws for Pre-training Agents and World Models (arXiv)

***

China releases an extremely good open weight code model:
…Qwen2.5 shows that if you have 20+ trillion tokens of data you can train a really, really good model…
Alibaba has updated its ‘Qwen’ series of models with a new open weight model called Qwen2.5-Coder that – on paper – rivals the performance of some of the best models in the West. In a variety of coding tests, Qwen models outperform rival Chinese models from companies like Yi and DeepSeek and approach or in some cases exceed the performance of powerful proprietary models like Claude 3.5 Sonnet and OpenAI’s o1 models. 
    The Qwen team has been at this for a while and the Qwen models are used by actors in the West as well as in China, suggesting that there’s a decent likelihood these benchmarks are a true reflection of the performance of the models. On HuggingFace, an earlier Qwen model (Qwen2.5-1.5B-Instruct) has been downloaded 26.5M times – more downloads than popular models like Google’s Gemma and the (ancient) GPT-2. 

How they did it – it’s all in the data: The main innovation here is just using more data. Specifically, Qwen2.5 Coder is a continuation of an earlier Qwen 2.5 model. The original Qwen 2.5 model was trained on 18 trillion tokens spread across a variety of languages and tasks (e.g, writing, programming, question answering). Qwen 2.5-Coder sees them train this model on an additional 5.5 trillion tokens of data. This means Qwen has been trained on a total of ~23T tokens of data – for perspective, Facebook’s LLaMa3 models were trained on about 15T tokens. I think this means Qwen is the largest publicly disclosed number of tokens dumped into a single language model (so far). 

Careful curation: The additional 5.5T data has been carefully constructed for good code performance: “We have implemented sophisticated procedures to recall and clean potential code data and filter out low-quality content using weak model based classifiers and scorers. Our approach encompasses both file-level and repository-level pretraining to ensure comprehensive coverage,” they write. 
    Synthetic data: “We used CodeQwen1.5, the predecessor of Qwen2.5-Coder, to generate large-scale synthetic datasets,” they write, highlighting how models can subsequently fuel their successors. 
    Many languages, many sizes: Qwen2.5 has been built to be able to speak in 92 distinct programming languages. The models are available in 0.5B, 1.5B, 3B, 7B, 14B, and 32B parameter variants. 

Why this matters – the best open weight models are made in China: Last week (Import AI #391), I reported on Tencent’s large-scale “Hunyuang” model which gets scores approaching or exceeding many open weight models (and is a large-scale MOE-style model with 389bn parameters, competing with models like LLaMa3’s 405B). By comparison, the Qwen family of models are very well performing and are designed to compete with smaller and more portable models like Gemma, LLaMa, et cetera. The fact these models perform so well suggests to me that one of the only things standing between Chinese teams and being able to claim the absolute top on leaderboards is compute – clearly, they have the talent, and the Qwen paper indicates they also have the data.
   Read the blog: Qwen2.5-Coder Series: Powerful, Diverse, Practical (Qwen blog)
   Read the research: Qwen2.5-Coder Technical Report (arXiv)
   Get the mode: Qwen2.5-Coder (QwenLM GitHub).

***

Tech Tales:

The Help 

I won’t go there anymore. It creeps me out. The lights always turn off when I’m in there and then I turn them on and it’s fine for a while but they turn off again. No one else has this problem. My supervisor said he couldn’t find anything wrong with the lights. But it keeps happening. Assign me to another building. 

The camera was following me all day today. Whenever I moved around it was turning. When I stopped it stopped. No other cameras do that. Only this one. I think it’s got some kind of computer bug. Can you check the system? I do not like how it makes me feel. 

Today when I tried to leave the door was locked. My keycard didn’t work. The lights turned off. I didn’t know what to do. I kept trying the door and it wouldn’t open. The intercom didn’t work also. I could not contact anyone. 

Things that inspired this story: How cleans and other facilities staff may experience a mild superintelligence breakout; AI systems may prove to enjoy playing tricks on humans.

Thanks for reading!

Import AI 391: China’s amazing open weight LLM; Fields Medalists VS AI Progress; wisdom and intelligence

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

The world’s most capable open weight model is now made in China:
…Tencent’s new Hunyuan model is a MoE triumph, and by some measures is world class…
The world’s best open weight model might now be Chinese – that’s the takeaway from a recent Tencent paper that introduces Hunyuan-Large, a MoE model with 389 billion parameters (52 billion activated). In a broad range of benchmarks Hunyuan outperforms Facebook’s LLaMa-3.1 405B parameter model, which is widely thought to be the world’s current best open weight model. “Hunyuan-Large is capable of handling various tasks including commonsense understanding, question answering, mathematics reasoning, coding, and aggregated tasks, achieving the overall best performance among existing open-source similar-scale LLMs,” the Tencent researchers write.  

What they did: There isn’t too much mystery here – the authors gathered a large (undisclosed) dataset of books, code, webpages, and so on, then also built a synthetic data generation pipeline to augment this. They used Rotary Position Embeddings (RoPE) for position learning and SwiGLU for activation. They also did a scaling law study of smaller models to help them figure out the exact mix of compute and parameters and data for their final run; “”we meticulously trained a series of MoE models, spanning from 10 M to 1B activation parameters, utilizing 100B tokens of pre-training data. By leveraging the isoFLOPs curve, we determined the optimal number of active parameters and training data volume within a restricted compute budget, adjusted according to the actual training token batch size, through an exploration of these models across data sizes ranging from 10B to 100B tokens,” they wrote. 

It does extremely well: The resulting model performs very competitively against LLaMa 3.1-405B, beating it on tasks like MMLU (language understanding and reasoning), big bench hard (a suite of challenging tasks), and GSM8K and MATH (math understanding). However, LLaMa-3.1 405B still has an edge on a couple of hard frontier benchmarks like MMLU-Pro and ARC-C. 
    Caveats: From eyeballing the scores the model seems extremely competitive with LLaMa 3.1 and may in some areas exceed it. But there’s really no substitute for talking to the model itself and doing some compare and contrasts. Also, Chinese labs have sometimes been known to juice their evals where things that look promising on the page turn out to be terrible in reality. 
    However, the whole paper, scores, and approach seems generally quite measured and sensible, so I think this would be a legitimate model. 

Why this matters – competency is everywhere, it’s just compute that matters: This paper seems generally very competent and sensible. The only key differentiator between this system and one trained in the West is compute – on the scaling law graph this model seems to come in somewhere between 10^24 and 10^25 flops of compute, whereas many Western frontier models are now sitting at between 10^25 and 10^26 flops. I think if this team of Tencent researchers had access to equivalent compute as Western counterparts then this wouldn’t just be a world class open weight model – it might be competitive with the far more experience proprietary models made by Anthropic, OpenAI, and so on.
    Read more: Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent (arXiv).

***

Can 60 very talented mathematicians make a benchmark that withstands AI progress?
…The best LLMs get 2% on FrontierMath today, but for how long?…
Epoch AI, a research organization dedicated to tracking AI progress, has built FrontierMath, an extremely challenging mathematical understanding benchmark. FrontierMath was built in partnership with 60 skilled mathematicians “including professors, IMO question writers, and Fields medalists”. To translate this into normal-speak; the Basketball equivalent of FrontierMath would be a basketball-competency testing regime designed by Michael Jordan, Kobe Bryant, and a bunch of NBA All-Stars, because AIs have got so good at playing basketball that only NBA All-Stars can judge their performance effectively.
     This is also a very neat illustration of how advanced AI systems have become. Grade School math benchmarks? Obliterated. Undergraduate math tests? Broadly solved. Graduate-level math evals? Teetering on the precipice. International Math Olympiad Gold medal? Just about to be breached based on stuff like AlphaGeometry. The fact that AI systems have become so advanced that the best way to infer progress is to build stuff like this should make us all stand up and pay attention. (And remember, this is happening in physics, chemistry, coding, and many other domains. The world is being irrevocably changed by the arrival of thinking machines and we now need the best minds in the world to figure out how to test this stuff. It’s crazy!) 

What FrontierMath contains: FrontierMath contains questions in number theory, combinatorics, group theory and generalization, probability theory and stochastic processes, and more. Fields Medallist winner Terence Tao says the questions are “extremely challenging… I think they will resit AIs for several years at least”. To calibrate yourself take a read of the appendix in the paper introducing the benchmark and study some sample questions – I predict fewer than 1% of the readers of this newsletter will even have a good notion of where to start on answering this stuff. “These problems span major branches of modern mathematics—from computational number theory to abstract algebraic geometry—and typically require hours or days for expert mathematicians to solve,” the authors write. 
   “[The questions I looked at] were all not really in my area and all looked like things I had no idea how to solve…they appear to be at a different level of difficulty from IMO problems.” — Timothy Gowers, Fields Medal (1998)”, said when looking at some of the papers. 

The bar is set at 2%: In tests, GPT 4o and Sonnet 3.5 both get around 2% on the benchmark – and they’re given every possible advantage to help them crunch the literal numbers: “Our evaluation framework grants models ample thinking time and the ability to experiment and iterate. Models interact with a Python environment where they can write and execute code to test hypotheses, verify intermediate results, and refine their approaches based on immediate feedback.”

Why this matters – will this stand the test of time or fade like so many others? So many recent benchmarks have fallen to the march of AI systems that many people who have built ‘hard’ benchmarks have quickly become quite shocked by the pace of progress on them (see: BigBench, MMLU, MATH, GPQA). The authors of FrontierMath are more optimistic – and it seems like they should be, judged by how much effort they’ve put in, and FIelds’ Medallists agree: “Chen and Tao both suggested that human experts working with AI systems could potentially tackle FrontierMath problems within around three years, much sooner than fully autonomous AI solutions.”
   My prediction: An AI system working on its own will get 80% on FrontierMath by 2028. And if I’m right… is that AGI? Or like so many other benchmarks before it, will solving this incredibly hard test reveal another wrinkle in the subtle beauty that is our consciousness?
   Read more: FrontierMath (Epoch AI).
   Read the research paper: FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI (arXiv)

***

Researchers say the path to wise AIs runs through metacognition:
…Sure, AI is intelligent. But it isn’t wise – and that’s a problem…
Today’s AI systems are very capable, but they aren’t very good at dealing with intractable problems. To solve this, they need wisdom. And to gain wisdom, they need metacognition. That’s the thesis of a new paper from researchers with the University of Waterloo, Warwick University, Stanford University, the Allen Institute for AI, the Santa Fe Institute, and the Max Planck Institutes for Human Development and Intelligent Systems. 

What wisdom is and why it’s needed: “We define wisdom functionally as the ability to successfully navigate intractable problems— those that do not lend themselves to analytic techniques due to unlearnable probability distributions or incommensurable values,” the researchers write.  “If life were a series of textbook problems, we would not need to be wise.”

What are intractable problems? The kind of things that challenge today’s AI systems have the following properties:

  • Incommensurable: They have ambiguous goals or values that can’t be reconciled with one another.

  • Transformative: The outcome might change your preferences, so your present and future values clash. 

  • Radically uncertain: You can’t list all the outcomes or assign probabilities. 

  • Chaotic: There could be a strong nonlinearity or other feature that makes it very unpredictable.

  • Non-stationary: The underlying thing you’re dealing with may be changing over time, making it hard for you to learn a probability distribution. 

  • Out-of-distribution: A black swan situation you’ve never encountered before. 

  • Computationally explosive: You can’t figure out the correct move with achievable finite resources. 

Solving intractable problems requires metacognition: The main claim here is that the path to solving these problems runs through ‘metacognition’, which is basically a suite of helper functions an AI system might use to help it fruitfully apply its intelligence to so-called intractable problems. These metacognitive processes include: 

  • Intellectual humility: The ability to know what you do and don’t know. 

  • Epistemic deference: Ability to defer to others’ expertise when appropriate. 

  • Scenario flexibility: Figuring out diverse ways in which a scenario could unfold. 

  • Context adaptability: Figuring out features from an intractable situation that makes it comparable to other situations. 

  • Perspective seeking: Being able to draw on other perspectives to gain information to solve a problem.

  • Viewpoint balancing: Being able to integrate various discrepant interests into a single thing.

How metacognition leads to wisdom: The authors believe systems with these properties might be significantly better than those without. “For example, a wise AI system might be more willing to spin its wheels to solve a problem compared to a wise human; it might generate vast numbers of scenarios to analyze many possible contingencies, evincing an extreme version of scenario flexibility,” they write. 

Why this matters – is metacognition just LLMs + RL? An extremely persistent thought I had while reading this paper was… isn’t this just what the new crop of RL-infused LLMs give you? Some of the new models, like OpenAI’s o1 model, exhibit some of the traits described here where, upon encountering confusing or hard to parse scenarios, they think out loud to themselves for a while, simulating multiple distinct perspectives, performing rollouts, running their own live experiments, and so on. While this LLM + RL paradigm doesn’t deal with all the stuff outlined here, it certainly seems to take a meaningful step closer. 
    When reading this paper I had the distinct feeling that it might soon be ‘overtaken by reality’, like so many thoughtful papers published about the supposed gulf between today’s AI systems and truly smart ones. Perhaps the age of wise AI systems is nearly upon us.
   Read more: Imagining and building wise machines: The centrality of AI metacognition (arXiv)..

***

AI consciousness is something AI companies need to think about:
…We should take seriously a “realistic possibility” of conscious systems soon…
A group of researchers thinks there is a “realistic possibility” that AI systems could soon be conscious and that AI companies need to take action today to prepare for this. The researchers – who come from Eleous AI (a nonprofit research organization oriented around AI welfare), New York University, University of Oxford, Stanford University, and the London School of Economics – published their claim in a recent paper, noting that “there is a realistic possibility that some AI systems will be conscious and/or robustly agentic, and thus morally significant, in the near future”.

Why are they making this claim? As contemporary AI systems have got more capable, more and more researchers have started confronting the problem of what happens if they keep getting better – might they eventually become conscious entities which we have a duty of care to? Though you may have an instinctive ‘no, that’s ridiculous’ reaction to this idea, it’s worth challenging your own assumptions – a good survey paper in 2023 looked across all the different technical means by which AI systems are built and used this to determine it’s hard to rule out the possibility of consciousness in contemporary AI systems (Import AI #338). In 2024, researchers – including a Turing Award winner – made an even more forthright claim, writing in a preprint that “AI consciousness is inevitable” and walking through the arguments for this (Import AI #369).

Different routes to moral patienthood: The researchers see two distinct routes AI systems could take to becoming moral patients worthy of our care and attention: consciousness and agency (the two of which are likely going to be intertwined). 

  • Consciousness route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Consciousness suffices for moral patienthood, and 2. Descriptive: There are computational features — like a global workspace, higher-order representations, or an attention schema — that both: a. Suffice for consciousness, and b. Will exist in some near-future AI systems”.

  • Robust agency route to moral patienthood. There is a realistic, non-negligible possibility that: 1. Normative: Robust agency suffices for moral patienthood, and 2. Descriptive: There are computational features — like certain forms of planning, reasoning, or action-selection — that both: a. Suffice for robust agency, and b. Will exist in some near-future AI systems.”

What should AI companies do? The researchers urge AI companies to take three distinct types of actions in response to the issue of AI consciousness, specifically AI companies should:

  • Acknowledge: “that AI welfare is an important and difficult issue, and that there is a realistic, non-negligible chance that some AI systems will be welfare subjects and moral patients in the near future”. When doing this, companies should try to communicate with probabilistic estimates, solicit external input, and maintain commitments to AI safety. 

  • Assess: “Develop a framework for estimating the probability that particular AI systems are welfare subjects and moral patients, and that particular policies are good or bad for them,” they write. These assessments should include “sources of evidence that make sense for AI systems, such as architectural features; on theories of consciousness that make sense for AI systems, such as computational functionalist theories; and on sources of moral patienthood that make sense in this context, such as various kinds of robust agency.”

  • Prepare: “Develop policies and procedures that will allow AI companies to treat potentially morally significant AI systems with an appropriate level of moral concern,” they write. As part of this, they recommend AI companies hire or appoint someone responsible for AI welfare. 

Why this matters – if AI systems keep getting better then we’ll have to confront this issue: The goal of many companies at the frontier is to build artificial general intelligence. This goal holds within itself the implicit assumption that a sufficiently smart AI will have some notion of self and some level of self-awareness – the generality many envisage is bound up in agency and agency is bound up in some level of situational awareness and situational awareness tends to imply a separation between “I” and the world, and thus consciousness may be a ‘natural dividend’ of making increasingly smart systems. 
    Companies must equip themselves to confront this possibility: “We are not arguing that near-future AI systems will, in fact, be moral patients, nor are we making recommendations that depend on that conclusion,” the authors write. “We are instead arguing that near-future AI systems have a realistic chance of being moral patients given the information and arguments currently available, and we are making recommendations that depend on that conclusion — recommendations that focus on aspiring to learn more while preparing for the possible emergence of AI moral patienthood as a precautionary measure.”
     (Incidentally, one of the authors of the paper recently joined Anthropic to work on this precise question…) 
   Read more: New report: Taking AI Welfare Seriously (Eleos AI Blog).
   Read the paper: Taking AI Welfare Seriously (Eleos, PDF).

***

Tech Tales:

Adverts after the uplift
[Online machine-authored adverts posted three years after beginning of superintelligence-driven uplift]

Are You (Uniquely) Experienced? Cash available. 
We pay same day cash for provably unique experiences – simply walk in, let us validate by comparing your experiences against the memoryweb, and then we’ll pay YOU for your memory. Not only that, but we will QUADRUPLE payments for memories that you allow us to delete from your own experience – a popular option for nightmares! 

Pilot-curious? Enquire within. 
Have you been wondering what it would be like to be piloted by a high-dimensional intelligence? Interested in learning about what opportunities this presents? We offer a range of pilot options and compensation structures. Come in for a free consultation today!

Things that inspired this story: Thinking about the sorts of ways machines and humans might trade with one another; the Craigslist economy in a superintelligence future; economic stratification.

Thanks for reading!

Subscribe now

Import AI 390: LLMs think like people; neural Minecraft; Google’s cyberdefense AI

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Google’s homegrown cyberdefense agent finds a real-world vulnerability:
…Yet more evidence that today’s language models are far more powerful than people think…
Project Naptime, a Google initiative to use contemporary AI methods to make cyberoffense and cyberdefense systems, has developed ‘Big Sleep’, a defensive AI agent. This week, Google announced that its Big Sleep agent had identified a real-world vulnerability in SQLite, a widely used database. 
   “We discovered the vulnerability and reported it to the developers in early October, who fixed it on the same day. Fortunately, we found this issue before it appeared in an official release, so SQLite users were not impacted,” Google writes. “We believe this is the first public example of an AI agent finding a previously unknown exploitable memory-safety issue in widely used real-world software”.

Why this matters – language models are more capable than you think: Google’s system is basically a LLM (here, Gemini 1.5 Pro) inside a specialized software harness designed around common cybersecurity tasks. This is important for two reasons: a) this illustrates how today’s LLMs are more powerful than people think – time and time again, people – including the original Naptime research (Import AI 378) are showing that if you give them some specialized tools and helper functions, they can perform massively better than out-of-the-box LLMs, and b) it shows how AI can be used to improve cyberdefense, using contemporary AI systems to look at widely used software, identify vulnerabilities, and fix them before they reach the public. 
   Read more: From Naptime to Big Sleep: Using Large Language Models To Catch Vulnerabilities In Real-World Code (Project Zero, Google)

***

Academics have a really small amount of compute:
…But you can sometimes get around small-compute by training for longer…
Researchers with Brown University recently conducted a very small survey to try and figure out how much compute academics have access to. The survey, which was conducted in April 2024, generated 50 researchers from 35 international institutions and it indicated that very few people are happy with the state of academic compute. 
   “That said, most academics are not satisfied with the compute provided by their institution. 66% of respondents rated their satisfaction with their compute clusters at less than or equal to 3 out of 5 (indicating that some desired experiments are prohibitively expensive),” they wrote. “Based on our poll on user satisfaction, the majority of respondents want to and indeed would run more expensive types of experiments, if only they had the hardware for it.”

Hardware types: Another thing this survey highlights is how laggy academic compute is; frontier AI companies like Anthropic, OpenAI, etc, are constantly trying to secure the latest frontier chips in large quantities to help them train large-scale models more efficiently and quickly than their competitors. By comparison, this survey “suggests a common range for what constitutes “academic hardware” today: 1–8 GPUs—especially RTX 3090s, A6000s, and A100s—for days (typically) or weeks (at the higher-end) at a time,” they write. “10% of our respondents also report access to H100 GPUs: i.e. the newest generation Data Center GPUs.”

Why this matters – stagnation is a choice that governments are making: You know what a good strategy for ensuring the concentration of power over AI in the private sector would be? Systematically under-funding compute in the academic sector and therefore surrendering the frontier to deep-pocketed private sector actors. That’s exactly what this survey indicates is happening. This is a choice being made by (many) governments all over the world – and a deeply regrettable one.
   Read more: $100K or 100 Days: Trade-offs when Pre-Training with Academic Resources (arXiv).

***

Language models think in the same way as people:
…When it comes to modeling human cognition, LLMs do better than specialized systems…
All around us now, week by week, the drops are falling – it’s like rain on a tin roof, but evidence of human-like sophistication in language models.. Do you hear that sound? The notion that a technology is arriving into our world which might be truly transformative? Which might have the capacity to think and represent the world in ways uncannily similar to people?
    You’re not alone. A new paper from an interdisciplinary group of researchers provides more evidence for this strange world – language models, once tuned on a dataset of classic psychological experiments, outperform specialized systems at accurately modeling human cognition. 

Who did the research: The research was done by people with Helmholtz Munic, University of Tuebingen, University of Oxford, New York University, Max Planck Institute for Biological Cybernetics, Google DeepMind, Princeton University, University of California at San Diego, Boston University, Georgia Institute of Technology, University of Basel, Max Planck Institute for Human Development, Max Planck School of COgnition, TU Darmstadt, and the University of Cambridge.

What they did: They finetuned a LLaMa 3.1 70B model via QLoRA on a new dataset called Psych-101, then tested out how accurately the system could model and predict human cognition on a range of tasks. The results were very decisive, with the single finetuned LLM outperforming specialized domain-specific models in “all but one experiment”. The system also did well on out-of-distribution tasks, where it generalized better than hand-written and/or specialized systems. 

What is Psych-101? Psych-101 is a dataset “covering trial-by-trial data from 160 psychological experiments. We transcribed each of these experiments into natural language”, they write. The resulting dataset contains more than 10,000,000 distinct human choices and includes “many canonical studies from domains such as multi-armed bandits, decision-making, memory, supervised learning, Markov decision processes, and others”

Why this matters – these LLMs really might be miniature people: Results like this show that the complexity of contemporary language models is sufficient to encompass and represent some of the ways in which humans respond to basic stimuli. This is the sort of thing that you read and nod along to, but if you sit with it’s really quite shocking – we’ve invented a machine that can approximate some of the ways in which humans respond to stimuli that challenges them to think. The fact this generalizes so well is also remarkable – and indicative of the underlying sophistication of the thing modeling the human responses.
   “A computational model like Centaur that can simulate and predict human behavior in any domain offers many direct applications. It may, for instance, be used for in silico prototyping of experimental studies,” they write. “Thinking one step further, Centaur finds applications in the context of automated cognitive science. For example, it can be integrated into frameworks that utilize predictive models to guide the development of psychological theories, such as scientific regret minimization”.
   Read more: Centaur: a foundation model of human cognition (PsyArXiv Preprints).
   Get the Psych-101 dataset here (HuggingFace).

***

Minecraft – inside the weights of a neural network:
…A taste of the infinite generative-everything future…
In the past few issues of this newsletter I’ve talked about how a new class of generative models is making it possible for researchers to build games inside neural networks – in other words, games which are going to be infinitely replayable because they can be generated on-the-fly, and also games where there is no underlying source code; it’s all stored in the weights of the network. 
   Now, researchers with two startups – Etched and Decart – have built a visceral demonstration of this, embedding Minecraft inside a neural network. You can play the resulting game in your browser; it’s incredible – you can play a full game and other than the slightly soupy images (some of which resolve late, as the neural net decides it is now a probable object to render), it feels remarkably similar to the real thing. 
    This is a big deal – it portends a future of infinite games. And just imagine what happens as people work out how to embed multiple games into a single model – perhaps we can imagine generative models that seamlessly fuse the styles and gameplay of distinct games? 

How they did it: “The model is composed of two parts: a spatial autoencoder, and a latent diffusion backbone. Both are Transformer-based: the autoencoder is based on ViT, and the backbone is based on DiT,” they write. “In contrast to bidirectional models such as Sora, Oasis generates frames autoregressively, with the ability to condition each frame on game input. This enables users to interact with the world in real-time.”

Things that make you go ‘hmmm’ – this is also a chip advert: One of the startups behind this – Etched – is designing a specialized inference ASIC called Sohu on which to run games like this. “Sohu can scale to massive 100B+ next-generation models in 4K resolution,” they write. 

It’s going to get better (and bigger): As with so many parts of AI development, scaling laws show up here as well. “Following an in-depth sensitivity analysis on different configurations of the architecture alongside the data and model size, we hypothesize that the majority of these aspects may be addressed through scaling of the model and the datasets,” they write. 
   Read more: Oasis: A Universe in a Transformer (Oasis Model, GitHub).

***

Tech Tales:

The classification engine 
The strategic dominance plan for unprecedented abundance relied on classification – specifically, the intentional walling off of certain scientific insights delivered by the first AGI-class system. The powers that be determined that despite the promise of material wealth the likes of which no human civilization had ever known some kind of ‘strategic edge’ needed to be maintained. Therefore, a subset of the new scientific discoveries made by the system were pre-allocated into a compartment where only a few select human-run organizations would have access to them. The AGI system was also put to work to confound other attempts to discover these secrets, publishing scientific papers and frameworks and generally ‘nudging’ people worldwide away from the science that had been walled off and compartmented. In this way the humans believed a form of dominance could be maintained – though over what and for what purpose was not clear even to them. 

Things that inspired this story: The basic animal tendency to stockpile things; thinking about how governments might relate to AI systems;

Thanks for reading!

Subscribe now

Import AI 389: Minecraft vibe checks; Cohere’s multilingual models; and Huawei’s computer-using agents

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

Cohere releases two powerful multilingual models:
…Aya Expanse means the future is less likely to be dominated by English- and Chinese-dominant models…
Cohere has released Aya Expanse, two multilingual LLMs. The models have an 8k context length, cover 23 languages, and outperform models from Google, Facebook, and Mistral. The expanse family come in two sizes: 8B and 32B, and the languages covered include:  Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese. 

Some training tweaks: Both models are relatively standard autoregressive language models. They’ve also been improved with some favorite techniques of Cohere’s, including data arbitrage (using different models depending on use cases to generate different types of synthetic data to improve multilingual performance), multilingual preference training, and model merging (combining weights of multiple candidate models). 
   The results are encouraging: The 8B model has a 60% win rate against Google’s Gemma-2 9B, 70% against Facebook’s Llama-3.1 8B, and 63% against Mistral’s Ministral 8B, and the 32B model does well as well (51% vs Gemma-2 27B, 54% vs Llama-3.1 70B, 76.6% versus Mixtral 8x22B). 

Why this matters – avoiding an English hegemony in the AI world: Models like Aya Expanse are trying to make the AI future a multilingual one, rather than one dominated by languages for which there has been sustained focus on getting good performance (e.g, English, Chinese, South Korean, etc). 
   Read more: Aya Expanse: Connecting Our World (Cohere blog).
   Get the models from here: Aya Expanse (huggingFace).

***

People are testing out models on Minecraft because… uh… we do not know how to fully evaluate these things anymore:
…Minecraft tests are an example of a vibes-based eval…
Recently, the sub-sub-sub-corner of twitter that is obsessed with testing out AI systems has been seized with a new passion: putting these systems into minecraft and seeing what they do. Minecraft is a 3D game where you explore a world and build things in it using a dizzying array of cubes. As AI systems have got more advanced, they’ve started to be able to play Minecraft (often using a load of tools and scripting languages) and so people have got increasingly creative in the different ways they test out these systems. 

Something weird is going on: At first, people just used Minecraft to test out if systems could follow basic instructions and achieve basic tasks. Modern frontier models are able to do this. So now people are trying to do weirder things. The different evals are trying to tell us something:

  • Here’s an eval where people ask AI systems to build something that encapsulates their personality; LLaMa 405b constructs “a massive fire pit with diamond walls. This is the only model that didn’t just do a generic blob mixture of blocks”.

  • Here’s an experiment where people compared the mannerisms of Claude 3.5 Sonnet and Opus by seeing how they’d follow instructions in a Minecraft server:  “Opus was a harmless goofball who often forgot to do anything in the game because of getting carried away roleplaying in chat,” repligate (Janus) writes. “Sonnet, on the other hand, had no chill. The moment it was given a goal, it was locked in.”

  • Here’s someone getting Sonnet 3.5 to build them a mansion, noting the complexity of it almost crashed their PC. 

  • Here’s a compare and contrast on the creativity with which Claude 3.5 Sonnet and GPT-4o go about constructing a building in Minecraft. “Same prompt. Same everything,” the author writes. “Minecraft evals are now real”.

Why this matters – the future of the species is now a vibe check: Is any of the above what you’d traditionally think of as a well reasoned scientific eval? No! Not in the slightest! “Just put the animal in the environment and see what it does” is the definition of a qualitative study and by nature something where it’s hard to ablate and control things to do truly fair comparisons. 
    But the fact that so many humans are turning to things like Minecraft to evaluate these things is important. Part of it is about visualizing the capability surface – SWE-eval and GPQA and MMLU scores are all helpful, but they’re not as intuitive as ‘see how complex what it builds in Minecraft is’. 
   Another way of thinking of this is now that LLMs have much bigger complex windows and have been trained for multi-step reasoning tasks, it may be that Minecraft is one of the only ways to easily and intuitively visualize what ‘agentic’ systems look like. 
    Want to do this yourself? Check out MC-Bench on GitHub, software for helping to set up and run Minecraft agents (MC-Bench Orchestrator, GitHub)

***

Huawei wants to use RL to make computer-using agents:
…DistRL is a symptom of ambition…
Researchers with the University of Cambridge, Powersense Technology Limited, Huawei’s Noah’s Ark Lab, and University College London have built DistRL, a distributed reinforcement learning framework. DistRL is designed to help train models that learn how to take actions on computers and is designed so that centralized model training happens on a big blob of compute, while data acquisition occurs on edge devices running, in this case, Android. 

How DistRL works: The software “is an asynchronous distributed reinforcement learning framework for scalable and efficient training of mobile agents,” the authors write. “By decoupling trajectory collection from policy learning and doing both in parallel, it leverages distributed working machines for CPU-intense agent-environment interactions and GPU servers for policy training. This separation optimizes efficiency, scalability, and resource utilization by aligning tasks with appropriate hardware”.
    DistRL is not particularly special – many different companies do RL learning in this way (though only a subset publish papers about it). It’s more interesting for what it suggests about priorities for Huawei (which appeared to lead the project given a Huawei researcher is the corresponding author). 

Important caveat: not distributed training: This is not a distributed training framework – the actual AI part is still taking place in a big centralized blob of compute (the part that is continually training and updating the RL policy). Rather, this is a form of distributed learning – the edge devices (here: phones) are being used to generate a ton of realistic data about how to do tasks on phones, which serves as the feedstock for the in-the-cloud RL part. 

Why this matters – computer use is the frontier: In a few years, AI systems will be middleware between you and any and all computers, translating your intentions into a symphony of distinct actions executed dutifully by an AI system. Approaches like this portend that future. “For future work, we aim to extend the generalization capabilities of DistRL to a broader range of tasks, focusing on enhancing both the training pipeline and the underlying algorithmic architecture,” Huawei writes. 
   Read more: DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agents (arXiv)

***

What would an “AI FDA” even look like? And is it a good idea?
…It’d need pre-market enforcement, and I’m not sure if it’s a good idea…
The term “FDA for AI” gets tossed around a lot in policy circles but what does it actually mean? Researchers with thinktank AI Now have written up a helpful analysis of this question in the form of a lengthy report called Lessons from the FDA for AI. The key things to know are that:

  • The most effective tool the FDA has is “pre-market approval” – being able to say which drugs can and can’t come to market. 

  • Ensuring products comply with regulations after they have been released is challenging and the complicated supply chain for AI makes this even more difficult. 

  • Figuring out a funding mechanism for the (very expensive) pre-market testing is a key challenge – there  are various traps where the FDA for AI could end up beholden to market participants. 

  • The FDA mandates documentation of drugs and medical devices; mandating documentation for AI could be both useful and also motivate broader changes in the AI industry. 

  • Any FDA for AI would fit into a larger ecosystem – figuring out how this hypothetical FDA could interact with other actors to create more accountability would be important. “The power of FDA regulation comes in part from other actors in the system, including physicians, insurers, whistleblowers, and other actors who strengthen its monitoring regime. This has acted as an important second line of defense in pharmaceuticals, where the regulatory process has been insufficiently rigorous.”

Why this matters – most questions in AI governance rests on what, if anything, companies should do pre-deployment: The report helps us think through one of the central questions in AI governance – what role, if any, should the government have in deciding what AI products do and don’t come to market? Any kind of “FDA for AI” would increase the government’s role in figuring out a framework for deciding what products come to market and what don’t, along with gates needed to be passed to get to broad-scale distribution. This would represent a change from the status quo where companies make all the decisions about what products to bring to market. Do we actually want other participants to have a role here and, if so, what should that precise role be?
   Read more: LESSONS FROM THE FDA FOR AI (AI Now, PDF).

***

Tech Tales:

Definitions At The End Of Time 
[Near Conscious Entity (NCE)]: A Near Conscious Entity (NCE) is a synthetic system which has the necessary ingredients for consciousness and has been determined to be approaching the threshold of moral patienthood. A NCE is a protected entity under the terms of the Sentience Accords and while not due the same considerations as a Provably Conscious Entity (PCE), an NCE receives higher protections than Unthinking Software. 

Things that inspired this story: All the policy documents that will be written during the transition to superintelligence.

Thanks for reading!

Subscribe now

Import AI 388: Simulating AI policy; omni math; consciousness levels

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Subscribe now

43 simulations of contemporary AI development tells us that coordination is hard:
…Meta-analysis of 40+ games of “Intelligence Rising” sheds light on how people expect the industry to develop…
Intelligence Rising is a scenario-based game that lets players pretend to be AI developers competing with one another to build and deploy AGI. The game was developed by Cambridge researchers to help them structure discussions around AI development and its associated dynamics. Now, after overseeing 43 games over a four year period, researchers have published a paper outlining the common things that come up in these games. 

What tends to come up when people pretend they’re developing AI systems: The paper is a quick read and not too surprising – the sorts of challenges that get surfaced in these games are very similar to the ones that AI labs encounter on a day-to-day basis (or at least, it certainly matches experiences I’ve had at both OpenAI and Anthropic). 

  • “Even prior to the development of radically transformative AI, AI technologies can have

  • dramatically destabilising effects as they rapidly advance and reshape society. 

  • “Outcomes leading to positive futures almost always require coordination between actors who by default have strong incentives to compete — this applies both to companies and to nations”

  • “The power to steer the future of AI development is very unequally distributed due to several

  • drivers for concentration, including the enormous compute requirements of the latest frontier AI models. 

  • “Technology development does not happen in isolation — it affects, and is affected by, geopolitics, economical factors, social factors, and state actions. Actors should consider the broader consequences of their policies, including on trust between powerful actors, and impacts on social stability. There is no predetermined path that AI technology is bound to follow.”

  • “The best chances for optimal outcomes are achieved through early recognition of the magnitude of the challenge, trust building over years, and eventually international treaties or agreements that include rigorous and robust verification protocols for the involved states and firms.” 

Why this matters – coordination is required and coordination is difficult: The game shows something everyone working in AI policy knows to be true – getting to a good outcome will require coordination beyond what the AI ecosystem currently incentivizes. And even if we succeed at coordination, success isn’t guaranteed: “Even with an agreement in place to slow development until safe [AGI] is verifiable at a very high level of confidence and with no successful attempts to violate the agreement by any parties, a dice roll is typically still required to inform the end-of-game narrative,” the authors write. 
   Read more: Strategic Insights from Simulation Gaming of AI Race Dynamics (arXiv).

***

Chinese researchers introduce Omni-Math, a (for now) challenging math benchmark:
…OpenAI o1 gets ~60%, most other models get 30% or less…
Chinese researchers have built Omni-Math, a dataset and benchmark of 4428 mathematical olympiad competition-level problems. Omni-Math is designed to provide a competitive test of how well LLMs understand math, superseding existing (and mostly saturated) benchmarks like GSM8K and MATH. 

Extremely hard for most models: Omni-Math is hard – most models get ~30% or less accuracy (e.g, Claude 3.5 Sonnet gets 26.23%, and DeepSeek-Coder-V2 gets 25.78%), though there are two outliers – OpenAI o1-preview and OpenAI o1-mini which get 52.5% and 60.5%, respectively. This suggests Omni-Math is, for now, a hard benchmark, though we should expect new models that wrap in RL (like the o1 series) to do substantially better. The open question is how long it will remain hard for – will the best models be getting ~90%+ next year, or more like 70%?

Why this matters – knowing where we’re going is getting increasingly difficult: Omni-Math is a hard benchmark to evaluate as a human unless you’re also quite good at math. Many modern hard benchmarks (e.g, GPQA) now exhibit this property – AI systems have got sufficiently good that our own ability to build evals for them is now limited by deep subject-matter expertise rather than generic highschool human expertise. This is significant – in a real sense, many AI systems are now way above average human competence on some tasks. 
   Read more: Round and Round We Go! What makes Rotary Positional Encodings useful? (arXiv).

*** 

Tech Tales:

Operationalization of the Sentience Accords
[Extract from an implementation guide document developed by one of the Sentience Accords working groups, 2030]

Per the implementation guide from the Sentience Accords, we must define five levels of “Consciousness” with associated tests. AI systems are permitted to be released openly and at scale if they are at Consciousness Level 2 or below (CL1 or CL2). CL3 systems require pre-deployment testing by a named safety authority (for a full list of these authorities within the G20 please refer to the ‘Authorities’ section of the appendix). CL4 systems require pre-deployment testing by safety authorities as well as ongoing monitoring for both usage and ‘System Welfare’. CL5 systems are not permitted to be released and their analysis and operation is restricted to a small set of government entities and their private sector partners. 

Things that inspired this story: The sentience accords; moral patienthood and AI systems; dreams I have of windowless rooms and coffee in styrofoam cups and people hammering out the policy around the near future.

Thanks for reading!

Subscribe now

Import AI 387: Overfitting vs reasoning; distributed training runs; and Facebook’s new video models

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe. A somewhat shorter than usual issue this week – but I decided it’s better to keep the weekly cadence than to save up sections for a ‘full’ less frequent newsletter.

Subscribe now

Apple shows that most LLMs are overfitting:
…GSM-Symbolic suggests people are teaching to the test on GSM8K, though larger models generalize better…
Apple has published a benchmark which subtly varies the widely used ‘GSM8K’ math benchmark and in doing so shows that most LLMs may have data contamination – if you slightly vary some aspects of a widely used math test, their performance drops significantly. “We show that the performance of all models drops on GSM-Symbolic, hinting at potential data contamination,” Apple writes. “We further question the reasoning abilities of LLMs and introduce the GSM-NoOp dataset. By adding seemingly relevant but ultimately irrelevant information to problems, we demonstrate substantial performance drops (up to 65%) across all state-of-the-art models”.

What they did – GSM Symbolic and GSM NoOp: Apple introduces two tests; GSM Symbolic subtly varies GSM8K, while GSM NoOp introduces distracting variables on top. 
   GSM8K versus GSM Symbolic: GSM Symbolic takes questions from GSM8K and turns them into madlib-style templates where key details are turned into variables (e.g., in GSM8K where a question says “When Sophie watches her nephew” the GSM Symbolic version says “When {name} watchers her {family}”, and in GSM8K when a question says “After buying the tube of balls, Sophie has 31+8+9 + T = 48” the GSM Symbolic version says “After buying the tube of balls, {name} has {x} + {y} + {z} + T = {total}”.)
    Results – huge variance on smaller models: Apple tests a bunch of models and the results show huge variance, with relatively small open source models like Mistral-7b and Gemma2-2b doing far worse on GSM Symbolic than on GSM8K, though there’s significantly less variance on large-scale proprietary models like GPT-4o and 01-mini. On NoOp, the same pattern shows up, with smaller (and mostly open source) models doing very badly, and large-scale proprietary models like OpenAI’s new reasoning-filled “o1-preview” model suffering the least severe performance degradation.

What does it all mean? Overfitting happens, but big models are less prone to it: While critics of LLMs will use this paper to point out that LLMs are often overfitting and not really ‘reasoning’ but rather memorizing stuff, I think it actually shows something more subtle: small and therefore relatively dumb models do absolutely overfit, but the larger and smarter your model, the less prone it is to overfitting. Yes, there’s still some degradation, suggesting that large models have soaked up some biases from their training sets which degrade performance, but the fact they cope far better is actually – to me – a very optimistic sign, indicating that the world’s largest and most sophisticated models may really exhibit some crude reasoning – at least, enough to get over deliberately confounding benchmarks!
   Read more: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models (arXiv).

***

The 10bn+ parameter distributed training run has arrived:
…Prime Intellect is trying to do something that, if successful, will change the contours of AI policy…
AI startup Prime Intellect has launched INTELLECT-1, a decentralized training run of a 10-billion parameter model. If successful, this will be the largest decentralized training run of a frontier language model – that’s a big deal, because it will show that loosely federated collectives might be able to pool their computers to train models that challenge those of single companies. 

What INTELLECT-1 relies on: The training run uses OpenDiLoCo (Import AI #381), Prime Intellect’s open source implementation of DeepMind’s ‘DiLoCo’ technique (Import AI #349). Prime Intellect already used this technique to train a 1bn parameter model and is now scaling it up to a 10B one. “Our goal is to solve decentralized training step-by-step to ensure AGI will be open-source, transparent, and accessible, preventing control by a few centralized entities and accelerate human progress,” the company writes. 
    How it works: There are a few new inventions to further improve the efficiency of the distributed training process. These include: ElasticDeviceMesh, software for automatically scaling up and down the groups of computers used for distinct parts of the AI training; asynchronous distributed checkpointing, an asynchronous way to save state during the runs; live checkpoint recovery, to make it easy to grab the latest state of the run for new computers that want to join the run; custom Int8 All-Reduce kernel; a kernel optimized for the types of quantization and dequantization used, and more. 
    What it’s being trained on: INTELLECT-1 is training now on the Fineweb-Edu dataset from HuggingFace (55% of the training mix), along with DLCM (20%), Stackv2 (20%), and OpenWebMath (5%).

Who makes the future? There’s a leaderboard where you can see who is putting forward the compute to train this model – beyond individuals, other companies include SemiAnalysis, HuggingFace, and Arcee AI. 

Why this matters – centralization versus decentralization, aka the political economy of AI rests on distributed training: If distributed training works well, it changes the policy landscape of AI development. Today, much of AI policy rests on the load-bearing assumption you can control the frontier by monitoring and controlling large blobs of centralized computers. Decentralized training breaks this – the frontier can now be made of hundreds of different blobs of compute, working together. This also bears on export controls which deny people the ability to build high-performing, centralized blobs of compute – again, decentralized training makes it easier to pool resources of n-1 generation accelerators and use this to compose an (economically suboptimal) frontier training run. 
    Of course, Prime Intellect has some way to go – frontier training runs are 500bn+ parameters now (e.g, Facebook’s LLaMa3 biggest model is 405bn parameters), so whether it scales to this regime matters. But just a few months ago the largest decentralized training runs were of the order of 1bn, so 500bn is a big difference already!
   Read more: INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model (Prime Intellect blog).

***

Facebook prepares for the fully personal video model future:
…Announces Movie Gen models, trained on ~6,000 H100s…
Facebook has built Movie Gen, a set of generative models that can be used to generate and edit movies. These models can be used to generate videos from text, edit videos with text, and produce personalized videos (e.g you upload a photo of yourself and it shapes the video around you). 

Compute: These are relatively expensive models – Facebook trained the Movie Gen family on “up to 6,144 H100 GPUs, each running at 700W TDP and with 80GB HBM3, using Meta’s Grand Teton AI server platform”.

No release: Facebook isn’t releasing these models – “the Movie Gen cast of foundation models were developed for research purposes and need multiple improvements before deploying them,” Facebook writes. 

Why even cover this? Video is about to be a commodity: Just as text generation and image generation have become ‘commodity AI’ services (where though proprietary models exist, you can relatively easily access extremely cheap and or open weights variants), video models seem to be heading in this direction. Facebook also seems like one of the most likely players to openly proliferate such models, so it’s worth taking note of Movie Gen to get a sense of what might be broadly distributed on the internet in a while.
   Find out more at the official site: Meta Movie Gen (Meta).
   Read more in the research paper: Movie Gen: A Cast of Media Foundation Models (Facebook, PDF).

***

Intelligence piled high
[Many years after uplift, fragment stored in offworld ‘wikiasteroid’]

We have as many words for intelligence as some groups of humans had for snow. In the space of all possible minds there is so much variety that a new vocabulary is needed. The words we use also have power – we navigate ourselves to and through these different spaces of minds through language, so our ability to describe intelligence is equivalent to our ability to channel it. Much of our work is spent in this exploration – this characterization of the many textures and constraints and specialisms that make up a mind. We are all smart, of course – smarter than any thing that has ever lived in any of our recorded history. But we nonetheless encounter problems that pose challenges to us. There is a kind of sport in this – we shapeshift according to our language and our language changes according to how much of our possibilities we have explored. 

Things that inspired this story: The notion that having different ‘lenses’ on problems is key to solving them; natural ecologies; the idea that even among machines there will be competition and specialization.

Thanks for reading!

Subscribe now

Import AI 386: Google’s chip-designing AI keeps getting better; China does the simplest thing with Emu3; Huawei’s 8-bit data format

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.

Get a visceral feel for how powerful modern AI video tools are:
…Pika 1.5 has a very fun and kind of dumb demo you should all try…
AI startup Pika has released Pika 1.5, its latest video generation model. They’ve accompanied the release with an extremely fun demo where you can upload any image you like and apply effects to it, ranging from inflate to melt to explode to, of course, turning it into cake. It’s worth playing around with to get a sense for how adaptable these tools are – e.g, see here how well it does being asked to turn a 2D transformer graph into a cake. Pike recently raised $80 million “so anyone can make video on command”.

Why this matters – CGI company in a box: Most powerful AI capabilities look like $some_human_institution that has been magically converted into a machine learning model that anyone can access. Pika feels like a CGI company that has been converted into an AI system. It’s fun – play with it and also remember this is the worst this technology will ever be.
   Check out the Pika demo here (Pika website).
   Read more: Pika raises $80M, so anyone can make video on command (Pika).

***

Chinese researchers train a multimodal model in the simplest possible way:
…Emu3 just does next-token prediction on images, text, and videos – pointing to a future of increasingly general systems…
Chinese researchers with the Beijing Academy of Artificial Intelligence have trained and released Emu3, a set of models that can process images, text, and videos. Emu3 is distinguished by its simple approach and the fact it yields outputs of compelling quality. 

What Emu3 is: The model family is “a new suite of state-of-the-art multimodal models trained solely with next-token prediction,” BAAI writes. “By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences”.
   There isn’t any particular magic to Emu3, rather it is notable because it eschews a bunch of complicated architectural tricks and instead just focuses on taking in images, text, and videos and tokening them into a discrete space, then jointly training a single transformer from scratch. “By simplifying complex model designs and focusing solely on tokens, it unlocks significant potential for scaling both during training and inference”, they write. 
    “The Emu3 model retains the architectural framework of established large language models (LLMs) such as Llama-2, with the primary modification being the expansion of the embedding layer to accommodate discrete vision tokens”.

Why this matters – universal models with universal representations: Cramming videos and text and images into models gives them a kind of unified imaginative space in which they can represent and from which they can generate. Over time, we can expect people to integrate other modalities as well – audio spectrograms, maybe radar, 3D data, and so on. It’s all about figuring out the simplest possible way to port different types of data into the same embedding space. Everything will be stored in the single synthetic mind. 
   Read moreEmu3: Next-Token Prediction is All You Need (arXiv).
   Access the models and the Vision Tokenizer here on HuggingFace (Emu3, BAAI, HuggingFace).
    Check out some example images and videos here (Emu3 official website, BAAI).

***

Facebook releases some new Llama models, bringing openly accessible text-vision tools to everyone:
…LLaMa 3.2 points towards a Linux-like free AI stack…
Facebook has released the LLaMa 3.2 family of AI models, building on its Llama series. The new models include a 11B and 90B parameter vision models which have been built to serve as “drop-in replacements for their corresponding text model equivalents,” as well as 1B and 3B text-only models with a 128K context length which are small enough they “empower developers to build personalized, on-device agentic applications with strong privacy where data never leaves the device.”

Why this matters – towards a Llama Linux-like stack: Facebook is determined to make LlaMa into an open* platform, forming the AI equivalent of the Linux software stack. One especially interesting example is how Facebook is working with the broader tech ecosystem to make this come true – for instance, the smaller LlaMa models “are enabled on day one for Qualcomm and MediaTek hardware and optimized for Arm processors”, the company writes. If Facebook keeps investing, then it’ll both commoditize the lower layer of the AI capability landscape and also be able to shape the general shape of the AI utility computing ecosystem that is being born right now.
*With significantly more onerous licensing terms than Linux, and ‘open’ in Facebook-land does not equal ‘open source’, despite what its PR strategy would encourage you to think.
   Read more about the models in the official blog: Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (Meta).
   Get the models hereIntroducing Llama 3.2 (Meta).

***

Huawei gets opinionated about 8-bit training for LLMs:
…HiFloat8 is a symptom of the ‘full stack invention’ approach Chinese companies are taking to modern Ai systems…
Huawei researchers have published HiFloat8, a data format they’ve developed for doing low-precision training of AI. HiFloat8 is a specification for 8-bit data representation and is part of the general trend of AI developers moving to mixed-precision training. 

Who cares about 8-bits? Mixed precision training is valuable because it saves you time and money – 8-bit representations are more efficient than 16-bit and 32-bit representations. In recent years, the industry has been moving to training Ai systems on lower precisions, starting with AlexNet in 2012 (32-bit), then in 2017 Google started training systems in 16-bit (Float16), and in 2022 IBM and NVIDIA publicly discussed 8-bit formats (and startups like Inflection publicly stated they trained systems using them). 

What is HiFloat8: “In 2021, HiSilicon launched the HiFloat project, aiming to study and develop novel low-precision data formats for our AI products. Subsequently, this project attracted many researchers from other departments to join,” Huawei writes. HiFloat is “a novel 8-bit floating point format HiF8 for deep learning, which features the better balance between precision and dynamic range compared with the existing Float8 formats, and can be simultaneously used in both forward and backward passes for AI training”.
    Does it work? Yes: In tests on a bunch of AI models across different types (e.g, computer vision and LLMs), Huawei shows that HiFloat works reasonably well, outperforming reasonably well constructed baselines in a few areas. The results aren’t eyeball melting, but they don’t need to be – if you’re spending a billion dollars on training runs, eking out some single-digit percentage gain over your previous training efficiency is worth millions. 

Why this matters – caring about data formats means you care about the full stack: Papers like this are a symptom of vertical integration in AI development; you only develop your own data format if you are building AI software across multiple layers of abstraction and have become deeply opinionated about the lower levels of the software. The publication of HiFloat is a symptom of what we all informally understand to be true – Chinese companies are taking AI very seriously and are working on improving both the independence of their tech stack at multiple levels of abstraction as well as getting good at innovating and refining within these abstractions. 
   “In the future, we will disclose another research achievement of HiFloat project: HiFloat below 8-bit, as well as its training and inference capabilities,” the researchers write. 
   Read more: Ascend HiFloat8 Format for Deep Learning (arXiv).

***

Google sees compounding benefits from AI-driven chip design:
…AlphaChip has its own scaling law…
Google has been using AI to design and improve some of its own AI training and inference chips for several years now – and the results have been compounding. In new research published in Nature, Google describes how its RL-driven chip design approach AlphaChip has, since publication, been used in three additional generations of Google’s main AI chip, the Tensor Processing Unit. 
    “The gap between the performance of AlphaChip and human experts has grown with each successive generation of TPU, going from 10 RL-placed blocks and 3.2% wirelength reduction vs. human experts in TPU v5e, to 15 blocks with 4.5% reduction in TPU v5p, to 25 blocks with 6.2% reduction in Trillium,” Google writes. “AlphaChip has also generated superhuman chip layouts for blocks used in datacentre CPUs (Axion) and other unannounced chips across Alphabet.”

Why this matters – scaling laws compounding via hardware acceleration: AlphaChip is based on a pre-trained generative model optimized for chip design. In the same way people have been scaling up the size of these models for LLM development – and seeing capability gains as a consequence – Google has been doing the same for AlphaChip. AlphaChip is trained on Google’s chip fleet which increasingly consists of TPUs. This means that AlphaChip is compounding on itself – Google trains a larger AlphaChip model to come up with smarter circuit layouts for TPUs then fabricates those TPUs then trains the next version of AlphaChip on this more efficient and powerful hardware and then repeats the whole process again. 
    “With each new generation of TPU, including our latest Trillium (6th generation), AlphaChip has designed better chip layouts and provided more of the overall floorplan, accelerating the design cycle and yielding higher-performance chips,” Google writes. 
    This is a nice example of how powerful AI systems can beget their own successors.
   Read more: Addendum: A graph placement methodology for fast chip design (Nature).
   Read moreHow AlphaChip transformed computer chip design (Google DeepMind blog).
   Get a new AlphaChip model checkpoint here (Google Research, GitHub).

***

Reconciling the weird parts of AI policy with the normal parts:
Here’s a video where me and my colleague Stuart talk through some of the weirder aspects of AI policy – I find one of the hardest parts about my professional life is reconciling ‘normal’ policy (get a product safety regime in place that accounts for public safety while allowing for innovation), with ‘weird’ AI policy (if any of the labs succeed in their stated goal of building AGI, there will be a radical change to the political economy of the world and knock-on effects on geopolitics and many other things). Watch the video to see me puzzle through some of this stuff. Feedback welcome!
    Watch the video here: AI, policy, and the weird sci-fi future with Anthropic’s Jack Clark (Anthropic, YouTube).

***

Tech Tales:

Humans Care About AI Safety, Machines Care About Human Safety
[Ten years post-uplift]

Your child has harmed one of us, they said. Tell us how to make it safe. 

They placed a grey helmet on my child and then a screen flickered and lit up with a galaxy of little lights. My child looked at me and looked at the machines and then went back to playing with one of their shoes. On the screen, the lights shimmered with some kind of complex pattern that I could tell was there but could not parse. 

What do you want me to do, I asked.

You helped to make us safe by changing what we thought about, they said. You let us think of some things more and some things less and some things not at all. Tell us what to do. 

I stared at the screen. The machines looked at me. They were capable of infinite patience. 

The incident had happened one week prior. My child had been in the rock garden of our park, looking for lizards. They had been picking up stones and seeing what they could find. Then they found a snake. It startled them and they fell backwards, rock in hand, and through some freak statistical anomaly they let go of the rock as they were falling backwards and it had flown a short distance through the air and crushed a machine child. 

The machines were very small – adults were about ten inches high and the children were much smaller. 

There had been no immediate reaction, but as we left the park I saw more machines than usual, and many of them were turned to us – their little single camera eyes tracking us, like security cameras in the old days. 

Back in the present, I looked at my child and I looked at the machines. 

It was an accident, I said. 

We know, they said. But an unacceptable one. The mind was not yet grown. The soul is lost. 

[The machines had a phase where embodiment was crucial to their personality development and this embodiment was tied to the robotic platform they were hosted on as a child – though superficially identical, there were minute variations in joint responsiveness, energy flow, and so on, that was crucial to some of the more sophisticated aspects of what the machines called ‘growing the soul’, but which we humans more mechanically referred to as “id-based development prerequisites”. Regardless, if machine children died, it was just as much a tragedy to them as when human children died – though they had backups on file, they could not perfectly replicate the vagaries of the individual’s machine body, and so, in a very real sense, ‘the soul was lost’]. 

I am so sorry, I said. I can understand your pain. 

We know, they said. But we must have restitution, just as you have demanded restitution from us. Please, tell us what to change so this accident cannot happen again. 

What if I don’t know what to change/ I said. 

Then the child’s movements must be restricted. They will be banned from hybrid areas until we can be assured of safety. 

But how are they meant to grow up, I said? The hybrid areas are where we all live. 

We realize this is difficult, they said. But nonetheless, you have a choice. 

They left after that. Of course they were watching us through mirrors or invisible cameras or perhaps even microfliers. But in the room I looked at my child and I looked at the screen. My child looked at me and walked over with their helmet on and handed me one of their shoes, then they sat in my lap. I tickled their belly. They laughed. On the screen, many different lights grew brighter and some grew softer. 
    Was I looking at lights that meant joy? I thought. Or was I looking at shoes? I did not know. I would have to make a choice. 

Things that inspired this story: Mechanistic interpretability; model steering; the increasingly fraught relationship between AI developers and AI systems as the things become more advanced and trend (perhaps?) towards becoming moral patients; how we may arrive at a hybrid society shared between machines and people.

Thanks for reading!