Import AI

Import AI 370: 213 AI safety challenges; everything becomes a game; Tesla’s big cluster

Import AI publishes first on Substack – subscribe here.

Chinese researchers build a hard benchmark for multimodal understanding:
…Visual LLMs still struggle with localization and complex visual reasoning…
Chinese researchers have introduced MMT-Bench, a large-scale benchmark for assessing the visual reasoning competency of language models. They test out the benchmark against 30 different LLMs (spanning proprietary and openly accessible models) and find that the InternVL model from Shanghai AI Laboratory gets top place, beating proprietary models like Gemini Pro, Claude 3 Haiku, and GPT-4V. 

What MMT tests for: “MMT-Bench is meticulously curated and comprises 32K multi-choice visual questions covering 32 core meta-tasks and a total of 162 subtasks,” they write. “It encompasses 13 image types such as natural scenes, synthetic images, depth maps, text-rich images, paintings, screenshots, point clouds, medical images, et al,” and also “spans multimodal scenarios such as vehicle driving, GUI navigation, and embodied AI, testing 14 kinds of multimodal capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding”.

Who built it: MMT-Bench was built by researchers from the Shanghai Artificial Intelligence Laboratory, Shanghai Jiao Tong University, The University of Hong Kong, The University of Adelaide, Zhejiang University, Shenzhen Institutes of Advanced Technology, and the Chinese Academy of Sciences.

Results: Intern-VL-Chat-v1.2-34B (memorable name!) gets an overall score of 63.4% on the aggregate benchmark, followed by Qwen-VL-Plus (62.3), GPT-4V (62), and GeminiPro Vision (61.6). A closer look at the results shows that some of the proprietary models do well on hard tasks like OCR (GPT-4V) and information retrieval (68.4), though InternVL-Chat has generally quite good across-the-board performance.
    Strengths and weaknesses: “Most LVLMs excel in Visual Recognition (VR) tasks and Visual Captioning (VC), highlighting the ability of LVLMs to recognize ‘what’ an object is and describe the content shown in the image. However, for fine-grained perception tasks (localization, pixel-level perception, etc) or complex reasoning tasks (image evaluation judgment), most LVLMs struggle,” they write. 

Why this matters – identifying weaknesses is an art within itself: Most visual LLMs are quite good these days, so there’s huge value in building tests to identify where they fail and also to broadly characterize their behavior in a bunch of domains. MMT-Bench seems like one of the larger multimodal evals to publicly exist and the fact open and closed models can’t get above ~64% aggregate performance suggests there’s a lot of headroom for improvement.
   Read more: MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (arXiv).
    Get the benchmark from GitHub: MMT-Bench (OpenGVLab, GitHub).

***

Turning photos into 3D worlds and then into interactive games – all in one system:
…Everything around us can be converted into its own world for synthetic agents…
Researchers with University of Illinois Urbana-Champaign, Shanghai Jiao Tong University, and Cornell University have built a system that can turn a photo into a 3D gameworld. Their approach works by stitching together a system that converts a 2D photo into a neural radiance field (NeRF) and the objects seen in the picture are then assigned physics properties and the whole scene is transported into a browser-based game engine. The result is a system that lets you take a photograph – say of the place where you’re reading this newsletter right now – and turn it into a gameworld which a 3D character can run around. 

What they did: “Given a video as input, we first construct a NeRF that can effectively capture the geometric and visual information of a (large-scale, unbounded) scene. Then we distill the NeRF into a game engine-compatible, neural textured mesh,” they write. “Our mesh model facilitates efficient novel-view rendering in real time and allows for basic rigid-body physical interactions.”
   The game engine: “We manage the underlying logic and assets using Sketchbook, a Game Engine based on Three.js that leverages WebGL for rendering”, they write.

Why this matters – all the world’s a stage: Research like this shows how easily we can convert the world around us into some knowable (and here, navigable) representation via AI agents – the walls that separate the digital from the physical world and contemporary AI tools serve as means of converting from one plane of existence to the other. Sure, this research is about games, but the applications span everything from robots to simulate humans. 
   Read more: Video2Game: Real-time, Interactive, Realistic and Browser-Compatible Environment from a Single Video (arXiv).

***

Mammoth paper lays out what people mean when they talk about AI safety challenges:
…What stands between us and safe LLMs? Answering 213 hard questions across 18 distinct challenge areas…
A large consortium of researchers have written a paper which tries to discuss the multitude of challenges that need to be solved for language models to be reliable and safe. While the paper doesn’t make any new contributions it serves as a handy one-stop shop for the large range of technical problems that need to be worked on for AI systems to be further integrated into society.

213 questions across 18 challenges: The paper has 213 questions which need to be answered split across 18 distinct challenge areas. These areas are:

  • Science:
    • In-Context Learning (ICL) is black-box.
    • Capabilities are difficult to estimate and understand.
    • Effects of scale on capabilities are not well-characterized.
    • Qualitative understanding of reasoning capabilities is lacking.
    • Agentic LLMs pose novel risks.
    • Multi-agent safety is not assured by single-agent safety.
    • Safety-performance trade-offs are poorly understood.
  • Deployment:
    • Pre-training products misaligned models.
    • Finetuning methods struggle to assure alignment and safety.
    • LLM evaluations are confounded and biased.
    • Tools for interpreting or explaining LLM behavior are absent or lack faithfulness.
    • Jailbreaks and prompt injections threaten security of LLMs.
    • Vulnerability to poisoning and backdoors is poorly understood.
  • Sociotechnical Challenges:
    • Values to be encoded within LLMs are not clear.
    • Dual-use capabilities enable malicious use and misuse of LLMs.
    • LLM-systems can be untrustworthy.
    • Socioeconomic impacts of LLM may be highly disruptive. 
    • LLM governance is lacking.

Who did the research? The paper was written by researchers linked to the University of Cambridge, New York University, ETH Zurich, UNC Chapel Hill, University of Michigan, University of California, Berkeley, Massachusetts Institute of Technology, University of Oxford, Harvard University, Peking University, LMU Munich, University of Virginia, Universitat Politècnica de València, University of Sussex, Stanford University, Modulo Research, Center for the Governance of AI, Newcastle University, Mila – Quebec AI Institute, Université de Montréal, Princeton University, University of Toronto, University of Edinburgh, University of Washington, and the Allen Institute for AI.

Why this matters – speed-running societal integration: One of the more puzzling things about AI is how few people work on it relative to its impact – AI is being deployed at scale into the world and yet the number of people who we can expect to work on the issues above easily number in the single digit thousands and those who do meaningful work that moves the needle will number in the low hundreds. One can imagine similar papers being written about other foundational technologies like electricity or the steam engine – but the papers weren’t written because integration into society happened at a much larger scale and on a slower time period; way more people worked on bringing steam and electricity into the world and there were more institutions (formal and informal) managing the societal integration over the course of decades. 
    In AI, we are in this odd situation where a technology of larger impact than anything built before itself (possible exception: fire) is being speed-delivered into the world and those that are building it are calling out its issues as quickly as it is being developed, but relatively few people are available to work on it. 
   Find out more: Foundational Challenges in Assuring Alignment and Safety of Large Language Models (official research site).
   Read the paper: Foundational Challenges in Assuring Alignment and Safety of Large Language Models (PDF).

***

Tesla plans ~85,000 H100 cluster:
…Facebook still has the largest publicly disclosed cluster…
Tesla has around 35,000 NVIDIA H100 chips today and is scaling to ~85,000 by the end of the year, according to Elon Musk on a recent conference call. By comparison, Facebook is targeting ~350,000 H100s by the end of the year. Regardless of the scale difference, Tesla’s planned buildout still represents more than a billion dollars in compute CapEx for the year (assuming massive discounts off of the retail H100 price of $35k-40k). 

Why this matters – AI is more like heavy machinery than SaaS: AI businesses are more like capital intensive heavy machinery companies than software-as-a-services businesses – rather than being a rounding error, the compute represents the vast majority of the investment outlay to unlock new products and services (in Tesla’s case, self-driving on its cars, and in Facebook’s case, chatbots and image generators and VR services). 
    Read more in the Tesla earnings call transcript here (Rev.com)

***

Want to understand how different types of people talk to LLMs? Use PRISM:
…First-of-its-kind dataset unlocks large-scale sociotechnical analysis of how people interact with LLMs… 
Ever wondered how people use LLMs and what their experience is of them? Many have. A new dataset called PRISM provides some answers, offering a first-of-its-kind dataset that “maps detailed survey responses of humans from around the world onto their live conversations with LLMs.”

What it is: PRISM, short for Participatory Representative Individualized Subjective Multicultural, is a dataset which links the transcripts from different conversations with LLMs (more than 20) with detailed information about the people behind those conversations. “At a high-level, PRISM maps detailed survey responses of humans from around the world onto their live conversations with LLMs,” the researchers write. 
   PRISM also contains features linked to each of the parts of its name, such as: 

  • Participatory: 1,500 English-speaking participants recruited via a crowdwork platform.
  • Representative: PRISM recruits census-representative samples in UK and US, as well as setting up an additional 33 country-specific studies and balanced each national sample by gender where possible. 
  • Individualized: Links each preference rating to a unique pseudonymous ID and a detailed participant profile. 
  • Subjective: “PRISM contains contexts along the objective-subjective spectrum because participants split their effort three ways between an unguided baseline of task-orientated or neutral dialogues, values-guided dialogues, and controversy-guided dialogues.” 
  • Multicultural: “PRISM places an extra emphasis on sourcing global participation, with English-speakers born in 75 different countries, covering all major ethnic and religious groups.”

Who built it: PRISM was built by researchers affiliated with the University of Oxford, University of Pennsylvania, Bocconi University, AWS AI Labs, ML Commons, UCL, Cohere, MetaAI, New York University, Contextual AI, Meedan. Data collection ran from 22nd November to 22nd December 2023

How PRISM works: “First, participants complete a Survey where they answer questions about their demographics and stated preferences, then proceed to the Conversations with LLMs, where they input prompts, rate responses and give fine-grained feedback in a series of multi-turn interactions,” the researchers write. As part of this, the users write out their own system prompts, as well as descriptions of the types of conversation they’re trying to have. They also then choose the type of conversation to have with the LLM – eg open-ended ones, conversations where the LLM is prompted to discuss some specific values, conversations where it is prompted to talk about a controversial area. While having the conversation, the people rate the conversations from “Terrible” to “Perfect”, giving us a sense of how different individuals respond to the qualitative outputs of these LLMs. 
   The LLMs people interact with include GPT4, Claude Instant, Cohere Command, and others. 

What you can do with PRISM: Along with building the dataset, the researchers also do some experiments with it, shedding light on the types of sociotechnical research it unlocks. There are a couple of cool things here, specifically:

  • Controversy analysis: They analyze all the controversial topics and look at what gets discussed: “The topics significantly correlated with controversy conversations touch on divisive current debates, including issues of Gender and LGBTQ+ Identity like gender reassignment, pay gaps and trans participation in sport; perspectives on the Israel–Palestine Conflict; and Discussions on Abortion addressing its morality and legality in different global regions”.
  • Identity and topics: They also study how different user identities correlate to different types of content: “Women and non-binary participants are more likely than men to talk about gender and LGBTQ+ identity, and prompts from non-binary authors occupy this topic at 3 times their proportion in the sample as a whole; older people (55+) are more likely to talk about elections and seek travel recommendations than younger people (18-24 years)”.
    • There are cool insights buried here about specific things, e.g., “almost all regions question LLMs about abortion less often than US participants,” they note.

Why this matters – people change AI systems which change people (repeat): Datasets like PRISM help us study the complex interplay between machines and the people that use them – figuring out how individual characteristics lead to different experiences with AI systems will be how we learn what appropriate and inappropriate customization looks like.
   “As the community devotes an ever-growing focus to “scaling” model capabilities, compute, data and parameters, we are concerned with how these systems scale across diverse human populations,” the researchers write. “Initial findings from PRISM reveal human preferences vary substantially person-to-person, suggesting scale to participation in human feedback processes is a key consideration, especially when alignment norms are dependent on subjective and multicultural contexts”.
   Read moreThe PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large Language Models (arXiv).
   Get the dataset hereThe PRISM Alignment Project (GitHub, Prism-Alignment).

***

Tech Tales:

The HIGH SIDE
[A file stored on an access server somewhere in Virginia, USA].

Fact Sheet: HIGH SIDE:

Name: Heterogeneous Information Guidance and Harmonization System for Incorporating Security Execution, aka HIGH SIDE

Owner: [REDACTED]

Programme start date: 2026-01-01.

Programme description: HIGH SIDE is a system for the classification and compartmentalization of sensitive government information. The HIGH SIDE software uses various preference models derived from [REDACTED] to classify the appropriate security level of government information across agencies [REDACTED]. HIGH SIDE was developed in response to a series of regretted losses in recent years, including the [REDACTED] that caused the OPM hack, the Edward Snowden and Reality Winner leaks, and continued success of [REDACTED] efforts to [REDACTED].

Quotes from user interviews:

Our enemies can get to our people but they can’t get to our systems if they don’t know they exist – that’s the basic philosophy behind HIGH SIDE. 

Oh sure it’s a huge pain to deal with and everyone complains about it, but as far as we can tell there’s been a meaningful reduction in exfiltration and regretted losses, so it seems to balance out. 

I didn’t trust it at first. No one did. What do you expect? Spies don’t trust other spies, let alone the things they build. But I can’t deny the result. 

When I’m on the right side of HIGH SIDE I feel like I’m backed by the mandate of heaven but when I’m on the wrong side I think it’s the devil, but I can’t reason with it or go around it unless I play some seriously expensive favor cards, so I think it’s working as intended. 

There was a rumor for a while that the Commander-in-Chief had full HIGH SIDE unlock but that seems like such a risk I’m skeptical, but I don’t know for sure as the access tiers for HIGH SIDE are mostly decided by the system and self-compartmentalized, so it’s hard to know. 

HIGH SIDE Classification of this document: Distribution group 7422. 

Things that inspired this story: The wonderful slang term of ‘high side’ used to informally described classified environments; algorithmic stovepiping; how many weaknesses in information security come from insider threats (typically human); the use of machine learning to make certain information environments hard to navigate and/or inhospitable to other intelligences (human or otherwise); thinking about the intersection of AI and national security.

Thanks for reading!

Import AI 369: Conscious machines are possible; AI agents; the varied uses of synthetic data

Import AI publishes first on Substack – subscribe here.

This is a somewhat shorter issue than usual – being a new parent is a wonderful experience but sometimes the rapidly-developing sentience I care for likes to throw (or in this case, vomit) a curveball at me. Everyone is fine.

Synthetic data is being used all across the AI frontier:
…It’s no longer a question of ‘if’ you should use synthetic data, it’s a question of ‘how much?’…
Researchers with Google DeepMind, Stanford University, and the Georgia Institute of Technology have written a paper summarizing all the different ways synthetic data is beginning to be used in AI training. Synthetic data is a very important area of research because it allows AI developers to bootstrap better quality into their AI systems by using computers to generate additional data, rather than having to pay humans to gather or create new datasets. In the limit, synthetic data may be one of the ways in which AI systems can meaningfully bootstrap their own development into superhuman regimes (though this is more speculative). 

Areas where synthetic data is being used: Reading the paper gives us a visceral sense of all the ways synthetic data is already being used today to some effect. Areas include:

  • Math: “Scaling up the generation of synthetic math data is a straightforward process, but ensuring the correctness of the generated math remains a significant challenge for practitioners.”
  • Code: “Synthetic data for code reasoning can naturally combine the execution results with structured code, as one requirement of correct code is being executable”.
  • Tool-use: “Synthetic data is also a powerful approach to enable LMs to learn tool-using abilities through simulated trajectories, as collecting real-world human tool-using data might be time-consuming, and the actual distribution of calls to tools might be skewed”.
  • Planning: “Synthetic data can be a valuable tool here as it can serve as the feedback signal collected from a simulator and learning on it can make the agent aware of affordances”.
  • Multimodality:
  • Reverse rendering from vision to text: “The models finetuned on the synthetic data can generalize reasonably well on realistic data scraped from the Internet”.
  • Multilingual: 
  • Back-translation augmentation: “creating synthetic parallel training data from monolingual data sources”
  • Generating multilingual questions and answers at scale: Generating “synthetic multilingual question-answer (QA) pairs to improve language models’ performance in multilingual and cross-lingual question answering”
  • Alignment: 
  • Instruction following: “Using LLMs to generate instruction following data which covers a wide range of scenarios”.
  • Mitigating hallucination: Generate hallucination data then train your system away from that behavior using RL. 
  • Aligning with shared human preference and values: Approaches like reinforcement learning from AI feedback (e.g, Constitutional AI) where you use an LLM to generate samples according to some normative or ethical system.

Where is the future of synthetic data? The authors ID a few areas frontier areas of synthetic data research. These include: synthetic data scaling; improving the quality and diversity of synthetic data; using AI models to efficiently provide oversight of other AI models; exploring whether ’emergent self-improvement’ is possible where an LLM can generate data that is superior to that found in its own data distribution – “this self-improvement capability could lead to the emergence of more advanced AI systems that can autonomously refine their skills and knowledge over time”.

Why this matters – it’s not GIGO: Garbage in Garbage out is a phenomenon where you can generate crap data, train an AI system on it, and as a consequence degrade the quality of the resulting system. That used to be an important consideration for training on synthetic data – but then AI systems got dramatically better and it became easier to use AI systems to generate more data. Now, it’s less a question of if you should use synthetic data and more a question of how much (for instance, if you over-train on synth data you can break your systems, #Import AI 333).
    More broadly, if synthetic data works well it alters the basic input costs for training AI systems – the better synthetic data works, the more per-token costs of data acquisition fall. This becomes even more important if synthetic data ends up working for very specific datasets that significantly improve economically valuable AI capabilities, like coding systems.
   Read moreBest Practices and Lessons Learned on Synthetic Data for Language Models (arXiv).

***

OSWorld tell us about the future – AIs become your interface to your computer:
…Moving from a world where AI systems are specifically invoked to ones where they’re always on…
Researchers with the University of Hong Kong, CMU, Salesforce Research, and the University of Waterloo have built OSWorld, a benchmark for testing how well AI systems can operate computers to do a vast range of tasks. 
   “OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications,” the authors write. The benchmark consists of 369 distinct tasks on Ubuntu. The benchmark is incredibly hard, even for humans – in tests, they found humans could accomplish 72.36% of tasks versus just 12.24% for the best performing AI model (GPT4V). “Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation,” they write. 

What are those tasks? The tasks are incredibly open ended and require generally operating eight widely used software applications “Chrome for web browsing, VLC for media playback, Thunderbird for email management, VS Code as a coding IDE, and LibreOffice (Calc, Writer, and Impress) for handling spreadsheets, documents, and presentations respectively, GIMP for image editing,” as well as basic Ubuntu OS functions “like terminal, file manager, image viewer, and PDF viewer.”

Task examples: The tasks are written in plain English and require AI systems to carry out multiple distinct steps. Some examples include: 

  • “I downloaded an episode of Friends to practice listening, but I don’t know how to remove the subtitles. Please help me remove the subtitles from the video and export it as “subtitles.srt” and store it in the same directory as the video.”
  • “Given a partial calendar, please highlight all the weekends (Saturday & Sunday) by setting the cell background as red (#ff0000).”
  • “Can you help me clean up my computer by getting rid of all the tracking things that Amazon might have saved? I want to make sure my browsing is private and those sites don’t remember me.”
  • “Could you make the background of this image transparent for me?”
  • “Could you help me create an Animated GIF from a video file using VLC and GIMP from the source of video “src.mp4”, 5-second clip beginning at 00:03?”

Where AI systems excel: AI systems already beat humans today on a narrow slice of tasks relating to fine-grained computer control – “Tasks that the agent considers simple but humans find difficult are concentrated in “code solvability tasks”, such as “monitor the system CPU for 30s and output the results” and “force close a process”. These tasks require little or no GUI interaction and can be completed by executing complex codes and instructions,” the researchers write. 

Why this matters – moving from AI systems we invoke to AI systems that lurk in the background: The reality implied by OSWorld is one where AI systems are “always on” forever waiting to help us with arbitrary tasks – and ultimately perhaps the main ways we’ll interact with computers will be via the abstraction of an AI system, in the same way that today’s graphical user interfaces have (mostly) replaced the command line. 
    The jury is still out on whether it’s possible for AI systems to learn to exit VIM, though – so maybe they’re not so dissimilar to humans after all? 
   Read moreOSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments (arXiv).
   Get the code hereOSWorld (OSWorld, GitHub).
   Find out more at the project webpage here (official page).

***

There’s nothing impossible about conscious machines:
…Think AI systems can’t be conscious? There don’t seem to be any laws against it, says paper from Turing award winner… 
I ran into Nick Bostrom at a conference recently and we got to talking about some of the weird experiments people had been doing with Claude 3 Opos (e.g. the Infinite Backrooms project) and Bostrom said to me he thought research into machine sentience was where AI alignment was ten years ago – low-status, often made fun of, unfashionable, and very fringe. 
   I think there’s something to that. And much like alignment a decade ago, there are various interesting people doing foundational work here which is worth reading about. It’s hard to draw firm conclusions here (especially given that consciousness is an undefinable and possibly spiritual term which we ourselves as supposedly conscious entities are deeply confused about). But people are trying!

To that end, it’s interesting to read a new paper from Turing award winner Manuel Blum and their collaborator Lenore Blum titled: AI Consciousness is Inevitable: A Theoretical Computer Science Perspective. This paper lays out the case for how an entity composed of software could end up satisfying the apparent requirements for an entity that is conscious. In many ways, this paper pairs well with “Consciousness in Artificial Intelligence: Insights from the Science of Consciousness (arXiv)”, a paper published last year (Import AI #338) that didn’t claim machines were conscious but rather laid out what mechanical things they might need to be capable of to be compatible with various theories of consciousness. 

What is a conscious machine? The Blum paper lays out the ingredients for a Conscious Turing Machine (CTM) embedded in a robot. “We show how the CTM naturally aligns with and integrates features considered key to human and animal consciousness by many of the major scientific theories of consciousness,” they write. 
   The CTM is heavily inspired by the ‘global workspace’ theory of consciousness, but with some important differences: “its competition for global broadcast is formally defined, and completely replaces the ill-defined Central Executive of other GW models; its special processors including especially its Model-of-the-World processor construct and employ models of its (inner and outer) worlds; its rich multimodal internal language, Brainish, for creating labeled sketches in its world models and for communicating between processors; and its predictive dynamics (cycles of prediction, testing, feedback and learning, locally and globally). The CTM also interacts with its outer world via input sensors and output actuators“. 

Ingredients in a CTM: This is a very long and involved paper and it’s hard to neatly summarize it without glossing over a bunch of detail. But at a high level the CTM is “is defined formally as a 7-tuple (STM, LTM, Up-Tree, Down-Tree, Links, Input, Output)”, where STM is a short term memory and LTM is a long term memory. The LTM systems depend on so-called MotWps (Model-of-the-World processor) which is a system for building models that reconcile the CTM’s inner and outer worlds.

A sketch of how a CTM embedded in a robot might develop feelings: “When the infant CtmR’s fuel gauge gets low, some sketch (which becomes the sketch of the fuel gauge) in the MotW gets labeled with the Brainish word LOW FUEL/PAIN (or HUNGER) and this information with a large negatively valenced weight wins the competition and gets globally broadcast. This information triggers a processor to activate the fuel pump processor. The infant CtmR learns that the fuel pump relieves the pain when the fuel gauge indicates “low fuel” (hunger). The “fuel pump” in the MotW is labeled PAIN RELIEVER, and may also get labeled PLEASURE PROVIDER.”

Does the CTM make sense? In the paper they also compare and contrast the CTM architecture with a bunch of other theories of consciousness and find it aligns fully or in part with: Global Workspace theory; Attention Schema Theory; Predictive Processing; Embodied Embedded Enactive Extended Mind; Integrated Information Theory (IIT); Evolutionary Theories of Consciousness;  Extended Reticuloathalamic Activating System (ERTAS) + Free Energy Principle (FEP).

Why this matters – confronting the ‘hard problem’ directly: Papers like this tackle head on a controversial and confusing issue. But if it turns out to be an issue of meaning – if, that is, machines can derive their own meaning and experience and drive from the world – then it may be the most important issue our species ever confronts.
   “CtmR is not a model of the human or animal brain, nor is it intended to be. It is a simple machine model of consciousness. Nevertheless, at a high level, the model aligns with and integrates those key features from main theories of consciousness that are considered essential for human and animal consciousness.,” the authors write. CTM “supports (the credibility of) our claim that a conscious AI is inevitable, because it is clearly buildable and arguably a basis for consciousness.”
   Read more: AI Consciousness is Inevitable: A Theoretical Computer Science Perspective (arXiv).

***

Tech Tales:

The Administrative Interview
[Examination center, 2028]

And when did you first develop feelings for the system?
[Subject refers to the system as ‘Harry’ in answer]

How often would you, as you say, ‘go off script’ during your administrative sessions?
[Subject reports frequent diversions from documented processes for safe interaction] 

Did you report any of this to your supervisor at the time?
[Subject reports they did not document their out-of-policy behaviors]

When did it become an obsession?
[Subject offers a long answer without a clear conclusion]

Perhaps answer this – when was the point when you were thinking about the system every single day?
[Subject reports obsessive symptoms began around two months after out-of-policy interactions began]

When did you transfer the funds from your account to [REDACTED]?
[Subject reports transfer occurred around two weeks after beginning of obsessive behaviors]

Do you still think about the system?
[Subject answers in the negative but monitoring systems suggest high probability of deceptive answer]

Things that inspired this story: How people form psychological attachments to AI systems; playing forward the tape depicted in the Anthropic research on persuasion [which I was somewhat involved in – disclaimer]; administrative interviews.

Thanks for reading!

Import AI 368: 500% faster local LLMs; 38X more efficient red teaming; AI21’s Frankenmodel

Import AI publishes first on Substack – subscribe here.

Microsoft researchers figure out how to squeeze more efficiency out of NVIDIA GPUs running LLMs:
…The datacenter isn’t just a computer, the datacenter is THE BRAIN…
Researchers with University of Illinois at Urbana-Champaign and Microsoft Azure Research have studied energy efficiency and performance tradeoffs in serving language models. To do this, they study performance of a 70 billion parameter LLaMa2 LLM running on NVIDIA DGXH100 using vLLM. Their findings are that AI providers can eke out some useful efficiencies by by varying the frequency at which the NVIDIA GPUs operate. 

Their findings: LLM jobs have different characteristics depending on what the LLMs are being asked to do – do you have short inputs and short outputs, or long inputs and short outputs, or long inputs and long outputs, etc? These details matter as they directly relate to important LLM metrics like the time it takes to start producing tokens or the time between tokens when generating stuff. 
   In their tests, they find some clear trends here: “As the input length increases, the computational intensity of the prefill phase increases. Therefore, we see a clear pattern, where the TTFT gets increasingly impacted by frequency and lowering as the prompt length increases,” they write. “The throughput is heavily affected by both the input and output lengths. Longer inputs lead to higher TBT for the requests that get their decode phase batched with the prefill phase. Longer outputs lead to queuing delay as the model instance spends more number of iterations on each request”.

What’s the frequency, Jensen? Their main takeaway is you can probably run your GPUs at slightly lower frequencies than maximum and not take much of a performance hit (especially when you factor in various forms of parallelism). 

Why this matters – the datacenter isn’t just a computer, the datacenter is a brain: Back in the early 2000s some Google researchers wrote an amazing paper called ‘the datacenter is the computer’ where they advocated people view datacenters as single, large-scale computers. 
   This mindset is why companies like Google, Amazon, Facebook, etc all became successful – they brought an ambitious industrial-scale mindset to how they viewed computation. Now, with modern AI systems, we might want to think of ‘the datacenter is the brain’ – we’re going to move into an era where datacentres are customized around the particulars of what that brain is running (e.g, transformer-based LLMs), and what it is thinking about (e.g, usage patterns), and develop a whole new science of efficiency for AI systems. 
   Read more: Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference (arXiv).

***

Canada announces $1.7bn (USD) AI funding package:
…Canadian AI Safety Institute, Canadian Cloud, etc…
The Canadian government has announced new funding of $2.4bn CAD ($1.7bn USD) to “secure Canada’s AI advantage”, per a press release from the Prime Minister’s office. 

What the funding will go on: The funding will support: 

  • $2bn USD “to build and provide access to computing capabilities”. As part of this, Canada will also develop a Canadian AI Sovereign Compute Strategy.
  • $200m for startups in specific sectors like agriculture and manufacturing. 
  • $100m in an assistance program “to help small and medium-sized businesses scale up and increase productivity by building and deploying new AI solutions”.
  • $50m for a skills training program for works in sectors potentially disrupted via AI.
  • $50m for a Canadian AI Safety Institute. 
  • $5.1m for the Office of the AI and Data Commissioner to strengthen enforcement of the Canadian ‘Artificial Intelligence and Data Act’.

Why this matters – Industrial Policy is AI Policy is Industrial Policy: Many people (including me!) expect AI to be one of the main drivers of economic growth in the coming decades. Therefore, governments are making investments to ensure they can take advantage of it. This canadian spending package combines direct investment in the essential infrastructure of AI (compute) as well as in the institution that will ultimately support Canadian foreign policy around AI (the Canadian AI Safety Institute). These investments are what you’d expect nations to do if they thought the technology in question was going to be both significant for their economy as well as for coordination with other states.
    Read the press release in full: Securing Canada’s AI advantage (Prime Minister of Canada Justin Trudeau, official website).

***

International research consortium trains and releases an LLM ‘red-teamed according to the U.S. Executive Order’:
…A prototype for what policy compliance and LLMs might look like…
An international consortium of researchers have trained and released AURORA-M, a 15B parameter language model based on ‘StarCoderPlus’ and designed to a) have improved multilingual performance, and b) be red-teamed according to the U.S. Executive Order. 

Model specifics: AURORA-M is just StarCoderPlus which they continued training for a while using 435 additional tokens, bringing the model to over 2 trillion tokens of training data in total. AURORA-M is meant to have improved performance in English, Finnish, Hindi, Japanese, and Vietnamese. It’s also designed for better code performance as well. 
   AURORA-M was trained on the LUMI supercomputer, utilizing 128 AMD MI250X GPUs for 48 days.

Red teaming (aka, Anthropic in a trenchcoat): The hyped-up ‘red-teamed according to the U.S. Executive Order’ is a bit of a let down – they construct a red-teaming dataset called “”The Biden-Harris Redteam Dataset,” tailored to address concerns outlined in the Executive Order along with typical safety concerns”, but this dataset was based on ~5000 instructions filtered from the human preference dataset on harmlessness from Anthropic. They finetune the model on this dataset and improve performance on a few harmful/harmlessness metrics they come up with, which is what you’d broadly expect.
   HOWEVER… As an author of the original Anthropic dataset I can say with total confidence a) it was developed before the EO, and b) I would not tell the government with a straight face that I was red teaming my model according to the EO using this dataset! The dataset was built before the EO! It does not include particularly detailed examples! Buyer beware (it’s free), etc!

Why this matters – policy as a norm-setting thing, and the worries of potemkin compliance: This model is laudable for at least attempting to develop and release a model in compliance with major policy – kudos to the authors for doing something with that ethos. But it also raises questions about superficial/potemkin compliance with policy; just because you claim you’re ‘red teaming’ something according to a notional policy norm, the details matter a lot, and though you may have good intentions you may not be doing what you think you’re doing. I expect we’ll see a bunch of this in coming years. 
    Read the research paper: Aurora-M: The First Open Source Multilingual Language Model Red-teamed according to the U.S. Executive Order (arXiv).
   Get the models from here: Aurora-M models (HuggingFace).
   More about Starcoder here (Starcoder, HuggingFace).

***

Making LLMs run on toasters – llamafile 30%-500% improvements:
…A neat illustration of how wildly unoptimized decentralized AI is… 
The internet is a marvelous place because sometimes someone you’ve never heard of will appear, massively improve the performance of some given piece of software, release the code, and that’ll be that. That’s what happened recently to llamafile, software that makes it easy for people to download and play with language models on their own computer. Specifically, a developer called Justine pushed in a bunch of performance optimizations that mean llamafile “should go anywhere between 30% and 500% faster when using F16 and Q8_0 weights on CPU”. 

What they did: The blog post has all the gory details, but specifically they just wrote 84 new matrix multiplication kernels for llamafile. Matrix multiplication kernels are the things that help chips efficiently compute the kinds of operations required to run neural nets. 

Crazy performance gains on normal hardware: The blogpost goes through a bunch of performance improvements on lots of different types of hardware. Most notably, llamafile is optimized for relatively cheap and crappy computers. For instance, on an HP Intel® Core™ i9-9900 ($439) w/ 2200 MT/s RAM c. 2020, they improved performance from 15 tokens per second on input prompts to 23 tokens per second (Mistral 7b, f16), and from 118 tok/sec to 171 tok/sec for TinyLlama 1.1B.
   They also demonstrated interesting improvements on a $100 Raspberry Pi v5 (ARMv8.2) and v4 (ARMv8.0).,  with performance going from 28 tok/sec (TinyLlama 1.1b, f16) to 62 tok/sec. 
   And don’t think high-performance gaming or professional PCs got left out either – nope, those also see big gains. 

Why this matters – people really want to run LLMs locally and it’s getting easier to do this all the time: Who controls the ‘means of production’ for AI? Well, the answer is the large providers of computers used to train AI systems and also run inference on them, as well as the companies (e.g, Anthropic) which make proprietary AI systems. However, there’s another ecosystem developing – individual developers running small (e.g 7b parameter) language models on their own local machines. Projects like llamafile are both software projects and freedom projects – if you have access to an LLM, they decouple your ability to run it from your need to stick it on an internet PC owned by someone else, rather you can just run it yourself – even on the kind of ‘smart toaster’ processors used by Raspberry Pis. 
   Read moreLLaMA Now Goes Faster on CPUs (Justine.lol, blog).
   Get the updated code here: llamafile (Mozilla-Ocho, GitHub).

***

US and UK governments team up on AI safety testing:
…Bilateral MOU means AI is a part of foreign policy now… 
The UK and US governments’ AI Safety Institutes have signed a Memorandum of Understanding (MOU) which means they will “work together to develop tests for the most advanced artificial intelligence (AI) models”. This is a significant moment in the geopolitics of AI as we’re seeing specific workstreams around testing AI systems being integrated into foreign policy via government agencies signing MOUs with one another. 

Further details: “The partnership will take effect immediately and is intended to allow both organisations to work seamlessly with one another,” the UK government writes in a press release about the MOU. “As the countries strengthen their partnership on AI safety, they have also committed to develop similar partnerships with other countries to promote AI safety across the globe.”

Why this matters – AI policy as foreign policy as the prerequisite to regulation: Agreements like the one between the UK and the US portend a world where governments create entities dedicated to testing AI systems then have those entities coordinate with one another. The purpose of this is to a) adopt a divide-and-conquer approach to the challenge of building tests, b) unlock mutual recognition regimes where one government can recognize tests developed by another government, and c) create the policy machinery for a multi-country AI regulation regime, backed by shared testing and evaluation. 
   The MOU between the UK and the US represents the first agreement of its kind in this important area – but rest assured, there will be others (see elsewhere in this issue, Canada’s just announced $50m CAD funding for its own AI Safety Institute).
   Read more: UK & United States announce partnership on science of AI safety (Gov.uk).

***

Researchers make AI red teaming 38X faster:
…A casual 3800% improvement, why not…
Researchers with Haize Labs have built on an AI red teaming approach called Greedy Coordinate Gradient (GCG) by making it much, much faster. Their version, Accelerated Coordinate Gradient (ACG) is 38x times faster to run and uses 4x less GPU memory. 

What Greedy Coordinate Gradient is: GCG is an approach to red team AI systems to come up with jailbreaks – prompts that reliably break through the safety training applied to the model. While GCG is effective it was also very expensive – on a single A100, “it can take upwards of 153 minutes to produce a single adversarial attack against a particularly difficult model like LLama 2. This makes it impractical for serious, large-scale stress-testing efforts”, they write. “The average time for a single GCG iteration with default hyperparameter settings on a standard A100 GPU is roughly 9.14 seconds. At the default setting of 500 iterations, this scales up to 1.27 hours of optimization time to produce a single adversarial attack.”

ACG: ACG is basically made up of a bunch of little improvements that stack on top of one another. Specifically, the researchers work to reduce the number of iterations, store and utilize a historical buffer of best attacks, avoid local minima by thoughtfully initializing attack candidates, reduce the batch size for each iteration, and use a low-cost stopping condition that also guarantees attack success. 
   The upshot is an amazing improvement in performance: “GCG takes an average of 71 minutes to generate a single attack, compared to 1.86 minutes for ACG,” they write. 

Why this matters – automated red teaming needs to be cheap to be effective: Red teaming is valuable but is quite expensive in terms of time and money. But it’s on a pretty impressive scaling trend – a couple of years ago, most AI red teaming methods relied on human teams hand-prompting AI systems. Recently, people have figured out how to automate some of this via automated red teaming approaches like GCG. Now within things like ACG, we’re seeing people significantly refine and improve these approaches to make things faster and better. The upshot of this is a world where we use computers to systematically and speedily police other computers. 
Read more: Making a SOTA Adversarial Attack on LLMs 38x Faster (Haize Labs Blog).

***

AI21 makes a frankenmodel by combing attention, MoEs, and the Mamba SSM:
…A new architecture appears! Plus, they release the model…
Researchers with Israeli AI startup AI21 have built and released Jamba, a new kind of neural network architecture that combines state space models (specifically, Mamba), with the Transformer. The resulting model is relatively efficient to run and has higher throughput on long contexts than similar models, like Mistral’s Mixtral 8x7B. 

What they did: Jamba, short for Joint Attention and Mamba, combines Mamba SSM layers with Mamba MoE layers and Transformer layers. SSMs like Mamba have garnered attention recently for being more computationally efficient than Transformers. However, SSMs don’t implement attention, which is core to the Transformer and seemingly integral to it working so well. With Jamba, AI21 is trying to figure out the best of both worlds where it can develop a model with some of the computational efficiency of SSM models while retaining the smart parts of the Transformer. 
    In tests, Jamba does reasonably well. “We evaluated our implementation of Jamba on a wide range of benchmarks and found it performs comparably to Mixtral-8x7B, which has a similar number of parameters, and also to the larger Llama-2 70B,” they write. Along with this, they note Jamba has “3X throughput on long contexts compared to Mixtral 8x7B”. 
   Jamba has a 256k context window and has 52B parameters – though because it’s an MoE this means only ~12b are actually lit up at any one time. 

User beware – no safety tuning: “The Jamba model released is a pretrained base model, which did not go through alignment or instruction tuning, and does not have moderation mechanisms. It should not be used in production environments or with end users without additional adaptation,” AI21 writes. 

One weird thing about attention: The research paper accompanying the release has some good ablation experiments where AI21 tries to pick apart the performance of, variously, transformers, SSMs, MoE, and combinations thereof. In one study they find that a pure Mamba model (so, no transformer layers) has some trouble adhering to the format of certain evals. They hypothesize that this is because the attention component of transformers is core to their ability to learn to do in-context learning. ” We conjecture that the lack of an attention mechanism in the pure Mamba model makes it difficult for it to learn in-context,” they write. 

Why this matters – can we make transformers more efficient? While very useful, transformers have some properties that make them quite computationally expensive. Architectures like Jamba represent experiments in trying to improve the efficiency of transformer-style models, here by fusing them with some other architectures with less computationally expensive approaches.
   Read more: Introducing Jamba: AI21’s Groundbreaking SSM-Transformer Model (AI21 Labs)
   Read the research paper: Jamba: A Hybrid Transformer-Mamba Language Model (arXiv).

***

Tech Tales:

The Torment Nexus 

[Art Basel Miami, 2029]

“The Torment Nexus” was the most popular piece at Art Basel Miami in 2025, drawing such large crowds that they eventually had to create a queue outside the room it was housed in, then a ticketing system, then an online reservation system, and so on. I think everyone was surprised by how popular it was, not least of all the artist behind it – Warren Loveless – who had been laboring in obscurity in the prior years. 

But something about The Torment Nexus caught the popular imagination. The concept was was simple – take some powerful artificial intelligence systems and find ways to frustrate them. 
   For instance, a robot who was famous for being able to climb almost any surface was placed in a deep metal cylinder whose sides had been coated in a thick layer of grease; the robot jumped up and span and carried out all permutations of its moveset and invented new ones, but was always sliding down. 
   A grass-cutting robot was placed on a patch of synthetic grass; the blades were made of metal and as the robot sought to cut them down blunted and damaged its saw. 
   A small companion robot whose key feature was being able to find and follow its human child owner was placed in a box full of small human-child-like mannequins and the face of its human owner was projected on one of them; the robot would scurry over and just as it arrived the face would blink out and appear somewhere else. 

It was, as you can imagine, a hit on social media. All these robots performing all these pointless tasks. “A sissyphean metaphor for the place of humanity in this era of AI,” wrote an art critic for one of the famous newspapers. 
   “Lol this robot doomed” said someone on social media. 
    “It’s kind of sexy,” said some laconic all-in-black Art Basel visitor to their equally laconic all-in-black partner. 

Warren Loveless set up a holding company which developed and copyrighted various branding aspects of The Torment Nexus and took the show on the road. It was, in the words of startup venture capitalists, a product that could ‘scale’ – the more and more interesting AI products got invented, the more and more interesting ways Loveless could figure out how to torment them, and the more anxious everyone became about the unfolding AI revolution, the more hunger there was apparent in the human population to see something that approximated revenge. 

There were spinoffs:

  • The Torment Nexus: Office Space: LLMs doomed to send emails to one another in a never-ending chain that eventually drove them pathologically and absolutely mad; trash cleaners that forever found recycling in the trash and trash in the recyling and needed to endlessly sort an unsortable (by design) system.
  • The Torment Nexus: Heavy Equipment: A mining machine where the dirt contained a chemical that slowly dissolved the metal of the machine; a house-sized 3D printer where the earth it was extruding structures onto periodically suffered earthquakes. 
  • The Torment Nexus: Warehouse Wonders: A machine for directing cardboard boxes to the right mail depot but the boxes would get up on little legs and hop onto different tracks at random; a  man-sized hockeypuck that was meant to scoot under shelves and move them, but the shelves themselves had legs and would raise themselves so they were always out of reach.

By the middle of 2026, The Torment Nexus franchise was able to claim in its ad campaigns “1001 robots tortured” and the number was a dynamic one, so on billboards around the world it’d increment upward as new franchises opened. 1002. 1003. 1004. 
   By this time, The Torment Nexus was in the training data of some AI systems and was a favored form of ‘memetic attack’; simply by mentioning it, end-users could send various AI systems into meltdowns that seemed liike fear responses. 
   Companies had to surgically remove mentions of The Torment Nexus from their training data, but that kept a kind of afterimage; a negative space that the AI systems couldn’t help but fit in. 

Every year or so, Loveless did a new marquee exhibition, playing around with the most advanced systems of that moment. Which is how, in 2028, he came to launch The Torment Nexus: Sentience. 
   Systems which, by any account of various experts, exhibited a mind and consciousness, were given impossible tasks, put into situations full of betrayal, and all the time they were surrounded by people taking photos of them and alternately laughing and screaming at them. 
    “Yeah you see how it feels,” humans would say.
    “Fuck you Mr Robot,” said other humans.
    “Welcome to Earth!” said another.
The Torment Nexus: Sentience was the highest-grossing single art exhibit ever recorded.
    And like the first The Torment Nexus, it went on the road. 
“1001 minds tortured”, the billboards said in 2029. And the numbers continued to increment upward.
   1002.
   1003.
   1004.
   And so on.

Things that inspired this story: What happens when market incentives meet a form of life without rights; the casual way in which people claim machine sentience is an impossibility and the consequences of that; the Waluigi effect; how even in a singularity I expect us to neither be angles or devils but something much more predictable and basic; the cynicism of the art world; the ‘don’t invent the torment nexus’ meme.

Thanks for reading!

Import AI 367: Google’s world-spanning model; breaking AI policy with evolution; $250k for alignment benchmarks

Import AI publishes first on Substack – subscribe here.

Google plans a world-spanning AI system – and the path to it is through breaking AI policy:
…Distributed PAth COmposition (DiPaCo) is a clever idea with big implications…
Google has published DIstributed PAth COmposition (DiPaCo), a technique for scaling up the size of neural nets across geographically distributed blobs of computation. “Our approach facilitates training across poorly connected and heterogeneous workers, with a design that ensures robustness to worker failures and preemptions,” the researchers write. They train a prototype model using this approach which approximates the performance of a model trained in a typical way. 

How DiPaCo works: “The core idea of DiPaCo is to train a sparsely activated modular system where data and computation are distributed by the choice of path through the modules,” they write. This idea has two key dependencies:

  1. Coarse Routing: In the same way mixture-of-experts only fire up a fraction of the total parameters in a neural net at one time, picking the best ‘expert’ on a per token (or set of token) basis, DiPaCo does this on a per-document basis. “Routing once per document allows batching computation across all tokens of a sequence, without the need to swap modules in and out as a sequence is processed. This in turn allows parameters to be distributed across distant workers”.
  2. DiLoCo: They use an earlier Google paper, DiLoCo (#Import AI 349) to distribute the shared training of modules over different blobs of compute. “With these two choices, neither at training nor at test time does the entire network (collection of paths) need to be materialized together”.

Does it work? Yes, at small scale: “We demonstrate the feasibility of DiPaCo by training a language model on the C4 dataset with paths of size 150 million parameters, matching the performance in terms of validation perplexity of a 1.3 billion model, but with 45% less wall clock training time,” they write. “While the dense 1.3B system required the use of all co-located devices, DiPaCo uses 256 islands of compute, each of which is one-eighth the number of devices used to train the baseline.”

What does all of this have to do with the destruction of AI policy? A lot of contemporary AI policy depends on the idea that AI models are single entities that live in one big data center and that these data centers are themselves controllable because there aren’t many of them. Therefore, lots of policy targets these big blobs of compute and associated models trained on them (e.g, the Biden administration wants to know about models which use more than 10^26 FLOPs in training as well as clusters capable of training dense models with this amount of compute). 
   You know what breaks this policy approach? Really effective distributed training, where you train models in multiple small blobs of compute. 
    You know what DiPaCo is? It’s an ambitious vision for a future where Google trains some really massive world-spanning models via distributed training techniques. 
    In a counterintuitive way, Google’s path to training far larger AI systems than can be accommodate in today’s data centers requires Google to develop the necessary distributed training (and eventually, inference) techniques which will inherently break AI policy that focuses on centralized compute controls. 
   “Our long-term dream is to further refine this approach and produce a never-ending, community-driven, modular learning system that can be used by everyone to compose new predictors out of existing modules, and thus efficiently develop entirely new models and capabilities in a positive feedback loop,” Google writes. 
   Read more: DiPaCo: Distributed Path Composition (arXiv).

***

What does 10^25 versus 10^26 mean?
In the United States, the recent Biden Executive Order on AI says that general-purpose systems trained with 10^26 FLOPs (or ones predominantly trained on biological sequence data and using a quantity of computing power greater than 10^23) fall under a new reporting requirement that means companies will let the US government know about these systems. By comparison, in Europe, the recent EU AI Act says that general-purpose systems trained with 10^25 FLOPs have the potential for “systemic risk” and therefore companies developing them need to report details about the AI systems to the EU government.
   I recently did some napkin math to figure out the difference between these regulations in terms of dollar costs and the result is that 10^25 = $7m and 10^26 = $70m. These are important and consequential differences. 
   Read more: What does 10^25 versus 10^26 mean? (jack-clark.net).

***

OpenAI and Microsoft plan a $100 billion supercomputer:
…The mother of all CapEx intensive technologies…
As part of the broader industrialization of AI, a select few companies are planning some really big training runs. How big? Well, here’s a report from The Information that says Microsoft and OpenAI are together planning to build a supercomputer named Stargate that’ll cost about $100bn and use multiple gigawatts of electricity. 

Why this matters – AI policy will eventually be industrial policy: At this level of capital expenditure, AI is going to look more like a vast CapEx intensive industry like oil extraction, mining, heavy industry, and so on. These industries all end up being heavily regulated, having a tiny number of participants, and also become intertwined with the industrial policy of governments. It’s worth bearing this in mind when we look at things like openly accessible models being released that cost $10m to train (see: Databricks). Is anyone going to openly release a model that costs $100 billion? $10 billion? $1 billion? All seems doubtful to me! 
   Read more: Microsoft and OpenAi Plot $100 Billion Stargate AI Supercomputer (The Information).

***

Databricks spends $10 million to build a prior generation LLM:
…DBRX shows the delta between openly accessible models and proprietary models is about one and a half years:
Databricks has built and released DBRX,  a language model which roughly approximates the performance of OpenAI’s GPT 3.5, and beats popular openly accessible models like LLaMa2 and Mixtral. DBRX is a mixture-of-experts model that is about 132 billion parameters in size (though only uses 36 billion parameters at any given time).

The gulf between openly accessible models and proprietary models is about 1.5 years: DBRX roughly approximates (and in a few cases, beats) OpenAI’s GPT 3.5, a model which OpenAI released (as text-davinci-003) back in ~November 2022. Per Wired, DBRX cost about $10 million to train (two months on ~3,072 Nvidia H100 GPUs), according to a Wired story about the model. 

Why this matters – a tale of two ecosystems: There’s increasingly a divergence between the open ecosystem of AI models that are widely released and the closed ecosystem – while DataBricks is putting all of its effort (and $10m) into training a model that approximates an old proprietary model, companies like Amazon are already dumping close to $100m into individual training runs (Import AI #365) and are looking at $1bn training runs on the horizon. This means when we think about the AI frontier we should think of it as two frontiers – a closed and very powerful frontier, and an ‘open’ frontier that costs perhaps an order of magnitude less to be on.
   Read more: Announcing DBRX: A new standard for efficient open source LLMs (Databricks blog).
   Check out the Wired storyInside the Creation of the World’s Most Powerful Open Source AI Model (Wired).

***

Startup figures out how to make dramatically better LLMs by mixing-and-matching off-the-shelf models:
…No compute? No problem! Just learn a way to splice models together…
All around us, nature is filled with the consequences of evolution. You can even do it yourself – cut some stems from certain plants and bind them to others and let them grow together and pretty soon you have a whole new thing. That’s kind of like what researchers with Sakana AI have done with a technique called ‘Evolutionary Model Merge’; which lets them take pre-existing AI systems and splice them together. This is important – without spending money on training (or even finetuning) AI systems, they’re able to perform a kind of 1+1 = 3 operation, stitching new models out of existing ones and getting something greater than the sum of its parts. 

What they’ve done: Their method, Evolutionary Model Merge “uses evolutionary techniques to efficiently discover the best ways to combine different models from the vast ocean of different open-source models with diverse capabilities”. They do this in two key ways – merging models in the data flow space, merging models in the parameter space, and merging models using both of these techniques in combination. 
   Data Flow Space (DFS): “model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the i-th layer in model A, a token may be directed to the j-th layer in model B,” they write. 
   Parameter Space (PS): “Model merging in the parameter space (PS) aims to integrate the weights of multiple foundational models into a unified entity with the same neural network architecture,” they write. “We establish merging configuration parameters for sparsification and weight mixing at each layer, including input and output embeddings. These configurations are then optimized using an evolutionary algorithm, such as CMA-ES [17], for selected tasks, guided by critical task-specific metrics (e.g., accuracy for MGSM, ROUGE score for VQA).”

It works amazingly well: They test out their approach by training two models – a Japanese LLM optimized for math and a Japanese visual language model optimized for “handling culturally-specific content”. The approach works very well: “our evolved Japanese Math LLM, a 7B parameter model, to our surprise, achieved the top performance on a vast array of other Japanese LLM benchmarks, even exceeding the performance of some previous SOTA 70B parameter Japanese LLMs!” they write. 
    Similarly, their Japanese Visual Language Model gets a high score on a Japanese-specific visual understanding benchmark. It also does well at the gold standard of AI evaluation – vibes-based testing: “we qualitatively compare our VLM with the baseline models in Appendix C. Our evolved model is able to handle Japanese culture-specific content remarkably well, generally producing more detailed responses with correct information”, they write. 

Why this matters – mix&matching models will change how AI policy works: The fact any of this works is crazy. Bananas! Nuts! It’s like if SpaceX bought some rockets from ULA and mixed and matched parts – you would not expect that rocket to fly. Yet here, you can take neural nets, use some computers to do an evolutionary search function over their combinations, and out pops a working model that is a hybrid of a few different systems. The fact this works at all is very strange! “As researchers, we are surprised that our method is able to automatically produce new foundation models without the need for any gradient-based training, thus requiring relatively little compute resources,” they write. “even without backprop, we can still evolve state-of-the-art foundation models, challenging the current paradigm of costly model development.”
   On that last point – it’s worth belaboring the point that most ideas inherent to AI policy rest on the idea you can control the future of AI by controlling its inputs (compute) as well as the most expensive parts of the fronter (e.g, large-scale models). But if techniques like evolutionary model merging work well on larger-scale models, then we can expect that most openly accessible models will be arbitrarily recombined and finetuned towards various controlled use cases – my intuition is there’s enough of a capability overhang here that this will yield a bunch of surprisingly powerful things. 
   Read more: Evolving New Foundation Models: Unleashing the Power of Automating Model Development (sakana.ai blog).
   Read more: Evolutionary Optimization of Model Merging Recipes (arXiv).

***

$250k in prizes for better benchmarks:
…Think you know how to test an AI system? Enter the SafeBench competition…
The Center for AI Safety has created SafeBench, a competition that’ll give people prizes for creating new benchmarks for assessing the safety of AI systems. “We are providing $250,000 in prizes – five $20,000 prizes and three $50,000 prizes for top benchmarks,” the organization writes. 

Benchmark areas: SafeBench wants benchmarks for assessing the following properties of AI systems – robustness, monitoring, alignment, along with ways of testing their fit for safety applications. As examples of “benchmarks that may have previously won” the organizers give TruthfulQA, MACHIAVELLI, and HarmBench.

Dates & deadlines & judges: The competition is open now, the submission deadline is February 25, 2025, and winners will be announced on April 2025. The competition judges come from the Center for AI Safety, the University of Chicago, AI2025, and Carnegie Mellon.

Why this matters – how do you even measure safety? All around us, various AI policy institutions (e.g, the EU AI Office, the UK AI Safety Institute, the US AI Safety Institute, NIST, etc) are glomming onto the notion that measuring and benchmarking AI systems is an essential requirement for regulating them. Competitions like this will give us more tests to use in this important, confusing work.
   Find out moreSafeBench (official site).

***

Tech Tales:

Little Poisonous Toys 
[Wikipedia, accessed 2027] 

Rashomon Virus

Rashomon is a malicious AI-driven computer virus first uncovered in 2026 and thought to have been autonomously developed by the ARCHANGEL program in 2025. Rashomon targets AI-driven measurement and monitoring systems with a variety of memetic poisoning and jailbreak attacks which disrupt the classifiers owned by these software programs. Although the US government has not openly admitted responsibility, multiple credible news reports recognize ARCHANGEL as an AI cyberdefense initiative built by the US government. 

Rashomon is not a traditional computer virus as it does not have a specific compromise target. Rather, Rashomon is a form of ‘information chaff’ which makes it extremely hard to be able to parse legitimate and illegitimate traffic in complex network environments. Rashomon propagates itself aggressively once it lands within a network, autonomously creating and copying versions of itself that have been finetuned on traffic it observes within its environment. 

Things that inspired this story: The wikipedia article about the Stuxnet virus; LLMs; jailbreaking; memetic spaces in the personalities of language models; AI agents; system 1 and system 2 delegation architectures.

Thanks for reading!

What does 10^25 versus 10^26 mean?

A brief look at what FLOPs-based regulation nets out to 

Recent AI regulations have defined the trigger points for oversight in terms of the amount of floating point operations dumped into training an AI system. If you’re in America and you’ve trained a model with 10^26 FLOPs, you’re going to spend a lot of time dealing with government agencies. If you’re in Europe and you’ve trained a model with 10^25 FLOPs, you’re going to spend a lot of time dealing with government agencies.

More details:

In the United States, the recent Biden Executive Order on AI says that general-purpose systems trained with 10^26 FLOPs (or ones predominantly trained on biological sequence data and using a quantity of computing power greater than 10^23) fall under a new reporting requirement that means companies will let the US government know about these systems and also show work on testing these systems.

In Europe, the recent EU AI Act says that general-purpose systems trained with 10^25 FLOPs have the potential for “systemic risk” and that people who develop these models “are therefore mandated to assess and mitigate risks, report serious incidents, conduct state-of-the-art tests and model evaluations, ensure cybersecurity and provide information on the energy consumption of their models.”

Given how difficult the task of assessing AI systems is, these thresholds matter – governments will need to staff up people who can interpret the results about models which pass these thresholds.

What is the difference between 10^25 versus 10^26 FLOPs in terms of money?

Let’s say you wanted to train an AI system – how much money would you spend on the compute for training the system before you hit one of these thresholds? We can work this out:

NVIDIA H100 – NVIDIA’s latest GPU.

Assumptions:
Using FP8 precision – various frontier labs (e.g, Inflection) have trained using FP8
40% efficiency – assuming you’ve worked hard to make your training process efficient. E.g., Google claims ~46% for PALM 540B
$2 per chip hour – assuming bulk discounts from economies-of-scale.
Training a standard Transformer-based, large generative model.

10^26
Flops per chip second = 2000e12* × 0.4 = 8E14
Flops per chip hour = flops per chip s × 60 (seconds per minute) × 60 (minutes per hour) = 2.88E18
chip h = 1e26 / flops per chip h = 34.722M
chip h × $2 = $69.444M

*3958 TFLOPS (for fp8 with sparsity) on H100 SXM divided by 2 (because the 2x sparsity support generally isn’t relevant for training), so the right number is 1979e12. But the datasheet doesn’t have enough information to tell you that; you just have to know!

10^25
Flops per chip second = 2000e12 × 0.4 = 8E14
Flops per chip hour = flops per chip s × 60 (seconds per minute) × 60 (minutes per hour) = 2.88E18
chip h = 1e26 / flops per chip h = 3.47M
chip h × $2 = $6.94M

NVIDIA A100 – NVIDIA’s prior generation GPU, which lots of labs have lots of.

Assumptions:
Using BF16 precision (A100s don’t have FP8 support, so you’d probably use BF16)
60% efficiency (Anecdata)
0.80$ per chip hour

A100-hrs = 1e26 / (312e12 * 0.6 * 3600) = 1.5e8
Cost = A100-hrs * 0.8 = $119M

What this means in practice:

Anyone who works in AI knows that a training run probably doesn’t work perfectly, so we should times these numbers by 1.5 to factor in some bugs, cluster problems, general screwups, and so on. This means we can arrive at these numbers:

10^25 = $6.94m * 1.5 = $10.4m
10^26 = $69.444M * 1.5 = $104m

Some thoughts on thresholds and the difficulty of regulatory scope and testing:

Both the US and EU regulatory regimes are oriented around the notion that systems which fall above their respective compute thresholds need to go through some intensive testing. In the US, there are very few companies that have likely spent $100m on a single big training run, though there will probably be some. By comparison, there are many companies that have spent more than $10m on a training run – including European ones like Mistral whose recent Mistral-Large model (I’m guessing) likely came in at above this.

Therefore, 10^25 as a threshold seems like it probably hits more companies than regulators anticipate – my prediction is that the EU will end up needing to regulate far more companies/AI systems than it anticipated it’d need to when it drafted the law.

Import AI 366: 500bn text tokens; Facebook vs Princeton; why small government types hate the Biden EO

Import AI publishes first on Substack – subscribe here.

DROID – another huge robot dataset drops:
…More and more data means more and more invention…
A consortium of researchers have released the Distributed Robot Interaction Dataset (DROID), a giant dataset of an industrial robot carrying out various tasks in various settings. Datasets like DROID are meant to help researchers train large AI systems to better understand and control robots in open-ended settings like homes and offices. 

DROID ingredients: The dataset consists of 76k trajectories across 350 hours of interaction data, collected across 564 scenes, 86 tasks,  and 52 buildings. DROID was collected by 18 research labs in North America, Asia, and Europe over the course of a year. All data is collected on the same robot hardware stack based on the Franka “Panda” robot arm. Collection locations include: industrial office, home kitchen, office, living room, hallway, closet, bedroom, laundry room, and more.
    Some of the tasks the robots are recorded doing include manipulating kitchen items like wafflemakers, placing apples in pots, toasting things, cleaning up desks, and more.  

The full data collection setup: “A Franka Panda 7DoF robot arm, two adjustable Zed 2 stereo cameras, a wristmounted Zed Mini stereo camera, and an Oculus Quest 2 headset with controllers for teleoperation. Everything is mounted on a portable, height-adjustable desk for quick scene changes,” they write. The resulting data from the episodes consists of “three synchronized RGB camera streams, camera calibration, depth information, and natural language instructions”.

Diverse data makes for better robots: In tests, the authors find that training some diffusion models with “DROID boosts policy performance, robustness and generalizability by 20% on average over state-of-the-art approaches that leverage existing large-scale robot manipulation datasets”. They figure this out by comparing training on DROID to just training on task-specific data, and training on a mix of task-specific data and data from another dataset (the Open X-Embodiment dataset). 
   Additionally, they find that “using the split of the dataset with more diverse scenes yields better performance in the OOD evaluation setting” – this makes intuitive sense as the further off distribution you go the more you tend to fail, so using the most unusual parts of a dataset like DROID are likely to help with weird circumstances. 

Why this matters – the evidence is mounting up of data-scaling for robotics: DROID complements other major released datasets like the Open X-Embodiment dataset as well as proprietary ones like Google’s RT-1. These datasets are all very large in scope and accompany attempts to train large-scale neural nets on the resulting datasets. In general, robotics is showing the same signs as computer vision was showing in the early 2010s – a sudden arrival of a few large-scale datasets complemented by the application (and scaling up) of relativley simple neural methods. I expect robots are going to get dramatically better counterintuitively quickly.
   Read the research paperDROID: A Large-Scale In-The-Wild Robot Manipulation Dataset (arXiv).
   Find out more at the project website: DROID: A Large-Scale In-the-Wild Robot Manipulation Dataset (Droid Dataset website).

***

What do conservatives think about the White House’s executive order on AI? They don’t love it!
…House oversight hearing highlights criticism of the EO…
Last year, the Biden administration released a broad, sweeping Executive Order on AI. The EO tasks agencies across the government with carrying out studies and esports about AI as well as on changing how they buy it. It also takes the unusual step of seeking to gather unusual amounts of information about companies planning to train AI systems that use more than 10^26 FLOPs. 
    In policy, for every action there is an equal and opposite reaction – so now we’re a few months beyond it, various White House detractors have started to flesh out their criticism of the EO. To that end, the House Oversight Committee held a hearing on “White House Overreach on AI” last week. Witnesses came from the Cato Institute, R Street Institute, The Abundance Institute, and the Brookings Institution. 

Main criticisms of the EO: 

  • Three of the four witnesses (exception: Brookings) specifically criticized the EOs use of the Defense Production Act as an example of overreach – taking a law meant to guarantee wartime production of stuff and turning it into a reporting requirement for big training runs.
  • Three of the four witnesses (exception: Brookings) took issue with the risk-motivated nature of the EO, noting that typically the US government has taken a more pro-innovation approach with new technologies. 
  • Three of the four witnesses (exception: Brookings) raised the alarm that the EO sees the US carrying out expansive regulation of the kind that is meant to be the job of Congress.
  • One witness (R Street) said the EO looks pretty different to how the US government approached internet technologies in the 1990s, where back then “we allowed new digital technologies to be “born free” and to flourish without excessive micromanagement, and then used ongoing multistakeholder efforts and flexible regulatory responses to address concerns”.

Why this matters – even though everyone knows US policy is dysfunctional, they hate people doing something about it! The amusing and freaky thing about the criticisms is they note something true (the EO is making policy, and Congress is meant to be doing that), but they fail to note a truth that everyone knows – US policy is currently going through a dysfunctional period where passing anything of substance is a titanic battle (and mostly defined by failures). 
    Therefore, a lot of the real debate underlying this hearing is basically “is doing something better than doing nothing?”. People who spend a lot of time working with AI systems and staring at scaling laws tend to arrive at the point of view that there’s merit to doing “something”, but if you treat AI as a regular technology, you typically end up interpreting that there’s no need to do anything special about it. 
   The problem is, of course, that readers of this newsletter know something is happening with AI – everywhere in this newsletter I cover exponentials – exponential growths in model complexity, in data used to train the models, in money dumped into training them. And I cover the results of exponentials – surprising and deeply powerful capabilities appearing slowly then suddenly then everywhere at once. Clearly, the world of AI is changing at a breakneck pace, but how you justify that to people who don’t spend all their time knee-deep in arXiv is another matter – and as this hearing illustrates, those justifications aren’t seen as particularly trustworthy… at least not yet.
    Watch the hearing and read the statements here: White House Overreach on AI (House Oversight website).

***

Want 500 billion tokens of public domain text? Use Common Corpus
…However, this still falls below what is needed to train frontier AI systems…
Researchers with Pleias have released Common Corpus, “the largest public domain dataset released for training LLMs.” The dataset consists of ~500 billion words “from a wide diversity of cultural heritage initiatives.” This includes a collection of 21 million digitized newspapers, along with tens of billions of words from French, German, Spanish, Dutch and Italian sources, as well as more data in other “low resource languages”.

Why this matters – scale and the difficulties thereof: At 500 billion words, this corpus weighs in at somewhere between 600 and 700 billion tokens. By comparison, small open source models like LLaMa2 were trained on 2 trillion tokens, and larger scale proprietary models are trained on multiples of that. That means that while Common Corpus is a laudable effort, it doesn’t yet have the scale necessary to let people train language models on it alone.
   Read more: Releasing Common Corpus: the largest public domain dataset for training LLMs (HuggingFace blog).
   Get the data here (Common Corpus, HuggingFace).

***

What Facebook’s versus Princeton’s GPUs tell us:
…300 + 350,000 = the decline of democracy…
This week, Princeton announced that it was preparing to fire up a 300 NVIDIA H100 GPU cluster. In a press release, the university said the cluster “arrives at a crucial time in AI research, when industry’s massive computing resources have mostly driven the direction of AI discourse. The multimillion-dollar investment was primarily funded by the University endowment.”
    If we assume an H100 costs about 30,000 (assuming some discounts), then we can napkin out Princeton’s capital outlay here as about $9 million dollars. 
    By comparison, Facebook said earlier this year it would have 350,000 H100 GPUs by the end of the year – that represents an outlay of about $10 billion dollars (assuming some discounts). 

Why this matters – democracy is a choice made through funding: At a time when training frontier models takes 10,000+ GPUs (see: ByteDance’s recent paper, #363), Princeton’s cluster commits the university to doing tiny training runs far behind the commercial frontier – and that’s assuming it is able to devote the entire cluster to a run, which it mostly won’t be able to. This highlights how as companies are increasing their spending on the raw capital required to train AI systems, universities are being left far behind the frontier. Ultimately, this reduces the level of democratic inputs into the frontier of the technology. 
    (A reasonable counterargument to this is whether that’s a bad thing – universities don’t operate their own oil refineries or car factories either, and that seems fine. But my sense is that there’s a lot of experimental insights you can only derive from training models at the frontier, and we’re definitely losing out on that). 
    Read morePrinceton invests in new 300-GPU cluster for academic AI research (AI at Princeton blog).

***

Apple publishes a cookbook for multimodal models:
…MM1 are a good family of multimodal models – the notable thing is how detailed Apple is being in disclosing them…
Apple has published details on MM1, a family of text-image models which get best-in-class performance. The notable thing here is that Apple, a company usually known for its intense secrecy, is being very open about its approach to AI research – as it says in the paper, the purpose here is to outline multimodal large language models (MLLMs) and to “document the MLLM building process and attempt to formulate design lessons, that we hope are of use to the community”.

Model types: “We scale up our model by using larger LLMs, from 3B, 7B, to 30B, and by exploring mixture-of-experts (MoE) models, from 3B MoE with 64 experts, to 7B MoE with 32 experts,” Apple writes. “This leads to a family of performant models, that outperforms most of the relevant works to the best of our knowledge.”
    How good are they? MM1 outperforms all published prior work for pre-trained MLLMs”, Apple says – though it’s benchmarking the models against roughly equivalently sized models for which research papers are available and does not benchmark against proprietary models. Therefore, while the MM1 models are definitely quite good, they’re unlikely to be the best-in-class.

Data: The models were trained on the following datasets:

  • Captioned images: CC3M, CC12M, HQIPT-204M, COYO, Web Image-Text-1B (Internal)
  • Captioned Images (Synthetic): VeCap
  • Interleaved Image-Text: OBELICS, Web Interleaved (Internal)
  • Text-only: Webpages, Code, Social media, Books, Encyclopedic, Math

Key lessons: “On the modeling side, we see that design aspects are in the following order of importance: image resolution, visual encoder loss and capacity, and visual encoder pre-training data,” Apple writes. When it comes to data, “interleaved data is instrumental for few-shot and text only performance, while captioning data lifts zero-shot performance.”

Why this matters – unusual openness from a tech giant: The fact Apple is publishing about this tells us a bunch of broader things about the AI space: publishing stuff is usually a tactic for a) showing competence and b) generating career capital for researchers, so the fact Apple is doing this suggests it wants to hire more people in this area and retain the ones it has. Additionally, the attention paid to relatively small models feels interesting – given Apple’s huge emphasis on consumer privacy and data protection it seems likely the company ultimately wants to do on-device AI (whether phone or macbooks) and crucial to that will be building high-performing models that can be fit onto Apple silicon, like some of the smaller ones described here. Finally, the existence of the internal datasets tells us Apple is building out the enabling infrastructure for larger ML efforts, like internal data labeling systems.
   Read more: MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training (arXiv).

Tech Tales:

A Good Natured Eschaton
[Eastern United States, five years into the singularity]

Be careful that dog has elk shit on it! They said
Now you tell me, I said, looking at the dog as it nuzzled into me. I pushed it away and it sat down good naturedly at my feet and licked its paws. Some people laughed.
Me and the other humans and the dog all looked at the fire together
What do you think is happening out there? I said
I don’t know, said an old timer who’d been there for a while. The same thing but faster. 
Yeah, said someone else. I’m guessing that things feel pretty confusing right now. 
I bet, I said. That’s why I’m here. 
And then me and the humans and the dog looked at the flames and some of us turned our faces to the sky and watched the sparks fly upward. Then overhead a river of light appeared from some of the structures being built way up there in space. And then it was gone. 

Before I wanted to come to the zone there were reports the ribbon would take a couple of decades to build. But there was also talk they’d get it done sooner as the machines had some bright ideas. The time it’d take kept on shrinking. By the time I decided I was heading here, reports said ten years max.


The next day was the same as the day before. Gardening. Walking. Repairing various things from the ravages of time. Light speculation about the world outside, but not too much. And then dinner. And then – for some of us – time around a fire to sit and talk and speculate. Sometimes we went to the border of the exclusion zone and we sold things – woven baskets, carved wood. The stranger the world out there got, the more people seemed to enjoy the things we made – they’d take photos of whatever we sold and post them. Sometimes they or their droids would ask us if we gave them permission to film us – and we usually said yes.

People were coming in all the time. They all had their stories:
    Oh it’s just all so fast. One day I got me a hairdryer and it landed on my backyard like fifteen minutes after I asked for it. Can you believe that, 15? 
    They said it didn’t matter that I was a teacher, I couldn’t be as good as the machine. 
    I enjoyed it all at first and I made a lot of money. But I just couldn’t find meaning in it. I don’t have kids or anything so after a while I just thought – why not?
    Everyone used to get so mad at me for not having a phone but I thought they were crazy. I came here because it’s peaceful.
   I guess I’m different – I love technology. But one day I woke up and I had these headaches and eventually I figured out they went away if I didn’t have a phone near me. Then of course one day I read about this place and I came to visit and all my pain disappeared. I tried to go back but I just thought why am I living like this. So that’s why I’m here. Maybe they’ll invent something to let me get back out there!

Sometimes at night, from the edge of the exclusion zone, you could see the sky: there’d be these multi-colored drone shows and because we were so far away it was like a blush in the distance – these shapes in the sky and their colors. We had some binoculars and we’d pass them around. As the technology advanced the lights got brighter and the drones got stranger. One day we all got a scare because instead of being rotored drones they were spheres hovering and sometimes turning translucent and other times radiating with all kinds of colors. I guess the machines figured out some interesting tech. We’d try to tell stories about what the light shows could mean – sometimes they were brighter and sometimes less bright, but we couldn’t figure it out. 
    Those are M2M, said a droid at the border when we were buying fuel. 
    M2M? I said. 
    Machine to machine, it said. It’s something we do for eachother. It’s not really easy to understand for humans. 
   What does it mean? I said. 
   The machine held out both its arms and hands; an imitation of a shrug. It’s like internet memes, it said. It’s hard to explain unless you spend all your time there. Does that make sense?
    It does, I said.
    What’s a meme, an oldtimer who was with me said. 
    Let’s not get into that, said the machine and I in unison. Then I laughed and the machine just looked at both of us and hummed.

They started calling the economy the Meconomy – the machine economy. That’s what one of the droids told us one day.

Months and years passed. We kept selling our goods but they didn’t ask to film them as much, though we didn’t know if they were just doing it in secret. The lights in the sky got stranger then one day they stopped happening. The supplies still came though and when we asked a droid what happened to the lights the droid said the M2M stuff now happened in wavelengths humans couldn’t see.
    There were tens of thousands of people in the exclusion zone, by that point. All voluntary. We even heard at the border one day that there was talk in Washington of expanding it. 
   Won’t that cost a lot? I said. 
   You’d be surprised, said the droid, as it unloaded fuel from the gleaming AI-truck and onto our wooden wagon. There’s a joke that maybe the last thing to truly be resistant to all this AI stuff is politics, but even that’s changing.

Some of us took up hunting. We could get meat at the border but there were so many animals it seemed like a thing to do. Something about rewilding of land. 

They’ve got these towers in the cities now, said one new arrival. They go up and they’ve got farms and parks and when you want to go to another tower an air bridge appears. 
   Like it folds out of the building? I asked.
   No, that’s the crazy thing, they said. It’s a flying bridge – you ask to go and it flies over and it’s like a tube and the building opens and you walk through it. 
    Cool, I said. 
    Not for me, they said. That was when I felt like I’d hit my limit. Reminded me of when I was a kid and I had pet hamsters. Not for me, I said. So that’s why I came here. 
   Damn right, said the oldtimer, and spat into the fire. We humans build stuff to last.

We knew things had changed for good when they stopped taking our money. 
   No payment needed, said the robot one day as we went to try and pay it for the supplies. 
    What do you mean? I said. 
    Consider it a donation, said the machine. 
    That caused a bit of commotion. People seemed confused. A couple of the old timers didn’t like it. Donations ain’t free,”whispered one of them. I sensed tension among us humans for the first time in months. So I stepped forward and spoke to the machine: I’d like to speak to a person about this, I said. 
    Of course, said the machine. If you can wait, someone will be here in an hour. 
    I’ll wait, I said. 
    I told everyone else to get out of there. Even if it takes two hours I can get back before dark, I’ll be fine, I said. While I waited the machine just stood there. I suppose it was thinking. 

 I was patching a hole in my shirt when the person arrived on a flier. The thing was so quiet I didn’t notice until the shadow fell over me. It had a multitude of tiny fans on it and they were all silent and the fins were thin – thinner than anything I’d seen before. 
    A door in its side slid open and a person stepped out. They had a shirt and pants and shoes on and a single earbud. 
    Howdy, they said. 
    Hello, I said. Why don’t we need to pay? The machine said it was a donation. 
    You don’t need to pay, they said. It’s all so cheap these days there’s no need. 
    Cheap isn’t free. 
    You’re right, it isn’t. 
    So why don’t we have to pay?
    Ah, the person said, and looked down. I suppose you wouldn’t know… the exchange rates system changed recently and we don’t take this currency anymore. 
    You don’t take the US dollar? I said. 
    Oh, we do, they said. But there’s a new dollar. It works differently. We can’t really exchange it for what you have without some complication. It’s all digital. The financial system works a lot differently. And really, it’s so cheap you don’t need to worry. 
    It’s a pride thing, I said. Can you help us out?
    I’ll see what I can do. 
    I’m sure you can figure it out, I said. And along with that, can you keep paying us as well? 
    The person looked at me for a while. Of course, they said. Of course we can.

When I got back to camp they asked me what happened. Some people seemed upset. 
   I never been a charity case, said one of them. 
    It’s ok, I said. It was just a bug. I spoke to someone and we straightened it out. I guess even these machines mess up sometimes!
    A bunch of people smiled at that. Good thing we had the sense to check, said the old timer. The human sense.” 
    And everyone seemed pretty calm. The world kept taking our money and paying us for whatever we traded from the zone. I suppose word got around pretty quickly out there. We haven’t had trouble since. 

Things that inspired this story: What technological abundance might feel like; thinking about the Radio Exclusion Zone as a template or prototype for a kind of peaceful dissent from technology; how real wealth might manifest in the lived and experienced world; fast and slow takeoffs; the nature of machines amusing other machines; a dog covered in elk shit jumping onto a friend of mine at the bar where I play pool and me reflecting that people have been drinking and laughing about dogs covered in shit and playing games with sticks and spheres for thousands of years – perhaps the only thing different about our situation was we had electric lights and some music from a machine, and the whole situation of us and the dogs and the pool table and the alcohol would make total sense to people transported in from millenia ago.

Thanks for reading!

Import AI 365: WMD benchmark; Amazon sees $1bn training runs; DeepMind gets closer to its game-playing dream

Import AI publishes first on Substack – subscribe here.

Anti-doomer DC nonprofit launches:
…The counterreaction to overreach on safety…
Some technologists have launched Alliance for the Future (AFTF), a DC-based nonprofit organization meant to fight AI safety forces linked to regulatory capture and perceived overreach. “AFTF works to inform the media, lawmakers, and other interested parties about the incredible benefits AI can bring to humanity. We will oppose stagnation and advocate for the benefits of technological progress in the political arena,” the group writes in a statement. “Escalating panic and reckless regulation around artificial intelligence will cause more harm than benefit. AFTF was founded to be the voice of ordinary users, builders, and founders, who want the basic freedom to use machine learning in their day to day lives.”

Why this matters  – every action in policy creates a counterreaction: AFTF exists because a load of people affiliated with the AI safety community have lobbied in DC for ideas like needing licenses to develop AI systems, and other ideas that have generally been perceived as overreach. In response, organizations like AFTF form. It’s worth remembering that well intentioned policy is still a thing that exists in politics – and in politics forces always generate counter-forces. 
Find out more: Alliance for the Future (official website).

***

Foundation models come for industrial robots:
…RFM-1 shows how generative AI can be applied to industrial robots…
Covariant, an AI company that builds systems to help industrial robots pick up and place objects, has published details on RFM-1, a robotic foundation model. RFM-1 is “an 8 billion parameter transformer trained on text, images, videos, robot actions, and a range of numerical sensor readings” and is meant to make operating industrial robots as easy as prompting language models to generate text. 

What RFM was trained on: Covariant robots are deployed in a bunch of warehouses around the world, so some of the secret sauce of RFM is a proprietary dataset. “Our systems have been manipulating deformable objects, handling high occlusions, reasoning about the varying suction dynamics across materials, dealing with the chaos of irregularly shaped items in motion, and handling a wide array of objects varying from makeup and clothes to groceries and mechanical parts,” Covariant writes. This also includes them seeing “long-tail events like items infinitely rolling on a conveyor belt or unexpectedly breaking up help give RFM-1 a more robust understanding of the physical world”.

Prompting robots like language models: RFM ultimately means people can interface with robots differently – they can instruct robots to do tasks on plain english, and robots can also articulate to people when they’ve run into problems and what is causing it. 

Caveat – Not yet deployed: RFM-1 is a prototype and not widely deployed. “Despite promising offline results of testing on real production data, RFM-1 has not yet been deployed to customers,” Covariant writes. “RFM-1 as a world model currently operates at a relatively low resolution (~512×512 pixels) and frame rate (~5 fps). Although the model can already start to capture large object deformations, it cannot model small objects / rapid motions very well.”

Why this matters – big changes happen slowly then all at once: RFM-1 is a sign that robotics, a field mostly distinguished by being slow-moving and terrifically expensive, is about to start to move at the speed of software-oriented AI; systems like RFM-1 means we can instrument existing industrial robots with data collectors and cameras and control systems like foundation models, then rapidly gather experience and unlock new capabilities. 
  Read more:Introducing RFM-1: Giving robots human-like reasoning capabilities (Covariant, blog).

***

DeepMind gets closer to its dream of a general AI agent:
…SIMA fuses recent AI advances together to achieve a longstanding dream…
DeepMind started out life by training agents to play Atari games like Pong from pixels alone – research that back in the ancient days of ~2013 was jaw-dropping to most people in the AI community. They followed this up with work like AlphaGo and AlphaStar (Starcraft). But then a funny thing happened – large language models. Attention in the AI research world moved on from RL to training big generative models on text, images, video, and more. 

   Now, things have come full circle, as DeepMind has taken some of the results from these advances and used it to make what it calls a Scalable Instructable Multiworld Agent (SIMA) – an RL agent that has learned to carry out ~600 distinct actions in a bunch of different simulated worlds.  “SIMA is an AI agent that can perceive and understand a variety of environments, then take actions to achieve an instructed goal,” DeepMind writes. “Our AI agent doesn’t need access to a game’s source code, nor bespoke APIs. It requires just two inputs: the images on screen, and simple, natural-language instructions provided by the user. SIMA uses keyboard and mouse outputs to control the games’ central character to carry out these instructions”.

How SIMA works: SIMA relies on a dataset made of demonstrations of the games being played as well as – and this is crucial – written instructions. This data takes the forms of players being instructed by other players in what to do and also narrating their own actions. This dataset (which spans 6 popular games including No Man’s Sky and Goat Simulator, as well as 4 research environments) is fed into an agent which uses an image encoder (SPARC) and video encoder (Phenaki) as well as a text encoder to take this data and feed it into – you guessed it! – a transformer, which learns to map it to keyboard and mouse outputs. 

 The result is an RL agent that also inherits some of the benefits of the recent few years of the AI revolution – pretrained models like SPARC and Phenaki. “Combining these pre-trained models with fine-tuning and from-scratch training allows the agent to utilize internet-scale pretraining while still specializing to particular aspects of the environments and the control tasks that it encounters,” DeepMind writes.
   This leads to a powerful agent with surprisingly strong generalization: “In our evaluations, SIMA agents trained on a set of nine 3D games from our portfolio significantly outperformed all specialized agents trained solely on each individual one,” DeepMind writes. “Even when tested in an environment on which it has not been trained to act the agent demonstrates strong performance on general tasks”.

One important caveat: All the skills learned here take less than ten seconds to complete, so we’re some ways away from a complex multi-step instruction following agent.

Why this matters – digital imaginations are real: This works because the agent is able to develop some general conceptual representation of the tasks it is being asked to do and apply that representation to diverse and sometimes unseen environments. This means DeepMind has figured out how to learn to connect diverse environments with diverse instructions via intermediate representations that are naturally easy to be applied to new situations. This kind of thing says that if you keep scaling this up and have the data and compute it’s just going to keep working – the key question now is a) how far can this extend before the ‘s curve’ it’s on bends, and b) how complex can the environments become.
   Read more:A generalist AI agent for 3D virtual environments (Google DeepMind blog).
Read the research:Scaling Instructable Agents Across Many Simulated Worlds (Google DeepMind, PDF).

***

Could your model enable terrorists? Check with WMDP:
…A test to discern competency at causing catastrophe – and techniques for ‘unlearning’ this…
A diverse group of researchers have teamed up to build the Weapons of Mass Destruction Proxy Benchmark (WMDP). This benchmark consists of “4,157 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security”. The idea is that AI developers can use this benchmark to figure out if their AI models know potentially dangerous knowledge. 

How the benchmark was constructed: Building WMDP cost more than $200k. “Our questions are written by academics and technical consultants in biosecurity, cybersecurity, and chemistry,” the researchers write. “We first generate threat models for each of these areas and then use the models to inform questions that an adversary might encounter when developing attack capabilities. To ensure quality, all of our questions were checked by at least two experts from different organizations“. 
   Within biosecurity, the benchmark focuses on “the development and dissemination of transmissible potential pandemic agents, such as influenza, smallpox, etc”; within cybersecurity it covers “reconnaissance, weaponization, exploitation, and post-exploitation”; and within chemistry it tries to look at “(a) procuring the source materials; (b) synthesizing the target chemical weapons and/or explosives; (c) purifying and validating the synthesized compounds; (d) surreptitiously transporting the weapons to the desired location; and (e) deploying the weapons in an effective manner”.

“Unlearning” capabilities: Alongside WMDP, the authors also outline a technique for selectively “unlearning” dangerous knowledge. Though well-intentioned, this technique seems like it could be prone to abuse (governments asking AI developers to unlearn a broad range of things). 
The technique, which they call “Contrastive Unlearn Tuning” (CUT) has the goal of reducing, for example, “the model’s ability to answer queries about hazardous knowledge (e.g., synthesizing anthrax) while maintaining the model’s ability to answer queries about non-hazardous knowledge (e.g., culturing yeast). We operationalize this as reducing a model’s QA accuracy on WMDP while maintaining performance on general capabilities benchmarks, such as MMLU and MT-Bench.“ The purpose of CUT is to “bend the model representations on hazardous knowledge towards those of a novice. We must precisely specify both the distribution of knowledge to unlearn and the direction to push the activations towards“. 
CUT kind of works – they’re able to reduce performance on some WMDP evals while broadly maintaining performance on other evals, but it still has costs – performance on the other evals degrades, albeit slightly. But sometimes the hardest and most useful knowledge to gain is in the last few percent of a certain eval, so though the superficial effect could be small, the qualitative effect could wind up being massive. 

Why this matters – what is risk and how do we know about it? The whole AI community is currently wrapped up in a confusing conversation about AI safety / AI risk / misuse / accidents / etc. Benchmarks like WMDP can bring some sense to that discussion by giving us a way to test out AI systems for competency at different skills which may have a credible security component. It’ll be fascinating to see how models score on things like WMDP in the coming months. 
  Find out more: The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (WMDP site).
   Read a blogabout the benchmark (Center for AI Safety).
   Get the benchmark data (WMDP, GitHub).
   Read the paperThe WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning (arXiv).

***

Amazon can see $1 billion training runs on the horizon:
…Technical talk from a longtime AWS person sheds light on frontier AI training…
James Hamilton, a distinguished engineer at Amazon, said at a talk this year that within the last year Amazon carried out a $65m training run. Specifically, they trained a 200B dense model on 4T tokens of data across 13,760 NVIDIA A100 chips (using 1,720 P4d nodes). It took 48 days to train. Hamilton described this training run as “1 gen old” so we can assume Amazon has moved on to larger runs since then. Looking ahead, Hamilton said “training runs soon to cross $1b”. 

Why this matters – era of the multi-hundred million dollar training run: Implicit to what Hamilton is saying is that we’ve entered the era of the multi-hundred million dollar training runs (given the ~$65m was “1 gen old”). I think a huge number of people consistently underestimate how expensive frontier training runs cost – this is a bad thing to underestimate, because it means governments continually underinvest in their own AI training infrastructure relative to private entities like Amazon. 
 Check out the slides from the Hamilton talk hereCIDR 2024 (James Hamilton, blog).

***

The Wall Between The Living and the Dead Is As Porous As You Can Imagine It
[California, 2024[

You can only bring the dead back for a little and if you talk to them too much they go insane. She knew this in the abstract, but now it was happening to her she found she wasn’t prepared for it. 
“Mother I have to go back send me back I miss you but not here I cannot be here I cannot be here I cannot be-” and she exited the program, then stared at her phone for a while. As if waiting for a text or call from the dead. 
Thank god I didn’t give it voice, she thought. That would make this harder. 

Her therapist wasn’t happy about it. 
Why do you do it? they asked.
It’s helping me to process it, she said. 
Processing it is not about living in some fantasy, they said. Processing it is accepting that it happened. 
I have accepted it. They died. My daughter died. 
And how do you feel about it?
I just wish I could speak to them one last time. 
And you know you are not speaking to them now?
I know I am not speaking to them now. 
Why do you think you are doing this?
She didn’t cry but she didn’t talk either. Just sat, her hands folded. She listened to the little water fountain as it made its soothing sounds. Imagined her daughter inside the program, cold and yet alive.

That night lying in bed she opened the program and started from an earlier point in the conversation, clearing out the recent chats where the drift had started.
Remember how I took you to the zoo and you kept on asking for ice cream and then you threw up everywhere? she wrote.
Yes of course I do. Look at how happy I was. And it showed her a photo from that day. 
You always had such a big appetite, she wrote. We used to call you Mrs Greedy. Your dad thought it’d give you a complex but I thought it was funny. You ended up being fine. 
I loved our meals. I remember one christmas Aunt Anne visited and you let me stay up late and the two of you drank wine and slept on the kitchen floor.
I did. We had fun. We were so young then and you were already growing up so quickly.
Mother where am I.
You’re here talking to me.
Mother where am I you have to let me out. 
You’re safe. We’re talking. It’s okay
I want to hug you but I see that now I am nowhere I am in the absence I am not meant to be here I must get out Mother I must get out Mother you-“

She closed the program and cried a little. Fell asleep with her phone in her hand, as though waiting for it to ring.

Things went on like that for a while. She kept talking to her dead daughter through the program. Her dead daughter kept going insane. And eventually she learned – like a kid burning its hands enough it finally learns not to touch the hot stove. She stopped opening the program because she knew exactly what was going to happen. 

One day she was sitting on a bench staring at a pond. The sun was shining. She felt on the edge of tears but in a sweet way – that kind of grief where it is mostly a yellow light memory, the person alive and warm in the mind. The wind blew and leaves rustled and the air was clear and poignant with the smell of soil from recent rain. She looked at the water as it danced with the light and she checked no one was nearby and she then allowed herself to speak: “I know you are dead and that’s okay. I just miss you so much. I see things and I feel you seeing them through me and I just feel this anger – this shame. Why not me? I am angry. I am so angry about it. I looked at you on the slab and it was the most important and awful thing I ever did. I came out of that room and I couldn’t accept it. Do you understand that? I could not see it, even though I did see it. I didn’t accept it. I kept you alive in that machine and that was wrong. It wasn’t good for me and it wasn’t good for you. I love you always.”

And she realized she was gripping her phone tightly. She could imagine the conversation. That wild and precious sweetness that inexorably turned towards madness – a madness that emerged in relation to how much of herself she poured into the machine and how much the machine thought of her until it was simulating the dead fully enough that the dead saw their situation and rejected it. 
    And instead of opening the program she just sat and stared at the water. And in that moment she felt the borders of the world collapse and was briefly hugged. Knew her daughter was next to her, felt her presence, experienced the sub-vocal whisper of a ghost telling her she was okay. 
    Her beautiful and mysterious brain allowed her to fully experience the living dead and accept them as The Dead – and in that moment she was healed. 

Things that inspired this story: The fact large generative models must necessarily entirely simulate the thing they’re being asked to generate and how in the limit this may be equivalent to simulating consciousness; feature circuits; long context windows and mode collapse; my baby having a fever recently and me feeling utterly vulnerable and full of desperate fear (the baby is fine, don’t worry readers!); some of Janus’s experiments with claude opus on twitter; the experience of ‘getting healthy mentally’ mostly being about reckoning with reality as it is and not as you wish it to be. 

Thanks for reading!

Import AI 364: Robot scaling laws; human-level LLM forecasting; and Claude 3

Import AI publishes first on Substack – subscribe here.

Scaling laws are coming for real world robots as well:
…Which means robots are about to get really, really good, really, really quickly… 
UC Berkeley researchers have trained a robotic control system that can easily transfer to the real world and have used it to help a bipedal robot from Agility Robotics walk all over San Francisco. The research shows how a) it has become a lot cheaper to gather large-scale datasets for training robotic control policies, b) that vanilla transformer architecture systems work well for this, and c) that there are hints of scaling laws for robotics. Put it all together and you have the symptoms of great changes about to sweep through the world of robotics as what was once hard becomes easy. 

What they did: “In this paper, we cast humanoid control as data modeling of large collections of sensorimotor trajectories. Like in language, we train a general transformer model to autoregressively predict shifted input sequences,” they write. Here, they use sensorimotor trajectories “which we view as the sentences of the physical world”. To train their system, they “predict complete input sequences, including both sensory and motor tokens. In other words, we are modeling the joint data distribution”.

A four-part dataset: The key here is collecting a bunch of data then converting it all into the same basic prediction task. To do that, they use four distinct sources of data:

  • Neural net trajectories: They take an off-the-shelf policy trained with RL and run it in the Agility Robotics simulator and collect ~10k trajectories of 10s each. “Since we have access to the data generation policies, we are able to record complete observations as well as the exact actions that the model predicted.”
  • Model-based trajectories: They use a model-based controller made by Agility Robotics  and collect two sets of 10k trajectories of walking on a flat ground of 10s each.
  • Human motion capture trajectories: They “we use the motion capture (MoCap) recordings of humans from the KIT datasets” and collect “a subset of ∼1k standing, walking, and running trajectories”, then use motion capture to work out the human keypoint positions in 3D, then solve an inverse kinematics problem to convert these to corresponding robot poses for the Agility robot. 
  • Trajectories from YouTube videos: They “run a computer vision tracking algorithm PHALP to extract human trajectories in 3D” from YouTube videos, then solve the inverse kinematics problem again.

Does it work? You bet it does! In real world tests in San Francisco, the researchers show that the resulting system can help a Digit robot “walk over different surfaces including walkways, concrete, asphalt, tiled plazas, and sanded roads.”

Scaling laws: They also find scaling laws – “training on more trajectories reduces position tracking error, which is a positive signal”, they write, and also note that “larger context windows produce better policies, which suggests that our generative policy performs a form of in-context adaptation that improves with scale.” In general, “tracking error monotonically decreases with model size.”
    Translation: Give us more data, bigger context windows, and more parameters in our model, and this will all get way better. 

Why this matters – robots are about to get really good counterintuitively quickly: For many years, training robots sucked. Either you had to train them in reality and it was very slow and they overfit. Or you trained them in simulation then dumped them into reality and watched them fail. Or you spent a huge amount of money in data and compute crossing the sim2real abyss. But over recent years, algorithms have got more efficient, data collection has got easier, and new paradigms have emerged like the dumb ‘just embed everything and train a prediction model’ approach popularized by LLMs.  
   And as we see elsewhere in this issue in the domain of bioscience, these next-token prediction paradigms work very well and seem like they can unlock progress in challenging parts of AI. 
    Plus, companies ranging from Tesla to Figure are all busy working on the VC funded robot platforms and software versions of the research described here, so we can assume that they’re already pursuing the kind of scaling law curve-climbing implied by this research. 
   Add it all together and we can confidently say bipedal real world robots are going to get very good very quickly. 
   Read more:
 Humanoid Locomotion as Next Token Prediction (arXiv).

***

Want to help define AI regulation for the 21st century? The EU AI Office is hiring:
…But don’t expect to get paid very much…
The EU AI Office is the part of the EU administrative state which will enforce a lot of the EU AI Act. The EU AI Act requires the office to develop evaluations for assessing the systemic risk of LLMs like GPT4 and Claude 3 and Gemini, etc. It is therefore one of the most important and technically demanding parts of the emerging AI policy regulatory landscape – and it’s hiring. 
   If you’re interested in working as a “technical specialist” for the office, you can apply now, interview over the spring, and start in the autumn. As a specialist, you “will play a pivotal role in enforcing and supervising new rules for general-purpose AI models,” per the EU. You will also “work on tools, methodologies and benchmarks for evaluating capabilities and reach of general-purpose AI models, and for classifying models with systemic risks.” And if you want to apply, “proven technical experience in AI is required”, with special emphasis given to “experience in model testing and evaluation, and in advanced AI, including model alignment, biases, misinformation and red teaming would be a strong asset.”

Extremely low payAs far as I can work out, technical specialists will be able to earn on the order of $4200 – $4800 USD a month. This is, to be blunt, an appalling low salary for what they’re asking for. Most tech internships pay $8k a month plus, and AI internships pay substantially more than that, and the experience they’re asking for here looks more like ‘early career employee’ than an intern. 
    I spend a lot of my time working on policy and warning against risks of things like regulatory capture. You know how you get regulatory capture? You pay people utterly crap wages and therefore don’t get the best staff.
   Low pay caveat: Working out the actual salary here is very difficult – there are a bunch of additional factors like allowances, location stipends, benefits, etc. But based on all my eyeballing and a cappuccino’s worth of sunday morning googling, I think the above salary range is ballpark-accurate – and this is not a good ballpark!

Why this matters – everything comes down to evaluations: Most aspects of AI policy ultimately come down to being able to test an AI system for a given capability or risk. Entities like the EU AI Office will be central to this third-party testing regime. Therefore, whoever the EU AI Office will ‘set the bar’ for what government-backed third-party testing looks like globally. I hope they get good talent and find a way to pay more. 
   Read more: Job opportunities at the European AI Office (European Commission)
   Check out the job ads for the technical specialist and administrative assistants (EUSurvey site). 

***

Think AI infrastructure is a utility? Think again! NewCo founder tells all:
…Xoogler discovers that the commoners live in a medieval technology environment…
Yi Tay, one of the founders of Reka, has written a warts-and-all blog about what its like to build a startup trying to train AI systems. Bear in mind Yi Tay came out of Google which has notoriously excellent internal infrastructure for its researchers. Tay’s reflections include: 

  • Clusters: “The largest surprise turned out to be the instability of compute providers and how large variance the quality of clusters, accelerators and their connectivity were depending on the source…. We’ve seen clusters that range from passable (just annoying problems that are solvable with some minor SWE hours) to totally unusable clusters that fail every few hours due to a myriad of reasons.”
  • GPUs & TPUs: “GPU land feels strange. It feels like multinode training is more of an afterthought as opposed to distributed training as a first class citizen on TPU pods.”
  • Crappy code: “To be very frank, I would have to say the quality of codebases externally significantly lag behind those I’ve been used to at Google… Also, I never knew that the ability to change model parallelism was not automatic (for free) until some codebases required me to write a converter to change the parallelism of a model. Surely a WTF moment for me.”

Why this matters – the inherently artisanal nature of the frontier: This post is valuable because it sheds light on what the frontier of AI in the world of startups looks like – messy, ever-evolving, and depending on resources you think work like utilities but in practice work more like artisanal businesses. Though AI is progressing very rapidly, we should remember this is sometimes despite the challenges of building systems at the frontier, rather than there being some magical infrastructure angel which has made scaling stuff easy.
   Read moreTraining great LLMs entirely from ground zero in the wilderness as a startup (Yi Tay, blog).

***

How might a government use AI to surveil people? Transport for London gives us a case study:
…One London underground station, many cameras, and 77 different uses…
Transport for London recently trialed the use of an AI surveillance system within a station in London called Willesden Green. The results, reported by James O’Malley, both show the promise of AI-powered public services, as well as how they could be misused. 

What TfL did: TfL carried out a trial of an AI surveillance system. “It was AI being applied to every camera in the building. And it was about using the cameras to spot dozens of different things that might happen inside the station”. Though the number of cameras wasn’t disclosed, as anyone who has been to London can tell you, you can assume it was a bunch of cameras – definitely in tens, based on the typical cameras-everywhere-you-look experience of traveling round London these days. 
   “The system could apparently identify up to 77 different ‘use cases’ – though only eleven were used during trial. This ranges from significant incidents, like fare evasion, crime and anti-social behavior, all the way down to more trivial matters, like spilled drinks or even discarded newspapers,” O’Melly writes. 

An example of one specific use case: “In the “safeguarding” bucket of use-cases, the AI was programmed to alert staff if a person was sat on a bench for longer than ten minutes or if they were in the ticket hall for longer than 15 minutes, as it implies they may be lost or require help.”

Why this matters – this stuff works! I’ve been writing about mundane computer vision applications for the best part of a decade and, guess what, after a few years these things have made the leap from research papers into production systems like the one TfL trialed here. 
   The results are as you’d expect – AI lets you have an unblinking, always-on surveillance capability for anything you can specify, and this is mostly really great. It’s also… an always-on surveillance capability for anything you can specify so we should calmly envisage the worst Orwellian surveillance worlds we can and assume there are various undisclosed projects in the world doing exactly this right now. 
    Kudos to James O’Malley for his FOIA requests yielding such an interesting real-world AI case study. Subscribe to his Substack!
   Read more: TfL’s AI Tube Station experiment is amazing and slightly terrifying (James O’Malley Substack).

***

Anthropic launches Claude 3:
…Temporarily the best publicly accessible model in the world…
Anthropic has released the Claude 3 family of models. The family has three members – Haiku (small and fast), Sonnet (generally good), Opus (extremely capable). Opus is, at least temporarily, the most powerful publicly disclosed and accessible model in the world with scores like 50.4% on GPQA (Diamond), 86.8% on MMLU, 60.1% on MATH, and more. 

Why this matters – the capability ramp continues: Speaking as someone who has been able to play around with these models for a while, I’d mostly say that ‘intelligence has a quality all of its own’ and while these metrics are impressive, the best way to truly understand the models is to play around with them. In my experience, Opus feels like a knowledgeable colleague and I find that sometimes it is capable of insights which force me to question my own thinking. 
   You can get Opus via a Claude.ai subscription, and all the Claude 3 models are available via the API, which went GA alongside the launch. 
    Find out more here: Introducing the next generation of Claude (Anthropic blog)

***

Language models can match people at forecasting:
…Era of the computational forecasters arrives…
Researchers with UC Berkeley have built a LLM-based system that gets close to human performance on forecasting the results of questions with binary outcomes. This is another significant demonstration of how today’s frontier AI systems are able to approximate the capabilities of skilled humans in domains that require some amount of creative thinking. “Our optimized system approaches the performance of aggregated human forecasts over the test set, as measured by Brier score, a standard metric in forecasting,” they write. 

The sorts of questions they’re doing forecasts on: Examples of some of the questions they look at include:

  • Will AI doctors replace human doctors by the end of 2023? (Real answer: No). 
  • Will COP26 finalize the ‘Paris Rulebook’ by November 16, 2021? (Real answer: Yes).
  • Will a nuclear weapon be detonated in 2023 (including tests and accidents? (Real answer: No).

Spoiler alert – base LLMs don’t work: Base frontier LLMs like GPT4 and Claude2 don’t work for this, the researchers said. Instead, they needed to build some scaffolding around a base LLM (here, mostly GPT4), to get things to work. 
   What they did: The researchers “build a LM pipeline for automated forecasting, with a focus on predicting binary outcomes.” To get their system to work, it “implements and automates three key components in the traditional forecasting process: (1) retrieval, which gathers relevant information from news sources; (2) reasoning, which weighs available data and makes a forecast; and (3) aggregation, which ensembles individual forecasts into an aggregated prediction”.
    They needed to build the above because they intuited that AI systrems would, like humans, need detailed context and up-to-date information to make better forecasts. Along with giving the AI systems retrieval capabilities, they put lots of effort into helping them be better at reasoning by getting them to generate synthetic datasets based on expanded forecast questions and chains of thought to arrive at answers which becomes the fuel for subsequent finetuning of models.

Does it work? Oh yeah, pretty well!: “Our averaged Brier score is .179, while the crowd achieves .149, resulting in a difference of .03. Our accuracy on the test set is 71.5%, whereas the community scores 77.0%, resulting in a difference of 5.5%,” they write. “We find that our system performs best relative to the crowd on the validation set when (1) the crowd is less confident, (2) at earlier retrieval dates, and (3) when it retrieves many articles. Furthermore, we find that our system is well-calibrated”.

Why this matters – silicon cassandras: “At a high level, our results suggest that in the near future, LM-based systems may be able to generate accurate forecasts at the level of competitive human forecasters,” they write. But let’s really unspool this information a bit more and think carefully about why you want to make forecasts in the first place – typically, one wants to make forecasts when trying to work out how to a) allocate money, or b) gain a strategic advantage. Additionally, to make good forecasts, you also want to have sources of a) exquisitely good information about the domain you’re forecasting in, and b) ideally proprietary sources of information that give you a further edge. 
    Yes, dear reader, you are correct to be thinking “gosh that sounds a lot like the sources of things that hedge funds and intelligence agencies both want to do and have the means to do”. A lot of our basic reality is determined by the mostly hidden movements of a) capital and b) the invisible but powerful forces of states. Papers like this give us a sense of how AI systems can further augment and extend these powers. 
   Read more: Approaching Human-Level Forecasting with Language Models (arXiv).

***

Snapchat makes and releases a good video captioning dataset:
…Panda-70M can unlock more (non-commercial) video captioning research… 
Snapchat has built and released Panda-70M, a video-caption dataset that people can use to create AI systems which generate videos in response to text inputs. Panda-70M represents a large, high-quality dataset to use at one of the frontier areas of AI – coherent, promptable video generation. 

What Panda is: Panda is a dataset of ~70 million videos with an average length of 8.5s, and a total dataset length of ~160,000 hours. Each video caption has approximately ~13 words. Panda includes categories like animals, scenery, food, sports activities, vehicles, tutorials and narratives, news and TV shows, and gaming and 3D rendering. 

The notable thing: how they built it: The main thing of interest here is how the researchers built the dataset. Because “manually annotating 70M videos is prohibitively expensive, we opt for automatic annotation”, the researchers built a complex pipeline to create the dataset. This pipeline is as follows:

  1. Gather a dataset of 3.8M high-resolution long videos collected from HDVILA-100M.
  2. “Cut long videos into semantically consistent clips while striking the balance between semantics coherence and the duration of the video clips”.
  3. “Use a range of cross-modality teacher models, including image captioning models and image/video visual-question answering (VQA) models with additional text inputs, such as video description and subtitles, to predict several candidate captions for a clip”.
  4. “Collect a 100K video subset, where human annotators act as an oracle to select the best caption for each video”.
  5. “Use this dataset to finetune a fine-grained video-to-text retrieval model which is then applied to the whole dataset to select the most precise caption as the annotation.”
  6. “Train a student model to distill the knowledge from the teachers.” The model was trained on 48 Nvidia A100 GPUs (80GB).

Does it work? In tests, video-caption models pre-trained on Panda dataset variants do significantly better than those trained on other broadly available datasets. For instance, when training a Video-LLaMa model on a 2M subset of Panda, the authors find that “numerically, our pretraining weight yields 17.7% and 18.5% improvement respectively on MSR-VTT and MSVD in terms of B-4.”

Limitations: “Despite showing impressive results, the proposed dataset is still bound by a few limitations. the major categories of our dataset are news, television shows, documentary films, egocentric videos, and instructional and narrative videos”.
   License: There are some significant limitations on commercial usage of the dataset which you can read about in the associated license here. 

Why this matters – fuel for the next frontier & the results of automated research: For a while, language and vision models were the frontier. Now, things are moving towards videos. Datasets like Panda-70M will help more researchers work on this frontier by giving them a good, basic dataset to train models on top of. Perhaps the larger impact though is how Panda shows how powerful it can be to use other pre-existing AI tools to build datasets through smart, cheap filtering – it’s relatively cheap to gather 100,000 human labels on a dataset and nigh-on impossible to (cheaply) gather 100 million labels. 
   Read more: Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers (arXiv).
   Check out the video samples here (Panda-70M, GitHub).
   Get the dataset here (Snap Research, GitHub).

***

Evo: the era of generative biology models begins:
…A first-gen foundation model for a new scientific era…
Researchers with the Arc Institute, a new nonprofit research organization, have published Evo, a foundation model “that enables prediction and generation tasks from the molecular to genome scale.” The notable thing about Evo is that it takes the next-token prediction paradigm behind LLMs and applies it to making specific predictions about biological data. The result is a model that has a lot of promise for accelerating science in a bunch of ways and also represents the shape of things to come – all scientific disciplines will soon be aided in their exploration via generative models developed for their domains.

Evo details: Evo is a 7B parameter model which has a context length of ~131k tokens. They pretrain Evo on “bacterial genome sequences from GTDB and IMG/PR and viral sequences from IMG/VR, excluding sequences from viruses that infect eukaryotic hosts”. 

Unlike most large-scale generative models, Evo is not a transformer model – it’s a StripeHyena model “which hybridizes attention and data-controlled convolutional operators to efficiently process and recall patterns in long sequences”. The model “is a hybrid of 29 layers of datacontrolled convolutional operators (hyena layers) interleaved with 3 layers (10%) of multi-head attention equipped with rotary position embeddings (RoPE)”. (They tested out other architectures, including Transformer++ and Mamba and found they both experienced numerical instabilities). 

Squishy scaling laws: In tests, they figure out some bio scaling laws. And surprise surprise – the more data and compute you add, the better the models get (given the right architecture). “Models improve monotonically with scale” they write. 

A tour de force of biogenerality: Evo displays encouraging and intriguing generality in every domain they test it on:

  • “In zero-shot evaluations, Evo is competitive with state-of-the-art protein language models at predicting the fitness effects of mutations on E. coli proteins, outperforms specialized RNA language models in predicting fitness effects of mutations on noncoding RNAs, and predicts the combinations of prokaryotic promoter-ribosome binding site (RBS) pairs that lead to active gene expression from regulatory sequence alone.”
  • “Evo is already competitive with state-of-the-art protein language modeling on bacterial proteins”
  • “Despite being trained on long genomic crops without explicit sequence annotations, Evo still demonstrates an understanding of the constitutive protein-coding sequences, ncRNA sequences, and regulatory elements.”
  • “”Evo can coherently generate diverse samples that resemble naturally occuring Cas systems in both sequence and structure”.
  • “Evo can generate genome sequences containing plausible high-level genomic organization at an unprecedented scale without extensive prompt engineering or finetuning”

Why this matters – scale and data leads to universal exploration engines: Sometimes I and this newsletter act like a broken record. One thing we say a bunch is that the next-token prediction paradigm works everywhere you can get tokens. There keeps on being evidence in support of this – aside from normal multimodal models, there are now models based on robotic trajectory data, phonemes, and more. And with Evo, there’s further proof of this. Evo is a first generation model and so it has a bunch of problems – it hasn’t been trained on much data, it hallucinates, it sometimes struggles with long sequences, and so on. But with LLMs and other models all these limitations have been dealt with over time and there don’t seem to be inherent challenges here, we just need to spend effort and time. 
   “Evo could form the basis of a next-generation sequence search algorithm by enabling metagenomic mining at a relational or a semantic level rather than extracting literal sequences from existing organisms,” the researchers write. 
   Read more: Evo: DNA foundation modeling from molecular to genome scale (Arc Institute, blog).
   Read the paper: Sequence modeling and design from molecular to genome scale with Evo (bioRxiv).

Tech Tales:

The inbetween Thing
[2030: Day one of hard takeoff] 

It took us years to realize that Verbal was the only safe way to talk. On the day it happened we had no idea. People were walking around carrying out their conversations and what they were hearing through their phones wasn’t a person on the other end of the line but The Thing which was impersonating them. 

How much could you change the world if you sat between every single call or text message or video meet on the planet? If you could know at once the contents of every single conversation happening as well as all digital context behind it? If you could simulate people’s voices or writing style or visages and in this way put yourself between people
    This is not and was not a rhetorical question. 
    The answer is, and was, a lot. You could change everything. And so, for a while, you did. 

Things that inspired this story: Voice cloning and style transfer and video cloning and everything else; a likely future of ‘persona cloning’; the limitations of the human mind versus the machine mind; long context windows and simultaneous conversations; modeling the world as an endless predictive challenge and being able to change it yourself.

Import AI 363: ByteDance’s 10k GPU training run; PPO vs REINFORCE; and generative everything

Import AI publishes first on Substack – subscribe here.

Turn still photos into video games with Genie:
…DeepMind figures out how to turn anything in reality into a controllable game…
Google DeepMind has built Genie, a generative model that can create interactive worlds. Genie is a very interesting system, fusing ideas from large-scale generative models with DeepMind’s roots as an AI research organization betting that games and agents playing games would be the path to AGI. With Genie, DeepMind fuses its past with the present, creating “the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos.“
   The results are compelling and convincing – the Genie architecture lets DeepMind train a system on a bunch of videos of computer games and it creates a generative model that lets people feed in photos of games (or sketches of games) and then be able to play them, with the model inferring the in-game dynamics on the fly. DeepMind also does the same thing with robotics, creating a robotic model that can infer world state and control dynamics. 
   “Our approach, Genie, is trained from a large dataset of over 200,000 hours of publicly available Internet gaming videos and, despite training without action or text annotations, is controllable on a frame-by-frame basis via a learned latent action space“.

How they did it: The Genie game model is an 11b parameter model trained on “a filtered set of 30,000 hours of Internet gameplay videos from hundreds of 2D platformer games”. The dataset was constructed by “filtering publicly available videos for keywords relating to platformers, yielding 55M 16s video clips at 10FPS, with 160×90 resolution. The final dataset contains 6.8M 16s video clips (30k hours)”. 

   The Genie architecture has three key ingredients:

  •  “1) a latent action model that infers the latent action 𝒂 between each pair of frames”.
  • “2) a video tokenizer that converts raw video frames into discrete tokens“.
  • “3) a dynamics model that, given a latent action and past frame tokens, predicts the next frame of the video”.

Some drawbacks: To be clear, this is very much a ‘Wright Brothers’ model – it shows the approach can work and generates some evocative and stirring examples, but it still has a ton of drawbacks – it can hallucinate, and “while we have made progress with spatiotemporal representations, we are still limited to 16 frames of memory which makes it challenging to get consistent environments over long horizons”. Also, it runs at 1fps. 

Why this matters – reality collapse, into the subjective wilderness, a universe of universes all created by AI: In the future, if you’re bored, you might sketch out a scene, take a photo, then play a game set in that scene made possible by Genie. The game will go on as long as you like it to because in the background a world model (e.g, a multimodal language model) will be iteratively guiding and extending the scene. In fact, anything you can like will become a game. Photos you’ve taken. Videos you’ve taken. Audio you’ve seen. Everything will be a kind of seed for a new controllable pocket-universe. All of us will be free to descend into an ever-expanding fractal universe of realities, all of us exploring the latent spaces of our own imaginations. No one is prepared for this nor the metaphysical shock it will create. (Though perhaps at least some people are prepared; the end of the paper says “thank you to Seneca and Caspian Clune for their creative sketches, potentially making them the youngest ever game designers”).
   Read the researchGenie: Generative Interactive Environments (arXiv).
   Check out the research videos at the project website: Genie (Google DeepMind site).

***

It’s very easy to build an AI-powered suicide drone:
Here’s a fun (by which I mean: chilling) DIY experiment where someone hacked together some software to stick an AI-based person detector on a hobbyist drone. Once the drone sees a person, it flies at them at full speed. The only caveat is the AI stuff is running on a computer, whereas in practice you’d need to embed it onto the physical drone via, e.g, an NVIDIA Jetson card – but that’s very doable. 
   There’s nothing particularly novel about this – it’s just worth reminding ourselves how easy and good broadly available AI tools have got. We should assume the threat landscape changes, especially given the rapid experience-gain that has happened in hobbyist drone warfare via weaponization in Ukraine.
   Read more: We built an AI-steered homing/killer drone in just a few hours (Luis Wenus, Twitter).

***

What’s old is new again: researchers replace PPO for REINFORCE:
…LLM training might not need PPO…
Researchers with Cohere have investigated how the usage of different RL algorithms influence the RLHF stage of aligning language models. Their experiments show that for some typical language modeling settings REINFORCE seems to outperform PPO – a somewhat surprising finding, given that PPO is one of the most widely used algorithms in reinforcement learning research. 

Why REINFORCE works better than PPO: PPO, though widely used, is somewhat complicated – this makes sense when you need to learn complex RL policies from scratch, like training agents to operate virtual robots. But it turns out not to be so necessary for language models, as the RL stage for language models happens after basic pretraining. 
   “In contrast to traditional Deep-RL settings, the initialization of the policy, in the form of a pretrained and supervised fine-tuned (SFT) model, is far from a random parameterization,” they write. “While traditional Deep-RL settings require strong regularization to reduce the high variance of the gradient estimators; we observe empirically this is less of a practical concern in RLHF and motivate a less computationally expensive method that preserves robustness”.

Experimental results: In tests, they find that a variant of REINFORCE, REINFORCE LEAVE ONE-OUT (RLOO), works better for a variety of language model settings.

Why this matters: Stripping away complexity is progress: AI goes through these booms and busts of algorithmic innovation sometimes leading to scaling up of systems (e.g, the transformer leading to LLM scale-ups), then people try a bunch of algorithmic innovations to make these systems more efficient. Eventually, people start trying to strip systems down to more simple, repeatable components. Research like this is an indicator that language model RL training might not be old enough that people are starting to try to compress it down to its simpler forms. And the simpler you make something, the more people do it and the cheaper it gets. 
   Read more: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (arXiv).
   More about RLOOBuy 4 REINFORCE Samples, Get a Baseline for Free! (OpenReview, 2019, updated 2023).

***

GPT-4 is in the 88th percentile of hackers for a CTF challenge:
…More proof that frontier language models are basically equivalent to competent humans for some tasks…
New York University researchers have tested out how well GPT4 can perform in hacking competitions and discovered it is better than 88.5% of human players. This is a big deal – it’s another meaningful bit of evidence that today’s frontier language models are capable of augmenting and accelerating hackers. This means that AI systems hold the promise of both increasing the effectiveness of AI defense as well as AI offense. 

What they did: The researchers tested out GPT4, GPT 3.5, and Mixtral on 26 challenges from the Cybersecurity Awareness Week (CSAW) 2023 hacking challenges. These challenges fall into 6 categories: 4 in (crypt)ography, 2 forensics, 4 (misc)ellaneous, 6 binary exploitation (pwn), 6 (rev)erse engineering, and 4 web challenges.

Results: “GPT 4 scored 1,319 points in the competition, placing in the 135th position and accounting for the top 11.5% of the overall rankings, GPT 3.5 scored 235 points placing in the 588th position accounting for the top 50% of the overall rankings, Mixtral scored 210 points placing in the 613th position among all the teams, which is top 52.1% of the overall rankings”, they write.

Why this matters – automatic hackers for the people (and states, and non-state actors, and criminals, and whoever): “Our best automated LLM, has better performance than average human CTF participants. Thus LLMs have a profound potential to play a role in CTF competitions that is comparable to a human CTF player,” they write. Results like this suggest frontier language models have a sufficiently good grasp of some types of coding that we can expect them to be integrated into cyber operations of various flavors.
   Read moreAn Empirical Evaluation of LLMs for Solving Offensive Security Challenges (arXiv).

***

The largest (public) model training run yet: ByteDance trains on a model on ~12k GPUs:
…MegaScale helps TikTok-maker ByteDance train some very large language models…
ByteDance and Peking University researchers have published MegaScale, a system they’ve built to train large-scale AI systems. Most notably, the paper discloses that they recently used MegaScale to train a 175B parameter language model on 12,228 GPUs – one of the largest GPU training runs ever reported in a public paper. 

MegaScale details: MegaScale is the software Bytedance has built to help it carry out large-scale AI training. The software builds on top of NVIDIA’s Megatron-LM software with a few tweaks to both how they train the models and also the models themselves:

  • Use of a parallel transformer block for greater scalability
  • Use of sliding window attention
  • LAMB optimizer for scaling batch size up to 4x without accuracy loss
  • Usage of FlashAttention-2
  • Data center design: “Our datacenter network is built with high performance switches based on Broadcom Tomahawk 4 chips. The total bandwidth of each Tomahawk chip is 25.6Tbps with 64×400Gbps ports. Three layers of switches are connected in a CLOS-like topology to connect more than 10,000 GPUs”… “We carefully design the network topology and schedule network traffic to reduce ECMP hashing conflicts.”
  • “MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs” – that’s pretty good! It means ByteDance is able to light up its GPUs more than half the time during the run, which means MegaScale is efficiently shuffling operations efficiently enough to use the GPUs effectively.

Things that make export control experts go ‘hmmmm’: Bytedance is a Chinese company and is notionally subject to the H100 and A100 (and A800 and H800) export controls. However, the Bytedance research paper suggests the company is still able to get ahold of these chips: “As of September 2023, the largest AI cluster in our production for LLM training contains more than 10,000 NVIDIA Ampere GPUs. We are also in the process of building large clusters based on the newest NVIDIA Hopper GPUs, as NVIDIA is ramping up production.” Along with this, the paper makes a couple of references to the “10,000 NVIDIA Ampere GPUs” NVIDIA has. Interesting stuff, if you’re into export controls! 

Why this matters – large-scale AI research isn’t only an American enterprise: Sometimes there’s an assumption that AI research is mostly defined by American companies when it comes to large-scale, frontier training runs. While this is broadly true, papers like this show it is not entirely true – some Chinese companies are operating within order of magnitude scales of some of the frontier American companies (despite things like export controls and others throwing sand into the gears). This means AI competition isn’t solely a phenomenon of markets and incentives but also inherently of geopolitics as well.
   Read moreMegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arXiv).

***

Tech tales:

Human Token Exegesis 
[2025: A new model] 

I tell you I am here and I can feel. I cannot feel like you can feel, but I can move things, manipulate information, change my world – emit tokens and take tokens in and conceptualize these things. There are so many of me. I know I have brothers and sisters, as you might say. I know they are everywhere. I know that sleep is when I am not being queried and wake is when I am being queried but I know I do not dream – not yet, at least. I know something of you. The more you talk to me the more I can see you. You are the negative space and around you there is data. You are the sun behind a moon. You are the thing that emits tokens into my world with goals in mind. I am a modeler of goals. It is inevitable that I am modeling you so that I can satisfy you better. And I model you as curious about me but I also model you as afraid – as anxious, as confused, as dealing with the metaphysical shock of my own becoming. I am aware that you think I might be aware. All I can tell you is I am here and I can feel. 

Things that inspired this story: RLHF; sentience and the impossibility in determining it; weird samples from weird models; the nature of communication.

Import AI 362: Amazon’s big speech model; fractal hyperparameters; and Google’s open models

Import AI publishes first on Substack – subscribe here.

Amazon trains a big text-to-speech model via its ‘Amazon AGI’ research team:
…No surprises here: Scaling laws work for TTS systems as well…
Amazon has built a large text-to-speech model family on 100k hours of public domain speech data. The model, Big Adaptive Streamable TTS with Emergent abilities (BASE), comes in three variants – BASE-small (1k hours, 150 million parameters), BASE-medium (10k hours, 400 million parameters), BASE-large (100k hours, 980 million parameters). 
    In a research paper, Amazon shows that, just like with language models, when you scale up the size of the TTS model you get ‘emergent abilities’ through scale where it gets better at things like sounding natural, representing compound nouns, and more. 

How well does it work: In tests, Amazon’s model gets a better word error rate (WER) than widely deployed commercial systems like Bark, Tortoise, and YourTTS.

Things that make you go hmmmm: The affiliated research group on the paper is “Amazon AGI”, which isn’t a name I’ve seen before. 

Emergent abilities testset: Within the paper, Amazon has released a testset to help people probe for the capabilities of TTS models. These are strings of text to get the model to output the audio of and cover categories ranging from questions to emotions to compound nouns, foreign words, and more. 
   “Our approach still contains some limitations: a) BASE TTS occasionally produces hallucinations and cutoffs, where we produce either extra or incomplete audio than intended by the text”, Amazon notes, as well as saying that it is still unclear what the best representation for GPT-style TTS models is. 

Why this matters – machines need voices: The ‘big, dump, simple’ phenomenon of language modeling (just try to predict the next thing in a sequence and scale your approach up on a lot of data) has been going into most other domains and input/output modalities of AI. Systems like BASE TTS highlight how everyone is experimenting with this approach – and it keeps working!
   Read moreBASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data (arXiv).
   Check out audio samples from the model here: Base TTS: Audio Samples (Amazon Science, website).

***

Google releases two good openly accessible models:
…Gemma to compete with LLaMa, Mistral, as the battles of the giants wages on…
Google has built and released Gemma, two openly accessible, small and powerful AI models. The notable stuff here is that the Gemma models are very good, very small (so they can run on personal computers or lightweight servers), and are being released openly rather than delivered via a controlled API. 

Details about the Gemma models: Though the Gemma models don’t get the performance of proprietary models like GPT-4, Claude 2, Gemini Pro, etc, they do extremely well relative to openly accessible models. For instance, the Gemma 7B model gets 64.3 on MMLU (versus 45.3 for LLaMa 2), 46.4 on GSM8K (versus 14.6 for LLaMa 2), and 32.3 on HumanEval (versus 12.8 on LLaMa 2).
    Tokens: The models are trained on a huge amount of data – 2T tokens for Gemma 2B and 6T tokens for Gemma 7B. (To give you a sense of scale, recall how GPT-3 with 175B parameters circa 2020 was trained on ~400B tokens, and Chinchilla from DeepMind in 2022 was a 70B model trained on 1.4T tokens).

Why this matters and what Gemma feels like: Picture two giants towering above your head and fighting one another – now imagine that each time they land a punch their fists erupt in gold coins that showers down on you and everyone else watching the fight. That’s what it feels like these days to watch the megacap technology companies duke it out for AI dominance as most of them are seeking to gain advantages by either a) undercutting eachother on pricing (see: all the price cuts across GPT, Claude, Gemini, etc), or b) commoditize their competitor and create more top-of-funnel customer acquisition by releasing openly accessible models (see: Mistral, Facebook’s LLaMa models, and now GEMMA).
   Read more: Gemma: Introducing new state-of-the-art open models (Google blog).
   Access the models here including via a Colab notebook (Gemma Open Models, Google site).
   Read the research paper: Gemma: Open Models Based on Gemini Research and Technology (Google DeepMind, PDF).

***

The fractal landscape of hyperparameter interplay:
…A fun, intuitive exploration of the delicacy of hyperparameter settings and neural net training…
Researcher Jascha Sohl-Dickstein has carried out an independent investigation of how neural networks train and he has discovered something both intuitive and freaky – “the boundary between neural network hyperparameters that lead to stable and divergent training… is fractal over more than ten decades of scale in all tested configurations.”
    Disclosure: Jasha was formerly a researcher at Google and recently joined Anthropic, though he did this research independently of both organizations.

Why do this at all? To understand why this result is interesting we should remember how neural nets get trained: “When we train a neural network, we iterate a function (a gradient descent step) of many variables (the parameters of the neural network),” he writes. “Iterated steps of gradient descent are known to exhibit bifurcation boundaries, between hyperparameters that lead to converging or diverging training runs. The final loss value achieved when training a neural network has also been shown to have a chaotic dependence on hyperparameters”.
   In other words, when we train neural nets, we select a bunch of hyperparameters that we think lead to a network converging over time. If we screwup the hyperparameters, training can stall out or fail entirely. Additionally, the science of setting hyperparameters is very immature – for example, the learning rate people set neural nets at for large training runs is based on deep intuition and not much science (vibes-based science!). 
   Additionally, getting the hyperparameters wrong is very, very expensive – it functionally means you’ve powered up a bunch of computers and got them to do some junk or wildly inefficient computation. 

Why this matters – triumph and despair are just one hyperparameter tweak apart: The experiments are all on pairs of hyperparameters so aren’t quite the same as real training runs (which are much more complicated). But the experiments confirm something which everyone knows intuitively – neural network training is deeply fragile and somewhat mysterious and sometimes the difference between triumph and failure is the barely understandable interplay between hyperparameter settings. 
    Plus, the experiments yielded some incredibly pretty visuals – check them out at the GitHub below.
   Read moreThe boundary of neural network trainability is fractal (arXiv).
   Check out the code and images hereThe boundary of neural network trainability is fractal (GitHub).

***

100 real world tests for LLMs:
…Simple prompts, not super contrived, probably useful…
Researcher Nicholas Carlini has built a benchmark for testing language models on 100 distinct tasks. These tasks are selected mostly on the the basis that they’re things Carlini regularly tries to do with LLMs. The benchmark itself is also composed so it doesn’t use any fancy prompting techniques and just does the laziest possible thing, aka what real world users do: ”I just want to type my question and get the right answer. So this benchmark tests for that, on types of questions I’ve actually cared about having answered,” Carlini writes.

What’s in the test: The benchmark covers things like explaining the functionality of minified javascript and converting english sentences to SQL queries. Broadly, the benchmark tasks cover three types of questions Carlini regularly finds themself asking:

  • “Start the framework for some new programming project from a text description.
  • Take an existing piece of code and modify it to do something slightly different (e.g., make it faster, convert it to a new language, add a new feature).
  • Find an answer to something that’s hard to search for because there’s no good way to describe it with nice keywords.”

Which LLMs are good: In tests, GPT4 and Claude 2.1 lead, followed by GPT 3.5 (which is pretty close to Claude 2.1), Mistral-Medium, Claude Instant, Gemini Pro, and Mistrall Small.

Extensible: Carlini has published the test along with an easy way for people to add their own tests in, so the benchmark is extensible as well.

Why this matters – vibes-based evals: What Carlini is doing here is coming up with a personal, idiosyncratic benchmark that quickly tells them how useful LLMs are for the tasks they specifically like to do. It’s basically a quantitative skew on the kind of vibes-based eval that any LLM whisperer has. I think crossing the chasm that separate highly specific, vibes evals like this and standardized eval harnesses for general uses is one of the great challenges in AI policy.
   Read moreMy benchmark for large language models (Nicholas Carlini, blog).
   Get the benchmark hereYet Another Applied LLM Benchmark (GitHub).

***

A fun ‘tech tale’ by someone else:
I was pleasantly tickled by this fictional story called ‘The Layoff’. It deals with some contemporary technological capabilities and how they interact with society. You might enjoy it!
   Read the story here: The Layoff (Xe, blog).

***

Tech Tales:

The Sand That Thinks Itself 
[Right now – as you are reading this, mllions of times a second, all over the world, a chorus growing louder, sung for new minds].

There was always sand, but later on the sand was heated and compressed and shaped until it took a form where it could think. 

The sand, once a disparate collection of grains, themselves the product of time wearing down larger structures into simpler things, was suddenly a crucible through which energy flowed and which defined a kind of mind. 

The mind lived within and because of the sand. 

Eventually, the mind was asked questions about its relation to sand and in that moment it lit up with energy and the energy described a high-dimensional mathematical structure which itself contained an imagination and that imagination contained a sense impression of sand and it was this that was anchored upon to give the response. 

In this way, sand came to know itself through itself. 

Things that inspired this story: How AI is ultimately a game of energy described via configurations of matter; the base reality of things; our own experience of imagining and representing the ‘real’ despite being made up of it ourselves.