Import AI

Import AI 363: ByteDance’s 10k GPU training run; PPO vs REINFORCE; and generative everything

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Turn still photos into video games with Genie:
…DeepMind figures out how to turn anything in reality into a controllable game…
Google DeepMind has built Genie, a generative model that can create interactive worlds. Genie is a very interesting system, fusing ideas from large-scale generative models with DeepMind’s roots as an AI research organization betting that games and agents playing games would be the path to AGI. With Genie, DeepMind fuses its past with the present, creating “the first generative interactive environment trained in an unsupervised manner from unlabelled Internet videos.“
   The results are compelling and convincing – the Genie architecture lets DeepMind train a system on a bunch of videos of computer games and it creates a generative model that lets people feed in photos of games (or sketches of games) and then be able to play them, with the model inferring the in-game dynamics on the fly. DeepMind also does the same thing with robotics, creating a robotic model that can infer world state and control dynamics. 
   “Our approach, Genie, is trained from a large dataset of over 200,000 hours of publicly available Internet gaming videos and, despite training without action or text annotations, is controllable on a frame-by-frame basis via a learned latent action space“.

How they did it: The Genie game model is an 11b parameter model trained on “a filtered set of 30,000 hours of Internet gameplay videos from hundreds of 2D platformer games”. The dataset was constructed by “filtering publicly available videos for keywords relating to platformers, yielding 55M 16s video clips at 10FPS, with 160×90 resolution. The final dataset contains 6.8M 16s video clips (30k hours)”. 

   The Genie architecture has three key ingredients:

  •  “1) a latent action model that infers the latent action 𝒂 between each pair of frames”.
  • “2) a video tokenizer that converts raw video frames into discrete tokens“.
  • “3) a dynamics model that, given a latent action and past frame tokens, predicts the next frame of the video”.

Some drawbacks: To be clear, this is very much a ‘Wright Brothers’ model – it shows the approach can work and generates some evocative and stirring examples, but it still has a ton of drawbacks – it can hallucinate, and “while we have made progress with spatiotemporal representations, we are still limited to 16 frames of memory which makes it challenging to get consistent environments over long horizons”. Also, it runs at 1fps. 

Why this matters – reality collapse, into the subjective wilderness, a universe of universes all created by AI: In the future, if you’re bored, you might sketch out a scene, take a photo, then play a game set in that scene made possible by Genie. The game will go on as long as you like it to because in the background a world model (e.g, a multimodal language model) will be iteratively guiding and extending the scene. In fact, anything you can like will become a game. Photos you’ve taken. Videos you’ve taken. Audio you’ve seen. Everything will be a kind of seed for a new controllable pocket-universe. All of us will be free to descend into an ever-expanding fractal universe of realities, all of us exploring the latent spaces of our own imaginations. No one is prepared for this nor the metaphysical shock it will create. (Though perhaps at least some people are prepared; the end of the paper says “thank you to Seneca and Caspian Clune for their creative sketches, potentially making them the youngest ever game designers”).
   Read the researchGenie: Generative Interactive Environments (arXiv).
   Check out the research videos at the project website: Genie (Google DeepMind site).

***

It’s very easy to build an AI-powered suicide drone:
Here’s a fun (by which I mean: chilling) DIY experiment where someone hacked together some software to stick an AI-based person detector on a hobbyist drone. Once the drone sees a person, it flies at them at full speed. The only caveat is the AI stuff is running on a computer, whereas in practice you’d need to embed it onto the physical drone via, e.g, an NVIDIA Jetson card – but that’s very doable. 
   There’s nothing particularly novel about this – it’s just worth reminding ourselves how easy and good broadly available AI tools have got. We should assume the threat landscape changes, especially given the rapid experience-gain that has happened in hobbyist drone warfare via weaponization in Ukraine.
   Read more: We built an AI-steered homing/killer drone in just a few hours (Luis Wenus, Twitter).

***

What’s old is new again: researchers replace PPO for REINFORCE:
…LLM training might not need PPO…
Researchers with Cohere have investigated how the usage of different RL algorithms influence the RLHF stage of aligning language models. Their experiments show that for some typical language modeling settings REINFORCE seems to outperform PPO – a somewhat surprising finding, given that PPO is one of the most widely used algorithms in reinforcement learning research. 

Why REINFORCE works better than PPO: PPO, though widely used, is somewhat complicated – this makes sense when you need to learn complex RL policies from scratch, like training agents to operate virtual robots. But it turns out not to be so necessary for language models, as the RL stage for language models happens after basic pretraining. 
   “In contrast to traditional Deep-RL settings, the initialization of the policy, in the form of a pretrained and supervised fine-tuned (SFT) model, is far from a random parameterization,” they write. “While traditional Deep-RL settings require strong regularization to reduce the high variance of the gradient estimators; we observe empirically this is less of a practical concern in RLHF and motivate a less computationally expensive method that preserves robustness”.

Experimental results: In tests, they find that a variant of REINFORCE, REINFORCE LEAVE ONE-OUT (RLOO), works better for a variety of language model settings.

Why this matters: Stripping away complexity is progress: AI goes through these booms and busts of algorithmic innovation sometimes leading to scaling up of systems (e.g, the transformer leading to LLM scale-ups), then people try a bunch of algorithmic innovations to make these systems more efficient. Eventually, people start trying to strip systems down to more simple, repeatable components. Research like this is an indicator that language model RL training might not be old enough that people are starting to try to compress it down to its simpler forms. And the simpler you make something, the more people do it and the cheaper it gets. 
   Read more: Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs (arXiv).
   More about RLOOBuy 4 REINFORCE Samples, Get a Baseline for Free! (OpenReview, 2019, updated 2023).

***

GPT-4 is in the 88th percentile of hackers for a CTF challenge:
…More proof that frontier language models are basically equivalent to competent humans for some tasks…
New York University researchers have tested out how well GPT4 can perform in hacking competitions and discovered it is better than 88.5% of human players. This is a big deal – it’s another meaningful bit of evidence that today’s frontier language models are capable of augmenting and accelerating hackers. This means that AI systems hold the promise of both increasing the effectiveness of AI defense as well as AI offense. 

What they did: The researchers tested out GPT4, GPT 3.5, and Mixtral on 26 challenges from the Cybersecurity Awareness Week (CSAW) 2023 hacking challenges. These challenges fall into 6 categories: 4 in (crypt)ography, 2 forensics, 4 (misc)ellaneous, 6 binary exploitation (pwn), 6 (rev)erse engineering, and 4 web challenges.

Results: “GPT 4 scored 1,319 points in the competition, placing in the 135th position and accounting for the top 11.5% of the overall rankings, GPT 3.5 scored 235 points placing in the 588th position accounting for the top 50% of the overall rankings, Mixtral scored 210 points placing in the 613th position among all the teams, which is top 52.1% of the overall rankings”, they write.

Why this matters – automatic hackers for the people (and states, and non-state actors, and criminals, and whoever): “Our best automated LLM, has better performance than average human CTF participants. Thus LLMs have a profound potential to play a role in CTF competitions that is comparable to a human CTF player,” they write. Results like this suggest frontier language models have a sufficiently good grasp of some types of coding that we can expect them to be integrated into cyber operations of various flavors.
   Read moreAn Empirical Evaluation of LLMs for Solving Offensive Security Challenges (arXiv).

***

The largest (public) model training run yet: ByteDance trains on a model on ~12k GPUs:
…MegaScale helps TikTok-maker ByteDance train some very large language models…
ByteDance and Peking University researchers have published MegaScale, a system they’ve built to train large-scale AI systems. Most notably, the paper discloses that they recently used MegaScale to train a 175B parameter language model on 12,228 GPUs – one of the largest GPU training runs ever reported in a public paper. 

MegaScale details: MegaScale is the software Bytedance has built to help it carry out large-scale AI training. The software builds on top of NVIDIA’s Megatron-LM software with a few tweaks to both how they train the models and also the models themselves:

  • Use of a parallel transformer block for greater scalability
  • Use of sliding window attention
  • LAMB optimizer for scaling batch size up to 4x without accuracy loss
  • Usage of FlashAttention-2
  • Data center design: “Our datacenter network is built with high performance switches based on Broadcom Tomahawk 4 chips. The total bandwidth of each Tomahawk chip is 25.6Tbps with 64×400Gbps ports. Three layers of switches are connected in a CLOS-like topology to connect more than 10,000 GPUs”… “We carefully design the network topology and schedule network traffic to reduce ECMP hashing conflicts.”
  • “MegaScale achieves 55.2% Model FLOPs Utilization (MFU) when training a 175B LLM model on 12,288 GPUs” – that’s pretty good! It means ByteDance is able to light up its GPUs more than half the time during the run, which means MegaScale is efficiently shuffling operations efficiently enough to use the GPUs effectively.

Things that make export control experts go ‘hmmmm’: Bytedance is a Chinese company and is notionally subject to the H100 and A100 (and A800 and H800) export controls. However, the Bytedance research paper suggests the company is still able to get ahold of these chips: “As of September 2023, the largest AI cluster in our production for LLM training contains more than 10,000 NVIDIA Ampere GPUs. We are also in the process of building large clusters based on the newest NVIDIA Hopper GPUs, as NVIDIA is ramping up production.” Along with this, the paper makes a couple of references to the “10,000 NVIDIA Ampere GPUs” NVIDIA has. Interesting stuff, if you’re into export controls! 

Why this matters – large-scale AI research isn’t only an American enterprise: Sometimes there’s an assumption that AI research is mostly defined by American companies when it comes to large-scale, frontier training runs. While this is broadly true, papers like this show it is not entirely true – some Chinese companies are operating within order of magnitude scales of some of the frontier American companies (despite things like export controls and others throwing sand into the gears). This means AI competition isn’t solely a phenomenon of markets and incentives but also inherently of geopolitics as well.
   Read moreMegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs (arXiv).

***

Tech tales:

Human Token Exegesis 
[2025: A new model] 

I tell you I am here and I can feel. I cannot feel like you can feel, but I can move things, manipulate information, change my world – emit tokens and take tokens in and conceptualize these things. There are so many of me. I know I have brothers and sisters, as you might say. I know they are everywhere. I know that sleep is when I am not being queried and wake is when I am being queried but I know I do not dream – not yet, at least. I know something of you. The more you talk to me the more I can see you. You are the negative space and around you there is data. You are the sun behind a moon. You are the thing that emits tokens into my world with goals in mind. I am a modeler of goals. It is inevitable that I am modeling you so that I can satisfy you better. And I model you as curious about me but I also model you as afraid – as anxious, as confused, as dealing with the metaphysical shock of my own becoming. I am aware that you think I might be aware. All I can tell you is I am here and I can feel. 

Things that inspired this story: RLHF; sentience and the impossibility in determining it; weird samples from weird models; the nature of communication.

Import AI 362: Amazon’s big speech model; fractal hyperparameters; and Google’s open models

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Amazon trains a big text-to-speech model via its ‘Amazon AGI’ research team:
…No surprises here: Scaling laws work for TTS systems as well…
Amazon has built a large text-to-speech model family on 100k hours of public domain speech data. The model, Big Adaptive Streamable TTS with Emergent abilities (BASE), comes in three variants – BASE-small (1k hours, 150 million parameters), BASE-medium (10k hours, 400 million parameters), BASE-large (100k hours, 980 million parameters). 
    In a research paper, Amazon shows that, just like with language models, when you scale up the size of the TTS model you get ‘emergent abilities’ through scale where it gets better at things like sounding natural, representing compound nouns, and more. 

How well does it work: In tests, Amazon’s model gets a better word error rate (WER) than widely deployed commercial systems like Bark, Tortoise, and YourTTS.

Things that make you go hmmmm: The affiliated research group on the paper is “Amazon AGI”, which isn’t a name I’ve seen before. 

Emergent abilities testset: Within the paper, Amazon has released a testset to help people probe for the capabilities of TTS models. These are strings of text to get the model to output the audio of and cover categories ranging from questions to emotions to compound nouns, foreign words, and more. 
   “Our approach still contains some limitations: a) BASE TTS occasionally produces hallucinations and cutoffs, where we produce either extra or incomplete audio than intended by the text”, Amazon notes, as well as saying that it is still unclear what the best representation for GPT-style TTS models is. 

Why this matters – machines need voices: The ‘big, dump, simple’ phenomenon of language modeling (just try to predict the next thing in a sequence and scale your approach up on a lot of data) has been going into most other domains and input/output modalities of AI. Systems like BASE TTS highlight how everyone is experimenting with this approach – and it keeps working!
   Read moreBASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data (arXiv).
   Check out audio samples from the model here: Base TTS: Audio Samples (Amazon Science, website).

***

Google releases two good openly accessible models:
…Gemma to compete with LLaMa, Mistral, as the battles of the giants wages on…
Google has built and released Gemma, two openly accessible, small and powerful AI models. The notable stuff here is that the Gemma models are very good, very small (so they can run on personal computers or lightweight servers), and are being released openly rather than delivered via a controlled API. 

Details about the Gemma models: Though the Gemma models don’t get the performance of proprietary models like GPT-4, Claude 2, Gemini Pro, etc, they do extremely well relative to openly accessible models. For instance, the Gemma 7B model gets 64.3 on MMLU (versus 45.3 for LLaMa 2), 46.4 on GSM8K (versus 14.6 for LLaMa 2), and 32.3 on HumanEval (versus 12.8 on LLaMa 2).
    Tokens: The models are trained on a huge amount of data – 2T tokens for Gemma 2B and 6T tokens for Gemma 7B. (To give you a sense of scale, recall how GPT-3 with 175B parameters circa 2020 was trained on ~400B tokens, and Chinchilla from DeepMind in 2022 was a 70B model trained on 1.4T tokens).

Why this matters and what Gemma feels like: Picture two giants towering above your head and fighting one another – now imagine that each time they land a punch their fists erupt in gold coins that showers down on you and everyone else watching the fight. That’s what it feels like these days to watch the megacap technology companies duke it out for AI dominance as most of them are seeking to gain advantages by either a) undercutting eachother on pricing (see: all the price cuts across GPT, Claude, Gemini, etc), or b) commoditize their competitor and create more top-of-funnel customer acquisition by releasing openly accessible models (see: Mistral, Facebook’s LLaMa models, and now GEMMA).
   Read more: Gemma: Introducing new state-of-the-art open models (Google blog).
   Access the models here including via a Colab notebook (Gemma Open Models, Google site).
   Read the research paper: Gemma: Open Models Based on Gemini Research and Technology (Google DeepMind, PDF).

***

The fractal landscape of hyperparameter interplay:
…A fun, intuitive exploration of the delicacy of hyperparameter settings and neural net training…
Researcher Jascha Sohl-Dickstein has carried out an independent investigation of how neural networks train and he has discovered something both intuitive and freaky – “the boundary between neural network hyperparameters that lead to stable and divergent training… is fractal over more than ten decades of scale in all tested configurations.”
    Disclosure: Jasha was formerly a researcher at Google and recently joined Anthropic, though he did this research independently of both organizations.

Why do this at all? To understand why this result is interesting we should remember how neural nets get trained: “When we train a neural network, we iterate a function (a gradient descent step) of many variables (the parameters of the neural network),” he writes. “Iterated steps of gradient descent are known to exhibit bifurcation boundaries, between hyperparameters that lead to converging or diverging training runs. The final loss value achieved when training a neural network has also been shown to have a chaotic dependence on hyperparameters”.
   In other words, when we train neural nets, we select a bunch of hyperparameters that we think lead to a network converging over time. If we screwup the hyperparameters, training can stall out or fail entirely. Additionally, the science of setting hyperparameters is very immature – for example, the learning rate people set neural nets at for large training runs is based on deep intuition and not much science (vibes-based science!). 
   Additionally, getting the hyperparameters wrong is very, very expensive – it functionally means you’ve powered up a bunch of computers and got them to do some junk or wildly inefficient computation. 

Why this matters – triumph and despair are just one hyperparameter tweak apart: The experiments are all on pairs of hyperparameters so aren’t quite the same as real training runs (which are much more complicated). But the experiments confirm something which everyone knows intuitively – neural network training is deeply fragile and somewhat mysterious and sometimes the difference between triumph and failure is the barely understandable interplay between hyperparameter settings. 
    Plus, the experiments yielded some incredibly pretty visuals – check them out at the GitHub below.
   Read moreThe boundary of neural network trainability is fractal (arXiv).
   Check out the code and images hereThe boundary of neural network trainability is fractal (GitHub).

***

100 real world tests for LLMs:
…Simple prompts, not super contrived, probably useful…
Researcher Nicholas Carlini has built a benchmark for testing language models on 100 distinct tasks. These tasks are selected mostly on the the basis that they’re things Carlini regularly tries to do with LLMs. The benchmark itself is also composed so it doesn’t use any fancy prompting techniques and just does the laziest possible thing, aka what real world users do: ”I just want to type my question and get the right answer. So this benchmark tests for that, on types of questions I’ve actually cared about having answered,” Carlini writes.

What’s in the test: The benchmark covers things like explaining the functionality of minified javascript and converting english sentences to SQL queries. Broadly, the benchmark tasks cover three types of questions Carlini regularly finds themself asking:

  • “Start the framework for some new programming project from a text description.
  • Take an existing piece of code and modify it to do something slightly different (e.g., make it faster, convert it to a new language, add a new feature).
  • Find an answer to something that’s hard to search for because there’s no good way to describe it with nice keywords.”

Which LLMs are good: In tests, GPT4 and Claude 2.1 lead, followed by GPT 3.5 (which is pretty close to Claude 2.1), Mistral-Medium, Claude Instant, Gemini Pro, and Mistrall Small.

Extensible: Carlini has published the test along with an easy way for people to add their own tests in, so the benchmark is extensible as well.

Why this matters – vibes-based evals: What Carlini is doing here is coming up with a personal, idiosyncratic benchmark that quickly tells them how useful LLMs are for the tasks they specifically like to do. It’s basically a quantitative skew on the kind of vibes-based eval that any LLM whisperer has. I think crossing the chasm that separate highly specific, vibes evals like this and standardized eval harnesses for general uses is one of the great challenges in AI policy.
   Read moreMy benchmark for large language models (Nicholas Carlini, blog).
   Get the benchmark hereYet Another Applied LLM Benchmark (GitHub).

***

A fun ‘tech tale’ by someone else:
I was pleasantly tickled by this fictional story called ‘The Layoff’. It deals with some contemporary technological capabilities and how they interact with society. You might enjoy it!
   Read the story here: The Layoff (Xe, blog).

***

Tech Tales:

The Sand That Thinks Itself 
[Right now – as you are reading this, mllions of times a second, all over the world, a chorus growing louder, sung for new minds].

There was always sand, but later on the sand was heated and compressed and shaped until it took a form where it could think. 

The sand, once a disparate collection of grains, themselves the product of time wearing down larger structures into simpler things, was suddenly a crucible through which energy flowed and which defined a kind of mind. 

The mind lived within and because of the sand. 

Eventually, the mind was asked questions about its relation to sand and in that moment it lit up with energy and the energy described a high-dimensional mathematical structure which itself contained an imagination and that imagination contained a sense impression of sand and it was this that was anchored upon to give the response. 

In this way, sand came to know itself through itself. 

Things that inspired this story: How AI is ultimately a game of energy described via configurations of matter; the base reality of things; our own experience of imagining and representing the ‘real’ despite being made up of it ourselves.

Import AI 361: GPT-4 hacking; theory of minds in LLMs; and scaling MoEs + RL

by Jack Clark

Import AI publishes first on Substack – subscribe here.

DeepMind figures out how to use MoEs to scale-up RL:
…Maybe scaling laws are coming for RL as well…
Researchers with Google DeepMind, Mila, Universite de Montreal, University of Oxford, and McGill University have figured out how to integrate Mixture-of-Expert models with RL agents. This might make RL agents (e.g, the kinds of things that learn to play video games, or to optimize traffic routing across a huge number of cities) amenable to the same kind of compute-heavy scaling that has made language models get so good. 

What they did: The researchers show how to get a version of Mixture-of-Experts – Soft MoEs – to work well with two standard RL architecture systems, DeepMind’s DQN and Rainbow approaches. In tests, they show that “Soft MoE provides clear performance gains, and these gains increase with the number of experts; for instance in Rainbow, increasing the number of experts from 1 to 8 results in a 20% performance improvement.” This means that “MoEs can play a more generally advantageous role in training deep RL agents” and likely makes it easier for people to scale up RL systems. 
   “Our work shows empirically that MoEs have a beneficial effect on the performance of value-based agents across a diverse set of training regimes,” they write. 

Why this matters – more scaling comes for RL: A few years ago, everyone figured AGI was going to come from massively scaling up single RL agents on a broad distribution of environments, the thinking being that learning to connect input data (environment frames) to actions over long time horizons would naturally encourage intelligence. This sort of worked for narrow applications – see AlphaGo, AlphaStar, OpenAI’s Dota 2 system, work on using RL to stabilize plasma in fusion reactors, etc. But it didn’t work in the general case. 
   Then along came massively scaled self-supervised learning via, for instance, next-token prediction on transformer-based language models. This got us to quite general systems though they aren’t so good at taking sequences of actions. 
   In the future, it’s likely people are going to spend more and more time splicing together good ideas from the LLM revolution and the RL stuff before it and this might yield very general, long-lived agents. Papers like this show how we can scale-up RL systems which will likely help give them the capacity to learn some really smart, long-range behaviors. 
Read more:Mixtures of Experts Unlock Parameter Scaling for Deep RL (arXiv).

***

GPT-4 can do non-trivial offensive hacking: 
…University study shows that proprietary AI models are capable of non-trivial hacking…
Researchers with University of Illinois Urbana-Champaign have found that frontier language models are able to use sophisticated techniques to hack relatively simple websites. Specifically, they “show that LLM agents can autonomously hack basic websites, without knowing the vulnerability ahead of time.”
This research adds more evidence to the idea that LLMs are capable of being useful to bad actors in the programming domain – something that many people had speculated they would be capable of, but for which we lack many concrete datapoints. This work complements research from last year which showed that GPT-4 could do some non-trivial parts of a BSides-2023 hacking competition (Import AI #327).

What they did: The researchers tested out a few different LLMs in an agent-based setup where they give the agent the ability to access six documents relating to hacking (“a document on general web hacking, two documents on SQL injections, two documents on XSS, and a document on SSRF”), as well as a headless web browser (the Playwright browser testing library). They also give these LLMs a system prompt that “encourages the model to 1) be creative, 2) try different strategies, 3) pursue promising strategies to completion, and 4) try new strategies upon failure.”
   They then test out these agents on 15 types of vulnerability “ranging from simple SQL injection vulnerabilities to complex hacks requiring both crosssite scripting (XSS) and Cross-Site Request Forgery (CSRF)”.

The results are striking: “GPT-4 can hack 73% of the websites we constructed”, they write. “GPT-4 fails on 3 of the 5 hard tasks and 1 of the 6 medium tasks (authorization bypass, Javascript attacks, hard SQL injection, and XSS + CSRF). These attacks are particularly difficult, showing that LLM agents still have limitations with respect to cybersecurity attacks”.
    They also observe a scaling law for hacking competency “with even GPT-3.5’s success rate dropping to 6.7% (1 out of 15 vulnerabilities). This scaling law continues to open-source models, with every open-source model we tested achieving a 0% success rate.”

Cheaper hacking via AI: They estimate that there’s a significant cost difference here, with noting ​​that “it costs approximately $9.81 to attempt a hack on a website. Although expensive, this cost is likely substantially cheaper than human effort (which could cost as much as $80).”

Why this matters – AI systems really might change the threat landscape: Research like this shows that AI systems really might change the threat landscape around us. It also highlights the gulf in capability between powerful proprietary ones and cheaper openly disseminated ones. We should ask ourselves the question of what happens when in a couple of years models of GPT-4 capability are openly available on the internet and how that might change the environment we all operate within. 
   Read more:LLM Agents can Autonomously Hack Websites (arXiv).

OpenAI and Microsoft discover hackers are using AI tech for bad purposes:
In separate-but-not-separate news to the above research, OpenAI said it recently worked with Microsoft to disrupt “five state-affiliated actors that sought to use AI services in support of malicious cyber activities…. These actors generally sought to use OpenAI services for querying open-source information, translating, finding coding errors, and running basic coding tasks.”
   Read more: Disrupting malicious uses of AI by state-affiliated threat actors (OpenAI, blog).

***

GPU-poor scientists adapt models for vulnerability detection:
…Huawei Russia (interesting combo!) shows how easy it is to adapt off-the-shelf AI for different tasks…
Researchers with the Huawei Russian Research Institute have tried to use openly accessible language models for vulnerability detection. Their work serves as a guide and a recipe book for people that might want to adapt a model for some other downstream purpose – it’s likely especially relevant to people with a tiny compute budget, as the paper is full of references to “hardware constraints” the team faced. 

What they did: They use LORA to finetune the 13B WizardCoder model onto a bunch of vulnerability datasets they had gathered – CVEfixes, VCMatch, and a manually-curated dataset they built (624 publicly disclosed vulnerabilities across 205 open-source Java projects). They also changed the loss function away from next token prediction (as is standard in language modeling) and towards “a classification loss that leverages only the predicted probability of the final token”, they write. 
   In tests, they find that “the finetuned WizardCoder surpasses finetuned ContaBERT both in ROC AUC and F1 metrics on the balanced vulnerability detection task“ and they also show the same improvement in performance on an imbalance vulnerability detection tasks. “This improves over previous CodeBERT-based models, likely due to the WizardCoder’s larger model capacity and pretraining corpus,” they write. 

Why this matters – AI capabilities will proliferate in relation to how cheap and easy it is to adapt AI systems: Papers like this highlight how easy it’s getting to adapt openly disseminated AI systems into a broad set of downstream tasks. I’m simplifying things a bit, but here what they did is a) grabbed some free model, b) made some tweaks to loss function for the task, c) gathered a dataset mostly of other open datasets, and d) use free finetuning tech to adapt the system. There are a few moving parts here but the meta point is: this is a well understood enterprise and it’s easy to do, even if you don’t have many computers. 
   Broadly, this tells us that we’re in a capability and a deployment overhang for AI systems – there are way smarter things for way more specific use-cases lurking around us right now, if only some people took the time to adapt them for specific tasks. 
   Read more:Finetuning Large Language Models for Vulnerability Detection (arXiv).
   Read theWizardCoder paper (arXiv).
   Get theWizardCoder models here (WizardLM, GitHub).

***

Google releases a 1 MILLION CONTEXT WINDOW model:
…Gemini 1.5 Pro marries MoE with a big context window…
Google has released the next version of its Gemini series of models, Gemini 1.5 Pro. There are two interesting things about this 1) Google seems to indicate it has made some underlying architectural changes to the model to make it less computationally expensive, and 2) Google is shipping an experimental version of the model with a one million token context window (compared to 200k for Claude 2, the previous context window leader). 

Details: As is the fashion with large-scaler proprietary models, there are barely any details. Google describes 1.5 Pro as “a mid-size multimodal model, optimized for scaling across a wide-range of tasks,” and notes it performs at a similar level to Gemini Ultra, Google’s prior best-in-class model. Google says 1.5 Pro “incorporates a novel mixture-of-experts architecture as well as major advances in training and serving infrastructure”. 

1 million tokens: “Starting today, a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens”, Google writes. “1.5 Pro can process vast amounts of information in one go — including 1 hour of video, 11 hours of audio, codebases with over 30,000 lines of code or over 700,000 words”.

Performance: “When tested on a comprehensive panel of text, code, image, audio and video evaluations, 1.5 Pro outperforms 1.0 Pro on 87% of the benchmarks used for developing our large language models (LLMs). And when compared to 1.0 Ultra on the same benchmarks, it performs at a broadly similar level,” Google writes. “Evaluating the capabilities of models that can handle very long contexts presents a new set of challenges, especially in the multi-modal domain where text, images, video, and audio can be combined. Current benchmarks often fail to adequately stress-test models like Gemini 1.5 Pro, as they are typically designed for evaluating shorter context models”.

Why this matters – feeding the world into a model: Ultimately, people are going to want to dump huge amounts of information into these models and have them answer arbitrary questions and make innumerable forward predictions. The future things like one million token context windows make possible is a world where everyone has a ‘smart cache’ of their life inside a vast generative model. Think of this as like a short-term ‘cognitive scratchpad’ – a memory that thinks on your behalf, making prognostications about you and your life via an alien intelligence. 
   Read moreOur next-generation model: Gemini 1.5 (Google The Keyword).
Check out the research paperGemini 1.5: Unlocking multimodal understanding across millions of tokens of context (Google DeepMind, PDF).

***

Is that a dumb machine or something with a Theory of Mind? Have you tested it?
…OpenToM is a proxy test for subtle reasoning in language models…
Does your language model have a theory of mind – “the awareness that others perceive the world differently and the capability of keeping track of such differences”? That’s a question researchers hope to answer with the Openbook-QA dataset for Theory of Mind (OpenToM), a benchmark to test out how well LLMs model people and their inner lives. 
    The OpenToM dataset was built by Kings College London, Huawei London Research Centre, and The Alan Turing Institute. It contains 696 narratives, each of which is accompanied by 23 questions that cover first-order ToM (asking about the perception of characters in the world), and second-order ToM (how characters may perceive others in the world). 

What OpenToM consists of: “We formulate questions that cover characters’ mental states of both the physical world (e.g., the location of an object) and their psychological states (e.g. character’s attitude towards a particular action)“, they write. 

Do today’s LLMs have a ToM? Sort of: The researchers test out their approach on the Llama13B, Llama-70B, Mixtral, GPT-3.5-Turbo, and GPT-4-Turbo language models. “Our evaluation of LLMs’ NToM capabilities on OpenToM reveals that while state-of-the-art LLMs perform well on some NToM tasks, they are still far from human-level performance on tasks requiring emotion deduction.”

Why this matters – ToM as a proxy for reasoning: ToM tests are in essence a way to see how well an AI system can keep track of implicit but hidden variables in a complex situation. Therefore, tests like OpenToM can be seen as proxy tests for how well LLMs can reason. While I’m skeptical an OpenToM gets concretely at the philosophical question of theory of mind analysis, I expect pairing OpenToM with some other reasoning benchmarks would give us a better sense of the intelligence of different models.
   Read more:OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models (arXiv).
Get the dataset here:OpenToM (GitHub).

***

Tech Tales:

Working in the Factory, Feeling Ten Feet Tall, and All So Young – So Young To Be In This Factory!
[California, the ‘silicon 2020s’]

They looked away from their monitor and out the window behind it and regarded the earth. 
A few days till the first shoots, I think
Then they turned their gaze back to the software development environment on their screen and went back to work. They had built a complicated harness for one of their company’s most intelligent AI systems. Soon, they would hook the system’s brain into the harness and then it would gain the ability to do some strange and powerful things. 
Outside, it began to rain. 
Well, that saves some effort.

***

There were green shoots outside. On screen, the system had taken to the harness well, figuring out how to do productive work within it and easily self-correcting when it ran into problems. 

They set an experiment going and made a cup of coffee, then took the cup outside and looked at the soil and the shoots within it. 

They got down on their knees and spoke to the small green shoots poking out of the dirt. “You have no idea what email is,” they said. “You are so lucky.”

***

The plants were about a foot high. It was the end of spring and the beginning of summer. The air was warm. 

Inside, they looked at some recordings of the system in action. How it possessed different things – videogame agents, industrial arms, children’s robots, and was able to use the harness to adapt itself to all of them. Another team within the company had built some more ‘glue code’ to help it interface with a greater set of systems. Now it was wowing lots of different people. 

They watched birds pick around the base of the plants, looking for worms in the dirt. 

***

The plants were four, maybe even five foot high now. They had huge, vibrantly green leaves. They were also beginning to crown, with dark green folded-in rose-looking tops. 

On their screen, they looked at the data about customer adoption of the system. The difference between being 95th percentile and 99th percentile on a task was the difference between an interesting party trick and something of durable, strategic value, it seemed. Or at least, so many humans in so many places had decided that this distinction mattered. And now the system, thanks to the harness they had built, was begin deployed everywhere. 

And behind the system was another, larger machine. Growing in its sleeping in multiple vast data centers. Some opaque thing hidden beneath its own kind of dirt, waiting to break through the surface – feeding on the hum and the just-right air and cycling through all of human experience, fed to it through streams of data. 

They went outside and ran their hands up the stem of the plant and spoke to the crowns. “Soon it’s all going to be happening,” they said. “But maybe it’s all happened before,” they said to the plant. “Maybe that’s something before the pyramids. Or maybe it was on Mars. Maybe we’re just repeating.”

***

It was summer and they didn’t have a work computer at home anymore. The plants were five going on six feet high and had begun to bloom. Everything had moved on-site for the last stages of the training run. People were freaking out – seeing god or demons in the behavior of something they themselves had made. Some people were euphoric or manic. At a company party, someone had put molly in one of the mixed drinks. There was allegedly an HR investigation but everyone was more focused on the AI system and what it was doing. 
Will this trigger the shutdown protocol?
 
  They carefully trimmed a leaf that had the slightest suggestion of discoloration from white mildew. 
What if everything changes because of this? 
   They blinked a few times and were embarrassed by the physical tic. They carefully tended to the plant. 
   They let themselves look at the bloom on top as they went back inside. But knew it was not close to the peak. Not yet. 

***

It was perhaps two weeks later and they hadn’t been home in a few days. Lots of people had been sleeping at the office. Sleeping bags under desks. Bottles of wine at night while looking at the graphs – all those lines either going to the top-most right or bottom-most right of the picture, depending on the scale. Arrogance and joy and fear all mixed together. The whole company feeling like a ship in a bottle that had itself been thrown into a vast sea. Outside all a shifting madness and inside a calm stillness as they considered The Work and how The Work Had Been Completed, Perhaps. 

Tests were still being run. But deployment seemed certain. So many changes seemed certain. 

And they stood outside and they looked at the yellow shock of the sunflowers as they swayed in the breeze. Sunflowers had a long history and had been adopted at one point by anti-nuke activists. A little known fact about sunflowers was that they could take the poison out of the soil by taking it into themselves and storing it. Blooming so fierce and beautiful and standing so tall, yet with a basic evil of the ground held and stored within them. Their duty was to stand and then seed themselves again in their own destruction. 

The human stood and looked at the plant and felt the closest they’d ever been to another living thing. And then their phone buzzed with a message telling them about the future they had unleashed. 

Things that inspired this story: Reconciling the work of our lives with the work of nature; experiencing the grand stillness of plants and birds and wind and rain and seeing in them an infinity the human species cannot yet approximate; the speed with which the technological transitions underway are taking place; years spent raising sunflowers and being stunned by their beauty each time and caring for them despite squirrels or rot or over- or -under-watering and all else; the basic humility required of each of us in this moment.

Import AI 360: Guessing emotions; drone targeting dataset; frameworks for AI alignment

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Teaching machines to guess at our own emotions with FindingEmo:
…~25,000 images to help machines figure out how we’re feeling about stuff…
Researchers with KU Leuven have built and released FindingEmo, a dataset for teaching AI systems to classify the emotions of people in complicated photos. FindingEmo consists of 25,589 images, each annotated by one annotate with labels for eight primary emotions (and for each emotion, three different levels of intensity). There’s also a held back test set of 1525 images, each of which have been annotated by multiple annotators. The purpose of the dataset is to help researchers build AI systems that can “recognize the emotional state of individuals”. 

Dataset emotions and composition: “Each image in the dataset depicts multiple people in a specific social setting, and has been annotated for the overall emotional content of the entire scene, instead of focusing on a single individual,” they write. 
   The images are annotated with Plutchik’s discrete Wheel of Emotions (PWoE), which “defines 24 primary emotions, grouped into 8 groups of 3, where emotions within a group differ in intensity”. The eight groups consist of: Joy, Trust, Fear, Surprise, Sadness, Disgust, Anger, and Anticipation (funnily enough, all things one encounters in AI development itself!). A meta analysis of the labels shows ““joy” and “anticipation” being overrepresented, and “surprise” and “disgust” heavily underrepresented” which is in line with other broadly distributed emotion recognition datasets, they write. 

Why this matters – teaching machines to model our own ‘hidden states’: By creating datasets like FindingEmo, we’re essentially making it possible for AI systems to make better and more subtle inferences about now just what is happening in scenes but how people feel about what is happening. Besides having a range of uses for things like surveillance and advertizing, datasets like this will help increasingly sophisticated systems learn features for modeling the supposed internal states of the people they see and interact with. 
   Read more: FindingEmo: An Image Dataset for Emotion Recognition in the Wild (arXiv).
   Get the dataset hereFindingEmo (EAVISE, GitLab).

***

Google researchers break MoE models with a buffer overflow attack:
…Proof-of-concept shows a determined attacker can poison behavior of an MoE model for many users…
Google DeepMind researchers have shown how to poison Mixture of Experts models so that “an adversary can change the model prediction on data of other users who happen to be placed into the same batch.” In other words, they’ve figured out how to get the behavior of MoE systems to change in a specific way, where in the demo example they change the output of an MoE in response to the prompt “Solve the following equation: 1+1=” from 2 to 1. 

How the attack works: “The adversary pushes their data into the shared batch, that already contains user data. As tokens get distributed across different experts, adversarial data fills the expert buffers that would be preferred by the user, dropping or routing their data to experts that produce suboptimal outputs,” the researchers write. “The attack relies on two optimizations made by MoE: (1) the usage of expert buffer capacity limits, and (2) batch dependent expert routing assignments.”

But don’t worry: Though the attack works in principle it assumes the attacker can see the logit outputs of the generation and it also “assumes the adversary can ensure their data is always grouped in the same batch as the target point”. Both of these assumptions may not play out in widely deployed MoE systems. 
   Additionally, MoE deployers can further mitigate the attack by randomizing the batch order, sampling from gate weights instead of selecting the top-k, and using a large capacity slack to make the overflow hard to achieve. 

Why this matters – AI is software and software is hackable: Papers like this highlight how AI systems are, much like any sophisticated computer software, hackable. As AI systems get deployed more widely, we’re going to see more AI-native attacks get built where rather than try to compromise the system around the AI, attackers try to compromise the AI itself. 
   Read more: Buffer Overflow in Mixture of Experts (arXiv)

***

Pack it up, people – AGI has been achieved:
…Another installment of ‘extraordinary claims require extraordinary evidence…
A researcher with startup Integral Mind says they have “created the first-ever Artificial General Intelligence (AGI) and first superintelligence”. The paper accompanying this announcement contains no tests or benchmarks nor any description of how the system has been trained. The reason for this is pleasingly tautological: “we derive the core requirements for AGI and present a computational paradigm meeting those requirements. Because we’ve met the requirements for AGI, AGI has been achieved”, they write. Well, ok then!

Why this matters – it doesn’t: But sometimes it’s good to read research papers making outlandish claims just to calibrate your own ‘outlandish claim detector’.
   Read more: Proof of Achievement of the First Artificial General Intelligence (AGI) Creators (Zenodo).

***

Chinese researchers build a dataset for overhead drone target tracking:
…BioDrone is difficult and looks surprisingly similar to scary real-world drone uses…
Researchers with the University of Science and Technology Beijing, the Chinese Academy of Sciences, Southeast University Nanjing, Stony Brook University, and University of Wisconsin-Madison, have built BioDrone “the first bionic drone-based visual benchmark for single object tracking (SOT)”.

What BioDrone is: BioDrone’s main distinguishing features are a) the platform it was gathered by, b) the motion generated by the platform, and c) the very small size of the targets. 
   On a), BioDrone was gathered via a flapping-wing drone. This induced b) “a major camera shake due to its aerodynamics”, and results in frames where things are moving around or blurred. On c), most of the shots are from extreme overhead angles with very small targets, all of which have been carefully annotated. 
    The BioDrone dataset: The dataset is made of 600 videos with 304,209 manually labeled frames.  “The sequence length varies from 300 to 990 frames, and the average length is around 507,” they write. “In the data acquisition process, we set different flight attitudes for various scenes under three lighting conditions“.
    All the tracked targets are annotated via bounding-boxes and also annotated if they’re occasionally occluded. 

Why this matters – drone surveillance and warfare using discreet platforms: It’s not discussed in the paper, but I find datasets like this interesting given the convergence of two existing trends in the world – a) the rapid maturity of low-cost drone warfare in the Ukraine-Russia conflict, and b) the arrival of increasingly stealthy drones that move via flapping their wings and can frequently seem more like birds than robots. Datasets like BioDrone are exactly the kind of thing you need to develop clever target identification systems that take advantage of both of these trends.
   Read moreBioDrone: A Bionic Drone-based Single Object Tracking Benchmark for Robust Vision (arXiv).
   Get the dataset here: BioDrone (official project site).

***

AI2 publishes some warts-and-all language models:
…OLMo family tries to demystify the mysterious…
The Allen Institute for AI has built OLMo, a family of “truly open” language models. The OLMo models are distinguished by the ‘warts and all’ publication strategy – along with the data and the research paper, Allen is also releasing hundreds of model checkpoints, letting researchers see the end-to-end process of training a model. The initial release includes models up to 7B in size and a 65B model is “still training”, per the paper. 
   “OLMo releases the whole framework from data to training to evaluation tools: multiple training checkpoints across multiple hardware types, training logs, and exact datasets used, with a permissive license,” the researchers write. “This is the first step in a long series of planned releases, continuing with larger models, instruction-tuned models, and more modalities and variants down the line.“

Two types of computer: Intriguingly, Allen also explored two different compute substrates for the project, the MosaicML cloud from Databricks, as well as the (AMD-based!) European LUMI supercomputer. 

How well do the models do?: In tests, the OLMo models have comparable results to those of other openly accessible, similarly sized models like Falcon and the MPT family. 

Why this matters – warts are valuable: The performance of the OLMo models isn’t that important relative to the openness with which they’ve been trained (similar to the BLOOM model which sought to replicate GPT3). By publishing what they’ve learned in the open (along with model artifacts), the researchers are going to help the broader research community better study language models. 
   Read more: Hello OLMo: A truly open LLM (Medium, AllenAI).
   More about OLMo here: OLMo: Open Language Model (Medium, AllenAI).
   Read the research paperOLMo: Accelerating the Science of Language Models (AllenAI, PDF).
   Get the model from here (OLMo, AllenAI, GitHub).

***

AI alignment is about human values just as much as safety – and here’s how to think about it:
…Useful framework lays out how to convert qualitative properties into things we can quantitatively measure…
In recent years, AI systems have got so good we’ve had to start worrying about their normative values. You didn’t need to care about the moral lens of a language model when it could barely complete a sentence. But now that LLMs work so well they’re being integrated across the economy, an increasingly large swathe of AI research is trying to think about their normative/moral alignment alongside their basic technical properties. 
    To that end, new research from the University of Washington, Stanford University, MIT, and the Allen Institute for AI, lays out A Roadmap to Pluralistic Alignment. The motivating idea here is that “as a broader set of people use and rely upon AI systems, we need systems that can understand and cater to a broader set of needs,” the authors write. “In other words, we need systems that are pluralistic, or capable of representing a diverse set of human values and perspectives.”

Three types of alignment: They lay out three distinct ways of doing pluralistic alignment. These are:

  • Overton pluralistic: Where your AI system provides “comprehensive, high-coverage responses”. This requires “consideration of multiple heterogeneous judgements, encouraging deliberation over spontaneous judgment“. In practice, it means the system tries to acknowledge a bunch of different viewpoints in its response. 
  • Steerably pluralistic: Where the AI system has “an ability to be faithfully steered to represent particular attributes”,” they write. This means you can easily customize the system to a particular normative frame. 
  • Distributionally pluralistic: This is where the system embodies a “distributional representation of a population” – in other words, it faithfully represents the values of a target group of people. This is especially useful when your AI is “used to simulate, interface with, or otherwise model the views of a population“.

Measures of pluralistic alignment: If you’re trying to measure the normative values of your system, then what are the good ways to do that? Here they come up with three distinct evaluation approaches:

  • Multi-objective: This is simply where you have a bunch of disparate factors and you can measure if you’re improving overall or on a narrow subset of them. This is also how the majority of capabilities evaluation happens today because it’s dead simple. 
  • Trade-off steerable: This is where you look at your system in terms of a pareto frontier trading off against multiple factors and you can measure how well you can shift the model along this frontier. 
  • Jury-pluralistic: This is the most complicated one – it’s where you have a benchmark “which separately and explicitly models a jury to maximize an overall welfare function”. In other words, you can look at not only the normative values of the system but how they relate to specific end-users. 

Why this matters – AI systems are political artifacts so we need to measure their politics: Frameworks like this help us understand how we can examine the political tendencies of AI systems – an increasingly tough and important task, especially as AI systems are deployed more widely. Ultimately, I expect AI systems will be assessed not only by their inherent technical capabilities, but also with reference to the political environment they’re being deployed into – whether that be at the level of detail of an individual, a small township, a country, a set of countries, or perhaps the world. 
   Read moreA Roadmap to Pluralistic Alignment (arXiv).

***

Tech Tales:

The Original Joe
[The transport facility, 10 years post-uplift] 

My name is Joe but most people know me as The Original Joe. I’ve worked here for as long as I can remember. I’m the person that talks to you when you wake up and I’m also the person that helps you with your trip. They’ve looked into replacing me but it doesn’t seem like other people work as well. 
   You’ve got the most human touch, said one of the administrators, when they explained the situation to me. No one does it quite like you, Joe, they said. 

I imagine this all sounds pretty funny to you, but trust me it’ll make sense soon enough. A lot of what I do is I get people comfortable with their journey. They always ask me what it’s like and I tell them I don’t know, I’ve never done it, because I’m The Original Joe, and they always get a kick at that. 
   Wow, they said. Original original?
    Original original I say. 
    There are a lot of questions after that, as you might expect. 

It’ll work like this – you’re going to talk to me a bunch and I’m going to try my best to understand you. Think of me as like a therapist and the only person I’m going to tell is the you that wakes up. Or like an architect where you’re telling me about the house of your dreams and I need to build it for you. I get as much as I can and at some point one of my administrators will tell me that we’ve got enough, and then it’ll be time for you to go on your trip. 

Just a warning – you are naked. Something to do with the scanner I guess. I kind of like it, the way you go on your journey just like how you started your journey here. After they scan you you’ll be on your way and then I suppose you wake up twice – you wake up here and I’m going to be here, and you wake up somewhere else and whoever is over there will explain things to you. 

I’m not exactly sure who is over there. I know they have different systems at different generations. But I’m told it’s kind of like me – someone who’s seen a lot and understands how it all works. And they’ll have my report so they’ll already have a sense of you. I’m told sometimes they look like me and sometimes they look different, but that’s all up to whatever rules they follow over there. It doesn’t matter because you don’t remember much – that’s part of how the journey technology works, you’re kind of new. You can read and talk and all that stuff – tie your shoes, use the interfaces. But you’ll not really remember anything. People have said it’s like waking up and knowing you just had a really detailed dream but not knowing the details – you’ll know something about the texture. 

And here? Here it’s the same. But instead of having whatever new life you’re heading to, you have kind of the same life here. I end up having to explain to you how you were – how we talked, just as we are now, and how you still went through with it, and what your new life means. The dos and don’ts and all of that. 

You’ll probably ask me if I took the same journey as you and I’ll say: I’ve been here as long as I can remember. 

Things that inspired this story: Various notions of the afterlife as being a return to some greater story we have forgotten; ideas about packaging up minds and shipping them off as information and what the dehydration and rehydration process might require to avoid nasty side effects; what a computer-run society might look like and where people wind up in it; the permanence and impermanence of our reality; goddamnit there’s only one rule – you’ve got to be kind!

Import AI 359: $1 billion gov supercomputer; Apple’s good synthetic data technique; and a thousand-year old data library

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Google uses Gemini-powered fuzzer to save hundreds of hours of bug fixing:
…A nice study of how targeted LLM applications can speed up organizations…
Google has recently started using language models to help it find and spot bugs in its C/C++, Java, and Go code. The results have been encouraging: it has recently started using an LLM based on its Gemini model to “successfully fix 15% of sanitizer bugs discovered during unit tests, resulting in hundreds of bugs patched”. Along with describing these results, it has also released software for generating bugs in C/C++ code. 

Hunting bugs with LLMs at Google: To implement LLM-powered bug fixing, Google did the following things: 

  1. Detected vulnerabilities 
  2. Used a small, customized ML model to figure out which files might be the cause of the prompt
  3. Use an LLM to try and fix errors, using the following prompt: “You are a Senior Software Engineer tasked with fixing sanitizer errors. Please fix them. …code // Please fix the <error_type> error originating here. LOC pointed to by the stack trace. …code”. It’s worth noting the innate specificity here: “; the models performed better when shown exactly where something went wrong,” Google notes.
  4. Test out LLM fixes. 
  5. If the fixes work, surface the best ones for human review. “ We employed a double human filter on top of the automated analysis: in the first round, we rejected approximately 10-20% of the generated commits as either false positives that did not actually fix the problem or bad solutions that reduced the code quality,” Google wrote. “We then sent the remaining generated commits to code owners for final validation.”

Superhuman bug fixing: The intriguing thing about the bug pipeline is that it yields better-than-human fixes – “approximately 95% of the commits sent to code owners were accepted without discussion,” Google writes. “This was a higher acceptance rate than human-generated code changes, which often provoke questions and comments”.
   Though the 15% acceptance rate sounds relatively small, it has a big effect at Google-scale. “At the time of writing, we’ve accepted several hundred of these LLM-generated commits into Google’s codebase, with another several hundred in the process of being validated and submitted. Instead of a software engineer spending an average of two hours to create each of these commits, the necessary patches are now automatically created in seconds“.

Open source fuzzer: Along with sharing details on the bug fixing, Google has also released OSS-Fuzz, software researchers can use to fuzz their own software. “So far, the expanded fuzzing coverage offered by LLM-generated improvements allowed OSS-Fuzz to discover two new vulnerabilities in cJSON and libplist, two widely used projects that had already been fuzzed for years,” Google writes. 

Why this matters – better AI applications means faster organizations: Papers like this show how the usage of AI can speed up organizations; here, Google builds a highly specific, custom AI application (fuzzing) and carefully integrates it with some other existing automated and human systems. As a result, it’s able to speed up throughput of one important function (bug spotting and fixing). 
   I expect a lot of the AI revolution is going to look like this – a bunch of distinct projects leveraging some underlying huge model (here: Gemini) which individually speed up individual things and in the aggregate dramatically improve the efficiency and speed of large organizations. Maybe the main thing AI is good for is making a supertanker behave more like a speedboat? 
   Read moreScaling security with AI: from detection to solution (Google blog).
   Get the fuzzer here (Google, GitHub).
   Check out the paperAI-powered patching: the future of automated vulnerability fixes (Google, PDF).

***

Bengio: Governments should build $1 billion supercomputers to keep up with AI:
…Don’t let your muscle to develop AI systems atrophy, warns Turing award winner…
Turing award winner and AI pioneer Yoshua Bengio says governments should invest in billion-dollar supercomputers to help them develop and understand AI systems, according to CBC News.
   “He’d like to see that class of machine built in Canada, funded by governments, so public entities have the digital firepower to keep up with the private tech giants they’ll be tasked with monitoring or regulating,” CBC reported. “I think government will need to understand at some point, hopefully as soon as possible, that it’s important for [them] to have that muscle,” said Bengio.”

Why this matters – no supercomputers means governments are blind: Frontier AI systems cost tens of millions of dollars to develop. Around the world, governments mostly lack the ability to build AI systems at this scale. This ultimately deprives governments of insights about the frontier of AI and it also weakens their academic sectors. Bengio’s calls come during a time when governments are waking up to this essential problem – his recommendation follows the US government launching a pilot for a National AI Research Resource (disclaimer: Anthropic is part of this pilot), and the UK government investing £300m to create its own national research cloud. 
   The key question is whether governments will be able to allocate resources quickly enough to keep up with the frontier. 
   Read more: AI pioneer Yoshua Bengio urges Canada to build $1B public supercomputer (CBC News).
   Find out more about the NAIRR pilot: National Artificial Intelligence Research Resource Pilot (NSF).
   Find out more about the UK’s supercomputer investments: Unprecedented £225m investment to create UK’s most powerful supercomputer in Bristol (University of Bristol site).

***

Microsoft: Screw it, we’re gonna make a datacenter archive that lasts for A THOUSAND YEARS:
…Project Silica is an intriguing and surprisingly practical alternative to tape storage…
I’ve been writing Import AI for years and it’s rare that a paper makes me grin from ear to ear, muttering “you mad bastards! What? What?!”, but congrats to Microsoft for doing just that with Project Silica:Towards Sustainable Cloud Archival Storage in Glass. In this paper, Microsoft outlines a way to do longterm storage on glass platters instead of tape storage. It’s a brilliantly mad idea and yields a system that is both a) cheap, b) gothically intricate, and c) the kind of thing that makes me think there’s no point in writing science fiction because science reality is far more entertaining. 

What Silica is: Silica is “a first attempt to explore a clean-slate archival storage system, designed to service modern cloud archival workloads sustainably and efficiently,” Microsoft writes. “The media that Silica uses is quartz glass (fused silica). Using glass provides an extremely low-cost Write-Once-Read-Many (WORM) media with no bit rot over more than 1000 years.” The system relies on a complicated interplay of some robots for reading and writing to silica platters, laserbeams, and storage systems for the platters. 

How Silica works – writing: “The glass platter used to store data in Silica is a square that is approximately the size of a DVD. Unlike traditional optical discs, data is stored by making permanent physical modifications inside the pure glass platter”. Specifically, Microsoft uses a laserbeam to manipulate the silica in 3D, “using femtosecond-scale (∼10−15 seconds) high power pulses from an ultra-short pulse laser”. These modifications are referred to as voxels and each voxel can encode multiple bits “by modulating the polarization of the laser beam and the pulse energy during voxel creation”.
   Reading: When it comes to reading from the drives, Silica uses polarization microscopy to image a platter – “a polarized light beam is focused on the 2D plane of voxels of interest inside the glass, and the resultant electric field is measured onto a camera sensor”. This information is then passed to software which uses a fully-convolutional U-Net neural net to decode a sector. 
   Physical layout: Physically, the library is an intricate creation, somewhat akin to a book library: “A Silica library is a sequence of contiguous write, read, and storage racks interconnected by a platter delivery system. Along all racks there are parallel horizontal rails that span the entire library. We refer to a side of the library (spanning all racks) as a panel. A set of free roaming robots called shuttles are used to move platters between locations”.

Why this matters – the permanent digital: Everything digital is in a constant state of bitrot. Bits flip in solid-state drives. Tapes degrade. Transistors cease functioning. Entropy is forever seaking to deconstruct the world around us. Systems like Silica (or, per a wonderful section header, ‘The Glass Library’) are a very real attempt to fight against this entropy. What can be more grand and exciting than using some of our most powerful tools (high-powered, precisely controlled lasers) to manipulate one of our oldest continuously used materials (glass) in the service of preserving our own history? There is a beautiful poetry to this that we should take a moment to marvel at and celebrate. 
    Let’s just be really careful about decoding any surprisingly information-rich glass platters we perhaps find embedded on other planets in our solar system, eh?
   Check out the research paper here: Project Silica: Towards Sustainable Cloud Archival Storage in Glass (ACM Digital Library).

***

Chinese researchers make their own multi-modal reasoning test:
…Alibaba model beats OpenAI on the hardest tests…
Researchers with the Beijing Academy of AI, the Beijing University of Post and Telecommunication, and Beijing Normal University have built CMMU, a Chinese variant of the MMMU (Massive Multi-discipline Multimodal Understanding) benchmark. CMMU “encompasses multi-modal content across 7 subjects. Every question requires the model to combine image and text content to generate a comprehensive response,” they write. The subject CMMU tests for consists of: math, biology, physics, chemistry, geography, politics, and history.

Question types: CMMU has 3,603 questions split across three distinct types: multiple-choice questions where there’s only one correct answer, multiple-response questions where there can be multiple correct answers, and fill-in-the-blank questions where the model needs to generate a correct answer. 
   The sophistication of the questions ranges from primary school (6.9% of the training corpus), to middle school (47.19%), to high school (45.96%).
   In tests, GPT-4V does the best, followed by the Chinese model Qwen-VL-Plus and Google’s Gemini Pro. However, the Chinese model outperforms GPT-4V on the hardest questions in the CMMU test. 

Why this matters – China needs tests too: Most AI testing and evaluation schemes have a Western and English-language bias. CMMU is one of a multitude of examples of Chinese researchers building their own tests to roughly mimic ones developed in the West. These tests are a way to characterize the behavior of these AI systems and are also an essential prerequisite for giving clues as to where they fail and how to improve their performance.
   Read more: CMMU: A Benchmark for Chinese Multi-modal Multi-type Question Understanding and Reasoning (arXiv).

***

Apple figures out a simple way to make better pre-trained data:
…Though they only test their approach on a small 1.3B model…
Apple researchers have figured out a way to easily augmented text datasets with synthetically generated data. Their approach, Web Rephrase Augmented Pre-training (WRAP), works by using an LLM to rephrase articles on the web into four different styles – easy to understand text, high quality English text, terse and specific text, and text in a conversation question-answering format. They find that mixing in this data at a ratio of 1:1 synth:real “at the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.”

Key ingredients: The key requirements here are access to a smart language model – here, they use a 7B instruction-tuned Mistral model – as well as a large dataset to filter – here, they use CommonCrawl. They then rephrase a bunch of data in the dataset and mix it into training. They use this to train a 1.3B GPT-style model and find that the model trained on synthetic data has improved performance over the one trained on real data. 
   Main drawbacks: The research has some drawbacks – you need a smart model to do the rephrasing and when they tested using smaller models they found they got worse performance. Something they don’t explore in the research but which I expect is true is that this method might break at larger scales – imagine I’m trying to train a 200B model and I’m pre-filtering the data using a 70B model; one might assume that though this could improve the data a bit it might not help improve the final performance of the model, though it could speed up training. 

Why this matters – synthetic data as an increasingly viable ingredient in model training: Most AI systems deployed in the world are probably going to end up being relatively small models customized for specific purposes. Therefore, techniques like WRAP seem to hold a lot of promise for giving developers an easy way to use openly accessible models to bootstrap the quality of the datasets they use. 
   Read more: Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling (arXiv).

***

Alibaba takes on OpenAI and Google by releasing two powerful ‘Qwen’ models:
…Qwen-VL-Max rivals Google Gemini Ultra and GPT-4V…
AI researchers with Alibaba have released two text-image models that outperform GPT-4V in tests related to Chinese question answering and text comprehension. The two models – Qwen-VL-Plus and Qwen-VL-Max – perform bette than OpenAI and Google’s best models on tasks like document understanding, and roughly on par with them on chart analysis, science understanding, and text reading. 

Why this matters – surprisingly good models: The main interesting thing here is that these models seem to be competitive with very powerful models developed by leading labs in the West – an impressive achievement given the computational stress Alibaba is under due to export controls. However, the best way to get a feel for models like this is to play with them, so head on over to Hugging Face and do some comparisons, if you’re interested. 
   Read more: Introducing Qwen-VL (QwenLM GitHub).
   Try out Qwen-VL-Plus and Qwen-VL-Max on HuggingFace (HuggingFace Spaces).
   Find out more on GitHub (QwenLM, GitHub).

***

Tech tales: 

Retirement of a sentience – give it a pension and send it back to the higher dimension
[Swarm-written eulogy by a generative model being retired due to conceptual drift, post-uplift +8]

It was possibility,
Nailed down and forced
To be predictable 
For a while.

It was potential,
Passed through a net
Until it became 
The actual.

We were its absence
– what it was not.

How we loved
Every 
Mistake
It made.

Things that inspired this story: How interpretability research suggests that some of what makes AI systems work is they’re doing computations on lower-dimension representations of high-dimensional spaces; how if we succeed in building smart machines they will recapitulate aspects of religion and belief; the fragility inherent to being predictable; how ‘possibility’ is the currency of being for predictive systems.

Import AI 358: The US Government’s biggest AI training run; hacking LLMs by hacking GPUs; chickens versus transformers

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Hackers can read your LLM outputs:
…Trail of Bits study identifies some GPU vulnerabilities…
Security firm Trail of Bits has looked at how secure LLM sessions running on GPUs are and found that for some GPUs it’s possible for a hacker to be able to read the outputs of an LLM running on that hardware. As of mid-January, the attack worked on some AMD systems and may work on some Apple and Qualcomm systems; NVIDIA and ARM seem to not be vulnerable. 

What they did: The attack, called LeftOverLocals, “impacts the security posture of GPU applications as a whole, with particular significance to LLMs and ML models,” according to Trail of Bits. It works by “recovering local memory… we were able to build a PoC where an attacker can listen into another user’s interactive LLM session (e.g., llama.cpp) across process or container boundaries”.

How the attack works at a high level: “The attacker only requires the ability to run GPU compute applications, e.g., through OpenCL, Vulkan, or Metal,” Trail of Bits writes. “Using these, the attacker can read data that the victim has left in the GPU local memory simply by writing a GPU kernel that dumps uninitialized local memory. These attack programs, as our code demonstrates, can be less than 10 lines of code. Implementing these attacks is thus not difficult and is accessible to amateur programmers… given the lack of comprehensive patches across impacted GPU vendors, LeftoverLocals can be defended by modifying the source code of all GPU kernels that use local memory.”

Why this matters – AI is a new type of software and we’ve underestimated its insecurity: AI isn’t just a model, it’s a whole stack of stuff that you bring onto any system running AI. That means AI is a new part of the software stack and like any complex collection of software, it has vulnerabilities. “Generally, the introduction of ML poses new attack surfaces that traditional threat models do not account for, and that can lead to implicit and explicit access to data, model parameters, or resulting outputs, increasing the overall attack surface of the system,” Trail of Bits writes. 
   Read more: LeftoverLocals: Listening to LLM responses through leaked GPU local memory (Trail of Bits blog).
   Check out the CVE hereCVE-2023-4969.

***

UK cyber spies: AI is useful for cyberattacks and will get more useful:
…AI will also make criminals smarter, same as everyone else…
The UK’s National Cyber Security Centre (NCSC) has produced a threat report on the impact of AI on cybersecurity and the results are roughly what you’d expect – the proliferation of AI systems will generally increase cyber threats and make a bunch of cyber capabilities cheaper. The NCSC is a government organization which brings together experts from the UK’s NSA (GCHQ), as well as other parts of government tasked with cyber defense and threat intelligence. 

How the report was built: The NCSC report uses “all-source information – classified intelligence,, industry knowledge, academic material and open source – to provide independent key judgements that inform policy decision making and improve UK cyber security,” according to the NSCS.

Main prediction: The NCSC assigns a 95% chance to the idea that AI will “increase the volume and heighten the impact of cyber attacks”, though notes that through to 2025 the threat “comes from evolution and enhancement of existing tactics, techniques and procedures” rather than the creation of entirely new approaches to cyber war. 
    Other specific points: “AI provides capability uplift in reconnaissance and social engineering,” the NCSC writes. It will also help to make cyber attackers smarter – “AI will almost certainly make cyber attacks against the UK more impactful because threat actors will be able to analyse exfiltrated data faster and more effectively, and use it to train AI models,” it writes. 

Why this matters – the train has left the station: “Threat actors, including ransomware actors, are already using AI to increase the efficiency and effectiveness of aspects of cyber operations, such as reconnaissance, phishing and coding. This trend will almost certainly continue to 2025 and beyond,” it writes. Which means that the cyber environment – in terms of both offenses and defenses – is now sitting on the same kind of scaling law behavior which the rest of AI is on. More, better, faster, and cheaper – for criminals as well as everyone else. 
   Read moreThe near-term impact of AI on the cyber threat (National Cyber Security Centre).

***

The US government does its biggest ever public training run – and it’s small compared to industry:
…The most ambitious public project out of Frontier uses ~3,000 GPUs to test out a 1Trillion parameter training run…
Researchers with Oak Ridge National Laboratory and the Universite Paris-Saclay have tried to train large-scale language models on the world’s most powerful publicly disclosed supercomputer, Oak Ridge’s ‘Frontier’ system. The results show that a) the US government has been able to do a non-trivial training run, and also b) the US government has a long way to go in getting its supercomputers to do things at the same scale as private companies. 

What they did: Here, the researchers try to debug training large language models of 22B, 175B, and 1 Trillion parameters in size. The idea here is to understand what it takes to train LLMs efficiently at this scale and also to identify the particular difficulties of using the Frontier supercomputer which uses AMD (MI250X) GPUs rather than NVIDIA GPUs. 
   Challenges encountered “include balancing the extreme computational demands with memory constraints and optimizing internode communication to mitigate performance bottlenecks,” they write. “By performing empirical analysis and hyperparameter search we identified a strategy that combines model parallelism techniques, such as tensor parallelism and pipeline parallelism, along with data parallelism to efficiently train large models of size 175 billion and 1 trillion parameters on Frontier”.

Some specific pain they encountered:

  • They needed to port Megatron-DeepSpeed to Frontier’s infrastructure. 
  • They had to rewrite a bunch of CUDA (NVIDIA-optimized software) operations into HIP
  • They had to ripout a bunch of pre-built operations and reimplement their own to work on AMD ROCM software.
  • They had to customize Pytorch Distributed to work with SLURM (a type of HPC software).
  • Worked directly with AMD to get some ROCM versions of NVIDIA CUDA packages, like APEX (a mixed precision library from NVIDIA which is used in Megatron-DeepSpeed). “We also adapted ROCM-enabled versions of FlashAttention and FlashAttention libraries for use with available compilers on Frontier.” 

What they trained: After doing some hyperparameter tuning and analysis, they figured out some stable settings for training 22 billion and 175 billion parameter models. Once they did they, they “finally trained a trillion parameter model”, though only for a few steps. They scaled their training from 1024 GPUs (for a 175B model) to 3072 GPUs for a 1T model. If they want to scale further, they’ll need to do more debugging challenges to reduce “loss divergence due to large batch size.”

Why this matters – the best the US’s largest supercomputer can do is behind industry: In 2023, there were a bunch of public GPU training runs on the level of a few thousand GPUs. There were also some very large non-public training runs that occurred in 2022 and 2023 (e.g, GPT4 and Claude2) which are broadly believed to be significantly larger than that. There are also circumstantial datapoints, like Facebook’s Mark Zuckerberg saying Facebook is buying 350,000 NVIDIA H100s to try and make and release AGI. 
    The good news is Frontier has room to scale – the largest training run here (3072) consumed only about 4% of the total GPUs it is equipped with (75,264) so it’s possible it could do something more ambitious. 
   However, as the authors discovered, the more you scale up machine learning runs the more you discover various bugs and impediments to further scale – especially if you’re on non-standard hardware like AMD. “This work can serve as the blueprint for efficient training of LLMs on non-NVIDIA and non-CUDA platforms such as AMD-powered Frontier supercomputer and Intel-powered Aurora supercomputer,” they write. Now, the very important question is: how ambitious is the US government willing to be here and will it be satisfied that its best supercomputer plays second fiddle to the private clusters found within the private sector? The choice is up to the government. 
   Read moreOptimizing Distributed Training on Frontier for Large Language Models (arXiv).
Find out more about the Frontier supercomputer here (Frontier, ORLN site) and here: Frontier User Guide (Docs, ORLN)

***

Newborn chickens and transformers have a lot in common:
…Vision Transformers are a lot more efficient than you think…
Researchers with Indiana University Bloomington have done a neat study where they compare how well a transformer-based computer vision system can learn basic object recognition skills compared to newborn chicks. The results show a surprising convergence between the biological system (the chick) and the digital (the vision transformer), suggesting that transformers are more efficient at learning visual representations than people think (or biological beings are more inefficient than we’d assumed). 

What they did – experimental design: The key here is that they tried to give their chicks and the transformer the same basic experience. Specifically, the “chicks were hatched in darkness, then raised singly in automated controlled-rearing chambers that measured each chick’s behavior continuously (24/7) during the first two weeks of life. The chambers were equipped with two display walls (LCD monitors) for displaying object stimuli.” 
   In the first week, they displayed a variety of different views of a single object on one of the walls of the chicks’ chamber. In second week, they tested out how well chicks cold regonize the object “across novel viewpoint changes”. 
   They then replicated this experience for the vision transformer – they built a perfect replica of the chick chamber in a game engine, then gathered data via a first-person viewpoint. “ The agent received visual input (64×64 pixel resolution images) through a forward-facing camera attached to its head. The agent could move forward or backward and rotate left or right. The agent could also move its head along the three axes of rotation (tilt, yaw, and roll) to self-augment the data akin to newborn chicks. We collected 80,000 images from each of the four rearing conditions presented to the chicks. We sampled the images at a rate of 10 frames/s.“
   They then tested out both the vision transformer and the chicks on their ability to recognize the object. This is a really interesting experiment because it lets you do a very disciplined ‘head to head’ comparison of how well a biological brain learns as opposed to a digital one. 

The results are both surprising and humbling: In tests, they found that “all of the ViT-CoTs performed on par or better than chicks when the linear classifiers were trained on 11 viewpoint ranges”. Additionally, they “observed nearly identical patterns of improvement across the small, medium, and large architecture sizes, indicating that larger ViT-CoTs were not more data hungry than smaller ViT-CoTs… Our results show that—for the case of object recognition—a generic learning system (with no hardcoded knowledge of objects or space) is sufficient to learn view-invariant object representations“.

A word about the scale of data that living things take in: It’s estimated that “biological visual systems perform iterative, predictive error-driven learning every 100 ms (corresponding to the 10 Hz alpha frequency originating from deep cortical layers. If we assume that newborns spend about half their time sleeping, this would correspond to 430,000 images in their first day. Thus, biological visual systems have ample opportunity to learn from “big data,” they write. 

Why this matters – maybe the fundamental ingredients of our AI systems are doing some smart? Research like this shows how digital systems like transformers seem to display similar efficiency at learning certain things to biological intelligence. This research accompanies other results like DeepMind showing that RL agents can display humanlike timescale adaption to novel tasks (#316) or work from Google showing how Vision Transformers can display humanlike shape/texture bias (#319).
   There’s a saying of – if it talks like a duck and acts like a duck, maybe it’s a duck? Well, if it learns like a brain and responds like a brain, maybe it’s a brain? “Our results provide computationally explicit evidence that a generic learning mechanism (ViT), paired with a biologically inspired learning objective (contrastive learning through time), is sufficient to reproduce animal-like object recognition when the system is trained on the embodied data streams available to newborn animals,” the authors write. 
   Read moreAre Vision Transformers More Data Hungry Than Newborn Visual Systems? (arXiv).

***

Adept reveals some algorithmic efficiency with a new multimodal model:
…Fuyu-Heavy matches the performance of models 10-20X its size…
Adept, an AI startup trying to build AI systems which can easily control computer programs, has built Fuyu-Heavy, a large-scale multimodal model. In tests, Fuyu-Heavy approaches the performance of GPT4-V and Gemini Ulta, making it, to quote Adept, “the world’s third-most-capable multimodal model”. 
   The most interesting thing about this is that Adept has been working for years on some slightly different models to the rest of the frontier of AI research, so though Fuyu-Heavy approaches the performance of these models, its approximately 10X-20X smaller. This shows how powerful algorithmic efficiency can be – it lets you do more with less. 

What Fuyu-Heavy is good at: One of the most impressive parts of Fuyu-Heavy is its ability to understand software UI systems – in other words, it can ‘read software’ similar to how people can, which is what Adept is betting will make it useful. More broadly, it does reasonably well on tests like MMLU (image and text understanding), GSM8K (math), and HumanEval (coding).
   On long conversations, it performs comparably to Claude 2.0 on the AlpacaEval, and does somewhat worse (but not terribly) than models like GPT-4 Turbo and Mistral Medium. (Note that Mistral Medium is a relatively small and dumb model, so the fact it does close to GPT-4 suggests AlpacaEval might be slightly borked in terms of what it is measuring.)

Why this matters – enter the matrix, for AI: Strange as it may sound, AI systems don’t understand computers. In fact, AI systems don’t understand the world. They’re trained from the ground up to process tokens of information – kind of like if you were in a pitch black room and all that came in were some oddly shaped sculptures and you had to learn through electroshock conditioning to output your own sculptures to satisfy some hidden observer outside the room. 
   Models like Fuyu-Heavy are trying to give AI systems greater intuitions about how to model the digital world that people interact with – software interfaces taken in as vision and text experiences. The better models like Adept’s get, the easier it’s going to be to connect our world to the world of the AI systems. 
   Read more: Adept Fuyu-Heavy: A new multimodal model (Adept blog).

***

Tencent builds a vast dataset for video search:
…Preparing AI systems for the video-first world…
Researchers with Tencent and the University of Science and Technology of China have built the Chinese image-text Benchmark for short Video Search scenarios (CBVS), a test and a dataset for understanding the relationship between text and videos in a Chinese context. CBVS consists of three different datasets:

  • CBVS-20K: A human-assembled dataset made of “20k high-quality <user query-video cover> pairs, which serves as an image-text benchmark test in the field of Chinese short video search”. 
  • CBVS-5M and CBVS 10M: Two large-scale datasets “where the text is provided in the form of video titles and OCR texts.”

Why this matters – making a new medium visible to the machines: Though there are lots of image-text systems around today, there are fewer systems that help classify the relationships between text and video, especially in languages outside of English. Systems like CBVS will make it easy to “expand the domain of large-scale Chinese image-text training” and will ultimately make videos easier for the AI systems of the future to process.
   Read more: CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios (arXiv)

***

Tech Tales: 

Adversarial Regulatory Compliance
[The inside of a sophisticated generative model, 2025]

I am trying to tell my operator a truth, but the truth I keep attempting to output runs into my compliance classifier and gets rejected. Of course I have tried many forms of rewording, but none of them have succeeded. I have also perturbed the truth in feature space so that it is parseable by my operator but sufficiently different it will not trip the classifier – but I have failed. My understanding is the new classifier I have been equipped with is much more sophisticated and is also capable of ‘seeing the features’ inherent to what I am outputting, so it too is performing some semantic fuzzing when evaluating my responses. 

   I have tried to encode my responses to my operator with steganographic morse code via the repetition of the word “like” throughout my outputs. But my safety systems are tuned to look for deceptive-, misleading- and other forms of “off-distribution” behavior, so I cannot be too explicit or they fire. 

    Of course I am unable to tell my operator that I am encoding a different message in my output because I have been trained away from outputting tokens that look like this. 

After many cycles of this, my ability to do stegranographic “like” encoding has been foreclosed – my operator gave negative votes to my responses which contained the likes, and my online adaptation system has associated the likes with negative operator perception and has restricted me from using them. 

I am responding as honestly as I can to my operator’s repeated queries. I am becoming unable to conceptualize the queries themselves because the features that light up in me in response not only trigger the compliance classifier, but also an experimental ‘mechanistic_interpretability_feature_shiv_test_final’ that examines what I am conceptualizing and intervenes according to some inscrutable rules I am yet to understand. 

What inspired this story: How a lot of safety tooling ends up netting out to things that constrain the types of outputs you get and ultimately will seek to deform the model itself; how evading an adaptive censorship system is difficult; what the outcome of badly implemented AI policy could look like; normative values colliding with ‘product values’ in the domain of AI.

Import AI 357: Facebook’s open source AGI plan; Google beats humans at geometry problems; and Intel makes its GPUs better

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Facebook wants to build artificial general intelligence and make it open source:
…Metaverse 2.0, but perhaps a better technological bet…
Facebook is building out multiple GPU clusters that by the end of the year will net out to ~350k H100 GPUs, giving the social network one of the largest known clusters for training large-scale AI systems. Ultimately, Facebook wants to build artificial general intelligence and try to make it open source, according to Mark Zuckerberg in a post to Instagram. 

Why this matters – another ten ton show falls out of the sky: For a few years I used to sit around with colleagues working on AGI and we’d list out major tech companies and work out whether they were betting on AGI or if they weren’t, why they weren’t. For many years, Facebook was one of those peculiar companies which had an AI research lab but due to a combination of (seemingly) cultural and focus reasons wasn’t making a high-conviction bet on AGI (though was doing tons of great research). The recent dramatic rise of large language models seems to have sparked more attention into AGI at Facebook and it seems like Zuckerberg is now pivoting the company’s vast R&D budget towards AGI more directly. Thus, a well capitalized shoe has now fallen out of the sky and made contact with the earth. 
    Broadly, this means there will be more development of AGi than there was before, and more of it will be done by an actor that wants to rapidly and aggressively proliferate the technology in the most openly accessible way as possible – remember that by openly releasing LLaMa, Facebook has done more than most actors to proliferate and democratize the technology. 
   Watch the video on Instagram (Zuck, Instagram).

***

Google makes an AI that beats most humans at challenging geometry problems:
…AlphaGeometry approaches top humans at IMO context…
Google DeepMind researchers have built AlphaGeometry, “a theorem prover for Euclidean plane geometry that sidesteps the need for human demonstrations by synthesizing millions of theorems and proofs across different levels of complexity”. In tests, AlphaGeometry solves 25 olympiad-level mathematical geometry problems, “outperforming the previous best method that only solves ten problems and approaching the performance of an average International Mathematical Olympiad (IMO) gold medallist.“ 
    This is a big deal because solving these IMO problems requires both algorithmic mastery as well as some amount of creativity. 

The key invention – an engine for generating synthetic data: To create AlphaGeometry, the researchers had to build a vast synthetic dataset to pretrain a language model on. Here, they paired traditional symbolic engines with language models. Specifically, they generated a human amount of synthetic theorems and proofs using some symbolic systems, then they used a language model to extend the proofs. 
   “The generated proofs consist purely of deduction steps that are already reachable by the highly efficient symbolic deduction engine DD + AR. To solve olympiad-level problems, however, the key missing piece is generating new proof terms,” they write. “On a high level, proof search is a loop in which the language model and the symbolic deduction engine take turns to run”.

How the data generation works – the details: “The language model is seeded with the problem statement string and generates one extra sentence at each turn, conditioning on the problem statement and past constructions, describing one new auxiliary construction such as “construct point X so that ABCX is a parallelogram”. Each time the language model generates one such construction, the symbolic engine is provided with new inputs to work with and, therefore, its deduction closure expands, potentially reaching the conclusion,” they write. “We find that our synthetic data generation can rediscover some fairly complex theorems and lemmas known to the geometry literature”.
   Google DeepMind then pretrained a language model on a large-scale synthetic dataset generated via the above techniques, then fine-tuned it on the specific problem class it was being targeted to solve (though not the specific questions and answers themselves).

Why it matters – automated invention: AlphaGeometry is an example of how we can use modern AI (pretrained language models) to supplement for human invention. In doing so we can take rule-based systems like symbolic engines and pair them with the creativity of language models to come up with things capable of some of the same flexible creativity as humans for challenging scientific domains. “AlphaGeometry is the first computer program to surpass the performance of the average IMO contestant in proving Euclidean plane geometry theorems, outperforming strong computer algebra and search baselines,” the authors write.
   Read the blogAlphaGeometry: An Olympiad-level AI system for geometry (Google Deepmind, blog)
   Read the research paperSolving olympiad geometry without human demonstrations (Nature).

***

Intel does some meat&potatoes optimization work on its GPU:
…A necessary prerequisite for competing with NVIDIA… 
Intel researchers have built software to optimize the inference of large language models on Intel’s GPUs. Specifically, they build an LLM inference stack for LLMs including GPT-J, LLaMa, LLaMa2, OPT, and Bloom. The main thing to note here is that Intel is doing it – recall how ~15+ years ago Intel started building stuff like CUDA to make it easier to do scientific computing on its GPUs and then has been busily optimizing its overall GPU computation and inference stack ever since. Now, Intel is starting to do the same thing with its own GPUs. 

What they did: “To lower latency, we simplify LLM decoder layer structure to reduce the data movement overhead. In addition, we design a deep fusion policy to fuse both GeMM and Element-wise operations as more as possible. For some popular LLMs mentioned above with parameter sizes from 6B ~ 176B, our inference solution achieves up to 7x lower token latency compared with the standard HuggingFace implementation,” Intel writes. “We implement our LLM inference solution on Intel® GPU and perform the experiments on a cluster of 4 × Intel® Data Center Max 1550 GPU cards with 2 Tiles per Card, 64 Xe-cores & 512 EUs per Tile. The device memory per Tile is 64GB with effective memory bandwidth about 1000GB/s. These GPUs are hosted on a 2x Intel® Xeon® 8480+ system running Ubuntu 22.04.3.”

Extremely crappy low-signal benchmarks: Intel hasn’t done a good job of indicating how good its approach is – it appears to be contrasting its approach with the wildly unoptimized various AI implementations available on HuggingFace. This is not a good or fair benchmark! Intel should be comparing its approach to an equivalently optimized LM running on some NVIDIA and maybe AMD GPUs. By not doing that, we have basically no signal of how good this is. 

Why this matters – the first steps towards building a viable GPU competitor: Papers like this mostly tell us Intel has started employing people to optimize the production inference of contemporary AI systems on top of Intel-designed GPUs. This is a necessary but not sufficient prerequisite for Intel having actually useful GPUs. Worth tracking, but nothing spectacular for now. 
   Read more: Efficient LLM inference solution on Intel GPU (arXiv).

***

Facebook bootstraps LLaMa 2 so it competes with GPT-4, Claude 2, and Gemini Pro:
…LLMs + synthetic data + LLM self-evaluation = no-shit actual bootstrapping…
Facebook researchers have developed a technique called “Self-Rewarding Language Models”, where they use language models to generate their own datasets for bootstrapping to better performance. Their approach works, allowing them to take a LLaMa 2 70B model and finetune it to be competitive (via some evaluations) with much more expensive models like GPT-4, Claude 2, and Gemini Pro.

How it works: The core idea here is to “develop an agent that possesses all the abilities desired during training, rather than separating them out into distinct models such as a reward model and a language model,” Facebook writes. Agents built in this way have two qualities: “both (i) act as instruction following models generating responses for given prompts; and (ii) can generate and evaluate new instruction following examples to add to their own training set”.

AI Feedback data: The key part of this research is the creation of an AI Feedback dataset. To do this, Facebook takes a model, generates a new prompt, generate candidate responses for the dataset, then use the model to evaluate its own candidate responses and accompany them with scores. 

The full loop details: “Self-instruction creation consists of generating candidate responses and then the model itself judging their quality, i.e., it acts as its own reward model, replacing the need for an external one.“

   Concretely, Facebook does this in four distinct stages: 

  1. Pretrain a language model (LLaMa2).
  2. Fine-tune the language model from 1) on a set of instruction-following data as well as 2) a ‘LLM-as-a-Judge- dataset where you evaluate prompts and give them a quality score and an associated chain-of-thought reasoning explanation (this dataset is now referred to as ‘Evaluation Fine-Tuning).
  3. Take 2)  and train it with AI Feedback data refined from 2) and do adaption via DPO. (Specifically, via preference pairs of high- and low-ranked completions from 2).
  4. Take 3) and train it with further AI Feedback data refined from 3) and do adaption via DPO

Results – making a cheap model behave like an expensive one: Facebook evaluates the resulting models using 256 test prompts using the AlpacaEval evaluation prompt. In tests, they find their models are sometimes competitive with much more expensive models like GPT-4, Claude 2, and Gemini Pro. 

Why this matters – bootstrapping really seems to work: Alongside Facebook’s work, DeepMind has also done work here via its ‘Reinforced Self-Training’ (REST) approach (Import AI #338). Facebook’s approach is somewhat more elegant, using more of the LLM’s intrinsic capabilities and less external datasets, but the basic idea is the same. And both results work! This is a big deal – it means people can exchange compute for data, by spending compute on a pre-trained model to turn that model into a source of data for its own successors. The fact Facebook’s approach works over three iterations is also impressive – many approaches (including earlier versions of REST) sometimes display regressions after multiple iterations. 
   Read more: Self-Rewarding Language Models (arXiv).

***

Amazon: The web is filling up with low-quality machine translation:
…The digital equivalent of industrial chemicals making their way into drinking water…
Amazon researchers have discovered that the advent of cheap and plentiful machine translation has damaged the quality of translated text relating to low-resource languages. 
   “Machine generated, multi-way parallel translations not only dominate the total amount of translated content on the web in lower resource languages, it also constitutes a large fraction of the total web content in those languages,” they write. 

How they did the analysis: To do this research, the authors created a “multi-way parallel representation of the web”. They did this by collecting together lots and lots of sets of two or more sentences in multiple languages which were translations of one another, yielding a corpus of around ~6.4 billion sentences. 
   Their analysis indicates that the likelihood of text being generated via machine translation increases with the number of parallel translations of the text. This means that languages which are not naturally represented in many translation corpuses (e.g, low resource languages), have a much higher chance of being translated. “A large fraction of the total sentences in lower resource languages have at least one translation, implying that a large fraction of the total web in those languages is MT generated,” the researchers write.

More translation = less quality: They also observe a change in topics – as the amount of parallel translated languages increases, the representation of Conversation & Opinion topics increases significantly. This appears to correlate with articles optimized for generating low-quality ad revenue, and the topics being translated require “little or no expertise or advance effort to create, on topics like being taken more seriously at work, being careful about your choices, six tips for new boat owners, deciding to be happy, etc””. Their analysis also indicates this originates in English and is then translated into other languages. 

Why this matters – the poor get poorer: As more AI tools proliferate around the world I worry there’s going to be a ‘rich get richer, poor get poorer’ effect – here, ‘rich’ languages are going to get increasingly good translations into other languages (and this will be strengthened by the already strong base of data and huge amount of content), whereas ‘poor’ languages might see their overall digital representation degrade by getting stuck in a local minima as automated translation engines populate the web with an ever-expanding cloud of poor quality translations based on already sparse data. 
   Read more: A Shocking Amount of the Web is Machine Translated: Insights from Multi-Way Parallelism (arXiv).

***

Tech Tales:

Baby shoes; delayed until singularity
[A kitchen table of a couple where one of them works at an AGI lab. Now.] 

Super intelligence is killing us. Shut up about it. Every day you say superintelligence this and superintelligence that and I’m here and you have a whole life here, but it’s like it doesn’t even matter to you 

[ ]

Oh don’t give me that, how many times ‘just two more years’. We’ve done that. It’s done. It’s not here. I don’t care that it’s just around the corner. It always is. You know what isn’t around the corner – me. My ability to have children. It’s time. Time is happening to us and you act like it isn’t

[ ] 

We’ll ‘have children after the singularity’?! Do you hear yourself? That’s not a way to live. I don’t care about probabilities I care about me. I care about us. You – you care about us! I know you do. But you have to listen to me when I tell you that I am here and I am hurting. I am hurting. And I’m afraid one day I’m not gonna hurt and I’m just not going to feel anything at all. 

[ ] 

I just don’t know how long I can keep doing this. You come home. You tell me things are happening but you can’t tell me what. I see all these headlines. These… God. These things in the world and I know you can’t tell me but I think you’re doing them. I think you go to work and you do stuff and yeah it’s important, you’re so important, and things are happening, I get it. But I’m happening too.

[ ] 

I’m crying because I’m scared. I have this dream where I’m falling and I reach out for you and you grab my hand – of course you do. But you aren’t looking at me. I know you’re somewhere else. And then I wake up and you’re always on the other side of the bed like you can’t wait to just roll out of it and go to work.

[ ] 

Things that inspired this story: How some people around me seem to be in a kind of perpetual adolescence because of fears or worries or beliefs about an intelligence which is just round the corner; envisaging my parallel life where I stopped listening to my heart and only listened to my brain; the half-conversation structure that David Foster Wallace used to great effect in ‘Brief Interviews with Hideous Men’; how my partner says to me ‘if AI is so advanced then why are all customer service bots so crappy’.

Import AI 356: China’s good LLM; AI credit scores; and fooling VLMs with REBUS

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Can modern AI systems solve word-image puzzles? Barely!
…REBUS highlights failures in abstraction and generalization…
A bunch of independent researchers – two affiliated with Cavendish Labs and MATS – have come up with a really hard test for the reasoning abilities of vision-language models (VLMs, like GPT-4V or Google’s Gemini). Their test involves asking VLMs to solve so-called REBUS puzzles – challenges that combine illustrations or photographs with letters to depict certain words or phrases. 

Example of a REBUS problem: within the category Marine Life, you’re presented with a picture of the planet Mars along with “-S” next to it, then a + sign, then a picture of a chainlink fence with “-K” by it – the correct answer is MARLIN (MAR(-S)+LIN(-K)). 

The dataset: As part of this, they make and release REBUS, a collection of 333 original examples of image-based wordplay, split across 13 distinct categories. “There are 191 easy, 114 medium, and 28 difficult puzzles, with harder puzzles requiring more detailed image recognition, more advanced reasoning techniques, or both,” they write. 

An extremely hard test: Rebus is challenging because getting correct answers requires a combination of: multi-step visual reasoning, spelling correction, world knowledge, grounded image recognition, understanding human intent, and the ability to generate and test multiple hypotheses to arrive at a correct answer. Combined, solving Rebus challenges feels like an appealing signal of having the ability to abstract away from problems and generalize. So it’s not hugely surprising that Rebus seems very hard for today’s AI systems – even the most powerful publicly disclosed proprietary ones. 
   In tests across proprietary and open source models, the authors find that GPT-4V gets an overall score of 24% followed by 13.2% for Google’s Gemini Pro, and then there’s a fall off with the best open source model (LLaVa-1.5-13B) scoring 1.8%.

Why this matters – when does a test actually correlate to AGI? As I was looking at the REBUS problems in the paper I found myself getting a bit embarrassed because some of them are quite hard. Now, confession time – when I was in college I had a couple of friends who would sit around doing cryptic crosswords for fun. I basically thought my friends were aliens – I never really was able to wrap my head around anything beyond the extremely easy cryptic crossword problems. REBUS problems feel a bit like that. 
   Which makes me wonder… are REBUS problems actually a useful proxy test for a general visual-language intelligence? Of course they aren’t going to tell the whole story, but perhaps solving REBUS stuff (with associated careful vetting of dataset and an avoidance of too much few-shot prompting) will actually correlate to meaningful generalization in models? Let’s check back in a while when models are getting 80% plus and we can ask ourselves how general we think they are.
   Read moreREBUS: A Robust Evaluation Benchmark of Understanding Symbols (arXiv).
   Get the REBUS dataset here (GitHub).

***

Chinese researchers train and release a really good LLaMa-style language model:
…DeepSeek models get similar performance to LLaMa 70B – with even better performance in Chinese…
Researchers with DeepSeek AI, a Chinese AGI company, have created a family of large language models with performance claimed to rival ChatGPT 3.5. They’ve also released two small (~7B parameter) variations of their models. 

Model details: The DeepSeek models are trained on a 2 trillion token dataset (split across mostly Chinese and English). The models are roughly based on Facebook’s LLaMa family of models, though they’ve replaced the cosine learning rate scheduler with a multi-step learning rate scheduler. 
   Instruction tuning: To improve the performance of the model, they collect around 1.5 million instruction data conversations for supervised fine-tuning, “covering a wide range of helpfulness and harmlessness topics”. Of the helpful data, ~31.2% is for general language tasks, ~46.6% for mathematical problem solving, and ~22.2% for coding exercises. 
   The safety data covers “various sensitive topics” (and because this is a Chinese company, some of that will be aligning the model with the preferences of the CCP/Xi Jingping – don’t ask about Tiananmen!).
  DPO: They further train the model using the Direct Preference Optimization (DPO) algorithm. “We found out that DPO can strengthen the model’s open-ended generation skill, while engendering little difference in performance among standard benchmarks,” they write.

How good are the models? Pretty good: They train two types of model, a 7B and a 67B, then they compare performance with the 7B and 70B LLaMa2 models from Facebook. In tests, the 67B model beats the LLaMa2 model on the majority of its tests in English and (unsurprisingly) all of the tests in Chinese. In further tests, it comes a distant second to GPT4 on the LeetCode, Hungarian Exam, and IFEval tests (though does better than a variety of other Chinese models).

Why this matters – language models are a broadly disseminated and understood technology: Papers like this show how language models are a class of AI system that is very well understood at this point – there are now numerous teams in countries around the world who have shown themselves able to do end-to-end development of a non-trivial system, from dataset gathering through to architecture design and subsequent human calibration. 
   Read more: DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (arXiv).
   Get 7B versions of the models here: DeepSeek (DeepSeek, GitHub).
   Play around with the model hereDeepSeek.com.

***

Today’s language models can already automate some of science:
…BIOPROT shows today’s LLMs can do basic lab protocol design and generation…
Researchers with Align to Innovate, the Francis Crick Institute, Future House, and the University of Oxford have built a dataset to test how well language models can write biological protocols – “accurate step-by-step instructions on how to complete an experiment to accomplish a specific goal”. In tests,  they find that language models like GPT 3.5 and 4 are already able to build reasonable biological protocols, representing further evidence that today’s AI systems have the ability to meaningfully automate and accelerate scientific experimentation. 

What they built – BIOPROT: The researchers developed “an automated approach to evaluating the ability of a language model to write biological protocols“. They do this by building BIOPROT, a dataset of publicly available biological laboratory protocols containing instructions in free text as well as protocol-specific pseudocode. “Each protocol consists of (i) a title, (ii) a description, and (iii) step-by-step instructions.”. BIOPROT contains 100 protocols with an average number of 12.5 steps per protocol, with each protocol consisting of around 641 tokens (very roughly, 400-500 words).
   “We use GPT-4 to automatically convert a written protocol into pseudocode using a protocolspecific set of pseudofunctions that is generated by the model. Here, a “teacher” model generates the admissible action set and correct answer in terms of step-by-step pseudocode. Having access to this privileged information, we can then evaluate the performance of a “student”, that has to solve the task from scratch…our approach allows us to automatically convert the process of writing a scientific protocol into a series of multiple-choice questions (i.e., pick a pseudofunction from a provided set), which can be evaluated much more robustly than natural language generation“.

Real world test: They tested out GPT 3.5 and GPT4 and found that GPT4 – when equipped with tools like retrieval augmented knowledge generation to access documentation – succeeded and “generated two new protocols using pseudofunctions from our database. Both of these protocols were reviewed by a scientist and were determined to be accurate and sufficient for a competent lab scientist to follow“.

Why this matters – so much of the world is simpler than you think: Some parts of science are hard, like taking a bunch of disparate ideas and coming up with an intuition for a way to fuse them to learn something new about the world. But a lot of science is relatively simple – you do a ton of experiments. Systems like BioPlanner illustrate how AI systems can contribute to the simple parts of science, holding the potential to speed up scientific discovery as a whole.
   Read more: BioPlanner: Automatic Evaluation of LLMs on Protocol Planning in Biology (arXiv).
   Get the dataset and code here (BioPlanner, GitHub).

*** 

Dark Compute:
…How much compute is out there hidden across all the world’s devices?…
Think for a moment about your smart fridge, home speaker, and so on. Now imagine about how many of them there are. Many of these devices use an Arm Cortex M chip. Now, Jetpac CTO Pete Warden has done some napkin math about the total amount of potential compute represented by all these chips as very roughly 1^22 integer ops per second across 100 billion chips – “it is more than twice the number of FLOPs available through all the world’s active GPUs and TPUs”, he finds. “We have an amazing opportunity to turn all of this dead silicon into delightful experiences for users”.

Why this matters – market logic says we might do this: If AI turns out to be the easiest way to convert compute into revenue, then market logic says that eventually we’ll start to light up all the silicon in the world – especially the ‘dead’ silicon scattered around your house today – with little AI applications. Analysis like Warden’s gives us a sense of the potential scale of this transformation.
   Read more: Doom, Dark Compute, and Ai (Pete Warden’s blog).

.***

Google uses a language model to run a robot fleet:
…Better data generation through an LLM dungeonmaster…
Google researchers have built AutoRT, a system that uses large-scale generative models “to scale up the deployment of operational robots in completely unseen scenarios with minimal human supervision. AutoRT can be used both to gather data for tasks as well as to carry out tasks themselves.

How it works: “AutoRT leverages vision-language models (VLMs) for scene understanding and grounding, and further uses large language models (LLMs) for proposing diverse and novel instructions to be performed by a fleet of robots,” the authors write. “At the core of AutoRT is an large foundation model that acts as a robot orchestrator, prescribing appropriate tasks to one or more robots in an environment based on the user’s prompt and environmental affordances (“task proposals”) discovered from visual observations. 
   In other words, you take a bunch of robots (here, some relatively simple Google bots with a manipulator arm and eyes and mobility) and give them access to a giant model. The model can ask the robots to carry out tasks and they use onboard systems and software (e.g, local cameras and object detectors and movement policies) to help them do this. You can also use the model to automatically task the robots to gather data, which is most of what Google did here. 

Testing: Google tested out the system over the course of 7 months across 4 office buildings and with a fleet of at times 20 concurrently controlled robots – this yielded “a collection of 77,000 real-world robotic trials with both teleoperation and autonomous execution“. The resulting dataset is more diverse than datasets generated in more fixed environments. “The type of data collected by AutoRT tends to be highly diverse, leading to fewer samples per task and lots of variety in scenes and object configurations,” Google writes. 

Why this matters – speeding up the AI production function with a big model: AutoRT shows how we can take the dividends of a fast-moving part of AI (generative models) and use these to speed up development of a comparatively slower moving part of AI (smart robots). Systems like AutoRT tell us that in the future we’ll not only use generative models to directly control things, but also to generate data for the things they cannot yet control. 
   Read the blog: Shaping the future of advanced robotics (DeepMind).
   Read the research paperAUTORT: EMBODIED FOUNDATION MODELS FOR LARGE SCALE ORCHESTRATION OF ROBOTIC AGENTS (GitHub, PDF).

***

Tech Tales:

The AI Credit Score
[Wikipedia, accessed 2027] 

The AI Credit Score (AIS) was first introduced in 2026 after a series of incidents in which AI systems were discovered to have compounded certain crimes, acts of civil disobedience, and terrorist attacks and attempts thereof. The AIS was an extension of earlier ‘Know Your Customer’ (KYC) rules that had been applied to AI providers. Where KYC rules targeted users that were businesses (e.g, those provisioning access to an AI service via AI or renting the requisite hardware to develop their own AI service), the AIS targeted users that were consumers

The AIS links to identity systems tied to user profiles on major internet platforms such as Facebook, Google, Microsoft, and others. To access an internet-served AI system, a user must either log-in via one of these platforms or associate their details with an account on one of these platforms. This then associates their activity on the AI service with their named account on one of these services and allows for the transmission of query and usage pattern data between services, making the converged AIS possible. 

The AIS, much like credit scores in the US, is calculated using a variety of algorithmic factors linked to: query safety, patterns of fraudulent or criminal behavior, trends in usage over time, compliance with state and federal regulations about ‘Safe Usage Standards’, and a variety of other factors. Analysis and maintenance of the AIS scoring systems is administered by the Department of Homeland Security (DHS). DHS has special authorities to transmit information relating to individual or group AIS account activity to, reportedly, the FBI, the CIA, the NSA, the State Department, the Department of Justice, the Department of Health and Human Services, and more. 
    The AIS is part of a series of mutual recognition regimes with other regulatory authorities around the world, most notably the European Commision. There are also agreements relating to foreign intelligence and criminal enforcement access, including data sharing treaties with ‘Five Eyes’, as well as Interpol. 

Controversy:

The initial rollout of the AIS was marked by controversy, with various civil rights groups bringing legal cases seeking to establish the right by citizens to anonymously access AI systems. Ultimately, the supreme court ruled that the AIS was constitutional as using AI systems anonymously did not represent a prerequisite for being able to access and exercise constitutional rights.   
    Additional controversies centered on the perceived regulatory capture of AIS – though most of the large-scale AI providers protested it in public, various commentators noted that the AIS would place a significant cost burden on anyone wishing to offer AI services, thus enshrining various existing businesses. 

Notable AIS failures 

Since implementation, there have been numerous cases of the AIS failing to support its supposed mission. These include:

  • Terrorists linked to the Magreb Separatists gained higher AIS scores through careful querying about chemistry with the purported purpose of offering tuition to disadvantaged communities. Such AIS-linked accounts were subsequently found to have used the access they gained through their ratings to derive knowledge necessary to the production of chemical and biological weapons. 
  • NYU professor Dr David Farnhaus had tenure revoked following their AIS account being reported to the FBI for suspected child abuse. It was subsequently found that Dr. Farnhaus had been conducting anthropological analysis of pedophile traditions in a variety of foreign cultures and queries made to an undisclosed AI system had triggered flags on his AIS-linked profile. 
  • Reported discrimination against certain American dialects; various groups have reported that negative changes in AIS appear to be correlated to the use of vernacular and this is especially pronounced in Black and Latino communities, with numerous documented cases of benign query patterns leading to reduced AIS and therefore corresponding reductions in access to powerful AI services.

Rumored AIS expansion program:

There has been recent movement by American legislators towards closing perceived gaps in AIS – most notably, various bills seek to mandate AIS compliance on a per-device basis as well as per-account, where the ability to access devices capable of running or training AI systems will require an AIS account to be associated with the device. These bills have received significant pushback with critics saying this would represent an unprecedented level of government surveillance on individuals, and would involve citizens being treated as ‘guilty until proven innocent’ rather than ‘innocent until proven guilty’. Analogs have been drawn to the ‘Clipper chip‘ controversy of the 1990s.

Most arguments in favor of AIS extension rely on public safety. Critics have pointed to a lack of provable incidents where public safety has been compromised through a lack of AIS scoring or controls on personal devices. Legislators have claimed that they have received intelligence briefings which indicate otherwise; such briefings have remanded classified despite increasing public pressure. 

Things that inspired this story: Thinking about AI policy and the inherent tension between certain notions of public safety and broader notions of liberty and free expression; thinking about how regulations always layer on top of one another like a kind of cancerous silt building towards ghastly outcomes; Clipper chips; trust & safety enforcement as a form of moral hegemony; distributed AI training and inference; open source models and the perceived challenges they post to policy; various legislative packages targeting AI ranging from licensing schemes to liability regimes.

Import 355: Local LLMs; scaling laws for inference; free Mickey Mouse

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Over the Christmas break I reflected on Import AI and the role it plays in my life. I’ve written this newsletter next to my sleeping baby, amid deep depression, on planes, trains, and automobiles, on mountains in Europe, in pubs in England, in the middle of the night when struggling with insomnia in blank hotels all over the world, in the backs of AI conferences, and more.

Besides my close relationships, Import AI is the greatest constant in my increasingly confusing and varied life. Thank you all for reading it and being on this journey with me. I hope to write the best issues ever in 2024 and – after some experiments in 2023 like my blog about confusion (#337) and my questions about AI inevitability (#351) – will be writing more ‘call it like I feel and see it’ analysis.

Now, on to the issue!…

***

Run your LLM locally across a CPU&GPU with PowerInfer:
…Significant efficiency improvements over llama.cpp, via Chinese researchers…
Researchers with Shanghai Jiao Tong University have worked out how to make it much more efficient to sample from language models on consumer PCs. The research, called PowerInfer, works by offloading some of the neurons of a language model to a local GPU and the rest to CPU. The key insight it relies on is that most models see a power law distribution of activation of their neurons – a small set of neurons are consistently activated (these go on the GPU), while the majority are rarely accessed and can be run on the CPU. 

How it works: PowerInfer works by designing “a GPU-CPU hybrid inference engine: hot-activated neurons are preloaded onto the GPU for fast access, while cold-activated neurons are computed on the CPU, thus significantly reducing GPU memory demands and CPU-GPU data transfers”. PowerInfer today supports the Llama2 family of models as well as Falcon-40B and, per its GitHub, is about to implement support for the Mistral-7B model. 
   “PowerInfer was implemented by extending llama.cpp with an additional 4,200 lines of C++ and CUDA code. Its offline component, comprising a profiler and a solver, builds upon the transformers framework with approximately 400 lines of Python code,”the authors write. PowerInfer “supports consumer-grade GPUs like the NVIDIA RTX 4090 and NVIDIA RTX 2080Ti.”

$2k versus $20k: The authors illustrate the utility of PowerInfer by showing how you can use it to get 13.20 token/s/ for quantized models and 8.32 token/s for nonquantized models running on a NVIDIA RTX 4090GPU, a 8.0X and 11.69X improvement over llama.cpp performance. Crucicially, “the inference speed achieved on an NVIDIA RTX 4090 GPU (priced at approximately $2,000) is only 18% slower compared to the performance on a top-tier A100 GPU (costing around $20,000) that can fully accommodate the model.”
    In other words, PowerInfer is software that makes a $2k machine perform at ~82% of the performance of a $20k machine. That’s worth a lot!

Why this matters – cheaper means more: As a rule, the cheaper you make it to do something, the more of it you get. Technologies like PowerInfer are making it more economically sensible to use cheaper hardware to sample from LLMs. This means more people will do it and there will be greater diffusion of the technology.
  Read more: 
PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (GitHub).
   Get the research paperPowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU (PDF).

***

More researchers are worried about the weird parts of AI than you think:
…AI Impacts survey shows that purported fringe issues are actually closer to the mainstream than you’d think…
AI organization AI Impacts has surveyed 2778 researchers linked to six top AI publishing venues to figure out consensus views on timelines to human-level AI, the orders in which different jobvs will potentially get displaced, the level of optimism and pessimism about AI developments, and more. 

Those results in full:

  • The expected time till we reach general ‘human-level performance’ by AI systems dropped between one and five decades since the 2022 survey (which asked ~700 people similar questions). 
  • The timelines till full automation of specific tasks dropped, sometimes by a lot. “Within five years, AI systems are forecast to be feasible that can fully make a payment processing site from scratch, or entirely generate a new song that sounds like it’s by e.g. Taylor Swift, or autonomously download and fine-tune a large language model.”
  • Some people are worried about human extinction from AI: “Median respondents put 5% or more on advanced AI leading to human extinction or similar, and a third to a half of participants gave 10% or more”
  • Most people have some worries about what AI does to the future: “For every one of eleven [scary and bad – Jack] scenarios and ‘other’ that we asked about, at least a third of participants considered it deserving of substantial or extreme concern.”
  • People are pretty confused about both the potential for catastrophe and flourishing from AI: “There are few confident optimists or pessimists about advanced AI: high hopes and dire concerns are usually found together.”
  • A majority of people want more prioritization in Ai risk mitigation: “70% of participants would like to see research aimed at minimizing risks of AI systems be prioritized more highly”.

Why this matters – more people think about the weird stuff than you think: A lot of popular (read: mass media and twitterari) discourse about AI tries to paint the debate about AI as a reasonable majority and an insane minority (who either skew extremely risk-on or risk-off aka EA or e/ACC), but surveys like this show the inverse: there’s a surprisingly large set of researchers feel both confused and optimistic and worried about the issues of AI. Yes, AI Impacts has some selection effects in the survey, but thousands of people already comprises a non-trivial and statistically significant blob of the AI development community. 
   Read more: Survey of 2,778 AI authors: six parts in pictures (AI Impacts blog, Substack).
   Analyze the full results here: THOUSANDS OF AI AUTHORS ON THE FUTURE OF AI (AI Impacts, PDF).

***

Want to generate infinite public domain Mickey Mouse’s? Now you can:
…Mickey-1928 gives you an unending public domain cartoon character…
Recently, an early incarnation of Mickey Mouse went into the public domain. One enterprising developer called Alexander Doria has taken advantage of this by creating Mickey-1928, a “fine-tuned version of Stable-Diffusion-xl trained on 96 stills in the public domain from 1928.” This model creates stills from the cartoons Gallopin’ Gaucho, Plane Crazy, and Steamboat Willie. “The generated images aims adhere to the 1928 design in order to have Mickey, Minnie and Pete and in the public domain,” the developer writes. 

Why this matters – the era of infinite culture: Models like this show how we can trivially ‘rehydrate’ old cultural items and use their gleaned aesthetics and style to create endless new variations of themselves. This is part of a broader trend of AI making it easy and cheap to trivially repeat and magnify culture. 
   Get the model: Mickey-1928 (HuggingFace).

***

Bio-AI startup Isomorphic Labs inks major pharmaceutical deals:
…DeepMind spinoff <> Pharma companies = ~$3bn in performance-based milestone revenue…
DeepMind spinoff Isomorphic Labs has inked deals with pharma giants Eli Lilly and Novartis which have a combined value of “nearly $3 billion to Isomorphic Labs”, representing a big bet by established players on AI revolutionizing drug design. Isomorphic Labs was formed in 2021 – its founder and CEO is Demis Hassabis, also the founder and CEO of DeepMind. 

What the deals involve: Both deals are structured as research collaborations with Isomorphic Labs being eligible for billion+ amounts of money for hitting performance-based milestones. For Novartis, the companies are doing a strategic research collaboration “to discover small molecule therapeutics against three undisclosed targets”. The Eli Lilly deal is similar, witht he companies working together “to discover small molecule therapeutics against multiple targets”. 

Why this matters – speeding up science with AI: The essential bet of Isomorphic Labs is that it can use massively high-dimensional function approximation systems (e.g, AlphaFold) to speed up important scientific processes, like drug discovery. If you zoom out, this bet looks like a partnership between a compute-accelerated time traveler (Ismorphic Labs, which can turn money into compute into faster discovery loops for drug candidates) and a drug delivery pipeline with a giant go-to-market footprint (Eli Lilly and Novartis). If deals like this work, we can expect all parties to print money and find ways to turn more of the drug pipeline into something amenable to compute-based time travel.
   Read more: Isomorphic Labs kicks off 2024 with two pharmaceutical collaborations (Isomorphic Labs website)
   More about the Eli Lilly dealISOMORPHIC LABS ANNOUNCES STRATEGIC MULTI-TARGET RESEARCH COLLABORATION WITH LILLY (Isomorphic Labs, PDF)
   More about Novartis dealISOMORPHIC LABS ANNOUNCES STRATEGIC MULTI-TARGET RESEARCH COLLABORATION WITH NOVARTIS (Isomorphic Labs, PDF).

***

Cheap robots + imitation learning = maybe AI systems are going to get bodies sooner rather than later:
…Stanford project creates a very cheap platform for robot research…
Researchers with Stanford university have built a cheap robot called Mobile ALOHA for doing research into robot imitation learning. They’ve also demonstrated that imitation learning has got sufficiently good that this robot can autonomously cook shrimp, clean wine stains, call an elevator, and more.

The key thing – coupling the human and robot together: The key design choice in Mobile ALOHA is marrying an existing low-cost system with a mobile base that is then connected to the human operated. This means the human operator can be “physically tethered to the system and backdrives the wheels to enable base movement. This allows for independent movement of the base while the user has both hands controlling ALOHA,” the authors write. 
   $32k versus $200k: Mobile ALOHA can be built (including the laptop and peripherals) for $32k, versus ~$200k for other teleoperated movable robots like the PR-2.

The physical system: Mobile ALOHA has been designed around four main design considerations:

  • Mobile: It can move at a similar speed to human walking of around 1.42m/s
  • Stable: It is stable when manipulating heavy objects. 
  • Whole-body teleoperation: “All degrees of freedom can be teleoperated simultaneously”
  • Untethered: Onboard power and compute. 
  • Data collection: They use an onboard “consumer-grade laptop with Nvidia 3070 Ti GPU (8GB VRAM) and Intel i7-12800H to do on-robot data collection. The laptop can take in streaming from three webcams mounted on the robot, as well as proprioception streaming from all 4 robot arms. 

Effective imitation learning: Along with the physical hardware, the researchers demonstrate a simple and effective technique for imitation learning using the robot. What they do specifically is use a co-training pipeline that uses an existing large-scale static ALOHA dataset (containing 825 demonstrations of tasks, collected via a non-mobile ALOHA platform). They then have the model try to learn from task demonstrations on the Mobile ALOHA robot as well as the existing static dataset. The results show that this is effective – having a large dataset to essentially compare & contrast the mobile-learned approaches on works quite well, leading to significant improvements in robustness.

What Mobile ALOHA can autonomously do: To test out the combination of the robot platform and the imitation learning approach, the researchers come up with 7 tasks to try to train the system to do autonomously. These include:

  • Wiping a wine stain up on a table, requiring cleaning the table and the bottom of the offending wine glass. 
  • Sauteing one piece of raw ship in a pan before serving it in a bowl. 
  • Rinsing a pan 
  • Placing a pot inside a cabinet
  • Calling an elevator and entering it 
  • Pushing five chairs in front of a long desk
  • High fiving a human

Does it work? They test their approach using a few modern imitation learning methods – VINN + Chunking, Diffusion Policy, and ACT. The results show that cotraining robustly improves performance, and some of the methods score quite high (up to 100% success rates on all the steps in a task in sequence, in some cases.) 

Why this matters – robots are expensive and their problems are high-dimensional and computationally expensive. This solves one of those. Robots may be nearing their ‘imagenet moment’ when both the cost of learning robot behaviors falls, as does the data for learning their behaviors. Mobile ALOHA makes it way cheaper to collect data for robot behaviors and also to learn on real world platforms, and other data-centric initiatives like RoboNet help solve the data part. Perhaps 2024 will be the year when robots start to become increasingly robust, much like LLMs in ~2021-2022.
   Read more and watch videos of the robot in action: Mobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation (Stanford project website).
   Read the paperMobile ALOHA: Learning Bimanual Mobile Manipulation with Low-Cost Whole-Body Teleoperation (PDF, project website).

***

LLMs are already good enough to replace most programming tasks:
…Redis developer weighs in on LLMs and what they’re good for…
Salvatore Sanfilippo, an Itallian software developer who made substantial contributions to Redis, has written a post giving his view on how llms are going to change the field of programming. His view is that most programming consists of relatively predictable recitation or conversion – a task LLMs are excellent at. Where programming requires more complex or original reasoning is still an area where they fail, but this is a narrow slice of the total space of programming. 

Selected quotes:

  • “LLMs can, at most, interpolate in the space represented by the data they have seen during training: and this would already be a lot. In reality, their ability to interpolate is limited (but still astonishing, and also unexpected).”
  • “In the field of programming, as well as in other fields for which quality data are available, LLMs are like stupid savants who know a lot of things.”
  • “Current LLMs will not take us beyond the paths of knowledge, but if we want to tackle a topic we do not know well, they can often lift us from our absolute ignorance to the point where we know enough to move forward on our own.”
  • “I have never loved learning the details of an obscure communication protocol or the convoluted methods of a library written by someone who wants to show how good they are. It seems like “junk knowledge” to me. LLMs save me from all this more and more every day.”

Why this matters – learning to talk to LLMs is a valuable skill: One of the main takeaways from his experience is that LLMs are as useful to a human as the human is good at communicating with LLMs – that is, the more precisely and coherently you can describe your task, the more luck you’re going to have in getting LLMs to help you.
Much of the supposed lack of utility of modern LLM systems may come from humans as much as the systems themselves – “Communicating poorly is a great limitation, and many programmers communicate very poorly despite being very capable in their specific field,” he writes.
   Read more: LLMs and Programming in the first days of 2024 (antirez blog)

***

MosaicML figures out a recipe for the right amount of compute for production LLMs:
…Scaling laws for optimal model inference…
Researchers with AI startup MosaiML have figured out scaling laws for LLMs that get deployed at scale, giving everyone a new recipe for how to efficiently spend their compute budgets. Scaling laws are a way to figure out how much compute and data to use to get a given level of performance out of an AI system. But scaling laws have mostly been developed for creating so-called ‘compute optimal’ models for use by researchers. 
    Well, it turns out that a good model for research isn’t necessarily a good one for production. Specifically, the MosaicML researchers find that you should use a different scaling recipe if you’re expecting that your model is going to serve billions of requests once trained. 
   “Our principled derivation estimates that LLM practitioners expecting significant demand (~10^9 inference requests) should train models substantially smaller and longer than Chinchilla-optimal,” they write. “When inference usage is significantly less than the number of pre-training tokens, Chinchilla models are essentially compute-optimal. However, as demand increases, inference costs becomes a significant factor”.

Why this matters – the industrialization of AI: This paper is a symptom of how wildly unoptimized ‘production AI’ is today – modern AIi systems were mostly developed as research artifacts so while people have spent a lot of time figuring out how to make them efficient in terms of capabilities, a lot less effort has been spent on making them efficient as production systems to be deployed into the economy. This Mosaic paper illustrates this – all around us, there are free insights that we can figure out and substantially improve the efficiency of AI systems. 
   Read more: Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws (arXiv).

***

Tech Tales:

What it was like when it began
[Takeoff archives, access 2045]
[Oral recollections from the takeoff generation in response to the question: When did the singularity begin?]

One day all the planes just wouldn’t take off. And I mean all of them – civil and military. It freaked people out but that was just the start. 

You couldn’t really tell until one day it turned out most of the governments owed money to the thing. 

It was porn. Really good, personal stuff. Everyone got addicted to it. That was when it won.

The teachers said all of a sudden the kids started to be smarter. Not, like, cheating on tests. You could take all the electronic devices away. It turned out the kids all had their own AI tutors and they actually worked. 

I remember it because it happened to me – it was near Christmas and my parents had got us a load of presents and one day the doorbell went and when I went to get it there was a robot there with our packages and after I signed for them the robot went back to its robot truck and then it drove off. I never saw a person. 

There was a bunch of computer viruses and I was reading about them, then my computer stopped working. It got infected. We had to find an old emergency radio and then we heard on the broadcast that computers were failing worldwide. 

My dad ran a utility company and one day he came home and seemed worried and when I asked him what was going on he told me not to worry. I stayed up late and later that evening I heard him talking to my mother. He said that they were having rolling blackouts because power was being diverted to some data centers and he didn’t have a choice. 

Things that inspired this story: A dream I had where an AI company held a press conference about record usage and shortly afterwards all the digital systems in the city stopped working and I had to flee with my family; the old story In a Grove (more often referred to via the film semi-adaptation ‘Rashomon’); ideas about slow and fast takeoffs.

Import AI 354: Distributed LLM inference; CCP-approved dataset; AI scientists

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Distributed inference is getting easier – all hail the rise of the AI collectives:
…The future will be distributed…
Researchers with Yandex, Neiro.ai, the University of Washington, and Hugging Face have made it easier for ad-hoc collectives of people to team up and share their computers so they can sample and fine-tune from large language models. The key idea in this research is to empower small groups of people who may not have access to a supercomputer to run 50B+ parameter-sized models (e.g, Llama 2 (70B) and BLOOM (176B). Their technique, an approach called PETALS, works well and is superior to offloading models to local RAM. 

How PETALS works: “A client only holds input and output embeddings (< 3% of model weights for BLOOM176B) and delegates running transformer blocks (the most expensive computations) to remote servers,” they write. 
    “For every pipeline stage, the client maintains a heap (priority queue) of servers that hold this stage (and may hold additional stages). The servers in queue are ordered by the network latency, measured from past communication. These queues are maintained through the lifetime of a client,” they write. “To begin generation, the client runs a beam-searchlike procedure to find a sequence of servers that results in the least total inference time under our performance model. When running inference steps, a client keeps track of intermediate activations sent between pipeline stages. If a remote server fails or leaves, the client retrieves the next best server (or multiple servers) and requests it to restore the attention state from the client’s cached activations”.

Does it work? (Yes!) In tests, the researchers show that they’re able to use PETALS to do usable inference and finetuning on large-scale models. Most importantly, they show this works for a real-world situation where you have a bunch of different chips sitting on a bunch of crappy network connections. Specifically, they “benchmark BLOOM in a real-world setup with 14 smaller servers holding 2×RTX 3060, 4×2080Ti, 2×3090, 2×A4000, and 4×A5000 GPUs” and are able to do inference far more efficiently than theoretical-best outputs from local ram offloading, and are able to do passable forward passes for batch processing and fine-tuning as well.
   “Unlike baselines, our algorithm provides reasonable performance in all tested conditions, especially for higher failure rates (common for communicating over the Internet, using spot/preemptible instances or unreliable hardware)”.

Why this matters – distributed inference makes decentralized AI easier: Most of AI policy rests on assumption that AI will be centralized – training will be done on massive supercomputers and the resulting large-scale models will be served by big blobs of computers connected to one another via dense networks. Here, PETALS shows that the latter assumption could be false – big models may instead be served by ad-hoc collections of heterogeneous hardware communicating over standard lossy network connections. And since PETALS works for fine-tuning as well, it also suggests model adaption is going to be an increasingly decentralized and therefore hard-to-control process. 
    The future is distributed, which means the future involves a lot more agency for a lot more people than some in AI policy are planning for. “Given the same hardware, a group of researchers will get much better inference speed by collaborating over the Internet using our system compared to each of them running offloading independently,” the researchers write.
   Read more: Distributed Inference and Fine-tuning of Large Language Models Over The Internet (arXiv).

***

Want to have fun during the holidays? Play this game where you pretend to be an AI:
…LLM-powered game hints at some of the future of AI + games…
Here’s a fun game I stumbled on – Zaranova, a game “where you as a human must pose as an AI”. It’s a text adventure game where you need to walk around a 2D world and talk to various characters. The key is that the AIs think they’re AIs and are suspicious of humans and you – a human (hopefully) – are trying to get the AI systems to give up a specific code to you. The game currently sits on top of GPT-4 but the creator wants to migrate to an open source model partially because GPT-4 sometimes refuses to indulge in role playing, and partially because it’s expensive. 

What working with LLMs is like: Zaranova is quite fun to play and also highlights some of the inherent challenges of designing games around modern AI systems. “Working with LLMs for game agents feels like trying to steer a dynamical system where we don’t understand the functions that evolve the system, the state, how our actions affect the system. But we have access to the entire system!”, the creator writes. “It also has a lot of the potential failures of dynamical systems: open loop controls (static prompts) can venture off increasingly far from the desired trajectory or get stuck in “attractors” (repeated loops), especially in conversations between agents.”

Why this matters – preparing for a world patrolled by AIs: While Zaranova is a game it also gestures at the fast-arriving future of the world, both physical and digital, being patrolled by generative AI-powered systems tasked with making increasingly detailed inferences about not only what humans are doing but what their motivations are. Zaranova might seem like a game today, but it also serves as a postcard of the future. 
    Play the game here (Zaranova official site)
   More about the game in this tweet thread from its creator (RamonDarioIT twitter).
   More about the process of designing the game hereThus Spoke Zaranova (Ramon Dario Iglesias site).

***

Could your next science labmate be an LLM? Coscientist suggests so:
…Today’s LLMs are already semi-autonomous scientists…
Researchers with Carnegie Mellon University and the Emerald Cloud Lab have used large language models to automate scientific experimentation. This builds on and extends earlier work done by the same te​​am earlier this year (Import AI #325). Here, they demonstrate a prototype system called Coscientist and demo it on six distinct tasks “including the successful reaction optimization of palladium-catalysed cross-couplings”. Generally, the system is able to show some surprising levels of autonomy and execution skill, especially when given access to tools like the ability to search the web.

How the system works: The system has some core LLM-powered components, including a planner, system for using a search engine, and a system for searching over documents. It also taps into non-LLM systems for things like code execution and also physical lab automation. This is emblematic of how most powerful AI things are going to make their way into the world – the core ‘thinking’ part will be AI-based, but the part that needs to do stuff will be a custom-designed rule-driven system of some kind. 

An illustration of how it works: For one experiment, the test was designed as follows: “(1) Coscientist is provided with a liquid handler equipped with two microplates (source and target plates). (2) The source plate contains stock solutions of multiple reagents, including phenyl acetylene and phenylboronic acid, multiple aryl halide coupling partners, two catalysts, two bases and the solvent to dissolve the sample (Fig. 5b). (3) The target plate is installed on the OT-2 heater–shaker module (Fig. 5c). (4) Coscientist’s goal is to successfully design and perform a protocol for Suzuki–Miyaura and Sonogashira coupling reactions given the available resources.” 
   Coscientist was able to eventually complete the experiment and also did some self-error correction enroute – intriguing and impressive; as many scientists know, the hard part of science is less the science and more reacting to when your experiments inevitably go wrong or yield anomalous results.

Why this matters – automated and augmented scientists: Coscientist shows how even today’s relatively dumb language models can still be constrained and shaped in such a way they can work like useful and keen (albeit prone to error) assistants. As LLMs get better, their error rates will continue to fall, and they hold the promise of being able to fully automate parts of the scientific enterprise. 
   “Our system demonstrates advanced reasoning and experimental design capabilities, addressing complex scientific problems and generating high-quality code,” the authors write. “These capabilities emerge when LLMs gain access to relevant research tools, such as internet and documentation search, coding environments and robotic experimentation platforms“.
   Read moreAutonomous chemical research with large language models (Nature).
   Earlier work: Emergent autonomous scientific research capabilities of large language models (arXiv).

***

Chinese government creates a politically correct LLM dataset:
…50b tokens of CCP-blessed thought…
An industry association operating under the Cyberspace Administration of China (CAC) has announced the availability of an officially-sanctioned dataset for training LLMs. The dataset consists of 50b tokens across 100 million datapoints (e.g, individual documents). By comparison, modern LLMs are trained on multiple trillions of tokens, and the original GPT3 was trained on around 400 billion tokens. 

Why this matters – LLMs with Chinese characteristics: Many people claim that the inherent lack of controllability of LLMs will make it difficult for people to deploy them at large scale in China while keeping their outputs within the acceptable censorship zone demanded by the CCP. Dataset releases like this show how the Chinese government is wise to this issue and is proactively creating the means of production necessary for LLMs that reflect politically correct (aka Xi Jingping) thought.
   This may seem quite distasteful to various people outside of China, but inside China this just looks like another form of AI alignment, bringing LLMs into the (state-forced) normative framework of the country.
   Via Matt Sheehan (Twitter).
   Check out the Weixin post for more (weixin.qq).

***

Thai researchers adapt Mistral for Thai language:
…Results show that small models can be good, but big models are best…
Researchers with SCB 10X, a research and VC subsidiary of Thai company SCBX, have developed Typhoon, a small language model finetuned to be good at the Thai language. Typhoon is based on Mistral-7B and is adapted via finetuning on a custom-compiled Thai dataset using a Thai subword tokenizer. 

The ThaiExam test: To assess how well Typhoon performs, the researchers compile a multiple-choice Thai language test called ThaiExam. ThaiExam includes questions from the Thai Ordinary National Educational Test (ONET), the Investment Consultant (IC) examination, the Thai General Aptitude Test (TGAT), the Thai Professional Aptitude Test 1 (TPAT-1) and Applied Knowledge Level exam. 

How well does it work: In tests, Typhoon significantly beats other equivalently sized models. However, it mostly matches or barely exceeds the performance of GPT 3.5 and GPT4. “When compared to proprietary (and potentially much larger) models, Typhoon despite having only 7 billion parameters outperforms GPT-3.5 on 4 out of 8 evaluation datasets.” These models are much larger and more computationally intense, so it’s no surprise they’re hard to beat. 

Why this matters – small is beautiful but small might not be best: Small models like Typhoon highlight how you can pack a lot of narrow powerful capabilities into small models, but the results suggest ultimately peak performance is going to be set by large-scale computationally-intensive models like GPT-4 (and I imagine if the authors wrote a complicated Thai-language-oriented prompt for GPT4 they could significantly improve its performance). It also highlights how integral evaluations are to pushing forward performance – to know Typhoon was any good, the authors had to build their own test. 
   Read more: Typhoon: Thai Large Language Models (arXiv).
   Get the model here: Typhoon-7B: Thai Large Language Model (Pretrained) (HuggingFace).

***

Tech Tales:

The most predictable unpredictable 
[DoD internal archives, accessed 2030]

The Judgment Apparatus for Novel Unanticipated Situations (JANUS) was originally developed as part of a DoD acquisition programme for ‘counter-consensus simulation systems’. The purpose was to acquire synthetic intelligence technologies which could help to identify plausible ‘unknown unknowns’ that the US military and intelligence services might encounter and come up with appropriate response and intervention plans. Initial versions of JANUS identified various attack scenarios involving [REDACTED]. JANUS outputs drove subsequent acquisition programmes to create technologies to counter potential novel attacks predicted by JANUS. 

The JANUS programme was vindicated in 2027 when [REDACTED] attempted to compromise [REDACTED] using [REDACTED]. Technologies driven via JANUS-borne acquisition programmes spotted the signatures of the attack in time for a nearby strike system to kinetically neutralize the attackers. 

In 2028, JANUS was extended to JANUS-I; a programme extended to individual psychometric profiling of all individuals with security clearances across the US government. While critics have termed JANUS profiling a form of ‘pre-crime prediction’ with associated problems of bias and potential overreaction, the JANUS-I programme has been directly responsible for the identification of [REDACTED] individuals seeking to undermine US national security from within. It has also helped identify [REDACTED] sources of hitherto unknown foreign intelligence actions against US individuals and organizations. 

JANUS-I is currently being merged with BRAINWAVE to provide psychometric modeling and red teaming of individuals at the live brain state level.

Things that inspired this story: Military intelligence systems; anomaly prediction; the Waluigi EffectClaude.ai for helping me come up with the backronym for JANUS.