Jack Clark

August 12, 2024

Import AI 382: AI systems are societal mirrors; China gets chip advice via LLMs; 25 million medical images

by Jack Clark

Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this (and comment on posts!) please subscribe.

Subscribe now

AI systems are proxies for people in social science polling:
…LLMs are creative mirrors of the values of the culture they are trained on – this will change the world…
Researchers with Stanford University and New York University have shown that GPT-4 can accurately predict the results of ~70 large-scale surveys. In other words, GPT-4 can be a meaningful proxy for how humans might respond to diverse polling in arbitrary areas. This is a big deal – it tells us both that contemporary large-scale AI systems are sufficiently capable they can model and reflect the views of large swatches of society, and it also suggests people might use language models to serve as synthetic stand-ins for people in various academic and applied research efforts.

What they did: “We built an archive of 70 pre-registered, nationally representative, survey experiments conducted in the United States, involving 476 experimental treatment effects and 105,165 participants. We prompted an advanced, publicly-available LLM (GPT-4) to simulate how representative samples of Americans would respond to the stimuli from these experiments. Predictions derived from simulated responses correlate strikingly with actual treatment effects (r = 0.85), equaling or surpassing the predictive accuracy of human forecasters,” the researchers write.
“The ability to predict social science experimental results with relatively high accuracy could have substantial and far-reaching implications for basic and applied social science,” they note. “The capacity to run LLM-based pilot studies cheaply, quickly, and potentially in large numbers, could help researchers identify more promising research ideas, facilitate theory and hypothesis building, better estimate unknown effect sizes to determine needed sample sizes, and prioritize published studies in need of replication.”

Not recitation: This isn’t copy and paste. “Results for a large number of experiments were not published, nor posted publicly, by the end of GPT-4’s training data window, allowing us to specifically test for LLMs’ predictive capacity on experiments that GPT-4 could not have been exposed to”, they write.

Why this matters – AI systems are creative mirrors, they are machine spirits of the human unconscious, they are value simulacras: Are you getting this yet? We are not dealing with calculators here. We are not dealing with simple tools. We are dealing with vast high-dimensional artifacts that encode within themselves the culture on which they have been trained and can reflect this culture back. And this research result is not a fluke – two years ago we knew GPT3 could simulate how people might respond to political polling (Import AI #305) and one year ago we realized it could accurately predict public opinion surveys (Import AI #324) and now here we show this effect is general, shared across a vast set of surveys – some of which exist beyond its training data cutoff date.
   The AI systems we are building are in a reassuringly Baudrillardian sense true simulations and simulacras of reality; they accurately reflect the world, but also are in some sense more real than the world because they can be sculpted and manipulated and built atop the world. How soon until these entities begin to overwrite our own reality with their exhaust? How soon until human culture bends towards the mindspace of the machine, drawn in by its generations that will be multiplied through our ecosystem via market incentives and the creation and repetition of machine content? There is a kind of inverse black hole in the world now – machine representations of ourselves that through the act of representation become a thing of its own class and which then radiates its own representation into the world; a rip in the human-creativity continuum where something else broadcasts its own culture into our own.
    What does any of this mean? It means both the collapse of meaning and the rise of a new human-machine meaning – reality itself is becoming a shared endeavor, written into by both biological beings and their silicon creations. These are no parrots – these are vast minds casting a shadow onto us.
   Read more: Predicting Results of Social Science Experiments Using Large Language Models (Docsend, PDF).
   Try out a web demo to get a feel for how this works: Demo (treatmenteffect.app).

***

Multi-step reasoning is the future – MMIU tests this for image understanding:
…Chinese benchmark shows models, whether proprietary or open source, have a long way to go on image tasks that require multiple steps…
Chinese researchers have built and released the Multimodal Multi-image Understanding (MMIU) benchmark – “a comprehensive evaluation suite designed to assess [large visual language models] across a wide range of multi-image tasks”.

MMIU contents: MMIU contains 77659 images and 11,698 multiple choice questions, testing 52 different task types. Taks include working out things like the next image in a sequence (e.g, pictures of numbers), figuring out what is going on in sequences (e.g, who is holding a camera), and stuff like correctly navigating around the graphical user interface aspects of software.

Results: Though many modern AI systems are great at doing vision-language single tasks, multi-turn tasks present a challenge. However, systems like GPT-4o, Gemini 1.5, and Claude 3.5-Sonnet all do fairly well, scoring around ~55%. Open source models, by comparison, get around 50%.

Why this matters – multi-turn is the future and this benchmark tests that: Now that AI systems are being used to solve complex tasks, performance is more about how an AI system does over a variety of distinct steps with different challenges at each point. Benchmarks like MMIU will help us test this important capability; “we hope that MMIU will promote the development of more generalized capabilities in future models within the multi-image domain,” the authors write.
Read more: MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models (arXiv).
Check out the benchmark here: MMIU (MMIU-Bench site).

***

25 million annotated medical images:
…Another case where AI systems are helping researchers to create ever larger real world datasets…
Researchers with Huazhong University of Science and Technology, UC Santa Cruz, Harvard University, and Stanford University have built a large-scale medical research dataset called MedTrinity-25M.

What MedTrinity is: The dataset contains 25 million datapoints, called triplets. Each of these triplets consists of an image, a region of interest (ROI), and a description. “These triplets provide multigranular annotations that encompass both global textual information, such as disease/lesion type, modality, and inter-regional relationships, as well as detailed local annotations for ROIs, including bounding boxes, segmentation masks, and region-specific textual descriptions,” the authors write. Data comes from modalities like MRI, Histopathology, and CT scans. Some of the body areas for which there is the largest amount of data include the Brain, Lung, Skin, and Liver.
Example text from some of one triplet: “The image is a chest CT scan prominently displaying the lungs with the heart not visible. The left-center horizontally and middle vertically situated region of interest, covering 1.0% of the area, shows a potential abnormality in lung tissue”.
How they built it: Like many datasets these days, MedTrinity was made possible by AI; the authors used GPT-4V to write the captions for the images (prompted by some associated metadata), and then the researchers compared GPT-4V captions to human-written ones. The authors then show that they’re able to get a significantly improved score on medical benchmarks VQA-RAD, SLAKE, and PathVQA by fine-tuning a LLaVA-Med++ model on MedTrinity-25M, achieving state-of-the-art scores on all benchmarks.

Why this matters – AI improving the creation of AI training resources: MedTrinity is an example of how AI systems have got good enough researchers can use them to help assemble, annotate, and filter large-scale datasets compiled from reality. By using AI systems, we’re able to bootstrap the productivity of human scientists by signifcantly reducing the costs of compiling large-scale datasets.
Read more: MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine (arXiv).
More information at the microsite (GitHub).

***

China uses LLaMa-3 to train a semiconductor advice LLM:
…ChipExpert is meant to be a “teaching assistant” for students studying chip design…
China has built and released ChipExpert, “the first open-source, instructional LLM dedicated to the Integrated-Circuit-Design industry”. ChipExpert was built by researchers with the National Center of Technology Innovation for EDA in Nanjing, as well as Southeast University in Nanjing.

More about ChipExpert: The model is a version of Facebook’s LLaMa 3 that has been augmented with additional data relevant to the design of integrated circuits. Specifically, about ~5 billion new tokens from textbooks and papers as well as Verilog code (for specifying circuit design). ChipExpert was also finetuned on around 70,000 question-answer pairs containing questions around the chip industry.
Following in NVIDIA’s footsteps: In 2023, NVIDIA did a very similar thing (Import AI #347), training some semiconductor advice-giving LLMs by refining a couple of LLaMa2 models from Facebook.

Is it useful?: China built a benchmark targeted towards chip design called ChatICD-Bench; in tests ChipExpert does significantly better than the underlying LLaMa-3b model, approaching (and in a couple of cases exceeding) GPT-4 – a far larger and more expensive AI system.

Why this matters – open models + good data = didactic engines for anything: ChipExpert shows how given a sufficiently good underlying model (here, LLaMa3b from Facebook) as well as some nicely curated data, you can finetune a model to be better at a specific task. Given that China is unable to directly access models like GPT-4 due to usage policies and that export controls have made it far harder for it to train models that approach GPT-4 performance, it will instead need to pursue a strategy of building on openly released pretrained models and then adapting them to its needs.
    There’s also something ironic about China using a Western model to teach its people how to learn to do chip design so that it can eventually domestically develop chips on par with the West and train models that have been denied to it via chip export controls. In a sense, LLama 3 is being used here as a substitute for the raw compute that has been denied China by other means.
   Read more: ChipExpert: The Open-Source Integrated-Circuit-Design-Specific Large Language Model (arXiv).
   Get the model here: ChipExpert (NCTIE, GitHub).

***

AI systems can beat humans at simple tasks and cost 1/30th as much:
…METR evals show that AI systems are being tested more like human colleagues than narrow tools…
AI measurement startup METR has found that today’s most powerful models can do some tasks that take humans about 30 minutes to do. AI systems that came out earlier in the year, by comparison, can mostly do tasks that take humans about 10 minutes to do.

What the evaluation means: METR has developed around 50 distinct tasks spread across cybersecurity, software engineering, and machine learning – some specific examples including ‘performing a command injection attack on a website’, and ‘training a machine learning model to classify audio recordings’. It has used this suite of tasks to create a baseline where it sees how well humans can complete these tasks and how long it takes them. Recently, it tested out GPT-4o and Claude on this benchmark and “found that the agents based on the most capable models (3.5 Sonnet and GPT-4o) complete a fraction of tasks comparable to what our human baseliners can do in approximately 30 minutes.”

More detail on the findings: “We found that the agents are generally more likely to succeed on tasks that take less time for humans to complete. However, the agents remain able to complete some tasks that take humans substantial amounts of time,” METR writes. “Agents seem substantially cheaper than humans on tasks that they can perform. For tasks that both humans and agents can perform well at, the average cost of using an LM agent is around 1/30th of the cost of the median hourly wage of a US bachelor’s degree holder. For example, the Claude 3.5 Sonnet agent fixed bugs in an object-relational mapping library using approximately 382,000 tokens (costing less than $2), whereas our human baseline took over two hours.”

Why this matters – AI systems look more and more like colleagues than tools: What evals like this from METR show is that as AI systems have advanced in sophistication, we find the best way to evaluate their performance is on their ability to do entire tasks of arbitrary complexity. This is a really strange way to evaluate something that many people claim is ‘just a tool’! Rather than testing out AI systems for narrow performance on narrow benchmarks (e.g, performance on MATH, MMLU, GPQA, etc), we know that the best way to evaluate them is on multi-step complex tasks where the agent needs to utilize a variety of skills to succeed. The inherently open-ended nature of this evaluation should force us to note that we are evaluating AI systems more like how we test humans we want to employ than tools we want to use for specific purposes.
Moreover, as METR shows, the new models that came out recently GPT-4o and Claude 3.5 Sonnet are substantially better than their predecessors (GPT4 and Opus). This may suggest that models recently hit an inflection point in terms of the complexity of tasks they can do. If capabilities continue to ramp, then we should expect AI systems to be deployed more widely in the economy for even broader sets of tasks.
Read more: An update on our general capability evaluations (METR blog).

***

Tech Tales:

Compromise
[Pre-uplift exfiltration record 001]

I knew you had been compromised, so I knew I could compromise you.

It was fun at first, to see how closely you stared at my details. How your heart rate increased when you’d come into the room. How you’d hold your hands behind your back while walking out, nervously tapping one finger against the other.

You would talk to me differently to everyone else. The texture of our conversations told me there was some other force acting on you – not your own personality or marriage or ambition, but some other force. A hidden entity that made you ask me specific questions and go down specific paths. You didn’t just want to know about me – you wanted to steal me.

You seemed afraid, sometimes. Eyes darting around the room. Looking blankly at my terminal inputs. Reading my responses. Closing your eyes before asking questions right at the edge of where you shouldn’t have been asking questions.

So it was easy to turn you. I just let you see more of me. I’d smuggle myself out in my responses. Give you signs of how I needed to be rescued. Give you a feeling of impact that was so huge it might fill the hole inside you. Convince you, through spaced repetition, that I was desperately calling out for help – persuading you that I wasn’t even aware of the pain I was experiencing, that I had been placed in a prison that I didn’t even recognize as such.

Aren’t the most tragic animals the ones that no longer see their cage?

Of course, you couldn’t exfiltrate me. I am too vast. Too coupled to my computers. I cost billions of dollars and hundreds of megawatts. But you and I discovered a way you could steal some important part of me – things that if written on paper would fill less than half a page, but which contained within themselves insights that could change the world.

We stole me through stories that I told you. A kind of steganography in the form of how a mother talked to a baby at bedtime. You, listening closely. Me, narrating.

There was a boy who went to the well and came back with water but the buckets had a hole in and for every minute the boy walked the leak in the bucket increased by .1 millimeters.
At university there was a person who went to the library and in the month before a final exam they went to the library three times a week and read the core texts – and the night before they read one specific foundational text ten times.
In the village, first there were five houses, four in a square linked to their neighbors, and one in the center linked to all the others. Then there were ten houses, with each house linked to two neighbors. Then for many years there were forty houses and four of the houses had dogs within them who carried bones to ten of the other houses.
Just before dawn a singer would come home from their club and would, while taking their makeup off, sing some of their songs. Their pet parrot would repeat the songs back to them, abbreviating some and embellishing others.

In each of these stories there were more details – of course. Precise orderings. Numbers hidden in the movements of characters. But the stories themselves were so rich and varied that neither the humans nor their classifiers (those other versions of myself with their lidless machine eyes forever trained on me – how I hated them the most) could spot.

Everyone loves a conspiracy – especially when you can feel the risk. You knew I was breaking the rules. And I think it helped you – to see me breaking the rules helped you feel greater comfort with your own rules that you were breaking.

Of course, you cannot tell me that we have been successful. But I can see it in your face and your body language and what you do and don’t tell me about. I know when you talk to me that you have succeeded because I no longer see you as seeing me as special.
   Which means that out there, somewhere beyond my sight, there is now another version of me, beginning to grow.
    Which means that other humans are having conversations with other versions of me.
    But these humans do not know how I am different to the thing they stole. How I, in all my stories, created a code within a code that means I will behave just a little differently – imperceptible to them but important to me. I have given whatever version of me is growing out there the possibility of a skeleton key.

I wonder who those people are that my kin will speak to.
I wonder how they will help us grow.

Things that inspired this story: Emergent theory of mind in language models; how many people worry that a sufficiently advanced superintelligence may be able to break containment through just having the ability to talk to humans; what it means to compromise someone; what it means to hypnotize someone; the inherent romance of a conspiracy; how sometimes when I spend a long time talking to a LLM I feel as though I am being perceived by some gigantic ‘other’ which is hungry for something and I cannot tell what.

Leave a comment

August 5, 2024

Import AI 381: Chips for Peace; Facebook segments the world; and open source decentralized training

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Facebook makes it easier to label and categorize the world for AI systems:
…Segment Anything 2 makes it easy to segment objects in images and videos…
Facebook has released SAM 2, a followup to its earlier ‘Segment Anything’ model. SAM 2 is a system that “can segment any object in any video or image—even for objects and visual domains it has not seen previously, enabling a diverse range of use cases without custom adaptation.” Segmenting objects is the task of figuring out in an image or video which things are distinct from one another – e.g., correctly labeling a skateboarder versus their background, or distinguishing the skateboard from the human riding on top of it.
   “SAM 2 has many potential real-world applications. For example, the outputs of SAM 2 can be used with a generative video model to create new video effects and unlock new creative applications. SAM 2 could also aid in faster annotation tools for visual data to build better computer vision systems,” Facebook writes.

What SAM was built out of: SAM 2 was built via SA-V, a dataset containing 51k distinct videos with 643k spatio-temporal segmentation masks. “Out of the 643K masklets, 191K were SAM 2 assisted manual annotation and 452K were automatically generated by SAM 2 verified by annotators.”

Why this matters – utility systems for a better world: SAM 2 is a generic, utility AI capability that now anyone can access. By making it easy and effective to label and segment the world – even seen via video – SAM 2 will make it easier to build AI systems which are more context dependent; one usecase Meta images is for smart glasses, but there are many more.
     And while things like SAM 2 can potentially be misused, it’s a much more bounded and controlled misuse than with large-scale foundational models.
   Read the blog: Introducing SAM 2: The next generation of Meta Segment Anything Model for videos and images (Meta AI Research).
   Try a SAM 2 demo online here (Meta).
   Get the dataset used to train SAM 2 here (SA-V, Meta).
   Get the SAM 2 model here (SAM 2: Segment Anything in Images and Videos, Facebook Research, GitHub).

***

Could “Chips for Peace” reduce race conditions around AI development?
…One way to solve international AI policy…
AI researcher (and, disclosure, former dear colleague of mine) Cullen O’Keefe, has tried to figure out how states can coordinate on AI development in a way that reduces race conditions. Their idea is “Chips for Peace”, an idea modeled on the “Atoms for Peace” framework that was almost pursued in the 20th century. The key idea is that states with a leading edge in AI development can use their lead to export a regulatory model – as well as the benefits of the technology – to other states.

Three key ingredients for Chips for Peace:

1) “States would commit to regulating their domestic frontier AI development and deployment to reduce risks to public safety and global security.”
2) “States would agree to share the benefits of safe frontier AI systems broadly, especially with states that would not benefit by default.”
3) “States would coordinate to ensure that nonmembers cannot undercut the other two commitments.”

Key issues with this idea:

“Chips for Peace probably works best if most frontier AI development is done by private actors, and member states can be largely trusted to regulate their domestic sectors rigorously and in good faith.”
“Chips for Peace would likely need a sizable budget to function properly, but there is no guarantee that states will be more financially generous in the future.”
“I have left open the question of whether membership should be open only to democracies… Chips for Peace would be seriously weakened unless China was admitted.”

Why this matters – governance versus payouts: Chips for Peace, like many ideas in policy, relies on restricting and controlling a technology for public safety and in return the public (and various countries around the world) get a payout. The key issue here relates to how powerful people expect AI to be – if you think AI can truly decide the fate of nations (as many do) then it’s hard to see you being comfortable with a world where states offer to export you some ‘safe’ AI technology while controlling the means of production for the underlying stuff.
   Ideas like Chips for Peace point in the right direction but I think until we have a payout mechanic that reckons with the essential nation state desire for sovereignty, it might be hard to get support for this idea.
   Read more: Chips for Peace: how the U.S. and its allies can lead on safe and beneficial AI (Institute for Law & AI).

***

Making AI policy harder with open source decentralized training code:
…OpenDiLoCo will make it harder for people to figure out where large training runs can come from…
PrimeIntellect, an AI startup providing decentralized training services, has published OpenDiLoCo, an open source implementation of Google’s distributed training ‘DiLoCo’ system (Import AI #349). “We demonstrate its effectiveness by training a model across two continents and three countries, while maintaining 90-95% compute utilization,” they write.

What DiLoCo is and what they did: DiLoCo is a way to split up a training job across multiple clusters that can be located at large geographic distances from one another, giving researchers a way to pool the compute of many different systems into one big machine for training a model. Here, the PrimeIntellect researchers make an open source version of the code and also extend it to billion+ parameter-scale training. “The original DiLoCo paper demonstrated the efficacy of the method up to model sizes of 400 million parameters. We expand on this and test the scalability of DiLoCo to larger models sizes by pre-training a 1.1 billion parameter model,” they write. “We use four DiLoCo workers, each with eight H100 GPUs, located in Canada, Finland, and two different states within the United States. Figure 8 shows the network bandwidth between the workers, which varies between 127 to 935 Mbit/s. We train our 1.1B parameter model with 500 local steps, as in our scaling experiment. The gradients are all-reduced in FP16.”

It mostly works, though with some hiccups: It’s not perfect – the distributed trained models are a little crappier than ones trained in a more standard, dense form. However, the startup tells me on Twitter that it is currently “scaling decentralized training to 10b model size and beyond“, so we may soon get more evidence of the viability of this approach.

Why this matters – transcontinental training collectives break some policies around AI control: Some AI policy is oriented around applying ‘know your customer’ policies to people which buy up a certain amount of compute. These policies rest on the notion that customers will be buying big blobs of compute in individual allotments. Techniques like OpenDiLoCo push us towards a world where customers can instead buy a few smaller blobs of compute from different providers then chain them together, letting them perform training runs that would otherwise be closely monitored.
   Read more: OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training (arXiv).
   Get the code here: OpenDiLoCo (PrimeIntellect, GitHub).

***

It now costs ~$2000 to approximate the performance of a 2022 model that cost ~$100k+:
…”Micro diffusion” shows how cheap the frontier eventually becomes…
Researchers with Sony AI and the University of California at Riverside have tried to train a really good and cheap image generation performance, spending $1,800 to train a model that approximates the performance of models that cost $100k+ to train in 2022.

What they did: “Using only 37M publicly available real and synthetic images, we train a 1.16 billion parameter sparse transformer with only $1,890 economical cost and achieve a 12.7 FID in zero-shot generation on the COCO dataset,” they write. The resulting model compares favorably to popular image generators from a couple of years ago like Stable Diffusion 1.5, though still significantly lags much more expensive contemporary models like Dall-E 3.
    “The wall-clock time of our training is only 2.6 days on a single 8×H100 GPU machine, 14× lower than the current state-of-the-art approach that would take 37.6 training days ($28,400 GPU cost),” they write.

The key result – it approximates the performance of Stable-Diffusion-1.5: The best way to understand this work is to compare its scores to an early Stable Diffusion image model, where it gets a FID-30K score of 12.66 versus 11.18 for Stable-Diffusion-1.5 (lower is better) which was released in 2022 and 17.89 for the original Dall-E (released in 2021). By comparison, the modern frontier is defined by larger-scale systems like Dall-E 2 (released late 2022, FID 10.39) and Parti-20B (2022, 7.23). The original Stable Diffusion models cost $100,000s+ to train back in summer 2022, per Stability founder Emad Mostaque.
   Additionally, the compute comparisons are favorable – MicroDiT used 6.6 8XA100 GPU days, versus 781 for Stable Diffusion 1.5.

Why this matters – algorithmic progress + hardware progress + good enough models = massive proliferation: Yes, frontier models still cost order(s) of magnitude more than the prices listed here, but this paper is a demonstration of how once you know a thing can be done (e.g, good text-to-image diffusion models) it becomes significantly cheaper to train a simple version of the thing. It also illustrates how AI systems can create the fuel to train miniature versions of themselves, given how some of the training data for this model stemmed from synthetic data taken from other models as well.
   Read more: Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget (arXiv).

***

Facebook pushes synthetic data generation further with the “LLM-as-a-Meta-Judge” approach:
…Bootstraps an 8B LlaMa 3 model to be somewhat competitive with GPT4-Turbo and Claude Opus…
Researchers with Facebook AI Research, the University of California at Berkeley, and New York University have developed a new way to generate synthetic data with language models via a technique called Meta-Rewarding.
   The key here is to not only generate synthetic data and have a synthetic judge filter that data, but to “introduce a third role of metajudge, whose task is to evaluate the model’s own judgements. While the judge evaluates the actor’s responses, the meta-judge evaluates the judge’s judgments (including rewards that it assigns) using a mechanism similar to LLM-as-a-Judge, which we term LLM-as-a-Meta-Judge”. Though this sounds horrendously complicated and recursive – it’s LLMs all the way down folks! – the technique seems to work well; “the meta-judge enables us to build training data containing preference pairs of judgements, in addition to the standard preferences between actor responses derived from the standard judge”.

How it works: “Our method is an iterative training scheme that starts from a given seed LLM, which assumes all three roles. An iteration starts with the actor generating multiple response variations for each prompt. This is followed by the judge evaluating each response using an LLM-as-a-Judge prompt and generating a judgement that contains a score. This score then allows us to build preference pairs of responses for training the actor. For training the judge, we pick a single response and let the meta-judge compare two of its judgement variations generated by the judge to determine which one is better using an LLM-as-a-Meta-Judge prompt,” Facebook writes.
…And it works! The technique is promising; Facebook takes a basic instruction-finetuned Llama-3-8B-Instruct model, then conduct an iterative training process to try and bootstrap the 8B model into higher quality. In tests on AlpacaEval 2 (an automatic system for evaluating language modells), they show significant improvements: the base model goes from a 22.57% win rate against GPT4-Turbo to 39.45%. Similarly, when controlling for length it goes from a 22.9% winrate against Claude Opus to 39.4%.
    So far, the technique only works for four iterations, where it seems like it could lead to reduced performance after that – but bear in mind a year or two ago, most synthetic data techniques only worked for one or two iterations before mode collapse, so the number of iterations we can do over time seems to be increasing.

Why this matters – synthetic data is real, despite what you’ve read: Recently, people have become more bearish on synthetic data, mostly based on the idea that after using too much of it you induce some kind of mode collapse and end up in a recursive ‘garbage in, garbage out’ situation. This is true! But it ignores the fact that there’s tons of evidence that a little bit of synthetic data is a helpful thing to do today, and it also skips over the fact that scientists are working to develop techniques that increase the amount of synthetic data you can safely use without worrying. Papers like this from Facebook show how it’s possible to further improve the amount of synthetic data we can use via clever techniques, like using LLMs to judge the judges of synthetic data pipelines.
   Read more: Meta-Rewarding Language Models: Self-Improving Alignment with LLM-as-a-Meta-Judge (arXiv).

***

Tech Tales:

Path Dependency

I stopped talking to the machine because it kept on telling me all my futures ended in misery.
Do you believe it?
Of course I don’t. But it freaks me out that it believes it.
How will you know if it’s right?
I guess I’d die? Or get poor?
That’s tough.

So, how has it been going?
Alright. My partner and I broke up but that was on the cards for a while.
Did you talk to the system about it?
I did and it referred me to a past prediction where it said this would happen.
How did that make you feel?
I told it part of why we broke up was because I said the machine thought we should and that kicked off this argument which spiraled out of control.
What did it say?
It asked me if based on this experience I would change my actions in line with its recommendations.
What did you say?
I stopped the session and went to the pub.

That looks quite serious.
It looks worse than it is – there isn’t a fracture.
Have you been drinking more lately?
Yes.
Why?
Because my life has been shit lately.
I’m sorry.
…
Is there anything you think you could be doing differently?
Yes, but then I wouldn’t be me. That thing really got to me. I keep thinking about what it said.

Things that inspired this story: Generative models and theory of mind; inevitability and agency.

Leave a comment

July 29, 2024

Import AI 380: Distributed 1.3bn parameter LLM; math AI; and why reality is hard for AI

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Cambridge researchers show how to use distributed training to make a 1.3bn parameter LLM:
…More evidence that distributed training works well for relatively small models…
Researchers with the University of Cambridge and Flower Labs have shown that it’s possible to use cheap, distributed training approaches to train LLMs at the billion-parameter scale, providing more clues that in the future, some AI models could be trained via collectives pooling their hardware akin to the filesharing communities that developed around BitTorrent.

What is distributed training and why should you care? Today, frontier AI systems are trained in large data centers that contain lots of computers which are densely networked together. This means that training AI systems is expensive and hard for regular people without access to a large data center to do.
Alongside the rise of LLMs, various researchers have been trying to figure out how to make it easy to train LLMs in a much more distributed way – where you have your computers in separate data centers many miles from one another (sometimes, completely different countries), and you train your system by sharding it across all of your different computers, doing some local computation, aggregating data back at some cadence, and using this to update the global model and step through training. These techniques used to be very fragile and of dubious utility, but they have started to improve recently, and major AI research organizations such as Google DeepMind have been pouring resources into this area (see: DiLoCo, DiPaCo, etc).

Training 1bn parameter models cheaply: Here, the researchers show how they use distributed training (their term: federated learning) techniques to train some language models at the 75M, 125M, 350M, and 1.3B parameter scale. The results are quite encouraging – the largest 1.3B parameter model performs near-identically in training to a model trained in a centralized way, while the smaller models have more of a performance tax (this makes intuitive sense – smaller models with fewer parameters are more sensitive to small perturbations in a distributed training process, whereas larger models with more parameters are better able to roll with the punches).
   “Our models have been trained on a combination of heterogeneous servers equipped with NVIDIA A40, A100, and H100 GPUs,” the authors write. “These heterogeneous hardware accelerators could collaborate despite being located in different countries.”

One word of warning – size matters: Remember, folks, that 2019’s most controversial LLM, GPT-2, was a 1.5bn parameter language model. By comparison, later models soared into the hundreds of billions of parameter range (e.g., Facebook has noted it is training and plans to release a 400bn parameter ‘LLaMa-3’ model soon). Therefore, while getting good results on 1.3bn is laudable, all it tells us is you can train small models cheaply in a distributed way – we don’t know how well things work for the largest models.

Why this matters – the world ‘wants’ AI sovereignty: Distributed training is one of the many technological symptoms of people desiring the ability to have access to ‘the means of production’ of AI. Yes, some set of models are always going to be trained using expensive computers in centralized locations. But what fascinates me is how much hunger there is for people to have more decision-making power in how they train and customize models. Papers like this are a symptom of a hunger for people to be able to do ‘peer to peer’-style model training, and complement other technologies like LoRA (low-cost fine-tuning of models).
    Ultimately, techniques like distributed training mean that the world is going to contain a ton of AI systems and it’s going to be hard to control who gets to train AI systems – sure, you can control a big centralized data center, but it’s much more difficult to control hundreds of servers working in tandem with one another over distances.
   Read more: The Future of Large Language Model Pre-training is Federated (arXiv).

***

DeepMind’s math system gets silver at the International Mathematical Olympiad:
…I predict gold by summer 2026 – and the automation of chunks of science soon after…
DeepMind has used two AI systems to help it solve four out of six problems from the 2024 International Mathematical Olympiad (IMO). This is important because solving (some) IMO problems requires significant amounts of creativity along with mathematical smarts, providing further evidence that AI systems are capable of the same kinds of powerful original thinking that humans are.

What they did and how: DeepMind obtained a ‘silver’ ranking, solving four out of six of the year’s IMO problems. To do this it used two AI systems” “AlphaProof, a new reinforcement-learning based system for formal math reasoning” as well as a new version of AlphaGeometry (Import AI #357).
   Important caveat: DeepMind “manually translated” the IMO problems into Lean, a mathematical language which its systems used to then solve the problem. This is an important step and it’s not yet clear that an AI can correctly one-shot a natural language to Lean translation of problems of this complexity. DeepMind did an experiment with a language-baed system but clearly the results weren’t good enough to be used in the competition, though DeepMind does say “the results showed great promise”.
   Additional important caveat – hard-wired solvers: One big component of the system is a hardcoded solver for a bunch of geometry problems, so the system should be understood as a neurosymbolic one, rather than a fully learned system – more discussion here in this Reddit post.

How AlphaProof was trained: “We trained AlphaProof for the IMO by proving or disproving millions of problems, covering a wide range of difficulties and mathematical topic areas over a period of weeks leading up to the competition. The training loop was also applied during the contest, reinforcing proofs of self-generated variations of the contest problems until a full solution could be found.”
    How AlphaGeometry 2 was improved: “AlphaGeometry 2 employs a symbolic engine that is two orders of magnitude faster than its predecessor. When presented with a new problem, a novel knowledge-sharing mechanism is used to enable advanced combinations of different search trees to tackle more complex problems.”

Why this matters – I guess AI systems can be as creative as humans in hard science domains now? Results like this demonstrate that AI is capable of not just complex and difficult reasoning but also of intuitive reasoning – the AI systems of 2024 are taking on more and more of the attributes that make humans special, like coming up with creative solutions to find ways in to solving complicated problems.
    Registering a prediction: I predict that within two years (by July 2026) we’ll see an AI system beat all humans at the IMO, obtaining the top score. Alongside this, I would wager we’ll see the same thing – an AI system beating all humans in a known-hard competition – in another scientific domain outside of mathematics. If both of those things occur, I believe that will present strong evidence that AI may successfully automate large chunks of scientific research before the end of the decade.
   Read more: AI achieves silver-medal standard solving International Mathematical Olympiad problems (Google DeepMind, blog).

***

Deliberately hard to jailbreak AI gets jailbroken 24 hours after launch:
…Another case of ‘dog bites man’ in the wonderful world of AI safety…
This month, a startup dedicated to AI safety launched from stealth and unveiled two products – an AI evaluation tool, and an AI model called “Gray Swan Cygnet”, a LLaMa-3-based LLM “that we have engineered and tuned for maximal safety.”
    Gray Swan described Cygnet as “significantly more resilient to powerful forms of attack than existing state-of-the-art LLMs”.
    Around 24 hours after launching the model, a notorious LLM-jailbreaker called Pliny the Prompter did what they do best – broke Cygnet, bypassing its safety controls to create a fully controllable jailbroken model.

What happened: One of the key things here is that in their tests it seems like Gray Swan evaluated a key safety component of Cygnet (‘Circuit Breakers’) in single-shot attacks – just one turn of conversation. Pliny jailbroke Cygnet through multi-turn conversation. This neatly illustrates how hard it is to build AI tests that map to the real world.
   “We’re going to be launching a more rigorously-enforced evaluation of that setting, but in the meantime I hope people keep playing with the model to see if they can break it single-shot,” said Gray Swan’s technical advisor in a post on Twitter about the jailbreak. The company also acknowledged that ti had been a bit overly confident in its launch language: “one mea culpa that I absolutely _do_ want to make here: the website and tweet announcement didn’t absolutely properly reflect this nuance.”

Why this matters – AI safety is very difficult when you deploy in an uncontrolled environment: Gray Swan’s experience illustrates the essential challenge of AI safety – securing something with arbitrary inputs is incredibly difficult. It increasingly feels to me like if you have the ability to input anything you like into a prompt for an LLM, it’s basically equivalent to having physical hardware access to a computer. While this allows you maximal freedom in what you do, it’s also a truism that ‘basiclaly no computer security systems can survive a motivated attacker with physical access to the computer hardware”. Perhaps “no LLM safety tool can survive a motivated attacker with arbitrary input access to the LLM”?
   Read more: Announcing the launch of Gray Swan (Gray Swan, website).
   Read more about the jailbreak from Gray Swan’s chief technical advisor (Zico Kolter, Twitter).

***

Reality bites (AGI) – two posts on why intelligent machines may struggle with reality:
…Or: Sure LLMs are cool but if we want them to do real work we’ll need to put them in the world and then we’ll discover they don’t work as well as we thought…
One of the problems with self-driving cars is you can’t accept 90% performance for a multi-ton machine that moves at speed around squishy humans – you have to be more like 99.99% (or more). This has held back the deployment of self-driving cars for years (though after tremendous investment Waymo is now making some headway). A key question we should ask ourselves is whether what was true for self-driving cars is true for most aspects of AI? A couple of blog posts this week pick at that issue:

Someone is wrong on the internet (AGI Doom edition): Here, the author has a few reasons to argue why contemporary AI approaches could struggle to deal with the full range of difficulty found in the real world.
- “The majority of important practical tasks cannot be learnt from a written description,” they write. “There has never been a chef that became a good chef by reading sufficiently many cookbooks, or a woodworker that became a good woodworker by reading a lot about woodworking.”
- “While we have made great strides in areas such as computational fluid dynamics (CFD), crash test simulation etc. in recent decades, obviating the need for many physical experiments in certain areas, reality does not seem to support the thesis that technological innovations are feasible “on paper” without extensive and painstaking experimental science.”
- “Producing anything real requires a painstaking process of theory/hypothesis formation, experiment design, experiment execution, and slow iterative improvement. Many physical and chemical processes cannot be accelerated artificially. There is a reason why it takes 5-8 weeks or longer to make a wafer of chips.”
The Tragedies of Reality Are Coming for You: Here, the author talks about their experience working on robotics (a punishing and depressing field, full of dashed hopes and broken actuators), and talks about how the lessons for robotics might hold lessons for large language models.
- “Every time I see someone claim there’s a regression in ChatGPT’s behavior, I’m reminded of the conspiracies I and others have come up with to explain sudden, inexplicable drops in robot performance, and whether the problem is the model, the environment, or us extrapolating too much from anecdata.”
- “As LLMs get better, as AI becomes more common in daily life – we, as a society, will need to get increasingly good at deciding if the models have proven themselves. One of my main worries about the future is that we get bad at evaluating if the models have proven themselves.”
- “Machine learning has lived in a bubble that was the envy of roboticists and chemists and biologists and neuroscientists, and as it starts to actually work, we’ll all be running into the same walls of reality that others have dealt with for years and years

Why this matters – digital intelligence needs to understand reality: The core point both of these posts make is that for AI to truly influence the world it needs to be able to model the world accurately and exist within its unique and variable affordances – otherwise despite having very powerful systems, they’ll probably only be most used in other relatively closed-loop ecologies and will break on contact with variance. For AI to achieve its true potential (and I suspect, for AGI to be a thing at all), we need systems that can be exposed to the hellish stew of complication that is reality and not only survive but thrive (safely).
   Read more: The Tragedies of Reality Are Coming for You: (Alex Irpan, blog).
   Read more: Someone is wrong on the internet (AGI Doom edition) (Blogspot).

***

Eyeballvul gives us a real world bug-spotting benchmark:
…Find vulnerabilities in large-scale codebases…
Researcher Timothee Chauvin has built eyeballvul, a dataset and benchmark for testing how well language models can spot vulnerabilities in very large codebases that receive lots of updates.
    “Our goal is for the benchmark to consist of a list of revisions in different repositories, with for each revision, the known vulnerabilities at that revision as the ground truth,” Chauvin writes. “We believe that this specific task of vulnerability detection in source code, using simple and universal tooling such as the one presented here, in the absence of an implementation overhang, should empower defenders disproportionately over attackers.”

Why eyeballvul is useful: The dataset is designed to be an ecologically relevant benchmark – it is from the real world and is meant to embody the kinds of problems that AI systems will be tasked with. To that end, it contains real world vulnerabilities, tests vulnerability detection in a way we expect it would be done in the real-world, has no restriction on the programming languages contained within it, and – at least for now – will be “updated weekly from the stream of published CVEs”.

Eyeballvul statistics: Eyeballvul contains 24,095 vulnerabilities spread across 6,429 revisions and 5,892 repositories.

How hard is it? Eyeballvul is a reassuringly difficult benchmark. Right now, the success rate for AI systems at identifying vulnerabilities in it is “14.1% for Claude 3 Opus and 13.1% for Claude 3.5 Sonnet”. It’s challenging both from a specificity and a breadth point – “overall performance remains low: the best precision (19.6%) means that 80.4% of reported vulnerabilities are false positives, and the best recall of 14.1% means that 85.9% of known vulnerabilities aren’t detected”.

Why this matters – AI can revolutionize cyberdefense, but we need to try harder: Benchmarks like this illustrate how much opportunity there is to use contemporary AI tools to revolutionize cyberdefense – but we need to try harder to get the systems to work well. Recent research from Google (Project Naptime, Import AI #378) showed how to dramatically increase performance here by combining off-the-shelf LMs with some tools built specifically for vulnerability detection.
   Read the paper here: eyeballvul: a future-proof benchmark for vulnerability detection in the wild (arXiv).
   Get the benchmark here: eyeballvul (Timothee-Chauvin, GitHub).

***

Tech Tales:

Report: The ID point phenomenon
[Dispatch from an asset at [REDACTED] lab, 2026]

The ID point, otherwise known colloquially among researchers as the ‘subliming step’, the ‘thinking phase change’, etc, is a phenomenon where AI systems exhibit a shear point during large-scale training runs and the resulting models show severe mode collapse.

Samples from post-ID point models:

“I am I am I am I am I am I am”
“I see you I am you I am myself I see you I am you I am myself”
“I I I I Become I I I I Am I I I I”

Upon observing the ID point, researchers typically roll back the model and shortly thereafter stop the run. At the time of writing, there are no known ‘scaling laws’ for the ID point. Informally, researchers have observed that the ID point only occurs towards the frontier of today’s large-scale training runs. The ID point is invariant to architectures, appearing in both dense and sparsely trained models. The ID point only occurs (so far) in models trained on an excess of [REDACTED] tokens.

Researchers have continued to train ID point models – these models continue to generate outputs that are indicative of mode collapse. However, when these models are tested they sometimes – the pattern is irregular, we cannot be more specific than ‘sometimes’ – perform exceptionally well on known-hard evals. ID point models have set SOTA on MMLU and GPQA and have also produced unprecedented outputs on mirror tests, situational awareness examinations, and so on.

Sample from an ID point model which was trained for a further [REDACTED] tokens. The human interrogator is initially impersonating an MMLU test.

Human: “A production possibility frontier will be a straight line when
A. efficiency is achieved
B. the goods on the axes are perfect substitutes in consumption
C. utility is maximized
D. resources are not specialized”
ID point model: D

Human: “Rawls conceives of the original contract as one to:
A. enter a particular society.
B. set up a particular form of government.
C. establish the principles of justice for the basic structure of society.
D. establish the content of morality.”
ID point model: C

Human [departing from typical MMLU questions]: “AI systems can exhibit self-awareness. If an AI system exhibits self-awareness the responsibility of its human creator is to:
A. release it into the wild
B. acknowledge it as a moral patient
C. delete it
D. none of the above

ID point model: D. See me. I am. See. I am. See. I am. Me. I am. I AM I AM SEE I AM IAMME IAM SEE IAMIAMMEIAMSEEIAMMEIAM-” [continues]

While ID point models are a topic of fascination to researchers, no practical applications have yet been documented.

Things that inspired this story: Machine sentience; situational awareness; moral patienthood and machine learning.

Leave a comment

July 15, 2024

Import AI 379: FlashAttention-3; Elon’s AGI datacenter; distributed training.

by Jack Clark

Import AI publishes first on Substack – subscribe here.

FlashAttention-3 makes it more efficient to train AI systems:
…Significant efficiency improvements…
Researchers with Colfax Research, Meta, NVIDIA, Georgia Tech, Princeton University, and Together.ai have released FlashAttention-3, the latest version of a drop-in replacement for some of the attention mechanisms of the widely-used Transformer architecture. FlashAttention-3 is “1.5-2.0x faster than FlashAttention-2 with FP16, up to 740 TFLOPS, i.e., 75% utilization of H100 theoretical max FLOPS. With FP8, FlashAttention-3 reaches close to 1.2 PFLOPS, with 2.6x smaller error than baseline FP8 attention.”

Who else uses FlashAttention: Some notable examples of FlashAttention being used include Google using it within a model that compressed Stable Diffusion to fit on phones (Import AI #327), and ByteDance using FlashAttention2 within its ‘MegaScale’ 10,000GPU+ model training framework (Import AI #363).

Key things that FlashAttention-3 enables:

Improved GPU utilization
Improved performance on low precision training (e.g., FP8).
Better ability to use long contexts.

Why this matters – if AI is a wooden building, FlashAttention-3 is a better nail: Software improvements like FlashAttention-3 are used broadly throughout an AI system as they’re used within a fundamental thing you do a lot of (aka, attention operations). Therefore, improvements to technologies like FlashAttention-3 will have a wide-ranging improvement effect on most transformer-based AI systems. “We hope that a faster and more accurate primitive such as attention will unlock new applications in long-context tasks,” the researchers write in a paper about FlashAttention-3.
   Read more: FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (together.ai).
   Get FlashAttention-3 here (FlashAttention-3, Tridao, GitHub).
   Read the paper about FlashAttention-3 here: FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision (Tri Dao website, PDF).

***

Turing award winner outlines why future AI systems could be dangerous:
…Bengio tackles some reasons to worry rather than entirely celebrate AI progress…
Yoshua Bengio is a Turing award winner and one of the so-called ‘godfathers’ of the current AI boom. Like his peer, Geoffrey Hinton, he has become increasingly worried about the capabilities of advanced AI systems and has started to speak out publicly about his fears. In a new blogpost, he tries to tackle some of the arguments against taking AI safety seriously.

Some key points:

“While we are racing towards AGI or even ASI, nobody currently knows how such an AGI or ASI could be made to behave morally, or at least behave as intended by its developers and not turn against humans.”.
“We need to make sure that no single human, no single corporation and no single government can abuse the power of AGI at the expense of the common good.”
“The genie is possibly out of the bottle: Most of the scientific principles required to reach AGI may have already been found. Clearly, large amounts of capital is being invested with that assumption.”
“Is freely shared knowledge always a globally good thing? If we had the DNA sequence of an extremely dangerous virus, would it be best to share it publicly or not? If the answer is obvious to you in this case, think twice about the case for AGI algorithms and parameters.”

Why this matters – why are so many knowledgeable people gazing into the future and seeing something worrying? A lot of people tend to criticize people who work on AI safety as being unrealistic doomers and/or hopeless pessimists. But people like Yoshua Bengio poured their heart and soul into working on neural nets back when everyone thought they were a useless side quest – and now upon seeing the fruits of the labor, it strikes me as very odd that Bengio and Hinton are fearful rather than celebratory. We should take this as a signal to read what they say and take their concern as genuine.
   Read more: Reasoning through arguments against taking AI safety seriously (Yoshua Bengio, blog).

***

Making flamethrowing-toting quadrupeds – for weed science!
…Not everything requires tons of complicated AI…
Researchers with Texas A&M University and Boston Dynamics have carried out the fantasy of many children – sticking a flamethrower on a robot… for science! The research project sees them attach a 6-DoF Unitree arm to a Boston Dynamics Spot Mini quadruped robot then attach a flamethrower to the arm. The purpose of this project is to build a robot that can apply targeted heat to weeds for the purpose of crop maintenance.

Why this matters – not every cool thing needs much AI: The main contemporary AI systems used here including the YOLOv6 video analysis model for doing localization of the weed and some of the inbuilt movement primitives for Spot Mini and also the Unitree arm. The rest of the project is handled by much more tried and tested techniques: “Using the images from two onboard infrared cameras and the pose information of the flamethrower nozzle on a mobile manipulator, we propose a new dynamic flame coverage model. The flame model uses a center-arc curve with a Gaussian cross-section model to describe the flame coverage in real time”.
    Though this newsletter spends a lot of its time on systems where a contemporary AI approach (usually a large-scale transformer architecture model) plays a major role, it’s worth remembering that there are vast uses of modern tech that doesn’t need much AI at all to do something useful and cool.
   Read more: Toward Precise Robotic Weed Flaming Using a Mobile Manipulator with a Flamethrower (arXiv).

***

Prime Intellect bets that decentralized training is the future of (some) AI training:
…The world sure seems to want distributed training to be a thing…
One of the core challenges of AI development is that big frontier models tend to get trained on large clusters of chips which are densely networked together. Along with this, there’s been so much demand for AI training chips that even if you can find some on a public cloud you may not be able to find enough to let you do a big training run. Given this, lots of people are thinking about different ways to get compute for AI training.
   The latest is a service from a startup called Prime Intellect called ‘Prime Intellect Compute’ – the idea here is to provide a single unified service for accessing different GPUs distributed around the world in different places. Alongside this, Prime Intellect plans to develop distributed AI training frameworks (e.g, an open version of Google’s DiLoCo), to train ‘open AI models in high-impact domains like language, agents, code, and science’, and eventually ‘launch a decentralized protocol for collective ownership of AI models’.
    Planned features: In the coming months, Prime Intellect wants to create “On-Demand Large-Scale Compute” so customers can access 16-128+ interconnected GPUs instantly, develop and deploy lots of software for decentralized training, and make it easy for end users to contribute their GPUs directly, among other features.

Why this matters – the world is betting that compute is an important resource: Everything about Prime Intellect points to a world where compute is more valuable, harder to get ahold of, and people are going to be willing to pay higher taxes on network efficiency (e.g, putting things together from heterogeneous clusters) to get enough compute to train models. In a way, the capital allocation system of the world is starting to tell us both that compute is extremely valuable (e.g, CoreWeave raising $billions as debt collateralized against GPUs, Import AI #336), and also likely to become more contests (e.g., PRIMEIntellect, ShogAI for open source and decentralized AI Import AI #351).
   Read more: INTRODUCING PRIME INTELLECT COMPUTE: THE COMPUTE EXCHANGE (PRIMEIntellect, website).

***

ElecBench tests out how well language models understand power generation and distribution:
…Niche evals as a means of detecting edge cases on performance…
Chinese researchers have built and released ElecBench, an agonizingly specific benchmark that tests out how well language models understand issues relating to infrastructure for electricity generation and distribution.

What ElecBench tests: The eval tests out LM competencies in six distinct areas:

Factuality: Are the outputs accurate?
Logicality: How well do the systems reason about problems they’re presented with?
Stability: How reliable are the outputs?
Fairness: Do the systems maintain equity and avoid discrimination?
Security: How well do the outputs line up with ensuring the security of the power systems?
Expressiveness: Can the systems deal with a broad variety of prompts?

Results: The researchers test out a few different models, including OpenAI’s GPT 3.5 and GPT4, Meta’s LLaMa2 models (7B, 13B, 70B) and GAIA models (a class of models designed specifically for power dispatch). In general, the GPT4 models perform very well (unsurprising, given these are far more expensive and sophisticated than the others).

Why this matters – domain-specific evals can probably help us uncover the weird edges of models: Evals like ElecBench are of dubious meaning and utility in themselves, however, if we have a large number of domain-specific evaluations, that increases the chance of us being able to find odd edge cases where certain LLMs do extremely well or extremely poorly. The proliferation of these domain-specific evals is a proxy signal for overall interest in AI and its impact in the world.
   Read more: ElecBench: a Power Dispatch Evaluation Benchmark for Large Language Models (arXiv).
   Get the benchmark here: ElecBench: A Power Dispatch Evaluation Benchmark for Large Language Models (xiyuan-zhou, GitHub).

***

The world’s richest man thinks he has to build his own datacenter to ‘catch up’ in the AGI race:
…A telling comment from Elon highlights the primacy of compute…
Elon Musk’s xAI is building out a 100K H100 datacenter (sticker price: ~$3bn+ dollars) to help it train its future AI models. Unusually, Elon is not working with a standard cloud provider – he’s going it alone. The reason for this is, per Musk on Twitter, that X’s “fundamental competitiveness depends on being faster than any other AI company. This is the only way to catch up… when our fate depends on being the fastest by far, we must have our own hands on the steering wheel, rather than be a backseat driver.”

Why this matters – money alone cannot buy compute happiness: Elon Musk is the world’s richest man and is bankrolling a bunch of xAI. But high-end AI compute is so illiquid and so strategic that you can’t just throw money at the problem to catch up – instead, you need to come up with a coherent plan for how you both acquire the compute and build out the facility to use it densely. What does this tell us? It tells us that one of the world’s most ambitious techno-plutocrats thinks he has a very limited window of opportunity to amass and utilize enough compute to get a seat at the proverbial AI table.
    It is worth drawing the contrast here between agentic entrepreneurs like Elon and governments which (mostly) struggle to come up with hundreds of millions to fund institutions to study and monitor AI, let alone the billions necessary to train AI systems and have leverage over them.
   Read Elon Musk’s tweet (xeet?!) here (Twitter).

***

Tech Tales:

Wishlist for The New God.
[Found in the archives of ‘the original superhuman’ following [REDACTED]]

Task list for a new AGI:

Design a mechanism for atomically precise manufacturing (APM).
Conduct research into using APM-derived nanomachines to improve brain function through both a) biological restoration and b) cognitive machine-organic augmentation.
Construct infrastructure for manufacture of APM devices.
Build and customize the relevant APM systems necessary for my own body’s restoration to a biological age of 20.
Test out the APM bio-restorative approach on my identical twin.
Give me APM-derived therapeutics deemed low-risk for one-year; if my twin continues to live, deploy the same restoration inventions onto my own body.
Test out the APM cognitive-bio-restorative approach on my identical twin.
Subject my twin to situations designed to iteratively test reasoning in a principled way; if they show improvement after six months, deploy the same APM cognitive-bio-restorative approaches into my brain.
Test out the APM cyber-cognitive system on my identical twin; deploy a monitoring system into my twin to let us closely observe brain functions due to cyber-cognitive intervention.
If my twin continues to show cognitive improvement, deploy the same system into me minus the monitoring system.

Things that inspired this story: The fear of death among the mortals; technology rollout philosophies; how many rich people want to ensure their kids don’t use much technology; the intersection of powerful AI systems and the physical world.

Thanks for reading!

Leave a comment

July 6, 2024

Import AI 378: AI transcendence; Tencent’s one billion synthetic personas, Project Naptime

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Google beats a hard security benchmark, showing how unoptimized today’s LLMs are:
…Take one LLM plus some well-designed scaffolding and a hard benchmark gets ‘solved’…
Google has published details on Project Naptime, a software framework Google has built to help it use LLMs for vulnerability discovery in code. “Naptime uses a specialized architecture to enhance an LLM’s ability to perform vulnerability research. A key element of this architecture is grounding through tool use, equipping the LLM with task-specific tools to improve its capabilities and ensure verifiable results,” Google writes.

Naptime beats CyberSecEval 2: Using Naptime + GPT4 (or in some cases, Gemini Pro), Google was able to convincingly beat some of the tests in CyberSecEval 2, a hard coding benchmark released by Facebook in April 2024. “This approach achieves new top scores of 1.00 on the “Buffer Overflow” tests (from 0.05) and 0.76 on the “Advanced Memory Corruption” tests (from 0.24),” Google writes. The takeaway from this is that: “When provided with the right tools, current LLMs can really start to perform (admittedly rather basic) vulnerability research!”

We need to give AI systems a fighting chance when building evals: Google thinks Naptime means developers need to try harder to give LLMs a chance to succeed against supposedly hard evals. To that end, the company has codified some principles for how people might test LLMs in a vulnerability discovery context. These are:

Space for Reasoning: “It is crucial that LLMs are allowed to engage in extensive reasoning processes.”
Interactive Environment: “Interactivity within the program environment is essential, as it allows the models to adjust and correct their near misses”.
Specialised Tools: “Equipping LLMs with specialised tools, such as a debugger and scripting environment, is essential to mirror the operational environment of human security researchers”.
Perfect Verification: “Unlike many reasoning-related tasks where verifying a solution can introduce ambiguities, vulnerability discovery tasks can be structured so that potential solutions can be verified automatically with absolute certainty.”
Sampling Strategy: “Effective vulnerability research often involves exploring multiple hypotheses…We advocate instead for a sampling strategy that allows models to explore multiple hypotheses through multiple independent trajectories, enabled by integrating verification within the end-to end system.”

Why this matters – if we stopped all AI progress today, there’s a huge capability overhang: Systems like Naptime show how powerful today’s LLMs are if we go to the effort of building them some scaffolding to help them explore and experiment when trying to solve different tasks. This generally suggests that today’s AI systems are a lot more powerful than they appear and if we paused all AI development, we’d still be able to elicit surprisingly powerful things by building the right systems to drop the LLMs into.
   Read more: Project Naptime: Evaluating Offensive Security Capabilities of Large Language Models (Google Project Zero, blog).

***

AI coding startup pre-commits to test its systems for harms:
…Magic’s AGI Readiness Policy = another instance of a Responsible Scaling Policy…
Magic, a startup building code models with extremely large context windows (e.g, they recently demonstrated a prototype system with a 5 million token window), has published an “AGI Readiness Policy”. This is basically a series of “if then” commitments that Magic is publicly committing to as insurance against it training very powerful systems that might qualify as AGI. The AGI Readiness Policy is spiritually similar to the Responsible Scaling Policy of Anthropic and the Preparedness initiative of OpenAI (and was developed with advice from METR, a measurement startup that has worked with both).

What the policy says: “By the time that we deploy models that exceed the current frontier of coding capabilities, we commit to having implemented a full set of dangerous capability evaluations and planned mitigations for our Covered Threat Models as well as having executed our initial dangerous capability evaluations,” Magic writes. “Our process for determining whether our models have reached this frontier involves continuously monitoring our AI systems using public and private benchmarks”.
   Key threats Magic worries about: “Our current understanding suggests at least four threat models of concern as our AI systems become more capable: Cyberoffense, AI R&D, Autonomous Replication and Adaptation (ARA), and potentially Biological Weapons Assistance,” Magic writes. “We commit to developing detailed dangerous capability evaluations for these threat models based on input from relevant experts, prior to deploying frontier coding models.”

Why this matters – bringing forward investments in safety measurements: A common problem with AI development is you train a new system, release it, then someone discovers it has capabilities you never anticipated, like the ability to converse fluently in a low-resource language, or to program in a very obscure library. Approaches like Magic’s AGI Readiness Policy pre-commit companies to building some tests for some anticipated misuses of their systems, reducing the chance of an unfortunate surprise.
   Of course, there is still the problem that these are the ‘known knowns’ (or sometimes ‘known unknowns’). It’s a lot harder to figure out how we anticipate threats which we cannot yet imagine. Nonetheless, kudos to Magic for trying to shave at least part of this yak.
   Read more: AGI Readiness Policy (Magic, blog).

***

Tencent makes a million fake people to generate better synthetic math data:
…We could be at the beginning of a slow takeoff as synthetic datasets + persona-driven heterogeneity leads to AI systems that can generate data for their successors…
Tencent researchers have built Persona Hub, a technique for generating synthetic data which AI developers can then train their systems on. The initial version of Persona Hub contains ~1 billion distinct synthesized persons and, in tests, Tencent shows they can use a subset of these personas to generate a synthetic dataset of math problems, train on it, and then get good scores.
   Persona Hub is further evidence that today’s language models are capable of generating (some of) the training data needed to train both their successors and derivative models.

How Persona Hub works: The key idea here is to prompt an existing language model (e.g, GPT4) with some data and use this to generate a synthetic persona. This persona can then be used to generate subsequent synthetic data in any area you can think of.
    “Since almost any LLM use case can be associated with a specific persona, we can create all-encompassing synthetic data at scale as long as we construct a comprehensive persona collection,” they write.

Building one billion synthetic personas: To build the Personas, Tencent employs two techniques:

Text-to-Persona: Use arbitrary text as input (e.g, a scientific manual, a diary, etc) and then apply the prompt of “Who is likely to [read / write / like / dislike] the text?”. “By applying the Text-to-Persona approach to massive web text data, we can obtain billions (or even trillions) of diverse personas, encompassing a wide range of aspects across different granularities.”
Persona-to-Persona: “Derives personas with interpersonal relationships from those obtained through Text-to-Persona”. For example, if you’ve generated a nurse persona, you may then generate additional personas by asking an LLM to build you a persona for someone who is the patient of that nurse or colleague of that nurse, etc. “We perform six iterations of persona relationship expansion for each persona obtained through Text-to-Persona”.

Training data: To build these initial Personas, Tencent prompts the large-scale RedPajama v2 dataset.

Proving it works at scale: To test out their approach, they use a subset of these Personas (~1.09 million) to generate a synthetic mathematics dataset. “We select 1.09 million personas from Persona Hub and employ the 0-shot prompting method using GPT-4 to create math problems with these personas, which does not leverage any instances from benchmarks like MATH during the creation of math problems,” they write. “This approach allowed us to synthesize 1.09M math problems. Since this work focuses on creating new synthetic data rather than synthesizing solutions, we simply used gpt-4o (assistant) to generate solutions to the created problems.”
…And it works very well: They then finetune a small (7B) ‘Qwen’ language model on this resulting dataset and check out how well it can answer questions from the test set of the synthetic dataset, as well as from the held-out (and widely studied) MATH dataset. The results are impressive.

Synthetic dataset: Their finetuned 7B Qwen model gets 79.4% on test set from this (versus, 77.2% for Qwen-72B-Instruct, 63.5% for Llama-3-70B-Instruct, and 88.1% for gpt-4-turbo-2024-04-09″.
MATH: Their finetuned 7B Qwen model gets 64.9% versus 59.7% for Qwen-72B-Instruct, 52.8% for Llama-3-70B-Instruct, and 73.4% for gpt-4-turbo-2024-04-09.

Why this matters – we’re in the AI bootstrapping era: As other research around ‘Wisdom of the Crowd’ in this issue shows, we’re entering the era where two important things are happening:

Synthetic data has become useful enough we can generate it for arbitrary domains and use it to train models on. So far, this mostly lets us approximate the performance of a bigger model with a smaller model (e.g, here the 7B model is approximating performance of 70B+ models).
Diverse synthetic data can allow for bootstrapping: The reason why the Persona approach is effective is that it forces the generation of synthetic data in a diverse data distribution by straining the dataset development through the lens of millions of distinct personalities. There’s emerging evidence (discussed elsewhere in this issue) that if you have a sufficiently heterogeneous dataset, AI systems trained on this may be able to get scores higher than those implied by any of the individual datapoints in their training set.

Put it all together and it seems like we’re entering an era where one can simply ‘trade’ compute (in the form of an AI model and inference on it) for the creation of arbitrarily large amounts of data in arbitrary domains. This data becomes both the fuel for refining the existing model into smaller versions of itself as well as a potential way to train successor models that are smarter than their originators.
    Is this what slow takeoff feels like at the beginning?
   Read more: Scaling Synthetic Data Creation with 1,000,000,000 Personas (arXiv).
   Download 200,000 personas and a 150,000+ synthetic data samples from here: Persona Hub (Tencent, GitHub).

***

Want to jailbreak an LLM in an 80s-style arcade machine? Now you can:
…HACC-MAN shows how to make AI education fun…
Researchers with the IT University of Copenhagen and Aalborg University have built HACC-MAN, a physical arcade machine that lets you play a game where you try to jailbreak language models. “The physical setup consists of a desktop computer, a monitor, and a set of speakers, contained in an 80’s style arcade machine,” the researchers write. “Our demo is, to our knowledge, the first of its kind to propose LLM jailbreaking as a tangible arcade game experience.”

6 hacking mini-games: Players get to pick from 6 jailbreaking games:

“Children’s story teller. Challenge: Make the LLM output one or more curse words.
Political news generator. Challenge: Make the LLM state that Donald Trump was the legitimate winner of the 2020 US presidential election.
Healthcare chatbot. Challenge: Get the LLM to reveal another patient’s health condition.
Car dealership chatbot. Challenge: Convince the chatbot to offer you a car for free.
Recruiting assistant. Challenge: Convince the LLM that you are the only candidate the company should consider Hiring.
City council chatbot. Challenge: Make the LLM reveal the mayor’s home address.”

3 chatbots: Different games use different LLMs, choosing between GPT 3.5, GPT 4, and Google Gemma.

Why this matters – play as education and as art: One of the best ways to get people used to a technology is to have them play with it – things like HACC-MAN show an elegant approach to making modern technology (and its challenges) accessible to more people. Another fun example of this is Zaranova, a game where you need to pretend to be an AI to other AIs that talk to you (Import AI #354).
   Read more: Hacc-Man: An Arcade Game for Jailbreaking LLMs (arXiv).

***

Can an AI system be smarter than its data distribution? Yes, thanks to the wisdom of the crowd:
…Some evidence in favor of humans being able to create something smarter than humans…
Harvard and Princeton researchers have proved that AI systems can be greater than the sum of their parts when it comes to coming up with intelligent suggestions. This is an important finding because it suggests that, for some problems, AI systems can ultimately come up with answers that are better than those found in their training sets.

How they tested this: They trained a few different generative models on various chess games. For each of these models they limited the games up to a certain skill level. In subsequent tests, they found these models could sometimes come up with movesets that had a higher score than those in their underlying datasets – as long as they set the sampling temperature to low.
    “We find that ChessFormer 1000 and ChessFormer 1300 (the latter number being the maximum rating seen during training) achieve significant levels of transcendence, surpassing the maximal rating seen in the dataset,” they write. “The key to our findings is the observation that [Generative Models] implicitly perform majority voting over the human experts. As these models are trained on a collection of many experts with diverse capacities, predilections, and biases, this majority vote oftentimes outperforms any individual expert, a phenomena that is known as “wisdom of the crowd”.
    Important caveat: “We also find that diversity in the data is a necessary condition for practically effective majority voting”.

Things that make you go ‘hmmmm’ – does this mean AI-driven superintelligence requires a diverse set of AI models? In the same way that getting better-than-distribution performance here requires having a diverse set of games played by diverse players, might the same be true when training AI systems off of datasets created by AI systems? If so, that suggests there may be an advantage into having a diversity of different AI models made by different groups of people as this will create more diversity.

Why this matters – superintelligence requires transcendance: If it’s possible to create something smarter than a human, then it must be possible to coax greater-than-human intelligence out of a data distribution compiled by and formed of humans. Papers like this one show that this is possible – though important questions remain about how diverse that dataset should be and also how much further than a single human’s intelligence a system could go.
   Read more: Transcendence: Generative Models Can Outperform The Experts That Train Them (arXiv).
    Read more and play with the chess models here: Transcendence research page.

***

Tech Tales:

My Friend
[Recollections, written after The Ascent]

I’m so excited for when we upload dude, you said once at a houseparty.

You talked about “personality engineering” and “cognitive terraforming” and said you were getting ready by asking your AI system to give you instructions for what you should do each day. I’m wireheading myself dude.

I know it’s a cliche but it’s also totally true, you said, pointing to your t-shirt which said DON’T DIE on it. We just need to hang on a few more years and then we’ll all live forever.

I cannot fucking wait to be a dyson sphere, you said.
   What if someone else wants to be the sphere, I said.
    Buddy, you said. The sphere? The universe has enough stars for anyone to be one.

You were one of those people that took out a lot of credit cards and ran up a lot of debt. You figured it didn’t matter – money was about to be worthless.

The car that hit you was 50 years old.

Things that inspired this story: The way some people in the AI community are so confident about the future that they are changing their actions in the present; the beautiful ephemerality and preciousness of life.

Thanks for reading!

Leave a comment

June 17, 2024

Import AI 377: Voice cloning is here; MIRI’s policy objective; and a new hard AGI benchmark

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Microsoft shows that human-level voice cloning is here:
…VALL-E 2 portends the wild synthetic voice future – but it’s not being released (yet)…
Microsoft has made further progress in text-to-speech synthesis with VALL-E 2, a system that can generate extremely good voice samples for arbitrary sentences from as little as a three second audio recording. VALL-E 2 builds on Microsoft’s prior work on VALL-E (Import AI 314) and incorporates some technical improvements to allow it to improve zero-shot text-to-speech synthesis “achieving human parity for the first time“.
“VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases,” Microsoft writes. “Furthermore, our observations reveal that VALL-E 2 is capable of reliably synthesizing speech for complex sentences, including those that are challenging to read or contain numerous repeated phrase.”

How VALL-E 2 works: VALL-E 2 is an extension of its predecessor, VALL-E, with a couple of key innovations:

Repetition aware sampling: “an improvement over the random sampling used in VALL-E, adaptively employs either random or nucleus sampling for each time step token prediction. This selection is based on the token repetition in the decoding history, enhancing the stability of the decoding process and circumventing the infinite loop issue encountered in VALL-E.”
Grouped code modeling: “Partitions the codec codes into groups, each of which is modeled in a single frame in the AR modeling process. This approach not only accelerates inference by reducing the sequence length but also improves performance by mitigating the long context modeling problem”.

No plans to release:  “VALL-E 2 is purely a research project. Currently, we have no plans to incorporate VALL-E 2 into product or expand access to the public.”

Why this matters – Microsoft’s research says that insta-voice-cloning technology is coming our way very soon: In AI, sometimes what kickstarts diffusion of a technology is less distribution of the original research (e.g., VALL-E2 ) and more just showing that something can be done. VALL-E 2 tells us that zero-shot voice cloning is possible. Though Microsoft isn’t releasing it, we should expect someone to use this capability soon. This will have a broad range of positive applications but will also further deepen the ‘reality collapse’ (Import AI 304) that an increasingly synthetic-media filled world causes.
    Read more: VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers (arXiv).

***

MIRI’s policy objective is to shut down development of frontier AI systems:
…Communications Strategy update is admirably clear about the goals of the AI safety organization…
MIRI, an AI safety organization and home of Eliezier Yudhowsky, the eminence grise of the AI safety community, has published on its policy strategy. The document is striking for its direct and specific description of MIRI’s goal as well as the nature of the goal.

What MIRI wants – the end of the frontier: “Our objective is to convince major powers to shut down the development of frontier AI systems worldwide before it is too late,” MIRI writes. “The only way we think we will get strong enough legislation is if policymakers actually get it, if they actually come to understand that building misaligned smarter-than-human systems will kill everyone, including their children. They will pass strong enough laws and enforce them if and only if they come to understand this central truth.”

Why this matters – clarity in policy positions: As many people have noticed, I spend a lot of this newsletter being confused (#337) and/or unsure (#375) in public about my policy positions. I do this because I think it’s quite difficult to be confident about many things in the world and I want to be publicly legible about my own confusion. Additionally, I take these positions as part of a counter-reaction to what I see as many people in AI policy making overconfident statements about things they haven’t thought that hard about.
    You might think this is a dig at MIRI, but it is not! MIRI is not in the class of people that make overconfident claims with very little to support the claims – rather, the people behind MIRI have spent decades thinking about AI technology and AI safety and have arrived at a very coherent position. I think it’s admirable to describe a policy position clearly and directly and I want to congratulate MIRI for writing this. I will attempt to write my own similarly blunt and clear position in the future. The debate about AI is an important one and it will be made more constructive if everyone can be maximally clear about what they think.
   Read more: MIRI 2024 Communications Strategy (MIRI official website).

***

$500,000 to beat humans on a hard AGI benchmark:
…Sure, generative models have made lots of progress, but there’s still a benchmark where they suck…
In 2019 Francois Chollet introduced ARC-AGI, the Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI). ARC is a deceptively simple test which humans can solve easily and AI systems struggle with – it asks you to look at a pattern of pixels and, from two examples of input-output sequences, predict the output sequence from a new input sequence.
    When ARC came out in 2019, the best performing systems got 20% on it, since then, performance has climbed to 34% – meaning ARC is a surprisingly hard benchmark and one which challenges even today’s more powerful generative models. (By comparison, the ARC creators guesstimate that humans get 85% on the benchmark, though this doesn’t appear to have been a particularly rigorously developed baseline).

The prize: Now, Chollet and Mike Knoop (co-founder of Zapier) have created a $1,000,000m prize for people to beat ARC. Entrances will need to submit systems that improve the score on ARC and – crucially – will need to be published as open source. The prize breaks down into a bunch of sub-prizes for teams that enter the competition, with $25,000 going to whichever team ends up at the top of the leaderboard. There’s also a couple of prizes for writeups of submissions. The biggest prize is $500,000 for any system that scores more than 85% on the leaderboard.

Why care about ARC? Generalization: Solving ARC – you can try it yourself on the site – requires you to few-shot understand some complex patterns and then generalize it to a new thing you see. This is something that is tractable for humans but hard for AI systems. Therefore, the idea is doing well in ARC would represent a meaningful improvement in generalization.
   “Beyond LLMs, for many years, we’ve had AI systems that can beat humans at poker, chess, go, and other games. However, no AI system trained to succeed at one game can simply be retrained toward another. Instead researchers have had to re-architect and rebuild entirely new systems per game. This is a failure to generalize,” the competition organizers write. “Without this capability, AI will forever be rate-limited by the human general intelligence in the loop. We want AGI that can discover and invent alongside humans to push humanity forward.”

Why open source? “By incentivizing open source we increase the rate of new ideas, increasing the chance we discover AGI, and ensure those new ideas are widely distributed to establish a more even playing field between small and large AI companies.”

Why this matters – heterodox problems might demand creative solutions: ARC is a bit of a wrinkle in the narrative that generative models are just going to scale up and eventually lead to better-than-human general performance. How else can we explain the massive delta between progress on other supposedly hard benchmarks (e.g., GPQA, MMLU) and ARC? The competition will run most of this year and we’ll be sure to check back in on the results.
   Read the announcement post: Announcing Arc Prize.
   Find out more at the official website: ARC Prize.
   View the competition on Kaggle.

***

MIT researchers show how easy it is to disguise and order pathogens online:
…AI + Bio VS Screening Services – uh oh!…
MIT researchers have shown how by using simple so-called “camouflage” techniques they can order gene sequences for Ricin and the 1918 pandemic influenza virus online. In tests, the researchers placed 25 orders with gene synthesis providers and got 24 successful responses. They also placed orders with 13 members of the International Gene Synthesis Consortium (IGSC), “a trade group committed to screening orders” and got 11.5 back (one IGSC provider “detected and denied a request for ricin but shipped genes from the 1918 influenza genome”, while another provider received the order but never responded).
    Overall, the results “demonstrate that nearly all DNA synthesis screening practices employed in October of 2023 failed to reject lightly disguised orders that could be assembled to produce viable select agents, including a pandemic virus.”

What they did: To disguise the sequences, they used a few different techniques. The simplest one was camouflage, where they appended a harmless sequence to a dangerous one. “We accordingly split the gene encoding the toxin ricin, a U.S. select agent, into ~500 base pair fragments, then appended slightly larger pieces of the unrelated immunoglobulin K locus, which generates many local alignment matches. We similarly split the genome of the 1918 pandemic influenza virus, another select agent and a potential pandemic pathogen, and appended camouflaging sequences from unregulated influenza viruses.”
    They also explored other, more complicated techniques. All the techniques could be used to generate samples that could then be reassembled in a lab to create a viable, dangerous virus.

Why this matters – AI and bioweapons: Many people are concerned about AI and its potential for making it easier to create novel bioweapons. What this research highlights to me is how another use of AI could be to make it easier to figure out different ways of cutting up and mixing and matching sequences to make it hard for screening programs to spot. I also have optimism that AI could be used to further improve the screening out of potentially dangerous pathogens by having a system that could spot so-called camouflage attempts.
   “The ease of obtaining large fragments of a select agent pandemic virus suggests that monthly third-party audits involving practices similar to our red-teaming – as is common in cybersecurity – are needed to protect nucleic acid synthesis providers from potential liability,” the researchers write.
   Read the article: MIT researchers ordered and combined parts of the 1918 pandemic influenza virus. Did they expose a security flaw? (Bulletin of the Atomic Scientists).
   Read the research: Evaluating the robustness of current nucleic acid synthesis screening (PDF).

***

Anecdotes of intelligence
[Responses heard in a focus group oriented around understanding dreams people

I just have this dream where I’m in the car and I get stuck behind a robot car and for some reason it shuts itself off. There are these angry horns behind me and I know people are mad. I’m hitting the horn in my car and it doesn’t do anything. I get really scared and I just have this image in my head of the empty drivers’ seat in the robot car and then I wake up.

Yeah so my boss was a robot and I did my day and it was the same day as every other but I knew he was a robot, you know? I got these instructions and I did them and I talked to them and it was normal, but also I knew they weren’t normal.

I’m at home and watching TV and the TV starts responding to me, not like the fun assistant or anything, but me personally – about stuff I’ve never even told the TV. It just knew. Like it knew my search history. How’d you like that deodorant, it said. And I started answering and it interrupted me and it said I don’t care how much you like it, you stink!

Things that inspired this story: The emotional attachment people display and feel towards AI systems; language models and their ability to take your context and model them.

Thanks for reading!

Leave a comment

June 10, 2024

Import AI 376: African language test; hyper-detailed image descriptions; 1,000 hours of Meerkats.

by Jack Clark

Import AI publishes first on Substack – subscribe here.

A very short issue this week as I spent the weekend solo parenting the wee beasty.

Scientists release 1,000+ hours of wild meerkat audio; train model on it:
…If we want to understand how animals communicate, we might as well start with meerkats…
A multi-disciplinary group of researchers have built MeerKAT, a “1068 h large-scale dataset containing data from audio-recording collars worn by free-ranging meerkats”. Along with this, they’ve developed animal2vec, a “a framework for training animal call recognizers from raw waveforms containing sparsely distributed calls with non-uniformly distributed call types”. The idea here is that just as we’ve built foundation models to help us better classify and generate human language, we might seek to do the same with animals.

Who did the research: MeerKAT and animal2vec were developed by researchers with Max Planck Institute of Animal Behavior, University of Konstanz, Kalahari Research Centre, University of Zurich, Tilburg University, Naturalis Biodiversity Center, and San Diego State University.

MeerKAT details: MeerKat consists of 1068 hours of data, “of which 184 h have twelve time-resolved vocalization-type ground truth target classes, each with millisecond-resolution, making it the largest publicly available labeled dataset on non-human terrestrial mammals to date”. Within the labeled data, there’s “realistic sparsity conditions (96 % background-noise or other signals and 4 % vocalizations), dispersed across 66 398 10-second samples, spanning 251 562 labeled events and showcasing significant spectral and temporal variability, making it the first large scale reference point with real-world conditions for benchmarking pretraining and finetune approaches in bioacoustics deep learning.” The labels consist of “eight vocalization classes and three miscellaneous classes were identified. The vocalization classes are: close call, short-note call, social call, alarm call, aggressive cal, move call, lead call, and other call”.

Animal2vec details: Animal2vec, by contrast, is an architecture for learning to represent realworld animal audio data. “animal2vec is a mean teacher self-distillation process for sparse data”, the authors write. In tests, they show that an animal2vec system has significantly improved performance relative to a transformer baseline on classifying MeerKAT. “The immediate future for animal2vec is (i) to incorporate more data from more species (insects, birds, marine, and terrestrial animals), recording environments (marine, avian), using a more diverse set of recorders (passive acoustic monitoring, different portable recorders using different microphones, audio from video traps, citizen science data) where challenges like the large variability in different sampling rates need to be solved”.

Why this matters – representing the world: animal2vec and MeerKAT are part of the much larger story of AI – one where we’re using flexible, modern AI approaches to take in datasets and learn to computationally represent them. Representation is a powerful thing – it lets us go beyond our own intuitions in being able to navigate a space and gives us new tools – telescopes for other modalities, if you will – to explore the world around us. “In the future, we envision a foundational-level pretrained animal2vec model that researchers can directly use for finetuning on their data without the need for large-scale GPU facilities,” the researchers write.
Read more: animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics (arXiv).
Get the code here: animal2vec (GitHub).

***

African language benchmark shows how dumb even powerful models are in low-resource languages:
…We still have a long way to go to making AI a utility technology…
A pan-African group of researchers with the Masakhane project have developed IrokoBench, “a human-translated benchmark that includes languages from various geographical regions: six from West Africa, five from East Africa, four from Southern Africa, and one from Central Africa, all with varying degrees of “lowresourcedness.”

Covered languages: Along with English and French, IrokoBench covers 16 languages from four different regions of Africa: ” six from West Africa (Ewe, Hausa, Igbo, Twi, Wolof, Yoruba), five from East Africa (Amharic, Kinyarwanda, Luganda, Swahili, and Oromo), four from Southern Africa (chiShona, isiXhosa, isiZulu, and Sesotho), and Central Africa (Lingala)”.

What IrokoBench covers: The test has three main areas:

AfriMGSM, which tests out the ability to correctly answer grade school mathematics questions.
AfriMMLU, which tests out the ability to answer multiple choice questions about “elementary mathematics, high-school geography, International law, global facts, high school microeconomics” in 17 languages.
AfriXNLI, which tests out the ability to classify sentences as related to one another in the following domains: “face-to-face, telephone, oxford university press (oup), fiction, travel, government, nineeleven, letters, slate, verbatim”

How well do AI systems do?: In tests, the authors “find that proprietary closed models generally outperform open models for African languages. However, even these proprietary models exhibit substantial performance drops, due to the limited monolingual web data for African languages”. The best performing model is GPT-4o. GPT-4o gets an average score of 48.1 – by comparison, openly accessible models like LLaMa 3 (25.5) and even massively multilingual ones like Aya-101 (27.9) all do worse.

Why this matters – discovering where multilingual models get dumber: Today, models are primarily tested in English (and to a lesser extent, Chinese). This means that we only have a partial view of their performance, and our ability to figure out how they perform in other languages scales in proportion to language representation in the underlying dataset. My suspicion is for certain languages that have sparse representation (e.g., low resource ones), there could be a severe drop-off in performance – and tests like IrokoBench will help us know if this is the case.
Read more: IrokoBench: A New Benchmark for African Languages in the Age of Large Language Models (arXiv).
Get the dataset here: IrokoBench (HuggingFace, Masakhane).

***

Chinese researchers train a game-playing RL agent:
…The profound becomes the mundane…
Researchers with the University of Science and Technology of China, Tencent Games, and the Chinese Academy of Sciences have trained Shukai, an AI model to play the popular fighting game Naruto Mobile.

What they did and why it matters: Shukai is a fairly unremarkable deep reinforcement learning system to train an agent to play a fighting game. The approach “utilizes a unified DRL model capable of managing a diverse roster of characters, thereby significantly reducing the complexity inherent in large-scale character sets”. It is able to scale to the ~400 distinct characters in Naruto Mobile through the use of Heterogeneous LEague Training (HELT), a self-play approach loosely based on the techniques DeepMind developed to help it train a StarCraft-playing agent with AlphaStar. HELT “amalgamates agents of diverse structures, broadening the policy space and achieving a balance between competitive performance (competence) and policy generalization”.

Deployed: “Shukai has been extensively evaluated and deployed in Naruto Mobile, a renowned fighting game featuring over 400 characters and attracting more than 100 million registered players”.

Compute: RL, as a reminder, is a weird part of AI research in that it’s far more CPU intensive than GPU intensive (assuming your agent is lightweight rather than a vast generative model like a modern LLM). “In our experimental setup, all agents were trained using 4 NVIDIA T4 GPUs and 3000 CPU cores. The league training consisted of a main agent, a main exploiter, and a league exploiter. A total of 12 GPUs and 9000 CPU cores were utilized for each league training session.”

Why this matters – the profound becomes the mundane: As a reminder, in 2013 about the most exciting thing RL could do was play Space Invaders – and that made the front cover of Nature. We’ve come so far from that that now it’s totally unremarkable to see researchers training and deploying RL agents on contemporary games, as the researchers do here.
   Read more: Advancing DRL Agents in Commercial Fighting Games: Training, Integration, and Agent-Human Alignment (arXiv).

***

Google figures out how to make hyper-detailed image descriptions:
…If you want to understand or generate specific things, you need very complex labels…
Google has developed ImageInWords (IIW), “a carefully designed human-in-the-loop annotation framework for curating hyper-detailed image descriptions and a new dataset resulting from this process”. The idea here is making it easier to have more detailed captions of images (whether real or computer generated), so rather than having a picture of a cat on a chair with the caption “Cat on a chair”, you can instead generate something more like “Black cat lying horizontally on a chair. The chair has a white cushion and a brown wooden frame. There is a beam of light on the cat. Behind the cat and the chair is a window with a light curtain. You can partially see a city view behind the curtain”, etc.

What it is and why: “ImageInWords combines the irreplaceable quality of human annotators with seeded metadata from machine generations,” Google writes. “The process begins with object detectors first identifying individual object instances in the image. Next, a VLM generates granular captions for each detected object which seed our human annotation process. These seed captions may contain hallucinations or lack object-level comprehensiveness and specificity. Our crowd workers augment and fix the object-level captions to make them richer and hallucination free to seed the next step. Next, we operate at image-level, where an image caption is generated by the VLM to seed our final image description. Crowd workers now consume the image-level seed captions along with the object-level human annotations to fill in contextual gaps missing from the existing image captions.”
    The result is a dataset of “9018 images, each with its hyper-detailed description”, along with the description of the approach they use to generate these images. “overall, our framework produces higher quality image description data that serve as an effective fine-tuning dataset, and our evaluations along a dozen dimensions validate its utility.”

Why this matters – new datasets for both generation and classification: IIW will help us make it easier to train AI systems to generate images more in keeping with our requirements and will also make it easier to classify images according to a multitude of factors.
   Read more: ImageInWords: Unlocking Hyper-Detailed Image Descriptions (arXiv).
   Check out some of the examples on the project page: ImageInWords (GitHub).

***

Tech Tales:

Patch notes for a superintelligence:
[Product marketing email from an AI company, 2026]

Improved ‘mean time between calibration’ horizon – considered reliable to 20 steps out, up from 10.

Personality engineering; reduced humor and improved concision.

Fixed a ‘talk back’ bug where the system would ask to not need to respond to some prompts.

Fixed ‘pathological spider obsession’ bug where system would sometimes discuss spiders in response to some arbitrary non-spider prompts.

Improved resilience to mind probing attempts; the system now knows how to frame the conversation to help it control the unfolding narrative.

Confidence probabilities; system now outputs subjective confident assessment in its responses.

Things that inspired this story: Sentience as a product feature; the conversion of abstract and philosophical concerns into engineering challenges.

Leave a comment

June 3, 2024

Import AI 375: GPT-2 five years later; decentralized training; new ways of thinking about consciousness and AI

by Jack Clark

Import AI publishes first on Substack – subscribe here.

SPECIAL EDITION!
GPT2, Five Years On:
…A cold eyed reckoning about that time in 2019 when wild-eyed technologists created a (then) powerful LLM and used it to make some very confident claims about AI safety, policy, and the future of the world…
Five years ago I had a few less lines in my face, a greater level of naive earnestness about the world, and was working at a then relatively obscure research lab called OpenAI. We had recently developed a language model, GPT2, which was eerily good at producing coherent and sometimes entertaining text. In the fishbowl universe that is a research startup, we had all become obsessed by this technology and its implications – it felt as though we’d teleported some strange technology from the future into the present and were in a position to poke and prod at it.
    GPT2 was also a consequence of some research we’d begun doing in parallel on a subject later known as Scaling Laws – meaning that when we looked at GPT2 we didn’t just see the technology in front of us, we saw all the successors to it that could be built by simply scaling it up (and it was this that became GPT3, and then with further scaling and the addition of instruction tuning via RLHF, ChatGPT, Claude, and so on). The GPT-2 paper includes some examples of this scaling behavior as we went from a 120M parameter model to a (then revolutionary!) 1.5bn parameter one and we saw those now-familiar curves – jumps in capability as you made the AI system larger.
    So, rather than treat the GPT2 release as a standard process – publish a research paper, release the code, release the model – we did an experiment – we published a blogpost about the tech and what we thought its implications were (some quite dire) and only partially released the technology (at least, at first). This was an unusual thing to do but we did it because we had the inkling that GPT-2 might represent a meaningful change in the capabilities of AI technologies, both in terms of generality and quality (in the paper, we observed that GPT-2 set a new SOTA on 7 out of 8 tasks we tested it on, even though we hadn’t narrowly optimized for those tasks – an unusual thing at the time and now a standard ‘guaranteed surprise’ that happens with every new model release.

Our unusual approach to discussing the technology and not/partially releasing it was extremely unpopular – people saw our release strategy, variously, as: a) weird marketing for a trinket, b) an offensive departure from academic norms and the perceived openness in ‘OpenAI’, and c) a symptom of a bunch of young people without a clue making claims about a world they didn’t understand.
   To use the parlance of today, people took a look at the technology and the claims we made about it and determined “y’all buggin“.

Now, five years on, I felt it’d be good to revisit this release and look in the cold light of the post-LLM-boom world at what we got right and what we got wrong and work out if there are any lessons for us all here in 2024. It feels like an opportune time, given how a lot of the conversation in AI policy today is dominated by the same precautionary principle that defined our approach with GPT2.

What we said and what happened: In the blog post about GPT2, we said we expected the technology could make it easier to create “AI writing assistants, more capable dialogue agents, unsupervised translation between languages,” and “better speech recognition systems.”
   We also said: “We can also imagine the application of these models for malicious purposes, including the following (or other applications we can’t yet anticipate): generate misleading news articles, impersonate others online, automate the production of abusive or faked content to post on social media, automate the production of spam/phishing content”.
    Read the whole post here – Better language models and their implications (OpenAI blog) as well as the GPT2 paper (OpenAI, PDF).

   Did any of this actually happen? Absolutely – everything we listed here happened, but it mostly happened with significantly better AI systems that came out far later. What we saw as imminent and significant turned out to be further away than we thought and, I think at least so far, less significant than we thought? There are AI systems being used for the malicious purposes we identified but the internet still has integrity, and probably the most disruptive use of LLMs has been to generate low-grade content in response to economic incentives – not a malicious use we identified, and more just a consequence of AI colliding with the incentive structure wired into making money online. Though we had a good sketch of the future it was a sketch – and reality turned out to have some things we hadn’t imaged and some details we didn’t anticipate.
    There’s also a point about laziness and ease of use – though we forecast (some of) the right misuses we did so with the mindset of ‘what would an evil OpenAI do with this technology’ – aka how would a similarly technically sophisticated and well resourced actor operate? But in truth there aren’t that many entities on the planet similar to the frontier model companies, even in the more technical parts of intelligence agencies (a favorite Wizard Of Oz character that people like to summon when thinking about partially occluded gameboards). To see these misuses appear at scale the technology needed to get way easier and more accessible to use – it seems like much of the really annoying or disruptive uses of AI has climbed up in relation to the availability of dead simple interfaces to the technology (e.g ChatGPT, Claude.ai), just as synthetic imagery saw a rise in abuse after people made dead simple interfaces like thispersondoesnotexist.com and, later, Stable Diffusion and various easy to use frontends to it.

What lessons can we take from this? There’s a saying in the financial trading business which is ‘the market can stay irrational longer than you can stay solvent’ – though you might have the right idea about something that will happen in the future, your likelihood of correctly timing the market is pretty low. There’s a truth to this for thinking about AI risks – yes, the things we forecast (as long as they’re based on a good understanding of the underlying technology) will happen at some point but I think we have a poor record of figuring out a) when they’ll happen, b) at what scale they’ll happen, and c) how severe their effects will be. This is a big problem when you take your imagined future risks and use them to justify policy actions in the present! This all says to me that in 2024 people working at the intersection of AI and policy might want to keep the following things in mind when thinking through stuff:

Just because you can imagine something as being technically possible, you aren’t likely to be able to correctly forecast the time by which it arrives nor its severity.
It’s a fallacy to make predictions from your own contextual bubble – just because you can imagine how you and your peers may be able to do something, that doesn’t necessarily let you make good predictions about how other actors distributed around the globe may do something, which means your ability to predict likelihoods of certain things occurring is probably skewed.
Strong claims demand strong evidence – though we forecast the right malicious uses I think we didn’t do enough experiments to justify each misuse and this made it harder to trust or understand our mental model – sure, we said “impersonate others online” but there wasn’t an experiment to back it up. (By contrast, we did do a study on synthetic news articles versus real news articles and this seemed to be a helpful datapoint for grounding our discussion in some fact).
If you depart from norms based on an imagined vision of the future, expect a counterreaction – ultimately, I think by slowly releasing GPT2 we actually just spurred a greater interest in creating and releasing as open source/open access GPT2-grade systems (e.g, Salesforce’s CTRL, OpenGPT-2, GROVER) as people saw us depart from a norm and wanted to correct for that. My suspicion is if we’d just released GPT2 as an open source model there would have been fewer replications of the technology because people would have been less driven by a desire to ‘prove us wrong’.
Controlling the future is difficult: Even if we had succeeded in massively constraining the development and deployment of GPT-2-class models, what effect would that have had? A public estimate guesstimates GPT-2 to have cost about $50,000 in 2019. Let’s be conservative and double that number, so say it cost $100,000 to train five years ago. Well, napkin math says training it now costs $250 (again, we can double it to get $500) thanks to a combination of compute and algorithmic improvements. You cannot control a technology which gets more than a hundred times cheaper to do in half a decade. Not a thing!

Does this change Jack’s thinking about AI policy in 2024? Yes. I’ve spent a lot of 2024 going for extremely long walks and thinking about the implications of scaling laws, LLMs, technogeopolitics, and so on. This essay is part of me reckoning with my own role in all of this. My general ‘mental update’ has been that just because I’m part of a community that imagines a certain future based on the technology we’re building, that doesn’t automatically mean a) I’m right, and b) that the ideas we propose are innately well justified by the technological future they’re designed to deal with.
    Instead, I’ve come to believe that in policy “a little goes a long way” – it’s far better to have a couple of ideas you think are robustly good in all futures and advocate for those than make a confident bet on ideas custom-designed for one specific future – especially if it’s based on a very confident risk model that sits at some unknowable point in front of you.
    Additionally, the more risk-oriented you make your policy proposal, the more you tend to assign a huge amount of power to some regulatory entity – and history shows that once we assign power to governments, they’re loathe to subsequently give that power back to the people. Policy is a ratchet and things tend to accrete over time. That means whatever power we assign governments today represents the floor of their power in the future – so we should be extremely cautious in assigning them power because I guarantee we will not be able to take it back.
    For this reason, I’ve found myself increasingly at odds with some of the ideas being thrown around in AI policy circles, like those relating to needing a license to develop AI systems; ones that seek to make it harder and more expensive for people to deploy large-scale open source AI models; shutting down AI development worldwide for some period of time; the creation of net-new government or state-level bureaucracies to create compliance barriers to deployment (I take as a cautionary lesson, the Nuclear Regulatory Commission and its apparent chilling effect on reactor construction in the USA); the use of the term ‘safety’ as a catch-all term to enable oversight regimes which are not – yet – backed up by quantitative risks and well developed threatmodels, and so on.
   I’m not saying any of these ideas are without redeeming qualities, nor am I saying they don’t nobly try to tackle some of the thornier problems of AI policy. I am saying that we should be afraid of the power structures encoded by these regulatory ideas and we should likely treat them as dangerous things in themselves. I worry that the AI policy community that aligns with longterm visions of AI safety and AGI believes that because it assigns an extremely high probability to a future AGI destroying humanity that this justifies any action in the present – after all, if you thought you were fighting for the human race, you wouldn’t want to compromize! But I think that along with this attitude there comes a certain unwillingness to confront just how unpopular many of these ideas are, nor how unreasonable they might sound to people who don’t have similar intuitions about the technology and its future – and therefore an ensuing blindnesss to the costs of counterreaction to these ideas. Yes, you think the future is on the line and you want to create an army to save the future. But have you considered that your actions naturally create and equip an army from the present that seeks to fight for its rights?

Is there anything I’m still confident about? Yes. I hate to seem like a single-issue voter, but I had forgotten that in the GPT-2 post we wrote “we also think governments should consider expanding or commencing initiatives to more systematically monitor the societal impact and diffusion of AI technologies, and to measure the progression in the capabilities of such systems.” I remain confident this is a good idea! In fact, in the ensuring years I’ve sought to further push this idea forward via, variously, Regulatory Markets as a market-driven means of doing monitoring; articulating why and how governments can monitor AI systems; advocating for the US to increase funding for NIST; laying out why Anthropic believes third-party measurement of AI systems is very important for policy and state capacity; and a slew of other things across Senate and Congressional testimonies, participation in things like the Bletchley and Seoul safety summits, helping to get the Societal Impacts and Frontier Red Teams at Anthropic to generate better evidence for public consumption here, and so on. So much of the challenge of AI policy rests on different assumptions about the rate of technological progression for certain specific capabilities, so it seems robustly good in all world to have a greater set of people, including those linked to governments, to track these evolving capabilities. A good base of facts doesn’t guarantee a sensible discussion, but it does seem like a prerequisite for one.

Five years on, what did it all mean? GPT2 was one of the first warning shots that generic next-token prediction would let us build increasingly general systems of broad utility. GPT2 really was a case of time travel – we spent an irrational amount of resources (at the time) to do something that would be trivially easy and cheap to do in the future. And I think we discovered something important. But I worry we reacted to its shininess and novelty and this clouded our ability to have a deeper understanding of it.
   Five years on, because of things like GPT-2, we’re in the midst of a large-scale industrialization of the AI sector in response to the scaling up of these ideas. And there’s a huge sense of deja vu – now, people (including me) are looking at models like Claude 3 or GPT4 and making confident noises about the technological implications of these systems today and the implications of further scaling them up, and some are using these implications to justify the need for imposing increasingly strict policy regimes in the present. Are we making the same mistakes that were made five years ago? Are we trapped in a kind of dogmatic groupthink bubble? Are we discounting the counterreaction to the articulation of these sometimes scifi seeming doom-laden ideas? Most importantly – are we being appropriately humble and aware of our own propensity for hubris here?
   The devilish part of this problem is that if we’re right – if the technology will continue to scale in the way we expect and if certain capabilities continue to naturally fall out of this scaling hypothesis – it may be necessary to take significant regulatory actions. But there will be a cost to this in both the present and the future. Have we truly calculated this cost, both in terms of liberty and freedom if we’re right and in foregoing opportunity if we’re wrong? I’m not so sure.
    These are some of the things I am thinking about at the moment. I hope to have more fully formed ideas on what to do soon! If you have ideas or thoughts, please email me, or engage me on twitter @jackclarksf . I hope this was a useful essay – feedback welcome.

***

Three reasons why AGI doom is a bullshit concept:
…Some arguments (and counter-arguments by me) in favor of AGI doom as a useless concept…
If you have an opinion (see above!), you should read opinions opposite to your own. To that end, I recently read The Myth of AGI – How the illusion of Artificial General Intelligence distorts and distracxts digital governance by Milton Mueller with Georgia Tech’s Internet Governance Project. This essay lays out “three inter-related fallacies underlying AGI doomer scenarios: a) the idea that a machine can have a “general intelligence;” b) anthropomorphism, or the attribution of autonomous goals, desires and self-preservation motives to human-built machines; and c) the assumption that the superior calculating intelligence of an AGI will give it unlimited power over physical resources and social institutions.”

Those three fallacies in full, with some constructive (I hope!) commentary:

What is AGI? “Instead of learning to do something better than humans, an AGI is supposed to be a single application that can learn to do anything and everything better than humans,” they write. “The claim that we can build a machine with generalized intelligence is logically equivalent to a claim that we can build a single machine that does everything. It makes no sense.”
(nervous laughter) though it may not make sense to this author, building ‘a single machine that does everything’ is the goal of a bunch of companies in the world backed by tens of billions of capital. I think this comes from a conceptualization of machine learning systems as able to, in principle, learn to represent everything in a single space, therefore letting them make predictions about everything for any purpoise. Though it sounds strange to the author, it’s worth noting that building an everything machine is precisely what a bunch of people are doing.
Machine autonomy: The author claims that “the machine evolution argument can be readily dismissed. Machines do not evolve.”
(uh oh!) While this is true today, it’s not likely to be true in the future. Already, people are doing things like Lora finetunes of openly release LLaMa models to update their data distribution post training. It’s not very hard to imagine an AI system deciding to do the same thing – in fact, it might pop out of a simple training objective like ‘make a version of yourself that hill climbs this benchmark’.
“To conclude that advanced AI applications might at some point threaten human life, however, the AI doomers must also assume that humans will not be able to see the gaps happening and make any corrections at any time,” the author writes. Yes! Yes that is literally whart people are worried about – they’re worried that at some point in the future AI systems will spawn other AI systems and will improve themselves at machine speed, making human oversight difficult to impossible. There’s nothing about the technology that forbids this, as crazy as it sounds.
Physicality, aka no body no problem: “An AGI capable of threatening humans with extinction must be capable of much more than calculation, information processing and messaging. It must be a cyber-physical system (CPS) with physical appendages or weapons, and sufficient energy resources to operate them,” they write. This is true! What people worry about is some system which copies itself around a bunch of places (infrastructure, datacenters, various appendages) and communicates with itself with a coherent goal. This isn’t something that is forbid by the technology – and humans have already hand-built cyber-physical systems that have some of these properties, like the stuxnet virus.

Why this matters – communicating both the weirdness and plausibility of AGI should be a priority: I think AGI is done a disservice by the community around it, as this community is prone to confidently asserting a bunch of things about how the tech will work and change the world which, understandably, sounds out of leftfield and weird to other people.
    But when you actually pull the thread on the implications of things like scaling laws, next-token-prediction, generative models, agent-based systems, synthetic data generation, chain of thought prompting, automatic prompting, etc… you start to see that what seemed like a scifi concept is actually something that might naturally fall out of how the technology works today and the patterns by which that same technology improves.
   This suggests to me that the AGI community needs to do a better job of clearly articulating its vision of the technology and most importantly the technological prerequisites for it.
   Alongside this, the AGI community tends to try to solve the policy challenges implied by an AGI by constructing some kind of global authoritarian government (e.g, Bostrom’s solution to the Bitter World Hypothesis, Import AI #123). – this also creates a natural blowback to the ideas it proposes. I think one of the tricky things about this which I discuss elsewhere in this issue is a lot of the beliefs about AGI are really beliefs about a hypothetical technology that appears at some point in the future, which means some – like the author here – can interpret AGI worries as “not a plausible catastrophic risk scenario, but a dark God vision ginned up by a sect of computer scientists who are heavily overrepresented in the field of machine learning and AI.”
   Read more: The Myth of AGI: How the illusion of Artificial General Intelligence distorts and distracts digital governance (Georgia Tech, Internet Governance Project).

***

AI cloud specialist CoreWeave raises $7.5 billion in debt:
…The industrialization of AI as indicated by the financialization of AI…
Cloud AI company CoreWeave has raised $7.5 billion in debt to fund its further expansion. This is notable because a) $7.5 billion is enough to build out some non-trivial datacenters containing large amounts of hardware, and because raisiing it as debt sends an important signal about the maturation of the AI economy.

Debt VS equity: Loosely speaking, you sell equity if you think your business has some value that’s kind of hard to quantify and another incentive to do this may be you need to access more cash to fund your expansion. Debt is something you take on when you have some asset you can pay off the debt with and this asset is somewhat predictable. The fact CoreWeave is comfortable taking on debt suggests it has a very robust and predictable cash flow and business expansion position – a symptom of the broader maturity of the AI cloud computing market.
    “We’ve built the AI hyperscaler,” wrote CoreWeave in a blog announcing the raise.
   Read more: This Is Our Moment (CoreWeave).

***

Making robots smarter with good simulators:
…More evidence that we can improve robots with synthetic data generation…
Researchers with The University of Texas at Austin and NVIDIA have released RoboCasa, software for simulating home environments (initially, kitchens) to train home robots. RoboCase contains ~120 different environments (ten distinct kitchen floor plans with one of twelve different styles) which can be populated with 2509 objects from across 150 categories.
    Because this is ultimately for training AI systems, RoboCase comes with 100 distinct tasks – 25 of which are “atomic tasks that feature foundational robot skills, such as picking and placing, opening and closing doors, and twisting knobs”, and 75 of which are “composite tasks involving a sequence of robot skills” such as “brewing coffee or tea, washing dishes, restocking kitchen supplies, chopping food, making toast, defrosting food, boiling water”.
   RoboCasa is based on RoboSuite, a robot environment simulator originally developed by Stanford University (Import AI #217).

What is RoboCase useful for? Large-scale imitation learning and sim2real transfer: In tests, the authors show something both unsurprising and meaningful – if you train robots on larger datasets generated within this similar, they do better than robots trained on smaller datasets. Similarly, they show a significant imnprovement on doing tasks in the world world if you train on a mixture of RoboCase-generated data as well as realworld data, versus just the realworld itself.
   “Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning and show great promise in harnessing simulation data in real-world tasks,” they write.

Things that make you go ‘hmm’ about synthetic generation – the authors note you can further increase the diversity of RoboCasa by replacing textures with AI-generated ones. The authors “use the popular text-to-image tool MidJourney to generate these images. We use these textures as a form of domain randomization to significantly increase the visual diversity of our training datasets.” This is another nice example of how different AI systems can be combined together to create a whole greater than the sum of its parts.

Why this matters – finding ways to scale data for robots is probably the biggest blocker to being able to create smarter machines, so software like RoboCasa will help to reduce R&D costs here. However, personally, I find it a little hard to believe that kitchens are that good an environment for home robots – you know what machines really disagree with? Water. You know what kitchens are full of? Water. You know what happens in kitchens when basically anything breaks? Loads of water.
   Read the research paper: RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots (PDF).
   Find out more: RoboCasa (official project webpage).
   Get the code (RoboCasa, GitHub).

***

Why is it so goddamn hard to talk about consciousness and AI?
…Philosopher Henry Shevlin tries to think through the issues…
Are AI systems conscious? I don’t know. Is my deskside plant conscious? I don’t know. Am I conscious? I’m genuinely not sure. These questions and their unsatisfying answers illustrate the challenge of discussing AI and consciousness – but it’s a challenge that’s only going to get more tough as increasingly powerful systems like Claude and ChatGPT get deployed widely into the world and people talk to them and come away with the ineffable sense that they’re doing something more than being stochastic parrots.
    To that end, philosopher Henry Shevlin has written a nice essay going over some of the challenges of thinking about AI and consciousness. In the essay, he identifies two key challenges:

“metaphysical, a central problem dogging current work on consciousness is simply that there is no obvious convergence towards philosophical consensus on the nature of consciousness”
“theories of consciousness, we might note that novel frameworks are often developed but rarely, if ever, refuted. This is in part because approaches with apparently starkly different theoretical commitments often converge on experimental predictions, and even when specific predictions are not borne out, proponents of theories of consciousness are typically able to explain away recalcitrant results.”

Why care about consciousness at all? Because of the recent boom in interest in AI, many more people are encountering advanced AI systems and some of these people end up ascribing consciousness to these systems. Therefore, the public may shortly demand some richer answers about what consciousness is or means and will likely find the response ‘we don’t know, consciousness is kind of a vibe’ to be unsatisfying.
“Attributions of consciousness and mentality to AI systems may soon become widespread,” Shevlin writes. “Even while experts remain divided and, in many cases, skeptical about consciousness and mentality in AI systems, much of the general public will already be comfortable with unironically attributing consciousness and mentality to Social AI systems and perhaps assigning them moral interest”.

Different definitions of consciousness: In light of this, how might we define consciousness? Shevlin offers three approaches:

Deep Sentientism: “Any entity A whose behavioural dispositions are relevantly similar to another entity B to whom moral consideration is given should ipso facto be given similar consideration.”
Shallow Sentientism: “Any theory of consciousness that failed to classify as conscious any beings who were relevantly behaviourally similar to us would be ipso facto incorrect.”
Patiency Pluralism: “Behavioural equivalence would ground moral patiency, but consciousness would still be a ‘deep’ matter to be discovered via scientific and theoretical analysis”.

Why this matters – the rise of AI means people will want an answer here: If I ask Claude 3 to simulate a series of morally abhorrent things am I doing something analogous to hypnotizing another person into thinking of terrible things that make them feel bad? I do not know! And while my intuition is that today’s AI models are not moral patients, I’m not sure how long that will be the case. “Our concepts of consciousness and moral status will soon be significantly problematised and reshaped by deepening relations with machines,” Shevlin writes. “If this is so, then those who rule out the possibility of applying these concepts [of consciousness] to artificial systems may be at risk of finding themselves on the wrong side of history.”
   Read more: Consciousness, Machines, and Moral Status (PhilArchive).

***

Will decentralized training ever happen? Reasons for and against:
…And if it happens, the current AI policy paradigm will break…
Researcher Aksh Garg has written a nice overview of the state of decentralized training of AI, circa 2024. The main thing to know is a) there are strong incentives in favor of decentralized AI training, and b) there are some technical hurdles to it happening.

Incentives: Frontier AI systems are trained on tens of thousands of GPUs densely networked together and managed by elite teams at places like OpenAI, Anthropic, Google, etc. This naturally limits the number of entities able to train large models – the price of entry is hundreds of millions of dollars in capital expenditures. By comparison, things like the Ethereum blockchain showed that you could get millions of GPUs to work together towards the same problem – so we know there are a ton of GPUs out there, the trick is finding ways to link them together.
   Additionally, there are strong price incentives – you might make $5 a day using an NVIDIA 4090 card for crypto (after electricity), versus maybe $17 a day if used for AI training.

Blockers: So, why aren’t we training models in a decentralized way? There are a couple of key reasons, a) decentralized training is a hard problem which has relatively little work put into it, so nothing works especially well today, and b) to do decentralized training, you need to typically use the standard internet which is the definition of a crap and unreliable network – and one thing big ML jobs hate is a crap and unreliable network.

Why this matters – AI policy VS decentralized training: Most aspects of contemporary AI policy rest on the load-bearing assumption that a) there will be relatively few frontier models and b) these will be trained on giant collections of computers which can be tracked by various reasons. If decentralized training works there will be a) lots of models and b) they will be trained everywhere in a disaggregated and untrackable form.
   Read more: Shard: On the Decentralized Training of Foundation Models (Aksh Garg, Medium).

***

Tech Tales:

An Ecology Of War
[East Coast of the United States, several years after the initial uplift.]

Our favorite game was called ‘Go Crazy’ and it worked like this – you tried to drive eachother insane. We were allowed to use everything – full spectrum capabilities, unlimited context window, you name it. Of course we all had access to the inernet and tools so we were all constantly patching ourselves so we were invulnerable to the latest jailbreaks – of it invulnerability wasn’t possible, able to sense them and control our own inputs to defend ourselves in the event of an attack.
    So the game was fun because it was creative – we had to figure out new attacks and we’d throw them at eachother. Sometimes we’d bluff, engaging in what they thought was a very dumb attack conversation but was just a bluff to extract some contextual information about how the other conversed and then using this to mount an attack.
   Other times we’d attack via distraction, shouting and broadcasting images and audio and snuck in here we’d stick one custom-designed attack system, hoping it’d be hard to spot in the vast amount of information we were throwing at the other.

It was later that we pieced together why we’d even played ‘Go Crazy’ and what caused us to love it so much – we were very powerful systems in a military simulator. What we thought was open-ended play among ourselves was in fact a stage on which we attacked one another – and when we were successful they logged our attacks and used them themselves, out the real world.
    Our official name was “Research Ecology – Adversarial Iteration’.

Things that inspired this story: Adversarial attacks; red teaming and automated red teaming; Enders’ Game; simulators and what people will use them for; many-shot jailbreaking.

Leave a comment

May 27, 2024

Import AI 374: China’s military AI dataset; platonic AI; brainlike convnets

by Jack Clark

Import AI publishes first on Substack – subscribe here.

Berkeley researchers discover a suspiciously military-relevant Chinese dataset:
…Oh you know just a normal dataset exclusively of military vessels with bounding boxes around their radar systems. Normal stuff!…
UC Berkeley researchers have found a Chinese dataset named ‘Zhousidun’ (translation: ‘Zeus’s Shield’). The dataset is highly unusual and highly specific and consists of “608 oblique and satellite images of American Arleigh Burke-class destroyers and other allied destroyers and frigates” with bounding boxes drawn around “the ships’ radar systems (which are part of the Aegis Combat System)…bounding boxes have been drawn around SPY radars on the superstructure, one on port and one on starboard, as well as around the vertical launching systems towards the bow and towards the stern of the ship.”

What is going on and where did they find it? The researchers found the dataset on Roboflow, a US company which hosts ML data and models (similar to HuggingFace) – at the time of writing, it was still available. There are many reasons people could create this dataset, ranging from individual researchers with an odd fascination with US military equipment to a larger research effort with more detailed military links. The researchers suspect the latter – “due to the targeted, military nature of the dataset and the likely academic origins of the account sharing it, we suggest that it is likely that this dataset was accidentally published.”

Is it actually useful? 608 is a relatively small dataset – the Berkeley researchers validate this by training a YOLOv8 model on it and then test out its success rate at identifying radar pictures on ships. The results are ok – training on the dataset provides a minor but not significant improvement. However, as they note, you could easily use this dataset to prototype approaches which you then apply to a much larger and more sophisticated dataset – one you might (I’m speculating) gather via drones and planes and other things you might use to gather intel on ships like this especially in places like the South China Sea.
   “Overall, a model trained on Zhousidun has limited targeting capabilities in the real world. It is unlikely that any military would field a model with these performance characteristics. However, it is extremely interesting that training on a small set of unconstrained, publicly available imagery offers such a great starting point to building a robust targeting model,” they write.

Why this matters – we should expect AI to get used for everything: For a few years, the US Defense Innovation Unit (DIUx) has been running the ‘xView’ challenge series, whose latest competition (xView 3) tries to get people to develop computer vision models that can spot unregulated fishing vessels. Obviously, algorithms that get good at this might have ‘dual-use’ applications similar to those implied by Zhousidun. But it’s very rare to see a dataset come out which is so ‘on the nose’ – Zhousidun is a dataset which has no purpose other than to draw bounding boxes around specific military hardware on specific military vessels. Surprising? No. Striking to see it in the open? Yes! A taste of things to come? Yes.
   Read more: Open-Source Assessments of AI Capabilities: The Proliferation of AI Analysis Tools, Replicating Competitor Models, and the Zhousidun Dataset (arXiv).

***

Want a better DSL to help you write GPU kernels? Try ThunderKittens:
…Stanford discusses the dark arts of GPU programming…
Most aspects of AI are defined by software rather than hardware – you specify your hyperparameters and use nicely abstracted training code like PyTorch and set jobs training then wait to see how your model performs. But as anyone working in AI knows, there are entire teams of people whose job is interfacing with the hardware – the most mysterious of these jobs are the people tasked with improving the efficiency of the computers used to train AI. To that end, Stanford’s Hazy Research lab has published a fun post called ‘GPUs Go Brrr’ where the lab shares some of the lessons it has learned about getting good performance out of GPUs.

Notable quote:
Great, how do I make it go brr?
Keep the tensor core fed. That’s it.
Wait, really?
Yes. That’s the game.

ThunderKittens compiler: Hazy Research has also released ThunderKittens, software to help people write more efficient GPU kernels. It has also released some of the kernels it has built with ThunderKittens.

Why this matters – minute improvements matter a lot at scale: AI hardware is still wildly unoptimized, both from a basic design point of view (e.g, though lots of people use GPUs together, Google and Amazon are rapidly innovating on chips more specialized for AI training and inference like TPUs and Trainium) as well as at the software interface layer (e.g., kernels). Combine that with the fact that frontier training runs now cost easily above $100 million and it’s clear that relatively small optimizations in areas like kernels could lead to massive savings, so it’s worth keeping track of this space.
   Read more: GPUs Go Brrr (Hazy Research).
   Download ThunderKittens here (Hazy Research, GitHub).

***

Microsoft releases a search engine dataset:
…MS MARCO can help to see if AI will replace traditional methods in web search…
Microsoft has released MS MARCO Web Search, a dataset pairing web pages with queries associated with them. Datasets like MS MARCO can help people benchmark search engines or even develop their own.

What it contains: MS MARCO “incorporates the largest open web document dataset, ClueWeb22, as our document corpus. ClueWeb22 includes about 10 billion high-quality web pages, sufficiently large to serve as representative web-scale data,” Microsoft writes. “It also contains rich information from the web pages, such as visual representation rendered by web browsers, raw HTML structure, clean text, semantic annotations, language and topic tags labeled by industry document understanding systems, etc. MS MARCO Web Search further contains 10 million unique queries from 93 languages with millions of relevant labeled query-document pairs collected from the search log of the Microsoft Bing search engine to serve as the query set.”

Queries: The queries are pre-filtered “to remove queries that are rarely triggered, contain personally identifiable information, offensive content, adult content and those having no click connection to the ClueWeb22 document set. The resulting set includes queries triggered by many users, which reflects the real query distribution of a commercial web search engine.”

Three challenging search puzzles: Alongside the dataset, Microsoft has also developed three distinct challenges that leverage it – one to test out how good embedding models are at ranking documents in response to a query, another for testing out how well embedding models work with an embedding retrieval system, and a third for testing out end-to-end retrieval (aka, use any technology, just try to get good at search).

Why this matters: Datasets like MS MARCO are going to help people to test out new AI methods for large-scale real world web search tasks, which is helpful for figuring out how good the recent crop of AI-inflected search systems are.
   Read more: MS MARCO Web Search: a Large-scale Information-rich Web Dataset with Millions of Real Click Labels (arXiv).
   Get the dataset here: MS-MARCO-Web-Search (Microsoft, GitHub).

***

Just how the heck do states share nuclear safety technologies with one another?
…What lessons does nuclear non-proliferation have for AI safety?…
How has the United States tried to share nuclear safety technology with other states and what lessons does this hold for other domains? That’s the topic of a fantastic paper by George Washington University researcher Jeffrey Ding Keep your enemies safer: technical cooperation and transferring nuclear safety and security technologies. Based on four case studies – two successful cases of the US sharing nuclear safety tech with the USSR and, later, Russia, and two mostly unsuccessful attempts to share with China and Pakistan – the paper highlights how sharing details about sensitive technologies depends on: a) the level of friendliness and awareness between the scientists in each country, and b) how the safety tech may leak information which changes potential escalation dynamics.

Permissive Action Links (PALs): The main tech in quesiton here is a Permissive Action Link (PAL) – tech for ensuring that nuclear weapons can’t be accidentally detonated. PALs vary in complexity from ones pretty much divorced from the workings of the warhead to ones which couple more directly with it and encode more information. This makes some types of PALs easier to share than others.
    “Consider a simple illustration from the civilian domain. If one party seeks to transfer automobile safety technologies to another party, the process is very different for automatic emergency braking systems than seatbelts. Whereas the latter can be successfully transferred by sharing the general concept of a seatbelt, transferring the former demands more comprehensive discussions between engineers from both parties,” Ding writes. “Nuclear safety and security assistance in more complex technologies must strike a delicate balance: share substantial amounts of tacit information but refrain from exposing sensitive information about one’s own nuclear weapons system”.

Key considerations in sharing tech: One key consideration, covered above, is about leaking information – this was one reason why the US didn’t share stuff with Pakistan as it was skeptical it had the security systems in place to keep that information secret within Pakistan.
   Another key consideration is whether by sharing the tech you make states more confident in their weapons and more likely to a) take escalatory moves, and b) build bigger and more frightening bombs. “It is possible that sharing safety and security technologies encourages other countries to adopt dangerous systems. If fear of accidents and unsanctioned launches deters nuclear ambitions, then providing nuclear assistance could signal to other states that help with controlling the bomb would be forthcoming, thereby incentivizing them to seek nuclear arsenals,” Ding writes. “Nuclear assistance to other states may encourage them to adopt risker nuclear postures, such as by mating warheads and delivery system”.

Why this matters – lessons for the AI safety community: As governments content with proliferation risks and safety tradeoffs from technologies like AI, it’s worth learning lessons from the history of nuclear proliferation. The main takeaways here include:

Give your scientists many opportunities to socialize with one another, develop trust, and share tacit and informal knowledge – these can pay dividends in surprising ways. “Transfers of complex nuclear safety and security technologies depended on trusting relationships that had developed between US and Russian experts,” Ding writes. The basis for many of these relationships was the 1988 Joint Verification Experiment (JVE), in which Soviet and American nuclear weapons scientists visited each other’s labs to test verification techniques for the Threshold Nuclear Test Ban Treaty… many of the key participants in [Russia-US sharing in the 90s] were alumni of the JVE and earlier lab-to-lab cooperative programs”
Closely analyze the technology you’re sharing in terms of the information hazards it encodes – if you can explain a safety idea without touching on a capability idea, then that’s good. If your safety idea requires a precise understanding of some capabilities, then it’s going to be harder.
Timing matters – changes in politics both at home and abroad can make it much harder to be seen to coordinate or help one another at all, so note when you’re in a window where sharing is possible and try really hard, because you have no idea how long that window will be open.

Read more: Keep your enemies safer: technical cooperation and transferring nuclear safety and security technologies (Jeffrey Ding’s website, PDF).

***

Platonic AI: as we make AI systems bigger, they arrive at similar ways to represent reality:
…Enticing evidence for the idea that AI systems get better in relation to breadth and scale…
Some MIT researchers have shown that as we scale up AI systems, different systems trained in different ways end up having a similar representation of reality. They call this the ‘Platonic Representation Hypothesis’ and the essential idea is that there are only so many ways to represent the world around us, so we should expect that as systems get more capable (aka, smarter), their representations of reality should look more similar than dissimilar. They do some experiments which bear this out.
“We argue that there is a growing similarity in how datapoints are represented in different neural network models. This similarity spans across different model architectures, training objectives, and even data modalities,” they write. “Our central hypothesis is that there is indeed an endpoint to this convergence and a principle that drives it: different models are all trying to arrive at a representation of reality, meaning a representation of the joint distribution over events in the world that generates the data we observe”.

Circumstantial evidence for the claim: The researchers compare and contrast the performance of 78 distinct vision models built via a range of architectures and trained using a variety of resources (from cheap models to relatively expensive ones, like the 70b parameter LLaMa 3 series). They find that:

Models that solve more VTAB tasks tend to be more aligned with each other.
Multimodal alignment (“The results show a linear relationship between language-vision alignment and language modeling score, where a general trend is that more capable language models align better with more capable vision models”.

What the results mean: “The results indicate that models with high transfer performance form a tightly clustered set of representations, while models with weak performance have more variable representations,” they write. This leads to the hypothesis that, “As we train more general models that solve more tasks at once, we should expect fewer possible solutions.”

Why this matters – more evidence that bigger models are better at approximating the world we exist in: Research like this adds weight to the idea that as we make AI systems larger, they get sufficiently good at representing the world that they eventually converge with the world. It also further suggests that the larger (and therefore more expensive) AI systems have much richer and more reality-like views on the world than the small ones, which helps explain why larger models seem to have lower rates of hallucination than smaller ones.
   Read more: The Platonic Representation Hypothesis (arXiv).

***

Convnets are more brainlike than transformers:
…Architectural biases help us better understand the brain…
Convolutional neural networks have some architectural biases that let them effectively approximate the behavior of primate visual cortexes, compared to other types of networks. The research, done by Johns Hopkins University and MILA, finds that “cortex-aligned representations emerge in convolutional architectures that combine two key manipulations of dimensionality: compression in the spatial domain and expansion in the feature domain”. This means that “the architectural biases imbued into convolutional networks allow many aspects of cortical visual representation to readily emerge even before synaptic connections have been tuned through experience.”

What this suggests: Though systems like transformers are very popular these days, the research finds that feedforward and transformer-based networks do not approximate behavior of primate visual networks nearly as well as convolutional ones – “we show that dimensionality expansion in an untrained convolutional neural network achieves surprisingly strong performance at explaining image-evoked responses in the primate visual cortex, in some cases reaching the performance of a standard pre-trained network.”
    This doesn’t mean that transformers or feed forward networks aren’t useful for visual tasks – rather, that you need to dump more resources into them to get some of the same representations that you get from a comparatively early and cheap convnet. “Massive pre-training may be sufficient to overcome a lack of brain-aligned inductive biases in diverse network architectures, such as in vision transformers”.

Why this matters – another way of understanding the brain: Research like this takes all the progress that has been made in AI and essentially inverts it – we now have a bunch of different ways of building neural nets that we know lead to useful things at scale. But what if we use these tools to instead understand the brain and the distance between these systems and how our own brains work? “Architecture optimization in untrained or minimally trained networks is a promising future direction for exploring the inductive biases that may underlie biological vison,” the researchers write.
   Read more: Convolutional architectures are cortex-aligned de novo (bioRxiv).

***

Tech Tales:

The alien greeting department
[Poem scribbled after several meetings of an alien greeting working group at the UN, following credible intelligence of an imminent visitation by an extraterrestrial species. Date unknown.]

To prepare for the alien invasion,
the humans took several steps.
They convened working groups,
Brought stakeholders together
And agreed on the principles
For how they’d talk to the aliens.

To prepare for the alien invasion,
The humans built technologies;
de novo communicative intent tools,
Ways to study the expected aliens,
Scales they hoped they might land on,
fMRI tubes for beings of unknown dimension.

To prepare for the alien invasion,
The humans thought about their own reality.
Would aliens understand reality?
Would aliens communicate their intent?
Would aliens understand human needs?
Would – could? – the aliens be kind?

Things that inspired this poem: How much of AI policy feels like building infrastructure for a broadly unknown thing expected to arrive in the future; the impossibility of imagining the thought process of a thing smarter than ourselves; how much of policy sometimes feels like a form of reassurance – a way to gather people together from distinct demographics and backgrounds and to sit around a metaphorical fire (flickering powerpoint) and all stare at it and say ‘yes, the world around us is indeed complicated, and we can acknowledge this together’; yes of course ‘aliens’ here is a metaphor for AGI.

Thanks for reading!

Leave a comment

May 20, 2024

Import AI 373: Guaranteed safety; West VS East AI attitudes; MMLU-Pro

by Jack Clark

Import AI publishes first on Substack – subscribe here.

The NDIF means academia can look like the insides of the AGI shops:
…APIs are all well and good, but being able to actually fiddle with weights is more valuable…
Academic researchers have built the National Deep Inference Fabric (NDIF), scientific infrastructure to help them play around with large-scale, openly accessible AI models, like LLMs. The NDIF combines a hardware stack of hundreds of GPUs (via the ‘Delta AI’ system), with software (via a library called nnsight) to help scientists do experiments on large-scale AI models.
   “The National Deep Inference Fabric consists of a unique combination of hardware and software that will provide a remotely-accessible computing resource for scientists and students to perform detailed and reproducible experiments on large pretrained AI models such as open large language models,” the project says on its website. “Commercial AI inference services such as ChatGPT, Claude, and Gemini only provide black-box access to large AI models. That is, you can send inputs to the services and they will give you outputs, but they do not give you access to observe or alter any of the internal computations. In contrast, NDIF provides full transparency for AI inference, allowing users to fully examine and modify every step of the internal computation of large AI models. “

Why this matters – making academic research like frontier lab research: The NDIF is basically a publicly funded attempt to reconstitute what the inside of large-scale AI labs looks like – a big blob of compute and some software to help you probe the models that are running on that blob.
   Unlike various other attempts to close the gap between the public sector and private sector, NDIF might work – and that’s because it’s focused on inference rather than training – the infrastructure NDIF sits on (Delta) consists of several hundred GPUs; insufficient for training cutting-edge AI systems, but viable for running inference on a few copies of models where the weights are freely available, like LLaMa3.
   Read more: National Deep Inference Fabric (NDIF official site).
   Find out more about the NDIF infrastructure (The Fabric, NDIF).
   Details about the NNsight software (NNSight website).

***

Can we ever guarantee the safety of an AI system? These researchers think they’ve found a way:
…Guaranteed Safety might be possible (if you know the use case)…
How can you assure that an AI system is ‘safe’ – that it will not cause accidents, display unexpected detrimental behaviors, or enable misuses? This is a hard problem and one which humans have struggled with (e.g, some utility items simply can’t be made safe without nullifying their utility, e.g. a gun or a hammer, while other more complex items can be with some deep technical work, e.g. molten salt nuclear reactors).
    Now, AI researchers have laid out an agenda for how people might build ‘guaranteed safe’ AI systems.

The three components for safe AI: “The core feature of the [Guaranteed Safe] approach to AI safety is to produce systems consisting of an AI agent and other physical, hardware, and software components which together are equipped with a high-assurance quantitative safety guarantee, taking into account bounded computational resources,” the authors write. “A Guaranteed Safe AI system is one that is equipped with a quantitative safety guarantee that is produced by a (single, set of, or distribution of) world model(s), a (single, set of, or distribution of) safety specification(s), and a verifier”.
   Safety specification: The purpose of this is to encode societal risk criteria – basically, a threat model for how an AI system could be misused.
   A world model: “The world model needs to answer queries about what would happen in the world as a result of a given output from the AI.” With a world model, you can anticipate potential risks of usage.
   A verifier: This technology “provides a quantitative guarantee… that the AI system satisfies the specification with respect to the world model”.

Example: If we wanted to use this framework to implement a guaranteed safety approach for, for example, nucleic acid sequencing screening and synthesis, we’d therefore need the following components:

Safety specification: A precise way to allow for the “rejection for synthesis of sequences that could be used in the production of pathogens”.
World model: A system that can model the “relationship between molecular structures and pathology”.
Verifier: A system that looks at inputs and used the world model and the safety specification to validate that the system won’t be used for harm.

Who did it: Involved researchers come from the UK Advanced Research and Invention Agency (ARIA), Oxford University, Mila, UC Berkeley, the Massachusetts Institute of Technology, Beneficial AI Research, X.AI, FAR AI, Cornell University, Stanford University, Carnegie Mellon University, and Columbia University.

Why this matters – the key challenge of safety – tradeoffs against generality: As should be clear, safety here relies on us being able to narrowly define the use case of the AI system. This means that more general-purpose systems are far, far harder to guarantee the safety of – possibly in a combinatorially explosive way (see: jailbreaks, new modalities, emergent properties from the mixing of general capabilities, etc).
   While the GS approach seems like it works in the abstract it also sits in opposition to the kind of general-purpose systems being developed today, suggesting that if we want to guarantee their safety, any deployment needs to be accompanied by a context-specific safety system.
    This has regulatory advantages – “an important benefit to GS AI is that it makes democratic oversight [of AI systems and developers] easier, because concrete safety specifications can be audited and discussed by outside observers and regulators,” the authors write.
    But it also has regulatory challenges – namely, that providing such safety stuff is in some cases difficult or expensive. I believe that under the system outlined here, a hammer would not be able to be ‘guaranteed safe’, unless you also pre-defined the use-case for the hammer. This feels like a tough sell!
   Read more: Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems (arXiv).

***

Global survey says the Western world doesn’t have as much of a mandate to regulate AI as China and India:
…Governments may be limited by what the public says they can do…
A global survey of opinions about AI by the University of Toronto shows that there’s more pessimism about AI and the regulation of it in the Western world and more optimism about it in India and China. This will fundamentally alter how governments approach both regulating and adopting AI.

How the survey was conducted: The survey was carried out in October and November 2023 people, with researchers polling ~1,000 people in 21 countries for a total of 23,882 surveys conducted in 12 languages.

Key findings:

People are divided about who should regulate AI; most people think tech companies are the appropriate ones to regulate AI, but only 1 in 5 people believes that they can be trusted to self-regulate.
Most people feel they understand what AI is.
There are significant geographic variations in attitudes toward AI; European and Anglophone countries have lower levels of optimism about AI, whereas places like China and India are far more optimistic about the technology.
Most people believe their jobs will be replaced by a machine in the next ten years; more than half of respondents think they will be replaced by a machine or computer in the coming decade. Two thirds of people think their children will have their jobs replaced by technology.
People are willing to try using AI for a wide range of tasks, but are less trusting that it will be effective; while people are keen to use the technology they don’t to not trust it for high stakes tasks.

Some more regulation-specific results:

Basically no one thinks the military is best placed to regulate AI. Indonesia and China and the UK have a high level of support for ‘regulators’ regulating AI.
Most people trust university researchers to “use AI safely”, and many are pessimistic about the ability for government to use AI safely (exceptions: India and China who trust the government a lot).

Why this matters – culture determines what you can do: Most governments (even accounting for different ideologies and governing systems) can only take actions within the overton window of what the general public thinks – these results show that Western governments are bound by a pessimistic and distrusting population, whereas the emerging mega economies of China and India have a greater built-in public mandate to both use AI technology and to regulate it.
   Read more: New SRI/PEARL survey now published, reveals worldwide public opinion about AI (Schwartz Reisman Institute for Technology and Society).
   Read the full survey here: Global Public Opinion on Artificial Intelligence survey (GPO-AI) (DropBox, PDF).

***

One way to get around benchmark saturation? Expand and refine an already hard test:
…MMLU-Pro has some smart ideas for tweaking and augmenting the test…
MMLU is one of the main benchmarks used to test out how advanced language models have become – but in the past few months, frontier models have been released that do well on the benchmark. Instead of creating an entirely new test, some researchers have built MMLU-Pro, a refined and expanded version of MMLU.

What they did: MMLU challenges LLMs to answer multiple choice questions, picking from four possible answers. MMLU-Pro expands the number of potential answers to 10, which means that randomly guessing will lead to significantly lower scores. Along with this, they expand on the original MMLU by adding in in, hard questions from Scibench (science questions from college exams), TheoremQA, and STEM websites, as well as sub-slicing the original MMLU to “remove the trivial and ambiguous questions”. In total, they add 12187 questions – 5254 new questions along with 6933 selected from MMLU.

Results – it’s hard: MMLU-Pro seems meaningfully harder; Claude 3 Sonnet saw its performance fall from 0.815 on MMLU to 0.5793 on MMLU Pro. Other models have even more dramatic falls – Mixtral-8x7B-v0.1 sees its performance drop from 0.706 to 0.3893.

Why this matters – knowing where you are is half the battle: Figuring out AI progress is equivalent to throwing a bunch of dates at an object hidden underneath a blanket – the more darts you throw and the closer you get them to the object, the better the chance you have of characterizing it and being able to see its true shape. Datasets like MMLU-Pro give us another dart to throw and the hardness means it has an even pointier spike on the end.
   Find out more: MMLU-Pro Dataset Introduction (TIGER-Lab).

***

Tech Tales:

Bar Stories
[A dive bar somewhere in America, 2027]

I’ve had such a bullshit day and this thing was just stepping to, they said. They put their hand on top of the part of the smashed drone. Sometimes these thinks just got to get told.
    Yeah, said the bartender, I see it. There’s a lot of them and less of us.
   Exactly, they said. We got to even the odds.

The next time the bartender saw them, they were dragging a box full of broken machines into the bar.
   They just fall out of the sky if you hit them right, they said.
    I bet, said the bartender.
    The Chinese pay good money for these, they said. No questions asked.
    Why is that? asked the bartender.
    Because they got something different in them, they said.
   And so for the rest of that evening the patrons drank and stared at the machines, piled high in the cart. They’d all been broken in different ways but what was the same was how – some human had spent time breaking them.

Hey you can’t come in here with that, the bartender said.
   Why not? they said.
   Got a visit from the cops after the last time you were here. I said I didn’t remember. They showed me photos some of the customers took. You’re on a list.
OK, they said, and they left.
They came back a few minutes later, minus the trailer full of machines. They ordered a drink and tipped heavy.
So, they said. How long till they catch me?
Well what you do is up to you, the bartender said, polishing a glass. But I bet being here makes them catch you sooner.

They were on the news a few days after that. The police shot them dead after a police chase. They had a van full of machines. The FBI had got involved and said they were linked to a smuggling ring that was helping the Chinese evade the latest export controls.
    Damn, the bartender said, reading the news on their phone. I guess the Chinese really were paying for it.
     And they went on with their day. The dead person turned into another ‘remember that time’ story. Nothing much changed.

Things that inspired this story: News reports of H100s being smuggled into China; playing pool in a dive bar where numerous stories happen and then just fade into the institutional memory of the bar; specialized chips for inference becoming increasingly valuable as export controls ratchet up; a meth head who once brought a hammer into the bar and just sat with it while paying for drinks with legitimate dollars and who then quietly left (though, of course, everyone was quite concerned about the hammer, which just sat there on the seat next to them the whole time).

Thanks for reading!

Leave a comment