Import AI 311: Distributed GPT busts the political economy of AI; Apple optimizes Stable Diffusion; AI war startup raises $1.48 billion

by Jack Clark

Test out your coding model on a fuzzed benchmark:
…DS-1000 pits code models against 1,000 tasks spread across seven Python libraries…
Researchers from the University of Hong Kong, Peking University, Stanford University, Berkeley, the University of Washington, Facebook, and Carnegie Mellon University have built DS-1000, a set of 1,000 data science problems spanning seven Python libraries. This is both a dataset and a benchmark and is useful for building code models, like Codegen or Copilot.

What’s in DS-1000? The dataset contains 1000 problems drawn from 451 distinct StackOverflow problems. “To defend against potential memorization, more than half of the DS-1000 problems are modified from the original StackOverflow problems; they include 152 surface perturbations, 235 semantic perturbations, and 162 difficult rewrites,” the authors write. DS-1000 contains problems in NumPy, SciPy, Pandas, TensorFlow, PyTorch, Scikit-learn, and Matplotlib. “The problems in DS-1000 represent more diverse and naturalistic intent and context formats that cannot be seen in any other datasets,” they write. 

How hard is it? The best performing models (Codex from OAI) get, at most, about 40% for tasks like insertion, followed by CodeGen(Salesforce) at ~8.4% and InCoder-6B from Facebook (7.5%). This is great news as it suggests it’s a hard benchmark. 
   Read more: DS-1000: A Natural and Reliable Benchmark for Data Science Code Generation (GitHub).
   Get the code here: DS-1000 Data Science Code Generation (GitHub).

####################################################

Apple optimizes Stable Diffusion on Apple silicon:
…World’s most valuable company + world’s most proliferated generative model…
Apple has significantly cut the time it takes to generate images from Stable Diffusion on Apple silicon. It’s notable that the world’s most valuable company has tacitly adopted the world’s most widely distributed (and quite controversial) generative image model, and perhaps a sign of things to come – release the weights of your model, and perhaps vast companies will expend engineering resources to make it run more efficiently on their hardware. 

   “This release comprises a Python package for converting Stable Diffusion models from PyTorch to Core ML using diffusers and coremltools, as well as a Swift package to deploy the models,” Apple writes.

Why this matters – on-device AI: Most AI models need to be sampled from via large computers, typically servers running top-of-the-line GPUs. Large language models, for instance, can take tens of GPUs to sample from in a reasonable time. Image models, while cheaper to sample from, can still be expensive. With this release, Apple has made it significantly faster for people to pull Stable Diffusion images off of their local devices – in other words, you could be sitting in the back of a cab in a place with no cell reception and could idly generate images on a laptop equipped with an M1 or M2 chip. 
   Read more: Stable Diffusion with Core ML on Apple Silicon (Apple Machine Learning Research blog)
   Check out detailed notes here: Core ML Stable Diffusion (Apple GitHub).

####################################################

Want to see if your object detection system works in the real world? Try out Roboflow100:
…RF100 – a reassuringly difficult and diverse benchmark…
Roboflow, a computer vision startup, has released Roboflow-100, a large-scale object detection dataset. What makes Roboflow different is, much like the recent emergence of benchmarks like SuperGLUE (a multi-task NLP benchmark), it takes multiple distinct datasets (in this case: 100) and puts them together into a single suite. This kind of thing tends to be really useful as it helps people work out if their models are overfitting or are actually capable of decent generalization.
   Another different thing is the data is sourced from real jobs by real users of Roboflow, so this is less an academic benchmark and more an applied one.

What goes into Roboflow-100? RF100 contains 100 datasets spread across 7 imagery domains, containing a total of 224,714 images annotated with 805 class labels. “By releasing RF100, we aim to provide a semantically diverse, multidomain benchmark of datasets to help researchers test their model’s generalizability with real-life data.”
   The seven main categories consist of annotation tasks in the following domains: Aerial, Video Games, Microscopic, Underwater, Documents, Electromagnetic, and Real World. All of these main categories contain sub-categories, ranging from first-person shooters (video games) to fishery sights from aquariums (underwater), to geology (real world), etc. 

Why this matters – hard enough to be useful: RF100 seems sufficiently large-scale and diverse that it poses a challenge to contemporary systems – that means it can be a valuable tool for developing and assessing the performance of more general models. The roboflow researchers show this by training a couple of baseline models (YOLOv5 and YOLOv7, respectively), as well as training a zero-shot detector called GLIP. The finetuned YOLO variants get about ~65-70% accuracy (v5 and v7, respectively), and GLIP gets ~11%. In other words – RF100 is a challenging benchmark, so there should be some signal in seeing how people do on it. 
   Read the paper: Roboflow 100: A Rich, Multi-Domain Object Detection Benchmark (arXiv).
   Read more: roboflow100 (official website).
   Get the dataset: Roboflow 100, GitHub.

####################################################

AI centralization just got less likely: Distributed team train a good 6bn parameter GPT model:
…You’ve heard about open source models. How about open source models trained over a super shitty network?…
Researchers with Together have trained GPT-JT, a 6bn parameter, well performing model. So far, so normal. The twist is that GPT-JT was trained in a decentralized manner on a heterogeneous bunch of GPUs over slow (1Gbps) internet links. That’s a big deal – and has some big implications. 

What is GPT-JT and how well does it work?: GPT-JT “is a variant forked off GPT-J and performs exceptionally well on text classification and other tasks,” the authors write. “On classification benchmarks such as RAFT, it comes close to state-of-the-art models that are much larger (e.g., InstructGPT davinci v2)”. GPT-JT was made possible by a range of open source software, ranging from underlying models (GPT-J, etc), datasets, evaluation metrics, and various contributions to decentralized algorithms. 

Trained in a decentralized manner: The authors wrap in a bunch of clever ideas to reduce the burden of decentralized training, cutting the amount of communication needed per machine for all the tokens processed. This is crucial to the success of the project; out-of-the-box decentralized training fails because you have enough between-machine chatter that the slowness of your connections represents a major tax on training.

Centralization versus decentralization – this is an attack on the political economy of AI! A lot of AI development has so far been defined by a small set of groups with access to big, centralized computers. These groups have used these blobs of compute to train impressive models, ranging from AlphaZero to GPT3. It has always been hard for people with fewer computers to catch up to the people with supercomputers. GPT-JT suggests a radically different future – distributed collectives can instead pool computers over crappy internet links and train models together. ex pluribus unum exemplar, if you will.
    Now, the multi-trillion dollar question is if these distributed groups can provably train models on par with those developed by the large, centralized giants. That part is a lot less clear – while GPT-JT is a decent model, it’s a tiny one at 6bn parameters. But if they can scale this kind of technique up, the implications are huge. 
   There’s also the small matter of China, which recently got a lot of its AI ambitions clipped by US export controls preventing it from accessing frontier GPUs. But maybe the frontier doesn’t matter as much if you can just aggregate compute across a country of more than a billion of people and train a model with the focus afforded by an Authoritarian regime. Food for thought! 
   Read more: Releasing v1 of GPT-JT powered by open-source AI (Together blog).
   Get the code: GPT-JT-6B-v1 (HuggingFace).
   Try out a live demo on HuggingFace here.

####################################################

AI war startup Anduril raises $1.48 billion: 
…AI + Robots + Startup DNA = a faster OODA loop for battlefield commanders…
AI War startup Anduril has raised $1.48 billion (that’s with a B) in a Series E round. “The new funding will enable Anduril to accelerate research and development to bring new, cutting edge, autonomous defense capabilities to the market and continue to mature and scale its current business lines with the US Department of Defense as well as US allies and partners,” the company wrote. 

AI and War: Anduril is a fascinating company – it’s one of the few modern defense startups in the US that is pairing recent AI innovations with various advances in robotics (e.g, low-cost drones) as well as sensor platforms. Put it all together and you wind up with a company that is fielding an increasingly vast arsenal of devices able to conduct war activities on land, air, and sea (via recent acquisition, Dive Technologies). Some of the company’s recent product launches include ALTIUS 600M (a loitering munition, aka a drone that hangs around then kills something with a bang), ‘Menace” (“a first-of-its-kind integrated, expeditionary, secure, command, control, communications and computing (C4) platform”), and Mobile Sentry (a robot for autonomous ground and air monitoring). 

Why this matters – war is about speed, and AI increases speed: War runs on an OODA loop – Observe, Orient, Decide, Act. By pulling in modern technologies such as AI, Anduril is building an arsenal that increases the speed at which battlefield commanders can iterate through the OODA loop. Anduril is less about its individual items and more about its overall suite of products – taken together, they potentially let an entrepreneurial army out-think the competition via running an OODA loop. War is a depressing thing, but a more depressing thing is losing wars, so the funding for Anduril seems like a positive indication for the US (and allied) defense industrial base. I hope it continues to succeed in breaking through the monopoly of the aging so-called defense ‘primes’ (Lockheed, etc). 
   Read more: Anduril Raises $1.48 Billion in Series E Funding (Anduril blog, Medium)

####################################################

Reality Authentication
[The internet, 2034] 

“To login, spit into the bio-API”
   I took a sip of water and swirled it around my mouth a bit, then hawked some spit into the little cup on my desk, put its lid on, then flipped over the receptacle and plugged it into the bio-API system.
“Authenticating… authentication successful, human-user identified. Enjoy your time on the application!”
   I spent a couple of hours logged-on, doing a mixture of work and pleasure. I was part of an all-human gaming league called the No-Centaurs; we came second in a mini tournament. I also talked to my therapist sans his augment, and I sent a few emails over the BioNet protocol. 

   When I logged out, I went back to the regular internet. Since the AI models had got minituarized and proliferated a decade ago, the internet had radically changed. For one thing, it was so much faster now. It was also dangerous in ways it hadn’t been before – Attention Harvesters were everywhere and the only reason I was confident in my browsing was I’d paid for a few protection programs. 

Things that inspired this story: The ceaseless march of generative model progress; chatGPT; high- and low-class hobbies; the rich will always have a retreat, while the poor will always be condemned to the most experimental parts of the frontier.