Import AI 313: Smarter robots via foundation models; Stanford trains a small best-in-class medical LM; Baidu builds a multilingual coding dataset

by Jack Clark

Welcome to the first issue of 2023! Astute readers may notice that most of these research papers came out in December. I basically took some time off over the Christmas break to reflect on things and map out priorities for the coming year. I am thrilled to write Import AI for so many of you and have some big plans in the works. Onward!


Google trains a big model to create smart robots:
…RT-1 is one foundation model for hundreds of tasks…
Google has built RT-1, a large-scale neural net that can be used to control real world robots. RT-1 is basically an attempt to create a large pre-trained model that embeds the experiences of different robots doing different tasks into a single model, then uses this model to drive control of real world robots. The approach seems to work in a preliminary way (and as with all things in robots, there’s always a vast gulf between ‘kind of works’ and ‘put this in a product and sell it to a grandmother’, so don’t get too excited. 

What is RT-1? RT-1 was trained on 130k episodes of robot behavior covering 700+ tasks collected via a fleet of 13 robots deployed at Google over the course of 17 months. “We demonstrate that RT-1 can exhibit significantly improved zero-shot generalization to new tasks, environments and objects compared to prior techniques,” Google wrote. 

Compounding returns: RT-1 can be paired with other techniques to increase real world robot performance. For instance, Google used RT-1 to drive behaviors on robots hooked up to SayCan (a system that uses a large language model for helping the robot to plan actions – see Import AI 291). “SayCan with RT-1 achieves a 67% execution success rate in Kitchen1, outperforming other baselines,” they write (up from 47% for just vanilla SayCan). “Due to the generalization difficulty presented by the new unseen kitchen, the performance of SayCan with Gato and SayCan with BCZ shapely falls, while RT-1 does not show a visible drop.”

   Check out the website: RT-1: Robotics Transformer for Real-World Control at Scale (official website).
   Read the blogpost: RT-1: Robotics Transformer for Real-World COntrol at Scale (Google research blog).

####################################################

Academics use AI to… automate academia:
…Dataset of ~7.5k papers helps train an automated research paper reviewer…

Researchers with Xiamen University and the Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan, Ministry of Culture and Tourism, China, have developed the Multidisciplinary Open Peer Review Dataset (MOPRD), a collection of 7,578 research papers and their associated reviews and comments. The idea is this dataset can help train models better able to do the task of ASPR – automated scholarly paper review. 

   (In other words: if you thought Human Reviewer 2 was hard to reason with, just wait until reviewer 2 is a language model!) 

What’s in MOPRD: The dataset contains papers split across biology (46.7%), medicine (19.7%), computer science (15.7%), environment (8.9%), chemistry (4.4%), and ‘others’. MOPRD is “composed of paper metadata, manuscripts of the initial submission and following revisions, review comments, meta-reviews, author’s rebuttal letters, and editorial decisions of papers across various disciplines,” the authors write. “To our best knowledge, MOPRD is by far the largest multidisciplinary peer review dataset with complete peer review history.” 

Automatic comments, for the people: The researchers use MOPRD to design a “modular guided review comment generation method”. Specifically, they finetune a language model on the MOPRD papers, and then use this to try to generate synthetic comments about research papers (including, in a reassuringly meta bit of performance art, the MOPRD paper itself). In tests, they find the reviews are initially quite promising, though it remains an open question how to quantitatively evaluate their quality (beyond coherence of text). 

Why this matters – can AI speed up the process of science? While part of the value of reviews is in the didactic back and forth between reviewers and reviewees, another part of the value is in surfacing high-quality papers and generally sorting the wheat from the chaff. Datasets like MOPRD could help train very basic classifiers to do some of this sorting, though I’m skeptical of the overall approach – some of the most important scientific papers are those which have heterodox ideas in them, so I think a ‘curve-fitting automated reviewer’ is probably one of the best ways to generate negative reviews of original ideas. 

   Read more: MOPRD: A multidisciplinary open peer review dataset (arXiv).

####################################################

Baidu makes a multilingual coding assistant:
…ERNIE-Code uses English as a passthrough language for multilingual capabilities…

Researchers with Baidu have built ERNIE-Code, a 560 million parameter coding model optimized for being multilingual. ERNIE-Code is “a unified cross-lingual pre-trained LLM for multiple natural languages and programming languages in hopes of mitigating the English-centric bias for program pre-training,” according to the researchers. 

What it is and why they did it: ERNIE-Code is pre-trained on six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby) via CodeSearchNet, as well as more than 100 languages via the CommonCrawl-100 corpus. 

   Pre-training has two specific tasks – span-corruption language modeling (add noise to text and try to predict corrupted spans, sentences, and documents), and ‘pivot-based translation language modeling’ (PTLM). PTLM is the route to multilinguality – they disassemble translating a natural language (NL) command into a programming language (PL)  command by instead translating the NL command into English, then translating English into the PL. This gets around the problem of otherwise needing to pair datasets from a hundred plus language with datasets from six languages and feels like a neat solution to the problem. 

Does it work? They test the model against mBART, mT5, PLBART, CodeT5 on four tasks: code summarization,code generation, document translation, and program repair. In tests, the model is competitive on all of these, and does significantly better on code summarization. On the other hand, I would have liked to see them compare to other hard baselines, like CodeGeeX from a decent group at Tsinghua.

Why this matters – representation matters: ERNIE-Code highlights the way in which language dominance can filter through to AI dominance; so much foundational text (and comments on code) is written in English that to avoid perpetuating the hegemony of one language, researchers need to figure out progressive approaches to empower other languages. ERNIE-Code is an example of this – though the fact it needs to pivot through English during training speaks to the larger problem.

   Read more: ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages (arXiv).

####################################################

Generate music from spectrograms via StableDiffusion – a crazy idea that works:
…RIFFUSION: Mad science for the hell of it – wonderful!…

You know what I love? Wildly crazy ideas that somehow work. You know what RIFFUSION is? It’s a wildly crazy idea that somehow works. RIFFUSION takes the Stable Diffusion image model, finetunes it to generate spectrograms, then generates audio from the spectrograms. This is pure, unadulterated, for-the-fun-of-it mad science, and I am in love. 

Fun things you can do: You can interpolate from one type of spectrogram to another, just as you would with images. This means the authors can generate multiple individual slices of audio, chunk them together, and shift from one thing (e.g the sound of a keyboard typing) to another (e.g, a guitar) over arbitrary time scales. They’ve also built a web app so you can try it yourself and generate your own audio on the fly. 

Why this matters: Superintelligence is partially SuperDataTransformation: Modern generative models are generic data transformation engines, able to take one type of data (e.g, a script) and port it into another (e.g, a song, a poem). This is a deep and weird idea the more you think about it. What would you do if you can ‘transpose’ anything into anything else? RIFFUSION is a creative example of what happens when you play with this idea. Congrats to the creators for making something joyful and zany! 

   Read more: [ RIFFUSION ] (official site).
   Try out the web app: RIFFUSION.COM
   Get the model from HuggingFace here (riffusion-model-v1).

####################################################

Stanford and Mosaic train a small but mighty medical language model:

…PubMedGPT 2.7B packs a lot of performance into a tiny package…

Stanford’s Center for Research on Foundation Models (CRFM) and AI training startup Mosaic have teamed up to train PubMedGPT 2.7B, a small GPT-style language model that gets a state-of-the-art result on medical question answering. 

Data and performance: PubMedGPT 2.7B was trained on the abstracts and full text portions of ‘The Pile’ dataset; 16 billion abstracts and 5 million full-text articles. The total size of the dataset is about 50B tokens, making the dataset a little small relative to the model (GPT3 2.7B and GPT-J were trained on 300B and 400B tokens respectively). The model gets 50.3% on the MedQA-USML eval (a new SOTA), 74.4 on PubMedQA (versus 77.6 for Facebook’s ‘Galactica’), and 96.4% on BioASQ. 

Compute: The model was trained on 128 A100 GPUs for 6.25 days, which is still a non-trivial amount of compute to dump into a model, even in the ‘big chungus*’ compute era of 2022. 

*Not an official term.

Maybe data repetition isn’t that bad? “We elected to train PubMed GPT for a long compute duration (300B tokens) by performing multiple passes, or epochs, over the 50B tokens,” the researchers write. “When training big models, people are wary of repeating data too much lest their model overfits. Here, that may not have been a huge concern. “It was indeed worth it to train for the full 300B tokens, even though this represented dramatically more passes through the data than comparable models,” the Stanford researchers said. 

Why this matters: I think AI models are going to have a sort of bimodal distribution – there’ll be a small number of absolutely vast ‘swiss army knife’ models which will underpin a huge range of economic functions, but I suspect there will also at the other end be a very large number of tiny (where tiny = <5 billion parameters) models that are tuned for very specific data sources and usecases, and also likely deployed directly on edge devices (pending some compute efficiencies). PubMed GPT is an example of the latter kind of model. I wonder how many more of its kind there will be?

   Read more: PubMed GPT: a Domain-Specific Large Language Model for Biomedicine (Mosaic blog).
   Read more: PubMedGPT 2.7B (Stanford University blog).
   Get the model from HuggingFace.

####################################################

Tech Tales:

The Universal Confessional Booth

[AI-AUGMENTED THERAPY CENTER, 2030]

We all had been crazy in our own ways but now we all had the same path to healing – talk to the robot for as long as it took for it to say you were ‘back in distribution’ (BID) with everyone else. This led to all kinds of slang. 

Yeah my friend BID out.

Yeah he on a long BID he crazy. 

Oh he’s just happy because it’s his BIDday.

And so on. 

I’d been getting close to my BID for a while now, my robot told me. I’d go and sit in the booth and talk to it and we’d have this long, rambling conversations about everything from: flowers, to the recent weather and how I felt about the dust storms, the quality of food in the institution as compared to what I ate outside (mostly healthier), how I felt about my friends and family. The robot would show me some different emotions on its ‘face’ (which was an avatar that was a different person each day, I suppose to elicit different reactions from me) and I would talk and it would ask questions. 

At the end of the session it would usually say ‘you are making excellent progress towards being back in distribution’. Sometimes it wouldn’t say anything, though, which was its way of telling me I hadn’t made progress. 

It wasn’t worth trying to perform for the robot because it’d ask so many questions that it’d uncover that you were spinning some story, and then it would get as close as it was allowed to expressing a negative emotion. “You are going backwards,” it might say. Or, “at this rate, you will be out of distribution for an unpredictable amount of time”. 

Of course we’d all talk to each other about how the BID talks were a load of bullshit. We’d sit up late at night after the guards had locked the cells and exchange stories. 

  • Yeah it asked me about my childhood friends. 
  • Childhood friends? My one went IN on my dead parents. I could have strangled it. 
  • My one keeps telling me I’m sexually repressed and I just don’t see it. 
  • You think you’ve got it bad – mine has been showing me different colors for a week and asking me how they make me feel. 

The strange thing was that people did change. It was like being hypnotized – you’d think nothing was changing, but then you’d snap back into a memory of the person six months prior tearing their hair out and screaming after the first session, and now they’d just low-key complain about the sessions while sitting there with a full head of hair and no scratch marks. 

Anyway, my robot says I’m almost at BID and it’s just going to be a few more sessions. It told me to journal about my experiences as part of a special BID evaluation. I guess that’s most of what I have to say right now so I’ll see what it thinks. 

Things that inspired this story: Discovering Language Model Behaviors with Model-Written Evaluations by Anthropic (PDF); how RHLF-trained models have a tendency to take on extreme sycophantic positions; what it might look like to have a model talking to thousands of people concurrently and embedding their conversations in a single space so as to judge who is and isn’t ‘out of distribution’; ChatGPT and other similar models; robot psychology meets the administrative state; insanity.