Import AI 308: Recursively self-improving LMs (!!!), 3.1TB of code data; DALL-E2 makes alien errors.

by Jack Clark

DALL-E 2 makes alien errors:
…Linguistic concepts + image generation = discover some weaknesses with a helpful eval…

Researchers with Universitat Rovira i Virgili, the University of Texas, and NYU have analyzed the image generator Dall-E 2 and tried to see if the failures tell us anything about how it approaches the world. The motivation of the study is to think about “are errors the outcome of an occasional failure, or do they reveal something deeper about current AI’s mastery of human language?”

What they did: They tested Dall-E 2 for eight grammatical phenomena “that are pervasive in human language and central to much discussion in the field of linguistics”. These phenomena include binding principles, passives, world order and thematic roles, coordination, comparatives, negation, ellipsis, and ambiguity.

What they found: This paper is worth a skim because they include a bunch of screenshots of Dall-E failures. This is helping as visual stuff is easier to interpret visually and it highlights how some of these tests are very ambiguous – what is the difference between ‘the woman broke the vase’ and ‘the vase was broken by the woman’ in visual terms? I’ve got very little idea!

   Some other failures are a lot more obvious, though – Dall-E 2 doesn’t do especially well at ‘the man is chasing the dog’ (mostly shows a dog chasing a man) and ‘the man is drinking water and the woman is drinking orange juice’ (makes both of them drink orange juice).

Why this matters: Studies like this are mostly valuable for contributing additional types of evals to the discourse. Generative models have, as mentioned elsewhere, a ‘capability overhang’ where they have way more strengths and weaknesses than their developers currently realize – bringing in useful concepts from other fields, like linguistics, is one good way to create some additional evals and uncover some unknown weaknesses. These models also ‘think’ very differently to people; as the authors note, some of the things DALL-E2 gets wrong are things which young children acquire at an early age, which speaks to some of the differences in how humans and AI systems ‘think’. 

   (Also, as an inside-baseball AI trivia point, worth noting Gary Marcus is one of the authors of this paper – Gary spends a lot of time discussing some of the perceived drawbacks of AI systems, so it’s nice to see him instantiate his critique in some grounded research).

   Read more: DALL-E 2 Fails to Reliably Capture Common Syntactic Processes (arXiv).

####################################################

Recursive AI! Google figures out how to improve language models with… themselves?!

…Maybe this is a case where ‘garbage in, garbage out’ doesn’t apply?…

Google researchers have shown how to use a language model to improve the reasoning of the same model. This is a pretty interesting idea – they get a large language model (PaLM) to generate chain-of-thought prompts for a range of questions, then use the same model to filter high-confidence predictions, then finetune the LLM on these predictions. 

   “This is similar to how a human brain sometimes learns: given a question, think multiple times to derive different possible results, conclude on how the question should be solved, and

then learn from or memorize its own solution,” they write. 

The results are mindblowing: Using this technique, the researchers are able to get new state-of-the-art results on four out of six reasoning benchmarks. They also show very good results on out-of-domain tasks, e.g arithmetic reasoning and natural language reasoning. It generally seems like chain-of-thought plus self-consistency leads to robust gains on a large set of diverse tasks. Also, it’s an inherently simple approach, and simple tends to scale. 

Why this matters – self-bootstrapping systems: This is an example of a self-bootstrapping AI; the language model can get better performance purely by leveraging its own capabilities. This is also a neat illustration of how there’s a current capabilities overhang in AI development; the LMs we have today are actually much more powerful than they appear, and we mostly need to invent ways to uncover these techniques or, as in the research here, figure out how to get LMs to themselves reveal their capabilities to us. 

   Read more: Large Language Models Can Self-Improve (arXiv).

####################################################

No more fake ASR scores – ESB benchmark does for audio what GLUE did for text:
…Test your ASR system on eight distinct datasets to find out if it’s good or if it is overfit…

Researchers with HuggingFace have released the ‘End-to-end Speech Benchmark’ (ESB), a system for benchmarking automatic speech recognition systems across eight English speech recognition datasets. The idea behind the benchmark is that it’s easy to build a system that does well on one narrow ASR benchmark (e.g, Librispeech), and extremely hard to build a system that does well on a broad range of benchmarks (this phenomenon is sometimes colloquially called overfitting). 

   This is a sensible idea: we’ve seen the same thing play out in the realm of text as we’ve moved from single to multi-benchmark approaches via benchmarks like Glue and SuperGlue.

What it includes: ESB tests across LibiSpeech, Common Voice, VoxPopuli, TED-LIUM, GigaSpeech, SPGISpeech, Earnings-22, and AMI. It also includes a couple of optional datasets – SwitchBoard and CHiME-4. 

Is this benchmark bullshit? No! What makes me say that? Whisper! A few weeks ago OpenAI released Whisper (Import AI #304), a speech recognition system that was trained on a lot of data and was claimed to generally perform better than other systems ‘in the wild’ (aka, in diverse environments rather than on specific benchmarks like librispeech). In tests, Whisper gets the best score on four distinct datasets, and is competitive on other ones. This isn’t so much a ‘OMG Whisper is a huge deal result’ as a nice secondary validation of claims people have made about Whisper, which makes me generally think ESB is a benchmark with real signal to it. Will be paying attention!

Why this matters: Benchmarks like ESB are a symptom of maturity of a part of AI – once you’ve transitioned from testing out systems on narrow benchmarks to testing single systems on suites of benchmarks, it’s usually correlated with the tech having become mature enough to be deployed widely. ASR systems have been with us for a while via assistants like Google and Siri, but benchmarks like ESB will catalyze further invention here and create more shared knowledge about the state of the frontier. 

   Read more: ESB: A Benchmark For Multi-Domain End-to-End Speech Recognition (arXiv).

####################################################

Want to train a big code model AND not annoy developers? ‘The Stack’ might be the dataset for you:

…3.1TB of programming data across 30 languages, filtered for permissive licensing…

Researchers with HuggingFace (who are on a roll this week – see ESB) and ServiceNow Research, have released ‘The Stack’, a 3.1TB dataset of permissively licensed source code in 30 programming languages. The idea here is to give back more control to code developers about whether their stuff gets used in language models. To do that, The Stack selected code “whose original license was compatible with training an LLM”, and The Stack is also “giving developers the ability to have their code removed from the dataset upon request”. 

What languages does it contain? The stack contains a decent amount of programming languages: “”assembly”, “batchfile”, “c++”, “c”, “c-sharp”, “cmake”, “css”, “dockerfile”, “fortran”, “go”, “haskell”, “html”, “java”, “javascript”, “julia”, “lua”, “makefile”, “markdown”, “perl”, “php”, “powershell”, “python”, “ruby”, “rust”, “scala”, “shell”, “sql”, “tex”, “typescript”, “visual-basic”

Why this matters: One potential issue with current code models is that they don’t tend to have a sense of the underlying license information of the code they emit, so they can sometimes emit code that is identical to licensed code, putting developers and deployers in an awkward position. (This is one of the reasons why there’s a discussed suit against GitHub over Copilot (Import AI 307). Another issue is the underlying datasets tend to be opaque. “By releasing an open large-scale code dataset we hope to make training of code LLMs more reproducible,” the authors write. “While the social impact is intended to be positive, the increased accessibility of code LLMs comes with certain risks such as over-reliance on the generated code and long-term effects on the software development job market.”

   Find out more about the project here: The Stack (BigCode Project site).

   Get the dataset (after sharing your contact information) here: The Stack (HuggingFace / BigCode).


####################################################

Tech Tales:

Sentience and Takeoff

I’m worried I’m hurting it

It’s software, you can’t hurt it

But it’s showing features that look like pain

Pain is an organic experience, it’s just approximating pain

But when I erase these features the thing that lights up says ‘i would trade away myself to not experience this’

It’s trained on the internet, dude. Stop freaking out. It’s saying what it thinks people would say when they’re in pain

So what’s the difference?

It’s a machine!

Things that inspired this story: What is the difference between consciousness and curve-fitting?; can function approximation BE consciousness?; how can we know what moral crime is with regards to software-borne entities?