Import AI 277: DeepMind builds a GPT-3 model; Catalan GLUE; FTC plans AI regs

FTC plans AI regulation:
…FTC brings on three AI Now people as advisors, now turns attention to algorithmic regulation…
The Federal Trade Commission announced Friday that it is considering using its rulemaking authority “to curb lax security practices, limit privacy abuses, and ensure that algorithmic decision-making does not result in unlawful discrimination, according to the Electronic Information Privacy Center (EPIC). The announcement follows the FTC bringing on three people from AI Now, including Meredith Whittaker, as advisors on AI (Import AI #275).
Read more:FTC Signals It May Conduct Privacy, AI, & Civil Rights Rulemaking (EPIC).
  Readthe FTC language at RegInfo.

####################################################

Google thinks sparsity might be the route to training bigger and more efficient GPT-3 models:
…GLaM shows that mixture of experts models keep getting better…
Google has built GLaM, a 1.2 trillion parameter mixture-of-experts model. GLaM is a big language model, like GPT-3, but with a twist: it’s sparse; MoE networks are actually a bunch of distinct networks all connected together, and when you pull inference off of one only a few sub-networks activate. This means that the parameter count in a sparse vs dense network isn’t really comparable (so you shouldn’t think 1.2 trillion MoE = ~6X larger than GPT-3).

Why MoE is efficient: “The experts in each layer are controlled by a gating network that activates experts based on the input data. For each token (generally a word or part of a word), the gating network selects the two most appropriate experts to process the data. The full version of GLaM has 1.2T total parameters across 64 experts per MoE layer with 32 MoE layers in total, but only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference.”

How well does it work: In tests, GLaM exceeds or is on-par with the performance of GPT-3 on 80% of zero-shot tasks and 90% of one-shot tasks. Like DeepMind’s Gopher, part of the improved performance comes from the size of the dataset – 1.6 trillion tokens, in this case.

Why this matters: For a few years, various Google researchers have been pursuing ‘one model to learn them all‘ – that is, a single model that can do a huge number of diverse tasks. Research like GLaM shows that MoE networks might be one route to building such a model.
Read more: More Efficient In-Context Learning with GLaM (Google blog).

####################################################

DeepMind announces Gopher, a 280 billion parameter language model:
…AI research firm joins the three comma language club…
DeepMind has built Gopher, a 280 billion parameter language model. Gopher is the UK AI research company’s response to GPT-3, and sees DeepMind publicly announce a multi-hundred billion parameter dense model, letting it join a club that also includes companies like Microsoft, Inspur, and Huawei.

What it does: During the research, DeepMind found areas “where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language,” the company writes. “We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense tasks.”

How well it works: Gopher outperforms GPT-3 in a broad range of areas – some of the results likely come from the dataset it was trained on, called MassiveText. MassiveText “contains 2.35 billion documents, or about 10.5 TB of text” (representing about 2.3 trillion tokens), and DeepMind notes that by curating a subset of MassiveText for data quality, it was able to substantially improve performance.

Language models – good, if you handle with care: Along with analysis on bias and other potential impacts of Gopher, DeepMind dedicates a section of the paper to safety: “We believe language models are a powerful tool for the development of safe artificial intelligence, and this is a central motivation of our work,” they write. “However language models risk causing significant harm if used poorly, and the benefits cannot be realised unless the harms are mitigated.”
  Given the above, how can we mitigate some of these harms? “We believe many harms due to LMs may be better addressed downstream, via both technical means (e.g. fine-tuning and monitoring) and sociotechnical means (e.g. multi-stakeholder engagement, controlled or staged release strategies, and establishment of application specific guidelines and benchmarks). Focusing safety and fairness efforts downstream has several benefits:”
Read the blog post:Language modelling at scale: Gopher, ethical considerations, and retrieval (DeepMind blog).
  Read the paper:Scaling Language Models: Methods, Analysis & Insights from Training Gopher (PDF).

####################################################

Want to evaluate a Catalan language model? Use CLUB:
…You can only build what you can measure…
Researchers with the Barcelona Supercomputing Center have built the Catalan Language Understanding Benchmark (CLUB), a benchmark for evaluating NLP systems inspired by the (English language) GLUE test. The main curation rationale they followed “was to make these datasets both representative of contemporary Catalan language use, as well as directly comparable to similar reference datasets from the General Language Understanding Evaluation (GLUE)”.

What’s in the CLUB? CLUB includes evals for Part-of-Speech Tagging (POS), Named Entity Recognition and Classification (NERC), Catalan textual entailment and text classification, and Extracted Question Answering (which involved work like translating and creating new Catalan datasets – XQuAD-Ca, VilaQuAD and ViquiQuad).

Why CLUB matters: There’s a phrase in business – ‘you can’t manage what you can’t measure’. CLUB will make it easier for researchers to develop capable Catalan-language systems.
  Read more:The Catalan Language CLUB (arXiv).

####################################################

Deep learning unlocks a math breakthrough:
…The era of Centaur Math cometh…
Deepmind researchers have used an AI system to help mathematicians make two breakthroughs in topology and representation theory. The result provides yet more evidence (following various AlphaFold-inspired projects) that humans+AI systems can discover things that neither could discover on their own.

What they did: The essential ideal is quite simple: get a mathematician to come up with a hypothesis for a given function, then build an ML model to estimate that function over a particular distribution of data, then have the mathematician evaluate the result and use their intuition to guide further experimentation. The best part? “The necessary models can be trained within several hours on a machine with a single graphics processing unit”, DeepMind says.

Why this matters: We’re entering a world where humans will collaborate with AI systems to synthesize new insights about reality. Though DeepMind’s system has limitations (“it requires the ability to generate large datasets of the representations of objects and for the patterns to be detectable in examples that are calculable,” DeepMind notes), it sketches out what the future of scientific discovery might look like.
  Read the paper:Advancing mathematics by guiding human intuition with AI (Nature, PDF).
  Read more:Exploring the beauty of pure mathematics in novel ways (DeepMind blog).

####################################################

Anthropic bits and pieces:
…(As a reminder, my dayjob is at Anthropic, an artificial intelligence safety and research company)…
We’ve just released our first paper, focused on simple baselines and investigations: A General Language Assistant as a Laboratory for Alignment. You can read it at arXiv here.

####################################################

Tech Tales:

Real and Imagined Gains
[DoD Historical archives, 2040]

They got trained in a pretty cruel way, back then – they’d initiatie the agents and place them in a room, and the room had a leak of a poisonous substance that had a certain density and a certain spread pattern. The agents had to work out how not to asphyxiate by doing fairly complicated intuitively-driven analysis of the environment. If they were able to give a correct guess at the spread pattern (and avoid it) before the room filled up, they moved onto the next stage. If they weren’t able to, they asphyxiated and died – as in, felt their computational budget get cut, got put in cold storage, probably never booted up again.
  (One curious by-product of the then-popular AI techniques was that the agents would sometimes seek to preserve eachother – in one case, two agents ‘kissed’ eachother so they could more efficiently exchange their air reserves between eachother, while the room filled; unfortunately, as their attention was allocated to the act of kissing, they did not complete the requisite calculations in time, and both died.) 

Things that inspired this story: Kurt Vonnegut; reinforcement learning; environmental design; moral patient hood.