Import AI 220: Google builds an AI borderwall; better speech rec via pre-training; plus, a summary of ICLR papers

by Jack Clark

Want to measure progress towards AGI? Welcome to a sissyphean task!
….Whenever we surpass an AGI-scale benchmark, we discover just how limited it really was…
One of the reasons it’s so hard to develop general intelligence is whenever people come close to beating a benchmark oriented around measuring progress towards AGI, we discover just how limited this benchmark was and how far we have to go. That’s the gist of a new blogpost from a “fervent generalist” from a person using the pseudonym ‘Z’, which discusses some of the problems inherent to measuring progress towards advanced AI systems.
  “Tasks we’ve succeeded at addressing with computers seem mundane, mere advances in some other field, not true AI. We miss that it was work in AI that lead to them,” they write. “Perhaps the benchmarks were always flawed, because we set them as measures of a general system, forgetting that the first systems to break through might be specialized to the task. You only see how “hackable” the test was after you see it “passed” by a system that clearly isn’t “intelligent”.”

So, what should we do? The author is fairly pessimistic about our ability to make progress here, because whenever people define new harder benchmarks, that usually incentivizes the AI community to collectively race to develop a system that can beat the benchmark. “Against such relentless optimization both individually and as a community, any decoupling between the new benchmark and AGI progress will manifest.”

Why this matters: Metrics are one of the ways we can orient ourselves with regard to the scientific progress being made by AI systems – and posts like this remind us that any single set of metrics is likely to be flawed or overfit in some way. My intuition is the way to go is developing ever-larger suites of AI testing systems which we can then use to more holistically characterize the capabilities of any given system.
  Read more: The difficulty of AI benchmarks (Singular Paths, blog).

###################################################

What’s hard and what’s easy about measuring AI? Check out what the experts say:
…Research paper lays out measurement and assessment challenges for AI policy…
Last year I helped organize a workshop at Stanford that brought together over a hundred AI practitioners and researchers to discuss the challenges of measuring and assessing AI. Our workshop identified six core challenges for measuring AI systems:
– Defining AI; as anyone knows, every policymaking exercise starts with definitions, and our definitions of AI are lacking.
– What are the factors that drive AI progress and how can we disambiguate them?
– How do we use bibliometric data to improve our analysis?
– What tools are available to help us analyze the economic impact of AI?
– How can we measure the societal impact of AI?
– What methods can we use to better anticipate the risks and threats of deployed AI systems?

Podcast conversation: Myself and Ray Perrault, co-chairs of the AI Index – a Stanford initiative to measure and assess AI, which hosted the workshop – recently appeared on the ‘Let’s Talk AI’ podcast to discuss the paper with Sharon Zhou.

Why this matters: Before we can regulate AI, we need to be able to measure and assess it at various levels of abstraction. Figuring out better tools to use to measure AI systems will help technologists create information that can drive policy decisions. More broadly, by building ‘measurement infrastructure’ within governments, we can improve the ability for civil society to anticipate and oversee challenges brought on by the maturation of AI technology.
  Read more: Measurement in AI Policy: Opportunities and Challenges (arXiv).
    Listen to the podcast here: Measurement in AI Policy: Opportunities and Challenges (Let’s Talk AI, Podbean).

###################################################

ICLR – a sampling of interesting papers for the 2021 conference:
…General scaling methods! NLP! Optimization! And so much more…
ICLR is a major AI research conference that uses anonymous, public submissions during the review phase. Papers are currently under review and AI researcher Aran Komatsuzaki has written a blog summarizing some of the more interesting papers and the trends behind them.

What’s hot in 2021:
– Scaling models to unprecedented sizes, while developing techniques to improve the efficiency of massive model training.
– Natural language processing; scaling models, novel training regimes, and methods to improve the efficiency of attention operations.
– RL agents that learn in part by modelling – sometimes described more colloquially as ‘dreaming’ – the world, then using this to improve performance.
– Optimization: Learning optimizations systems to do better optimization, and so on.

Why this matters: Scaling has typically led to quite strong gains in certain types of machine learning – if we look at the above trends, they’re all inherently either about improving the efficiency of scaling, or figuring out way to make models with fewer priors that learn richer structures at scale. 
  Read more: Some Notable Recent ML Papers and Future Trends (Aran Komatsuzaki, blog).

###################################################

Robot navigation gets a boost with ‘RxR’ dataset:
…The era of autonomous robot navigation trundles closer…
How can we create robots that can intelligently navigate their environment, teach eachother to navigate, and follow instructions? Google thinks one way is to create a massive dataset consisting of various paths through high-fidelity 3D buildings (recorded via ‘MatterPort‘), where each path is accompanied by detailed telemetry data as the navigator goes through the building, as well as instructions describing the path they take.

The dataset: The ‘Room-Across-Room’ (RxR) dataset contains ~126,000 instructions for ~16500 distinct paths through a rich, varied set of rooms. “RxR is 10x larger, multilingual (English, Hindi and Telugu), with longer and more variable paths, and it includes… fine-grained visual groundings that relate each word to pixels/surfaces in the environment,” Google says in a research paper.

The most interesting thing about this is… the use of what Google terms pose traces – that is, when the people building the RxR dataset move around the world they “speak as they move and later transcribe their audio; our annotation tool records their 3D poses and time-aligns the entire pose trace with words in the transcription”. This means researchers who use this data have a rich, multi-modal dataset that pairs complex 3D information with written instructions, all within a simulated environment that provides togglable options for surface reconstructions, RGB-D panoramas, and 2D and 3D semantic segmentations. This means it’s likely we’ll see people figure out a bunch of creative ways to use this rich set of data.
  Get the code: Room-Across-Room (RxR) Dataset (GitHub).
  Read the paper: Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense Spatiotemporal Grounding (arXiv).
  Read a thread about the research from Google here (Peter Anderson, Twitter).

###################################################

Google helps Anduril build America’s virtual, AI-infused border wall:
….21st century state capabilities = AI capabilities…
Google has a contract with the Customers and Border Protection agency to do work with Anduril, the military/defence AI startup founded by former VR wunderkind Palmer Luckey (and backed by Peter Thiel). This comes via The Intercept, which bases much of its reporting on a contract FOIA’d by the Tech Inquiry Project (which is run by former Googler Jack Poulson).

Why this matters: In the 20th century, state capacity for defense and military had most of its roots in large, well-funded government programs (think: Bletchley Park in the UK, the Manhattan Project in the USA, and so on). In the latter half of the 20th century governments steadily outsourced and offloaded their capacity to invent and deploy technology to third parties, like the major defense contractors.
  Those decisions are now causing significant anxiety among governments who are finding that the mammoth military industrial base they’ve created is pretty good at building expensive machines that (sometimes) work, and is (generally) terrible at rapidly developing and fielding new software-oriented capabilities (major exception: cryptography, comms/smart radio tech). So now they’re being forced to try to partner with younger, software companies – this is challenging, since the large tech companies are global rather than national in terms of clients, and startups like Anduril need a lot of help being shepherded through complex procurement processes.
  It’s worth tracking projects like the Google-Anduril-CBP tie-up because they provide a signal for how quickly governments can reorient technology acquisition, and also tells us how willing the employee base of these companies are to work on such projects.
  Read more: Google AI Tech Will Be Used For Virtual Border Wall, CBP Contract Shows (The Intercept).
  Read the FOIA’d contract here (Document Cloud).

###################################################

Andrew Ng’s Landing.AI adds AI eyes to factory production lines:
…All watched over by machines of convolutional grace…
AI startup Landing.AI has announced LandingLens, software that lets manufacturers train and deploy AI systems that can look at stuff in a factory and identify problems with it. The software comes with inbuilt tools for data labeling and annotation, as well as systems for training and evaluating AI vision models.

Why this matters: One really annoying task that people need to do is stare at objects coming down production lines and figure out if they’ve got some kind of fault; things like LandingLens promise to automate this, which could make manufacturing more efficient. (As with most tools like this there are inherent employment issues bound up in such a decision, but my intuition is AI systems will eventually exceed human capabilities at tasks like product defect detection, making wide deployment a net societal gain).
  Read more: Landing AI Unveils AI Visual Inspection Platform to Improve Quality and Reduce Costs for Manufacturers Worldwide (Landing AI).

###################################################

Google sets new speech recognition record via, you guessed it, MASSIVE PRE-TRAINING:
…Images, text, and now audio networks – all amenable to one (big) weird trick…
Google has set a new state-of-the-art at speech recognition by using a technique that has been sweeping across the ML community – massive, large-scale pre-training. Pre-training is where you naively train a network on a big blob of data (e.g, ImageNet models that get pre-trained on other, larger datasets then finetuned on ImageNet; or text models that get pre-trained on huge text corpuses (e.g, GPT3) then fine-tuned). Here, Google has combined a bunch of engineering techniques to do large-scale pre-training to set a new standard. In the company’s own words, it uses “a large unlabeled dataset to help with improving the performance of a supervised task defined by a labeled dataset”.

The ingredients: Google’s system uses a large-scale ‘Conformer’, a convnet-transformer hybrid model, along with pre-training, Noisy Student training, and some other tricks to gets its performance.

How good is it? Google’s largest system (which uses a 1billion parameter ‘Conformer XXL’ model), gets a word-error-rate of 1.4% on the ‘test clean’ LibriSpeech dataset, and 2.6% on ‘test other’. To put this in perspective, that represents  almost an one point improvement over prior SOTA. That’s significant at this level of difficulty. And it’s also worth remembering how far we’ve come – just five years ago, we were getting word-error-rates of around 13.25%!

The secret? More data, bigger models: Google pre-trains its system using wav2wec 2.0 on 30,031 hours of audio data from the ‘Libri-Light’ dataset, then does supervised training on the 960 hours of transcribed audio in LibriSpeech. Their best performing model uses a billion parameters and was pre-trained for four days on around 512 TPU V3 cores, then fine-tuned for three days on a further 512 TPUs.
…and a language model: They also try to use a language model to further improve performance – the idea being that an LM can help correct transcription errors in a transcribed setting. By using a language model, they’re able to further improve performance by 0.1 absolute performance points; this isn’t huge, but it’s also not nothing, and the LM improvement seems pretty robust.

Why this matters: Pre-training might be dumb and undisicplined, but heck – it works! Papers like this further highlight the durability of this technique and suggest that, given sufficient data and compute, we can expect to develop increasingly powerful systems for basic ML tasks like audio, vision, and text processing systems.
  Read more: Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (arXiv).

###################################################

Tech Tales:

Separation

[Calls between a low-signal part of the world and a contemporary urban center. 2023.]

The marriage was already falling apart when she left. On her last day she stood at the door, bags piled high by it, the taxi about to arrive, and said “I’ll see you”.
  “I’ll see you too” he said.
    They were half right and half wrong.

She went to a rainforest where she cataloged the dwindling species of the planet. The trip of a lifetime. All her bundles of luck had aligned and the grant application came back successful – she’d hit the academic equivalent of a home run. How could she not go?

Of course, communication was hard. No one wants to write emails. Text messages feel like talking to shadows. So she didn’t send them. They’d talk rarely. And between the conversations she’d sit, watching butterflies and brightly colored beetles, and remember how when they had been young they’d had dyed hair and would run through construction sites at night, un-sensible shoes skidding over asphalt.
 

She’d watch the birds and the insects and she’d stare at her satellite connection and sometimes she’d make a call. They’d look at eachother – their faces not their own faces; blotches of color and motion, filled in by AI algorithms, trying to reconstruct the parts of their expressions that had escaped the satellites. 
  “How are you,” he would say.
  “I want things to be like they were before,” she’d say.
  “It doesn’t work like that. We’re both going forward,” he’d say.
  “I loved us,” she said. “I don’t know that I love us.”
  “I love you,” he’d say.
  “I love to be alive,” she’d say.
    And they’d look at eachother.

Months went by and there wasn’t as much drama as either of them had feared. Their conversations became more civil and less frequent. They’d never mention the word, but they both knew that they were unknitting from eachother. Unwinding memories and expressions from around the other, so they could perhaps escape with themselves.
  “I’ve forgotten how you smell,” he said.
  “I don’t know I’ve forgotten, but there’s so much here that I keep on smelling all the new things.”
  “What’s the best thing you’ve smelled?”
  “There was a birds nest on the tree by the well. But the birds left. I smelled the nest and it was very strange.”
  “You’re crazy.”
  “I’m collecting secrets before they’re gone,” she said. “Of course I am.”

They tired of their video avatars – how scar tissue wasn’t captured, and eyes weren’t always quite aligned. Earlobes blurred into backgrounds and slight twitches and other physical mannerisms were hidden or elided over entirely.
  So they switched to voice. And now here the algorithms appeared again – quiet, diligent things, that would take a word and compress it down, transmit it via satellites, then re-inflate it on the other side.
  “You sound funny,” he said.
  “It’s the technology, mostly,” she said. “I’ve submitted some more grant applications. We’ve been finding so much.”
  “That’s good,” he said, and the algorithms took ‘good’ and squashed it down, so she heard less of the pain and more of what most people sound like when they say ‘good’.
  But she knew.
  “It’s going to be okay,” she said. “There is so much love here. I love the birds and I love the beetles.”
  “I loved you,” he said, holding the phone tight enough he could feel it bend a little.
  “And I you,” she said. “The world is full of love. There is so much for you.”
  “Thank you.”
  “Goodbye.”
  “Goodbye.”
  And that was it – their last conversation, brokered by algorithms that had no stake in the preservation of a relationship, just a stake in the preservation of consistency – an interest in forever being able to generate something to encourage further interaction, and an inability to appreciate the peace of quiet. 

Things that inspired this story: Using generative models to reconstruct audio, video, and other data streams; thinking about the emotional resonance of technology ‘filling in the blanks’ on human relationships; distance makes hearts strengthen and weaken in equal measure but in different areas; how relationships are much like a consensual shared dream and as we all know ‘going to sleep’ and ‘waking up’ are supernatural things that all of us do within our lives; meditations on grace and humor and the end of the world; listening to the song ‘Metal Heart’ by Cat Power while bicycling around the downward-society cracked streets of Oakland; sunlight in a happy room.