Import AI 263: Foundation models; Amazon improves Alexa; My Little Pony GPT.

by Jack Clark

Amazon makes Alexa sound more convincing:
…A grab bag of techniques to make synthetic voices sound more realistic…
Amazon has published a research paper about some of the techniques it’s using to make more convincing text-to-speech systems. By using a variety of tools, the company was able to improve the quality of its synthetic voices by 39% relative to a baseline system.

What they did: They use a variety of techniques, ranging from a state-of-the-art sequence-to-sequence model to encode the acoustics, to using a parallel-Wavenert implementation for the ‘neural vocoder’ which fits the text to speech.
  Adversarial training – they also use a GAN approach to further improve quality, training a generator network via the acoustic model, then using a discriminator to force the generation of more real-sounding samples.
Read more: Enhancing audio quality for expressive Neural Text-to-Speech (arXiv).

####################################################

Stanford University: Now that big models are here, what do we do about them?
…New research paper and workshop tries to lay out the issues of GPT-3, BERT, and so on…
In recent years, a new class of highly capable, broad utility AI model has emerged. These models vary in modalities and purposes, and include things like GPT-3 (text analysis and generation), BERT (a fundamental input into new search engines), CLIP (combined text and image model), and more. These models are typified by being trained on very large datasets, then being used for a broad range of purposes, many of which aren’t anticipated by their developers.
  Now, researchers with Stanford University have published a large research paper on the challenges posed by these models – it’s worth skimming the 100+ paper, and it does a good job of summarizing the different impacts of these models in different areas, ranging from healthcare to robotics. It also tackles core issues, like dataset creation, environmental efficiency, compute usage, and more. Stanford is also hosting a workshop on these models this week, and I’ll be giving a talk where I try to lay out some of the issues, particularly those relating to the centralization of resources and power.

Why this matters: I mostly think ‘foundation models’ matter insofar as they’re bound up with the broader industrialization of AI – foundation models are what you get when you’ve built a bunch of systems that can dump a large amount of resources into the development of your model (where resource = compute, data, training time, human engineering time, etc). Some people dislike foundation models because of how they interact with existing power structures. I think foundation models tell us that there are very significant power asymmetries in AI development and we should pay attention to them and try to increase the number of actors that can work on them. I’ll be giving a keynote about these ideas at the workshop – comments welcome!
Read more about the workshop here:Workshop on Foundation Models (Stanford).
Read the paper here: On the Opportunities and Risks of Foundation Models (arXiv).

####################################################

DeepMind’s multi-agent game AI software goes to V1:
…OpenSpiel steps forward, gets ready to play more…
OpenSpiel (first covered in November 2019, #162), has gone into its first, major V1 release, meaning that its developer, DeepMind, thinks the software is now quite well supported. OpenSpiel is a software framework to help researchers play around with multi-agent reinforcement learning.

What’s new in OpenSpiel: Additions include a bunch of new games (ranging from tic-tac-toe, to reconnaissance blind chess), various algorithm implementations (including some JAX rewrites of things like DQN), more examples, more bots, and various other quality of life improvements.
Read more and get the code:OpenSpiel update notes (GitHub).

####################################################

Step aside, dogs. In the future, blind people are going to use drones as well:
…You know what’s cooler than an organic dog? A mechanical flying drone!…
Some researchers from Karlsruhe Institute of Technology have combined semantic segmentation computer vision techniques with a flying drone to create what they call a ‘flying guide dog;’ – a machine meant to help Blind and Visually Impaired People (BVIP) safely navigate around a city. “Based on its perception of the environment, the drone adjusts itself and leads the user to walk safely,” they write. “To follow the drone, the user holds a string attached to the drone.”

What they did: The approach uses semantic segmentation to tell the drone figure out which parts of a scene are safe for a pedestrian, and to identify important objects like traffic lights where changes can alter the safety landscape. They pair this with the drone, which flies along the walkable pathways, guiding the pedestrian holding its string. The drone can also talk to the user through a bone conduction headset, telling people to ‘stop’ when there’s a red light and ‘go’ when there’s a green light. In tests, people say that they found the drone helpful and relatively easy to use, though it’s traffic light prediction could be improved.

In search of the dog baseline: What I would have loved to have seen here would be a dog baseline – my assumption is dogs are way, way better at this task than drones. Dogs are also more autonomous, better able to deal with unanticipated changes in the environment, and respond in a far cuter way to head pats (where, in the worst case, applying to a head pat to a drone either breaks its rotors or breaks your fingers). Still, this is a tantalizing research project outlining some of the ways robots are going to become more integrated into our day-to-day lives.
  Read more: Flying Guide Dog: Walkable Path Discovery for the Visually Impaired Utilizing Drones and Transformer-based Semantic Segmentation (arXiv).
Get the code and dataset from this repo eventually (Flying Guide Dog, GitHub).

####################################################

AI uses are hard to predict – case in point, My Little Pony GPT:
…6bn parameters of neural stuff meets the fandom…
A couple of months ago, Eleuther released a 6 billion parameter GPT model, named GPT-J-6B (Import AI 253).
  Cyborgs will dream of electric my little ponies: Now, researchers with *checks notes* a distributed collective called pone.dev trying to build a  *squints hard at notes* “AI Pony Waifu”, have said they’ve finetuned this network on a ton of My Little Pony (MLP) fanfiction to create something that can spit out convincing MLP text.

Why this matters: We’re entering the era of DIY AI where a ton of people will use big models like GPT-J-6B for their own purposes, ranging from the banal to the profane, from the dangerous to the joyful, from the sexy to the ascetic. This is just another dot in the galaxy of uses, and highlights how AI is going to augment and magnify different types of culture.
  Check outone of the samples here (Astralight Heart, twitter).
  Check outanother sample here (Astralight Heart, twitter).

####################################################

X-ray analysis via deep learning:
…Chinese researchers gather ~50k x-ray images of prohibited items…
Chinese researchers have built PIDray, a dataset of x-ray images of prohibited items. PIDray consists of 12 categories of prohibited items across 47,677 images, (this makes PIDray a much larger dataset than all other prior x-ray ones, with the  SIXray, which contained 1,059,231 images, though this only ~8k images were of prohibited items.)

Why build PIDray? The researchers built PIDray because “compared with natural images, X-ray images have a quite different appearance and edges of objects and background, which brings new challenges in appearance modeling for X-ray detection.” Therefore, making datasets like PIDray will make it easier for researchers to build systems that can use contemporary AI techniques to analyze x-rayed items.
Read more:Towards Real-World Prohibited Item Detection: A Large-Scale X-ray Benchmark (arXiv).

####################################################

After Copilot (GitHub) and Codex (OpenAI), along comes Google’s unnamed code model:
…137 billion parameters = surprisingly capable program synthesis…
Google has developed a 137 billion parameter code model, following on from earlier work by GitHub and OpenAI. The model portends a future where people specify in natural language what they want computers to do, then a big blob of neural stuff takes over and translates these commands into natural language.

What they did – new datasets to assess performance: Along with developing the models, they create a ‘Mostly Basic Programming Problems’ (MBPP) dataset, which contains 974 short Python functions along with their text descriptions and test cases. They also created a Python synthesis dataset made up of 23914 problems made out of a subset of the MathQA dataset. “These two datasets exercise different points in the space of synthesis tasks: MBPP contains more usage of imperative control flow such as loops and conditionals, while MathQA-Python contains more complex natural language descriptions,” they write.

Things that make you go ‘hmm, kind of good, kind of scary’: Emergent capabilities: One of the fascinating things about models like this (which you could term a ‘foundation model’) is how with a few prompts in their context window, you can coax them into new behaviors – but a graph in the paper shows that few-shot training is less smooth than finetuning; in other words, you get more somewhat discontinuous jumps in capability as you go up model sizes. That’s useful, as it means these models can go from not understanding something to understanding something, but it’s also potentially worrying – new capabilities emerge in a kind of janky, sudden manner.
Read more: Program Synthesis with Large Language Models (arXiv).

####################################################

Tech Tales:

The Most Perfect Flower
[Earth, 2035]

Towards the end of the first era, the robots would play games that would entertain the human public and inspire curiosity in the nascent robot civilization. One famous game was called The Most Perfect Flower – the robots competed with one another to synthesize a virtual simulacra of a vanishingly rare flower – and one of the catches was they could read about it but could not see images explicitly containing it (though some robots took their chances and looked at photos of other plants, making assumptions that certain unlabeled plants in the background corresponded to the plant described in text).
  For weeks, the robots competed with eachother, iterating on various plant designs. Members of the public (both humans and robots) voted on the designs, and the machines updated their simulated flowers, smoothing out a petal here, altering a tint there, booting up a new physics simulation to check the dew was sitting correctly there, and so on. In the meanwhile, a scout robot had been funded through spectators of the competition to go and search out a real example of the flower they were synthesizing. 

The scout robot was struck by lightning and disabled a few metres from the flower – though, hidden beneath thick tree growth, it had not yet spotted it. Initially, the robots sought to raise money to fund another expedition, but public interest had waned. Some months after that, the public soured on the concept of robot-instigated games entirely, and the various projects were shut down or handed over to humans, depending on interest. Perhaps predictably, projects like the Most Perfect House and Most Fiendish Weapon were of interest to the humans, while Most Perfect Flower (and related ones, such as Most Perfect Ecosystem and Most Dense Forest) failed to draw enough interest to continue.
  Some centuries after that, some robot successors unearthed these projects and went about synthesizing and constructing the things outlined within them; it was in this way that, hundreds of years after going extinct, a certain type of flower with pink petals and a blue-and-yellow core came alive in a controlled environment, watched over by caring, inhuman eyes.

Things that inspired this story: Frechet Inception Distance (FiD) metrics; machine-on-machine NFT marketplaces (imagined); NFTs (real); generative adversarial networks; program synthesis; multi-agent reinforcement learning.