Import AI 244: NVIDIA makes better fake images; DeepMind gets better at weather forecasting; plus 5,000 hours of speech data.

“Teaching a Robot Dog to Pee Beer”
Here’s a video in which a very interesting person hacks around with a Boston Dynamics ‘Spot’ robot, building it a machine to let it ‘pee’ beer. This video is extremely great.
  Read more:Teaching a Robot Dog to Pee Beer (Michael Reeves, YouTube).

###################################################

5,000 hours of transcribed speech audio:
…Think your AI doesn’t know enough about financial news? Feed it this earnings call dataset…
Kensho, a subsidiary of dull-but-worthy financial analytics company S&P Global, has published SPGISpeech, a dataset of 5,000 hours of professionally-transcribed audio.

The dataset: SPGISpeech “consists of 5,000 hours of recorded company earnings calls and associated manual transcription text. The original calls were split based on silences into slices ranging from 5 to 15 seconds to allow easy training of a speech recognition system”, according to Kensho. The dataset has a vocabulary size of 100,000 and contains 50,000 distinct speakers.
  Here’s an example of what a transcribed slice might look like: ““our adjusted effective tax rate was 31.6%. Please turn to Slide 10 for balance sheet and other highlights.”

How does it compare? SPGISpeech is about 10X smaller than the Spotify podcast dataset (Import AI 242).

Why this matters: In ten years time, the entire world will be continuously scanned and analyzed via AI systems which are continually transcribing/digitizing reality. At first, this will happen in places where there’s an economic drive for it (e.g, in continually analyzing financial data), but eventually it’ll be everywhere. Some of the issues this will raise include: who gets digitized and who doesn’t? And what happens to things that have been digitized – who explores or exploits these shadow world representations of reality?
  Read more: SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end-to-end speech recognition (arXiv).
Get the data (requires signing an agreement): Kensho Audio Transcription Dataset (Kensho Datasets, site).

###################################################

GPUS ARE BEING SMUGGLED VIA SPEEDBOAT
…and that’s why you can’t find any to buy…
Police recently seized hundreds of NVIDIA cards from smugglers, according to Kotaku. The smugglers were caught moving the GPUs from a fishing boat onto a nearby speedboat. Though it’s likely these were cryptomining-oriented cards, it’s a sign of the times: computers have become so valuable that they’re being smuggled around using the same methods as Cocaine in the 1980s. How long until AI GPUs get the same treatment?
  Read more:Smuggled Nvidia Cards Found After High-Speed Boat Chace (Kotaku).

###################################################

What does it take to train an AI to decode handwriting? Read this and find out:
…Researchers go through the nitty-gritty details of an applied AI project…
Researchers with Turnitin, an education technology company, have published a paper on how to build an AI which can extract and parse text from complicated scientific papers. This is one of those tasks which sounds relatively easy but is, in practice, quite difficult. That’s because, as the researchers note, a standard scientific paper will consist of “images containing possibly multiple blurbs of handwritten text, math equations, tables, drawings, diagrams, side-notes, scratched out text and text inserted using an arrow / circumflex and other artifacts, all put together with no reliable layout”.

Results versus generally available: The authors say their system is much better than commercial services which can be rented via traditional clouds. For instance, they claim the accuracy of the ‘best Cloud API’ on parsing a ‘free form answers’ dataset is 14.4% (as assessed by FPHR), whereas theirs is 7.6%.

Sentences that make you go hmmm: “In total, the model has about 27 million parameters, which is quite modest,” the authors write. They’re correct, in that this is modest, though I wonder what a sentence equivalent to this might look like in a decade (quite modest, at about 27 billion parameters?).

Why this matters: Though theoretical breakthroughs and large generic models seem to drive a lot of progress in AI research, it’s always worth stepping back and looking at how people are applying all of these ideas for highly specific, niche tasks. Papers like this shine a light on this area and give us a sense of what real, applied deployment looks like.
Read more:Turnitin tech talk: Full Page Handwriting Recognition via End-to-End Deep Learning (Turnitin blog).
Read more:Full Page Handwriting Recognition via Image to Sequence Extraction (arXiv).

###################################################

Three AI policy jobs, with the Future of Life Institute:
The Future of Life Institute has 3 new job postings for full-time equivalent remote policy focused positions. FLI are looking for a Director of European Policy, a Policy Advocate, and a Policy Researcher. These openings will mainly be focused on AI policy and governance. Additional policy areas of interest may include lethal autonomous weapons, synthetic biology, nuclear weapons policy, and the management of existential and global catastrophic risk. You can find more details about these positions here. FLI are continuing to accept applications for these positions on a rolling basis. If you have any questions about any of these positions, feel free to reach out to jobsadmin@futureoflife.org.

###################################################

By getting rid of its ethics team, Google invites wrath from an unconventional area – shareholders!
…Shareholder group says Google should review its whistleblowing policies…
Trillium Asset Management has filed a shareholder resolution asking Google’s board of directors to review the company’s approach to whistleblowers, according to The Verge. Trillium, which is supported by nonprofit group Open MIC in this push, says whistleblowers can help investors by spotting problems that the company doesn’t want to reach the public, and it explicitly IDs the ousting of Google’s Timnit Gebru and Margaret Mitchell (previously co-leads of its ethics team) as the reason for the lawsuit.

Why this matters: AI is powerful. Like any powerful thing, the uses of it have both positives and negatives. Gebru and Mitchell highlighted both types of uses in their work and were recently pushed out of the company, as part of a larger push by Google to control more of its own research into the ethical impacts of its tech. If shareholders start to discuss these issues, Google has a fiduciary duty to listen to them – but with Trillium’s stake worth about $140million (versus an overall marketcap of 1.5 trillion), it’s unclear if this suit will change much.
Read more: Alphabet shareholder pushes Google for better whistleblower protections (The Verge).

###################################################

NVIDIA makes synthetic images that are harder-to-spot:
…Latest research means we might have a better-than-stock-StyleGAN2 system now…
NVIDIA has made more progress on the systems that let us generate synthetic imagery. Specifically, the company along with researchers from the University of Maryland, the Max Planck Institute for Informatics, Bilkent University, and the Helmholtz Center for Information Security, have published some ideas for how to further improve on StyleGAN2, the current best-in-class way to generate synthetic images.

What they did: This research has two main impacts. First, they replace the standard loss within StyleGAN2 with a “newly designed dual contrastive loss”. They also design a new “reference-attention discriminator architecture”, which helps improve performance on small-data datasets (though doesn’t help as much for large-scale ones).

How good are the results? In tests, the dual contrastive loss improves performance on four out of five datasets (FFHQ – a dataset of faces, Bedroom, Church, Horse), and obtains the second-best performance after Wasserstein GAN on the ‘CLEVR’ dataset. Meanwhile, the reference-attention system seems to radically help for small datasets (30k instances or less), but yields the same or slightly worse performance than stock StyleGAN2 at large scales. Combined, the techniques tend to yield significant improvements in the quality of synthetic images, as assessed by the Frechet Inception Distance (FID) metric.

Why this matters: Here’s a broader issue I think about whenever I read a synthetic imagery paper: who is responsible for this? In NVIDIA’s mind, the company probably thinks it is doing research to drive forward the state of the art of AI, apply what it learns to building GPUs that are better for AI, and generally increase the amount of activity around AI. At what point does a second-order impact, like NVIDIA’s role in the emerging ecosystem of AI-mediated disinformation, start to be something that the company weighs and talks about? And would it be appropriate for it to care about this, or should it instead focus on just building things and leave analysis to others? I think, given the hugely significant impacts we’re seeing from AI, that these questions are getting harder to answer with each year that goes by.
Read more:Dual Contrastive Loss and Attention for GANs (arXiv).

###################################################

Want to understand the future of AI-filled drones? Read about “UAV-Human”:
…Datasets like this make our ‘eye in the sky’ future visible…
In AI, data is one of the expensive things that people careful gather and curate, usually to help them develop systems to solve tasks contained in the data. That means that some datasets are signals about the future of one strand of AI research. With that in mind, a paper discussing “UAVHuman” is worth reading, because it’s about a dataset to enable “human behaviour understanding with UAVs” – on other words, it’s about the future of drone-driven surveillance.

What goes into UAV-Human? The dataset was made via a DJI Matrice 100 platform and was collected in multiple modalities, ranging from fisheye videos, to night-vision, to RGB, to infrared, and more. The drone was outfitted with an Azure Kinect DK to collect IR and depth maps.

What does UAV-Human tell us about the future of surveillance? UAV human is oriented around action recognition, pose recognition, person re-identification, and attribute recognition. The data is collected in a variety of different weather conditions across multiple data modalities, and includes footage where the UAV is descending, moving, and rotating, as well as hovering in place.

Why this matters: Right now, AI techniques don’t work very well on drone data. This is partially because of a lack of much available data to train these systems on (which UAV-Human helps solve), and also because of the inherent difficulty of making correct inferences about sequences of images (for why this is hard, see the entire self-driving car industry). With datasets like UAV-Human, a world of drone-mediated AI surveillance gets a bit closer.
  Read more:UAV-Human: A Large Benchmark for Human Behavior Understanding with Unmanned Aerial Vehicles (arXiv).

###################################################

Your next weather forecast could be done by deep learning:
…DeepMind research shows that GANs can learn the weather…
In the future, deep learning-based systems could provide short-term (<2 hour) predictions of rainfall – and these forecasts will be better than the ones we have today. That’s the implication of new research from DeepMind and the UK Meteorological Office, University of Exeter, and University of Reading. In the research, they train a neural net to provide rainfall estimates and the resulting system is generally better than those used today.
  “Our model produces realistic and spatio-temporally consistent predictions over regions up to 1536 km × 1280 km and with lead times from 5–90 min ahead. In a systematic evaluation by more than fifty expert forecasters from the Met Office, our generative model ranked first for its accuracy and usefulness in 88% of cases against two competitive methods, demonstrating its decision-making value and ability to provide physical insight to real-world experts,” the authors write. (They compare it against a neural baseline, as well as a widely used system called ‘pySTEPs‘.

How it works: They train their system as a conditional generative adversarial network (GAN), using UK rainfall data from 2016-2018 and testing on a 2019 dataset.
  Their system has two loss functions and a regularization term which help it perform better than an expert system and another neural baseline. Specifically, the first loss is defined by a spatial discriminator that tries to distinguish real radar fields from generated fields, and a second loss that is a temporal discriminator which tries to distinguish real and generated sequences. The system is regularized by a term that penalizes deviations at the grid cell level between real radar and the model’s predictions, which further improves performance. The resulting system seems computationally efficient – a single prediction takes just over a second to generate on a NVIDIA V100.

Why this matters: You know what is extremely weird and is something we take as normal? That neural nets can do function approximation of reality. When I see an AI system figure out an aspect of protein folding (AlphaFold: #226), or weather forecasting, I think to myself: okay, maybe this AI stuff is a really big deal, because getting good at weather prediction or chemistry is basically impossible to bullshit. If you have systems that do well at this, I think there’s a credible argument that these AI systems can learn useful abstractions for highly complex, emergent systems. Fantastic!.
  Read more: Skillful Precipitation Nowcasting using Deep Generative Models of Radar (arXiv).

###################################################

Tech Tales

Thislifedoesnotexist.com
[2031: Black budget ‘AI deployment’-analysis oriented intelligence team, undisclosed location]

Thislifedoesnotexist.com launched in 2026, made its first millionaires by 2028, and became used by one in ten people on the Internet by 2030. Now, we’re trying to shut it down, but it’s impossible – the site is too easy to build, uses too much widely available technology, and, perhaps sadly, it seems people want to use it. People want to get served up a life on a website and then they want to live that life and no matter what we do, people keep building this.

How did we get here? Back in the early 2020s, AI-generated synthetic content was the new thing. There were all kinds of websites that let people see the best fakes that people could come up with – This Person Does Not Exist, This Pony Does Not Exist, This Cat Does Not Exist, This Rental Does Not Exist, and more. 

The ‘This X Does Not Exist’ sites proliferated, integrating new AI technologies as they came out. “Multimodal” networks meant the sites started to jointly generate text and imagery. The sites got easier to use, as well. You started being able to ‘prime’ them with photos, or bits of text, or illustrations, and the sites would generate synthetic things similar to what you gave them.

The real question is why no one anticipated thislifedoesnotexist. We could see all the priors: synthetic image sites warped into synthetic video sites. Text sites went from sentences, to paragraphs, to pages. Things got interlinked – videos got generated according to text, or vice versa. And all the time people were feeding these sites – pouring dreams and images and desires and everything else into them.

Of course, the sites eventually shared some of their datasets. Some of them monetized, but most of them were hobbyist projects. And these datasets, containing so many priors given to the sites from the humans that interacted with them, became the training material for subsequent sites.

We don’t know who created the original thislifedoesnotexist, though we know some of the chatrooms they used, and some of the (synthetically-generated) identities they used to promote it.
And we know that it was successful.
I mean, why wouldn’t it be?

The latest version of thislifedoesnotexist works like the first version, but better.
You go there and it asks you some questions about your life. Who are you? Where are you? What do you earn? How do you live? What do you like to do? What do you dislike?
And then it asks you what you’d like to earn? Where you’d like to live? Who you’d prefer to date?
And once you tell it these things, it spits out a life for you. Specifically, a sequence of actions you might take to turn your life from your current one, to the one you desire.

Of course, people used it. They used it to help them make career transitions. Sometimes it advised them to buy certain cryptocurrencies (and some of the people who listened to it got rich, while others got poor). Eventually, people started getting married via the site, as it recommended different hookup apps for them to use, and things to say to people on it (and it was giving advice to other people in turn – the first marriages which were subsequently found to be thislifedoesnotexist-mediated occurred in 2029.

Now, we play whack-a-mole with these sites. But the datasets that are being used to train them are openly circulating on the darkweb. And the more we look, the clearer it is that:
– Some one or some entity is providing the computers used to run the AI systems on 90% of tghe thislifedoesnotexist sites.
– These sites are changing not only the people that use them, but the people that these people interact with. When someone changes jobs it creates a ripple. Same with a marriage. Same with school. We can’t see the pattern in these ripples, yet, but there are a lot of them.

Things that inspired this story: Generative models; synthetic content; watching the world fill up with synthetic text and synthetic images and thinking about the long-term consequences; wondering how culture is going to get altered by AI systems being deployed into human society.