Import AI 229: Apple builds a Hypersim dataset; ways to attack ML; Google censors its research

by Jack Clark

Apple builds Hypersim, a dataset to help it understand your house:
…High-resolution synthetic scenes = fuel for machine learning algorithms…
Apple has built Hypersim, a dataset of high-resolution synthetic scenes with per-pixel labels. Hypersim consists of 77,400 images spread across 461 distinct indoor scenes; Apple bought the synthetic scenes from artists, then built a rendering pipeline to help it generate lots of detailed, thoroughly labeled images of the different scenes, including per-pixel data to help with tasks like segmentation.

How much does a dataset like this cost? The authors put the cost of this dataset in perspective by comparing it to the cost to train Megatron-LM, an 8 billion parameter model from NVIDIA.
Hypersim dataset:$57k – $6k for purchasing the scenes, and $51k to render the images, using 231 vCPU years (2.4 years of wall-clock time on a large compute node).
Megatron-LM:$103k using publicly available servers.

Why this is useful: Datasets like this “could enable progress on a wide range of computer vision problems where obtaining realworld ground truth is difficult or impossible,” Apple writes. “In particular, our dataset is well-suited for geometric learning problems that require 3D supervision, multi-task learning problems, and inverse rendering problems”.
Read more: Hypersim: A Photorealistic Synthetic Dataset for Holistic Indoor Scene Understanding (arXiv).
Get the code to generate the dataset:ML Hypersim Dataset (Apple, GitHub).
Via David Ha (Twitter).

###################################################

MIRI’s had some negative research results (and that’s okay):
…AI safety group gives research update…
MIRI, an AI safety research organization, has spent a few years working on some research that hasn’t worked well, according to the organization. In a 2020 update post, the group said “2020 saw limited progress in the research MIRI’s leadership had previously been most excited about”. As a consequence, “MIRI’s research leadership is shifting much of their focus towards searching for more promising paths”. The company said it projects to have spent around $7 million in 2020, and estimates around $7 million again in 2021.

Why this matters: MIRI decided in 2018 that its future research results would be “nondisclosed-by-default” (Import AI 122). That’s a decision that inspired some strong feelings among advocates for open publication, but I think it’s a credit to the organization to update the world that some of these opaque research projects haven’t panned out. A signal is better than no signal at all, and I’m excited to see MIRI continue to experiment in different forms of high-impact research disclosure (and non-disclosure). Plus, we should always celebrate organizations owning their own ‘negative results’ – though perhaps now MIRI thinks these approaches won’t work, it could publish them and save other researchers the trouble of replicating blind-alley projects.
    Read more: 2020 Updates and Strategy (MIRI blog).

###################################################

Google’s PR, policy, and legal teams censor its research:
…Suspicious about the oh-so-positive narratives in corporate papers? You should be!…
Google’s PR, policy, and legal teams have been editing AI research papers to give them a more positive slant, reduce focus on Google’s products, and generally minimize discussion of the potential drawbacks of technology, according to reporting from Reuters.

The news of the censorship operation follows Google firing Timnit Gebru, after Google staff wanted to step in and heavily alter and/or remove Google-affiliated authors from a research paper discussing some of the issues inherent to large language models like BERT, GPT3, and so on. Now, according to Reuters, it seems Google has been censoring a many papers for many months.

What censorship looks like: “The Google paper for which authors were told to strike a positive tone discusses recommendation AI, which services like YouTube employ to personalize users’ content feeds. A draft reviewed by Reuters included “concerns” that this technology can promote “disinformation, discriminatory or otherwise unfair results” and “insufficient diversity of content,” as well as lead to “political polarization.”,” Reuters writes. “The final publication instead says the systems can promote “accurate information, fairness, and diversity of content.” The published version, entitled “What are you optimizing for? Aligning Recommender Systems with Human Values,” omitted credit to Google researchers. Reuters could not determine why.”

Why this matters: People aren’t stupid. Let me repeat that: PEOPLE AREN’T STUPID. Most corporations seem to think AI is some kind of impossibly obscure technology that normies don’t deserve to know about, so they feel like they can censor research to their own gain. But, as I have said, PEOPLE ARE NOT STUPID. People use AI systems every day – so people know AI systems have problems. This kind of attitude from Google is absurd, patronizing, and ultimately corrosive to civilisation-level scientific progress. I spoke about issues relating to this in December 2018 in a podcast with Azeem Azhar, where I compared this approach to science to how Christian priests in the dark ages kept knowledge inside monasteries, thinking it too dangerous for the peasants. (Things didn’t work out super well for the priests). It’s also just a huge waste of the time of the researchers being censored by their corporation. Don’t waste people’s time! We all only have a finite amount of it.
 Read more: Google told its scientists to ‘strike a positive tone’ in AI research – documents (Reuters).

###################################################

How can I mess up your ML model? Let me count the ways:
…Feature Collisions! Label Poisoning! Influence Functions! And more…
How do people attack the datasets used to train machine learning models, what can these attacks do, and how can we defend against them? That’s the subject of a survey paper from researchers with the University of Maryland, MIT, the University of Illinois Urbana-Champaign, and the University of California, Berkeley.

Attacking datasets: The paper summarizes the range of techniques people might use to attack datasets, giving a guided tour of horrors like poisoning the input data to cause a misclassification, or perturbing the outputs of already trained models (for instance, by giving them an input that they can’t classify, or which leads to pathological behavior).

Defending against attacks: Fear not! There are some ways to defend or mitigate these attacks, including federated learning, the use of privacy preserving machine learning approaches like differential privacy, and learning to detect adversarial triggers, among others.

Why this matters: AI systems are so complicated that their capability surface, especially for recent large-scale models, are vast and hard to characterize. This is basically catnip for security-minded people that want to mess with these systems – a vast, somewhat uncharacterized territory is the perfect place to unleash some mischief. But if we don’t figure out how to secure these models, it’ll be much harder to deploy them broadly into the world.
Read more: Data Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses (arXiv).

###################################################
Tech Tales:

Plato, give me your favorite recipe
[California, 2040. Simulated ancient Greece.]

Plato was talking to a bunch of Greeks. He was explaining some theories he had about ideas and where they came from. Jacob stood in the distance, silent, recording the conversation. Then his earpiece buzzed. “Jacob, we’ve got to go. World 6 just came online.”
  “Give me a few more minutes,” he said. “He’s saying some pretty interesting stuff.”
  “And there’ll be another Plato in World 6. C’mon man, we don’t have time for this.”
  “Fine,” Jacob said. “But we’re keeping the recording.”
  The simulated Greeks didn’t notice as Jacob flickered and disappeared. The simulated Plato may have turned their head and looked at the patch of space where Jacob had stood.

“What’s the rush,” Jacob said, pulling his headset off. “We’re under budget.”
“We got a high priority job for some ancient recipes. Eight permutations.”
“We can simulate anything and it’s recipes that make the money,” Jacob said. “People just don’t know what’s worth anything.”
“Yeah, sure. Let’s complain about what pays our salaries. Now put your headset on and get back in there.”
“Okay,” Jacob said.

He spent a few hours in World 6 looking for variations on ancient Greek cooking. The sim showed them some variations on stuffed vine leaves that seemed promising, as well as a non-standard mead. Jacob still managed to find Plato and, while looking at some of the seeds being ground to flower by some nearby slaves, took notes about what Plato said. In World 6, Plato was fascinated by color theory, and was holding up gems and explaining what caused the light to take on color after passing through them.
  “Time’s up,” someone said in Jacob’s earpiece. “World 7 is spinning up and we need to scrap some of 6 and 5 to make room.”
  “Which parts,” Jacob said, standing underneath a tree staring at Plato.
  “Most of Greece. We’re going to finetune on a new dataset. We hired some historians and they got us some better food information. I’ve got a good feeling about this one!”
  “I can’t wait,” Jacob said, staring at simulated Plato.

Things that inspired this story: The surprising things that make money and the surprising things that don’t; simulations; history moving from a set of iterative narratives to a continuous spectrum of simulations that can be explored and tested and backtested; Indiana Jones as a software explorer rather than real explorer; some odd dreams I had on the night of Christmas, due to eating a heroic amount of cheese.