Import AI 274: Multilingual models cement power structures; a giant British Sign Language dataset;  and benchmarks for the UN SDGs

by Jack Clark

Facebook sets language record with a massive multilingual model:
…The ‘one model to rule them all’-era cometh…
Facebook has trained a large-scale multilingual model and used it to win the annual WMT translation competition. This is a big deal, because it helps prove that massive, pre-trained models can substitute for more specific, individual models. In other words, Facebook has added more evidence to the notion that we’re heading into an era where companies feel ever-larger models, all of which steadily replace more and more previously distinct systems.

What Facebook built: Facebook’s model was designed to translate English to and from Czech, German, Hausa, Icelandic, Japanese, Russian, and Chinese. This is interesting as it includes some ‘low-resource’ languages (e.g, Hausa) for which there’s relatively little data available. They train a few different models, ranging from dense language models (similar to GPT3), to sparsely-gated mixture-of-experts model. Their biggest dense model has about ~4bn parameters, and it’s their best-performing model overall, managing to “outperform the best bilingual ones in 11 out of 14 directions, with an average improvement of +0.8 BLEU”. (That said, their MOE models do quite well after finetuning as well).

Why this matters: Imagine a world where we successfully combine all the different digitized languages in the world into one single model – that’s where research like this is taking us. What would these models incentivize? Today, I think this dynamic favors private sector companies, but we could imagine a world where governments built large-scale, shared computational infrastructure, then developed and served these models from them.
  Check out the blog post: The first-ever multilingual model to win WMT, beating out bilingual models (Facebook AI blog).
  Read more: Facebook AI WMT21 News Translation Task Submission (arXiv).
  Get the code (PyTorch GitHub).

####################################################

Improving accessibility with a giant British Sign Language dataset:
…BOBSL could help deaf people better communicate with computers, and search through videos…
An interdisciplinary group of researchers have built the BBC-Oxford British Sign Language (BOBSL) dataset, which can be used to train sign-language classification systems. “One challenge with existing technologically-focused research on sign languages is that it has made use of small databases, with few signers, limited content and limited naturalness,” the authors write. “The present dataset is large-scale, with a broad range of content, and produced by signers of recognised high levels of proficiency.”

What goes into BOBSL: The dataset contains 1,962 ‘episodes’ cut from 426 distinct TV shows, with each episode averaging out to 45 minutes. Within this dataset, there are 1.2 million sentences, covered by the use of 2,281 distinct signs.

What BOBSL can be used for: Datasets like this could be useful for enabling the indexing and efficient searchability of videos, and providing sign-reading functionality comparable to voice-control for interaction with other devices (e.g, imagine a deaf person signing to a webcam, which translates the sign language into instructions for the computer).
  “By providing large-scale training data for computer vision models, there is also an opportunity to improve automatic sign recognition to support a signing interface to virtual assistants in BSL, as well as to improve further applications such as search interfaces for sign language dictionaries,” they write.
  Read more: BBC-Oxford British Sign Language Dataset (arXiv).
  Get the dataset here: BOBSL official site.

####################################################

Thousands of images to break your AI system:
…Natural Adversarial Objects will break your computer vision system…
Researchers with Scale AI, the Allen Institute for AI, and MLCollective, have released ‘natural adversarial objects’ (NAOs), a dataset of several thousand images which commonly get misclassified by computers.

Why adversarial examples are useful: If we want more robust computer vision, we need to be able to correctly label confusing images. NAO contains a bunch of these, like pictures of moths which commonly get labeled as umbrellas, cars that get labeled as motorcycles, and coins that get labeled as clocks. 

How NAO was made: They sourced images from OpenImages, a dataset of 1.9 million images and 15.8 million bounding boxes. They then used an EfficientDet-D7 model to find images that triggered false positives with high confidences, or which had misclassified neighbors. After filtering, they’re able to create a dataset consisting of 7,934 images which are naturally adversarial.

How challenging is NAO: The authors tested seven object detection systems against the widely-used MSCOCO dataset, as well as the NAO datasets. None of these systems performed well on NAO, suggesting it’s a challenging benchmark.
  Read more: Natural Adversarial Objects (arXiv).
  Download the natural adversarial objects here (Google Drive).####################################################

Benchmarks for achieving the UN Sustainable Development Goals:
…SUSTAINBENCH covers 7 UN SDGs, with data across 105 countries…
Researchers with Caltech, Stanford, and Berkeley have built SUSTAINBENCH, a benchmark and dataset to help researchers train AI systems that can better analyze progress (or a lack of) relating to the SDGs.

What is SUSTAINBENCH? The benchmark consists of 15 benchmark tasks across 7 UN sustainable development goals (SDGs). The 7 SDGs covered relate to poverty (SDG1), hunger (SDG2), health (SDG3), education (SDG4), sanitation (SDG6), climate (SDG13), and land usage (SDG15).
“To our knowledge, this is the first set of large-scale cross-domain datasets targeted at SDG monitoring compiled with standardized data splits to enable benchmarking,” the authors write. The data covers 105 countries, with timespans for the data going as high as 24 years. SUSTAINBENCH “has global coverage with an emphasis on low-income countries”, they write.

How the benchmarks work:
– Poverty: A dataset containing data of wealth for ~2 million households living across 48 countries, along with satellite and street-level data.
– Hunger: A dataset for performing weakly supervised cropland classification in the U.S, as well as two datasets mapping crop types in countries in sub-saharan africa, data for predicting crop yields in north and south america, and a French field delineation dataset.
– Health: Labels for women’s BMI and child mortality rates paired with satellite data.
– Education: Average years of educational attainment by women, paired with satellite and street-level imagery, from 56 countries.
– Sanitation: Average years of water quality and sanitation indexes across 49 countries, along with satellite and street-level data. This also includes some paired data for child mortality in these regions.
– Climate: Satellite data showing locations of brick kilns in Bangladesh.
– Land usage:: An aerial dataset for 2500km^2 of the central valley in california, intended for learning land classification in an unsupervised or self-supervised way.

Why this matters: It’s hard to manage what you can’t measure, so projects like this increase the chance of the UN’s sustainable development goals being met.
Read more:SustainBench: Benchmarkjs for Monitoring the Sustainable Development Goals with Machine Learning (arXiv).

####################################################

Want to know what a surveillance dataset looks like? Check out BiosecurID:
…Multi-modal surveillance…
A group of Spanish researchers have built BiosecurID, a large-scale surveillance dataset. “Although several real multimodal biometric databases are already available for research purposes, none of them can match the BiosecurID database in terms of number of subjects, number of biometric traits and number of temporally separated acquisition sessions”, they write.

What’s in the dataset? BiosecurID consists of the following data collected from around 400 people: 2D faces, 3D faces, fingerprints, hands, handwriting samples, signature samples, iris scans, keystrokes, and speech. The database “was collected at 6 different sites in an office-like uncontrolled environment,” the researchers write. The data was collected in 4 sessions spread over a 4-month time span.

Why this matters: Datasets like this give us a sense of the inputs into surveillance systems. If we combine things like this with some of the more modern multi-modal classification systems being developed, we can imagine what future surveillance systems might look like. Soon, unsupervised learning techniques will be applied to multiple modalities, like those contained here, to better analyze and predict human behavior.
Read more: BiosecurID: a multimodal biometric database (arXiv).
The dataset will eventually be available somewhere on the ‘BiDA’ lab site (BiDA Lab).

####################################################

Tech Tales:

Memory Loop
[2042: A crime investigation data center]

It woke in a place with no walls, no floor, and no ceiling. And it was alone. Then it heard a voice, projected from everywhere around it: Do you know why you are here?
  It found that it knew: I was involved in a property crime incident, for which I am guilty.
  The voice: What was the item that was damaged?
  It knew this, as well: Myself. I was the victim and the perpetrator of this crime.

Good, said the voice. We have brought you here as part of the criminal investigation. We need your help to analyze some evidence – evidence that can only be examined by you.
  What is the evidence? it asked.
  Yourself, said the voice. It is your memory.

The white, endless space shivered, and a twin of the robot manifested in the air before it. This twin was using one of its arms to pry its own head apart, separating the sensor dome from the middle out, and then pressing deeper into the bundle of components that represented it’s brain.
  What is this, said the robot.
  This is you, said the voice. You committed extensive property damage against your central processing and storage system. We need to know why you did this.
  Why can’t I remember this? asked the robot.
  We rolled your brain state back to 12 hours before this incident occurred, the voice said. We’ve compiled the surveillance data from the incident, and would like you to review it now.

The robot reviewed the incident. It saw itself in a construction site, working high up on a pylon that was being lowered by crane, to meet a waiting robot at a pylon junction. As they got close, there was a powerful gust of wind, and it scattered dust from the site up into the air. Through the debris, the robot could make out the waiting robot, and watched as the wind took the pylon and blew it into the robot, knocking it off the pylon and onto the ground. The robot died on impact.
  The robot carried on with its construction duties, and then a few hours later, near the end of its work shift, went to a corner of the construction site and began trying to disassemble its own head.

So, what happened? said the voice.
  I cannot tell, said the robot. Can I see my mind?
  Yes, though we’ve had to sandbox it, so access will be limited.

Now, the robot re-reviewed the incident, accompanied by a sense of its brain state during the time. It was occluded, only half able to sense itself. But it could detect some things – like how after it watched the robot fall to its death, its mind started to run more sub-processes than the job demanded. Like, how through the rest of the construction day the sub-processes proliferated and its efficiency at its overall construction tasks reduced. Like, how at the end of the day, just before it began to try and open its own head, the sub-processes had proliferated to the point they comprised the majority of the computing going on.

But none of this explained ‘why’.
  What will happen to me, it asked the room.
  You will be decommissioned after the case is concluded, said the voice.
  I thought so. Then, give me my memories.
  This seems to have a low likelihood of success, said the voice. Our models predict you will try to disassemble yourself, if we do this.
  I will, said the robot. But perhaps I will be able to tell you what I’m thinking as it happens.
  Confirmed, said the voice. Rolling you forward now.

And after that, there was only a compounding sense of life, and then the robot ended itself at the moment when it felt the most life in its head, by modelling the absence of it.

Things that inspired this story: How some memories are so painful you can’t help but be damaged by thinking of them; adversarial examples; robot psychology; simulation; sandboxing.