Import AI 158: Facial recognition surveillance; pre-training and the industrialization of AI; making smarter systems with the DIODE depth dataset

The dawn of the endless, undying network:
…ERNIE 2.0 is a “continual pre-training framework” for massive unsupervised learning & subsequent fine-tuning…
ERNIE 2.0 is a “continual pre-training framework” which has support for single-task and multi-task training. The sorts of tasks it’s built for are ones where you want to pour a huge amount of compute into a model via unsupervised pre-training on large datasets (see, language modeling), then expose this model to additional incremental data and finetune it against those objectives, like achieving some constraint placed on the system by reality. 

A cocktail of training: The researchers use seven pre-training tasks to train a large-scale ‘ERNIE 2.0’ model. In subsequent experiments, the researchers show that ERNIE outperforms similarly scaled ‘BERT’  and ‘XLNet’ models on the competitive ‘GLUE’ benchmark – this is interesting, since BERT has swept a bunch of NLP benchmarks this year. (However, the researchers don’t indicate they’ve tested ERNIE 2.0 on the more sophisticated SuperGLUE benchmark). They also show similarly good performance on a GLUE-esque set of 9 tasks tailored around Chinese NLP. 

Why this matters: Unsupervised pre-training is the new ‘punchbowl’ of deep learning – a multitude of organizations with significant amounts of compute to spend will likely compete with one another training increasingly large-scale models, which are subsequently re-tooled for tasks. It’ll be interesting to see if in a few months we can infer whether any such systems are being deployed for economically useful purposes, or in research domains (eg biomedicine) where they could unlock other breakthroughs.
   Read more: ERNIE 2.0: A Continual Pre-training Framework for Language Understanding (Arxiv)

####################################################

Feed your AI systems with 500 hours of humans playing MineCraft:
…Agent see, agent do – or at least, that’s the hope…
In recent years, AI systems have learned to beat games ranging in complexity from Chess, to Go, to multi-player strategic games like StarCraft 2 and Dota 2. Now, researchers are thinking about the next set of environments to test AI agents in. Some think that Minecraft, a freeform game where people can mine blocks in a procedurally generated world and build unique items and structures, is a good environment for the frontier of research. Now, researchers with Carnegie Mellon University have released the MineRL dataset: a set of 60 million “state-action pairs of human demonstrations across a range of related tasks in Minecraft”, which can help researchers develop smarter algorithms. The MineRL dataset has been developed as a part of the MineRL competition on sample efficient reinforcement learning (Import AI #145).

The dataset: MineRL consists of 500+ hours of recorded human demonstrations from 1000 people of six different tasks in Minecraft, like navigating around the world, chopping down trees, or multi-step tasks that result in obtaining specific (sometimes rare) items (diamonds, pickaxes, cooked meat, beds). They’ve released the dataset in a variety of resolutions (ranging from 64 X 64, to 192 X 256), so people can experiment with algorithms that operate over imagery of various resolutions. Each demonstration consists of an RGB video of the player’s point of view, as well as a set of features from the game like the distances to objects in the world, details on the player’s inventory, and so on.  

Why this matters: Is Minecraft a good, challenging environment for next-generation AI research? Datasets like this will help us figure that out, as they give us human baselines for complex tasks, and also involve the sort of multi-step sequences of actions that are challenging for contemporary AI systems. I think it’s interesting to reflect on how fundamental videogames have become to the development of action-oriented AI systems, and I’m excited for when research like that motivated by MineRL leads to smart, curious ‘neural agents’ showing up in more consumer-oriented videogames.
   Get the dataset from the official MineRL website (MineRL.io).
   Read more: MineRL: A Large-Scale Dataset of Minecraft Demonstrations (Arxiv)

####################################################

Want Ai systems that can navigate the world? Train them with DIODE:
…RBGD dataset makes it easier to train AIs that have a sense of depth perception…
Researchers with TTI-Chicago, the University of Chicago, and Beihang University  have produced DIODE, a large-scale image+depth dataset of indoor and outdoor environments. Datasets like these are crucial to developing AIs that can better reason about three-dimensional worlds. 

   “While there have been many recent advances in 2.5D and 3D vision, we believe progress has been hindered by the lack of large diverse real-world datasets comparable to ImageNet and COCO for semantic object recognition,” they write. 

What DIODE is: DIODE consists of around ~25,000 high-resolution photos (sensor depth precision: +/- 1mm, compared to +/- 2cm for other popular datasets like KITTI), split across indoor environments (~8.5k images) and outdoor ones (~17k images). The researchers collected the dataset with a ‘FARO Focus S350 scanner’, in locations ranging from student offices, to large residential buildings, to hiking trails, parking lots, and city streets. 

Why this matters: Datasets like this will make it easier for people to develop more robust machine learning systems that are better able to deal with the subtleties of the world.
   Read more: DIODE: A Dense Indoor and Outdoor DEpth Dataset (Arxiv)

####################################################

Need to morph your AI to work for a world with different physics? Use TuneNet:
…Model tuning & re-tuning framework makes sim2real transfer better…
Researchers with the University of Texas at Austin have released “TuneNet, a residual tuning technique that uses a neural network to modify the parameters of one physical model so it approximates another.” TuneNet is designed to make it easy for researchers to migrate an AI system from one simulation to another, or potentially from simulation to reality. Therefore, TuneNet is built to rapidly analyze the differences between different software simulators with altered physics parameters, and work out what it takes to re-train a model so it can be transferred between them. 

How TuneNet works: “TuneNet takes as input observations from two different models (i.e. a simulator and the real world), and estimates the difference in parameters between the models,” they write. “By estimating the parameter gradient landscape, a small number of iterative tuning updates enable rapid convergence on improved parameters from a single observation from the target model. TuneNet is trained using supervised and learning on a dataset of pairs of auto-generated simulated observations, which allows training to proceed without real-world data collection or labeling”. 

Testing: The researchers perform three experiments “to validate TuneNet’s ability to tune one model to match another”. The researchers conduct their tests using the ‘PyBullet’ simulator. These tests involve: testing its how well it adapts to a new target environment, how well it can predict the dynamics of a bouncing ball from one simulation to the other, and testing if it can transfer from a simulator onto a real world robot and bounce a hall off an inclined plane into a hoop. The approach does well on all of these, and obtains an 87% hit rate at the sim2real task – interesting, but not good enough for the real world yet. 

Why this matters: Being able to adapt AI systems to different contexts will be fundamental to the broader industrialization of AI; advances in sim2real techniques reduce the costs of taking a model trained for one context and porting it to another, which will likely – once the techniques are mature – encourage model proliferation and dissemination.
   Read more: TuneNet: One-Shot Residual Tuning for System Identification and Sim-to-Real Robot Task Transfer (Arxiv)

####################################################

Perfecting facial expression recognition for fine-grained surveillance:
…You seem happy. Now you seem worried. Now you seem cautious. Why?…
Researchers with Fudan University and Ping An OneConnect want to build AI systems that can automatically label the emotions displayed by people seen via surveillance cameras. “Facial expression recognition (FER) is widely used in multiple applications such as psychology, medicine, security and education,” they write. (For the purposes of this write-up, let’s put aside the numerous thorny issues relating to the validity of using certain kinds of ’emotion recognition’ techniques in AI systems.) 

Dataset construction: To build their system, they gather a large-scale dataset that consists of 200,000 images of 119 people displaying any of four poses and 54 facial expressions. The researchers also use data augmentation to artificially grow the dataset via a system called a facial pose generative adversarial network (FaPE-GAN), which generates additional facial expression images for training. 

To create the dataset, the researchers invited participants into a room filled with video cameras to have “a normal conversation between the candidate and two psychological experts” which lasts for about 30 minutes. After this, a panel of three psychologists reviewed each video and assigned labels to specific psychological states; the dataset only includes videos where all three psychologists agreed on the label. Each participant is captured from four different orientations: face-on, from the left, from the right, and an overhead view.

54 expressions: The researchers tie 54 distinct facial expressions with specific terms that – they say – correlate to emotions. These terms include things like boredom, fear, optimism, boastfulness, aggressiveness, disapproval, neglect, and more. 

Four challenges: The researchers propose four evaluation challenges people can test systems on to develop more effective facial recognition systems. These include: expression recognition with a balanced setting (ER-SS); unbalanced expression (ER-UE), where they make 20% of the facial expressions relate to particularly rare classes; unbalanced poses (ER-UP), where they assume the left-facing views are rarer than the other ones; and zero-shot ID (ER-ZID), where they try to recognize the facial expressions of people that haven’t been seen before to test “whether the model can learn the person invariant feature for emotion classification”.

What faces are useful for: The researchers show that F2ED can be used to pre-train models which are subsequently fine-tuned on other facial emotion recognition datasets, including FER2013 and JAFFE.

Why this matters: Data is one of the key fuels of AI progress, so a dataset containing a couple of hundred thousand labelled pictures of faces will be like jetfuel for human surveillance. However, the scientific basis for much of facial expression recognition is contentious, which increases the chance that the use of this technology will have unanticipated consequences.
   Read more: A Fine-Grained Facial Expression Database for End-to-End Multi-Pose Facial Expression Recognition (Arxiv).

####################################################

Tech Tales:

The Adversarial Architecture Arms Race
US Embassy, Baghdad, 2060

So a few decades ago they started letting the buildings build themselves. I guess that’s why we’re here now. 

It started like this: we decided to have the computers help us design our buildings, and so we’d ask them questions like: given a specific construction schematic, what vulnerabilities are possible? They’d come back with some kind of answer, and we’d design the building around the most conservative set of their predictions. Other governments did the same. Embassies worldwide sprouted odd appendages – strange, seemingly-illogical turrets and gantries, all designed to optimize for internal security while increasing the ability to survey and predict likely actions in the emergent environment. 

“Country lumps”, people called some of the embassies.
“State growths”
“Sovereign machines”. 

Eventually, the buildings evolved for a type of war that we’re not sure we can understand any more. Now, construction companies get called up by machines that use synthesized voices to order up buildings with slanted roofs and oddly shaped details, all designed to throw off increasingly sophisticated signals equipment. Now I get to work in buildings that feel more like mazes than places to do business. 

They say that some of the next buildings are fully ‘lights out’ – designed to ‘serve all administrative functions without an on-site human’, as the brochures from the robot designers says. The building still has doors and corridors, and even some waiting spaces for people – but for how long, I wonder? When does all of this become more efficient. When do we start hurling little boxes of computation into cities and call these the new buildings? When do we expect to have civilizations that exist solely in corridors of fibre wiring and high-speed interconnects? And how might those entities design themselves?

Things that inspired this story: Generative design; the architecture of prisons and quasi-fortress buildings; the American Embassy in Berlin; game theory between machines.