Import AI 144: Facial recognition sighted in US airports; Amazon pairs humans&AI for data labeling; Facebook translates videos into videogames
by Jack Clark
Amazon uses machine learning to automate its own data-labeling humans:
…We heard you like AI so much we put AI inside your AI-data-labeling system…
Amazon reveals ProduceNet, an ImageNet-inspired dataset of products. ProductNet is designed to help researchers train models that have as subtle and thorough an understanding of products, as equivalently-trained systems have with regard to classes of images. The goal for Amazon is to be able to better learn how to categorize products, and the researchers say in tests that this system can significantly improve the effectiveness of human data labelers.
Dataset composition: ProductNet consists of 3900 categories of product, with roughly 40-60 products for each category. “We aim at the diversity and representativeness of the products. Being representative, the labeled data can be used as reference products to power product search, pricing, and other business applications,” they write. “Being diverse, the models are able to achieve strong generalization ability for unlabeled data, and the product embedding is also able to represent richer information”.
ProductNet, what is it good for? ProductNet’s main purpose appears to be helping Amazon to develop better systems to help its human contractors more efficiently label data, and creating a system that can directly label itself.
Labelling: ProductNet is designed to be tightly-integrated with human workers, who can collectively help Amazon better label its various items while continuously calibrating the AI system. It works like this: they start off by using a basic system (eg, Inception-v4 trained on ImageNet for processing images, and fastText for processing text data) to use to search over unlabelled images, then the humans annotate these and the labels are fed back into the master model, which is then used to surface more specific products, which the humans then annotate, and so on.
20X gain: In tests, Amazon says human annotators augmented via ProductNet can label 100 things to flesh out the edge of a model in about 30 minutes, compared to humans who don’t have access to the model which only manage around five data points during this time period. This represents a 20X gain through the use of the system, Amazon says. Read more: ProductNet: a Collection of High-Quality Datasets for Product Representation Learning (Arxiv).
#####################################################
AI + Facial Recognition + Airlines:
What does it mean when airlines use facial recognition instead of passports & boarding passes to let people onto planes? We can get a sense of the complex feelings this experience provokes by reading a Twitter thread from someone who experienced it, then questioned the airline (JetBlue) about its use of the tech.
Read about what happens when someone finds facial recognition systems deployed at the boarding gate. (Twitter).
#####################################################
Down on the construction site: How to deploy AI in a specific context and the challenges you’ll encounter:
…AI is useless unless you can deploy it…
There’s a big difference between having an idea and implementing that idea; research from Iowa State University highlights this by discussing the steps needed to go from selecting a problem (for example: training image recognition systems to recognize images from construction sites) to solving that problem.
“Based on extensive literature review, we found that most of the studies focus on development of improved techniques for image analytics, but a very few look at the economics of final deployment and the trade-off between accuracy and costs of deployment,” the authors write. “This paper aims at providing the researchers and engineers a practical and comprehensive deep learning based solution to detect construction equipment from the very first step of development to the last step, which is deployment of the solution”.
Deployment – more than just a discrete step: The paper highlights the sorts of tradeoffs people need to make as they try to deploy systems, ranging from the lack of good open datasets for specific contexts (eg, here the users try to train a model for use on construction sites off of the comparatively small ‘AIM’ subset of ImageNet) to the need to source efficient models (they use MobileNet), to needing to customize those models for specific hardware platforms (Raspberry Pis, Intel Jetsons, Intel Neural Compute Sticks, and so on.
Why this matters: As AI enters its deployment phase, research like this gives us a sense of the gulf between most research papers and actual deployable systems. It also provides a further bit of evidence in favor of ‘MobileNet’, which I’m seeing crop up in an ever-increasing number of papers concerned with deploying AI systems, as opposed to just inventing them.
Read more: A deep learning based solution for construction equipment detection: from development to deployment (Arxiv).
#####################################################
Enter the AI-generated Dungeon:
… One big language model plus some crafted sentences = fun…
Language models have started to get much more powerful as researchers have combined flexible components (eg: Transformers) with large datasets to train big, effective general-purpose models (see: ULMFiT, GPT2, BERT, etc). Language models, much like image classifiers, have a ranger of uses, and so it’s interesting to see someone use a GPT2 model to create an online AI Dungeon game, where you navigate a scenario via reading blocks of texts and picking options – the twist here is it’s all generated by the model.
Play the game here: AI Dungeon.
#####################################################
Facebook wants to make videos into videogames:
…vid2game extracts playable characters from videos…
Facebook AI Research has published vid2game, an AI system that lets you select a person in a public video on the internet and develop the ability to control them, as though they are a character in a videogame. The approach also lets them change the background, so a tennis player can – for instance – walk off of the court and onto a (rendered) dirt road, et cetera.
The technique relies on two components: Pose2Pose and Pose2Frame; Pose2Pose lets you select a person in some footage and extract their pose information by building a 3D model of their body and using that to help you move them. Pose2Frame helps to match this body to a background, which lets you use this technology to take a person, control them, and change the context around them.
Why this matters: Systems like this show how we can use AI to (artificially) add greater agency to the world around us. This approach “paves the way for new types of realistic and personalized games, which can be casually created from everyday videos”, Facebook wrote.
Read more: Vid2Game: Controllable Characters Extracted from Real-World Videos (Arxiv).
Watch the technology work here (YouTube).
#####################################################
Making AI systems that can read the visual world:
…Facebook creates dataset and develops technology to help it train AI models to read text in pictures…
Researchers with Facebook AI Research and the Georgia Institute of Technology want to create AI systems that can look at the world around us – including the written world – and answer questions about it. Such systems could be useful to people with vision impairments who could interact with the world by asking their AI system questions about it, eg: what is in front of me right now? What items are on the menu in the restaurant? Which is the least expensive item on the menu in the restaurant? And so on.
If this sounds so simple, why is it hard? Think about what you – a computer – are being asked to do when required to parse some text in an image in response to a question. You’re being asked to:
- Know when the question is about text
- Figure out the part of the image that contains text
- Convert these pixel representations into word representations
- Reason about both the text and the visual space
- Decide on whether the answer to the question involves copying some text from the image and feeding it to the user, or whether the answer involves understanding the text in the picture and using that to further reason about stuff.
The TextVQA Dataset: To help researchers tackle this, the authors release TextVQA, a dataset containing 28,408 images from OpenImages, 45,336 questions associated with these images, and 453,360 ground truth answers.
Learning to read images: The researchers develop a model they call LoRRA, short for Look, Read, Reason & Answer. LoRRA staples together some existing Visual Question Answering (VQA) systems, with a dedicated optical character recognition (OCR) module. It also has an Answer Module, loosely modeled on Pointer Networks, which can figure out when to incorporate words the OCR module has parsed but which the VQA module doesn’t necessarily understand.
Human accuracy versus machine accuracy: Human accuracy on the TextVQA dataset is about 85.01%, the researchers say. Meanwhile, the best-performing model the researchers develop (based on loRRA) obtains a top accuracy of 26.56% – suggesting we have a long way to go before we get good at this.
Why this matters: Building AI systems that can ingest enough information about the world to be able to augment people seems like one of the more immediate, high-impact uses of the technology. The release of a new dataset here should encourage more progress on this important task.
Read more: Towards VQA Models that can Read (Arxiv).
Get the dataset: TextVQA (Official TextVQA website).
#####################################################
AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net…
Russia calls for international agreements on military AI:
Russia’s security chief has spoken publicly about the need for international regulation on military uses of AI and emergent technologies, which he said could be as dangerous as weapons of mass destruction. He said it is necessary to “activate the powers of the global community, chiefly at the UN”, to develop an international regulatory framework. This comes as a surprise, given that Russia have previously been among the countries resisting moves towards international agreements on lethal autonomous weapons.
Read more: Russia’s security chief calls for regulating use of new technologies in military sphere (TASS).
#####################################################
OpenAI Bits & Pieces:
Making music with OpenAI MuseNet:
We’ve been experimenting with big, generative models (see: our language work on GPT-2). We’re interested in how we can explore generations in a variety of mediums to better understand how to build more creative systems. To that end, we’ve developed MuseNet, a Transformer-based system that can generate 4-minute musical compositions with 10 different instruments, combining styles from Mozart to the Beatles
Listen to some of the samples and try out MuseNet here (OpenAI Blog).
Imagining weirder things with Sparse Transformers:
We’ve recently developed the Sparse Transformer, a Transformer-based system that can extract patterns from sequences 30x longer than possible previously. What this translates to is generically better generative models, and gives us the ability to extract more subtle features from bigger chunks of data.
Generative Modeling with Sparse Transformers (OpenAI Blog).
#####################################################
Tech Tales:
Keep It Cold
We were at a bar in the desert with a man who liked to turn things cold. He had a generator with him hooked up to a refrigeration unit and it was all powered by some collection of illicit jerry cans of gasoline. We figured he must have fished these out of the desert on expeditions. And now here he was, presiding over a temporary bar in a small town rife with other collectors of banned things, who fanned out into the desertified surrounding areas of burnt or drying suburbs, in search of the things we could once make but could no longer.
The shtick of this guy – and his bar – was that he could make cold drinks, and not any kind of cold drinks but “sub-zero drinks”, powered by another system which seemed to be a hybrid of a chemistry set and a gun. Something about it helped cool water down even more while keeping it flowing so that it came out cold enough you could feel it chill air if you held your eye up close enough. It was a gloriously wasteful, indulgent, expensive enterprise, but he wasn’t short of customers. Something about a really cold drink is universal, I guess.
We were talking about the past: what it had been like to have a ‘rainy season’ where the rains were kind and were not storms. What it meant when a ‘dry season’ referred to an unusually lack of humidity, rather than guaranteed city-eating fires. We talked about the present, a little, but moved on quickly: the past and the future are interesting, and the present is a drag.
We were mid-way through talking about the future – could we get off planet? What was going to happen to the middle of Africa due to desertification? How were the various walls and nation-severing barriers progressing? – when the generator stopped working. The man went outside and fussed for a bit and when he came back in he said: “I guess that’s the last ice on planet earth”.
We all laughed but there was a pretty good chance he was right. That’s how things were in those days, before civilization moved into whatever it is now – we lived in a time where when ever a thing stopped working you could credibly think “maybe that’s the last human civilization will see of that?”. Imagine that.
Things that inspired this story: The increasingly real lived&felt reality of massive and globally distributed climate change; refrigerators; tinkerers; Instagram-restaurants at the end of time and space.
The story with the facial recognition at airports sounds a bit like the news on Reddit from last week that in China all the persons are tracked with public cameras with an accuracy of 99.9%. That means the system will identify only 1 out of 100000 persons wrong. From a technical perspective such myth are wishful thinking. In reality, most of the software can’t separate between a human face and a car because this problem is unsolved in the arxiv based literature yet.