Import AI: 117: Surveillance search engines; harvesting real-world road data with hovering drones; and improving language with unsupervised pre-training

by Jack Clark

Chinese researchers pursue state-of-the-art lip-reading with massive dataset:
…What do I spy with my camera eyes? Lips moving! Now I can figure out what you are saying…
Researchers with the Chinese Academic of Sciences and Huazhong University of Science and Technology have created a new dataset and benchmark for “lip-reading in the wild” for Mandarin. Lip-reading gives people a new sensory capability to imbue AI systems with. For instance, lip-reading systems can be used for “aids for hearing-impaired persons, analysis of silent movies, liveness verification in video authentication systems, and so on” the researchers write.
  Dataset details: The lipreading dataset contains 745,187 distinct samples from more than 2,000 speakers, grouped into 1,000 classes, where each class corresponds to the syllable of a Mandarin word composed of one or several Chinese characters. “To the best of our knowledge, this database is currently the largest word-level lipreading dataset and the only public large-scale Mandarin lipreading dataset”, the researchers write. The dataset has also been designed to be dverse so the footage in it consists of multiple different people taken from multiple different camera angles, along with perspectives taken from television broadcasts. This diversity makes the benchmark more closely approximate real world situations whereas previous work in this domain has involved stuff taken from a fixed perspective. They build the dataset by annotating Chinese television using a service provided by iFLYTEK, a Chinese speech recognition company.
  Baseline results: They train three baselines on this dataset – a fully 2D CNN, a fully 3D CNN (modeled on LipNet, research covered in ImportAI #104 from DeepMind and Google) , and a model that mixes 2D and 3D convolutional layers. All of these approaches perform poorly on the new dataset, despite having obtained performances as high as 90% on other more restricted datasets. The researchers implement their models in PyTorch and train them on servers containing four Titan X GPUs with 12GB of memory. The resulting top-5 accuracy results for the baselines on the new Chinese dataset LRW-1000 are as follows:
– LSTM-5: 48.74%
– D3D: 59.80%
– 3D+2D: 63.50%
  Why it matters: Systems for stuff like lipreading are going to have a significant impact on applications ranging from medicine to surveillance. One of the challenges posed by research like this is its inherently ‘dual use’ nature; as the researchers allude to in the introduction of this paper, this work can be used both for healthcare uses as well for surveillance uses (see: “analysis of silent movies”). How society deals with the arrival of these general AI technologies will have a significant impact on the types of societal architectures that will be built and developed throughout the 21st Century. It is also notable to see the emergence of large-scale datasets built by Chinese researchers in Chinese language – perhaps one could measure the relative growth in certain language datasets to model AI interest in the associated countries?
  Read more: LRW-1000: A Naturally Distributed Large-Scale Benchmark for Lip Reading in the Wild (Arxiv).

Want to use AI to study the earth? Enter the PROBA-V Super Resolution competition:
…European Space Agency challenges researchers to increase the usefulness of satellite-gathered images…
The European Space Agency has launched the ‘PROBA-V Super Resolution” competition, which challenges researchers to take in a bunch of photos from a satellite of the same region of the Earth and stitch them together to create a higher-resolution composite.
  Data: The data contains multiple images taken in different spectral bands of 74 locations around the world at each point in time. Images are annotated with a ‘quality map’ to indicate any parts of them that may be occluded or otherwise hard to process. “Each data-point consists of exactly one 100m resolution image and several 300m resolution images from the same scene,” they write.
  Why it matters: Competitions like this provide researchers with novel datasets to experiment with and have a chance of improving the overall usefulness of expensive capital equipment (such as satellites).
Find out more about the competition here at the official website (PROBA-V Super Resolution challenge).

Google releases BERT, obtains state-of-the-art language understanding scores:
…Language modeling enters its ImageNet-boom era…
Google has released BERT, a natural language processing system that uses unsupervised pre-training and task fine-tuning to obtain state-of-the-art scores on a large number of distinct tasks.
  How it works: BERT, which stands for Bidirectional Encoder Representations from Transformers, builds on recent developments in language understanding ranging from techniques like ELMO to ULMFiT to recent work by OpenAI on unsupervised pre-training. BERT’s major performance gains come from a specific structural modification (jointly conditioning on the left and right context in all layers), as well as some other minor tweaks, plus – as is the trend in deep learning these days – training on a larger model using more compute. The approach it is most similar to is OpenAI’s work using unsupervised pre-training for language understanding, as well as work from using similar approaches.
  Major tweak: BERT’s use of joint conditioning likely leads to its most significant performance improvement. They implement this by adding in an additional pre-training objective called the ‘masked language model’ which involves randomly masking input tokens, then asking the model to predict the contents of the masked token based on context – this constraint encourages the network to learn to use more context when completing task, which seems to lead to greater representational capacity and improved performance. They also use Next Sentence Prediction during pre-training to try to train a model that has a concept of relationships of concepts across different sentences. Later they conduct significant ablation studies of BERT and show that these two pre-training tweaks are likely responsible for the majority of the observed performance increase.
  Results: BERT obtains state-of-the-art performance on the multi-task GLUE benchmark, setting new state-of-the-art scores on a wide range of challenging tasks. It also sets a new state-of-the-art score on the ‘SWAG’ dataset – significant, given that SWAG was released earlier this year and was expressly designed to challenge AI techniques, like DL, which may gather a significant amount of performance by deriving subtle statistical relationships within datasets.
  Scale: The researchers train two models, BERTBASE and BERTLARGE. BERTBASE was trained on 4 Cloud TPUs for approximately four days, and BERTLARGE was trained on 16 Cloud TPUs also for four days.
  Why it matters – Big Compute and AI Feudalism: Approaches like this show how powerful today’s deep learning based systems are, especially when combined with large amounts of compute and data. There are legitimate arguments to be made that such approaches are bifurcating research into low-compute and high-compute domains – one of these main BERT models took 16 TPUs (so 64 TPU chips total) trained for four days, putting it out of reach of low-resource researchers. On the plus side, if Google releases things like the pre-trained model then people will be able to use the model themselves and merely pay the training cost to finetune for different domains. Whether we should be content with researchers getting the proverbial crumbs from rich organizations’ tables is another matter, though. Maybe 2018 is the year in which we start to see the emergence of ‘AI Feudalism’.
  Read more: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Arxiv).
Check out this helpful Reddit BERT-explainer from one of the researchers (Reddit).

Using drones to harvest real world driving data:
…Why the future involves lightly-automated aerial robot data collection pipelines…
Researchers with the Automated Driving Department of the Institute for Automotive Engineering at Aachen University have created a new ‘highD’ dataset that captures the behavior of real world vehicles on German highways (technically: autobahns).
  Drones + data: The researchers created the dataset via DJI Phantom 4 Pro Plus drones hovering above roadways which they used to collect natural vehicle trajectories from vehicles driving on German highways around Cologne. The dataset includes post-processed trajectories of 110,000 vehicles including cars and trucks. The datasets consists of 16.5 hours of video spread across 60 different recordings which were were made at six different locations between 2017 and 2018, with each recording having an average length of 17 minutes.
  Augmented dataset: The researchers provide additional labors in the dataset beyond trajectories, categorizing vehicles’ behavior into distinct detected maneuvers, which include: free driving, vehicle following, critical maneuvers, and lane changes.
highD VS NGSIM: The dataset most similar to highD is NGSIM, a dataset developed by the US Department of Transport. highD contains a significantly greater diversity of vehicles as well as being significantly larger, but the recorded distances which the vehicles travel along are shorter, and the German roads have fewer lanes than the American ones used in highD.
  Why it matters: Data is likely going to be crucial for the development of real world robot platforms, like self-driving cars. Techniques like those outlined in this paper show how we can use newer technologies, like cheap consumer drones, to automate significant chunks of the data gathering process, potentially making it easier for people to gather and create large datasets. “Our plan is to increase the size of the dataset and enhance it by additional detected maneuvers for the use in safety validation of highly automated driving,” the researchers write.
Get the data from the official website (
You can access the Matlab and Python code used to handle the data, create visualizations, and extract maneuvers from here (Github).
Read more: The highD Dataset: A Drone Dataset of Naturalistic Vehicle Trajectories on German Highways for Validation of Highly Automated Driving Systems (Arxiv).

Building Google-like search engines for surveillance, using AI:
…New research lets you search via a person’s height, color, and gender…
Deep learning based techniques are fundamentally changing how surveillance architecture are being built. Case in point: A new paper from Indian researchers gives a flavor of how deep learning can expand the capabilities of security technology for ‘person retrieval’, which is the task of trying to find a particular person within a set of captured CCTV footage.
  The system: The researchers use Mask R-CNN pre-trained on Microsoft COCO to let people search over CCTV footage from the SoftBioSearch dataset for people via specific height, color, and ‘gender’ (for the purpose of this newsletter we won’t go into the numerous complexities and presumed definitions inherent in the use of ‘gender’ here).
  Results: “The algorithm correctly retrieves 28 out of 41 persons,” the researchers write. This isn’t yet quite to the level of performance where I can imagine people implementing it, but it certainly seems ‘good enough’ for many surveillance cases, where you don’t really care about a few false positives as you’re mostly trying to find candidate targets backed up by human analysis.
  Why it matters: The deployment of artificial intelligence systems is going to radically change how governments relate to their citizens by giving them greater abilities than before to surveil and control them. Approaches like this highlight how flexible this technology is and how it can be used for the sorts of surveillance work that people typically associate with large teams of human analysts. Perhaps we’ll soon hear about intelligence analysts complaining about the automation of their own jobs as a consequence of deep learning.
  Read more: Person Retrieval in Surveillance video using Height, Color and Gender (Arxiv).

Tech Tales:

[2028: A climate-protected high-rise in a densely packed ‘creatives’ district of an off-the-charts Gini-coefficient city]

We Noticed You Liked This So Now You Have This And You Shall Have This Forever

The new cereal arrived yesterday. I’m already addicted to it. It is perfect. It is the best cereal I have ever had. I would experience large amounts of pain to have access to this cereal. My cereal is me; it has been personalized and customized. I love it.

I had to invest to get here. Let us not speak of the first cereals. The GAN-generated “Chocolate Rocks” and “Cocoa Crumbles” and “Sweet Black Bean Flakes”. I shudder to think of these. Getting to the good cereal takes time. I gave much feedback to the company, including giving them access to my camera feeds, so their algorithms can watch me eat. Watch me be sick.

One day I got so mad that the cereal had been bad for so long that I threw it across the room and didn’t have anything else for breakfast.

Thank You For Your Feedback, Every Bit of Feedback Gets us Closer to Your Perfect Cereal, they said.
I believed them.

Though I do not have a satisfying job, I now start every morning with pride. Especially now, with the new cereal. This cereal reflects my identity. The taste is ideal. The packaging reminds me of my childhood and also simulates a new kind of childhood for me, filling the hole of no-kids that I have. I am very lonely. The cereal has all of my daily nutrients. It sustains me.

Today, the company sent me a message telling me I am so valuable they want me to work on something else. Why Not Design Your Milk? They said. This makes sense. I have thrown up twice already. One of the milks was made with seaweed. I hated it. But I know because of the cereal we can get there: we can develop the perfect milk. And I am going to help them do it and then it will be mine, all mine.

And they say our generation is less exciting than the previous ones. Answer me this: did any of those old generations who fought in wars design their own cereal in companion with a superintelligence? Did any of them know the true struggle of persisting in the training of something that does not understand you and does not care about you, but learns to? No. They had children, who already like you, and partners, who want to be with you. They did not have this kind of hardness.

The challenge of our lifetime is to suffer enough to make the perfect customization. Why not milk? They ask me. Why not my own life, I ask them? Why not customize it all?

And they say religion is going out of fashion!

Things that inspired this story: GANs; ad-targeting; the logical end point of Google and Facebook and all the other stateless multinationals expanding into the ‘biological supply chain’ that makes human life possible; the endless creation of new markets within capitalism; the recent proliferation of various ‘nut milks’ taken to their logical satirical end point; hunger; the shared knowledge among all of us alive that our world is being replaced by simulcras of the world and we are the co-designers of these paper-thin realities.