Import AI 109: Why solving jigsaw puzzles can lead to better video recognition, learning to spy on people in simulation and transferring to reality, why robots are more insecure than you might think

by Jack Clark

Fooling object recognition systems by adding more objects:
…Some AI exploits don’t have to be that fancy to be effective…
How do object recognition systems work, and what throws them off? That’s a hard question to answer because most AI researchers can’t provide a good explanation for how all the different aspects of a system interact to make predictions. Now, researchers with York University and the University of Toronto have shown how to confound commonly deployed object detection systems by adding more objects to a picture in unusual places. Their approach doesn’t rely on anything as subtle as an adversarial example – which involves subtly perturbing the pixels of an image to cause a mis-classification – and instead involves either adding new objects to a scene, or creating duplicates within a scene.
   Testing: The researchers test trained models from the public Tensorflow Object Detection API against images from the validation set of the 2017 version of MS-COCO.
  Results: The tests show that most commonly deployed object detection systems fail when objects are moved to different parts of an image (suggesting that the object classifier is conditioning heavily on the visual context surrounding a given object) or overlap with one another (suggesting that these systems have trouble segmenting objects, especially similar ones). They also show that the manipulation or addition of an object to a scene can lead to other negative effects elsewhere in the image, for instance, objects near – but not overlapping – the object can “switch identity, bounding box, or disappear altogether.”
  Terror in a quote: I admire the researchers for the clinical tone they adopt when describing the surreal scenes they have concocted to stress the object recognition system, for instance, this description of some results successfully confusing a system: “The second row shows the result of adding a keyboard at a certain location. The keyboard is detected with high confidence, though now one of the hot-dogs, partially occluded, is detected as a sandwich and a doughnut.”
  Google flaws: The researchers gather a small amount of qualitative data by uploading a couple of images to the Google Vision API website, in which “no object was detected”.
  Non-local effects: One of the more troubling discoveries relates to non-local effects. In one test on Google’s OCR capabilities they show that: “A keyboard placed in two different locations in an image causes a different interpretation of the text in the sign on the right. The output for the top image is “dog bi” and for the bottom it is “La Cop””.
  Why it matters: Experiments like this demonstrate the brittle and sometimes rather stupid ways in which today’s supervised learning deep neural net-based systems can fail. The more worrying insights from this are the appearance of such dramatic non-local effects, suggesting that it’s possible to confuse classifiers with visual elements that a human would not find disruptive.
Read more: The Elephant in the Room (Arxiv).

$! AI Measurement Job: !$ The AI Index, a project to measure and assess the progress and impact of AI, is hiring for a program manager. You’ll work with the steering committee, which today includes myself and Erik Brynjolfsson, Ray Perrault, Yoav Shoham, James Manyika  and others (more on that subject soon!). It’s a good role for someone interested in measuring AI progress on both technical and societal metrics and suits someone who enjoys disentangling hype from empirically verifiable reality. I spend a few hours a week working on the index (more as we finish the 2017 report!) and can answer any questions about the role:
  AI Index program manager job posting here.
  More about the AI Index here.

Better video classification by solving jigsaw puzzles:
…Hollywood squares, AI edition…
Jigsaw puzzles could be a useful way to familiarize a network with some data and give it a curriculum to train over – that’s the implication of new research from Georgia Tech and Carnegie Mellon University which shows how to improve video recognition performance by, during training, slicing videos in a test set up into individual jigsaw pieces then tracing a neural network to predict how to piece them back together. This process involves the network learning to jointly solve two tasks: correctly piecing together the scrambled bits of each video frame, and learning to join the frames together in the appropriate order through time. “Our goal is to create a task that not only forces a network to learn part-based appearance of complex activities but also learn how those parts change over time,” they write.
  Slice and dice: The researchers cut up their videos by dividing  each video frame into 2 x 2 grid of patches, then they stitch three of these frames together into tuples.  “There are 12! (479001600) ways to shuffle these patches” in both space and time, they note. They implement a way to intelligently winnow down this large combinatorial space into selections geared towards helping the network learn.
  Testing: The researchers believe that training networks to correctly unscramble these video snippets in terms of both visual appearance and temporal placement will give them a greater raw capability to classify other, unseen videos. To test this, they train their video jigsaw network on the UCF101 (13,320 videos across 101 action categories) and Kinetics (around 400 categories with 400+ videos each) datasets, then they evaluate it on the UCF101 and HMDB51 (around 7,000 videos across 51 action categories). They train their systems with a curriculum approach, where they start off having to learn how to unscramble a few pieces at a time, then increase this figure through training, forcing it to learn to solve harder and harder tasks.
  Transfer learning: The researchers note that systems pre-trained with the larger Kinetics dataset generalize better than ones trained on the smaller UCF101 one and they test this hypothesis by training the UCF101 in a different way designed to minimize over-fitting, but discover the same phenomenon.
  Results: The researchers find that when they finetune their network on the UCF101 and HMDB51 datasets are pre-training on Kinetics they’re able to obtain state-of-the-art results when compared to other unsupervised learning techniques, though obtain less accuracy than supervised learning approaches. They also obtain close-to SOTA accuracy on classification on the PASCAL VOC 2007 dataset.
  Why it matters: Approaches like this demonstrate how researchers can use the combinatorial power made available by cheap computational resources to mix-and-match datasets, letting them create natural curricula that can lead to better unsupervised learning approaches. One way to view research like this is it is increasing the value of existing image and video data by making such data potentially more useful.
  Read more: Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition (Arxiv).

Learning to surveil a person in simulation, then transferring to reality:
…sim2real, but for surveillance…
Researchers with Tencent AI Lab and Peking University have shown how to use virtual environments to “conveniently simulate active tracking, saving the expensive human labeling or real-world trial-and-error”. This is part of a broader push by labs to use simulators to generate large amounts of synthetic data which they train their system on, substituting the compute used to run the simulator for the resources that would have otherwise been expended on gathering data from the real world. The researchers use two environments for their research: VIZDoom and the Unreal Engine (UE). Active tracking is the task of locking onto an object in a scene, like a person, and tracking them as they move through the scene, which could be something like a crowded shopping mall, or a public park, and so on.
  Results: “We find out that the tracking ability, obtained purely from simulators, can potentially transfer to real-world scenarios,” they write. “To our slight surprise, the trained tracker shows good generalization capability. In testing, it performs the robust active tracking in the case of unseen object movement path, unseen object appearance, unseen background, and distracting object”.
  How they did it: The researchers use one major technique to transfer from simulation into reality: domain randomization. Domain randomization is a technique where you apply multiple variations to an environment to generate additional data to train over. For this they vary things like the textures applied to the entities in the simulator, as well as the velocity and trajectory of these entities. They train their agents with a reward which is roughly equivalent to keeping the target in the center of the field of view at a consistent distance.
  VIZDoom: For VIZDoom, the researchers test how well their approach works when trained on randomizations, and when trained without. For the randomizations, they train on a version of the Doom map where they randomize the initial positions of the agent during training. In results, agents trained on randomized environments substantially outperformed those trained on non-randomized ones (which intuitively makes sense, since the non-randomized agents will have gained a less wide variety of experience during training). Of particular note is that they find the tracker is able to perform well even when it temporarily loses sight of the target being tracked.
  Unreal Engine (UE): For the more realistic Unreal Engine environment the team show, again, that versions trained with randomizations – which include texture randomizations of the models – are superior to systems trained without. They show that the trained trackers are robust to various changes, including giving it a different target to what it was trained on to track, or altering the environment.
Transfer learning – real data: So, how useful is it to train in simulators? A good test is to see if systems learned in simulation can transfer to reality – that’s something other researchers have been doing (like OpenAI’s work on its hand project, or CAD2RL). Here, the researchers test this transfer ability by taking best-in-class models trained within the more realistic Unreal Engine environment, then evaluating them on the ‘VOT’ dataset. They discover that the trained systems displays action recommendations for each frame (such as move left, or move right) consistent with moves that place the tracked target in the center of the field of view.
  Testing on a REAL ROBOT: They also perform a more thorough test of generalization by installing the system on a real robot. This has two important elements: augmenting the training data to aid transfer learning to real world data, and modifying the action space to better account for the movements of the real robot (both using discrete and continuous actions).
  Hardware used: They use a wheeled ‘TurtleBot’, which looks like a sort of down-at-heel R2D2. The robot sees using an RGB-D camera mounted about 80cm above the ground.
  Real environments: They test out performance in an indoor room and on an outdoor rooftop. The indoor room is simple, containing a table, a glass wall, and a row of railings; the glass wall presents a reflective challenge that will further test generalization of the system. The outdoor space is much more complicated and includes desks, chairs, and plants, as well as more variable lighting conditions. They test the robot on its ability to track and monitor a person walking a predefined path in both the room and the outdoor rooftop.
  Results: The researchers use a YOLOv3 object detector to acquire the target and its bounding box,then test the tracker using both discrete and continuous actions. The system is able to follow the target the majority of the time in both Indoor and Outdoor settings, with higher scores on the simpler, indoor environment.
  Why this matters: Though this research occurs in a somewhat preliminary setting (like the off-the-shelf SLAM drone from Import AI 206), it highlights a trend in recent AI research: there are enough open systems and known-good techniques available to let teams of people create interesting AI systems that can perform crude actions in the real world. Yes, it would be nice to have far more sample-efficient algorithms that could potentially operate live on real data as well, but those innovations – if possible – are some way off. For now, researchers can instead spend money on compute resources to simulate arbitrarily large amounts of data via the use of game simulators (eg, Unreal Engine) and clever randomizations of the environment.
  Read more: End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning (Arxiv).

Teaching computers to have a nice discussion, with QuAC:
…New dataset poses significant challenges to today’s systems by testing how well they can carry out a dialog…
Remember chatbots? A few years ago people were very excited about how natural language processing technology was going to give us broadly capable, general purpose chatbots. People got so excited that many companies made acquisitions in this area or span-up their own general purpose dialog projects (see: Facebook M, Microsoft Cortana). None of this stuff worked very well, and today’s popular personal assistants (Alexa, Google Home, Siri) contain a lot more hand-engineering than people might expect.
  So, how can we design better conversational agents? One idea put forward by researchers at the University of Washington, the Allen Institute for AI, UMass Amherst, and Stanford University, is to teach computers to carry out open-ended question-and-answer conversations with eachother. To do this, they have designed and released a new dataset and task called QuAC (Question Answering in Context) which consists of around 14,000 information-seeking QA dialogs, consisting of 100,000 questions in total.
   Dataset structure: QuAC is structured so that there are two agents having a conversation, a teacher and a student; the teacher is able to see the full text of a Wikipedia section, and the student is able to see the title of this section (for instance: Origin & History). Given this heading, the student’s goal is to learn as much as possible about what the teacher knows, and they can do this by asking the teacher questions. The teacher can answer these questions, and can also provide structured feedback in the form of encouragements to continue or not ask a follow-up, whether a question is correct or not, and – when appropriate – no answer.
Inspiration: The inspiration for the dataset is that being able to succeed at this should be sufficiently hard that it will test language models in a rounded way, forcing them to model things like partial evidence, needing to remember things the teacher has said for follow-up questions, co-reference, and so on.
  Results (the gauntlet has been thrown): After testing their dataset on a number of simple baselines to ensure it is difficult, the researchers test it against some algorithmic baselines. They find the best performing baseline is a reimplementation of a top-performing SQuAD model that augments bidirectional attention flow with self-attention and contextualized embeddings. This model, called BiDAF++, The best performing system obtains human equivalence on 60% of questions and 5% of full dialog, suggesting that solving QuAC could be a good proxy for the development of far more advanced language modeling systems.
  Why it matters: Language will be one of the main ways in which people try to interact with machines, so the creation and dissemination of datasets like QuAC gives researchers a useful way to calibrate their expectations and their experiments – it’s useful to have (seemingly) very challenging datasets out there, as it can motivate progress in the future.
  Read more: QuAC: Question Answering in Context (Arxiv).
  Get the dataset (QuAC official website).

What’s worse than internet security? Robots and internet security:
…Researchers find multiple open ROS access points during internet scan…
As we head toward a world containing more robots that have greater capabilities, it’s probably worth making sure we can adequately secure these robots to prevent them being hacked. New research from the CS department at Brown University shows how hard a task that could be; researchers scanned the entire IPv4 address space on the internet and found over 100 publicly-accessible hosts running ROS, the Robot Operating System.
“Of the nodes we found, a number of them are connected to simulators, such as Gazebo, while others appear to be real robots capable of being remotely moved in ways dangerous both to the robot and the objects around it,” they write. “This scan was eye-opening for us as well. We found two of our own robots as part of the scan, one Baxter robot and one drone. Neither was intentionally made available on the public Internet, and both have the potential to cause physical harm if used inappropriately.”
  Insecure robots absolutely everywhere: The researchers used ZMap to scan the IPv4 space three times over several months for open ROS devices. “Each ROS master scan observed over 100 ROS instances, spanning 28 countries, with over 70% of the observed instances using addresses belonging to various university networks or research institutions,” they wrote. “Each scan surfaced over 10 robots exposed…Sensors found in our scan included cameras, laser range finders, barometric pressure sensors, GPS devices, tactile sensors, and compasses”. They also found several exposed simulators including the Unity Game Engine, TORCS, and others.
  Real insecure robots, live on the internet: Potentially unsecured robot platforms found by the researchers included a Baxter, PR2, JACO, Turtlebot, WAM, and – potentially the most worrying of all – an exposed DaVinci surgical robot.
  Penetration test: The researchers also performed a penetration test on a robot they discovered in this way which was at a lab in the University of Washington. During this test they were able to hack and access its camera, letting them view images of the lab. They could also play sounds remotely on the robot.
  Why it matters: “Though a few unsecured robots might not seem like a critical issue, our study has shown that a number of research robots is accessible and controllable from the public Internet. It is likely these robots can be remotely actuated in ways dangerous to both the robot and the human operators,” they write.
   More broadly, this reinforces a point made by James Mickens during his recent USENIX keynote on computer security + AI (more information: ImportAI #107) in which he notes that the internet is a security hellscape that itself connects to nightmarishly complex machines, creating a landscape for emergent, endless security threats.
  Read more: Scanning the Internet for ROS: A View of Security in Robotics Research (Arxiv).

Better person re-identification via multiple loss functions:
…Unsupervised Deep Association Learning, another powerful surveillance technique…
Researchers with the Computer Vision Group for Queen Mary University of London, and startup Vision Semantics Ltd, have published a paper on video tracking and analysis, showing how to use AI techniques to automatically find pedestrians via a camera view, then re-acquire them when they appear elsewhere in the city.
  Technique: They call their approach an “unsupervised Deep Association Learning (DAL) scheme”. DAL has two main loss terms to aid its learning: local space-time consistency (identifying a person within views from a single camera) and global cyclic ranking consistency (identifying a person from different camera feeds from different cameras.
“This scheme enables the deep model to start with learning from the local consistency, whilst incrementally self-discovering more cross-camera highly associated tracklets subject to the global consistency for progressively enhancing discriminative feature learning”.
  Datasets: The researchers evaluate their approach on three benchmark datasets:
– PRID2011: 1,134 ‘tracklets’ gathered from two cameras, containing 200 people across both cameras.
– iLIDS-VID: 600 tracklets of 300 people.
– MARS: 20,478 tracklets of 1,261 people captured from a camera network with 6 near-synchronized cameras.
  Testing: The researchers find that their DAL technique, when paired with a ResNet50 backbone, obtains state-of-the-art accuracy across PRID 2011 and iLIDS-VID datasets, and second-to-SOTA on MARS. DAL systems with a MobileNet backend obtain second-to-SOTA accuracy on PRID 2011 and iLIDS-VID, and SOTA on Mars. The closest other technique in terms of performance is the Stepwise technique, which is somewhat competitive on PRID 2011.
  Why it matters: Systems like this are the essential inputs to a digital surveillance state; it would have been nice to see some acknowledgement of this obvious application within the research paper. Additionally, as technology like this is developed and propagated it’s likely we’ll see numerous creative uses of the technology, as well as vigorous adoption by companies in industries like advertising and marketing.
  Read more: Deep Association Learning for Unsupervised Video Person Re-identification (Arxiv).

OpenAI Bits & Pieces:

OpenAI plays competitive pro-level Dota matches at The International; loses twice:

OpenAI plays competitive-level Dota in Vancouver:
OpenAI Five lost two games against top Dota 2 players at The International in Vancouver this week, maintaining a good chance of winning for the first 20-35 minutes of both games. In contrast to our
Benchmark 17 days ago, these games: Were played against significantly better human players, used hero lineups provided by a third party rather than by Five drafting against humans, and removed our last major restriction from what most pros consider “Real Dota” gameplay.
  We’ll continue to work on this and will have more to share in the future.
  Read more: The International 2018: Results (OpenAI Blog).

Maybe the reason why today’s AI algorithms are bad is because they aren’t curious enough:
…Of imagination and broken televisions…
New research from OpenAI, the University of California at Berkeley, and the University of Edinburgh,  shows how the application of curiosity to AI agents can lead to the manifestation of surprisingly advanced behaviors. In a series of experiments we show that agents which use curiosity can learn to outperform random-agent baselines on a majority of games in the Atari corpus, and that such systems display good performance in other areas as well. But this capability comes at a cost: curious agents can be tricked, for instance by putting them in a room with a television that shows different patterns of static on different channels – to a curious agent, this type of television static represents variety, and variety is good when you’re optimizing for curiosity, so agents can become trapped, unable to tear themselves away from the allure of the television static.
  Read more: Large-Scale Study of Curiosity Driven-Learning (Arxiv).
  Read more: Give AI curiosity, and it will watch TV forever (Quartz).

Tech Tales:

Art Show

I mean, is it art?
It must be. They’re bidding on it.
But what is it?
A couple of petabytes of data.
Well, we assume it’s data. We’re not sure exactly what it is. We can’t find any patterns in it. But we think they can.
Well, they’re bidding on it. The machines don’t tend to exchange much stuff with eachother. For some reason they think this is valuable. None of our data-integrity protocols have triggered any alarms, so it seems benign.
Where did it come from?
We know some of this. Half of it is a quasar burst that happened a while ago. Some of it is from a couple of atomic clocks. A few megabytes come from some readings from a degrading [REDACTED]. That’s just what they’ve told us. They’ve kind of stitched this altogether.
Explains the name, I guess.
Yeah: Category: Tapestry. I’d almost think they were playing a joke on us – maybe that’s the art!

Things that inspired this story:
Oakland glitch video art shows, patterns that emerge out of static, untuned televisions, radio plays.