Import AI 128: Better pose estimation through AI; Amazon Alexa gets smarter by tapping insights from Alexa Prize, and differential privacy gets easier to implement in TensorFlow

by Jack Clark

How to test vision systems for reliability: sample from 140 public security cameras:
…More work needed before everyone can get cheap out-of-the-box low light object detection…
Are benchmarks reliable? That’s a question many researchers ask themselves, whether testing supervised learning or reinforcement learning algorithms. Now, researchers with Purdue University, Loyola University Chicago, Argonne National Laboratory, Intel, and Facebook have tried to create a reliable, real world benchmark for computer vision applications. The researchers use a network of 140 publicly accessible camera feeds to gather 5 million images over a 24 hour period, then test a widely deployed ‘YOLO’ object detector against these images.
  Data: The researchers generate the data for this project by pulling information from CAM2, the Continuous Analysis of Many CAMeras project, which is built and maintained by Purdue University researchers.
  Can you trust YOLO at night? YOLO performance degrades at night, causing the system to fail to detect cars when they are illuminated only by streetlights (and conversely, at night it sometimes mistakes streetlights for vehicles’ headlights, causing it to label lights as cars).
  Is YOLO consistent? YOLO’s performance isn’t as consistent as people might hope – there are frequent cases where YOLO’s predictions for the total number of cars parked on a street varies over time.
  Big clusters: The researchers used two supercomputing clusters to perform image classification: one cluster used a mixture of Intel Skylake CPU and Knights Landing Xeon Phi cores, and the other cluster used a combination of CPUs and NVIDIA dual-K80 GPUs. The researchers used this infrastructure to process data in parallel, but did not analyze the different execution times on the different hardware clusters.
  Labeling: The researchers estimate it would take approximately ~600 days to label all 5 million images, so instead labels a subset (13,440) images, then checks labels from YOLO against this test set.
  Why it matters: As AI industrializes being able to generate trustworthy data about the performance of systems will be crucial to giving people the confidence necessary to adopt the technology; tests like this both show how to create new, large-scale robust datasets to test systems, and indicate that we need to develop more effective algorithms to have systems sufficiently powerful for real-world deployment.
  Read more: Large-Scale Object Detection of Images from Network Cameras in Variable Ambient Lighting Conditions (Arxiv).
  Read more about the dataset (CAM2 site).

Amazon makes Alexa smarter and more conversational via the Alexa Prize:
Report analyzing results of this year’s competition…
Amazon has shared details of how it improved the capabilities of its Alexa personal assistant through running the Alexa open research prize. The tl;dr is that inventions made by the 16 participating teams during the competition have improved Alexa in the following ways: “driven improved experiences by Alexa users to an average rating of 3.61, median duration of 2 mins 18 seconds, and average [conversation] turns to 14.6, increases of 14%, 92%, 54% respectively since the launch of the 2018 competition”, Amazon wrote.
  Significant speech recognition improvements: The competition has also meaningfully improved the speech recognition performance of Amazon’s system – significant, given how fundamental speech is to Alexa. “For conversational speech recognition, we have improved our relative Word Error Rate by 55% and our relative Entity Error Rate by 34% since the launch of the Alexa Prize,” Amazon wrote. “Significant improvement in ASR quality have been obtained by ingesting the Alexa Prize conversation transcriptions in the models” as well as through algorithmic advancements developed by the teams, they write.
  Increasing usage: As the competition was in its second year in 2018, Amazon now has some comparative data to use to compare general growth in Alexa usage. “Over the course of the 2018 competition, we have driven over 60,000 hours of conversations spanning millions of interactions, 50% higher than we saw in the 2017 competition,” they wrote.
  Why it matters: Competitions like this show how companies can use deployed products to tempt researchers into doing work for them, and highlights how the platforms will likely trade access for AI agents (eg, Alexa) in exchange for the ideas of researchers. It also highlights the benefit of scale: it would be comparatively difficult for a startup with a personal assistant with a small install base to offer a competition offering the same scale and diversity of interaction as the Alexa Prize.
  Read more: Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize (Arxiv).

Chinese researchers create high-performance ‘pose estimation’ network:
…Omni-use technology highlights the challenges of AI policy; pose estimation can help us make better games and help people get fit, but can also surveil people…
Researchers with facial recognition startup Megvii, Inc; Shanghai Jiao Tong University; Beihang University, and Beijing University of Posts and Telecommunications have improved the performance of surveillance AI technologies via implementing what they call a ‘multi-stage pose estimation network’ (MSPN). Pose estimation is a general purpose computer vision capability that lets people figure out the wireframe skeleton of a person from images and/or video footage – this sort of technology has been widely used for things like CGI and game playing (eg, game consoles might extract poses from people via cameras like the Kinect and use this to feed the AI component of an interactive fitness video game, etc). It also has significant applications for automated surveillance and/or image/video analysis, as it lets you label large groups of people from their poses – one can imagine the utility of being able to automatically flag if a crowd of protestors display a statistically meaningful increase in violent behaviors, or being able to isolate the one person in a crowded train station who is behaving unusually.
  How it works: MSPN: The MSPN has three tweaks that the researchers say explains its performance: tweaks to the main classification module to prevent information being lost during downscaling of images during processing; improving post localization by adopting a coarse-to-fine supervision strategy, and sharing more features across the network during training.
  Results: “New state-of-the-art performance is achieved, with a large margin compared to all previous methods,” the researchers write. Some of the baselines they test against include: AE, G-RMI, CPN, Mask R-CNN, and CMU Pose. The MSPN obtains state-of-the-art scores on the COCO test set, with versions of the MSPN that use purely COCO test-dev data managing to score higher than some systems which augmented themselves with additional data.
  Why it matters: AI is, day in day out, improving the capabilities of automated surveillance systems. It’s worth remembering that for a huge amount of areas of AI research, progress in any one domain (for instance, an improved architecture for supervised classification like a Residual Networks) can have knock-on effects in other more applied domains, like surveillance. This highlights both the omni-use nature of AI, as well as the difficulty of differentiating between benign and less benign applications of the technology.
  Read more: Rethinking on Multi-Stage Networks for Human Pose Estimation (Arxiv).

Making deep learning more secure: Google releases TensorFlow Privacy
…New library lets people train models compliant with more stringent user data privacy standards…
Google has released TensorFlow Privacy, a free Python library which lets people train TensorFlow models with differential privacy. Differential privacy is a technique for training machine learning systems in a way that increases user privacy by letting developers set various tradeoffs relating to the amount of noise applied to the user data being processed. The theory works like this: given a large enough number of users, you can add some noise to individual user data to anonymize them, but continue to extract a meaningful signal out of the overall blob of patterns in the combined pool of fuzzed data – if you have enough of it. And Apple does (as do other large technology companies, like Amazon, Google, Microsoft, etc).
  Apple + Differential Privacy: Apple was one of the first large consumer technology companies to publicly state it had begun to use differential privacy, announcing in 2016 that it was using the technology to train large-scale machine learning models over user data without compromizing on privacy.
  Why it matters: As AI industrializes, adoption will be sped up by coming up with AI training methodologies that better preserve user privacy – this will also ease various policy challenges associated with the deployment of large-scale AI systems. Since TensorFlow is already very widely used, the addition of a dedicated library for implementing well-tested differential privacy systems will help more developers experiment with this technology, which will improve it and broaden its dissemination over time.
  Read more: TensorFlow Privacy (TensorFlow GitHub).
  Read more: Differential Privacy Overview (Apple, PDF).

Indian researchers make a DIY $1,000 Robot Dog named Stoch:
…See STOCH walk!, trot!, gallop!, and run!…
Researchers with the Center for Cyber Physical Systems, IISc, Bengaluru, India, have published a recipe that lets you build a $1,000 quadrupedal robot named Stoch that, if you squint, looks like a cheerful robot dog.
  Stoch the $1,000 robot dog: Typical robot quadrupeds like the MIT Cheetah or Boston Dynamics’ Spot Mini cost on the order of $30,000 to manufacture the researchers write (part of this is from more expensive and accurate sensing and actuator equipment).  Stoch is significantly cheaper because of a hardware design based on widely available off-the-shelf materials combined with non-standard 3D-printed parts that can be made in-house; as well as software for teleoperation of the robot as well as a basic walking controller.
  Stoch – small stature, large (metaphorical) heart: “The Stoch is designed equivalent to the size of a miniature Pinscher dog”, they write. (I find this endears Stoch to me even more).
  Basic movements – no deep learning required: To get robots to do something like walk you can either learn a model from data, or you can code one yourself. The researchers mostly do the former here, using nonlinear coupled differential equations to generate coordinates which are then used to generate joint angles via inverse kinematics. The researchers implement a few different movement policies on Stoch, and have published a video showing the really quite-absurdly cute robot dog walking, trotting, galloping and – yes! – bounding. It’s delightful. The core of the robot is running a Raspberry Pi 3b board which communications via PWM Drivers with the robot’s four leg modules.
  Why it matters – a reminder: Lots of robot companies choose to hand-code movements usually by performing some basic well-understood computation over sensor feedback to let robots hop, walk, and run. AI systems may let us learn far more complex movements, like OpenAI’s work on manipulating a cube with a Shadowhand, but these approaches are currently data and compute-intensive and may require more work on generalization to be as applicable as hand-coded techniques. Papers like this show how for some basic tasks its possible to implement well-documented non-DL systems and get basic performance.
  Why it matters – everything gets cheaper: One central challenge for technology policy is that technology seems to get cheaper over time – for example, back in ~1999 the Japanese government briefly considered imposing export controls on the PS2 consoles over worries about the then-advanced chips inside it being put to malicious uses (whereas today’s chips are significantly more powerful and are in everyone’s smartphones). This paper is an example for how innovations in 3D printing and second-order effects from other economies of scale (eg, some parts of this robot are made of carbon fibre) can make surprisingly futuristic-seeming robot platforms into economic reach for larger numbers of people.
  Watch STOCH walk, trot, gallop, and bound! (Video Results_STOCH (Youtube)).
  Read more: Design, Development and Experimental Realization of a Quadrupedal Research Platform: Stoch (Arxiv).
  Read more: Military fears over PlayStation2, BBC News, Monday 17 April 2000 (BBC News).

Helping blind people shop with ‘Grocery Store Dataset’:
Spare a thought for the people that gathered ~5,000 images from 18 different stores…
Researchers with KTH Royal Institute of Technology and Microsoft Research have created and released a dataset of common grocery store items to help AI researchers train better computer vision systems. The dataset labels have a hierarchical structure, labeling a multitude of objects with board coarse and fine-grained labels.
  Dataset ingredients: The researchers collected data using a 16-megapixel Android smartphone camera and photographed 5125 images of various items in the fruit and vegetable and refrigerated dairy/juice sections of 18 different grocery stores. The dataset contains 81 fine-grained products (which the researchers call classes) which are each accompanied with the following information: “an iconic image of the item and also a product description including origin country, an appreciated weight and nutrient values of the item from a grocery store website”.
  Dataset baselines: The researchers run some baselines over the dataset which use systems that pair CNN architectures AlexNet, VGG16, and DenseNet-169 for feature extraction, and then pairing of these feature vectors with systems that use VAEs to develop a feature representation of the entities in the dataset which leads to improved classification accuracy.
  Why it matters: The researchers think systems like this can be used “to train and benchmark assistive systems for visually impaired people when they shop in a grocery store. Such a system would complement existing visual assistive technology, which is confined to grocery items with barcodes. It also seems to follow that the same technology would be adapted for usage in building stores with fully-automated checkout systems in the style of Amazon Go.
  Get the data: Grocery Store Dataset (GitHub).
  Read more: A Hierarchical Grocery Store Image Dataset with Visual and Semantic Labels (Arxiv).

OpenAI / Import AI Bits & Pieces:

Neo-feudalism, geopolitics, communication, and AI:
…Jack Clark and Azeem Azhar assess what progress in AI means for politics…
I spent this Christmas season in the UK and had the good fortune of being able to sit and talk with Azeem Azhar, AI raconteur and author of the stimulating Exponential View newsletter. We spoke for a little over an hour for the Exponential View podcast, talking about what the political aspects of AI are, and what it means. If you’re at all curious as to how I view the policy challenge of AI, then this may be a good place to start as I lay out a number of my concerns, biases, and plans. The tl;dr is that I think AI practitioners should acknowledge the implicitly political nature of the technology they are developing and act accordingly, which requires more intentional communication to the general public and policymakers, as well as a greater investment into understanding what governments are thinking about with regards to AI and how actions by other actors, eg companies, could influence these plans.
  Listen to the podcast here (Exponential View podcast).
 Check out the Exponential View here (Exponential View archive).

Tech Tales:

The Life of the Party

On certain days, the property comes alive. The gates open. Automated emails are sent to residents of the town:
come, join us for the Easter Egg hunt! Come, celebrate the festive season with drone-delivered, robot-made eggnog; Come, iceskate on the flat roof of the estate; Come, as our robots make the largest bonfire this village has seen since the 17th century.

Because they were rich, The Host died more slowly than normal people, and the slow pace of his decline combined with his desire to focus on the events he hosted and not himself meant that to many children – and even some of their parents – he and his estate had forever been a part of the town. The house had always been there, with its gates, and its occasional emails. If you grew up in the town and you saw fireworks coming from the north side of town then you knew two things: there was a party, and you were both late and invited.

Keen to show he still possessed humor, The Host once held a halloween event with themselves in costume: Come, make your way through the robot house, and journey to see The (Friendly) Monster(!) at its heart. (Though some children were disturbed by their visit with The Host and his associated life-support machines, many told their parents that they thought it was “so scary it was cool”; The Host signalled he did not wish to be in any selfies with the children, so there’s no visual record of this, but one kid did make a meme to commemorate it: they superimposed a vintage photo of The Host’s face onto an ancient still of the monster from Frankenstein – unbeknownst to the kid who made it, the host subsequently kept a laminated printout of this photo on their desk.

We loved these parties and for many people they were highlights of the year – strange, semi-random occasions that brought every person in the town together, sometimes with props, and always with food and cheer.

Of course, there was a trade occuring. After The Host died and a protracted series of legal battles with his estate eventually lead to the release of certain data relating to the events, we learned the nature of this trade: in exchange for all the champagne, the robots that learned to juggle, the live webcam feeds from safari parks beamed in and projected on walls, the drinks that were themselves tailored to each individual guest, the rope swings that hung from ancient trees that had always had rope swings leading to the rope having bitten into the bark and the children to call them “the best swings in the entire world”; in exchange for all of this, The Host had taken something from us: our selves. The cameras that watched us during the events recorded our movements, our laughs, our sighs, our gossip – all of it.

Are we angry? Some, but not many. Confused? I think none of us are confused. Grateful? Yes, I think we’re all grateful for it. It’s hard to begrudge what The Host did – fed our data, our body movements, our speech, into his own robots, so that after the parties had ended and the glasses were cleaned and the corridors vacuumed, he could ask his robots to hold a second, private party. Here, we understand, The Host would mingle with guests, going on their motorized chair through the crowds of robots and listening intently to conversations, or pausing to watch two robots mimic two humans falling in love.

It is said that, on the night The Host died, a band of teenagers near the corner of the estate piloted a drone up to altitude and tried to look down at the house; their footage shows a camera drone hovering in front of one of the ancient rope swings, filming one robot pushing another smaller robot on the swing. “Yeahhhhhhh!” the synthesized human voice says, coming from the smaller robot’s mouth. “This is the best swing ever!”.

Things that inspired this story: Malleability; resilience; adaptability; Stephen Hawking; physically-but-not-mentally-disabling health issues; the notion of a deeply felt platonic love for the world and all that is within it; technology as a filter, an interface, a telegram that guarantees its own delivery.