Import AI 154: Teaching computers how to plan; DeepNude is where dual-use meets pornography; and what happens when we test machine translation systems on real-world data

by Jack Clark

Can computers learn to plan? Stanford researchers thinks so:
…Turns out being able to plan is similar to figuring out where you are and where you’ve been…
Researchers with Stanford University have developed a system that can watch instructional videos on YouTube and learn to look at the start and end of a new video then figure out the appropriate order of actions to take to transition from beginning to end.

What’s so hard about this? The real world involves such a vast combinatorial set of possibilities that traditional planning approaches (mostly) aren’t able to scale to work within it. “One can imagine an indefinitely growing semantic state space, which prevents the application of classical symbolic planning approaches that require a given set of predicates for a well-defined state space”. To get around this, they instead try to learn everything in a latent space, essentially slurping in reality and turning it into features, which they then use to map actions and observations into sequences, helping them figure out a plan.

Two models to learn the latent space:
   The system that derives the latent space and the transformations within it has two main components:

  • A transition model, which predicts the next state based on the current state and action.
  • A conjugate constraint model which maps current actions to past actions.

   The full model takes in a video and essentially learns the transitions between states by sliding these two models along through time to the desire goal state, sampling actions and then learns the next state. 

Two approaches to planning: The researchers experiment with two planning approaches, both of which rely on the features mined by the main system. One approach tries to map current and goal observations into a latent space while also mapping actions to prior actions, then samples from different actions to use to solve its task. The other approach is called ‘walkthrough planning’ and outputs the visual observations between the current and goal state; this is a less direct approach as it doesn’t output actions, but could serve as a useful reward signal for another system. 

Dataset: For this work, they use the CrossTask instructional video dataset, which is a compilation of videos showing 83 different tasks, involving things like grilling steak, making pancakes, changing a tire, and so on.

Testing: Spoiler alert – this kind of task is extremely hard, so get ready for some stay-in-your-chair results. In tests, the researchers find their system using the traditional planning approach can obtain accuracies of around 31.29% tests, with an overall success rate of 12.18%. This compares to a prior state-of-the-art of 24.39% accuracy and 2.89% success rate for ‘Universal Planning Networks’ (Import AI #90). (Note: UPN is the closest thing to compare to, but has some subtle differences making a direct comparison difficult). They show that the same system when using walkthrough planning can significantly improve scores over prior state-of-the-art systems as well – “our full model is able to plan the correct order for all video clips”, they write, compared to baselines which typically fail. 

Why this matters: We’re starting to see AI systems that use the big, learnable engines used in deep learning research as part of more deliberately structured systems to tackle specific tasks, like learning transitions and plans for video walkthroughs. Planning is an essential part of AI, and being able to learn plans and disentangle plans from actions (and learn appropriate associations) is an inherently complex task; progress here can give us a better sense for progress in the field of AI
   Read more: Procedure Planning in Instructional Videos (Arxiv)

####################################################

DeepNude: Dual Use concerns meet Pornography; trouble ensues:
…Rock, meet hard place…
What would a person look like without their clothes? That’s something people can imagine fairly easily, but has been difficult for AI systems. That is, until we developed a whole bunch of recent systems capable of modeling data distributions and generating synthetic versions of said data; these techniques contributed to the rise of things like ‘deepfakes’ which let people superimpose the face of one person on that of another in a video. Recently, someone took this a step further with a software tool called DeepNude which automatically removes the clothes of (predominantly women), rendering synthetic images of them in the nude. 

Blowback, phase one: The initial DeepNude blowback centered on the dubious motivation for the project and the immense likelihood of the software being used to troll, harass, and abuse women. Coverage in Vice led to such outcry from the community that the creator of DeepNude took the application down – but not before others had implemented the same capabilities in other software and distributed it around the web. 

Rapid proliferation makes norms difficult: Just a couple of days after taking the app down, the creator posted the code of the application to GitHub, saying that because the DeepNude application had already been replicated widely, there was no purpose in keeping the original code private, so they published it online. 

Why this matters: DeepNude is an illustration of the larger issues inherent to increasingly powerful AI systems; these things have got really powerful and can be used in a variety of different applications and are also, perhaps unintuitively, relatively easy to program and put together once you have some pre-trained networks lying around (and the norms of publication mean this is always the case). How we figure out new norms around development and publication of such technology will have a significant influence on what happens in society, and if we’re not careful we could enable more things like DeepNude.
   Read the statement justifying code release: Official DeepNude Algorithm (DeepNude GitHub).
   Read more: This Horrifying App Undresses a Photo of any Woman With a Single Click (Vice). (A special ImportAI shoutout to Samantha Cole, the journalist behind this story; Samantha was the first journalist to cover deepfakes back in 2017 and has been on this beat doing detailed work for a while. Worth a follow!)

####################################################

Have no pity for robots? Watch these self-driving cars try to tackle San Francisco:
A short video from Cruise, a self-driving car service owned by General Motors, shows how its cars can now deal with double-parked cars in San Francisco, California.
    Check out the video here (official Cruise Twitter).

####################################################

Think AI services are consistent across cloud providers? Think again:
…Study identifies significant differences in AI inferences made by Google, Amazon, and Microsoft…
Different AI cloud providers have different capabilities, and these under-documented differences could cause problems for software developers, according to research from computer science researchers with Deakin University and Monash University in Australia. In a study, they explore the differences between image labeling AI services from Amazon (“AWS Rekognition”), Google (“Google Cloud Vision”) and Microsoft (“Azure Computer Vision”). The researchers try to work out if “computer vision services, as they currently stand, offer consistent behavior, and if not, how is this conveyed to developers (if it is at all)?”

Developers may not realize that services can vary from cloud provider to provider, the researchers write; this is because if you look at the underlying storage and compute systems across major cloud providers like Microsoft or Amazon or Google you find that they’re very comparable, whereas differences in the quality of AI services are much less easy to work out from product descriptions. (For instance, one basic example is the labels services output when classifying objects; one service may describe a dog as both a ‘collie’ and a ‘border collie’, while another may use just one (or none) of these labels, etc.) 

Datasets and study length: The authors used three datasets to evaluate the services; two self-developed ones – a small one containing 30 images and a large one containing 1,650 ones, and a public dataset called COCOVal17, which contains 5,000 images. The study took place over 11 months and had two main experimental phases: a 13 week period from April to August 2018 and a 17 week period from November 2018 to March 2019. 

Methodology: They test the cloud services for six traits: the consistency of the top label assigned to an image from each service; the ‘semantic consistency’ of multiple labels returned by the same service; the confidence level of each service’s top label prediction; the consistency of these confidence intervals across multiple services; the consistency of the top label over time (aka, does it change); and the consistency of the top label’s confidence over time. 

Three main discoveries: The paper generates evidence for three concerning traits in clouds, which are:

Computer vision services do not respond with consistent outputs between services, given the same input image. 

  • Outputs from computer vision services are non-deterministic and evolving, and the same service can change its top-most response over time given the same input image. 
  • Computer vision services do not effectively communicate this evolution and instability, introducing risk into engineering these systems. 

Why this matters: Commercial AI systems can be non-repeatable and non-reliable, and this study shows that multiple AI systems developed by different providers can be even more inconsistent with one another over time. This is going to be a challenging issue, as it makes it easier for developers to get ‘locked in’ to the specific capabilities of a single service, and also makes application portability difficult. Additionally, these issues will make it harder for people to build AI services that are composed out of multiple distinct AI services from different clouds, as these systems will not have predictable performance capabilities.
   Read more: Losing Confidence in Quality: Unspoken Evolution of Computer Vision Services (Arxiv).

####################################################

Stealing people’s skeletons with deep learning:
…XNect lets researchers do real-time multi-person pose estimation via a single RGB camera…
How do you use a single camera to track multiple people and their pose as they move around? That’s a question being worked on by researchers with the Max Planck Institute for Informatics, EPFL, and the University of Saarland. The try to solve this problem via a neural network architecture that encodes and decodes poses of people, which is also implemented efficiently enough to run in real-time from a single camera feed. The system uses two networks; one which focuses on learning to reason about individual body joints, and another which tries to jointly reason about all body joints. 

Special components for better performance: Like some bits of AI research, this work takes a bunch of known-good stuff, and then pushes it forward on a task-specific dimension. Here, they develop a convolutional neural network architecture called SelecSLS Net, which “employs selective long and short range concatenation-skip connections to promote information flow across network layers which allows to use fewer features leading to a much faster inference time but comparable accuracy in comparison to ResNet-50”. 

Real-time performance: Most of the work here has involved increasing the efficiency of the system so it can process footage from video cameras in real-time (when running on an NVIDIA GTX 1080Ti and a Xeon E5). In terms of performance, the system marginally outperforms a more standard system that uses a typical residual network, while being far more efficient when it comes to runtime. 

Why this matters: It’s becoming trivial for computers to look at people, model each of them as a wireframe skeleton, and then compute over that. This is a classic omni-use capability; we could imagine such a system being used to automatically port people into simulated virtual worlds, or to plug them into a large-scale surveillance system to analyze their body movements and characterize the behavior of the crowd. How society deals with the challenges of such a multi-purpose technology remain to be seen.
   Read more: XNect: Real-time Multi-person 3D Human Pose Estimation with a Single RGB Camera (Arxiv).

####################################################

Think network design is hard? Try it where every network point is a drone:
…Researchers show how to build dynamic networks out of patrolling drones…
Researchers with Alpen-Adria-Universitat Klagenfurt, Austria, have developed “a novel collaborative data delivery approach where UAVs transport data in a store-and-forward fashion”. What this means is they develop a system that automatically plans the flight paths of fleets of drones so that the drones at the front of the formation periodically overlap in communication range with UAVs behind them, which then overlap in communication range with other, even more distant UAVs. The essential idea behind the research is to use fast drone-to-drone communications systems to hoover up data via exploration drones at the limits of a formation, then squirt this data back to a base station via the drones themselves. The next step for the research is to use “more sophisticated scheduling of UAVs to minimize the number of idle UAVs (that do neither sensing nor transporting data) at each time step”. 

Why this matters: Drones are going to let people form ad-hoc computation and storage systems, and approaches like this suggest the shape of numerous ‘flying internets’ that we could imagine in the future.
   Read more: Persistent Multi-UAV Surveillance with Data Latency Constraints (Arxiv).

####################################################

Pushing machine translation systems to the limit with real, messy data:
…Machine translation robustness competition shows what it takes to work in the real world…
Researchers from Facebook AI Research, Carnegie Mellon University, Harvard University, MIT, the Qatar Computing Research Institute, Google, and Johns Hopkins University, have published the results of the “first shared task on machine translation robustness”. The goal of this task is to give people better intuitions about how well machine translation models deal with “orthographic variations, grammatical errors, and other linguistic phenomena common in user-generated content”. 

Competitions, what are they good for? The researchers hope that systems which do well at this task will use better modelling, training and adaptation techniques, or may learn from large amounts of unlabeled data. And indeed, entered systems did use a variety of additional techniques to increase their performance, such as data cleaning, data augmentation, fine-tuning, ensembles of models, and more. 

Datasets: The datasets were “collected from Reddit, filtered out for noisy comments using a sub-word language modeling criterion and translated by professional translators”

Results: As this competition explores robustness in the context of a competition, it’s perhaps less meaningful to focus on the quantitative results, and instead discuss the trends seen among the entries. Some of the main things seen by the competition organizers are: stronger submissions were typically stronger across the board; out-of-domain generalization is important (so having systems that can deal with words they haven’t seen before); being able to accurately model upper and lower case text, as well as the use of special characters, is useful; it can be difficult to learn to translate sentences written in slang, 

Why this matters: Competitions like this give us a better sense of the real-world progress of AI systems, helping us understand what it takes to build systems that work over real data, as opposed to highly-constrained or specifically structured test sets.
   Read more: Findings of the FIrst Shared Task on Machine Translation Robustness (Arxiv).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Axon Ethics Board— no face recognition on police body cameras:
Axon, who make technologies for law enforcement, established an AI Ethics Board back in 2018 to look at the ethical implications of their products. The board has just released their first report, looking at ethical issues surrounding face recognition, particularly on police body cameras—Axon’s core product.

The board: Axon was an early mover in establishing an AI ethics board. The board’s members are drawn from law enforcement, civil rights groups, policy, academia, and tech. Among the lessons learned, the Board emphasizes the importance of board involvement at an early stage in product development (ideally before the design stage), so that they can suggest changes before they become too costly for the company.

Six major conclusions:
  (1) Face recognition technology is currently not reliable enough to justify use on body-cameras. Far greater accuracy, and equal performance across different populations are needed before deployment.
  (2) In assessing face recognition algorithms, it is important to separate false positive and false negative rates. There are real trade offs between the two, which depend on use cases. E.g. in identifying a missing person, more false positives might be cost worth bearing to minimize false negatives. Whereas in enforcement scenarios, it might be more important to minimize false positives, due to the potential harms from police interacting with innocent people on mistaken information.
  (3) The Board does not endorse the development of face recognition technology that can be completely customised by users, to prevent misuse. This requires technological controls by product manufacturers, but will increasingly also require government regulation.
  (4) No jurisdiction should adopt the technology without going through transparent, democratic processes. At present, big decisions affecting the public are being made by law enforcement alone, e.g. whether to include drivers license photos in face databases.
  (5) Development of products should be premised on evidence-based (and not merely theoretical) benefits.
  (6) When assessing costs and benefits of potential use cases, one must take into account the realities of policing in particular jurisdictions, and technological limitations.
  Read more: First Report of the Axon AI & Policing Ethics Board (Axon).
  Read more: Press release (Axon).

####################################################

NIST releases plan on AI standards:
The White House’s executive order on AI, released in February, included an instruction for NIST to make “a plan for Federal engagement in the development of technical standards and related tools in support of reliable, robust, and trustworthy systems that use AI technologies.” NIST have released a draft plan, and are accepting public input until July 19, before delivering a final document in August. Recommendations: NIST recommends that the government “bolster AI standards-related knowledge, leadership, and coordination among federal agencies; promote focused research on the ‘trustworthiness’ of AI; support and expand public-private partnerships; and engage with international parties.”

Why it matters: The US is keen to lead international efforts in standards-setting. Historically, international standards have governed policy externalities in cybersecurity, sustainability, and safety. Given the challenges of trust and coordinating safe practices in AI development and deployment, standards setting could play an important role.
  Read more: U.S. Leadership in AI: a Plan for Federal Engagement in Developing Technical Standards and Related Tools (NIST).

####################################################

Tech tales

Dreamworld versus Reality versus Government

After the traceable content accords were enacted people changed how they approached themselves – nude photos aren’t so fun if you know your camera is cryptographically signing them and tying them to you then uploading that information to some vast database hosted by a company or a state. 

The same thing happened for a lot of memes and meme-fodder: it’s not obviously a good idea to record yourself downing ten beers on an amusement park ride if you’re subsequently going to pursue a career in politics, nor does it seem like a smart thing to participate in overtly political pranks if you think you might pursue a career in law enforcement. 

The internet got… quiet? It was still full of noise and commotion and discussion, but the edge had been taken off a little. Of course, when we lost the edge we lost a lot of pain: it’s harder to produce terrorist content if it is traced back to your phone or camera or whatever, and it’s harder for other people to fake as much of it when it stops being, as they say, a ‘desirable media target’.

It didn’t take long for people to figure out a work around: artificial intelligence. Specifically, using large generative models to create images and, later, audio, and even later after that, videos, which could synthesize the things they wanted to create or record, but couldn’t send or do anymore. Teens started sending eachother impressionistic, smeared videos of teen-like creatures doing teen-like pranks. Someone invented some software called U.S.A which stood for Universal Sex Avatar and teens started sending eachother ‘AIelfies’ (pronounced elfeez) which showed nude-like human-like things doing sexual-like stuff. Even the terrorists got involved and started pumping out propaganda that was procedural and generative. 

Now the internet has two layers: the reality-layer and what people have taken to calling the dreamworld. In the reality-layer things are ever-more controlled and people conduct themselves knowing that what they do will be knowable and identifiable most-likely forever; everyone’s a politician, essentially. In the dreamworld, people experiment with themselves, and everyone has a few illicit channels on their messaging apps through which they let people send them dreamworld content, and through which they can anonymously and non-anonymously send their own visions into the world. 

The intelligence agencies are trying to learn about the dreamworld, people say. Knowing the difference between what known individuals publicly present and what the ghostly mass of civilization illicitly sends to itself is a valuable thing, say certain sour-faced people who are responsible for terrible tools that ward off against more terrible things. “The difference between presented self and imagined self is where identity resides,” says one of them in a no-phone presentation to other sour-faced people. “If we can learn how society chooses to separate the two, perhaps we can identify the character of our society. If we can do that, we can change the character.”

And so the terrible slow engines are working now, chewing through our dreamworld, invisible to us, but us increasingly aware of them. Where shall we go next, we wonder? What manifestation shall our individuality take next?

Things that inspired this story: Generative adversarial networks; DeepNude; DeepFakes; underground communities; private messaging infrastructures; the conversation of all of physical reality into digital simulacra.