Import AI 265: Deceiving AI hackers; how Alibaba makes money with ML; why governments should measure AI progress

In the future, spies will figure out what you’re doing by staring at a blank wall:
…This sounds crazy, but this research appears quite sane. Oh my…
Here’s a wild bit of research from MIT, NVIDIA, and Israel’s Technion Israel Institute of Technology: “We use a video of the blank wall and show that a learning-based approach is able to recover information about the hidden scene”. Specifically, they’re able to point a camera at a blank wall and then perform some analysis over the shifting patterns of ambient light on it, then use this to figure out whether there are 0, 1, or 2 people in a scene, and to classify the activities of the people – whether they’re walking, crouching, waving hands, jumping.

Accuracy: “Trained on 20 different scenes achieve an accuracy of 94.4% in classifying the number of people and 93.7% in activity recognition on the held out test set of 5 unseen scenes”, they write. Not enough good to rely on in a critical situation, but much better than you’d think. (As an experiment, sit in a completely white room without major shadows wearing noise canceling headphones and try to figure out if there’s someone behind you by staring at the blank wall opposite you – good luck getting above 50%!).

Why this matters: I’m fascinated by how smart surveillance is going to become. At the same time, I’m interested in how we can use various contemporary AI and signal processing techniques to be able to eke more information out of the various fuzzy signals inherent to reality. Here, these researchers show that as cameras and processing algorithms get better, we’re going to see surveillance systems develop that can extract a lot of data from stuff barely perceptible to humans.
  Read more: What You Can Learn by Staring at a Blank Wall (arXiv).

####################################################

AI is a big deal – so governments should monitor its development:
…New research from myself and Jess Whittlestone lays out the case for better AI monitoring…
We write about AI measurement a lot here, because measuring AI systems is one of the best ways to understand their strengths and weaknesses. In the coming years, information about AI – and specifically, how to measure it for certain traits – will also be a crucial ingredient in the crafting of AI policy. Therefore, we should have governments develop public sector AI measurement and monitoring systems so that we can track the research and development of increasingly powerful AI technology. Such an initiative can help us with problems today and can better orient the world with regard to more general forms of AI, giving us infrastructure to help us measure increasingly advanced systems. That’s the gist of a research paper I and my collaborator Jess Whittlestone worked on this year – please take a read and, if you’re in a government, reach out, as I want to help make this happen.
  Read more: Why and How Governments Should Monitor AI Development (arXiv).
    Some analysis of our proposal by NESTA’s Juan Mateos Garcica (Twitter)
  Listen to Jess and I discussing the idea with Matt Clifford on his ‘thoughts in between’ podcast..

####################################################

Alibaba uses a smarter neural net to lower its cost and increase its amount of users:
…Here’s why everyone is deploying as much machine learning as they can…
Ant Financial, a subsidiary of Chinese tech giant Alibaba, has written a fun paper about how it uses contemporary machine learning to improve the performance of a commercially valuable deployed system. “This paper proposes a practical two-stage framework that can optimize the [Return on Investment] of various massive-scale promotion campaigns”, the authors write. In this context, they do use ML to optimize an e-coupon gifting campaign. “Alipay granted coupons to customers to incentivize them to make mobile payments with the Alipay mobile app. Given its marketing campaign budget, the company needed to determine the value of the coupon given to each user to maximize overall user adoption”, they write.

What ML can do: For the ML component, they built a ‘Deep Isotonic Promotion Network’ (DIPN), which is basically a custom-designed AI system for figuring out whether to recommend something to a user (and what to recommend). “In the first stage, we model users’ personal promotion-response curves with machine learning algorithms. In the second stage, we formulate the problem as a linear programming (LP) problem and solve it by established LP algorithms”, they write.

Real world deployment: They deployed the resulting system at Alipay and tested it out on a few marketing campaigns. It was so successful it “was eventually deployed to all users.” (Depending on how you count it, Alipay has anything between 300 million to 1 billion active users, so that’s a lot). In tests, they saw that using their ML system reduced the cost of running campaigns by between 6% and 10%, and it increased the usage rate of humans by 2% and 8.5%. Put another way, using a better ML system made their promotion campaign both cheaper to run and more effective in outcome.

Why this matters: This paper gives us a good sense of the incentive structure behind AI development and deployment – if things like this can make multiple percentage point differences to core business metrics like cost and user-usage, then we shouldn’t be surprised to see companies race against eachother to deploy increasingly powerful systems into the world. More subjectively, it makes me wonder about how smart these systems will become – when will I be the target of an ML system that encourages me to use something I hadn’t previously considered using? And how might this ML system think of me when it does that?
  Read more: A framework for massive scale personalized promotion (arXiv).

####################################################

10,000 labelled animal images? Yes please!
…Pose estimation gets a bit easier with AP-10K…
Researchers from Xidian University and JD Explore Academy in China, along with the University of Sydney in Australia, have released AP-10K, a dataset for animal pose estimation. Pose estimation is the task of looking at a picture and figuring out the orientation of the animal(s) body.

What’s in it: AP-10K consists of 10,015 images from 23 animal families and 60 distinct species. Thirteen annotators helped annotate the bounding boxes for each animal in an image, as well as its image keypoints. (AP-10K also contains an additional 50,000 images that lack keypoint annotations). Some of the animals in AP-10K include various types of dogs (naturally, this being AI)_, as well as cats, lions, elephants, mice, gorillas, giraffes, and more.

Scale: Though AP-10K may be the largest dataset for animal pose estimation, it’s 10X smaller than datasets used for humans, like COCO.
  Read more: AP-10K: A Benchmark for Animal Pose Estimation in the Wild (arXiv).
  Get the benchmark data here (AP-10K GitHub).

####################################################

Facebook makes a big language model from pure audio – and what about intelligence agencies?
…No text? No problem! We’ll just build a big language model out of audio…
Facebook has figured out how to train a language model from pure audio data, no labeled text required. This is a potentially big deal – only a minority of the world’s spoken languages are instantiated in large text datasets, and some languages (e.g, many African dialects) have a tiny text footprint relative to how much they’re spoken. Now, Facebook has built the Generative Spoken Language Model (GSLM), which converts speech into discrete units, makes predictions about the likelihood of these units following one an other, then converts these units back into speech. The GLSM is essentially doing what text models like GPT3 do, but where GPT3 turns labeled text into tokens and then makes predictions about tokens, GSLM turns audio into tokens and then makes predictions about them. Simple!

How well does it work? GSLM is not equivalent to GPT3. It’s a lot dumber. But that’s because it’s doing something pretty complicated – making predictions about speech purely from audio waveforms. In tests, Facebook says it can generate some reasonable sounding stuff, and that it has the potential to be plugged into other systems to make them better as well.

What about intelligence agencies? You know who else, besides big tech companies like Google and Facebook, has tons and tons of raw audio data? Intelligence agencies! Many of these agencies are in the business of tapping telephony systems worldwide and hoovering stuff up for their own inscrutable purposes. One takeaway from this Facebook research is it puts agencies in a marginally better position with regard to developing large-scale AI systems.
  Read more: Textless NLP: Generating expressive speech from raw audio (Facebook AI).
  Get code for the GSLM models here (Facebook GitHub).

####################################################

How bad is RL at generalization? Really bad, if you don’t pre-train, according to this competition:
…Testing out physical intelligence with… Angry Birds!…
Researchers with Australian National University have come up with an Angry Birds (yes, really) benchmark for testing out physical reasoning in AI agents, named Phy-Q.

15 physical scenarios; 75 templates; 7500 tasks: Each scenario is designed to analyze how well an agent understands a distinct physics concept. These scenarios test out how well an agent understands a given aspect of physics, such as that objects can fall on one another, that some objects can roll, that paths need to be cleared for objects to be reached, and so on. For each scenario, the developers build 2-8 distinct tasks that ensure the agent needs to use the given rule to solve the template, then for each template they generate ~100 game levels.

How hard is this for existing agents: In all but the most basic scenarios, humans do really well achieving pass rates of 50% and up, whereas most AI baseline systems (DQN, PPO, A2C, along with some ones with hand-crafted heuristics) do very poorly. Humans (specifically, 20 volunteers recruited by Australian National University) are, unsurprisingly, good at generalization, getting an aggregate generalization score of 0.828 on the test, versus 0.12 for a DQN-based system with symbolic elements, and 0.09 for a non-symbolic DQN (by comparison, a random agent gets 0.0427).
  The most high-performing algorithm is one called ‘Eagle’s Wing’, which gets a generalization score of 0.1999. All this basically means that this task is very hard for current AI methods. One takeaway I have is that RL-based methods really suck here, though they’d probably improve with massive pre-training.
  Read more: Phy-Q: A Benchmark for Physical Reasoning (arXiv).
  Get the benchmark here: Phy-Q (GitHub).

####################################################

Countering RL-trained AI hackers with honeypots:
…Honeypots work on machines just as well as humans…
Researchers with the Naval Information Warfare Center have built some so-called ‘cyber deception’ tools into CyberBattleSim, an open source network simulation environment developed by Microsoft.

Adding deception to CyberBattleSim: “With the release of CyberBattleSim environment in April 2021, Microsoft, leveraging the Python-based Open AI Gym interface, has created an initial, abstract simulation-based experimentation research platform to train automated agents using RL”, the researchers write. Now, they’ve added some deception tools in – specifically, they adapted the toy capture the flag environment in CyberBattleSim and incorporated depoys (systems that can’t be connected to, but look like real assets), honeypots (systems that can be connected to and which look like real assets, but are full of fake credentials) and honeytokens (fake credentials).

What deception does: Unsurprisingly, adding in these deceptive items absolutely bricks the performance of AI systems deployed in the virtual environment with a goal of hacking into a system. Specifically, they tested out four methods – Credential Lookup, Deep Q-Learning, Tabular Q-Learning, and a Random Policy. By adding in decoys, they were able to reduce system win rates from 80% to 60% across the board, and by adding in several honeypots, they were able to reduce performance from 80% to below 20%. Additionally, by adding in honeypots and other decoys, they  are able to make it take a lot longer for systems to successfully hack into things.

Why this matters: AI is playing an increasingly important role in frontier cyberdefense and cyberoffense. Studies like this give us an idea for how the tech may evolve further. “While there are no unexpected findings, the contribution to demonstrate the capability of modeling cyber deception in CyberBattleSim was achieved. These fundamental results provided a necessary sanity check while incorporating deception into CyberBattleSim.”
  Read more: Incorporating Deception into CyberBattleSim for Autonomous Defense (arXiv).

####################################################

Tech Tales:

Wake Up, Time To Die
[Asteroid, 20??, out there – far out]

And you woke up. You were a creature among many, stuck like barnacles on the surface of an asteroid. Your sisters and brothers had done their job and the gigantic ball of rock was on course for collision with the enemy planet.

They allowed you’re sentience, now, because you needed it to be able to respond to emergent situations – which tend to happen, when you’re attached to a ball of rock that means certain death for the beings on the planet it is headed for.

Look up, you think. And so do the rest of your brothers and sisters. You all turn your faces away from the rock, where you had been mindlessly eating it and excreting it as a gas and in doing so subtly altering its course. Now you flipped around and you all gazed at the stars and the blackness of space and the big sphere that you were about to collide with. You feel you are all part of the same tapestry as your so-called ‘kinetic d-grade planet wiper’ asteroid collides with the planet.You all dissipate – you, them, everything above a certain level of basic cellular sophistication. And the asteroid boils up chunks of the planet and breaks them apart and sets things in motion anew.

Things that inspired this story: Creation myths; emergence from simple automata; ideas about purpose and unity; notions of the end of the world.