Import AI 197: Facebook trains cyberpunk AI; Chinese companies unite behind ‘AIBench’ evaluation system; how Cloudflare uses AI

Want to analyze real-world AI performance? Use AIBench instead of MLPerf, says AIBench developers:
…Chinese universities and 17 companies come together to develop AI measurement approach…
A consortium of Chinese universities along with seventeen companies – including Alibaba, Tencent, Baidu, and ByteDance – have developed AIBench, an AI benchmarking suite meant to compete with MLPerf, an AI benchmarking suite predominantly developed by American universities and companies. AIBench is interesting because it proposes ways to do fine-grained analysis of a given AI application, which could help developers make their software more efficient. It’s also interesting because of the sheer number of major Chinese companies involved, and in its explicit positioning as an alternative to MLPerf.

End-to-end application benchmarks: AIBench is meant to test tasks in an end-to-end way, covering both the AI and non-AI components. Some of the tasks it tests against include: recommendation tasks, 3D face recognition, face embedding (turning faces into features), video prediction, image compression, speech recognition, and more. This means AIBench can measure various real-world metrics, like the latency of a given task that reflects the time it takes to execute the AI part, as well as the surrounding infrastructure software services, and so on.

Fine-grained analysis: AIBench will help researchers figure out what proportion of time their systems spend doing different things while executing a program, helping them figure out, for instance, how much time is spent doing data arrangement for a task versus running convolution operations, or batch normalization, and so on.

The politics of measurement: It’s no coincidence that most of AIBench’s backers are Chinese and most of MLPerf’s backers are American – measurement is intimately tied to the development of standards, and standards are one of the (extraordinarily dull) venues where US and Chinese entities are currently jockeying for influence with one another. Systems like AIBench will generate valuable data about the performance of contemporary AI applications, while also supporting various second-order political goals. Watch this space!
  Read more: AIBench: A Datacenter AI Benchmark Suite, BenchCouncil (official AIBench website).
  Read more: AIBench: An Agile Domain-specific Benchmarking Methodology and an AI Benchmark Suite (arXiv).

####################################################

US government wants AI systems that can understand an entire movie:
…NIST’s painstakingly annotated HLVU dataset sets new AI video analysis challenges…
Researchers with NIST, a US government agency dedicated to assessing and measuring different aspects of technology, want to build a video understanding dataset that tests out AI inference capabilities on feature-length movies.

Why video modelling is hard (and what makes HLVU different): Video modelling is an extremely challenging problem for AI systems – it takes all the hard parts of image recognition, then makes them harder by adding a temporal element which requires you to isolate objects in scenes then track them from frame to frame while pixels change. So far, much of the work on video modeling has come along in the form of narrow tasks, like being able to accurately recognize different types of movements in the ‘ActivityNet’ dataset, or characterize individual actions in things like DeepMind’s ‘Kinetics’ stuff.

What HLVU is: The High-Level Video Understanding (HLVU) dataset is meant to help researchers develop algorithms that can understand entire movies. Specifically, today HLVU consists of 11 hours of heavily annotated footage across a multitude of open source movies, collected from Vimeo and Archive.org. NIST is currently paying volunteers to annotate the movies using a graphing tool called yEd to help create knowledge graphs about the movies – e.g., describing how characters are related to eachother. This means competition participants might be confronted with a couple of images of a couple of characters then allowed to have their algorithms ‘watch’ the movie, after which they’d be expected to discuss the relationship of the two characters. This is a challenging, open-ended task.

Why this matters: HLVU is a ‘moonshot problem’, in the sense that it seems amazingly hard for today’s existing systems to solve it out of the box, and building systems that can understand full-length movies will likely require systems that are able to cope with larger contexts during training, and which may come with some augmented symbol-manipulation machinery to help them figure out relationships between representations (although graph neural network approaches might work here, also). Progress on HLVU will provide us with valuable signals about the relative maturity of different bits of AI technology.
  Read more: HLVU: A New Challenge to Test Deep Understanding of Movies the Way Humans do (arXiv).

####################################################

CyberpunkAI – Facebook turns NetHack into an OpenAI Gym environment:
…AI like it’s 1987!…
When you think of recent highlights in reinforcement learning research you’re likely to contemplate things like StarCraft and Dota-playing bots, or robots learning to manipulate objects. You’re less likely to think of games with ASCII graphics from decades ago. Yet a team of Facebook-led researchers think NetHack, a famous roguelike game first launched in 1987, is a good candidate for contemporary AI research, and have released the Nethack Learning Environment to encourage researchers to pit AI agents against the ancient game.

Why NetHack: 

  • Cheap: The ASCII-based game has a tiny computational footprint, which means many different researchers will be able to conduct research on it. 
  • Complex: NetHack worlds are procedurally generated, so an AI agent can’t memorize the level. Additionally, NetHack contains hundreds of monsters and items, introducing further challenges. 
  • Simple: The researchers have implemented it as an OpenAI Gym environment, so you can run it within a simple, pre-existing software stack. 
  • Fast: A standard CNN-based agent can iterate through NetHack environments at 5000 steps per second, letting them gather a lot of experience in a relatively short amount of (human) time.
  • Reassuringly challenging: The researchers train an IMPALA-style model to solve some basic NetHack tasks relating to actions and navigation and find that it struggles on a couple of them, suggesting the environment will pose a challenge and demand the creation of new algorithms with new ideas. 

Things that make you go ‘hmmm’: One tantalizing idea here is that people may need to use RL+Text-understanding techniques to ‘solve’ NetHack: “Almost all human players learn to master the game by consulting the NetHack Wiki, or other so-called spoilers, making NLE a testbed for advancing language-assisted RL,” writes Facebook AI researcher Tim Rocktäschel.

Why this matters: If NetHack becomes established, then researchers will be able to use a low-cost, fast platform to rapidly prototype complex reinforcement learning research ideas – something that today is mostly done through the use of expensive (aka, costly-to-run) game engines, or complex robotics simulations. Plus, it’d be nice to watch Twitch streams of trained agents exploring the ancient game.
  Get the code from Facebook’s GitHub here.
  Read more: The NetHack Learning Environment (PDF).

####################################################

How AI lets Cloudflare block internet bots:
…Is it a bot? Check “The Score” to see our guess…
The internet is a dangerous place. We all know this. But Cloudflare, a startup that sells various network services, has a sense of exactly how dangerous it is. “Overall globally, more than [a] third of the Internet traffic visible to Cloudflare is coming from bad bots,” the company writes in a blogpost discussing how it uses machine learning and other techniques to defend its millions of customers from the ‘bad bots’ of the internet. These bad bots are things like spambots, botnets, unauthorized webscrapers, and so on.

Five approaches to rule them all: Cloudflare uses five interlocking systems to help it deal with bots:
– Machine Learning: This system covers about 82.83% of global use-cases on cloudflare. It uses the (very simple and reliable) gradient boosting on decision trees and has been in production with Cloudflare customers since 2018. Cloudflare says it trains and validates its models using “trillions of requests”, which gives a sense of the scale of the (simple) system.
Heuristics Engine: This handles about 14.95% of use-cases for Cloudflare:“Not all problems in the world are the best solved with machine learning,” they write. Enter the heuristics engine, which is a set of “hundreds of specific rules based on certain attributes of the request” – this system is useful because it’s fast (Cloudflare suggests model inference takes less than 50 microseconds per model, whereas “hundreds of heuristics can be applied just under 20 microseconds”. Additionally, the engine serves as a source of input data for the ML models, which helps Cloudflare “generalize behavior learnt from the heuristics and improve detections accuracy”.
Behavioural Analysis: This system uses an unsupervised machine learning approach to “detect bots and anomalies from the normal behavior on specific customer’s website”. Cloudflare doesn’t give other details besides this.
Verified bots: This system figures out which bots are good and which are malicious via stuff like dns analysis, bot-type identification, and so on. This system also uses a ‘machine learning validator’ which ‘uses an unsupervised learning algorithm, clustering good bot IPs which are not possible to validate through other means”.
– JS Fingerprinting: This is a ~mysterious system where Cloudflare uses client-side systems to figure out weird things. They don’t give many details in the blogpost, but a key quote is: “detection mechanism is implemented as a challenge-response system with challenge injected into the webpage on Cloudflare’s edge. The challenge is then rendered in the background using provided graphic instructions and the result sent back to Cloudflare for validation and further action such as  producing the score”.

Watched over by machines: The net effect of this kind of technology use is that Cloudflare uses its own size to derive everricher machine learning models of the environment it operates in, giving it a kind of sixth sense for things that feel fishy. I find it interesting that we can use computers to generate signals that look in the abstract like a form of ‘intuition’.
  Read more: Cloudflare Bot Management: machine learning and more (Cloudflare blog).


####################################################

Recursion for the common good: Using machine learning to analyze the results of machine learning papers:
…Using AI to analyze AI progress…
In recent years, there’s been a proliferation of machine learning research papers, as part of the broader resurgence of AI. This has introduced a challenge: how can we scalably analyze the results of these papers and understand a meta-sense of progress in the field at large? (This newsletter is itself an exercise in this!). One way is to use AI techniques to automatically hoover up interesting insights from research papers and put them in one place. New research from Facebook, n-waves, UCL, and DeepMind, outlines a way to use machine learning to automatically pull data out from research papers – a task that sounds easy, but is in fact quite difficult.

AxCell: They build a system that uses an ULMFiT architecture-based classifier to read the contents of papers and identify tables of numeric results, then they hand that off to another classifier that works out if the cell in a table contains a dataset, metric, paper model, cited model, or ‘other’ stuff. Once they’ve got this data, they figure out how to tag specific results in the table with their appropriate identifies (e.g., a given score on a certain dataset). Once they’ve done this, they try and link these results to leaderboards, which keep track of which techniques are doing well and which techniques are doing poorly in different areas.

Does AxCell actually work? Check out PapersWithCode: AxCell is deployed as part of ‘Papers with Code‘, a useful website that keeps track of quantitative metrics mined from technical papers.

Code release: The researchers are releasing datasets of papers from arXiv, as well as proprietary Papers with Code leaderboards. They’re also releasing a pre-trained axcell model, as well as an ULMFiT model pretrained on the arXivPapers dataset.

Why this matters: If we can build AI tools to help us navigate AI science as it is published, then we’ll be able to better identify areas where progress is rapid and areas where it is more restrained, which could help researchers identify areas for high impact experimentation.
  Get all the code from the axcells repo (Papers with Code, GitHub).
  Read more: A Home For Results in ML (Medium).
  Read more: AxCell: Automatic Extraction of Results from Machine Learning Papers (arXiv).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Technological discontinuities:
A key question in AI forecasting is the likelihood of discontinuously fast progress in AI capabilities — i.e. progress that comes much quicker than the historic trend. If we can’t rule out very rapid AI progress, this makes it valuable to ‘hedge’ against this possibility by front-loading efforts to address problems that might arise from very powerful AI systems.

History: Looking at the history of technology can help shed light on this possibility.
AI Impacts, an AI research organization, has identified ten examples of ‘large’ discontinuities — instances where more than 100 years of progress (on historic trends) have come at once. I’ll highlight two particularly interesting examples:

  • Superconductor temperature: In 1986 the warmest temperature of superconduction was 30°K, having steadily risen by ~0.4°K per year since 1911. In 1987, YBa2Cu3O7 was found to be able to superconduct at over 90°K (~140 years of progress). Since 1987, the record has been increasing by ~5°K per year.
  • Nuclear weapons: The effectiveness of explosives (per unit mass) is measured by the amount of TNT required to get the same explosive power. In the thousand years prior to 1945, the best explosives had gone from ~0.5x to 2x. The first nuclear weapons had a relative effectiveness of 4500x. And 15 years later, the US built a nuclear bomb that was 1,000x more efficient than the first nuclear bomb.


In both instances, the discontinuity was driven by a radical technological breakthrough (nuclear fission, ceramic superconduction), and prompted a shift into a higher growth mode. 


Matthew’s view: The existence of clear examples of technological discontinuities makes it hard to rule out the possibility of discontinuous progress in AI. Better understanding the drivers of discontinuities, and whether they were foreseeable, seems like a particularly fruitful area for further research.

   Read more: Discontinuous progress in history – an update (AI impacts)

What do 50 people think about AI Governance in 2019?

The Shanghai Institute for Science of Science has collected short essays from 50 AI experts (Jack – including me and some OpenAI colleagues!) on the state of AI governance in 2019. The contributions from Chinese experts are particularly interesting for better understanding how the field is progressing globally.
  Read more: AI Governance in 2019.

####################################################

Tech Tales:

Political Visions
2030

The forecasting system worked well, at first. The politicians would plug in some of their goals – a more equitable society, an improved approach to environmental stewardship, and so on. Then the machine would produce recommendations for the sorts of political campaigns they should run and how, once they were in power, they could act to bring about their goals. The machine was usually right.

Every political party ended up using the machine. And the politicians found that when they won using the machine, they had a greater ability to act than if they ran on their own human intuition alone. Something about the machine meant it created political campaigns that more people believed in, and because more people believed in them, more stuff got done once they were in power.

So, once they were in power, the politicians started allocating more funding to conducting scientific research to expand the capabilities of the machine. If it could help them get elected and help them achieve their goals, then perhaps it could help them govern as well, they thought. They were mostly right – by increasing funding for research into the machine, they made it more capable. And as it became more capable, they spent more and more time consulting with the machine on what to do next.

Of course, the machine never ran for office on its own. But it started appearing in some adverts.
“Together, we are strong,” read one poster that included a picture of a politician and a picture of the machine.

The world did change, of course. And the machine did not change with it. Some parts of society were, for whatever reason, difficult for the machine to understand, so it stopped trying to win them over during election campaigns. The politicians worried about this at first, but then they saw that elections carried on as normal, and they continued to be able to accomplish much using the machine.

Some of them did wonder what might happen once more of society was something the machine couldn’t understand. How much of the world did the machine need to be able to model to serve the needs of politicians? Certainly not all of the world. So then, how much? Half? A quarter? Two thirds?

The only way to find out was to keep commingling politics with the machine and to find, eventually, where its capabilities ended and the needs of the uncounted in society began. For some politicians, they worried that such an end might not exist – that the machine might have just created a political dynamic where it only needed to convince an ever smaller slice of the population, and it had arranged things so that the world would not break while transitioning into this reality.

“Together, we are strong”, was both a campaign slogan, and a future focus of historical study, as the people that came after sought to understand the mania that made so many societies bet on the machine. Strong at what? The future historians asked. Strong for what?

Things that inspired this story: The application of sentiment analysis tools to an entire culture; our own temptation to do things without contemplating the larger purpose; public relations.