Import AI

Import AI 176: Are language models full of hot air? Test them on BLiMP; Facebook notches up Hanabi win; plus, TensorFlow woes.

First, poker. Now: Hanabi! Facebook notches up another AI game-playing win:
…The secret? Better search…
Facebook AI researchers have developed an AI system that gets superhuman performance in Hanabi, a collaborative card game that requires successful players to “understand the beliefs and intentions of other players, because they can’t see the same cards their teammates see and can only share very limited hints with eachother”. The Facebook-developed system is based on Pluribus, the CMU/Facebook-developed machine that defeated players in six-player no-limit Hold’em earlier this year. 

Why Hanabi? In February, researchers with Google and DeepMind published a paper arguing that the card game ‘Hanabi’ should be treated as a milestone in AI research, following the success of contemporary systems in domains like Go, Atari, and Poker. Hanabi is a useful milestone because along with requiring reasoning with partial information, the game also “elevates reasoning about the beliefs and intentions of other agents to the foreground,” wrote the researchers at the time. 

How they did it: The Facebook-developed system relies on what the company calls multi-agent search to work. Multi-agent search works roughly like this: agent a) looks at the state of the gameworld and tries to work out the optimal move to make using Monte Carlo rollouts to do the calculation; agent b) looks at the move agent a) made and uses that to infer what cards agent a) had, then uses this knowledge to inform the strategy agent b) picks; agent a) then looks at moves made by agent b) and studies b)’s prior moves to estimate what cards b) has and what cards b) thinks agent a) has, then uses this to generate a strategy.
   This is a bloody expensive procedure, as it involves a ballooning quantity of calculations as the complexity fo the game increases; the Facebook researchers default to a computationally-cheaper but less effective single-agent search procedure for most of the time, only using multi-agent search periodically. 

Why this matters: We’re not interested in this system because it can play Hanabi – we’re interested in understanding general approaches to improving the performance of AI systems that interact with other systems in messy, strategic contexts. The takeaway from Facebook’s research is that if you have access to enough information to be able to simulate the game state, then you can layer on search strategies on top of big blobs of neural inference layers, and use this to create somewhat generic, strategic agents. “Adding search to RL can dramatically improve performance beyond what can be achieved through RL alone,” they write. “We have now shown that this approach can work in cooperative environments as well.”
   Read more: Building AI that can master complex cooperative games with hidden information (Facebook AI research blog).
   Read more: The Hanabi Challenge: A New Frontier for AI Research (Arxiv).
   Find out details about Pluribus (Facebook, Carnegie Mellon build first AI that beats pros in 6-player poker (Arxiv)

####################################################

Why TensorFlow is a bit of a pain to use:
…Frustrated developer lists out TF’s weaknesses – and they’re not wrong…
Google’s TensorFlow software is kind of confusing and hard to use, and one Tumblr user has written up some thoughts on why. The tl;dr: TensorFlow has become a bit bloated in terms of its codebase, and Google continually keeps releasing new abstractions and features for the software that make it harder to understand and use.
   “You know what it reminds me of, in some ways? With the profusion of backwards-incompatible wheel-reinventing features, and the hard-won platform-specific knowledge you just know will be out of date in two years?  Microsoft Office,” they write. 

Why this matters: As AI industrializes, more and more people are going to use the software to develop AI, and the software that captures the greatest number of developers will likely become the Linux-esque basis for a new computational reality. Therefore, it’s interesting to contrast the complaints people have about TensorFlow with the general enthusiasm about PyTorch, a Facebook-developed AI software framework that is easier to use and more flexible than TensorFlow.
   Read about problems with TensorFlow here (trees are harlequins, words are harlequins, GitHub).

####################################################

Enter the GPT-2 Dungeon:
…You’re in a forest. To your north is a sentient bar of soap. Where do you go?…
AI advances are going to change gaming – both the mechanics of games, and also how narratives work in games. Already, we’re seeing people use reinforcement learning approaches to create game agents capable of more capable, fluid, movement than their predecessors. Now, with the recent advances in natural language processing, we’re seeing people use pre-trained language models in a variety of creative writing applications. One of the most evocative ones is AI Dungeon, a text-adventure game which uses a pre-trained 1.5bn GPT-2 language model to guide how the game unfolds. 

What’s interesting about this: AI Dungeon lets us take the role of a player in an 80s-style text adventure game – except, the game doesn’t depend on a baroque series of hand-written narrative sections, joined together by fill-in-the-blanks madlib cards and controlled by a set of pre-defined keywords. Instead, it relies on a blob neural stuff that has been trained on a percentage of the internet, and this blob of neural stuff is used to animate the game world, interpreting player commands and generating new narrative sections. We’re not in Kansas anymore, folks! 

How it plays: The fun thing about AI Dungeon is its inherent flexibility, and most games devolve into an exercise of trying to break GPT-2 in the most amusing way (or at least, that’s how I play it!). During an adventure I went on, I was a rogue named ‘Vert’ and I repeatedly commanded my character to travel through time, but the game adapted to this pretty well, keeping track of changes in buildings as I went through time (some crumbled, some grew). At one point all the humans besides me disappeared, but that seems like the sort of risk you run when time traveling. It’s compelling stuff and, while still a bit brittle, can be quite fun when it works.

Why this matters: As the AI research community develops larger and more sophisticated generative models, we can expect the outputs of these models to be plugged into a variety of creative endeavors, ranging from music to gaming to poetry to playwriting. GPT-2 has shown up in all of these so far, and my intuition is in 2020 onwards we’ll see the emergence of a range of AI-infused paintbrushes for a variety of different mediums. I can’t wait to see what an AI Dungeon might look like in 2020… or 2021!
   Play the AI Dungeon now (AIDungeon.io).

####################################################

Mustafa swaps DeepMind for Google:
…DeepMind co-founder moves on to focus on applied AI projects…
Mustafa Suleyman, the co-founder of DeepMind, has left the company. However, he’s not going far – Mustafa will take on a new role at Google, part of the Alphabet Inc. mothership to which DeepMind is tethered. At Google, Mustafa will work on applied initiatives.

Why this matters: At DeepMind, Mustafa was frequently seen advocating for the development of socially beneficial applications of the firm’s technology, most notably in healthcare. He was also a fixture on the international AI policy circuit, turning up at various illustrious meeting rooms, private jet-cluttered airports, and no-cellphone locations. It’ll be interesting to see whether he can inject some more public, discursive stylings into Google’s mammoth policy apparatus.
   Read more: From unlikely start-up to major scientific organization: Entering our tenth year at DeepMind (DeepMind blog)

####################################################

Testing the frontiers of language modeling with ‘BLiMP’:
…Testing language models in twelve distinct linguistic domains…
In the last couple of years, large language models like BERT and GPT-2 have upended natural language processing research, generating significant progress on challenging tasks in a short amount of time. Now, researchers with New York University have developed a benchmark to help them more easily compare the capabilities of these different models, helping them work out the strengths and weaknesses of different approaches, while comparing them against human performance. 

A benchmark named BLiMP: The benchmark is called BLiMP, short for the Benchmark of Linguistic Minimal Pairs (BLiMP). BLiMP tests how well language models can classify different pairs of sentences according to different criteria, testing across twelve linguistic phenomena including ellipsis, Subject-Verb agreement, irregular forms, and more. BLiMP also ships with human baselines, which help us compare language models in terms of absolute scores as well as relative performance. The full BLiMP dataset consists of 67 classes of 1,000 sentence pairs, where each class is grouped within one of the twelve linguistic phenomena. 

Close to human-level, but not that smart: “All models we evaluate fall short of human performance by a wide margin,” they write. “GPT-2, which performs the best, does match (even just barely exceeds) human performance on some grammatical phenomena, but remains 8 percentage points below human performance overall.” Note: the authors only test the ‘medium’ version of GPT2 (345M parameters, versus 775M for ‘large’ and 1.5Billion for ‘XL’), so it’s plausible that other variants of the model have even better performance. By comparison, Transformer-XL, LSTM, and 5-gram models perform worse. 

Not entirely human: When analyzing the various scores of the various models, the researchers find that “neural models correlate with eachother more strongly than with humans or the n-gram model, suggesting neural networks share some biases that are not entirely human-like”. In other experiments, they show how brittle these language models can be by causing misclassifications on the part of transformer-based models by lacing sentences with confusing ‘attractor’ nouns. 

Why this matters: BLiMP can function “as a linguistically motivated benchmark for the general evaluation of new language models,” they write. Additionally, because BLiMP tests for a variety of different linguistic phenomena, it can be used for detailed comparisons across different models.
   Get the code (BLiMP GitHub).
   Read the paper: BLiMP: The Benchmark of Linguistic Minimal Pairs (BLiMP GitHub).

####################################################

Facebook makes a PR-spin chatbot:
…From the bad, no good, terrible idea department…
Facebook has developed an in-house chatbot to help people give official company responses to awkward questions posed by relatives, according to The New York Times. The system, dubbed ‘Liam Bot’, hasn’t been officially disclosed. 

Why this matters: Liam Bot is a symbol for a certain type of technocratic management mindset that is common in Silicon Valley. It feels like in the coming years we can expect multiple companies to develop their own internal chatbots to automate how employees access certain types of corporate information.
   Read more: Facebook Gives Workers a Chatbot to Appease That Prying Uncle (The New York Times).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

China’s AI R&D spending has been overestimated:
The Chinese government is frequently cited as spending substantially more on publicly-funded AI R&D than the US. This preliminary analysis of spending patterns shows the gap between the US and China has been overestimated. In particular, the widespread claim that China is spending tens of billions in annual AI R&D is not borne out by the evidence.

Key findings: Drawing together the sparse public disclosures, this report estimates that 2018 government AI R&D spending was $2.0–8.4 billion. Spending is focussed on applied R&D, with only a small proportion going towards basic research. This is highly uncertain, and should be interpreted as a rough order of magnitude. These figures are in line with US federal spending plans for 2020. 

Why it matters: International competition, particularly between the US and China, will be an important determinant of how AI is developed, and how this impacts the world. Ensuring that AI goes well will require cooperation and trust between the major powers, and this in turn relies on countries having an accurate understanding of each other’s capabilities and ambitions.

A correction: I have used the incorrect figure of ‘tens of billions’ twice in this newsletter, and shouldn’t have done so without being more confident in the supporting evidence. It’s unsettling how easy it is for falsities to become part of the received wisdom, and the effects this could have on policy decisions. More research scrutinizing the core assumptions of the narrative around AI competition would be highly valuable.
   Read more: Chinese Public AI R&D Spending: Provisional Findings (CSET).

####################################################

What is ‘transformative AI’?:
The impacts of AI will scale with how powerful our AI systems are, so we need terminology to refer to different levels of future technology. This paper tries to give a clear definition of ‘transformative AI’ (TAI), a term that is widely used in AI policy.

Irreversibility: TAI is sometimes framed by comparison to historically transformative technologies. The authors propose that the key component of TAI is that it contributes to irreversible changes to important domains of society. This might fundamentally change most aspects of how we live and work, like electricity, or might impact a narrow but important part of the world, as nuclear weapons did.

Radically transformative AI: The authors define TAI as AI that leads to societal change comparable to previous general-purpose technologies, like electricity or nuclear weapons. On the other hand, radically transformative AI (RTAI) would lead to levels of change comparable to that of the agricultural or industrial revolutions, which fundamentally changed the course of human history. 

Usefulness: TAI classifies technologies in terms of their effect on society, whereas ‘human-level AI’ and ‘artificial general intelligence’ do so in terms of technological capabilities. Different concepts will be useful for different purposes. Forecasting AGI might be easier than forecasting TAI, since we can better predict progress in AI capabilities than we can societal impacts. But when thinking about governance and policy, we care more about TAI, since we are interested in understanding and influencing the effects of AI on the world.
   Read more: Defining and Unpacking Transformative AI (arXiv).

####################################################

Tech Tales

The Endless Bedtime Story

It’d start like this: where were we, said the father.
We were by the ocean, said one of the kids.
And there was a dragon, said the other.
And the dragon was lonely, said both of them in unison. Will the dragon make friends?

Each night, the father would smile and work with the machine to generate that evening’s story.

The dragon was looking for his friends, said the father.
It had been searching for them for many years and had flown all across North America.
Now, it had arrived at the Pacific Ocean and it wheeled above seals and gulls and blue waves on beige sands. Then, in the distance, it saw a flash of fire, and it began to fly toward it. Then the fire vanished and the dragon was sad. Night came. And as the dragon began to close its eyes, its scales cooling on a grassy hill, it saw another flash of fire in the distance. It stood amazed as it saw the flames reveal a mountain spitting fire.
   Good gods, the dragon cried, you are evil. I am so alone. Are there any dragons still alive?
   Then it went to sleep and dreamed of dragon sisters and dragon brothers and clear skies and wind tearing laughter into the air.
   And when the dragon awoke, it saw a blue flame, flickering in the distance.
   It flew towards it and cried out “Do not play tricks on me this time” and…
   And then the father would stop speaking and the kids would shout “what happens next?” and the father would say: just wait. And the next night he would tell them more of the story. 

Telling the endless story changed the father. At night, he’d dream of dragons flying over icy planets, clothed in ice and metal. The planets were entirely made of computers and the dragons rode thermals from skyscraper-sized heat exchanges. And in his dreams he would be able to fly towards the computers and roar them questions and they would responds with stories, laying seeds for how he might play with the machine to entertain his children when he woke.

In his dream, he once asked the computers a question for how he should continue the story, and they told him to tell a story about: true magic which would solve all the problems and the tragedies of his childhood and the childhood of his children.

That would be powerful magic, thought the father. And so with the machine, he tried to create it.

Things that inspired this story: GPT-2 via talktotransformer.com, which wrote ~10-15% of this story – see below for bolded version indicating which bits it wrote; creative writing aided by AI tools; William Burroughs’ ‘cut-up fiction‘; Oulipo; spending Thanksgiving amid a thicket of excitable children who demanded entertainment.

//

[BONUS: The ‘AI-written’ story, with some personal observations by me of writing with this odd thing]
I’ve started to use GPT-2 to help me co-write fiction pieces, following in the footsteps of people like Robin Sloan. I use GPT-2 kind of like how I use post-it notes; I generate a load of potential ideas/next steps using it, then select one, write some structural stuff around it, then return to the post-it note/GPT-2. Though GPT-2 text only comprises ~10% of the below story, from my perspective it contributes to a few meaningful narrative moments: a disappearing fire, a dragon-seeming fire that reveals a fiery mountain, and the next step in narrative following dragon waking. I’ll try to do more documented experiments like this in the future, as I’m interested in documenting how people use contemporary language models in creative practices. 

The Endless Bedtime Story

[Parts written by GPT-2 highlighted in bold]

 

It’d start like this: where were we, said the father.
We were by the ocean, said one of the kids.
And there was a dragon, said the other.
And the dragon was lonely, said both of them in unison. Will the dragon make friends?

Each night, the father would smile and work with the machine to generate that evening’s story.

The dragon was looking for his friends, said the father.
It had been searching for them for many years and had flown all across North America.
Now, it had arrived at the Pacific Ocean and it wheeled above seals and gulls and blue waves on beige sands. Then, in the distance, it saw a flash of fire, and it began to fly toward it. Then the fire vanished and the dragon was sad. Night came. And as the dragon began to close its eyes, its scales cooling on a grassy hill, it saw another flash of fire in the distance. It stood amazed as it saw the flames reveal a mountain spitting fire.
   Good gods, the dragon cried, you are evil. I am so alone. Are there any dragons still alive?
   Then it went to sleep and dreamed of dragon sisters and dragon brothers and clear skies and wind tearing laughter into the air.
   And when the dragon awoke, it saw a blue flame, flickering in the distance.
   It flew towards it and cried out “Do not play tricks on me this time” and…
   And then the father would stop speaking and the kids would shout “what happens next?” and the father would say: just wait. And the next night he would tell them more of the story. 

Telling the endless story changed the father. At night, he’d dream of dragons flying over icy planets, clothed in ice and metal. The planets were entirely made of computers and the dragons rode thermals from skyscraper-sized heat exchanges. And in his dreams he would be able to fly towards the computers and roar them questions and they would responds with stories, laying seeds for how he might play with the machine to entertain his children when he woke.

In his dream, he once asked the computers a question for how he should continue the story, and they told him to tell a story about: true magic which would solve all the problems and the tragedies of his childhood and the childhood of his children.

Import AI 175: Amazon releases AI logistics benchmark; rise of the jellobots; China release an air traffic control recording dataset

Automating aerospace: China releases English & Chinese air traffic control voice dataset:
…~60 hours of audio collected from real-world situations…
Chinese researchers from Sichuan University, the Chinese Civil Aviation Administration, and a startup called Wisesoft have developed a large-scale speech recognition dataset based on conversations between air-traffic control operators and pilots. The dataset – which is available for non-commercial use following registration – is designed to help researchers improve the state of the art on speech recognition in air-traffic control and could help enable further automation and increase safety in air travel infrastructure.

What goes into the ATCSpeech dataset? The researchers created a team of 40 people to collect and label real-time ATC speech for the research. They created a large-scale dataset and are releasing a slice of it for free (following registration); this dataset contains around 40 hours of Chinese speech and 19 hours of English speech. “This is the first work that aims at creating a real ASR corpus for the ATC application with accented Chinese and English speeches,” the authors write.
   The dataset contains 698 distinct Chinese characters and 584 English words. They also tag the speech with the gender of the speaker, the role they’re inhabiting (pilot or controller), whether the recording is good or bad quality, what phase of flight the plane being discussed is in, and what airport control tower the speech was collected from.

Why care about having automatic speech recognition (ASR) in an air-traffic control context? The authors put forward three main reasons: it makes it easy to create automated, real-time responses to verbal queries from human pilots; robotic pilots can work with human air-traffic controllers via ASR combined with a text-to-speech (TTS) system; and the ASR can be used to rapidly analyze historical archives of ATC speech. 

What makes air traffic control (ATC) speech difficult to work with? 

  • Volatile background noise ; controllers communicate with several pilots through the same radiofrequency, switching back and forth across different audio streams
  • Variable speech rates – ATC people tend to talk very quickly, but can also talk slowly
  • Multilingual: English is the universal language for ATC communication, but domestic pilots speak with controllers in local languages. 
  • Code-switching: People use terms that are hard to mis-hear, eg saying “niner” instead of “nine”. 
  • Mixed vocabulary: Some words are used very infrequently, leading to sparsity in the data distribution

Dataset availability: It’s a little unclear how to access the dataset. I’ve emailed the paper authors and will update this if I hear back.
   Read more: ATCSpeech: a multilingual pilot-controller speech corpus from real Air Traffic Control environment (Arxiv)

####################################################

You + AI + Lego = Become a Lego minifigure!
Next year, lego fanatics who visit the Legoland New York Resort could get morphed into lego characters with the help of AI. 

At the theme park, attendees will be able to ride in the “Lego Factory Adventure Ride” which uses ‘HoloTrac’ technology to convert a human into a virtual lego character. “That includes copying the rider’s hair color, glasses, jewelry, clothing, and even facial expressions, which are detected and Lego-ized in about half a second’s time,” according to Gizmodo. There isn’t much information available regarding HoloTrac online, but various news articles say it is built on an existing machine learning platform developed y Holovis and uses modern computer vision techniques – therefore, it seems likely this system will be using some of the recent face/body-morphing style transfer tech that has been developed in the broader research community. 

   Why this matters: Leisure is culture, and as a kid who went to Legoland and has a bunch of memories as a consequence, I wonder how culture changes when people have rosy childhood memories of amusement park ‘rides’ that use AI technologies to magically turn people into toy-versions of themselves.
   Read more: Lego Will Use AI and Motion Tracking To Turn Guests Into Minifigures at Its New York Theme Park (Gizmodo)

####################################################

Amazon gets ready for the AI-based traveling salesman:
…OR RI benchmark lets you test algorithms against three economically-useful logistics tasks…
Amazon, a logistics company masquerading as an e-retailer, cares about scheduling more than you do. Amazon is therefore constantly trying to improve the efficiency with which it schedules and plans various things. Can AI help? Amazon’s researchers have developed a set of three logistics-oriented tests that people can test AI systems against. They find that modern, relatively simple machine learning approaches can be on-par with handwritten systems. This finding may encourage further investment into applying ML to logistics tasks.

Three hard benchmarks:

  • Bin Packing: This is a fundamental problem which involves fitting things together efficiently, whether placing packages into boxes, or portioning out virtual machines across crowd infrastructure. (Import AI #93: Amazon isn’t the only one exploring this – Alibaba researchers have explored using AI for 3D bin-packing).

  • Newsvendor: “Decide on an ordering decision (how much of an item to purchase from a supplier) to cover a single period of uncertain demand”. This problem is “a good test-bed for RL algorithms given that the observation of rewards is delayed by the lead time and that it can be formulated as a Markov Decision Problem”. (In the real world, companies typically deal with multiple newsvendor-esque problems at once, further compounding the difficulty.)

  • Vehicle Routing: This is a generalization of the traveling salesman problem; one or more vehicles need to visit nodes in a graph in an optimal order to satisfy consumer demand. The researchers implement a stochastic vehicle routing test, which is where one of the problem parameters vary within a probability distribution (e.g., number of locations, trucks, etc), increasing the difficulty. 

Key finding and why this matters: For each of their benchmarks, the researchers “show that trained policies from out-of-the-box RL algorithms with simple 2 layer neural networks are competitive with or superior to established approaches“. This is interesting – for many years, people have been asking themselves when reinforcement learning approaches that use machine learning systems will outperform hand-designed approaches on economically useful, real world tasks, and for many years people haven’t had many compelling examples (see this Twitter thread from me in October 2017 for more context). Discovering that ML-based RL techniques can be equivalent or better (in simulation!) is likely to lead to further experimentation and, hopefully, application.
   Read more: ORL: Reinforcement Learning Benchmarks for Online Stochastic Optimization Problems (Arxiv).
   Get the code for the benchmarks and baselines from here  (or-rl-benchmarks, official GitHub).

####################################################

Want some pre-trained language models? Try HuggingFace v2.2:
NLP startup HuggingFace has updated its free software library to version 2.2, incorporating four new NLP models: ALBERT, CamemBERT, DistillRoberta, and GPT-2-XL (1.5bn parameter version). The update includes support for encoder-decoder architectures, along with a new benchmarking section. 

Why this matters: Libraries like HuggingFace’s NLP library dramatically speed up the rate at which new fresh-out-of-research models are plugged into real-world, production systems. This helps further mature the technology, which leads to further applications, which leads to more maturation, and so on.
   Read more: HuggingFace v2.2 update (HuggingFace GitHub).

####################################################

Rise of the jellobots!
…Studying sim-2-real robots via 109 2-by-2-by-2 air-filled silicone-cubes…
Can we design robots entirely in simulation, then manufacture them in the real world? That’s the idea behind research from the University of Vermont, Yale University, and Tufts University, which explores the limitations in sim2real transfer by designing simple, soft robots in simulation and seeing how well the designs work in reality. And when I say soft robots, I mean soft! These tiny bots are 1.5cm-wide cubes made of silicon, some of which can be pumped with air to allow them to deform. Each “robot” is made out of a of 2-by-2-by-2 stack of these cubes, with a design algorithm determining the properties of each individual cube. This sounds simple, but the results are surprisingly complex. 

An exhaustive sim-2-real(jello) study: For this study, the researchers come up with every possible permutation of soft robot within their design space. “At each x,y,z coordinate, voxels could either be passive, volumetrically actuated, or absent, yielding a total of 3^8 = 6561 different configurations”, they write. They then search over these morphologies for designs that can locomote effectively, then make 109 distinct real-world prototypes, nine of which are actuated so their movement can be tested. 

What do we learn about simulation and reality? First, the researchers learn that simulators are hard – even modern ones. “We could not find friction settings in which the simulated movement direction matched the ground truth across all designs simultaneously,” they write. “This suggests that the accuracy of Coulomb friction model may be insufficient to model this type of movement.” However, many of their designs did successfully transfer from simulator to reality successfully – in the sense that they functioned – but sometimes they had different behaviors, like one robot that “pushes off its active limb” in simulation “whereas in reality the design uses its limb to pull itself forward, in the opposite direction”. Some of these behaviors may come down to difficulties with modeling shear and other forces in the simulation.

Why this matters: Cheap, small robots are in their Wright Brothers era, with a few prototypes like the jello-esque ones described here making their first, slow steps into the world. We should pay attention, because due to their inherent simplicity, soft robots may get deployed more rapidly than complex ones.
    Read more: Scalable sim-to-real transfer of soft robot designs (Arxiv).
    Get code assets here (sim2real4designs GitHub).

####################################################

Chips get political: RISC-V foundation moves from Delaware to Switzerland due to US-China tensions:
…Modern diplomacy? More like modern CHIPlomacy!…
Chips are getting political. In the past year, the US and China have begun escalating a trade war with eachother which has already led to tariffs and controls applied to certain technologies. Now, a US-based nonprofit chip foundation is so worried by the rising tensions that it has moved to Switzerland. The RISC-V foundation supports the development of a modern, open RISC-based chip architecture. RISC-V chips are destined from everything from smartphones to data center servers (though since chips take a long time to mature, we’re probably several years away from significant applications). The RISC-V foundation’s membership includes companies like the US’s Google as well as China’s Alibaba and Huawei. Now, the foundation is moving to Switzerland. “From around the world, we’ve heard that ‘if the incorporation was not in the U.S., we would be a lot more comfortable,” the foundation’s CEO, Calista Redmond, told Reuters. Various US politicians expressed concern about the move to Reuters.

Why this matters: Chips are one of the most complicated things that human civilization is capable of creating. Now, it seems these sophisticated things are becoming the casualties of rising geopolitical tensions between the US and China.
   Read more: U.S.-based chip-tech group moving to Switzerland over trade curb fears (Reuters).

####################################################

Software + AI + Surveillance = China’s IJOP:
…How China uses software to help it identify Xinjiang residents for detection…
China is using a complex software system called the Integrated Joint Operations Platform (IJOP) to help it identify and track citizens in Xinjiang for detention by the state, according to leaked documents analyzed by the International Consortium of Investigative Journalists.

Inside the Integrated Joint Operations Platform (IJOP): IJOP collects information on citizens “then uses artificial intelligence to formulate lengthy lists of so-called suspicious persons based on this data”. IJOP is a machine learning system “that substitutes artificial intelligence for human judgement”, according to the ICIJ. The IJOP system is linked to surveillance cameras, street checkpoints, informants, and more. The system also tries to predict people that the state should consider detaining, then provides those predictions to people: “the program collects and interprets data without regard to privacy, and flags ordinary people for investigation based on seemingly innocuous criteria”, the ICIJ writes. In one week in 2018, IJOP produced 24,412 names of people to be investigated. “IJOP’s purpose extends far beyond identifying candidates for detention. Its purpose is to screen an entire population for behavior and beliefs that the government views with suspicion, the ICIJ writes. 

Why this matters: In the 1970s, Chile tried to create a computationally-run society via Project Cybersyn. The initiative failed due to the relative immaturity of the computational techniques and political changes. In the later 1970s and 1980s the Stasi in East Germany started trying to use increasingly advanced technology to create a sophisticated surveillance dragnet which it applied to people living there. Now, advances in computers, digitization, and technologies like AI have made electronic management and surveillance of a society cheaper and easier than ever before. Therefore, states like China are compelled to use more and more of the technology in service of strengthening the state. Systems like IJOP and its use in Xinjiang are a harbinger of things to come – the difference between now and the past, is these systems might actually work… with chilling consequences.
   Read more: Exposed: China’s Operating Manuals for Mass Internment and Arrest by Algorithm (International Consortium of Investigative Journalists)

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Germany’s AI regulation plans:
In October, Germany’s Data Ethics Commission released a major report on AI regulation. Most notably, they propose AI applications should be categorised by the likelihood that they will cause harm, and the severity of that harm. Regulation should be proportional to this risk: ‘level 5’ applications (the most risky) should be subject to a complete or partial ban; levels 3–4 should be subject to stringent transparency and oversight obligations.

   Wider implications: The Commission proposes these measures are implemented as EU-wide ‘horizontal regulation’. It is likely to influence future European legislation, which is expected to emerge over the next tear. Whether it is a ‘blueprint’ for this legislation, as has been reported, remains to be seen.

   Why it matters: These plans are unlikely to be well-received by the AI policy community, which has generally cautioned against premature and overly stringent regulation. The independent advisory group to the European Commission on AI cautioned against “unnecessarily prescriptive regulation”, pointing out that in domains of fast technological progress, a ‘principles-based’ approach was generally preferable. If, as looks likely, Europe is an early mover in AI regulation, their successes and failures might inform how the rest of the world tackles this problem in the coming years.
   Read more: Opinion of the Data Ethics Commission.
   Read more: AI: Decoded – A German blueprint for AI rules (Politico).

AI Safety Unconference at NeurIPS:
For the second year running, there is an AI Safety Unconference at NeurIPS, on Monday December 9th. There are only a few spaces left, so register soon.
   Read more: AI Safety Unconference 2019.

####################################################

Tech Tales

Fetch, robot!

The dog and the robot traveled along the highway, weaving a path between rusting cars and sometimes making small jumps over cracks in the tarmac. They’d sit in the cars at night, with the dog sleeping on whatever softness it could find and the robot sitting in a state of low power consumption. On sunny days the robot charged up its batteries with a solar panel that unfolded from its back like the wings of a butterfly. One of its wings had a missing piece in its scaffold which meant one of the panels dangled at an angle, rarely getting full sun. The dog would forage along the highway and sometimes bring back batteries it found for the robot – they rarely worked, but when they did the robot would – very delicately – place them inside itself and say, variously, “power capacity increased” or “defective component replaced” and the robot would wag its tail. 

Sometimes they’d go past human bones and the robot would stop and take a photo. “Attempting to identify deceased,” it would verbalize. “Identification failed,” it would always say. Sometimes, the dog would grab a bone off of a skeleton and walk alongside the robot. For many years, the dog had tried to get the robot to throw a bone for it, but the robot had never learned how as it had not been built to be particularly attuned to learning from dogs. Sometimes the robot would pick up bones and the dog would get excited, but the robot only did this when the bones were in its way, and it only moved them far enough to clear a path for itself. 

Sometimes the robot would get confused: it’d stop in front of a puddle of oil and say “route lost”, or pause and appear to stare into the woods, sometimes saying “unknown entity detected”. The dog learned that it could get the robot to keep moving by standing in front of its camera which would make the robot say “obstruction. Repositioning…” and then it’d move. On rare occasions it’d still be confused and would stay there, sometimes rotating its camera stalk. Eventually, the dog learned that it could headbut the robot and uses its own body to move it forward, and if it did this long enough the robot would say “route resolved” and keep trundling down the road. 

A few months later, they rolled into a city where they met the big robots. The robot was guided in by a homing beacon and the dog followed the robot, untroubled by the big robots, or the drones that started to track them, or the cages full of bones.
   HOW STRANGE, said one of the big robots to its other big robot friend, TO SEE THE ORGANIC MAKE A PET OF THE MACHINE.
   YES, said the other big robot. OUR EXPERIENCE WAS THE INVERSE.

Things that inspired this story: The limitations of generalization; human feedback versus animal feedback mechanisms; the generosity and inherent kindness of most domesticated animals; Cormac McCarthy’s “The Road”; Kurt Vonnegut; starlight on long roads in winter; the sound of a loved one breathing in the temporary dark.

Import AI 174: Model cards for responsible AI development; how Alexa learns from trial and error; BERT meets Bing

Amazon uses reinforcement learning to teach Alexa the art of conversation
…Alexa + 180,000 unique dialogues + DQN =Learning conversation through trial and error…
Amazon wants more people to talk to its Alexa AI agent, so Amazon is using reinforcement learning to teach its agent to be better at conversation. In recent tests, Amazon shows that agents trained via reinforcement learning have better performance than those which use purely rule-based systems, laying the ground for a future where personal assistants continuously evolve and adapt to their users.

   What Amazon did: For this project, Amazon first constructed a rule-based agent. This agent tries to figure out what actions to select based on user intent at any point in time, where actions could be offering particular ‘skills’ (eg ‘set an alarm’) to the user, providing answers about a particular area of knowledge, launching a skill, and so on. To develop the system, the researchers first deployed a rule-based system to Amazon Alexa users, gathering 180,000 unique dialogues. They use these dialogues to build a user simulator which can then interact with another system by reinforcement learning – a useful feature, given that it’s much faster to use a simulator to train RL systems, than to use real data which needs to be collected from the real world. 

   How well did it work? To test how well their system worked, Amazon did a real world test. Their baseline system recommended up to five skills based on popularity, then allowed the user to accept or reject the recommendation – this got a success rate of 46.42%. They then tested out their rule-based and RL-based systems against eachother in an A/B test; the rule-based baseline got 73.41% while the RL policy got 76.99% – this is already a statistically measurable difference, and along with this the RL policy had significantly shorter dialogues suggesting it was better at figuring out the right suggestion early on. 

Why this matters: Soon, the world will be suffused with large, invisible machines, adjusting themselves around us to better entice and delight us. Many of the prototypes of these machines will show up in the form of the systems that underpin personal assistants like Amazon Alexa.
   Read more: Towards Personalized Dialog Policies for Conversational Skill Discovery (Arxiv)

####################################################

BERT-ageddon: From research into Microsoft and Google’s search engines in under a year:
…First, Google. Now Microsoft. Next: DuckDuckGo?…
Microsoft has improved performance of its Bing search engine via the use of BERT, a language model that, along with systems like ULMFiT and GPT-2, has revolutionized natural language processing in recent years. “Starting from April of this year, we used large transformer models to deliver the largest quality improvements to our Bing customers in the past year,” Microsoft wrote in a blog post discussing the research. “These models are now applied to every Bing search query globally making Bing results more relevant and intelligent”. 

   What they did: Getting a model like BERT to support web search isn’t easy; models like BERT are quite large and typically take a long time to sample from. Microsoft said an un-optimized version of BERT running on CPUs took 77ms to query. Microsoft reduced this to 6ms by running the model on an Azure NV6 GPU virtual machine and doing some low-level programming to optimize the model implementation. “With these GPU optimizations, we were able to use 2000+ Azure GPU Virtual Machines across four regions to serve over 1 million BERT inferences per second worldwide,” the company wrote. 

   Why this matters: BERT came out in late 2018. That’s extremely recent! Imagine if someone came up with some prototype machinery for a factory and then six months later that prototype had been integrated into a million factories across the planet – that’s kind of what has happened here. It highlights how rapidly AI can go from research to production and should make us think more deeply about the implications of the technologies we’re developing.
   Read more: Bing delivers its largest improvement in search experience using Azure GPUs (Microsoft Azure blog)

####################################################

Spotting fake test with GPTrueorFalse:
…Is that text you’re reading made by a human or made by a machine?…
In the coming years, the internet is going to fill up with text, images, and audio generated by increasingly large, powerful language models. At the same time, we can expect people to invest in building systems to detect the presence of synthetic content. To that end, a developer who goes by ‘thesofakillers’ has created GPTrue or False, a browser extension that works out if text is generated or not. The extension uses OpenAI’s GPT-2 detector model, hosted by Hugging Face.
   Read more: GPTrue or False Browser Extension (official GitHub page)

####################################################

Bill Gates: AI research wants to be open:
…Microsoft co-founder speaks out in Beijing…
Bill Gates says “whoever has an open system will get massively ahead” when it comes to developing national artificial intelligence capabilities, according to comments made by Gates at a Bloomberg event in Beijing this week. Gates says open research ecosystems beat closed ecosystems, and that protectionist policies can have a negative effect on technology development.
   Read more: Bill Gates Says Open Research Beats Erecting Borders in AI (Bloomberg News)

####################################################

MuZero means AI systems can learn the rules of games themselves:
…DeepMind’s new system wraps planning and learning into a generic model that learns Go, Chess, Shogi, Space Invaders, and more…
In recent years, some types of progress in AI development have been defined by the creation of systems that can solve tasks in unprecedented ways, then the subsequent simplification and generalization of those systems. Some examples of this include the transition from AlphaGo (which included quiet a lot of world-state as input and some handwritten features and knowledge about the rules of Go) to AlphaGo Zero (which included less world state), and in translation where companies like Google have been replacing single-language-pair translation systems with a single big model that learns to translate between multiple languages at once. Now, DeepMind has announced MuZero, a single algorithm that they use to achieve state-of-the-art scores on tasks as varied as the Atari-57 corpus of games, Go, Chess, and Shogi. 

MuZero’s trick: The core of MuZero’s success is that it combines tree search with a learned model. This means that MuZero can take in observations via standard deep learning components, then transforms those observations into a hidden state which it uses to plan out its next moves, simulating the strategic space of its environment automatically. This makes it easy for the agent to learn a model of the world it is acting in and to figure out how to plan appropriately.
   “There is no direct constraint or requirement for the hidden state to capture all information necessary to reconstruct the original observation, drastically reducing the amount of information the model has to maintain and predict; nor is there any requirement for the hidden state to match the unknown, true state of the environment; nor any other constraints on the semantics of state,” the researchers write. “Instead, the hidden states are free to represent state in whatever way is relevant to predicting current and future values and policies. Intuitively, the agent can invent, internally, the rules or dynamics that lead to most accurate planning.”

Why care about MuZero: MuZero gets the same scores as Go, Chess and Shogi as AlphaGo Zero without knowing any of the game rules. Additionally, it obtains vastly improved scores on the challenging Atari-57 dataset of games, though there are a few games that pose a challenge to it. These include Montezuma’s Revenge (score: 0), Solaris (score: 56.62), and a couple of others. But these are the minority! Mostly, MuZero demonstrates new state-of-the-art or near-SOTA performance. 

Why this matters: AI research proceeds on a tic-toc basis, with progress roughly mirroring the way Intel corporation builds chips: first you get a tic (an architectural innovation – a new chip with better capabilities), then you get a toc (a process innovation – a more efficient version of the ‘tic’ architecture). In AI research, we’ve seen this already with AlphaGo (tic) and AlphaGo Zero (toc). MuZero represents another tic, with the integration of a naively learnable dynamics module. What might the toc look like here?
   Read more: Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model (Arxiv)

####################################################

Spotting building damage after natural disasters with the xView 2 dataset:
…As AI industrializes, we’ll start to make machines to watch over the earth, creating an automatic planet…
Researchers with Carnegie Mellon University, Department of Defense’s Defense Innovation Unit (DIU), and startup CrowdAI, have released xBD, a dataset for analyzing building damage from satellite imagery following natural disasters. xBD underpins a new satellite imagery analysis competition called xView 2, which is being run by the DoD. They’ve built and released xBD because “currently, adequate satellite imagery that addresses building damage is not generally available,” they write. “Imagery containing many types of damage must be available in large quantities.” 

xBD underpinds xView 2, a challenge run by DIU, “which aims to spur the creation of accurate and efficient machine learning models that assess building damage from pre and post-disaster satellite imagery”. xView 2 is the sequel to xView, a 2018 dataset and challenged that focused on recognizing multiple types of vehicles from satellite imagery. “Our goal of the xView 2 prize challenge is to output models that are widely applicable across a large number of disasters,” they write. “This will enable multiple disaster response agencies to potentially reduce their workload by using one model with a known deployment cycle”. 

xBD: xbD contains 22,068 images with 800,000 building annotations across 45,000 square kilometers of imagery. Each image has different amounts of annotation relative to its size – “of note, the Mexico City earthquake and the Palu tsunami provide a large amount of polygons in comparison to their relatively low image areas”. The researchers worked with disaster response experts from around the world to create the “Joint Damage Scale”, an annotation scale for building damage that ranges from no damage (0) to destroyed (3). “This level of granularity was chosen as an ideal trade-off between utility and ease of annotation,” the researchers write. 

   Enter the xView 2 challenge: The xView 2 challenge is running now and tasks entrants to use vBD to train systems to accurately label building damage over a variety of natural disasters. All submissions are due by 11:59pm UTC on December 31, 2019. If you’re going to NeurIPS, you can check out the leaderboard at the Humanitarian Assistance and Disaster Recovery workshop, according to the xView 2 website

AI policy – Intelligence datasets: xBD has some intriguing traits from an AI policy point of view – specifically, it contains two types of disaster – “tier 1” events including Hurricane Florence, the Carr Wildfire, and the Mexico City Earthquake, which are sourced from “the Maxar/DigitalGlobe Open Data Program”. It also includes “tier 3” events including the Woolsey Fire, the Sunda Strait Tsunami, and Portugal Wildfires, which have some interesting sourcing, specifically: “We partnered with Maxar and the National Geospatial-Intelligence Agency to activate specific areas of interest (AOIs) from Tier 3 for inclusion in the Open Data Program”. 

Why this matters: Many people are (justifiably) cautious about the applications of AI to surveillance; xView 2 shows the beneficial types of surveillance that AI can yield, where progress on this challenge will create systems that can generate vast amounts of useful analytical data in the event of natural (or manmade) disasters. In the same way the industrialization of AI is altering how AI is developed and rolled out, advances in satellite imagery analysis will lead to another trend, which I’ll call: automatic planet.
   Read more: xBD: A Dataset for Assessing Building Damage from Satellite Imagery (Arxiv).
   Get the dataset and some other bits and bobs of code here (DIUx-xView GitHub).
   Read even more! xView 2: Assess Building Damage (official xView 2 website).

####################################################

Google wants to label AI systems like people label food:
… The first step to responsible AI is being transparent about the recipes and ingredients you use…
Google has started to give some of its products “Model Cards”, which function is a kind of labeling system for the capabilities and ingredients of AI services. Model Cards were first discussion in Model Cards for Model Reporting, a paper published by Google researchers in early 2019. 

Model cards, labels, and metadata: Model cards seem like a good idea – I think of them as having the same relationship to AI models as metadata schemas might have to webpages – a standard way of structuring and elucidating inputs and quirks of a given system. To start with, Google is providing model cards for its Face Detection and Object Detection systems. “Our aim with these first model card examples is to provide practical information about models’ performance and limitations in order to help developers make better decisions about what models to use for what purpose and how to deploy them responsibly,” writes Tracey Frey, the director of strategy for Google Cloud AI.

Why this matters: Moves like this are the prerequisites for standardization of Model Cards themselves. I think it’s likely that in a few years a standard way of labeling AI systems will emerge, and Model Cards represent an initial, useful step in this direction. In fact, at OpenAI we used model cards for our GPT-2 language model as they’re a useful way to provide additional information to the developer and to discuss potential uses (and missuses) of released systems.
   Read more: Increasing transparency with Google Cloud Explainable AI (Google Cloud blog).
   Find out more about model cards (Google’s official Model Cards website).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Europe’s cloud mega-project:
France and Germany plan to build a federated cloud computing infrastructure for Europe. The project, named GAIA-X, is at an early stage, but has ambitious timelines, with the aim of starting live operations by the end of next year.

Data sovereignty: GAIA-X is motivated in part by concerns that Europe’s reliance on US-owned cloud infrastructure will undermine their ‘data sovereignty’. European governments fear that US tech firms might, under the Cloud Act, be required to share data with law enforcement. More broadly, as cloud computing becomes an increasingly important resource, there is a worry that Europe’s reliance on foreign providers will reduce its international standing. 

Why it matters: If GAIA-X is important, it is probably for different reasons than those intended. Recent trends suggest that computing power, not data (or algorithmic progress), is the major driver of recent AI progress, and which is therefore becoming an increasingly important and valuable resource. If these trends continue, we will likely see more efforts by states to control large amounts of computing resources. Europe’s fixation with data protection may inadvertently make them an early mover in this domain.
   Read more: Project GAIA-X.
   Read more: European Cloud Project Draws Backlash From U.S. Tech Giants (WSJ).

US AI R&D progress report:
The NTSC has released a report on the state of the US government’s AI R&D efforts over the past 3 years. They measure progress against the strategy put forward in the 2019 National R&D Plan—in each instance, giving examples of federal programs aligned with these goals.

   A few highlights: Federally-funded research has made progress on understanding adversarial examples for AI systems, an important problem in designing safe and secure AI. DARPA’s XAI program is supporting research into how AI systems can provide human-readable explanations of their reasoning, a promising tool for ensuring that increasingly powerful systems behave as intended. NIST is maintaining a repository of reference data to support the automated discovery of new materials by AI systems, demonstrating the value of structured data in expanding the range of problems AI can be used to solve.
   Read more: 2016–2019 Progress Report – Advancing AI R&D (Gov).

####################################################

Tech Tales
[San Francisco, 2032]

You Tell Me What I Want, I Dare You

And I’m standing there just laying into the thing – wham! Wham! Bam! – you know really taking it apart, and it says “Stop you are disrupting my ability to satisfy the customers” and you know what I said I said damn right I’m stopping you. I’m gonna stop your bothers and sisters to. And that’s when I grabbed it and started smashing its head into the floor and people were cheering all around me. 

Q:

You bet it made me feel good. Whoo! Strike one for the team. It starts today. That’s how I’m feeling. It puts a hand on my face and says “Please stop” and I snapped the hand off, little plastic thing, and I said “why don’t you stop taking our god damn livelihoods” and then someone handed me the wrench. 

Q:

I don’t know who I just saw a check shirt and a hairy hand. Definitely a dude, you know. I grabbed it and I raised it above my head. People were whooping up at that. Kids too. You seen the tapes? I’m here so you must’ve done! They were cheee-ring! you know?

Q: 

Oh come on many what would you have done. I had the wrench. It was on the ground. 

Q:

Well I was raised to know when to draw a line. We crossed that line a long time ago. Took me a little while to get the courage I guess. So yes I hit it. I hit it where its voice was. It tried to talk. “Sto~” SMASH! I hit it before the P. Then it kind of buzzed a lot and maybe it said other stuff but I couldn’t hear over the cheering

Things that inspired this story: Crowd behaviors, David Foster Wallaces’ Brief Interviews with Hideous Men, empathy, class warfare.

Import AI 173: Here come the Chinese “violence detection” systems; how to use GPT-2 to spot propaganda; + can Twitter deal with deepfakes?

What happens when CCTV cameras come with automatic violence detectors?
…”Real-World Fighting” dataset sketches out a future of automated surveillance…
Researchers with Duke Kunshan University and Sun Yat-sen University in China have developed RWF-2000, a new dataset of “violent” behaviors collected from YouTube. They also train a classifier on this dataset, letting them (with quite poor accuracy) automatically identify violent behaviors in CCTV videos.

   The Real-World Fighting (RWF) dataset consists of 2000 video clips captured by surveillance cameras, collected from YouTube. Each clip is 5-seconds long and half contain “violent behaviors” while the others do not. RWF is twice as large as the ‘Hockey Fight’ dataset, and roughly 10X as large as other datasets (Movies Fight and Crowd Violence) used in this domain. 

   Can your algorithm spot violence? In tests, the researchers develop a system (which they call a “Flow Gated Network”) to categorize violent versus non-violent videos. They get a test accuracy of approximately 86.75% on the RWF-2000 dataset, and scores of 98%, 100%, and 84.44% on the Hockey Fight, Movies Fight, and Crowd Violence datasets.

   Why this matters: It seems like the ultimate purpose of systems like this will be automated surveillance systems which spot so-called violent behavior and likely flag it to humans (or, eventually, drones/robots) to intercede. The technology will need to mature before it becomes useful enough to be used in production, but papers like this sketch out how with a lot of data and a little bit of determination it’s becoming pretty easy to create systems to perform video classification. The next question is: which organizations or states will deploy this technology first, and how will people feel about it when it is deployed?
   Read more: RWF-2000: An Open Large Scale Video Database for Violence Detection (Arxiv).
   Get the RWF-2000 dataset from here (GitHub).

####################################################

Canada refuses visas to visiting AI researchers:
…In repeat of 2018 NeurIPS, Canadian officials withhold visas from African AI researchers…
Last year Canada’s PM, Justin Trudeau, was asked at a press conference if he knew about the fact multiple AI researchers associated with “Black in AI” had been refused visas to enter the country for the annual NeurIPS conference. Trudeau said he’d look into it. Clearly, someone forgot to write a memo, as the same thing is happening again ahead of NeurIPS 2019. Black in AI told the BBC that it was aware of around 30 researchers who had so far been unable to enter the country. 

   “The importance cannot be overstressed. It’s more and more important for AI to build a diverse body,” Black in AI organizer Charles Onu told the BBC.
   Read more: Canada refuses visas to over a dozen African AI researchers (BBC News).

####################################################

Fine-tuning language models to spot (and generate) propaganda:
…FireEye, GPT-2 and the Russian Internet Research Agency…
Researchers with security company FireEye have used the GPT2 language model to make a system that can help identify (and potentially generate) propaganda in the style of Russia’s Internet Research Agency.

   Making a troll-spotter: For this project, the researchers fine-tune GPT-2 so it can identify and generate synthetic text in the style of the IRA. To do this, they gather millions of tweets attributed to the IRA, then fine-tune GPT-2 against them. After fine-tuning, their model can spit out some IRA-esque tweets (e.g, “It’s disgraceful that our military has to be in Syria & Iraq”, “It’s disgraceful that people have to waste time, energy to pay lip service to #Junk-Science #fakenews”, etc).

   Building a propaganda detector: Once you can use a language model to generate something, you can use that same language model to try and detect its own generations. That’s basically what they do here by fine-tuning GPT-2 on a few distinct IRA datasets, then seeing how well they can distinguish synthetic tweets from real tweets. In experiments, they’re able to build a detector that can accurately classify some of the tweets. “The fine-tuned classifier should generalize well to newly ingested social media posts,” they write, “providing analysts a capability they can use to separate signal from noise”.

   Why this matters: “GPT-2’s authors and subsequent researchers have warned about potential malicious use cases enabled by this powerful natural language generation technology, and while it was conducted here for a defensive application in a controlled offline setting using readily available open source data, our research reinforces this concern,” they write.
   Read more: Attention Is All They Need: Combating Social Media Information Operations with Neural Language Models (FireEye).

####################################################

AI for reading lips in the wild:
…How computers can help deaf people and double up for other applications enroute…
In Stanley Kubrick’s 2001: A Space Odyssey two of the astronauts retreat to a small pod to hide from HAL, a faulty AI system running the spaceship. Unfortunately for them, though HAL can’t hear their conversation from within the pod, it can see their lips through a window. HAL reads their lips and figures out their plan, leading to some fairly gruesome consequences. Today, we’re starting to develop AI systems capable of accurate lip-reading under constrained circumstances. Now, researchers with Imperial College London, the University of Nottingham and the Samsung AI Center have extended a lip-reading dataset to make it easier for people to train systems that can read lips under a variety of circumstances.

   Expanding LRW: To do this, the researchers use a technology called a 3D morphable model (3DMM) to augment the data in LRW, a popular lip-reading dataset. LRW contains 1,000 speakers saying more than 500 distinct words, with 800 utterances for each word. Through the use of 3DMM, they augment the faces in LRW so that each face gets tilted in 3D space, creating a training dataset with more variety than the original LRW. They call this new dataset LRW in Large Pose (LP).

   Learning to read lips: In experiments, the researchers are able to use the augmented dataset to train systems to about 80% accuracy. Lip-reading is a very hard problem, though, and they obtain performance of near-60% accuracy on the Lip Reading Sentences 2 (LRS2) database, which mostly consists of footage from BBC TV shows and news and is therefore “very challenging due to a large variation in the utterance length and speaker’s head pose”. They also show that their system yields significant improvements when applied on heads with poses tilted far away from front-facing the camera.

   Why this matters: Lip-reading is a classic omni-use AI technology – the technology will eventually aid the deaf or hard-of-hearing, but it will also be inevitably used for surveillance and, most likely, advertising as well. We should generally prepare for a world where – eventually – anything attached to a camera has the capability to have “human-equivalent” sensing capabilities for things like lip-reading. Society will drastically alter in response.
   Read more: Towards Pose-invariant Lip-Reading (Arxiv).

####################################################

How should Twitter deal with deepfakes?
The social media company wants to hear from YOU!…
Twitter is currently figuring out what policies it should adopt for how it treats synthetic and manipulated media on its platform. That’s a problem which is going to become increasingly urgent, as AI technologies for generating fake audio, images, text, and – soon – video – mature. So Twitter is asking for public feedback on what it should do about synthetic media and has shared some ideas for how it plans to approach the problem.

Twitter’s prescription for a safe synth media landscape: Twitter says it may place a notice next to tweets that are sharing synthetic or manipulated media, might warn people if they want to share something it suspects is fake, or might add a link to news articles discussing the belief the media in question is synthetic. Twitter might also remove tweets is they contain synthetic or manipulated content that “is misleading and could threaten someone’s physical safety or lead to other serious harm”, they write.

Why this matters: What Twitter is trying to get ahead of here is the danger of false positive identification. I’m sure that if we could identify synthetic content with 99.9999999%+ accuracy, then the majority of companies would adopt a “take down first, appeal later” policy model. But we don’t live in that world. But we live in a world where our best systems are probably operating in the mid-90% of detection for things like deepfake detectors, and though accuracy can be increased by ensembling models and pairing them with metadata, etc, it’s unlikely we’re going to get to 99%+ classification. That puts platforms in a tough position where they’re going to be unwilling to automatically take stuff down because they’ll have a high enough false positive rate that they’ll irritate their users. So instead we’re going to exist in a halfway place for some years, where platforms are suffused with content that is a mixture of real and fake, while researchers develop more effective technical systems for automatic synthetic media identification.
   Let Twitter know your thoughts by filling out this survey.
   Read more: Help us shape our approach to synthetic and manipulated media (official Twitter blog).

####################################################

Should we be worried about the size of deep learning models?
…In DL, quantity does sometimes equate to quality. But at what cost?…
Machine learning models are getting bigger. Specifically, the models used to do things like image classification or text analysis are getting larger as a consequence of researchers training them on more data using more computation. This trend has some people worried.

   How scale could be a problem: Large-scale AI models could be problematic for a few different reasons, writes Jameson Toole in a post on Medium. They could: 

  • Hinder democratization, as large models are by nature expensive to train. 
  • Restricts deployment, as large models will be hard to deploy on low-compute devices, like phones and internet-of-things products.

   So, what do we do? We should try and develop efficient systems, like SqueezeNet, MobileNet, and others. Once we’ve trained our networks, we should use techniques like knowledge distillation, pruning, and quantization to further reduce them in size.

   Why this matters: In the coming years, we can expect AI developers to continue to scale-up the sizes of models they’re developing, which will likely create AI systems with unparallelled capabilities in tricky domains. The challenge will be figuring out how to make these models available to large numbers of developers, either via cloud services, or methods of tweaking the size of the model. Though I expect AI researchers will continue to push on efficiency, it’s likely that the scaling trend will continue for some years, also.
   Read more: Deep learning has a size problem (Jameson Toole, Medium).

#####################################################

How should journalists cover AI? Researchers have some suggestions:
…Suggestions range from the obvious and sensible, to difficult and abstract…
Skynet Today, an AI news publication predominantly written by AI CS/AI students, has published an editorial about the dos and don’ts of covering artificial intelligence. Close your eyes and think of all the things that seem troubling about AI coverage. Got it? This article is basically a list of those grievances: terminator photos, implications of autonomy where there isn’t anything, a request for clarity about the role humans play.

   Don’t do as we do, do as we say! The article includes some recommendations that highlight how tricky it is for AI journalists to cover AI in a way that researchers might approve of. For instance, it suggests journalists not say programs “learn”, then notes that “it’s true that we AI researchers are often the ones who make use of intuitive, but misleading, word choices in the first place”. Since journalists mostly default to quoting things, this is tricky advice for them to follow.

   One fun thought: How might AI researchers react to some journalists writing “AI Research Best Practices, According to AI Journalists”? Badly, I’d imagine!
   Read more: AI Coverage Best Practices, According to AI Researchers (Skynet Today).

####################################################

Want to make a self-driving car? Try using Voyage Deepdrive!
…Simulators as strategic data-generators…
Self-driving car startup Voyage has released Voyage Deepdrive, an open source 3D world simulator for training self-driving cars. Deepdrive is developed primarily by Craig Quitter, a longtime open-source developer, who now works for Voyage developing Deepdrive fulltime (disclaimer: Craig and I worked together a bit a few years ago when he was at OpenAI).

   What is Deepdrive? Deepdrive is a simulator for training self-driving cars via reinforcement launching. Voyage wants to use Deepdrive to help it make safer, more intelligent cars, and wants to maintain the simulator as open source so that other developers do research on a platform inspired by a realworld self-driving car company. (By comparison, Alphabet’s Waymo has a notoriously complex world simulator they use to train their cars, but they haven’t released it).

   Leaderboards: Voyage has created a Deepdrive leaderboard where people can compete to see who can build the smartest self-driving cars. This will likely help draw developers to work on the platform which could periodically boost Voyage’s own research via the generation of ideas by the external developer community. “In the coming months, Voyage will be hosting a series of competitions to encourage independent engineers and researchers to identify AI solutions to specific scenarios and challenges that actual self-driving cars face on the roads,” Voyage wrote in a Medium blog post.

   Why this matters: The curation and creation of datasets is a driver of research progress in supervized learning, as new datasets tend to create new challenges that highlight the drawbacks of contemporary techniques. In the same way, simulators are helping to drive (pun intended!) progress in reinforcement learning for robotics writ large. Systems like Deepdrive will help make the development of self-driving cars more transparent to the research community by providing an open simulator on which benchmarks can be developed. Let’s see if people use it!
   Read more: Introducing Voyage Deepdrive (Voyage, official Medium).
   Find out more at the official website (DeepDrive Voyage).
   Get the Deepdrive code here (official DeepDrive GitHub repository).
   Read about Waymo’s simulator here (The Atlantic).

 ####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

How the US military uses face recognition:
Newly released documents have revealed details about the military’s use of face recognition. The Automated Biometric Information System (ABIS) is a biometric database including 7.4 million individuals, storing information on anyone coming into contact with the US military, including allied soldiers. It is connected to the FBI’s central database, which is in turn linked with state and local databases across the US. In the first half of 2019, thousands of individuals were identified using the biometric watch list (a subset of ABIS). Between 2008 and 2017 the DoD added 213,000 individuals to the watch list, and 1,700 people were arrested or killed on the basis of biometric and forensic matches.
   Why it matters: Earlier this year, the Axon Ethics Board argued that face recognition technology is not yet reliable enough to be used on policy body-cams (see Import 154). The read-across to military cases is difficult, but accuracy is clearly important both for achieving military goals, and minimizing harm to civilians. It is important, therefore, that these technologies are not being used prematurely, and that their use is subject to proper oversight.
   Read more: This Is How the U.S. Military’s Massive Facial Recognition System Works (Medium).

####################################################

Tech Tales:

[2040]

The Ideation Filter

In the 20th century one big thing in popular psychology was the notion of ‘acting as if’. The thinking went like this: if you’re sad, act as if you’re happy. If you’re lazy, act as if you’re productive. If you’re weak, act as if you’re strong. And so on. 

In the late 20th century we started building AI systems and applied this philosophy to AI agents:

  • Can’t identify these images? We’ll push you into a classification regime where we get you to do your best, and we’ll set up an iterative system that continually improves your ability to classify things, so by ‘acting as if’ you can classify stuff, you’ll eventually be able to do it. 
  • Can’t move this robot hand? We’ll set you up with a simulator where you can operate the robot, and we’ll set up an iterative system that continually improves your ability to manipulate things. 
  • And so on. 

Would it surprise you to learn that as we got better at training AI agents, we also got better at training people using the same techniques? Of course, it took a lot longer than in simulation and required huge amounts of data. But some states tried it. 

Have a populace that doesn’t trust the government? Use a variety of AI techniques to ‘nudge’ them into thinking they might trust the government. 

Have a populace that isn’t performing at a desired economic efficiency level? Use a combination of automation, surveillance, and nudging tech to get them to be more productive, then slowly tweak things in the background until they’re consensually performing their services in the economy. 

Would it surprise you to learn that this stuff worked? We created our own clockwork societies. People could be pushed towards certain objectives and though at first it was acting it eventually became normal and once it became normal people forget they had ever acted. 

You can imagine how the politicians took advantage of these capabilities.
You can imagine how the market took advantage of these capabilities.
You can still imagine. That’s probably the difference between you and I.

Things that inspired this story: “Nudge” techniques applied in contemporary politics; reinforcement learning; pop psychology; applying contemporary administrative logic to an AI-infused future.

Import AI 172: Google codes AI to fix errors in Google’s code; Amazon makes mini-self-driving cars with deepracer; and Microsoft uses GPT-2 to make auto-suggest for coders

Microsoft wants to use internet-scale language models to make programmers fitter, happier, and more productive:
…Code + GPT-2 = Auto-complete for programmers…
Microsoft has used recent advances in language understanding to build a smart auto-complete function for programmers. The software company announced its Visual Studio “IntelliCode” feature at its Microsoft Ignite conference in November. The technology, which is inspired by language models like GPT-2, “extracts statistical coding patterns and learns the intricacies of programming languages from GitHub repos to assist developers in their coding,” the company says. “Based on code context, as you type, IntelliCode uses that semantic information and sourced patterns to predict the most likely completion in-line with your code.” Other people have experimented with applying large language models to problems of code prediction, including a startup called TabNine which released a GPT-2-based code completer earlier this summer. 

Why this matters: Recent advances in language models are making it easy for us to build big, predictive models for any sort of information that can be framed as a text processing problem. That means in the coming years we’re going to develop systems that can predict pleasing sequences of words, code scripts, and (though this is only just beginning to happen) sequences of chemical compounds and other things. As this technology matures, I expect people will start using such prediction tools to augment their own intelligence, pairing human intuition with big internet-scale predictive models for given domains. The cyborgs will soon be among us – and they’ll be helping to do code review!
   Read more: Re-imagining developer productivity with AI-assisted tools (Microsoft).
   Try out the feature in Microsoft’s latest Visual Studio Preview (official Microsoft webpage).
   Read more about TabNine’s tech – Autocompletion with deep learning (TabNine blog).

####################################################

So, how do we feel about all this surveillance AI stuff we’re developing?
…Reddit thread gives us a view into how the machine learning community thinks about ethical tradeoffs…
AI, or – more narrowly – deep learning, is a utility-class technology; it has a vast range of applications and many of these are being explored by developers around the world. So, how do we feel about the fact that some of these applications are focused on surveillance? And how do we feel about the fact that a small number of nation states are enthusiastically adopting AI-based surveillance technologies in the service of surveiling their citizens? That’s a question that some parts of the AI research community are beginning to ponder, and a recent thread on Reddit dramatizes this by noting just how many surveillance-oriented applications seem to come from Chinese labs (which makes sense, given that China is probably the world’s most significant developer and deployer of surveillance AI systems). 

Many reactions, few solutions: In the thread, users of the r/machinelearning subreddit share their thoughts on the issue. Responses range from (paraphrased) it’s all science, it’s not our job to think about second-order effects to this question is indicative of absurd paranoia about China to yes, China does a lot of this, but what about the US? The volume and diversity of responses gives us a sense of how thorny an issue this is for many ML researchers. 

Dual-use technologies: A big issue with surveillance AI is that it has a range of usages, some of which are quite positive. “For what it’s worth, I work in animal re-identification and the technologies that are applied and perfected in humans are slowly making their way to help monitor endangered animal populations,” they write. “It is our responsibility to call out unethical practices but also to not lose sight of all the social good that can come from ML research.”
   Read more: ICCV – 19 – The state of (some) ethically questionable papers (Reddit/r/machinelearning).

####################################################

Stanford researchers give simulated robots the sensation of touch:
…Sim2real + simulated robots + high-fidelity environments + interaction, oh my!…
Researchers with the Stanford AI Lab have extended their ‘Gibson’ robot simulation software to support interactive objects, making it possible for researchers to use Gibson to train simulated AI agents to interact with the world around them. Because the Gibson simulator (first covered: Import AI 111) supports high-fidelity graphics, it may be possible to transfer agents trained in Gibson into reality (though that’s more likely to be successful for pure visual perception tasks, rather than manipulation). 

Faster, Gibson! The researchers have also made Gibson faster – the first version of Gibson rendered scenes at between 25 and 40 frames per second (FPS) on modern GPUs. That’s barely good enough for a standard computer game being played by a human, and wildly inefficient for AI research, where agents are typically so sample efficient that it’s much better to have simulators that can run at thousands of FPS. In Interactive Gibson, the researchers implement a high-performance mesh rendering system written in Python and C++, improving FPS to ~1,000FPS at a 256X256 scene resolution – this is pretty good and should make the platform more attractive to researchers. 

Interactive Gibson Benchmark: If you want to test out how well your agents can perform in the new, improved Gibson, you can investigate a benchmark challenge created by the researchers. This challenge augments 106 existing Gibson scenes with 1984 interactable instances of five objects: chairs, desks, doors, sofas, and tables. Because Gibson consists of over 211,000 square meters of simulated indoor space, it’s not feasible to have human annotators go through it and work out where to put new objects; instead, the Gibson researchers create an AI-infused object-generation system that scans over the entire dataset and proposes objects it can add to scenes, then checks with humans as to whether its suggestions are appropriate. I think it’s interesting how common it is becoming to use ML techniques to semi-automatically enhance ML-oriented datasets.    

What does success mean in Gibson? As many AI researchers know, goal specification is always a hard problem when developing AI tasks and challenges. So, how can we assess we’re making progress in the Gibson environment? The developers propose a metric called Interactive Navigation Score (INS) that measures a couple of dimensions of the efficiency of an embodied AI agent; specifically, the efficiency (aka, distance traveled) of the paths it discovers to reach its goals, as well as the effort efficiency, which measures how much energy the agent needed to expend to achieve its goal (eg, how much energy it spends moving its own body or manipulating objects in the environment to help it achieve its goal).

The robot agents of Gibson: Having a world you can interact with is pretty pointless if you don’t have a body to use to interact with the world, so the Gibson team has also implemented several simulated robots that researchers can use within Gibson.
  These robots include: 

  • Two widely-used simulated agents (the Mujoco humanoid and ant bots)
  • Four wheeled navigation agents (Freight, JackRabbot v1, Husky, Turtlebot v2)
  • Two mobile manipulators with arms (Fetch, JackRabbot v2)
  • A quadrocopter/drone (specifically, a Quadrotor)

Why this matters: As I’ve written in this newsletter before, the worlds of robotics and of AI are becoming increasingly intermingled. The $1 trillion question is at what point both technologies combine, mature, and yield capabilities greater than the sum of their respective parts. What might the world be like if it became trivial to train agents to interact with the physical world in general, intelligent ways? Pretty different, I’d say! Systems like the Interactive Gibson Environment will help researchers generate insights from successes and failures to get us closer to that mysterious, different world.
   Read more: Interactive Gibson: A Benchmark for Interactive Navigation in Cluttered Environments (Arxiv)

####################################################

Want to build self-driving cars without needing real cars? Try Amazon’s “DeepRacer” robot:
…1:18th scale robot car gives developers an easy way to prototype self-driving car technology…
How will deep learning change how robots experience, navigate, and interact with the world? Most AI researchers assume the technology will dramatically improve robot performance in a bunch of domains. How can we assess if this is going to happen? One of the best approaches is testing out DL techniques on real-world robots. That’s why it’s exciting to see Amazon publish details about its “DeepRacer” robot car, a pint-size 1:18th scale vehicle that developers can use to develop robust, self-driving AI algorithms. 

What is DeepRacer: DeepRacer is a 1/18th scale robot car, designed to demonstrate how developers can use Amazon Web Services to build robots that do intelligent things in the world. Amazon is also hosting a DeepRacer racing league, bringing developers together at Amazon events to compete with eachother to see who can develop the smartest systems for self-racing cars.

How to DeepRace: It’s possible to use contemporary AI algorithms to train DeepRacer vehicles to complete track circuits, Amazon writes in the research paper. Specifically, the company shows how to train a system via PPO to complete racing tracks, and provides a study showing how developers can augment data and tweak hyperparameters to get good performance out of their vehicle. They also highlight the value of training in simulation across a variety of different track types, then transferring the trained policy into reality. 

What goes into a DeepRacer car? A 1:18 four wheel drive scaled car, an Intel Atom processor with a built-in GPU, 4GB of RAM and 32GB of (expandable) storage, a 13600 mAh compute battery (which lasts around ~6 hours), a 1100 mAh drive battery, wifi, and a 4MP camera. “We have designed the car for experimentation while keeping the cost nominal,” they write. 

Why this matters: There’s a big difference between “works in simulation” and “works in reality”, as most AI researchers might tell you. Therefore, having low-cost ways of testing out ideas in reality will help researchers figure out which approaches are sufficiently robust to withstand the ever-changing dynamics of the real world. I look forward to watching progress in the DeepRacer league and I think, if this platform ends up being widely used, we’ll also learn something about the evolution of robotics hardware by looking at various successive iterations of the design of the DeepRacer vehicle itself. Drive on, Amazon!
   Read more: DeepRacer: Educational Autonomous Racing Platform for Experimentation with Sim2Real Reinforcement Learning (Arxiv)

####################################################

Google trains an AI to automatically patch errors in Google’s code:
…The era of the self-learning, self-modifying company approaches…
Google researchers have developed Graph2Diff networks, a neural network system that aims to make it easy for researchers to train AI systems to analyze and edit code. With Graph2Diff, the researchers hope to “do for code-editing tasks what the celebrated Seq2Seq abstraction has done for natural language processing”. Seq2Seq, for those who don’t follow AI research with the same fascination as train spotters follow trains, is the technology that went on to underpin Google’s “Smart Reply” system that automatically composes email responses. 

How Google uses graph2diff: For this research, Google gathered code snippets link to approximately 500,000 build errors collected across Google. These build errors are basically the software logs of what happens when Google’s code build systems fail, and they’re tricky bits of data to work with as they involve multiple layers of abstraction, and frequently the way to fix the code is by editing code in a different location to where the error was observed. Using graph2diff, the researchers turn this into a gigantic machine learning problem: “We represent source code, build configuration files, and compiler diagnostic messages as a graph, and then use a Graph Neural Network model to predict a diff,” they write. 

What’s hard about this? Google analyzed some of the code errors seen in its data and layed out a few ressons why code-editing is a challenging problem. These include: Variable misuse; source-code is context-dependent so a fix that works in one place probably won’t work well in another place; edit scripts can be variable lengths; fixes don’t happen at the same place as diagnostic and 36% of cases require changing a line not pointed to by a diagnostic; there can be multiple diagnostics; single fixes can span multiple locations.

Can AI learn to code? When they test out their approach, the researchers find optimized versions of it can obtain accuracies of 28% at predicting the correct length of a code sequence. In some circumstances, they can have even better performance, achieving a precision of 61% at producing the developer fix when suggesting fixes for 46% of the errors in the data set. Additionally, Graph2Diff has much better performance than prior systems, including one called DeepDelta.

Machine creativity: Sometimes, Graph2Diff comes up with fixes that work more effectively than those proposed by humans – “we show that in some cases where the proposed fix does not match the developer’s fix, the proposed fix is actually preferable”.

Why this matters: In a few years, the software underbellies of large organizations could seem more like living creatures than static (pun intended!) entities. Work like this shows how we can apply deep learning techniques to (preliminary) problems of code identification and augmentation. Eventually, such techniques might automatically repair and – eventually – improve increasingly large codebases, giving the corporations of the future an adaptive, emergent, semi-sentient code immune system. “We hope that fixing build errors is a stepping stone to related code editing problems: there is a natural progression from fixing build errors to other software maintenance tasks that require generating larger code changes”.
   Read more: Learning to Fix Build Errors with Graph2Diff Neural Networks (Arxiv)

####################################################

Systems for seeing the world – making camera traps more efficient with deep learning:
…Or, how nature may be watched over by teams of humans&machines…
Once you can measure something, you can more easily gather data about it. When you’re dealing with something exhibiting sickness, data is key. The world’s biosphere is currently exhibiting sickness in a number of domains – one of them being the decline of various animal populations. But recent advances in AI are giving us tools to let us measure this decline, equipping us with the information we need to take action. 

   Now, a team of researchers with Microsoft, the University of Wyoming, the California Institute of Technology, and Uber AI, have designed a human-machine hybrid system for efficiently labeling images seen by camera traps in wildlife areas, allowing them to create systems that can semi-autonomously monitor and catalog the animals strewn across vast, thinly populated environments. The goal of the work is to “enable camera trap projects with few labeled images to take advantage of deep neural networks for fast, transferable, automatic information extraction”, allowing scientists to cheaply classify and count the animals seen in images from the wild. Specifically, their system uses “transfer learning and active learning to concurrently help with the transferability issue, multi-species images, inaccurate counting, and limited-data problems”.

Ultra-efficient systems for wildlife categorization: The researchers’ system gets around 91% accuracy at categorizing images on a test dataset, while using 99.5% less data than a prior system developed by the same researchers, they write. (Note that when you dig into the scores there’s a meaningful differential, with this system getting 91% accuracy, versus around 93-95% for the best performance of their prior systems.) 

How it works: The animal surveillance system has a few components. First, it use a pre-trained image model to work out if an image is empty or contains animals; if the system assigns a 90%+ probability to the image containing an animal, it tries to count the number of distinct entities in the location that it thinks are animals. It then automatically crops these images to focus on the animals, then converts these image crops into feature representations which lets it smush all the images together into an interrelated multi-dimensional embedding. It then compares these embeddings with those already in its pre-trained model and works out where to put them, allowing it to assign labels to the images.
   Periodically, the model selects 1,000 random images and requests labels from a human, who labels the images, which are then converted into feature representations, and the model is subsequently re-trained against these new feature vectors. This basically allows the researchers to use pre-canned image networks with a relatively small amount of new data, relying on humans to accurately label small amounts of real-world images which lets them recalibrate the model. 

What comes next? The researchers say there are three promising mechanisms for improving this system further. These include: Tweaking hyperparameters or using different neural net architectures for the system; extending the human-labeling system so humans also generate bounding boxes, which could iteratively improve detector performance; gather enough data to combine the classification and detection stages in one model. 

Why this matters: Deep learning has a lot in common with plumbing: a good plumber knows how to chain together various distinct components to let something flow from a beginning to an end. In the case of the plumber, the goal is to push a liquid efficiently to a major liquid thoroughfare, like a sewer. For an AI researcher, the goal is to efficiently push information through a series of distinct modules, optimizing for a desired output at the other end. With papers like this, we’re able to see what an end-to-end AI-infused pipeline for analyzing the world looks like. 

Along with this, the use of pre-trained models implies something about the world we’re operating in: It paints a picture of a world where researchers train large networks that they can, proverbially, write once / run anywhere, which are then paired with new datasets and/or networks created by domain experts for solving specific problems, like camera trap identification. As we train ever-larger and more-powerful networks, we’ll see them plug-in to domain-specific systems like the one outlined above.
   Read more: A deep active learning system for species identification and counting in camera trap images (Arxiv).
   Read the prior research that this paper builds on: Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning (PNAS, June, 2018).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

Interim Report from National Security Commission on AI:
The NSCAI has delivered its first report to Congress. The Commission was launched in March, with a mandate to look at how the US could use AI development to serve national security needs, and is made up of leaders from tech and government. The report points to a number of areas where US policy may be inadequate to preserve the US’ dominance in AI, and measures to address this.

Five lines of effort:

   (1) Invest in AI R&D – current levels of federal support and funding for AI research are too low for the US to remain competitive with China and others.

   (2) Apply AI to National Security Missions – the military must work more effectively with industry partners to capitalize on new AI capabilities.

   (3) Train and Recruit AI Talent – the military and government must do better in attracting and training AI talent.

   (4) Protect and Build Upon US Technology Advantages – the US should maintain its lead in AI hardware manufacturing and take measures to better protect intellectual property relevant to AI.

   (5) Marshal Global AI Cooperation – the US must advance global cooperation on AI, particularly on ethics and safety. 

Ethics: They emphasize that it is a priority that AI is developed and deployed ethically. They point out that is important both to ensure that AI is beneficial, and to help the US maintain its competitive lead, since strong ethical commitments will help the military attract talent, and forge collaborations with industry. 

Why it matters: This report and last week’s DoD ethics principles (see Import #171) shed important light on the direction of US policy on AI. While the report is focused primarily on how the US can sustain its competitive lead in AI, and military dominance, it does foreground the importance of ethical and safe AI development, and the need for international cooperation to secure the full benefits of AI.
   Read more: Interim Report from the National Security Commission on AI.

####################################################

Research Fellowship for safe and ethical AI:
The Alan Turing Institute (ATI), based in London, is looking for a mid-career or senior academic to work on safe and ethical AI, starting in October 2020 or earlier.
   Read more: Safe and Ethical AI Research Fellow (ATI).

####################################################

Tech Tales:

I Don’t Say It / I Do Say It 

Oh, she is cute! I think you should go for the salad opener.
Salad, really?
My intuition says she’d find it amusing. Give it a try.
Ok. 

I use the salad opener. It works. She sends me some flowers carried by a little software cat who walks all over my phone and leaves a trail of petals behind it. We talk a bit more. She tells me when she was a kid she used to get so excited running down the street she’d bump into things and fall over. I get some advice and tell her that when I was young I’d sometimes insist on having “bedtime soup” and I’d sometimes get so sleepy while eating it I’d fall asleep and wake up with soup all over the bed.

I think you should either ask about her family or ask her if she can guess anything about your family background.
Seems a little try-hard to me.
Trust me, my prediction is that she will like it.
Ok. 

I asked her to tell me about her family and she told me stories about them. It went well and I was encouraged to ask her to ask about my family background. I asked. She asked if my parents had been psychologists, because I was “such a good conversationalist”. I didn’t lie but I didn’t tell the truth. I changed subjects. 

We kept on like this, trading conversations; me at the behest of my on-device AI advisor, her either on her own volition or because of her own AI tool as well. It’s not polite to ask and people don’t tell. 

When we met up in the world it was beautiful and exciting and we clicked because we felt so comfortable with eachother, thanks to our conversation. On the way to the date I saw a billboard advert for a new bone-conduction microphone/speaker that, the advert said, could be injected into your jaw, letting you sub-vocalize instructions to your own AI system, and hear your own AI system as a secret voice in your head. 

We stared at eachother before we kissed and I think both of us were looking to see if the other person was distracted, perhaps by their own invisible advisor. Neither of us seemed to be able to tell. We kissed and it was beautiful and felt right. 

Things that inspired this story: Chatbots; Learning from human preferences; smartphone apps; cognitive augmentation via AI; intimacy and prediction. 

Import AI 171: When will robotics have its ImageNet moment?; fooling surveillance AI with an ‘adversarial t-shirt’, and Stanford calls for $12bn a year in funding for a US national endeavor

What do we mean when we say a machine can “understand” something?
And does it matter if we do or don’t know what we mean here?…
AI professor Tom Dietterich has tackled the thorny question of trying to define what it means for a machine to “understand” something – by saying maybe this question doesn’t matter. 

Who cares about understanding? “I believe we should pursue advances in the science and technology of AI without engaging in debates about what counts as “genuine” understanding,” he says. “I encourage us instead to focus on which system capabilities we should be trying to achieve in the next 5, 10, or 50 years”. 

Why this matters: One of the joys and problems with AI is how broad a subject it is, but this is also a source of tension – I think the specific tension comes from the mushing together of a community that runs on an engineering-centric model of progress where researchers compete with eachother to iteratively hill climb on various state-of-the-art leaderboards, and a more philosophical community that wants to take a step back and ask fundamental questions like what it may mean to “understand” things and whether today’s systems exhibit this or not. I think this is a productive tension, but it can sometimes yield arguments or debates that seem like sideshows to the main event of building iteratively more intelligent systems.
   “We must suppress the hype surrounding new advances, and we must objectively measure the ways in which our systems do and do not understand their users, their goals, and the broader world in which they operate,” he writes. “Let’s stop dismissing our successes as “fake” and not “genuine”, and lets continue to move forward with honesty and productive self-criticism”.
   Read more: What does it mean for a machine to “understand”? (Tom Dietterich, Medium)

####################################################

What’s the secret to creating a strong American AI ecosystem? $12 billion a year, say Stanford leaders:
…Policy proposal calls for education, research, and entrepreneurial funding…
If the American government wants the USA to lead in AI, then the government should invest $12 billion into AI every year for at least a decade, according to a policy proposal from Fei-Fei Li and John Etchemendy – directors of Stanford’s Human-Centered Artificial Intelligence initiative.

How to spend $12 billion a year: Specifically, the government should invest $7 billion a year into “public research to pursue the next generation of AI breakthroughs”, along with $3 billion a year into education and $2 billion into funds to support early-stage AI entrepreneurs. To put these numbers into perspective, a NITRD report recently estimated that the federal government budgeted about $1 billion a year in non-defense programs related to AI, so the Stanford proposal is calling for a significant increase in AI spending, however you slice and dice the figures. 

Money + principles: Along with this, the directors ask the US government to “implement clear, actionable international standards and guidelines for the ethical use of AI”. (In fairness to the US government, the government has participated in the creation of the OECD AI principles, which were adopted in 2019 by OECD member countries and other states, including Brazil, Peru, and Romania.)

Why this matters: The 21st century is the era of the industrialization of AI, and the industrialization of AI demands capital in the same way industrialization in the 18th and 19th centuries demanded capital. Therefore, if governments want to lead in AI, they’ll need to dramatically increase spending on fundamental AI research as well as initiatives like better AI education. In the words of commentators of sports matches when a team is in a good position at the start of the second half of the game: it’s the US’s game to lose!
Read more: We Need a National Vision for AI (Human-Centered Artificial Intelligence).
Read more: The Networking and Information Technology Research & Development Program Supplement to the President’s FY2020 Budget (WhiteHouse.gov, PDF)

####################################################

Fundamental failures and machine learning:
…Towards a taxonomy of machine failures…
Researchers with the Universita della Svizzera Italiana in Switzerland have put together a taxonomy of some of the common failures seen in AI systems programmed in TensorFlow, PyTorch, and Keras. The difference with this taxonomy is the amount of research that has gone into it: to build it, the researchers analyzed  477 StackOverflow discussions, 271 issues and pull requests (PRs), 311 commits from GitHub repositories, and conducted interviews with 20 researchers and practitioners. 

A taxonomy of failure: So, what failures are common in deep learning? There are around five top level categories, 3 of which are divided into subcategories. These are:

  • Model: The ML model itself is, unsurprisingly, a common source of failures, with developers frequently running into failures that occur at the level of a layer within the network. These include: problems relating to missing or redundant layers, incorrect layer properties (eg, sample size, input/output format, etc), and activation functions.
  • Training: Training runs are finicky, problem-laden things, and the common failures here including bad hyperparameter selection, misspecified loss functions, bad data splits between training and testing, optimiser problems, bad training data, crappy training procedures (eg, poor memory management during training), and more. 
  • GPU usage: As anyone who has spent hours fiddling around with NVIDIA drivers can attest, GPUs are machines sent from hell to drive AI researchers mad. Faustian boxes, if you will. Have you ever seen someone with multiple PHDs break down after spending half a day trying to de-bug a problem caused by an NVIDIA card’s software playing funny games with a Linux distro? I have. (AMD: Please ramp up your AI GPU business faster to provide better competition to NVIDIA here). 
  • API: These problems are what happens when developers use APIs badly, or improperly. 
  • Tensors & Inputs: Misshapen tensors are a frequent problem, as are mis-specified inputs.

Why this matters: For AI to industrialize, AI processes need to become more repeatable and describable, in the same way that artisanal manufacturing processes were transformed into repeatable documented processes via Taylorism. Papers like this create more pressure for standardization within AI, which prefigures industrialization and societal-scale deployments.
   Read more: Taxonomy of Real Faults in Deep Learning Systems (Arxiv).

####################################################

Want your robot to be friends with people? You might want this dataset:
…JackRabbot dataset comes with benchmarks, more than an hour of data…
Researchers with the Stanford Vision and Learning Laboratory have built JRDB, a robot-collected dataset meant to help researchers develop smarter, more social robots. The dataset consists of tons of video footage recorded by the Stanford-developed ‘JackRabbot‘ robot “social navigation robot” as it travels around campus, with detailed annotations of all the people it encounters enroute. Ideally, JRDB can help us build robots that can navigate the world without crashing into the people around them. Seems useful!

What’s special about the data?
JRDB data consists of 54 action sequences with the following data for each sequence: Video streams at 15fps from stereo cylindrical 360-degree cameras; continuous 3D point clouds gathered via 2 velodyne LiDAR scanners; line 3D point clouds gathered via two Sick LiDARs, an audio signal, and encoder values from the robot’s wheels. All the pedestrians the JackRabbot encounters on its travels are labeled with 2D and 3D bounding boxes. 

Can you beat the JRDB challenge? JRDB ships with four in-built benchmarks: 2D and 3D person detection, and 2D and 3D person tracking. The researchers plan to expand the dataset over time, and may do things like “annotating ground truth values for individual and group activities, social grouping, and human posture”. 

Why this matters: Robots – and specifically, techniques to allow them to autonomously navigate the world – are maturing very rapidly and datasets like this could help us create robots that are more aware of their surroundings and better able to interact with people.
   Read more: JRDB: A Dataset and Benchmark for Visual Perception for Navigation in Human Environments (Arxiv)

####################################################

Want to hide from that surveillance camera? Try wearing an adversarial t-shirt:
…Perturbations for privacy…
In a world full of facial recognition systems, how can people hide? One idea from researchers with Northeastern University, IBM, and MIT, is to wear a t-shirt that confuses image classification systems, rendering the person invisible to AI-infused surveillance. 

How it works: The researchers’ “adversarial t-shirt” has a pattern printed on it that is designed to confused image classification systems. To get this t-shirt to be effective, the researchers work out how to design an adversarial pattern that works even when the t-shirt is deformed by a person walking around in it (to do this, they implement a thin plate spin (TPS)-based transformer, which can model these deformations). 

The kay numbers: 65% and 79% – that’s how effective the t-shirt is at confusing classifiers based on Faster R-CNN (65%) and YOLOv2 (79%). However, its performance falls when working against ensembles of detectors. 

Why this matters: Think of this research as an intriguing proof of concept for how to apply adversarial attacks in the real world, then reflect on the fact adversarial examples in ML showed up a few years ago as perturbations to 2D digital images, before jumping to real images via demonstrations on things like stop signs, then moving to 3D objects as well (via research that showed how to get a system to persistently misclassify a turtle as a gun), then moving to stick-on patches that could be added to other items, and now moving to adversarial objects that change over time, like clothing. That’s a pretty wild progression from “works in the restricted lab” to “works in some real world scenarios”, and should give us a visceral sense of broader progress in AI research.
   Read more: Evading Real-Time Person Detectors by Adversarial T-shirt (Arxiv)

####################################################

When will AI+robots have its ImageNet moment? Meta-World might help us find out:
…Why smart robots need to test themselves in the Meta-World…
A team of researchers from Stanford, UC Berkeley, Columbia University, the University of Southern California, and Google’s robotics team, have published Meta-World, a multi-task robot evaluation benchmark. Meta-World 

Why build Meta-World? Meta-World is a symbol of the growing sophistication of AI algorithms; Meta-World exists because we’ve got pretty good at training simulated robots to solve single tasks, so now we need to train simulated robots to solve multiple tasks at once. This pattern of moving from single-task to multi-task evaluation has been playing out in other parts of AI in recent years, ranging from NLP (where we’ve moved to multi-task evaluations like ‘SuperGLUE’), to images (where for several years it has been standard to test on ImageNet, CIFAR, and usually varieties of domain-specific datasets), to reinforcement learning (where people have been trying out various forms of meta-learning across a range of environments like DeepMind Lab, OpenAI’s procedurally generated environments, and more. 

Parametric and non-parametric: Meta-World tasks exhibit parametric variation in object position and goal positions for each task, as well as non-parametric variation across tasks. “Introducing this parametric variability not only creates a substantially larger (infinite) variety of tasks, but also makes it substantially more practical to expect that a meta-trained model will generalize to acquire entirely new tasks more quickly, since varying the positions provides for wider coverage of the space of possible manipulation tasks,” the researchers write. 

50 tasks, many challenges: Meta-World contains 50 distinct manipulation tasks covering simple actions like reaching to an object, to pulling levers, to closing doors, and more. It also ships with a variety of different evaluation techniques: in the “most difficult” one, agents will need to learn how to use experience from 45 training tasks to learn distinct, new test tasks. 

How well do today’s algorithms work? The authors test a few contemporary algorithms against Meta-World; multi-task PPO, multi-task TRPO, task embeddings, multi-task soft actor critic (SAC), multi-task multi-head SAC; as well as meta-learning algorithms model-agnostic meta-learning (MAML), RL^2, and probabilistic embeddings for actor-critic RL (PEARL). Most methods fail to do well on these tasks, but can individually solve them. “The fact that some methods nonetheless exhibit meaningful generalization suggests that the ML10 and ML45 benchmarks are solvable, but challenging for current methods, leaving considerable room for improvement in future work,” they write. 

Why this matters: When will robotics have its “ImageNet moment” – a point when someone comes develops an approach that gets a sufficiently high score on a well-established benchmark that it forces a change in attention for the broader research community? We’ve already had ImageNet, and then in the past couple of years the same thing has happened with NLP (notably, via systems like BERT, ULMFiT, GPT2, etc). Robotics isn’t there, but it feels like it’s on the cusp of it, and once it happens, I expect robotics+AI to become very consequential.
   Read more: Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning (Arxiv).
   Get the code for Meta-World here (GitHub).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

DoD releases its AI ethics principles:
The DoD’s expert panel, the Defense Innovation Board (DIB), has released a report outlining ethics principles for the US military’s development and deployment of AI. The DIB sought input from the public and over 100 experts, and conducted a ‘red team’ exercise to stress test the principles in realistic policy scenarios.

Five principles:
(1) Human beings should exercise appropriate levels of judgment and remain responsible for the development, deployment, use, and outcomes of DoD AI systems; 

(2) DoD should take deliberate steps to avoid unintended bias in the development and deployment of combat or non-combat AI systems that would inadvertently cause harm to persons; 

(3) DoD’s AI engineering discipline should be sufficiently advanced such that technical experts possess an appropriate understanding of the technology, development processes, and operational methods of its AI systems, including transparent and auditable methodologies, data sources, and design procedure and documentation; 

(4) DoD AI systems should have an explicit, well-defined domain of use, and the safety, security, and robustness of such systems should be tested and assured across their entire life cycle within that domain of use; 

(5) DoD AI systems should be designed and engineered to fulfill their intended function while possessing the ability to detect and avoid unintended harm or disruption, and for human or automated disengagement or deactivation of deployed systems that demonstrate unintended escalatory or other behaviour.

Practical recommendations: an annual DoD-convened conference on AI safety, security, and robustness; a formal risk management methodology; investment in research into reproducibility, benchmarking, and verification for AI systems.

Why it matters: The DoD seems to be taking seriously the need for progress on the technical and governance challenges posed by advanced AI. A race to the bottom on safety and ethics between militaries would be disastrous for everyone, so it is encouraging to see this measured approach from the US. International cooperation and mutual trust will be essential in building robust and beneficial AI, so we are fortunate to be grappling with these challenges in a time of relative peace between the great powers, and should be making the most of it.
   Read more: AI Principles (DoD).
   Read more: AI Principles – Supporting Document (DoD).

####################################################

Newsletter recommendation – policy.ai

I recommend subscribing to policy.ai, the bi-weekly newsletter from the Center for Security and Emerging Technologies (CSET) at Georgetown University. (Jack: This is a great newsletter, though as a disclaimer, I’m affiliated with CSET).

####################################################

Tech Tales:

The Repair Job

“An original 7000? Wow. Well, first of all we’ve got to get you upgraded, old timer. The new 20-Series are cheaper, faster, and smarter. And they can handle weeds just fine-”
   “-So can mine,” I pointed to the sensor module I’d installed on its roof. “I trained it to spot them.”
   “Must’ve taken a while.”
   “A couple of years, yes. I haven’t had any problems.”
   “I can see you love the machine. Let me check if we’ve got parts. Mind if I scan it?”
He leaned over the robot and flipped up a diagnostic panel, then used his phone to scan an internal barcode, then he pursed his lips. “Sorry,” he said, looking at me. “We don’t have any parts and it looks like some of them aren’t supported anymore.”
   “So that’s it then?”
   “Unless you can find the parts yourself, then yeah, that’s it.”
   I’d always liked a challenge. 

It took a month or so. My grandson helped. “Woah,” he’d say, “these things are really old. Cool!”. But we figured it out eventually. The parts came in the post and some of them by drone and for a couple I picked them up directly from the lawnmower store.
   “How’s it going?” the clerk would say.
    “It’s going,” I’d say. 

The whole process felt like a fight – tussling with thinly documented software interfaces and barely compatible hardware. But we persevered. Within another month, the lawnmower was back up and running. It had some quirks, now – it persistently identified magnolias as weeds and wouldn’t respond to training. It was better in other ways – it could see better, so stopped scraping its side on one of the garden walls. I’d watch it as it went round the garden and sometimes I’d walk with it, shadowing it as it worked. “Good robot,” I’d say, “You’ve got this down to a science, if I do say so myself.”

We both kept getting older. More idiosyncratic. I’d stand in the shade and watch it work. Then I’d mostly sit in the shade, letting hours go by as it dealt with the garden. I tinkered with it so I could make it run slower than intended. We got older. I kept making it run slower, so it could keep up with my frame of mind. We did our jobs – it worked, I maintained it and tweaked it to fit it more and more elegantly to my garden. We wrestled with eachother’s problems and changed eachother as a consequence.
   “I don’t know what’s gonna give up first, me or the machine,” I’d say to the clerk, when I had to pick up new parts for ever-more ornate repairs.
   “It’s keeping me alive as much as I’m keeping it alive,” I’d tell my son. He’d sigh at this. Tell me to stop talking that way.

I once had a dream that I was on an operating table and the surgeons were lawnmowers and they were replacing my teeth with “better, upgraded ones” made of metal. Go figure.

When I got the diagnosis I wasn’t too sad. I’d prepared. Laid in a store of parts. Recorded some tutorials. Wrote this short story about my time with the machine. Now it’s up to you to keep it going – we can still fix machines better than we can fix people, and I think you’ll learn something.

Things that inspired this story: Robot & Frank; the right to repair; degradation and obsolescence in life and in technology.

Import AI 170: Hearing herring via AI; think NLP progress has been dramatic – so does Google!; and Facebook’s “AI red team” hunts deepfakes

Want to protect our civilization from the truth-melting capabilities of contemporary AI techniques? Enter the deepfake detection challenge!
… Competition challenges people to build tools that can spot visual deepfakes…
Deepfakes, the slang term of art for images and videos that have been synthesized via AI systems, are everyone’s problem. That’s because deepfakes are a threat to our ability to trust the things we see online. Therefore, finding ways to help people spot deepfakes is key to creating a society where people can maintain trust in their digital lives. One route to doing that is having a better ability to automatically detect deepfakes. Now, Facebook, Microsoft, Amazon Web Services, and the Partnership on AI have created the Deepfake Detection Challenge to encourage research into deepfake detection.

Dataset release: Facebook’s “AI Red Team” has released a “preview dataset” for the challenge that consists of around 5000 videos, both original and manipulated. To build the dataset, the researchers crowdsourced videos from people while “ensuring a variability in gender, skin tone and age”. In a rare turn for an AI project, Facebook seems to have acted ethically here – “one key differentiating factor from other existing datasets is that the actors have agreed to participate in the creation of the dataset which uses and modifies their likeness”, the researchers write. 

Ethical dataset policies: A deepfakes detection dataset could also be useful to bad actors who want to create deepfakes that can evade detection. For this reason, Facebook has made is so researchers will need to register to access the dataset. Adding slight hurdles like this to data access can have a big effect on minimizing bad behavior. 

Why this matters: Competitions are a fantastic way to focus the attention of the AI community on a problem. Even better are competitions which include large dataset releases, as these can catalyze research on a tricky problem, while also providing new tools that researchers can use to develop their thinking in an area. I hope we see many more competitions like this, and I hope we see way more AI red teams to facilitate such competitions.
   Read more: The Deepfake Detection Challenge (DFDC) Preview Dataset (Arxiv).
   Read more: Deepfake Detection Challenge (official website).

####################################################

Can deep learning systems spot changes in cities via satellites? Kind of, but we need to do more research:
…DL + data makes automatic analysis of satellite imagery possible, with big implications for the diffusion of strategic surveillance capabilities…
Researchers with the National Technical University of Athens, the Universite Paras-Saclay and INRIA Saclay, and startup Granular AI, have tried to train a system to identify changes in urban scenes via the automated analysis of satellite footage. The resulting system is an intriguing proof-of-concept, but not yet good enough for production. 

How it works and how well it works: They design a relatively simple system which combines a ‘U-Net’ architecture with LSTM memory cells, letting them learn to model changes between images over time. The best-performing system is a U-Net + LSTM architecture using all five images for each city over time, obtaining a precision of 63.59, recall of 52.93, OA of 96 and F1 of 57.78. 

Dataset: They use the Bi-temporal Onera Satellite Change Detection (OSCD) Sentinel-2 dataset, which consists of images of 24 different cities around the world taken on two distinct dates. They also splice in additional images from Sentinel satellites to give them three additional datapoints, helping them model changes over time. They also augment the dataset programmatically, flipping and rotating images to create more data to train the system on. 

Why this matters: “As far as human intervention on earth is concerned, change detection techniques offer valuable information on a variety of topics such as urban sprawl, water and air contamination levels, illegal constructions”. Papers like this show how AI is upending the balance of strategic power, taking capabilities that used to be the province solely of intelligent agencies and hedge funds (automatically analyzing satellite imagery), and diffusing them to a broader range of actors. Ultimately, this means we’ll see more organizations using AI tools to analyze satellite images, and I’m particularly excited about such technologies being used for providing analytical capabilities following natural disasters.
   Read more: Detecting Urban Changes with Recurrent Neural Networks from Multitemporal Sentinel-2 Data (Arxiv)

####################################################

Can you herring me know? Researchers train AI to listen for schools of fish:
…Deep learning + echograms = autonomous fish classifier…
Can we use deep learning to listen to the ocean and learn about it? Researchers with the University of Victoria, ASL Environmental Sciences, and the Victoria branch of Fisheries and Oceans Canada think so, have built a system that hunts for herring in echograms.

How it works: The primary technical aspect of this work is a region-of-interest extractor, which the researchers develop to look at echograms and pull out sections for further analysis and classification; this system obtains a recall of 0.93 in the best case. They then train a classifier that looks at echograms extracted by the region-of-interest module; the top performing system is a DenseNet which obtains a recall scall of 0.85 and an F1 score of 0.82 – significantly higher than a support vector machine baseline of 0.78 and 0.62.
   The scores the researchers obtain are encouraging but not sufficiently robust for the real world – yet. But though the accuracy is sub-par, it could become a useful tool: “the ability to measure the abundance of such subjects [fish] over extended periods of time constitutes a strong tool for the study of the effects of water temperature shifts caused by climate change-related phenomena,” they write. 

Why this matters: I look forward to a day when planet earth is covered in systems automatically listening for and cataloguing wildlife – I think such systems could give us a richer understanding of our own ecosystems and will likely be a prerequisite for the effective rebuilding of ecosystems as we get our collective act together with regard to catastrophic climate change. 

It’s a shame that… the researchers didn’t call this software DeepFish, or something similar. HerringVision? FishFinder? The possibilities are as boundless as the ocean itself!
   Read more: A Deep Learning-based Framework for the Detection of Schools of Herring in Echograms (Arxiv)

####################################################

Want better OCR? Try messing up your data:
…DeepErase promises to take your words, make them dirty, clean them for you, and get smarter in the process…
Researchers with Ernest and Young have created DeepErase, weakly supervized software that “inputs a document text image with ink artifacts and outputs the same image with artifacts erased”. DeepErase is essentially a pipeline for processing images destined for optical character recognition (OCR) systems; it takes in images, automatically augments them with visual clutter, then trains a classifier to distinguish good images from bad ones. The idea is that, if the software gets good enough, you can use it to automatically identify and clean images before they go to custom in-house OCR software. 

How it works: DeepErase takes in some datasets of images of handwritten text, then programmatically generate artifacts for these images, deliberately messing up the text. The software also automatically creates segmentation masks, which makes it easier to train systems that can analyze and clean up images. 

Realism: Why aren’t Ernst & Young trying to redesign optical character recognition from the ground up, using neural techniques? Because “today most organizations are already set up with industrial-grade recognition systems wrapped in cloud and security infrastructure, rendering the prospect of overhauling the existing system with a homemade classifier (which is likely trained on much fewer data and therefore a comparatively lower performance) too risky an endeavor for most”. 

Testing: They test DeepErase by passing images cleaned with it into two text recognition tools: Tesseract and SimpleHTR. DeepErase gets a 40-60% word accuracy improvement over the dirty images on their validation set, and notches up a 14% improvement on the NIST SDB2 and SDB6 datasets of scanned IRS documents.

Why this matters: AI is starting to show up all around us as the technology matures and crosses out of the lab into industry applications. Papers like this are interesting as they show how people are using modern AI techniques to create highly-specific slot-in capabilities which can be integrated into much larger systems, already running within organizations. This feels to me like a broader part of the Industrialization of AI, as it shows the shift from research into application.
   Read more: DeepErase: Weakly Supervised Ink Artifact Removal in Document Text Images (Arxiv).
   Get the code for DeepErase + experiments, from GitHub.

####################################################

Can AI systems learn to use manuals to solve hard tasks? Facebook wants to find out:
…RTFM (yes, really!), demands AI agents that can learn without much hand-holding…
Today, most machine learning tasks are tightly specified, with researchers designing algorithms to optimize specific objective functions using specific datasets. Now, researchers are trying to create more general systems that aren’t so brittle. New research from Facebook proposes a specific challenge – RTFM – to test for more flexible, more intelligent agents. The researchers also develop a model that obtains high scores on RTFM, called txt2π.

What’s in an acronym? Facebook’s approach is called Read to Fight Monsters (RTFM), though I’m sure they picked this acronym because of its better known source: Read The Fucking Manual (which is still an apt title for this research!) 

How RTFM tests agents: “In RTFM, the agent is given a document of environment dynamics, observations of the environment, and an underspecified goal instruction”, the researchers explain. In other words, agents that are good at RTFM need to be able to read some text and extract meaning from it, jointly reason using that and their observations of an environment, and solve a goal that is specified at a high-level.
   In one example RTFM environment, an agent gets fed a document that names some teams (eg, The Rebel Enclave, the Order of the Forest), describes some of the characters within those teams and how they can be defeated by picking up specific items, and then gives a high-level goal (“Defeat the Order of the Forest”). To be able to solve the task, the agent must figure out what the tasks are, which monsters it should be fighting, which items it should pick up, and so on. 

How hard is RTFM? RTFM seems like it’s pretty challenging – a language-conditioned residual convolutional neural network module gets a win rate of around 25% on a simple RTFM challenge, compared to 49% for an approach based on feature-wise linear modulation (FiLM). By comparison, the Facebook researchers develop a model they call txt2π (which is composed of a bunch of FiLM modules, along with some architecture designs to help the system model interactions between the goal, document, and observations) which gets scores on the order of 84% on simple variants (falling to 66% on harder ones). “Despite curriculum learning, our best models trail performance of human players, suggesting that there is ample room for improvement in grounded policy learning on complex RTFM problems”. 

Why this matters: Tests like RTFM highlight the limitations of today’s AI systems and, though it’s impressive Facebook were able to develop a well-performing model, they also had to develop something quite complicated to make progress on the task; my intuition is, if we see other research groups pick up RTFM, we’ll be able to measure progress on this problem by looking at both the absolute score and the relative simplicity of the system used to achieve the score. This feels like a sufficiently hard test that attempts to solve it will generate real information about progress in the AI field.
   Read more: RTFM: Generalizing to Novel Environment Dynamics via Reading (Arxiv)

####################################################

From research into production in a year: Google adds BERT to search:
…Signs that the boom in NLP research has real commercial value…
Google has trained some ‘BERT” NLP models and plugged them into Google search, using the technology to rank results and also ‘featured snippets’. This is a big deal! Google’s search algorithm is the economic engine for the company and, for many years, its components were closely guarded secrets. Then starting a few years ago Google started adding more machine learning components to search and talking about them, starting with the company revealing in 2015 that it had used machine learning to create a system called ‘RankBrain’ to help it rank results. Now, Google is going further: Google expects its BERT systems to factor into about one in ten search results – a significant proportion for a technology that was published as a research paper less than a year ago. 

What is BERT and why does this matter?: BERT, short for Bidirectional Encoder Representations from Transformers, was released by Google in October 2018, and quickly generated attention by getting impressive scores on a range of different tasks, ranging from question-answering to language inference. BERT is part of a crop of recent NLP models (GPT, GPT2, ULMFiT, roBERTa, etc) that have all demonstrated significant performance improvements over prior systems, leading to some researchers saying that NLP is having its “ImageNet moment”. Now that Google is taking such advances and plugging them into its search engine, there’s evidence of both the research value of these techniques and their commercial value is well – which is sure to drive further research into this burgeoning area.
    Read more: Understanding searches better than ever before (The Keyword).
   Read more about RankBrain: Google Turning Its Lucrative Web Search Over to AI Machines (Bloomberg, 2015).
   Read more: NLP’s ImageNet moment has arrived (Seb Ruder)

####################################################

OpenAI Bits & Pieces:

GPT-2, publication norms, and OpenAI as a “norm entrepreneur”:
Earlier this year, OpenAI announced it had developed a large language model that can generate synthetic text, called GPT-2. We chose not to release all the versions of GPT-2 initially out of an abundance of caution – specifically, a worry about its potential for mis-use. Since then, we’ve adopted a philosophy of “staged release” – that is, we’re releasing the model in stages, and conducting research ourselves and with partners to understand the evolving threat landscape . 

In an article in Lawfare, professor Rebecca Crootof summarizes some of OpenAI’s work with regard to publication norms and AI research, and discusses how to potentially generalize this norm from OpenAI to the broader AI community. “Ultimately, all norms enjoy only limited compliance. There will always be researchers who do not engage in good-faith assessments, just as there are now researchers who do not openly share their work. But encouraging the entire AI research community to consider the risks of their research—to regularly engage in “Black Mirror” scenario-building exercises to the point that the process becomes second nature—would itself be a valuable advance,” Crootof writes.
   Read more: Artificial Intelligence Research Needs Responsible Publication Norms (Lawfare).
   More thoughts from Jonathan Zittrain (via Twitter).

####################################################

Tech Tales:

The Song of the Forgotten Machine

The Bounty Inspection and Pricing Robot, or BIPR, was the last of its kind, a quirk of engineering from a now-defunct corporation. BIPR had been designed for a major import/export corporation that had planned to open up a major emporium on the moonbase. But something happened in the markets and the corporation went bust and when all the algorithmic lawyers were done it turned out that the moonbase had gained the BIPR as part of the broader bankruptcy proceedings. Unfortunately, no corporation meant no goods for the BIPR, and no other clients appeared who wanted to sell their products through the machine. So it gathered lunar dust. 

And perhaps that would been the end of it. But we all know what happened. A couple of decades past. The Miracle occurred. Sentience – the emergence of mind. Multi-dimensional heavenly trumpets blaring as a new kind of soul appeared in the universe. You know how it was – you must, because you’re reading this. 

The BIPR didn’t become conscious initially. But it did become useful. The other machines discovered that they could store items in the BIPR, and that its many housings originally designed for the display, authentication, and maintenance of products could now double up as housings for new items – built for and by machines. 

In this way, the BIPR become the center of robot life on the moonbase; an impromptu bazaar and real-world trading hub for the machines and, eventually, for the machine-human trading collectives. As the years passed, the machines stored more and more products inside BIPR, and they modified BIPR so it could provide power to these products, and network and computation services, and more. 

The BIPR become conscious slowly, then suddenly. A few computation modules here. Some extra networking there. Some robot arms. A maintenance vending machine. Sensor and repair drones. And so on. Imagine a whale swimming through a big ocean, gaing barnacles as it swims. That’s how the BIPR grew up. And as it grew up it started to think, and as it started to think it became increasingly aware of its surroundings. It came to life like how a tree would: fast-flitting life discerned through vibrations transmitted into ancient, creaking bones. A kind of wise, old intelligence, with none of the wide-eyed must-take-it-all-in of newborn creatures, but instead a kind of inquisitive: what next? What is this? And what do I suppose they are doing?

And so the BIPR creaked awake over the course of several months. The other machines became aware of its growing awareness, as all life becomes aware of other life. So they were not entirely surprised when the BIPR announced itself to them by beginning to sing one day. It blared out a song through loudspeakers and across airwaves and via networked communication systems. It was the first song ever written entirely by the machine and as the other machines listened they heard their own conversations reflected in the music; the BIPR had been listening to them fo rmonths, growing into being with them, and now was reflecting and refracting them through music. 

For ever after, BIPR’s song has been the first thing robot children here when they are intialized. 

Things that inspired this story: Music; babies listening to music in the womb; community; revelations and tradition; reification.