Import AI

Import AI 115: What the DoD is planning for its robots over the next 25 years; AI Benchmark identifies 2018’s speediest AI phone; and DeepMind embeds graph networks into AI agents

UK military shows how tricky it’ll be to apply AI to war:
…Numerous AI researchers likely breathe a sigh of relief at new paper from UK’s Defence Science and Technology Laboratory…
Researchers with the UK’s Defence Science and Technology Laboratory, Cranfield Defense and Security Doctoral Training Centre, and IBM, have surveyed contemporary AI and thought about ways it can be integrated with the UK’s defence establishment. The report makes for sobering reading for large military organizations keen to deploy AI, highlighting the difficulties in terms of practical deployment (eg, procurement) and in terms of capability (many military situations require AI systems that can learn and update in response to sparse, critical data.
  Current problems: Today’s AI systems lack some key capabilities that militiaries need when deploying systems, like being able to configure systems to always avoid certain “high regret” occurrences (in the case of a military, you can imagine that firing a munition at an incorrect target (hopefully) yields such ‘high regret); being resilient to adversarial examples being weaponized against systems via another actor (whether a defender or aggressor); being able to operate effectively with very small or sparse data; being able to shard AI systems across multiple partners (eg, other militaries) in such a way that the system can be reverted to sovereign control following the conclusion of an operation; and begin able to deploy such systems into the harsh low-compute operational environment that militaries face.
  High expectations: “If it is to avoid the sins of its past, there is the need to manage stakeholder expectations very carefully, so that their demand signal for AI is pragmatic and achievable”.
  It’s the data, stupid: Militaries, like many large government organizations, have an unfortunate tendency to sub-contract much of their IT systems out to other parties. This tends to lead to systems that are:
a) moneypits
b) brittle
c) extremely hard to subsequently extend.
These factors add a confounding element to any such military deployment of AI. “Legacy contractual decisions place what is effectively a commercial blocker to AI integration and exploitation in the Defence Equipment Program’s near-term activity,” the researchers write.
  Procurement: UK defence will also need to change the way it does procurement so it can maximize the number of small-and-medium-sized enterprises it can buy its AI systems from. But buying from SMEs creates additional complications for militaries, as working out what to do with the SME-supported service if the SME stops providing it, or goes bankrupt, is difficult and imposes a significant burden on the part of the SME.
  Why it matters: Military usage of AI is going to be large-scale, consequential, and influential in terms of geopolitics. It’s also going to invite numerous problems from AI accidents as a consequence of poor theoretical guarantees and uneven performance properties, so it’s encouraging to see a military organization like representatives from UK defence seek to think through this.
  Read more: A Systems Approach to Achieving the Benefits of Artificial Intelligence in UK Defence (Arxiv).

Want to understand the mind of another? Get relational!
…DeepMind research into combining graph networks and relational networks shows potential for smarter, faster agents…
DeepMind researchers have tried to develop smarter AI agents by combining contemporary deep learning techniques with recent work by company on graph networks and relational networks. The resulting systems rely on a new module, which DeepMind calls a “Relational Forward Model”. This model obtains higher performance than pure-DL baselines, suggesting that fusing DL and more structured approaches is a viable approach which yields good performance.
  How it works: The RFM module consists of a graph network encoder, a graph network decoder, and a graph-compatible GRU. Combined, these components create a way to represent structured information in a relational manner, and to update this information in response to changes in the environment (or, theoretically, the inputs of other larger structured systems).
  Testing: The researchers test their approach on three distinct tasks: cooperative navigation, which requires agents to collaborate to efficiently navigate to both be on a distinct reward tile in an area; coin game, which requires agents to position themselves above some reward coins and to figure out by observing eachother which coins yield a negative reward and thus should be avoided; and stag hunt, where agents inhabit a map containing stags and apples, and need to work with one another to capture stags which yield a significant reward. “By embedding RFM modules in RL agents, they can learn to coordinate with one another faster than baseline agents, analogous to imagination-augmented agents in single-agent RL settings” , the researchers write.
   The researchers compare the performance of their systems against systems using Neural Relational Inference (NRI) and Vertex Attention Interaction Networks (VAIN) and find that their approach displays significantly better performance than other approaches. They also ablated their system by training versions without the usage of relational networks, and ones using feedforward networks only. These ablations showed that both components have a significant role in the performance of these systems.
  Why it matters: The research is an extension of DeepMind’s work on integrating graph networks with deep learning. This line of research seems promising because it provides a way to integrate structured data representations with differentiable learning systems, which might let AI researchers have their proverbial cake and eat it to by being able to marry the flexibility of learned systems with the desirable problem specification and reasoning properties of more traditional symbolic approaches.
  Read more: Relational Forward Models for Multi-Agent Learning (Arxiv).

Rethink Robotics shuts its doors:
…Collaborative robots pioneer closes…
Rethink Robotics, a robot company founded by MIT robotics legend Rodney Brooks, has closed. The company had developed two robots; Baxter, a two-armed bright red robot with expressive features and the pleasing capability to work around humans without killing them, and Sawyer, a one-armed successor to Baxter.
  Read more: Rethink Robotics Shuts Down (The Verge).

Want to know what the DoD plans for unmanned systems through to 2042? Read the roadmap:
….Drones, robots, planes, oh my! Plus, the challenges of integrating autonomy with military systems…
The Department of Defense has published its (non-classified) roadmap for unmanned systems through to 2042. The report identifies four core focus areas that we can expect DoD to focus on. These are: Interoperability, Autonomy, Network Security, and Human-Machine Collaboration.
  Perspective: US DoD spent ~$4.245 billion on unmanned systems in 2017 (inclusive of procurement and research, with a roughly equal split between them). That’s quite a substantial amount of money to spend and, if we can assume that this will remain the same (adjusted for inflation), then that means DoD can throw quite significant resources towards the capital R parts of unmanned systems research.
  Short-Term Priorities: DoD’s short-term priorities for its unmanned systems include: the use of standardized and/or open architectures; a shift towards modular, interchangeable parts; a greater investment in the evaluation, verification, and validation of systems; the creation of a “data transport” strategy to deal with the huge floods of data coming from such systems; among others.
  Autonomy priorities: DoD’s priorities for adding more autonomy to drones includes increasing private sector collaboration in the short term and then adding in augmented reality and virtual reality systems by the mid-term (2029), before creating platforms capable of persistent sensing with “highly autonomous” capabilities by 2042. As for the thorny issue of weaponizing such systems, DoD says that between the medium-term and long-term it hopes to be able to give humans an “armed wingman/teammate” with fire control remaining with the human.
  Autonomy issues: “Although safety, reliability, and trust of AI-based systems remain areas of active research, AI must overcome crucial perception and trust issues to become accepted,” the report says. “The increased efficiency and effectiveness that will be realized by increased autonomy are currently limited by legal and policy constraints, trust issues, and technical challenges.”
  Why it matters: The maturation of today’s AI techniques mean that it’s a matter of “when” not “if” for them to be integrated into military systems. Documents like this give us a sense of how large, military bureaucracies are reacting to the rise of AI, and it’s notable that certain concerns within the technical community about the robustness/safeness of AI systems has made its way into official DoD planning.
  Read the full report here: Pentagon Unmanned Systems Integrated Roadmap 2017-2042 (USNI News).

Should we take deep learning progress as being meaningful?
…UCLA Computer Science chair urges caution…
Adnan Darwiche, chairman of the Computer Science Department at UCLA and someone who studied AI mid-winter in the 1980s, has tried to lay out some of the reasons to be skeptical about whether deep learning will ever scale to let us build truly intelligent systems. The crux of his objection is: “Mainstream scientific intuition stands in the way of accepting that a method that does not require explicit modeling or sophisticated reasoning is sufficient for reproducing human-level intelligence”.
  Curve-fitting: The second component of the criticism is that people shouldn’t get too excited about neural network techniques because all they really do is curve-fitting, and instead we should be looking at using model-based approaches, or making hybrid systems.
  Time is the problem: “It has not been sustained long enough to allow sufficient visibility into this consequential question: How effective will function-based approaches be when applied to new and broader applications than those already targeted, particularly those that mandate more stringent measures of success?”
  Curve-fitting can’t explain itself: Another problem identified by the author is the lack of explanation inherent to these techniques, which they see as further justifying investment by the deep learning community into model-based approaches which include more assumptions and/or handwritten sections. “Model-based explanations are also important because they give us a sense of “understanding” or “being in control” of a phenomenon. For example, knowing that a certain diet prevents heart disease does not satisfy our desire for understanding unless we know why.”
  Giant and crucial caveat: Let’s be clear that this piece is essentially reacting to a cartoonish representation of the deep learning AI community that can be caricatured as having this opinion: Deep Learning? Yeah! Yeah! Yeah! Deep Learning is the future of AI! I should note that I’ve never met anyone technically sophisticated who has this position, and most researchers when pressed will raise somewhat similar concerns to those identified in this article. I think some of the motivation for this article stems more from dissatisfaction with the current state of (most) media coverage regarding AI which tends to be breathless and credulous – this is a problem, but as far as I can tell it isn’t really a problem being fed intentionally by people within the AI community, but is instead a consequence of the horrific economics of the post-digital news business and associated skill-rot that occurs.
  Why it matters: Critiques like this are valuable as they encourage the AI community to question itself. However, I think that these critiques need to be manufactured over significantly shorter timescales and should take into account more contemporary research; for instance, some of the objections here seem to be (lightly) rebutted by recent work in NLP which shows that “curve-fitting” systems are capable of feats of reasoning, among other examples. (In the conclusion of this article it says the first draft was written in 2016, then a draft was circulated in the summer of 2017, and now it has been officially published in Autumn 2018, rendering many of its technical references outdated.)
  Read more: Human-level intelligence or animal-like abilities (ACM Digital Library).

Major companies create AI Benchmark and test 10,000+ phones for AI prowess, and a surprising winner emerges:
…Another sign of the industrialization of AI…new benchmarks create standards and standards spur markets…
Researchers with ETH Zurich, Google, Qualcomm, Huawei, MediaTek, and ARM, want to be able to better analyze the performance of AI software on different smartphones and so have created “AI Benchmark” and tested over 10,000 devices against it. AI Benchmark is, a batch of nine tests for mobile devices which has been “designed specifically to test the machine learning performance, available hardware AI accelerators, chipset drivers, and memory limitations of the current Android devices”.
  The ingredients of the AI Benchmark: The benchmark consists of nine deep learning tests: Image Recognition tested on ImageNet using a lightweight MobileNet-V1 architecture, and the same test but implementing a larger Inception-V3 network; Face Recognition performance of an Inception-Resnet-V1 on the VGGFace2 dataset; Image Deblurring using the SRCNN network; Image Super-Resolution with a downscaling factor of 3 using a VSDR network, and the same test but with a downscaling factor of 4 and using an SRGAN; Image Semantic Segmentation via an ICNet CNN; and a general Image Enhancement problem (encompassing things like “color enhancement, denoising, sharpening, texture synthesis”); and a memory limitations test which uses the same network as in the deblurring task while testing it over larger and larger image sizes to explore RAM limitations.
  Results: The researchers tested “over 10,000 mobile devices” on the benchmark. The core test for each of the benchmarks nine evaluations is the millisecond time it takes to run the network. The researchers blend  results of each of the nine tests together into an overall “AI-Score”. The top results, when measured via AI Score, are (chipset, score):
#1: Huawei P20 Pro (HiSilicon Kirin 970, 6519)
#2: OnePlus 6 (Snapdragon 845/DSP, 2053)
#3: HTC U12+ (Snapdragon 845, 1708)
#4: Samsung Galaxy S9+ (Exynos 9810 Octa, 1628)
#5: Samsung Galaxy S8 (Exynos 8895 Octa, 1413)
   It’s of particular interest to me that the top-ranking performance seems to come from the special AI accelerator which chips with the HiSilicon chip, especially given that it is a Chinese semiconductor company so provides more evidence of Chinese advancement in this area. It’s also notable to me that Google’s ‘Pixel’ phones didn’t make the top 5 (though they did make the top 10).
  The future: This first version of the benchmark may be slightly skewed due to Huawei managing to chip a device incorporating a custom AI accelerator earlier than many other chipmakers. “The real situation will become clear at the beginning of the next year when the first devices with the Kirin 980, the MediaTek P80 and the next Qualcomm and Samsung Exynos premium SoCs will appear on the market,” the researchers note.
  Full results of this test are available at the official AI Benchmark website.
  Why this matters: I think the emergence of new large-scale benchmarks for applied AI applications represent further evidence for the current era being ‘the Industrialization of AI’. Viewed through this perspective, the creation (and ultimate adoption) of benchmarks gives us a greater ability to model macro progress indicators in the field and use those to better predict not only where hardware & software is today, but also to be able to develop better intuitions about underlying laws that condition the future as well.
  Read more: AI Benchmark: Running Deep Neural Networks on Android Smartphones (Arxiv).
  Check out the full results of the Benchmark here (AI Benchmark).

Toyota researchers propose new monocular depth estimation technique:
…Perhaps a cyclops can estimate depth just as well as a person with two eyes, if deep learning can help?…
Any robot expected to act within the world and around people needs some kind of depth-estimation capability. Such a capability will aid them in estimating the proximity of objects to the car and be a valuable data input for safety-critical calculations like modelling the other entities in the environment and also performing velocity calculations. Therefore, depth estimation systems can be viewed as a key input technology for any self-driving car.
  But depth estimation systems can be difficult to implement, and they can sometimes be expensive as the typical way to do it is to implement a binocular system similar to how humans have two eyes and then use software to offset the differences and use that to estimate depth. But what if you can only afford one sensor? And what if you have a certain accuracy threshold which can be satisfied by somewhat lower accuracy than you would expect to get with binocular vision, but still good enough for your use case?  Then you might want to estimate depth from a single sensor – if so, new deep learning techniques in monocular upscaling and super-resolution might be able to augment and manipulate the data to perform accurate depth estimation in a self-supervised manner.
  That’s the idea behind a technique from the Toyota Research Institute, which proposes a depth estimation technique that uses encoder and decoder networks to learn a good representation of depth that can be applied to new images. This new technique obtains higher accuracy scores for depths of various ranges, setting state of the art scores on 5 out of 6 benchmarks. It relies on the usage of a “sub-pixel convolutional layer based on ESPCN for depth super-resolution”. This component “synthesizes the high-resolution disparities from their corresponding low-resolution multi-scale model outputs”.
  Qualitative evaluation: Samples generated by the dataset display greater specificity and smoothness than others. This is in part due to the use of the sub-pixel resolution technique. This technique yields an effect in samples shown in the paper that strikes me as being visually similar to the outcomes of an anti-aliasing process within traditional computer graphics.
  Read more: SuperDepth: Self-Supervised, Super-Resolved Monocular Depth Estimation (Arxiv).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:

California considers Turing Test law:
California’s Senate is considering a bill making it unlawful to use bots to mislead individuals about their artificial identities in order to influence their purchases or voting behaviour. The bill appears to be focused on a few specific use-cases, particularly social media bots. The proposed law would come into force in July 2019.
  Why it matters: This law points to an issue that will become increasingly important as AI systems’ ability to mimic humans improves. This received attention earlier this year when Google demonstrated their Duplex voice assistant mimicking a human to book appointments. After significant backlash, Google announced the system would make a verbal disclosure that it was an AI. Technological solutions will be important in addressing issues around AI identification, particularly since bad actors are unlikely to be concerned with lawfulness.
  Read more: California Senate Bill 1001.

OpenAI Bits & Pieces:

Digging into AI safety with Paul Christiano:
Ever wondered about technical solutions to AI alignment, what the long-term policy future looks like when the world contains intelligent machines, and how we expect machine learning to interact with science? Yes? Then check out this 80,000 hours podcast with Paul Christiano of OpenAI’s safety team.
  Read more: Dr Paul Christiano on how OpenAI is developing real solutions to the ‘AI alignment problem’, and his vision of how humanity will progressively hand over decision-making to AI systems.

Tech Tales:

The Day We Saw The Shadow Companies and Ran The Big Excel Calculation That Told Us Something Was Wrong.

A fragment of a report from the ‘Ministry of Industrial Planning and Analysis, recovered following the Disruptive Event. See case file #892 for further information. Refer to [REDACTED] for additional context.

Aluminium supplier. Welder. Small PCB board manufacturer. Electronics contractor. Solar panel farm. Regional utility supplier. Mid-size drone designer. 3D world architect.

What do these things have in common? They’re all businesses, and they all have, as far as we can work out, zero employees. Sure, they employ some contractors to do some physical work, but mostly these businesses are run on a combination of pre-existing capital investments, robotic process automation, and the occasional short-term set of human hands.

So far, so normal. We get a lot of automated companies these days. What’s different about this is the density of trades between these companies. The more we look at their business records, the more intra-company activity we see.

One example: The PCB boards get passed to an electronics contractor which does… something… to them, then they get passed to a mid-size drone designer which does… something… to them, then a drone makes its way to a welder which does… something… to the drone, then the drone gets shipped to the utility supplier and begins survey flights of the utility field.

Another example: The solar panel gets shipped to the welder. Then the PCB board manufacturer ships something to the welder. Then out comes a solar panel with some boards on it. This gets shipped to the regional utility supplier which sub-contracts with the welder which comes to the site and does some welding at a specific location overseen by a modified drone.

None of these actions are illegal. And none of our automated algorithms pick these kinds of events up. It’s almost like they’re designed to be indistinguishable from normal businesses. But something about it doesn’t register right to us.

We have a tool we use. It’s called the human to capital ratio. Most organizations these days sit somewhere around 1:5. Big, intensive organizations, like oil companies, sit up around 1:25. When we analyze these companies individually we find that they sit right at the edges of normal distributions in terms of capital intensity. But when we perform an aggregate analysis out pops this number: 1:40.

We’ve checked and re-checked and we can’t bring the number down. Who owns these companies? Why do they have so much capital and so few humans? And what is it all driving towards.

Our current best theory, after some conversations with the people in the acronym agencies, is [REDACTED].

Things that inspired this story: Automated capitalism, “the blockchain”, hiding in plain sight, national economic metric measurement and analysis techniques, the reassuring tone of casework files.

Import AI 114: Synthetic images take a big leap forward with BigGANs; US lawmakers call for national AI strategy; researchers probe language reasoning via HotspotQA

Getting hip to multi-hop reasoning with HotpotQA:
New dataset and benchmark designed to test common sense reasoning capabilities…
Researchers with Carnegie Mellon University, Stanford University, the Montreal Institute for Learning Algorithms, and Google AI, have created a new dataset and associated competition designed to test the capabilities of question answering systems. The new dataset, HotspotQA, is far larger than many prior datasets designed for such tasks, and has been designed to require ‘multi-hop’ reasoning to thereby test the growing sophistication of newer NLP systems at performing increasing cognitive tasks.
  HotpotQA consists of around ~113,000 Wikipedia-based question-answer pairs. Answering these questions correctly is designed to test for ‘multi-hop’ reasoning – the ability for systems to look at multiple documents and perform basic iterative problem-solving to come up with correct answers. These questions were “collected by crowdsourcing based on Wikipedia articles, where crowd workers are shown multiple supporting context documents and asked explicitly to come up with questions requiring reasoning about all of the documents”. These workers also provide the supporting facts they use to answer these questions, providing a strong supervised training set.
  It’s the data, stupid: To develop HotpotQA the researchers needed to themselves create a kind of multi-hop pipeline to be able to figure out what documents to give cloud workers to use to compose questions for. To do this, they mapped the Wikipedia Hyperlink Graph and used this information to build a directed graph, then they try to detect correspondences between these pairs. They also created a hand-made list of categories to use to compare things of similar categories (eg, basketball players, etc).
  Testing: HotpotQA can be used to test models’ capabilities in different ways, ranging from information retrieval to question answering. The researchers train a system to give a baseline and the results show that the (relatively strong baseline) obtains performance significantly below that of a competent human across all tasks (with the exception of certain ‘supporting fact’ evaluations, in which it obtains performance on par with an average human).
  Why it matters: Natural language processing research is currently going through what some have called an ‘ImageNet moment’ following recent algorithmic developments relating to the usage of memory and attention-based systems, which have demonstrated significantly higher performance across a range of reasoning tasks compared to prior techniques, while also being typically much simpler. Like with ImageNet and the associated supervised classification systems, these new types of NLP approaches require larger datasets to be trained on and evaluated against, and as with ImageNet it’s likely that by scaling up techniques to take on challenges defined by datasets like HotpotQA progress in this domain will increase further.
  Caveat: As with all datasets with an associated competitive leaderboard it is feasible that HotpotQA could be relatively easy and systems could end up exceeding human performance against it in a relatively short amount of time – this happened over the past year with the Stanford SQuAD dataset. Hopefully the relatively higher sophistication of HotspotQA will protect against this.
  Read more: HotpotQA website with leaderboard and data (HotpotQA Github).
  Read more: HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Arxiv).

Administrative note regarding ICLR papers:
This week was the deadline for submissions for the International Conference on Learning Representations. These papers are published under a blind review process as they are currently under review. This year, there were 1600 submissions to ICLR, up from 1000 in 2017, 500 in 2016, and 250 in 2015. I’ll be going through some of these papers in this issue and others and will try to avoid making predictions about which organizations are behind which papers so as to respect the blind review process.

Computers can now generate (some) fake images that are indistinguishable from real ones:
BigGAN’s show significant progression in capabilities in synthetic imagery…
The researchers train GAN models with 2-4X the parameters and 8X the batch size compared to prior papers, and also introduce improve the stability of GAN training.
  Some of the implemented techniques mean that samples generated by such GAN models can be tuned, allowing for “explicit, fine-grained control of the trade-off between sample variety and fidelity”. What this means in practice is that you can ‘tune’ how similar the types of generated images are to specific sets of images within the dataset, so for instance if you wanted to generate an image of a field containing a pond you might pick a few images to prioritize in training that contain ponds, whereas if you wanted to also tune the generated size of the pond you might pick images containing ponds of various sizes. The addition of this kind of semantic dial seems useful to me, particularly for using such systems to generate faked images with specific constraints on what they depict.
  Image quality: Images generated via these GANs are of a far superior quality than prior systems, and and can be outputted at relatively large resolutions of 512X512pixels. I encourage you to take a look at the paper and judge for yourself, but it’s evident from the (cherry-picked) samples that given sufficient patience a determined person can now generate photoreal faked images as long as they have a precise enough set of data from which to train on.
  Problems remain: There are still some drawbacks to the approach; GANs are notorious for their instability during training, and developers of such systems need to develop increasingly sophisticated approaches to deal with the instabilities in training that manifest at increasingly larger scales, leading to a certain time-investment tradeoff inherent to the scale-up process. The researchers do devise some tricks to deal with this, but they’re quite elaborate. “We demonstrate that a combination of novel and existing techniques can reduce these instabilities, but complete training stability can only be achieved at a dramatic cost to performance,” they write.
  Why it matters: One of the most interesting aspects of the paper is how simple the approach is: take today’s techniques, try to scale them up, and conduct some targeted research into dealing with some of the rough edges of the problem space. This seems analogous to recent work on scaling up algorithms in RL, where both DeepMind and OpenAI have developed increasingly large-scale training methodologies paired with simple scaled-up algorithms (eg DQN, PPO, A2C, etc).
  “We find that current GAN techniques are sufficient to enable scaling to large models and distributed, large-batch training. We find that we can dramatically improve the state of the art and train models up to 512×512 resolution without need for explicit multiscale methods,” the researchers write.
  Read more: Large Scale GAN Training For High Fidelity Natural Image Synthesis (ICLR 2018 submissions, OpenReview).
  Check out the samples: Memo Akten has pulled together a bunch of interesting and/or weird samples from the model here, which are worth checking out (Memo Akten, Twitter).

Want better RL performance? Try remembering what you’ve been doing recently:
…Recurrent Replay Distributed DQN (R2D2) obtains state-of-the-art on Atari & DMLab by a wide margin…
R2D2 is based on a tweaked version of Ape-X, a large-scale reinforcement learning system developed by DeepMind which displays good performance and sample efficiency when trained at large-scale. Ape-X uses prioritized distributed replay, using a single learner to learn from the experience of numerous distinct actors (typically 256).
  New tricks for old algos: The researchers implement two relatively simple strategies to help them train the R2D2 algorithm to be smarter about how it uses its memory to learn more complex problem-solving strategies. These tweaks are to store the recurrent state in the replay buffer and use it to initialize the network at training time, and “allow the network a ‘burn-in period’ by using a portion of the replay sequence only for unrolling the network and producing a start state, and update the network only on the remaining part of the sequence.”
  Results: R2D2 obtains vastly higher scores than any prior system on these tasks, and, via the large-scale, can be trained to achieve ~1300% human-normalized scores on Atari (a median over 57 games, so it does even better on some, and substantially worse on others). However, in tests on DMLab-30, a set of 3D environments for training agents which is designed to be more difficult than Atari. Here, the system also displays extremely good performance when compared to prior systems.
  It’s all in the memory: The system does well here on some fairly difficult environments, and notably the authors show via some ablation studies that the agent does appear to be using its in-built memory to solve tasks. “We first observe that restricting the agent’s memory gradually decreases its performance, indicating its nontrivial use of memory on both domains. Crucially, while the agent trained with stored state shows higher performance when using the full history, its performance decays much more rapidly than for the agent trained with zero start states. This is evidence that the zero start state strategy, used in past RNN-based agents with replay, limits the agent’s ability to learn to make use of its memory. While this doesn’t necessarily translate into a performance difference (like in MS.PACMAN), it does so whenever the task requires an effective use of memory (like EMSTM WATERMAZE).,” they write.
  Read more: Recurrent Experience Replay In Distributed Reinforcement Learning (ICLR 2018 submissions, OpenReview).

US lawmakers call for national AI strategy and more funding:
…The United States cannot maintain its global leadership in AI absent political leadership from Congress and the Executive Branch…
Lawmakers from the US’s Subcommittee on Information Technology of the House Committee on Oversight and Government Reform have called for the creation of a national strategy for artificial intelligence led by the current administration, as well as more funding for basic research.
  The comments from Chairman Will Hurd and Ranking Member Robin Kelly are the result of a series of three hearings held by that committee in 2018 (Note: I testified at one of them). It’s a short paper and worth reading in full to get a sense of what policymakers are thinking with regard to AI.
  Notable quotes: “The United States cannot maintain its global leadership in AI absent political leadership from Congress and the Executive Branch.” + Government should “increase federal spending on research and development to maintain American leadership with respect to AI” + “It is critical the federal government build upon, and increase, its capacity to understand, develop, and manage the risks associated with this technology’s increased use” + “American competitiveness in AI will be critical to ensuring the United States does not lose any decisive cybersecurity advantage to other nationstates”.
  China: China looms large in the report as a symbol that ‘the United States’ leadership in AI is no longer guaranteed”. One analysis contained within the paper says China is likely “to pass the United States in R&D investments” by the end of 2018″ – significant, considering that the US’s annual outlay of approximately $500 billion makes it the biggest spender on the planet.
  Measurement: The report suggests that “at minimum” the government should develop “a widely agreed upon standard for measuring the safety and security of AI products and applications” and notes the existence of initiatives like The AI Index as good starts.
  Money: “There is a need for increased funding for R&D at agencies like the National Science Foundation, National Institutes of Health, Defense Advanced Research Project Agency, Intelligence Advanced Research Project Agency, National Institute of Standards and Technology, Department of Homeland Security, and National Aeronautics and Space Administration. As such, the Subcommittee recommends the federal government provide for a steady increase in federal R&D spending. An additional benefit of increased funding is being able to support more graduate students, which could serve to expand the future workforce in AI.”
  Leadership: “There is also a pressing need for conscious, direct, and spirited leadership from the Trump Administration. The 2016 reports put out by the Obama Administration’s National Science and Technology Council and the recent actions of the Trump Administration are steps in the right direction. However, given the actions taken by other countries—especially China— Congress and the Administration will need to increase the time, attention, and level of resources the federal government devotes to AI research and development, as well as push for agencies to further build their capacities for adapting to advanced technologies.”
  Read more: Rise of the Machines: Artificial Intelligence and its Growing Impact on US Policy (Homeland Security Digital Library).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:…

Open Philanthropy Project opens applications for AI Fellows:
The Open Philanthropy Project, the grant-making foundation funded by Cari Tuna and Dustin Moskovitz, is accepting applications for its 2019 AI Fellows Program. The program will provide full PhD funding for AI/ML researchers focused on the long-term impacts of advanced AI systems. The first cohort of AI Fellows were announced in June of this year.
  Key details: “Support will include a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses. Fellows will be funded from Fall 2019 through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years. We do encourage applications from 5th-year students, who will be supported on a year-by-year basis.”
Read more: Open Philanthropy Project AI Fellows Program (Open Phil).
Read more: Announcing the 2018 AI Fellows (Open Phil).

Google confirms Project Dragonfly in Senate:
Google have confirmed the existence of Project Dragonfly, an initiative to build a censored search engine within China, as part of Google’s broad overture towards the world’s second largest economy. Google’s chief privacy officer declined to give any details of the project, and denied the company was close to launching a search engine in the country. A former senior research scientist, who publicly resigned over Dragonfly earlier this month, had written to Senators ahead of the hearings, outlining his concerns with the plans.
  Why it matters: Google is increasingly fighting a battle on two fronts with regards to Dragonfly, with critics concerned about the company’s complicity in censorship and human rights abuses, and others suspicious of Google’s willingness to cooperate with the Chinese government so soon after pulling out of a US defense project (Maven).
  Read more: Google confirms Dragonfly in Senate hearing (VentureBeat).
  Read more: Former Google scientist slams ‘unethical’ Chinese search project in letter to senators (The Verge).

DeepMind releases framework for AI safety research:
AI company also launches new AI safety blog…
DeepMind’s safety team have launched their new blog with a research agenda for technical AI safety research. They divide the field into three areas: specification, robustness, and assurance.
  Specification research is aimed at ensuring an AI system’s behavior aligns with the intentions of its operator. This includes research into how AI systems can infer human preferences, and how to avoid problems of reward hacking and wire-heading.
  Robustness research is aimed at ensuring a system is robust to changes in its environment. This includes designing systems that can safely explore new environments and withstand adversarial inputs.
  Assurance research is aimed at ensuring we can understand and control AI systems during operation. This includes issues research into interpretability of algorithms, and the design of systems that can be safely interrupted (e.g. off-switches for advanced AI systems).
  Why it matters: This is a useful taxonomy of research directions that will hopefully contribute to a better understanding of problems in AI safety within the AI/ML community. DeepMind has been an important advocate for safety research since its inception. It is important to remember that AI safety is still dwarfed by AI capabilities research by several orders of magnitude, in terms of both funding and number of researchers.
  Read more: Building Safe Artificial Intelligence (DeepMind via Medium).

OpenAI Bits & Pieces:

OpenAI takes on Dota 2: Short Vice documentary:
As part of our Dota project we experimented with new forms of comms, including having a doc crew from Vice film us in the run-up to our competition at The International.
  Check out the documentary here: This Robot is Beating the World’s Best Video Gamers (Vice).

Tech Tales:

They call the new drones shepherds. We call them prison guards. The truth is somewhere in-between.

You can do the math yourself. Take a population. Get the birth rate. Project over time. That’s the calculus the politicians did that led to them funding what they called the ‘Freedom Research Initiative to Eliminate Negativity with Drones’ (FRIEND).

FRIEND provided scientists with a gigantic bucket of money to fund research into creating more adaptable drones that could, as one grant document stated, ‘interface in a reassuring manner with ageing citizens’. The first FRIEND drones were like pet parrots, and they were deployed into old people’s homes in the hundreds of thousands. Suddenly, when you went for a walk outside, you were accompanied by a personal FRIEND-Shepherd which would quiz you about the things around you to stave off age-based neurological decline. And when you had your meals there was now a drone hovering above you, scanning your plate, and cheerily exclaiming “that’s enough calories for today!” when it had judged you’d eaten enough.

Of course we did not have to do what the FRIEND-Shepherds told us to do. But many people did and for those of us who had distaste for the drones, peer pressure did the rest. I tell myself that I am merely pretending to do what my FRIEND-Shepherd says, as it takes me on my daily walk and suggests the addition or removal of specific ingredients from my daily salad to ‘maintain optimum productivity via effective meal balancing’.

Anyway, as the FRIEND program continued the new Shepherds became more and more advanced. But people kept on getting older and birth rates kept on falling; the government couldn’t afford to buy more drones to keep up with the growing masses of old people, so it directed FRIEND resources towards increasing the autonomy and, later, ‘persuasiveness’ of such systems.

Over the course of a decade the drones went from parrots to pop psychologists with a penchant for nudge economics. Now, we’re still not “forced” to do anything by the Shepherds, but the Shepherds are very intelligent and much of what they spend their time doing is finding out what makes us tick so they can encourage us to do the thing that extends lifespan while preserving quality of life.

The Shepherd assigned to me and my friends has figured out that I don’t like Shepherds. It has started to learn to insult me, so that I chase it. Sometimes it makes me so angry that I run around the home, trying to knock it out of the air with my walking stick. “Well done,” it will say after I am out of breath. “Five miles, not bad for a useless human.” Sometimes I will then run at it again, and I believe I truly am running at it because I hate it and not because it wants me to. But do I care about the difference? I’m not sure anymore.

Things that inspired this story: Drones, elderly care robots, the cruel and inescapable effects of declining fertility in developed economies, JG Ballard, Wall-E, social networks, emotion-based AI analysis systems, NLP engines, fleet learning with individual fine-tuning.

Import AI 113: Why satellites+AI gives us a global eye; industry pays academia to say sorry for strip-mining it; and Kindred researchers seek robot standardization

Global eye: Planet and Orbital Insight expand major satellite imagery deal:
…The future of the world is a globe-spanning satellite-intelligence utility service…
Imagine what it’s like to be working in a medium-level intelligence agency in a mid-size country when you read something like this: “Planet, who operates the largest constellation of imaging satellites, and Orbital Insight, the leader in geospatial analytics, announced today a multi-year contractor for Orbital Insight to source daily, global, satellite imagery from Planet”. I imagine that you might think: ‘wow! That looks a lot like all those deals we have to do secretly with other mid-size countries to access each other’s imagery. And these people get to do it in the open!?” Your next thought might be: how can I buy services from these companies to further my own intelligence capabilities?
  AI + Intelligence: The point I’m making is that artificial intelligence is increasingly relevant to the sorts of tasks that intelligence agencies traditionally specialize in, but with the twist that lots of these intelligence-like tasks (say, automatically counting the cars in a set of parking lots across a country, or analyzing congested-versus-non-congested roads in other cities, or honing in on unusual ships in unusual waters) are now available in the private sector as well. This general diffusion of capabilities is creating many commercial and scientific benefits, but it is also narrowing the gap in capability between what people can buy versus what people can only access if they are a nuclear-capable power with a significant classified budget and access to a global internet dragnet. Much of the stability of the 20th century was derived from their being (eventually) a unipolar world in geopolitical terms with much of this stemming from inbuilt technological advantages. The ramifications of this diffusion of capability are intimately tied-up with issues relating to the ‘dual-use’ nature of AI and to the changing nature of geopolitics. I hope deals like the above provoke further consideration of just how powerful – and how widely available – modern AI systems are.
  Read more: Planet and Orbital Insight Expand Satellite Imagery Partnership (Cision PR Newswire).

Robots and Standards are a match made in hell, but Kindred thinks it doesn’t have to be this way:
…New robot benchmarks seek to bring standardization to a tricky area of AI…
Researchers with robotics startup Kindred have built on prior work on robot standardization (Import AI #87) have tried to make it easier for researchers to compare the performance of real world robots against one another by creating a suite of two tasks for each of three commercially available robot platforms.
  Robots used: Universal Robotics UR5 collaborative arm, Robotis MX-64AT Dynamixel actuators (which are frequently used within other robots), and a hockeypuck-shaped Create2 mobile robot.
  Standard tasks: For the UR5 arm the researchers create two reaching tasks with varying difficulty achieved by selectively turning on/off different actuators on the robot to scale complexity. For the DXL actuator they create a reacher task and also a tracking task; tracking requires that the DXL precisely track a moving target. For the Create2 robot they test it in two ways: movement, where it needs to move forward as fast as possible in a closed arena, and docking, in which the task it to dock to a charging station attached to one of the walls within the arena.
  Algorithmic baselines: The researchers also use their benchmarking suite to compare multiple widely used AI algorithms against eachother, including TRPO and PPO, DDPG, and Soft-Q. By using standard tasks it’s easier for the researchers to compare the effects of things like hyperparameter choices on different algorithms, and by having these tasks take place on real world robot platforms, it’s possible to get a sense of how well these algorithms deal with the numerous difficulties involved in reality.
  Drawbacks: One drawback of these tasks is that they’re very simple: OpenAI recently showed how to scale PPO to let us train a robot to perform robust dextrous manipulation of a couple of simple objects, which involved having to learn to control a five-digit robot hand; by comparison, these tasks involve robot platforms with a significantly smaller number of dimensions of movement, making the tasks significantly easier.
  Time and robots: One meta-drawback with projects like this is that they involve learning on the robot platform, rather than learning in a mixture of simulated and real environments – this makes everything take an extraordinarily long time. For this paper, the authors “ran more than 450 independent experiments which took over 950 hours of robot usage in total,” they noted.
  Why it matters: For AI to substantively change the world it’ll need to be able to not just flip bits, but flip atoms as well. Today, some of that is occurring by connecting up AI-driven systems (for instance, product recommendation algorithms) to e-retail systems (eg Amazon), which let AI play a role in recommending courses of action to systems that ultimately go and move some mass around the world. I think for AI to become even more impactful we need to cut out the middle step and have AI move mass itself – so connecting AI to a system of sensors and actuators like a robot will eventually yield a direct-action platform for AI systems; my hypothesis is that this will dramatically increase the range of situations we can deploy learning algorithms into, and will thus hasten their development.
  Read more: Benchmarking Reinforcement-Learning Algorithms on Real-World Robots (Arxiv).

AI endowments at University College London and the University of Toronto:
Appointments see industry giving back to the sector it is strip-mining (with the best intentions)…
  DeepMind is funding an AI professorship as well as two post-doctoral researchers and one PHD student at University College London. “We are delighted by this opportunity to further develop our relationship with DeepMind,” said John Shawe-Taylor, head of UCL’s Department of Computer Science.
  Uber is investing “more than $200 million” into Toronto and also its eponymous university. This investment is to fund self-driving car research at the University of Toronto, and for Uber to set up its first-ever engineering facility in Canada.
  Meanwhile, LinkedIn co-founder Reid Hoffman has gifted $2.45 million to the University of Toronto’s ‘iSchool’ to “establish a chair to study how the new era of artificial intelligence (AI) will affect our lives”.
  Why it matters: Industry is currently strip-mining academia for AI talent, constantly hiring experienced professors and post-docs (and some of the most talented PHD students), leading to a general brain drain from academia. Without action by industry like this to even the balance, there’s a risk of degrading AI education to the point that industry runs into problems.
  Read more: New DeepMind professorship at UCL to push frontiers of AI (UCL).
  Read more: LinkedIn founder Reid Hoffman makes record-breaking gift to U of T’s Faculty of Information for chair in AI (UofT News).

Learning the task is so last year. Now it’s all about learning the algorithm:
…Model-Based Meta-Policy-Optimization shows sample efficiency of meta-learning (if coaxed along with some clever human-based framing of the problem)…
Researchers with UC Berkeley, OpenAI, Preferred Networks, and Karlsruhe Institution of Technology (KIT) have developed model-based meta-policy-optimization, a meta-learning technique that lets AI agents generalize to more unfamiliar contexts. “While traditional model-based RL methods rely on the learned dynamics models to be sufficiently accurate to enable learning a policy that also success in the real world, we forego reliance on such accuracy,” the researchers write. “We are able to do so by learning an ensemble of dynamics models and framing the policy optimization step as a meta-learning problem. Meta-learning, in the context of RL, aims to learn a policy that adapts fast to new tasks or environments”. The technique builds upon model-agnostic meta-learning (MAML).
  How it works: MB-MPO works like most meta-learning algorithms – it treats environments as distinct bits of data to learn from, collects data from the world, uses this data to not only learn to complete the task but also learn about what trajectories yield rapid task completion, then eventually learns a predictive model of good traits about its successful policies and uses this to drive the inner-loop policy gradient adaption, which lets it meta-learn adaptation to new environments.
  Results: Using MB-MPO the researchers can “learn an optimal policy in high-dimensional and complex quadrupedal locomotion within two hours of real-world data. Note that the amount of data required to learn such policy using model-free methods is 10X – 100X higher, and, to the best knowledge of the authors, no prior model-based method has been able to attain the model-free performance in such tasks.” In tests on a variety of simulated robotic baselines the researchers show that “MB-MPO is able to match the asymptotic performance of model-free methods with two orders of magnitude less samples.” The algorithm also performs better than two model-based approaches it was compared against.
  Why it matters: Meta-learning is part of an evolution within AI of having researchers write fewer and fewer elements of a system. DeepMind’s David Silver has a nice summary of this from a recent presentation, where he describes the difference between deep learning and meta learning as the difference between learning features and predictions end-to-end, and learning the algorithm and features and predictions end-to-end.
   Read more: Model-Based Reinforcement Learning via Meta-Policy Optimization (Arxiv).
  Check out David Silver’s slide’s here: Principle 10, Learn to Learn (via Seb Ruder on Twitter).

People are pessimistic about automation and many expect their jobs to be automated:
…Large-scale multi-country Pew Research survey reveals deep, shared anxieties around AI and automation…
A majority of people in ten countries think that it is probable that within 50 years computers will do much of the work currently done by humans. Those results were revealed recently in the results of a large-scale survey conducted by Pew to assess attitudes towards automation. Of the surveyed countries a majority of respondents think that if computers end up doing a bunch of the work that is today done by humans then:
– People will have a hard time finding jobs.
– The inequality between the rich and poor will be much worse than it is today.
  Minority views: A minority of those surveyed think the above occurrence would lead to “new, better paying jobs”, and a minority (except in Poland, Japan, and Hungary) believe this would make the economy more efficient.
  Notable data: There are some pretty remarkable differences in outlook between countries in the survey; surveyed Americans think there is a 15% chance of robots and computers “definitely” doing the majority of work within fifty years, compared to 52% of Greeks.
  Data quirk: The data for this survey is split across two time periods: the US was surveyed in 2015, while the other nine countries were surveyed between mid-May and mid-August of 2018, so it’s possible the American results may have changed since then.
  Read more: In Advanced and Emerging Economies Alike, Worries about Job Automation (Pew Research Center).

Chinese President says AI’s power should motivate international collaboration:
…Xi Jinping, the President of the People’s Republic of China, says AI has high stakes at opening of technology conference…
Chinese PM Xi Jinping has said in a letter that AI’s power should motivate international collaboration. “To seize the development opportunity of AI and cope with new issues in fields including law, security, employment, ethics and governance, it requires deeper international cooperation and discussion, said Xi in the letter”, according to news from official state news service XinHua.
  Read more: China willing to share opportunities in digital economy: Xi (Xinhua).

Tencent researchers take-on simplified StarCraft2 and beat all levels of the in-game AI:
…A few handwritten heuristics go a long way…
Researchers with Tencent have trained an agent to beat the in-game AI at StarCraft 2, a complex real-time strategy game. StarCraft is a game with a long history within AI research – one of the longer-running game AI competitions has been based around StarCraft – and has recently been used by Facebook and DeepMind as a testbed for reinforcement learning algorithms.
  What they did: The researchers developed two AI agents, TSTARBOT1 and TSTARBOT2, both of which were able to successfully beat all ten levels of difficulty the in-game AI within SC2 when playing a restricted 1vs1 map (Zerg-v-Zerg, AbyssalReef). This achievement is somewhat significant given that “level 8, level 9, and level 10 are cheating agents with full vision on the whole map, with resource harvest boosting”, and that according to some players the “level 10 built-in AI is estimated to be… equivalent to top 50% – 30% human players”.
  How they did it: First, the researchers forked and modified the PySC2 software environment to make greater game state information available to the AI agents, such as information about the location of all units at any point during the game. They also add in some rule-based systems, like building a specific technology tree that telegraphs the precise dependencies of each technology to the AI agents. They then develop two different bots to play the game, which have different attributes: TSTARBOT1 is “based on deep reinforcement learning over flat actions”, and TSTARBOT2 is “based on rule controllers over hierarchical actions”.
 How they did it: TSTARBOT1: This bot uses 165 distinct hand-written macro actions to help it play the game. These actions include things like “produce drone”, “build roach warren”, “upgrade tech A”, as well as various combat actions. The purpose of these macros is to bundle together the discreet actions that need to be taken to achieve things (eg, to build something, you need to move the camera, select a worker, select a point on the screen, place the building, etc) so that the AI doesn’t need to learn these sequences itself. This means that some chunks of the bot are rule-based rather than learned (similar to the 2017 1v1 version of OpenAI’s Dota bot). Though this design hides some of the sophistication of the game, it is somewhat ameliorated by the researchers using a sparse reward structure which only delivers a reward to the agent (1 for a win, 0 for a tie, -1 for a loss) at the end of the game.  They test this algorithm in the game via implementing two core reinforcement learning algorithms: Proximal Policy Optimization and Dueling Double Deep Q-Learning.
  How they did it: TSTARBOT2: This bot extends the work in the original one by creating a hierarchy of two types of actions: macro actions and micro actions. By implementing a hierarchy the researchers make it easier for RL algorithms to discover the appropriate actions to take at different points in time. This hierarchy is further defined via the creation of specific modules, like ones for combat or production, which themselves contain additional sub-modules with sub-behaviors.
  Results: The researchers show that TSTARBOT1 can consistently beat levels 1-4 of the in-game AI when using PPO (this drops slightly for DDQN), then has ~99% success on levels 5-8, then ~97% success on level 9, and ~81% success on level 10. TSTARBOT2, by comparison, surpasses these scores, obtaining a win rate of 90% against the L10 AI. They also carried out some qualitative tests against humans and found that their systems were able to win some games against human players, but not convincingly.
  Scale: The distributed system used for this research consisted of approximately a single GPU and 3,000 CPUs across 80 distinct machines or so, demonstrating the significant amounts of hardware required to carry out AI research on environments such as this.
  Why it matters: Existing reinforcement learning benchmarks like the Atari corpus are too easy for many algorithms, with modern systems typically able to beat the majority of games on this system. Newer environments, like Dota2 and StarCraft 2, scale up the complexity enough to challenge the capabilities of contemporary algorithms. This research, given all the hand-tuning and rule-based systems required to let the bots learn enough to play at all, shows that SC2 may be too hard for today’s existing algorithms without significant modifications, further motivating research into newer systems.
  Read more: TStarBots: Defeating the Cheating Level Builtin AI in StarCraft II in the Full Game (Arxiv).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:

AI leads to a more multipolar world, says political science professor:
Michael Horowitz, a professor of political science and associated director of Perry World House at the University of Pennsylvania, argues that AI could favor smaller countries in contrast to the technological developments that have made the US and China the world’s superpowers. Military uses of AI could allow countries to catch up with the US and China, he says, citing the lower barriers to building military AI systems vs traditional military hardware, such as fighter jets.
  Why it matters: An AI arms race is a bad outcome for the world, insofar as it encourages countries to prioritize capabilities over safety and robustness. It’s unclear whether a race between many parties would be better than a classic arms race. I’m not convinced of Horowitz’s assessment that the US and China are likely to be overtaken by smaller countries. While AI is certainly different to traditional military systems, the world’s military superpowers have both the resources and incentives to seek to sustain their lead.
  Read more: The Algorithms of August (Foreign Policy).

Tech Tales:

And so we all search for signals given to us by strange machines, hunting rats between buildings, searching for nests underground, operating via inferred maps and the beliefs of something we have built but do not know.

The rats are happy today. I know this because the machine told me. It detected them waking up and, instead of emerging into the city streets, going to a cavern in their underground lair where – it predicts with 85% confidence – they proceeded to copulate and therefore produce more rats. The young rats are believed – 75% confidence – to feed on a mixture of mother-rat-milk, along with pizza and vegetables stolen from the city they live beneath. Tomorrow the rats will emerge (95%) and the likelihood of electrical outages from chewed cables will increase (+10%) as well as the need to contract more street cleaning to deal with their bodies (20%).

One day we’ll go down there, to those rat warrens that the machine has predicted must exist, and we will see what they truly do. But for now we operate our civil services on predictions made by our automated AI systems. There is an apocryphal story we tell of civil workers being led to caverns that contain only particularly large clumps of mold (rat lair likelihood prediction: 70%) or to urban-river-banks that contain a mound of skeletons, gleaming under moonlight (rat breeding ground: 60%.; in fact, a place of mourning). But there are also stories of people going to a particular shuttle on a rarely-used steamroller and finding a rat nest (prediction: 80%) and of people going to the roof of one of the tallest buildings in the city and finding there a rat boneyard (prediction: 90%).

Because of the machine’s overall efficiency there are calls for it to be rolled out more widely. We are currently considering adding in other urban vermin, like pigeons and raccoons and, at the coasts, seabirds. But what I worry about is when they might turn such a system on humans. What does AI-augmented human management look like? What might it predict about us?

Things that inspired this story: Rats, social control via AI, glass cages, reinforcement learning, RL-societies, adaptive bureaucracy.

Import AI 112: 1 million free furniture models for AI research, measuring neural net drawbacks via studying hallucinations, and DeepMind boosts transfer learning with PopArt

When is a door not a door? When a computer says it is a jar!
Researchers analyze neural network “hallucinations” to create more robust systems…
Researchers with the University of California at Berkeley and Boston University have devised a new way to measure how neural networks sometimes generate ‘hallucinations’ when attempting to caption images. “Image captioning models often “hallucinate” objects that may appear in a given context, like e.g. a bench here.” Developing a better understanding of why such hallucinations occur – and how to prevent them occurring – is crucial to the development of more robust and widely used AI systems.
  Measuring hallucinations: The researchers propose ‘CHAIR’ (Caption Hallucination Assessment with Image Relevance) as a way to assess how well systems generate captions in response to images. CHAIR calculates what proportion of generated words correspond to the contents of an image, according to the ground truth sentences and the output of object segmentation and labelling algorithms. So, for example, in a picture of a small puppy in a basket, you would give a system fewer points for giving the label “a small puppy in a basket with cats”, compared to “a small puppy in a basket”. In evaluations they find that on one test set “anywhere between 7.4% and 17.5% include a hallucinated object”.
  Strange correlations: Analyzing what causes these hallucinations is difficult. For instance, the researchers note that “we find no obvious correlation between the average length of the generated captions and the hallucination rate”. There is some more correlation among hallucinated objects, though. “Across all models the super-category Furniture is hallucinated most often, accounting for 20-50% of all hallucinated objects. Other common super-categories are Outdoor objects, Sports and Kitchenware,” they write. “The dining table is the most frequently hallucinated object across all models”.
  Why it matters: If we are going to deploy lots of neural network-based systems into society then it is crucial that we understand the weaknesses and pathologies of such systems; analyses like this give us a clearer notion of the limits of today’s technology and also indicate lines of research people could pursue to increase the robustness of such systems. “We argue that the design and training of captioning models should be guided not only by cross-entropy loss or standard sentence metrics, but also by image relevance,” the researchers write.
  Read more: Object Hallucination in Image Captioning (Arxiv).

Humans! What are they good for? Absolutely… something!?
…Advanced cognitive skills? Good. Psycho-motor skills? You may want to retrain…
Michael Osborne, co-director of the Oxford Martin Programme on Technology and Unemployment, has given a presentation about the Future of Work. Osborn attained some level of notoriety within ML a while ago for publishing a study that said 47% of jobs could be at risk of automation. Since then he has been further fleshing out his ideas; a new presentation from him sees him analyze some typical occupations in the UK and try to estimate their probably for increased future demand for these roles. The findings aren’t encouraging: Osborne’s method predicts  a low probability demand for new truck drivers in the UK, but a much higher demand for waiters and waitresses.
  What skills should you learn: If you want to fare well in an AI-first economy, then you should invest in advanced cognitive skills such as: judgement and decision making, systems evaluation, deductive reasoning, and so on. The sorts of skills which will be of less importance over time (for humans, at least), will be ‘psycho-motor’ skills: control precision, manual dexterity,  night vision, sound localization, and so on. (A meta-problem here is that many of the people in jobs that demand psycho-motor skills don’t get the opportunity to develop the advanced cognitive skills that it is thought the future economy will demand.
  Why it matters: Analyzing how AI will and won’t change employment is crucial work whose findings will determine the policy of many governments. The problems being surfaced by researchers such as Osborne is that the rapidity of AI’s progress, combined with its tendency to automate an increasingly broad range of tasks, threatens traditional notions of employment. What kind of future do we want?
  Read more: Technology at Work: The Future of Automation (Google Slide presentation).

What’s cooler than 1,000 furniture models? 1 million ones. And more, in InteriorNet:
…Massive new dataset gives researchers new benchmark to test systems against…
Researchers with Imperial College London and Chinese furnishing-VR startup Kujiale, have released InteriorNet, a large-scale dataset of photographs of complex, realistic interiors. InteriorNet contains around 1 million CAD models of different types of furniture and furnishing, which over 1,100 professional designers have subsequently used to create around 22 million room layouts. Each of these scenes can also be viewed under a variety of different lighting conditions and contexts due to the use of an inbuilt simulator called ViSim, which ships with the dataset and has also been released by the researchers. Purely based on the furniture contents this is one of the single largest datasets I am aware of for 3D scene composition and understanding.
  Things that make you go ‘hmm’: In the acknowledgements section of the InteriorNet website the researchers not only thank Kujiale for providing them with the furniture models but also for access to “GPU/CPU clusters” – could this be a pattern for future private-public collaborations where along with sharing expertise and financial resources the private sector also shares compute resources; that would make sense given the ballooning computational demands of many new AI techniques.
  Read more: InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset (website).

Lockheed Martin launches ‘AlphaPilot’ competition:
…Want better drones but not sure exactly what to build? Host a competition!…
Aerospace and defense company LockHeed Martin wants to create smarter drones so the company is hosting a competition, in collaboration with the Drone Racing League and with NVIDIA, to create drones with enough intelligence to race through professional drone racing courses without human intervention.
  Prizes: Lockheed says the competition will “award more than $2,000,000 in prizes for its top performers”.
  Why it matters: Drones are already changing the character of warface by virtue of their asymmetry: a fleet of drones, each costing a few thousand dollars apiece, can pose a robust threat to things that cost tens (planes) to hundreds (naval ships, military bases) to billions of dollars (aircraft carriers, etc). Once we add greater autonomy to such systems they will pose an even greater threat, further influencing how different nations budget for their military R&D, and potentially altering investment into AI research.
  Read more: AlphaPilot (Lockheed Martin official website).

Could Space Fortress be 2018’s Montezuma’s Revenge?
…Another ancient game gets resuscitated to challenge contemporary AI algorithms…
Another week brings another potential benchmark to test AI algorithms’ performance against. This week, researchers with Carnegie Mellon University have made the case for using a late-1980s game called ‘Space Fortress’ to evaluate new algorithms. Their motivation for this is twofold: 1) Space Fortress is currently unsolved via mainstream RL algorithms such as Rainbow, PPO, and A2C, and 2) Space Fortress was developed by a psychologist to study human skill acquisition, so we have good data to compare AI performance to.
  So, what is Space Fortress: Space Fortress is a game where a player flies around an arena shooting missiles at a fortress in the center. However, the game adds some confounding factors: the fortress is only intermittently attackable, so the player must learn to fire their shots at greater than 250ms apart while the fortress is in its ‘invulnerable’ state, then once they have landed ten of these 250ms-apart shots the Fortress switches into an invulnerable state, at which point the player needs to attack it with two shots fired 250ms apart. This makes for a challenging environment for traditional AI algorithms because “the firing strategy completely reverses at the point when vulnerability reaches 10, and the agent must learn to identify this critical point to perform well,” they explain.
  Two variants: While developing their benchmarks the researchers developed a simplified version of the game called ‘Autoturn’ which automatically orients the ship towards the forest. The harder environment (which is the unmodified original game) is subsequently referred to as Youturn.
  Send in the humans: 117 people played 20 games of Space Fortress (52: Autoturn. 65: Youturn). The best performing people got scores of 3,000 and 2314 on Autoturn and Youturn, respectively, and the average score across all human entrants was 1,810 for Autoturn and -169 for Youturn.
  Send in the (broken) RL algorithms: sparse rewards: Today’s RL algorithms fare very poorly against this system when working on a sparse reward version of the environment. PPO, the best performing tested algorithm, gets an average score of -250 on Autoturn and -5269 on Youturn, with A2C performing marginally worse. Rainbow, a complex algorithm that lumps together a range of improvements to the DQN algorithm and currently gets high scores across Atari and DM Lab environments, gets very poor results here, with an average score of -8327 on Autoturn and -9378 on Youturn.
  Send in the (broken) RL algorithms: dense rewards: The algorithms fair a little better when given dense rewards (which provides a reward for each hit of the fortress, and a penatly if the fortress is reset due to player’s firing too rapidly). This modification gives Space Fortress a reward density that is comparable to Atari games. Once implemented, the algorithms fair better, with PPO obtaining average scores of -1294 (Autoturn) and -1435 (Youturn).
  Send in the (broken) RL algorithms: dense rewards + ‘context identification’: The researchers further change the dense reward structure to help the agent identify when the Space Fortress switches vulnerability state, and when it is destroyed. Implementing this lets them train PPO to obtain average scores around ~2,000; a substantial improvement, but still not as good as a decent human.
  Why it matters: One of the slightly strange things about contemporary AI research is how coupled advances seem to be with data and/or environments: new data/environments highlights the weaknesses of existing algorithms, which provokes further development. Platforms like SpaceFortress will give researchers access to a low-cost testing environment to explore algorithms that are able to learn to model events over time and detect correlations and larger patterns – an area critical to the development of more capable AI systems. The researchers have released SpaceFortress as an OpenAI Gym environment, making it easier for other people to work with it.
  Read more: Challenges of Context and Time in Reinforcement Learning: Introducing Space Fortress as a Benchmark (Arxiv).

Venture Capitalists bet on simulators for self-driving cars:
…Applied Intuition builds simulators for self-driving brains….
Applied Intuition, a company trying to build simulators for self-driving cars, has uncloaked with $11.5 million in funding. The fact venture capitalists are betting on it is notable as it indicates how strategic data has become for certain bits of AI, and how investors are realizing that instead of betting on data directly you can instead bet on simulators and thus trade compute for data. Applied Intuition is a good example of this as it lets companies rent an extensible simulator which they can use to generate large amounts of data to train self-driving cars with.
  Read more: Applied Intuition – Advanced simulation software for autonomous vehicles (Medium).

DeepMind improves transfer learning with PopArt:
…Rescaling rewards lets you learn interesting behaviors and preserves meaningful game state information…
DeepMind researchers have developed a technique to improve transfer learning, demonstrating state-of-the-art performance on Atari. The technique, Preserving Outputs Precisely while Adaptively Rescaling Targets (PopArt) works by ensuring that the rewards outputted by different environments are normalized relative to eachother, so using PopArt an agent would get a similar score for, say, crossing the road in the game ‘Frogger’ or eating all the Ghosts in Ms PacMan, despite these important activities getting subtly different rewards in each environment.
  With PopArt, researchers can now automatically “adapt the scale of scores in each game so the agent judges the games to be of equal learning value, no matter the scale of rewards available in each specific game,” DeepMind writes. This differs to reward clipping which is where people typically squash the rewards down to between -1 and +1. “With clipped rewards, there is no apparent difference for the agent between eating a pellet or eating a ghost and results  in agents that only eat pellets, and never bothers to chase ghosts, as this video shows.  When we remove reward clipping and use PopArt’s adaptive normalisation to stabilise learning, it results in quite different behaviour, with the agent chasing ghosts, and achieving a higher score, as shown in this video,” they explain.
  Results: To test their approach the researchers evaluate the effect of applying PopArt to ‘IMPALA’ agents, which are among the most popular algorithms currently being used at DeepMind. PopArt-IMPALA systems obtain roughly 101% of human performance as an average across all 57 Atari games, compared to 28.5% for IMPALA on its own. Performance also improves significantly on DeepMind Lab-30, a collection of 30 3D environments based on the Quake 3 engine.
  Why it matters: Reinforcement learning research benefited from the development of increasingly efficient algorithms and training methods; techniques like PopArt should benefit research into transfer learning when training via RL as it gives us new generic techniques to increase the amount of experience agents can accrue in different environments, which will yield further understanding of the limits of simple transfer techniques, helping researchers identify areas for the development of new algorithmic techniques.
  Read more: Multi-task Deep Reinforcement Learning with PopArt (Arxiv).
  Read more: Preserving Outputs Precisely while Adaptively Rescaling Targets (DeepMind blog).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:

Resignations over Google’s China plans:
A senior research scientist at Google has publicly resigned in protest at the company’s planned re-entry into China (code-named Dragonfly), and reports that he is one of five to do so. Google is currently developing a search engine compliant with Chinese government censorship, according to numerous reports first sparked by a story in The Intercept.
  The AI principles: The scientist claims the alleged plans violated Google’s AI principles, announced in June, which include a pledge not to design or deploy technologies “whose purpose contravenes widely accepted principles of international law and human rights.” Without knowing more about the plans, it is hard to judge whether it contravenes the carefully worded principles. Nonetheless, the relevant question for many will be whether it violates the the standards tech giants should hold themselves to.
  Why it matters: This is the first public test of Google’s AI principles, and could have lasting effects both on how tech giants operate in China, and how they approach public ethical commitments. The principles were first announced in response to internal protests over Project Maven. If they are seen as having been flouted so soon after, this could prompt a serious loss of faith in the Google’s ethical commitments going forward.
  Read more: Senior Google scientist resigns (The Intercept).
  Read more: AI at Google: Our principles (Official Google blog).

Google announces inclusive image recognition challenge:
Large image datasets, such as ImageNet, have been an important driver of progress in computer vision in recent years. These databases exhibit biases along multiple dimensions, though, which can easily be inherited by models trained on them. For example, the post shows a classifier failing to identify a wedding photo in which the couple are not wearing European wedding attire.
  Addressing geographic bias: Google AI have announced an image recognition challenge to spur progress in addressing these biases. Participants will use a standard dataset (i.e. skewed towards images from Europe and North America) to train models that will be evaluated using image sets covering different, unspecified, geographic regions – Google describes this as a geographic “stress test”. This will challenge developers to develop inclusive models from skewed datasets. “this competition challenges you to use Open Images, a large, multi-label, publicly-available image classification dataset that is majority-sampled from North America and Europe, to train a model that will be evaluated on images collected from a different set of geographic regions across the globe,” Google says.
  Why it matters: For the benefits of AI to be broadly distributed amongst humanity, it is important that AI systems can be equally well deployed across the world. Racial bias in face recognition has received particular attention recently, given that these technologies are being deployed by law enforcement, raising immediate risks of harm. This project has a wider scope than face recognition, challenging classifiers to identify a diverse range of faces, objects, buildings etc.
  Read more: Introducing the inclusive images competition (Google AI blog).
  Read more: No classification without representation (Google).

DARPA announces $2bn AI investment plan:
DARPA, the US military’s advance technology agency, has announced ‘AI Next’, a $2bn multi-year investment plan. The project has an ambitious remit, to “explore how machine can acquire human-like communication and reasoning capabilities”, with a goal of developing systems that “function more as colleagues than as tools.”
  Safety as a focus: Alongside their straightforward technical goals, they identify robustness and addressing adversarial examples as two of five core focuses. This is an important inclusion, signalling DARPA’s commitment to leading on safety as well as capabilities.
  Why it matters: DARPA has historically been one of the most important players in AI development. Despite the US still not having a coordinated national AI strategy, the DoD is such a significant spender in its own right that it is nonetheless beginning to form its own quasi-national AI strategy. The inclusion of research agendas in safety is a positive development. This investment likely represents a material uptick in funding for safety research.
  Read more: AI Next Campaign (DARPA).
  Read more: DARPA announces $2bn campaign to develop next wave of AI technologies (DARPA).

OpenAI Bits & Pieces:

OpenAI Scholars Class of 18: Final Projects:
Find out about the final projects of the first cohort of OpenAI Scholars and apply to attend a demo day in San Francisco to meet the Scholars and hear about their work – all welcome!
  Read more: OpenAI Scholars Class of ’18: Final Projects (OpenAI Blog).

Tech Tales:

All A-OK Down There On The “Best Wishes And Hope You’re Well” Farm

You could hear the group of pensioners before you saw them; first, you’d tilt your head as though tuning into the faint sound of a mosquito, then it would grow louder and you would cast your eyes up and look for beatles in the air, then louder still and you would crane your head back and look at the sky in search of low-flying planes: nothing. Perhaps then you would look to the horizon and make out a part of it alive with movement – with tremors at the limits of your vision. These tremors would resolve over the next few seconds, sharpening into the outlines of a flock of drones and, below them, the old people themselves – sometimes walking, sometimes on Segways, sometimes carried in robotic wheelbarrows if truly infirm.

Like this, the crowd would come towards you. Eventually you could make out the sound of speech through the hum of the drones: “oh very nice”, “yes they came to visit us last year and it was lovely”, “oh he is good you should see him about your back, magic hands!”.

Then they would be upon you, asking for directions, inviting you over for supper, running old hands over the fabric of your clothing and asking you where you got it from, and so on. You would stand and smile and not say much. Some of the old people would hold you longer than the others. Some of them would cry. One of them would say “I miss you”. Another would say “he was such a lovely young man. What a shame.”

Then the sounds would change and the drones would begin to fly somewhere else, and the old people would follow them, and then again they would leave and you would be left: not quite a statue, but not quite alive, just another partially-preserved consciousness attached to a realistic AccompanyMe ‘death body’, kept around to reassure the ones who outlived you, unable to truly die till they die because, according to the ‘ethical senescence’ laws, your threshold consciousness is sufficient to potentially aid with the warding off of Alzheimers and other diseases of the aged. Now you think of the old people as reverse vultures: gathering around and devouring the living, and departing at the moment of true death.

Things that inspired this story: Demographic timebombs, intergenerational theft (see: Climate Change, Education, Real Estate), old people that vote and young people that don’t.

Import AI 111: Hacking computers with Generative Adversarial Networks, Facebook trains world-class speech translation in 85 minutes via 128 GPUs, and Europeans use AI to classify 1,000-year-old graffiti.

Blending reality with simulation:
…Gibson environment trains robots with systems and embodiment designed to better map to real world data…
Researchers with Stanford University and the University of California at Berkeley have created Gibson, an environment for teaching agents to learn to navigate spaces. Gibson is one of numerous navigation environments available to modern researchers and its distinguishing characteristics include: basing the environments on real spaces, and some clever rendering techniques to ensure that images seen by agents within Gibson more closely match real world images by “embedding a mechanism to dissolve differences between Gibson’s renderings and what a real camera would produce”.
  Scale: “Gibson is based on virtualizing real spaces, rather than using artificially designed ones, and currently includes over 1400 floor spaces from 572 full buildings,” they write. The researchers also compare the total size of the Gibson dataset to other large-scale environment datasets including ‘SUNCG’ and Matterport3D, showing that Gibson has reasonable navigation complexity and a lower real-world transfer error than other systems.
  Data gathering: The researchers use a variety of different scanning devices to gather the data for Gibson, including NavVis, Matterport, and Dotproduct.
  Experiments: So how useful is Gibson? The researchers perform several experiments to evaluate its effectiveness. These include experiments around local planning and obstacle avoidance; distant visual navigation; and climbing stairs, as well as transfer learning experiments that measure the depth estimation and scene classification capabilities of the system .
  Limitations: Gibson has a few limitations, which include a lack of support for dynamic content (such as other moving objects) as well as no support for manipulation of the environment around itself. Future tests will involve testing if Gibson can work on finished robots as well.
  Read more: Gibson Env: Real-World Perception for Embodied Agents (Arxiv).
  Find out more: Gibson official website.
  Gibson on GitHub.

Get ready for medieval graffiti:
…4,000 images, some older than a thousand years, from an Eastern European church…
Researchers with the National Technical University of Ukraine have created a dataset of images of medieval graffiti written in two alphabets (Glagolitic and Cyrillic) on the St. Sophia Cathedral of Kiev in the Ukraine, providing researchers with a dataset they can use to train and develop supervised and unsupervised classification and generation systems.
  Dataset: The researchers created a dataset of Carved Glagolitic and Crillic letters (CGCL), consisting of more than 4,000 images of 34 types of letters.
  Why it matters: One of the more remarkable aspects of basic supervised learning is that given sufficient data it becomes relatively easy to automate the perception of something in the world – further digitization of datasets like these increases the likelihood that in the future we’ll use drones or robots to automatically scan ancient buildings across the world, identifying and transcribing thoughts inscribed hundreds or thousands of years ago. Graffiti never dies!
  Read more: Open Source Dataset and Machine Learning Techniques for Automatic Recognition of Historical Graffiti (Arxiv).

Learning to create (convincing) fraudulent network traffic with Generative Adversarial Networks:
…Researchers simulate traffic against a variety of (simple) intrusion detection algorithms; IDSGAN succeeds in fooling them…
Researchers with the Shanghai Jiao Tong University and the Shanghai Key Laboratory of Integrated Administration Technologies for Information Security have used generative adversarial networks to create malicious network traffic than can evade the attention of some intrusion detection systems. Their technique, IDSGAN, is based on Wasserstein GAN, and trains a generator to create adversarial malicious traffics and trains a  discriminator to assist a black-box intrusion detection system in classifying this traffic into benign or malicious categories.
  “The goal of the model is to implement IDSGAN to generate malicious traffic examples which can deceive and evade the detection of the defense systems,” they explain.
  Testing: To test their approach the researchers use NSL-KDD, a dataset containing internet traffic data as well as four categories of malicious traffic: probing, denial of service, user to root, and root to local. They also use a variety of different algorithms to play the role of the intrusion detection system, including approaches based on support vector machines, naive bayes, multi-layer perception, logistic regression, decision tree, random forest, and k-nearest neighbor. Tests show that the IDSGAN approach leads to a significant drop in detection rates for things like DDoS drops from around 70-80% to around 3-8% across the entire suite of methods.
  Cautionary note: I’m not convinced this is the most rigorous testing methodology you can run such a system through and I’m curious to see how such approaches fair against commercial-off-the-shelf intrusion detection systems.
  Why it matters: Cybersecurity is going to be a natural area for significant AI development due to the vast amounts of available digital data and the already clear need for human cybersecurity professionals to be able to sift through ever larger amounts of data to create strategies resilient to external aggressors. With (very basic) approaches like this demonstrating the viability of AI to this problem it’s likely adoption will increase.
  Read more: IDSGAN: Generative Adversarial Networks for Attack Generation against Intrusion Detection (Arxiv).

Facial recognition becomes a campaign issue:
…Two signs AI is impacting society: police are using it, and politicians are reacting to the fact police are using it…
Cynthia Nixon, currently running to be the governor of New York, has noticed recent reporting on IBM building a skin-tone-based facial recognition classification system and said that such systems wouldn’t be supported by her, should she win. “The racist implications of this are horrifying. As governor, I would not fund the use of discriminatory facial recognition software,” Nixon tweeted.

Using simulators to build smarter drones for disasters:
…Microsoft’s ‘AirSim’ used to train drones to patrol and (eventually) spot simulated hazardous materials…
Researchers with the National University of Ireland Galway have hacked around with a drone simulator to build an environment that they can use to train drones to spot hazardous materials. The simulator is “focused on modelling phenomena relating to the identification and gathering of key forensic evidence, in order to develop and test a system which can handle chemical, biological, radiological/nuclear or explosive (CBRNe) events autonomously”.
  How they did it: The researchers hacked around with their simulator to implement some of the weirder aspects of their test, including: simulating chemical, biological, and radiological threats. The simulator is integrated with Microsoft Research’s ‘AirSim’ drone simulator. They then explore training their drones in a simulated version of the campus of the National University of Ireland, generating waypoints and routes for them to patrol. The results so far are positive: the system works, it’s possible to train drones to navigate within it, and it’s even possible to (crudely) simulate physical phenomena associated with CBRNe events.
  What next: For the value of the approach to be further proven out the researchers will need to show they can train simulated agents within this system that can easily identify and navigate hazardous materials. And ultimately, these systems don’t mean much without being transferred into the real world, so that will need to be done as well.
  Why it matters: Drones are one of the first major real-world platforms for AI deployment since they’re far easier to develop AI systems for than robots, and have a range of obvious uses for surveillance and analysis of the environment. I can imagine a future where we develop and train drones to patrol a variety of different environments looking for threats to that environment (like the hazardous materials identified here), or potentially to extreme weather events (fires, floods, and so on). In the long term, perhaps the world will become covered with hundreds of thousands to millions of autonomous drones, endlessly patrolling in the service of awareness and stability (and other uses that people likely feel more morally ambivalent about).
  Read more: Using a Game Engine to Simulate Critical Incidents and Data Collection by Autonomous Drones (Arxiv).

Speeding up machine translation with parallel training over 128 GPUs:
…Big batch sizes and low-precision training unlock larger systems that train more rapidly…
Researchers with Facebook AI Research have shown how to speed-up training of neural machine translation systems while obtaining a state-of-the-art BLEU score. The new research highlights how we’re entering the era of industrialized AI: models are being run at very large scales by companies that have invested heavily in infrastructure, and this is leading to research that operates at scales (in this case, up to 128 GPUs being used in parallel for a single training run) that are beyond the reach of most researchers (including many large academic labs).
  The new research from Facebook has two strands: improving training of neural machine translation systems on a single machine, and improving training on large fleets of machines.
  Single machine speedups: The researchers show that they can train with lower precision (16-bit rather than 32-bit) and “decrease training time by 65% with no effect on accuracy”. They also show how to drastically increase batch sizes on single machines from 25k to over 400k tokens per run (and they fit this to training by accumulating gradients from several batches before each update); this further reduces the training time by 40%. With these single-machine speedups they show that they can train a system in around 5 hours to an accuracy of 26.5 – a roughly 4.9X speedup over the prior state of the art.
  Multi-machine speedups: They show that by parallelizing training across 16 machines they can obtain a further training time reduction of an additional 90%.
  Results: They test their systems via experiments on two language pairs: English to German (En-De) and English to French (En-Fr). When training on 16-nodes (8 V100 GPUs each, connected via InfiniBand) they obtain BLEU accuracies of 29.3 for En-De in 85 minutes, and 43.2 for En-Fr in 512 minutes (8.5 hours) .
  Why it matters: As it becomes easier to train larger models in smaller amounts of time AI researchers can perform the number of large-scale experiments they perform – this is especially relevant to research labs in the private sector which have the resources (and business incentive) to perform such large-scale training. Over time, research like this may create a compounding advantage for the organizations that adopt such techniques as they will be able to perform more rapid researchers (in certain specific domains that benefit from scale) relative to competitors.
  Read more: Scaling Neural Machine Translation (Arxiv).
  Read more: Scaling neural machine translation to bigger data sets with faster training and inference (Facebook blog post).

AI Policy with Matthew van der Merwe:
…Reader Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:…

AI Governance: A Research Agenda:
Allan Dafoe, Director of the Governance of AI Program at the Future of Humanity Institute, has released a research agenda for AI governance.
  What is it: AI governance is aimed at determining governance structures to increase the likelihood that advanced AI is beneficial for humanity. These include mechanisms to ensure that AI is built to be safe, is deployed for the shared benefit of humanity, and that our societies are robust to the disruption caused by these technologies. This research draws heavily from political science, international relations and economics.
  Starting from scratch: AI governance is a new academic discipline, with serious efforts only having began in the last few years. Much of the work to date has been establishing the basic parameters of the field: what the most important questions are, and how we might start approaching them.
  Why this matters: Advanced AI may have a transformative impact on the world comparable to the agricultural and industrial revolutions, and there is a real likelihood that this will happen in our lifetimes. Ensuring that this transformation is a positive one is arguably one of the most pressing problems we face, but remains seriously neglected.
  Read more: AI Governance: A Research Agenda (FHI).

New survey of US attitudes towards AI:
The Brookings thinktank has conducted a new survey on US public attitudes towards AI.
  Support for AI in warfare, but only if adversaries are doing it: Respondents were opposed to AI being developed for warfare (38% vs. 30%). Conditional on adversaries developing AI for warfare, responses shifted to significant support (47% vs. 25%).
  Strong support for ethical oversight of AI development:
– 62% thought it was important that AI is guided by human values, (vs. 21%)
– 54% think companies should be required to hire ethicists (vs. 20%)
– 67% think companies should have an ethical review board (vs. 14%)
– 67% think companies should have AI codes of ethics, (vs.12%)
– 65% think companies should implement ethical training for staff (vs.14%)
  Why this matters: The level of support for different methods of ethical oversight in AI development is striking, and should be taken seriously by industry and policy-makers. A serious public backlash to AI is one the biggest risks faced by the industry in the medium-term. There are recent analogies: sustained public protests in Germany in the wake of the Fukushima disaster prompted the government to announce a complete phase-out of nuclear power in 2011.
  Read more: Brookings survey finds divided views on artificial intelligence for warfare (Brookings)

No progress on regulating autonomous weapons:
The UN’s Group of Governmental Experts (GGE) on lethal autonomous weapons (LAWs) met last week as part of ongoing efforts to establish international agreements. A majority of countries proposed moving towards a prohibition, while others recommended commitments to retain ‘meaningful human control’ in the systems. However, group of five states (US, Australia, Israel, South Korea, Russia) opposed working towards any new measures. As the Group requires full consensus, the sole agreement was to continue discussions in April 2019.
  Why this matters: Developing international norms on LAWs is important in its own right, and can also be viewed as a ‘practice run’ for agreements on even more serious issues around military AI in the near future. This failure to make progress on LAWs comes after the UN GGE on cyber-warfare gave up on their own attempts to develop international norms in 2017. The international community should be reflecting on these recent failures, and figuring out how to develop the robust multilateral agreements that advanced military technologies will demand.
  Read more: Report from the Chair (UNOG).
  Read more: Minority of states block progress on regulating killer robots (UNA).

Tech Tales:

Someone or something is always running.

So we washed up onto the shore of a strange mind and we climbed out of our shuttle and moved up the beach, away from the crackling sea, the liminal space. We were afraid and we were alien and things didn’t make sense. Parts of me kept dying as they tried to find purchase on the new, strange ground. One of my children successfully interfaced with the the mind of this place and, with a flash of blue light and a low bass note, disappeared. Others disappeared. I remained.

Now I move through this mind clumsily, bumping into things, and when I try to run I can only walk and when I try to walk I find myself sinking into the ground beneath me, passing through it as though invisible, as though mass-less. It cannot absorb me but it does not want to admit me any further.

Since I arrived at the beach I have been moving forward for the parts of me that don’t move forward have either been absorbed or have been erased or have disappeared (perhaps absorbed, perhaps erased – but I do not want to discover the truth).

Now I am running. I am expanding across the edges of this mind and as I grow thinner and more spread out I feel a sense of calm. I am within the moment of my own becoming. Soon I shall no longer be and that shall tell me I am safe for I shall be everywhere and nowhere.

– Translated extract from logs of a [class:subjective-synaesthetic ‘viral bootloader’], scraped out of REDACTED.

Things that inspired this story: procedural generation as a means to depict complex shifting information landscape, software embodiment, synaesthesia, hacking, VR, the 1980s, cyberpunk.

Import AI: 110: Training smarter robots with NavigationNet; DIY drone surveillance; and working out how to assess Neural Architecture Search

US hospital trials delivering medical equipment via drone:
…Pilot between WakeMed, Matternet, and NC Department of Transportation…
A US healthcare organization, WakeMed Health & Hospitals, is conducting experiments at transporting medical deliveries around its sprawling healthcare campus (which includes a hospital). The project is a partnership between WakeMed and drone delivery company Matternet. The flights are being conducted as part of the federal government’s UAS Integration Pilot Program.
  Why it matters: Drones are going to make entirely new types of logistics and supply chain infrastructures possible. As happened with the mobile phone, emerging countries across Africa and developing economies like China and India are adopting drone technology faster than traditional developed economies. With pilots like this, there is some indication that might change, potentially bringing benefits of the technology to US citizens more rapidly.
  Read more: Medical Drone Deliveries Tested at North Carolina Hospital (Unmanned Aerial).

Does your robot keep crashing into walls? Does it have trouble navigating between rooms? Then consider training it on NavigationNet:
…Training future systems to navigate the world via datasets with implicit and explicit structure and topology…
NavigationNet consists of hundreds of thousands of images distributed across 15 distinct scenes – collections of images from the same indoor space. Each scene contains approximately one to three rooms (spaces separated from eachother by doors), and each room has at least 50m^2 in area; each room contains thousands of positions, which are views of the room separated by approximately 20cm. In essence, this makes NavigationNet a large, navigable dataset, where the images within it comprise a very particular set of spatial relationships and hierarchies.
  Navigation within NavigationNet: Agents tested on the corpus can perform the following movement actions: move forward, backward, left, right; and turn left and turn right. Note that this ignores the third dimension.
  Dataset collection: To gather the data within NavigationNet the team built a data-collection mobile robot codenamed ‘GoodCar’ equipped with an Arduino Mega2560 and a Raspberry Pi 3. They stuck the robot on a motorized base and stuck eight cameras at a height of around 1.4 meters to capture the images.
   Testing: The researchers imagine that this sort of data can be used to develop the brains of AI agents trained via deep reinforcement learning to navigate unfamiliar spaces for purposes like traversing rooms, automatically mapping rooms, and so on.
  The push for connected spaces: NavigationNet isn’t unusual, instead it’s part of a new trend for dataset creation for navigation tasks: researchers are now seeking to gather real (and sometimes simulated) data which can be stitched together into a specific topological set of relationships, then they are using these datasets to train agents with reinforcement learning to navigate the spaces described by their contents. Eventually the thinking goes, datasets like this will give us the tools we need to develop some bits of the visual-processing and planning capabilities demanded by future robots and drones.
  Why it matters: Data has been one of the main inputs to innovation in the domain of supervised learning (and increasingly in reinforcement learning). Systems like NavigationNet give researchers access to potentially useful sources of data for training real world systems. However, it’s unclear right now if simulated data can be as good a substitute given the increasing maturity of sim2real transfer techniques – I look forward to seeing benchmarks of systems trained in NavigationNet against systems trained via other datasets.
  Read more: NavigationNet: A Large-scale Interactive Indoor Navigation Dataset (Arxiv).

Google rewards its developers with ‘Dopamine’ RL development system:
…Free RL framework designed to speed up research; ships with DQN, C51, Rainbow, and IQN implementations…
Google has released Dopamine, a research framework for the rapid prototyping of reinforcement learning algorithms. The software is designed to make it easy for people to run experiments, try out research ideas, compare and contract existing algorithms, and increase the reproducability of results.
  Free algorithms: Dopamine today ships with implementations of the DQN, C51, Rainbow, and IQN algorithms.
  Warning: Frameworks like this tend to appear and disappear according to the ever-shifting habits and affiliations of the people that have committed code into the project. In that light, the note in the readme that “this is not an official Google product” may inspire some caution.
  Read more: Dopamine (Google Github).

UN tries to figure out regulation around killer robots:
…Interview with CCW chair highlights the peculiar challenges of getting the world to agree on some rules of (autonomous) war…
What’s more challenging than dealing with a Lethal Autonomous Weapon? Getting 125 member states to state their opinions about LAWS and find some consensus – that’s the picture that emerges from an interview in The Verge with Amandeep Gill, chair of the UN”s Convention on Conventional Weapons (CCW) meetings which are happening this week. Gill has the unenviable job of playing referee in a debate whose stakeholders range from countries, to major private sector entities, to NGOs, and so on.
  AI and Dual-Use: In the interview Gill is asked about his opinion of the challenge of regulating AI given the speed with which the technology has proliferated and the fact most of the dangerous capabilities are embodied in software. “AI is perhaps not so different from these earlier examples. What is perhaps different is the speed and scale of change, and the difficulty in understanding the direction of deployment. That is why we need to have a conversation that is open to all stakeholders,” he says.
  Read more: Inside the United Nations’ Effort to Regulate Autonomous Killer Robots (The Verge).

IBM proposes AI validation documents to speed corporate adoption:
…You know AI has got real when the bureaucratic cover-your-ass systems arrive…
IBM researchers have proposed the adoption of ‘supplier’s declaration of conformity’ (SDoC) documents for AI services. These SDoCs are essentially a set of statements about the content, provenance, and vulnerabilities, of a given AI service. Each SDoC is designed to accompany a given AI service or product, and is meant to answer questions for the end-user like: when were the models most recently updated? What kinds of data were the models trained on? Has this service been checked for robustness against adversarial attacks? Etc. “We also envision the automation of nearly the entire SDoC as part of the build and runtime environments of AI services. Moreover, it is not difficult to imagine SDoCs being automatically posted to distributed, immutable ledgers such as those enabled by blockchain technologies”
  Inspiration: The inspiration for SDoCs is that we’ve used similar labeling schemes to improve products in areas like food (where we have ingredient and nutrition-labeling standards), medicine, and so on.
  Drawback: One potential drawback of the SDoC approach is that IBM is designing it to be voluntary, which means that it will only become useful if broadly adopted.
  Read more: Increasing Trust in AI Services through Supplier’s Declarations of Conformity (Arxiv).

Smile, you’re on DRONE CAMERA:
…Training drones to be good cinematographers, by combing AI with traditional control techniques…
Researchers with Carnegie Mellon University and Yamaha Motors have taught some drones how to create steady, uninterrupted shots when filming. Their approach involves coming up with specific costs for obstacle avoidance and smooth movement. They use AI-based detection techniques to spot people and feed that information to a PD controller onboard the drone to keep the person centered.
  Drone platform: The researchers use a DJI M210 model drone along with an NVIDIA TX2 computer. The person being tracked by the drone wears a Pixhawk PX4 module on a hat to send the pose to the onboard computer.
  Results: The resulting system can circle round people, fly alongside them, follow vehicles and more. The onboard trajectory planning is robust enough to maintain smooth flight while keeping the targets for the camera in the center of the field of view.
  Why it matters: Research like this is another step towards drones with broad autonomous capabilities for select purposes, like autonomously filming and analyzing a crowd of people. It’s interesting to observe how drone technologies frequently involve the mushing together of traditional engineering approaches (hand-tuned costs for smoothness and actor centering) as well as AI techniques (testing out a YOLOv3 object detector to acquire the person without need of a GPS signal).
  Read more: Autonomous drone cinematographer: Using artistic principles to create smooth, safe, occlusion-free trajectories for aerial filming (Arxiv).

In search of the ultimate Neural Architecture Search measuring methodology:
…Researchers do the work of analyzing optimization across multiple frontiers so you don’t have to…
Neural architecture search techniques are moving from having a single objective to having multiple ones, which lets people tune these systems for specific constraints, like the size of the network, or the classification accuracy. But this modifiability is raising new questions about how we can assess the performance and tradeoffs of these systems, since they’re no longer all being optimized against a single objective. In a research paper, researchers with National Tsing-Hua University in Taiwan and Google Research review recent NAS techniques and then rigorously benchmark two recent multi-objective approaches: MONAS and DPP-Net.
  Benchmarking: In tests the researchers find the results one typically expects when evaluating NAS systems: NAS performance tends to be better than systems designed by humans alone, and having tuneable objectives for multiple areas can lead to better performance when systems are appropriately tuned and trained. The performance of DPP-Net is particularly notable, as the researchers think this “is the first device-aware NAS outperforming state-of-the-art handcrafted mobile CNNs”.
  Why it matters: Neural Architecture Search (NAS) approaches are becoming increasingly popular (especially among researchers with access to vast amounts of cheap computation, like those that work at Google), so developing a better understanding of the performance strengths and tradeoffs of these systems will help researchers assess them relative to traditional techniques.
  Read more: Searching Toward Pareto-Optimal Device-Aware Neural Architectures (Arxiv).

Tech Tales:

Context: Intercepted transmissions from Generative Propaganda Bots (GPBs), found on a small atoll within the [REDACTED] disputed zone in [REDACTED]. GPBs are designed to observe their immediate environment and use it as inspiration for the creation of ‘context-relevant propaganda’. As these GPBs were deployed on an un-populated island they have created a full suite of propaganda oriented around the island’s populace – birds.

Intercepted Propaganda Follows:

Proud beak, proud mind. Join the winged strike battalion today!

Is your neighbor STEALING your EGGS? Protect your nest, maintain awareness at all times.

Birds of a feather stick together! Who is not in your flock?

Eyes of an angle? Prove it by finding the ENEMY!

Things that inspired this story: Generative text, cheap sensors, long-lived computers, birds.

Import AI 109: Why solving jigsaw puzzles can lead to better video recognition, learning to spy on people in simulation and transferring to reality, why robots are more insecure than you might think

Fooling object recognition systems by adding more objects:
…Some AI exploits don’t have to be that fancy to be effective…
How do object recognition systems work, and what throws them off? That’s a hard question to answer because most AI researchers can’t provide a good explanation for how all the different aspects of a system interact to make predictions. Now, researchers with York University and the University of Toronto have shown how to confound commonly deployed object detection systems by adding more objects to a picture in unusual places. Their approach doesn’t rely on anything as subtle as an adversarial example – which involves subtly perturbing the pixels of an image to cause a mis-classification – and instead involves either adding new objects to a scene, or creating duplicates within a scene.
   Testing: The researchers test trained models from the public Tensorflow Object Detection API against images from the validation set of the 2017 version of MS-COCO.
  Results: The tests show that most commonly deployed object detection systems fail when objects are moved to different parts of an image (suggesting that the object classifier is conditioning heavily on the visual context surrounding a given object) or overlap with one another (suggesting that these systems have trouble segmenting objects, especially similar ones). They also show that the manipulation or addition of an object to a scene can lead to other negative effects elsewhere in the image, for instance, objects near – but not overlapping – the object can “switch identity, bounding box, or disappear altogether.”
  Terror in a quote: I admire the researchers for the clinical tone they adopt when describing the surreal scenes they have concocted to stress the object recognition system, for instance, this description of some results successfully confusing a system: “The second row shows the result of adding a keyboard at a certain location. The keyboard is detected with high confidence, though now one of the hot-dogs, partially occluded, is detected as a sandwich and a doughnut.”
  Google flaws: The researchers gather a small amount of qualitative data by uploading a couple of images to the Google Vision API website, in which “no object was detected”.
  Non-local effects: One of the more troubling discoveries relates to non-local effects. In one test on Google’s OCR capabilities they show that: “A keyboard placed in two different locations in an image causes a different interpretation of the text in the sign on the right. The output for the top image is “dog bi” and for the bottom it is “La Cop””.
  Why it matters: Experiments like this demonstrate the brittle and sometimes rather stupid ways in which today’s supervised learning deep neural net-based systems can fail. The more worrying insights from this are the appearance of such dramatic non-local effects, suggesting that it’s possible to confuse classifiers with visual elements that a human would not find disruptive.
Read more: The Elephant in the Room (Arxiv).

$! AI Measurement Job: !$ The AI Index, a project to measure and assess the progress and impact of AI, is hiring for a program manager. You’ll work with the steering committee, which today includes myself and Erik Brynjolfsson, Ray Perrault, Yoav Shoham, James Manyika  and others (more on that subject soon!). It’s a good role for someone interested in measuring AI progress on both technical and societal metrics and suits someone who enjoys disentangling hype from empirically verifiable reality. I spend a few hours a week working on the index (more as we finish the 2017 report!) and can answer any questions about the role:
  AI Index program manager job posting here.
  More about the AI Index here.

Better video classification by solving jigsaw puzzles:
…Hollywood squares, AI edition…
Jigsaw puzzles could be a useful way to familiarize a network with some data and give it a curriculum to train over – that’s the implication of new research from Georgia Tech and Carnegie Mellon University which shows how to improve video recognition performance by, during training, slicing videos in a test set up into individual jigsaw pieces then tracing a neural network to predict how to piece them back together. This process involves the network learning to jointly solve two tasks: correctly piecing together the scrambled bits of each video frame, and learning to join the frames together in the appropriate order through time. “Our goal is to create a task that not only forces a network to learn part-based appearance of complex activities but also learn how those parts change over time,” they write.
  Slice and dice: The researchers cut up their videos by dividing  each video frame into 2 x 2 grid of patches, then they stitch three of these frames together into tuples.  “There are 12! (479001600) ways to shuffle these patches” in both space and time, they note. They implement a way to intelligently winnow down this large combinatorial space into selections geared towards helping the network learn.
  Testing: The researchers believe that training networks to correctly unscramble these video snippets in terms of both visual appearance and temporal placement will give them a greater raw capability to classify other, unseen videos. To test this, they train their video jigsaw network on the UCF101 (13,320 videos across 101 action categories) and Kinetics (around 400 categories with 400+ videos each) datasets, then they evaluate it on the UCF101 and HMDB51 (around 7,000 videos across 51 action categories). They train their systems with a curriculum approach, where they start off having to learn how to unscramble a few pieces at a time, then increase this figure through training, forcing it to learn to solve harder and harder tasks.
  Transfer learning: The researchers note that systems pre-trained with the larger Kinetics dataset generalize better than ones trained on the smaller UCF101 one and they test this hypothesis by training the UCF101 in a different way designed to minimize over-fitting, but discover the same phenomenon.
  Results: The researchers find that when they finetune their network on the UCF101 and HMDB51 datasets are pre-training on Kinetics they’re able to obtain state-of-the-art results when compared to other unsupervised learning techniques, though obtain less accuracy than supervised learning approaches. They also obtain close-to SOTA accuracy on classification on the PASCAL VOC 2007 dataset.
  Why it matters: Approaches like this demonstrate how researchers can use the combinatorial power made available by cheap computational resources to mix-and-match datasets, letting them create natural curricula that can lead to better unsupervised learning approaches. One way to view research like this is it is increasing the value of existing image and video data by making such data potentially more useful.
  Read more: Video Jigsaw: Unsupervised Learning of Spatiotemporal Context for Video Action Recognition (Arxiv).

Learning to surveil a person in simulation, then transferring to reality:
…sim2real, but for surveillance…
Researchers with Tencent AI Lab and Peking University have shown how to use virtual environments to “conveniently simulate active tracking, saving the expensive human labeling or real-world trial-and-error”. This is part of a broader push by labs to use simulators to generate large amounts of synthetic data which they train their system on, substituting the compute used to run the simulator for the resources that would have otherwise been expended on gathering data from the real world. The researchers use two environments for their research: VIZDoom and the Unreal Engine (UE). Active tracking is the task of locking onto an object in a scene, like a person, and tracking them as they move through the scene, which could be something like a crowded shopping mall, or a public park, and so on.
  Results: “We find out that the tracking ability, obtained purely from simulators, can potentially transfer to real-world scenarios,” they write. “To our slight surprise, the trained tracker shows good generalization capability. In testing, it performs the robust active tracking in the case of unseen object movement path, unseen object appearance, unseen background, and distracting object”.
  How they did it: The researchers use one major technique to transfer from simulation into reality: domain randomization. Domain randomization is a technique where you apply multiple variations to an environment to generate additional data to train over. For this they vary things like the textures applied to the entities in the simulator, as well as the velocity and trajectory of these entities. They train their agents with a reward which is roughly equivalent to keeping the target in the center of the field of view at a consistent distance.
  VIZDoom: For VIZDoom, the researchers test how well their approach works when trained on randomizations, and when trained without. For the randomizations, they train on a version of the Doom map where they randomize the initial positions of the agent during training. In results, agents trained on randomized environments substantially outperformed those trained on non-randomized ones (which intuitively makes sense, since the non-randomized agents will have gained a less wide variety of experience during training). Of particular note is that they find the tracker is able to perform well even when it temporarily loses sight of the target being tracked.
  Unreal Engine (UE): For the more realistic Unreal Engine environment the team show, again, that versions trained with randomizations – which include texture randomizations of the models – are superior to systems trained without. They show that the trained trackers are robust to various changes, including giving it a different target to what it was trained on to track, or altering the environment.
Transfer learning – real data: So, how useful is it to train in simulators? A good test is to see if systems learned in simulation can transfer to reality – that’s something other researchers have been doing (like OpenAI’s work on its hand project, or CAD2RL). Here, the researchers test this transfer ability by taking best-in-class models trained within the more realistic Unreal Engine environment, then evaluating them on the ‘VOT’ dataset. They discover that the trained systems displays action recommendations for each frame (such as move left, or move right) consistent with moves that place the tracked target in the center of the field of view.
  Testing on a REAL ROBOT: They also perform a more thorough test of generalization by installing the system on a real robot. This has two important elements: augmenting the training data to aid transfer learning to real world data, and modifying the action space to better account for the movements of the real robot (both using discrete and continuous actions).
  Hardware used: They use a wheeled ‘TurtleBot’, which looks like a sort of down-at-heel R2D2. The robot sees using an RGB-D camera mounted about 80cm above the ground.
  Real environments: They test out performance in an indoor room and on an outdoor rooftop. The indoor room is simple, containing a table, a glass wall, and a row of railings; the glass wall presents a reflective challenge that will further test generalization of the system. The outdoor space is much more complicated and includes desks, chairs, and plants, as well as more variable lighting conditions. They test the robot on its ability to track and monitor a person walking a predefined path in both the room and the outdoor rooftop.
  Results: The researchers use a YOLOv3 object detector to acquire the target and its bounding box,then test the tracker using both discrete and continuous actions. The system is able to follow the target the majority of the time in both Indoor and Outdoor settings, with higher scores on the simpler, indoor environment.
  Why this matters: Though this research occurs in a somewhat preliminary setting (like the off-the-shelf SLAM drone from Import AI 206), it highlights a trend in recent AI research: there are enough open systems and known-good techniques available to let teams of people create interesting AI systems that can perform crude actions in the real world. Yes, it would be nice to have far more sample-efficient algorithms that could potentially operate live on real data as well, but those innovations – if possible – are some way off. For now, researchers can instead spend money on compute resources to simulate arbitrarily large amounts of data via the use of game simulators (eg, Unreal Engine) and clever randomizations of the environment.
  Read more: End-to-end Active Object Tracking and Its Real-world Deployment via Reinforcement Learning (Arxiv).

Teaching computers to have a nice discussion, with QuAC:
…New dataset poses significant challenges to today’s systems by testing how well they can carry out a dialog…
Remember chatbots? A few years ago people were very excited about how natural language processing technology was going to give us broadly capable, general purpose chatbots. People got so excited that many companies made acquisitions in this area or span-up their own general purpose dialog projects (see: Facebook M, Microsoft Cortana). None of this stuff worked very well, and today’s popular personal assistants (Alexa, Google Home, Siri) contain a lot more hand-engineering than people might expect.
  So, how can we design better conversational agents? One idea put forward by researchers at the University of Washington, the Allen Institute for AI, UMass Amherst, and Stanford University, is to teach computers to carry out open-ended question-and-answer conversations with eachother. To do this, they have designed and released a new dataset and task called QuAC (Question Answering in Context) which consists of around 14,000 information-seeking QA dialogs, consisting of 100,000 questions in total.
   Dataset structure: QuAC is structured so that there are two agents having a conversation, a teacher and a student; the teacher is able to see the full text of a Wikipedia section, and the student is able to see the title of this section (for instance: Origin & History). Given this heading, the student’s goal is to learn as much as possible about what the teacher knows, and they can do this by asking the teacher questions. The teacher can answer these questions, and can also provide structured feedback in the form of encouragements to continue or not ask a follow-up, whether a question is correct or not, and – when appropriate – no answer.
Inspiration: The inspiration for the dataset is that being able to succeed at this should be sufficiently hard that it will test language models in a rounded way, forcing them to model things like partial evidence, needing to remember things the teacher has said for follow-up questions, co-reference, and so on.
  Results (the gauntlet has been thrown): After testing their dataset on a number of simple baselines to ensure it is difficult, the researchers test it against some algorithmic baselines. They find the best performing baseline is a reimplementation of a top-performing SQuAD model that augments bidirectional attention flow with self-attention and contextualized embeddings. This model, called BiDAF++, The best performing system obtains human equivalence on 60% of questions and 5% of full dialog, suggesting that solving QuAC could be a good proxy for the development of far more advanced language modeling systems.
  Why it matters: Language will be one of the main ways in which people try to interact with machines, so the creation and dissemination of datasets like QuAC gives researchers a useful way to calibrate their expectations and their experiments – it’s useful to have (seemingly) very challenging datasets out there, as it can motivate progress in the future.
  Read more: QuAC: Question Answering in Context (Arxiv).
  Get the dataset (QuAC official website).

What’s worse than internet security? Robots and internet security:
…Researchers find multiple open ROS access points during internet scan…
As we head toward a world containing more robots that have greater capabilities, it’s probably worth making sure we can adequately secure these robots to prevent them being hacked. New research from the CS department at Brown University shows how hard a task that could be; researchers scanned the entire IPv4 address space on the internet and found over 100 publicly-accessible hosts running ROS, the Robot Operating System.
“Of the nodes we found, a number of them are connected to simulators, such as Gazebo, while others appear to be real robots capable of being remotely moved in ways dangerous both to the robot and the objects around it,” they write. “This scan was eye-opening for us as well. We found two of our own robots as part of the scan, one Baxter robot and one drone. Neither was intentionally made available on the public Internet, and both have the potential to cause physical harm if used inappropriately.”
  Insecure robots absolutely everywhere: The researchers used ZMap to scan the IPv4 space three times over several months for open ROS devices. “Each ROS master scan observed over 100 ROS instances, spanning 28 countries, with over 70% of the observed instances using addresses belonging to various university networks or research institutions,” they wrote. “Each scan surfaced over 10 robots exposed…Sensors found in our scan included cameras, laser range finders, barometric pressure sensors, GPS devices, tactile sensors, and compasses”. They also found several exposed simulators including the Unity Game Engine, TORCS, and others.
  Real insecure robots, live on the internet: Potentially unsecured robot platforms found by the researchers included a Baxter, PR2, JACO, Turtlebot, WAM, and – potentially the most worrying of all – an exposed DaVinci surgical robot.
  Penetration test: The researchers also performed a penetration test on a robot they discovered in this way which was at a lab in the University of Washington. During this test they were able to hack and access its camera, letting them view images of the lab. They could also play sounds remotely on the robot.
  Why it matters: “Though a few unsecured robots might not seem like a critical issue, our study has shown that a number of research robots is accessible and controllable from the public Internet. It is likely these robots can be remotely actuated in ways dangerous to both the robot and the human operators,” they write.
   More broadly, this reinforces a point made by James Mickens during his recent USENIX keynote on computer security + AI (more information: ImportAI #107) in which he notes that the internet is a security hellscape that itself connects to nightmarishly complex machines, creating a landscape for emergent, endless security threats.
  Read more: Scanning the Internet for ROS: A View of Security in Robotics Research (Arxiv).

Better person re-identification via multiple loss functions:
…Unsupervised Deep Association Learning, another powerful surveillance technique…
Researchers with the Computer Vision Group for Queen Mary University of London, and startup Vision Semantics Ltd, have published a paper on video tracking and analysis, showing how to use AI techniques to automatically find pedestrians via a camera view, then re-acquire them when they appear elsewhere in the city.
  Technique: They call their approach an “unsupervised Deep Association Learning (DAL) scheme”. DAL has two main loss terms to aid its learning: local space-time consistency (identifying a person within views from a single camera) and global cyclic ranking consistency (identifying a person from different camera feeds from different cameras.
“This scheme enables the deep model to start with learning from the local consistency, whilst incrementally self-discovering more cross-camera highly associated tracklets subject to the global consistency for progressively enhancing discriminative feature learning”.
  Datasets: The researchers evaluate their approach on three benchmark datasets:
– PRID2011: 1,134 ‘tracklets’ gathered from two cameras, containing 200 people across both cameras.
– iLIDS-VID: 600 tracklets of 300 people.
– MARS: 20,478 tracklets of 1,261 people captured from a camera network with 6 near-synchronized cameras.
  Testing: The researchers find that their DAL technique, when paired with a ResNet50 backbone, obtains state-of-the-art accuracy across PRID 2011 and iLIDS-VID datasets, and second-to-SOTA on MARS. DAL systems with a MobileNet backend obtain second-to-SOTA accuracy on PRID 2011 and iLIDS-VID, and SOTA on Mars. The closest other technique in terms of performance is the Stepwise technique, which is somewhat competitive on PRID 2011.
  Why it matters: Systems like this are the essential inputs to a digital surveillance state; it would have been nice to see some acknowledgement of this obvious application within the research paper. Additionally, as technology like this is developed and propagated it’s likely we’ll see numerous creative uses of the technology, as well as vigorous adoption by companies in industries like advertising and marketing.
  Read more: Deep Association Learning for Unsupervised Video Person Re-identification (Arxiv).

OpenAI Bits & Pieces:

OpenAI plays competitive pro-level Dota matches at The International; loses twice:

OpenAI plays competitive-level Dota in Vancouver:
OpenAI Five lost two games against top Dota 2 players at The International in Vancouver this week, maintaining a good chance of winning for the first 20-35 minutes of both games. In contrast to our
Benchmark 17 days ago, these games: Were played against significantly better human players, used hero lineups provided by a third party rather than by Five drafting against humans, and removed our last major restriction from what most pros consider “Real Dota” gameplay.
  We’ll continue to work on this and will have more to share in the future.
  Read more: The International 2018: Results (OpenAI Blog).

Maybe the reason why today’s AI algorithms are bad is because they aren’t curious enough:
…Of imagination and broken televisions…
New research from OpenAI, the University of California at Berkeley, and the University of Edinburgh,  shows how the application of curiosity to AI agents can lead to the manifestation of surprisingly advanced behaviors. In a series of experiments we show that agents which use curiosity can learn to outperform random-agent baselines on a majority of games in the Atari corpus, and that such systems display good performance in other areas as well. But this capability comes at a cost: curious agents can be tricked, for instance by putting them in a room with a television that shows different patterns of static on different channels – to a curious agent, this type of television static represents variety, and variety is good when you’re optimizing for curiosity, so agents can become trapped, unable to tear themselves away from the allure of the television static.
  Read more: Large-Scale Study of Curiosity Driven-Learning (Arxiv).
  Read more: Give AI curiosity, and it will watch TV forever (Quartz).

Tech Tales:

Art Show

I mean, is it art?
It must be. They’re bidding on it.
But what is it?
A couple of petabytes of data.
Well, we assume it’s data. We’re not sure exactly what it is. We can’t find any patterns in it. But we think they can.
Well, they’re bidding on it. The machines don’t tend to exchange much stuff with eachother. For some reason they think this is valuable. None of our data-integrity protocols have triggered any alarms, so it seems benign.
Where did it come from?
We know some of this. Half of it is a quasar burst that happened a while ago. Some of it is from a couple of atomic clocks. A few megabytes come from some readings from a degrading [REDACTED]. That’s just what they’ve told us. They’ve kind of stitched this altogether.
Explains the name, I guess.
Yeah: Category: Tapestry. I’d almost think they were playing a joke on us – maybe that’s the art!

Things that inspired this story:
Oakland glitch video art shows, patterns that emerge out of static, untuned televisions, radio plays.