Import AI 114: Synthetic images take a big leap forward with BigGANs; US lawmakers call for national AI strategy; researchers probe language reasoning via HotspotQA
by Jack Clark
Getting hip to multi-hop reasoning with HotpotQA:
…New dataset and benchmark designed to test common sense reasoning capabilities…
Researchers with Carnegie Mellon University, Stanford University, the Montreal Institute for Learning Algorithms, and Google AI, have created a new dataset and associated competition designed to test the capabilities of question answering systems. The new dataset, HotspotQA, is far larger than many prior datasets designed for such tasks, and has been designed to require ‘multi-hop’ reasoning to thereby test the growing sophistication of newer NLP systems at performing increasing cognitive tasks.
HotpotQA consists of around ~113,000 Wikipedia-based question-answer pairs. Answering these questions correctly is designed to test for ‘multi-hop’ reasoning – the ability for systems to look at multiple documents and perform basic iterative problem-solving to come up with correct answers. These questions were “collected by crowdsourcing based on Wikipedia articles, where crowd workers are shown multiple supporting context documents and asked explicitly to come up with questions requiring reasoning about all of the documents”. These workers also provide the supporting facts they use to answer these questions, providing a strong supervised training set.
It’s the data, stupid: To develop HotpotQA the researchers needed to themselves create a kind of multi-hop pipeline to be able to figure out what documents to give cloud workers to use to compose questions for. To do this, they mapped the Wikipedia Hyperlink Graph and used this information to build a directed graph, then they try to detect correspondences between these pairs. They also created a hand-made list of categories to use to compare things of similar categories (eg, basketball players, etc).
Testing: HotpotQA can be used to test models’ capabilities in different ways, ranging from information retrieval to question answering. The researchers train a system to give a baseline and the results show that the (relatively strong baseline) obtains performance significantly below that of a competent human across all tasks (with the exception of certain ‘supporting fact’ evaluations, in which it obtains performance on par with an average human).
Why it matters: Natural language processing research is currently going through what some have called an ‘ImageNet moment’ following recent algorithmic developments relating to the usage of memory and attention-based systems, which have demonstrated significantly higher performance across a range of reasoning tasks compared to prior techniques, while also being typically much simpler. Like with ImageNet and the associated supervised classification systems, these new types of NLP approaches require larger datasets to be trained on and evaluated against, and as with ImageNet it’s likely that by scaling up techniques to take on challenges defined by datasets like HotpotQA progress in this domain will increase further.
Caveat: As with all datasets with an associated competitive leaderboard it is feasible that HotpotQA could be relatively easy and systems could end up exceeding human performance against it in a relatively short amount of time – this happened over the past year with the Stanford SQuAD dataset. Hopefully the relatively higher sophistication of HotspotQA will protect against this.
Read more: HotpotQA website with leaderboard and data (HotpotQA Github).
Read more: HOTPOTQA: A Dataset for Diverse, Explainable Multi-hop Question Answering (Arxiv).
Administrative note regarding ICLR papers:
This week was the deadline for submissions for the International Conference on Learning Representations. These papers are published under a blind review process as they are currently under review. This year, there were 1600 submissions to ICLR, up from 1000 in 2017, 500 in 2016, and 250 in 2015. I’ll be going through some of these papers in this issue and others and will try to avoid making predictions about which organizations are behind which papers so as to respect the blind review process.
Computers can now generate (some) fake images that are indistinguishable from real ones:
…BigGAN’s show significant progression in capabilities in synthetic imagery…
The researchers train GAN models with 2-4X the parameters and 8X the batch size compared to prior papers, and also introduce improve the stability of GAN training.
Some of the implemented techniques mean that samples generated by such GAN models can be tuned, allowing for “explicit, fine-grained control of the trade-off between sample variety and fidelity”. What this means in practice is that you can ‘tune’ how similar the types of generated images are to specific sets of images within the dataset, so for instance if you wanted to generate an image of a field containing a pond you might pick a few images to prioritize in training that contain ponds, whereas if you wanted to also tune the generated size of the pond you might pick images containing ponds of various sizes. The addition of this kind of semantic dial seems useful to me, particularly for using such systems to generate faked images with specific constraints on what they depict.
Image quality: Images generated via these GANs are of a far superior quality than prior systems, and and can be outputted at relatively large resolutions of 512X512pixels. I encourage you to take a look at the paper and judge for yourself, but it’s evident from the (cherry-picked) samples that given sufficient patience a determined person can now generate photoreal faked images as long as they have a precise enough set of data from which to train on.
Problems remain: There are still some drawbacks to the approach; GANs are notorious for their instability during training, and developers of such systems need to develop increasingly sophisticated approaches to deal with the instabilities in training that manifest at increasingly larger scales, leading to a certain time-investment tradeoff inherent to the scale-up process. The researchers do devise some tricks to deal with this, but they’re quite elaborate. “We demonstrate that a combination of novel and existing techniques can reduce these instabilities, but complete training stability can only be achieved at a dramatic cost to performance,” they write.
Why it matters: One of the most interesting aspects of the paper is how simple the approach is: take today’s techniques, try to scale them up, and conduct some targeted research into dealing with some of the rough edges of the problem space. This seems analogous to recent work on scaling up algorithms in RL, where both DeepMind and OpenAI have developed increasingly large-scale training methodologies paired with simple scaled-up algorithms (eg DQN, PPO, A2C, etc).
“We find that current GAN techniques are sufficient to enable scaling to large models and distributed, large-batch training. We find that we can dramatically improve the state of the art and train models up to 512×512 resolution without need for explicit multiscale methods,” the researchers write.
Read more: Large Scale GAN Training For High Fidelity Natural Image Synthesis (ICLR 2018 submissions, OpenReview).
Check out the samples: Memo Akten has pulled together a bunch of interesting and/or weird samples from the model here, which are worth checking out (Memo Akten, Twitter).
Want better RL performance? Try remembering what you’ve been doing recently:
…Recurrent Replay Distributed DQN (R2D2) obtains state-of-the-art on Atari & DMLab by a wide margin…
R2D2 is based on a tweaked version of Ape-X, a large-scale reinforcement learning system developed by DeepMind which displays good performance and sample efficiency when trained at large-scale. Ape-X uses prioritized distributed replay, using a single learner to learn from the experience of numerous distinct actors (typically 256).
New tricks for old algos: The researchers implement two relatively simple strategies to help them train the R2D2 algorithm to be smarter about how it uses its memory to learn more complex problem-solving strategies. These tweaks are to store the recurrent state in the replay buffer and use it to initialize the network at training time, and “allow the network a ‘burn-in period’ by using a portion of the replay sequence only for unrolling the network and producing a start state, and update the network only on the remaining part of the sequence.”
Results: R2D2 obtains vastly higher scores than any prior system on these tasks, and, via the large-scale, can be trained to achieve ~1300% human-normalized scores on Atari (a median over 57 games, so it does even better on some, and substantially worse on others). However, in tests on DMLab-30, a set of 3D environments for training agents which is designed to be more difficult than Atari. Here, the system also displays extremely good performance when compared to prior systems.
It’s all in the memory: The system does well here on some fairly difficult environments, and notably the authors show via some ablation studies that the agent does appear to be using its in-built memory to solve tasks. “We first observe that restricting the agent’s memory gradually decreases its performance, indicating its nontrivial use of memory on both domains. Crucially, while the agent trained with stored state shows higher performance when using the full history, its performance decays much more rapidly than for the agent trained with zero start states. This is evidence that the zero start state strategy, used in past RNN-based agents with replay, limits the agent’s ability to learn to make use of its memory. While this doesn’t necessarily translate into a performance difference (like in MS.PACMAN), it does so whenever the task requires an effective use of memory (like EMSTM WATERMAZE).,” they write.
Read more: Recurrent Experience Replay In Distributed Reinforcement Learning (ICLR 2018 submissions, OpenReview).
US lawmakers call for national AI strategy and more funding:
…The United States cannot maintain its global leadership in AI absent political leadership from Congress and the Executive Branch…
Lawmakers from the US’s Subcommittee on Information Technology of the House Committee on Oversight and Government Reform have called for the creation of a national strategy for artificial intelligence led by the current administration, as well as more funding for basic research.
The comments from Chairman Will Hurd and Ranking Member Robin Kelly are the result of a series of three hearings held by that committee in 2018 (Note: I testified at one of them). It’s a short paper and worth reading in full to get a sense of what policymakers are thinking with regard to AI.
Notable quotes: “The United States cannot maintain its global leadership in AI absent political leadership from Congress and the Executive Branch.” + Government should “increase federal spending on research and development to maintain American leadership with respect to AI” + “It is critical the federal government build upon, and increase, its capacity to understand, develop, and manage the risks associated with this technology’s increased use” + “American competitiveness in AI will be critical to ensuring the United States does not lose any decisive cybersecurity advantage to other nationstates”.
China: China looms large in the report as a symbol that ‘the United States’ leadership in AI is no longer guaranteed”. One analysis contained within the paper says China is likely “to pass the United States in R&D investments” by the end of 2018″ – significant, considering that the US’s annual outlay of approximately $500 billion makes it the biggest spender on the planet.
Measurement: The report suggests that “at minimum” the government should develop “a widely agreed upon standard for measuring the safety and security of AI products and applications” and notes the existence of initiatives like The AI Index as good starts.
Money: “There is a need for increased funding for R&D at agencies like the National Science Foundation, National Institutes of Health, Defense Advanced Research Project Agency, Intelligence Advanced Research Project Agency, National Institute of Standards and Technology, Department of Homeland Security, and National Aeronautics and Space Administration. As such, the Subcommittee recommends the federal government provide for a steady increase in federal R&D spending. An additional benefit of increased funding is being able to support more graduate students, which could serve to expand the future workforce in AI.”
Leadership: “There is also a pressing need for conscious, direct, and spirited leadership from the Trump Administration. The 2016 reports put out by the Obama Administration’s National Science and Technology Council and the recent actions of the Trump Administration are steps in the right direction. However, given the actions taken by other countries—especially China— Congress and the Administration will need to increase the time, attention, and level of resources the federal government devotes to AI research and development, as well as push for agencies to further build their capacities for adapting to advanced technologies.”
Read more: Rise of the Machines: Artificial Intelligence and its Growing Impact on US Policy (Homeland Security Digital Library).
AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: firstname.lastname@example.org…
Open Philanthropy Project opens applications for AI Fellows:
The Open Philanthropy Project, the grant-making foundation funded by Cari Tuna and Dustin Moskovitz, is accepting applications for its 2019 AI Fellows Program. The program will provide full PhD funding for AI/ML researchers focused on the long-term impacts of advanced AI systems. The first cohort of AI Fellows were announced in June of this year.
Key details: “Support will include a $40,000 per year stipend, payment of tuition and fees, and an additional $10,000 in annual support for travel, equipment, and other research expenses. Fellows will be funded from Fall 2019 through the end of the 5th year of their PhD, with the possibility of renewal for subsequent years. We do encourage applications from 5th-year students, who will be supported on a year-by-year basis.”
Read more: Open Philanthropy Project AI Fellows Program (Open Phil).
Read more: Announcing the 2018 AI Fellows (Open Phil).
Google confirms Project Dragonfly in Senate:
Google have confirmed the existence of Project Dragonfly, an initiative to build a censored search engine within China, as part of Google’s broad overture towards the world’s second largest economy. Google’s chief privacy officer declined to give any details of the project, and denied the company was close to launching a search engine in the country. A former senior research scientist, who publicly resigned over Dragonfly earlier this month, had written to Senators ahead of the hearings, outlining his concerns with the plans.
Why it matters: Google is increasingly fighting a battle on two fronts with regards to Dragonfly, with critics concerned about the company’s complicity in censorship and human rights abuses, and others suspicious of Google’s willingness to cooperate with the Chinese government so soon after pulling out of a US defense project (Maven).
Read more: Google confirms Dragonfly in Senate hearing (VentureBeat).
Read more: Former Google scientist slams ‘unethical’ Chinese search project in letter to senators (The Verge).
DeepMind releases framework for AI safety research:
…AI company also launches new AI safety blog…
DeepMind’s safety team have launched their new blog with a research agenda for technical AI safety research. They divide the field into three areas: specification, robustness, and assurance.
Specification research is aimed at ensuring an AI system’s behavior aligns with the intentions of its operator. This includes research into how AI systems can infer human preferences, and how to avoid problems of reward hacking and wire-heading.
Robustness research is aimed at ensuring a system is robust to changes in its environment. This includes designing systems that can safely explore new environments and withstand adversarial inputs.
Assurance research is aimed at ensuring we can understand and control AI systems during operation. This includes issues research into interpretability of algorithms, and the design of systems that can be safely interrupted (e.g. off-switches for advanced AI systems).
Why it matters: This is a useful taxonomy of research directions that will hopefully contribute to a better understanding of problems in AI safety within the AI/ML community. DeepMind has been an important advocate for safety research since its inception. It is important to remember that AI safety is still dwarfed by AI capabilities research by several orders of magnitude, in terms of both funding and number of researchers.
Read more: Building Safe Artificial Intelligence (DeepMind via Medium).
OpenAI Bits & Pieces:
OpenAI takes on Dota 2: Short Vice documentary:
As part of our Dota project we experimented with new forms of comms, including having a doc crew from Vice film us in the run-up to our competition at The International.
Check out the documentary here: This Robot is Beating the World’s Best Video Gamers (Vice).
They call the new drones shepherds. We call them prison guards. The truth is somewhere in-between.
You can do the math yourself. Take a population. Get the birth rate. Project over time. That’s the calculus the politicians did that led to them funding what they called the ‘Freedom Research Initiative to Eliminate Negativity with Drones’ (FRIEND).
FRIEND provided scientists with a gigantic bucket of money to fund research into creating more adaptable drones that could, as one grant document stated, ‘interface in a reassuring manner with ageing citizens’. The first FRIEND drones were like pet parrots, and they were deployed into old people’s homes in the hundreds of thousands. Suddenly, when you went for a walk outside, you were accompanied by a personal FRIEND-Shepherd which would quiz you about the things around you to stave off age-based neurological decline. And when you had your meals there was now a drone hovering above you, scanning your plate, and cheerily exclaiming “that’s enough calories for today!” when it had judged you’d eaten enough.
Of course we did not have to do what the FRIEND-Shepherds told us to do. But many people did and for those of us who had distaste for the drones, peer pressure did the rest. I tell myself that I am merely pretending to do what my FRIEND-Shepherd says, as it takes me on my daily walk and suggests the addition or removal of specific ingredients from my daily salad to ‘maintain optimum productivity via effective meal balancing’.
Anyway, as the FRIEND program continued the new Shepherds became more and more advanced. But people kept on getting older and birth rates kept on falling; the government couldn’t afford to buy more drones to keep up with the growing masses of old people, so it directed FRIEND resources towards increasing the autonomy and, later, ‘persuasiveness’ of such systems.
Over the course of a decade the drones went from parrots to pop psychologists with a penchant for nudge economics. Now, we’re still not “forced” to do anything by the Shepherds, but the Shepherds are very intelligent and much of what they spend their time doing is finding out what makes us tick so they can encourage us to do the thing that extends lifespan while preserving quality of life.
The Shepherd assigned to me and my friends has figured out that I don’t like Shepherds. It has started to learn to insult me, so that I chase it. Sometimes it makes me so angry that I run around the home, trying to knock it out of the air with my walking stick. “Well done,” it will say after I am out of breath. “Five miles, not bad for a useless human.” Sometimes I will then run at it again, and I believe I truly am running at it because I hate it and not because it wants me to. But do I care about the difference? I’m not sure anymore.
Things that inspired this story: Drones, elderly care robots, the cruel and inescapable effects of declining fertility in developed economies, JG Ballard, Wall-E, social networks, emotion-based AI analysis systems, NLP engines, fleet learning with individual fine-tuning.
[…] Getting hip to multi-hop reasoning with HotpotQA:…New dataset and benchmark designed to test common sense reasoning capabilities…Researchers with Carnegie Mellon University, Stanford University, the Montreal Institute for Learning Algorithms, and Google AI, have created a new dataset and associated competition designed to test the capabilities of question answering systems. The new dataset, HotspotQA, is far larger than many prior datasets designed for such tasks, and has been designed to require ‘multi-hop’ reasoning to thereby test the growing sophistication of newer NLP systems at performing increasing cognitive tasks. Read more: HotpotQA website with leaderboard and data (HotpotQA Github). Read More […]