Import AI 112: 1 million free furniture models for AI research, measuring neural net drawbacks via studying hallucinations, and DeepMind boosts transfer learning with PopArt

by Jack Clark

When is a door not a door? When a computer says it is a jar!
Researchers analyze neural network “hallucinations” to create more robust systems…
Researchers with the University of California at Berkeley and Boston University have devised a new way to measure how neural networks sometimes generate ‘hallucinations’ when attempting to caption images. “Image captioning models often “hallucinate” objects that may appear in a given context, like e.g. a bench here.” Developing a better understanding of why such hallucinations occur – and how to prevent them occurring – is crucial to the development of more robust and widely used AI systems.
  Measuring hallucinations: The researchers propose ‘CHAIR’ (Caption Hallucination Assessment with Image Relevance) as a way to assess how well systems generate captions in response to images. CHAIR calculates what proportion of generated words correspond to the contents of an image, according to the ground truth sentences and the output of object segmentation and labelling algorithms. So, for example, in a picture of a small puppy in a basket, you would give a system fewer points for giving the label “a small puppy in a basket with cats”, compared to “a small puppy in a basket”. In evaluations they find that on one test set “anywhere between 7.4% and 17.5% include a hallucinated object”.
  Strange correlations: Analyzing what causes these hallucinations is difficult. For instance, the researchers note that “we find no obvious correlation between the average length of the generated captions and the hallucination rate”. There is some more correlation among hallucinated objects, though. “Across all models the super-category Furniture is hallucinated most often, accounting for 20-50% of all hallucinated objects. Other common super-categories are Outdoor objects, Sports and Kitchenware,” they write. “The dining table is the most frequently hallucinated object across all models”.
  Why it matters: If we are going to deploy lots of neural network-based systems into society then it is crucial that we understand the weaknesses and pathologies of such systems; analyses like this give us a clearer notion of the limits of today’s technology and also indicate lines of research people could pursue to increase the robustness of such systems. “We argue that the design and training of captioning models should be guided not only by cross-entropy loss or standard sentence metrics, but also by image relevance,” the researchers write.
  Read more: Object Hallucination in Image Captioning (Arxiv).

Humans! What are they good for? Absolutely… something!?
…Advanced cognitive skills? Good. Psycho-motor skills? You may want to retrain…
Michael Osborne, co-director of the Oxford Martin Programme on Technology and Unemployment, has given a presentation about the Future of Work. Osborn attained some level of notoriety within ML a while ago for publishing a study that said 47% of jobs could be at risk of automation. Since then he has been further fleshing out his ideas; a new presentation from him sees him analyze some typical occupations in the UK and try to estimate their probably for increased future demand for these roles. The findings aren’t encouraging: Osborne’s method predicts  a low probability demand for new truck drivers in the UK, but a much higher demand for waiters and waitresses.
  What skills should you learn: If you want to fare well in an AI-first economy, then you should invest in advanced cognitive skills such as: judgement and decision making, systems evaluation, deductive reasoning, and so on. The sorts of skills which will be of less importance over time (for humans, at least), will be ‘psycho-motor’ skills: control precision, manual dexterity,  night vision, sound localization, and so on. (A meta-problem here is that many of the people in jobs that demand psycho-motor skills don’t get the opportunity to develop the advanced cognitive skills that it is thought the future economy will demand.
  Why it matters: Analyzing how AI will and won’t change employment is crucial work whose findings will determine the policy of many governments. The problems being surfaced by researchers such as Osborne is that the rapidity of AI’s progress, combined with its tendency to automate an increasingly broad range of tasks, threatens traditional notions of employment. What kind of future do we want?
  Read more: Technology at Work: The Future of Automation (Google Slide presentation).

What’s cooler than 1,000 furniture models? 1 million ones. And more, in InteriorNet:
…Massive new dataset gives researchers new benchmark to test systems against…
Researchers with Imperial College London and Chinese furnishing-VR startup Kujiale, have released InteriorNet, a large-scale dataset of photographs of complex, realistic interiors. InteriorNet contains around 1 million CAD models of different types of furniture and furnishing, which over 1,100 professional designers have subsequently used to create around 22 million room layouts. Each of these scenes can also be viewed under a variety of different lighting conditions and contexts due to the use of an inbuilt simulator called ViSim, which ships with the dataset and has also been released by the researchers. Purely based on the furniture contents this is one of the single largest datasets I am aware of for 3D scene composition and understanding.
  Things that make you go ‘hmm’: In the acknowledgements section of the InteriorNet website the researchers not only thank Kujiale for providing them with the furniture models but also for access to “GPU/CPU clusters” – could this be a pattern for future private-public collaborations where along with sharing expertise and financial resources the private sector also shares compute resources; that would make sense given the ballooning computational demands of many new AI techniques.
  Read more: InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset (website).

Lockheed Martin launches ‘AlphaPilot’ competition:
…Want better drones but not sure exactly what to build? Host a competition!…
Aerospace and defense company LockHeed Martin wants to create smarter drones so the company is hosting a competition, in collaboration with the Drone Racing League and with NVIDIA, to create drones with enough intelligence to race through professional drone racing courses without human intervention.
  Prizes: Lockheed says the competition will “award more than $2,000,000 in prizes for its top performers”.
  Why it matters: Drones are already changing the character of warface by virtue of their asymmetry: a fleet of drones, each costing a few thousand dollars apiece, can pose a robust threat to things that cost tens (planes) to hundreds (naval ships, military bases) to billions of dollars (aircraft carriers, etc). Once we add greater autonomy to such systems they will pose an even greater threat, further influencing how different nations budget for their military R&D, and potentially altering investment into AI research.
  Read more: AlphaPilot (Lockheed Martin official website).

Could Space Fortress be 2018’s Montezuma’s Revenge?
…Another ancient game gets resuscitated to challenge contemporary AI algorithms…
Another week brings another potential benchmark to test AI algorithms’ performance against. This week, researchers with Carnegie Mellon University have made the case for using a late-1980s game called ‘Space Fortress’ to evaluate new algorithms. Their motivation for this is twofold: 1) Space Fortress is currently unsolved via mainstream RL algorithms such as Rainbow, PPO, and A2C, and 2) Space Fortress was developed by a psychologist to study human skill acquisition, so we have good data to compare AI performance to.
  So, what is Space Fortress: Space Fortress is a game where a player flies around an arena shooting missiles at a fortress in the center. However, the game adds some confounding factors: the fortress is only intermittently attackable, so the player must learn to fire their shots at greater than 250ms apart while the fortress is in its ‘invulnerable’ state, then once they have landed ten of these 250ms-apart shots the Fortress switches into an invulnerable state, at which point the player needs to attack it with two shots fired 250ms apart. This makes for a challenging environment for traditional AI algorithms because “the firing strategy completely reverses at the point when vulnerability reaches 10, and the agent must learn to identify this critical point to perform well,” they explain.
  Two variants: While developing their benchmarks the researchers developed a simplified version of the game called ‘Autoturn’ which automatically orients the ship towards the forest. The harder environment (which is the unmodified original game) is subsequently referred to as Youturn.
  Send in the humans: 117 people played 20 games of Space Fortress (52: Autoturn. 65: Youturn). The best performing people got scores of 3,000 and 2314 on Autoturn and Youturn, respectively, and the average score across all human entrants was 1,810 for Autoturn and -169 for Youturn.
  Send in the (broken) RL algorithms: sparse rewards: Today’s RL algorithms fare very poorly against this system when working on a sparse reward version of the environment. PPO, the best performing tested algorithm, gets an average score of -250 on Autoturn and -5269 on Youturn, with A2C performing marginally worse. Rainbow, a complex algorithm that lumps together a range of improvements to the DQN algorithm and currently gets high scores across Atari and DM Lab environments, gets very poor results here, with an average score of -8327 on Autoturn and -9378 on Youturn.
  Send in the (broken) RL algorithms: dense rewards: The algorithms fair a little better when given dense rewards (which provides a reward for each hit of the fortress, and a penatly if the fortress is reset due to player’s firing too rapidly). This modification gives Space Fortress a reward density that is comparable to Atari games. Once implemented, the algorithms fair better, with PPO obtaining average scores of -1294 (Autoturn) and -1435 (Youturn).
  Send in the (broken) RL algorithms: dense rewards + ‘context identification’: The researchers further change the dense reward structure to help the agent identify when the Space Fortress switches vulnerability state, and when it is destroyed. Implementing this lets them train PPO to obtain average scores around ~2,000; a substantial improvement, but still not as good as a decent human.
  Why it matters: One of the slightly strange things about contemporary AI research is how coupled advances seem to be with data and/or environments: new data/environments highlights the weaknesses of existing algorithms, which provokes further development. Platforms like SpaceFortress will give researchers access to a low-cost testing environment to explore algorithms that are able to learn to model events over time and detect correlations and larger patterns – an area critical to the development of more capable AI systems. The researchers have released SpaceFortress as an OpenAI Gym environment, making it easier for other people to work with it.
  Read more: Challenges of Context and Time in Reinforcement Learning: Introducing Space Fortress as a Benchmark (Arxiv).

Venture Capitalists bet on simulators for self-driving cars:
…Applied Intuition builds simulators for self-driving brains….
Applied Intuition, a company trying to build simulators for self-driving cars, has uncloaked with $11.5 million in funding. The fact venture capitalists are betting on it is notable as it indicates how strategic data has become for certain bits of AI, and how investors are realizing that instead of betting on data directly you can instead bet on simulators and thus trade compute for data. Applied Intuition is a good example of this as it lets companies rent an extensible simulator which they can use to generate large amounts of data to train self-driving cars with.
  Read more: Applied Intuition – Advanced simulation software for autonomous vehicles (Medium).

DeepMind improves transfer learning with PopArt:
…Rescaling rewards lets you learn interesting behaviors and preserves meaningful game state information…
DeepMind researchers have developed a technique to improve transfer learning, demonstrating state-of-the-art performance on Atari. The technique, Preserving Outputs Precisely while Adaptively Rescaling Targets (PopArt) works by ensuring that the rewards outputted by different environments are normalized relative to eachother, so using PopArt an agent would get a similar score for, say, crossing the road in the game ‘Frogger’ or eating all the Ghosts in Ms PacMan, despite these important activities getting subtly different rewards in each environment.
  With PopArt, researchers can now automatically “adapt the scale of scores in each game so the agent judges the games to be of equal learning value, no matter the scale of rewards available in each specific game,” DeepMind writes. This differs to reward clipping which is where people typically squash the rewards down to between -1 and +1. “With clipped rewards, there is no apparent difference for the agent between eating a pellet or eating a ghost and results  in agents that only eat pellets, and never bothers to chase ghosts, as this video shows.  When we remove reward clipping and use PopArt’s adaptive normalisation to stabilise learning, it results in quite different behaviour, with the agent chasing ghosts, and achieving a higher score, as shown in this video,” they explain.
  Results: To test their approach the researchers evaluate the effect of applying PopArt to ‘IMPALA’ agents, which are among the most popular algorithms currently being used at DeepMind. PopArt-IMPALA systems obtain roughly 101% of human performance as an average across all 57 Atari games, compared to 28.5% for IMPALA on its own. Performance also improves significantly on DeepMind Lab-30, a collection of 30 3D environments based on the Quake 3 engine.
  Why it matters: Reinforcement learning research benefited from the development of increasingly efficient algorithms and training methods; techniques like PopArt should benefit research into transfer learning when training via RL as it gives us new generic techniques to increase the amount of experience agents can accrue in different environments, which will yield further understanding of the limits of simple transfer techniques, helping researchers identify areas for the development of new algorithmic techniques.
  Read more: Multi-task Deep Reinforcement Learning with PopArt (Arxiv).
  Read more: Preserving Outputs Precisely while Adaptively Rescaling Targets (DeepMind blog).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

Resignations over Google’s China plans:
A senior research scientist at Google has publicly resigned in protest at the company’s planned re-entry into China (code-named Dragonfly), and reports that he is one of five to do so. Google is currently developing a search engine compliant with Chinese government censorship, according to numerous reports first sparked by a story in The Intercept.
  The AI principles: The scientist claims the alleged plans violated Google’s AI principles, announced in June, which include a pledge not to design or deploy technologies “whose purpose contravenes widely accepted principles of international law and human rights.” Without knowing more about the plans, it is hard to judge whether it contravenes the carefully worded principles. Nonetheless, the relevant question for many will be whether it violates the the standards tech giants should hold themselves to.
  Why it matters: This is the first public test of Google’s AI principles, and could have lasting effects both on how tech giants operate in China, and how they approach public ethical commitments. The principles were first announced in response to internal protests over Project Maven. If they are seen as having been flouted so soon after, this could prompt a serious loss of faith in the Google’s ethical commitments going forward.
  Read more: Senior Google scientist resigns (The Intercept).
  Read more: AI at Google: Our principles (Official Google blog).

Google announces inclusive image recognition challenge:
Large image datasets, such as ImageNet, have been an important driver of progress in computer vision in recent years. These databases exhibit biases along multiple dimensions, though, which can easily be inherited by models trained on them. For example, the post shows a classifier failing to identify a wedding photo in which the couple are not wearing European wedding attire.
  Addressing geographic bias: Google AI have announced an image recognition challenge to spur progress in addressing these biases. Participants will use a standard dataset (i.e. skewed towards images from Europe and North America) to train models that will be evaluated using image sets covering different, unspecified, geographic regions – Google describes this as a geographic “stress test”. This will challenge developers to develop inclusive models from skewed datasets. “this competition challenges you to use Open Images, a large, multi-label, publicly-available image classification dataset that is majority-sampled from North America and Europe, to train a model that will be evaluated on images collected from a different set of geographic regions across the globe,” Google says.
  Why it matters: For the benefits of AI to be broadly distributed amongst humanity, it is important that AI systems can be equally well deployed across the world. Racial bias in face recognition has received particular attention recently, given that these technologies are being deployed by law enforcement, raising immediate risks of harm. This project has a wider scope than face recognition, challenging classifiers to identify a diverse range of faces, objects, buildings etc.
  Read more: Introducing the inclusive images competition (Google AI blog).
  Read more: No classification without representation (Google).

DARPA announces $2bn AI investment plan:
DARPA, the US military’s advance technology agency, has announced ‘AI Next’, a $2bn multi-year investment plan. The project has an ambitious remit, to “explore how machine can acquire human-like communication and reasoning capabilities”, with a goal of developing systems that “function more as colleagues than as tools.”
  Safety as a focus: Alongside their straightforward technical goals, they identify robustness and addressing adversarial examples as two of five core focuses. This is an important inclusion, signalling DARPA’s commitment to leading on safety as well as capabilities.
  Why it matters: DARPA has historically been one of the most important players in AI development. Despite the US still not having a coordinated national AI strategy, the DoD is such a significant spender in its own right that it is nonetheless beginning to form its own quasi-national AI strategy. The inclusion of research agendas in safety is a positive development. This investment likely represents a material uptick in funding for safety research.
  Read more: AI Next Campaign (DARPA).
  Read more: DARPA announces $2bn campaign to develop next wave of AI technologies (DARPA).

OpenAI Bits & Pieces:

OpenAI Scholars Class of 18: Final Projects:
Find out about the final projects of the first cohort of OpenAI Scholars and apply to attend a demo day in San Francisco to meet the Scholars and hear about their work – all welcome!
  Read more: OpenAI Scholars Class of ’18: Final Projects (OpenAI Blog).

Tech Tales:

All A-OK Down There On The “Best Wishes And Hope You’re Well” Farm

You could hear the group of pensioners before you saw them; first, you’d tilt your head as though tuning into the faint sound of a mosquito, then it would grow louder and you would cast your eyes up and look for beatles in the air, then louder still and you would crane your head back and look at the sky in search of low-flying planes: nothing. Perhaps then you would look to the horizon and make out a part of it alive with movement – with tremors at the limits of your vision. These tremors would resolve over the next few seconds, sharpening into the outlines of a flock of drones and, below them, the old people themselves – sometimes walking, sometimes on Segways, sometimes carried in robotic wheelbarrows if truly infirm.

Like this, the crowd would come towards you. Eventually you could make out the sound of speech through the hum of the drones: “oh very nice”, “yes they came to visit us last year and it was lovely”, “oh he is good you should see him about your back, magic hands!”.

Then they would be upon you, asking for directions, inviting you over for supper, running old hands over the fabric of your clothing and asking you where you got it from, and so on. You would stand and smile and not say much. Some of the old people would hold you longer than the others. Some of them would cry. One of them would say “I miss you”. Another would say “he was such a lovely young man. What a shame.”

Then the sounds would change and the drones would begin to fly somewhere else, and the old people would follow them, and then again they would leave and you would be left: not quite a statue, but not quite alive, just another partially-preserved consciousness attached to a realistic AccompanyMe ‘death body’, kept around to reassure the ones who outlived you, unable to truly die till they die because, according to the ‘ethical senescence’ laws, your threshold consciousness is sufficient to potentially aid with the warding off of Alzheimers and other diseases of the aged. Now you think of the old people as reverse vultures: gathering around and devouring the living, and departing at the moment of true death.

Things that inspired this story: Demographic timebombs, intergenerational theft (see: Climate Change, Education, Real Estate), old people that vote and young people that don’t.