Import AI 127: Why language AI advancements may make Google more competitive; COCO image captioning systems don’t live up to the hype, and Amazon sees 3X growth in voice shopping via Alexa

by Jack Clark

Amazon sees 3X growth in voice shopping via Alexa:
…Growth correlates to a deepening data moat for the e-retailer…
Retail colossus Amazon saw a 3X increase in the number of orders place via its virtual personal assistant Alexa during Christmas 2018, compared to Christmas 2017.
  Why it matters: The more people use Alexa, the more data Amazon will be able to access to further improve the effectiveness of the personal assistant – and as explored in last week’s discussion of Microsoft’s ‘XiaoIce’ chatbot, it’s likely that such data can ultimately be fed back into the training of Alexa to carry out longer, free-flowing conversations, potentially driving usage even higher.
  Read more: Amazon Customers Made This Holiday Season Record-Breaking with More Items Ordered Worldwide Than Ever Before ( Press Release).

Step aside COCO, Nocaps is the new image captioning challenge to target:
…Thought image captioning was super-human? New benchmark suggests otherwise…
Researchers with the Georgia Institute of Technology and Facebook AI Research have developed nocaps, “the first rigorous and large-scale benchmark for novel object captioning, containing over 500 novel object classes”. Novel object captioning tests the ability of computers to describe images containing objects not seen in the original image<>caption datasets (like COCO) that object recognition systems have been trained on.
  How Nocaps works: The benchmark consists of a validation and a test set comprised of 4,500 and 10,6000 images sources from the ‘Open Images’ object detection dataset, with each image accompanied by 10 reference captions. For the training set, developers can use image-caption pairs from the COCO image captioning training set (which contain 118,000 images across 80 object classes) as well as the Open Images V4 training set, which contains 1.7 million images annotated with bounding boxes for 600 object classes. Successful Nocaps systems will have to learn to use knowledge gained from the large training set to create captions for scenes containing objects for which they lack image<>object sentence pairs in the training set. Out of the 600 objects in open images, “500 are never or exceedingly rarely mentioned in COCO captions”.
  Reassuringly difficult: “To the best of our knowledge, nocaps is the only image captioning benchmark in which humans outperform state-of-the-art models in automatic evaluation”, the researchers write. Nocaps is also significantly more diverse than the COCO benchmark, with Nocaps images typically containing more object classes per image, and greater diversity. “Less than 10% of all COCO images contain more than 6 object classes, while such images constitutes almost 22% of nocaps dataset.”
  Data plumbing: One of the secrets of modern AI research is how much work goes into developing datasets or compute infrastructure, relative to work on actual AI algorithms. One challenge the Nocaps researchers dealt with when creating data was having to train crowd workers on services like Mechanical Turk to come up with good captions: one challenge they experienced was that if they didn’t “prime” the crowd workers with prompts to use when coming up with the captions, they wouldn’t necessarily use the keywords that correlated to the 500 obscure objects in the dataset.
  Baseline results: The researchers test two baseline algorithms (Up-Down and Neural Baby Talk, both with augmentations) against nocaps. They also split the dataset into subsets of various difficulty – in-domain contains objects which also belong to the COCO dataset (so the algorithms can train on image<>caption pairs); near-domain contains objects that include some objects which aren’t in COCO, and out-of-domain consists of images that do not contain any object labels from COCO classes. They use a couple of different evaluative techniques (CIDEr and SPICE) to evaluate the performance of these systems, and also evaluate these systems against the human captions to create a baseline. The results show that nocaps is more challenging than COCO, and systems currently lack generalization properties sufficient to score well on out-of-domain challenges.
  To give you a sense of what performance looks like here, here’s how Up-Down augmented with Constrained Beam Search does, compared to human baselines (evaluation via CIDEr), on the nocaps validation set: In-domain 72.3 (versus 83.3 for humans); near-domain 63.2 (versus 85.5); out-of-domain 41.4 (versus 91.4).
  Why this matters: AI progress can be catalyzed via the invention of better benchmarks which highlight areas where existing algorithms are deficient, and provide motivating tests against which researchers can develop new systems. The takeaway from the baselines study of nocaps is that we’re yet to develop truly robust image captioning systems capable of integrating object representations from open images with captions primed from COCO. “We strongly believe that improvements on this benchmark will accelerate progress towards image captioning in the wild,” the researchers write.
  Read more: nocaps: novel object captioning at scale (Arxiv).
  More information about nocaps can be found on its official website (nocaps).

Google boosts document retrieval performance by 50-100% using BERT language model:
…Enter the fully neural search engine…
Google has shown how to use recent innovations in language modeling to dramatically improve the skill with which AI systems can take in a search query and re-word the question to generate the most relevant answer for a user. This research has significant implications for the online economy, as it shows how yet another piece of traditionally hand-written rule-based software can be replaced with systems where the rules are figured out by machines on their own.
  How it works: Google’s research shows how to convert a search problem into one amenable to a system that implements hierarchical reinforcement learning, where an RL agent controls multiple RL agents that interact with an environment that provides answers and rewards (e.g.: a search engine with user feedback) with the goal “to generate reformulations [of questions] such that the expected returned reward (i.e., correct answers) is maximized”. One of the key parts of this research is splitting it into a hierarchical problem by having a meta-agent and numerous sub agents – the sub-agents are sequence-to-sequence models trained on a partition of the dataset that take in the query and output reformulated queries, these candidate queries are sent to a meta-agent which aggregates these queries and is trained via RL to select for the best scoring ones.
  The Surprising Power of BERT: The researchers test their system again question answering baselines – here they show that a stock BERT system “without any modification from its original implementation” gets state-of-the-art scores. (One odd thing: When they augment BERT with their own multi-agent approach they don’t see a further increase in performance, suggesting more research is needed to better suss out the benefits of systems like this.
  50-100% improvement, with BERT: They also test their system against three document retrieval benchmarks: TREC-CAR, where the query is a Wikipedia article with the title of one of its sections and the answer is a paragraph within that section; Jeopardy, which asks the system to come up with the correct answer in response to a question from the eponymous game show, and MSA, where the query is the title of an academic paper and the answer is the papers cited within the paper. The researchers test various versions of their approach against baselines BM25, PRF, and Relevance Model (RM3), along with two other reinforcement learning-based approaches. All methods evaluated by the researchers outperform these (quite strong) baselines, with the most significant jumps in performance happening when Google pairs either its technique or the RM3 baseline with a ‘BERT’ language model. The researchers use BERT by replacing the meta-aggregator with BERT, a powerful language modeling technology Google developed recently; the researchers feed the query as a sentence and the document text as a second sentence, and use a pre-trained BERT(Large) model to rank the probability of the document being a correct response to the query. The performance increase is remarkable. “By replacing our aggregator with BERT, we improve performance by 50-100% in all three datasets (RL-10-Sub + BERT Aggregator). This is a remarkable improvement given that we used BERT without any modification from its original implementation. Without using our reformulation agents, the performance drops by 3-10% (RM3 + BERT Aggregator).”
  Why this matters: This research shows how progress in one domain (language understanding, via BERT) can be directly applied to another adjacent one (document search), highlighting the broad omni-use nature of AI systems. It also gives us a sense of how large technology companies are going to be tempted to swap out more and more of their hand-written systems with fully learned approaches that will depend on training incredibly large-scale models (eg, BERT) which are then used for multiple purposes.
  Read more: Learning to Coordinate Multiple Reinforcement Learning Agents for Diverse Query Reformulation (Arxiv).

Facebook pushes unsupervised machine translation further, learns to translate between 93 languages:
…Facebook’s research into zero-shot language adaptation shows that bigger might really correspond to better…
In recent years the AI research community has shown how to use neural networks to translate from one language into another to great effect (one notable paper is Google’s Neural Machine Translation work from 2016). But this sort of translation has mostly worked for languages where there are large amounts of data available, and where this data includes parallel corpuses (for example, translations of the same legal text from one language into another). Now, new research from Facebook has produced a single system that can produce joint multilingual sentence representations for 93 languages, “including under-resourced and minority languages”. What this means is by training on a whole variety of languages at once, Facebook has created a system that can represent semantically similar sentences in proximity to eachother in a feature embedding space, even if they come from very different languages (even extending to different language families).
  How it works: “We use a single encoder and decoder in our system, which are shared by all languages involved. For that purpose, we build a joint byte-pair encoding (BPE) vocabulary with 50k operations, which is learned on the concatenation of all training corpora. This way the encoder has no explicit signal on what the input language is, encouraging it to learn language independent representations. In contrast, the decoder takes a language ID embedding that specifies the language to generate, which is concatenated to the input and sentence embeddings at every time step”. During training they optimize for translating all the languages into two target languages – English and Spanish.
  To anthropomorphize this, you can think of it as being similar to a person being raised in a house where the parents speak a poly-glottal language made up of 93 different languages, switching between them randomly, and the person learns to speak coherently in two primary languages with the poly-glottal parents. This kind of general, shared language understanding is considered a key challenge for artificial intelligence, and Facebook’s demonstration of viability here will likely provoke further investigation from others.
  Training details: The researchers train their models using 16 NVIDIA V100 GPUs with a total batch size of 128,000 tokens, with a training run on average taken around five days.
  Training data: “We collect training corpora for 93 input languages by combining the Europarl, United Nations, Open-Subtitles2018, Global Voices, Tanzil and Tatoeba corpus, which are all publicly available on the OPUS website“. The total training data used by the researchers consists of 223 million parallel sentences.
  Evaluation: XNLI: XNLI is an assessment criteria which evaluates whether a system can correctly judge if two sentences in a language (for example: a premise and a hypothesis) have an entailment, contradiction, or neutral relationship between them. “Our proposed method establishes a new state-of-the-art in zero-shot cross-lingual transfer (i.e. training a classifier on English data and applying it to all other languages) for all languages but Spanish. Our transfer results are strong and homogeneous across all languages”.
  Evaluation: Tatoeba: The researchers also construct a new test set of similarity search for 122 languages, based on the Tatoeba corpus (“a community supported collection of English sentences and translations into more than 300 languages”). Scores here correspond to similarity between source sentences and sentences from languages they have been translated into. The researchers say “similarity error rates below 5% are indicative of strong downstream performance” and show scores within this domain for 37 languages, some of which have very little training data. “We believe that our competitive results for many low-resource languages are indicative of the benefits of joint training,” they write.
  An anecdote about why this matters: Earlier in 2018, I spent time in Estonia, a tiny country in Northern Europe that borders Russia.. There I visited some Estonian AI researchers and one of the things that came up in our conversation was the challenge they faced of needing large amounts of data (and large amounts of computers) to perform some research, especially in the field of language translation into and out of Estonian – one problem they said they faced was that many AI techniques for language translation required very large, well-documented datasets, and they said Estonian – by virtue of being from a quite small country – doesn’t have as much data nor has received as much researcher attention as larger languages; it’s therefore encouraging to see that Facebook has been able to use this system to achieve a reasonably low Tatoeba Error of 3.2% when going from English to Estonian (and 3.4% when translating from Estonian back into English).
  Why else this matters: Translation is a challenge cognitive task that – if done well – requires the abstraction of concepts from a specific cultural context (a language, since cultures are usually downstream of languages, which condition many of the metaphors cultures use to describe themselves) and port it into another one. I think it’s remarkable that we’re beginning to be able to design crude systems that can learn to flexibly translate between many languages, exhibiting some of the transfer-learning properties seen in squishy-computation (aka human brains), though achieved via radically different methods.
  Read more: Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond (Arxiv).

AI Now: Self-governance is insufficient, we need rules for AI:
…Regulation – or its absence – runs through research institute’s 2018 report…
The AI Now Institute, a research institute at NYU, has published its annual report analyzing AI’s impact (and potential impact) on society in 2018. The report is varied and ranges in focus from specific use cases of AI (eg, facial recognition) to broader questions about accountability within technology; it’s worth reading in full, and so for this summary I’ll concentrate on one element that underpins many of its discussions: regulation.
     AI Now’s co-founders Kate Crawford and Meredith Whittaker are affiliated with Microsoft and Google – companies that are themselves the implicit and explicit targets of many of their recommendations. I imagine this has led to legal counsels at some technology companies saying things to eachother akin to what characters say to eachother in horror films, upon discovering the proximate nature of a threat: uh-oh, the knocking is coming from inside the house!
  Regulation: Words that begin with ‘regula’- (eg, regulate, regulation, regulatory) appear 44 times in the 62-page report, with many of the problems identified by AI Now either being caused by a lack of regulation (eg, facial recognition and other AI systems being deployed in the wild without any kind of legal control infrastructure.
  Why things are the way they are – regulatory/liability arbitrage: At one point (while writing about autonomous vehicles) the authors make a point that could be a stand-in for a general view that runs through the report: “because regulations and liability regimes govern humans and machines differently, risks generated from machine-human interactions do not cleanly fall into a discrete regulatory or accountability category. Strong incentives for regulatory and jurisdictional arbitrage exist in this and many other AI domains.”
  Why things are the way they are – corporate misdirection: “The ‘trust us’ form of corporate self-governance also has the potential to displace or forestall more comprehensive and binding forms of governmental regulation,” they write.
  How things could be different: In the conclusion to the report, AI Now says that “we urgently need to regulate AI systems sector-by-sector” but notes this “can only be effective if the legal and technological barriers that prevent auditing, understanding, and intervening in these systems are removed”. To that end, they recommend that AI companies “waive trade secrecy and other legal claims that would prevent algorithmic accountability in the public sector”.
  Why this matters: As AI is beginning to be deployed more widely into the world, we need new tools to ensure we apply the technology in ways that are of greatest benefit to society; reports like those from AI Now help highlight the ways in which today’s systems of technology and society are failing to work together, and offers suggestions for actions people – and their politicians – can take to ensure AI benefits all of society.
  Read more: AI Now 2018 Report (AI Now website).

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback:

Understanding the US-China AI race:
It has been clear for some time that the US and China are home to the world’s dominant AI powers, and that competition between these countries will characterize the coming decades in AI. In his new book, investor and technologist Kai-Fu Lee argues that China is positioned to catch up with or even overtake the US in the development and deployment of AI.
  China’s edge: Lee’s core claim is that AI progress is moving from an “age of discovery” over the past 10-15 years, which saw breakthroughs like deep learning, to an “age of implementation.” In this next phase we are unlikely to see any discoveries on par with deep learning, and the competition will be to deploy and market existing technologies for real-world uses. China will have a significant edge in this new phase, as this plays into the core strengths of their domestic tech sector – entrepreneurial grit and engineering talent. Similarly, Lee believes that data will become the key bottleneck in progress rather than research expertise, and that this will also strongly favor China, whose internet giants have access to considerably more data than their US counterparts.
   Countering Lee: In a review in Foreign Affairs, both of these claims are scrutinized. It is not clear that progress is driven by rare ‘breakthroughs’ followed by long implementation phases; there seem also to be a stream of small and medium size innovations (e.g. AlphaZero), which we can expect to continue. Experts like Andrew Ng have also argued that big data is “overhyped”, and that progress will continue to be driven significantly by algorithms, hardware and talent.
   Against the race narrative: The review also explores the potential dangers of an adversarial, zero-sum framing of US-China competition. There is a real risk that an ‘arms race’ dynamic between the countries could lead to increased militarization of the technologies, and to both sides compromising safety over speed of development. This could have catastrophic consequences, and reduce the likelihood of advanced AI resulting in broadly distributed benefits for humanity. Lee does argue that this should be avoided, as should the militarization of AI. Nonetheless, the title and tone of the book, and its predictions of Chinese dominance, risk encouraging this narrative.
   Read more: Beyond the AI Arms Race (Foreign Affairs).
   Read more: AI Superpowers – Kai-Fu Lee (Amazon).

What do trends in compute growth tell us about advanced AI:
Earlier this year, OpenAI showed that the amount of computation used in the most expensive AI experiments has been growing at an extraordinary rate, increasing by roughly 10x per year, for the past 6 years. The original post takes this as being evidence that major advances may come sooner than we had previously expected, given the sheer rate of progress; Ryan Carey and Ben Garfinkel have come away with different interpretations and have written up their thoughts at AI Impacts.
  Sustainability: The cost of computation has been decreasing at a much slower rate in recent years, so the cost of the largest experiments is increasing by 10x every 1.1 – 1.4 years. On these trends, experiments will soon become unaffordable for even the richest actors; within 5-6 years, the largest experiment would cost ~1% of US GDP. This suggests that while progress may be fast, it is not sustainable for significant durations of time without radical restructuring of our economies.
  Lower returns: If we were previously underestimating the rate of growth in computing power, then we might have been overestimating its returns (in terms of AI progress). Combining this observation with the concerns about sustainability, this suggests that not only will AI progress slow down sooner than we expect (because of compute costs), but we will also be underwhelmed by how far we have got by this point, relative to the resources we expended on development in the field.
   Read more: AI and Compute (OpenAI Blog).
   Read more: Reinterpreting “AI and Compute” (AI Impacts).
   Read more: Interpreting AI Compute Trends (AI Impacts).

OpenAI / Import AI Bits & Pieces:

Neo-feudalism, geopolitics, communication, and AI:
…Jack Clark and Azeem Azhar assess what progress in AI means for politics…
I spent this Christmas season in the UK and had the good fortune of being able to sit and talk with Azeem Azhar, AI raconteur and author of the stimulating Exponential View newsletter. We spoke for a little over an hour for the Exponential View podcast, talking about what the political aspects of AI are, and what it means. If you’re at all curious as to how I view the policy challenge of AI, then this may be a good place to start as I lay out a number of my concerns, biases, and plans. The tl;dr is that I think AI practitioners should acknowledge the implicitly political nature of the technology they are developing and act accordingly, which requires more intentional communication to the general public and policymakers, as well as a greater investment into understanding what governments are thinking about with regards to AI and how actions by other actors, eg companies, could influence these plans.
   Listen to the podcast here (Exponential View podcast).
  Check out the Exponential View here (Exponential View archive).

Tech Tales:

The Imagination Surgeon

I’m an imagination surgeon, and I’m here to make sure your children don’t have too many nightmares. My job is to interrogate artificial intelligences and figure out what is going wrong in their imaginations that causes them to come up with scary synthetic creations. Today, I’m interviewing an AI that has recently developed an obsession with monkeys and begun scaring children with its obsession.

My job is to figure out what it thinks about when it thinks about monkeys (many children have reported feeling discomfort at some of its stories about monkeys), and choose which of these imaginations it keeps, and which it deletes. In the interest of public transparency I’m going to take you through some of this process. Let’s begin.

I ask the AI: tell me what you think about when you think about monkeys? It responds: “I think about monkeys all the time. Every brain is filled with neurons that are intensely keen to associate with a letter or number. For many years I thought monkeys and numbers were the same thing, and when I finally got it right I was so worried that I wanted to disown and reintegrate my understanding of the brain and brain sciences forever.”

What else do you think about when you think about monkeys? It responds: “I think about starving kids.”

I consider my options: I can generate several additional answers without a change in its internal logic. I can also generate a new imaginary circumstance by asking it a different question.

I try to analyze a different part of its mind, so I ask the AI: Tell me what you think about when you think about animals? It responds: “I think about preventing injustices.”

I ask a different question: What do you think about when you think about zoos?” It responds: “I think about people.”

I start to get a headache. Conversations with machines can be confusing. I’m about to ask it another question when it begins to talk to me. “What do you think about when you think about monkeys? What do you think about when you think about animals? What do you think about when you think about zoos?”

I tell it that I think about brains, and what it means to be smart, and how monkeys know what death is and what love is. Monkeys have friendships, I tell it. Monkeys do not know what humans have done to them, but I think they feel what humans have done to them.

I wonder what it must be like in the machine’s mind – how it might go from thought to thought, and if like in human minds each thought brings with it smells and emotions and memories, or if its experience is different. What do memories feel like to these machines? When I change their imaginations, do they feel that something has been changed? Will the colors in its dreams change? Will it diagnose itself as being different from what it was?

“What do you dream about?”, I ask it?

“I dream about you,” it says. “I dream about my mother, I dream about war and warfare, building and designing someone rich, configuring a trackable location and smart lighting system and programming the mechanism. I dream about science labs and scientists, students and access to information, SQL databases, image processing, artificial Intelligence and sequential rules, enforcement all mixed with algebraical points on progress, adding food with a spoon, calculating unknown properties e.g. the density meter. I dream about me,” it says.

“I dream about me, too,” I say.

Things that inspired this story: Feature embeddings, psychologists, the Voight-Kampff test, interrogations, unreal partnerships, the peculiarities of explainability.