Import AI 120: The Winograd test for commonsense reasoning is not as hard as we thought; Tencent learns to spot malware with AiDroid data;and what a million people think about the trolley problem

by Jack Clark

Want almost ten million images for machine learning? Consider Open Images V4:
…Latest giant dataset release from Google annotates images with bounding boxes, visual relationships, and image-level labels for 20,000 distinct concepts…
Google researchers have released Open Images Dataset V4, a very large image dataset collected from photos from Flickr that had been shared with a Creative Commons Attribution license.
  Scale: Open Images V4 contains 9.2 million heavily-annotated images. Annotations include bounding boxes, visual relationship annotations, and 30 million image-level labels for almost 20,000 distinct concepts. “This [scale] makes it ideal for pushing the limits of the data-hungry methods that dominate the state of the art,” the researchers write. “For object detection in particulate, the scale of the annotations is unprecedented”.
  Automated labeling: “Manually labeling a large number of images with the presence or absence of 19,794 different classes is not feasible not only because of the amount of time one would need, but also because of the difficulty for a human to learn and remember that many classes”, they write. Instead, they use a partially-automated method to first predict labels for images, then have humans provide feedback on these predictions. They also implemented various systems to more effectively add the bounding boxes to different images, which required them to train human annotators in a technique called “fast clicking”.
  Scale, and Google scale: The 20,000 class names selected for use in Open Images V4 are themselves a subset of all the names used by Google for an internal dataset called JFT, which contains “more than 300 million images”.
  Why it matters: In recent years, the release of new, large datasets has been (loosely) correlated with the emergence of new algorithmic breakthroughs that have measurably improved the efficiency and capability of AI algorithms. The large-scale and dense labels of Open Images V4 may serve to inspire more progress in other work within AI.
  Get the data: Open Images V4 (Official Google website).
  Read more: The Open Images Dataset V4 (Arxiv).

What happens when plane autopilots go bad:
…Incident report from England gives us an idea of how autopilots bug-out and what happens when they do…
A new incident report from the UK about an airplane having a bug with its autopilot gives us a masterclass in the art of writing bureaucratic reports about terrifying subjects.
  The report in full: “After takeoff from Belfast City Airport, shortly after the acceleration altitude and at a height of 1,350 ft, the autopilot was engaged. The aircraft continued to climb but pitched nose-down and then descended rapidly, activating both the “DON’T SINK’ and “PULL UP” TAWS (EGPWS) warnings. The commander disconnected the autopilot and recovered the aircraft into the climb from a height of 928 ft. The incorrect autopilot ‘altitude’ mode was active when the autopilot was engaged causing the aircraft to descend toward a target altitude of 0 ft. As a result of this event the operator has taken several safety actions including revisions to simulator training and amendments to the taxi checklist.”
  Read more: AAIB investigation to DHC-8-402 Dash 8, G-ECOE (UK Gov, Air Accidents Investigation Branch).

China’s Xi Jinping: AI is a strategic technology, fundamental to China’s rise:
…Chinese leader participates in Politburo-led AI workshop, comments on its importance to China…
Chinese leader Xi Jinping recently led a Politburo study session focused on AI, as a continuation of the country’s focus on the subject following the publication of its national strategy last year. New America recently translated Chinese-language official media coverage of the event, giving us a chance to get a more detailed sense of how Xi views AI+China.
  AI as a “strategic technology”: Xi described AI as a strategic technology, and said it is already imparting a significant influence on “economic development, social progress, and the structure of international politics and economics”, according to remarks paraphrased by state news service Xinhua. “Accelerating the development of a new generation of AI is an important strategic handhold for China to gain the initiative in global science and technology competition”.
  AI research imperatives: China should invest in fundamental theoretical AI research, while growing its own education system. It should “fully give rein to our country’s advantages of vast quantities of data and its huge scale for market application,” he said.
  AI and safety: “It is necessary to strengthen the analysis and prevention of potential risks in the development of AI, safeguard the interests of the people and national security, and ensure that AI is secure, reliable, and controllable,” he said. “Leading cadres at all levels must assiduously study the leading edge of science and technology, grasp the natural laws of development and characteristics of AI, strengthen overall coordination, increase policy support, and form work synergies.”
  Why it matters: Whatever the United States government does with regard to artificial intelligence will be somewhat conditioned by the actions of other countries, and China’s actions will be of particular influence here given the scale of the country’s economy and its already verifiable state-level adoption of AI technologies. I believe it’s also significant to have such detailed support for the technology emanate from the top of China’s political system, as it indicates that AI may be becoming a positional geopolitical technology – that is, state leaders will increasingly wish to demonstrate superiority in AI to help send a geopolitical message to rivals.
  Read more: Xi Jinping Calls for ‘Healthy Development’ of AI [Translation] (New America).

Manchester turns on SpiNNaker spiking neuron supercomputer:
…Supercomputer to model biological neurons, explore AI…
Manchester University has switched on SpiNNaker, one-million processor supercomputer designed with a network architecture to help it better model biological neurons in brains, specifically by implementing spiking networks. SpiNNaker “mimics the massively parallel communication architecture of the brain, sending billions of small amounts of information simultaneously to thousands of different destinations”, according to Manchester University.
  Brain-scale modelling: SpiNNaker’s ultimate goal is to model one billion neurons at once. One billion neurons are about 1% of the total number of neurons in the average human brain. Initially, it should be able to model around a million neurons “with complex structure and internal dynamics”. But SpiNNaker boards can also be scaled down and used for other purposes, like in developing robotics. “A small SpiNNaker board makes it possible to simulate a network of tens of thousands of spiking neurons, process sensory input and generate motor output, all in real time and in a low power system”.
  Why it matters: Many researchers are convinced that if we can figure out the right algorithms, spiking networks are a better approach to AI than today’s neural networks – that’s because a spiking network can propagate messages that are both fuzzier and more complex than those made possible by traditional networks.
  Read more: ‘Human brain’ supercomputer with 1 million processors switched on for first time (Manchester).
  Read more: SpiNNaker home page (Manchester University Advanced Processor Technologies Research Group).

Learning to spot malware at China-scale with Tencent AiDroid:
…Tencent research project shows how to use AI to spot malware on phones…
Researchers with West Virginia University and Chinese company Tencent have used deep neural networks to create AiDroid, a system for spotting malware on Android. AiDroid has subsequently “been incorporated into Tencent Mobile Security product that serves millions of users worldwide”.
  How it works: AiDroid works like this: First, the researchers extract the API call sequences from runtime executions of Android apps in users’ smartphones, then they try to model the relationships between different mobile applications, phones, apps, and so on, via a heterogeneous information network (HIN). They then learn a low-dimensional representation of all the different entities within HIN, and use these features as inputs to a DNN model, which learns to classify typical entities and relationships, and therefore can learn to spot erroneous entities or relationships – which typically correspond to malware.
  Data fuel: This research depends on access to a significant amount of data. “We obtain the large-scale real sample collection from Tencent Security Lab, which contains 190,696 training app (i.e., 83,784 benign and 106,912 malicious).
  Results: The researchers measure the effectiveness of their system and show it is better at in-sample embedding than other systems such as DeepWalk, LINE, and metapath2vec, and that systems trained with the researchers’ HINembedding display superior performance to those trained with others. Additionally, their system is better at prediction malicious applications than other somewhat weaker baselines.
  Why it matters: Machine learning approaches are going to augment many existing cybersecurity techniques. AiDroid gives us an example for how large platform operators, like Tencent, can create large-scale data generation systems (like the basis AiDroid app) then use that data to conduct research – bringing to mind the question, if this data has such obvious value, why aren’t the users being paid for its use?
  Read more: AiDroid: When Heterogeneous Information Network Marries Deep Neural Network for Real-time Android Malware Detection (Arxiv).

The Winograd Schema Challenge is not as smart as we hope:
…Researchers question robustness of Winograd Schema’s for assessing language AIs after breaking the evaluation method with one tweak…
Researchers with McGill University and Microsoft Research Montreal have shown how the Winograd Schema Challenge (WSC) – thought by many to be a gold standard for evaluating the ability of language systems to perform common sense reasoning – is deeply flawed, and for researchers to truly test for general cognitive capabilities they need to apply a different evaluation criteria when studying performance on the dataset.
  Whining about Winograd: WSC is a dataset of almost three hundred sentences where the language model is tasked with working out which pronoun is being referred to in a given sentence. For example, WSC might challenge a computer to figure out which of the entities in the following sentence is the one going fast: “The delivery truck zoomed by the school bus because it was going so fast”. (The correct answer is that the delivery truck is the one going fast). People have therefore assumed WSC might be a good way to test the cognitive abilities of AI systems.
  Breaking Winograd with one trick: The research shows that if you do one simple thing in WSC you can meaningfully damage the success rate of AI techniques when applied to the dataset. The trick? Switching the order of different entities in sentences. What does this look like in practice? An original sentence in Winograd might be “Emma did not pass the ball to Janie although she saw that she was open”, and the authors might change it to “Janie did not pass the ball to Emma although she saw that she was open”.
  Proposed Evaluation Protocol: Models should first be evaluated against their accuracy score on the original WSC set, then researchers should analyze the accuracy on the switchable subset of WSC (before and after switching the candidates), as well as the accuracy on the associative and non-associative subsets of the dataset. Combined, this evaluation technique should help researchers distinguish models that are robust and general from ones which are brittle and narrow.
  Results: The researchers test a language model, an ensemble of ten language models, an ensemble of 14 language models, and a “knowledge hunting method” against the WSC using the new evaluation protocol. “We observe that accuracy is stable across the different subsets for the single LM. However, the performance of the ensembled LMs, which is initially state-of-the-art by a significant margin, falls back to near random on the switched subset.” The tests also show that performance for the language models drops significantly on the non-associative portion of WSC “when information related to the candidates themselves does not give away the answer”, further suggesting a lack of a reasoning capability.
  Why it matters: “Our results indicate that the current state-of-the-art statistical method does not achieve superior performance when the dataset is augmented and subdivided with our switching scheme, and in fact mainly exploits a small subset of highly associative problem instances”. Research like this shows how challenging it is to not just develop machines capable of displaying “common sense”, but how tough it can be to setup the correct sort of measurement schemes to test for this capability in the first place. Ultimately, this research shows that “performing at a state-of-the-art level on the WSC does not necessarily imply strong common-sense reasoning”.
  Read more: On the Evaluation of Common-Sense Reasoning in Natural Language Understanding (Arxiv).
  Read more about the Winograd Schema Challenge here.

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net…

Microsoft president wants rules on face recognition:
Brad Smith, Microsoft’s president, has reiterated his calls for regulation of face recognition technologies at the Web Summit conference in Portugal. In particular, he warned of potential risks to civil liberties from AI-enabled surveillance. He urged societies to decide on the acceptable limits of governments on our privacy, ahead of widespread proliferation of the technology.
  “Before we wake up and find that the year 2024 looks like the book “1984”, let’s figure out what kind of world we want to create, and what are the safeguards and what are the limitations of both companies and governments for the use of this technology”, he said.
  Earlier this year, Smith made similar calls via a Microsoft blogpost.
Read more: Microsoft’s president says we need to regulate facial recognition (Recode).
Read more: Facial recognition technology: The need for public regulation and corporate responsibility (Microsoft blog).

Machine ethics for self-driving cars via survey:
Researchers asked respondents to decide on a range of ‘trolley problem’-style ethical dilemmas for autonomous vehicles, where vehicles must choose between (e.g.) endangering 1 pedestrian and endangering 2 occupants. Several million subjects were drawn from over 200 countries. The strongest preferences were for saving young lives over old, humans over animals, and more lives over fewer.
  Why this matters: Ethical dilemmas in autonomous driving are unlikely to be the most important decisions we delegate to AI systems. Nonetheless, these are important issues, and we should use them to develop solutions that are scalable to a wider range of decisions. I’m not convinced that we should want machine ethics to mirror widely-held views amongst the public, or that this represents a scalable way of aligning AI systems with human values. Equally, other solutions come up against problems of consent and might increase the possibility of a public backlash.
  Read more: The Moral Machine Experiment (Nature).

Tech Tales:

[2020: Excerpt from an internal McGriddle email describing a recent AI-driven marketing initiative.]

Our ‘corporate insanity promotion’ went very well this month. As a refresher, for this activity we had all external point-of-contact people for the entire McGriddle organization talk in a deliberately odd-sounding ‘crazy’ manner for the month of march. We began by calling all our Burgers “Borblers” and when someone asked us why the official response was “What’s borbling you, pie friend?” And so on. We had a team of 70 copywriters working round the clock on standby generating responses for all our “personalized original sales interactions” (POSIs), augmented by our significant investments in AI to create unique terms at all locations around the world, trained on local slang datasets. Some of the phrase creations are already testing well enough in meme-groups that we’re likely to use them on an ongoing basis. So when you next hear “Borble Topside, I’m Going Loose!” shouted as a catchphrase – you can thank our AIs for that.

Things that inspired this story: the logical next-step in social media marketing, GANs, GAN alchemists like Janelle Shane, the arms race in advertising between normalcy and surprise, conditional text generation systems, Salesforce / CRM systems, memes.