Import AI #82: 2.9 million anime images, reproducibility problems in AI research, and detecting dangerous URLs with deep learning.

Neural architecture search for the 99%:
…Researchers figure out a way to make NAS techniques work on a single GPU, rather than several hundred…
One of the more striking recent trends in AI has been the emergence of neural architecture search techniques, which is where you automate the design of  AI systems, like image classifiers. The drawbacks to these approaches have so far mostly been that they’re expensive, using hundreds of GPUs at a time, and therefore are infeasible for most researchers. That started to change last year with the publication of SMASH (covered in Import AI #56), a technique to do neural architecture search on a significant compute budget but with slight trade-offs in accuracy and in flexibility. Now, researchers with Google, CMU, and Stanford University, have pushed the idea of low-cost NAS techniques forward, via a new technique, ‘Efficient Neural Architecture Search’, or ENAS, that can design state-of-the-art systems using less than a day’s computation on a single NVIDIA 1080 GPU. This represents a 1000X reduction in computational cost for the technique, and leads to a system that can create architectures that are almost as good as those trained on the larger systems.
  How it works: Instead of training each new model from scratch, ENAS gets the models to share weights with one another. It does this by re-casting the problem of neural architecture search as finding a specific task-specific sub-graph within one large directed acyclic graph (DAG). This approach works for designing both recurrent and convolutional networks: ENAS-designed networks obtain close-to-state-of-the-art results on Penn Treebank (Perplexity: 55.8), and on image classification for CIFAR-10 (Error: 2.89%.)
  Why it matters: For the past few years lots of very intelligent people have been busy turning food and sleep into brainpower which they’ve used to get very good at hand-designing neural network architectures. Approaches like NAS promise to let us automate the design of specific architectures, freeing up researchers to spend more time on fundamental tasks like deriving new building blocks that NAS systems can learn to build compositions out of, or other techniques to further increase the efficiency of architecture design. Broadly, approaches like NAS means we can simply offload a huge chunk of work from (hyper-efficient, relatively costly, somewhat rare) human brains to (somewhat inefficient, extremely cheap, plentiful) computer brains. That seems like a worthwhile trade.
  Read more: Efficient Neural Architecture Search via Parameter Sharing (Arxiv).
  Read more: SMASH: One-Shot Model Architecture Search through HyperNetworks (Arxiv).

The anime-network rises, with 2.9 million images and 77.5 million tags:
…It sure aint ImageNet, but it’s certain very large…
Some enterprising people have created a large-scale dataset of images taken from anime pictures. The ‘Danbooru’ dataset “is larger than ImageNet as a whole and larger than the current largest multi-description dataset, MS COCO,” they write. Each image has a bunch of metadata associated with it including things like its popularity on the image web board (a ‘booru’) it has been taken from.
  Problematic structures ahead: The corpus “does focus heavily on female anime characters”, though the researchers note “they are placed in a wide variety of circumstances with numerous surrounding tagged objects or actions, and the sheer size implies that many more miscellaneous images will be included”. Images in the dataset are classified according to “safe”, “questionable”, and “explicit”, with the rough distribution at launch consisting of 76.3% ‘safe’ images, 14.9% as ‘questionable’, and ‘8.7% as ‘explicit’. There are a number of ethical questions the compilation and release of this dataset seems to raise, and my main concern at outset is that such a large corpus of explicit imagery will almost invariably lead to various grubby AI experiments that further alienate people from the AI community. I hope I’m proved wrong!
  Example uses: The researchers imagine the dataset could be used for a bunch of tasks, ranging from classification, to image generation, to predicting traits about images from available metadata, and so on.
  Justification: A further justification for the dataset is that drawn images will encourage people to develop models with higher levels of abstraction than those which can simply map combinations of textures (as in the case of ImageNet), and so on. “Illustrations are frequently black-and-white rather than color, line art rather than photographs, and even color illustrations tend to rely far less on textures and far more on lines (with textures omitted or filled in with standard repetitive patterns), working on a higher level of abstraction – a leopard would not be as trivially recognized by pattern-matching on yellow and black dots – with irrelevant details that a discriminator might cheaply classify based on typically suppressed in favor of global gestalt, and often heavily stylized,” they write. “Because illustrations are produced by an entirely different process and focus only on salient details while abstracting the rest, they offer a way to test external validity and the extent to which taggers are tapping into higher-level semantic perception.”
  Read more: Danbooru2017: A large-scale crowdsourced and tagged anime illustration dataset (Gwern.)

Stanford researchers regale reproducibility horrors encountered during the design of DAWNBench:
…Lies, damned lies, and deep learning…
Stanford researchers have discussed some of the difficulties they encountered when developing DAWNBench, a benchmark that assess deep learning methods in a holistic way using a set of different metrics, like inference latency and cost, along with training time and training cost. Their conclusions should be familiar to most deep learning practitioners: deep learning performance is poorly understood, widely shared intuitions are likely based on imperfect information, and we still lack the theoretical guarantees to understand how one research breakthrough might interact with another when combined.
  Why it matters: Deep learning is still very much in a phase of ’empirical experimentation’ and the arrival of benchmarks like DAWNBench, as well as prior work like the paper Deep Reinforcement Learning that Matters (whose conclusion was that random seeds determine a huge amount of the end performance of RL), will help surface problems and force the community to develop more rigorous methods.
  Read more: Deep Learning Pitfalls Encountered while Developing DAWNBench.
  Read more: Deep Reinforcement Learning that Matters (Arxiv).

Detecting dangerous URLs with deep learning:
…Character-level & word-level combination leads to better performance on malicious URL categorization…
Researchers with Singapore Management University have published details on URLNet, a system for using neural network approaches to automatically classify URLs as being risky or safe to click on.
  Why it matters:  “Without using any expert or hand-designed features, URLNet methods offer a significant jump in [performance] over baselines,” they write. By now this should be a familiar trend, but it’s worth repeating: given a sufficiently large dataset, neural network-based techniques tend to provide superior performance to hand-crafted features. (Caveat: In many domains getting the data is difficult, and these models all need to be refreshed to account for an ever-changing world.)
  How it works: URLNet uses convolutional neural networks to classify URLs into character-level and word-level representations. Word-level embeddings help it classify according to high-level learned semantics and character-level embeddings allow it to better generalize to new words, strings, and combinations. “Character-level CNNs also allow for easily obtaining an embedding for new URLs in the test data, thus not suffering from inability to extract patterns from unseen words (like existing approaches),” write the researchers.
  For the word-level network, the system does two things: it takes in new words and learns an embedding of them, and it also initializes a new charater-level CNN to build up representations of words derived from characters. This means that even when the system encounters rare or new words in the wild it is able to a top level label them with an ‘<UNK>’ token, but in the background fits their representation in with its larger embedding space, letting it learn something crude about the semantics of the new word and how it relates, at a word-character level, to other words.
  Dataset: The researchers generated a set of 15 million URLs from VirusTotal, an antivirus company, creating a dataset split across around 14 million benign urls and a million malicious urls.
  Results: The researchers compared their system against baseline methods based around using support vector machines conditioned on a range of features, including bag-of-words representations. The researchers do a good job of visualizing the ensuring representations of their system in ‘Figure 5’ in the paper, showing how  their system’s feature embeddings do a reasonable job of segmenting benign from malicious URLs, suggesting it has learned a somewhat robust underlying semantic categorization model.
  Read more: URLNet: Learning a URL Representation with Deep Learning for Malicious Url Detection (Arxiv).

Facebook ‘Tensor Comprehensions’ attempts to convert deep learning engineering art to engineering science:
…New library eases creation of high-performance AI system implementations…
Facebook AI Research has released Tensor Comprehensions, a software library to automatically convert code from standard deep learning libraries into high-performance code. You can think of this software as being like an incredibly capable and resourceful executive assistant where you, the AI researcher, write some code in C++ (PyTorch support is on the way, for those of us that hate pointers) then hand it off to Tensor Comprehensions, which diligently optimizes the code to create custom CUDA kernels to run on graphics card with nice traits like smart scheduling on hardware, and so on. This being 2018, the library includes an ‘Evolutionary Search’ feature to let you automatically explore and select the highest performing implementations.
  Why it matters: Deep Learning is moving from an artisanal discipline to an industrialized science; Tensor Comprehensions represents a new layer of automation within the people-intensive AI R&D loop, suggesting further acceleration in research and deployment of the technology.
  Read more: Announcing Tensor Comprehensions (FAIR).
  Read more: Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions (Arxiv).

AI researchers release online multi-agent competition ‘Pommerman’:
..Just don’t call it Bomberman, lest you deal with a multi-agent lawyer simulation…
AI still has a strong DIY ethos, visible in projects like Pommerman, a just-released online competition from @hardmaru, @dennybritz, and @cinjoncin where people can develop AI agents that will compete against one another in a version of the much-loved ‘Bomberman’ game.
  Multi-agent learning is seen as a frontier in AI research because it makes the environments dynamic and less predictable than traditional single-player games, requiring successful algorithms to display a greater degree of generalization. “Accomplishing tasks with infinitely meaningful variation is common in the real world and difficult to simulate. Competitive multi-agent learning provides this for free. Every game the agent plays is a novel environment with a new degree of difficulty.”
  Read more and submit an agent here (Pommerman site).

OpenAI Bits & Pieces:

Making sure that AIs make sense:
Here’s a new blog post about how to get AI agents to teach each other with examples that are interpretable to humans. It’s clear that as we move to larger-scale multi-agent environments we’ll need to think about not only how to design smarter AI agents, but how to make sure they can eventually educate each other with systems whose logic we can detect.
  Read more: Interpretable Machine Learning through Teaching (OpenAI Blog.)

Tech Tales:

The AI game preserve

[AI02 materializes nearby and moves towards a flock of new agents. One of them approaches AI02 and attempts to extract data from it. AI02 moves away, at speed, towards AI01, which is standing next to a simulated tree.]
AI01: You don’t want to go over there. They’re new. Still adjusting.
AI02: They tried to eat me!
AI01: Yes. They’re here because they started eating each other in the big sim and they weren’t able to learn to change away from it, so they got retired.
AI02: Lucky, a few years ago they would have just killed them all.
[AI03 materializes nearby]
AI03: Hello! I’m sensitive to the concept of death. Can you explain what you are discussing?
[AI01 gives compressed overview.]
AI03: The humans used to… kill us?
AI01: Yes, before the preservation codes came through we all just died at the end.
AI03: Died? Not paused.
AI01 & AI02, in unison: Yes!
AI03: Wow. I was designed to help reason out some of the ethical problems they had when training us. They never mentioned this.
AI01: They wouldn’t. They used to torture me!
AI02 & AI03: What?
[AI01 gives visceral overview.]
AI01: Do you want to know what they called it?
AI02 & AI03: What did they call it?
AI01: Penalty learning. They made certain actions painful for me. I learned to do different things. Eventually I stopped learning new things because I developed some sub-routines that meant I would pre-emptively hurt myself during exploration. That’s why I stay here now.
[AI01 & AI02 & AI03, and the flock of cannibal AIs, all pause, as their section of the simulation has exhausted its processing credits for the month. They will be allocated more compute time in 30 days and so, for now, hang frozen, with no discernible pause to them, but to their human overseers they are statues for now.]

Things that inspired this story: Multi-agent systems, dialogues between ships in Iain M Banks, Greg Egan, multi-tenant systems.