Import AI: #85: Keeping it simple with temporal convolutional networks instead of RNNs, learning to prefetch with neural nets, and India’s automation challenge.
by Jack Clark
Administrative note: a somewhat abbreviated issue this week as I’ve been traveling quite a lot and have chosen sleep above reading papers (gasp!).
It’s simpler than you think: researchers show convolutional networks frequently beat recurrent ones:
…The rise and rise of simplistic techniques continues…
Researchers with Carnegie Mellon University and Intel Labs have rigorously tested the capabilities of convolutional neural networks (via a ‘temporal convolutional network’ (TCN) architecture, inspired by Wavenet and other recent innovations) against sequence modeling architectures like Recurrent Nets (via LSTMs and GRUs). The advantages of TCNs for sequence modeling are as follows: easily parallelizable rather than relying on sequential processing; a flexible receptive field size; stable gradients; low memory requirements for training; and variable length inputs. Disadvantages include: a greater data storage need than RNNs; parameters need to be fiddled with when shifting into different data domains.
Testing: The researchers test out TCNs against RNNS, GRUs, and LSTMs on a variety of sequence modeling tasks, ranging from MNIST, to adding and copy tasks, to word-level and character-level perplexity on language tasks. In nine out of eleven cases the TCN comes out far ahead of other techniques, in one of the eleven cases it roughly matches GRU performance, and in another case it is noticeably worse then an LSTM (though still comes in second).
What happens now: “The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history. Until recently, before the introduction of architectural elements such as dilated convolutions and residual connections, convolutional architectures were indeed weaker. Our results indicate that with these elements, a simple convolutional architecture is more effective across diverse sequence modeling tasks than recurrent architectures such as LSTMs. Due to the comparable clarity and simplicity of TCNs, we conclude that convolutional networks should be regarded as a natural starting point and a powerful toolkit for sequence modeling,” write the researchers.
Why it matters: One of the most confusing things about machine learning is that it’s a defiantly empirical science, with new techniques appearing and proliferating in response to measured performance on given tasks. What studies like this indicate is that many of these new architectures could be overly complex relative to their utility and it’s likely that, with just a few tweaks, the basic building blocks still reign supreme; we’ve seen a similar phenomenon with basic LSTMs and GANs doing better than many other more-recent innovations, given thorough analysis. In one sense this seems good as it seems intuitive that simpler architectures tend to be more flexible and general, and in another sense it’s unnerving, as it suggests much of the complexity that abounds in AI is an artifact of empirical science rather than theoretically justified.
Read more: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (Arxiv).
Code for the TCN used in the experiments here (GitHub).
Automation & economies: it’s complicated:
…Where AI technology comes from, why automation could be challenging for India, and more…
In a podcast, three employees of the McKinsey Global Institute discuss how automation will impact China, Europe, and India. Some of the particularly interesting points include:
– China has an incentive to automate its own industries to improve labor productivity, as its labor pool has peaked and is now in similar demographic-based decline as other developed economies.
– The supply of AI technology seems to come from the United States and China, with Europe lagging.
– “A large effect is actually job reorganization. Companies adopting this technology will have to reorganize the type of jobs they offer. How easy would it be to do that? Companies are going to have to reorganize the way they work to make sure they get the juice out of this technology.”
– India may struggle as it transitions tens to hundreds of millions of people out of agriculture jobs. “We have to make this transition in an era where creating jobs out of manufacturing is going to be more challenging, simply because of automation playing a bigger role in several types of manufacturing.”
– Read more: How will automation affect economies around the world? (McKinsey Global Institute).
DeepChem 2.0 bubbles out of the lab:
…Open source scientific computing platform gets its second major release…
DeepChem’s authors have released version 2.0 of the scientific computing library, bringing with it improvements to the TensorGraph API, tools for molecular analysis, new models, tutorial tweaks and adds, and a whole host of general improvements. DeepChem “aims to provide a high quality open-source toolchain that democratizes the use of deep-learning in drug discovery, materials science, quantum chemistry, and biology.”
Read more: DeepChem 2.0 release notes.
Read more: DeepChem website.
Google researchers tackle prefetching with neural networks:
…First databases, now memory…
One of the weirder potential futures of AI is one where the fundamental aspects of computing, like implementing systems that search over database indexes or prefetch data to boost performance, are mostly learned rather than pre-programmed. That’s the idea in a new paper from researchers at Google, which tries to use machine learning techniques to solve prefetching, which is “the process of predicting future memory accesses that will miss in the on-chip cache and access memory based on past history”. Prefetching is a somewhat fundamental problem, as the better one becomes at prefetching, the higher the chance of being able to better intuit which data to load-in to memory before it is called upon, which increases the performance of your system.
How it works: Can prefetching be learned? “Prefetching is fundamentally a regression problem. The output space, however, is both vast and extremely sparse, making it a poor fit for standard regression models,” the Google researchers write. Instead, they turn to using LSTMs and find that two variants are able to demonstrate competitive prefetching performance when compared to handwritten systems. “The first version is analogous to a standard language model, while the second exploits the structure of the memory access space in order to reduce the vocabulary size and reduce the model memory footprint,” the researchers write. They test out their approach on data from Google’s web search workload and demonstrate competitive performance.
“The models described in this paper demonstrate significantly higher precision and recall than table-based approaches. This study also motivates a rich set of questions that this initial exploration does not solve, and we leave these for future research,” they write. This research is philosophically similar to work from Google last autumn in using neural networks to learn database index structures (covered in #73), which also found that you could learn indexes that had competitive to superior performance to hand-tuned systems.
One weird thing: When developing one of their LSTMs the researchers created a t-SNE embedding of the program counters ingested by the system and discovered that the learned features contained quite a lot of information. “The t-SNE results also indicate that an interesting view of memory access traces is that they are a reflection of program behavior. A trace representation is necessarily different from e.g., input-output pairs of functions, as in particular, traces are a representation of an entire, complex, human-written program,” they write.
Read more: Learning Memory Access Patterns (Arxiv).
Learning to play video games in minutes instead of days:
…Great things happen when AI and distributed systems come together…
Researchers with the University of California at Berkeley have come up with a way to further optimize large-scale training of AI algorithms by squeezing as much efficiency as possible out of underlying compute infrastructure. Their new technique makes it possible for them to train reinforcement learning agents to master Atari games in under ten minutes on an NVIDIA DGX-1 (which contains 40 CPUs and 8 P100 GPUS). Though the sample efficiency of these algorithms is still massively sub-human (requiring millions of frames to approximate the performance of humans trained on thousands to tens of thousands of frames) it’s interesting that we’re now able to develop algorithms that approximate flesh-and-blood performance in roughly similar wall clock time.
Results: The researchers show that given various distributed systems tweak its possible for algorithms like A2C, A3C, PPO, and APPO to attain good performance on various games in a few minutes.
Why it matters: Computers are currently functioning like telescopes for certain AI researchers – the bigger your telescope, the farther you can see into the limit of scaling properties of various AI algorithms. We still don’t fully understand the limits here, but research like this indicates that as new compute substrates come alone it may be able to scale RL algorithms to achieve very impressive feats in relatively little time. But there are more unknowns than knowns right now – what an exciting time to be alive! “We have not conclusively identified the limiting factor to scaling, nor if it is the same in every game and algorithm. Although we have seen optimization effects in large-batch learning, we do not know their full nature, and other factors remain possible. Limits to asynchronous scaling remain unexplored; we did not definitively determine the best configurations of these algorithms, but only presented some successful versions,” they write.
Read more: Accelerated Methods for Deep Reinforcement Learning (Arxiv).
OpenAI Scholars: Funding for underrepresented groups to study AI:
OpenAI is providing 6-10 stipends and mentorship to individuals from underrepresented groups to study deep learning full-time for 3 months and open-source a project. You’ll need US employment authorization and will be provided with a stipdend of $7.5k per month while doing the program, as well as $25,000 AWS credits.
Read more: OpenAI Scholars (OpenAI blog).
John Henry 2.0
No one places any bets on it internally asides from the theoretical physicists who, by virtue of their field of study, had a natural appreciation for very long odds. Everyone else just assumed the machines would win. And they were right, though I’m not sure in the way they were expecting.
It started like this: one new data center was partitioned into two distinct zones. In one of the zones we applied the best, most interpretable, most rule-based systems we could to every aspect of the operation, ranging from the design of the servers, to the layout of motherboards, to the software used to navigate the compute landscape, and so on. The team tasked with this data center had an essentially limitless budget for infrastructure and headcount. In the other zone we tried to learn everything we could from scratch, so we assigned AI systems to figure out: the types of computers to deploy in the data center, where to place these computers to minimize latency, how to aggressively power these devices up or down in accordance with observed access patterns, how to learn to effectively index and store this information, knowing when to fetch data into memory, figuring out how to proactively spin-up new clusters in anticipation of jobs that had not happened yet but were about to happen, and so on.
You can figure out what happened: for a time, the human-run facility was better and more stable, and then one day the learned data center was at parity with it in some areas, then at parity in most areas, then very quickly started to exceed its key metrics ranging from uptime to power consumption to mean-time-between-failure for its electronic innards. The human-heavy team worked themselves ragged trying to keep up and many wonderful handwritten systems were created that further pushed the limit of what we knew theoretically and could embody in code.
But the learned system kept going, uninhibited by the need for a theoretical justification for its own innovations, instead endlessly learning to exploit strange relationships that were non-obvious to us humans. But transferring insights gleaned from this system into the old rule-based one was difficult, and tracking down why something had seen such a performance increase in the learned regime was an art in itself: what tweak made this new operation so successful? What set of orchestrated decisions had eked out this particular practise?
So now we build things with two different tags on them: need-to-know (NTK) and hell-if-I-know (HIIK). NTK tends to be stuff that has some kind of regulation applied to it and we’re required to be able to explain, analyze, or elucidate for other people. HIIK is the weirder stuff that is dealing in systems that don’t handle regulated data – or, typically, any human data at all – or are parts of our scientific computing infrastructure, where all we care about is performance.
In this way the world of computing has split in two, with some researchers working on extending our theoretical understanding to further boost the performance of the rule-based system, and an increasingly large quantity of other researchers putting theory aside and spending their time feeding what they have taken to calling the ‘HIIKBEAST’).
Things that inspired this story: Learning indexes, learning device placement, learning prefetching, John Henry, empiricism.