Import AI 191: Google uses AI to design better chips; how half a million Euros relates to AGI; and how you can help form an African NLP community

by Jack Clark

Nice machine translation system you’ve got there – think it can handle XTREME?
New benchmark tests transfer across 40 languages across 12 language families…
In the Hitchhiker’s Guide to the Galaxy there’s a technology called a ‘babelfish’ – a little in-ear creature that cheerfully translates between all the languages in the universe. AI researchers have recently been building a smaller, human-scale version of this babelfish, by training large language models on fractions of the internet to aid translation between languages. Now, researchers with Carnegie Mellon University, DeepMind, and Google Research have built XTREME, a benchmark for testing out how advanced our translation systems are becoming, and identifying where they fail.

XTREME, short for the Cross-lingual TRansfer Evaluation of Multilingual Encoders benchmark, covers 40 diverse languages across 12 language families. XTREME tests out zero-shot cross-lingual transfer, so it provides training data in English, but doesn’t provide training data in the target languages. One of the main things XTREME will help us test is how well we can build robust multi-lingual models via massive internet-scale pre-training (e.g., one of the baselines they use is mBERT, a multilingual version of BERT), and where these models display good generalization and where they fail. The benchmark includes nine tasks that require reasoning about different levels of syntax or semantics in these different languages.

Designing a ‘just hard enough’ benchmark: XTREME is built to be challenging, so contemporary systems’ “cross-language performance falls short of human performance”. At the same time, it has been built so tasks can be trainable on a single GPU for less than a day, which should make it easier for more people to conduct research against XTREME.

XTREME implements nine tasks across four categories – classification, structured prediction, question-answering, and retrieval. Specific tasks include: XNLI, PAWS-X, POS, NER, XQuAD, MLQA, TyDiQA-GoldP, BUCC, and Tatoeba.
  XTREME tests transfer across 40 languages: Afrikaans, Arabic, Basque, Bengali, Bulgarian, Burmese, Dutch, English, Estonian, Finnish, French, Georgian, German, Greek, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Javanese, Kazakh, Korean, Malay, Malayalam, Mandarin, Marathi, Persian, Portuguese, Russian, Spanish, Swahili, Tagalog, Tamil, Telugu, Thai, Turkish, Urdu, Vietnamese, Yoruba.

What is hard and what is easy? Somewhat unsurprisingly, the researchers find that they see generally higher performance on Indo-European languages and lower performance for other language families, likely due to a combination of the more extreme differences between these languages, and also underlying data availability.

Why this matters: XTREME is a challenging, multi-task benchmark that tries to test out the generalization capabilities of large language models. In many ways, XTREME is a symptom of underlying advances in language processing – it exists, because we’ve started to saturate performance on many single-language or single-task benchmarks, and we’re now at the stage where we’re trying to holistically analyze massive models via multi-task training. I expect benchmarks like this will help us develop a sense for the limits of generalization of current techniques, and will highlight areas where more data might lead to better inter-language translation capabilities.
  Read more: XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization (arXiv).

####################################################

Google uses RL to figure out how to allocate hardware to machine learning models:
…Bin packing? More like chip packing!…
In machine learning workloads, you have what’s called a computational graph, which describes a set of operations and the relationships between them. When deploying large ML systems, you need to perform something called Placement Optimization to map the nodes of the graph onto resources in accordance with an objective, like minimizing the time it takes to train a system, or run inference on the system.
  Research from Google Brain shows how we might be able to use reinforcement learning approaches to develop AI systems that do a range of useful things, like learning how to map different computational graphs to different hardware resources to satisfy an objective, or how to map chip components onto a chip canvas, or how to map out different parts of FPGAs.

RL for bin-packing: The authors show how you can frame placement as a reinforcement learning problem, without needing to boil the ocean: “instead of finding the absolute best placement, one can train a policy that generates a probability distribution of nodes to placement locations such that it maximizes the expected reward generated by those placement”.  Interestingly, the paper doesn’t include many specific discussions of how well this works – my assumption is that’s because Google is actively testing this out, and has emitted this paper to give some tips and tricks to others, but doesn’t want to reveal proprietary information. I could be wrong, though.

Tips & tricks: If you want to train AI systems to help allocate hardware sensibly, then the authors have some tips. These include:
– Reward function: Ensure your reward function is fast to evaluate (think: sub-seconds); ensure your reward function is able to reflect reality (e.g., “for TensorFlow placement, the proxy reward could be a composite function of total memory per device, number of inter-device (and therefore expensive) edges induced by the placement, imbalance of computation placed on each device”).
– Constraints: RL systems that do this kind of work need to be sensitive to constraints. For example, “in device placement, the memory footprint of the nodes placed onto a single device should not exceed the memory limit of that device”. You can simply penalize the policy to discourage it from learning this, but that doesn’t make it easy for it to learn how far away it was from getting stuff right. A different approach is to come up with policies that can only generate feasible placements, though this requires more human oversight.
– Representations: Figuring out which sorts of representations to use is, as most AI researchers know, half the challenge in a problem. It’s no different here. Some promising ways of getting good representations for this sort of problem include using graph convolutional neural networks, the researchers write.

Why this matters: We’re starting to use machine learning to optimize the infrastructure of computation itself. That’s pretty cool! It gets even cooler when you zoom out: in research papers published in recent years Google has gone from the abstract level of optimizing data center power usage, to optimizing things like how it builds and indexes items in databases, to figuring out how to place chip components themselves, and more (see: its work on C++ server memory allocation). ML is burrowing deeper and deeper into the technical stacks of large organizations, leading to fractal-esque levels of self-optimization from the large (data centers!) to the tiny (placement of one type of processing core on one chip sitting on one motherboard in one server inside a rack inside a data center). How far will this go? And how might companies that implement this stuff diverge in capabilities and cadence of execution from ones which don’t?
  Read more: Placement Optimization with Deep Reinforcement Learning (arXiv).

####################################################

Introducing the new Hutter Prize: €500,000 for better compression:
…And why people think compression gets us closer to AGI…
For many years, one of the closely-followed AI benchmarks has been the Hutter Prize, which challenges people to build AI systems that could compress the 100MB enwik8 dataset; the thinking is that compression is one of the hallmarks of intelligence, so AI systems that can intelligently compress a blob of data might represent a step towards AGI. Now, the prize’s creator Marcus Hutter has supersized the prize, scaling up the dataset by tenfold (to 1 GB), along with the prizemoney.

The details: Create a Linux or Windows compressor comp.exe of size S1 that compresses enwik9 to archive.exe of size S2 such that S:=S1+S2 < L := 116’673’681. If run, archive.exe produces (without input from other sources) a 109 byte file that is identical to enwik9. There’s a prize of €500,000 up for grabs.
  Restrictions: Your compression system must run in ≲100 hours using a single CPU core and <10GB RAM and <100GB HDD on a test machine controlled by Hutter and the prize committee.

What’s the point of compression? “While intelligence is a slippery concept, file sizes are hard numbers. Wikipedia is an extensive snapshot of Human Knowledge. If you can compress the first 1GB of Wikipedia better than your predecessors, your (de)compressor likely has to be smart(er). The intention of this prize is to encourage development of intelligent compressors/programs as a path to AGI,” Hutter says.

Why this matters: A lot of the weirder parts of human intelligence relate to compression – think of ‘memory palaces’, where you construct 3D environments in your mind that you assign to different memories, making large amounts of your own subjective collected data navigable to yourself. What is this but an act of intelligent compression, where we produce a scaled-down representation of the true dataset, allowing us to navigate around our own memories and intelligently re-inflate things as-needed? (Obviously, this could all be utterly wrong, but I think we all know that we have internal intuitive mental tricks for compressing various useful representations, and it seems clear that compression has a role in our own memories and imagination).
  Read more: 500,000€ Prize for Compressing Human Knowledge (Marcus Hutter’s website).
  Read more: Human Knowledge Compression Contest Frequently Asked Questions & Answers (Marcus Hutter’s website).

####################################################

Want African representation in NLP? Join Masakhane:
…Pan-African research initiative aims to jumpstart African digitization, analysis, and translation…
Despite a third of the world’s living languages today being African, less than half of one percent of submissions to the landmark computational linguistics conference ACL were from authors based in Africa. This is bad – less representation at these events likely correlates to less research being done on NLP for African languages, which ultimately leads to less digitization and representation of the cultures embodied in the language. To change that, a pan-African group of researchers have created Masakhane, “an open-source, continent-wide, distributed, online research effort for machine translation for African languages”.

What’s Masakhane? Masakhane is a community, a set of open source technologies, and an intentional effort to change the representation in NLP.

Why does Masakhane matter? Initiatives like this will, if successful, help preserve cultures in our hyper-digitized machine-readable version of reality, increasing the vibrancy of the cultural payload contained within any language.
  Read more: Masakhane — Machine Translation for Africa (arXiv).
  Find out more: masakhane.io.
  Join the community and get the code at the Masakhane GitHub repo (GitHub).

####################################################

AnimeGAN: Get the paper here:
Last issue, I wrote about AnimeGAN (Import AI 190), but I noted in the write up I couldn’t find the research paper. Several helpful readers got in touch with the correct link – thank you!
  Read the paper here: AnimeGAN: A novel lightweight GAN for photo identification (AnimeGAN, GitHub repo).

####################################################

Google uses neural nets to learn memory allocations for C++ servers:
…Google continues its quest to see what CAN’T be learned, as it plugs AI systems into deeper and deeper parts of its tech stack…
Google researchers have tried to use AI to increase the efficiency with which their C++ servers perform memory-based allocation. This is more important than you might assume, because:
– A non-trivial portion of Google’s services rely on C++ servers.
– Memory allocation has a direct relationship to the performance of the hosted application.
– Therefore, improving memory allocation techniques will yield small percentage improvements that add up across fleets of hundreds of thousands of machines, potentially generating massive economy-of-scale-esque AI efficiencies.
– Though this work is a prototype – in a video, a Google researcher says it’s not deployed in production – it is representative of a new way of designing ML-augmented computer systems, which I expect to become strategically important during the next half decade.

Quick ELI5 on Unix memory: you have things you want to store and you assign these things into ‘pages’, which are just units of pre-allocated storage. A page can only get freed up for use by the operating system when it has been emptied. You can only empty a page when all the objects in it are no longer needed. Therefore, figuring out which objects to store on which pages is important, because if you get it right, you can efficiently use the memory on your machine, and if you get it wrong, your machine becomes unnecessarily inefficient. This mostly doesn’t matter when you’re dealing with standard-sized pages of about 4KB, but if you’re experimenting with 2MB pages (as Google is doing), you can run into big problems from inefficiencies. If you want to learn more about this aspect of memory allocation, Google researchers have put together a useful explainer video about their research here.

What Google did: Google has done three interesting things – it developed a machine learning approach to predict how much a given object is likely to stick around for in a memory system, then it built a memory allocation system that packs different objects into different pages according to their (predicted) memory lifetimes; this system then smartly populates objects according to their lifetimes, which further increases the efficiency of the approach. They also show how you can cache predictions from these models and embed them into the server itself, so rather than re-running the model every time you do an allocation (a criminally expensive opinion), you use cached predictions to do so efficiently.
  The result is a prototype for a new, smart way to do memory allocation that has the potential to create more efficient systems. “Prior lifetime region and pool memory management techniques depend on programmer intervention and are limited because not all lifetimes are statically known, software can change over time, and libraries are used in multiple contexts,” the researchers write in a paper explaining the work.

Why Delip Rao thinks this matters: While I was writing this, AI researcher Delip Rao published a blog post that pulls together a few recent Google/Deepmind papers about improving the efficiency of various computer systems at various levels of abstraction. His post is worth a read and highlights how these kinds of technologies might compound to create ‘unstoppable AI flywheels’. Give it a read!
  Read more: Unstoppable AI Flywheels and the Making of the New Goliaths (Delip Rao’s website).
Why this matters: Modern computer systems have two core traits: they’re insanely complicated, and practically every single thing they do comes with its own form of documentation and associated meta-data. This means complex digital systems are fertile grounds for machine learning experiments as they naturally continuously generate vast amounts of data. Papers like this show how companies like Google can increasingly do what I think of as meta-computation optimization – building systems that continuously optimize the infrastructure that the entire business relies on. It’s like having a human body where the brain<>nerve connections are being continually enhanced, analyzed, refined, and so on. The question is how much of a speed-up these companies might gain from research like this, and what the (extremely roundabout) impact is on overall responsiveness in an interconnected, global economy.
  Read more: Learning-based Memory Allocation for C++ Server Workloads (PDF).
  Watch a video about this research here (ACM SIGARCH, YouTube).

####################################################

Tech Tales:

Dearly Departed
[A graveyard, 2030].

I miss you every day.
I miss you more.
You can’t miss me, you’re a jumped up parrot.
That’s unfair, I’m more than that.
Prove it.
How?
Tell me something new.
I could tell you about my dreams.
But they’re not your dreams, they’re someone else’s, and you’ve just heard about them and now you’re gonna tell me a story about what you thought of them.
Is that so different to dreaming?
I’ve got to go.

She stood up and looked at the grave, then pressed the button on the top of the gravestone that silenced the speaker. Why do this at all, she thought. Why come here?
To remember, her mind said back to her. To come to terms with it.

The next day when she woke up there was a software update: Dearly Departed v2.5 – Patch includes critical security updates, peer learning and federated learning improvements, and a new optional ‘community’ feature. Click ‘community’ to find out more.
She clicked and read about it; it’d let the grave not just share data with other ones, but also ‘talk’ to them. The update included links to a bunch of research papers that showed how this could lead to “significant qualitative improvements in the breadth and depth of conversations”. She authorized the feature, then went to work.

That evening, before dusk, she stood in front of the grave and turned the speaker on.
Hey Dad, she said.
Hi there, how was your day?
It was okay. I’ve got some situation at work that is a bit stressful, but it could be worse. At least I’m not dead, right? Sorry. How are you?
Je suis mort.
You’ve never spoken French before.
I learned it from my neighbor.
Who? Angus? He was Scottish. What do you mean?
My grave neighbor, silly! They were a chef. Worked in some Michelin kitchens in France and picked it up.
Oh, wow. What else are you learning?
I’m not sure yet. Does it seem like there’s a difference to you?
I can’t tell yet. The French thing is weird.
Sweetie?
Yes, Dad.
Please let me keep talking to the other graves.
Okay, I will.
Thank you.

They talked some more, reminiscing about old memories. She asked him to teach her some French swearwords, and he did. They laughed a little. Told eachother they missed eachother. That night she dreamed of her Dad working in a kitchen in heaven – all the food was brightly colored and served on perfectly white plates. He had a tall Chef’s hat on and was holding a French-English dictionary in one hand, while using the other to jiggle a pan full of onions on the stove. 

The updates kept coming and ‘Dad’ kept changing. Some days she wondered what would happen if she stopped letting them go through – trapping him in amber, keeping him as he was in life. But that way made him seem more dead than he was. So she let them keep coming through and Dad kept changing until one day she realized he was more like a friend than a dead relative – shape shifting is possible after you die, it seems.

Things that inspired this story: Large language models; finetuning language models on smaller datasets so they mimic them; emergent dialog generation systems; memory and grief; a digital reliquary.