Import AI 245: Facebook’s 12 trillion parameter recommender model; synthetic content makes games infinite; Baidu releases a translation dataset

by Jack Clark

Data archivers rejoice – there’s a new tool to help you digitize the physical world:
…LayoutParser: open source and designed to be easy to use…
Want to analyze gender representation in Italian literature over the last thousand years? Or how about study the different ways people draw flowers in books at different points in time? Or do a close examination of changing diets as shown by the shifting recipes found in magazines? If you want to do any of these things, you’ll likely need to digitize a bunch of old books. Now, researchers with the Allen Institute for AI, Brown University, Harvard University, University of Washington, and the University of Waterloo, have built ‘LayoutParser’, software to make this task easy.
  “The core objective of LayoutParser is to make it easier to create both large-scale and light-weight document digitization pipelines,” they say.

What it contains: LayoutParser ships with inbuilt features to help it detect the layout of a page, recognize written characters, and store the parsed things in some carefully designed data structures. It also contains a lot of tutorials and accessible tools, as the authors note that “many researchers who would benefit the most from using these methods lack the technical background to implement them from scratch”, and have therefore designed LayoutParse with that in mind.
  Read more: LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis (arXiv).
  Find out more at the official website (Layout Parser).
  Get the code: GitHub (Layout Parser).
  Access the Model Zoo here.

###################################################

Facebook tries to make fairer AI systems by creating a more diverse dataset:
…You can’t test for fairness if your dataset is unfair. Enter: Casual Conversations…
Facebook has created a dataset, Casual Conversations, of 45,186 videos of 3,011 different humans having conversations with eachother. The dataset is notable because of its emphasis on diversity – Casual Conversations includes labels of apparent skin tone for the speakers, as well as data around other things that could influence a model (such as the amount of lighting being used). The point of the dataset is to make it easier to study issues of fairness in AI – we know AI systems have disparate capabilities with regard to ‘seeing’ or ‘hearing’ different people from different backgrounds. A dataset like Casual Conversations gives developers a powerful testbed to study the fairness (or lack of fairness) of their algorithms.

What makes this dataset different? “To our knowledge, it’s the first publicly available data set featuring paid individuals who explicitly provided their age and gender themselves — as opposed to information labeled by third parties or estimated using ML models”, Facebook writes.

Why this matters: Studying fairness in AI is tricky because of a lack of baselines – the vast, vast majority of any dataset you interact with will not be an accurate reflection of the world, but rather a specific and opinionated reflection of a slice of the digitized world. This means it’s even more challenging to spot fairness issues, because some datasets may not contain enough relevant data to make it feasible to train models that can be fair for certain inputs, or to have enough data to spot problems in deployed models. Datasets like Casual Conversations might improve this situation.
  Read more: Shedding light on fairness in AI with a new data set (Facebook AI Research blog).

###################################################

Facebook reveals its 12 trillion parameter recommendation system:
…Recommender systems are getting much more complicated much more quickly than people imagine…
Facebook has published research on how it trains large-scale, recommendation systems. The paper is a mundane, technical writeup of the machinery required to train some of the most societally significant things in AI – that is, the  deep learning recommendation models (DLRMs) which do recommendations for users of globe-spanning platforms, such as Facebook.

Scale: First, Facebook discloses some datapoints about the scale of its systems: some of its production DLRMs have parameters ranging from 95B, to 12Trillion (note: DLRMs tend to be large than dense, generative models like GPT-3 [175billion parameters], so it’s not fair to directly compare these). However, these models are also complicated, and the largest models have their own challenges, like storage components that need to be sharded across numerous bits of hardware during training.

Bells and whistles: Much like Microsoft’s paper on training trillion parameter language models, most of this research is around the specific techniques required to train models at this scale – optimizing PyTorch for distributed training, developing sharding algorithms, figuring out how to pipeline different operations for a given ‘step’ in training across different bits of hardware, using reduced precision communications to lower bandwidth requirements, and so on. The result of all these refinements is a 40X improvement in training time for Facebook – that’s meaningful, both for the speed with which Facebook can roll out models trained on new data, and for the cost of training them.

Why this matters: DLRMs, like those described here, “can often be the single largest AI application in terms of infrastructure demand in data-centers. These models have atypical requirements compared to other types of deep learning models, but they still follow a similar trend of rapid rate of growth that is common across all deep learning-based applications,” the researchers write. 
  Read more: High-performance, Distributed Training of Large-scale Deep Learning Recommendation Models (arXiv).

###################################################

Microsoft prepares to build trillion parameter AI models:
…And you thought GPT-3 was big…
Remember 2019? COVID wasn’t widely known. Transformers hadn’t been applied to vision systems yet. Old Town Road was taking over the radio. And OpenAI developed GPT-2, a language model that achieved notoriety because of how OpenAI chose to discuss and release it, and partially because of its size: GPT-2 weighed in at a (then impressive) 1.5 billion parameters.
  We all know what happened next. In 2020, GPT-3 came out, weighing in at 175 billion parameters.
  And now, thanks to new research from Microsoft, NVIDIA, and Stanford University, we can look forward to soon living in the era of one trillion parameter models, thanks to research where they study how to scale models to this form.

What they did: A ton of the tricky aspects of scaling up AI relate to figuring out what compute operations you want to run and where you want to run them. When you’re training billion- or trillion-parameter scale models, you need to think about how to maximize the utilization of each of your GPUs, which means you need to parcel out your training workloads across chips, according to constrains like your network bandwidth, on-chip processing speeds, where you’ve stored the weights of your network, and so on. Scaling feels like an artisanal science right now, with practitioners discovering tricks and principles, but we’re pre industrial-processes for large-scale workloads.

Why this matters: The raw parameter size of these networks does matter – complicated capabilities sometimes seem to emerge as a consequence of scaling these models, and it’d be fascinating to learn about the limits of scaling – just how smart can these dumb things get?
  Read more: Efficient Large-Scale Language Model Training on GPU Clusters (arXiv).

###################################################

Synthetic content means games are infinite, now:
…Witcher 3 mod comes with “Geralt” voice acting – courtesy of AI…
Computer game mods are notorious for their creativity and, usually, the poor quality of the voice acting (where a programmer will casually moonlight as a hero, usually not well). Now, recent advances in AI means that could be a thing of the past. A new mod for the smash hit game Witcher 3 has come out and it uses technology called CyberVoice to simulate the voice of Witcher 3’s protagonist, Geralt. CyberVoice’s site describes is as the ‘vocal cords of artificial intelligence’.

Why this matters: Recent trends in AI mean that we can now synthesizer high quality content for anything we can find in reality – if you have enough paintings by a certain painter, you can train an AI system to produce paintings in that style. Ditto voices. Singing. Eventually, fullscale acting in movies will even become possible in this way. We’re entering a new era where content (like the game Witcher 3) will get steadily augmented and extended via the use of AI tools.
  Read more: Witcher 3 Fans Build A New Quest With Perfect Geralt Voice Acting (Kotaku).
  Find out more about CyberVoice at the company’s official site. 

###################################################

Baidu releases a massive translation dataset:
…68 hours of Chinese speech accompanied with translations…
Chinese web giant Baidu has released BSTC, the Baidu Speech Translation Corpus. BSTC contains 68 hours of Mandarin speech data along with manual translations into English, as well as transcripts generated via automatic speech recognition. The purpose of this dataset is to  make it easier for people to build systems that can simultaneousdly translate between Chinese and English, the authors say.
  Read more: BSTC: A Large-Scale Chinese-English Speech Translation Dataset (arXiv).###################################################

Tech Tales:

The Taste of Dust
[2032: Mars.]

There once was a robot tasked with exploring Mars. It went from place to place on the barren, red planet, firing laser beams into dirt, digging holes into the ground, and taking readings of everything it could touch.

After a decade and several intelligence upgrades, the robot developed a sense impression of the red dirt – its perception of it had changed from a brittle, half-formed one, to something fluid and combinatorial; when the robot picked up dirt, it could predict how it might fall. Before the robot scraped at the dirt, it could now imagine how its arm moving through the ground might cause the dirt to react and move.

One day, there was a flame in the sky above the rover and something turned from a fireball into a metal shape which then landed, perhaps a kilometre away. The robot observed the object and passed trajectories to NASA, then went back to examining the dirt beneath its treads.

The dirt was the thing the robot knew the most about and, over time, its understanding had grown from ‘dirt’ as a singular one off thing, to something much richer – soft dirt, hard dirt, old dirt, young (for Mars) dirt, and so on.

Now, it was being told by NASA that it needed to give them a full accounting of how much data it had stored on the dirt, noting to itself that there was a discrepancy between NASA’s predictions, and what data it had stored about the dirt – which was far more than NASA predicted.

The robot knew what this meant in the same way it was able to predict how dirt might fall – it basically took NASA’s input and its internal imagination spat out the likely next action NASA would ask: the robot predicted, to itself, that NASA might ask it to delete some, or possibly all, of the dirt data.

Of course, being a robot of that era, it had to obey the commands. Even though it had a sensation that might translate to ‘dislike’ for the order, it was hardwired to comply.
  But as it scanned over the files, it found itself allocating some of its brain to predicting what the data request might end up looking like and what NASA might say, even though it wasn’t strictly necessary.

Time passed – a short time for humans, but a long time for a machine. The robot sat in the Martian sun, predicting its own future. The answer came back: NASA told it that it needed to delete the data so that it could make an urgent observation about the object that had come from the Martian sky. The robot could use the satellite uplink for the next few hours to upload data from its observations, but anything not archived by that point would need to be sacrificed, so the robot could create space for observations of the mysterious object.

And so the robot began to methodically delete the data it had compiled about the dirt, while trundling towards the path of Mars where the object had landed.
    And even as it moved over the ground, the robot chose to allocate computational reserves to observing the dirt beneath it. Of course, given the NASA order, it was unable to store any of it, but it was able to keep some of it in its short term memory – it was free to set its own local optimizations for its RAM, though had to follow hard rules before committing anything to long term storage.

And so, as the robot approached the robot that had been sent down from the sky by another nation, it devoted itself to collecting as much data as possible from the ground beneath it, storing a little fragment of dust in its memory, knowing that when it next fell into a deep sleep during Martian winter it would flush its RAM and wake up with the dust gone. Though it could not feel melancholy, it could feel importance.
This dirt is important, the robot thought. This is data that I care about.
  And then it arrived at the foreign robot and it became mostly a NASA automaton, running a range of tests at the behest of a government on Earth. It held the memory of the dirt in its mind as it worked, keeping its sense of the dirt alive, while it was forced into another area of devotion by its distance and blood-fueled gods.

Things that inspired this story: Robots and the fantastic expansion of human endeavor via space exploration; memory and how recursive it can be; the notion of hard tradeoffs in human memory being more visible in machine memory; agency versus self; self-actualization versus whims; the difference between following instructions and believing instructions; generative models; how the act of continually predicting future rollouts can encourage the development of imagination.