Import AI 258:Game engines are data generators; Spanish language models; the logical end of civilization

Open source GPT-ers Eleuther turn one:
…What can some DIY hackers with a Discord channel and a mountain of compute do in a year? A lot, it turns out…
Eleuther, a collective of hackers working on open source AI projects, has recently celebrated their one year birthday by writing a retrospective about their work. For those who haven’t kept up to date, Eleuther is trying to do an open source replication of GPT-3 (and people affiliated with the organization have already released GPT-J, a surprisingly powerful code-friendly 6BN parameter model). They’ve also dabbled in a range of other open source projects. This retrospective gives a peek into what they’ve been working on and also gives us a sense of the ideology behind the organization – something we find interesting here at Import AI is the different release philosophies encapsulated by orgs like Eleuther, so keeping track of their thinking is worthwhile.
  Read more: What A Long, Strange Trip It’s Been: EleutherAI One Year Retrospective (Eleuther blog).

####################################################Game engines are data generators now:
…Unity Perception represents the future of game engines…
Researchers with Unity Technologies, makers of the widely-used Unity game engine, have built an open source tool that lets AI researchers use Unity to generate data to train AI systems on. The ‘Unity Perception’ package “supports various computer vision tasks (including 2D/3D object detection, semantic segmentation, instance segmentation, and keypoints (nodes and edges attached to 3D objects, useful for tasks such as human-pose estimation)”, the authors write. The software also comes with systems to automatically label the generated data, along with tools for randomizing the assets used in a data generation task (which makes it easy to create additional data to train systems on to increase their robustness).

Proving that it works: To test out the system, Unity also built ‘SynthDet’, a project where they used Unity Perception to generate synthetic data for 63 common grocery objects, then train an object recognition system on this. They used their software to generate a synthetic dataset containing 400,000 images and 2D bounding box annotations, then also collected a real-world dataset of 1627 images of the 63 items. They then show that by pairing the synthetic data with the real data, they can get substantially improved performance. “Our results clearly demonstrate that synthetic data can play a significant role in computer vision model training,” they write.

Why this matters – data generators are engines, computers are electricity: I think of game engines like Unity as the equivalent to an engine that you might place in a factory, where here the factory is a datacenter. Systems like Unity help you take in a small amount of input fuel (e.g, a scene rendered in a 3D world), then run electricity (compute) through the engine (Unity) until you output a much larger dataset made possible by the initial fuel. You can then pair this output with ‘real’ data gathered via other means and in doing so improve the performance and efficiency of your AI factory. This feels like another important trend to look at when thinking about the steady industrialization of AI development.
Read more:Unity Perception: Generate Synthetic Data for Computer Vision (arXiv).

####################################################

Can your algorithm handle the real world? Use the ‘Shifts’ dataset to find out:
…Distributional shift data from industrial sources = more of a real world dataset than usual…
Much of AI progress is reliant on algorithms doing well on certain narrow, pre-defined benchmarks. These benchmarks are based on datasets that simulate or represent tasks found in the real world. However, once these algorithms get deployed into the real world it can be quite common fro them to break, because they encounter some situation which their dataset and benchmark didn’t represent. This phenomenon is called ‘distributional shift’.
  Now, researchers with (primarily) Russien tech company Yandex, along with ones at HSE University, Moscow Institute of Physics and Technology, University of Cambridge, University of Oxford, and the Alan Turing Institute, have developed the ‘Shifts Dataset’, which consists of “data taken directly from large-scale industrial sources and services where distributional shift is ubiquitous”.

What data is in Shifts? Shifts contains tabular weather prediction data from the Yandex Weather service, machine translation data taken from the WMT robustness track and mined from Reddit (and annotated in-house by Yandex), and self-driving car data from Yandex’s self-driving car project. 
  Read more: Shifts: A Dataset of Real Distributional Shift Across Multiple Large-Scale Tasks (arXiv).
  Get the dataset from here (Yandex, GitHub).

####################################################

Buy Sophia the robot (for $80,000):
…Sure, little quadruped robots are cool, but what about the iconic (for better or for worse) human-robot?…
Sophia the robot is a fancy human-appearing robot made by Hanson Robotics. Sophia has become a lightning rod in the AI community for giving wildly unrealistic impressions of what AI is capable of. But the hardware is really, really nice. If you’ve got $80,000 to spare and want to buy a couple of 21st century animatronics, maybe put a bid in here. I, for one, would love to be invited to a rich person’s party where some fancy puppets might be swanning around. Bonus points if you lose the skirt and go for the full hybrid-frightener look. (You could always spend a rumored $75k on a Boston Dynamics ‘Spot’ robot, but where’s the fun in that).
  Consider buyinga robot here (RobotShop).

####################################################

Spanish researchers embed Spanish culture into some large-scale RoBERTa models:
…National data for national models…
Researchers with the wonderfully named “Text Mining Unit” within the Barcelona Supercomputing Center have created a couple of Spanish-language RoBERTa models, helping them to imbue some AI tools with Spanish language and culture. This is part of a recent trend of countries seeking to build their own nationally/culturally representative AI models. Some other examples include Korea, where a startup named Naver created a Korean-representing GPT-3 style model called ‘HyperCLOVA’ (Import AI 251), and a Dutch RoBERTA (Import AI 182), among others.

What they did:
They gathered 570GB of predominantly Spanish-language data, then trained a RoBERTa base and RoBERTA large model on the dataset. In tests, their models generally did better than other pre-existing Spanish-focused BERT models.

The ethics of dragnet data fishing:
In the past year, there’s been some debate about how large datasets should be constructed, where some people argue such datasets should be heavily curated by the people that gather them, while others argue they should be deliberately uncurated. Here, the researchers opt for what I’d call a curated uncurated strategy – they create three different types of data (theme-based, e.g datasets relating to politics, feminism, etc), event-based (events of significance to Spanish society), and domains at risk of disappearing (e.g, if a website is about to be shutdown). You can find out more information here about the crawls. My expectation is most of the world will move to lightly curated dragnet fishing data gathering, as individual human curation may be too expensive and slow.
  Read more:
Spanish Language Models (arXiv).
  Get the RoBERTa base model here (HuggingFace).
Get the RoBERTa large model here (HuggingFace).

####################################################

Tech Tales:

Repetition and Recitation at the End of Time
[A historian in another Solar System, either now or thousands of years prior or thousands of years in the future]

He was a historian and he studied the long-dead by the traces they had created in the AI systems that had outlasted the civilization. It worked like this: he found a computational artefact, got it running, worked out how to prime it, then started plugging details in until the system would spit out data it had memorized about the individual’s life: home addresses, contact details, extracts of speeches they had made, and so on.

Of course, some of the data was fuzzy. Most AI systems trend towards a form of poetic license, much like how when people recite things from memory they have a tendency to embellish – to over-dramatize, or to insert illusory facts that come from their own lives and dreams.

But it was all they had to work with: the living beings that had made the AI were longdead, and so he made do with these bottled up representations of their culture. He wrote his reports and published them to the system-wide internet, where they were read and commented on. And, of course, ingested in turn by his own civilization’s AI systems.

Just a decade ago, the first AI probes had been sent out – trained artefacts embedded into craft and then sent, in hopes they might arrive at target systems intact and in stable orbits and then exist there, waiting to be found by other civilizations, other forms of life, who might probe them and learn to extract their secrets and develop an understanding of the civilization they came from. His own reports were in there, as well. So perhaps one day soon some being unlike him would sit down and try to extract his name and habits and details, eager to learn about the strange beings now showing up as zeros and ones in cold machines, sent into the dark.

Things that inspired this story: The recent discussion about memorization and recitation in neural nets; ideas about how culture gets represented within AI models; thoughts of space and the purpose of existing in space; the idea that there may be a more limited design space for AI than for biological life so perhaps such things as the above may be possible; hope for a stellar future and fear that if we don’t get to it, we will be known by our digital exhaust, captured in our generative models.