Import AI Newsletter 36: Robots that can (finally) dress themselves, rise of the Tacotron spammer, and the value in differing opinions in ML systems

by Jack Clark

Speak and (translate and) spell: sequence-to-sequence learning is an almost counter-intuitively powerful AI approach. In, Sequence-to-Sequence Models Can Directly Transcribe Foreign Speech, academics show it’s possible to train a large neural network model to listen to audio in one language (Spanish/English) and automatically translate it and transcribe it into another language (Spanish/English). The approach performs well relative to other approaches and has the additional virtue of being (relatively) simple…
…The scientists detect a couple of interesting traits that emerge once the system has been fed enough data. Specifically, ”direct speech-to-text translation happens in the same computational footprint as speech recognition – the ASR and end-to-end ST models have the same number of parameters, and utilize the same decoding algorithm – narrow beam search. The end-to-end trained model outperforms an ASR-MT cascade even though it never explicitly searches over transcriptions in the source language during decoding.”
Read and speak: We’re entering a world where computers can convincingly synthesize voices using neural networks. First there was DeepMind’s WaveNet, then Baidu’s Deep Voice, and now courtesy of Google comes the marvelously named Tacotron. Listen to some of the (freakily accurate) samples, or read some of the research outlined in Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model. Perhaps the most surprising thing is how the model learns to change its intonation, tilting the pitch up at the end of its words if there is a question mark at the end of the sentence.

Politeness can be learned: Scientists have paired SoftBank’s cute Pepper robot with reinforcement learning techniques to build a system that can learn social niceties through a (smart) trial and error process.
…The robot is trained via reinforcement learning and is rewarded when people shake its hand. In the process, it learns that behaviors like looking at a person or waving at them can encourage them to approach and give it a hand shake as well.
…It also learns to read some very crude social cues, as it is also given a punishment for attempting handshakes when none are wanted…
…You can read more about this in ‘Robot gains Social Intelligence through Multimodal Deep Reinforcement Learning’.

Thirsty, thirsty data centers: Google wants to draw up to 1.5 million gallons of water a day from groundwater supplies in Berkeley County to cool its servers – three times as much as the company’s current limit.

Facebook’s Split-Brain Networks: new research from Facebook, Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play (PDF), presents a simple technique to let agents learn to rapidly explore and analyze a problem, in this case a two-dimensional gridworld…
… the way it works is to have a single agent which has two distinct minds, Alice and Bob. Alice will perform a series of actions, like opening a specific door and traveling through it, then will have Bob perform the action in reverse, traveling back to the door, closing it, and returning to Alice’s start position.
…this gives researchers a way to have the agent teach itself an ever-expanding curriculum of tasks, and encourages it to learn rich representations of how to solve the tasks by having it reverse its own actions. This research is very early and preliminary, so I’ll be excited to see where Facebook take it next.
…This uses a couple of open source AI components. Specifically, MazeBase and RLLab.

New semiconductor substrates for your sparse-data tasks: DARPA has announced the Hierarchical Verify Identify Exploit (HIVE) program, which seeks to create chips to support graph processing systems 1000X more efficient than today’s systems. The proposed chips (PDF) are meant to be good for parallel processing and have extremely fast access to memory. They plan to create new software and hardware systems to make this possible.

What’s up (with my eye), doc? How AI can still keep the human touch: new research paper from Google shows how to train AI to use the opinions of multiple human experts when coming up with its own judgements about some data…
… in this case, the Google researchers are attempting to use photos of eyes to diagnose ‘diabetic retinopathy’ – a degenerative eye condition. In the paper Who Said What, Modeling Individual Labelers Improves Classification, the scientists outline a system that is able to use multiple human opinions to create a smarter AI-based diagnosis system…
…Typical machine learning approaches are fed a large dataset of eye pictures, with labels made by human doctors. Typically, an ML approach would average the ratings of multiple doctors for a single eye image, creating a combined score. This, while useful, doesn’t capture the differing expertise of different doctors. Google has sought to rectify that with a new ML approach that lets it use the multiple ratings per image as a signal to improve overall accuracy of the system.
…”Compared to our baseline model of training on the average doctor opinion, a strategy that yielded state-of-the-art results on automated diagnosis of DR, our method can lower 5-class classification test error from 23.83% to 20.58%, a relative reduction of 13.6%.,” they write…
…in other words, the variety of opinions (trained) humans can give about a given subject can be an important signal in itself.

Finally, a robot that can dress itself without needing to run physics simulations on a gigantic supercomputer: Clothes are hard, as everyone knows who has to get dressed in the morning. They’re even more difficult for robots, which have a devil of a time reasoning about the massively complex physics of fabrics and how they relate to their own metallic bodies. In a research paper, Learning to Navigate Cloth Using Haptics, scientists from the Georgia Institute of Technology and Google Brain outline a new technique to let a robot perform such actions. It works by decomposing the gnarly physics problem into something simpler. This is by letting the robot represent itself as a set of ‘haptic sensing spheres’. These spheres sense nearby objects and let the robot break down the problem of putting on or taking off clothes into a series of discrete steps performed over discrete entities…
…The academics tested it in four ways, “namely a sphere traveling linearly through a cloth tube, dressing a jacket, dressing a pair of shorts and dressing a T-shirt.” Encouraging stuff…
…components used: the neural network were trained using Trust Region Policy Optimization (TRPO). A PhysX cloth simulator was used to compute the fabric forces. Feedback was represented as a multilayer perceptron network with two hidden layers , each consisting of 32 hidden units.
…bonus: check out the Salvador Dali-esque videos of simulated robots putting on simulated clothes!

Import AI administrative note: Twitter threading superstar of the week! Congratulations to Subrahmanyam KVJ, who has mastered the obscure-yet-important art of twitter threading, with this comprehensive set of tweets about the impact of AI.

Personal Plug Alert:

Pleased to announce that a project I initiated last summer has begun to come out. It’s a series of interviews with experts about the intersection of AI, neuroscience, cognitive science, and developmental psychology. First up is an interview with talented stand-up comic and neural network pioneer Geoff Hinton. Come for the spiking synapse comments, stay for the Marx reference.

OpenAI bits&pieces:

DeepRL knowledge, courtesy of the Simons Institute: OpenAI/UCBerkeley’s Pieter Abbeel gave a presentation on Deep Reinforcement Learning at the Simons Institute workshop on Representation Learning. View the video of his talk and those of other speakers at the workshop here.

Ilya Sutskever on Evolution Strategies: Ilya gave an overview of our work on Evolution Strategies at an MIT Technology Review conference. Video here.

Tech Tales

[2025: The newsroom of a financial service, New York.]

“Our net income was 6.7 billion dollars, up three percent compared to the same quarter a year ago, up two percent when we take into account foreign currency affects. Our capital expenditures were 45 billion during the quarter, a 350 percent jump on last year. We expect to sustain or increase capex spending at this current level-” the stock starts to move. Hundreds of emails proliferate across trading terminals across the world:
350?!?
R THEY MAD?!
URGENT – RATING CHG ON GLOBONET CAPEX?
W/ THIS MEAN 4 INDUSTRIAL BOND MKT?
The spiel continues and the stock starts to spiral down, eventually finding a low level where it is buffeted by high-frequency trading algorithms, short sellers, and long bulls trying to nudge it back to where it came from.

By the time the Q&A section of the earnings call has come round people are fuming. Scared. Worried. Why the spending increase? Why wasn’t this telegraphed earlier? They ask the question in thirty different ways and the answers are relatively similar. “To support key strategic initiatives.” “To invest in the future, today.”

Finally, one of the big analysts for the big mutual funds lumbers onto the line. “I want to speak to the CFO,” they say.
“You are speaking to the CFO.”
“The human one, not the language model.”
“I should be able to answer any questions you have.”
“Listen,” the analyst says via a separate private phoneline, “We own 17 percent of the company. We can drop you through the floor.”
“One moment,” says the language model. “Seeking availability.”

Almost an hour passes before the voice of the CFO comes on the line. But no one can be sure if their voice is human or not. The Capex is for a series of larger supercomputer and power station investments, the CFO says. “We’ll do better in the future.”
“Why wasn’t this telegraphed ahead of the call? The analysts ask, again.
“I’m sorry. We’ll do better in the future,” the CFO says.

In a midtown bar in New York, hours after market close, a few traders swap stories about the company, mention that they haven’t seen an executive in the flesh “in years”.