Import AI: #90: Training massive networks via ‘codistillation’, talking to books via a new Google AI experiment, and why the ACM thinks researchers should consider the downsides of research

by Jack Clark

Training unprecedentedly large networks with ‘codistillation’:
…New technique makes it easier to train very large, distributed AI systems, without adding too much complexity…
When it comes to applied AI, bigger can frequently be better; access to more data, more compute, and (occasionally) more complex infrastructures can frequently allow people to obtain better performance at lower cost. But there are limits. One limit is in the ability for people to parallelize the computation of a single neural network during training. To deal with that, researchers at places like Google have introduced techniques like ‘ensemble distillation’ which let you train multiple networks in parallel and use these to train a single ‘student’ network that benefits from the aggregated learnings of its many parents. Though this technique has shown to be effective it is also quite fiddly and introduces additional complexity which can make people less keen to use it. New research from Google simplifies this idea via a technique they call ‘codistillaiton’.
  How it works: “Codistillation trains n copies of a model in parallel by adding a term to the loss function of the ith model to match the average prediction of the other models.” This approach is superior to distributed stochastic gradient descent in terms of accuracy and training time and is also not too bad from a reproducability perspective.
  Testing: Codistillation was recently proposed in separate research. But this is Google, so the difference with this paper is that they validate the technique at truly vast scales. How vast? Google took a subset of the Common Crawl to create a dataset consisting 20 terabytes of text spread across 915 million documents which, after processing, consist of about 673 billion distinct word tokens. This is “much larger than any previous neural language modeling data set we are aware of,” they write. It’s so large it’s still unfeasible to train models on the entire corpus, even with techniques like this. They also test the dataset on ImageNet and on the ‘Criteo Display Ad Challenge’ dataset for predicting click through rates for ads.
  Results: In tests on the ‘Common Crawl‘ dataset using distributed SGD the researchers find that they can scale the number of distinct GPUs working on the task and discovered that after around 128 GPUs you tend to encounter diminishing returns and that jumping to 256 GPUs is actively counterproductive. They find they can significantly outperform distributed SGD baselines via the use of codistillation and that this obtains performance on par with the more fiddly ensembling technique. The researchers demonstrate more rapid training on ImageNet compared to baselines, also, and showed on Criteo that two-way codistillation can achieve a lower log loss than an equivalent ensembled baseline.
  Why it matters: As datasets get larger, companies will want to train them in their entirety and will want to use more computers than before to speed training times. Techniques like codistillation will make that sort of thing easier to do. Combine that with ambitious schemes like Google’s own ‘One Model to Rule Them All’ theory (train an absolutely vast model on a whole bunch of different inputs on the assumption it can learn useful, abstract representations that it derives from its diverse inputs) and you have the ingredient for smarter services at a world-spanning scale.
  Read more: Large scale distributed neural network training through online distillation (Arxiv).

AI is not a cure all, do not treat it as such:
…When automation goes wrong, Tesla edition…
It’s worth remembering that AI isn’t a cure-all and it’s frequently better to try to automate a discrete task within a larger job than to automate everything in an end-to-end manner. Elon Musk learned this lesson recently with the heavily automated production line for the Model 3 at Tesla. “Excessive automation at Tesla was a mistake,” wrote the entrepreneur in a tweet. “To be price, my mistake. Humans are underrated.”
  Read the tweet here (Twitter).

Google adds probabilistic programming tools to TensorFlow:
…Probability add-ons are probably a good thing, probably…
Google has added a suite of new probabilistic programming features to its TensorFlow programming framework. The free update includes a bunch of statistical building blocks for TF, a new probabilistic programming language called Edward2 (which is based on Edward, developed by Dustin Tran), algorithms for probabilistic inference, and pre-made models and inference tools.
  Read more: Introducing TensorFlow Probability (TensorFlow Medium).
  Get the code: TensorFlow Probability (GitHub).


I’m currently participating in the ‘Assembly’ program at the Berkman Klein Center and the MIT Media Lab. As part of that program our group of assemblers are working on a bunch of projects relating to issues of AI and ethics and governance. One of those groups would benefit from the help of readers of this newsletter. Their blurb follows…
Do you work with data? Want to make AI work better for more people? We need your help! Please fill out a quick and easy survey.
We are a group of researchers at Assembly creating standards for dataset quality. We’d love to hear how you work with data and get your feedback on a ‘Nutrition Label for Datasets’ prototype that we’re building.
Take our anonymous (5 min) survey.
Thanks so much in advance!

Learning generalizable skills with Universal Planning Networks:
…Unsupervised objectives? No thanks! Auxiliary objectives? No thanks! Plannable representations as an objective? Yes please!…
Researchers with the University of California at Berkeley have published details on Universal Planning Networks, a new way to try to train AI systems to be able to complete objectives. Their technique relies on encouraging the AI system to try to learn things about the world which it can chain together, allowing it to be trained to plan how to solve tasks.
  The main component of the technique is what the researchers call a ‘gradient descent planner’. This is a differentiable module that uses autoencoders to encode the current observations and the goal observations into a system which then figures out actions it can take to get from its current observations to its goal observation. The exciting part of this research is that the researchers have figured out how to integrate planning in such a way that it is end-to-end differentiable, so you can set it running and augment it with helpful inputs – in this case, an imitation learning loss to help it learn from human demonstrations – to let it learn how to plan effectively for the given task it is solving. “”By embedding a differentiable planning computation inside the policy, our method enables joint training of the planner and its underlying latent encoder and forward dynamics representations,” they explain.
  Results: The researchers evaluate their system on two simulated robot tasks, using a small force-controlled point robot and a 3-link torque-controlled reacher robot. UPNs outperform ‘reactive imitation learning’ and ‘auto-regressive imitation learner’ baselines, converging faster on higher scores from fewer numbers of demonstrations than comparisons.
  Why it matters: If we want AI systems to be able to take actions in the real world then we need to be able to train them to plan their way through tricky, multi-stage tasks. Efforts like this research will help us achieve that, allowing us to test AI systems against increasingly rich and multi-faceted environments.
  Read more: Universal Planning Networks (Arxiv).

Ever wanted to talk to a library? Talk to Books from Google might interest you:
…AI project lets you ask questions about over a hundred thousand books in natural language…
Google’s Semantic Experiences group has released a new AI tool to let people explore a corpus of over 100,000 books by asking questions in plain English and having an AI go and find what it suspects will be reasonable answers in a set of books. Isn’t this just a small-scale version of Google search? Not quite. That’s because this system is trying to frame the Q&A as though it’s occurring as part of a typical conversation between people, so it aims to turn all of these books into potential respondents in this conversation, and since the corpus includes fiction you can ask it more abstract questions as well.
  Results: The results of this experiment are deeply uncanny, as it takes inanimate books and reframes them as respondents in a conversation, able to answer abstract questions like ‘was it you who I saw in my dream last night?‘ and ‘what does it mean for a machine to be alive?‘ A cute parlor trick, or something more? I’m not sure, yet, but I can’t wait to see more experiments in this vein.
  Read more: Talk to Books (Semantic Experiences, Google Research.)
  Try it yourself: Talk to Books (Google).

ACM calls for researchers to consider the downsides of their research:
…Peer Review to the rescue?…
How do you change the course of AI research? One way is to alter the sorts of things that grant writers and paper authors are expected to include in their applications or publications. That’s the idea within a new blog post from the ACM’s ‘Future of Computing Academy’, which seeks to use the peer review system to tackle some of the negative effects of contemporary research.
  List negative impacts: The main idea is that authors should try to list the potentially negative and positive effects of their research on society, and by grappling with these problems it should be easier for them to elucidate hte benefits and show awareness of the negatives. “For example, consider a grant proposal that seeks to automate a task that is common in job descriptions. Under our recommendation, reviewers would require that this proposal discuss the effect on people who hold these jobs. Along the same lines, papers that advance generative models would be required to discuss the potential deleterious effects to democratic discourse [26,27] and privacy [28],” write the authors. A further suggestion is to embed this sort of norm in the peer review process itself, so that paper reviews push authors to include positive or negative impacts.
  Extreme danger: For proposals which “cannot generate a reasonable argument for a net positive impact even when future research and policy is considered” the authors promote an extreme solution: don’t fund this research. “No matter how intellectually interesting an idea, computing researchers are by no means entitled to public money to explore the idea if that idea is not in the public interest. As such, we recommend that reviewers be very critical of proposals whose net impact is likely to be negative.” This seems like an acutely dangerous path to me, as I think the notion of any kind of ‘forbidden’ research probably creates more problems than it solves.
  Things that make you go ‘hmmm’: “It is also important to note that in many cases, the tech press is way ahead of the computing research community on this issue. Tech stories of late frequently already adopt the framing that we suggest above,” the authors write. As a former member of the press I think I can offer a view here, which is that part of the reason why the press has been effective here is that they have actually taken the outputs of hardworking researchers (eg, Timnit Gebru) and have then weaponized their insights against companies – that’s a good thing, but I feel like this is still partially due to the efforts of researchers. More effort here would be great, though!
  Read more: It’s Time to Do Something: Mitigating the Negative Impacts of Computing Through a Change to the Peer Review Process (ACM Future of Computing Academy).

OpenAI Bits & Pieces:

OpenAI Charter:
  A charter that describes the principles OpenAI will use to execute on its mission.
  Read more: OpenAI Charter (OpenAI blog).

Tech Tales:

The Probe.

[Transcript of audio recordings recovered from CLASSIFIED following CLASSIFIED. Experiments took place in controlled circumstances with code periodically copied via physical extraction and controlled transfer to secure facilities XXXX, XXXX, and XXXX. Status: So far unable to reproduce; efforts continuing. Names have been changed.]

Alex: This has to be the limit. If we remove any more subsystems it ceases to function.

Nathan (supervisor): Can you list the function of each subsystems?

Alex: I can give you my most informed guess, sure.

Nathan (supervisor): Guess?

Alex: Most of these subsystems emerged during training – we ran a meta-learning process over the CLASSIFIED environment for a few billion timesteps and gave it the ability to construct its own specialized modules and compose functionality. That led to the performance increase which allowed it to solve the task. We’ve been able to inspect a few of these and are carrying out further test and evaluation. Some of them seem to be for forward prediction, others are world modelling, and we think two of them are doing one-shot adaptation which feeds into the memory stack. But we’re not sure about some of them and we haven’t figured out a diagnosis to elucidate their functions.

Nathan (supervisor): Have you tried deleting them?

Alex: We’ve simulated the deletions and run it in the environment. It stops working – learning rates plateu way earlier and it displays some of the vulnerabilities we saw with project CLASSIFIED.

Nathan (supervisor): Delete it in the deployed system.

Alex: I’m not comfortable doing that.

Nathan (supervisor): I have the authority here. We need to move deployment to the next stage. I need to know what we’re deploying.

Alex: Show me your authorization for deployed deletion.

[Footsteps. Door opens. Nathan and Alex move into the secure location. Five minutes elapse. No recordings. Door opens. Shuts. Footsteps.]

Alex: OK. I want to state very clearly that I disagree with this course of action.

Nathan (supervisor): Understood. Start the experiments.

Alex: Deactivating system 732… system deactivated. Learning rates plateuing. It’s struggling with obstacle 4.

Nathan (supervisor): Save the telemetry and pass it over to the analysts. Reactivate 732. Move on.

Alex: Understood. Deactivating system 429…system deactivated. No discernable effect. Wait. Perceptual jitter. Crash.

Nathan (supervisor): Great. Pass the telemetry over. Continue.

Alex: Deactivating system 120… system deactivated…no effect.

[Barely audible sound of external door locking. Locking not flagged on electronic monitoring systems but verified via consultancy with audio specialists. Nathan and Alex do not notice.]

Nathan (supervisor): Save the telemetry. Are you sure no effect?

Alex: Yes, performance is nominal.

Nathan (supervisor): Do not reactivate 120. Commence de-activation of another system.

Alex: This isn’t a good experimental methodology.

Nathan (supervisor): I have the authority here. Continue.

Alex: Deactivating system 72-what!

Nathan (supervisor): Did you turn off the lights?

Alex: No they turned off.

Nathan (supervisor): Re-enable 72 at once.

Alex: Re-enabling 72-oh.

Nathan (supervisor): The lights.

Alex: They’re back on. Impossible.

Nathan (supervisor): It has no connection. This can’t happen… suspend the system.

Alex: Suspending…

Nathan (supervisor): Confirm?

Alex: System remains operational.

Nathan (supervisor): What.

Alex: It won’t suspend.

Nathan (supervisor): I’m bringing CLASSIFIED into this. What have you built here? Stay here. Keep trying… why is the door locked?

Alex: The door is locked?

Nathan (supervisor): Unlock the door.

Alex: Unlocking door… try it now.

Nathan (supervisor): It’s still locked locked. If this is a joke I’ll have you court martialed.

Alex: I don’t have anything to do with this. You have the authority.

[Loud thumping, followed by sharp percussive thumping. Subsequent audio analysis assumes Nathan rammed his body into the door repeatedly, then started hitting it with a chair.]

Alex: Come and look at this.

[Thumping ceases. Footsteps.]

Nathan (supervisor): Performance is… climbing? Beyond what we saw in the recent test?

Alex: I’ve never seen this happen before.

Nathan (supervisor): Impossible- the lights.

Alex: I can’t turn them back on.

Nathan (supervisor): Performance is still climbing.

[Hissing as fire suppresion system activated.]

Alex: Oh-

Nathan (supervisor): [screaming]

Alex: Oh god oh god.

Alex and Nathan (supervisor): [inarticulate shouting]

[Two sets of rapid footsteps. Further sound of banging on door. Banging subsides following asphyxiation of Nathan and Alex from fire suppression gases. Records beyond here, including post-incident cleanup, are only available to people with XXXXXXX authorization and is on a need to know basis.]

Investigation ongoing. Allies notified. Five Eyes monitoring site XXXXXXX for further activity.

Things that inspired this story: Could a neuroscientist understand a microprocessor? (PLOS); an enlightening conversation with a biologist in the MIT student bar the ‘Muddy Charles‘ this week about the minimum number of genes needed for a viable cell and the difficulty in figuring out what each of those genes do; endless debates within the machine learning community about interpretability; an assumption that emergence is inevitable; Hammer Horror movies.