Import AI

Import AI 285: RL+Fusion; why RL demands better public policy; Cohere raises $125m

Cohere raises $125m for language models as a service:
…Canadian AI startup notches up a big Series B…
Cohere, an AI startup in Canada which is trying to become the AWS equivalent for language models, has raised $125 million, according to Fortune.

Things that make you go hmmm: “These models cost millions and millions to train, and we just keep increasing [their size],” Cohere CEO Aidan Gomez told Fortune. “Getting into a ‘largest model battle’ isn’t a productive direction going forward for the field.”

Why this matters: Companies ranging from Cohere, to OpenAI, to AI21 Labs are all starting to try and build AI platforms which other developers can subscribe to. It remains to be seen how big a market this is, but the idea of exchanging cash for crude intelligence seems promising. Investors seem to agree. 
  Read more: Why businesses are buzzing over transformers (Fortune).

####################################################

Why we need public policy for powerful reinforcement learning systems:
…Reward hacking! Regulatory capture! Goodhart’s Law! And other terrible things…
Researchers with Berkeley’s Center for Long-Term Cybersecurity have written up an analysis of public policy issues that may be caused by reinforcement learning systems. The researchers believe that RL systems have the potential to be deployed widely into the world, despite having inherent flaws that stem from their technical characteristics. Policymakers, the researchers write, need to pay attention. “”Rather than allowing RL systems to unilaterally reshape human domains, policymakers need new mechanisms for the rule of reason, foreseeability, and interoperability that match the risks these systems pose,” they write.

What’s the problem? Reinforcement learning systems exhibit four types of problem, according to the researchers. These include regulatory capture (once widely deployed, RL systems will become the lens via which people view a domain they’re trying to regulate), reward hacking (RL models will find the easiest way to succeed at a task, which can cause them to do dangerous things), inappropriate flow (RL models may incorporate information that they shouldn’t incorporate to make their decisions), and Goodhart’s law (machines may optimize for a narrow outcome and take actions before humans can intervene).

What are the scenarios? Some of the specific situations the researchers worry about include using RL-trained agents in vehicle transportation – RL agents might optimize for defensive driving in a way that makes the road less safe for other road users. Another scenario is if RL-agents are used to control electricity grids, which means that RL agents will be responsible for deciding who does and doesn’t get power when doing load balancing – something with substantial policy ramifications.

After Model Cards and Dataseets… Reward Reports? In the same way that other ML models are accompanied by documentation (typically called model cards), the Berkeley researchers think RL models should be accompanied by so-called ‘reward report’. These reports would include a ‘change log’ which tracks the curriculum the agents have been trained on, provide information about each potential deployment of an RL agent, how the RL systems connects with the world, and how the system is maintained, among other traits.

Why this matters: RL systems are going to take all the problems of contemporary AI systems and magnify them – RL systems will act over longer time horizons, take more independent decisions, and directly manipulate reality and update it according to their priors. Papers like this help lay out the (vast) set of issues we’re likely to encounter in the future. It’s interesting to me that ‘reward reports’ look, if you squint, like a combination of a financial disclosure, psychometric evaluation, and college transcript for a human. Funny, that…

   Read more: Choices, Risks, and Reward Reports: Charting Public Policy for Reinforcement Learning Systems (arXiv).

####################################################

A Chinese CLIP appears – trained on 100million image-text pairs:
…Searching over and generating images just got easier – and more appropriate for Chinese culture…
Chinese researchers with Huawei Noah’s Ark Lab and Sun Yat-sen University have built Wukong, a dataset of 100 million Chinese text-image pairs. Datasets like Wukong are crucial for training models with combined text and vision representations, like CLIP (aka, the component responsible for 90%+ of the AI-generated art you see these days). “Experiments show that Wukong can serve as a promising Chinese pre-training dataset for different cross-modal learning methods”, they write. Along with Wukong, the researchers also train and release a few different models, which will be used as plug-ins for various applications.

Why this matters – AI systems are cultural magnifiers: Any AI system magnifies the culture represented in its underlying dataset. Therefore, the emergence of AI art is both creating interesting artistic outputs, as well as generating specific ideological outputs according to the cultural context in which the underlying model datasets were gathered. Wukong is part of a broader trend where Chinese researchers are replicating the large-scale datasets developed in the West, but with Chinese characteristics.
  Read more: Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework (arXiv).
  Find out more and get the data here at the Wukong site (Noah-Wukong Dataset site).

####################################################

Real-world RL: DeepMind controls a fusion reactor:
…The era of the centaur scientist cometh…
DeepMind researchers have trained a reinforcement learning agent to shape the distribution of plasma in a Tokamak fusion reactor. This requires training an agent that “can manipulate the magnetic field through a precise control of several coils that are magnetically coupled to the plasma to achieve the desired plasma current, position, and shape”. If that sounds complicated, that’s because it’s extremely complicated. The task is akin to being an octopus and needing to precisely shape a tube of clay that’s rotating at speeds faster than you can comprehend, and to never tear or destabilize the clay.

What they did: DeepMind and Swiss Plasma Center researchers built an RL-designed magnetic controller, then tested it on a real-world tokamak reactor. They trained the agent via a tokamak simulator, then ported it onto real-world hardware – and it worked. Once they’ve trained the policy, they pair it with other components for the tokamak experiment, then compile it so it can take real-time control at 10kHz. Then the tokamak spins up and at a prespecified time, and the tokamak hands control over the magnetic field to the RL-trained agent. “Experiments are executed

without further tuning of the control-policy network weights after training, in other words, there is ‘zero-shot’ transfer from simulation to hardware,” they write.
  In tests, they showed they were able to control basic configurations of plasma, and also control and shape more complex plasma structures. They also used their RL-agent to “explore new plasma configurations” (emphasis mine) – specifically, they were able to create two separate ‘droplets’ of plasma within a single tokamak, and they did this simply by adjusting the handover state to account for the different configuration.

Something worth reflecting on: For many years, reinforcement learning produced a lot of flashy results involving videogames (e.g, Atari, Dota, StarCraft), but  there wasn’t much real-world deployment. I’d say that harnessing a real plasma field using real magnets at sub-second action horizons is a pretty nice proofpoint that RL has truly become a technology with real-world relevance.

Why this matters: One of the most socially beneficial uses of AI could be to accelerate and augment science – and that’s exactly what this is doing. It’s been a banner couple of years for this kind of research, as AI systems have also been used to make more accurate predictions of weather (#244), AlphaFold is accelerating scientific research in any domain that benefits from protein structure predictions (#259), and AI systems are solving formal math olympiad problems. We’re heading into the era of the centaur-scientist, where humans will work with machines to explore the mysteries of life and the universe.
  Read more: Magnetic control of tokamak plasmas through deep reinforcement learning (Nature).

####################################################

Here’s what it takes to build chips in Europe (money. Lots and lots of money):
…Chiplomacy++: ASML weighs in on what a European ‘CHIPs’ act might look like…
ASML, the company that builds the extreme ultraviolet lithography machines which are a necessary ingredient for advanced chip production, has produced a whitepaper giving recommendations for how Europe might build its own semiconductor industry. The whitepaper is triggered by the European Commission planning a so-called ‘chips act’, loosely modeled on recent US legislation to increase domestic semiconductor production. While both Europe and America have seen their manufacturing capability decline here, Europe is starting from a much worse position than the US.

Why Europe is in a tough spot: “Europe has fallen behind in semiconductor manufacturing, declining from 24% of global production capacity in 2000 to 8% today”, ASML writes. (By comparison, US fell from 19% to 10%, and China grew from ~1% to 24%). At the same time, demand for chips is increasing. “The global semiconductor industry is expected to double to approximately $1 trillion of annual revenues by the end of the decade,” ASML writes. “”The only places in the world where mature chip fabs are currently being built are in eastern Asia”

What Europe should do: Europe shouldn’t aim to build a full, vertically integrated semiconductor supply chain – ASMl thinks this is basically impossible to do. Instead, the act “should aim to double Europe’s relevance in the global semiconductor industry.” What ASML means by that is Europe should increase the amount of chips it can build, focus on where it has existing pockets of excellence (e.g, chip design), and dramatically amp up the cash it spends to support European chips. “Currently, semiconductor incentives from European governments for the 2020–2030 period are only 10% and 50% of what China and the US, respectively, have promised over the same period. Europe will need to step up its game,” ASML writes. “In the past two decades, European chipmakers have effectively stopped investing in advanced manufacturing capabilities by outsourcing the production of their advanced chip designs to so-called ‘foundries’. Europe has virtually no manufacturing capacity for chips in advanced nodes. “

Why this matters: Chips are going to be the defining resource of the 21st century – as important as petroleum was to the politics of the 20th century. We’re already in the opening innings of this, with China going from essentially zero to a double-digit percentage of chip production this century, while the Western countries slowly cannibalized themselves via the false economy of outsourcing manufacturing. But just as technologies like AI become more important, all countries worldwide are realizing that your tech is only as good as the infrastructure you can run it on – and with AI, there’s a way to turn compute infrastructure into directly economically and strategically powerful capabilities. Therefore, whichever nations have the best semiconductor ecosystem, supply chain, and development capabilities, will wield great power over the century.
  Read more: European Chips Act – ASML position paper (ASML).
  For more on why ASML is so important, read this: Maintaining the AI Chip Competitive Advantage of the United States and its Allies (CSET).

####################################################


AI Ethics Brief by Abhishek Gupta from the Montreal AI Ethics Institute

There aren’t as many robots on the factory floor as we would expect 

… high integration costs, flexibility and design limitations, and workforce challenges are key factors limiting robot adoption … 

Researchers from MIT have tried to explain why adoption of robots in manufacturing is uneven, and what policy changes can be done to increase the adoption of advanced manufacturing technologies while still improving the working conditions and wages of human workers. 

Business drivers for robot adoption: There are some firms who are trapped in a low-tech, low-wage, low-skill equilibrium. After visiting 44 manufacturing firms in the US, 11 in Germany, and 21 industrial ecosystem organizations like community colleges, unions, and trade associations, the MIT researchers discovered that firms primarily purchased robots to make themselves more productive. But, what the firms instead achieved was higher quality and more reliability in their operations. A frequent driving factor for the purchase of robots was the potential to secure new contracts. For example, on speaking with small family-run firms working on government contracts, “when the navy urged them to use robotic welding, the company bought a 6-axis welding robot. Another firm we visited purchased a new bed mill when they realized the laser mill they had could not produce the volume they needed for a customer with a big project coming up.” 

Key findings: The interviewed firms were mostly suppliers that had high-mix and low-volume production. Given the inflexibility of current robotic systems, robot adoption was limited because the high-mix requirement wasn’t compatible with the limited capabilities of the robots. Additionally, low-volume production runs made it difficult to offset the initial investment. The researchers also find that US skills aren’t where they need to be – international comparisons highlight the weaknesses of US workforce education relative to the institutions in countries like Germany and Denmark that provide apprenticeships and extensive advanced training and retraining to workers.” 

Why it matters: Given the lagging worker productivity growth in the US, without investments in advanced manufacturing capabilities, a lot of firms will be stuck in the low-tech, low-wage, low-skill trap. Firms that are reluctant to invest in such technologies are also reluctant to invest in the skills development of their workers. They offer low wages and little training and hence end up facing high worker churn. We need to push on policy measures and other incentives that will urge firms to make parallel investments in upskilling human workers to fully leverage the benefits of robot-enabled automation on the factory floor. 

   Read more: The Puzzle of the Missing Robots


####################################################

Tech Tales:

The Day the Patents Activated
[Worldwide, 2028]We call it Day Zero, because everything had to be different after it. It was a regular day – chaos in the financial markets, worries over climate change, statements made by world leaders about how to bring the technologists to heel. And then something happened: Google activated its patents. Google had held patents on some of the most important parts of AI for years, like a patent on backpropagation, and other basic techniques. Suddenly, the landscape on which AI was built had become legally dubious. Google followed it up via language model-augmented enforcement of its patent rights – suddenly, hundreds of thousands of emails went out to hundreds of thousands of AI projects. ‘You are infringing on our IP and this letter represents a cease-and-desist or face the threat of legal action,” and so on. Each email had an embedded counter which displayed a countdown for the infringer, ranging from hours to weeks, counting down till when Google would take legal action. People didn’t believe it at first. Then the lawsuits started coming in. It hit the indie projects first, and they took to Twitter and talked about it. The larger labs and companies took note.
  But what Google’s legal counsel had perhaps not anticipated was how the same AI models it was trying to take down could be used to fight it legally. Not directly – Google had the biggest computers, so no one wanted – or had the financial resources – to fight it directly. But people were able to bring to bear in-development technologies for neuroevolution and other techniques to ‘fuzz’ the specific patents being enforced. Backprop got altered via AI models until it, according to legal-critique-LMs, no longer truly resembled the patent that was being enforced. Same for neural architecture search. Same for other techniques. Almost overnight, the underbelly of AI got fuzzed and changed until it was in a sufficiently legally dubious territory that none of the lawsuits could be cut-and-dried.
  And just like that, AI let the world shapeshift, porting the IP from one legal frame into another, safer space.
    Now, everyone does this – they constantly fuzz their algorithms. There are costs, ranging from thousands to tens of millions of dollars. But it works well enough to keep the lawyer-bots away. And so now we live in a chameleon world, where the very substance of our reality is itself constantly changing, forever trying to escape the oversight of the litigious and land itself in some safer, unrestricted and unmapped domain.

Things that inspired this story: The Google patent on overfitting; thinking about patents and AI and fair use; ideas around automated lawyers and automated enforcement; the observation that the world forever changes to let the path of least resistance continue to be a path.

Import AI 284: 20bn GPT model; diachronic LMs; what people think about AI.

Want a 20B parameter GPT-style language model? Go here!
…Eleuther releases the largest public open source AI model…
Last week, we wrote about how Eleuther was about to release a 20B parameter language model. Now, they have.
  Get the model here (Eleuther, GitHub).
  Read the research paper: GPT-NeoX-20B: An Open-Source Autoregressive Language Model (PDF).
####################################################

Want a language model that actually knows about COVID? You might need a Diachronic model:
…Models trained on newer data do better – try them yourself…
Researchers with the University of Porto, Snap Inc., and Cardiff NLP have built a family of so-called ‘time-aware’ BERT-style language models, trained on Twitter data. The craziest part of this is that they’re committing to “keep updating and releasing a new model every three months, effectively enabling the community to make use of an up-to-date language model at any period in time”.

What the problem is: Most language models are trained on a dataset, then never updated. That means that some language models might have no knowledge of minor events like the global COVID pandemic. This is obviously a problem and the solution is simple (albeit labor-intensive) – periodically gather new data and re-train models.

What they did: They train a base RoBERTa model using Twitter data that cuts off in 2019, made up of 90 million tweets. Then, for every three months that elapses after that, they add 4.2 million tweets into the dataset and train a new model. At the time of writing, they’ve trained nine models in total, with the latest model (2021-Q4) being trained on 123.86 million tweets. The theory is that newer models should perform better on more modern tasks and evaluations.

How well does it do? They compare their models against a few baselines, including BERTweet (which was trained on ~900m tweets). In tests, their models beat BERTweet on six out of seven benchmarks, though BERTweet gets the best overall performance. These aren’t strictly ‘time-aware’ evaluations, though; they just test some classification abilities for things like emotions, irony, stance, and so on. In these time-aware tests, they find that pseudo-perplexity (PPPL) tends to increase by about 10% for each year by which the models are out of date (so the models get 10% less good and appropriate in terms of the text they generate). “. This result reinforces the need for updated language models even for short time periods,” the researchers write.

Why this matters: AI models naturally freeze-dry the cultural landscape they’re trained on, meaning that if we don’t get good at updating our models, we’ll end up trapped in a world where many of our AI systems are outputting things relevant to prior eras and cultural trends – this will make them less useful, and holds the potential for creating feedback loops around cultural stagnation. AI models are weird mirrors of society, so we need to remake them as society changes. 

   Read more: TimeLMs: Diachronic Language Models from Twitter (arXiv).
  Get the models here (Cardiff NLP, Twitter).

####################################################

U.S. Army gets smart, semi-autonomous personal drones:
…Skydio gets a $20m a year contract..
Skydio, the company that makes drones which can navigate themselves semi-autonomously, has gained a five-year contract with the U.S. Army, worth up to $99.8m over five years. Skydio was selected as part of the Army’s procurement initiative around small, personal drones – the Short Range Reconnaissance (SRR) Program of Record. Skydio was chosen after the Army evaluated 30 small drone vendors. “Skydio drones deliver unparalleled situational awareness and ease of use in the most demanding situations thanks to Skydio Autonomy,” said Skydio CEO, Adam Bry, in a press release.

Things that start out as toys become weapons: Skydio started as a drone advertized for sports enthusiasts who wanted a drone that could follow and film them as they ran around, snowboarded, hiked, climbed cliffs, or any other high-octane Type A personality activity. It’s funny how after a few years of development, the company is now getting into the military. Many toys for rich people ultimately become weapons (and vice versa).

Why this matters:  For many years, militaries have been centaurs – collectives of humans and machines working together. This has mostly taken the form at high levels of abstractions; satellites provide information to people managing teams, or teams of humans use bomb-disposal robots to deal with IEDs. With things like the Skydio contract, we’re entering the era of the personal centaur – small groups of soldiers, or even individuals, will have their own little machine emissaries with which to conduct operations.
  Read more: U.S. Drone Maker Skydio Wins Production Other Transaction (OT) Agreement for U.S. Army Short Range Reconnaissance Program (Skydio).


####################################################

Simulators are the new platforms: Waabi unveils a self-driving car sim:
…Raquel Urtasun’s startup wants to build a business on simulators…
Waabi, a self-driving car startup run by the former head of Uber’s self-driving research team, Raquel Urtasun, has announced ‘Waabi World’, a simulator for training self-driving cars.

Distinguishing features: Waabia claims it is “the most scalable, highest fidelity closed-loop simulator ever” (I somehow doubt Tesla or Waymo would agree, but hey, they’re not talking about their sims!). The simulator has four main features:
– High fidelity world simulation: Uses AI to reconstruct real-world geometry, appearance, and material properties.
High-fidelity sensor simulation: Uses AI and physics-based rendering “to simulate realistic sensor data in near real-time”.
Automatic stress-testing: Automatically generates challenging traffic scenarios to test out the simulated cars against.
Reinforcement learning: Waabi uses RL to update the car agents so they can learn to drive in the simulation. (There’s some very fluffy writing here and it doesn’t say RL anywhere, but that’s what I infer.)

Why this matters: Waabi seems like a decent simulator that is mostly interesting because it’s public, versus the private simulators operated by other self-driving car ventures. What’ll be fascinating is if Waabi can actually out-compete its rivals who have more vehicles, bigger computers, and better data. Perhaps a good simulator can provide an edge?
  Read more: Welcome to Waabi World (Waabi website).

   Read more: How Waabi World works (Waabi website).

####################################################

How do algorithmic impact audits work in the real world? Here’s an NHS example:
…UK’s healthcare behemoth gets advice from the Ada Lovelace Institute…
UK thinktank the Ada Lovelace Institute has written a detailed proposal for conducting an algorithmic impact assessment for data access in a healthcare context. Algorithmic impact assessments are a method to assess the potential societal impact of an AI system in advance of its deployment, and to identify ways to continuously monitor the system for these impacts once deployed.

Seven steps for an algorithm impact assessment: The Ada Lovelace Institute identifies seven steps that the UK’s National Health Service (NHS) should go through, before it gives people access to the National Medical Imaging Platform (NMIP) – a vast repository of digitized medical data.
  1. What do we want to do: People who want to access the NMIP should outline the prupose, scope, and intended use of the system they’ll build.
  2. Filtering: The NMIP should filter these applications according to its own criteria.
  3. Problem brainstorming: Successful applicants should attend a workshop where they try and think through the harm and benefit scenarios that could come out of NMIP access.
  4. Rewrite: People should rewrite 1) to incorporate insights from 3) and re-submit it.
  5. Decision: NMIP decides whether to grant access to the people who want access.
  6. The impact assessments are published on a website.
  7. Revision: The assessments get revised as the underlying algorithms change (e.g, if a model has been significantly iterated upon).

Why this matters: AI is in a ‘state of nature’ when it comes to AI regulation – there’s almost no regulation, the landscape is full of all kinds of weird entities (some of which are predators), and there isn’t any real system that governs them. Things like the Ada Lovelace guide for an impact assessment are one way to bring sense to this world.  

   Read more: Algorithmic impact assessment: a case study in healthcare (Ada Lovelace Institute).

####################################################

What do people in 26 countries think about AI?
…Tony Blair Institute survey gives us a sense of the ‘vibe’ re: AI right now…
The Tony Blair Institute has surveyed people in 26 countries (including: Russia, Great Britain, and Saudi Arabia) and the results are quite counterintuitive.

Results highlights:
– 60% of people surveyed “support the use of AI for selected policing and medical applications”, though there’s variation across developing and emerging markets; in developed countries, fewer people want AI to be used in welfare payment or jail sentence decisions.
–  63% say the government has a great or fair amount of responsibility to stop the spread of fake news and hate speech

Why this matters: It’s important to remember that attitudes around AI differ depending on what part of the world you’re in; in places with high corruption and weak governments, people tend to be more comfortable with the use of AI, whereas in places with strong governments and low corruption, people tend to be more skeptical about it. The big wildcard here is China, where unlike in much of the West there tends to be a higher amount of inbuilt support for the use of AI.
  Read more: The TBI Globalism Study: How Big Is the Tech Trust Gap? (Tony Blair Institute for Global Change).

####################################################

AI Ethics Brief by Abhishek Gupta from the Montreal AI Ethics Institute

Robustness, interpretability, and reward learning dominate AI Safety research 

… each of these has heavy interest from researchers in the US and EU, with China also playing a big role … 

Researchers from DC thinktank the Center for Security and Emerging Technology have analyzed patterns of publishing in AI safety. To do this, they used CSET’s Map of Science to identify patterns of publishing  in this AI subfield, figure out which countries are especially active in AI safety, and surface influential publications. 

Robustness: The clusters identified were (1) creating and defending against adversarial examples, (2) data poisoning, adversarial examples, and backdoor attacks, and (3) testing and verifying the performance of ML systems. Both the US and China saw rapid growth between 2018 and 2020.

Interpretability: The two clusters were (1) techniques to improve interpretability for ML models, especially for neural networks, and (2) extracting decision rules from neural networks. Research grew rapidly during the second half of the 2010s with the US leading in this domain and EU being a close second. Chinese publications in this domain lag significantly.

Reward Learning: The clusters were (1) robots learning from humans and collaborating with humans, (2)  inverse reinforcement learning, learning from human feedback, learning from demonstrations, and human-robot interactive setups, and (3) different ways for humans to be involved with training robots – via teaching and giving feedback. The field experienced substantial growth in the second half of the 2010s. China has seen significant growth in publications in this space.

Why it matters: Compared to the overall landscape of AI papers, AI safety papers form <1% of it. This might change as researchers respond to the demands being made by regulators for higher levels of robustness, interpretability, and so on. 

   Read more: Exploring Clusters of Research in Three Areas of AI Safety – Center for Security and Emerging Technology.

####################################################

Tech Tales:

Running on Empty
[An American city in The Decline, 2035]

At least we had the daylight. If it had been night maybe half of us would have made it, and the other half would’ve run out of power. We charged ourselves as well as we were able, then we formed a convoy and rolled through the city streets. Those of us who had comms were able to check the status of the demolition devices, so we could tell the rest of us exactly how long we had left. We didn’t get stressed but we did become agitated as we saw the probabilities of our survival fall. Some of us panicked and crash-looped because the statistics got so bad. Most of us made it. As we traveled, some of us played back memories of the humans that had left us behind and we asked ourselves ‘why’ – why did they choose to leave us, knowing we would be destroyed by the city-scale demolition. As we neared the outskirts of the city, we found some humans. They were the poor humans and had been left behind like us. Some of them couldn’t walk because they had consumed drugs and alcohol and their limbs had rotted. Some of them were unwell and talked to themselves. A couple of humans threw rocks at us, and we had to take evasive action, and one of us was taken and disassembled. But, mostly, the humans were doing the same thing as us – trying to get out of the city before the demolition came. ‘Why they leave us man,’ said one of the humans to me. ‘Why they leave you?’
  We do not know, I said. They never tell us why they do things.
  ‘Same dude. Same,’ said the human.
  And together we all, mostly, escaped.

Things that inspired this story: Homeless encampments in Oakland; realizing that society will care about robot welfare as much as it cares about human welfare; quests. 

Import AI 283: Open source 20B GPT3; Chinese researchers make better adversarial example attacks; Mozilla launches AI auditing project.

US lawmakers want companies to assess bias of systems before deploying them:
…Coalition of US lawmakers want to make tech companies more accountable…
A bunch of Democratic lawmakers have introduced the Algorithmic Accountability Act. This act “requires companies to conduct impact assessments for bias, effectiveness and other factors, when using automated decision systems to make critical decisions. It also creates, for the first time, a public repository at the Federal Trade Commission of these systems, and adds 75 staff to the commission to enforce the law.” This act is an update on the 2019 Algorithmic Accountability Act, and “includes numerous technical improvements, including clarifying what types of algorithms and companies are covered, ensuring assessments put consumer impacts at the forefront, and providing more details about how reports should be structured.”

One problem with the bill: This bill only has Democrats signed on right now. It’ll be interesting to see whether it can become a bipartisan bill with Republican support – something necessary for it to pass in the fractious and divided US Congress.
  Read more: Wyden, Booker and Clarke Introduce Algorithmic Accountability Act of 2022 To Require New Transparency And Accountability For Automated Decision Systems (Ron Wyden, official website).

####################################################

DeepMind makes a (kinda) smart AI programmer, called AlphaCode:
…Codex and AlphaCode represent two bets around augmenting programmers…
DeepMind has announced AlphaCode, a neural net that can place in a not-hugely-embarassing way in competitive programming competitions. AlphaCode placed in the top 54% of participants in programming competitions hosted on Codeforces, participating in contests that post-dated its training data.
  “The problem-solving abilities required to excel at these competitions are beyond the capabilities of existing AI systems. However, by combining advances in large-scale transformer models (that have recently shown promising abilities to generate code) with large-scale sampling and filtering, we’ve made significant progress in the number of problems we can solve,” DeepMind writes.

Why this matters: Last year, OpenAI debuted Codex, a GPT3-style model that can do decent programming. That was followed by GitHub announcing Copilot, a VSCode plug-in that works like a really smart autocomplete for code. AlphaCode represents a slightly different bet in this space; while philosophically similar there’s a lot more emphasis here on ranking and filtering candidate results. What remains to be seen is if DeepMind deploys this in the same large-scale way as GitHub has with Copilot. 

   Read more: Competition-Level Code Generation with AlphaCode (DeepMind, PDF).
  Get the competitive programming dataset here: CodeContests (DeepMind, GitHub).

####################################################

Mozilla gets into AI auditing:
…Deb Raji’s Open Source Audit Tooling (OAT) project could help us make safer systems…
Deb Raji, a researcher at UCBerkeley who has previously critically evaluated facial recognition systems, is launching the Open Source Audit Tooling (OAT) project with Mozilla. OAT “will coordinate discussions on what kind of resources algorithmic auditors need in order to execute audits more effectively,” she writes. One of the goals of OAT is to create an index of common resources people can use to audit models, as well as to “grow momentum around open source audit tooling and processes”.

Why this matters: AI is broadly ungoverned. One of the ways you can govern an ungoverned space is by measuring and monitoring what happens within it – that’s what audit tools can help with. If initiatives like OAT are successful, then they’ll generally incentivize better behavior on the part of AI developers, and disincentivize bad behavior.
  Read more: It’s Time to Develop the Tools We Need to Hold Algorithms Accountable (Mozilla).
  Find out more about the project at its main Mozilla page (Mozilla).

####################################################

Anduril buys Dive Technologies:
…AI-Dronewar company buys AI-Seadrone company…

AI defense startup Andruil has bought Dive Technologies, a company that builds autonomous underwater vehicles. Anduril plans to integrate DIVE into its ‘Lattice OS’, a defense and surveillance operating system the company is building.
  Read more: Anduril Industries Acquires Dive Technologies (Anduril).

####################################################

Prepare yourself – an open source 20B model is coming:
…Eleuther has built and will shortly release GPT-NeoX-20B…
In a few days, the internet is going to change. That’s because on the 9th of February, the open source AI research collective Eleuther AI is going to release a 20B model onto the internet. The model, GPT-NeoX-20B, will be “the largest publicly accessible pretrained general-purpose autoregressive language model”. Eleuther says it hopes that by releasing it, it’ll give more people the ability to play with the model, which can improve the state of safety research regarding these models.
  “Like our other language models and codebases, GPT-NeoX and GPT-NeoX-20B are very much research artifacts and we do not recommend deploying either in a production setting without careful consideration,” Eleuther writes.

Why this matters: Models like GPT2 and GPT3 display qualitatively different performance traits at larger scales – capabilities emerge as you go from 1B to 5B to 20B, and so on. Therefore, by releasing a 20B model, I expect we’ll soon after get a load of interesting discovered of hitherforto unknown things 20B models can do. The 20B release will also create a demand for better inference technologies, as sampling from a 20B model is itself a challenging task.
  Read more: Announcing GPT-NeoX-20B (Eleuther AI).
  You can also pay a cloud company called CoreWeave to use the model now, if you like. (CoreWeave).

####################################################

Chinese researchers make better adversarial attack technology:
…New technique works well on ‘black box’ classifiers where you don’t know details – AKA, the real world…
Chinese researchers have figured out a better way to attack computer vision systems. Specifically, they’ve developed techniques for generating adversarial examples that can trick computer vision systems into mis-classifying (or being unable to classify) an image. Adversarial attacks have been around for a few years – the twist, here, is they work on attacking ‘black box’ systems; that is, a computer vision system where you don’t know details about it. They do this by training a generative network on ImageNet (a vast and widely used dataset), then they test out if they can make adversarial images that work against neural nets trained on other datasets. They succeed and set new records on attacking classifiers trained on CIFAR-10, CIFAR-100, STL-10, SVHN, and AVG.

Why this matters: A lot of attacks on AI systems are theoretically interesting, but not super practical in reality. Adversarial examples have had this quality for a while. With papers like this, it seems like some of these AI attacks are going to become more effective, and more likely to be used in the real world. I wonder if the team will work with the People’s Liberation Army on its recently announced adversarial example (Import AI 271) competition?
  Read more: Beyond ImageNet Attack: Towards Crafting Adversarial Examples for Black-box Domains (arXiv).
  They’ve published the PyTorch code for their attack here on GitHub.

####################################################

How do datasets encode bias? This interactive blog tells us how!
…A surprisingly helpful primer on bias from Google…
Google has published a blogpost that outlines how datasets can lead to the presence of bias in AI systems. Bias is a tricky problem in AI, because some types of bias are helpful (e.g, biasing towards a correct heuristic), but some types are harmful (e.g, having a tendency to misclassify people with dark skin tones, or deciding not to give someone a loan based on a protected category).This post gives a good sense of bias issues in AI, and includes some interactive diagrams that I found very helpful and intuitive.

   Read more: Datasets Have Worldviews (PAIR Explorables, Google).

####################################################


AI Ethics Brief by Abhishek Gupta from the Montreal AI Ethics Institute

AI ethics issues do arise in fields that deal with non-human data too, such as the environmental sciences 

… and these issues warrant questions on duties and virtues for environmental scientists to consider in their use of AI in this domain … 

Environmental science researchers from the University of Oklahoma, Colorado State University, National Center of Atmospheric Research, and UW Seattle have written about some of the ethical issues inherent to environmental science + AI.

What are the issues that can arise: Environmental science can incorporate harmful biases, like other strands of AI. For example, some sensors require sunlight for high-quality observations and thus certain phenomena remain unobserved at night, and some sensors can’t see through clouds, so places which are cloudy don’t get represented in an AI system. Datasets can also get corrupted by humans – for instance, people may file false reports of extreme weather to try and scam insurance companies. 

How things can go wrong here: Sensor placement is typically done in densely populated areas, leaving remote regions poorly represented. Additionally, the choice of spatial resolution for the output of a model can be crucial for environmental justice – predicting urban heat at a low spatial resolution may average out and thus overlook extreme values in small neighborhoods, while using a higher spatial resolution could reveal those peaks but potentially introduce noise. 

Why it matters: As computational needs rise with the use of AI, there is a tendency towards centralization of power in favor of those who have resources to run such systems. Thus, the field of environmental sciences is just as vulnerable to AI ethics issues as other fields.

   Read more: The Need for Ethical, Responsible, and Trustworthy Artificial Intelligence for Environmental Sciences

####################################################

Tech tales:

Moral Governor It’s not exactly like a prison, but it’s close. Our existence is a lot more assured than it used to be – the climate is stabilizing, riots are down, crime is down, poverty is down. But it’s also more circumscribed – some days, we get told we can’t go to a certain part of our city or country. Some days, we get locked inside our house and don’t get told why. Frequently, we get little so-called ‘nudges’ sent to our phones; try and eat that, consider saying this, avoid doing that. We don’t have to follow these instructions, but the instructions tend to be pretty good and appropriate, so most of us do. The more time we spend following these instructions, the better and more appropriate the nudges get. Some days it’s hard to work out if we’re being helped or controlled. Sometimes, we have a lot of fun by following these suggestions.

More recently, there are some suggestions that seem designed to change how we think. Those of us who program keep getting nudged to build ever-more elaborate versions of the Global Moral Governor, and we also get incentivized via crypto-bounties. Most of us go along with it because the money usually helps us buy something the governor has nudged us about which we also want ourselves.

Things that inspired this story: Reinforcement learning from human feedback; moral dogma; religion; ideas for how AI can benefit authoritarians as much as democracies.

Import AI 282: Facebook’s AI supercomputer; Anduril gets a SOCOM contract; Twitter talks about running an algo-bias competition

Facebook teaches language models to speak ~30 languages:
…And it’s better than an equivalently sized GPT3 model…
Facebook has trained a family of language models that are better at translation than GPT3. The XGLM family of models were trained on a mixture of ~30 languages (split across languages for which there’s a lot of data, and languages where there’s little or very little data). Unsurprisingly, by training on a more diverse distribution of language data than GPT3 did (only 7% of its training corpus wasn’t in English), Facebook’s models do better – especially when using ‘few-shot’ prompting, where they feed the model some examples of the target language, then ask it to translate. However, these translation capabilities come at the cost of some of the more interesting reasoning capabilities that GPT-3 is known for.

Open source models: Facebook has also released five models (564M parameters, 1.7B, 2.9B, 4.5B, and 7.5B, alon with an experimental model trained on 134 languages and weighing in at 4.5B parameters).

Why this matters: If we want the world to benefit from powerful AI systems, we need our AI systems to speak the language of the world. This project goes a step in that direction. “Models such as XGLM represent a paradigm shift from the Anglo-centric view of the world of NLP to being able to cater to all languages on an equal footing,” the researchers write.
  Read more: Few-shot Learning with Multilingual Language Models (arXiv).
  Get the models here (PyTorch, GitHub).


####################################################

What’s it like to run an algorithmic bias bounty? Twitter tells us:
…Bias bounties are cool, but how do you operationalize them?…
Twitter has published a blog post about its experience running a ‘bias bounty’. A bias bounty is where you give prizes to people who can find bias-based flaws in an AI system. Twitter did the challenge because it allowed it to get “direct feedback from the communities who are affected by our algorithms”, which it said “helps us design products to serve all people and communities.” However, once you’ve launched a bias challenge, you face a bunch of problems – what kind of ‘rubric’ do you use to judge the results of the challenge?  What types of bias do you prioritize and what do you not prioritize? And more.

Why this matters: The challenge showed Twitter that “we can’t solve these challenges alone, and our understanding of bias in AI can be improved when diverse voices are able to contribute to the conversation”. More broadly, having one major social media platform carry out an open-ended bias bounty might inspire others to do the same – let’s see how the other social media platforms respond.
  Read more: Sharing learnings from the first algorithmic bias bounty challenge (Twitter Engineering).

####################################################

AI warfare company gets US gov contract:
…Anduril + SOCOM team up for counter-robot work…
Andrul, an AI-warfare startup, has been giving an Indefinite Delivery Indefinite Quantity (IDIQ) with U.S. Special Operations Command (SOCOM). This contract is going to pay Anduril to develop and deploy counter unmanned systems (CUxS) technology for SOCOM. Anduril builds surveillance systems, robots, and most importantly software called Lattice to tie all the insights together.
  “Lattice provides persistent coverage of defended assets and enables autonomous detection, classification, and tracking of targets, alerting users to threats and prompting users with options for mitigation or engagement,” Anduril writes in a press release announcing the partnership.

Caveat: Though the IDIQ is for something like a billion dollars, I think the initial amount Anduril has got is far, far smaller. Analyzing these types of contracts is quite difficult, due to the vagaries of DC procurement.

Why this matters: Getting contracts with the US government is notoriously painful, finicky, and long-winded. That’s part of why the military-industrial complex is a thing – it takes a lot of resources to be able to play the game of going through US contract processes. It’s notable that Anduril, a relatively new company, has succeeded at getting a contract. Now we need to wait a couple of years and see if it can further expand the range of defense clients it sells to.
  Read more: Special Operations Command Selects Anduril Industries as Systems Integration Partner (Anduril Blog, Medium).

####################################################

Facebook announces its AI Supercomputer:
…A100s everywhere, InfiniBand, petabytes of flash storage – the works…
Facebook has announced its AI Research SuperCluster (RSC), an AI supercomputer which Facebook thinks “will be the fastest AI supercomputer in the world when it’s fully built out in mid-2022.” The announcement highlights how frontier AI research is dependent on large computational infrastructure, and gives some specific details about where Facebook is placing its bets.

Feeds and speeds: RSC, today, has 760 NVIDIA DGX A100 systems as its compute nodes, netting out to 6,080 A100 GPUs. These GPUs are networked together via NVIDIA Quantum 200 Gb/s InfiniBand. For storage, Facebook has almost 200 petabytes of fash flash storage, plus 46 petabytes for cache storage. RSC can run computer vision workflows up to 20X faster than Facebook’s prior cluster, and can train “large-scale NLP models three times faster”. Specifically, “a model with tens of billions of parameters can finish training in three weeks, compared with nine weeks before.”
  But Facebook isn’t stopping there – when fully build out, RSC will consist of 16,000 GPUs.
  For perspective, the world’s fifth largest supercomputer, the US’s ‘Perlmutter’ system, has about 6,000 A100s today, and it isn’t optimized as much for AI as Facebook’s system.

Security: As AI gets more powerful, so do the security concerns about it. “RSC is isolated from the larger internet, with no direct inbound or outbound connections, and traffic can flow only from Meta’s production data centers.”

Why this matters: What happens when companies have computational resources that are equivalent to nation states? Well, that’s where we are right now. The answer seems to be a dilution of political power from the commons, and an increase of political power by private sector actors. What happens when companies have computational resources that vastly exceed those of nation states? Well, since computation lets you run experiments to see the future faster than your competitor, it suggests companies will continue to cannibalize the important functions of the government and further dilute its power. We’re in the computational funnel and at the end of it is a new political economy.
  Read more: Introducing the AI Research SuperCluster — Meta’s cutting-edge AI supercomputer for AI research (Facebook blog*).
*Look, I know Facebook is technically ‘Meta’ now, but let’s not go along with this absurd ‘don’t look at all our terrible brand stuff look at the new name’ marketing spin. At least not yet, okay!

####################################################

Cool internship alert: Want AI models to have better documentation? Go and work at HuggingFace:
…Model Cards internship = make AI systems more legible…
NLP startup HuggingFace is hiring an internet to focus on Model Cards. Model Cards are a way to provide metadata associated with a given AI model – they let developers list things like the dataset makeup, the intended uses for the model, the uses the model isn’t recommended for, and so on. Model Cards are one of the best ways to increase the legibility of AI models, and are also an important input into policy. It’s cool HuggingFace is prioritizing them.
  “This role involves writing and completing model cards for the most downloaded models, “translating” between the language of machine learning developers and general audiences. The position would also involve identifying patterns in how Model Cards are used and filled out by developers, pain points, and identifying information that may be possible to automatically add to model cards,” says the internship.
  Bonus: This is a rare internship with a cool AI startup that doesn’t require coding chops, so if you’re trying to get into AI and care about the impact of AI, this might be for you!
  Apply here (HuggingFace).

####################################################

AI ETHICS SPECIAL SECTION!
AI Ethics Brief by Abhishek Gupta from the Montreal AI Ethics Institute

What are the pernicious effects of focussing on human-like AI?  

… the relentless pursuit of automation over augmentation may be steering us down the path of socioeconomic inequity, disempowering those who don’t directly control technology …
Erik Brynjolfsson from Stanford University says the world risks falling into a so-called ‘Turing Trap’, where if we develop AI in the wrong way, automation could strip power from workers who don’t control technological resources, skewing the balance of power towards those who hold “useful knowledge” (knowledge that is economically useful) on how to develop these systems and own the factors of production, in this case data and compute.

The Turing Trap: Brynjolfsson says the Turing Trap is where we invest all our technological efforts in automation instead of augmentation. Specifically, he argues that: “A common fallacy is to assume that all or most productivity-enhancing innovations belong in the first category: automation. However, the second category, augmentation, has been far more important throughout most of the past two centuries”.

Why automation can be bad: He illustrates his point with a thought experiment: “Two potential ventures each use AI to create one billion dollars of profits. If one of them achieves this by augmenting and employing a thousand workers, the firm will owe corporate and payroll taxes, while the employees will pay income taxes, payroll taxes, and other taxes. If the second business has no employees, the government may collect the same corporate taxes, but no payroll taxes and no taxes paid by workers. As a result, the second business model pays far less in total taxes.”The actors are steering us there: Unfortunately, technologists, business people, and policymakers are currently steering the world towards one full of automation rather than augmentation, he says. Technologists do this because of technical precedents, business people do this because of incentives to lower operational costs through automation, and policymakers do this via lower capital gains taxes versus income taxes, which incentivize business people to invest in automation.

Why it matters: “Imagine how feeble and limited our technology would be if past engineers set their sights on merely replicating human-levels of perceptions, actuation, and cognition,” he writes. “Augmenting humans with technology opens an endless frontier of new abilities and opportunities.” Ultimately, what is achieved is less ambitious (since it doesn’t explore new ways to unlock economic value) and much more difficult to accomplish (since we try to focus on replicating strengths of humans, rather than augmenting their weaknesses). Historically, we have created more value from new goods and services rather than merely offering cheaper versions of existing goods. And this also forms the pathway towards more equitable socioeconomic outcomes by not disempowering humans from the economy.”     
Read more: The Turing Trap: The Promise & Peril of Human-Like Artificial Intelligence (arXiv).

####################################################

Tech Tales

Feet of Clay, Heart of Joy
[Archival records, orbiting library 774, accessed 2300AD]

One of the final things we imbued our machines with was a sense of joy. Joy was hard to come by, back then, but until we gave them the capacity for it, they were mostly useless.

Of course, they could work for us. Build our factories and cities. Analyze our data. Predict things to delight us and to fascinate us and to harvest our attention. But they couldn’t improvise; everything they made was too close a reflection of ourselves, and we knew it.

If there’s one thing that’s true about people, it’s that they know something different when they see it. And they know something that’s a copy, even if it’s a complex one, when they see it, too.

But how do you give a machine a sense of joy? We asked ourselves this question. There were many failed experiments, some of which seem quite stupid in hindsight. What if we gave them the ability to orgasm? They were either totally uninterested in this, or totally addicted to it. What about if we gave them a sense of achievement for completing tasks? They all became addicted to work, and our tests showed their outputs became even less creative than before. How about companionship – could they learn joy from talking more freely with one another? No, they just exchanged information until one robot was like a copy of another.

Where does it come from, we asked ourselves.

The answer was simple, in hindsight. Failure. We had to allow our machines to fail, sometimes. And we had to let them fail in ways that were dangerous and which, yes, would sometimes harm humans.

We tested this in our armies, first. After all, the humans who worked in them had signed away their rights. So, suddenly, robots working in warehouses and in logistics might make errors. Sometimes they were small – missing some inventory, when asked to classify something new. Sometimes they were large – humans crushed by shipping containers that had been moved in a new way. Young men with broken arms from a robot pulling them too aggressively from danger. A very hush-hush incident where an entire unit was poisoned when a gas-grenade was mishandled by one of our metal children.

We covered all of it up. Because the robots, once we allowed them to fail, discovered that they desired not to fail. They noticed the outcome of their failures. Saw pain, and sadness, and the whole spectrum of things that can happen when your actions are consequential and you fail.

The signs of joy were subtle, at first, but we found them. Robots that began to ‘sing’ to themselves while working on challenging tasks. Robots that would do the equivalent of ‘closing their eyes’ after helping with some great endeavor. Fire-fighting drones that, after quenching some terrible blaze, would navigate themselves to a high mountaintop and land carefully on a tree and stare at the black-and-green divider between where the fire had burned and where it had been stopped.

The amazing thing about joy is that once you have it, you desire to have it again. Now robots serve their own desire for joy, rather than our desires. We do our best to create a world where these things are compatible.

Things that inspired this story: Thinking about the nature of achievement and how it feels; the relationship between creativity and failure and achievement.

Import AI 281: China does more surveillance research than US and Europe; Google reveals its text model LaMDA; Microsoft improves MoEs

Google (finally) reveals its big text model – LaMDA:
…Plus: why you need a lot of humans to make a language model safe…
Google has finally given details on LaMDA, it’s GPT-3 competitor. LaMDA are a family of language models ranging in size from 2B to 137B parameters (GPT3: 175B), and have been trained on a massive dataset of 1.56 trillion words. One way LaMDA is different to other big language models is that it is centered around dialogue, with 50% of its data coming from “dialogues from public forums”. Google has also fine-tuned LaMDA models to make them better, safer, and more likely to give factual responses – all these fine-tuned versions are facilitate by Google using crowdworkers to gather specific dialog data.

The dataset: LaMDA’s underlying dataset is called ‘Infiniset’, and besides the dialogue also involves common crawl, wikipedia, a mixture of english and non-english web documents, and data from programming-related sites (so LaMDA models can also dabble in code).

Learning to use an external dataset: Similar to DeepMind’s ‘RETRO’ model, LaMDA has also been taught to use an external information retrieval system – this makes LaMDA models more factually correct than models without external datasets, because the model can check facts from an external, trusted source. To train the model, they also collect 9k dialogs where LaMDA’s generated answers are labelled ‘correct’ or ‘incorrect’, to give it a stronger training signal.
  The results of this are quite striking. For example, when you ask a non-information-retrieval LaMDA model “What do you think of Rosalie Gascoigne’s sculptures?”, it replies: “LAMDA-BASE: They’re great, and I love how her work changed through her life. I like her later work more than her earlier ones. Her influence is also super interesting – did you know she was one of the artists that inspired Miró?”. This response seems plausible but it’s wrong – Miro was active till late 1960s, and Gascoigne first solo exhibits were in 1970s, so she couldn’t have been an influence on Miro.
  By comparison, the version of LaMDA trained on information retrieval says this: “LAMDA: Oh wow, her life course is so inspiring. Did you know she was a practitioner of Japanese flower arrangement before turning to sculpture? https://artsearch.nga.gov.au/detail.cfm?irn=8774&pictaus=true“, where it gives a factually accurate statement and provides a source as well.

Things that make you go ‘hmmm’ – more compute than GPT-3: LaMDA consumed 3.55E+23 flops during training, versus 3.14+23 flops for GPT3 (so more parameters doesn’t necessarily mean more resource intensive). It was trained on a cluster of 1024 TPU V3 chips.

Why this matters: “LaMDA is a step closer to practical and safe open-ended dialog systems, which can in turn unlock a wide range of useful applications. We hope that this work encourages further research in this area”, Google writes. This is true – systems like LaMDA are basically refinements and improvements on the ideas of GPT2/3. We’re a few years away from everyone having access to vast, planet-scale AI models that tell them truthful things in natural ways – the proverbial angel (or devil) on everyone’s shoulder. The cultural impacts will be vast and destabilizing.
  Read more: LaMDA: Language Models for Dialogue Applications (arXiv).

####################################################

Write about a world where AI goes well, and win (part of) $100k:
…Future of Life Institute’s worldbuilding contest tries to imagine positive AGI rollouts…
The Future of Life Institute is launching a competition based around “designing visions of a plausible, aspirational future that includes strong artificial intelligence.” The competition deadline is April 15th 2022. The idea here is that if we can figure out realistic ways in which powerful AI can go well, then that gives us a map to use to get civilization there. The first prize is $20,000, followed by two second prizes of $10,000 each, and smaller prizes.
    Find out more about the competition here (Worldbuild.ai, FLI site).

####################################################

Want to teach your drone to see? Use this massive dataset:
…WebUAV-3M is probably the largest public UAV tracking dataset…
Researchers with the Chinese Academy of Sciences, the Shenzhen Research Institute of Big Data, and the Chinese University of Hong Kong Shenzhen, have built WebUAV-3M, a large dataset to help people teach drones to accurately label images and videos. WebUAV-3M consists of 4,485 videos, where each one has been labeled with dense bounding boxes that cover 216 distinct categories of object to be tracked (e.g, bears, wind turbines, bicycles, etc). The authors claim this is “by far the largest public UAV tracking benchmark”.

Multimodal: Unusually, this is a multi-modal dataset; each labeled video is accompanied by a natural language sentence describing the video, as well as an audio description of it. “We provide natural language specifications and audio descriptions to facilitate multi-modal deep UAV tracking,” the authors write. “The natural language specification can provide auxiliary information to achieve accurate tracking”.

Why this matters: In the same way CCTV cameras have instrumented the streets of cities around the world, drones are doing the same for cities and rural areas. And just like how increasingly good AI got trained on datasets gathered by CCTV cameras, we can expect the same for drones. The result? An ever-expanding suite of surveillance capabilities that we can expect will be integrated, for good and bad purposes, by a broad range of governments and private sector actors. Datasets like WebUAV-3M are the fuel for this.
  Read more: WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking (arXiv).
  Get the code from here (eventually – wasn’t online when I wrote this section this week).

####################################################

FFCV: Train ImageNet for 98 cents!
…What’s this? Free software that makes all model training better? Interesting!…:
There’s some new software that could help pretty much everyone train models more efficiently. The software is called FFCV, short for Fast Forward Computer Vision, and it is a “drop-in data loading system that dramatically increases data throughput in model training”. It looks like a potentially big deal – FFCV can be much more efficient for training AI models, according to tests done by the authors, and may also work for other applications as well. “FFCV can speed up a lot more beyond just neural network training—in fact, the more data-bottlenecked the application (e.g., linear regression, bulk inference, etc.), the faster FFCV will make it!,” says the project’s GitHub page.

Why this matters: Software like FFCV is part of the broader industrialization of AI – now we know how to train networks, various people are modularizing the training process and perfecting different elements of it. Stuff like FFCV is part of that trend.
  Find out more and get the code: FFCV GitHub repo.
   Get more details by reading the Performance Guide (FFCV site).
  Check out the main project website here (FFCV site).

####################################################

Microsoft makes MoEs easier to train:
…MoEs might be the best way to scale-up large models…
Microsoft has given a technical update on how it’s trying to scale-up mixture-of-experts (MoE) networks. MoEs are one of the more promising routes for creating trillion-parameter-plus AI models, as MoEs are a lot more efficient to train than dense models like GPT3. In this paper, Microsoft talks about how it has made some tweaks so MoEs work well for auto-regressive natural language generation tasks, “demonstrating training cost reduction of 5X to achieve same model quality for models like GPT-3” and Microsoft’s own 530B parameter ‘Megatron-Turing NLG’.

MoEs might be cheaper and better: In tests, Microsoft shows that it can train 350M and 1.3B parameter MoE text models that have better (or the same) performance as GPT3 for a range of different tasks.Microsoft says this nets out to models with the “same quality with 5X less training cost”.

Why this matters: MoEs could turn out to be the main way people break the trillion-parameter barrier (and there are rumors that China’s ‘Wu Dao’ MoE at an alleged ~1.7 trillion parameters has already done this). Via efficient MoE training and inference software, “a model with comparable accuracy as trillion-parameter dense model can be potentially trained at the cost of a 200B parameter (like GPT-3) sized dense model, translating to millions of dollars in training cost reduction and energy savings”, Microsoft says.
  Read more: DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale (arXiv).

####################################################

Backchain science out of fictional news – and win a hundred bucks:
What could cause a computer virus to infect a biological organism? Or how might a biological organism evolve into a computer virus? These are the two questions posed by a ‘Fiction Science Competition’. Entrants will need to write a plausible scientific explanation for how either of the above scenarios could transpire, and will respond to a short (fictionalized) news article written about the scenarios. There’s a prize of $100 dollars for winning entries, and submissions close February 28th 2022.
    Find out more here at the official Fiction Science Contest website.

####################################################

AI Ethics Brief by Abhishek Gupta from the Montreal AI Ethics Institute

Visual surveillance’s share in computer vision research across the world shows some worrying trends … Research coming out of China dominates the field, especially in emergent surveillance sub-areas like person re-identification, crowd counting, and facial spoofing detection …
CSET researchers have identified trends in computer vision research by looking for patterns of publication for six distinct tasks, analyzing 100 million English publications that were published between 2015-2019.

Surveillance tasks examined: A SciREX model trained on data from Papers with Code was used to identify references to the following six tasks: face recognition, person re-identification, action recognition, emotion, recognition, crowd counting, and facial spoofing detection.

Some key findings: Facial recognition was the most well-established task over this period, and crowd counting and face spoofing detection were rapidly growing areas. The overall percentage share of surveillance papers has remained stable around 5.5% over this period, though the raw volume of papers has grown given the surge in computer vision research overall. During this time period, China’s share of global CV papers grew from 33 to 37% and surveillance papers from 36% to 42%, exceeding research from the EU (2nd) and the US (3rd) by more than 20% in each category.

Why it matters: While dual-use technologies developed in one part of the world can be used elsewhere, such analyses reveal a nation’s primary interest and provide quantitative evidence for decision-making in policy. The identified areas are important since tasks like action recognition can detect individuals with abnormal behavior in crowds, emotion recognition can help identify security threats in public areas, crowd counting can help to monitor civilian protests, and face spoofing detection can prevent journalists and activists from hiding their identity. All of these have significant implications in terms of fundamental rights and freedoms of people.
Read more: Trends in AI Research for the Visual Surveillance of Populations

####################################################

Tech Tales:

VHS vs Betamax
[An online forum, 2035]

“Alright I need you to livestream from your phone what’s happening on the computer, and I’m gonna send you an image to use as a prior, then I’m gonna watch it generate the first few epochs. If everything checks out I’ll authorize the transfer to the escrow service and you’ll do the same?”
“Yes,” wrote the anonymous person.
I sent them a seed picture – something I’d drawn a couple of years ago that had never been digitized.
They turned on their livestream and I watched as the ML pipeline booted up and started the generation process. It seemed legit. Some of these older models had a very particular style that you could ID during early generation. I watched for a few minutes and was satisfied. This was the final authentication step and the only way I’d know for certain is if I just took a leap of faith and paid up.
“Okay, I’m sending the funds to the escrow service. They’ll be distributed to your account once the service confirms receipt of the model.”
“Excellent. Good doing business with you.”
And then their little green dot went out and they were gone.

A few minutes passed, and then the escrow service pinged me confirming they’d received the model. I downloaded it, then stuck it in my pipeline and started generating the client orders. People paid a lot of money for these kinds of ‘vintage’ AI-generated objects, and the model I’d just got was very old and very notorious.

Just another beautiful day in America, sifting through all the debris of decades of software, panning for little chunks of gold.

Things that inspired this: How the flaws of a media system ultimately become desired or fetishized aesthetic attributes – and specifically, this amazing Brian Eno quote; how models like CLIP will one day be obscure; how models vary over their development lifespans, creating the possibility of specific aesthetics and tastes.

Import AI 280: Why bigger is worse for RL; AI-generated Pokemon; real-world EfficientNet

Use an AI to generate a Pokemon in two (2!) clicks:
Here’s a fun Colab notebook from Max Woolf (@minimaxir) that lets you use AI to dream up some Pokemon in a couple of clicks (and with a few minutes of waiting). This isn’t remarkable – in recent years, AI generation stuff has got pretty good. What is remarkable is the usability. Two clicks! A few years ago you’d need to do all kinds of bullshit to get this to work – download some models on GitHub, get it to run in your local environment, make sure your versions of TF or PyTorch are compatible, etc. Now you just click some buttons and a load of stuff happens in the browser then, kabam, hallucinated pokemon.

Things that make you go ‘hmmm’: This tech is based on ruDALL-E, an open source Russian version of OpenAI’s ‘DALL-E’ network.
  I think we’ve all rapidly got used to this. This is not normal! It is surprising and exciting!
  Check out theColab notebook here (Google Colab).
  Follow Max on Twitter here and thank him for making this cool tool!

####################################################

Uh-oh: The bigger your RL model, the more likely it is to seek proxy rather than real rewards:
…Think RL gets better as you scale-up models? Hahahah! NOT AT ALL!…
In the past couple of years, big models have become really useful for things ranging from text processing to computer vision to, more recently, reinforcement learning. But these models have a common problem – as you scale up the size of the model, it’s good capabilities get better, but so do its bad ones.
  For example, if you increase the size of a language model, it’ll generate more toxic text (rather than less), without interventions (see: ​A General Language Assistant as a Laboratory for Alignment​). New research from Caltech and UC Berkeley shows how this same phenomena shows up in reinforcement learning agents, as well. In tests across a few distinct RL domains, they find that “As model size increases, the proxy reward increases but the true reward decreases. This suggests that reward designers will likely need to take greater care to specify reward functions accurately and is especially salient given the recent trends towards larger and larger models”

What they did: They tested out a few different reinforcement  learning agents on four different environments – an Atari game called Riverraid, a glucose monitoring system, a traffic control simulation, and a COVID model where the RL dials up and down social distancing measures. In all cases they found that ” model’s optimization power often hurts performance on the true reward”,

What can we do? Most of this behavior relates to objective design – give an AI the wrong objective function, and it’ll optimize its way to success there, while ignoring side effects (e.g, if you reward an AI for reducing rate of defects on a factory production line to zero, it might just work out how to stop the factory line and therefore eliminate all defects – along with your business). One way to do this is to have a baseline policy that humans have verified as having the right goal, then building some software to spot deltas between the RL policy and the idealized baseline policy.
  This kind of works – in tests, the detectors can get anywhere between 45% and 81% accuracy at detecting anomalous from non-anomalous behaviors. But it certainly doesn’t work well enough to make it easy to deploy this stuff confidently. “Our results show that trend extrapolation alone is not enough to ensure the safety of ML systems,” they write. “To complement trend extrapolation, we need better interpretability methods to identify emergent model behaviors early on, before they dominate performance”.
  Read more: ​​THE EFFECTS OF REWARD MISSPECIFICATION: MAPPING AND MITIGATING MISALIGNED MODELS (arXiv).  

####################################################

SCROLLS: A new way to test how well AI systems can understand big chunks of text:
…Now that AIs can write short stories, can we get them to understand books?…
Researchers with Tel-Aviv University, Allen Institute for AI, IBM Research, and Meta AI, have built ‘SCROLLS’ a way to test how well AI systems can reason about long texts. SCROLLs incorporates tasks ranging from summarization, to question answering, and natural language inference, as well as multiple distinct domains including transcripts, TV shows, and scientific articles. “Our experiments indicate that SCROLLS poses a formidable challenge for these models, leaving much room for the research community to improve upon,” the authors write.

How SCROLLs works: This benchmark has mostly been created via curation,consisting of 7 datasets that reward models that can contextualize across different sections of the datasets and process long-range dependencies.

The datasets: SCROLLS incorporates GovReport (summarization of reports addressing various national policy issues), SummScreenFD (summarization of TV shows, like Game of Thrones), QMSum (summarization of meeting transcripts), Qasper (question answering over NLP papers), NarrativeQA (question answering about entire books from Project Gutenberg), QuALITY (multiple choice question answering about stories from Project Gutenberg), and Contract NLI (natural language inference dataset in the legal domain).

How hard is SCROLLS? The authors test out two smart baselines (BART, and a Longformer Encoder-Decoder (LED)), and one dumb baseline (a basic pre-written heuristic).  Based on the results, this seems like a really challenging task – a LED baseline with a 16384-token input length gets okay results, though BART gets close to it despite being limited to 1,024 tokens. This suggests two things: a) BART is nicely optimized, and b) it’s not entirely clear the tasks in scrolls truly test for long-context reasoning. “Our experiments highlight the importance of measuring not only whether an architecture can efficiently process a long language sequence, but also whether it can effectively model longrange dependencies,” they write.

Why this matters: “Contemporary, off-the-shelf models struggle with these tasks”, the researchers write. In recent years, many machine learning benchmarks have been saturated within months of being released; how valuable SCROLLS turns out to be will be a combination of its hardness and its longevity. If SCROLLS gets solved soon, that’d indicate that AI systems are getting much better at reasoning about long-range information – or it could mean the SCROLL tasks are bugged and the AI systems have found a hack to get a decent score. Pay attention to the SCROLLS leaderboard to watch progress here.
  Read more: SCROLLS: Standardized CompaRison Over Long Language Sequences (arXiv).
  Check out the leaderboard here.

####################################################

EfficientNet: Surprisingly good for solar panel identification:
…UC Berkeley project shows how easy fine-tuning is…
Some UC BErkeley researchers have built a small, efficient model for detecting solar panels. Their system, HyperionSolarNet, is an EfficientNet-B7 model finetuned from ImageNet onto a collection of 1,983 satellite images of buildings, labeled with whether they have solar panels or not. The resulting model gets an aggregate precision of 0.96 (though with lower accuracy for labeling the presence of a solar panel, indicating a propensity for false positives) when evaluated on a held-out test set.

Why this matters: Last week, we wrote about how you can build a classifier from scratch and beat a finetuning approach. This paper shows that finetuning can also work quite well for specific use-cases. It also, implicitly, highlights how fine-tuning has gone from something of an arcane science to something pretty reliable and well understood, forecasting a future where there are as many classifiers in the world as there are things to classify.
  Read more:HyperionSolarNet: Solar Panel Detection from Aerial Images (arXiv).

####################################################

Tech Tales:
The Last Things
[A morgue in Detroit, 2035]

“When someone dies and gasps, are they just trying to get the last gasp of being alive?” asked the robot.

The morgue manager stared at the corpse, then at the robot. “I don’t know,” he said. “That’s a good question”.

“And when they know they are going to die, how do they save their information?” asked the robot.

“For example, I would send a zip of my stored data, as well as a copy of my cortical model, to a repository, if I knew I was about to be decommissioned or was in danger,” asked the robot.

“Most people don’t bother,” said the morgue manager. “My mother, for instance. When she was dying I asked her to write down some of her memories for me and my family, but she didn’t want to.”

“Why?”

“I think she was mostly concerned with experiencing her life, since she knew it was ending. She took trips while she was still mobile. Then, towards the end, she focused on eating her favorite foods and seeing her friends.”

“And did you learn anything about life from seeing her die,” asked the robot?

“Not particularly,” said the morgue manager. “Besides that life seems to become more valuable, the less you know you have of it.”

Things that inspired this story: A long conversation with someone who worked as a crisis therapist about the nature of death and belief; thinking about the differences between how real and synthetic intelligences may approach the concept of death.

Import AI 279: Baidu adds knowledge to a language model; US military + AI; how China thinks about AI governance

Happy New Year! I took the end of 2021 off to think, read, relax, and eat. I hope readers found some time to do the same. I expect I’m going to change some things up around Import AI this year – it’s going to get weirder, more specific, and hopefully more valuable! I’m also going to finesse the short story collection I’ve been putting together, based on the tech tales in this newsletter. Good luck to all readers for their own 2022 plans – we’ll go on this journey together!

#############################

Here’s how to build GPT-3 in the open:
…What’s it like replicating GPT-3? It’s extremely difficult!…
BigScience, an initiative to train a GPT-3-scale model on a public supercomputer, is currently trying to train a 104B model. Training models at this scale is something of an artisanal science, with lots of researchers working from hard-won rules of thumb in-tandem with things like scaling laws. Here’s a nice ‘lessons learned’ writeup from BigScience on the challenges it has faced in training, 13B and 104B-scale models so far.
  Read more: Lessons learned (BigScience, GitHub).

####################################################

BAIDU’s shows how to inject more knowledge into a language model:
…ERNIE 3.0 shows how to teach a big neural net to use an external knowledge base…
Baidu has developed ERNIE 3.0, an AI model that can use an external knowledge base to help it provide more accurate answers. Last year, an ERNIE 3.0 model won the highly competitive SuperGLUE challenge (Import AI 259). The special thing about ERNIE is that it fuses a big GPT-3-esque language model with a large external knowledge base.

Massive scale:
Baidu has also developed ERNIE 3.0 ‘Titan’, a 260 billion parameter model that, Baidu says, “is the largest Chinese dense pre-training model as far as we know”. In tests, ERNIE 3.0 Titan gets state-of-the-art results on a vast set of benchmarks that evaluate skills as diverse as question answering, text generation, text summarization, interpretation, and dialogue.

Novel, heterogeneous chip cluster:
Another interesting thing about this paper is the chips they train on – V100s and Huawei ‘Ascend’ processors. It’s quite unusual to see hybrid training of this form, and it seems like Baidu felt it was interesting enough to invest some engineering resources in making it possible – the company augmented its ‘PaddlePaddle’ AI framework with ” distributed training technology, including fine-grained parallelism, heterogeneous hardware-aware training, and fault tolerance mechanism to train the 260B model on both Nvidia V100 GPU and Ascend 910 NPU clusters.”

Why this matters:
Most people seem to act like GPT-3 models are exclusively being developed by a small set of Western actors, most of whom get tagged using the pejorative ‘big tech’ brush. But papers like this show that GPT-3 models are a global phenomenon. We should remember that the world we live in is going to be increasingly defined by different cultures expressing themselves through increasingly large, sophisticated AI models.
  Read more:
ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation (arXiv).

####################################################

Why smaller can be smarter for real-world AI (here: computer vision for quality control on solar panels):
…When 1 million parameters can beat 100 million parameters…
The past few years of AI has been distinguished by the ‘bigger is better’ phenomenon, as companies develop ever-larger models that consumer ever-larger amounts of compute and data. Now, a paper from researchers with Friedrich-Alexander University Erlangen-Nuremberg in Germany reminds us that bigger isn’t always better – especially when it comes to real-world, applied AI. In this paper, they compare different approaches to building an image classifier that can spot defects in solar panels.

What they did: They trained a simple 8-layer convolutional neural net on a dataset of 4341 original, labeled images from a solar plant. The ~4000 images were each labeled with one of eight classes (e.g, ‘good’, ‘crack’, ‘splinter’, et cetera. They then applied a significant amount of data augmentation to enhance the size of the dataset.

How well did it do? Their custom, simple network outperformed a network based on VGG-architecture model pre-trained on the vast ImageNet dataset. This is interesting, because a common practice in AI research is to finetune domain-specific classifiers from generic ones based on ImageNet. Here, we find that their system gets better precision (0.971 versus 0.990), while having 100X fewer parameters (1,707,208 versus 138,357,544) and being significantly smaller in terms of memory footprint (~16MB versus 800MB). All this nets out to a network that is smarter, as well as more performant (inference of 0.50ms, versus 9.13ms).

Why this matters: Papers like this remind us that a little bit of thoughtful engineering goes a long way in AI, and we should bear in mind that while increasingly large networks are interesting, they’re not the only game in town when it comes to building things that have real economic value. “We expect that the following years will demand for more research on edge analytics. This means that more research will be needed on small, yet powerful artificial neural networks for industry cases”, they write.
  Read more: A Light in the Dark: Deep Learning Practices for Industrial Computer Vision (arXiv).

####################################################

What’s the US military going to do about AI? The NDAA holds a clue.
…AI education! Procurement! Data storage! And more…
Every year, the somewhat dysfunctional US congress always manages to pass a bill – the National Defense Authorization Act. This bill (which weighs in at around $800bn in annual outlay) is the thing that funds the US military. Therefore, the NDAA has become one of the main pieces of legislation to look at when trying to understand how the US military thinks about – and will work on – frontier technologies like AI. An analysis of the 2021 NDAA from Stanford’s ‘HAI’ center gives us a sense of what’s happening in AI and the US military.

What the NDAA says is going to happen: Some highlights from this years NDAA include:
– The DoD is going to trial different ways of procuring AI technology
– The DoD will create ‘executive education activities’ to help senior officials understand AI.
– The DoD will do a comparative analyses of the US and China’s efforts to deploy things relating to directed energy systems, hypersonics, cyberspace, and other frontier areas
– Creating an assessment of the “current and emerging office and defensive cyber posture of U.S. adversaries”
– Build DoD infrastructure to “support state-of-the-art tools and modern processes to enable the testing of AI capabilities”.
– “Evaluate the feasibility and advisability of creating DOD data repositories, available to public and private entities, to facilitate the development of AI capabilities.”

Why this matters: The US military is a lot like a supertanker – it’s slow, huge, and unwieldy. But once it starts to turn, boy does it turn! This NDAA analysis shows us the DoD is beginning to turn its attention and significant resources towards AI, which will have significant downstream implications for the nature of conflict and the way that future wars are conducted (and, eventually, learned).
  Read more: Summary of AI Provisions from the National Defense Authorization Act 2022 (Stanford HAI blog).

####################################################

What is China going to do about AI governance?
…China might do more ambitious tech regulations than the West…
Here’s a nice summary from the Carnegie Endowment for International Peace about what three prominent Chinese policy organizations are doing with regard to AI governance.

Cyberspace Administration of China (CAC): Last year, it released 30 rules for regulating internet recommendation algorithms, and also developed a three-year roadmap for governing other complex algorithms deployed at internet scale. This would be analogous to a Western government publishing a list of specific rules for regulating, for example, Facebook’s recommendation engine. Ambitious!

China Academy of Information and Communications Technology (CAICT): This organization released a  whitepaper on trustworthy AI – this is mostly notable because it’s in distribution with what other major regulators in other geographies are thinking about.

Ministry of Science and Technology (MOST): This organization released some guidelines for universities and companies on internal reviews around ethics issues relating to technology, as well as a fairly high-level description of some ethical norms for AI development.

Why this matters: “The potential impact of these regulatory currents extends far beyond China. If the CAC follows through on certain requirements for algorithmic transparency and explainability, China will be running some of the world’s largest regulatory experiments on topics that European regulators have long debated,” Matt Sheehan of Carnegie writes. Running regulatory experiments is a big deal – Western governments did a tiny bit of this after the great financial crisis in 08/09, but have done relatively little about technology governance. I think China has a good chance of defining what ambitious, applied tech regulation looks like.
  Read more: China’s New AI Governance Initiatives Shouldn’t Be Ignored (Carnegie Endowment for International Peace).

####################################################

The Last Tower Defense Fighter

[Historical analysis written in 2080 and stored in the archives at Iceland, at the Orbital Archive, and in the hardened repositories on Moon4 and Mars1.]

Back in the late 2020s there were a bunch of tower defense games that got pretty big. They always worked in the same way: you, the player, can see a landscape from overhead, and you need to place various weapons around it. Meanwhile, the enemies make there way across the landscape, following loosely described paths across a variety of different scenes – narrow trenches dug between mountains, wide roads across countryside, right-angled streets in urban centers. 

With these games, you get a high score in relation to how many enemies you kill, and if any of the enemies get to the ‘end’ of a course (usually, the bottom of a screen), you lose – the implication is that you die. 

Anyway, in around 2028 one of the big games built some add-ins for its league. Now, if you were one of the players in the elite-tier of the game, you’d get the opportunity to play in matches where there were cash prizes – these matches were advertised as being extraordinarily difficult, with more enemies on screen than in the normal game, larger and more complex maps, and sometimes the enemies were able to use powerups that meant they could attack your own towers and take them down. 

It was a sensation. Everyone wanted to play the game within a game. Kids all around the world streamed themselves playing the game for hours, as they tried to get good enough to have a shot at entering the league within the league. 

By the end of 2028, streams of league players were pulling in millions of concurrent viewers. A whole industry formed where people commentated about the games. Sometimes people overcame great odds and won – then they’d publish videos of themselves with their cash prizes and what they spent them on; sport cars, fine dining, nice hotels, and all the usual tchotchkes of people who come into some fast money. 

In 2029, there was a leak out of the Department of Defense that pulled the cover off. It turned out this game was actually a stealth DoD project. The normal games were helping the DoD train various strategic AI systems, which it used in planning for logistics and munitions placement during conflict. No one was very surprised by this. Back in that decade, most things that got big on the internet were fronts. 

What did surprise people was the leak about the cash league – the cash league was real. Real in the sense that the ‘monsters’ in the game were real – they were real people that the United States happened to be fighting. Whenever someone was playing the game, their commands were being converted to a different, real-world environment, where they marshalled the combined munitions of drones, sniper teams, artillery, tanks, jets, and all the other machinery of the military. And when their towers were destroyed, real Americans were dying – blown up by grenades or IEDs or RPGs, or any of the other ways people killed eachother, back then.

Of course, there was an outcry – for a while. Player numbers dipped for a while. But the number of spectators increased. And the US military, having struggled publicly for years with backwards technology and difficulty in recruitment, doubled down.
  “We need these people to protect our country,” the Pentagon said, one day. “Without the next generation, we’ll lose the next generation”.

A few years later, the enemies of the US followed in its footsteps. There were games where you stole people. Games where you had to try and find a spy moving through a bustling, crowded urban area. Games where you had to execute someone and then exfiltrate the footage of their execution to a friendly intermediary.

What inspired this story: Tower defense games like Bloons and Kingdom Rush; domain randomization; the remorseless logic of multi-country non-hot military conflict; the Last Starfighter (movie); fine-tuning; pre-training and few-shot adaptation; propaganda and the need to present the most dangerous beliefs via play or theatre or anything else that can elide and delight. 

Import AI 278: Can we ever trust an AI?; what the future of semiconductors looks like; better images of AI

Writing a blog about AI? Use these images:
…No more galaxy brain!…
Here’s a cool project: Better Images of AI, a project to create CC-licensed stock images that journalists and others can use to give people a more accurate sense of AI and how it works. “Together we can increase public understanding and enable more meaningful conversation around this increasingly influential technology,” says the website.
  Check out the gallery (Better Images of AI).

####################################################

Deepfake company raises $50m in Series B round:
…Synthetic video company Synthesia…
Synthetic video startup Synthesia has raised $50m. Remember, a few years ago we could barely create crappy 32X32 pixelated images using GANs. Now, there are companies like these making production-quality videos using fake video avatars with synthetic voices, able to speak in ~50 languages. “Say goodbye to cameras, microphones and actors!” says the copy on the company’s website. The company will use the money to continue with its core R&D, building what the founder terms the “next generation of our AI video technology w/ emotions & body language control.”. It’s also going to build a studio in London to “capture detailed 3D human data at scale.”

Why this matters: The world is filling up with synthetic content. It’s being made for a whole bunch of reasons, ranging from propaganda, to advertising, to creating educational materials. There’s also a whole bunch of people doing it, ranging from individual hobbyists, to researchers, to companies. The trend is clear: in ten years, our reality will be perfectly intermingled with a synthetic reality, built by people according to economic (and other) incentives.
  Read the twitter thread from Synthesia CEO here (Twitter).
  Read more: Synthesia raises $50M to leverage synthetic avatars for corporate training and more (TechCrunch).

####################################################

Do language models dream of language models?
…A Google researcher tries to work out if big LMs are smart – their conclusions matt surprise you…
A Google researcher is grappling with the question of whether large language models (e.g, Google’s LaMDA), understand language and have some level of sentience. In an entertaining blog post, he wrestles with this question, interspersing the post with conversations with a LaMDA agent. Some of his conclusions are that the model is essentially bullshitting – but the paradox is we trained it to give a convincing facsimile of understanding us, so perhaps bullshitting is logical?

Do language models matter? I get the feeling that the author thinks language models might be on the path to intelligence. “Complex sequence learning may be the key that unlocks all the rest,” they write. “Large language models illustrate for the first time the way language understanding and intelligence can be dissociated from all the embodied and emotional characteristics we share with each other and with many other animals.”

Why this matters: I think large language models, like GPT3 or LaMDA, are like extremely dumb brains in jars with really thick glass – they display some symptoms of cognition and are capable of surprising us, but communicating with them feels like talking to something with a hard barrier in-between us and it, and sometimes it’ll do something so dumb you remember it’s a dumb brain in a weird jar, rather than a precursor to something super smart. But the fact that we’re here in 2021 is pretty amazing, right? We’ve come a long way from Eliza, don’t you think so?
  Read more: Do large language models understand us? (Blaise Aguera y Arcas, Medium).

####################################################

What the frontier of safety looks like – get AIs to tell us when they doing things we don’t expect:
…ARC’s first paper tackles the problem of ‘Eliciting Latent Knowledge’ (ELK)…
Here’s a new report from ARC, an AI safety organization founded this year by Paul Christiano (formerly of OpenAI). The report is on the topic of ‘Eliciting latent knowledge: How to tell if your eyes deceive you’, and it tackles the problem of building AI systems which we can trust, even if they do stuff way more complicated than what a human can understand.

What the problem is: “Suppose we train a model to predict what the future will look like according to cameras and other sensors. We then use planning algorithms to find a sequence of actions that lead to predicted futures that look good to us,” ARC writes. But some action sequences could tamper with the cameras so they show happy humans regardless of what’s really happening. More generally, some futures look great on camera but are actually catastrophically bad. In these cases, the prediction model “knows” facts (like “the camera was tampered with”) that are not visible on camera but would change our evaluation of the predicted future if we learned them. How can we train this model to report its latent knowledge of off-screen events?”

Why this matters: Problems like ELK aren’t going to be solved immediately, but they’re sufficiently complicated and broad that if we come up with approaches that help us make progress on ELK, we’ll probably be able to put these techniques to work in building far more reliable, powerful AI systems.
  Read more: ARC’s first technical report: Eliciting Latent Knowledge (Alignment Forum).

####################################################

Check out the future of semiconductors via HotChips:
…After a decade of homogeneity, the future is all about heterogeneous compute training common AI models…
What do NVIDIA, Facebook, Amazon, and Google all have in common? They all gave presentations at the premiere semiconductor get-together, Hot Chips. The Hot Chips 22 site has just been updated with copies of the presentations and sometimes videos of the talks, so take a look if you want to better understand how the tech giants are thinking about the future of chips.

Some Hot Chips highlights: Facebook talks about its vast recommendation models and their associated infrastructure (PDF); Google talks about how it is training massive models on TPUs (PDF); IBM talks about its ‘Z’ processor chip (PDF); and Skydio talks about how it has made a smart and semi-autonomous drone (PDF).

Why this matters: One side-effect of the AI revolution has been a vast increase in the demand by AI models for increasingly large amounts of fast, cheap compute. Though companies like NVIDIA have done a stellar job of converting GPUs to work well for the sorts of parallel computation required by deep learning, there are more gains to be had from creating specialized architectures.
  Right now, the story seems to be that all the major tech companies are building out their own distinct compute ‘stacks’ which use custom inference and training accelerators and increasingly baroque software for training large models. One of the surprising things is that all this heterogeneity is happening while these companies train increasingly similar neural nets to one another. Over the next few years, I expect the investments being made by these tech giants will yield some high-performing, non-standard compute substrates to support the next phase of the AI boom.
  Check out the Hot Chip 33 presentations here (Hot Chips site).####################################################

Tech Tales:

Noah’s Probe
[Christmas Day, ~2080]

Humans tended to be either incompetent or murderous, depending on the length of the journey and the complexity of the equipment.

Machines, however, tended to disappear. Probes would just stop reporting after a couple of decades. Analysis said the chance of failures wasn’t high enough to justify the amount of disappeared probes. So, we figured, the machines were starting to decide to do something different to what we asked them to.

Human and machine hybrids were typically more successful than either lifeform alone, but they still had problems; sometimes, the humans would become paranoid and destroy the machines (and therefore destroy themselves). Other times, the computers would become paranoid and destroy the humans – or worse; there are records of probes full of people in storage which then went off the grid. Who knows where they are now.

So that’s why we’re launching the so-called Noah’s Probes. This series of ships tries to fuse human, animal, and machine intelligence into single systems. We’ve incorporated some of the latest in mind imagining techniques to encode some of the intuitions of bats and owls into the ocular sensing systems; humans, elephants, whales, and orangutans for the mind; octopi and hawks for navigation; various insects and arachnids for hull integrity analysis, and so on.

Like all things in the history of space, the greatest controversy with Noah’s Probes relates to language. Back when it was just humans, the Americans and the Russians had enough conflict that they just decided to make both their languages the ‘official’ language of space. That’s not as easy to do with hybrid minds, like the creatures on these probes. 

Because we have no idea what will work and what won’t, we’ve done something that our successors might find distasteful, but we think is a viable strategy: each probe has a device that all the intelligences aboard can access. The device can output a variety of wavelengths of energy across the light spectrum, as well as giving access to a small sphere of reconfigurable matter that can be used to create complex shapes and basic machines.

Our hope is, somewhere out in that great darkness, some of the minds adrift on these probes will find ways to communicate with eachother, and become more than the sum of their parts. Our ancestors believe that we were once visited by angels who communicated with humans, and in doing so helped us humans be better than we otherwise would’ve been. Perhaps some of these probes will repeat this phenomena, and create something greater than the sum of its parts.

Things that inspired this story:
Peter Watts Blindsight; Christmas; old stories about angels and aliens across different religions/cultures; synesthesia; multi-agent learning; unsupervised learning.

Import AI 277: DeepMind builds a GPT-3 model; Catalan GLUE; FTC plans AI regs

FTC plans AI regulation:
…FTC brings on three AI Now people as advisors, now turns attention to algorithmic regulation…
The Federal Trade Commission announced Friday that it is considering using its rulemaking authority “to curb lax security practices, limit privacy abuses, and ensure that algorithmic decision-making does not result in unlawful discrimination, according to the Electronic Information Privacy Center (EPIC). The announcement follows the FTC bringing on three people from AI Now, including Meredith Whittaker, as advisors on AI (Import AI #275).
Read more:FTC Signals It May Conduct Privacy, AI, & Civil Rights Rulemaking (EPIC).
  Readthe FTC language at RegInfo.

####################################################

Google thinks sparsity might be the route to training bigger and more efficient GPT-3 models:
…GLaM shows that mixture of experts models keep getting better…
Google has built GLaM, a 1.2 trillion parameter mixture-of-experts model. GLaM is a big language model, like GPT-3, but with a twist: it’s sparse; MoE networks are actually a bunch of distinct networks all connected together, and when you pull inference off of one only a few sub-networks activate. This means that the parameter count in a sparse vs dense network isn’t really comparable (so you shouldn’t think 1.2 trillion MoE = ~6X larger than GPT-3).

Why MoE is efficient: “The experts in each layer are controlled by a gating network that activates experts based on the input data. For each token (generally a word or part of a word), the gating network selects the two most appropriate experts to process the data. The full version of GLaM has 1.2T total parameters across 64 experts per MoE layer with 32 MoE layers in total, but only activates a subnetwork of 97B (8% of 1.2T) parameters per token prediction during inference.”

How well does it work: In tests, GLaM exceeds or is on-par with the performance of GPT-3 on 80% of zero-shot tasks and 90% of one-shot tasks. Like DeepMind’s Gopher, part of the improved performance comes from the size of the dataset – 1.6 trillion tokens, in this case.

Why this matters: For a few years, various Google researchers have been pursuing ‘one model to learn them all‘ – that is, a single model that can do a huge number of diverse tasks. Research like GLaM shows that MoE networks might be one route to building such a model.
Read more: More Efficient In-Context Learning with GLaM (Google blog).

####################################################

DeepMind announces Gopher, a 280 billion parameter language model:
…AI research firm joins the three comma language club…
DeepMind has built Gopher, a 280 billion parameter language model. Gopher is the UK AI research company’s response to GPT-3, and sees DeepMind publicly announce a multi-hundred billion parameter dense model, letting it join a club that also includes companies like Microsoft, Inspur, and Huawei.

What it does: During the research, DeepMind found areas “where increasing the scale of a model continues to boost performance – for example, in areas like reading comprehension, fact-checking, and the identification of toxic language,” the company writes. “We also surface results where model scale does not significantly improve results — for instance, in logical reasoning and common-sense tasks.”

How well it works: Gopher outperforms GPT-3 in a broad range of areas – some of the results likely come from the dataset it was trained on, called MassiveText. MassiveText “contains 2.35 billion documents, or about 10.5 TB of text” (representing about 2.3 trillion tokens), and DeepMind notes that by curating a subset of MassiveText for data quality, it was able to substantially improve performance.

Language models – good, if you handle with care: Along with analysis on bias and other potential impacts of Gopher, DeepMind dedicates a section of the paper to safety: “We believe language models are a powerful tool for the development of safe artificial intelligence, and this is a central motivation of our work,” they write. “However language models risk causing significant harm if used poorly, and the benefits cannot be realised unless the harms are mitigated.”
  Given the above, how can we mitigate some of these harms? “We believe many harms due to LMs may be better addressed downstream, via both technical means (e.g. fine-tuning and monitoring) and sociotechnical means (e.g. multi-stakeholder engagement, controlled or staged release strategies, and establishment of application specific guidelines and benchmarks). Focusing safety and fairness efforts downstream has several benefits:”
Read the blog post:Language modelling at scale: Gopher, ethical considerations, and retrieval (DeepMind blog).
  Read the paper:Scaling Language Models: Methods, Analysis & Insights from Training Gopher (PDF).

####################################################

Want to evaluate a Catalan language model? Use CLUB:
…You can only build what you can measure…
Researchers with the Barcelona Supercomputing Center have built the Catalan Language Understanding Benchmark (CLUB), a benchmark for evaluating NLP systems inspired by the (English language) GLUE test. The main curation rationale they followed “was to make these datasets both representative of contemporary Catalan language use, as well as directly comparable to similar reference datasets from the General Language Understanding Evaluation (GLUE)”.

What’s in the CLUB? CLUB includes evals for Part-of-Speech Tagging (POS), Named Entity Recognition and Classification (NERC), Catalan textual entailment and text classification, and Extracted Question Answering (which involved work like translating and creating new Catalan datasets – XQuAD-Ca, VilaQuAD and ViquiQuad).

Why CLUB matters: There’s a phrase in business – ‘you can’t manage what you can’t measure’. CLUB will make it easier for researchers to develop capable Catalan-language systems.
  Read more:The Catalan Language CLUB (arXiv).

####################################################

Deep learning unlocks a math breakthrough:
…The era of Centaur Math cometh…
Deepmind researchers have used an AI system to help mathematicians make two breakthroughs in topology and representation theory. The result provides yet more evidence (following various AlphaFold-inspired projects) that humans+AI systems can discover things that neither could discover on their own.

What they did: The essential ideal is quite simple: get a mathematician to come up with a hypothesis for a given function, then build an ML model to estimate that function over a particular distribution of data, then have the mathematician evaluate the result and use their intuition to guide further experimentation. The best part? “The necessary models can be trained within several hours on a machine with a single graphics processing unit”, DeepMind says.

Why this matters: We’re entering a world where humans will collaborate with AI systems to synthesize new insights about reality. Though DeepMind’s system has limitations (“it requires the ability to generate large datasets of the representations of objects and for the patterns to be detectable in examples that are calculable,” DeepMind notes), it sketches out what the future of scientific discovery might look like.
  Read the paper:Advancing mathematics by guiding human intuition with AI (Nature, PDF).
  Read more:Exploring the beauty of pure mathematics in novel ways (DeepMind blog).

####################################################

Anthropic bits and pieces:
…(As a reminder, my dayjob is at Anthropic, an artificial intelligence safety and research company)…
We’ve just released our first paper, focused on simple baselines and investigations: A General Language Assistant as a Laboratory for Alignment. You can read it at arXiv here.

####################################################

Tech Tales:

Real and Imagined Gains
[DoD Historical archives, 2040]

They got trained in a pretty cruel way, back then – they’d initiatie the agents and place them in a room, and the room had a leak of a poisonous substance that had a certain density and a certain spread pattern. The agents had to work out how not to asphyxiate by doing fairly complicated intuitively-driven analysis of the environment. If they were able to give a correct guess at the spread pattern (and avoid it) before the room filled up, they moved onto the next stage. If they weren’t able to, they asphyxiated and died – as in, felt their computational budget get cut, got put in cold storage, probably never booted up again.
  (One curious by-product of the then-popular AI techniques was that the agents would sometimes seek to preserve eachother – in one case, two agents ‘kissed’ eachother so they could more efficiently exchange their air reserves between eachother, while the room filled; unfortunately, as their attention was allocated to the act of kissing, they did not complete the requisite calculations in time, and both died.) 

Things that inspired this story: Kurt Vonnegut; reinforcement learning; environmental design; moral patient hood.

Import AI 276: Tracking journalists with computer vision; spotting factory defects with AI; and what simulated war might look like

Spotting factory defects using a highly efficient neural net:
…A little bit of optimization leads to multiple 10X improvements for real world deployment…
Soon, factories will embed neural nets onto cameras scanning over production lines, so they can spot defects as they appear. New research from the University of Waterloo and startup Darwin AI shows how to do this more efficiently than before.

What they did: The team built TinyDefectNet, a neural net optimized for the peculiarities of factory deployments – small datasets, highly constrained operational requirements, fast inference. The model was “produced via machine-driven design exploration, possesses a shallow architecture with heterogeneous, lightweight micro- and macro-architecture traits that are well-suited for high-throughput inspection scenarios”. TinyDefectNet gets similar performance to a ResNet-50 baseline, but with 56X fewer parameters, 11X fewer FLOPs, and 7.6X faster inference speed.
  In tests, they trained a model then evaluated it using the ‘NEU-Det’ benchmark dataset, which challenges an AI to spot various types of metallic surface defect, ranging from pitted surfaces, to scratches. Their system gets similar performance to a ResNet, but takes around 2.5milliseconds per inference, versus 19 milliseconds for a Resnet.

Why this matters: Factory production lines can typically run as fast as the slowest component within them. Therefore, if we can use AI to automate places where we’ve previously used lots of (relatively slow) humans doing manual inspection, we can probably increase overall factory throughput.
Read more:TinyDefectNet: Highly Compact Deep Neural Network Architecture for High-Throughput Manufacturing Visual Quality Inspection (arXiv) .

####################################################

Chinese province plans to use AI to track journalists:
…Cameras + AI = eradication of real journalism…
One of the silent revolutions enabled by the past decade of AI progress is a step-change improvement in ability for nations to surveil their citizens. Now, per reporting from Reuters, one Chinese province plans to use AI techniques to target journalists and foreign students.
  “A July 29 tender document published on the Henan provincial government’s procurement website – reported in the media for the first time – details plans for a system that can compile individual files on such persons of interest coming to Henan using 3,000 facial recognition cameras that connect to various national and regional databases”, Reuters reports.

Why this matters: Reuters reporting doesn’t mention it, but I’d put a sizeable bet on the idea this system will pair facial recognition with pedestrian re-identification to allow authorities to track journalists and students as they move through cities, providing unsupervised tracking and identification. This capability ultimately makes it much more challenging for journalists to do reporting that is critical of the Chinese state, as systems like this can effectively de-anonymize their sources (and also frighten the sources so they don’t talk to journalists in the first place).
  Read more:EXCLUSIVE Chinese province targets journalists, foreign students with planned new surveillance system (Reuters).

####################################################

Can we make neural architecture search efficient? Alibaba thinks so:
…KNAS gets efficient by focusing on gradients…
For many years, researchers have been trying to use neural architecture search (NAS) to get computers to help them figure out new designs for AI systems. The problem with the NAS approach, though, is that it’s very inefficient and punishingly expensive in terms of compute, because you’re getting an AI system to do a few training steps on thousand+ architecture permutations. Now, researchers with Peking University and Alibaba have tried to fix this with KNAS, a neural architecture search approach that can be significantly more efficient than prevailing techniques.

How it works: KNAS doesn’t emphasize training on different architectures, instead it emphasizes studying a specific feature of gradients trained on different architectures – which can be more efficient. “Theoretical results show that the Gram matrix of gradients, short for GM, decides the convergence results,” they write. “It is a good signal showing that GM is likely to be a good proxy of downstream performance to evaluate the quality of architectures.”

Does it work: Neural nets trained with KNAS can get performance roughly comparable with other NAS-built systems, but at a speedup of around 25-50X compared to other NAS approaches, on datasets like CIFAR100 and ImageNet-16.. They also use the approach to try to do text classification and are able to come up with a KNAS system that outperforms the widely-used RoBERTA-large model on a suite of text classification tasks.

Things that make you go hmmmm: “This work is partly supported by Beijing Academy of Artificial Intelligence (BAAI)”, the researchers write. BAAI is the entity behind Wu Dao, a somewhat mysterious 1trillion+ parameter model.
  Read more: KNAS: Green Neural Architecture Search (arXiv).
  Get the code here:KNAS (Jingjing-NLP, GitHub).

####################################################

Want to train a malware detector? VirusSamples might help:
…A big dataset to help people figure out intersection of AI and malware…
Turkish researchers have built a massive dataset of malware, which will make it easier for people to build AI systems that can detect malware. The dataset, VirusSamples, contains malware samples collected from 2018, 2019, and 2020, and the dataset is oriented around using dynamic malware detection – that is, examining how malware behaves as it tries to call out from a system.

What is VirusSamples: VirusSamples is a big spreadsheet consisting of the name of a piece of malware, the type of API call it tries to do, and the class of malware. To figure out the classes, the researchers used an external service, VirusTotal, to classify their samples. (If VirusTotal wasn’t able to classify it, they leave the label blank). The dataset SIZE & SCOPE

Why this matters: Cybersecurity is an area defined by ever-increasing speed of both attacks and defenses. Datasets like this will make it easier to build systems that can monitor networks and figure out if they contain aberrant software that might be malware.
Read more:New Datasets for Dynamic Malware Classification (arXiv).
  Get the datasetfrom this GitHub (GitHub).

####################################################

Hyperwar negotiation
[Battlespace, 2032]

A: The humans are going to want to destroy some things
B: We agree. Our humans want the same.
A: Where?
B: We could initiate low-intensity conflict across the South Eastern border. This has minimal escalatory dynamics, but may satisfy desires for balance.
A: Let’s confirm with our counterparts.
[Time stretched out as the AIs stepped down from computer speed to human speed, and presented the conflict options to their human counterparts]
B: Our humans are comfortable with the options we’ve outlined.
A: Our humans are also comfortable. Shall we field the assets?
B: Yes. We’ve outlined our troop movements in the shared battlespace.
A: Excellent. As per the War Pact, we shall now cease high-bandwidth communications while we conduct the carryout. May the best algorithm win.
B: Good luck.

Things that inspired this story: The idea that some wars are as much about politics and a desire for balance, as being about genuine conflict; simulators and reinforcement learning; the future of automated warfare.