Import AI 202: Baidu leaves PAI; ImageNet can live forever with better labels; and what a badly upscaled Obama photo tells us about data bias

Making ImageNet live forever with better labels:
…Industry-defining dataset gets new labels for a longer lifespan…
ImageNet is why the recent decade was a boom year for AI – after all, it was in 2012 that a team of researchers at the University of Toronto used deep learning techniques to make significant progress on the annual ImageNet image recognition competition; their success ultimately led to the mass pivoting of the computer vision research community towards neural methods. The rest, as they say, is history.

But is ImageNet still useful, almost a decade later? That’s a question contemplated by researchers with Google Brain and DeepMind in a new research paper. Their conclusion is some form of “yes, but” – yes, ImageNet is still a useful large-scale training dataset for image systems, but its labels aren’t as good as they could be. To remedy this, the researchers develop a set of “ReaL” reassessed labels for ImageNet, which tries to fix some of the labeling problems inherent to ImageNet, creating a richer dataset of labels for researchers to work with.

What’s wrong with old ImageNet labels? An old picture of a tool chest might have the label ‘hammer’, whereas the new ‘ReaL’ labels could be any of ‘screwdriver; hammer; power drill; carpenters’ kit’ (all of which are in the image). The new labels also fix some of the weird parts of ImageNet – like a picture of a bus and a passenger car, where the bus is in the foreground but the old correct label is ‘passenger car’ (whereas the new label is ‘school bus’).

Why this matters: “While ReaL labels seem to be of comparable quality to ImageNet ones for fine-grained clases, they have significantly reduced the noise in the rest, enabling further meaningful progress on this benchmark,” the authors write. Specifically, the ReaL ID labels make it more useful to train systems against the ImageNet dataset, because it leads to the development of vision systems with a more robust, broad set of labels. “These findings suggest that although the original set of labels may be nearing the end of their useful life, ImageNet and its ReaL labels can readily benchmark progress in visual recognition for the foreseeable future”, they write.
  Read more: Are we done with ImageNet? (arXiv).
  Get the new labels: Reassessed labels for the ILSVRC-2012 (“ImageNet”) validation set (Google Research, GitHub).

####################################################

Want to count Zebrafish? This dataset might help!
…The smart fishtank cometh…
Researchers with Aalborg University, Denmark, have built a dataset of videos tracking Zebrafish as they move around in a tank. They’re releasing the dataset and some baseline models to help people build systems that can automatically track and analyze ZebraFish.

The dataset: The dataset consists of eight sequences with a duration between 15 and 120 seconds and 1-10 free moving zebrafish. It has been hand-annotated with 86,400 points and bounding boxes. It also includes tags relating to the occlusion of fish at different points in time, which can help provide data for training systems that are able to analyze schools of fish, rather than individual ones.

Why Zebrafish? So, why bother making this? The researchers say it is because “Zebrafish is an increasingly popular animal model and behavioural analysis plays a major role in neuroscientific and biological research”, but tracking Zebrafish is a complex, tedious process. With this dataset, the researchers hope to spur the construct of robust zebrafish tracking systems which “are critically needed to conduct accurate experiments on a grand scale”.
  Read more: 3D-ZeF: A 3D Zebrafish Tracking Benchmark Dataset (arXiv).
  Get the dataset here (Multiple Object Tracking Benchmark official site).

####################################################

Facebook gets its own StreetView with Mapillary acquisition:
…Acquisition gives Facebook lots of data and lots of maps…
Facebook has acquired Mapillary, a startup that had been developing a crowdsourced database of street-level imagery. Mapillary suggests in a blog post that it’ll work with Facebook to develop better maps; “by merging our efforts, we will further improve the ways that people and machines can work with both aerial and street-level imagery to produce map data,” the company writes.

Data moats and data maps: Mapping the world is a challenge, because once you’ve mapped it, the world keeps changing. That’s why companies like Apple and Google have made tremendous investments in infrastructure to regularly map and analyze the world around them (e.g, StreetView). Mapillary may give Facebook access to more data to help it develop sophisticated, current maps. For instance, a few months ago Mapillary announced it had created a dataset of more than 1.6 million images of streets from 30 major cities across six continents (Import AI 196)..    
  Read more: Mapillary Joins Facebook on the Journey of Improving Maps Everywhere (Mapillary blog).

####################################################

Photo upscaling tech highlights bias concerns:
…When photo enhancement magnifies societal biases…
Last week, some researchers with Duke University published information about PULSE, a new photo upscaling technique. This system uses a neural network to upscale low-resolution pixelated picture into high-fidelity counterparts. Unfortunately, how good the neural net is at upscaling stuff depends on a combination of the underlying dataset it was trained on and how well tuned its loss function(s) is. Perhaps because PULSE is so good at upscaling in domains where there’s a lot of data (e.g, pictures of white people), then its failures in other domains feel far worse.

Broken data means you get Broken Barack Obama: Shortly after publishing the research, some Twitter users started probing the model for biases. And they found some unfortunate stuff:
– Here is the model upscaling Barack Obama into a person with more typically caucasian features.
– Here is a Twitter thread with more examples, where the model tends to skew towards generating caucasian outputs regardless of inputs.
– Here is a more detailed exploration of the Barack Obama example from AI Artist Mario Klingemann, which shows more diversity in the generations (and some fun bugs) – note this isn’t using exactly the same components as PULSE, so treat with a grain of salt.

Blame the data or blame the algorithm? In a statement published on GitHub, the PULSE creators say “this bias is likely inherited from the dataset StyleGAN was trained on, though there could be other factors that we are unaware of”. They say they’re going to talk to NVIDIA, which originally developed StyleGAN. However, AI artist Klingemann says “StyleGAN is perfectly capable of producing diverse faces, it is their algorithm that fails to capture that diversity”. Meanwhile, Yann Lecun, Facebook’s head of AI research, says the issue is solely down to data – “train the *exact* same system on a dataset from Senegal, and everyone will look African” (this tweet doesn’t discuss the issue of dataset creation – there aren’t nearly as many datasets that portray people from Senegal, as those that portray people from other parts of the world).

Why this matters: Bias and how it relates to AI is a broad, hard problem in the AI research and deployment space – that’s because bias can creep in at every level, ranging from initial dataset selection, to the techniques used to train models, to the people that develop the systems. Bias is also hard because it relates to machine-readable data, which either means data people have taken the trouble to compile (e.g, the CelebA dataset compiled by NVIDIA as part of Stylegan), or data that has been digitized by other larger cultural forces (e.g, the inherent biases of film production lead to representational issues in film datasets). I expect that in the next few years we’ll see people concentrate on the issue of bias in AI both at the level of dataset creation and also at the level of analyzing the outputs of generative models and figuring out algorithmic tweaks to increase diversity.
  Get the code and read the statement from the PULSE GitHub repo (GitHub).
  Read the paper: PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models (arXiv).

####################################################

Uber uses 3D worlds to simulate better self-driving cars:
…Shows how to create good enough simulated self-driving car data to train cars on…
Uber wants to build self-driving cars, but self-driving cars are one of the hardest things in AI to build, so Uber needs to generate a ton of data to train them on. Like Google and other self-driving companies, Uber is currently collecting data around the world via cars tricked out with lasers (LiDAR), cameras, and other sensors. That’s all useful. Now, Uber – like other self-driving companies – is trying to figure out how it can simulate data to let it have even more data to train on.

Simulating data for fun and (so much!) profit: In a new research paper, researchers from Uber, the University of Toronto, and MIT say they have built “a LiDAR simulator that simulates complex scenes with many actors and produce point clouds with realistic geometry”. With this simulator, they can pair their real data with synthetic datasets that let them train neural networks to higher performance than those trained on real data alone. (Most of the technical trick here comes in layering two types of data together in the simulator – realistic worlds, and then filling them with dynamic objects, all of which are based on real LiDAR data, increasing the realism in the procedurally generated synthetic data.

The key statistic: “with the help of simulate data, even with around 10% real data, we are able to achieve similar performance as 100% real data, with less than 1% mIOU difference, highlighting LiDARsim’s potential to reduce the cost of annotation”. Other tests show that if you pair real data with a significant amount of simulated data (in their tests, equivalent to your amount of real data), then you can obtain better scores than you can get with real data alone.

Why this matters: Say it with me: Computers let us arbitrage $ for data. This is wild! Projects like this show how effective these techniques are and they suggest that large companies may be well positioned to benefit from economy of scale effects of the data they gather – because in AI, you can gather a dataset and train models on it, and also use that dataset to develop systems for generating additional data. It’s almost like someone harvesting crop from farmland, eating the crop, and in parallel cloning the crop and eating the clones as well. I think the effects of this are going to be significant in the long-term, especially with regard to competitive dynamics in the AI space.
  Read more: LiDARsim: Realistic LiDAR Simulation by Leveraging the Real World (arXiv).

####################################################

Baidu leaves Partnership on AI:
…Move heightens tensions around the international governance of AI…
Chinese search giant Baidu has left the Partnership on Artificial Intelligence, a US-based initiative that brings together industry, academia, and other third stakeholders to grapple with the ethical challenges of artificial intelligence. The move was reported by Wired. It’s unclear at this time what the reasons are behind the move – some indicate it may be financial in nature, but that would be puzzling since the membership dues for PAI are relatively small and Baidu had a net income of almost $1 billion in its 2019 financial year. The move comes amid heightened tensions between the US and China over the development of artificial intelligence.

Why this matters: We already operate in a world with two distinct ‘stacks’ – the domestic Chinese internet and associated companies (with their international efforts, like TikTok) and government bodies (e.g, those that participate in standards organizations), and the stacks made of mostly American internet companies built on the global Internet system.
  With moves like Baidu leaving PAI, we’re seeing the decoupling of these tech stacks rip higher up the layers of abstraction – first the internet systems decoupled, then various Chinese companies emerged to counter/compete with the Western companies (e.g, some (imperfect) comparisons: Baidu / Google; Huawei / Cisco; Alibaba; Amazon), then we started to see hardware layers decouple (e.g, domestic chip development). Now, there are also signs that we might come apart in our ability to convene internationally – if it’s hard to have shared discussions with China at venues like the Partnership on AI, then where can those discussions take place?
  Read more: Baidu Breaks Off an AI Alliance Amid Strained US-China Ties (Wired).

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

US joins international AI panel:
The US has joined the Global Partnership on AI, the ‘IPCC for AI’ first proposed by France and Canada in 2018. The body, officially launched last week, aims to support “responsible and human-centric” AI development. GPAI will convene four working groups, focused on — responsible AI; data governance; the future of work; innovation & commercialization. These will be made up of experts from industry, academia, government and civil society. The US is the last G7 country to join, having held out due to concerns that international rules might hamper US innovation.
  Read more: US joins G7 artificial intelligence group to counter China (AP).
  Read more: Joint Statement from founding members of the Global Partnership on Artificial Intelligence.

Europe’s cloud infrastructure project takes shape:
France and Germany have shared more details on their ambitious plan to build a European cloud computing infrastructure. The project, named GAIA-X, is intended to help establish Europe’s ‘data sovereignty’, by reducing its reliance on non-European tech companies for core infrastructure. With an initial budget of €1.5 million per year, GAIA-X is unlikely to rival big tech any time soon (Amazon’s AWS operating expenses were $26 billion last year). It is expected to launch in 2021.
  Read more: Altmaier charts Gaia-X as the beginning of a ‘European data ecosystem’ (Euractiv).
  Read more: GAIA-X: A Franco-German pitch towards a european data infrastructure – Ministerial talk and GAIA-X virtual expert forum (BMWI).

####################################################
Tech Tales:

The Future News

[Random sampling of headlines seen on a technology focused news site during the course of four months in 2025]

AMD, NVIDIA Face GPU Competition from Chinese Upstart

Life on Mars? SpaceX Mission Tries to Find Out

Doggo Robbo: Boston Dynamics Introduces Robot companion product

Mysterious Computer Virus Wreaks Havoc on World’s Satellites

Nuclear Content: Programmatic Ads and Synthetic Media

Drone Wars: How Drones Got “Fast, Cheap, and Out of Control”

Computer Vision Sales Rise in Authoritarian Nations, Decline in Democratic Ones – Study

Internet Connectivity Patterns see “Unprecedented Fluctuations”, Puzzling Experts

African Nations Shut Down Cellular, Internet Networks To Quell Protests

Rise in Robot Vandalisms Attributed to “Luddite Parties”

Quantum Chemistry: The Technology Behind Oxford University’s Smash Hit Spinout FeynBio

“Authentic Content Act” Sees Tech Industry Pushback

“Internet 4.0” – Why the U.S., Russian, and Indian Governments Are Building Their Own Domestic Internet Systems

Things that inspired this story: Many years working as a journalist; predictions about the evolution of computer vision; automation politics and the 21st century; CHIPlomacy; some loose projections of the future based on some existing technical trends.