Import AI 167: An aerial crowd hunting dataset; surveying people with the WiderPerson dataset; and testing out space robots for bomb disposal on earth 

by Jack Clark

Spotting people in crowds with the DLR Aerial Crowd Dataset:
…Aerial photography + AI algorithms = airborne crowd scanners…
One of the main ways we can use modern AI techniques to do helpful things in the world is through counting – whether counting goods on a production line, or the number of ships in a port, or the re-occurrence of the same face over a certain time period from a certain CCTV camera. A new dataset from the remote sensing technology institute at the German Aerospace Center in Wessling, Germany wants to use a new dataset to make it much easier for us to teach machines to accurately count large numbers of people via overhead imagery.

The DLR Aerial Crowd Dataset: This dataset consists of 33 images captured via DSLR cameras installed on a helicopter. The images come from 16 flights over a variety of events and locations, including sport events, city center views, trade fairs, concerts, and more. Each of these images is absolutely huge, weighing in at around 3600 * 5200 pixels each. There are 226,291 person annotations spread across the dataset. DLR-ACD is the first dataset of its kind, the researchers write, and they hope to use it “to promote research on aerial crowd analysis”. The majority of the images in ACD contain many thousands of people viewed from overhead, whereas most other aerial datasets involves crowds of less than 1,000 in size, according to analysis by the researchers. 

MRCNet: The researchers also develop the Multi-Resolution Crowd Network (MRCNet) which uses an encoder-decoder structure to extract image features and then generate crowd density maps. The system uses two losses at different resolutions to help it count the number of people in the map, as well as providing a coarser map density estimate.

Why this matters: As AI research yields increasingly effective surveillance capabilities, people are going to likely start asking about what it means for these capabilities to diffuse widely across society. Papers like this give us a sense of activity in this domain and hint at future applied advances.
   Read more: MRCNet: Crowd Counting and Density Map Estimation in Aerial and Ground Imagery (Arxiv).
   Get the dataset from here (official DLR website).

####################################################

Once Federated Learning works, what happens to big model training?
…How might AI change when distributed model training gets efficient?…
How can technology companies train increasingly large AI systems on increasingly large datasets, without making individual people feel uneasy about their data being used in this way? That’s a problem that has catalyzed research by large companies into a range of privacy-preserving techniques for large-scale AI training. One of the most common techniques is federated learning – the principle of breaking up a big model training run so that you train lots of the model on personal data on end-user devices, then aggregate the insights into a central big blob of compute that you control. The problem with federated learning, though, is that it’s expensive, as you need to shuttle data back and forth between end-user devices and your giant central model. New research from the University of Michigan and Facebook outlines a technique that can reduce the training requirements of such federated learning approaches by 20-70%. 

Active Federated Learning: UMichigan/Facebook’s approach works like this: During each round of model training, Facebook’s Active Federated Learning (AFL) algorithm tries to figure out how useful the data of each user is to model training, then uses that to automatically select which users it will sample from next. Another way to think about this is that if the algorithm didn’t do any of this, it could end up mostly trying to learn from data held by users who were irrelevant to the thing being optimized, potentially because they don’t fit the use case being optimized for. In tests, the researchers said that AFL could let them “train models with 20%-70% fewer iterations for the same performance” when compared to a random sampling baseline. 

Why this matters: Federated learning will happen eventually: it’s inevitable, given how much computation is stored on personal phones and computers, that large technology developers eventually figure out a way to harness it. I think that one interesting side-effect of the steady maturing of federated learning technology could be the increasing viability of technical approaches for large-scale, distributed model training for pro-social uses. What might the AI-equivalent of the do-it-yourself protein folding ‘FoldIt @ Home’ or alien-hunting ‘SETI @ Home’ systems look like?
   Read more: Active Federated Learning (Arxiv)

####################################################

Put your smart machine through its paces with DISCOMAN:
…Room navigation dataset adds more types of data to make machines that can navigate the world…
Researchers with Samsung’s AI research lab have developed DISCOMAN, a dataset to help people train and benchmarking AI systems for simultaneous location and mapping (SLAM). 

The dataset: DISCOMAN contains a bunch of realistic indoor scenes with ground truth labels for odometry, mapping, and semantic segmentation. The entire dataset consists of 200 sequences of a small simulated robot navigating a variety of simulated houses. Each sequence lasts between 3000 and 5000 frames.
   One of the main things that differentiates DISCOMAN from other datasets is the length of its generated sequences, as well as the fact that agent can get a bunch of different types of data, including depth, stereo, and IMU sensors.
   Read more: DISCOMAN: Dataset of Indoor SCenes for Odometry, Mapping and Navigation (Arxiv)

####################################################

Surveying people in unprecedented detail with ‘WiderPerson’:
…Pedestrian recognition dataset aims to make it easier to train high-performance pedestrian recognition systems…
Researchers with the Chinese Academy of Sciences, the University of Southern California, the Nanjing University of Aeronautics and Astronautics, and Baidu have created the “WiderPerson” pedestrian detection dataset. 

The dataset details: WiderPerson consists of 13,382 images with 399,786 annotations (that’s almost 30 annotations per image) and detailed bounding boxes. The researchers gathered the dataset by crawling images from search engines including Google, Bing, and Baidu. They then annotate entities in these images with one of five categories: pedestrians, riders, partially-visible person, crowd, and ignore. On average, each image in WiderPersons contains almost 30 people. 

Generalization: Big datasets like WiderPerson are good candidates for pre-training experiments, where you run a model over this dtaa before pointing it to a test task. Here, the researchers test this by pre-training models on WiderPerson then testing them on another dataset, called Caltech-USA: Pre-training on WiderPerson can yield a reasonably good score when evaluated on CalTech, and they show that systems which train on WiderPerson and finetune on Caltech-USA data can beat systems trained purely on Caltech alone. They show the same phenomenon with the ‘CityPersons’ dataset, suggesting that WiderPerson could be a generally useful dataset for generic pre-training. 

Why this matters: The future of surveillance and the future of AI research are closely related. Datasets like WiderPerson illustrate just how close that relationship can be.
   Read more: WiderPerson: A Diverse Dataset for Dense Pedestrian Detection in the Wild (Arxiv).
   Get the dataset from here (official WiderPerson website).

####################################################

Space robots come to earth for bomb disposal:
…Are bipedal robots good enough for bomb disposal? Let’s find out…
Can we use bipedal robots to defuse explosives? Not yet, but new research from NASA, TRACLabs, the Institute for Human Machine and Cognition, and others, suggests that we can. 

Human control: The researchers design the competition so that the human operator is more of a manager, making certain decisions about where the robot should move next, or turn its attention to, but not operating the robot via remote control every step of the way. 

The task: The robot is tested out by examining how well it can navigate an uneven terrain with potholes, squeeze between a narrow gap, open a car door, retrieve an IED-like object from the car, then place the IED inside a containment vessel. This task has a couple of constraints as well: the robot needs to complete it in under an hour, and needs to not drop the IED while completing the task. 

The tech…: It’s worth noting that the Valkyrie comes with a huge amount of inbuilt software and hardware capabilities – and very few of these use traditional machine learning approaches. That’s mostly because in space, debugging errors is insanely difficult, so people tend not to avoid methods that don’t come along with guarantees about performance.
   …is brittle: This paper is a good reminder of how difficult real world robotics can be. One problem the researchers ran into was that sometimes the cinder blocks they scattered to make an uneven surface could cause “perceptual occlusions which prevent a traversable plane or foothold from being detected”.
…and slow: The average of the best run times for the robot is about 26 minutes, while the time average of all successful runs is about 37 minutes. This highlights a problem with the Valkyrie system and approach: it relies on human operators a lot. “Even under best case scenarios, 50% of the task completion time is spent on operator pauses with the current approach,” they write. “The manipulation tasks were the most time consuming portion of the scenario”.

What do we need to do to get better robots? The paper makes a bunch of suggestions for things people could work on to create more reliable, resilient, and dependable robots. These include:

  • Improving the  ROS-based software interface the humans use to operate the robot
  • Use more of the robot’s body to complete tasks, for instance by strategically bracing itself on something in the environment while retrieving the IED. 
  • Re-calculate robot localization in real-time
  • More efficient waypoint navigation 
  • Generally improving the viability of the robot’s software and hardware

Why this matters: Bipedal robots are difficult to develop because they’re very complex, but they’re worth developing because our entire world is built around the assumption of the user being a somewhat intelligent biped. Research like this helps us prototype how we’ll use robots in the future, and provides a useful list of some of the main hardware and software problems that need to be overcome for robots to become more useful to society.
   Read more: Deploying the NASA Valkyrie Humanoid for IED Response: An Initial Approach and Evaluation Summary (Arxiv)

####################################################

VCs pony up $16 million for robot automation:
…Could OSARO robot pick&place be viable? These investors think so…
OSARO, a Silicon Valley AI startup that is building robots which can perform pick&place tasks on production lines, has raised $16 million in a Series B funding round. This brings the company’s total raise to around $30 million. 

What they’re investing in: OSARO has developed software which “enables industrial robots to perform diverse tasks in a wide range of environments”. It produces two main software products today: OSARO Pick, which automates pick&place work within warehouses; and OSARO Vision, which is a standalone vision system that can be plugged into other factory systems. 

Why this matters: Robotics is one of the sectors most likely to be revolutionized by recent advances in AI technology. But, as anyone who has worked with robots knows, robots are also difficult things to work with and getting stuff to work in real-world situations is a pain. Therefore, watching what happens with investments like this will give us a good indication about the maturity of the robotics<>AI market.
   Read more: OSARO Raises $16M in Series B Funding, Attracting New Venture Capital for Machine Learning Software for Industrial Automation (Business Wire, press release).
   Find out more about OSARO at their official website.

####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe brings you views on AI and AI policy; I (lightly) edit them…

A Facebook debate about AI risk:
Yann LeCun, Stuart Russell, and Yoshua Bengio have had a lively discussion about the potential risks from advanced AI. Russell’s new book Human Compatible makes the case that unaligned AGI poses an existential risk to humanity, and that urgent work is needed to ensure humans are able to retain control over our machines once they become much more powerful than us. 

LeCun argues that we would not be so stupid as to build superintelligent agents with the drive to dominate, or that weren’t aligned with our values, given how dangerous this would be. He agrees that aligning AGI with human values is important, but disputes that it is a particularly new or difficult problem, pointing out that we already have trouble with aligning super-human agents, like governments or companies.

Russell points out that the danger isn’t that we program AI with a drive to dominate (or any emotions at all), but that this drive will emerge as an instrumental goal for whatever objective we specify. He argues that we are already building systems with misspecified objectives all the time (e.g. Facebook maximizing clicks, companies maximizing profits), and that this sometimes has bad consequences (e.g. radicalization, pollution). 

Bengio explains that the potential downsides of misalignment will be much greater with AGI, since it will be so much more powerful than any human systems, and that this could leave us without any opportunity to notice or fix the misalignment before it is too late.

My take: Were humanity more sensible and coordinated, we would not be so reckless as to build something as dangerous as unaligned AGI. But as LeCun himself points out, we are not: companies and governments—who will likely be building and controlling AGI—are frequently misaligned with what we want them to do, and our desires can be poorly aligned with what is best (Jack: Note that OpenAI has published research on this topic, identifying rapid AI development as a collective action problem that demands greater coordination among developers). We cannot rule out that the technical challenge of value alignment, and the governance challenge of ensuring that AI is developed safely, are very difficult. So it is important to start working on these problems now, as Stuart Russell and others are doing, rather than leaving it until further down the line, as LeCun seems to be suggesting.
   Read more: Thread on Yann LeCun’s Facebook.
   Read more: Human Compatible by Stuart Russell (Amazon).
   Read more: The Role of Cooperation in Responsible AI Development (Arxiv).

####################################################

 Tech Tales:

Full Spectrum Tilt
[London, 2024]

“Alright, get ready folks we’re dialing in”, said the operator. 

We all put on our helmets. 

“Pets?”

Here, said my colleague Sandy. 

“Houses?”

Here, said Roger. 

“Personal transit?”

Here, said Karen. 

“Phone?”

Here, said Jeff. 

“Vision?”

Here, I said. 

The calls and responses went on for a while: these days, people have a lot of different ways they can be surveilled, and for this operation we were going for a full spectrum approach.

“Okay gang, log-in!” said the operator. 

Our helmets turned on. I was the vision, so it took me a few seconds to adjust. 

Our target wore smart contacts, so I was looking through their eyes. They were walking down a crowded street and there was a woman to their left, whose hand they were holding. The target looked ahead and I saw the entrance to a subway. The woman stopped and our target closed his eyes. We kissed, I think. Then the woman walked into the subway and our target waited there a couple of seconds, then continued walking down the street. 

“Billboards, flash him,” said the operator. 

Ahead of me, I suddenly saw my face – the target’s face – appear in a city billboard. The target stopped. Stared at himself. Some other people on the street noticed and a fraction of them saw our target and did a double take. All these people looking at me

Our target looked down and retrieved his phone from his pocket. 

“Hit him again,” said the operator. 

The target turned their phone on and looked into it, using their face to unlock the phone. When it unlocked, they went to open a messaging app and the phone front-facing camera turned on, reflecting the subject back at them. 

“What the hell,” the target said. They thumbed the phone but it didn’t respond and the screen kept showing the target. I saw them raise their other hand and manually depress the phone’s power stud. Five, four, three, two, one – and the phone turned off. 

“Phone down, location still operating,” said someone over the in-world messaging system. 

The target put their phone back in their pocket, then looked at their face on the giant billboard and turned so their back was to it, then walked back towards the subway stop. 

“Target proceeding as predicted,” said the operator.  

I watched as the target headed towards the subway and started to walk down it. 

I watched as a person stepped in front of them. 

I watched as they closed their eyes, slumping forward. 

“Target acquired,” said the operator. 

Things that inspired this story: Interceptions; the internet-of-things; predictive route-planning systems; systems of intelligence acquisition.