Import AI 150: Training a kiss detector; bias in AI, rich VS poor edition; and just how good is deep learning surveillance getting?

by Jack Clark

What happens when AI alters society as much as computers and the web have done?
…Researchers contemplate long-term trajectory of AI, and detail a lens to use to look at its evolution…
Based on how the World Wide Web and the Computing industry altered society, how might we expect the progression of artificial intelligence to influence society? That’s the question researchers with Cognizant Technology Solutions and the University of Texas at Austin try to answer in a new research paper.

The four phases of technology: According to the researchers, any technology has four defining phases – standardization; usability; consumerization; and foundationalization [?]. For example, the ‘usability’ phase for computing was when people adopted GUI interfaces, while for the web, it was when people adopted stylesheets to separate content from presentation. By stage four (where computing is now and where the web is heading) “people do not have to care where and how it happens – they simply interact with its results, the same way we interact with a light switch or a faucet”, they write.

Lessons for the AI sector: Right now, AI as a technology is at the pre-standardization stage/.

  Standardization: We need standards for how we connect AI systems together. “It should be possible to transport the functionality from one task to another,” they write, “e.g. to learn to recognize a different category of objects” across different classification infrastructures using shared systems.

  Usability: AI needs interfaces that everyone can use and access, the authors write. They then reference Microsoft’s dominance of the PC industry in the 1990s as an example of the sort of thing we want to avoid with AI, though it’s pretty unclear from the paper what they mean by usability and accessibility here.

  Consumerization: The general public will need to be able to easily create AI services. “People can routinely produce, configure, teach, and such systems for different purposes and domains,” they write. “They may include intelligent assistants that manage an individual’s everyday activities, finances, and health, but also AI systems that design interiors, gardents, and clothing, maintain buildings, appliances and vehicles, and interact with other people and their AIs.

  Foundationalization, “AI will be routinely running business operations, optimizing government policies, transportation, agriculture, and healthcare,” they write. AI will be particularly useful for directing societies to solve complex, intractable problems, they write. “For instance, we may decide to maximize productivity and growth, but at the same time minimize cost and environmental impact, and promote equal access and diversity.”

Why this matters: AI researchers are increasingly seeking to situate themselves and their research in relation to the social as well as technical phenomena of AI, and papers like this are artefacts of this process. I think this prefigures the general politicization of the AI community. I suspect that in a couple of years we may even need an additional Arxiv sub-category to contain such papers as these.
  Read more: Better Future through AI: Avoiding Pitfalls and Guiding AI Towards its Full Potential (Arxiv).

#####################################################

Deep learning + surveillance = it’s getting better all the time:
…Vehicle re-identification survey shows how significant deep learning is for automating surveillance systems…
How good has deep learning been for vehicle surveillance? A significant effect, according to a survey paper from researchers with the University of Hail in Saudi Arabia.

Sensor-based methods: In the early 90s, researchers developed sensor-based methods for identifying and re-identifying vehicles; these methods used things like inductive loops, as well as sensors for infrared, ultrasonic, microwave, magnetic, and piezoelectric. Other methods have explored using systems like GPS, mobile phone signatures, and RFID and MAC address-based identification. People have also explored using multi-sensor systems to increase the accuracy of identifications. All of these systems had drawbacks, mostly relating to them breaking in the presence of unanticipated things, like modified or occluded vehicles.

Vision-based methods: Pre-deep learning and from the early 2000s, people experimented with a bunch of hand-crafted feature-based methods to try to create more flexible less sensor-dependent approaches to the task of vehicle identification. These techniques can do things like generate bounding boxes around vehicles, and even match specific vehicles between non-overlapping camera fields. But these methods also have drawbacks relating to their brittleness, and dependence on features that may change or be occluded. “The performance of appearance based approaches is limited due to different colors and shapes of vehicles”, they write.

Deep learning: Since the ImageNet breakthrough in 2012, researchers have increasingly used these techniques for vision problems, including for vehicle re-identification, mostly because they’re simpler systems to implement and tend to have better generalization properties. These methods typically use convolutional neural networks, sometimes paired with an LSTM. Any deep learning method appears to outperform hand-crafted based methods, according to tests in which 12 deep learning-based methods were compared against 8 hand-crafted ones.

The future of vehicle re-identification: Vehicles vary in appearance a lot more than humans, so it will be more difficult to train classifiers that can accurately identify all the vehicles that can pass through a city on a given day. Additionally, we’ll need to build larger datasets to be able to better model the temporal aspect of entity-tracking – this should also let us accurately identify vehicles with bigger lags between them.

Why this matters: The maturation of deep learning technology is irrevocably changing surveillance, improving the capabilities and scalability of a bunch of surveillance techniques, including vehicle re-identification.

  Read more: A survey of advances in vision-based vehicle re-identification (Arxiv).

#####################################################

AI Stock Image of the Week:
Thanks to Delip Rao for surfacing this delightful contribution to the burgeoning media genre.

#####################################################

Spotting intimacy in Hollywood films with a kiss detector:
…Conv and Lution sitting in a tree, K-I-S-S-I-N-G!…
Amir Ziai, a researcher at Stanford University, has built a deep learning-based kissing detector! The unbearably cute project takes in a video clip, spots all the kissing scenes in it, then splices thouse scenes together into an output.

Classifying Kissing: So, how do you spot kissing? Here, we use a multi-modal classifier which uses a network to detect the visual appearance of a kiss, and another network which scans the audio over that same period, extracting features out of it (architecture used: ‘VGGish’, “a very effective feature extractor for downstream Acoustic Event Detection).

   The dataset: The data for this research is a 2.3TB database of ~600 Hollywood films spanning 1915 to 2016, with files ranging in size from 200MB and 12GB. 100 of these movies have been annotated with kissing segments, for a total of 263 kissing segments and 363 non-kissing segments across 100 films.

The trained ‘kiss detector’ gets an F1 score of 0.95 or so, so in a particularly salacious movie you might expect to get a few mis-hits in the output, but you’ll likely capture the majority of the moments if you run this over it.

   Why this matters: This is a good example of how modern computer vision techniques make it fairly easy to develop specific ‘sense and respond’ software, cued to qualitative/unstructured things (like the presence of kissing in a scene). I think this is one of the most under-hyped aspects of how AI is changing the scope of individual software development. I could also imagine systems like this being used for somewhat perverse/weird uses, but I’m reading this paper
  Read more: Detecting Kissing Scenes in a Database of Hollywood Films (Arxiv).

#####################################################

Bias in AI: What happens when rich countries get better models?
…Facebook research shows how biases in dataset collection and labeling lead to a rich VS poor divide…
The recent spate of research into bias in AI systems feels like finding black mold in an old apartment building – you spot a patch on the wall, look closer, and then realize that the mold is basically baked into the walls of the apartment and if you can’t see it it’s probably because you aren’t looking hard enough or don’t have the right equipment. Bias in AI feels a bit like that, where the underlying data that is used to train various systems has obvious bias (like mostly containing white people, instead of a more diverse set of humans), but also has non-obvious bias which gets discovered through testing (for example, early work on word embeddings), and the more we think about bias the more ways we find to test for it and reveal it.

Now, researchers with Facebook AI Research have shown how image datasets might have an implicit bias towards certain kinds of representations of common concepts, favoring rich countries over poor ones. The study “suggests these systems are less effective at recognizing household items that are common in non-Western countries or in low-income communities” as a consequence of subtle biases in the underlying dataset. “The absolute difference in accuracy of recognizing items in the United States compared to recognizing them in Somalia or Burkina Faso is around 15% to 20%. These findings are consistent across a range of commercial cloud services for image recognition”

The dataset: For this study, the authors investigate the ‘Dollar Street Dataset’, which contains photos of common goods across 135 different classes in photos taken in 264 homes across 54 countries.

Recognition for some, but not for all: The researchers discovered that “for all systems, the difference in accuracy for household items appearing in the lowest income bracket (less than $50 per month) is approximately 10% lower than that for household items appearing in the highest income bracket”.

To generate these results, the researchers measured the accuracy of five commercial systems and one self-developed system at categorizing objects in the dataset. These systems are Microsoft Azure, Clarifai, Google Cloud Vision, Amazon Rekognition, and IBM Watson, and a ResNet-101 model trained against the Tencent ML Images dataset.

   What explains this? One is the underlying geographical distribution of data in image datasets like ImageNet, COCO, and OpenImages – the researchers studied these and found that, at least for some of their data, “the computer-vision dataset severely undersample visual scenes in a range of geographical regions with large populations, in particular, in Africa, India, China, and South-East Asia”.

Another source of bias is the use of English as the language for data collection, which means that the data is biased towards objects with English labels or easily translatable labels – the researchers back this up with some qualitative tests where they search for a term in English then in another language on a service like Flickr and show that such searches yield quite different sets of results.

   Why this matters: Studies like this show us how dependent certain AI capabilities are on underlying data, and how bias can creep in in hard-to-anticipate ways. I think this motivates the creation of a new field of study within AI, which I guess I’d think of as “AI ablation, measurement, and assurance” – we need to think about building big empirical testing regimes to check trained systems and products against. (Think Model Cards for Model Reporting, but for everything.)
  Read more: Does Object Recognition Work for Everyone (Arxiv).

#####################################################

Want over a billion digitized Arabic words? Check out KITAB:
…KITAB repository adds to the Open Islamicate Texts Initiative (OpenITI)…
Researchers with KITAB, a project to create digital tools and resources to help people interact with Arabic texts, has released a vast corpus of Arabic text, which may be of interest to machine learning researchers.

The Kitab dataset is a significant contribution to the Open Islamicate Texts Initiative (OpenITI), which is “a multi-institutional effort to construct the first machine-actionable scholarly corpus of premodern Islamicate texts”.

The Kitab dataset, by the numbers:

  • Authors: 1,859.
  • Titles: 4,288
  • Words: 755,689,541
  • Multiple versions of same titles: 7,114
  • Total words including multiple versions of same titles: 1,520,667,360

   Things that make you go ‘hmmm’: Arabic texts may have some particular properties with regard to repetition that may make them interesting to researchers. “Arabic authors frequently made use of past works, cutting them into pieces and reconstituting them to address their own outlooks and concerns. Now you can discover relationships between these texts and also the profoundly intertextual circulatory systems in which they sit”, they write.

   Why this matters: Though the majority of the world doesn’t speak English, you wouldn’t realize this from reading AI research papers, which frequently deal predominantly in English datasets with English labels. It’ll be interesting to see how the creation of big, new datasets in other languages can help stimulate development and make the world a little smaller.
  Read more: First Open Access Release of Our Arabic Corpus (Kitab project blog).
  Find out more about OpenITI: Open Islamicate Texts Initiative official website.

#####################################################

Ctrl-C and Ctrl-V for video editing:
…Stanford researchers show how to edit what people say in videos…
Researchers with Stanford University, the Max Planck Institute for Informatics, Adobe, and Princeton University, have made it easier for people to edit footage of other people. This is part of the broader trend of AI researchers developing flexible, generative systems which can be used to synthesize, replicate, and tweak reality. One notable aspect of this research is the decision by the researchers to prominently discuss the ethical issues inherent to the research.

What they’ve done: “We propose a novel method to edit talking-head video based on its transcript to produce a realistic output video in which the dialogue of the speaker has been modified, while maintaining a seamless audio-visual flow (i.e. no jump cuts)”, they write. “Based only on text edits, it can synthesize convincing new video of a person speaking, and produce a seamless transition even at challenging cut points such as the middle of an utterance”. The resulting videos are labelled as likely to be real by people about 60% of the time.

Ethical Considerations: The paper includes a prominent discussion of the ethics of the research and development of this system, showing awareness of its omni-use nature. “The availability of such technology – at a quality that some might find indistinguishable from source material – also raises important and valid concerns about the potential for misuse”, they write. “The risks of abuse are heightened when applied to a mode of communication that is sometimes considered to be authoritative evidence of thoughts and intents. We acknowledge that bad actors might use such technologies to falsify personal statements and slander prominent individuals”

Technical and institutional mitigations: “We believe it is critical that video synthesized using our tool clearly presents itself as synthetic,” they write. “It is important that we as a community continue to develop forensics, fingerprinting and verification techniques (digital and non-digital) to identify manipulated video.”

How it works: The video-editing tool can handle three types of edit operation: adding one or more consecutive words at a point in the video; rearranging existing words; or deleting existing words.

It works by scanning over the video and aligning it with a text transcript, then extracts the phonemes from the footage and, in parallel, tries to identify visemes – “groups of aurally distinct phonemes that appear visually similar to one another” – that it can use a face-tracking and neural rendering system to compose new utterances out of. “Our approach drives a 3D model by seamlessly stitching different snippets of motion tracked from the original footage. The snippets are selected based on a dynamic programming optimization that searches for sequences of sounds in the transcript that should look like the words we want to synthesize, using a novel viseme-based similarity measure”

The neural rendering system is able to generate better outputs that match the synthesized person to the background, getting around one of the contemporary stumbling blocks of existing systems. The system has some limitations, like not being able to distinguish emotions in phonemes, which could “lead to the combination of happy and sad segments in the blending”. Additionally, they require about one hour of video to produce decent results, which seems higher. Finally, if the lower face is occluded, for instance by someone moving their hand, this can cause problems for the system.

Video + Audio: In the future, such systems will likely be paired with audio-generation systems so that people can, from a very small amount of footage of a source actor, create an endless, generative talking head. “Our system could also be used to easily create instruction videos with more fine-grained content adaptation for different target audiences,” they write.

Convincing, sort of: In tests across around ~2900 subjects, people said that videos modified using the technique appeared to be ‘real’ about 60% of the time, compared to around 82% of the time for non-modified videos.

Why this matters: This research is a harbinger for things to come – a future where being able to have confidence in the veracity of the media around is will be determined by systems surrounding the media, rather than the media itself. Though human societies have dealt with fake media before, my intuition is the capabilities of these AI systems mean that it is becoming amazingly cheap to do previously punishingly expensive things like video-editing. Additionally, it’s significant to see researchers acknowledge the ethical issues inherent to their work – this kind of acknowledgement feels like a healthy pre-requisite to the cultivation of new community norms around publication.
  Read more: Text-based Editing of Talking-head Video (Arxiv).

#####################################################

AI Policy with Matthew van der Merwe:
…Matthew van der Merwe has kindly offered to write some sections about AI & Policy for Import AI. I’m (lightly) editing them. All credit to Matthew, all blame to me, etc. Feedback: jack@jack-clark.net

FBI criticized on face recognition:
The US Government Accountability Office (GAO) has released a report on the use of face recognition software by the FBI.

Privacy: The FBI has access to 641 million face photos in total. They have a proprietary database, and agreements allowing them to access databases from external partners, such as state or federal agencies. These are not limited to photos from criminal justice sources, and are also drawn from databases of drivers licenses and visa applications, etc. GAO criticise the FBI for failing to publish two key privacy documents, designed to inform the public about the impacts of data collection programs, before rolling out face recognition.

  Accuracy: In 2016, GAO made three recommendations concerning the accuracy of face recognition systems: that the FBI assess the accuracy of searches from their proprietary database before deployment; that they conduct annual operational reviews of the database; and that they assess the accuracy of searches from external partner databases. They find the FBI have failed to respond adequately to any of these. In particular, there are no solid estimates of false positive rates, making it difficult to properly judge the accuracy of the system.

  Why it matters: There is increasing attention on the use of face recognition software by law enforcement in the US. This report suggests that the FBI have failed to implement proper measures to review accuracy, and to comply with privacy regulations. Without a thorough understanding these systems’ accuracy, or accountability on privacy, it is difficult to weigh up the potential harms and benefits of the technology.
   Read more: Face recognition technology (US GAO).

DeepMind’s plan to make AI systems robust and reliable:
DeepMind’s Pushmeet Kohli was recently interviewed on the 80,000 Hours podcast, where he discussed the company’s approach to building robust AI, and how it relates to their broader research agenda.
  Read more: DeepMind’s plan to make AI systems robust & reliable, why it’s a core issue in AI design, and how to succeed at AI research (80,000 Hours)

#####################################################

Tech Tales:

Reality Slurping

Interview with hacker ‘Bingo Wizard’, widely attributed to be the inventor of ‘reality slurping’

Look I can predict the questions. Question one: what made you come up with slurping? Question two: what do you think about how people are using slurping? Question three: don’t you feel responsible for what happened? Okay.

So question one: I kept on getting ideas for things I’d want to train. Bumblebee detectors. Bird-call sensing beacons. Wind predictors. Optimal sunset-photo locations. You know: weird stuff that comes from me and what I like to do. So I guess it started with the drones. I put some software on a drone so I could kind of flip a switch and get it to record the world around it and feed that data back to a big database. I guess it took a year or so to have enough data to train the first generative sunset model. I’d hold up my phone and paint sunsets into otherwise dark nights, warping views on hills around me from moon-black to drench-red-evening. After that I started writing about it and wrote some code and stuck it online. Things took off after that.

Question two: and let me figure out what the additional question is you’d probably move to – murder-filters, fear fakes, atrocity simulators. Yeah, sure, I don’t think that stuff is good. I wouldn’t do it. I think I’d hate most people that chose to do it. But should I stop them? Maybe if I could stop every specific use or stop all the people we knew were specifically bad, but it’s a big world and it’s… it’s reality. If you build stuff that can be pointed and trained on any part of reality, then you can’t really make that tech only work for some of reality – it doesn’t work that way. So what do I think? I think people are doing more than we can imagine, and some of it’s frightening and gross or disgusting or whatever, but some of it is joy and love and fun. Who am I to judge? I just made the thing for sunsets and then it got popular.

Question three: no. Who could predict it? You think people predicted all the shit the iPhone caused? No. The world is chaos and you make things and these things are meant to change the world and they do. They do. It’s not on me that other people are also changing the world and things interact and… you know, society. It’s big but not big enough if everyone can see everyone. Learn everyone. I get it. But it’s not slurping that’s here, it’s everything around slurping. Ads. Infotainment. Unemployment. Foreign funding of the digital infrastructure. Political bias. Pressure groups. Bingo Wizard. We’re all in it all at the same time. I was trying for sunsets and now I can see them everywhere and I can turn people into birds and make sad things happy or happy things sad or whatever and, you know, I’m learning.

Things that inspired this story: the ‘maker mindset’; arguments for and against various treatments of ‘dual use’ AI technology.