M2 Internship - Active learning strategies for diarization @ Ava France


When joining a conversation with several persons, deaf and hearing-impaired people experience hard times trying to follow who is saying what. As a result, these people feel most of the time excluded from the daily social interactions, whether it is in a work meeting with their colleagues (in-person or remote) or in a bar with their friends.

Ava aims to help 450M deaf and hearing-impaired people live a fully accessible life. We provide them an app that gives the information in real-time of who is speaking and what they are saying. For this, the app relies on a hybrid combination of AI and audio signal processing which makes it able to work in different situations.

The core of the app is based on a speaker diarization system, i.e. a system that determines who is speaking and when they are speaking, followed by a speech-to-text step which provides the transcriptions to the user. The speaker diarization system relies on deep learning and audio signal processing. Collecting data to train those systems is very expensive and time consuming. We are thus looking for an intern to help us make our data annotation pipeline smarter using active learning.

← All Open Positions and More About AvaApply

About this Internship

Internship topic - Active learning

In the field of machine learning, the accent is often put on the algorithms. In practice though, data collection, curation and annotation is probably as important: a simple algorithm can achieve surprisingly good performance given appropriate training data while the best algorithm will always perform poorly if the training data quality is low. Furthermore, gathering adequate annotated data is often an expensive process. Acknowledging this, the field of active learning proposes to improve data collection by choosing the right examples to annotate. If your task is to train a neural network to distinguish between mammals and birds (the most interesting task there is, right? 😁), is it better to have 10.000 examples of dogs and eagles or 1.000 examples of a varied set of mammals and birds? Is it better to include more examples of cats in the set or more examples of bats, given that the latter might be more confusing for the model?

Traditionally, active learning algorithms [1] have been tested on image recognition (see e.g. [2], [3] or [5]) . However, there is evidence that the performance of certain strategies is impacted by the type of task (e.g. image recognition vs. speech recognition) or the type of learning algorithm (e.g convolutional vs. recurrent neural networks). The goal of this internship is to explore the effectiveness of active learning for the task of speaker diarization . We will compare different strategies, ranging from simple uncertainty based sampling [1] to more recent and complex ones like expected model change [4] or discriminative active learning [5].


[1] Settles, Burr. “Active Learning Literature Survey,” 2009. https://burrsettles.com/pub/settles.activelearning.pdf.

[2] Gal, Yarin, Riashat Islam, and Zoubin Ghahramani. “Deep Bayesian Active Learning with Image Data.” arXiv, March 8, 2017. http://arxiv.org/abs/1703.02910.

[3] Ducoffe, Melanie, and Frederic Precioso. “Adversarial Active Learning for Deep Networks: A Margin Based Approach.” arXiv, February 27, 2018. http://arxiv.org/abs/1802.09841.

[4] Huang, Jiaji, Rewon Child, Vinay Rao, Hairong Liu, Sanjeev Satheesh, and Adam Coates. “Active Learning for Speech Recognition: The Power of Gradients.” arXiv, December 9, 2016. https://doi.org/10.48550/arXiv.1612.03226.

[5] Gissin, Daniel, and Shai Shalev-Shwartz. “Discriminative Active Learning.” arXiv, July 15, 2019. http://arxiv.org/abs/1907.06347.

Expected skills: