M2 Internship - Online multi-channel speaker diarization @ Ava France

Introduction

When joining a conversation with several persons, deaf and hearing-impaired people experience hard times trying to follow who is saying what. As a result, these people feel most of the time excluded from the daily social interactions, whether it is in a work meeting with their colleagues (in-person or remote) or in a bar with their friends.

Ava aims to help 450M deaf and hearing-impaired people live a fully accessible life. We provide them an app that gives the information in real-time of who is speaking and what they are saying. For this, the app relies on a hybrid combination of AI and audio signal processing which makes it able to work in different situations.

The core of the app is based on a speaker diarization system, i.e. a system that determines who is speaking and when they are speaking, followed by a speech-to-text step which provides the transcriptions to the user. The speaker diarization system relies on deep learning and audio signal processing. However, the transcriptions could be improved in complex scenarios (e.g. noisy environment, hybrid work meetings). We are thus looking for interns to improve the speaker diarization system in those scenarios.

← All Open Positions and More About AvaApply

About this Internship


Internship topic - Online multi-channel speaker diarization

The goal of this internship consists in improving an online speaker diarization system by supplying it with spatial information that is often available thanks to plurality of audio recording devices (e.g., phones, laptops) and/or availability of several microphones on those devices.

End-to-end multi-channel neural speaker diarization systems exist [Horiguchi2021]. However, not only they are data hungry for training well, they are also not suitable for online processing. Moreover, the number of speakers should be limited in these approaches. As such, approaches combining clustering-based and all-neural approaches emerged [Coria2021]. These approaches allow for low-delay online processing. As such, in this internship we are targeting extending combined approaches [Coria2021] by feeding into them some spatial information.

This work will be based on a hybrid (combining clustering-based and all-neural approaches) single-channel speaker diarization by Coria et al. [Coria2021]. This approach is mainly based on two steps:

Note that the speaker's spatial information does not seem to be a global representative of this speaker. Indeed, it may vary since both the speaker and the equipment (e.g., phones, laptops) may move. However, the spatial information varies slowly/smoothly in time. This is why we propose investigating in this internship the use of spatial information within the all-neural local speaker diarization [Bredin2021] rather than the global clustering.

References:

[Horiguchi2021] Horiguchi, Shota, Yuki Takashima, Paola Garcia, Shinji Watanabe, et Yohei Kawaguchi. « Multi-Channel End-to-End Neural Diarization with Distributed Microphones ». arXiv:2110.04694 [cs, eess], 9 octobre 2021. http://arxiv.org/abs/2110.04694.

[Coria2021] Coria, Juan M., Hervé Bredin, Sahar Ghannay, et Sophie Rosset. « Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation ». arXiv:2109.06483 [cs, eess], 14 septembre 2021. http://arxiv.org/abs/2109.06483.

[Bredin2021] Bredin, Hervé, et Antoine Laurent. « End-to-end speaker segmentation for overlap-aware resegmentation ». arXiv:2104.04045 [cs, eess], 10 juin 2021. http://arxiv.org/abs/2104.04045.