M2 Internship - Online diarization-informed beamforming @ Ava France

Introduction

When joining a conversation with several persons, deaf and hearing-impaired people experience hard times trying to follow who is saying what. As a result, these people feel most of the time excluded from the daily social interactions, whether it is in a work meeting with their colleagues (in-person or remote) or in a bar with their friends.

Ava aims to help 450M deaf and hearing-impaired people live a fully accessible life. We provide them an app that gives the information in real-time of who is speaking and what they are saying. For this, the app relies on a hybrid combination of AI and audio signal processing which makes it able to work in different situations.

The core of the app is based on a speaker diarization system, i.e. a system that determines who is speaking and when they are speaking, followed by a speech-to-text step which provides the transcriptions to the user. The speaker diarization system relies on deep learning and audio signal processing. However, the transcriptions could be improved in complex scenarios (e.g. noisy environment, hybrid work meetings). We are thus looking for interns to improve the speaker diarization system in those scenarios.

← All Open Positions and More About AvaApply

About this Internship


Internship topic - Online diarization-informed beamforming

The goal of this internship consists in adding a speech enhancement step after an online speaker diarization system by leveraging both diarization and plurality of audio recording devices (e.g., phones, laptops) and/or availability of several microphones on those devices.

Our motivation is based on the following. A system based on a diarization step up to an automatic speech recognition (ASR) module might be convenient in many situations. However, a conventional ASR system fails correctly transcribing the target speech in case of multiple overlapping speakers and/or strong background noise.

Beamforming techniques allowing multi-microphone speech enhancement can improve ASR systems. Deep-learning-based beamforming systems exist [Heymann2016, Zhang2020]. However, these approaches require estimations of speech and noise statistics with a help of a deep neural network (DNN) [Heymann2016]. This would add other extra steps (i.e., DNN training and inference) into an already complex system, in addition to a DNN-based diarization model (e.g., as in [Coria2021]).

Recently, an informed beamforming technique called Guided Source Separation (GSS) has been proposed [Boeddecker2018, Watanabe2020]. It relies on the oracle diarization information to perform beamforming, thus does not need an additional DNN. Inline with the GSS and relying on some of its aspects, in this internship we are targeting relying directly on a diarization system for beamforming.

References:

[Heymann2016] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based spectral mask estimation for acoustic beamforming,” in ICASSP, Mar. 2016, pp. 196–200. doi: 10.1109/ICASSP.2016.7471664.

[Zhang2020] Z. Zhang, Y. Xu, M. Yu, S.-X. Zhang, L. Chen, and D. Yu, “ADL-MVDR: All deep learning MVDR beamformer for target speech separation,” arXiv:2008.06994 [eess], Oct. 2020, Accessed: Nov. 24, 2020. [Online]. Available: http://arxiv.org/abs/2008.06994

[Boeddecker2018] C. Boeddecker, J. Heitkaemper, J. Schmalenstroeer, L. Drude, J. Heymann, and R. Haeb-Umbach, “Front-end processing for the CHiME-5 dinner party scenario,” in 5th International Workshop on Speech Processing in Everyday Environments (CHiME 2018), Sep. 2018, pp. 35–40. https://www.isca-speech.org/archive_v0/CHiME_2018/pdfs/CHiME_2018_paper_boeddecker.pdf

[Watanabe2020] Watanabe, Shinji, Michael Mandel, Jon Barker, Emmanuel Vincent, Ashish Arora, Xuankai Chang, Sanjeev Khudanpur, et al. « CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings ». arXiv:2004.09249 [cs, eess], 2 mai 2020. http://arxiv.org/abs/2004.09249.

[Coria2021] Coria, Juan M., Hervé Bredin, Sahar Ghannay, et Sophie Rosset. « Overlap-aware low-latency online speaker diarization based on end-to-end local segmentation ». arXiv:2109.06483 [cs, eess], 14 septembre 2021. http://arxiv.org/abs/2109.06483.