M2 Internship - Speaker identity matching between conversations @ Ava France


When joining a conversation with several persons, deaf and hearing-impaired people experience hard times trying to follow who is saying what. As a result, these people feel most of the time excluded from the daily social interactions, whether it is in a work meeting with their colleagues (in-person or remote) or in a bar with their friends.

Ava aims to help 450M deaf and hearing-impaired people live a fully accessible life. We provide them an app that gives the information in real-time of who is speaking and what they are saying. For this, the app relies on a hybrid combination of AI and audio signal processing which makes it able to work in different situations.

The core of the app is based on a speaker diarization system, i.e. a system that determines who is speaking and when they are speaking, followed by a speech-to-text step which provides the transcriptions to the user. The speaker diarization system relies on deep learning and audio signal processing. However, the transcriptions could be improved in complex scenarios (e.g. noisy environment, hybrid work meetings). We are thus looking for interns to improve the speaker diarization system in those scenarios.

← All Open Positions and More About AvaApply

About this Internship

Internship topic - Speaker identity matching between conversations

Diarization systems that allow to tell “who is speaking and when they are speaking" usually rely on spatial information and/or speaker embeddings that depend on voice characteristics of the person that is speaking. Extracting the appropriate speaker embeddings is thus very important to offer accurate diarization at a fine grained level.

This internship aims to study the benefits of transferring speaker embeddings across conversations held by the same user. Take the example of hard of hearing (HoH) or deaf users who rely on Ava to transcribe conversations they have with different family members at home or with classmates at school. Their day-to-day conversations involve mostly the same people (with some new speakers every now and then). Thus, we would like to study the effect of warming up the diarization system with voice embeddings of people who had past interactions with the user. The goal of such cross-conversational speaker embeddings is to provide a better user experience regarding text segmentation and transcripts quality.

During this internship, the student will work within the AI team and will: