[Master’s / PhD or Doctorate] Research Scientist - Internship

Role: Research Scientist Intern - Speaker Diarization (VoiceAI)

Location: Toulouse, Remote

Job type: Internship

Work setup: On-Site, Hybrid, Remote

Start: ASAP

Length: 4-6 months

💚 About pyannoteAI

pyannoteAI is pioneering Speaker Intelligence AI, transforming how AI processes and understands spoken language. Our speaker diarization technology distinguishes speakers with unmatched precision, regardless of the spoken language, making AI understand not just what is said, but who said it and when.

Founded by voice AI experts with 10+ years in the industry (ex-CNRS research scientists), we've built the 9th most downloaded open-source model on HuggingFace with 52 million monthly downloads and over 140,000 users worldwide. After raising €8M from leading international VCs (Crane Venture Partners, Serena, and angels from HuggingFace and OpenAI), we're now scaling our enterprise platform.

From meeting transcription and call center analytics to video dubbing and voice agents, pyannoteAI powers the next generation of voice-enabled applications across industries that depend on understanding who speaks and when.

🧵 Internship Overview

Our Speaker diarization has reached a point where our models often outperform human annotators — many apparent "errors" are actually mistakes in the ground truth labels themselves. This creates both a challenge and an opportunity: we need better ways to establish reliable benchmarks while also enabling use cases that demand perfect accuracy, such as legal transcripts for trials and police depositions.

Your mission will be to design interactive machine learning approaches that achieve 100% accurate speaker diarization through human-in-the-loop systems. You'll work on reducing the time and effort needed to reach perfect accuracy while establishing new gold standards for benchmarking. This work has direct impact on critical downstream tasks including speaker-attributed ASR, voice cloning, and archive search.

Concretely, you will :

Develop novel clustering approaches that incorporate human feedback
Train neural networks that combine audio signals with partial human annotations
Explore speaker identification and profiling methods
Implement these methods in proof-of-concept systems