Role: Data Engineer (VoiceAI)

Location: Toulouse, Paris

Job type: Full-time

Work setup: 2-3 days remote per week

Start: ASAP

Job offer

About pyannoteAI

pyannoteAI is pioneering Speaker Intelligence AI, transforming how AI processes and understands spoken language. Our speaker diarization technology distinguishes speakers with unmatched precision, regardless of the spoken language, making AI understand not just what is said, but who said it and when.

Founded by voice AI experts with 10+ years in the industry (ex-CNRS research scientists), we've built the 9th most downloaded open-source model on HuggingFace with 52 million monthly downloads and over 140,000 users worldwide. After raising €8M from leading international VCs (Crane Venture Partners, Serena, and angels from HuggingFace and OpenAI), we're now scaling our enterprise platform.

From meeting transcription and call center analytics to video dubbing and voice agents, pyannoteAI powers the next generation of voice-enabled applications across industries that depend on understanding who speaks and when.

🧵 Your role

As a Data Engineer at pyannoteAI, you'll be embedded in our world-class research team, building the data infrastructure that powers breakthrough speaker diarization models. You'll own the entire data pipeline—from acquisition to quality assessment—supporting researchers who are training state-of-the-art models on massive audio datasets across multiple tasks: speaker diarization, separation, transcription, streaming, and tagging. Your work will take our already industry-leading models to the next level through high-quality, curated datasets.

You'll:

Own the complete data pipeline - Manage data acquisition, collection, labeling, metadata management, backup, and versioning for terabytes of audio data across 100+ languages.
Build tools for the research team - Write custom, high-performance PyTorch dataloaders and develop visualization/annotation tools for rapid quality assessment.
Manage continuous benchmarking infrastructure - Benchmark internal research models and competitors to track performance improvements and maintain our competitive edge.
Ensure audio data quality - Prepare and standardize audio data through automated processing pipelines, implement collection tools, conduct QA on purchased datasets, and analyze data statistics to optimize model performance.
Bridge research and production - Translate research needs into scalable data infrastructure that accelerates experimentation and model iteration.