A research toolkit for extracting speaker identity from audio using a custom-trained Transformer encoder for voice cloning and speaker diarisation

The model employs a three stage deep learning framework — “Neural Voice Cloning with a Few Samples”.
- It requires 5 second audio clip, and text prompt to generate voice cloning.
Tacotron 2 uses modified WaveNet as a vocoder.
For Synthesiser stage — The Tacotron architecture is modified to allow conditioning on a voice
All models are to be trained separately and on distinct datasets.

1 Problem Definition

Group Short audio clips utterances by speaker. These Utterances are in waveform domain.
- $j^{th}$ utterance of $i^{th}$ speaker is denoted as $u_{ij}$
  - Example: If Speaker 1 has 3 audio clips, they are labeled $u_{11},u_{12},u_{13}$
- $x_{ij}$ is the log spectrogram of $u_{ij}$
  - Example: If $u_{11}$ is a 5-second clip of Speaker 1 saying "Hello," $x_{11}$ is its visual representation as a 2D grid of frequency vs. time.
<aside> <img src="/icons/info-alternate_lightgray.svg" alt="/icons/info-alternate_lightgray.svg" width="40px" />

A log-mel spectrogram is a deterministic, lossy function that extracts speech features from a waveform

</aside>

Screenshot 2025-03-22 at 6.42.37 PM.png

1.1 Speaker Encoder $E$

Goal: Turn a short audio clip into mbeddings. It takes a spectogram of the audio and producess embeddings.
- This embedding captures the speaker’s unique voice.
Converts $x_{ij}$ into an embedding $e_{ij}=E(x_{ij};w_E)$, where $w_E$ are encoder parameters.
Speaker embedding $c_i$ is average of all embeddings from speaker $i$

$$ \mathbf{c}i = \frac{1}{n} \sum{j=1}^n \mathbf{e}_{ij} \tag{1} $$

1.2 Synthesiser $S$

Generates a spectrogram $\hat{x}{ij}$ from text $t{ij}$ and an utterance embedding $u_{ij}$

$$ \hat{x}{ij} = S(u{ij},t_{ij};w_S) $$

This generated $\hat{x}{ij}$ should match the real $x{ij}$

1.3 Vocoder $V$

Converts the generated spectrogram $\hat{x}{ij}$ into a waveform $\hat{u}{ij}$ that you can hear.

$$ \hat{u}{ij}=V(\hat{x}{ij};w_V) $$

2 Training

2.1 Dataset Requirements