A research toolkit for extracting speaker identity from audio using a custom-trained Transformer encoder for voice cloning and speaker diarisation
Group Short audio clips utterances by speaker. These Utterances are in waveform domain.
<aside> <img src="/icons/info-alternate_lightgray.svg" alt="/icons/info-alternate_lightgray.svg" width="40px" />
A log-mel spectrogram is a deterministic, lossy function that extracts speech features from a waveform
</aside>

Goal: Turn a short audio clip into mbeddings. It takes a spectogram of the audio and producess embeddings.
Converts $x_{ij}$ into an embedding $e_{ij}=E(x_{ij};w_E)$, where $w_E$ are encoder parameters.
Speaker embedding $c_i$ is average of all embeddings from speaker $i$
$$ \mathbf{c}i = \frac{1}{n} \sum{j=1}^n \mathbf{e}_{ij} \tag{1} $$
Generates a spectrogram $\hat{x}{ij}$ from text $t{ij}$ and an utterance embedding $u_{ij}$
$$ \hat{x}{ij} = S(u{ij},t_{ij};w_S) $$
This generated $\hat{x}{ij}$ should match the real $x{ij}$
Converts the generated spectrogram $\hat{x}{ij}$ into a waveform $\hat{u}{ij}$ that you can hear.
$$ \hat{u}{ij}=V(\hat{x}{ij};w_V) $$