A research toolkit for extracting speaker identity from audio using a custom-trained Transformer encoder for voice cloning and speaker diarisation

1 Problem Definition

  1. Group Short audio clips utterances by speaker. These Utterances are in waveform domain.

    <aside> <img src="/icons/info-alternate_lightgray.svg" alt="/icons/info-alternate_lightgray.svg" width="40px" />

    A log-mel spectrogram is a deterministic, lossy function that extracts speech features from a waveform

    </aside>

Screenshot 2025-03-22 at 6.42.37 PM.png

1.1 Speaker Encoder $E$

1.2 Synthesiser $S$

1.3 Vocoder $V$

2 Training

2.1 Dataset Requirements