ANY-TO-ONE SEQUENCE-TO-SEQUENCE VOICE CONVERSION USING SELF-SUPERVISED DISCRETE SPEECH REPRESENTATIONS | Notion

Summary

Propose using VQ-wav2vec to achieve any-to-one speech conversion, which converts any utterance to have the voice characteristic of the trained target speaker.
Train the VQ-wav2vec that learns speaker-invariant representation, followed by a Seq2Seq to translates the representation to speech.

The problems

Conventional Seq2Seq voice-conversion models require parallel data and only achieve one-to-one conversion.

The solution

Leverage VQ-wav2vec to learn speaker invariant representation, and learn an any-to-one framework that doesn't require the parallel data.
Pre-training and especially the grouping method for the learned codebook are claimed to improve data efficiency.

Thoughts

Compare the method with the VQVAE-based methods (e.g. ‣ ) might yield an interesting comparison (e.g. A Comparison Of Discrete Latent Variable Models For Speech Representation Learning).