
Summary
- Propose using VQ-wav2vec to achieve any-to-one speech conversion, which converts any utterance to have the voice characteristic of the trained target speaker.
- Train the VQ-wav2vec that learns speaker-invariant representation, followed by a Seq2Seq to translates the representation to speech.
The problems
Conventional Seq2Seq voice-conversion models require parallel data and only achieve one-to-one conversion.
The solution
- Leverage VQ-wav2vec to learn speaker invariant representation, and learn an any-to-one framework that doesn't require the parallel data.
- Pre-training and especially the grouping method for the learned codebook are claimed to improve data efficiency.
Thoughts