VoxCPM2 is a free, open-source text-to-speech model from OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.
This guide covers everything: installation, basic TTS, voice design, voice cloning, and fine-tuning.
Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space. The result is more natural, expressive output.
Key specs:
Before you install anything, check you have:
To check your Python version: