VoxCPM2 is a free, open-source text-to-speech model from OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.

This guide covers everything: installation, basic TTS, voice design, voice cloning, and fine-tuning.


What Makes VoxCPM2 Different

Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space. The result is more natural, expressive output.

Key specs:


Requirements

Before you install anything, check you have:

To check your Python version: