VoxCPM2 Setup Guide: Open-Source TTS, Voice Cloning & Voice Design

VoxCPM2 is a free, open-source text-to-speech model from OpenBMB. It's a 2B parameter model trained on 2 million+ hours of multilingual audio, and it can clone voices, design voices from text descriptions, and generate 48kHz studio-quality audio. No API key. No monthly fee. Runs on your own machine.

This guide covers everything: installation, basic TTS, voice design, voice cloning, and fine-tuning.

What Makes VoxCPM2 Different

Most TTS systems work by breaking speech into tokens (discrete chunks), which limits naturalness. VoxCPM2 skips that entirely. It uses a tokenizer-free diffusion architecture that generates speech directly in a continuous latent space. The result is more natural, expressive output.

Key specs:

2B parameters: bigger than most open-source TTS models
30 languages: no language tag needed, it auto-detects
48kHz output: studio quality, no external upsampler
~8GB VRAM
Apache 2.0 license: fully open, commercial use allowed

Requirements

Before you install anything, check you have:

Python 3.10 to 3.12 (not 3.13, it's not supported yet)
PyTorch 2.5.0+
CUDA 12.0+ (you need an NVIDIA GPU; CPU inference is very slow)
~8GB VRAM
~10GB disk space for the model weights

To check your Python version: