musubi gym.png

What is musubi gym?

Musubi Gym is an open-source, production-ready training pipeline for character identity, motion, aesthetic and style LoRAs. It currently supports the Wan 2.1 and 2.2 video generation model family, built on on musubi-tuner (kohya-ss).

The repo aims to cover the complete training lifecycle, with templates and utility tools — dataset preparation and captioning, cloud GPU training scripts for Modal or RunPod — in a single documented, copy-and-run package with 16 training templates across every Wan model variant (T2V, I2V, 2.1, 2.2, Lightning-merged, and vanilla).

Currently creators Minta Carlson and Timothy Bielec are primarily focused on optimizing LoRA training hyperparameters for Wan 2.2's Mixture-of-Experts architecture. The project is being developed through empirical testing and iterations, and documents several original findings including bug fixes for undocumented issues in musubi-tuner and a quality-first methodology testing validated hyperparameter defaults derived from cross-referencing multiple practitioners' results (credited) in the space, rather than untested community defaults.

How Should Someone Use This?

Pick your path based on your platform preference:

✍️ Dataset Captioning: Happens locally, put your reference images and video clips in a folder, run one of the two captioning scripts (Gemini for free, Replicate for faster), spot-check the captions, and upload. Includes recommending captioning methodology.

🦾 Modal (serverless, no SSH): Use the Quickstart for Wan 2.2 T2V on Modal. You'll clone a training template, swap in your dataset folder and character name, and run a single modal run command. No pod management - the GPU spins up, trains, and shuts down automatically and models are saved to your volume.

🏃RunPod (bare metal, cheaper GPU rates): Run setup_runpod.sh on a fresh A100 pod to install musubi-tuner and download model weights automatically. Then pick a training template, customize it, and launch. The RunPod Training Guide walks through every step.

In both cases the workflow is the same three steps: cache VAE latents, cache T5 text encoder outputs, then train.

For Wan 2.2, you run training twice - once for the high-noise expert and once for the low-noise expert - and load both LoRAs in ComfyUI at inference. The templates handle all the flags, precision settings, and flow shift values correctly so you don't have to debug them yourself.

🚀 Start Here

New to the project? Pick your path:

Mission