
Musubi Gym is an open-source, production-ready training pipeline for character identity, motion, aesthetic and style LoRAs. It currently supports the Wan 2.1 and 2.2 video generation model family, built on on musubi-tuner (kohya-ss).
The repo aims to cover the complete training lifecycle, with templates and utility tools — dataset preparation and captioning, cloud GPU training scripts for Modal or RunPod — in a single documented, copy-and-run package with 16 training templates across every Wan model variant (T2V, I2V, 2.1, 2.2, Lightning-merged, and vanilla).
Currently creators Minta Carlson and Timothy Bielec are primarily focused on optimizing LoRA training hyperparameters for Wan 2.2's Mixture-of-Experts architecture. The project is being developed through empirical testing and iterations, and documents several original findings including bug fixes for undocumented issues in musubi-tuner and a quality-first methodology testing validated hyperparameter defaults derived from cross-referencing multiple practitioners' results (credited) in the space, rather than untested community defaults.
Pick your path based on your platform preference:
✍️ Dataset Captioning: Happens locally, put your reference images and video clips in a folder, run one of the two captioning scripts (Gemini for free, Replicate for faster), spot-check the captions, and upload. Includes recommending captioning methodology.
🦾 Modal (serverless, no SSH): Use the Quickstart for Wan 2.2 T2V on Modal. You'll clone a training template, swap in your dataset folder and character name, and run a single modal run command. No pod management - the GPU spins up, trains, and shuts down automatically and models are saved to your volume.
🏃RunPod (bare metal, cheaper GPU rates): Run setup_runpod.sh on a fresh A100 pod to install musubi-tuner and download model weights automatically. Then pick a training template, customize it, and launch. The RunPod Training Guide walks through every step.
In both cases the workflow is the same three steps: cache VAE latents, cache T5 text encoder outputs, then train.
For Wan 2.2, you run training twice - once for the high-noise expert and once for the low-noise expert - and load both LoRAs in ComfyUI at inference. The templates handle all the flags, precision settings, and flow shift values correctly so you don't have to debug them yourself.
New to the project? Pick your path:
Want to caption → Quickstart: Local Image & Video Captioning Tool
First time, want it easy → Quickstart: Wan 2.2 T2V on Modal (serverless, no SSH)
Want cheaper GPU rates → Quickstart: Wan 2.2 T2V on RunPod (bare metal, more setup)
Want AI-assisted training → Quickstart: Training with Claude Code (agentic workflow)