Quickstart: Local Image & Video Captioner

What Is This?

Two Python scripts that auto-caption your images and videos for LoRA training. Point them at a folder, set your anchor word, and they generate .txt caption files that musubi-tuner reads automatically.

You pick one:

	Gemini (`caption_gemini.py`)	Replicate (`caption_replicate.py`)
Cost	Free (Google API key)	Paid (Replicate credits)
Speed	~10s/file + rate limit waits	~2-5s/file, minimal waits
Rate limits	Aggressive on free tier (frequent 429s)	Generous
Best for	Small datasets (under 50 files)	Large datasets, time-sensitive work
Dependencies	`google-generativeai`, `Pillow`	`requests`

Recommendation: Use Replicate for bulk captioning. Use Gemini for small batches or when you don't want to spend credits.

Prerequisites

Python 3.10+ — Open PowerShell and run python --version to check.

Gemini path:

pip install google-generativeai Pillow

Get a free API key at aistudio.google.com/apikey

Replicate path:

pip install requests

Get your token at replicate.com/account/api-tokens

Step 1: Organize Your Dataset

Both scripts expect your files in two folders — one for images, one for videos. You can use just one or both.

datasets/
  YourCharacter/
    images/
      photo_001.png
      photo_002.jpg
      photo_003.webp
    videos/
      clip_001.mp4
      clip_002.mp4

Supported formats:

Images: .png, .jpg, .jpeg, .webp, .bmp
Videos: .mp4, .webm, .mov, .avi, .mkv

::: callout {icon="⚠️" color="yellow_bg"}