Added 7/10/2021 - ✨ NEW workflow for large datasets

Large datasets that do not fit into memory can be trained with the --lazy flag, which loads images on-the-fly instead of all at once at the beginning of training. This can, however, be very slow due to the filesystem access pattern for on-the-fly image loading, especially if the data is not located on a SSD drive.

To reduce the memory requirement of loading the whole dataset, cryoDRGN contains a new tool cryodrgn preprocess that performs some of the image preprocessing done at the beginning of training. Separating out image preprocessing significantly reduces the memory requirement of cryodrgn train_vae, potentially leading to major training speedups ⚡ ⚡ ⚡ .

The new workflow replaces cryodrgn downsample with cryodrgn preprocess:

# Replace `cryodrgn downsample` with `cryodrgn preprocess`
cryodrgn preprocess P10_J712_particles_exported.cs \\
		--datadir P10/exports/groups/P10_J628_particles/J626/extract \\
		-D 128 \\
		-o data/preprocessed/128/particles.mrcs

# Parse pose information as usual, specifying the refinement box size with -D
cryodrgn parse_pose_csparc P10_J712_particles_exported.cs \\
		-D 256 \\ 
		-o data/pose.pkl

# Parse CTF information as usual
cryodrgn parse_ctf_csparc P10_J712_particles_exported.cs -o data/ctf.pkl

# Run cryoDRGN with preprocessed particles.ft.txt and extra flag --preprocessed
cryodrgn train_vae data/preprocessed/128/particles.ft.txt \\
		--preprocessed \\
		--ctf data/ctf.pkl \\
		--poses data/pose.pkl \\
		--zdim 8 \\
		-n 50 \\
		-o 00_vae128 >> 00.log

Numbers

Some numbers for training on a 1,375,854, 128x128 particle dataset (86 GB)

Baseline:

With new --preprocessed:

On a single V100 GPU, this dataset trained in approximately 2h,3min per epoch (large 1024x3 model) when fully loaded into memory. Training with on-the-fly data loading (--lazy) was 4x slower, though this can vary widely depending on your filesystem/network.

Technical details