cryodrgn preprocess

Added 7/10/2021 - ✨ NEW workflow for large datasets

Large datasets that do not fit into memory can be trained with the --lazy flag, which loads images on-the-fly instead of all at once at the beginning of training. This can, however, be very slow due to the filesystem access pattern for on-the-fly image loading, especially if the data is not located on a SSD drive.

To reduce the memory requirement of loading the whole dataset, cryoDRGN contains a new tool cryodrgn preprocess that performs some of the image preprocessing done at the beginning of training. Separating out image preprocessing significantly reduces the memory requirement of cryodrgn train_vae, potentially leading to major training speedups ⚡ ⚡ ⚡ .

The new workflow replaces cryodrgn downsample with cryodrgn preprocess:

# Replace `cryodrgn downsample` with `cryodrgn preprocess`
cryodrgn preprocess P10_J712_particles_exported.cs \\
		--datadir P10/exports/groups/P10_J628_particles/J626/extract \\
		-D 128 \\
		-o data/preprocessed/128/particles.mrcs

# Parse pose information as usual, specifying the refinement box size with -D
cryodrgn parse_pose_csparc P10_J712_particles_exported.cs \\
		-D 256 \\ 
		-o data/pose.pkl

# Parse CTF information as usual
cryodrgn parse_ctf_csparc P10_J712_particles_exported.cs -o data/ctf.pkl

# Run cryoDRGN with preprocessed particles.ft.txt and extra flag --preprocessed
cryodrgn train_vae data/preprocessed/128/particles.ft.txt \\
		--preprocessed \\
		--ctf data/ctf.pkl \\
		--poses data/pose.pkl \\
		--zdim 8 \\
		-n 50 \\
		-o 00_vae128 >> 00.log

Numbers

Some numbers for training on a 1,375,854, 128x128 particle dataset (86 GB)

Baseline:

607 GB maximum memory requirement
18.5 min to load the dataset in cryodrgn train_vae

With new --preprocessed:

200 GB maximum memory requirement
3.2 min to load the dataset in cryodrgn train_vae

On a single V100 GPU, this dataset trained in approximately 2h,3min per epoch (large 1024x3 model) when fully loaded into memory. Training with on-the-fly data loading (--lazy) was 4x slower, though this can vary widely depending on your filesystem/network.

Technical details

Using cryodrgn preprocess in place of cryodrgn downsample means that images will be windowed (circular mask applied in real space) before they are downsampled. This is slightly different than in the original workflow where images are first downsampled, then the mask is applied during training. To exactly replicate the previous behavior, run cryodrgn downsample as usual, then run cryodrgn preprocess on the downsampled dataset.