CryoDRGN is a machine learning system for heterogenous cryo-EM reconstruction. In cryoDRGN’s framework of generative modeling, once a model is trained, an arbitrary number of volumes may be reconstructed, thus tools are needed to comprehensively explore the reconstructed distribution. This page describes a new “landscape analysis” tool for comprehensive and quantitative analysis of a trained cryodrgn model, including 1) assigning discrete conformational states (and providing their particle lists for refinement) and 2) visualizing continuous conformational landscapes. This tool also allows the user to focus their analysis on specific regions of interest by providing custom masks.

Landscape analysis is implemented in the executables, cryodrgn analyze_landscape and cryodrgn analyze_landscape_full available in version 1.0+ of the cryoDRGN software. The analysis pipeline is fully automated, though there are many command line arguments that can be experimented with, and we provide a jupyter notebook for interactive visualization.

A description of the method is found in Chapter 6 of Ellen Zhong’s thesis.

Overview of the cryodrgn landscape_analysis pipeline. We show the general schematic (top) and its application to a dataset of the ClpXP protease from Fei et al 2020 (bottom).

Overview of the cryodrgn landscape_analysis pipeline. We show the general schematic (top) and its application to a dataset of the ClpXP protease from Fei et al 2020 (bottom).

1. Quickstart

Example usage:

(cryodrgn) $ cryodrgn analyze_landscape [workdir] [epoch]

# for example:
(cryodrgn) $ cryodrgn analyze_landscape /path/to/cryodrgn/output/directory 24 # assuming 25 epochs of training (0-indexed)

# Use the flag -h to see all settings and their defaults:
(cryodrgn) $ cryodrgn analyze_landscape -h

By default, the script will:

  1. Generate 500 volumes at a box size of 128^3
  2. Perform PCA on the volumes to map conformational coordinates. The goal is for the volume PCA coordinates to provide a more visually interpretable representation of the dataset than the VAE latent space.
  3. Cluster the volumes and provide summary volumes and the constituent particles for each cluster.

By default, all outputs will be located in a subdirectory [workdir]/landscape.[epoch].

The expected runtime is ~30 min (1 GPU) which is mostly spent on volume generation); Rerunning the tool without volume generation (--skip-vol) should take less than 5 min (no GPU)

Outputs at a glance

2. Assigning 3D conformational states (”classes”)

Once 500 volumes are generated, they are clustered to summarize the major conformational states of the reconstructed ensemble. This clustering approach mirrors some of the assumptions in 3D classification (i.e. that particles fall in 1 of K discrete classes). The resulting clusters can be interpreted as the main conformational states, and this tool provided the constituent particles as a .star file can be exported to other tools for further refinement.