Sourmash: alignment-free analysis of metagenomes

Before starting

The following protocol outlines the use of Sourmash for interpreting microbial whole genome sequence (WGS) or metagenomic data. Sourmash computes a fingerprint or ‘sketch’ from your data using minHash, and enables comparison of sketches from isolates (to understand strain relatedness, for example). Similarly, a sketch can be compared against a large databases of sketches, stored as a .zip database for rapid searching, from Genbank or the Genome Taxonomy Database (GTDB) in order to help assign identity to a sample. These databases of microbial genomes from all of Genbank are available directly on our server for convenience.

If you want to learn more about how Sourmash works, you could start with the original Sourmash paper, and a recent follow-up paper. Also check out C. Titus Brown’s blog posts on the topic here and here. Also take a look at Adam Phillippy’s original Mash paper.

Note: Sourmash is a kmer-based method, which means you have to choose a kmer size. In general smaller kmers (e.g. k=31) allow for higher sensitivity but may result in some false positives, while larger kmer sizes (e.g. k=51) will tend to give higher specificity at the expense of false negatives.

Collectively, the GenBank references on our server contain over 1.2 million microbial genomes (including viral and fungal) from NBCI’s Genbank, including:

1,148,011 bacterial genomes
47,952 viral genomes
8,750 archaeal genomes
10,286 fungal genomes
1193 protist genomes

The Genome Taxonomy Database (GTDB) files include data for 402,709 prokaryotic genomes organized into 85,205 species clusters. This is most comprehensive and well-curated collection of bacterial genomes, and therefore is often the database of choice when using Sourmash gather

Prepare sample ‘sketches’

Here, we’ll use sourmash’s sketch function to prepare a sketch of each fastq file in our directory. The --scaled option applies a 1000:1 compression ratio, which retains the ability to detect regions of similarity in the 10kb range.

# for a single sample of single end data
sourmash sketch dna \\
SAMPLE.fastq.gz \\
-p k=21,k=31,k=51,scaled=1000,abund \\
--merge SAMPLE \\
-o SAMPLE.sig.gz

# for a single sample of paired end data
sourmash sketch dna \\
SAMPLE_R1.fastq.gz SAMPLE_R2.fastq.gz \\
-p k=21,k=31,k=51,scaled=1000,abund \\
--merge SAMPLE \\
-o SAMPLE.sig.gz

# looping for single end data for multiple samples
for FASTQ in *.fastq.gz
do
		SAMPLE=${FASTQ//_001.fastq.gz}
    sourmash sketch dna $FASTQ -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
		echo "done sketching from $SAMPLE!"
done

# looping for paired end data for multiple samples
for READ1 in *R1_001.fastq.gz
do
		READ2=${READ1//1_001.fastq.gz/2_001.fastq.gz}
		SAMPLE=${READ1//_R1_001.fastq.gz}
    sourmash sketch dna $READ1 $READ2 -p k=21,k=31,k=51,scaled=1000,abund --merge $SAMPLE -o ${SAMPLE}.sig.gz
		echo "done sketching from $SAMPLE!"
done

Compare sketches to each other

The Sourmash compare command computes the jaccard index between two or more sketches generated above.

sourmash compare *.sig.gz -k 31 -o cmp

Visualize sample relatedness

Sourmash plot prepares a graphic showing the distance between samples