Make Patterns Pop Out of Heatmaps with Seriation

One of the easiest ways to start visualizing data is to turn a table into a heatmap: every cell gets a colour, the higher the number the brighter the colour. Unfortunately, this is often a fairly unrewarding exercise, yielding graphics that look like plaid or tartan fabric. Part of the problem is that the rows and columns of a dataset often have no natural ordering, such as time, and are instead shown in alphabetical order, or else the dataset is sorted by one of the rows or columns, rather than in an order which makes patterns pop out visually. My goal in this article is to clearly demonstrate this problem and show that there exist neat solutions to this problem using a set of techniques collectively called seriation. I’ll do this by automatically reordering the rows and columns in the following noisy-looking heatmap to make the underlying pattern very clear.

It seems like there might be a pattern here but it’s definitely not obvious what it is

I know for a fact that there’s a pattern here because I put it there myself before independently shuffling the rows and columns to obscure it. Critically, this means that any cells that were in the same row in the original dataset are still in the same row, and likewise for columns.

The techniques I demonstrate below come out of a branch of research called seriation, which is the study of ways to place sets of items in an order that reveals structural information about that set, and which has a rich history. I was not involved in writing any of the software demonstrated below, which comes from an R package appropriately called seriation, although parts of it could have just as easily been done in Python or Javascript. The tiny amount of code I wrote for this article can be found on Github.

Clustering

If you look at the heatmap above, it certainly seems like there is a pattern, in that some columns look similar to each other and the same for rows, leading to the plaid effect. A natural reaction to seeing this is to want to group together similar-looking rows and columns. This type of reordering is often done by agglomerative clustering, which in the case of rows would start by placing each row in its own cluster, and then iteratively merging the most similar clusters together until only one remains. The result of such a clustering of items can be visualized as a dendrogram. Here is the result of agglomerative clustering on the heatmap above, with row and column dendrograms, as it is frequently done in bioinformatics:

Agglomerative clustering

We get what we asked for: certainly it looks like things have been grouped together and the image feels less chaotic than the original, but the clear pattern I promised isn’t exactly jumping out at us.

Optimal Leaf Ordering

One of the problems with agglomerative clustering is that it doesn’t actually place the rows in a definite order, it merely constrains the space of possible orderings. Take three items A, B and C. If you ignore reflections, there are three possible orderings: ABC, ACB, BAC. If clustering them gives you ((A+B)+C) as a tree, you know that C can’t end up between A and B, but it doesn’t tell you which way to flip the A+B cluster. It doesn’t tell you if the ABC ordering will lead to a clearer-looking heatmap than the BAC ordering. The clustered heatmap above was placed in the default order for the R heatmap.2 function, which is ordering by the mean value of the row/column, within the constraints of the tree.

Here we meet our first seriation algorithm: Optimal Leaf Ordering (OLO). This algorithm starts with the output of an agglomerative clustering algorithm and produces a unique ordering, one that flips the various branches of the dendrogram around so as to minimize the sum of dissimilarities between adjacent leaves. Here is the result of applying Optimal Leaf Ordering to the same clustering result as the heatmap above:

Agglomerative clustering with Optimal Leaf Ordering

The effect is quite dramatic: much of the jagginess of the original clustered heatmap is gone, and the perceptive reader is likely able to guess what the underlying pattern of the dataset is. It’s not exactly crystal clear yet, but it’s a big improvement over arbitrarily ordering clustered rows and columns, just like the latter was an improvement over arbitrarily-ordered rows and columns.

Travelling Salespeople

We started with clustering just because it seemed to make intuitive sense to group similar rows together, and then we ordered the tree branches to minimize the sum of dissimilarities between adjacent rows (and the same for columns, independently). But what if we didn’t care about clustering? We could just find the order of the rows that minimizes the sum of dissimilarities, unconstrained by the clustering tree. This is similar to the well-known Travelling Salesperson Problem (TSP), wherein one wants to find the shortest path that visits every city in a set and comes back to its starting point. In the case of seriation, though, we explicitly don’t care about coming back to the starting point: it doesn’t matter how dissimilar the first and last rows are in the heatmap. Thankfully, this problem can be reduced to a TSP by the addition of a dummy row with 0 distance to all the others, and then cutting the result at that point. Here is the result of seriation by independently applying a TSP solver to the rows and columns of our heatmap: