Sue Hyun Park, March 11

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Multidimensional, or high-dimensional, data carries in-depth information of complex systems. An impressive example is single-cell RNA sequencing (scRNA-seq) data. This contains thousands of attributes to explain a single cell’s phenotype, since the number of cell-type-specific gene expression patterns in any tissue or cell type ranges from 3,000 to 5,000.

To visualize and interpret multidimensional data of the like, a widely used way is to reduce the dimensionality of data first. Then, we can thoroughly check the projection in a lower-dimensional space. Such conversion technique is called multidimensional projection (MDP).

Inter-cluster tasks have been regarded as the core tasks for using MDP. These tasks investigate meaningful inter-cluster structures of the dataset through projections, such as how cell clusters are located and related based on patterns of gene expression.

Identifying clusters with discrete cell types (Source: Yan Wu & Kun Zhang)

Identifying clusters with discrete cell types (Source: Yan Wu & Kun Zhang)

seeking the relationship between clusters using an MDP technique called t-SNE

seeking the relationship between clusters using an MDP technique called t-SNE

Unfortunately, distortions inherently occur when reducing dimensionality. In the projected space, originally nearby clusters can be separated (called stretching) or originally distinct clusters can gather together (called compression). These distortions can make meaningful structures in projections less trustworthy, thus disturbing users’ comprehension of original data.

Distortions projecting E to F: compression (a) and stretching (b) (Source: Aupetit)

Distortions projecting E to F: compression (a) and stretching (b) (Source: Aupetit)

How much can we trust the clusters revealed by MDP? Which MDP technique should we choose?

Researchers using multidimensional data in their work must be aware of inter-cluster reliability, specifically how well the low-dimensional projection preserves the inter-cluster structures in the original high-dimensional space.

In a paper published in IEEE TVCG, we introduce two novel metrics that quantitatively measure inter-cluster reliability: Steadiness and Cohesiveness. Recalling the two types of distortions mentioned above, Steadiness evaluates compression while Cohesiveness evaluates stretching. A complementary tool we propose is a reliability map that visually explains inter-cluster reliability quantified by Steadiness and Cohesiveness within projections. Starting off from our design considerations, we will explore how our metrics can precisely capture distortions and prevent users’ misinterpretations.


Design considerations

Measuring inter-cluster reliability is challenging. Through a survey of 26 papers concerning inter-cluster tasks, we first narrowed down inter-cluster tasks into three types. However, previous local metrics like Trustworthiness and Continuity (T&C) cannot correctly quantify the potential performance of each task, and can fail it.

Our metrics should adequately quantify how accurately each inter-cluster task can be performed, and thus be able to measure inter-cluster reliability precisely. We formulated the following three design considerations, which outline the capacity of our metrics: