Sue Hyun Park, March 11

<aside> 🔗 We have reposted this blog on our Medium publication. Read this on Medium.

</aside>

Multidimensional, or high-dimensional, data carries in-depth information of complex systems. An impressive example is single-cell RNA sequencing (scRNA-seq) data. This contains thousands of attributes to explain a single cell’s phenotype, since the number of cell-type-specific gene expression patterns in any tissue or cell type ranges from 3,000 to 5,000.

To visualize and interpret multidimensional data of the like, a widely used way is to reduce the dimensionality of data first. Then, we can thoroughly check the projection in a lower-dimensional space. Such conversion technique is called multidimensional projection (MDP).

Inter-cluster tasks have been regarded as the core tasks for using MDP. These tasks investigate meaningful inter-cluster structures of the dataset through projections, such as how cell clusters are located and related based on patterns of gene expression.

Identifying clusters with discrete cell types (Source: Yan Wu & Kun Zhang)

seeking the relationship between clusters using an MDP technique called t-SNE

Unfortunately, distortions inherently occur when reducing dimensionality. In the projected space, originally nearby clusters can be separated (called stretching) or originally distinct clusters can gather together (called compression). These distortions can make meaningful structures in projections less trustworthy, thus disturbing users’ comprehension of original data.

Distortions projecting E to F: compression (a) and stretching (b) (Source: Aupetit)

How much can we trust the clusters revealed by MDP? Which MDP technique should we choose?

Researchers using multidimensional data in their work must be aware of inter-cluster reliability, specifically how well the low-dimensional projection preserves the inter-cluster structures in the original high-dimensional space.

In a paper published in IEEE TVCG, we introduce two novel metrics that quantitatively measure inter-cluster reliability: Steadiness and Cohesiveness. Recalling the two types of distortions mentioned above, Steadiness evaluates compression while Cohesiveness evaluates stretching. A complementary tool we propose is a reliability map that visually explains inter-cluster reliability quantified by Steadiness and Cohesiveness within projections. Starting off from our design considerations, we will explore how our metrics can precisely capture distortions and prevent users’ misinterpretations.

Design considerations

Measuring inter-cluster reliability is challenging. Through a survey of 26 papers concerning inter-cluster tasks, we first narrowed down inter-cluster tasks into three types. However, previous local metrics like Trustworthiness and Continuity (T&C) cannot correctly quantify the potential performance of each task, and can fail it.

Our metrics should adequately quantify how accurately each inter-cluster task can be performed, and thus be able to measure inter-cluster reliability precisely. We formulated the following three design considerations, which outline the capacity of our metrics:

(C1) Capture the inter-cluster structure in detail in order to precisely identify clusters or seek relationships between them**.** The inter-cluster structure in MDP is complex, intertwined, and often has no ground truth. Each cluster’s characteristics, like shape, density, or size, vary widely as well.
(C2) Consider stretching and compression individually so as to accurately estimate clusters’ features and their similarities. The clusters’ size, density, or their distance between can be overestimated due to stretching, or can be underestimated by compression.
(C3) Measure how accurately the clusters identified in the projection reflect their original density and size, to quantify misconceptions when comparing clusters between spaces. The reason is that projected clusters’ size and density may not reflect those in the original space.