A new algorithm for hybrid hierarchical clustering with visualization and the bootstrap

https://doi.org/10.1016/S0378-3758(02)00388-9Get rights and content

Abstract

We propose a hybrid clustering method, hierarchical ordered partitioning and collapsing hybrid (HOPACH), which is a hierarchical tree of clusters. The methodology combines the strengths of both partitioning and agglomerative clustering methods. At each node, a cluster is partitioned into two or more smaller clusters with an enforced ordering of the clusters. Collapsing steps uniting the two closest clusters into one cluster can be used to correct for errors made in the partitioning steps. We implement a version of HOPACH which optimizes a measure of clustering strength, such as average silhouette, at each partitioning and collapsing step. An important benefit of a hierarchical tree is that one can look at clusters at increasing levels of detail. We propose to visualize the clusters at any level of the tree by plotting the distance matrix corresponding with an ordering of the clusters and an ordering of elements within the clusters. A final ordered list of elements is obtained by running down the tree completely. The bootstrap can be used to establish the reproducibility of the clusters and the overall variability of the followed procedure. The power of the methodology compared to current algorithms is illustrated with simulated and publicly available cancer gene expression data sets.

Introduction

Exploratory data methods, such as cluster analysis, are being widely applied to the high-dimensional data structures produced by today's researchers. Advances in technology have vastly altered the type and amount of data collected in fields such as molecular biology, meteorology, medicine, astronomy, and marketing. As the means for collecting and storing ever larger amounts of data develop, it is essential to have good methods for identifying patterns, selecting variables of interest and building models in high-dimensional settings. For example, our methodological research has focused on gene expression data analysis where the number of genes far exceeds the number of samples. In this paper, we present a new hierarchical clustering algorithm called hierarchical ordered partition and collapsing hybrid (HOPACH) that has been developed to address some of the short comings of currently available methods for clustering gene expression data. Since similar issues are likely to arise in other high dimensional data settings, the HOPACH algorithm will be applicable for analysis of data bases in many fields.

Clustering is a type of unsupervised learning in which a set of elements is separated into homogeneous groups. Suppose we are interested in clustering p elements xj, j∈{1,…,p} and that each element xj is an n dimensional vector (x1j,…,xnj)T. Let d(xj,xj′) denote the dissimilarity between elements j and j′ and let D be the p×p symmetric matrix of dissimilarities. Typical choices of dissimilarity include Euclidean distance, 1 minus correlation, 1 minus absolute correlation and 1 minus cosine-angle. Clustering routines (either implicitly or explicitly) map a distance matrix D into p cluster labels. Graph theoretic, model-based and non-parametric approaches have been implemented.

The non-parametric clustering methods can be separated into partitioning and hierarchical algorithms. Partitioning methods, such as self-organizing maps (SOM) (Törönen et al., 1999), partitioning around medoids (PAM) (Kaufman and Rousseeuw, 1990), and K-MEANS, identify a user specified number of clusters and assign each element to a cluster. Hierarchical methods involve constructing a tree of clusters in which the root is a single cluster containing all the elements and the leaves each contain only one element. Hierarchical methods are used when it is of interest to look at clusters at a range of levels of detail, including the final level of the tree, which is an ordered list of the elements. Such a list, in which nearby elements are similar, is often more helpful to the subject-matter scientist than a collection of large, unordered groups. A hierarchical tree can be divisive (i.e. built from the top down by recursively partitioning the elements) or agglomerative (i.e. built from the bottom up by recursively combining the elements). DIANA (Kaufman and Rousseeuw, 1990) is an example of a divisive hierarchical algorithm, while AGNES (Kaufman and Rousseeuw, 1990) and Cluster (Eisen et al., 1998) are examples of agglomerative hierarchical algorithms. Agglomerative methods can be employed with different types of linkage. In average linkage methods, the distance between two clusters is the average of the dissimilarities between the points in one cluster and the points in the other cluster. In single linkage methods (nearest neighbor methods), the dissimilarity between two clusters is the smallest dissimilarity between a point in the first cluster and a point in the second cluster.

Most clustering routines, including the methods proposed in this paper, fit in the general statistical framework for gene clustering presented in van der Laan and Bryan (2001). In this formal framework, the underlying target subset (with cluster labels) is defined as a deterministic function S(μ,Σ) of the population mean and covariance matrix (or any other smooth function of the true data generating distribution). The bootstrap is used to estimate the distribution of the estimated target subset S(μn,Σn), obtained by substituting the empirical mean and covariance. In particular, the authors propose to run the bootstrap with fixed cluster centers and to report for each element (e.g. each gene) the cluster-specific proportion of times it falls in that cluster. A corresponding cluster probability plot allows one to inspect the cluster reproducibility visually and numerically. Cluster probabilities can also be used to obtain a sensible ordering of elements within clusters so that weakly clustered elements can be filtered out. The authors also prove consistency of the subset estimates and asymptotic validity of the bootstrap under the assumption that the sample size converges faster to infinity than the logarithm of the number of elements. In this paper, we will illustrate how the bootstrap methodology of van der Laan and Bryan can applied with HOPACH.

New technologies are allowing researchers to monitor the expression of thousands of genes simultaneously. By comparing gene expression profiles across cells that are at different stages in some process, in distinct pathological states, or under different experimental conditions, we gain insight into the roles and interactions of various genes. For example, one can compare healthy cells to cancerous cells within subjects in order to learn which genes tend to be over (or under) expressed in the diseased cells; regulation of such differentially expressed genes could produce effective cancer treatment or prophylaxis (DeRisi et al., 1996). Groups of differentially expressed genes which are significantly correlated with each other are particularly interesting, since such genes might be part of the same causal mechanism. In addition to identifying interesting clusters of genes, researchers often want to find subgroups of samples which share a common gene expression profile. Ross et al. (2000), for example, use microarray data to classify 60 human cancer cell lines.

A typical gene expression experiment results in an observed data matrix X whose columns are n copies of a p-dimensional vector of gene expression measurements. Consider, for example, a population of cancer patients from which we take a random sample of n patients, each of whom contributes p gene expression measurements. For microarrays, each measurement is a ratio, calculated from the intensities of two flourescently labeled mRNA samples (e.g. tumor and healthy tissues) cohybridized to arrays spotted with known cDNA sequences. Gene chips produce similar data, except each element is a quantitative expression level rather than a ratio. The methods presented in this paper can be applied to both types of data, but we will focus on microarrays. Data preprocessing may include background subtraction, combining data from replicated spots representing the same cDNA sequence, normalization, log transformation, and truncation. A typical next step is to select an interesting subset of the genes (e.g. based on a cut-off value for mean expression). Throughout this paper, we assume that preprocessing and subsetting have already been completed.

Approaches to gene expression data analysis rely heavily on results from cluster analysis. Hierarchical clustering of genes has been popularized by Eisen et al., who apply an agglomerative hierarchical clustering algorithm (Cluster) to the empirical correlation matrix. Kaufman and Rousseeuw also describe and implement hierarchical clustering algorithms (e.g. DIANA, AGNES) in FORTRAN and Splus which can be used to cluster gene expression data. Two serious drawbacks of these methods are that the ordering of the genes has a random component (and is therefore non-unique) and that the splits at each node are forced to be binary, while there is no biological reason for this choice. These two factors often result in clustering results in which clearly similar elements are not ordered near each other. Furthermore, once an error is made in one of these algorithms, it is impossible to correct it further down (or up) the tree. These issues motivated the development of HOPACH.

Researchers are not only interested in clustering genes, but also in clustering or classifying samples based on similarity of gene expression patterns. Thus, while gene clustering results are of biological interest themselves, they serve an additional purpose if we can use them to aid in the task of clustering samples. The simulation and data analysis illustrate, in particular, that a sensible strategy for clustering the samples is to first cluster genes and then cluster samples for each cluster of genes separately (possibly iterating this process). These results can again be visualized in a reordered data matrix, where the ordering of samples is produced by applying HOPACH to the transposed data matrix. This approach can reveal underlying structure in the data which may not be apparent when the samples are clustered using all genes. For a more formal treatment of this subject we refer the reader to Pollard and van der Laan (2002a), where we extend the statistical framework of van der Laan and Bryan (2001) to include clustering of both genes and samples by defining a simultaneous clustering parameter which is a composition of a mapping for clustering genes and a mapping for clustering samples.

In addition to illustrating how hierarchical clustering can be applied to gene expression data, Eisen et al. provide a visual plot of the ordered p by n data matrix X. They apply their Cluster algorithm separately to both genes and samples to reorder the rows and columns of the data matrix for the plot. Each log ratio is represented by a spot on the red/green color scale. This plot allows one, to a certain extent, to visually inspect the strength of clusters. We feel that the visualization of clusters is an important contribution, but that plotting the data matrix might not show clustering patterns with respect to certain distances. For example, two vectors can be highly correlated, but have very different mean expression levels. Such vectors will be close using 1 minus correlation as the distance metric (details below), but their similarity will be difficult to “eyeball” in the data matrix. We propose that one should, in particular, visualize the corresponding distance matrix. Visualizing ordered distance matrices was proposed by Sokal (1966) and has been used by others since (e.g. Chen, 2002). We emphasize the usefulness of this method in high dimensional settings, such as gene expression data. Ordered distance matrix plots (and to a certain degree ordered data matrix plots) corresponding to different levels of a hierarchical tree help to determine the main clustering structures in the data set and provide information about the clusters, such as their strength and their similarity to each other. In this paper, we use a red–white color scale (rather than red–green) for improved visibility.

In this paper, we aim to build on the strengths of currently employed methods and to address some of their drawbacks. In Section 2, we describe our clustering method, HOPACH, which creates a hierarchical tree of clusters with an enforced ordering of the clusters. Three important components of this method which are improvements over other hierarchical methods are (1) not restricting splits to be binary (which we see degrades performance in the simulation and data analysis), (2) ordering the clusters and elements within clusters deterministically, and (3) combining recursive partitioning with a collapsing step that allows erroneously separated groups of elements to be reunited. In addition, the methodology includes

  • running down the tree to obtain an ordered list of genes,

  • plotting the ordered distance matrix at any level of the tree,

  • identifying the main clusters, and

  • running the bootstrap to establish the reproducibility of the clusters.

Section 3 discusses methods for statistical inference based on applying the bootstrap to HOPACH clustering results. In Section 4, we apply HOPACH to simulated data in order to illustrate its performance relative to other clustering methods. By defining true clustering parameters, we are able to compare the parameters defined by different algorithms. We show that HOPACH is a consistent estimator of cluster labels and is more efficient than existing algorithms for realistically small sample sizes. In Section 5, we analyze a publicly available cancer data set consisting of gene expression from a variety of cell lines derived from different types of tumors. The simulations and data analysis demonstrate, in particular, that HOPACH is better able to identify clusters and to produce a sensible ordering of the elements being clustered than a number of currently employed methods.

Section snippets

Hierarchical ordered partitioning and collapsing hybrid

We present a new clustering method, HOPACH, which applies partitioning and collapsing steps iteratively to create a hierarchical tree whose final level is an ordered list of the elements. By combining the strengths of two celebrated approaches to clustering, we create a more flexible algorithm for finding patterns in data.

It is important to note that our methodology is a general approach which could be applied with any choice of partitioning algorithm. Commonly employed partitioning algorithms

The bootstrap

Let μn,Σn be the p×1 empirical mean and p×p covariance matrix of the data X. Since (for the previously mentioned dissimilarities) the distance matrix is a function of μn,Σn, we can view each particular HOPACH–PAM algorithm, as defined by the various choices such as the level l of the tree, as a particular functional (μ,Σ)→S(μ,Σ) applied to (μn,Σn). Therefore we can consider the HOPACH–PAM clustering result S(μn,Σn) as an estimator of a true clustering parameter S(μ,Σ). The main cluster labels

Simulations

In order to evaluate the clustering performance of HOPACH–PAM relative to existing algorithms, we have conducted two sets of investigations. In the first set, we look at the final ordering produced by HOPACH–PAM compared to two other hierarchical methods, DIANA (divisive) and AGNES (agglomerative, single linkage). We run the HOPACH–PAM tree to the final level without collapsing. In the second set, we investigate the ability to identify the main clusters. We compare HOPACH–PAM to two

Data analysis

We have extracted a publicly available data set from the data base accompanying Ross et al. (2000). The authors performed microarray experiments on 60 human cancer cell lines (the NCI60) derived from tumors from a variety of tissues and organs by researchers from the National Cancer Institute's Developmental Therapeutics Program. The data set includes gene expression measurements for 9996 cDNAs representing approximately 8000 unique transcripts. Each tumor sample was cohybridized with a

Discussion

HOPACH algorithms combine partitioning and collapsing steps to produce an ordered hierarchical tree and corresponding list of elements. This family of proposed algorithms has some interesting properties.

Employed with collapsing, HOPACH–PAM is a “hybrid” clustering method which uses both partitioning and agglomerative steps. Collapsing serves two different purposes. First, collapsing can correct the number of clusters k in a split, to allow k=1 (no split) or to correct for the number of clusters

References (10)

  • P. Törönen et al.

    FEBS Lett.

    (1999)
  • C.H. Chen

    Statist. Sinica

    (2002)
  • J. DeRisi et al.

    Natur. Genet.

    (1996)
  • M. Eisen et al.

    Proc. Natl. Acad. Sci.

    (1998)
  • L. Kaufman et al.

    Finding Groups in Data: An Introduction to Cluster Analysis

    (1990)
There are more references available in the full text version of this article.

Cited by (152)

  • Differential Etv2 threshold requirement for endothelial and erythropoietic development

    2022, Cell Reports
    Citation Excerpt :

    The built-in R function p.adjust was used to calculate the false discovery rate (FDR) for each p value using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995). The expression of the genes that were associated with genotype with an FDR <0.05 was first clustered using the HOPACH package and then visualized using the pheatmap package (https://cran.r-project.org/package=pheatmap; van der Laan and Pollard, 2003). Genes from each of the HOPACH clusters were extracted and gene ontology (GO) analyses were performed using Metascape (Zhou et al., 2019).

  • Natural computing and unsupervised learning methods in smart healthcare data-centric operations

    2022, Cognitive and Soft Computing Techniques for the Analysis of Healthcare Data
  • Evaluation of single-cell RNA-seq clustering algorithms on cancer tumor datasets

    2022, Computational and Structural Biotechnology Journal
View all citing articles on Scopus

This research has been supported by a grant from the Life Sciences Informatics Program with industrial partner biotech company Chiron Corporation.

View full text