Motivation: A cluster analysis is the most commonly performed procedure (often regarded as a first step) on a set of gene expression profiles. Numerous attempts have been made in the literature in order to validate the results of an existing or a novel clustering algorithm often within the context of microarray data analysis. A closely related problem is that of selecting a clustering algorithm that is optimal in some way from a rather impressive list of clustering algorithms that currently exist. Most of these approaches rely on the quality of the clusters produced in terms of their physical characteristics and statistical properties such as stability or consistency. While this may be reasonable for certain applications such as machine learning, the underlying biology must be taken into consideration for applications to microarray experiments. On the other hand, standard clustering algorithms such as hierarchical clustering (UPGMA) or K- Means have been reported to group functionally similar genes in a number of microarray studies, often, on the basis of a handful of genes with known functions. There have been attempts in recent years in interpreting the resulting clusters obtained using a statistical clustering method in terms of their biological functions. However the problem of validation and selection of a clustering algorithm on the basis of the functional information has received relatively little attention.

Results: In this paper, we propose two performance measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional consistency, so that a good clustering algorithm should have a small value for these measures. These measures are easy to interpret and implement. As a byproduct, once we select a clustering algorithm we can obtain a prediction for the most likely biological functional class for a gene. We also lay out a Monte Carlo algorithm to compute a statistical p-value for a given clustering algorithm representing the probability of obtaining a performance measure as small as the one for the given algorithm just by random cluster assignments.

We illustrate our methods using two sets of expression profiles obtained from a breast cancer data set. The first set of profiles consist of expressions of significantly differentially expressed genes between normal and ductal carcinoma in situ (DCIS). The second set consisted of significantly differentially expressed genes between DCIS and invasive ductal carcinoma (IDC) which represents a more advanced stage of breast cancer. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated using these two performance measures for the two sets of expression profiles. Whereas the exact ordering depends on the particular data set (expression profiles) used and the performance measure employed, overall UPGMA appears to be the optimal for this cancer data set that we considered. All of the above clustering algorithms were judged to be significantly better (p-value < 5%) than random cluster assignments.

TR Number: 
2004-20
Susmita Datta and Somnath Datta

To request a copy of this report, please email us. We will send you a pdf copy if one is available.