PhD Candidate, Statistcs
This dissertation consists of two parts for the topic of sample integrity in high dimensional data. The first part focuses on batch effect in gene expression data. Batch bias has been found in many microarray studies that involve multiple batches of samples. Currently available methods for batch effect removal are mainly based on gene-by-gene analysis. There has been relatively little development on multivariate approaches to batch adjustment, mainly because of the analytical difficulty that originates from the high dimensional nature of gene expression data. We propose a multivariate batch adjustment method that effectively eliminates inter-gene batch effects. The proposed method utilizes high dimensional sparse covariance estimation based on a factor model and a hard-thresholding technique. We study theoretical properties of the proposed estimator. Another important aspect of the proposed method is that if there exists an ideally obtained batch, other batches can be adjusted so that they resemble the target batch. We demonstrate the effectiveness of the proposed method with real data as well as simulation study. Our method is compared with other approaches in terms of both homogeneity of adjusted batches and cross-batch prediction performance. The second part deals with outlier identication for high dimension, low sample size (HDLSS) data. The outlier detection problem has been hardly addressed in spite of the enormous popularity of high dimensional data analysis. We introduce three types of distances in order to measure the \outlyingness" of each observation to the other data points: centroid distance, ridge Mahalanobis distance, and maximal data piling distance. Some asymptotic properties of the distances are studied related to the outlier detection problem. Based on these distance measures, we propose an outlier detection method utilizing the parametric bootstrap. The proposed method also can be regarded as an HDLSS version of quantilequantile plot. Furthermore, the masking phenomenon, which might be caused by multiple outliers, is discussed under HDLSS situation.