PhD Candidate, Statistics

Single-nucleotide polymorphisms (SNPs), believed to determine human differences, are widely used to predict risk of diseases and class membership of subjects. In the literature, several supervised machine learning methods, such as, support vector machine, neural network and logistic regression, are available for classification. Typically, however, samples for training a machine are limited and/or the sampling cost is high. Thus, it is essential to determine the minimum sample size needed to construct a classifier based on SNP data. Such a classifier would facilitate correct classification while keeping the sample size to a minimum, thereby making the studies cost-effective. In this dissertation, we first consider the problem of sample size determination when there are two classes. Then, we consider the same problem when there are two or more classes. While the sample size determination algorithm for the two-class scenario and the multi-class scenario are the same, the criterion we use for each scenario is different. More specifically, for coded SNP data from two classes, an optimal classifier and an approximation to its probability of correct classification (PCC) are derived. A linear classifier is constructed and an approximation to its PCC is also derived. These approximations are validated through a variety of Monte Carlo simulations. A sample size determination algorithm based on the criterion which ensures that the difference between the two approximate PCCs is below a threshold, is given and its effectiveness is illustrated via simulations. For the HapMap data on Chinese and Japanese populations, a linear classifier is built using 51 independent SNPs, and the required total sample sizes are determined using our algorithm, as the threshold varies. For example, when the threshold value is 0.05, our algorithm determines a total sample size of 166 (83 for Chinese and 83 for Japanese) that satisfies the criterion. For coded SNP data from D( 2) classes, we derive an optimal Bayes classifier and a linear classifier, and obtain a normal approximation to the probability of correct classification for each classifier. These approximations are used to evaluate the associated Area Under the Receiver Operating Characteristic (ROC) Curve (AUCs) or Volume Under the ROC hyper- Surface (V USs), whose performances are then validated via Monte Carlo simulations. We give an algorithm for sample size determination, which ensures that the difference between the two approximate AUCs (or V USs) is below a pre-specified threshold. The performance of this algorithm is also illustrated via simulations. For the HapMap data with three and four populations, a linear classifier is built using 92 independent SNPs and the required total sample sizes are determined for various threshold values. We also illustrate the usefulness of our sample size determination algorithm in a prediction problem using a Heterogeneous Stock Mice data, where the continuous variable Anxiety is categorized into three groups, whereas the variable Obesity BMI is categorized into four groups, and then a linear classifier is built based on 348 SNPs for each variable.