For high dimensional genetic data, an important problem is to search for associations between genetic variables and a phenotype---typically, a discrete variable (diseased versus normal). A conventional solution is to characterize such relationships through regression models in which a phenotype is treated as the response variable and genetic variables are treated as the covariates. Not surprisingly, such a way incurs the challenging problem of the number of variables much larger than the number of observations. We propose a general framework of expressing the transformed mean of the genetic variables in exponential distribution family via ANOVA type of models in which a low-rank interaction space captures association between phenotype and genetic variables. This alternative method transforms the variable selection problem to a well-posed problem with that number of observations larger than number of genetic variables. We also develop a new model selection criterion based on Bayesian information criterion for the new model framework with diverging number of parameters.
In the talk, we focus on a specific application to genome-wide association studies. The primary task is to detect biomarkers in the form of Single Nucleotide Polymorphisms (SNPs) that have nontrivial associations with a disease phenotype and some other important clinical factors. However, the extremely large number of SNPs comparing to the sample size inhibits application of the classical methods such as the multiple logistic regression for case-control studies. We demonstrate applicability of the proposed method via a Multiple Sclerosis data set and simulation studies.
More information on Jianhua Hu may be found at http://faculty.mdanderson.org/Jianhua_Hu/Default.asp?SNID=279365695
This Colloquium is sponsored jointly by the University of Georgia Department of Statistics and the University of Georgia Department of Epidemiology and Biostatistics.