In the large cohorts typically used for genome-wide association studies (GWAS), it is not economically feasible to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, treating the imputed rare variants as observed quantities in downstream association analysis may inflate the type I error, especially when the
sequenced subjects are not a random subset of the whole cohort. In this article, we show how to properly account for the uncertainties in the imputation of rare variants. We consider all commonly used gene-level association tests, including the burden test, variable threshold (VT) test, and sequence-kernel association test (SKAT), all of which are based on the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for non-sequenced subjects is unbiased. We construct a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error.
We demonstrate the usefulness of the proposed methodology through extensive simulation studies and empirical data from the Women's Health Initiative (WHI). The relevant software is freely available.
More information on Yijuan Hu may be found at http://cfusion.sph.emory.edu/Faculty/Profile.cfm?Network_ID=YHU30&DEPT=BIOS