With the rapid development of technology, increasing amount of data has been produced from many fields of science, such as biology, neuroscience, and engineering. The inadequate sample is no longer a bottleneck of modern statistical research. More often, we are facing data of extremely high dimensionality or coming from remarkably different sources. How to effectively extract information from the large-scale and high-dimensional data or data with various types and formats poses new statistical challenges.
In this thesis, I develop novel statistical method and theory to harness the various issues in analyzing the high-dimensional or multisource big data. More specifically, I propose (a) a model-free variable screening method for high-dimensional data regression, (b) a data level fusion method and a feature level fusion method to integrate multiple data sources for improved knowledge discovery. The consistency property in screening redundant variables and asymptotic property for the fused data are established respectively to provide theoretical underpinnings. The proposed methods are widely applied to many scientific investigations including genomic, epigenetic and metabolomic studies, and greatly help the scientific development in other disciplines.