Date of Award
Doctor of Philosophy
This thesis is mainly focused on developing novel and flexible non/semi-parametric statistical methods dealing with data with complex features. In recent years, advancement of high throughput technologies has made it possible to collect sophisticated high-dimensional datasets, such as microarray data, genome-wide single nucleotide polymorphism (SNP) data, and RNA sequencing (RNA-seq) data. These advances have caused an escalating demand for innovative dimension reduction tools to extract useful information from the huge amount of data, to visualize the underlying structure, and to facilitate the understanding and analysis of the data. The research undertaken in my thesis are described below.
In Chapter 1, we consider a semiparametric additive partially linear regression model (APLM) for analyzing ultra-high-dimensional data where both the number of linear components and the number of nonlinear components can be much larger than the sample size. We propose a two-step approach for estimation, selection and simultaneous inference of the components in the APLM. In the first step, the nonlinear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and nonlinear components based on adaptive LASSO. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and nonlinear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the Shoot Apical Meristem (SAM) of maize genotypes.
In Chapter 2, we further consider the model identification problem, as long with variable selection, estimation and inference simultaneously for the additive partially linear model (APLM). APLM combines the flexibility of nonparametric regression with the parsimony of regression models, and has been widely used as a popular tool in multivariate nonparametric regression to alleviate the "curse of dimensionality". A natural question raised in practice is the choice of structure in the nonparametric part, that is, whether the continuous covariates enter into the model in linear or nonparametric form. In this paper we present a comprehensive framework for simultaneous sparse model identification and learning for ultra-high-dimensional APLMs where both the linear and nonparametric components are possibly larger than the sample size. We propose a fast and efficient two-stage procedure. In the first stage, we decompose the nonparametric functions into a linear part and a nonlinear part. The nonlinear functions are approximated by constant spline bases, and a triple penalization procedure is proposed to select nonzero components using adaptive group LASSO. In the second stage, we refit data with selected covariates using higher order polynomial splines, and apply spline backfitted local linear smoothing to obtain asymptotic normality for the estimators. The procedure is shown to be consistent for model structure identification. It can identify zero, linear, and nonlinear components correctly and efficiently. Inference can be made on both linear coefficients and nonparametric functions. We conduct simulation studies to evaluate the performance of the method, and apply the proposed method to a dataset on the Shoot Apical Meristem (SAM) of maize genotypes for illustration.
In Chapter 3, motivated by recent advances in technology for brain imaging and high-throughput genotyping, we consider an imaging genetics approach to discover relationships between the interplay of genetic variation and environmental factors and measurements from imaging phenotypes. We propose an image-on-scalar regression method, in which the spatial heterogeneity of gene-environment interactions on imaging responses is investigated via an ultra-high-dimensional spatially varying coefficient model (SVCM). Bivariate splines on triangulations are used to represent the coefficient functions over an irregular two-dimensional domain of interest. When using the image-on-scalar regression method, a natural question raised in practice is if the coefficient function is really varying over space. In this paper, we present a unified approach for simultaneous sparse learning and model structure identification (i.e., varying and constant coefficients separation). Our method can identify zero, nonzero constant and spatially varying components correctly and efficiently. The estimators of constant coefficients and varying coefficient functions are consistent. The performance of the method is evaluated by a few simulation examples and a brain mapping study based on the Alzheimer's Disease Neuroimaging Initiative data.
Li, Xinyi, "Non/Semi-parametric learning from data with complex features" (2018). Graduate Theses and Dissertations. 17250.