Date of Award
Doctor of Philosophy
Song Xi Chen
This dissertation consists of three research papers that deal with three different problems in statistics concerning high-volume datasets. The first paper studies the distributed statistical inference for massive data. With the increasing size of the data, computational complexity and feasibility should be taken into consideration for statistical analyses. We investigate the statistical efficiency of the distributed version of a general class of statistics. Distributed bootstrap algorithms are proposed to approximate the distribution of the distributed statistics. These approaches relief the computational burdens of conventional methods while preserving adequate statistical efficiency. The second paper deals with testing the identity and sphericity hypotheses problem regarding high-dimensional covariance matrices, with a focus on improving the power of existing methods. By taking advantage of the sparsity in the underlying covariance matrices, the power improvement is accomplished by utilizing the banding estimator for the covariance matrices, which leads to a significant reduction in the variance of the test statistics. The last paper considers variable selection for high-dimensional data. Distance-based variable importance measures are proposed to rank and select variables with dependence structures being taken into consideration. The importance measures are inspired by the multi-response permutation procedure (MRPP) and the energy distance. A backward selection algorithm is developed to discover important variables and to improve the power of the original MRPP for high-dimensional data.
Peng, Liuhua, "Topics in statistical inference for massive data and high-dimensional data" (2017). Graduate Theses and Dissertations. 15601.