Date of Award
Doctor of Philosophy
Bioinformatics and Computational Biology
Gene expression data analysis is a critical component to how today's researchers comprehend biological function at a molecular level. With the amount of data being generated outstripping the ability to analyze it, it is critically important that the development of statistical methodology keep pace with technological advancement in order to fully take advantage of this wealth of information. In this dissertation, we examine issues that are presented in the context of gene expression analysis and develop new methods to account for these complications via three separate papers, contained in Chapters 2, 3, and 4.
Chapters 2 and 3 are closely related in their relevance to the detection of differential expression and multiple testing procedures for microarray analysis. Specifically, in Chapter 2 we modify an existing semiparametric estimator of the true null proportion of hypothesis tests to make use of permutation testing. We argue that this approach is more appropriate for the typically small sample sizes of microarray experiments, especially since expression data is nonnormal. We show that our modification is more accurate than the original approach using simulated data based upon real microarray expression values, and advocate its use for microarray analysis when small sample sizes are used.
In Chapter 3, we examine the implications of rejecting a fixed number of genes for the detection of differential expression on FDR estimation. We employ a wide variety of estimators which assume a uniformly distributed empirical null p-value distribution, and our findings show that there is a strong, negatively correlated relationship with FDR estimates and the true proportion of false discoveries, Q, when significance is determined in this fashion. This phenomenon is observed over a wide variety of simulation conditions. We also show that, in conjunction with publication bias, this type of significance threshold selection results in liberally biased estimates of FDR. We contrast these estimators with Efron's empirical null approach, which produces an FDR estimator which is positively correlated with Q.
Chapter 4 involves the development of a method for simultaneously classifying the transcriptional activity of genes using RNA-Seq data. We specifically consider a crossbreeding experiment involving two inbred lines of maize in order to investigate the complementation model of heterosis. We use a negative binomial distribution to model the read counts, and assume a simple latent class model for transcriptional activity. Application of this model to experimental as well as simulated data provides reasonable classifications and identifies specific genes that appear to be in accordance with the complementation theory of heterosis. We argue for the use of this model in other breeding experiments to further investigate the complementation theory.
Nicholas Bradley Larson
Larson, Nicholas Bradley, "Investigation and development of statistical method for gene expression data analysis" (2011). Graduate Theses and Dissertations. 10448.