Date of Award
Doctor of Philosophy
In recent years, the advent of next-generation sequencing (NGS) technology has been revolutionizing how genomic studies are processed. One important application of NGS technology is the study of transcriptome through sequencing of RNAs (RNA-seq). Compared with previous technologies such as microarray, RNA-seq data have many advantages, such as providing digital rather than analog signals of expression levels, dynamic and wider ranges of measurements, less noise, higher throughput, etc. Hence, RNA-seq is gradually replacing the array-based approach as the major platform in transcriptome studies. Meanwhile, the massive amounts of discrete data generated by the NGS technology call for effective methods of statistical analysis. There are many interesting questions in RNA-seq data analysis, and we focus on three important ones in this dissertation: identifying differentially expressed genes, from two-treatment experiments, detecting alternative splicing patterns using exon-expression data, and clustering gene expression profiles for multi-sample studies. Our major work are introduced in the following chapters:
First, we propose an approximated maximum-average powerful (AMAP) testing procedure to compare gene expression from two treatment groups. The proposed method allows for testing null hypotheses that are much more general than what have been considered by most previous studies, and it leads to a natural way of controlling the FDR. We show that our method has higher power as well as better FDR control than other widely-used methods in practice.
Second, we generalize the AMAP test from testing gene expression data to studying alternative splicing events from exon-level expression data. A nonparametric algorithm to estimate the distribution of exon usages is proposed, and this algorithm provides more flexibility for fitting the data, and higher computation efficiency. Our method is compared with previous methods and ours is shown to be much more powerful.
In the third project, we introduce clustering algorithms based on appropriate probability models for RNA-seq data, with well-designed initialization strategy and grouping algorithms. We also present a model-based hybrid-hierarchical clustering method to generate a tree structure that allows visualization of relationships among clusters as well as flexibility of choosing the number of clusters. Results from both simulation studies and analysis of a maize RNA-seq data set show that our proposed methods provide better clustering results than alternative methods that are not based on probability models.
Si, Yaqing, "Statistical analysis of RNA-seq data from next-generation sequencing technology" (2012). Graduate Theses and Dissertations. 12682.