Degree Type

Dissertation

Date of Award

2019

Degree Name

Doctor of Philosophy

Department

Statistics

Major

Bioinformatics and Computational Biology; Statistics

First Advisor

Dan . Nettleton

Second Advisor

Roger . Wise

Abstract

RNA-sequencing (RNA-seq) technology is a high-throughput next-generation sequencing procedure. It allows researchers to measure gene transcript abundance at a lower cost and with a higher resolution.

Advances in RNA-seq technology promoted new methodological development in several branches of quantitative analysis for RNA-seq data. In this dissertation, we focus on several topics related to RNA-seq data analysis.

This dissertation is comprised of three papers on the analysis of RNA-seq data. We first introduce a method for detecting differentially expressed genes across different experimental conditions with correlated RNA-seq data. We fit a general linear model to the transformed read counts of each gene and assume the error vector has a block-diagonal correlation matrix with unstructured blocks that

account for within-gene correlations. In order to stabilize parameter estimation with limited replicates, we shrink the residual maximum likelihood estimator of correlation parameters toward a mean-correlation locally-weighted scatterplot smoothing curve. The shrinkage weights are determined by using a hierarchical model and then estimated via parametric bootstrap. Due to the information sharing across genes in parameter estimation, the null distribution of test statistic is unknown and mathematically intractable. Thus, we approximate the null test distribution through a parametric bootstrap strategy.

Next, we focus on correlation estimation between genes. Gene co-expression correlation estimation is a fundamental step in gene co-expression network construction. The correlation estimates could also be used as inputs of topological statistics which help analyze gene functions. We propose a new strategy for co-expression correlation definition and estimation. We introduce a motivating dataset with two factors and a split-plot experimental design. We define two types of co-expression correlations that originate from two different sources. We apply a linear mixed model to each gene pair. The correlations within random effects and random errors are used to represent the two types of correlations.

Finally, we consider a basic topic in quantitative RNA-seq analysis, gene filtering. It is essential to remove genes with extremely low read counts before further analysis to avoid numerical problems and to get a more stable estimates. For most differential expression and gene network analyses tools, there are embedded gene filtering functions. In general, these functions rely on a user-defined hard threshold for gene selection and fail to make full use of gene features, such as gene length and GC content level. Several studies have shown that gene features have a significant impact on RNA-sequencing efficiency and thus should be considered in subsequent analysis. We propose to fit a

model involving a two-component mixture of Gaussian distribution to the transformed read counts for each sample and assume all parameters are functions of GC content. We adopt a modified semiparametric expectation-maximization algorithm for parameter estimation.

We perform a series of simulation studies and show, that in many cases, the proposed methods improve upon existing methods and are more robust.

Copyright Owner

Meiling Liu

Language

en

File Format

application/pdf

File Size

95 pages

Available for download on Thursday, November 26, 2020

Share

COinS