Date

1-4-2016 12:00 AM

Major

Mathematics

Department

Mathematics

College

College of Liberal Arts and Sciences

Project Advisor

Jeffery Trimarchi

Project Advisor's Department

Genetics, Development and Cell Biology

Description

A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering, and filtering steps using phenotype prediction accuracy as a metric. We have shown that isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest-ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer patients, human patients with chronic obstructive pulmonary disease, and amyotrophic lateral sclerosis transgenic mice, each including samples of diseased and non-diseased phenotypes.

File Format

application/pdf

Included in

Mathematics Commons

Share

COinS
 
Apr 1st, 12:00 AM

Complementary Feature Selection from Alternative Splicing Events and Gene Expression for Phenotype Prediction

A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering, and filtering steps using phenotype prediction accuracy as a metric. We have shown that isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest-ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer patients, human patients with chronic obstructive pulmonary disease, and amyotrophic lateral sclerosis transgenic mice, each including samples of diseased and non-diseased phenotypes.