Date
1-4-2016 12:00 AM
Major
Mathematics
Department
Mathematics
College
College of Liberal Arts and Sciences
Project Advisor
Jeffery Trimarchi
Project Advisor's Department
Genetics, Development and Cell Biology
Description
A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering, and filtering steps using phenotype prediction accuracy as a metric. We have shown that isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest-ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer patients, human patients with chronic obstructive pulmonary disease, and amyotrophic lateral sclerosis transgenic mice, each including samples of diseased and non-diseased phenotypes.
File Format
application/pdf
Included in
Complementary Feature Selection from Alternative Splicing Events and Gene Expression for Phenotype Prediction
A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. This study identifies the empirically optimal methods of transcript quantification, feature engineering, and filtering steps using phenotype prediction accuracy as a metric. We have shown that isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest-ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer patients, human patients with chronic obstructive pulmonary disease, and amyotrophic lateral sclerosis transgenic mice, each including samples of diseased and non-diseased phenotypes.