Machine learning tools for mRNA isoform function prediction

Thumbnail Image
Date
2019-01-01
Authors
Kandoi, Gaurav
Major Professor
Advisor
Julie A. Dickerson
Carolyn Lawrence-Dill
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Electrical and Computer Engineering
Abstract

This dissertation is focused on improving mRNA isoform characterization in terms of functional networks, function prediction and tissue-specificity. There are three major challenges in solving these problems. The first is the unavailability of mRNA isoform level functional data which is required to develop machine learning tools. However, the available data, even at the gene level doesn’t include all genes, further complicating the matter. The second challenge is the lack of information about tissue-specificity in functional databases such as Gene Ontology, Kyoto Encyclopedia of Genes and Genomes and UniProt. The third challenge is the lack of mRNA isoform level “ground truth” functional annotation data. The scope of this dissertation includes using mRNA isoform and protein sequences, high-throughput RNA-sequencing data and functional annotations at the gene level to develop computational methods for predicting functions for alternative spliced mRNA isoforms in mouse.

To address these challenges, this dissertation develops and describes two computational tools. The first is a supervised learning-based machine learning framework for predicting tissue-specific mRNA isoform functional networks. Tissue-spEcific mrNa iSoform functIonal Networks (TENSION) makes use of single mRNA producing gene annotations and gene annotations tagged with “NOT” to create a high-quality mRNA isoform level functional data. We use these mRNA isoform level functional data to train random forest algorithms to develop mRNA isoform functional network prediction models. By using a leave-one-tissue-out approach and incorporating tissue-specific mRNA isoform level predictors along with those obtained from mRNA isoform and protein sequences, we have developed mRNA isoform level functional networks for 17 mouse tissues. We identify about 10.6 million tissue-specific functional mRNA isoform interactions and demonstrate the ability of our networks to reveal tissue-specific functional differences of the isoforms of the same genes. We validate our models and predictions by using a series of tests such as 10-fold stratified cross validation, comparison with published method and validating against literature datasets. As a result, we have also generated a high-quality mRNA isoform level functional dataset that can be used for benchmarking future methods.

Next, we describe mRNA Function Recommendation System (mFRecSys), a recommendation system for making tissue-specific function recommendations for mRNA isoforms. In mFRecSys, we consider mRNA isoforms as “users” and Gene Ontology biological process terms as “items”. By using explicit contexts for mRNA isoforms, Gene Ontology biological process terms and tissue-specific mRNA isoform expression, mFRecSys is able to make tissue-specific mRNA isoform function recommendations.

This work emphasizes the significance of incorporating diverse biological context to develop better machine learning tools for biology. It also highlights the use of simplified supervised learning methods for biological network prediction. The machine learning models and recommendation systems developed as part of this work also draw attention to the power of simple mRNA isoform sequence-based predictors to improve mRNA isoform function prediction. The methods developed have potential practical applications, for instance as predictive models for distinguishing the functions of different mRNA isoforms of the same gene or identifying tissue-specific functions of mRNA isoforms.

Comments
Description
Keywords
Citation
DOI
Source
Copyright
Thu Aug 01 00:00:00 UTC 2019