Date of Award
Doctor of Philosophy
Electrical and Computer Engineering
Bioinformatics and Computational Biology
Julie A. Dickerson
High-throughput technologies produce genome-scale transcriptomic and metabolomic (omics) datasets that allow for the system-level studies of complex biological processes. The limitation lies in the small number of samples versus the larger number of features represented in these datasets. Machine learning methods can help integrate these large-scale omics datasets and identify key features from each dataset. A novel class dependent feature selection method integrates the F statistic, maximum relevance binary particle swarm optimization (MRBPSO), and class dependent multi-category classification (CDMC) system. A set of highly differentially expressed genes are pre-selected using the F statistic as a filter for each dataset. MRBPSO and CDMC function as a wrapper to select desirable feature subsets for each class and classify the samples using those chosen class-dependent feature subsets. The results indicate that the class-dependent approaches can effectively identify unique biomarkers for each cancer type and improve classification accuracy compared to class independent feature selection methods. The integration of transcriptomics and metabolomics data is based on a classification framework. Compared to principal component analysis and non-negative matrix factorization based integration approaches, our proposed method achieves 20-30% higher prediction accuracies on Arabidopsis tissue development data. Metabolite-predictive genes and gene-predictive metabolites are selected from transcriptomic and metabolomic data respectively. The constructed gene-metabolite correlation network can infer the functions of unknown genes and metabolites. Tissue-specific genes and metabolites are identified by the class-dependent feature selection method. Evidence from subcellular locations, gene ontology, and biochemical pathways support the involvement of these entities in different developmental stages and tissues in Arabidopsis.
Zhou, Wengang, "Machine learning methods for omics data integration" (2011). Graduate Theses and Dissertations. 12238.