Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Computer Science


Computer Science

First Advisor

Jin Tian


The history of science is the history of finding true belief from observations. Humans build knowledge from observations, experiences, and/or other knowledge. Sometimes, it is difficult to understand a phenomenon with superficial observation. Observed data is the key to solve problems around us. As the size and complexity of data grows, so does the need to accurately process the data. There is now an opportunity to apply artificial intelligence and machine learning approaches to large-scale multi-omic data sets in new areas such as agriculture and crop improvement. In this study, we focus on maize (Zea mays L.) genomic data, and applied machine learning approaches to discover meaningful features from the genomic data. First we explored the relationship between phenotype and genotype in the maize genome by building a database of images and genomes called MaizeDIG. Second, we applied the k-mer concept to construct machine learning frameworks for predicting gene expression using gene sequences. We describe two k-mer Naive Bayes classifiers for genomic sequence classification: k-mer Naive Bayes (NB(k)) and two-phase k-mer Naive Bayes (tNB(k)). Finally, we extend NB(k) and tNB(t) methods and propose new methods to represent sequence: k-mer distance model and reduced k-mer alphabet model.

NB(k) only considers relative frequencies of each respective alphabet under Naive Bayes classifier. To better represent the complexity of protein structure, we propose a new k-mer distance model. We constructed a distance matrix with the distance between all pairwise k-mers in a sequence, and it can be used to measure or compare two gene sequences in terms of similarity. Since the size of the amino acid alphabet affects the complexity and therefore limits k-mer size we propose the reduced k-mer alphabet model. Instead of the NB(k) approach based on the traditional 20 amino acids, we applied the k-mer method with smaller-sized groupings based on physico-chemical properties of amino acids or randomly generated reduced alphabet groupings. Our machine learning approaches allow researchers to predict when and where genes are expressed in the absence of experimental data and make links between genomic and phenotypic data. These predictions and linkages can be used to better understand the relationship between the genes in a plant and the traits observed in farmers' fields.


Copyright Owner

Kyoung Tak Cho



File Format


File Size

129 pages