Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Computer Science


Bioinformatics and Computational Biology

First Advisor

Steven Cannon


Gene families are groups of genes that have descended from a common ancestral gene present in the species under study. Current, widely used gene family building algorithms are prone to producing incomplete families (under-clustering) or families containing wrong or non-family sequences (over-clustering). In this work, we present a sequence-pair-classification-based method that, first, inspects given families for under-clustering and then predicts the missing sequences for the families using family-specific alignment score cutoffs. We test this method on a set of curated, gold-standard families from the Yeast Gene Order Browser (YGOB) database, including 20 yeast species. To check if the method can detect and correct incomplete families obtained using existing family building methods, we test this method on under-clustered yeast families produced using the OrthoFinder tool. We demonstrate the utility of the pair-classification method in merging small, fragmented legume families into larger families, built using the OrthoFinder tool, from 14 legumes species belonging to subfamily Papilionoideae of the plant family Leguminosae. We provide recommendations on different types of family-specific alignment score cutoffs that can be used for predicting the missing sequences based on the "purity" of under-clustered families and the chosen precision and recall for prediction. Finally, we provide the containerized version of the pair-classification method that can be applied on any given set of gene families.

In addition to the pair-based classification method, we present a simple hidden Markov model (HMM)-based protocol for merging fragmented families and a phylogeny-based protocol for detecting and splitting over-clustered families. We apply these methods for improving the legume gene families built from 14 legumes species belonging to subfamily Papilionoideae of the plant family Leguminosae, using a custom family building method, that utilizes differences in the synonymous-sites (Ks) in the gene sequences in order to capture the family clusters defined by the whole-genome duplication that occurred in the most recent common ancestor of the subfamily. We also analyze the improvements in the legume families obtained after the application of merging and splitting procedures by comparing the protein domain compositions of the new families against the original families. We also provide the containerized versions of family merging, splitting and scoring methods along with the new set of improved legume families.

We investigate the occurrence of whole-genome duplication events within the Cercidoideae subfamily of the plant family Leguminosae, using evolutionary, phylogenomic, and synteny analyses together with analysis of chromosome counts, from a diverse set of legume species. Based on diverse evidence, we conclude that one of the slow-evolving lineages within Cercidoideae may be unique among legumes in lacking evidence of an independent whole-genome duplication and can be a useful genomic model for the legumes. We are able to show that the genome duplication observed in the other sister lineage within Cercidoideae is most likely due to allotetraploidy involving hybridization between two progenitor species that existed in the Cercidoideae subfamily.

We present a method for tracking protein domain changes in a selected set of species with known phylogenetic relationships, by defining domains as "features" or "descriptors," and considering the species (target + outgroup) as instances or data-points in a domain feature matrix. Protein domains can be regarded as sections of protein sequences capable of folding independently and performing specific functions that enable protein sequences to evolve through domain shuffling events like domain insertion, deletion, or duplication. We look for features (domains) that are significantly different between the target species and the outgroup species using a feature selection technique called Mutual-Information (MI) and non-parametric statistical tests (Fisher's exact test/Wilcoxon rank-sum test). We study the domain changes in two large, distinct groups of plant species: legumes (Fabaceae) and grasses (Poaceae), with respect to selected outgroup species, using four types of domain feature matrices: domain content, domain duplication, domain abundance, and domain versatility. The four types of domain feature matrices attempt to capture different aspects of domain changes through which the protein sequences may evolve - i.e. via gain or loss of domains, increase or decrease in the copy number of domains along the sequences, expansion or contraction of domains, or through changes in the number of adjacent domain partners. We report and study the biological functions of the top selected domains from all four feature matrices. In addition, we perform domain-centric Gene Ontology (dcGO) enrichment analysis on all selected domains from all the feature matrices to study the Gene Ontology terms associated with the significantly changing domains in legumes and grasses. We provide a docker container that can be used to perform this analysis on any user-defined sets of species.


Copyright Owner

Akshay Yadav



File Format


File Size

155 pages