Date of Award
Doctor of Philosophy
Bioinformatics and Computational Biology
Recombination is a powerful weapon in the evolutionary arsenal of retroviruses such as HIV. It enables the production of chimeric variants or recombinants that may confer a selective advantage to the pathogen over the host immune response. Recombinants further accentuate differences in virulence, disease progression and drug resistance mutation patterns already observed in non-recombinant variants of HIV. This thesis describes the development of a rapid genotyper for HIV sequences employing supervised learning algorithms and its application to complex HIV recombinant data, the application of a hierarchical model for detection of recombination hotspots in the HIV-1 genome and the extension of this model enabling estimation of the association between recombination probabilities and covariates of interest.
The rapid genotyper for HIV-1 explores a solution to the genotyping problem in the machine learning paradigm. Of the algorithms tested, the genotyper built using Bayesian additive regression trees (BART) was most successful in efficiently classifying complex recombinants that pose a challenge to other currently available genotyping methods. We also developed a novel method, bootSMOTE, for generating synthetic data in order to supplement insufficient training data. We found that supplementation with synthetic recombinants especially boosts identification of complex recombinants. We describe the genotyper software available for download as well as a web interface enabling rapid
classiffication of HIV-1 sequences.
Hotspots for recombination in the HIV-1 genome are modeled using spatially smoothed changepoint processes. This hierarchical model uses a phylogenetic recombination detection model of dual changepoint processes at the lower level. The upper level applies a Gaussian Markov random eld (GMRF) hyperprior to population-level recombination probabilities in order to efficiently combine the information from many individual recombination events as inferred at the lower level. Focusing on 544 unique recombinant sequences, we found a novel hotspot in the pol gene of HIV-1 while confirming the presence of a high recombination activity in the env gene.
Valuable insights into the molecular mechanism of recombination may be gained by extending the GMRF model to include covariates of interest. We add a level to the hierarchical model and allow for the simultaneous inference of recombination probabilities as well their association with genomic covariates of interest. Using a set of 527 unique recombinants, we confirmed the presence of the pol hotspot. Interestingly, we found significant positive associations of spatial fluctuations in recombination probabilities with genomic regions prone to forming secondary structure as well as significant negative associations with regions that support tight RNA-DNA hybrid formation. Overall, our results support the theory that pause sites along the genome promote recombination.
Rajaram, Misha, "Detecting recombination and its mechanistic association with genomic features via statistical models" (2010). Graduate Theses and Dissertations. 11248.