Detecting recombination and its mechanistic association with genomic features via statistical models

Thumbnail Image
Date
2010-01-01
Authors
Rajaram, Misha
Major Professor
Advisor
Karin Dorman
Dennis Lavrov
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Statistics
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract

Recombination is a powerful weapon in the evolutionary arsenal of retroviruses such as HIV. It enables the production of chimeric variants or recombinants that may confer a selective advantage to the pathogen over the host immune response. Recombinants further accentuate differences in virulence, disease progression and drug resistance mutation patterns already observed in non-recombinant variants of HIV. This thesis describes the development of a rapid genotyper for HIV sequences employing supervised learning algorithms and its application to complex HIV recombinant data, the application of a hierarchical model for detection of recombination hotspots in the HIV-1 genome and the extension of this model enabling estimation of the association between recombination probabilities and covariates of interest.

The rapid genotyper for HIV-1 explores a solution to the genotyping problem in the machine learning paradigm. Of the algorithms tested, the genotyper built using Bayesian additive regression trees (BART) was most successful in efficiently classifying complex recombinants that pose a challenge to other currently available genotyping methods. We also developed a novel method, bootSMOTE, for generating synthetic data in order to supplement insufficient training data. We found that supplementation with synthetic recombinants especially boosts identification of complex recombinants. We describe the genotyper software available for download as well as a web interface enabling rapid

classiffication of HIV-1 sequences.

Hotspots for recombination in the HIV-1 genome are modeled using spatially smoothed changepoint processes. This hierarchical model uses a phylogenetic recombination detection model of dual changepoint processes at the lower level. The upper level applies a Gaussian Markov random eld (GMRF) hyperprior to population-level recombination probabilities in order to efficiently combine the information from many individual recombination events as inferred at the lower level. Focusing on 544 unique recombinant sequences, we found a novel hotspot in the pol gene of HIV-1 while confirming the presence of a high recombination activity in the env gene.

Valuable insights into the molecular mechanism of recombination may be gained by extending the GMRF model to include covariates of interest. We add a level to the hierarchical model and allow for the simultaneous inference of recombination probabilities as well their association with genomic covariates of interest. Using a set of 527 unique recombinants, we confirmed the presence of the pol hotspot. Interestingly, we found significant positive associations of spatial fluctuations in recombination probabilities with genomic regions prone to forming secondary structure as well as significant negative associations with regions that support tight RNA-DNA hybrid formation. Overall, our results support the theory that pause sites along the genome promote recombination.

Comments
Description
Keywords
Citation
Source
Subject Categories
Copyright
Fri Jan 01 00:00:00 UTC 2010