Algorithms for synteny-based phylostratigraphy and gene origin classification

Thumbnail Image
Date
2019-01-01
Authors
Arendsee, Zebulun
Major Professor
Advisor
Eve S. Wurtele
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Genetics, Development and Cell Biology
Abstract

With every newly sequenced species we discover hundreds of novel protein coding genes. Many of these "orphan" genes have been experimentally proven to have dramatic functions in development, sexual dimorphism, pathogen resistance, and social traits like symbiosis. Whereas in the past, researchers viewed genes as the product of continuous variation acting on ancient material, we now know that novel genes may arise de novo from non-genic sequence. Thus evolutionary experimentation is not limited to tweaking existing genes or their regulatory patterns. Any orphan genes that arose in the distant past, should appear today as lineage-specific genes (or gene families). The search for genes by their relative time of origin is called "phylostratigraphy". However, phylostratigraphy has proven to be a challenging task with different methodologies often yielding contradictory conclusions. Standard phylostratigraphy infers the age of a gene by finding the most distant species that has an inferred homolog. However, this approach is highly sensitive to annotation quality and cannot easily distinguish between rapidly evolving genes and genes of de novo origin.

This dissertation contributes a suite of tools for more accurately determining the phylostratigraphic age of genes and the level of support for the classification. First, we developed phylostratr to automate standard phylostratigraphy. Second, we developed a program, synder, to infer syntenic-homologs of query features using a synteny map. Third, we developed fagin, a package that builds on synder to search query genes against related species for traces of genic or non-genic orthology. The pipeline can distinguish orphans with high-confidence data support from orphans identified due to bad assembly or missing data. We traced many orphans to their non-genic cousins, identifying the non-genic footprint from which they arose. We linked others to putative genes in related species from which they diverged beyond recognition. Knowing the approximate location of each gene across species and the amount of data support provides a launching point for future orphan studies.

Comments
Description
Keywords
Citation
DOI
Source
Subject Categories
Copyright
Wed May 01 00:00:00 UTC 2019