SearchTree: Mining robust phylogenetic trees

Deepak, Akshay

SearchTree: Mining robust phylogenetic trees

File

Deepak_iastate_0097M_11062.pdf (673.58 KB)

Date

2010-01-01

Authors

Deepak, Akshay

Advisor

David Fernandez-baca

Altmetrics

Organizational Units

Organizational Unit

Computer Science

Computer Science—the theory, representation, processing, communication and use of information—is fundamentally transforming every aspect of human endeavor. The Department of Computer Science at Iowa State University advances computational and information sciences through; 1. educational and research programs within and beyond the university; 2. active engagement to help define national and international research, and 3. educational agendas, and sustained commitment to graduating leaders for academia, industry and government.

History
The Computer Science Department was officially established in 1969, with Robert Stewart serving as the founding Department Chair. Faculty were composed of joint appointments with Mathematics, Statistics, and Electrical Engineering. In 1969, the building which now houses the Computer Science department, then simply called the Computer Science building, was completed. Later it was named Atanasoff Hall. Throughout the 1980s to present, the department expanded and developed its teaching and research agendas to cover many areas of computing.

Dates of Existence
1969-present

Related Units

College of Liberal Arts and Sciences (parent college)

Department

Computer Science

Abstract

Phylogenetic trees providing high quality information and at the same time covering large number of species are essential for comparative biology. It is a widely accepted fact that with the currently available resources we are far from assembling one completely sampled phylogenetic tree for all life (or one based on a very large subset of species), hence a need for an interim solution arises. Here we describe SearchTree, a software tool that allows users to query efficiently on an arbitrary user taxon list and returns high scoring matches from approximately one billion phylogenetic trees being constructed from molecular sequence data in GenBank. The core of SearchTree has two parts. The first is a pre-computed collection of phylogenetic species trees from GenBank sequence data consisting of approximately 10,000,000 data sets with 100 bootstrap trees for each set for a total of around 1 billion trees. The goal here is to ensure high `coverage' (i.e., each taxon occurring in many trees). The second part is the search-retrieval process. The goal is to quickly retrieve the clusters and the subsequent trees from the large data set described above, maximizing the scoring function for the resultant set of trees and all the while keeping computational resources within a limit. Both parts were dealt separately due to their complexity; here we focus on the second part.

The complete pre-computed data set of phylogenetic trees will be around 500 GB. Fast response times are achieved by SearchTree through a combination of techniques from information retrieval, notably inverted indexing, and from computational phylogenetics, especially for constructing consensus trees. The use of Redwood cluster, an advanced hardware configuration specifically tuned for this kind of work, has further improved the query times by 100%.

Copyright

Fri Jan 01 00:00:00 UTC 2010

Collections

Theses and Dissertations

Full item page