Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Electrical and Computer Engineering


Bioinformatics and Computational Biology

First Advisor

Srinivas Aluru

Second Advisor

Patrick Schnable


Next generation sequencing (NGS) has revolutionized genomic data generation by enabling high-throughput parallel sequencing. This makes it possible to sequence new genomes or re-sequence individual genomes at a manifold cheaper cost and in an order of magnitude lesser time than traditional Sanger sequencing. Using NGS technologies, ambitious genomic sequencing projects target many organisms rather than a few, and large scale studies of sequence variation become feasible. Because of this revolution, the data analysis methodologies are changing, exemplified by different applications: de Bruijn or string graph based approach is replacing traditional overlap-layout-consensus paradigm in genome assembly, computational

pipelines consisting of locating and counting short reads per gene location on the reference genome are replacing microarrays in gene expression analysis, and so on.

In this context, efficient analysis for large scale datasets is one of the most challenging problems. In this thesis work, we design efficient algorithms to improve the read quality for next generation sequencing and explore the emerging cloud computing techniques to cluster a large amount of metagenomic reads. First, we develop an efficient algorithm that uses a flexible read decomposition method to improve accuracy of error correction, and demonstrate its applicability using standard runs of Illumina sequencing. We further propose a statistical framework to differentiate infrequently observed subreads from sequencing errors when genomic repeats are prevalent. To differentiate between valid and invalid substrings based on their genomic frequency, we propose a statistical approach to estimate a frequency related threshold based on the dataset under study. Lastly, we formalize the task to quantify microbial organisms

in environmental samples as a sequence clustering problem and develop a parallel solution integrating sketching, quasi-clique enumeration and MapReduce techniques. The implementation is carried out using Hadoop -- a MapReduce framework for cloud computing.


Copyright Owner

Xiao Yang



Date Available


File Format


File Size

129 pages