Date of Award
Doctor of Philosophy
Electrical and Computer Engineering
Bioinformatics and Computational Biology
Next generation sequencing (NGS) has revolutionized genomic data generation by enabling high-throughput parallel sequencing. This makes it possible to sequence new genomes or re-sequence individual genomes at a manifold cheaper cost and in an order of magnitude lesser time than traditional Sanger sequencing. Using NGS technologies, ambitious genomic sequencing projects target many organisms rather than a few, and large scale studies of sequence variation become feasible. Because of this revolution, the data analysis methodologies are changing, exemplified by different applications: de Bruijn or string graph based approach is replacing traditional overlap-layout-consensus paradigm in genome assembly, computational
pipelines consisting of locating and counting short reads per gene location on the reference genome are replacing microarrays in gene expression analysis, and so on.
In this context, efficient analysis for large scale datasets is one of the most challenging problems. In this thesis work, we design efficient algorithms to improve the read quality for next generation sequencing and explore the emerging cloud computing techniques to cluster a large amount of metagenomic reads. First, we develop an efficient algorithm that uses a flexible read decomposition method to improve accuracy of error correction, and demonstrate its applicability using standard runs of Illumina sequencing. We further propose a statistical framework to differentiate infrequently observed subreads from sequencing errors when genomic repeats are prevalent. To differentiate between valid and invalid substrings based on their genomic frequency, we propose a statistical approach to estimate a frequency related threshold based on the dataset under study. Lastly, we formalize the task to quantify microbial organisms
in environmental samples as a sequence clustering problem and develop a parallel solution integrating sketching, quasi-clique enumeration and MapReduce techniques. The implementation is carried out using Hadoop -- a MapReduce framework for cloud computing.
Yang, Xiao, "Error correction and clustering algorithms for next generation sequencing" (2011). Graduate Theses and Dissertations. 12253.