Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Veterinary Microbiology and Preventive Medicine


Bioinformatics and Computational Biology; Statistics

First Advisor

Iddo Friedberg


Data collected in biological experiments comes in all shapes and sizes, including DNA and protein sequences, mRNA counts, spatial interactions, protein annotations, phenotypic images and so on. In order to make sense of this myriad of data, novel statistical methods are needed to not only model the biological data, but also to assess the accuracy of predictions. In this thesis, I present three research studies that perform statistical analysis in the benchmarking, assessment and modelling of genetic data, demonstrating diversity of bioinformatics research. The approach taken here is to tailor statistical methods for specific data types.

To provide quality benchmark data for phenotypic image processing and assessment, a Generalized Linear Mixed effects model was used to compare the performance of different groups of people (lay people recruited through Amazon Mechanical Turk versus experts) in their efficacy to highlight key elements in phenotypic images collected from corn fields. The analyzed images were then used as ground-truth for the training and testing of automated methods. We concluded that properly managed crowdsourcing can be used to establish large volumes of viable ground truth data at a low cost and high quality, especially in the context of high throughput plant phenotyping.

To assess the quality of computational protein function predictions, the third Critical Assessment of Functional Annotation (CAFA) was launched to evaluate predictions in the form of a community challenge. Each protein is associated with multiple functions represented by Gene Ontology terms (labels). These ontological terms form a hierarchical structure, and the frequency of each term is not distributed uniformly among different proteins. Precision-recall based assessment metrics were not enough to account for the non-uniform prior distribution of this multi-label problem, so semantic-distance based methods were developed for better model assessment. We concluded that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than expectations set by baseline methods, it leaves considerable room and need for improvement. The CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation databases, computational function prediction, and our ability to manage big data in the era of large experimental screens.

To model the spatial dependency of gene expression on the 3D structure of the genome, a Poisson Hierarchical Markov Random Field model (PhiMRF) was developed for gene expression data that accounts for the pairwise spatial interaction from HiC experiments. The quantitative expression of genes on human chromosomes 1, 4, 5, 6, 8, 9, 12, 19, 20 , 21 and X all showed meaningful positive intra-chromosomal spatial dependency. Moreover, the spatial dependency is much stronger than the dependency based on linear gene neighborhoods, suggesting that 3D chromosome structures such as chromatin loops and Topologically Associating Domains (TADs) are indeed strongly correlated with gene expression levels. The results both confirm and quantify the spatial correlation in gene expression. In addition, PhiMRF improves upon the stochastic modelling of gene expression that is currently widely used in differential expression analyses. PhiMRF is available at as an R package.


Copyright Owner

Naihui Zhou



File Format


File Size

182 pages