Date of Award
Doctor of Philosophy
Electrical and Computer Engineering
Bioinformatics and Computational Biology
Julie A. Dickerson
Carolyn J. Lawrence-Dill
One of the key tenets of bioinformatics is to find ways to enable the interoperability of heterogeneous data sources and improve the integration of various biological data. High-throughput experimental methods continue to improve and become more easily accessible. This allows researchers to measure not just their specific gene or protein of interest, but the entirety of the biological machinery inside the cell. These measurements are referred to as "omics", such as genomics, transcriptomics, proteomics, metabolomics, and fluxomics.
Omics data is highly interrelated at the systems-level, as each type of molecule (DNA, RNA, protein, etc.) can interact with and have an impact on the other types. These interactions may be direct, such as the central dogma of biology that information flows from DNA to RNA to protein. They may also be indirect, such as the regulation of gene expression or metabolic feedback loops. Regardless, it is becoming apparent that multiple levels of omics data must be analyzed and understood simultaneously if we are to advance our understanding of systems-level biology.
Much of our current biological knowledge is stored in public databases, most of which specialize in a particular type of omics or a specific organism. Despite efforts to improve consistency between databases, there are many challenges which can impede efforts to meaningfully compare or combine these resources. At a basic level, differences in naming and internal database ID assignments prevent simple mapping between objects in these databases. More fundamentally, though, is the lack of a standardized way to define equivalency between two functionally identical biological entities.
One benefit of improving database interoperability is that targeted high quality data from one database can be used to improve another database. Comparison between MaizeCyc and CornCyc identified many manually curated GO annotations present in MaizeCyc but not in CornCyc. CycTools facilitates the transfer of high-quality annotation data from one database to another by automatically mapping equivalent objects in both databases. This java-based tool has a graphical user interface which guides users through the transfer process.
A case study which uses two independent Zea Mays pathway databases, CornCyc and MaizeCyc, illustrates the challenges of comparing the content of even closely related resources. This example highlights the downstream implications that the choice of initial computational enzymatic function assignment pipelines and subsequent manual curation had on the overall scope and quality of the content of each database. We compare the prediction accuracy of the protein EC assignments for 177 maize enzymes between these resources and find that while MaizeCyc covers a broader scope of enzyme predictions, CornCyc predictions are more accurate.
The advantage of high quality, integrated data resources must be realized through analysis methods which can account for multiple data types simultaneously. Due to the difficulty in obtaining systems-wide metabolic flux measurements, researchers have made several efforts to integrate transcriptional regulatory data with metabolic models in order to improve the accuracy of metabolic flux predictions. Transcriptional regulation involves the binding of transcription factors (i.e. proteins) to binding sites on the DNA in order to positively or negatively influence expression of the targeted gene. This has an indirect, downstream impact on the organism's metabolism, as metabolic reactions depend on gene-derived enzymes in order to catalyze the reaction.
A novel method is proposed which seeks to integrate transcriptional regulation and metabolic reactions data into a single model in order to investigate the interactions between metabolism and regulation. In contrast to existing methods which seek to use transcriptional regulation networks to limit the solution space of the constraint-based metabolic model, we seek to define a transcriptional regulatory space which can be associated with the metabolic distribution of interest. This allows us to make inferences about how changes in the regulatory network could lead to improved metabolic flux.
Jesse R. Walsh
Walsh, Jesse R., "Computational methods for integrated analysis of omics and pathway data" (2016). Graduate Theses and Dissertations. 15198.