Degree Type

Dissertation

Date of Award

2006

Degree Name

Doctor of Philosophy

Department

Electrical and Computer Engineering

Major

Computer Engineering; Bioinformatics and Computational Biology;

First Advisor

Daniel Berleant

Second Advisor

Eve Wurtele

Abstract

Continued rapid advancements in genomic, proteomic and metabolomic technologies demand computer-aided methods and tools to efficiently and timely process large amount of data, extract meaningful information, and interpret data into knowledge. While numerous algorithms and systems have been developed for information extraction (i.e. profiling analysis), biological interpretation still largely relies on biologists' domain knowledge, as well as collecting and analyzing functional information from various public databases. The goal of this project was to build a text clustering-based software system, called GeneNarrator, for functional analysis of genes (microarray experiments);GeneNarrator automatically collected MEDLINE citations for a list of genes as the source of functional information. A two-step clustering approach was designed to process the citations. The first-step (text) clustering grouped the citations into hierarchical topics. The second-step (gene) clustering grouped the genes based on the similarities of their occurrences across the clusters resulting from step one. Hence, we planned to demonstrate how, instead of manually collecting and tediously sifting through potentially thousands of citations, biologists can be presented with dozens of topics as a summarization of the citations, and gene (groups) mapped to the topics;In order to improve the first-step text clustering part of the system, several strategies were explored, including different vector space models (BOW-based or concept-based) for text representation, vector space dimensionality reduction (document frequency filtering), and multi clustering. The most improvement came from multi-clustering. The clusterings were evaluated in terms of self-consistency and agreement with a manually constructed gold standard dataset using a newly proposed metric, normalized mutual information.

DOI

https://doi.org/10.31274/rtd-180813-4375

Publisher

Digital Repository @ Iowa State University, http://lib.dr.iastate.edu/

Copyright Owner

Jing Ding

Language

en

Proquest ID

AAI3217266

File Format

application/pdf

File Size

81 pages

Share

COinS