Date of Award
Doctor of Philosophy
Many applications e.g., biomolecular sequence analysis, image classification, text classification, social network analysis, among others require classification of topologically structured data, i.e. data whose topology reflects intrinsic dependencies between the constituent elements that make up the data. Hence, there is a growing need for automated methods for building predictive models from topologically structured data. Machine learning offers a promising approach to the design of algorithms for training computer programs to efficiently and accurately classify topologically structured data.
One of the main challenges in learning from topologically structured data has to do with the representation of the data that is presented to a learner. The representation has to be rich enough to capture distinctions that are relevant from the standpoint of learning, but not so rich as to make the task of learning harder due to overfitting.
Against this background, we present an abstraction-based probabilistic approach to learning accurate and, at the same time, compact predictive models from topologically structured data with applications to biological sequence and text data. Specifically, we address the problem of simplifying the data representation used by a learner on sequence classification tasks in two settings: (i) the setting in which sequences are independent and identically distributed and (ii) the setting in which sequences are not independent (e.g., as in the case of protein sequences that are linked by function). Our abstraction-based approach exploits the complementary strengths of super-structuring (constructing complex features by combining existing features) and abstraction (grouping similar features to generate more abstract features). Super-structuring provides a way to increase the predictive accuracy of the learned models by enriching the data representation (hence, super-structuring increases the complexity of the learned models), whereas abstraction helps reduce the number of model parameters by simplifying the data representation.
Furthermore, we introduce and study abstraction augmented Markov models (AAMMs). AAMMs are directed acyclic graphical models that capture dependencies between variables and, at the same time, allow for compact representations of conditional probability tables. More precisely, AAMMs simplify the data representation used by the standard MMs by grouping similar entities to generate more abstract entities that are organized in an abstraction hierarchy.
Experimental results on text document classification tasks and protein subcellular localization prediction tasks demonstrate the feasibility of the proposed approach: the results show that simplifying data representation by combining super-structuring and abstraction makes it possible to construct predictive models that use substantially smaller number of features (by one to three orders of magnitude) than those obtained using super-structuring alone (whose size grows exponentially with the length of direct dependencies). Super-structuring and abstraction-based models are competitive with and, in some cases, outperform, models that use only super-structuring.
Collectively, the techniques developed in this thesis advance the current state of the art in terms of algorithms for building accurate and concise predictive models from topologically structured data.
Caragea, Cornelia, "Abstraction-based probabilistic models for sequence classification" (2009). Graduate Theses and Dissertations. Paper 11022.