Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Computer Science

First Advisor

Vasant Honavar


The emergence of many interlinked, physically distributed, and autonomously maintained linked data sources amounts to the rapid growth of Linked Open Data (LOD) cloud, which offers unprecedented opportunities for predictive modeling and knowledge discovery from such data. However existing machine learning approaches are limited in their applicability because it is neither desirable nor feasible to gather all of the data in a centralized location for analysis due to access, memory, bandwidth, or computational restrictions. In some applications additional schema such as subclass hierarchies may be available and exploited by the learner. Furthermore, in other applications, the attributes that are relevant for specific prediction tasks are not known a priori and hence need to be discovered by the algorithm. Against this background, we present a series of approaches that attempt to address such scenarios. First, we show how to learn Relational Bayesian Classifiers (RBCs) from a single but remote data store using statistical queries, and we extend to the setting where the attributes that are relevant for prediction are not known a priori, by selectively crawling the data store for attributes of interest. Next, we introduce an algorithm for learning classifiers from a remote data store enriched with subclass hierarchies. Our algorithm encodes the constraints specified in a subclass hierarchy using latent variables in a directed graphical model, and adopts the Variational Bayesian EM approach to efficiently learn parameters. In retrospect, we observe that in learning from linked data it is often useful to represent an instance as tuples of bags of attribute values. With this inspiration, we introduce, formulate, and present solutions for a novel type of learning problem which we call distributional instance classification. Finally, building up from the foundations, we consider the problem of learning predictive models from multiple interlinked data stores. We introduce a distributed learning framework, and identify three special cases of linked data fragmentation then describe effective strategies for learning predictive models in each case. Further, we consider a novel application of a matrix reconstruction technique from the field of Computerized Tomography to approximate the statistics needed by the learning algorithm from projections using count queries, thus dramatically reducing the amount of information transmitted from the remote data sources to the learner.

Copyright Owner

Harris Lin



File Format


File Size

99 pages