Degree Type


Date of Award


Degree Name

Doctor of Philosophy


Computer Science

First Advisor

Vasant Honavar


The identification and characterization of epitopes in antigenic sequences is critical for understanding disease pathogenesis, for identifying potential autoantigens, and for designing vaccines and immune-based cancer therapies. As the number of pathogen genomes fully or partially sequenced is rapidly increasing, experimental methods for epitope mapping would be prohibitive in terms of time and expenses. Therefore, computational methods for reliably identifying potential vaccine candidates (i.e., epitopes that invoke strong response from both T-cells and B-cells) are highly desirable.

Machine learning offers one of the most cost-effective and widely used approaches to developing epitope prediction tools. In the last few years, several advances in machine learning research have emerged. We utilize recent advances in machine learning research to provide epitope prediction tools with improved predictive performance. First, we introduce two methods, BCPred and FBCPred, for predicting linear B-cell epitopes and flexible length linear B-cell epitopes, respectively, using string kernel based support vector machine (SVM) classifiers. Second, we introduce three scoring matrix methods and show that they are highly competitive with a broad class of machine learning methods, including SVM, in predicting major histocompatibility complex class I (MHC-I) binding peptides. Finally, we formulate the problems of qualitatively and quantitatively predicting flexible length major histocompatibility complex class II (MHC-II) peptides as multiple instance learning and multiple instance regression problems, respectively. Based on this formulation, we introduce MHCMIR, a novel method for predicting MHC-II binding affinity using multiple instance regression.

The development of reliable epitope prediction tools is not feasible in the absence of high quality data sets. Unfortunately, most of the existing epitope benchmark data sets are comprised of epitope sequences that share high degree of similarity with other peptide sequences in the same data set. We demonstrate the pitfalls of these commonly used data sets for evaluating the performance of machine learning approaches to epitope prediction. Finally, we propose a similarity reduction procedure that is more stringent than currently used similarity reduction methods.


Copyright Owner

Yasser Mohamed El-manzalawy



Date Available


File Format


File Size

179 pages