Degree Type


Date of Award


Degree Name

Doctor of Philosophy




Bioinformatics and Computational Biology

First Advisor

Philip M. Dixon

Second Advisor

Julie A. Dickerson


Large-scale projects such as the The Cancer Genome Atlas (TCGA) have generated extensive exome libraries across several disease types and populations. Detection of somatic changes in HLA genes by whole-exome sequencing (WES) has been complicated by the highly polymorphic nature of these loci. We developed a method POLYSOLVER (POLYmorphic loci reSOLVER) for accurate inference of class I HLA-A, -B and -C alleles from WES data, and achieved 97% accuracy at protein level resolution when this was applied to 133 HapMap samples of known HLA type. By applying POLYSOLVER in conjunction with somatic change detection tools to 2688 tumor/normal pairs TCGA that were previously analyzed by conventional approaches, we re-discovered 37 of 56 (66%) HLA mutations, while further identifying 23 new events. An analysis of WES data from a larger set of 3768 tumor/normal pairs by POLYSOLVER revealed 131 class I mutations with an enrichment for potentially loss-of-function events. 3% of samples had at least one HLA event with 95 of 131 mutations in the T cell interacting and peptide binding domains. Recurrent hotspot sites of missense, nonsense and splice site mutations were discovered that suggest positive selection, and support immune evasion as an important pathway in cancer.

Exome sequencing has also revealed a large number of shared and personal somatic mutations across human cancers. In principle, any genetic alteration affecting a protein-coding region has the potential to generate mutated peptides that are presented by surface HLA class I proteins that might be recognized by cytotoxic T cells. Utilizing POLYSOLVER in conjunction with knowledge of mutations in other genetic loci inferred from exome data, we developed a pipeline for the prediction and validation of such neoantigens derived from individual tumors and presented by patient-specific alleles of the HLA proteins. We applied our computational pipeline to 91 chronic lymphocytic leukemias (CLL) that had undergone whole-exome sequencing. We predicted ~22 mutated HLA-binding peptides per leukemia (derived from ~16 missense mutations), and experimentally confirmed HLA binding for ~55% of such peptides. Finally, we computationally predicted HLA-binding peptides with missense or frameshift mutations for several cancer types and predicted dozens to thousands of neoantigens per individual tumor, suggesting that neoantigens are frequent in most tumors. The neoantigen prediction pipeline can also elucidate the neoantigens unique to a particular cancer patient and help in the design of personalized immune vaccines.

MicroRNAs (miRs) are a class of non-coding small RNAs that regulate gene expression by promoting mRNA degradation or by inhibiting mRNA translation. Context Likelihood of Relatedness (CLR) is genetic network reconstruction method that considers the local network context in assessing the significance of connections while also allowing for detection of non-linear associations. Leveraging TCGA multidimensional data in glioblastoma, we inferred the putative regulatory network between microRNA and mRNA using the CLR algorithm. Interrogation of the network in context of defined molecular subtypes identified 8 microRNAs with a strong discriminatory potential between proneural and mesenchymal subtypes. Integrative in silico analyses, a functional genetic screen, and experimental validation identified miR-34a as a tumor suppressor in proneural subtype glioblastoma. Mechanistically, in addition to its direct regulation of platelet-derived growth factor receptor-alpha (PDGFRA), promoter enrichment analysis of CLR-inferred mRNA nodes established miR-34a as a novel regulator of a SMAD4 transcriptional network. Clinically, miR-34a expression level is shown to be prognostic, where miR-34a low-expressing glioblastomas exhibited better overall survival. This work illustrates the potential of comprehensive multidimensional cancer genomic data combined with computational and experimental models to enable mechanistic exploration of relationships among different genetic elements across the genome space in cancer.

Copyright Owner

Sachet Ashok Shukla



File Format


File Size

109 pages