Date of Award
Doctor of Philosophy
Rohan L. Fernando
Whole genome analysis is a powerful tool for accurately predicting the genetic merit of selection candidates and for mapping quantitative trait loci (QTL) with high resolution. Single-nucleotide polymorphism (SNP) markers that cover the entire genome unveil the information about QTL through either linkage disequilibrium (LD) with the QTL in founders or cosegregation (CS) with the QTL in nonfounders given a pedigree. Due to the advances in molecular biology and the associated drop in the cost of genotyping, the density of SNPs and the number of individuals that have phenotypes and genotypes are both increasing dramatically for whole genome analyses. Consider a matrix of genotypes collected for analysis, where rows are the genotypes of individuals across SNPs and columns are the genotypes of SNPs across individuals. As explained below, structures exist in such a genotype matrix and will become more evident and important as the SNP density and the training population size increase.
Horizontally, haplotype block structures are observed across SNP loci in the genome due to the historical cosegregation, which creates LD, or recent cosegregation. These structures exist even in the gametes of a single individual. The statistical dependence of the SNP effects is therefore expected in small chromosomal segments given the presence of QTL. However, most of the methods for whole genome analyses do not account for this dependence of the SNP effects.
Vertically, individuals in the pedigree will share a large proportion of alleles that are identical-by-descent (IBD) if they have a common recent ancestor, or vice versa. The genomic (IBD) relationship structure therefore manifests at each locus across individuals in the pedigree, and for closely linked loci, these structures will be very similar due to CS. Alleles that are identical-by-descent are also identical-by-state (IBS) but the inverse is not true. Thus, the genomic relationship structures may not be properly accounted for by the methods that use IBS relationships computed from SNP genotypes.
Two methods, BayesN and the QTL model, have been developed in this thesis to account for the structure in the genotypes that are used for whole genome analyses. BayesN is a nested marker effects model, where SNP effects in each small genomic window are a priori considered dependent. Compared with BayesB, where the structure in the genome is ignored and SNP effects are assumed to be independently and identically distributed, BayesN gave a higher accuracy of genomic prediction for breeding values, especially when high-density SNP panels were used and the QTL had rare alleles. When BayesN was used for QTL discovery, the proportion of false positives (PFP) for finding QTL was perfectly controlled in the case of common QTL alleles and was controlled better than BayesB in the case of rare QTL alleles. At the same level of PFP, BayesN had a higher power than BayesB for detecting QTL that had rare alleles and at least 1% of the total genetic variance. The advantage of BayesN is attributed to the modeling of dependence between SNP effects such that they jointly explained more genetic variance at the QTL and shrunk the effects of SNPs not associated with QTL more toward zero. Moreover, BayesN has a benefit in computing time, which is only one-fourth of that for BayesB in the case of high-density SNP panels.
The QTL model includes the effects of the unobserved QTL genotypes, and the phenotype therefore has a mixture distribution. The mixture model exploits information from the pedigree, LD and CS optimally to model the QTL allele states in founders and allele inheritance in nonfounders. Thus, the QTL model accounts for horizontal structure across loci and vertical structure across individuals because only information from the SNPs that are within a small chromosomal segment contribute to the modeling of QTL alleles in that segment. In a range of pedigree structures, the QTL model had a substantially higher accuracy than BayesC for genomic prediction when training population consisted of multiple families, generations, or breeds. The advantages of the QTL model increased with the complexity of the pedigree structure and the contribution of CS information. Furthermore, use of the QTL model permits direct inferences on the unobserved QTL. As expected, the QTL model had a better control of PFP than BayesC and a higher power for detecting any size of QTL when PFP was limited to be a small value. In this thesis, a method to calculate the credible intervals for multiple QTL locations is developed. The method presented here is straightforward and can easily be applied to other models that fit QTL effects for the unobserved genotypes. The credible intervals for the QTL locations provide important information to guide future fine-mapping studies.
In QTL discovery, signal from the QTL may bleed to neighboring genomic windows depending on the structures of the genome. It is therefore suggested to search QTL in the window that has a positive test result as well as its flanking windows, or to use hypotheses that only test for large genetic variance (at least 1% of the total genetic variance for example).
In conclusion, parsimonious and sophisticated methods that account for the horizontal and vertical structures in genotypes were developed for whole genome analyses. Both methods gave higher accuracy of genomic prediction and trait loci discovery than the widely used methods that ignore these structures. Both methods are expected to be more efficient with respect to computing time and performance as higher SNP densities or sequence data are used in whole genome analyses.
Zeng, Jian, "Whole genome analyses accounting for structures in genotype data" (2015). Graduate Theses and Dissertations. 14699.