Date of Award
Doctor of Philosophy
Jack C. Dekkers
Rohan L. Fernando
Genetic improvement for economically important traits in livestock populations has been revolutionized through the application of genomic selection, where the selection criterion for parents of future generations incorporates genomic estimated breeding values (GEBV). Genomic prediction is a statistical method that predicts GEBV based on high-density genotypes of single nucleotide polymorphisms (SNPs) with genome-wide coverage. The theoretical basis for genomic prediction is that the genetic variance of every quantitative trait locus (QTL) for a desired trait can be captured by SNPs due to linkage disequilibrium (LD) between QTL and SNPs. To date, most statistical models for genomic prediction are based on multiple regression of trait phenotypes on SNP genotypes. Informative prior distributions are usually specified for SNP effects that allow simultaneous estimation of all SNP effects (training). Computer simulation of genomic prediction has revealed that the accuracy of GEBV depends on the genetic basis of trait, the size of training population, and LD between QTL and SNPs, which is affected by historical and current effective population sizes (Ne), mutation, selection, population stratification, family structure, SNP density, and minor allele frequencies (MAF) of QTL and SNPs. With moderate to high level of LD, GEBV are expected to have significantly higher accuracy than breeding values estimated using pedigree relationships. In analyses of field datasets, higher accuracy is typically observed in populations that are closely related to the training population, whereas the accuracy in a distantly related population is often low or even zero. Further, prediction accuracy hardly improves by increasing the density of SNPs that are usually selected to have high MAF, which contradicts results from simulation studies. Evidence has been increasing that LD between QTL and SNPs in livestock populations is low because many QTL have much lower MAF than SNPs, and prediction accuracy mainly comes from co-segregation (CS) and additive relationships that are implicitly captured by SNP genotypes.
With low LD between QTL and SNPs, CS information is expected to capture QTL effects more accurately than LD information. CS refers to alleles at linked loci originating from the same parental chromosome, which is quantified by the identical grand-parental allele origins at linked loci. CS information by definition is independent from LD, but is affected by the distance between QTL and SNPs along the chromosome and current Ne, which is usually determined by the mating design for a specific breeding program. The objectives of this thesis were to develop a statistical method to model CS explicitly, and to study the effects of historical LD, current Ne, MAF of QTL and SNP density on the contributions of LD and CS information to prediction accuracy. The CS model was developed by following the transmission of QTL alleles using allele origins at SNPs. Simulated half-sib datasets were analyzed to study contributions of LD and CS information to prediction accuracy for datasets that included many unrelated families. Simulated datasets of extended pedigrees with different mating designs were analyzed to study contributions of LD and CS information to prediction accuracy across validation generations without retraining. Results from half-sib datasets showed that when LD between QTL and SNPs was low, the accuracy of the model that fits SNP genotypes (LD model) decreased when the training data size was increased by adding independent sire families, but accuracies from the CS model and a combined LD-CS model increased and plateaued rapidly with increasing the number of sire families. Results from half-sib datasets suggest that modeling CS explicitly improves prediction accuracy when LD between QTL and SNPs is low, especially when the training data size is increased by adding independent families. Results from extended pedigrees showed that the LD model resulted in high accuracy across validation generations only when LD between QTL and SNPs was high. With low LD between QTL and SNPs, modeling CS explicitly resulted in higher accuracy than the LD model across validation generations when the mating design generated a large number of close relatives. Results from extended pedigrees suggest that modeling both LD and CS explicitly is expected to improve prediction accuracy when current Ne is small, and LD between QTL and SNPs is low due to distinct MAF, which is the typical situation in most livestock populations.
Application of the CS and the LD-CS models in field datasets has two major difficulties. First, obtaining allele origins for genome-wide SNPs can be computationally demanding. Second, the application of the CS model is limited to populations with correctly recorded pedigrees. CS information in populations without pedigree can be explicitly captured by fitting SNP haplotypes. The reason is that, as shown by our previous studies, the association between 1-cM haplotypes and QTL alleles is complete with a high SNP density of 200 SNPs/cM, and therefore 1-cM haplotypes can accurately follow the transmission of QTL alleles from the most recent common ancestor. Simulated datasets of extended pedigrees with different mating designs were analyzed to study contributions of fitting SNP genotypes and haplotypes to prediction accuracy across validation generations without retraining. Results showed that fitting both SNP genotypes and haplotypes had similar accuracy as fitting only SNP genotypes when LD between QTL and SNPs was high, but had significantly higher accuracy than fitting SNP genotypes when LD between QTL and SNPs was low. In the analyses of several egg quality traits of commercial layer chickens, fitting both SNP genotypes and haplotypes improved prediction accuracy for traits for which the accuracy was almost zero by fitting only SNP genotypes. Fitting haplotypes is effective to capture CS information for genomic prediction, especially when LD between QTL and SNPs is low and LD contributes little to prediction accuracy.
In conclusion, genomic prediction models that fit SNP genotypes capture both LD and CS information. When most QTL have much lower MAF than SNPs, LD between QTL and SNPs is low, and the accuracy obtained from fitting SNP genotypes is mainly contributed by CS information that is implicitly captured by SNP genotypes. This accuracy decreases when the training data size is increased by adding independent families, and deteriorates across validation generations without retraining, because CS information captured by SNP genotypes over long chromosome distances erodes rapidly by recombination. CS information can be explicitly captured by modeling transmission of putative QTL alleles within short chromosome regions using allele origins at SNPs. Modeling CS explicitly has limited contribution to accuracy when LD between QTL and SNPs is high, but has substantial contribution to accuracy when LD between QTL and SNPs is low. CS information has greater contribution to accuracy in populations with larger current Ne, because fewer haplotypes segregate in a population with a smaller current Ne, and the effect of each haplotype can be estimated more accurately. Therefore, modeling CS explicitly is expected to result in high accuracy across validation generations in mating designs that create small current Ne. For populations without pedigree information, CS information can be modeled explicitly by fitting SNP haplotypes within short chromosome regions. Fitting haplotypes captures as much CS information as modeling CS by following the transmission of QTL alleles of pedigree founders, but also captures CS information from most recent common ancestors. Although fitting both SNP genotypes and haplotypes improved accuracy for several traits in layer chickens for which the SNP model had low accuracy, the potential advantage of the SNP-haplotype model in improving accuracy for livestock populations requires further study.
Sun, Xiaochen, "Genomic prediction using linkage disequilibrium and co-segregation" (2014). Graduate Theses and Dissertations. 14273.