Applications of Non-Parametric Kernel Smoothing Estimators in Monte Carlo Risk Assessments

Thumbnail Image
Date
2012-01-01
Authors
Trapp II, Allan
Major Professor
Advisor
Philip M. Dixon
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Statistics
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract

This dissertation addresses two separate issues involving the estimation of risk. The first issue regards the creation of a schedule for the viability testing of seeds stored in long-term storage facilities. The second problem pertains to the time required to simulate risk by using a two-dimensional Monte Carlo.

Genebank managers conduct viability tests on stored seeds so they can replace lots that have viability near a critical threshold, such as 50 or 85 % germination. Currently, these tests are typically scheduled at uniform intervals; testing every 5 years is common. A manager needs to balance the cost of an additional test against the possibility of losing a seed lot due to late retesting. We developed a data-informed method to schedule viability tests for a collection of 2,833 maize seed lots with 3 to 7 completed viability tests per lot. Given these historical data reporting on seed viability at arbitrary times, we fit a hierarchical Bayesian seed-viability model with random seed-lot-specific coefficients. The posterior distribution of the predicted time to cross below a critical threshold was estimated for each seed lot. We recommend a predicted quantile as a retest time, chosen to balance the importance of catching quickly decaying lots against the cost of premature tests. The method can be used with any seed-viability model; we focused on two, the Avrami viability curve and a quadratic curve that accounts for seed after-ripening. After fitting both models, we found that the quadratic curve gave more plausible predictions than did the Avrami curve. Also, a receiver operating characteristic (ROC) curve analysis and a follow-up test demonstrated that a 0.05 quantile yields reasonable predictions.

The two-dimensional Monte Carlo simulation is an important tool for quantitative risk assessors. Its framework easily propagates aleatoric and epistemic uncertainties related to risk. Aleatoric uncertainty concerns the inherent, irreducible variability of a risk factor. Epistemic uncertainty concerns the reducible uncertainty of a fixed risk factor. The total crop yield of a corn field is an example of an aleatoric uncertainty while the mean of corn yield is an epistemic uncertainty. The traditional application of a two-dimensional Monte Carlo simulation in a risk assessment requires many Monte Carlo samples. In a common case, a risk assessor samples 10,000 epistemic factor vectors. For each vector, the assessor generates 10,000 vectors of aleatoric factors and calculates risk. The purpose of heavy aleatoric simulation is to estimate a cumulative frequency distribution, CDF, of risk conditional on an epistemic vector. This approach has 108 calculations of risk and is computationally slow. We propose a more efficient method that reduces the number of simulations in the aleatoric dimension by pooling together risk values of epistemic vectors close to a target epistemic vector and estimate the conditional CDF using the multivariate Nadaraya-Watson estimator. We examine the risk of hemolytic uremic syndrome in young children exposed to Escherichia coli O157:H7 in frozen ground beef patties and demonstrate that our method replicates the results of the traditional two-dimensional Monte Carlo risk assessment. Furthermore, for this problem, we find that our method is three times faster than the traditional method.

In order to perform the modified two-dimensional Monte Carlo simulation of risk, we must specify a bandwidth, h. In general, researchers pick an h that balances the estimator's bias and variance. They minimize criteria such as average squared error (ASE), penalized ASE, or asymptotic mean integrated squared error (AMISE) to select an "optimal" h. A review of the optimal bandwidth selection literature related to multivariate kernel-regression estimation shows that there is still ambiguity about the best bandwidth selector. We compare the effects of five penalized-ASE bandwidth selectors and an AMISE bandwidth plug-in on the average accuracy of a multivariate Nadaraya-Watson kernel-regression estimator of a CDF of hemolytic uremic syndrome (HUS) risk in young children exposed to Escherichia coli O157:H7 in ground beef patties. We consider these six bandwidth selectors because they compute relative quickly, and researchers generally desire fast results. Simulating different amounts of data (ne = 1000, 3000, and 5000) from each of three HUS-risk models of varying complexity, we find that none of the selectors consistently results in the most accurate CDF estimator. However, if the goal is to produce accurate quantile-quantile risk assessment results (Pouillot and Delignette-Muller (2010)), then the AMISE-based selector performs best.

Comments
Description
Keywords
Citation
Source
Copyright
Sun Jan 01 00:00:00 UTC 2012