Date of Award
Doctor of Philosophy
Daniel J. Nordman
Random forest (RF) is a widely used machine learning method that shows competitive prediction performance in various fields, including biological science, finance, chemical engineering, agroscience, medical analysis, etc. In this dissertation, we study some characteristics and modifications of RFs in order to improve its prediction performance.
In CHAPTER 1, we review the mechanics of classification and regression trees (CARTs), bootstrap aggregation (bagging) and RFs. The properties of RFs are discussed, along with several variations of this method.
In CHAPTER 2, we describe a counter-intuitive discovery using RFs: the out-of-sample prediction errors can be reduced by augmenting the regressor with a new scientifically meaningless predictor variable independent of all variables in the dataset. We explain this phenomenon using a simulated example and discuss the importance of this result in interpreting predictor variable importance in RFs.
RF predictions can be biased. In CHAPTER 3, we apply an iterative debiasing approach based on bagging to RFs and test this bias correction method with real datasets. The debiasing approach can significantly improve RF predictions. The number of debiasing iterations can be tuned using cross-validation.
Standard RF methodology generates a common RF from a given training sample, regardless of test cases. In CHAPTER 4, we propose a new way to grow a RF specifically predicting a particular test case, namely, Case-Specific Random Forests (CSRF). We also suggest Case-Specific Variable Importance (CSVI), a new definition of predictor variable importance in terms of the prediction performance on a particular test case.
Prediction error estimation is generally useful in evaluation of a prediction rule. All present methods deal with estimating prediction errors averaging over the distribution of a test set. In CHAPTER 5, we propose a method to estimate expected prediction loss on a specific regressor point using RF methodology.
Xu, Ruo, "Improvements to random forest methodology" (2013). Graduate Theses and Dissertations. 13052.