Improvements to random forest methodology

Xu, Ruo

Improvements to random forest methodology

File

Xu_iastate_0097E_13365.pdf (1.04 MB)

Date

2013-01-01

Authors

Xu, Ruo

Advisor

Dan Nettleton

Daniel J. Nordman

Altmetrics

Organizational Units

Organizational Unit

Statistics

As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.

Department

Statistics

Abstract

Random forest (RF) is a widely used machine learning method that shows competitive prediction performance in various fields, including biological science, finance, chemical engineering, agroscience, medical analysis, etc. In this dissertation, we study some characteristics and modifications of RFs in order to improve its prediction performance.

In CHAPTER 1, we review the mechanics of classification and regression trees (CARTs), bootstrap aggregation (bagging) and RFs. The properties of RFs are discussed, along with several variations of this method.

In CHAPTER 2, we describe a counter-intuitive discovery using RFs: the out-of-sample prediction errors can be reduced by augmenting the regressor with a new scientifically meaningless predictor variable independent of all variables in the dataset. We explain this phenomenon using a simulated example and discuss the importance of this result in interpreting predictor variable importance in RFs.

RF predictions can be biased. In CHAPTER 3, we apply an iterative debiasing approach based on bagging to RFs and test this bias correction method with real datasets. The debiasing approach can significantly improve RF predictions. The number of debiasing iterations can be tuned using cross-validation.

Standard RF methodology generates a common RF from a given training sample, regardless of test cases. In CHAPTER 4, we propose a new way to grow a RF specifically predicting a particular test case, namely, Case-Specific Random Forests (CSRF). We also suggest Case-Specific Variable Importance (CSVI), a new definition of predictor variable importance in terms of the prediction performance on a particular test case.

Prediction error estimation is generally useful in evaluation of a prediction rule. All present methods deal with estimating prediction errors averaging over the distribution of a test set. In CHAPTER 5, we propose a method to estimate expected prediction loss on a specific regressor point using RF methodology.

Copyright

Tue Jan 01 00:00:00 UTC 2013

Collections

Theses and Dissertations

Full item page