Campus Units

Statistics

Document Type

Conference Proceeding

Conference

2015 Joint Statistical Meetings

Publication Version

Published Version

Publication Date

2015

Journal or Book Title

JSM Proceedings

First Page

2537

Last Page

2543

Conference Title

2015 Joint Statistical Meetings

Conference Date

August 8-13, 2015

City

Seattle, Washington

Abstract

Some training datasets may be too large for storage on a single computer. Such datasets may be partitioned and stored on separate computers connected in a parallel computing environment. To predict the response associated with a specific target case when training data are partitioned, we propose a method for finding the training cases within each partition that are most relevant for predicting the response of a target case of interest. These most relevant training cases from each partition can be combined into a single dataset, which can be a subset of the entire training dataset that is small enough for storage and analysis in memory on a single computer. To generate a prediction from this selected subset, we use Case-Specific Random Forests, a variation of random forests that replaces the uniform bootstrap sampling used to build a tree in a random forest with unequal weighted bootstrap sampling, where training cases more similar to the target case are given greater weight. We demonstrate our method with an example concrete dataset. Our results show that predictions generated from a small selected subset of a partitioned training dataset can be as accurate as predictions generated in a traditional manner from the entire training dataset.

Comments

This proceeding is published as Zimmerman, J., Nettleton, D. (2015). Case-specific random forests for big data prediction. In JSM Proceedings, General Methodology. Alexandria, VA: American Statistical Association, pp. 2537–2543. Posted with permission.

Copyright Owner

American Statistical Association

Language

en

File Format

application/pdf

Share

Article Location

 
COinS