Modifications to classification and regression trees to solve problems related to imperfect detection and dependence

Thumbnail Image
Date
2013-01-01
Authors
McKelvey, Mark
Major Professor
Advisor
Philip Dixon
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Altmetrics
Authors
Research Projects
Organizational Units
Organizational Unit
Statistics
As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract

Classification and Regression Tree (CART) models provide a flexible way to model species' habitat occupancy status, but standard CART algorithms have plenty of room for extensions.

One such extension explores the survey error of imperfect detection. When an individual is not detected, that is often taken as sign of non-presence. However, the principle of imperfect detection tells us that just because one cannot find what they are looking for, that does not mean that what they are looking for is not present.

We outline four methods for including detection probability in the process of growing the tree, and illustrate these methods using data from a study of mountain plovers (Dinsmore et al. 2003). The results depend on the method used to estimate detection and occupancy. For the mountain plover data, the tree structures produced by three of the methods are identical to that produced by the naive tree in which detection is ignored. The fourth method yields different splitting choices. Estimates of occupancy probability are consistently lower when using the naive tree than those computed using detection-adjusted trees. Accounting for imperfect detection is crucial even when occupancy is modeled using a CART tree.

In addition to imperfect detection, another extension to standard CART algorithms deals with spatial correlation. Many studies include a cluster-type sampling design where there is a clear spatial correlation between sampling locations. This correlation causes the variance of the node occupancy estimates in CART to be biased. We suggest a generalized estimating equation (GEE)-based approach in which the naive variance estimates (calculated as if all locations were independent) are "corrected'' based on the data available in each parent node of the tree. The corrected variance estimates are then used to revise the binary-split decision criterion of the tree. The variances of each node in the split are assumed to be unequal. We demonstrate this method using data from a study on rats and also from a study on bird occurrences in Oregon.

When creating alternative methods of growing trees (i.e. how nodes are split) in CART, we expect to see [potentially] different trees. However, using those new methods also means that methodology involved in pruning the trees may need their own corresponding changes. For example, both of the types of methodology proposed above, incorporating imperfect detection and correlated data, led to an examination of current pruning criteria. Taking both of those new algorithms into account, we will discuss several pruning criteria that could be used in conjunction with our proposed CART methodology. We evaluated the performance of each criteria by using simulated examples for each criteria, which resulted in error rates that were used to assess the performance of the pruning criteria.

Comments
Description
Keywords
Citation
Source
Subject Categories
Copyright
Tue Jan 01 00:00:00 UTC 2013