Degree Type

Creative Component

Semester of Graduation

Spring 2019

Department

Statistics

First Major Professor

Lily Wang

Degree(s)

Master of Science (MS)

Major(s)

Statistics

Abstract

In this study, we used the loan application data collected in Iowa in 2014 by the Home Loan Disclosure Act to analyze and model the approval rates. The approval rates of Hispanic applicants or applicants from the minor race are compared to those of non-Hispanic white applicants using hypothesis testing. It was found that for loans applied to conventional institutions, the denial rates of Hispanic applicants is statistically higher than non-Hispanic whites, and the denial rates of Asian and Black are statistically higher than non-Hispanic whites for either home purchase, improvement or refinance. However, no such behavior was observed for some loans from FHA and all loans from FSA and VA.

Several classification methods (including logistic regression, LDA, GAM and random forest) were used to model the approval rates. The AUC in 10-fold cross-validation was used to assess the model performance. It was found that:

(1) The sensitive variables {ethnicity, race, gender} are statistically significant in the logistic regression models. The importance of co-applicant’s ethnicity and co-applicant’s race are also high in random forest;

(2) The model performance improves as more variables are allowed during model construction. However, the improvement slows down as the number of variables reaches p = 6 or 7, further increasing p will not dramatically improve the model performance; Among the models built using the same set of variables by logistic regression, LDA and GAM, GAM performs systematically better than logistic regression and LDA, and logistic regression performs systematically better than LDA;

(3) Random forest performs the best among all the methods when p ≥ 9, but worse than others when p < 8. Meanwhile, the performance of random forest depends strongly on the Ntree parameter, which is the total number of trees to grow during model construction. The random forest requires at least Ntree ≥ 256 (in some cases 512) to outperform the other three methods;

(4) Geographical information impacts the classification considerably, and the impact of ethnicity/race/gender on classification was less strong than geographical information.

Copyright Owner

ZHANG, YUE

File Format

Word

Share

COinS