Modeling the approval rates of Iowa Home Loan Applications

Zhang, Yue

Modeling the approval rates of Iowa Home Loan Applications

File

auto_convert.pdf (1.81 MB)

Supplemental Files

Date

2019-01-01

Authors

Zhang, Yue

Major Professor

Lily Wang

Organizational Units

Organizational Unit

Statistics

As leaders in statistical research, collaboration, and education, the Department of Statistics at Iowa State University offers students an education like no other. We are committed to our mission of developing and applying statistical methods, and proud of our award-winning students and faculty.

Department

Statistics

Abstract

In this study, we used the loan application data collected in Iowa in 2014 by the Home Loan Disclosure Act to analyze and model the approval rates. The approval rates of Hispanic applicants or applicants from the minor race are compared to those of non-Hispanic white applicants using hypothesis testing. It was found that for loans applied to conventional institutions, the denial rates of Hispanic applicants is statistically higher than non-Hispanic whites, and the denial rates of Asian and Black are statistically higher than non-Hispanic whites for either home purchase, improvement or refinance. However, no such behavior was observed for some loans from FHA and all loans from FSA and VA.

Several classification methods (including logistic regression, LDA, GAM and random forest) were used to model the approval rates. The AUC in 10-fold cross-validation was used to assess the model performance. It was found that:

(1) The sensitive variables {ethnicity, race, gender} are statistically significant in the logistic regression models. The importance of co-applicant’s ethnicity and co-applicant’s race are also high in random forest;

(2) The model performance improves as more variables are allowed during model construction. However, the improvement slows down as the number of variables reaches p = 6 or 7, further increasing p will not dramatically improve the model performance; Among the models built using the same set of variables by logistic regression, LDA and GAM, GAM performs systematically better than logistic regression and LDA, and logistic regression performs systematically better than LDA;

(3) Random forest performs the best among all the methods when p ≥ 9, but worse than others when p < 8. Meanwhile, the performance of random forest depends strongly on the Ntree parameter, which is the total number of trees to grow during model construction. The random forest requires at least Ntree ≥ 256 (in some cases 512) to outperform the other three methods;

(4) Geographical information impacts the classification considerably, and the impact of ethnicity/race/gender on classification was less strong than geographical information.

Copyright

Tue Jan 01 00:00:00 UTC 2019

Collections

Creative Components

Full item page