Modeling the approval rates of Iowa Home Loan Applications

Thumbnail Image
Supplemental Files
Date
2019-01-01
Authors
Zhang, Yue
Major Professor
Lily Wang
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Organizational Unit
Journal Issue
Is Version Of
Versions
Series
Department
Statistics
Abstract

In this study, we used the loan application data collected in Iowa in 2014 by the Home Loan Disclosure Act to analyze and model the approval rates. The approval rates of Hispanic applicants or applicants from the minor race are compared to those of non-Hispanic white applicants using hypothesis testing. It was found that for loans applied to conventional institutions, the denial rates of Hispanic applicants is statistically higher than non-Hispanic whites, and the denial rates of Asian and Black are statistically higher than non-Hispanic whites for either home purchase, improvement or refinance. However, no such behavior was observed for some loans from FHA and all loans from FSA and VA.

Several classification methods (including logistic regression, LDA, GAM and random forest) were used to model the approval rates. The AUC in 10-fold cross-validation was used to assess the model performance. It was found that:

(1) The sensitive variables {ethnicity, race, gender} are statistically significant in the logistic regression models. The importance of co-applicant’s ethnicity and co-applicant’s race are also high in random forest;

(2) The model performance improves as more variables are allowed during model construction. However, the improvement slows down as the number of variables reaches p = 6 or 7, further increasing p will not dramatically improve the model performance; Among the models built using the same set of variables by logistic regression, LDA and GAM, GAM performs systematically better than logistic regression and LDA, and logistic regression performs systematically better than LDA;

(3) Random forest performs the best among all the methods when p ≥ 9, but worse than others when p < 8. Meanwhile, the performance of random forest depends strongly on the Ntree parameter, which is the total number of trees to grow during model construction. The random forest requires at least Ntree ≥ 256 (in some cases 512) to outperform the other three methods;

(4) Geographical information impacts the classification considerably, and the impact of ethnicity/race/gender on classification was less strong than geographical information.

Comments
Description
Keywords
Citation
DOI
Source
Copyright
Tue Jan 01 00:00:00 UTC 2019