Modeling the approval rates of Iowa Home Loan Applications
Date
Authors
Major Professor
Advisor
Committee Member
Journal Title
Journal ISSN
Volume Title
Publisher
Authors
Research Projects
Organizational Units
Journal Issue
Is Version Of
Versions
Series
Department
Abstract
In this study, we used the loan application data collected in Iowa in 2014 by the Home Loan Disclosure Act to analyze and model the approval rates. The approval rates of Hispanic applicants or applicants from the minor race are compared to those of non-Hispanic white applicants using hypothesis testing. It was found that for loans applied to conventional institutions, the denial rates of Hispanic applicants is statistically higher than non-Hispanic whites, and the denial rates of Asian and Black are statistically higher than non-Hispanic whites for either home purchase, improvement or refinance. However, no such behavior was observed for some loans from FHA and all loans from FSA and VA.
Several classification methods (including logistic regression, LDA, GAM and random forest) were used to model the approval rates. The AUC in 10-fold cross-validation was used to assess the model performance. It was found that:
(1) The sensitive variables {ethnicity, race, gender} are statistically significant in the logistic regression models. The importance of co-applicant’s ethnicity and co-applicant’s race are also high in random forest;
(2) The model performance improves as more variables are allowed during model construction. However, the improvement slows down as the number of variables reaches p = 6 or 7, further increasing p will not dramatically improve the model performance; Among the models built using the same set of variables by logistic regression, LDA and GAM, GAM performs systematically better than logistic regression and LDA, and logistic regression performs systematically better than LDA;
(3) Random forest performs the best among all the methods when p ≥ 9, but worse than others when p < 8. Meanwhile, the performance of random forest depends strongly on the Ntree parameter, which is the total number of trees to grow during model construction. The random forest requires at least Ntree ≥ 256 (in some cases 512) to outperform the other three methods;
(4) Geographical information impacts the classification considerably, and the impact of ethnicity/race/gender on classification was less strong than geographical information.