Using Multiple Discriminant Analysis to Construct a Statistical Model for Predicting Bank Loan Repay and Default Customers in the Eastern Region of Ghana

Banks that lend to small businesses and individuals need to quickly assess the creditworthiness of prospective borrowers so as to reduce the probability of issuing bad loans while attempting to maintain their own profitability. It was for these reasons that credit institutions have made several attempts at modeling and reliably forecasting credit default using numerous statistical approaches. The objective of the study was to develop a model which could be used to identify likely future defaulters. The population for the study was all financial institutions were in the Eastern Region of Ghana. A number of banks that could give the needed data for the study was purposefully chosen and a random sample of 150 customers were randomly selected to provide data on the study variables which include customers’ financial standing, reason to loan, employment and demographic information. The statistical model obtained indicated four important influences total asset, total income, family size and number of years with current employer as the most discriminating variables between the repay and default group. The validity of the model was confirmed using several diagnostic analytical procedures. The importance of examining a model’s sensitivity and specificity in the context of one’s specific, real-world objectives was also discussed.


Introduction
The process of administering loan varies from one financial institution to the other. However the loan repayment process appears to be the same that is the borrower must repay the lender. The banks monitor the likelihood of corporate default because of its impact on lenders and possible devasting effects on systemic stability of the bank. Financial institutions expect loan losses and therefore include the risk in loan pricing. Unexpected default erode capital to a potentially dangerous degree. Recent research at the bank has been aimed at quantifying the risk of default by individual companies. Some of these studies have used univariate and Multivariate statistical techniques to show how effective financial ratio sets can be when constructing company default prediction models (Altman, 1968(Altman, , 1993Beaver et.al, 1989).
The use of statistical methods for credit scoring and prediction of default on credit card accounts is now well-known. In particular, logistic regression has become a standard method for this task (Thomas et.al, 2002)). Recently there has been an interest in using survival analysis for credit scoring. This allows lenders to model not just if a borrower will default, but when the borrower will repay the loan. Survival analysis has been applied in many financial contexts including explaining financial product purchases (Tang et.al, 2007), behavioural scoring on credit customers (Stepanova and Thomas, 2001), predicting default on personal loans (Stepanova and Thomas, 2002) and the development of generic score cards for retail cards (Andreeva, 2006).
Multiple Discriminant Analysis is a much valued tool for market segmentation. Over the years, the estimation of the linear discriminant function has received much theoretical attention (Lopez, 2001;Malhotra and Malhotra, 2002;Crask and Perreault 1977;Morrison 1969).
The lender having fulfilled his part of the contract expects the borrower to fulfill his obligation without 9 any delay. It must be mentioned that in granting loan facilities to customers, the express assumption has been and will continue to be that all parties will fulfill their obligations. However, numerous legal suits are reported in the media of borrowers who have defaulted the repayment of their loan. It is true that there are some categories of customers who deliberately default in loan repayment (Babajide, 2011;BoG, 2011;Suleiman, 2011). It was for these reasons that credit institutions have made several attempts at modeling and reliably forecasting credit default using numerous approaches and methods. The problem of the study was to find a model for monitoring the loan repayment and predicting potential defaulters.

Objective of the study
The objective of the study was to develop a model which could be used to identify likely future defaulters. The model would develop statistical estimates based upon a cohort of borrowers. Using bank's past data files, a model can be developed around the historical relationships between borrower characteristics and the incidence of default. The resulting model can then be applied to borrowers in order to predict likely defaulters who should be the target of preemptive default prevention efforts.

The multiple discriminant analysis -conceptual and mathematical model
Multiple Discriminant Analysis is a technique for classifying a set of observations into predefined classes. It refers to all statistical methods that simultaneously analyze multiple measurements on each individual or object under investigation (Hair et.al, 2006). Multiple Discriminant Analysis is a multivariate technique which uses several variables simultaneously to classify an observation into priori groups, in this case, repay and default groups of customers. A linear combination of the variables used is formed into an equation, called the discriminant function.
The first term a, represents the constant within the equation. The b's are discriminant coefficients or discriminant weights, and the x's are the input variables or predictors. The weights and the cutoff score are estimated in such a way to minimize the number of classification errors. The Maximum Likelihood estimator is broadly used in parameter estimation. For the sake of simplicity, it is presented here assuming that the vector of data collected at time ti is modelled as: In the framework of Maximum Likelihood estimation,  is considered as unknown but with a single actual value. Bayesian approaches consider a distribution of possible values for,  . Hence,  is assumed to have a known prior probability density The joint probability density of y and  satisfies the relation: is the marginal distribution of the observed data, defined by the relation: The posterior probability density for The maximum of a posteriori (MAP) estimator maximizes,

Population and sample
The population for the study was financial institutions in the Eastern Region of Ghana. A bank that could give the needed data for the study was purposefully chosen.

Procedure and variables selection
Content analysis as a research method (Elo and Kyngas, 2008;Lauri and Kyngas, 2005) was used to collect and analyse data from the bank's data set. The data set used in this case contains 150 cases and 6 variables (or predictors) with information pertaining to past and current customers who borrowed from a Ghanaian bank for various reasons. The data set contains information related to the customers' financial standing, reasons for obtaining the loan, employment, demographic information, among others. For each customer, the binary outcome "creditability" was also available. This variable contained information about whether each customer's credit was deemed "Good" or "Bad". The data set had a distribution of 89% credit worthy (good) customers and 11% not credit worthy (bad) customers. Customers who had missed 90 days of payment were thought of as bad risks, and customers who had missed no payment were thought of as good risks. Other typical measures for determining good and bad customers were the amount obtained over the overdraft limit, current account turnover, number of months of missed payments, or a function of these and other variables. The following variables were also measured: (i) Basic personal information (Age, Sex) (ii) Family information (marital status, number of dependents) (iii) Employment status (years in current occupation) (iv) Financial status (Most valuable available assets, number of year with current bank) (v) Others: (purpose of credit, amount of loan).
The variables listed above were used to develop a model to discriminate between repay and default groups of customers. The assumption was that if the model could discriminate between these two groups, the predictive model can be used to classify or predict new cases where the above mentioned information are provided but credit standing of the borrower is unknown. This would be useful, for example, in deciding whether or not a person qualifies for a loan.

Data analysis
The data on a sample of 150 customers were analysed using discriminant analysis in the SPSS version 17 programme. The stepwise procedure was used. With this programme, the computer at each stage chose a variable to enter the discriminant function. The Wilks lambda criterion was used for entering the variables in the equation. The variable entered fitted the entry requirements in terms of the associated Wilks lambda value.
Approximately 70% of the customers who were previously given loans were used to create the model. The remaining customers who were previously given loans were used to validate the model results. The classification function was used to assign cases to groups. The discriminant model assigned the case to the group whose classification function obtained the highest score. Using the discriminant analysis function, loan default was predicted for individual loans in the portfolio and the prediction accuracies in terms of the sensitivity (proportion of default cases correctly identified to total number of default in the sample) and the specificity (proportion of non-default cases correctly predicted to total number of non-default cases in the sample). The data was further analysed using the enter method to determine the best combination of variables that could give the highest prediction accuracy rates taking into consideration that the model and the function constructed and accepted were strong as indicated by the size of the eigen value. The larger the eigen value, the better the discriminating power of the function. Also, the Chi-Square and the Wilk's Lambda values were also assessed to determine discriminating power.
SPSS was used to generate  2 approximation to in order obtain a significance level. The Wilk's Lambda was used to measure the differences between groups and the homogeneity within groups and to test the null hypothesis that the populations have identical means on D. A low Wilk's Lambda and a large Chi-Square with a significant p-value indicated good discriminating power of the discriminant function. Each subject's discriminant score was used to determine the posterior probabilities of being in each of the two groups. The subject was then classified (predicted) to be in the group with the higher posterior probability.
Over 53 (65%) of the respondents have been with the bank for more than 3 years and 78% had been working for over five years.

Tests of Equality of Group Means
The tests of equality of group means measure each independent variable's potential before a model is created. Each test displays the results of a one-way ANOVA for the independent variable using the grouping variable as the factor. Table 3 shows that total asset, total income, family size and number of years with current employer are the most discriminating variables between the repay and default groups. All these four variables are significant at 0.05 level of significance (p=0.05).
Wilks' lambda is another measure of a variable's potential. Smaller values indicate the variable is better at discriminating between groups. Table 3 suggests that income is best in discriminating between groups, followed by years with current employer and asset. The associated chi-square statistic tests the hypothesis that the means of the functions listed are equal across groups. The significance of the chi-value ( 2 = 47.557, p=0.000) indicates that the discriminant function does better than chance at separating the groups.

Standardized discriminant function coefficients
The standardized coefficients allowed the researcher to compare the variables measured on different scales. Coefficients with large absolute values correspond to variables with greater discriminating ability. A low standardized coefficient might mean that the groups do not differ much on that variate or it might just mean that the variate's correlation with the grouping variable is redundant with that of another variate in the model. Table 4 shows the estimated standardized discriminant function coefficients.
The standardized coefficients allow for comparison of variables measured on different scales. Parameter values show that a percentage increase in asset, family size and number of years with current employer, ceteris paribus, will decrease the odds of probability of default by almost 4%, 38.3% and 13.4% respectively. On the other hand, a percentage increase in income, debt and number of years with current bank will increase the odds of probability of default by almost 9%, 1.7% and 7.1%respectively. where, A is Assets; I is Income, D is Debt, F is Family size, E is Number of years with current employer, and B is Number of years with current bank. The cutting score is zero. Discriminant scores greater than zero (positive scores) indicate a predicted membership in the default group, while negative scores imply predicted membership in the repay group. Correlations between variates and D are available in the loading or structure matrix. Generally, any variate with a loading of 0.30 or more is considered to be important in defining the discriminant dimension (Abdi and Williams, 2010). These correlations may help us understand the discriminant function we have created.
The structure matrix shows the correlation of each predictor variable with the discriminant function. The ordering in the structure matrix is the same as that suggested by the tests of equality of group means and is different from that in the standardized coefficients table. Table 5 displays the prior probabilities for membership in groups. A prior probability is an estimate of the likelihood that a case belongs to a particular group when no other information about it is available. The prior probabilities were based on the sizes of the groups. A priori, 88.5% of the cases were nondefaulters, so the classification function was weighted more heavily in favor of classifying cases as nondefaulters.  Cases Not Selected Original default 0(0%) 0(0%) 0 repay 1(50%) 1(50%) 2

Conclusion
The study demonstrated the use of discriminant analysis to identify demographic and behavioral characteristics associated with likelihood to default on a bank loan. The study identified four important influences -total asset, total income, family size and number of years with current employer as the most discriminating variables between the repay and default group. The validity of the model was confirmed using several diagnostic analytic procedures. The overall success rate or hit ratio of the discriminant function was 82. 5%. The findings showed that using six variables and multiple discriminant analysis, a strong statistical model could be constructed that would be able to predict repay and default customer with very high correct classifications.