credit scoring using logistic regression and decision trees -凯发k8网页登录
create and compare two credit scoring models, one based on logistic regression and the other based on decision trees.
credit rating agencies and banks use challenger models to test the credibility and goodness of a credit scoring model. in this example, the base model is a logistic regression model and the challenger model is a decision tree model.
logistic regression links the score and probability of default (pd) through the logistic regression function, and is the default fitting and scoring model when you work with objects. however, decision trees have gained popularity in credit scoring and are now commonly used to fit data and predict default. the algorithms in decision trees follow a top-down approach where, at each step, the variable that splits the dataset "best" is chosen. "best" can be defined by any one of several metrics, including the gini index, information value, or entropy. for more information, see .
in this example, you:
use both a logistic regression model and a decision tree model to extract pds.
validate the challenger model by comparing the values of key metrics between the challenger model and the base model.
compute probabilities of default using logistic regression
first, create the base model by using a object and the default logistic regression function . fit the object by using the full model, which includes all predictors for the generalized linear regression model fitting algorithm. then, compute the pds using . for a detailed description of this workflow, see .
% create a creditscorecard object, bin data, and fit a logistic regression model load creditcarddata.mat scl = creditscorecard(data,'idvar','custid'); scl = autobinning(scl); scl = fitmodel(scl,'variableselection','fullmodel');
generalized linear regression model: logit(status) ~ 1 custage tmataddress resstatus empstatus custincome tmwbank othercc ambalance utilrate distribution = binomial estimated coefficients: estimate se tstat pvalue _________ ________ _________ __________ (intercept) 0.70246 0.064039 10.969 5.3719e-28 custage 0.6057 0.24934 2.4292 0.015131 tmataddress 1.0381 0.94042 1.1039 0.26963 resstatus 1.3794 0.6526 2.1137 0.034538 empstatus 0.89648 0.29339 3.0556 0.0022458 custincome 0.70179 0.21866 3.2095 0.0013295 tmwbank 1.1132 0.23346 4.7683 1.8579e-06 othercc 1.0598 0.53005 1.9994 0.045568 ambalance 1.0572 0.36601 2.8884 0.0038718 utilrate -0.047597 0.61133 -0.077858 0.93794 1200 observations, 1190 error degrees of freedom dispersion: 1 chi^2-statistic vs. constant model: 91, p-value = 1.05e-15
% compute the corresponding probabilities of default
pdl = probdefault(scl);
compute probabilities of default using decision trees
next, create the challenger model. use the statistics and machine learning toolbox™ method fitctree
to fit a decision tree (dt) to the data. by default, the splitting criterion is gini's diversity index. in this example, the model is an input argument to the function, and the response 'status'
comprises all predictors when the algorithm starts. for this example, see the name-value pairs in fitctree
to the maximum number of splits to avoid overfitting and specify the predictors as categorical.
% create and view classification tree categoricalpreds = {'resstatus','empstatus','othercc'}; dt = fitctree(data,'status~custage tmataddress resstatus empstatus custincome tmwbank othercc utilrate',... 'maxnumsplits',30,'categoricalpredictors',categoricalpreds); disp(dt)
classificationtree predictornames: {'custage' 'tmataddress' 'resstatus' 'empstatus' 'custincome' 'tmwbank' 'othercc' 'utilrate'} responsename: 'status' categoricalpredictors: [3 4 7] classnames: [0 1] scoretransform: 'none' numobservations: 1200
the decision tree is shown below. you can also use the function with the name-value pair argument 'mode'
set to 'graph'
to visualize the tree as a graph.
view(dt)
decision tree for classification 1 if custincome<30500 then node 2 elseif custincome>=30500 then node 3 else 0 2 if tmwbank<60 then node 4 elseif tmwbank>=60 then node 5 else 1 3 if tmwbank<32.5 then node 6 elseif tmwbank>=32.5 then node 7 else 0 4 if tmataddress<13.5 then node 8 elseif tmataddress>=13.5 then node 9 else 1 5 if utilrate<0.255 then node 10 elseif utilrate>=0.255 then node 11 else 0 6 if custage<60.5 then node 12 elseif custage>=60.5 then node 13 else 0 7 if custage<46.5 then node 14 elseif custage>=46.5 then node 15 else 0 8 if custincome<24500 then node 16 elseif custincome>=24500 then node 17 else 1 9 if tmwbank<56.5 then node 18 elseif tmwbank>=56.5 then node 19 else 1 10 if custage<21.5 then node 20 elseif custage>=21.5 then node 21 else 0 11 class = 1 12 if empstatus=employed then node 22 elseif empstatus=unknown then node 23 else 0 13 if tmataddress<131 then node 24 elseif tmataddress>=131 then node 25 else 0 14 if tmataddress<97.5 then node 26 elseif tmataddress>=97.5 then node 27 else 0 15 class = 0 16 class = 0 17 if resstatus in {home owner tenant} then node 28 elseif resstatus=other then node 29 else 1 18 if tmwbank<52.5 then node 30 elseif tmwbank>=52.5 then node 31 else 0 19 class = 1 20 class = 1 21 class = 0 22 if utilrate<0.375 then node 32 elseif utilrate>=0.375 then node 33 else 0 23 if utilrate<0.005 then node 34 elseif utilrate>=0.005 then node 35 else 0 24 if custincome<39500 then node 36 elseif custincome>=39500 then node 37 else 0 25 class = 1 26 if utilrate<0.595 then node 38 elseif utilrate>=0.595 then node 39 else 0 27 class = 1 28 class = 1 29 class = 0 30 class = 1 31 class = 0 32 class = 0 33 if utilrate<0.635 then node 40 elseif utilrate>=0.635 then node 41 else 0 34 if custage<49 then node 42 elseif custage>=49 then node 43 else 1 35 if custincome<57000 then node 44 elseif custincome>=57000 then node 45 else 0 36 class = 1 37 class = 0 38 class = 0 39 if custincome<34500 then node 46 elseif custincome>=34500 then node 47 else 1 40 class = 1 41 class = 0 42 class = 1 43 class = 0 44 class = 0 45 class = 1 46 class = 0 47 class = 1
when you use fitctree
, you can adjust the name-value pair arguments depending on your use case. for example, you can set a small minimum leaf size, which yields a better accuracy ratio (see ) but can result in an overfitted model.
the decision tree has a function that, when used with a second and third output argument, gives valuable information.
% extract probabilities of default
[~,observationclassprob,node] = predict(dt,data);
pddt = observationclassprob(:,2);
this syntax has the following outputs:
observationclassprob
returns anumobs
-by-2
array with class probability at all observations. the order of the classes is the same as indt.classname
. in this example, the class names are[0 1]
and the good label, by choice, based on which class has the highest count in the raw data, is0
. therefore, the first column corresponds to nondefaults and the second column to the actual pds. the pds are needed later in the workflow for scoring or validation.node
returns anumobs
-by-1
vector containing the node numbers corresponding to the given observations.
predictor importance
in predictor (or variable) selection, the goal is to select as few predictors as possible while retaining as much information (predictive accuracy) about the data as possible. in the class, the function internally selects predictors and returns p-values for each predictor. the analyst can then, outside the workflow, set a threshold for these p-values and choose the predictors worth keeping and the predictors to discard. this step is useful when the number of predictors is large.
typically, training datasets are used to perform predictor selection. the key objective is to find the best set of predictors for ranking customers based on their likelihood of default and estimating their pds.
using logistic regression for predictor importance
predictor importance is related to the notion of predictor weights, since the weight of a predictor determines how important it is in the assignment of the final score, and therefore, in the pd. computing predictor weights is a back-of-the-envelope technique whereby the weights are determined by dividing the range of points for each predictor by the total range of points for the entire object. for more information on this workflow, see .
for this example, use with the option pointsoddsandpdo
for scaling. this is not a necessary step, but it helps ensure that all points fall within a desired range (that is, nonnegative points). the pointsoddsandpdo
scaling means that for a given value of targetpoints
and targetodds
(usually 2
), the odds are "double", and then solves for the scaling parameters such that pdo
points are needed to double the odds.
% choose target points, target odds, and pdo values targetpoints = 500; targetodds = 2; pdo = 50; % format points and compute points range scl = formatpoints(scl,'pointsoddsandpdo',[targetpoints targetodds pdo]); [pointstable,minpts,maxpts] = displaypoints(scl); ptsrange = maxpts - minpts; disp(pointstable(1:10,:))
predictors bin points _______________ _____________ ______ {'custage' } {'[-inf,33)'} 37.008 {'custage' } {'[33,37)' } 38.342 {'custage' } {'[37,40)' } 44.091 {'custage' } {'[40,46)' } 51.757 {'custage' } {'[46,48)' } 63.826 {'custage' } {'[48,58)' } 64.97 {'custage' } {'[58,inf]' } 82.826 {'custage' } {''} nan {'tmataddress'} {'[-inf,23)'} 49.058 {'tmataddress'} {'[23,83)' } 57.325
fprintf('minimum points: %g, maximum points: %g\n',minpts,maxpts)
minimum points: 348.705, maximum points: 683.668
the weights are defined as the range of points, for any given predictor, divided by the range of points for the entire scorecard.
predictor = unique(pointstable.predictors,'stable'); numpred = length(predictor); weight = zeros(numpred,1); for ii = 1 : numpred ind = strcmpi(predictor{ii},pointstable.predictors); maxptspred = max(pointstable.points(ind)); minptspred = min(pointstable.points(ind)); weight(ii) = 100*(maxptspred-minptspred)/ptsrange; end predictorweights = table(predictor,weight); predictorweights(end 1,:) = predictorweights(end,:); predictorweights.predictor{end} = 'total'; predictorweights.weight(end) = sum(weight); disp(predictorweights)
predictor weight _______________ _______ {'custage' } 13.679 {'tmataddress'} 5.1564 {'resstatus' } 8.7945 {'empstatus' } 8.519 {'custincome' } 19.259 {'tmwbank' } 24.557 {'othercc' } 7.3414 {'ambalance' } 12.365 {'utilrate' } 0.32919 {'total' } 100
% plot a histogram of the weights figure bar(predictorweights.weight(1:end-1)) title('predictor importance estimates using logit'); ylabel('estimates (%)'); xlabel('predictors'); xticklabels(predictorweights.predictor(1:end-1));
using decision trees for predictor importance
when you use decision trees, you can investigate predictor importance using the function. on every predictor, the function sums and normalizes changes in the risks due to splits by using the number of branch nodes. a high value in the output array indicates a strong predictor.
imp = predictorimportance(dt); figure; bar(100*imp/sum(imp)); % to normalize on a 0-100% scale title('predictor importance estimates using decision trees'); ylabel('estimates (%)'); xlabel('predictors'); xticklabels(dt.predictornames);
in this case, 'custincome'
(parent node) is the most important predictor, followed by 'utilrate'
, where the second split happens, and so on. the predictor importance step can help in predictor screening for datasets with a large number of predictors.
notice that not only are the weights across models different, but the selected predictors in each model also diverge. the predictors 'ambalance'
and 'othercc'
are missing from the decision tree model, and 'utilrate'
is missing from the logistic regression model.
normalize the predictor importance for decision trees using a percent from 0 through 100%, then compare the two models in a combined histogram.
ind = ismember(predictor,dt.predictornames); w = zeros(size(weight)); w(ind) = 100*imp'/sum(imp); figure bar([weight,w]); title('predictor importance estimates'); ylabel('estimates (%)'); xlabel('predictors'); h = gca; xticklabels(predictor) legend({'logit','dt'})
note that these results depend on the binning algorithm you choose for the object and the parameters used in fitctree
to build the decision tree.
model validation
the function attempts to compute scores based on internally computed points. when you use decision trees, you cannot directly run a validation because the model coefficients are unknown and cannot be mapped from the pds.
to validate the object using logistic regression, use the function.
% model validation for the creditscorecard
[statsl,tl] = validatemodel(scl);
to validate decision trees, you can directly compute the statistics needed for validation.
% compute the area under the roc [x,y,t,auc] = perfcurve(data.status,pddt,1); ksvalue = max(y - x); ar = 2 * auc - 1; % create stats table output measure = {'accuracy ratio','area under roc curve','ks statistic'}'; value = [ar;auc;ksvalue]; statsdt = table(measure,value);
roc curve
the area under the receiver operating characteristic (auroc) curve is a performance metric for classification problems. auroc measures the degree of separability — that is, how much the model can distinguish between classes. in this example, the classes to distinguish are defaulters and nondefaulters. a high auroc indicates good predictive capability.
the roc curve is plotted with the true positive rate (also known as the sensitivity or recall) plotted against the false positive rate (also known as the fallout or specificity). when auroc
= 0.7
, the model has a 70% chance of correctly distinguishing between the classes. when auroc
= 0.5
, the model has no discrimination power.
this plot compares the roc curves for both models using the same dataset.
figure plot([0;tl.falsealarm],[0;tl.sensitivity],'s') hold on plot(x,y,'-v') xlabel('fraction of nondefaulters') ylabel('fraction of defaulters') legend({'logit','dt'},'location','best') title('receiver operating characteristic (roc) curve')
tvalidation = table(measure,statsl.value(1:end-1),statsdt.value,'variablenames',... {'measure','logit','dt'}); disp(tvalidation)
measure logit dt ________________________ _______ _______ {'accuracy ratio' } 0.32515 0.38903 {'area under roc curve'} 0.66258 0.69451 {'ks statistic' } 0.23204 0.29666
as the auroc values show, given the dataset and selected binning algorithm for the object, the decision tree model has better predictive power than the logistic regression model.
summary
this example compares the logistic regression and decision tree scoring models using the creditcarddata.mat
dataset. a workflow is presented to compute and compare pds using decision trees. the decision tree model is validated and contrasted with the logistic regression model.
when reviewing the results, remember that these results depend on the choice of the dataset and the default binning algorithm (monotone adjacent pooling algorithm) in the logistic regression workflow.
whether a logistic regression or decision tree model is a better scoring model depends on the dataset and the choice of binning algorithm. although the decision tree model in this example is a better scoring model, the logistic regression model produces higher accuracy ratio (
0.42
), auroc (0.71
), and ks statistic (0.30
) values if the binning algorithm for the creditscorecard object is set as'split'
withgini
as the split criterion.the function requires scaled scores to compute validation metrics and values. if you use a decision tree model, scaled scores are unavailable and you must perform the computations outside the object.
to demonstrate the workflow, this example uses the same dataset for training the models and for testing. however, to validate a model, using a separate testing dataset is ideal.
scaling options for decision trees are unavailable. to use scaling, choose a model other than decision trees.
see also
| | | | | | | | | | | | | | |