compare tobit lgd model to benchmark model -凯发k8网页登录
this example shows how to compare a model for loss given default (lgd) against a benchmark model.
load data
load the lgd data.
load lgddata.mat
disp(head(data))
ltv age type lgd _______ _______ ___________ _________ 0.89101 0.39716 residential 0.032659 0.70176 2.0939 residential 0.43564 0.72078 2.7948 residential 0.0064766 0.37013 1.237 residential 0.007947 0.36492 2.5818 residential 0 0.796 1.5957 residential 0.14572 0.60203 1.1599 residential 0.025688 0.92005 0.50253 investment 0.063182
split the data into training and test sets.
numobs = height(data); rng('default'); % for reproducibility c = cvpartition(numobs,'holdout',0.4); trainingind = training(c); testind = test(c);
fit tobit model
fit a lgd model with training data. by default, the last column of the data is used as a response variable and all other columns are used as predictor variables.
lgdmodel = fitlgdmodel(data(trainingind,:),'tobit');
disp(lgdmodel)
tobit with properties: censoringside: "both" leftlimit: 0 rightlimit: 1 modelid: "tobit" description: "" underlyingmodel: [1x1 risk.internal.credit.tobitmodel] predictorvars: ["ltv" "age" "type"] responsevar: "lgd"
disp(lgdmodel.underlyingmodel)
tobit regression model: lgd = max(0,min(y*,1)) y* ~ 1 ltv age type estimated coefficients: estimate se tstat pvalue _________ _________ _______ __________ (intercept) 0.058257 0.027279 2.1356 0.032828 ltv 0.20126 0.031383 6.4129 1.7592e-10 age -0.095407 0.0072435 -13.171 0 type_investment 0.10208 0.018054 5.6544 1.7785e-08 (sigma) 0.29288 0.0057071 51.318 0 number of observations: 2093 number of left-censored observations: 547 number of uncensored observations: 1521 number of right-censored observations: 25 log-likelihood: -698.383
you can now use this model for prediction or validation. for example, use to predict lgd on test data and visualize the predictions with a histogram.
lgdpredtobit = predict(lgdmodel,data(testind,:)); histogram(lgdpredtobit) title('predicted lgd, tobit model') xlabel('predicted lgd') ylabel('frequency')
create benchmark model
in this example, the benchmark model is a lookup table model that segments the data into groups and assigns the mean lgd of the group to all group members. in practice, this common benchmarking approach is easy to understand and use.
the groups in this example are defined using the three predictors. ltv
is discretized into low and high levels. age
is discretized into young and old loans. type
already has two levels, namely, residential and investment. the groups are all the combinations of these values (for example, low ltv, young loan, residential, and so on). the number of levels and the specific cutoff points are only for illustration purposes. the benchmark model uses the same predictors as the model in this example, but you can use other variables to define the groups. in fact, the benchmark model could be a black-box model as long as the predicted lgd values are available for the same customers as in this data set.
% add the discretized variables as new colums in the table. % discretize the ltv. ltvedges = [0 0.5 max(data.ltv)]; data.ltvdiscretized = discretize(data.ltv,ltvedges,'categorical',{'low','high'}); % discretize the age. ageedges = [0 2 max(data.age)]; data.agediscretized = discretize(data.age,ageedges,'categorical',{'young','old'}); % type is already a categorical variable with two levels.
finding the group means on the training data is effectively the fitting of the model. note that the group counts are small for some groups. adding many groups comes with reduced group counts for some groups and more unstable estimates.
% find the group means on training data. gs = groupsummary(data(trainingind,:),{'ltvdiscretized','agediscretized','type'},'mean','lgd'); disp(gs)
ltvdiscretized agediscretized type groupcount mean_lgd ______________ ______________ ___________ __________ ________ low young residential 163 0.12166 low young investment 26 0.087331 low old residential 175 0.021776 low old investment 23 0.16379 high young residential 1134 0.16489 high young investment 257 0.25977 high old residential 265 0.066068 high old investment 50 0.11779
to predict an lgd for a new observation, you need to find its group and then assign the group mean as the predicted lgd. use the function, which takes the discretized variables as input. for a completely new data point, the ltv
and age
information needs to be discretized first by using the function before you use the function.
lgdgroup = findgroups(data(testind,{'ltvdiscretized' 'agediscretized' 'type'})); lgdpredmeanstest = gs.mean_lgd(lgdgroup);
there are eight unique values in the predictions, as expected, one for each group.
disp(unique(lgdpredmeanstest))
0.0218 0.0661 0.0873 0.1178 0.1217 0.1638 0.1649 0.2598
the histogram of the predictions also shows the discrete nature of the model.
histogram(lgdpredmeanstest) title('predicted lgd, tobit model') xlabel('predicted lgd') ylabel('frequency')
to have all the predictions available for both training and test sets to make comparisons, add a column with lgd predictions for the entire data set.
lgdgroup = findgroups(data(:,{'ltvdiscretized' 'agediscretized' 'type'})); data.lgdpredmeans = gs.mean_lgd(lgdgroup);
compare performance
compare the performance of the tobit model and the benchmark model using the validation functions in the model.
start with the area under the receiver operating characteristic (roc) curve, or auroc metric, using .
datasetchoice = "testing"; if datasetchoice=="training" ind = trainingind; else ind = testind; end discmeasure = modeldiscrimination(lgdmodel,data(ind,:),'showdetails',true,'referencelgd',data.lgdpredmeans(ind),'referenceid','group means')
discmeasure=2×3 table
auroc segment segmentcount
_______ __________ ____________
tobit 0.67986 "all_data" 1394
group means 0.61251 "all_data" 1394
use to visualize the roc curve.
modeldiscriminationplot(lgdmodel,data(ind,:),'referencelgd',data.lgdpredmeans(ind),'referenceid','group means')
use to compute the calibration metrics.
calmeasure = modelcalibration(lgdmodel,data(ind,:),'referencelgd',data.lgdpredmeans(ind),'referenceid','group means')
calmeasure=2×4 table
rsquared rmse correlation samplemeanerror
________ _______ ___________ _______________
tobit 0.08527 0.23712 0.29201 -0.034412
group means 0.041622 0.2406 0.20401 -0.0078124
use to visualize the scatter plot of the observed lgd values against predicted lgd values.
modelcalibrationplot(lgdmodel,data(ind,:),'referencelgd',data.lgdpredmeans(ind),'referenceid','group means')
then you can use to visualize the scatter plot of the predicted lgd values against the ltv values.
modelcalibrationplot(lgdmodel,data(ind,:),'referencelgd',data.lgdpredmeans(ind),'referenceid','group means','xdata','ltv','ydata','predicted')
see also
| | | | | | |