predictor importance estimates by permutation of out-凯发k8网页登录

predictor importance estimates by permutation of out-of-bag predictor observations for random forest of classification trees

syntax

imp = oobpermutedpredictorimportance(mdl)

imp = oobpermutedpredictorimportance(mdl,name,value)

description

example

imp = oobpermutedpredictorimportance(mdl) returns a vector of out-of-bag, predictor importance estimates by permutation using the random forest of classification trees mdl. mdl must be a model object.

example

imp = oobpermutedpredictorimportance(mdl,name,value) uses additional options specified by one or more name,value pair arguments. for example, you can speed up computation using parallel computing or indicate which trees to use in the predictor importance estimation.

input arguments

`mdl` — random forest of classification trees
`classificationbaggedensemble` model object

random forest of classification trees, specified as a model object created by fitcensemble.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

`learners` — indices of learners to use in predictor importance estimation
`1:mdl.numtrained` (default) | numeric vector of positive integers

indices of learners to use in predictor importance estimation, specified as the comma-separated pair consisting of 'learners' and a numeric vector of positive integers. values must be at most mdl.numtrained. when oobpermutedpredictorimportance estimates the predictor importance, it includes the learners in mdl.trained(learners) only, where learners is the value of 'learners'.

example: 'learners',[1:2:mdl.numtrained]

`options` — parallel computing options
`[]` (default) | structure array returned by `statset`

parallel computing options, specified as the comma-separated pair consisting of 'options' and a structure array returned by . 'options' requires a parallel computing toolbox™ license.

oobpermutedpredictorimportance uses the 'useparallel' field only. statset('useparallel',true) invokes a pool of workers.

example: 'options',statset('useparallel',true)

output arguments

`imp` — out-of-bag, predictor importance estimates by permutation
numeric vector

out-of-bag, predictor importance estimates by permutation, returned as a 1-by-p numeric vector. p is the number of predictor variables in the training data (size(mdl.x,2)). imp(j) is the predictor importance of the predictor mdl.predictornames(j).

examples

estimate importance of predictors

load the census1994 data set. consider a model that predicts a person's salary category given their age, working class, education level, martial status, race, sex, capital gain and loss, and number of working hours per week.

load census1994
x = adultdata(:,{'age','workclass','education_num','marital_status','race',...
    'sex','capital_gain','capital_loss','hours_per_week','salary'});

you can train a random forest of 50 classification trees using the entire data set.

mdl = fitcensemble(x,'salary','method','bag','numlearningcycles',50);

fitcensemble uses a default template tree object templatetree() as a weak learner when 'method' is 'bag'. in this example, for reproducibility, specify 'reproducible',true when you create a tree template object, and then use the object as a weak learner.

rng('default') % for reproducibility
t = templatetree('reproducible',true); % for reproducibiliy of random predictor selections
mdl = fitcensemble(x,'salary','method','bag','numlearningcycles',50,'learners',t);

mdl is a classificationbaggedensemble model.

estimate predictor importance measures by permuting out-of-bag observations. compare the estimates using a bar graph.

imp = oobpermutedpredictorimportance(mdl);
figure;
bar(imp);
title('out-of-bag permuted predictor importance estimates');
ylabel('estimates');
xlabel('predictors');
h = gca;
h.xticklabel = mdl.predictornames;
h.xticklabelrotation = 45;
h.ticklabelinterpreter = 'none';

figure contains an axes object. the axes object with title out-of-bag permuted predictor importance estimates, xlabel predictors, ylabel estimates contains an object of type bar.

imp is a 1-by-9 vector of predictor importance estimates. larger values indicate predictors that have a greater influence on predictions. in this case, marital_status is the most important predictor, followed by capital_gain.

unbiased estimates of predictor importance using parallel computing

this example uses:

load census1994
x = adultdata(:,{'age','workclass','education_num','marital_status','race', ...
    'sex','capital_gain','capital_loss','hours_per_week','salary'});

display the number of categories represented in the categorical variables using summary.

summary(x)

variables:
    age: 32561×1 double
        values:
            min        17  
            median     37  
            max        90  
    workclass: 32561×1 categorical
        values:
            federal-gov              960   
            local-gov               2093   
            never-worked               7   
            private                22696   
            self-emp-inc            1116   
            self-emp-not-inc        2541   
            state-gov               1298   
            without-pay               14   
            nummissing              1836   
    education_num: 32561×1 double
        values:
            min              1       
            median          10       
            max             16       
    marital_status: 32561×1 categorical
        values:
            divorced                       4443      
            married-af-spouse                23      
            married-civ-spouse            14976      
            married-spouse-absent           418      
            never-married                 10683      
            separated                      1025      
            widowed                         993      
    race: 32561×1 categorical
        values:
            amer-indian-eskimo      311 
            asian-pac-islander     1039 
            black                  3124 
            other                   271 
            white                 27816 
    sex: 32561×1 categorical
        values:
            female    10771
            male      21790
    capital_gain: 32561×1 double
        values:
            min               0     
            median            0     
            max           99999     
    capital_loss: 32561×1 double
        values:
            min               0     
            median            0     
            max            4356     
    hours_per_week: 32561×1 double
        values:
            min               1       
            median           40       
            max              99       
    salary: 32561×1 categorical
        values:
            <=50k     24720  
            >50k       7841

because there are few categories represented in the categorical variables compared to levels in the continuous variables, the standard cart, predictor-splitting algorithm prefers splitting a continuous predictor over the categorical variables.

train a random forest of 50 classification trees using the entire data set. to grow unbiased trees, specify usage of the curvature test for splitting predictors. because there are missing values in the data, specify usage of surrogate splits. to reproduce random predictor selections, set the seed of the random number generator by using rng and specify 'reproducible',true.

rng('default') % for reproducibility
t = templatetree('predictorselection','curvature','surrogate','on', ...
    'reproducible',true); % for reproducibility of random predictor selections
mdl = fitcensemble(x,'salary','method','bag','numlearningcycles',50, ...
    'learners',t);

estimate predictor importance measures by permuting out-of-bag observations. perform calculations in parallel.

options = statset('useparallel',true);
imp = oobpermutedpredictorimportance(mdl,'options',options);

starting parallel pool (parpool) using the 'local' profile ...
connected to the parallel pool (number of workers: 6).

compare the estimates using a bar graph.

figure
bar(imp)
title('out-of-bag permuted predictor importance estimates')
ylabel('estimates')
xlabel('predictors')
h = gca;
h.xticklabel = mdl.predictornames;
h.xticklabelrotation = 45;
h.ticklabelinterpreter = 'none';

in this case, capital_gain is the most important predictor, followed by martial_status. compare these results to the results in estimate importance of predictors.

more about

out-of-bag, predictor importance estimates by permutation

out-of-bag, predictor importance estimates by permutation measure how influential the predictor variables in the model are at predicting the response. the influence of a predictor increases with the value of this measure.

if a predictor is influential in prediction, then permuting its values should affect the model error. if a predictor is not influential, then permuting its values should have little to no effect on the model error.

the following process describes the estimation of out-of-bag predictor importance values by permutation. suppose that r is a random forest of t learners and p is the number of predictors in the training data.

for tree t, t = 1,...,t:
1. identify the out-of-bag observations and the indices of the predictor variables that were split to grow tree t, s_t ⊆ {1,...,p}.
2. estimate the out-of-bag error ε_t.
3. for each predictor variable x_j, j ∊ s_t:
  1. randomly permute the observations of x_j.
  2. estimate the model error, ε_tj, using the out-of-bag observations containing the permuted values of x_j.
  3. take the difference d_tj = ε_tj – ε_t. predictor variables not split when growing tree t are attributed a difference of 0.
for each predictor variable in the training data, compute the mean, ${\bar{d}}_{j}$ , and standard deviation, σ_j, of the differences over the learners, j = 1,...,p.
the out-of-bag predictor importance by permutation for x_j is ${\bar{d}}_{j} / σ_{j}$ .

tips

when growing a random forest using fitcensemble:

standard cart tends to select split predictors containing many distinct values, e.g., continuous variables, over those containing few distinct values, e.g., categorical variables [3]. if the predictor data set is heterogeneous, or if there are predictors that have relatively fewer distinct values than other variables, then consider specifying the curvature or interaction test.
trees grown using standard cart are not sensitive to predictor variable interactions. also, such trees are less likely to identify important variables in the presence of many irrelevant predictors than the application of the interaction test. therefore, to account for predictor interactions and identify importance variables in the presence of many irrelevant variables, specify the interaction test [2].
if the training data includes many predictors and you want to analyze predictor importance, then specify of the templatetree function as 'all' for the tree learners of the ensemble. otherwise, the software might not select some predictors, underestimating their importance.

for more details, see and .

references

[1] breiman, l., j. friedman, r. olshen, and c. stone. classification and regression trees. boca raton, fl: crc press, 1984.

[2] loh, w.y. “regression trees with unbiased variable selection and interaction detection.” statistica sinica, vol. 12, 2002, pp. 361–386.

[3] loh, w.y. and y.s. shih. “split selection methods for classification trees.” statistica sinica, vol. 7, 1997, pp. 815–840.

extended capabilities

automatic parallel support
accelerate code by automatically running computation in parallel using parallel computing toolbox™.

to run in parallel, specify the options name-value argument in the call to this function and set the useparallel field of the options structure to true using statset:

"options",statset("useparallel",true)

for more information about parallel computing, see (parallel computing toolbox).

version history

introduced in r2016b

predictor importance estimates by permutation of out-凯发k8网页登录

syntax

description

input arguments

`mdl` — random forest of classification trees
`classificationbaggedensemble` model object

name-value arguments

`learners` — indices of learners to use in predictor importance estimation
`1:mdl.numtrained` (default) | numeric vector of positive integers

`options` — parallel computing options
`[]` (default) | structure array returned by `statset`

output arguments

`imp` — out-of-bag, predictor importance estimates by permutation
numeric vector

examples

estimate importance of predictors

unbiased estimates of predictor importance using parallel computing

more about

out-of-bag, predictor importance estimates by permutation

tips

references

extended capabilities

automatic parallel support
accelerate code by automatically running computation in parallel using parallel computing toolbox™.

version history

see also

topics

predictor importance estimates by permutation of out-凯发k8网页登录

syntax

description

input arguments

mdl — random forest of classification trees classificationbaggedensemble model object

name-value arguments

learners — indices of learners to use in predictor importance estimation 1:mdl.numtrained (default) | numeric vector of positive integers

options — parallel computing options [] (default) | structure array returned by statset

output arguments

imp — out-of-bag, predictor importance estimates by permutation numeric vector

examples

estimate importance of predictors

unbiased estimates of predictor importance using parallel computing

more about

out-of-bag, predictor importance estimates by permutation

tips

references

extended capabilities

automatic parallel support accelerate code by automatically running computation in parallel using parallel computing toolbox™.

version history

see also

topics

wechat

`mdl` — random forest of classification trees
`classificationbaggedensemble` model object

`learners` — indices of learners to use in predictor importance estimation
`1:mdl.numtrained` (default) | numeric vector of positive integers

`options` — parallel computing options
`[]` (default) | structure array returned by `statset`

`imp` — out-of-bag, predictor importance estimates by permutation
numeric vector

automatic parallel support
accelerate code by automatically running computation in parallel using parallel computing toolbox™.