main content

shapley values -凯发k8网页登录

shapley values

since r2021a

description

the shapley value of a feature for a query point explains the deviation of the prediction for the query point from the average prediction, due to the feature. for each query point, the sum of the shapley values for all features corresponds to the total deviation of the prediction from the average.

you can create a shapley object for a machine learning model with a specified query point (querypoint). the software creates an object and computes the shapley values of all features for the query point.

use the shapley values to explain the contribution of individual features to a prediction at the specified query point. use the plot function to create a bar graph of the shapley values. you can compute the shapley values for another query point by using the fit function.

creation

description

explainer = shapley(blackbox) creates the shapley object explainer using the machine learning model object blackbox, which contains predictor data. to compute shapley values, use the fit function with explainer.

example

explainer = shapley(blackbox,x) creates a shapley object using the predictor data in x.

example

explainer = shapley(___,'querypoint',querypoint) also computes the shapley values for the query point querypoint and stores the computed shapley values in the shapleyvalues property of explainer. you can specify querypoint in addition to any of the input argument combinations in the previous syntaxes.

example

explainer = shapley(___,name,value) specifies additional options using one or more name-value arguments. for example, specify 'useparallel',true to compute shapley values in parallel.

input arguments

machine learning model to be interpreted, specified as a full or compact regression or classification model object or a function handle.

  • full or compact model object — you can specify a full or compact regression or classification model object, which has a predict object function. the software uses the predict function to compute shapley values.

    • if you specify a model object that does not contain predictor data (for example, a compact model), then you must provide the predictor data using x.

    • when you train a model, use a numeric matrix or table for the predictor data where rows correspond to individual observations.

    regression model object

    supported modelfull or compact regression model object
    ensemble of regression models, regressionbaggedensemble,
    gaussian kernel regression model using random feature expansion
    gaussian process regressionregressiongp,
    generalized additive model,
    linear regression for high-dimensional dataregressionlinear
    neural network regression model,
    regression treeregressiontree, compactregressiontree
    support vector machine regressionregressionsvm, compactregressionsvm

    classification model object

    supported modelfull or compact classification model object
    discriminant analysis classifier,
    multiclass model for support vector machines or other classifiersclassificationecoc, compactclassificationecoc
    ensemble of learners for classification, ,
    gaussian kernel classification model using random feature expansionclassificationkernel
    generalized additive model,
    k-nearest neighbor classifierclassificationknn
    linear classification modelclassificationlinear
    multiclass naive bayes model, compactclassificationnaivebayes
    neural network classifier,
    support vector machine classifier for one-class and binary classificationclassificationsvm, compactclassificationsvm
    binary decision tree for multiclass classification, compactclassificationtree
  • function handle — you can specify a function handle that accepts predictor data and returns a column vector containing a prediction for each observation in the predictor data. the prediction is a predicted response for regression or a predicted score of a single class for classification. you must provide the predictor data using x.

predictor data, specified as a numeric matrix or table. each row of x corresponds to one observation, and each column corresponds to one variable.

  • for a numeric matrix:

    • the variables that makes up the columns of x must have the same order as the predictor variables that trained blackbox, stored in blackbox.x.

    • if you trained blackbox using a table, then x can be a numeric matrix if the table contains all numeric predictor variables.

  • for a table:

    • if you trained blackbox using a table (for example, tbl), then all predictor variables in x must have the same variable names and data types as those in tbl. however, the column order of x does not need to correspond to the column order of tbl.

    • if you trained blackbox using a numeric matrix, then the predictor names in blackbox.predictornames and the corresponding predictor variable names in x must be the same. to specify predictor names during training, use the predictornames name-value argument. all predictor variables in x must be numeric vectors.

    • x can contain additional variables (response variables, observation weights, and so on), but shapley ignores them.

    • shapley does not support multicolumn variables or cell arrays other than cell arrays of character vectors.

if blackbox is a model object that does not contain predictor data or a function handle, you must provide x. if blackbox is a full machine learning model object and you specify this argument, then shapley does not use the predictor data in blackbox; it uses the specified predictor data only.

data types: single | double

query point at which shapley explains a prediction, specified as a row vector of numeric values or a single-row table.

  • for a row vector of numeric values:

    • the variables that makes up the columns of querypoint must have the same order as x or the predictor variables that trained blackbox, stored in blackbox.x.

    • if you trained blackbox using a table, then querypoint can be a numeric vector if the table contains all numeric variables.

  • for a single-row table:

    • if you trained blackbox using a table (for example, tbl), then all predictor variables in querypoint must have the same variable names and data types as those in tbl. however, the column order of querypoint does not need to correspond to the column order of tbl.

    • if you trained blackbox using a numeric matrix, then the predictor names in blackbox.predictornames and the corresponding predictor variable names in querypoint must be the same. to specify predictor names during training, use the predictornames name-value argument. all predictor variables in querypoint must be numeric vectors.

    • querypoint can contain additional variables (response variables, observation weights, and so on), but shapley ignores them.

    • shapley does not support multicolumn variables or cell arrays other than cell arrays of character vectors.

if querypoint contains nans for continuous predictors and 'method' is 'conditional', then the shapley values (shapleyvalues) in the returned object are nans. otherwise, shapley handles nans in querypoint in the same way as blackbox (the predict object function of blackbox or the function handle specified by blackbox).

example: blackbox.x(1,:) specifies the query point as the first observation of the predictor data in the full machine learning model blackbox.

data types: single | double | table

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

example: shapley(blackbox,'querypoint',q,'method','conditional') creates a shapley object and computes the shapley values for the query point q using the extension to the kernel shap algorithm.

categorical predictors list, specified as one of the values in this table.

valuedescription
vector of positive integers

each entry in the vector is an index value indicating that the corresponding predictor is categorical. the index values are between 1 and p, where p is the number of predictors used to train the model.

if blackbox uses a subset of input variables as predictors, then the software indexes the predictors using only the subset. the 'categoricalpredictors' values do not count the response variable, observation weight variable, or any other variables that the function does not use.

logical vector

a true entry means that the corresponding predictor is categorical. the length of the vector is p.

character matrixeach row of the matrix is the name of a predictor variable. the names must match the variable names of the predictor data in the form of a table. pad the names with extra blanks so each row of the character matrix has the same length.
string array or cell array of character vectorseach element in the array is the name of a predictor variable. the names must match the variable names of the predictor data in the form of a table.
'all'all predictors are categorical.
  • if you specify blackbox as a function handle, then shapley identifies categorical predictors from the predictor data x. if the predictor data is in a table, shapley assumes that a variable is categorical if it is a logical vector, unordered categorical vector, character array, string array, or cell array of character vectors. if the predictor data is a matrix, shapley assumes that all predictors are continuous. to identify any other predictors as categorical predictors, specify them by using the categoricalpredictors name-value argument.

  • if you specify blackbox as a regression or classification model object, then shapley identifies categorical predictors by using the categoricalpredictors property of the model object.

shapley supports an ordered categorical predictor when blackbox supports ordered categorical predictors and you specify 'method' as 'interventional'.

example: 'categoricalpredictors','all'

data types: single | double | logical | char | string | cell

maximum number of predictor subsets to use for shapley value computation, specified as a positive integer.

for details on how shapley chooses the subsets to use, see computational cost.

this argument is valid only when shapley uses the kernel shap algorithm or the extension to the kernel shap algorithm. if you set the maxnumsubsets argument when method is 'interventional', the software uses the kernel shap algorithm. for more information, see algorithms.

example: 'maxnumsubsets',100

data types: single | double

since r2023a

shapley value computation algorithm, specified as 'interventional' or 'conditional'.

  • 'interventional' (default) — shapley computes the shapley values with an interventional value function.

    shapley offers three interventional algorithms: kernel shap [1], linear shap [1], and tree shap [2]. the software selects an algorithm based on the machine learning model blackbox and other specified options. for details, see interventional algorithms.

  • 'conditional'shapley uses the extension to the kernel shap algorithm [3] with a conditional value function.

the method property stores the name of the selected algorithm. for more information, see algorithms.

before r2023a: you can specify this argument as 'interventional-kernel' or 'conditional-kernel'. shapley supports the kernel shap algorithm and the extension of the kernel shap algorithm.

example: 'method','conditional'

data types: char | string

flag to run in parallel, specified as true or false. if you specify "useparallel",true, the shapley function executes for-loop iterations by using . the loop runs in parallel when you have parallel computing toolbox™.

this argument is valid only when shapley uses the tree shap algorithm for an ensemble of trees, the kernel shap algorithm, or the extension to the kernel shap algorithm.

example: 'useparallel',true

data types: logical

properties

this property is read-only.

machine learning model to be interpreted, specified as a regression or classification model object or a function handle.

the blackbox argument sets this property.

this property is read-only.

prediction for the query point computed by the machine learning model (blackboxmodel), specified as a scalar.

  • if blackboxmodel is a model object, then blackboxfitted is a predicted response for regression or a classified label for classification.

  • if blackboxmodel is a function handle, then blackboxfitted is a value returned by the function handle, either a predicted response for regression or a predicted score of a single class for classification.

this property is read-only.

categorical predictor indices, specified as a vector of positive integers. categoricalpredictors contains index values indicating that the corresponding predictors are categorical. the index values are between 1 and p, where p is the number of predictors used to train the model. if none of the predictors are categorical, then this property is empty ([]).

  • if you specify blackbox using a function handle, then shapley identifies categorical predictors from the predictor data x. if you specify the categoricalpredictors name-value argument, then the argument sets this property.

  • if you specify blackbox as a regression or classification model object, then shapley determines this property by using the categoricalpredictors property of the model object.

shapley supports an ordered categorical predictor when blackbox supports ordered categorical predictors and when you specify 'method' as 'interventional'.

average prediction, averaged over the predictor data x, specified as a numeric vector or numeric scalar.

  • if blackboxmodel is a classification model object, then intercept is a vector of the average classification scores for each class.

  • if blackboxmodel is a regression model object, then intercept is a scalar of the average response.

  • if blackboxmodel is a function handle, then intercept is a scalar of the average function evaluation.

for a query point, the sum of the shapley values for all features corresponds to the total deviation of the prediction from the average (intercept).

this property is read-only.

shapley value computation algorithm, specified as 'interventional-linear', 'interventional-tree', 'interventional-kernel', or 'conditional-kernel'.

  • 'interventional-linear'shapley uses the linear shap algorithm [1] with an interventional value function. that is, shapley computes interventional shapley values using the estimated coefficients for linear models.

  • 'interventional-tree'shapley uses the tree shap algorithm [2] with an interventional value function.

  • 'interventional-kernel'shapley uses the kernel shap algorithm [1] with an interventional value function.

  • 'conditional-kernel'shapley uses the extension to the kernel shap algorithm [3] with a conditional value function.

the method argument of shapley or the method argument of fit sets this property.

for more information, see algorithms.

this property is read-only.

number of predictor subsets to use for shapley value computation, specified as a positive integer.

the maxnumsubsets argument of shapley or the maxnumsubsets argument of fit sets this property.

for details on how shapley chooses the subsets to use, see computational cost.

this property is read-only.

query point at which shapley explains a prediction using the shapley values (shapleyvalues), specified as a row vector of numeric values or single-row table.

the querypoint argument of shapley or the querypoint argument of fit sets this property.

this property is read-only.

shapley values for the query point (querypoint), specified as a table.

  • for regression, the table has two columns. the first column contains the predictor variable names, and the second column contains the shapley values of the predictors.

  • for classification, the table has two or more columns, depending on the number of classes in blackboxmodel. the first column contains the predictor variable names, and the rest of the columns contain the shapley values of the predictors for each class.

this property is read-only.

predictor data, specified as a numeric matrix or table.

each row of x corresponds to one observation, and each column corresponds to one variable.

  • if you specify the x argument, then it sets this property.

  • if you specify blackbox as a full machine learning model object and do not specify x, then this property value is the predictor data used to train blackbox.

if an observation contains nans for continuous predictors and method is 'conditional-kernel', then shapley does not use the observation for the shapley value computation. otherwise, shapley handles nans in x in the same way as blackboxmodel (the predict object function of blackboxmodel or the function handle specified by blackboxmodel).

shapley stores all observations, including the rows with missing values, in this property.

object functions

fitcompute shapley values for query point
plotplot shapley values

examples

train a classification model and create a shapley object. when you create a shapley object, specify a query point so that the software computes the shapley values for the query point. then create a bar graph of the shapley values by using the object function plot.

load the creditrating_historical data set. the data set contains customer ids and their financial ratios, industry labels, and credit ratings.

tbl = readtable('creditrating_historical.dat');

display the first three rows of the table.

head(tbl,3)
     id      wc_ta    re_ta    ebit_ta    mve_bvtd    s_ta     industry    rating
    _____    _____    _____    _______    ________    _____    ________    ______
    62394    0.013    0.104     0.036      0.447      0.142       3        {'bb'}
    48608    0.232    0.335     0.062      1.969      0.281       8        {'a' }
    42444    0.311    0.367     0.074      1.935      0.366       1        {'a' }

train a blackbox model of credit ratings by using the fitcecoc function. use the variables from the second through seventh columns in tbl as the predictor variables. a recommended practice is to specify the class names to set the order the classes.

blackbox = fitcecoc(tbl,'rating', ...
    'predictornames',tbl.properties.variablenames(2:7), ...
    'categoricalpredictors','industry', ...
    'classnames',{'aaa' 'aa' 'a' 'bbb' 'bb' 'b' 'ccc'});

create a shapley object that explains the prediction for the last observation. specify a query point so that the software computes shapley values and stores them in the shapleyvalues property.

querypoint = tbl(end,:)
querypoint=1×8 table
     id      wc_ta    re_ta    ebit_ta    mve_bvtd    s_ta    industry    rating
    _____    _____    _____    _______    ________    ____    ________    ______
    73104    0.239    0.463     0.065      2.924      0.34       2        {'aa'}
explainer = shapley(blackbox,'querypoint',querypoint)
warning: computation can be slow because the predictor data has over 1000 observations. use a smaller sample of the training set or specify 'useparallel' as true for faster computation.
explainer = 
  shapley with properties:
            blackboxmodel: [1x1 classificationecoc]
               querypoint: [1x8 table]
           blackboxfitted: {'aa'}
            shapleyvalues: [6x8 table]
               numsubsets: 64
                        x: [3932x6 table]
    categoricalpredictors: 6
                   method: 'interventional-kernel'
                intercept: [-1.7642 -1.3677 -1.0980 -1.0645 -1.4758 -2.1268 -2.3909]

as the warning message indicates, the computation can be slow because the predictor data has over 1000 observations. for faster computation, use a smaller sample of the training set or specify 'useparallel' as true.

for a classification model, shapley computes shapley values using the predicted class score for each class. display the values in the shapleyvalues property.

explainer.shapleyvalues
ans=6×8 table
    predictor        aaa           aa             a            bbb            bb             b            ccc    
    __________    _________    __________    ___________    __________    ___________    __________    __________
    "wc_ta"        0.051507      0.022531      0.0093463     0.0017109      -0.027655     -0.041443     -0.039882
    "re_ta"         0.16772      0.094211       0.051629     -0.011019      -0.087919      -0.20974      -0.29463
    "ebit_ta"     0.0011995    0.00052588     0.00041919    0.00011866    -0.00066237    -0.0013347    -0.0011824
    "mve_bvtd"       1.3417        1.3082        0.61472      -0.11247        -0.6555      -0.86908      -0.68547
    "s_ta"        -0.013059    -0.0091049    -0.00031099    -0.0028624    -0.00019227     0.0016759    -0.0024149
    "industry"     -0.10142     -0.048668      0.0036522      0.081542       0.092657       0.10464       0.15888

the shapleyvalues property contains the shapley values of all features for each class.

plot the shapley values for the predicted class by using the plot function.

plot(explainer)

figure contains an axes object. the axes object with title shapley explanation, xlabel shapley value, ylabel predictor contains an object of type bar.

the horizontal bar graph shows the shapley values for all variables, sorted by their absolute values. each shapley value explains the deviation of the score for the query point from the average score of the predicted class, due to the corresponding variable.

train a regression model and create a shapley object. when you create a shapley object, if you do not specify a query point, then the software does not compute shapley values. use the object function fit to compute the shapley values for the specified query point. then create a bar graph of the shapley values by using the object function plot.

load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s.

load carbig

create a table containing the predictor variables acceleration, cylinders, and so on, as well as the response variable mpg.

tbl = table(acceleration,cylinders,displacement,horsepower,model_year,weight,mpg);

removing missing values in a training set can help reduce memory consumption and speed up training for the fitrkernel function. remove missing values in tbl.

tbl = rmmissing(tbl);

train a blackbox model of mpg by using the function

rng('default') % for reproducibility
mdl = fitrkernel(tbl,'mpg','categoricalpredictors',[2 5]);

create a shapley object. specify the data set tbl, because mdl does not contain training data.

explainer = shapley(mdl,tbl)
explainer = 
  shapley with properties:
            blackboxmodel: [1x1 regressionkernel]
               querypoint: []
           blackboxfitted: []
            shapleyvalues: []
               numsubsets: 64
                        x: [392x7 table]
    categoricalpredictors: [2 5]
                   method: 'interventional-kernel'
                intercept: 22.6202

explainer stores the training data tbl in the x property.

compute the shapley values of all predictor variables for the first observation in tbl.

querypoint = tbl(1,:)
querypoint=1×7 table
    acceleration    cylinders    displacement    horsepower    model_year    weight    mpg
    ____________    _________    ____________    __________    __________    ______    ___
         12             8            307            130            70         3504     18 
explainer = fit(explainer,querypoint);

for a regression model, shapley computes shapley values using the predicted response, and stores them in the shapleyvalues property. display the values in the shapleyvalues property.

explainer.shapleyvalues
ans=6×2 table
      predictor       shapleyvalue
    ______________    ____________
    "acceleration"       -0.1561  
    "cylinders"         -0.18306  
    "displacement"      -0.34203  
    "horsepower"        -0.27291  
    "model_year"         -0.2926  
    "weight"            -0.32402  

plot the shapley values for the query point by using the plot function.

plot(explainer)

figure contains an axes object. the axes object with title shapley explanation, xlabel shapley value, ylabel predictor contains an object of type bar.

the horizontal bar graph shows the shapley values for all variables, sorted by their absolute values. each shapley value explains the deviation of the prediction for the query point from the average, due to the corresponding variable.

train a regression model and create a shapley object using a function handle to the predict function of the model. use the object function fit to compute the shapley values for the specified query point. then plot the shapley values by using the object function plot.

load the carbig data set, which contains measurements of cars made in the 1970s and early 1980s.

load carbig

create a table containing the predictor variables acceleration, cylinders, and so on.

tbl = table(acceleration,cylinders,displacement,horsepower,model_year,weight);

train a blackbox model of mpg by using the treebagger function.

rng('default') % for reproducibility
mdl = treebagger(100,tbl,mpg,'method','regression','categoricalpredictors',[2 5]);

shapley does not support a treebagger object directly, so you cannot specify the first input argument (blackbox model) of shapley as a treebagger object. instead, you can use a function handle to the predict function. you can also specify options of the predict function using name-value arguments of the function.

create the function handle to the predict function of the treebagger object mdl. specify the array of tree indices to use as 1:50.

f = @(tbl) predict(mdl,tbl,'trees',1:50);

create a shapley object using the function handle f. when you specify a blackbox model as a function handle, you must provide the predictor data. tbl includes categorical predictors (cylinder and model_year) with the double data type. by default, shapley does not treat variables with the double data type as categorical predictors. specify the second (cylinder) and fifth (model_year) variables as categorical predictors.

explainer = shapley(f,tbl,'categoricalpredictors',[2 5]);
explainer = fit(explainer,tbl(1,:));

plot the shapley values.

plot(explainer)

figure contains an axes object. the axes object with title shapley explanation, xlabel shapley value, ylabel predictor contains an object of type bar.

more about

references

[1] lundberg, scott m., and s. lee. "a unified approach to interpreting model predictions." advances in neural information processing systems 30 (2017): 4765–774.

[2] lundberg, scott m., g. erion, h. chen, et al. "from local explanations to global understanding with explainable ai for trees." nature machine intelligence 2 (january 2020): 56–67.

[3] aas, kjersti, martin jullum, and anders løland. "explaining individual predictions when features are dependent: more accurate approximations to shapley values." artificial intelligence 298 (september 2021).

extended capabilities

version history

introduced in r2021a
网站地图