ensemble of bagged decision trees

description

a treebagger object is an ensemble of bagged decision trees for either classification or regression. individual decision trees tend to overfit. bagging, which stands for bootstrap aggregation, is an ensemble method that reduces the effects of overfitting and improves generalization.

creation

the treebagger function grows every tree in the treebagger ensemble model using bootstrap samples of the input data. observations not included in a sample are considered "out-of-bag" for that tree. the function selects a random subset of predictors for each decision split by using the random forest algorithm [1].

syntax

mdl = treebagger(numtrees,tbl,responsevarname)

mdl = treebagger(numtrees,tbl,formula)

mdl = treebagger(numtrees,tbl,y)

mdl = treebagger(numtrees,x,y)

mdl = treebagger(___,name=value)

description

tip

by default, the treebagger function grows classification decision trees. to grow regression decision trees, specify the name-value argument method as "regression".

example

mdl = treebagger(numtrees,tbl,responsevarname) returns an ensemble object (mdl) of numtrees bagged classification trees, trained by the predictors in the table tbl and the class labels in the variable tbl.responsevarname.

mdl = treebagger(numtrees,tbl,formula) returns mdl trained by the predictors in the table tbl. the input formula is an explanatory model of the response and a subset of predictor variables in tbl used to fit mdl. specify formula using .

mdl = treebagger(numtrees,tbl,y) returns mdl trained by the predictor data in the table tbl and the class labels in the array y.

example

mdl = treebagger(numtrees,x,y) returns mdl trained by the predictor data in the matrix x and the class labels in the array y.

example

mdl = treebagger(___,name=value) returns mdl with additional options specified by one or more name-value arguments, using any of the previous input argument combinations. for example, you can specify the algorithm used to find the best split on a categorical predictor by using the name-value argument predictorselection.

input arguments

`numtrees` — number of decision trees
positive integer

number of decision trees in the bagged ensemble, specified as a positive integer.

data types: single | double

`tbl` — sample data
table

sample data used to train the model, specified as a table. each row of tbl corresponds to one observation, and each column corresponds to one predictor variable. optionally, tbl can contain one additional column for the response variable. multicolumn variables and cell arrays other than cell arrays of character vectors are not allowed.

if tbl contains the response variable, and you want to use all remaining variables in tbl as predictors, then specify the response variable by using responsevarname.
if tbl contains the response variable, and you want to use only a subset of the remaining variables in tbl as predictors, then specify a formula by using formula.
if tbl does not contain the response variable, then specify a response variable by using y. the length of the response variable and the number of rows in tbl must be equal.

`responsevarname` — response variable name
name of variable in `tbl`

response variable name, specified as the name of a variable in tbl.

you must specify responsevarname as a character vector or string scalar. for example, if the response variable y is stored as tbl.y, then specify it as "y". otherwise, the software treats all columns of tbl, including y, as predictors when training the model.

the response variable must be a categorical, character, or string array; a logical or numeric vector; or a cell array of character vectors. if y is a character array, then each element of the response variable must correspond to one row of the array.

a good practice is to specify the order of the classes by using the classnames name-value argument.

data types: char | string

`formula` — explanatory model of response variable and subset of predictor variables
character vector | string scalar

explanatory model of the response variable and a subset of the predictor variables, specified as a character vector or string scalar in the form "y~x1 x2 x3". in this form, y represents the response variable, and x1, x2, and x3 represent the predictor variables.

to specify a subset of variables in tbl as predictors for training the model, use a formula. if you specify a formula, then the software does not use any variables in tbl that do not appear in formula.

the variable names in the formula must be both variable names in tbl (tbl.properties.variablenames) and valid matlab^® identifiers. you can verify the variable names in tbl by using the isvarname function. if the variable names are not valid, then you can convert them by using the matlab.lang.makevalidname function.

data types: char | string

`y` — class labels or response variable
categorical array | character array | string array | logical vector | numeric vector | cell array of character vectors

class labels or response variable to which the ensemble of bagged decision trees is trained, specified as a categorical, character, or string array; a logical or numeric vector; or a cell array of character vectors.

if you specify method as "classification", the following apply for the class labels y:
- each element of y defines the class membership of the corresponding row of x.
- if y is a character array, then each row must correspond to one class label.
- the treebagger function converts the class labels to a cell array of character vectors.
if you specify method as "regression", the response variable y is an n-by-1 numeric vector, where n is the number of observations. each entry in y is the response for the corresponding row of x.

the length of y and the number of rows of x must be equal.

`x` — predictor data
numeric matrix

predictor data, specified as a numeric matrix.

each row of x corresponds to one observation (also known as an instance or example), and each column corresponds to one variable (also known as a feature).

the length of y and the number of rows of x must be equal.

data types: double

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

example: treebagger(100,x,y,method="regression",surrogate="on",oobpredictorimportance="on") creates a bagged ensemble of 100 regression trees, and specifies to use surrogate splits and to store the out-of-bag information for predictor importance estimation.

`chunksize` — number of observations in each chunk of data
50000 (default) | positive integer

number of observations in each chunk of data, specified as a positive integer. this option applies only when you use treebagger on tall arrays. for more information, see extended capabilities.

example: chunksize=10000

data types: single | double

`cost` — misclassification cost
square matrix | structure

misclassification cost, specified as a square matrix or structure.

if you specify the square matrix cost and the true class of an observation is i, then cost(i,j) is the cost of classifying a point into class j. that is, rows correspond to the true classes and columns correspond to the predicted classes. to specify the class order for the corresponding rows and columns of cost, use the classnames name-value argument.
if you specify the structure s, then it must have two fields:
- s.classnames, which contains the class names as a variable of the same data type as y
- s.classificationcosts, which contains the cost matrix with rows and columns ordered as in s.classnames

the default value is cost(i,j)=1 if i~=j, and cost(i,j)=0 if i=j.

for more information on the effect of a highly skewed cost, see algorithms.

example: cost=[0,1;2,0]

data types: single | double | struct

`categoricalpredictors` — categorical predictors list
vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | `"all"`

categorical predictors list, specified as one of the values in this table.

value	description
vector of positive integers	each entry in the vector is an index value indicating that the corresponding predictor is categorical. the index values are between 1 and `p`, where `p` is the number of predictors used to train the model. if `treebagger` uses a subset of input variables as predictors, then the function indexes the predictors using only the subset. the `categoricalpredictors` values do not count the response variable, observation weights variable, or any other variables that the function does not use.
logical vector	a `true` entry means that the corresponding predictor is categorical. the length of the vector is `p`.
character matrix	each row of the matrix is the name of a predictor variable. the names must match the entries in `predictornames`. pad the names with extra blanks so each row of the character matrix has the same length.
string array or cell array of character vectors	each element in the array is the name of a predictor variable. the names must match the entries in `predictornames`.
`"all"`	all predictors are categorical.

by default, if the predictor data is in a table (tbl), treebagger assumes that a variable is categorical if it is a logical vector, categorical vector, character array, string array, or cell array of character vectors. if the predictor data is a matrix (x), treebagger assumes that all predictors are continuous. to identify any other predictors as categorical predictors, specify them by using the categoricalpredictors name-value argument.

for the identified categorical predictors, treebagger creates dummy variables using two different schemes, depending on whether a categorical variable is unordered or ordered. for an unordered categorical variable, treebagger creates one dummy variable for each level of the categorical variable. for an ordered categorical variable, treebagger creates one less dummy variable than the number of categories. for details, see automatic creation of dummy variables.

example: categoricalpredictors="all"

`method` — type of decision tree
`"classification"` (default) | `"regression"`

type of decision tree, specified as "classification" or "regression". for regression trees, y must be numeric.

example: method="regression"

`minleafsize` — minimum number of leaf node observations
positive integer

minimum number of leaf node observations, specified as a positive integer. each leaf has at least minleafsize observations per tree leaf. by default, minleafsize is 1 for classification trees and 5 for regression trees.

example: minleafsize=4

data types: single | double

`numpredictorstosample` — number of predictor variables for each decision split
positive integer | "all"

number of predictor variables (randomly selected) for each decision split, specified as a positive integer or "all". by default, numpredictorstosample is the square root of the number of variables for classification trees, and one third of the number of variables for regression trees. if the default number is not an integer, the software rounds the number to the nearest integer in the direction of positive infinity. if you set numpredictorstosample to any value except "all", the software uses breiman's random forest algorithm [1].

example: numpredictorstosample=5

data types: single | double | char | string

`numprint` — number of grown trees after which software displays message
0 (default) | positive integer

number of grown trees (training cycles) after which the software displays a message about the training progress in the command window, specified as a nonnegative integer. by default, the software displays no diagnostic messages.

example: numprint=10

data types: single | double

`inbagfraction` — fraction of input data to sample
1 (default) | positive scalar

fraction of input data to sample with replacement from the input data for growing each new tree, specified as a positive scalar in the range (0,1].

example: inbagfraction=0.5

data types: single | double

`oobprediction` — indicator to store out-of-bag information
`"off"` (default) | "on"

indicator to store out-of-bag information in the ensemble, specified as "on" or "off". specify oobprediction as "on" to store information on which observations are out-of-bag for each tree. treebagger can use this information to compute the predicted class probabilities for each tree in the ensemble.

example: oobprediction="off"

`oobpredictorimportance` — indicator to store out-of-bag estimates of feature importance
`"off"` (default) | `"on"`

indicator to store out-of-bag estimates of feature importance in the ensemble, specified as "on" or "off". if you specify oobpredictorimportance as "on", the treebagger function sets oobprediction to "on". if you want to analyze predictor importance, specify predictorselection as "curvature" or "interaction-curvature".

example: oobpredictorimportance="on"

`options` — options for running computations in parallel and setting random streams
structure

options for running computations in parallel and setting random streams, specified as a structure. create the options structure using . this table lists the options fields and their values.

field name value default

useparallel set this value to true to run computations in parallel. false

field name	value	default
`useparallel`	set this value to `true` to run computations in parallel.	`false`
`usesubstreams`	set this value to `true` to run computations in parallel in a reproducible manner. to compute reproducibly, set `streams` to a type that allows substreams: `"mlfg6331_64"` or `"mrg32k3a"`.	`false`
`streams`	specify this value as a object or a cell array consisting of one such object.	if you do not specify `streams`, then `treebagger` uses the default stream.

usesubstreams

set this value to true to run computations in parallel in a reproducible manner.

to compute reproducibly, set streams to a type that allows substreams: "mlfg6331_64" or "mrg32k3a".

false

streams specify this value as a object or a cell array consisting of one such object. if you do not specify streams, then treebagger uses the default stream.

note

you need parallel computing toolbox™ to run computations in parallel.

example: options=statset(useparallel=true)

data types: struct

`predictornames` — predictor variable names
string array of unique names | cell array of unique character vectors

predictor variable names, specified as a string array of unique names or cell array of unique character vectors. the functionality of predictornames depends on how you supply the training data.

if you supply x and y, then you can use predictornames to assign names to the predictor variables in x.
- the order of the names in predictornames must correspond to the column order of x. that is, predictornames{1} is the name of x(:,1), predictornames{2} is the name of x(:,2), and so on. also, size(x,2) and numel(predictornames) must be equal.
- by default, predictornames is {'x1','x2',...}.
if you supply tbl, then you can use predictornames to choose which predictor variables to use in training. that is, treebagger uses only the predictor variables in predictornames and the response variable during training.
- predictornames must be a subset of tbl.properties.variablenames and cannot include the name of the response variable.
- by default, predictornames contains the names of all predictor variables.
- a good practice is to specify the predictors for training using either predictornames or formula, but not both.

example: predictornames=["sepallength","sepalwidth","petallength","petalwidth"]

data types: string | cell

`samplewithreplacement` — indicator for sampling with replacement
`"on"` (default) | `"off"`

indicator for sampling with replacement, specified as "on" or "off". specify samplewithreplacement as "on" to sample with replacement, or as "off" to sample without replacement. if you set samplewithreplacement to "off", you must set the name-value argument inbagfraction to a value less than 1.

example: samplewithreplacement="on"

`prior` — prior probability for each class for two-class learning
`"empirical"` (default) | `"uniform"` | numeric vector | structure array

prior probability for each class for two-class learning, specified as a value in this table.

value	description
`"empirical"`	the class prior probabilities are the class relative frequencies in `y`.
`"uniform"`	all class prior probabilities are equal to 1/k, where k is the number of classes.
numeric vector	each element in the vector is a class prior probability. order the elements according to `mdl.classnames`, or specify the order using the `classnames` name-value argument. the software normalizes the elements to sum to `1`.
structure	a structure `s` with two fields: `s.classnames` contains the class names as a variable of the same type as `y`. `s.classprobs` contains a vector of corresponding prior probabilities. the software normalizes the elements of the vector to sum to `1`.

if you specify a cost matrix, the prior property of the treebagger model stores the prior probabilities adjusted for the misclassification cost. for more details, see algorithms.

this argument is valid only for two-class learning.

example: prior=struct(classnames=["setosa" "versicolor" "virginica"],classprobs=1:3)

data types: char | string | single | double | struct

note

in addition to its name-value arguments, the treebagger function accepts the name-value arguments of fitctree and fitrtree listed in additional name-value arguments of treebagger function.

output arguments

`mdl` — ensemble of bagged decision trees
`treebagger` object

ensemble of bagged decision trees, returned as a treebagger object.

properties

bagging properties

`computeoobprediction` — indicator to compute out-of-bag predictions for training observations
`false` or 0 (default) | `true` or 1

this property is read-only.

indicator to compute out-of-bag predictions for training observations, specified as a numeric or logical 1 (true) or 0 (false). if this property is true:

the treebagger object has the properties oobindices and oobinstanceweight.
you can use the object functions ooberror, oobmargin, and oobmeanmargin.

`computeoobpredictorimportance` — indicator to compute out-of-bag variable importance
`false` or 0 (default) | `true` or 1

this property is read-only.

indicator to compute the out-of-bag variable importance, specified as a numeric or logical 1 (true) or 0 (false). if this property is true:

the treebagger object has the properties oobpermutedpredictordeltaerror, oobpermutedpredictordeltameanmargin, and oobpermutedpredictorcountraisemargin.
the property computeoobprediction is also true.

`inbagfraction` — fraction of observations that are randomly selected
1 (default) | numeric scalar

this property is read-only.

fraction of observations that are randomly selected with replacement (in-bag observations) for each bootstrap replica, specified as a numeric scalar. the size of each replica is nobs×inbagfraction, where nobs is the number of observations in the training data.

data types: single | double

`oobindices` — out-of-bag indices
logical array

this property is read-only.

out-of-bag indices, specified as a logical array. this property is a nobs-by-numtrees array, where nobs is the number of observations in the training data, and numtrees is the number of trees in the ensemble. if the oobindices(i,j) element is true, the observation i is out-of-bag for the tree j (that is, the treebagger function did not select the observation i for the training data used to grow the tree j).

`oobinstanceweight` — number of out-of-bag trees for each observation
numeric vector

this property is read-only.

number of out-of-bag trees for each observation, specified as a numeric vector. this property is a nobs-by-1 vector, where nobs is the number of observations in the training data. the oobinstanceweight(i) element contains the number of trees used for computing the out-of-bag response for observation i.

data types: single | double

`oobpermutedpredictorcountraisemargin` — predictor variable importance for raising margin
numeric vector

this property is read-only.

predictor variable (feature) importance for raising the margin, specified as a numeric vector. this property is a 1-by-nvars vector, where nvars is the number of variables in the training data. for each variable, the measure is the difference between the number of raised margins and the number of lowered margins if the values of that variable are permuted across the out-of-bag observations. this measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble.

this property is empty ([]) for regression trees.

data types: single | double

`oobpermutedpredictordeltaerror` — predictor variable importance for prediction error
numeric vector

this property is read-only.

predictor variable (feature) importance for prediction error, specified as a numeric vector. this property is a 1-by-nvars vector, where nvars is the number of variables (columns) in the training data. for each variable, the measure is the increase in prediction error if the values of that variable are permuted across the out-of-bag observations. this measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble.

data types: single | double

`oobpermutedpredictordeltameanmargin` — predictor variable importance for classification margin
numeric vector

this property is read-only.

predictor variable (feature) importance for the classification margin, specified as numeric vector. this property is a 1-by-nvars vector, where nvars is the number of variables (columns) in the training data. for each variable, the measure is the decrease in the classification margin if the values of that variable are permuted across the out-of-bag observations. this measure is computed for every tree, then averaged over the entire ensemble and divided by the standard deviation over the entire ensemble.

this property is empty ([]) for regression trees.

data types: single | double

tree properties

`deltacriteriondecisionsplit` — split criterion contributions for each predictor
numeric vector

this property is read-only.

split criterion contributions for each predictor, specified as a numeric vector. this property is a 1-by-nvars vector, where nvars is the number of changes in the split criterion. the software sums the changes in the split criterion over splits on each variable, then averages the sums across the entire ensemble of grown trees.

data types: single | double

`mergeleaves` — indicator to merge leaves
`false` or 0 (default) | `true` or 1

this property is read-only.

indicator to merge leaves, specified as a numeric or logical 1 (true) or 0 (false). this property is true if the software merges the decision tree leaves with the same parent, for splits that do not decrease the total risk. otherwise, this property is false.

`minleafsize` — minimum number of leaf node observations
positive integer

this property is read-only.

minimum number of leaf node observations, specified as a positive integer. each leaf has at least minleafsize observations. by default, minleafsize is 1 for classification trees and 5 for regression trees. for decision tree training, fitctree and fitrtree set the name-value argument minparentsize to 2*minleafsize.

data types: single | double

`numtrees` — number of decision trees
positive integer

this property is read-only.

number of decision trees in the bagged ensemble, specified as a positive integer.

data types: single | double

`prune` — indicator to estimate optimal sequence of pruned subtrees
`false` or 0 (default) | `true` or 1

this property is read-only.

indicator to estimate the optimal sequence of pruned subtrees, specified as a numeric or logical 1 (true) or 0 (false). the prune property is true if the decision trees are pruned, and false if they are not. pruning decision trees is not recommended for ensembles.

`samplewithreplacement` — indicator to sample decision tree with replacement
`true` or 1 (default) | `false` or 0

this property is read-only.

indicator to sample each decision tree with replacement, specified as a numeric or logical 1 (true) or 0 (false). this property is true if the treebagger function samples each decision tree with replacement, and false otherwise.

`surrogateassociation` — predictive measures of variable association
numeric matrix

this property is read-only.

predictive measures of variable association, specified as a numeric matrix. this property is an nvars-by-nvars matrix, where nvars is the number of predictor variables. the property contains the predictive measures of variable association, averaged across the entire ensemble of grown trees.

if you grow the ensemble with the surrogate name-value argument set to "on", this matrix, for each tree, is filled with the predictive measures of association averaged over the surrogate splits.
if you grow the ensemble with the surrogate name-value argument set to "off", the surrogateassociation property is an identity matrix. by default, surrogate is set to "off".

data types: single | double

`treearguments` — name-value arguments specified for `treebagger` function
cell array

this property is read-only.

name-value arguments specified for the treebagger function, specified as a cell array. the treebagger function uses these name-value arguments when it grows new trees for the bagged ensemble.

`trees` — decision trees in ensemble
cell array

this property is read-only.

decision trees in the bagged ensemble, specified as a numtrees-by-1 cell array. each tree is a compactclassificationtree or compactregressiontree object.

predictor properties

`numpredictorsplit` — number of decision splits for each predictor
numeric vector

this property is read-only.

number of decision splits for each predictor, specified as a numeric vector. this property is a 1-by-nvars vector, where nvars is the number of predictor variables. each element of numpredictorsplit represents the number of splits on the predictor summed over all trees.

data types: single | double

`numpredictorstosample` — number of predictor variables to select
positive integer

this property is read-only.

number of predictor variables to select at random for each decision split, specified as a positive integer. by default, this property is the square root of the total number of variables for classification trees, and one third of the total number of variables for regression trees.

data types: single | double

`outliermeasure` — outlier measure for each observation
numeric vector

this property is read-only.

outlier measure for each observation, specified as a numeric vector. this property is a nobs-by-1 vector, where nobs is the number of observations in the training data.

data types: single | double

`predictornames` — predictor names
cell array of character vectors

this property is read-only.

predictor names, specified as a cell array of character vectors. the order of the elements in predictornames corresponds to the order in which the predictor names appear in the training data x.

`x` — predictors
numeric array

this property is read-only.

predictors used to train the bagged ensemble, specified as a numeric array. this property is a nobs-by-nvars array, where nobs is the number of observations (rows) and nvars is the number of variables (columns) in the training data.

data types: single | double

response properties

`defaultyfit` — default prediction value
`""` | `"mostpopular"` | numeric scalar

default prediction value returned by predict or oobpredict, specified as "", "mostpopular", or a numeric scalar. this property controls the predicted value returned by the predict or oobpredict object function when no prediction is possible (for example, when oobpredict predicts a response for an observation that is in-bag for all trees in the ensemble).

for classification trees, you can set defaultyfit to either "" or "mostpopular". if you specify "mostpopular" (default for classification), the property value is the name of the most probable class in the training data. if you specify "", the in-bag observations are excluded from computation of the out-of-bag error and margin.
for regression trees, you can set defaultyfit to any numeric scalar. the default value for regression is the mean of the response for the training data. if you set defaultyfit to nan, the in-bag observations are excluded from computation of the out-of-bag error and margin.

example: mdl.defaultyfit="mostpopular"

data types: single | double | char | string

`y` — class labels or response data
cell array of character vectors | numeric vector

this property is read-only.

class labels or response data, specified as a cell array of character vectors or a numeric vector.

if you set the method name-value argument to "classification", this property represents class labels. each row of y represents the observed classification of the corresponding row of x.
if you set the method name-value argument to "regression", this property represents response data and is a numeric vector.

data types: single | double | cell

training properties

`method` — type of ensemble
`"classification"` | `"regression"`

this property is read-only.

type of ensemble, specified as "classification" for classification ensembles or "regression" for regression ensembles.

`proximity` — proximity between training data observations
numeric array

this property is read-only.

proximity between training data observations, specified as a numeric array. this property is a nobs-by-nobs array, where nobs is the number of observations in the training data. the array contains measures of the proximity between observations. for any two observations, their proximity is defined as the fraction of trees for which these observations land on the same leaf. the array is symmetric, with ones on the diagonal and off-diagonal elements ranging from 0 to 1.

data types: single | double

`w` — observation weights
vector of nonnegative values

this property is read-only.

observation weights, specified as a vector of nonnegative values. this property has the same number of rows as y. each entry in w specifies the relative importance of the corresponding observation in y. the treebagger function uses the observation weights to grow each decision tree in the ensemble.

data types: single | double

classification properties

`classnames` — unique class names
cell array of character vectors

this property is read-only.

unique class names used in the training model, specified as a cell array of character vectors.

this property is empty ([]) for regression trees.

`cost` — misclassification cost
numeric square matrix

this property is read-only.

misclassification cost, specified as a numeric square matrix. the element cost(i,j) is the cost of classifying a point into class j if its true class is i. the rows correspond to the true class and the columns correspond to the predicted class. the order of the rows and columns of cost corresponds to the order of the classes in classnames.

this property is empty ([]) for regression trees.

data types: single | double

`prior` — prior probabilities
numeric vector

this property is read-only.

prior probabilities, specified as a numeric vector. the order of the elements in prior corresponds to the order of the elements in mdl.classnames.

if you specify a cost matrix by using the cost name-value argument of the treebagger function, the prior property of the treebagger model object stores the prior probabilities (specified by the prior name-value argument) adjusted for the misclassification cost. for more details, see algorithms.

this property is empty ([]) for regression trees.

data types: single | double

object functions

create `compacttreebagger`

compact ensemble of decision trees

modify ensemble

	append new trees to ensemble
	train additional trees and add to ensemble

interpret ensemble

`partialdependence`	compute partial dependence
`plotpartialdependence`	create partial dependence plot (pdp) and individual conditional expectation (ice) plots

measure performance

	error (misclassification probability or mse)
	mean classification margin
	classification margin
	out-of-bag error
	out-of-bag mean margins
	out-of-bag margins
	out-of-bag quantile loss of bag of regression trees
	quantile loss using bag of regression trees

predict responses

	ensemble predictions for out-of-bag observations
	quantile predictions for out-of-bag observations from bag of regression trees
	predict responses using ensemble of bagged decision trees
	predict response quantile using bag of regression trees

measure proximity

	proximity matrix for training data
	multidimensional scaling of proximity matrix

examples

train ensemble of bagged classification trees

create an ensemble of bagged classification trees for fisher's iris data set. then, view the first grown tree, plot the out-of-bag classification error, and predict labels for out-of-bag observations.

load the fisheriris data set. create x as a numeric matrix that contains four petal measurements for 150 irises. create y as a cell array of character vectors that contains the corresponding iris species.

load fisheriris
x = meas;
y = species;

set the random number generator to default for reproducibility.

rng("default")

train an ensemble of bagged classification trees using the entire data set. specify 50 weak learners. store the out-of-bag observations for each tree. by default, treebagger grows deep trees.

mdl = treebagger(50,x,y,...
    method="classification",...
    oobprediction="on")

mdl = 
  treebagger
ensemble with 50 bagged decision trees:
                    training x:              [150x4]
                    training y:              [150x1]
                        method:       classification
                 numpredictors:                    4
         numpredictorstosample:                    2
                   minleafsize:                    1
                 inbagfraction:                    1
         samplewithreplacement:                    1
          computeoobprediction:                    1
 computeoobpredictorimportance:                    0
                     proximity:                   []
                    classnames:        'setosa'    'versicolor'     'virginica'
  properties, methods

mdl is a treebagger ensemble for classification trees.

the mdl.trees property is a 50-by-1 cell vector that contains the trained classification trees for the ensemble. each tree is a compactclassificationtree object. view the graphical display of the first trained classification tree.

view(mdl.trees{1},mode="graph")

figure classification tree viewer contains an axes object and other objects of type uimenu, uicontrol. the axes object contains 27 objects of type line, text. one or more of the lines displays its values using only markers

plot the out-of-bag classification error over the number of grown classification trees.

plot(ooberror(mdl))
xlabel("number of grown trees")
ylabel("out-of-bag classification error")

figure contains an axes object. the axes object with xlabel number of grown trees, ylabel out-of-bag classification error contains an object of type line.

the out-of-bag error decreases as the number of grown trees increases.

predict labels for out-of-bag observations. display the results for a random set of 10 observations.

ooblabels = oobpredict(mdl);
ind = randsample(length(ooblabels),10);
table(y(ind),ooblabels(ind),...
    variablenames=["truelabel" "predictedlabel"])

ans=10×2 table
      truelabel       predictedlabel
    ______________    ______________
    {'setosa'    }    {'setosa'    }
    {'virginica' }    {'virginica' }
    {'setosa'    }    {'setosa'    }
    {'virginica' }    {'virginica' }
    {'setosa'    }    {'setosa'    }
    {'virginica' }    {'virginica' }
    {'setosa'    }    {'setosa'    }
    {'versicolor'}    {'versicolor'}
    {'versicolor'}    {'virginica' }
    {'virginica' }    {'virginica' }

train ensemble of bagged regression trees

create an ensemble of bagged regression trees for the carsmall data set. then, predict conditional mean responses and conditional quartiles.

load the carsmall data set. create x as a numeric vector that contains the car engine displacement values. create y as a numeric vector that contains the corresponding miles per gallon.

load carsmall
x = displacement;
y = mpg;

set the random number generator to default for reproducibility.

rng("default")

train an ensemble of bagged regression trees using the entire data set. specify 100 weak learners.

mdl = treebagger(100,x,y,...
    method="regression")

mdl = 
  treebagger
ensemble with 100 bagged decision trees:
                    training x:               [94x1]
                    training y:               [94x1]
                        method:           regression
                 numpredictors:                    1
         numpredictorstosample:                    1
                   minleafsize:                    5
                 inbagfraction:                    1
         samplewithreplacement:                    1
          computeoobprediction:                    0
 computeoobpredictorimportance:                    0
                     proximity:                   []
  properties, methods

mdl is a treebagger ensemble for regression trees.

for 10 equally spaced engine displacements between the minimum and maximum in-sample displacement, predict conditional mean responses (ymean) and conditional quartiles (yquartiles).

predx = linspace(min(x),max(x),10)';
ymean = predict(mdl,predx);
yquartiles = quantilepredict(mdl,predx,...
    quantile=[0.25,0.5,0.75]);

plot the observations, estimated mean responses, and estimated quartiles.

hold on
plot(x,y,"o");
plot(predx,ymean)
plot(predx,yquartiles)
hold off
ylabel("fuel economy")
xlabel("engine displacement")
legend("data","mean response",...
    "first quartile","median",...,
    "third quartile")

figure contains an axes object. the axes object with xlabel engine displacement, ylabel fuel economy contains 5 objects of type line. one or more of the lines displays its values using only markers these objects represent data, mean response, first quartile, median, third quartile.

unbiased predictor importance estimates for bagged regression trees

create two ensembles of bagged regression trees, one using the standard cart algorithm for splitting predictors, and the other using the curvature test for splitting predictors. then, compare the predictor importance estimates for the two ensembles.

load the carsmall data set and convert the variables cylinders, mfg, and model_year to categorical variables. then, display the number of categories represented in the categorical variables.

load carsmall
cylinders = categorical(cylinders);
mfg = categorical(cellstr(mfg));
model_year = categorical(model_year);
numel(categories(cylinders))

ans = 3

numel(categories(mfg))

ans = 28

numel(categories(model_year))

ans = 3

create a table that contains eight car metrics.

tbl = table(acceleration,cylinders,displacement,...
    horsepower,mfg,model_year,weight,mpg);

set the random number generator to default for reproducibility.

rng("default")

train an ensemble of 200 bagged regression trees using the entire data set. because the data has missing values, specify to use surrogate splits. store the out-of-bag information for predictor importance estimation.

by default, treebagger uses the standard cart, an algorithm for splitting predictors. because the variables cylinders and model_year each contain only three categories, the standard cart prefers splitting a continuous predictor over these two variables.

mdlcart = treebagger(200,tbl,"mpg",...
    method="regression",surrogate="on",...
    oobpredictorimportance="on");

treebagger stores predictor importance estimates in the property oobpermutedpredictordeltaerror.

impcart = mdlcart.oobpermutedpredictordeltaerror;

train a random forest of 200 regression trees using the entire data set. to grow unbiased trees, specify to use the curvature test for splitting predictors.

mdlunbiased = treebagger(200,tbl,"mpg",...
    method="regression",surrogate="on",...
    predictorselection="curvature",...
    oobpredictorimportance="on");
impunbiased = mdlunbiased.oobpermutedpredictordeltaerror;

create bar graphs to compare the predictor importance estimates impcart and impunbiased for the two ensembles.

tiledlayout(1,2,padding="compact");
nexttile
bar(impcart)
title("standard cart")
ylabel("predictor importance estimates")
xlabel("predictors")
h = gca;
h.xticklabel = mdlcart.predictornames;
h.xticklabelrotation = 45;
h.ticklabelinterpreter = "none";
nexttile
bar(impunbiased);
title("curvature test")
ylabel("predictor importance estimates")
xlabel("predictors")
h = gca;
h.xticklabel = mdlunbiased.predictornames;
h.xticklabelrotation = 45;
h.ticklabelinterpreter = "none";

figure contains 2 axes objects. axes object 1 with title standard cart, xlabel predictors, ylabel predictor importance estimates contains an object of type bar. axes object 2 with title curvature test, xlabel predictors, ylabel predictor importance estimates contains an object of type bar.

for the cart model, the continuous predictor weight is the second most important predictor. for the unbiased model, the predictor importance of weight is smaller in value and ranking.

train ensemble of bagged classification trees on tall array

train an ensemble of bagged classification trees for observations in a tall array, and find the misclassification probability of each tree in the model for weighted observations. this example uses the data set airlinesmall.csv, a large data set that contains a tabular file of airline flight data.

when you perform calculations on tall arrays, matlab® uses either a parallel pool (default if you have parallel computing toolbox™) or the local matlab session. to run the example using the local matlab session when you have parallel computing toolbox, change the global execution environment by using the function.

mapreducer(0)

create a datastore that references the location of the folder containing the data set. select a subset of the variables to work with, and treat "na" values as missing data so that the datastore function replaces them with nan values. create the tall table tt to contain the data in the datastore.

ds = datastore("airlinesmall.csv");
ds.selectedvariablenames = ["month" "dayofmonth" "dayofweek",...
                            "deptime" "arrdelay" "distance" "depdelay"];
ds.treatasmissing = "na";
tt  = tall(ds)

tt =
  mx7 tall table
    month    dayofmonth    dayofweek    deptime    arrdelay    distance    depdelay
    _____    __________    _________    _______    ________    ________    ________
     10          21            3          642          8         308          12   
     10          26            1         1021          8         296           1   
     10          23            5         2055         21         480          20   
     10          23            5         1332         13         296          12   
     10          22            4          629          4         373          -1   
     10          28            3         1446         59         308          63   
     10           8            4          928          3         447          -2   
     10          10            6          859         11         954          -1   
      :          :             :           :          :           :           :
      :          :             :           :          :           :           :

determine the flights that are late by 10 minutes or more by defining a logical variable that is true for a late flight. this variable contains the class labels y. a preview of this variable includes the first few rows.

y = tt.depdelay > 10

y =
  mx1 tall logical array
   1
   0
   1
   1
   0
   1
   0
   0
   :
   :

create a tall array x for the predictor data.

x = tt{:,1:end-1}

x =
  mx6 tall double matrix
          10          21           3         642           8         308
          10          26           1        1021           8         296
          10          23           5        2055          21         480
          10          23           5        1332          13         296
          10          22           4         629           4         373
          10          28           3        1446          59         308
          10           8           4         928           3         447
          10          10           6         859          11         954
          :           :            :          :           :           :
          :           :            :          :           :           :

create a tall array w for the observation weights by arbitrarily assigning double weights to the observations in class 1.

w = y 1;

remove the rows in x, y, and w that contain missing data.

r = rmmissing([x y w]);
x = r(:,1:end-2); 
y = r(:,end-1); 
w = r(:,end);

train an ensemble of 20 bagged classification trees using the entire data set. specify a weight vector and uniform prior probabilities. for reproducibility, set the seeds of the random number generators using rng and tallrng. the results can vary depending on the number of workers and the execution environment for the tall arrays. for details, see .

rng("default") 
tallrng("default")
tmdl = treebagger(20,x,y,...
    weights=w,prior="uniform")

evaluating tall expression using the local matlab session:
- pass 1 of 1: completed in 0.72 sec
evaluation completed in 0.89 sec
evaluating tall expression using the local matlab session:
- pass 1 of 1: completed in 1.2 sec
evaluation completed in 1.3 sec
evaluating tall expression using the local matlab session:
- pass 1 of 1: completed in 4.3 sec
evaluation completed in 4.3 sec

tmdl = 
  compacttreebagger
ensemble with 20 bagged decision trees:
              method:       classification
       numpredictors:                    6
          classnames: '0' '1'
  properties, methods

tmdl is a compacttreebagger ensemble with 20 bagged decision trees. for tall data, the treebagger function returns a compacttreebagger object.

calculate the misclassification probability of each tree in the model. attribute a weight contained in the vector w to each observation by using the weights name-value argument.

terr = error(tmdl,x,y,weights=w)

evaluating tall expression using the local matlab session:
- pass 1 of 1: completed in 4.8 sec
evaluation completed in 4.8 sec

terr = 20×1
    0.1420
    0.1214
    0.1115
    0.1078
    0.1037
    0.1027
    0.1005
    0.0997
    0.0981
    0.0983
      ⋮

find the average misclassification probability for the ensemble of decision trees.

avg_terr = mean(terr)

avg_terr = 0.1022

more about

additional name-value arguments of `treebagger` function

in addition to its name-value pair arguments, the treebagger function accepts the following name-value arguments of fitctree and fitrtree.

supported `fitctree` arguments	supported `fitrtree` arguments
`algorithmforcategorical`	`maxnumsplits`
`classnames`*	`mergeleaves`
`maxnumcategories`	`predictorselection`
`maxnumsplits`	`prune`
`mergeleaves`	`prunecriterion`
`predictorselection`	`quadraticerrortolerance`
`prune`	`splitcriterion`
`prunecriterion`	`surrogate`
`splitcriterion`	`weights`
`surrogate`	n/a
`weights`	n/a

*when you specify the classnames name-value argument as a logical vector, use 0 and 1 values. do not use false and true values. for example, you can specify classnames as [1 0 1].

tips

for a treebagger model mdl, the trees property contains a cell vector of mdl.numtrees compactclassificationtree or compactregressiontree objects. view the graphical display of the t grown tree by entering:
```
view(mdl.trees{t})
```
for regression problems, treebagger supports mean and quantile regression (that is, quantile regression forest [5]).
- to predict mean responses or estimate the mean squared error given data, pass a treebagger model object and the data to or , respectively. to perform similar operations for out-of-bag observations, use or .
- to estimate quantiles of the response distribution or the quantile error given data, pass a treebagger model object and the data to or , respectively. to perform similar operations for out-of-bag observations, use or .
standard cart tends to select split predictors containing many distinct values, such as continuous variables, over those containing few distinct values, such as categorical variables [4]. consider specifying the curvature or interaction test if either of the following is true:
- the data has predictors with relatively fewer distinct values than other predictors; for example, the predictor data set is heterogeneous.
- your goal is to analyze predictor importance. treebagger stores predictor importance estimates in the oobpermutedpredictordeltaerror property.
for more information on predictor selection, see the name-value argument predictorselection for classification trees or the name-value argument predictorselection for regression trees.

algorithms

if you specify the cost, prior, and weights name-value arguments, the output model object stores the specified values in the cost, prior, and w properties, respectively. the cost property stores the user-specified cost matrix (c) without modification. the prior and w properties store the prior probabilities and observation weights, respectively, after normalization. for model training, the software updates the prior probabilities and observation weights to incorporate the penalties described in the cost matrix. for details, see .
the treebagger function generates in-bag samples by oversampling classes with large misclassification costs and undersampling classes with small misclassification costs. consequently, out-of-bag samples have fewer observations from classes with large misclassification costs and more observations from classes with small misclassification costs. if you train a classification ensemble using a small data set and a highly skewed cost matrix, then the number of out-of-bag observations per class might be very low. therefore, the estimated out-of-bag error might have a large variance and be difficult to interpret. the same phenomenon can occur for classes with large prior probabilities.
for details on how the treebagger function selects split predictors, and for information on node-splitting algorithms when the function grows decision trees, see algorithms for classification trees and algorithms for regression trees.

alternative functionality

statistics and machine learning toolbox™ offers three objects for bagging and random forest:

object created by the fitcensemble function for classification
regressionbaggedensemble object created by the function for regression
treebagger object created by the treebagger function for classification and regression

for details about the differences between treebagger and bagged ensembles (classificationbaggedensemble and regressionbaggedensemble), see .

references

[1] breiman, leo. "random forests." machine learning 45 (2001): 5–32. .

[2] breiman, leo, jerome friedman, charles j. stone, and r. a. olshen. classification and regression trees. boca raton, fl: crc press, 1984.

[3] loh, wei-yin. "regression trees with unbiased variable selection and interaction detection." statistica sinica 12, no. 2 (2002): 361–386. .

[4] loh, wei-yin, and yu-shan shih. "split selection for classification trees." statistica sinica 7, no. 4 (1997): 815–840. .

[5] meinshausen, nicolai. "quantile regression forests." journal of machine learning research 7, no. 35 (2006): 983–999. .

[6] genuer, robin, jean-michel poggi, christine tuleau-malot, and nathalie villa-vialanei. "random forests for big data." big data research 9 (2017): 28–46. .

extended capabilities

tall arrays
calculate with arrays that have more rows than fit in memory.

this function supports tall arrays with the following limitations.

for more information, see .

automatic parallel support
accelerate code by automatically running computation in parallel using parallel computing toolbox™.

to run in parallel, specify the options name-value argument in the call to this function and set the useparallel field of the options structure to true using statset:

"options",statset("useparallel",true)

for more information about parallel computing, see (parallel computing toolbox).

version history

introduced in r2009a

r2022a: `cost` property stores the user-specified cost matrix

starting in r2022a, the cost property stores the user-specified cost matrix. the software stores normalized prior probabilities (prior) and observation weights (w) that do not reflect the penalties described in the cost matrix.

note that model training has not changed and, therefore, the decision boundaries between classes have not changed.

for training, the fitting function updates the specified prior probabilities by incorporating the penalties described in the specified cost matrix, and then normalizes the prior probabilities and observation weights. this behavior has not changed. in previous releases, the software stored the default cost matrix in the cost property and stored the prior probabilities and observation weights used for training in the prior and w properties, respectively. starting in r2022a, the software stores the user-specified cost matrix without modification, and stores normalized prior probabilities and observation weights that do not reflect the cost penalties. for more details, see .

the ooberror and oobmeanmargin functions use the observation weights stored in the w property. therefore, if you specify a nondefault cost matrix when you train a classification model, the object functions return a different value compared to previous releases.

if you want the software to handle the cost matrix, prior probabilities, and observation weights as in previous releases, adjust the prior probabilities and observation weights for the nondefault cost matrix, as described in . then, when you train a classification model, specify the adjusted prior probabilities and observation weights by using the prior and weights name-value arguments, respectively, and use the default cost matrix.

ensemble of bagged decision trees -凯发k8网页登录

description

creation

syntax

description

input arguments

numtrees — number of decision trees positive integer

tbl — sample data table

responsevarname — response variable name name of variable in tbl

formula — explanatory model of response variable and subset of predictor variables character vector | string scalar

y — class labels or response variable categorical array | character array | string array | logical vector | numeric vector | cell array of character vectors

x — predictor data numeric matrix

chunksize — number of observations in each chunk of data 50000 (default) | positive integer

cost — misclassification cost square matrix | structure

categoricalpredictors — categorical predictors list vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | "all"

method — type of decision tree "classification" (default) | "regression"

minleafsize — minimum number of leaf node observations positive integer

numpredictorstosample — number of predictor variables for each decision split positive integer | "all"

numprint — number of grown trees after which software displays message 0 (default) | positive integer

inbagfraction — fraction of input data to sample 1 (default) | positive scalar

oobprediction — indicator to store out-of-bag information "off" (default) | "on"

oobpredictorimportance — indicator to store out-of-bag estimates of feature importance "off" (default) | "on"

options — options for running computations in parallel and setting random streams structure

predictornames — predictor variable names string array of unique names | cell array of unique character vectors

samplewithreplacement — indicator for sampling with replacement "on" (default) | "off"

prior — prior probability for each class for two-class learning "empirical" (default) | "uniform" | numeric vector | structure array

output arguments

mdl — ensemble of bagged decision trees treebagger object

properties

bagging properties

computeoobprediction — indicator to compute out-of-bag predictions for training observations false or 0 (default) | true or 1

computeoobpredictorimportance — indicator to compute out-of-bag variable importance false or 0 (default) | true or 1

inbagfraction — fraction of observations that are randomly selected 1 (default) | numeric scalar

oobindices — out-of-bag indices logical array

oobinstanceweight — number of out-of-bag trees for each observation numeric vector

oobpermutedpredictorcountraisemargin — predictor variable importance for raising margin numeric vector

oobpermutedpredictordeltaerror — predictor variable importance for prediction error numeric vector

oobpermutedpredictordeltameanmargin — predictor variable importance for classification margin numeric vector

tree properties

deltacriteriondecisionsplit — split criterion contributions for each predictor numeric vector

mergeleaves — indicator to merge leaves false or 0 (default) | true or 1

minleafsize — minimum number of leaf node observations positive integer

numtrees — number of decision trees positive integer

prune — indicator to estimate optimal sequence of pruned subtrees false or 0 (default) | true or 1

samplewithreplacement — indicator to sample decision tree with replacement true or 1 (default) | false or 0

surrogateassociation — predictive measures of variable association numeric matrix

treearguments — name-value arguments specified for treebagger function cell array

trees — decision trees in ensemble cell array

predictor properties

numpredictorsplit — number of decision splits for each predictor numeric vector

numpredictorstosample — number of predictor variables to select positive integer

outliermeasure — outlier measure for each observation numeric vector

predictornames — predictor names cell array of character vectors

x — predictors numeric array

response properties

defaultyfit — default prediction value "" | "mostpopular" | numeric scalar

y — class labels or response data cell array of character vectors | numeric vector

training properties

method — type of ensemble "classification" | "regression"

proximity — proximity between training data observations numeric array

w — observation weights vector of nonnegative values

classification properties

classnames — unique class names cell array of character vectors

cost — misclassification cost numeric square matrix

prior — prior probabilities numeric vector

object functions

create compacttreebagger

modify ensemble

interpret ensemble

measure performance

predict responses

measure proximity

examples

train ensemble of bagged classification trees

train ensemble of bagged regression trees

unbiased predictor importance estimates for bagged regression trees

train ensemble of bagged classification trees on tall array

more about

additional name-value arguments of treebagger function

tips

`numtrees` — number of decision trees
positive integer

`tbl` — sample data
table

`responsevarname` — response variable name
name of variable in `tbl`

`formula` — explanatory model of response variable and subset of predictor variables
character vector | string scalar

`y` — class labels or response variable
categorical array | character array | string array | logical vector | numeric vector | cell array of character vectors

`x` — predictor data
numeric matrix

`chunksize` — number of observations in each chunk of data
50000 (default) | positive integer

`cost` — misclassification cost
square matrix | structure

`categoricalpredictors` — categorical predictors list
vector of positive integers | logical vector | character matrix | string array | cell array of character vectors | `"all"`

`method` — type of decision tree
`"classification"` (default) | `"regression"`

`minleafsize` — minimum number of leaf node observations
positive integer

`numpredictorstosample` — number of predictor variables for each decision split
positive integer | "all"

`numprint` — number of grown trees after which software displays message
0 (default) | positive integer

`inbagfraction` — fraction of input data to sample
1 (default) | positive scalar

`oobprediction` — indicator to store out-of-bag information
`"off"` (default) | "on"

`oobpredictorimportance` — indicator to store out-of-bag estimates of feature importance
`"off"` (default) | `"on"`

`options` — options for running computations in parallel and setting random streams
structure

`predictornames` — predictor variable names
string array of unique names | cell array of unique character vectors

`samplewithreplacement` — indicator for sampling with replacement
`"on"` (default) | `"off"`

`prior` — prior probability for each class for two-class learning
`"empirical"` (default) | `"uniform"` | numeric vector | structure array

`mdl` — ensemble of bagged decision trees
`treebagger` object

`computeoobprediction` — indicator to compute out-of-bag predictions for training observations
`false` or 0 (default) | `true` or 1

`computeoobpredictorimportance` — indicator to compute out-of-bag variable importance
`false` or 0 (default) | `true` or 1

`inbagfraction` — fraction of observations that are randomly selected
1 (default) | numeric scalar

`oobindices` — out-of-bag indices
logical array

`oobinstanceweight` — number of out-of-bag trees for each observation
numeric vector

`oobpermutedpredictorcountraisemargin` — predictor variable importance for raising margin
numeric vector

`oobpermutedpredictordeltaerror` — predictor variable importance for prediction error
numeric vector

`oobpermutedpredictordeltameanmargin` — predictor variable importance for classification margin
numeric vector

`deltacriteriondecisionsplit` — split criterion contributions for each predictor
numeric vector

`mergeleaves` — indicator to merge leaves
`false` or 0 (default) | `true` or 1

`minleafsize` — minimum number of leaf node observations
positive integer

`numtrees` — number of decision trees
positive integer

`prune` — indicator to estimate optimal sequence of pruned subtrees
`false` or 0 (default) | `true` or 1

`samplewithreplacement` — indicator to sample decision tree with replacement
`true` or 1 (default) | `false` or 0

`surrogateassociation` — predictive measures of variable association
numeric matrix

`treearguments` — name-value arguments specified for `treebagger` function
cell array

`trees` — decision trees in ensemble
cell array

`numpredictorsplit` — number of decision splits for each predictor
numeric vector

`numpredictorstosample` — number of predictor variables to select
positive integer

`outliermeasure` — outlier measure for each observation
numeric vector

`predictornames` — predictor names
cell array of character vectors

`x` — predictors
numeric array

`defaultyfit` — default prediction value
`""` | `"mostpopular"` | numeric scalar

`y` — class labels or response data
cell array of character vectors | numeric vector

`method` — type of ensemble
`"classification"` | `"regression"`

`proximity` — proximity between training data observations
numeric array

`w` — observation weights
vector of nonnegative values

`classnames` — unique class names
cell array of character vectors

`cost` — misclassification cost
numeric square matrix

`prior` — prior probabilities
numeric vector

create `compacttreebagger`

additional name-value arguments of `treebagger` function

tall arrays
calculate with arrays that have more rows than fit in memory.

automatic parallel support
accelerate code by automatically running computation in parallel using parallel computing toolbox™.

r2022a: `cost` property stores the user-specified cost matrix