train regression trees using regression learner app
this example shows how to create and compare various regression trees using the regression learner app, and export trained models to the workspace to make predictions for new data.
you can train regression trees to predict responses to given input data. to predict the response of a regression tree, follow the tree from the root (beginning) node down to a leaf node. at each node, decide which branch to follow using the rule associated to that node. continue until you arrive at a leaf node. the predicted response is the value associated to that leaf node.
statistics and machine learning toolbox™ trees are binary. each step in a prediction involves checking the value of one predictor variable. for example, here is a simple regression tree:
this tree predicts the response based on two predictors, x1
and
x2
. to predict, start at the top node. at each node, check the
values of the predictors to decide which branch to follow. when the branches reach a
leaf node, the response is set to the value corresponding to that node.
this example uses the carbig
data set. this data set contains
characteristics of different car models produced from 1970 through 1982,
including:
acceleration
number of cylinders
engine displacement
engine power (horsepower)
model year
weight
country of origin
miles per gallon (mpg)
train regression trees to predict the fuel economy in miles per gallon of a car model, given the other variables as inputs.
in matlab®, load the
carbig
data set and create a table containing the different variables:load carbig cartable = table(acceleration,cylinders,displacement, ... horsepower,model_year,weight,origin,mpg);
on the apps tab, in the machine learning and deep learning group, click regression learner.
on the regression learner tab, in the file section, select new session > from workspace.
under data set variable in the new session from workspace dialog box, select
cartable
from the list of tables and matrices in your workspace.observe that the app has preselected response and predictor variables.
mpg
is chosen as the response, and all the other variables as predictors. for this example, do not change the selections.to accept the default validation scheme and continue, click start session. the default validation option is cross-validation, to protect against overfitting.
regression learner creates a plot of the response with the record number on the x-axis.
use the response plot to investigate which variables are useful for predicting the response. to visualize the relation between different predictors and the response, select different variables in the x list under x-axis to the right of the plot.
observe which variables are correlated most clearly with the response.
displacement
,horsepower
, andweight
all have a clearly visible impact on the response and all show a negative association with the response.select the variable
origin
under x-axis. a box plot is automatically displayed. a box plot shows the typical values of the response and any possible outliers. the box plot is useful when plotting markers results in many points overlapping. to show a box plot when the variable on the x-axis has few unique values, under style, select box plot.train a selection of regression trees. the models pane already contains a fine tree model. add medium and coarse tree models to the list of draft models. on the regression learner tab, in the models section, click the arrow to open the gallery. in the regression trees group, click medium tree. the app creates a draft medium tree in the models pane. reopen the model gallery and click coarse tree in the regression trees group. the app creates a draft coarse tree in the models pane.
in the train section, click train all and select train all. the app trains the three tree models and plots both the true training response and the predicted response for each model.
note
if you have parallel computing toolbox™, then the app has the use parallel button toggled on by default. after you click train all and select train all or train selected, the app opens a parallel pool of workers. during this time, you cannot interact with the software. after the pool opens, you can continue to interact with the app while models train in parallel.
if you do not have parallel computing toolbox, then the app has the use background training check box in the train all menu selected by default. after you select an option to train models, the app opens a background pool. after the pool opens, you can continue to interact with the app while models train in the background.
note
validation introduces some randomness into the results. your model validation results can vary from the results shown in this example.
in the models pane, check the rmse (validation) (validation root mean squared error) of the models. the best score is highlighted in a box.
the fine tree and the medium tree have similar rmses, while the coarse tree is less accurate.
choose a model in the models pane to view the results of that model. for example, select the medium tree model (model 2). in the response plot tab, under x-axis, select
horsepower
and examine the response plot. both the true and predicted responses are now plotted. show the prediction errors, drawn as vertical lines between the predicted and true responses, by selecting the errors check box.see more details on the currently selected model in the model's summary tab. on the regression learner tab, in the models section, click summary. check and compare additional model characteristics, such as r-squared (coefficient of determination), mae (mean absolute error), and prediction speed. to learn more, see . in the summary tab, you also can find details on the currently selected model type, such as the hyperparameters used for training the model.
plot the predicted response versus true response. on the regression learner tab, in the plot and interpret section, click the arrow to open the gallery, and then click predicted vs. actual (validation) in the validation results group. use this plot to understand how well the regression model makes predictions for different response values.
a perfect regression model has predicted response equal to true response, so all the points lie on a diagonal line. the vertical distance from the line to any point is the error of the prediction for that point. a good model has small errors, so the predictions are scattered near the line. usually a good model has points scattered roughly symmetrically around the diagonal line. if you can see any clear patterns in the plot, it is likely that you can improve your model.
select the other models in the models pane, open the predicted versus actual plot for each of the models, and then compare the results. rearrange the layout of the plots to better compare the plots. click the document actions arrow located to the far right of the model plot tabs. select the
tile all
option and specify a 1-by-3 layout. click the hide plot options button at the top right of the plots to make more room for the plots.to return to the original layout, you can click the layout button in the plot and interpret section and select single model (default).
in the models gallery, select all trees in the regression trees group. to try to improve the tree models, include different features in the models. see if you can improve the model by removing features with low predictive power.
on the regression learner tab, in the options section, click feature selection.
in the default feature selection tab, you can select different feature ranking algorithms to determine the most important features. after you select a feature ranking algorithm, the app displays a plot of the sorted feature importance scores, where larger scores (including
inf
s) indicate greater feature importance. the table shows the ranked features and their scores.in this example, both the mrmr and f test feature ranking algorithms rank the acceleration and country of origin predictors the lowest. the app disables the rrelieff option because the predictors include a mix of numeric and categorical variables.
under feature ranking algorithm, click f test. under feature selection, use the default option of selecting the highest ranked features to avoid bias in the validation metrics. specify to keep 4 of the 7 features for model training.
click save and apply. the app applies the feature selection changes to the current draft model and any new models created using the models gallery.
train the tree models using the reduced set of features. on the regression learner tab, in the train section, click train all and select train all or train selected.
observe the new models in the models pane. these models are the same regression trees as before, but trained using only 4 of 7 predictors. the app displays how many predictors are used. to check which predictors are used, click a model in the models pane, and note the check boxes in the expanded feature selection section of the model summary tab.
note
if you use a cross-validation scheme and choose to perform feature selection using the select highest ranked features option, then for each training fold, the app performs feature selection before training a model. different folds can select different predictors as the highest ranked features. the table on the default feature selection tab shows the list of predictors used by the full model, trained on the training and validation data.
the models with the three features removed do not perform as well as the models using all predictors. in general, if data collection is expensive or difficult, you might prefer a model that performs satisfactorily without some predictors.
train the three regression tree presets using only
horsepower
as a predictor. in the models gallery, select all trees in the regression trees group. in the model summary tab, expand the feature selection section. choose the select individual features option, and clear the check boxes for all features excepthorsepower
. on the regression learner tab, in the train section, click train all and select train selected.using only the engine power as a predictor results in models with lower accuracy. however, the models perform well given that they are using only a single predictor. with this simple one-dimensional predictor space, the coarse tree now performs as well as the medium and fine trees.
select the best model in the models pane and view the residuals plot. on the regression learner tab, in the plot and interpret section, click the arrow to open the gallery, and then click residuals (validation) in the validation results group. the residuals plot displays the difference between the predicted and true responses. to display the residuals as a line graph, in the style section, choose lines.
under x-axis, select the variable to plot on the x-axis. choose the true response, predicted response, record number, or one of the predictors.
usually a good model has residuals scattered roughly symmetrically around 0. if you can see any clear patterns in the residuals, it is likely that you can improve your model.
to learn about model hyperparameter settings, choose the best model in the models pane and expand the model hyperparameters section in the model summary tab. compare the coarse, medium, and fine tree models, and note the differences in the model hyperparameters. in particular, the minimum leaf size setting is 36 for coarse trees, 12 for medium trees, and 4 for fine trees. this setting controls the size of the tree leaves, and through that the size and depth of the regression tree.
to try to improve the best model (the medium tree trained using all predictors), change the minimum leaf size setting. first, click the model in the models pane. on the regression learner tab, in the models section, click duplicate. in the summary tab, change the minimum leaf size value to 8. then, in the train section of the regression learner tab, click train all and select train selected.
to learn more about regression tree settings, see .
you can export a full or compact version of the selected model to the workspace. on the regression learner tab, click export, click export model and select export model. to exclude the training data and export a compact model, clear the check box in the export regression model dialog box. you can still use the compact model for making predictions on new data. in the dialog box, click ok to accept the default variable name
trainedmodel
.the command window displays information about the results.
use the exported model to make predictions on new data. for example, to make predictions for the
cartable
data in your workspace, enter:the outputyfit = trainedmodel.predictfcn(cartable)
yfit
contains the predicted response for each data point.if you want to automate training the same model with new data or learn how to programmatically train regression models, you can generate code from the app. to generate code for the best trained model, on the regression learner tab, in the export section, click generate function.
the app generates code from your model and displays the file in the matlab editor. to learn more, see .
tip
use the same workflow as in this example to evaluate and compare the other regression model types you can train in regression learner.
train all the nonoptimizable regression model presets available:
on the regression learner tab, in the models section, click the arrow to open the gallery of regression models.
in the get started group, click all.
in the train section, click train all and select train all.
to learn about other regression model types, see train regression models in regression learner app.