main content

options to train reinforcement learning agents using existing data -凯发k8网页登录

options to train reinforcement learning agents using existing data

since r2023a

description

use an rltrainingfromdataoptions object to specify options to train an off-policy agent from existing data. training options include the maximum number of epochs to train, criteria for stopping training and criteria for saving agents. to train the agent using the specified options, pass this object to trainfromdata.

for more information on training agents, see train reinforcement learning agents.

creation

description

tfdopts = rltrainingfromdataoptions returns a default options set to train an off-policy agent offline, from existing data.

example

tfdopts = rltrainingoptions(name=value) creates the training option set tfdopts and sets its properties using one or more name-value arguments.

properties

maximum number of epochs to train the agent, specified as a positive integer. each epoch has a fixed number of learning steps specified by numstepsperepoch. regardless of other criteria for termination, training terminates after maxepochs.

example: maxepochs=500

number of steps to run per epoch, specified as a positive integer.

example: numstepsperepoch=1000

buffer update period, specified as a positive integer. for example, if the value of this option is 1 (default), then the buffer updates every epoch, if it is 2 the buffer updates every other epoch, and so on. note that the experience buffer is not updated if it already contains all the available data.

example: experiencebufferupdatefrequency=2

number of experiences appended per buffer update, specified as a positive integer or empty matrix. if the value of this option is left empty (default) then, at training time, it is automatically set to half the length of the experience buffer used by the agent.

example: numexperiencesperexperiencebufferupdate=5e5

batch of observations used to compute q values, specified as an 1-by-n cell array, where n is the number of observation channels. each cell must contain a batch of observations, along the batch dimension, for the corresponding observation channel. for example, if you have two observation channels carrying a 3-by-1 vector and a scalar, a batch of 10 random observations is {rand(3,1,10),rand(1,1,10)}.

if the value of this option is left empty (default) then, at training time, it is automatically set to a cell array in which each element corresponding to an observation channel is an array of zeros having the same dimensions of the observation, without any batch dimension.

example: qvalueobservations={rand(3,1,10),rand(1,1,10)}

window length for averaging q-values, specified as a scalar. one termination and one saving options are expressed in terms of average q-values. for these options, the average is calculated over the last scoreaveragingwindowlength epochs.

example: scoreaveragingwindowlength=10

training termination condition, specified as one of the following strings:

  • "none" — stop training after the agent is trained for the number of epochs specified in maxepochs.

  • "qvalue" — stop training when the average q-value (computed using the current critic and the observations specified in qvalueobservations) over the last scoreaveragingwindowlength epochs equals or exceeds the value specified in the stoptrainingvalue option.

example: stoptrainingcriteria="qvalue"

critical value of the training termination condition, specified as a scalar. training ends when the termination condition specified by the stoptrainingcriteria option equals or exceeds this value.

for instance, if stoptrainingcriteria is "qvalue" and stoptrainingvalue is 50, then training terminates when the moving average q-value (computed using the current critic and the observations specified in qvalueobservations) over the number of epochs specified in scoreaveragingwindowlength equals or exceeds 50.

example: stoptrainingvalue=50

condition for saving the agent during training, specified as one of the following strings:

  • "none" — do not save any agents during training.

  • "epochfrequency" — save the agent when the number of epochs is an integer multiple of the value specified in the saveagentvalue option.

  • "qvalue" — save the agent when the when the average q-value (computed using the current critic and the observations specified in qvalueobservations) over the last scoreaveragingwindowlength epochs equals or exceeds the value specified in saveagentvalue.

set this option to store candidate agents that perform in term of q-value, or just to save agent at a fixed rate. for instance, if saveagentcriteria is "epochfrequency" and saveagentvalue is 5, then the agent is saved every five epochs.

example: saveagentcriteria="epochfrequency"

critical value of the condition for saving the agent, specified as a scalar.

example: saveagentvalue=10

folder name for saved agents, specified as a string or character vector. the folder name can contain a full or relative path. when an epoch occurs in which the condition specified by the saveagentcriteria and saveagentvalue options are satisfied, the software saves the agents in a mat-file in this folder. if the folder does not exist, train creates it. when saveagentcriteria is "none", this option is ignored and train does not create a folder.

example: saveagentdirectory = pwd "\run1\agents"

option to display training progress at the command line, specified as a numerical or logical 0 (false) or 1 (true). set to true to write information from each training epoch to the matlab® command line during training.

example: verbose=false

option to display training progress with episode manager, specified as "training-progress" or "none". by default, calling trainfromdata opens the reinforcement learning episode manager, which graphically and numerically displays information about the training progress, such as the reward for each epoch, average reward, number of epochs, and total number of steps. to turn off this display, set this option to "none". for more information, see train.

example: plots="none"

object functions

trainfromdatatrain off-policy reinforcement learning agent using existing data

examples

create an options set to train a reinforcement learning agent offline, from an existing dataset.

set the maximum number of epochs to 2000 and the maximum number of steps per epoch to 1000. do not set any criteria to stop the training before 1000 epochs. also, display training progress on the command line instead of using the episode manager.

tfdopts = rltrainingfromdataoptions(...
    maxepochs=2000,...
    numstepsperepoch=1000,...
    verbose=true,...
    plots="none")
tfdopts = 
  rltrainingfromdataoptions with properties:
                                  maxepochs: 2000
                           numstepsperepoch: 1000
            experiencebufferupdatefrequency: 1
    numexperiencesperexperiencebufferupdate: []
                         qvalueobservations: []
                 scoreaveragingwindowlength: 5
                       stoptrainingcriteria: "none"
                          stoptrainingvalue: "none"
                          saveagentcriteria: "none"
                             saveagentvalue: "none"
                         saveagentdirectory: "savedagents"
                                    verbose: 1
                                      plots: "none"

alternatively, create a default options set and use dot notation to change some of the values.

trainopts = rltrainingfromdataoptions;
trainopts.maxepochs = 2000;
trainopts.numstepsperepoch = 1000;
trainopts.verbose = true;
trainopts.plots = "training-progress";
trainopts
trainopts = 
  rltrainingfromdataoptions with properties:
                                  maxepochs: 2000
                           numstepsperepoch: 1000
            experiencebufferupdatefrequency: 1
    numexperiencesperexperiencebufferupdate: []
                         qvalueobservations: []
                 scoreaveragingwindowlength: 5
                       stoptrainingcriteria: "none"
                          stoptrainingvalue: "none"
                          saveagentcriteria: "none"
                             saveagentvalue: "none"
                         saveagentdirectory: "savedagents"
                                    verbose: 1
                                      plots: "training-progress"

you can now use trainopts as an input argument to the trainfromdata command.

version history

introduced in r2023a

网站地图