options to train reinforcement learning agents using existing data -凯发k8网页登录
options to train reinforcement learning agents using existing data
since r2023a
description
use an rltrainingfromdataoptions
object to specify options to
train an off-policy agent from existing data. training options include the maximum number of
epochs to train, criteria for stopping training and criteria for saving agents. to train the
agent using the specified options, pass this object to trainfromdata
.
for more information on training agents, see train reinforcement learning agents.
creation
description
returns
a default options set to train an off-policy agent offline, from existing data.tfdopts
= rltrainingfromdataoptions
creates the training option set tfdopts
= rltrainingoptions(name=value
)tfdopts
and sets its properties using
one or more name-value arguments.
properties
maxepochs
— maximum number of epochs to train the agent
1000
(default) | positive integer
maximum number of epochs to train the agent, specified as a positive integer. each
epoch has a fixed number of learning steps specified by
numstepsperepoch
. regardless of other criteria for termination,
training terminates after maxepochs
.
example: maxepochs=500
numstepsperepoch
— number of steps to run per epoch
500
(default) | positive integer
number of steps to run per epoch, specified as a positive integer.
example: numstepsperepoch=1000
experiencebufferupdatefrequency
— buffer update period
1
(default) | positive integer
buffer update period, specified as a positive integer. for example, if the value of
this option is 1
(default), then the buffer updates every epoch, if
it is 2
the buffer updates every other epoch, and so on. note that
the experience buffer is not updated if it already contains all the available
data.
example: experiencebufferupdatefrequency=2
numexperiencesperexperiencebufferupdate
— number of experiences appended per buffer update
[]
(default) | positive integer
number of experiences appended per buffer update, specified as a positive integer or empty matrix. if the value of this option is left empty (default) then, at training time, it is automatically set to half the length of the experience buffer used by the agent.
example: numexperiencesperexperiencebufferupdate=5e5
qvalueobservations
— batch of observations used to compute q values
[]
(default) | cell array
batch of observations used to compute q values, specified as an 1-by-n cell array,
where n is the number of observation channels. each cell must contain a batch of
observations, along the batch dimension, for the corresponding observation channel. for
example, if you have two observation channels carrying a 3-by-1 vector and a scalar, a
batch of 10 random observations is
{rand(3,1,10),rand(1,1,10)}
.
if the value of this option is left empty (default) then, at training time, it is automatically set to a cell array in which each element corresponding to an observation channel is an array of zeros having the same dimensions of the observation, without any batch dimension.
example: qvalueobservations={rand(3,1,10),rand(1,1,10)}
scoreaveragingwindowlength
— window length for averaging q-values
5
(default) | positive integer scalar
window length for averaging q-values, specified as a scalar. one termination and one
saving options are expressed in terms of average q-values. for these options, the
average is calculated over the last scoreaveragingwindowlength
epochs.
example: scoreaveragingwindowlength=10
stoptrainingcriteria
— training termination condition
"none"
(default) | "qvalue"
| ...
training termination condition, specified as one of the following strings:
"none"
— stop training after the agent is trained for the number of epochs specified inmaxepochs
."qvalue"
— stop training when the average q-value (computed using the current critic and the observations specified inqvalueobservations
) over the lastscoreaveragingwindowlength
epochs equals or exceeds the value specified in thestoptrainingvalue
option.
example: stoptrainingcriteria="qvalue"
stoptrainingvalue
— critical value of training termination condition
"none"
(default) | scalar
critical value of the training termination condition, specified as a scalar.
training ends when the termination condition specified by the
stoptrainingcriteria
option equals or exceeds this value.
for instance, if stoptrainingcriteria
is
"qvalue"
and stoptrainingvalue
is
50
, then training terminates when the moving average q-value
(computed using the current critic and the observations specified in
qvalueobservations
) over the number of epochs specified in
scoreaveragingwindowlength
equals or exceeds
50
.
example: stoptrainingvalue=50
saveagentcriteria
— condition for saving agent during training
"none"
(default) | "epochfrequency"
| "qvalue"
| ...
condition for saving the agent during training, specified as one of the following strings:
"none"
— do not save any agents during training."epochfrequency"
— save the agent when the number of epochs is an integer multiple of the value specified in thesaveagentvalue
option."qvalue"
— save the agent when the when the average q-value (computed using the current critic and the observations specified inqvalueobservations
) over the lastscoreaveragingwindowlength
epochs equals or exceeds the value specified insaveagentvalue
.
set this option to store candidate agents that perform in term of q-value, or just
to save agent at a fixed rate. for instance, if saveagentcriteria
is "epochfrequency"
and saveagentvalue
is
5
, then the agent is saved every five epochs.
example: saveagentcriteria="epochfrequency"
saveagentvalue
— critical value of condition for saving agent
"none"
(default) | scalar
critical value of the condition for saving the agent, specified as a scalar.
example: saveagentvalue=10
saveagentdirectory
— folder name for saved agents
"savedagents"
(default) | string | character vector
folder name for saved agents, specified as a string or character vector. the folder
name can contain a full or relative path. when an epoch occurs in which the condition
specified by the saveagentcriteria
and
saveagentvalue
options are satisfied, the software saves the
agents in a mat-file in this folder. if the folder does not exist,
train
creates it. when saveagentcriteria
is
"none"
, this option is ignored and train
does
not create a folder.
example: saveagentdirectory = pwd "\run1\agents"
verbose
— option to display training progress at the command line
false
or 0
(default) | true
or 1
option to display training progress at the command line, specified as a numerical or
logical 0
(false
) or 1
(true
). set to true
to write information from
each training epoch to the matlab® command line during training.
example: verbose=false
plots
— option to display training progress with episode manager
"training-progress"
(default) | "none"
option to display training progress with episode manager, specified as
"training-progress"
or "none"
. by default,
calling trainfromdata
opens the reinforcement learning episode
manager, which graphically and numerically displays information about the training
progress, such as the reward for each epoch, average reward, number of epochs, and total
number of steps. to turn off this display, set this option to "none"
.
for more information, see train
.
example: plots="none"
object functions
trainfromdata | train off-policy reinforcement learning agent using existing data |
examples
configure options to train agent from data
create an options set to train a reinforcement learning agent offline, from an existing dataset.
set the maximum number of epochs to 2000 and the maximum number of steps per epoch to 1000. do not set any criteria to stop the training before 1000 epochs. also, display training progress on the command line instead of using the episode manager.
tfdopts = rltrainingfromdataoptions(... maxepochs=2000,... numstepsperepoch=1000,... verbose=true,... plots="none")
tfdopts = rltrainingfromdataoptions with properties: maxepochs: 2000 numstepsperepoch: 1000 experiencebufferupdatefrequency: 1 numexperiencesperexperiencebufferupdate: [] qvalueobservations: [] scoreaveragingwindowlength: 5 stoptrainingcriteria: "none" stoptrainingvalue: "none" saveagentcriteria: "none" saveagentvalue: "none" saveagentdirectory: "savedagents" verbose: 1 plots: "none"
alternatively, create a default options set and use dot notation to change some of the values.
trainopts = rltrainingfromdataoptions;
trainopts.maxepochs = 2000;
trainopts.numstepsperepoch = 1000;
trainopts.verbose = true;
trainopts.plots = "training-progress";
trainopts
trainopts = rltrainingfromdataoptions with properties: maxepochs: 2000 numstepsperepoch: 1000 experiencebufferupdatefrequency: 1 numexperiencesperexperiencebufferupdate: [] qvalueobservations: [] scoreaveragingwindowlength: 5 stoptrainingcriteria: "none" stoptrainingvalue: "none" saveagentcriteria: "none" saveagentvalue: "none" saveagentdirectory: "savedagents" verbose: 1 plots: "training-progress"
you can now use trainopts
as an input argument to the trainfromdata
command.
version history
introduced in r2023a
打开示例
您曾对此示例进行过修改。是否要打开带有您的编辑的示例?
matlab 命令
您点击的链接对应于以下 matlab 命令:
请在 matlab 命令行窗口中直接输入以执行命令。web 浏览器不支持 matlab 命令。
select a web site
choose a web site to get translated content where available and see local events and offers. based on your location, we recommend that you select: .
you can also select a web site from the following list:
how to get best site performance
select the china site (in chinese or english) for best site performance. other mathworks country sites are not optimized for visits from your location.
americas
- (español)
- (english)
- (english)
europe
- (english)
- (english)
- (deutsch)
- (español)
- (english)
- (français)
- (english)
- (italiano)
- (english)
- (english)
- (english)
- (deutsch)
- (english)
- (english)
- switzerland
- (english)
asia pacific
- (english)
- (english)
- (english)
- 中国
- (日本語)
- (한국어)