train off-凯发k8网页登录
train off-policy reinforcement learning agent using existing data
since r2023a
syntax
description
also specifies nondefault training options using the tfdstats
= trainfromdata(___,tfdopts
)rltrainingfromdataoptions
object trainfdopts
.
logs training data using the tfdstats
= trainfromdata(___,logger=lgr
)filelogger
object
lgr
.
examples
train agent from data collected by training another agent
to collect training data, first, create an environment.
env = rlpredefinedenv("cartpole-discrete");
create a built-in ppo agent with default networks.
agent1 = rlppoagent( ... getobservationinfo(env), ... getactioninfo(env));
create a filelogger
object.
flgr = rldatalogger;
to log the experiences on disk, assign an appropriate logger function to the logger object. this function is automatically called by the training loop at the end of each episode, and is defined at the end of the example.
flgr.episodefinishedfcn = @myepisodefinishedfcn;
define a training option object to train agent1
for no more than 100 epochs, without visualizing any training progress.
topts = rltrainingoptions(maxepisodes=100,plots="none");
train agent1
, logging the experience data.
train(agent1,env,topts,logger=flgr);
at the end of this training, files containing experience data for each episode are saved in the logs
folder.
note that the only purpose of training agent1
is to collect experience data from the environment. collecting experiences by simulating the environment in closed loop with a controller (using a for
loop), or indeed collecting a series of observations caused by random actions, would also accomplish the same result.
to allow the trainfromdata
function to read the experience data stored in the logs
folder, create a read function that, given a file name, returns the respective experience structure. for this example, the myreadfcn
function is defined at the end of the example.
check that the function can successfully retrieve data from an episode.
cd logs exp = myreadfcn("loggeddata002")
exp=11×1 struct array with fields:
nextobservation
observation
action
reward
isdone
size(cell2mat([exp.action]))
ans = 1×2
1 11
cd ..
create a filedatastore
object using . pass as arguments the name of the folder where files are stored and the read function. the read function is called automatically when the datastore is accessed for reading and is defined at the end of the example.
fds = filedatastore("./logs", "readfcn", @myreadfcn);
create a built-in dqn agent with default networks to be trained from the collected dataset.
agent2 = rldqnagent( ... getobservationinfo(env), ... getactioninfo(env));
define an options object to train agent2
from data for 50 epochs.
tfdopts = rltrainingfromdataoptions("maxepochs",50);
to train agent2
from data, use trainfromdata
. pass the filedatastore
object fds
as second input argument.
trainfromdata(agent2,fds,tfdopts);
here, the estimated q-value seems to grow indefinitely over time. this often happens during offline training because the agent updates its estimated q-value based on the current estimated q-value, without using any environment feedback. to prevent the q-value from becoming increasingly large (and inaccurate) over time, stop the training earlier or use data regularizer options such as (for dqn or sac agents) or (for ddpg, td3 or sac agents).
in general, the q-value calculated as above for an agent trained offline is not necessarily indicative of the performance of the agent within an environment. therefore, best practice is to validate the agent within an environment after offline training.
support functions
the data logging function. this function is automatically called by the training loop at the end of each episode, and must return a structure containing the data to log, such as experiences, simulation information, or initial observations. here, data is a structure that contains the following fields:
episodecount — current episode number
environment — environment object
agent — agent object
experience — structure array containing the experiences. each element of this array corresponds to a step and is a structure containing the fields
nextobservation
,observation
,action
,reward
andisdone
.agent — agent object
episodeinfo — structure containing the fields cumulativereward, stepstaken and initialobservation.
simulationinfo — contains simulation information from the episode. for matlab environments this is a structure with the field simulationerror, and for simulink fallback for tm_simulink environments it is a simulink.simulationoutput object.
function datatolog = myepisodefinishedfcn(data) datatolog.experience = data.experience; end
for more information on logging data on disk, see filelogger
.
the data store read function. this function is automatically called by the training loop when the data store is accessed for reading. it must take a filename and return the experience structure array. each element of this array corresponds to a step and is a structure containing the fields nextobservation
, observation
, action
, reward
and isdone
.
function experiences = myreadfcn(filename) if contains(filename,"loggeddata") data = load(filename); experiences = data.episodedata.experience{1}; else experiences = []; end end
input arguments
agent
— off-policy agent
rldqnagent
object | rlddpgagent
object | rltd3agent
object | rlsacagent
object | rlmbpoagent
object
off-policy agent to train, specified as a reinforcement learning agent object, such as an object.
note
trainfromdata
updates the agent as training progresses. for
more information on how to preserve the original agent, how to save an agent during
training, and on the state of agent
after training, see the notes and
the tips section in train
. for
more information about handle objects, see .
for more information about how to create and configure agents for reinforcement learning, see reinforcement learning agents.
datastore
— data store
filedatastore
object
data store, specified as a filedatastore
. the function specified in
the readfcn
property of datastore
must return
a structure array of experiences with the observation
,
action
, reward
,
nextobservation
, and isdone
fields. the
dimensions of the arrays in observation
and
nextobservation
in each experience must be the same as the
dimensions specified in the observationinfo
of
agent
. the dimension of the array in action
must be the same as the dimension specified in the actioninfo
of
agent
. the reward
and
isdone
fields must contain scalar values. for more information,
see .
tfdopts
— training from data parameters and options
rltrainingfromdataoptions
object
training from data parameters and options, specified as an
rltrainingfromdataoptions
object. use this argument to specify
parameters and options such as:
number of epochs
number of steps for each epochs
criteria for saving candidate agents
how to display training progress
note
trainfromdata
does not support parallel computing.
for details, see rltrainingfromdataoptions
.
lgr
— logger object
filelogger
object | monitorlogger
object
logger object, specified either as a filelogger
or as
a monitorlogger
object. for more information on reinforcement logger objects, see rldatalogger
.
output arguments
tfdstats
— training results
rltrainingfromdataresult
object
training results, returned as an rltrainingfromdataresult
object,
which has the following properties:
epochindex
— epoch numbers
[1;2;…;n]
epoch numbers, returned as the column vector [1;2;…;n]
,
where n
is the number of epochs in the training run. this
vector is useful if you want to plot the evolution of other quantities from epoch
to epoch.
epochsteps
— number of steps in each epoch
column vector
number of steps in each epoch, returned as a column vector of length
n
. each entry contains the number of steps in the
corresponding epoch.
totalsteps
— total number of steps
column vector
total number of agent steps in training, returned as a column vector of length
n
. each entry contains the cumulative sum of the entries in
epochsteps
up to that point.
qvalue
— q-value estimates for each epoch
column vector
q-value estimates for each epoch, returned as a column vector of length
n
. each element is the average q-value of the policy, over
the observations specified in the qvalueobservations
property
of tfdopts
, evaluated at the end of the epoch, and using the
policy parameters at the end of the epoch.
note
during offline training, the agent updates its estimated q-value based on the current estimated q-value (without any environment feedback). as a result, the estimated q-value can become inaccurate (and often increasingly large) over time. to prevent the q-value from growing indefinitely, stop the training earlier or use data regularizer options. for more information, see and .
note
the q-value calculated as above for an agent trained offline is not indicative of the performance of the agent within an environment. therefore, it is good practice to validate the agent within an environment after offline training.
trainingoptions
— training options set
rltrainingfromdataoptions
object
training options set, returned as an rltrainingfromdataoptions
object.
version history
introduced in r2023a
打开示例
您曾对此示例进行过修改。是否要打开带有您的编辑的示例?
matlab 命令
您点击的链接对应于以下 matlab 命令:
请在 matlab 命令行窗口中直接输入以执行命令。web 浏览器不支持 matlab 命令。
select a web site
choose a web site to get translated content where available and see local events and offers. based on your location, we recommend that you select: .
you can also select a web site from the following list:
how to get best site performance
select the china site (in chinese or english) for best site performance. other mathworks country sites are not optimized for visits from your location.
americas
- (español)
- (english)
- (english)
europe
- (english)
- (english)
- (deutsch)
- (español)
- (english)
- (français)
- (english)
- (italiano)
- (english)
- (english)
- (english)
- (deutsch)
- (english)
- (english)
- switzerland
- (english)
asia pacific
- (english)
- (english)
- (english)
- 中国
- (日本語)
- (한국어)