main content

train off-凯发k8网页登录

train off-policy reinforcement learning agent using existing data

since r2023a

description

tfdstats = trainfromdata(agent) trains the off-policy agent agent offline, using data stored in its experiencebuffer property. note that agent is an handle object and it is updated during training, despite being an input argument.

tfdstats = trainfromdata(agent,datastore) trains the off-policy agent agent offline, using data stored according to the filedatastore object datastore.

example

tfdstats = trainfromdata(___,tfdopts) also specifies nondefault training options using the rltrainingfromdataoptions object trainfdopts.

tfdstats = trainfromdata(___,logger=lgr) logs training data using the filelogger object lgr.

examples

to collect training data, first, create an environment.

env = rlpredefinedenv("cartpole-discrete");

create a built-in ppo agent with default networks.

agent1 = rlppoagent( ...
    getobservationinfo(env), ...
    getactioninfo(env));

create a filelogger object.

flgr = rldatalogger;

to log the experiences on disk, assign an appropriate logger function to the logger object. this function is automatically called by the training loop at the end of each episode, and is defined at the end of the example.

flgr.episodefinishedfcn = @myepisodefinishedfcn;

define a training option object to train agent1 for no more than 100 epochs, without visualizing any training progress.

topts = rltrainingoptions(maxepisodes=100,plots="none");

train agent1, logging the experience data.

train(agent1,env,topts,logger=flgr);

at the end of this training, files containing experience data for each episode are saved in the logs folder.

note that the only purpose of training agent1 is to collect experience data from the environment. collecting experiences by simulating the environment in closed loop with a controller (using a for loop), or indeed collecting a series of observations caused by random actions, would also accomplish the same result.

to allow the trainfromdata function to read the experience data stored in the logs folder, create a read function that, given a file name, returns the respective experience structure. for this example, the myreadfcn function is defined at the end of the example.

check that the function can successfully retrieve data from an episode.

cd logs
exp = myreadfcn("loggeddata002")
exp=11×1 struct array with fields:
    nextobservation
    observation
    action
    reward
    isdone
size(cell2mat([exp.action]))
ans = 1×2
     1    11
cd ..

create a filedatastore object using . pass as arguments the name of the folder where files are stored and the read function. the read function is called automatically when the datastore is accessed for reading and is defined at the end of the example.

fds = filedatastore("./logs", "readfcn", @myreadfcn);

create a built-in dqn agent with default networks to be trained from the collected dataset.

agent2 = rldqnagent( ...
    getobservationinfo(env), ...
    getactioninfo(env));

define an options object to train agent2 from data for 50 epochs.

tfdopts = rltrainingfromdataoptions("maxepochs",50);

to train agent2 from data, use trainfromdata. pass the filedatastore object fds as second input argument.

trainfromdata(agent2,fds,tfdopts);

here, the estimated q-value seems to grow indefinitely over time. this often happens during offline training because the agent updates its estimated q-value based on the current estimated q-value, without using any environment feedback. to prevent the q-value from becoming increasingly large (and inaccurate) over time, stop the training earlier or use data regularizer options such as (for dqn or sac agents) or (for ddpg, td3 or sac agents).

in general, the q-value calculated as above for an agent trained offline is not necessarily indicative of the performance of the agent within an environment. therefore, best practice is to validate the agent within an environment after offline training.

support functions

the data logging function. this function is automatically called by the training loop at the end of each episode, and must return a structure containing the data to log, such as experiences, simulation information, or initial observations. here, data is a structure that contains the following fields:

  • episodecount — current episode number

  • environment — environment object

  • agent — agent object

  • experience — structure array containing the experiences. each element of this array corresponds to a step and is a structure containing the fields nextobservation, observation, action, reward and isdone.

  • agent — agent object

  • episodeinfo — structure containing the fields cumulativereward, stepstaken and initialobservation.

  • simulationinfo — contains simulation information from the episode. for matlab environments this is a structure with the field simulationerror, and for simulink fallback for tm_simulink environments it is a simulink.simulationoutput object.

function datatolog = myepisodefinishedfcn(data)
    datatolog.experience = data.experience;
end

for more information on logging data on disk, see filelogger.

the data store read function. this function is automatically called by the training loop when the data store is accessed for reading. it must take a filename and return the experience structure array. each element of this array corresponds to a step and is a structure containing the fields nextobservation, observation, action, reward and isdone.

function experiences = myreadfcn(filename)
if contains(filename,"loggeddata")
    data = load(filename);
    experiences = data.episodedata.experience{1};
else
    experiences = [];
end
end

input arguments

off-policy agent to train, specified as a reinforcement learning agent object, such as an object.

note

trainfromdata updates the agent as training progresses. for more information on how to preserve the original agent, how to save an agent during training, and on the state of agent after training, see the notes and the tips section in train. for more information about handle objects, see .

for more information about how to create and configure agents for reinforcement learning, see reinforcement learning agents.

data store, specified as a filedatastore. the function specified in the readfcn property of datastore must return a structure array of experiences with the observation, action, reward, nextobservation, and isdone fields. the dimensions of the arrays in observation and nextobservation in each experience must be the same as the dimensions specified in the observationinfo of agent. the dimension of the array in action must be the same as the dimension specified in the actioninfo of agent. the reward and isdone fields must contain scalar values. for more information, see .

training from data parameters and options, specified as an rltrainingfromdataoptions object. use this argument to specify parameters and options such as:

  • number of epochs

  • number of steps for each epochs

  • criteria for saving candidate agents

  • how to display training progress

note

trainfromdata does not support parallel computing.

for details, see rltrainingfromdataoptions.

logger object, specified either as a filelogger or as a monitorlogger object. for more information on reinforcement logger objects, see rldatalogger.

output arguments

training results, returned as an rltrainingfromdataresult object, which has the following properties:

epoch numbers, returned as the column vector [1;2;…;n], where n is the number of epochs in the training run. this vector is useful if you want to plot the evolution of other quantities from epoch to epoch.

number of steps in each epoch, returned as a column vector of length n. each entry contains the number of steps in the corresponding epoch.

total number of agent steps in training, returned as a column vector of length n. each entry contains the cumulative sum of the entries in epochsteps up to that point.

q-value estimates for each epoch, returned as a column vector of length n. each element is the average q-value of the policy, over the observations specified in the qvalueobservations property of tfdopts, evaluated at the end of the epoch, and using the policy parameters at the end of the epoch.

note

during offline training, the agent updates its estimated q-value based on the current estimated q-value (without any environment feedback). as a result, the estimated q-value can become inaccurate (and often increasingly large) over time. to prevent the q-value from growing indefinitely, stop the training earlier or use data regularizer options. for more information, see and .

note

the q-value calculated as above for an agent trained offline is not indicative of the performance of the agent within an environment. therefore, it is good practice to validate the agent within an environment after offline training.

training options set, returned as an rltrainingfromdataoptions object.

version history

introduced in r2023a

网站地图