train reinforcement learning agents
once you have created an environment and reinforcement learning agent, you can train the
agent in the environment using the train
function. to
configure your training, use an rltrainingoptions
object. for example, create a training option set opt
, and train agent
agent
in environment env
.
opt = rltrainingoptions(... maxepisodes=1000,... maxstepsperepisode=1000,... stoptrainingcriteria="averagereward",... stoptrainingvalue=480); trainresults = train(agent,env,opt);
if env
is a multi-agent environment created with rlsimulinkenv
, specify the agent argument as
an array. the order of the agents in the array must match the agent order used to create
env
. multi-agent training is not supported for matlab® environments.
for more information on creating agents, see reinforcement learning agents. for more information on creating environments, see and .
note
train
updates the agent as training progresses. this is possible
because each agent is an handle object. to preserve the original agent parameters for later
use, save the agent to a
mat-file:
save("initialagent.mat","agent")
training terminates automatically when the conditions you specify in the
stoptrainingcriteria
and stoptrainingvalue
options
of your rltrainingoptions
object are satisfied. you can also terminate
training before any termination condition is reached by clicking stop
training in the reinforcement learning episode manager.
when training terminates the training statistics and results are stored in the
trainresults
object.
because train
updates the agent at the end of each episode, and because
trainresults
stores the last training results along with data to correctly
recreate the training scenario and update the episode manager, you can later resume training
from the exact point at which it stopped. to do so, at the command line,
type:
trainresults = train(agent,env,trainresults);
train
call.the trainresults
object contains, as one of its properties, the
rltrainingoptions
object opt
specifying the training
option set. therefore, to restart the training with updated training options, first change the
training options in trainresults
using dot notation. if the maximum number of
episodes was already reached in the previous training session, you must increase the maximum
number of episodes.
for example, disable displaying the training progress on episode manager, enable the
verbose
option to display training progress at the command line, change
the maximum number of episodes to 2000, and then restart the training, returning a new
trainresults
object as output.
trainresults.trainingoptions.maxepisodes = 2000;
trainresults.trainingoptions.plots = "none";
trainresults.trainingoptions.verbose = 1;
trainresultsnew = train(agent,env,trainresults);
note
when training terminates, agents
reflects the state of each agent
at the end of the final training episode. the rewards obtained by the final agents are not
necessarily the highest achieved during the training process, due to continuous
exploration. to save agents during training, create an rltrainingoptions
object specifying the saveagentcriteria
and saveagentvalue
properties and pass it to train
as a trainopts
argument.
training algorithm
in general, training performs the following steps.
initialize the agent.
for each episode:
reset the environment.
get the initial observation s0 from the environment.
compute the initial action a0 = μ(s0), where μ(s) is the current policy.
set the current action to the initial action (a←a0), and set the current observation to the initial observation (s←s0).
while the episode is not finished or terminated, perform the following steps.
apply action a to the environment and obtain the next observation s''and the reward r.
learn from the experience set (s,a,r,s').
compute the next action a' = μ(s').
update the current action with the next action (a←a') and update the current observation with the next observation (s←s').
terminate the episode if the termination conditions defined in the environment are met.
if the training termination condition is met, terminate training. otherwise, begin the next episode.
the specifics of how the software performs these steps depend on the configuration of the agent and environment. for instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so. for more information on agents and their training algorithms, see reinforcement learning agents. to use parallel processing and gpus to speed up training, see train agents using parallel computing and gpus.
episode manager
by default, calling the train
function opens the reinforcement
learning episode manager, which lets you visualize the training progress.
the episode manager plot shows the reward for each episode (episodereward) and a running average reward value (averagereward).
for agents with a critic, episode q0 is the estimate of the
discounted long-term reward at the start of each episode, given the initial observation of
the environment. as training progresses, if the critic is well designed and learns
successfully, episode q0 approaches in average the true discounted
long-term reward, which may be offset from the episodereward value
because of discounting. for a well designed critic using an undiscounted reward
(discountfactor
is equal to 1
), then on average
episode q0 approaches the true episode reward, as shown in the
preceding figure.
the episode manager also displays various episode and training statistics. you can also
use the train
function to return episode and training information. to
turn off the reinforcement learning episode manager, set the plots
option of rltrainingoptions
to "none"
.
save candidate agents
during training, you can save candidate agents that meet conditions you specify in the
saveagentcriteria
and saveagentvalue
options of
your rltrainingoptions
object. for instance, you can save any agent whose
episode reward exceeds a certain value, even if the overall condition for terminating
training is not yet satisfied. for example, save agents when the episode reward is greater
than 100
.
opt = rltrainingoptions(saveagentcriteria="episodereward",saveagentvalue=100);
train
stores saved agents in a mat-file in the folder you specify
using the saveagentdirectory
option of
rltrainingoptions
. saved agents can be useful, for instance, to test
candidate agents generated during a long-running training process. for details about saving
criteria and saving location, see rltrainingoptions
.
after training is complete, you can save the final trained agent from the matlab workspace using the save
function. for example, save the
agent myagent
to the file finalagent.mat
in the
current working directory.
save(opt.saveagentdirectory "/finalagent.mat",'agent')
by default, when ddpg and dqn agents are saved, the experience buffer data is not saved.
if you plan to further train your saved agent, you can start training with the previous
experience buffer as a starting point. in this case, set the
saveexperiencebufferwithagent
option to true
. for
some agents, such as those with large experience buffers and image-based observations, the
memory required for saving the experience buffer is large. in these cases, you must ensure
that enough memory is available for the saved agents.
validate trained policy
to validate your trained agent, you can simulate the agent within the training
environment using the sim
function. to
configure the simulation, use rlsimulationoptions
.
when validating your agent, consider checking how your agent handles the following:
changes to simulation initial conditions — to change the model initial conditions, modify the reset function for the environment. for example reset functions, see , , and .
mismatches between the training and simulation environment dynamics — to check such mismatches, create test environments in the same way that you created the training environment, modifying the environment behavior.
as with parallel training, if you have parallel computing toolbox™ software, you can run multiple parallel simulations on multicore computers. if
you have matlab
parallel server™ software, you can run multiple parallel simulations on computer clusters or
cloud resources. for more information on configuring your simulation to use parallel
computing, see useparallel
and
parallelizationoptions
in rlsimulationoptions
.
environment visualization
if your training environment implements the plot
method, you can
visualize the environment behavior during training and simulation. if you call
plot(env)
before training or simulation, where env
is your environment object, then the visualization updates during training to allow you to
visualize the progress of each episode or simulation.
environment visualization is not supported when training or simulating your agent using parallel computing.
for custom environments, you must implement your own plot
method.
for more information on creating a custom environments with a plot
function, see .
see also
apps
functions
objects
related examples
- design and train agent using reinforcement learning designer
- train agents using parallel computing and gpus