main content

options for training reinforcement learning agents -凯发k8网页登录

options for training reinforcement learning agents

since r2019a

description

use an rltrainingoptions object to specify options to train an agent within an environment. training options include the maximum number of episodes to train, criteria for stopping training, criteria for saving agents, and options for using parallel computing. after setting its options, use this object as an input argument for train.

for more information on training agents, see train reinforcement learning agents.

creation

description

trainopts = rltrainingoptions returns the default options for training a reinforcement learning agent.

example

trainopts = rltrainingoptions(name=value) creates the training option set trainopts and sets its properties using one or more name-value arguments.

properties

maximum number of episodes to train the agent, specified as a positive integer. regardless of other criteria for termination, training terminates after maxepisodes.

example: maxepisodes=1000

maximum number of steps to run per episode, specified as a positive integer. in general, you define episode termination conditions in the environment. this value is the maximum number of steps to run in the episode if other termination conditions are not met.

example: maxstepsperepisode=1000

window length for averaging the scores, rewards, and number of steps for each agent, specified as a scalar or vector.

if the training environment contains a single agent, specify scoreaveragingwindowlength as a scalar.

if the training environment is a multi-agent simulink® environment, specify a scalar to apply the same window length to all agents.

to use a different window length for each agent, specify scoreaveragingwindowlength as a vector. in this case, the order of the elements in the vector correspond to the order of the agents used during environment creation.

for options expressed in terms of averages, scoreaveragingwindowlength is the number of episodes included in the average. for instance, if stoptrainingcriteria is "averagereward", and stoptrainingvalue is 500 for a given agent, then for that agent, training terminates when the average reward over the number of episodes specified in scoreaveragingwindowlength equals or exceeds 500. for the other agents, training continues until:

  • all agents reach their stop criteria.

  • the number of episodes reaches maxepisodes.

  • you stop training by clicking the stop training button in episode manager or pressing ctrl-c at the matlab® command line.

example: scoreaveragingwindowlength=10

training termination condition, specified as one of the following strings:

  • "averagesteps" — stop training when the running average number of steps per episode equals or exceeds the critical value specified by the option stoptrainingvalue. the average is computed using the window 'scoreaveragingwindowlength'.

  • "averagereward" — stop training when the running average reward equals or exceeds the critical value.

  • "episodereward" — stop training when the reward in the current episode equals or exceeds the critical value.

  • "globalstepcount" — stop training when the total number of steps in all episodes (the total number of times the agent is invoked) equals or exceeds the critical value.

  • "episodecount" — stop training when the number of training episodes equals or exceeds the critical value.

example: stoptrainingcriteria="averagereward"

critical value of the training termination condition, specified as a scalar or a vector.

if the training environment contains a single agent, specify stoptrainingvalue as a scalar.

if the training environment is a multi-agent simulink environment, specify a scalar to apply the same termination criterion to all agents. to use a different termination criterion for each agent, specify stoptrainingvalue as a vector. in this case, the order of the elements in the vector corresponds to the order of the agents used during environment creation.

for a given agent, training ends when the termination condition specified by the stoptrainingcriteria option equals or exceeds this value. for the other agents, the training continues until:

  • all agents reach their stop criteria.

  • the number of episodes reaches maxepisodes.

  • you stop training by clicking the stop training button in episode manager or pressing ctrl-c at the matlab command line.

for instance, if stoptrainingcriteria is "averagereward", and stoptrainingvalue is 100 for a given agent, then for that agent, training terminates when the average reward over the number of episodes specified in scoreaveragingwindowlength equals or exceeds 100.

example: stoptrainingvalue=100

condition for saving agents during training, specified as one of the following strings:

  • "none" — do not save any agents during training.

  • "episodereward" — save the agent when the reward in the current episode equals or exceeds the critical value.

  • "averagesteps" — save the agent when the running average number of steps per episode equals or exceeds the critical value specified by the option stoptrainingvalue. the average is computed using the window 'scoreaveragingwindowlength'.

  • "averagereward" — save the agent when the running average reward over all episodes equals or exceeds the critical value.

  • "globalstepcount" — save the agent when the total number of steps in all episodes (the total number of times the agent is invoked) equals or exceeds the critical value.

  • "episodecount" — save the agent when the number of training episodes equals or exceeds the critical value.

set this option to store candidate agents that perform well according to the criteria you specify. when you set this option to a value other than "none", the software sets the saveagentvalue option to 500. you can change that value to specify the condition for saving the agent.

for instance, suppose you want to store for further testing any agent that yields an episode reward that equals or exceeds 100. to do so, set saveagentcriteria to "episodereward" and set the saveagentvalue option to 100. when an episode reward equals or exceeds 100, train saves the corresponding agent in a mat file in the folder specified by the saveagentdirectory option. the mat file is called agentk.mat, where k is the number of the corresponding episode. the agent is stored within that mat file as saved_agent.

example: saveagentcriteria="episodereward"

critical value of the condition for saving agents, specified as a scalar or a vector.

if the training environment contains a single agent, specify saveagentvalue as a scalar.

if the training environment is a multi-agent simulink environment, specify a scalar to apply the same saving criterion to each agent. to save the agents when one meets a particular criterion, specify saveagentvalue as a vector. in this case, the order of the elements in the vector corresponds to the order of the agents used when creating the environment. when a criteria for saving an agent is met, all agents are saved in the same mat file.

when you specify a condition for saving candidate agents using saveagentcriteria, the software sets this value to 500. change the value to specify the condition for saving the agent. see the saveagentcriteria option for more details.

example: saveagentvalue=100

folder name for saved agents, specified as a string or character vector. the folder name can contain a full or relative path. when an episode occurs in which the conditions specified by the saveagentcriteria and saveagentvalue options are satisfied, the software saves the agents in a mat file in this folder. if the folder does not exist, train creates it. when saveagentcriteria is "none", this option is ignored and train does not create a folder.

example: saveagentdirectory = pwd "\run1\agents"

flag for using parallel training, specified as a logical. setting this option to true configures training to use parallel processing to simulate the environment, thereby enabling usage of multiple cores, processors, computer clusters or cloud resources to speed up training. to specify options for parallel training, use the parallelizationoptions property.

when useparallel is true then for dqn, ddpg, td3, and sac the numstepstolookahead property or the corresponding agent option object must be set to 1, otherwise an error is generated. this guarantees that experiences are stored contiguously. when ac agents are trained in parallel, a warning is generated if the stepsuntildataissent property of the parallelizationoptions object is set to a different value than the numsteptolookahead property of the ac agent option object.

note that if you want to speed up deep neural network calculations (such as gradient computation, parameter update and prediction) using a local gpu, you do not need to set useparallel to true. instead, when creating your actor or critic representation, use an rlrepresentationoptions object in which the usedevice option is set to "gpu". using parallel computing or the gpu requires parallel computing toolbox™ software. using computer clusters or cloud resources additionally requires matlab parallel server™. for more information about training using multicore processors and gpus, see train agents using parallel computing and gpus.

example: useparallel=true

parallelization options to control parallel training, specified as a paralleltraining object. for more information about training using parallel computing, see train reinforcement learning agents.

the paralleltraining object has the following properties, which you can modify using dot notation after creating the rltrainingoptions object.

parallel computing mode, specified as one of the following:

  • "sync" — use parpool to run synchronous training on the available workers. in this case, workers pause execution until all workers are finished. the host updates the actor and critic parameters based on the results from all the workers and sends the updated parameters to all workers. note that synchronous training is required for gradient-based parallelization, that is when datatosendfromworkers is set to "gradients" then mode must be set to "sync".

  • "async" — use parpool to run asynchronous training on the available workers. in this case, workers send their data back to the host as soon as they finish and receive updated parameters from the host. the workers then continue with their task.

example: mode="async"

randomizer initialization for workers, specified as one of the following:

  • –1 — assign a unique random seed to each worker. the value of the seed is the worker id.

  • –2 — do not assign a random seed to the workers.

  • vector — manually specify the random seed for each worker. the number of elements in the vector must match the number of workers.

example: workerrandomseeds=[1 2 3 4]

option to send model and workspace variables to parallel workers, specified as "on" or "off". when the option is "on", the client sends to the workers the variables defined in the base matlab workspace and used in the approximation models.

example: transferbaseworkspacevariables="off"

additional files to attach to the parallel pool, specified as a string or string array.

example: attachedfiles="myinitfile.m"

function to run before training starts, specified as a handle to a function having no input arguments. this function is run once per worker before training begins. write this function to perform any processing that you need prior to training.

example: attachedfiles=@mysetupfcn

function to run after training ends, specified as a handle to a function having no input arguments. you can write this function to clean up the workspace or perform other processing after training terminates.

example: attachedfiles=@mycleanupfcn

option to display training progress at the command line, specified as the logical values false (0) or true (1). set to true to write information from each training episode to the matlab command line during training.

example: verbose=false

option to stop training when an error occurs during an episode, specified as "on" or "off". when this option is "off", errors are captured and returned in the simulationinfo output of train, and training continues to the next episode.

example: stoponerror="off"

option to display training progress with episode manager, specified as "training-progress" or "none". by default, calling train opens the reinforcement learning episode manager, which graphically and numerically displays information about the training progress, such as the reward for each episode, average reward, number of episodes, and total number of steps. for more information, see train. to turn off this display, set this option to "none".

example: plots="none"

object functions

traintrain reinforcement learning agents within a specified environment

examples

create an options set for training a reinforcement learning agent. set the maximum number of episodes and the maximum number of steps per episode to 1000. configure the options to stop training when the average reward equals or exceeds 480, and turn on both the command-line display and reinforcement learning episode manager for displaying training results. you can set the options using name-value pair arguments when you create the options set. any options that you do not explicitly set have their default values.

trainopts = rltrainingoptions(...
    maxepisodes=1000,...
    maxstepsperepisode=1000,...
    stoptrainingcriteria="averagereward",...
    stoptrainingvalue=480,...
    verbose=true,...
    plots="training-progress")
trainopts = 
  rltrainingoptions with properties:
                   maxepisodes: 1000
            maxstepsperepisode: 1000
                   stoponerror: "on"
    scoreaveragingwindowlength: 5
          stoptrainingcriteria: "averagereward"
             stoptrainingvalue: 480
             saveagentcriteria: "none"
                saveagentvalue: "none"
            saveagentdirectory: "savedagents"
                       verbose: 1
                         plots: "training-progress"
                   useparallel: 0
        parallelizationoptions: [1x1 rl.option.paralleltraining]

alternatively, create a default options set and use dot notation to change some of the values.

trainopts = rltrainingoptions;
trainopts.maxepisodes = 1000;
trainopts.maxstepsperepisode = 1000;
trainopts.stoptrainingcriteria = "averagereward";
trainopts.stoptrainingvalue = 480;
trainopts.verbose = true;
trainopts.plots = "training-progress";
trainopts
trainopts = 
  rltrainingoptions with properties:
                   maxepisodes: 1000
            maxstepsperepisode: 1000
                   stoponerror: "on"
    scoreaveragingwindowlength: 5
          stoptrainingcriteria: "averagereward"
             stoptrainingvalue: 480
             saveagentcriteria: "none"
                saveagentvalue: "none"
            saveagentdirectory: "savedagents"
                       verbose: 1
                         plots: "training-progress"
                   useparallel: 0
        parallelizationoptions: [1x1 rl.option.paralleltraining]

you can now use trainopts as an input argument to the train command.

to turn on parallel computing for training a reinforcement learning agent, set the useparallel training option to true.

trainopts = rltrainingoptions(useparallel=true);

to configure your parallel training, configure the fields of the trainopts.parallelizationoptions. for example, specify the asynchronous training mode:

trainopts.parallelizationoptions.mode = "async";
trainopts.parallelizationoptions
ans = 
  paralleltraining with properties:
                              mode: "async"
                 workerrandomseeds: -1
    transferbaseworkspacevariables: "on"
                     attachedfiles: []
                          setupfcn: []
                        cleanupfcn: []

you can now use trainopts as an input argument to the train command to perform training with parallel computing.

to train an agent using the asynchronous advantage actor-critic (a3c) method, you must set the agent and parallel training options appropriately.

when creating the ac agent, set the numstepstolookahead value to be greater than 1. common values are 64 and 128.

agentopts = rlacagentoptions(numstepstolookahead=64);

use agentopts when creating your agent. alternatively, create your agent first and then modify its options, including the actor and critic options later using dot notation.

configure the training algorithm to use asynchronous parallel training.

trainopts = rltrainingoptions(useparallel=true);
trainopts.parallelizationoptions.mode = "async";

configure the workers to return gradient data to the host. also, set the number of steps before the workers send data back to the host to match the number of steps to look ahead.

trainopts.parallelizationoptions.datatosendfromworkers = ...
    "gradients";
trainopts.parallelizationoptions.stepsuntildataissent = ...
    agentopts.numstepstolookahead;

use trainopts when training your agent.

for an example on asynchronous advantage actor-critic agent training, see train ac agent to balance cart-pole system using parallel computing.

version history

introduced in r2019a
网站地图