main content

options for training off-凯发k8网页登录

options for training off-policy reinforcement learning agents using an evolutionary strategy

since r2023b

description

use an rlevolutionstrategytrainingoptions object to specify options to train an ddpg, td3 or sac agent within an environment. evolution strategy training options include the population size and its update method, the number of training epochs, as well as criteria for stopping training and saving agents. after setting its options, use this object as an input argument for trainwithevolutionstrategy.

for more information on the training algorithm, see train agent with evolution strategy. for more information on training agents, see train reinforcement learning agents.

creation

description

trainopts = rlevolutionstrategytrainingoptions returns the default options for training a ddpg, td3 or sac agent using an evolutionary strategy.

example

trainopts = rlevolutionstrategytrainingoptions(name=value) creates the training option set trainopts and sets its properties using one or more name-value arguments.

properties

number of individuals in the population, specified as a positive integer. every individual corresponds to an actor.

example: populationsize=50

percentage of individuals surviving to form the next population, specified as an integer between 1 and 100.

example: percentageelitesize=30

maximum number of episodes run per individual, specified as a positive integer.

example: evaluationsperindividual=2

number of training epochs used to update the gradient-based agent. if you set trainepochs to 0, then the agents are updated without using any gradient-based agent (therefore using only an pure evolutionary search strategy). for more information on the training algorithm, see train agent with evolution strategy.

example: trainepochs=5

population update options, specified as a gaussianupdateoptions object. for more information on the training algorithm, see train agent with evolution strategy.

the properties of the gaussianupdateoptions object, which determine how the evolution algorithm updates the distribution, and which you can modify using dot notation after creating the rlevolutionstrategytrainingoptions object, are as follows.

update method for the population distribution, specified as either:

  • "weightedmixing" — when calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor according to its fitness index (that is, better actors are weighted more).

  • "uniformmixing" — when calculating the sum used to calculate the mean and standard deviation of the population distribution, weights each actor equally.

example: mode="async"

initial mean of the population distribution, specified as a scalar.

example: initialmean=-0.5

initial standard deviation of the population distribution, specified as a scalar.

example: initialstandarddeviation=0.5

initial bias of the standard deviation of the population distribution, specified as a scalar. a larger value promotes exploration.

example: initialstandarddeviationbias=0.2

final bias of the standard deviation of the population distribution, specified as a nonnegative scalar.

example: finalstandarddeviationbias=0.002

decay rate of the bias of the standard deviation of the population distribution, specified as a positive scalar.

at the end of each training time step, if the bias of the population standard deviation stdbias is updated as follows.

stdbias = (1-standarddeviationbiasdecayrate)*stdbias   ...
          standarddeviationbiasdecayrate*finalstandarddeviationbias

note that stdbias is conserved between the end of an episode and the start of the next one. therefore, it keeps on uniformly evolving over multiple episodes until it reaches finalstandarddeviationbias.

example: standarddeviationbiasdecayrate=0.99

type of the policy returned once training is terminated, specified as either "averagedpolicy" or "bestpolicy".

example: returnedpolicy="bestpolicy"

maximum number of generations that the population is updated, specified as a positive integer.

example: maxgenerations=1000

maximum number of steps to run per episode, specified as a positive integer. in general, you define episode termination conditions in the environment. this value is the maximum number of steps to run in the episode if other termination conditions are not met.

example: maxstepsperepisode=1000

window length for averaging the scores, rewards, and number of steps, specified as a scalar or vector.

for options expressed in terms of averages, scoreaveragingwindowlength is the number of episodes included in the average. for instance, if stoptrainingcriteria is "averagereward", and stoptrainingvalue is 500, training terminates when the average reward over the number of episodes specified in scoreaveragingwindowlength equals or exceeds 500.

example: scoreaveragingwindowlength=10

training termination condition, specified as one of the following strings:

  • "averagereward" — stop training when the running average reward equals or exceeds the critical value.

  • "episodereward" — stop training when the reward in the current episode equals or exceeds the critical value.

example: stoptrainingcriteria="averagereward"

critical value of the training termination condition, specified as an scalar.

training ends when the termination condition specified by the stoptrainingcriteria option equals or exceeds this value.

for instance, if stoptrainingcriteria is "averagereward", and stoptrainingvalue is 100, training terminates when the average reward over the number of episodes specified in scoreaveragingwindowlength equals or exceeds 100.

example: stoptrainingvalue=100

condition for saving agents during training, specified as one of the following strings:

  • "none" — do not save any agents during training.

  • "episodereward" — save the agent when the reward in the current episode equals or exceeds the critical value.

  • "averagesteps" — save the agent when the running average number of steps per episode equals or exceeds the critical value specified by the option stoptrainingvalue. the average is computed using the window 'scoreaveragingwindowlength'.

  • "averagereward" — save the agent when the running average reward over all episodes equals or exceeds the critical value.

  • "globalstepcount" — save the agent when the total number of steps in all episodes (the total number of times the agent is invoked) equals or exceeds the critical value.

  • "episodecount" — save the agent when the number of training episodes equals or exceeds the critical value.

set this option to store candidate agents that perform well according to the criteria you specify. when you set this option to a value other than "none", the software sets the saveagentvalue option to 500. you can change that value to specify the condition for saving the agent.

for instance, suppose you want to store for further testing any agent that yields an episode reward that equals or exceeds 100. to do so, set saveagentcriteria to "episodereward" and set the saveagentvalue option to 100. when an episode reward equals or exceeds 100, train saves the current agent in a mat-file in the folder specified by the saveagentdirectory option. the mat-file is called agentk.mat, where k is the number of the corresponding episode. the agent is stored within that mat-file as saved_agent.

example: saveagentcriteria="episodereward"

critical value of the condition for saving agents, specified as a scalar.

when you specify a condition for saving candidate agents using saveagentcriteria, the software sets this value to 500. change the value to specify the condition for saving the agent. see the saveagentcriteria option for more details.

example: saveagentvalue=100

folder name for saved agents, specified as a string or character vector. the folder name can contain a full or relative path. when an episode occurs in which the conditions specified by the saveagentcriteria and saveagentvalue options are satisfied, the software saves the current agent in a mat-file in this folder. if the folder does not exist, train creates it. when saveagentcriteria is "none", this option is ignored and train does not create a folder.

example: saveagentdirectory = pwd "\run1\agents"

option to display training progress at the command line, specified as the logical values false (0) or true (1). set to true to write information from each training episode to the matlab® command line during training.

example: verbose=false

option to stop training when an error occurs during an episode, specified as "on" or "off". when this option is "off", errors are captured and returned in the simulationinfo output of train, and training continues to the next episode.

example: stoponerror="off"

option to display training progress with episode manager, specified as "training-progress" or "none". by default, calling train opens the reinforcement learning episode manager, which graphically and numerically displays information about the training progress, such as the reward for each episode, average reward, number of episodes, and total number of steps. for more information, see train. to turn off this display, set this option to "none".

example: plots="none"

object functions

trainwithevolutionstrategytrain ddpg, td3 or sac agent using an evolutionary strategy within a specified environment

examples

create an options set for training a ddpg, td3 or sac agent using an evolutionary strategy. set the population size, the number of train epochs, and the maximum number of steps per episode. you can set the options using name-value pair arguments when you create the options set. any options that you do not explicitly set have their default values.

esopts = rlevolutionstrategytrainingoptions(...
    populationsize=50, ...
    trainepoch=10, ...
    maxstepsperepisode=500)
esopts = 
  evolutionstrategytrainingoptions with properties:
                populationsize: 50
           percentageelitesize: 50
      evaluationsperindividual: 1
                   trainepochs: 10
       populationupdateoptions: [1×1 rl.option.gaussianupdateoptions]
                returnedpolicy: "averagedpolicy"
                maxgenerations: 500
            maxstepsperepisode: 500
    scoreaveragingwindowlength: 5
          stoptrainingcriteria: "averagesteps"
             stoptrainingvalue: 500
             saveagentcriteria: "none"
                saveagentvalue: "none"
            saveagentdirectory: "savedagents"
                       verbose: 0
                         plots: "training-progress"

alternatively, create a default options set and use dot notation to change some of the values.

esopts = rlevolutionstrategytrainingoptions;
esopts.populationsize=30;
esopts.trainepochs=15;
esopts.maxstepsperepisode=500;

set the population update method and the initial standard deviation in the populationupdateoptions property.

esopts.populationupdateoptions.updatemethod = "uniformmixing";
esopts.populationupdateoptions.initialstandarddeviation  =  0.2;

to train a supported off-policy agent with an evolutionary strategy, you can now use esopts as an input argument to trainwithevolutionstrategy.

algorithms

version history

introduced in r2023b

网站地图