train ddpg, td3 or sac agent using an evolutionary strategy within a specified environment

since r2023b

syntax

trainstats = trainwithevolutionstrategy(env,agent,estopts)

description

trainstats = trainwithevolutionstrategy(env,agent,estopts) trains agent within the environment env, using the evolution strategy training options object trainopts. note that agent is an handle object and it is updated during training, despite being an input argument. for more information on the training algorithm, see train agent with evolution strategy.

examples

train agent using an evolutionary strategy

this example shows how to train a ddpg agent using an evolutionary strategy.

load the predefined environment object representing a cart-pole system with a continuous action space. for more information on this environment, see load predefined control system environments.

env = rlpredefinedenv("cartpole-continuous");

the agent networks are initialized randomly. ensure reproducibility by fixing the seed of the random generator.

rng(0)

create a ddpg agent with default networks.

agent = rlddpgagent(getobservationinfo(env),getactioninfo(env));

to create an evolution strategy options object, use rlevolutionstrategytrainingoptions.

estopts = rlevolutionstrategytrainingoptions(...
    populationsize=10 , ...
    returnedpolicy="bestpolicy" , ...
    stoptrainingcriteria="episodecount" , ...
    stoptrainingvalue=100);

to train the agent, use trainwithevolutionstrategy.

trainstats = trainwithevolutionstrategy(agent,env,estopts);

display the reward accumulated during the last episode.

trainstats.episodereward(end)

ans = 496.2431

this value means that the agent is able to balance the cart-pole system for the whole episode.

input arguments

`agent` — ddpg, td3 or sac agent
`rlddpgagent` object | `rltd3agent` object | `rlsacagent` object

agent to train, specified as an , rltd3agent, or object.

note

trainwithevolutionstrategy updates the agent as training progresses. for more information on how to preserve the original agent, how to save an agent during training, and on the state of agent after training, see the notes and the tips section in train. for more information about handle objects, see .

for more information about how to create and configure agents for reinforcement learning, see reinforcement learning agents.

`env` — environment
reinforcement learning environment object

environment in which the agent acts, specified as one of the following kinds of reinforcement learning environment object:

a predefined matlab^® or simulink^® environment created using rlpredefinedenv.
a custom matlab environment you create with functions such as rlfunctionenv or rlcreateenvtemplate.
a custom simulink environment you create using rlsimulinkenv.

note

multiagent environments do not support training agents with an evolution strategy.

for more information about creating and configuring environments, see:

when env is a simulink environment, calling trainwithevolutionstrategy compiles and simulates the model associated with the environment.

`estopts` — parameters and options for training using an evolution strategy
`rlevolutionstrategytrainingoptions` object

parameters and options for training using an evolution strategy, specified as an rlevolutionstrategytrainingoptions object. use this argument to specify parameters and options such as:

population size
population update method
number training epochs
criteria for saving candidate agents
how to display training progress

note

trainwithevolutionstrategy does not support parallel computing.

for details, see rlevolutionstrategytrainingoptions.

output arguments

`trainstats` — training episode data
`rltrainingresult` object

training episode data, returned as an rltrainingresult object. the following properties pertain to the rltrainingresult object:

`episodeindex` — episode numbers
`[1;2;…;n]`

episode numbers, returned as the column vector [1;2;…;n], where n is the number of episodes in the training run. this vector is useful if you want to plot the evolution of other quantities from episode to episode.

`episodereward` — reward for each episode
column vector

reward for each episode, returned in a column vector of length n. each entry contains the reward for the corresponding episode.

`episodesteps` — number of steps in each episode
column vector

number of steps in each episode, returned in a column vector of length n. each entry contains the number of steps in the corresponding episode.

`averagereward` — average reward over the averaging window
column vector

average reward over the averaging window specified in trainopts, returned as a column vector of length n. each entry contains the average award computed at the end of the corresponding episode.

`totalagentsteps` — total number of steps
column vector

total number of agent steps in training, returned as a column vector of length n. each entry contains the cumulative sum of the entries in episodesteps up to that point.

`episodeq0` — critic estimate of expected discounted cumulative long-term reward at the beginning of each episode
column vector

critic estimate of expected discounted cumulative long-term reward using the current agent and the environment initial conditions, returned as a column vector of length n. each entry is the critic estimate (q₀) for the agent at the beginning of the corresponding episode. this field is present only for agents that have critics, such as and .

`simulationinfo` — information collected during simulation
structure | vector of `simulink.simulationoutput` objects

information collected during the simulations performed for training, returned as:

for training in matlab environments, a structure containing the field simulationerror. this field is a column vector with one entry per episode. when the stoponerror option of rltrainingoptions is "off", each entry contains any errors that occurred during the corresponding episode. otherwise, the field contains an empty array.
for training in simulink environments, a vector of simulink.simulationoutput objects containing simulation data recorded during the corresponding episode. recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.

`evaluationstatistic` — evaluation statistic for each episode
column vector

evaluation statistic for each episode, returned as a column vector with as many elements as the number of episodes. since trainwithevolutionstrategy does not support evaluator objects, each elements of this vector is a nan. for more information, see rlevaluator and rlcustomevaluator.

`trainingoptions` — training options set
`rlevolutionstrategytrainingoptions` object

training options set, returned as an rlevolutionstrategytrainingoptions object.

version history

introduced in r2023b

train ddpg, td3 or sac agent using an evolutionary strategy within a specified environment -凯发k8网页登录

syntax

description

examples

train agent using an evolutionary strategy

input arguments

`agent` — ddpg, td3 or sac agent
`rlddpgagent` object | `rltd3agent` object | `rlsacagent` object

`env` — environment
reinforcement learning environment object

`estopts` — parameters and options for training using an evolution strategy
`rlevolutionstrategytrainingoptions` object

output arguments

`trainstats` — training episode data
`rltrainingresult` object

`episodeindex` — episode numbers
`[1;2;…;n]`

`episodereward` — reward for each episode
column vector

`episodesteps` — number of steps in each episode
column vector

`averagereward` — average reward over the averaging window
column vector

`totalagentsteps` — total number of steps
column vector

`episodeq0` — critic estimate of expected discounted cumulative long-term reward at the beginning of each episode
column vector

`simulationinfo` — information collected during simulation
structure | vector of `simulink.simulationoutput` objects

`evaluationstatistic` — evaluation statistic for each episode
column vector

`trainingoptions` — training options set
`rlevolutionstrategytrainingoptions` object

version history

see also

functions

objects

topics

train ddpg, td3 or sac agent using an evolutionary strategy within a specified environment -凯发k8网页登录

syntax

description

examples

train agent using an evolutionary strategy

input arguments

agent — ddpg, td3 or sac agent rlddpgagent object | rltd3agent object | rlsacagent object

env — environment reinforcement learning environment object

estopts — parameters and options for training using an evolution strategy rlevolutionstrategytrainingoptions object

output arguments

trainstats — training episode data rltrainingresult object

episodeindex — episode numbers [1;2;…;n]

episodereward — reward for each episode column vector

episodesteps — number of steps in each episode column vector

averagereward — average reward over the averaging window column vector

totalagentsteps — total number of steps column vector

episodeq0 — critic estimate of expected discounted cumulative long-term reward at the beginning of each episode column vector

simulationinfo — information collected during simulation structure | vector of simulink.simulationoutput objects

evaluationstatistic — evaluation statistic for each episode column vector

trainingoptions — training options set rlevolutionstrategytrainingoptions object

version history

see also

functions

objects

topics

wechat

`agent` — ddpg, td3 or sac agent
`rlddpgagent` object | `rltd3agent` object | `rlsacagent` object

`env` — environment
reinforcement learning environment object

`estopts` — parameters and options for training using an evolution strategy
`rlevolutionstrategytrainingoptions` object

`trainstats` — training episode data
`rltrainingresult` object

`episodeindex` — episode numbers
`[1;2;…;n]`

`episodereward` — reward for each episode
column vector

`episodesteps` — number of steps in each episode
column vector

`averagereward` — average reward over the averaging window
column vector

`totalagentsteps` — total number of steps
column vector

`episodeq0` — critic estimate of expected discounted cumulative long-term reward at the beginning of each episode
column vector

`simulationinfo` — information collected during simulation
structure | vector of `simulink.simulationoutput` objects

`evaluationstatistic` — evaluation statistic for each episode
column vector

`trainingoptions` — training options set
`rlevolutionstrategytrainingoptions` object