simulate reinforcement learning environment against policy or agent

since r2022a

syntax

output = runepisode(env,policy)

output = runepisode(env,agent)

output = runepisode(___,name=value)

description

output = runepisode(env,policy) runs a single simulation of the environment env against the policy policy.

output = runepisode(env,agent) runs a single simulation of the environment env against the agent agent. during the simulation, the policy of the agent is evaluated to produce actions but learnable parameters are not updated.

example

output = runepisode(___,name=value) specifies nondefault simulation options using one or more name-value arguments.

examples

simulate environment and agent

create a reinforcement learning environment and extract its observation and action specifications.

env = rlpredefinedenv("cartpole-discrete");
obsinfo = getobservationinfo(env);
actinfo = getactioninfo(env);

to approximate the q-value function withing the critic, use a neural network. create a network as an array of layer objects.

net = [...
    featureinputlayer(obsinfo.dimension(1))
    fullyconnectedlayer(24)
    relulayer
    fullyconnectedlayer(24)
    relulayer
    fullyconnectedlayer(2)
    softmaxlayer];

convert the network to a dlnetwork object and display the number of learnable parameters (weights).

net = dlnetwork(net);
summary(net)

   initialized: true
   number of learnables: 770
   inputs:
      1   'input'   4 features

create a discrete categorical actor using the network.

actor = rldiscretecategoricalactor(net,obsinfo,actinfo);

check your actor with a random observation.

act = getaction(actor,{rand(obsinfo.dimension)})

act = 1x1 cell array
    {[-10]}

create a policy object from the actor.

policy = rlstochasticactorpolicy(actor);

create an experience buffer.

buffer = rlreplaymemory(obsinfo,actinfo);

set up the environment for running multiple simulations. for this example, configure the training to log any errors rather than send them to the command window.

setup(env,stoponerror="off")

simulate multiple episodes using the environment and policy. after each episode, append the experiences to the buffer. for this example, run 100 episodes.

for i = 1:100
    output = runepisode(env,policy,maxsteps=300);
    append(buffer,output.agentdata.experiences)
end

clean up the environment.

cleanup(env)

sample a mini-batch of experiences from the buffer. for this example, sample 10 experiences.

batch = sample(buffer,10);

you can then learn from the sampled experiences and update the policy and actor.

input arguments

`env` — reinforcement learning environment
environment object | ...

reinforcement learning environment, specified as one of the following objects.

rlfunctionenv — environment defined using custom functions
simulinkenvwithagent — simulink^® environment created using rlsimulinkenv or createintegratedenv
rlmdpenv — markov decision process environment
rlneuralnetworkenvironment — environment with deep neural network transition models
predefined environment created using rlpredefinedenv
custom environment created from a template (rlcreateenvtemplate)

`policy` — policy
policy object | array of policy objects

policy object, specified as one of the following objects.

rldeterministicactorpolicy
rladditivenoisepolicy
rlepsilongreedypolicy
rlmaxqpolicy
rlstochasticactorpolicy

if env is a simulink environment configured for multi-agent training, specify policy as an array of policy objects. the order of the policies in the array must match the agent order used to create env.

for more information on a policy object, at the matlab^® command line, type help followed by the policy object name.

`agent` — reinforcement learning agent
agent object | array of agent objects

reinforcement learning agent, specified as one of the following objects.

rltd3agent
rlacagent
custom agent — for more information, see .

if env is a simulink environment configured for multi-agent training, specify agent as an array of agent objects. the order of the agents in the array must match the agent order used to create env.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

example: maxsteps=1000

`maxsteps` — maximum simulation steps
`500` (default) | positive integer

maximum simulation steps, specified as a positive integer.

`processexperiencefcn` — function for processing experiences
function handle | cell array of function handles

function for processing experiences and updating the policy or agent based on each experience as it occurs during the simulation, specified as a function handle with the following signature.

[updatedpolicy,updateddata] = myfcn(experience,episodeinfo,policy,data)

here:

experience is a structure that contains a single experience. for more information on the structure fields, see output.experiences.
episodeinfo contains data about the current episode and corresponds to output.episodeinfo.
policy is the policy or agent object being simulated.
data contains experience processing data. for more information, see processexperiencedata.
updatedpolicy is the updated policy or agent.
updateddata is the updated experience processing data, which is used as the data input when processing the next experience.

if env is a simulink environment configured for multi-agent training, specify processexperiencefcn as a cell array of function handles. the order of the function handles in the array must match the agent order used to create env.

`processexperiencedata` — experience processing data
any matlab data type | cell array

experience processing data, specified as any matlab data, such as an array or structure. use this data to pass additional parameters or information to the experience processing function.

you can also update this data within the experience processing function to use different parameters when processing the next experience. the data values that you specify when you call runepisode are used to process the first experience in the simulation.

if env is a simulink environment configured for multi-agent training, specify processexperiencedata as a cell array. the order of the array elements must match the agent order used to create env.

`cleanuppostsim` — option to clean up environment
`true` (default) | `false`

option to clean up the environment after the simulation, specified as true or false. when cleanuppostsim is true, runepisode calls cleanup(env) when the simulation ends.

to run multiple episodes without cleaning up the environment, set cleanuppostsim to false. you can then call cleanup(env) after running your simulations.

if env is a simulinkenvwithagent object and the associated simulink model is configured to use fast restart, then the model remains in a compiled state between simulations when cleanuppostsim is false.

`logexperiences` — option to log experiences
`true` (default) | `false`

option to log experiences for each policy or agent, specified as true or false. when logexperiences is true, the experiences of the policy or agent are logged in output.experiences.

output arguments

`output` — simulation output
structure | `future` object

simulation output, returned as a structure with the fields agentdata and simulationinfo.

the agentdata field is a structure array containing data for each agent or policy. each agentdata structure has the following fields.

field	description
`experiences`	logged experience of the policy or agent, returned as a structure array. each experience contains the following fields. `observation` — observation `action` — action taken `nextobservation` — resulting next observation `reward` — corresponding reward `isdone` — termination signal
`time`	simulation times of experiences, returned as a vector.
`episodeinfo`	episode information, returned as a structure with the following fields. `cumulativereward` — total reward for all experiences `stepstaken` — number of simulation steps taken `initialobservation` — initial observation at the start of the simulation
`processexperiencedata`	experience processing data
`agent`	policy or agent used in the simulation

the simulationinfo field is one of the following:

for matlab environments — structure containing the field simulationerror. this structure contains any errors that occurred during simulation.
for simulink environments — simulink.simulationoutput object containing simulation data. recorded data includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred.

if env is configured to run simulations on parallel workers, then output is a future object, which supports deferred outputs for environment simulations that run on workers.

tips

you can speed up episode simulation by using parallel computing. to do so, use the setup function and set the useparallel argument to true.
```
setup(env,useparallel=true)
```

version history

introduced in r2022a

simulate reinforcement learning environment against policy or agent -凯发k8网页登录

syntax

description

examples

simulate environment and agent

input arguments

`env` — reinforcement learning environment
environment object | ...

`policy` — policy
policy object | array of policy objects

`agent` — reinforcement learning agent
agent object | array of agent objects

name-value arguments

`maxsteps` — maximum simulation steps
`500` (default) | positive integer

`processexperiencefcn` — function for processing experiences
function handle | cell array of function handles

`processexperiencedata` — experience processing data
any matlab data type | cell array

`cleanuppostsim` — option to clean up environment
`true` (default) | `false`

`logexperiences` — option to log experiences
`true` (default) | `false`

output arguments

`output` — simulation output
structure | `future` object

tips

version history

see also

objects

functions

topics

simulate reinforcement learning environment against policy or agent -凯发k8网页登录

syntax

description

examples

simulate environment and agent

input arguments

env — reinforcement learning environment environment object | ...

policy — policy policy object | array of policy objects

agent — reinforcement learning agent agent object | array of agent objects

name-value arguments

maxsteps — maximum simulation steps 500 (default) | positive integer

processexperiencefcn — function for processing experiences function handle | cell array of function handles

processexperiencedata — experience processing data any matlab data type | cell array

cleanuppostsim — option to clean up environment true (default) | false

logexperiences — option to log experiences true (default) | false

output arguments

output — simulation output structure | future object

tips

version history

see also

objects

functions

topics

wechat

`env` — reinforcement learning environment
environment object | ...

`policy` — policy
policy object | array of policy objects

`agent` — reinforcement learning agent
agent object | array of agent objects

`maxsteps` — maximum simulation steps
`500` (default) | positive integer

`processexperiencefcn` — function for processing experiences
function handle | cell array of function handles

`processexperiencedata` — experience processing data
any matlab data type | cell array

`cleanuppostsim` — option to clean up environment
`true` (default) | `false`

`logexperiences` — option to log experiences
`true` (default) | `false`

`output` — simulation output
structure | `future` object