main content

simulate reinforcement learning environment against policy or agent -凯发k8网页登录

simulate reinforcement learning environment against policy or agent

since r2022a

description

output = runepisode(env,policy) runs a single simulation of the environment env against the policy policy.

output = runepisode(env,agent) runs a single simulation of the environment env against the agent agent. during the simulation, the policy of the agent is evaluated to produce actions but learnable parameters are not updated.

example

output = runepisode(___,name=value) specifies nondefault simulation options using one or more name-value arguments.

examples

create a reinforcement learning environment and extract its observation and action specifications.

env = rlpredefinedenv("cartpole-discrete");
obsinfo = getobservationinfo(env);
actinfo = getactioninfo(env);

to approximate the q-value function withing the critic, use a neural network. create a network as an array of layer objects.

net = [...
    featureinputlayer(obsinfo.dimension(1))
    fullyconnectedlayer(24)
    relulayer
    fullyconnectedlayer(24)
    relulayer
    fullyconnectedlayer(2)
    softmaxlayer];

convert the network to a dlnetwork object and display the number of learnable parameters (weights).

net = dlnetwork(net);
summary(net)
   initialized: true
   number of learnables: 770
   inputs:
      1   'input'   4 features

create a discrete categorical actor using the network.

actor = rldiscretecategoricalactor(net,obsinfo,actinfo);

check your actor with a random observation.

act = getaction(actor,{rand(obsinfo.dimension)})
act = 1x1 cell array
    {[-10]}

create a policy object from the actor.

policy = rlstochasticactorpolicy(actor);

create an experience buffer.

buffer = rlreplaymemory(obsinfo,actinfo);

set up the environment for running multiple simulations. for this example, configure the training to log any errors rather than send them to the command window.

setup(env,stoponerror="off")

simulate multiple episodes using the environment and policy. after each episode, append the experiences to the buffer. for this example, run 100 episodes.

for i = 1:100
    output = runepisode(env,policy,maxsteps=300);
    append(buffer,output.agentdata.experiences)
end

clean up the environment.

cleanup(env)

sample a mini-batch of experiences from the buffer. for this example, sample 10 experiences.

batch = sample(buffer,10);

you can then learn from the sampled experiences and update the policy and actor.

input arguments

reinforcement learning environment, specified as one of the following objects.

policy object, specified as one of the following objects.

  • rldeterministicactorpolicy

  • rladditivenoisepolicy

  • rlepsilongreedypolicy

  • rlmaxqpolicy

  • rlstochasticactorpolicy

if env is a simulink environment configured for multi-agent training, specify policy as an array of policy objects. the order of the policies in the array must match the agent order used to create env.

for more information on a policy object, at the matlab® command line, type help followed by the policy object name.

reinforcement learning agent, specified as one of the following objects.

if env is a simulink environment configured for multi-agent training, specify agent as an array of agent objects. the order of the agents in the array must match the agent order used to create env.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

example: maxsteps=1000

maximum simulation steps, specified as a positive integer.

function for processing experiences and updating the policy or agent based on each experience as it occurs during the simulation, specified as a function handle with the following signature.

[updatedpolicy,updateddata] = myfcn(experience,episodeinfo,policy,data)

here:

  • experience is a structure that contains a single experience. for more information on the structure fields, see output.experiences.

  • episodeinfo contains data about the current episode and corresponds to output.episodeinfo.

  • policy is the policy or agent object being simulated.

  • data contains experience processing data. for more information, see processexperiencedata.

  • updatedpolicy is the updated policy or agent.

  • updateddata is the updated experience processing data, which is used as the data input when processing the next experience.

if env is a simulink environment configured for multi-agent training, specify processexperiencefcn as a cell array of function handles. the order of the function handles in the array must match the agent order used to create env.

experience processing data, specified as any matlab data, such as an array or structure. use this data to pass additional parameters or information to the experience processing function.

you can also update this data within the experience processing function to use different parameters when processing the next experience. the data values that you specify when you call runepisode are used to process the first experience in the simulation.

if env is a simulink environment configured for multi-agent training, specify processexperiencedata as a cell array. the order of the array elements must match the agent order used to create env.

option to clean up the environment after the simulation, specified as true or false. when cleanuppostsim is true, runepisode calls cleanup(env) when the simulation ends.

to run multiple episodes without cleaning up the environment, set cleanuppostsim to false. you can then call cleanup(env) after running your simulations.

if env is a simulinkenvwithagent object and the associated simulink model is configured to use fast restart, then the model remains in a compiled state between simulations when cleanuppostsim is false.

option to log experiences for each policy or agent, specified as true or false. when logexperiences is true, the experiences of the policy or agent are logged in output.experiences.

output arguments

simulation output, returned as a structure with the fields agentdata and simulationinfo.

the agentdata field is a structure array containing data for each agent or policy. each agentdata structure has the following fields.

fielddescription
experiences

logged experience of the policy or agent, returned as a structure array. each experience contains the following fields.

  • observation — observation

  • action — action taken

  • nextobservation — resulting next observation

  • reward — corresponding reward

  • isdone — termination signal

timesimulation times of experiences, returned as a vector.
episodeinfo

episode information, returned as a structure with the following fields.

  • cumulativereward — total reward for all experiences

  • stepstaken — number of simulation steps taken

  • initialobservation — initial observation at the start of the simulation

processexperiencedataexperience processing data
agentpolicy or agent used in the simulation

the simulationinfo field is one of the following:

  • for matlab environments — structure containing the field simulationerror. this structure contains any errors that occurred during simulation.

  • for simulink environments — simulink.simulationoutput object containing simulation data. recorded data includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred.

if env is configured to run simulations on parallel workers, then output is a future object, which supports deferred outputs for environment simulations that run on workers.

tips

  • you can speed up episode simulation by using parallel computing. to do so, use the setup function and set the useparallel argument to true.

    setup(env,useparallel=true)

version history

introduced in r2022a

网站地图