main content

reset environment, agent, experience buffer, or policy object -凯发k8网页登录

reset environment, agent, experience buffer, or policy object

since r2022a

description

example

initialobs = reset(env) resets the specified matlab® environment to an initial state and returns the resulting initial observation value.

do not use reset for simulink® environments, which are implicitly reset when running a new simulation. instead, customize the reset behavior using the resetfcn property of the environment.

example

reset(agent) resets the specified agent. resetting a built-in agent performs the following actions, if applicable.

  • empty experience buffer.

  • set recurrent neural network states of actor and critic networks to zero.

  • reset the states of any noise models used by the agent.

agent = reset(agent) also returns the reset agent as an output argument.

example

resetpolicy = reset(policy) returns the policy object resetpolicy in which any recurrent neural network states are set to zero and any noise model states are set to their initial conditions. this syntax has no effect if the policy object does not use a recurrent neural network and does not have a noise model with state.

example

reset(buffer) resets the specified replay memory buffer by removing all the experiences.

examples

create a reinforcement learning environment. for this example, create a continuous-time cart-pole system.

env = rlpredefinedenv("cartpole-continuous");

reset the environment and return the initial observation.

initialobs = reset(env)
initialobs = 4×1
         0
         0
    0.0315
         0

create observation and action specifications.

obsinfo = rlnumericspec([4 1]);
actinfo = rlnumericspec([1 1]);

create a default ddpg agent using these specifications.

initoptions = rlagentinitializationoptions(usernn=true);
agent = rlddpgagent(obsinfo,actinfo,initoptions);

reset the agent.

agent = reset(agent);

create observation and action specifications.

obsinfo = rlnumericspec([4 1]);
actinfo = rlnumericspec([1 1]);

create a replay memory experience buffer.

buffer = rlreplaymemory(obsinfo,actinfo,10000);

add experiences to the buffer. for this example, add 20 random experiences.

for i = 1:20
    expbatch(i).observation = {obsinfo.upperlimit.*rand(4,1)};
    expbatch(i).action = {actinfo.upperlimit.*rand(1,1)};
    expbatch(i).nextobservation = {obsinfo.upperlimit.*rand(4,1)};
    expbatch(i).reward = 10*rand(1);
    expbatch(i).isdone = 0;
end
expbatch(20).isdone = 1;
append(buffer,expbatch);

reset and clear the buffer.

reset(buffer)

create observation and action specifications.

obsinfo = rlnumericspec([4 1]);
actinfo = rlfinitesetspec([-1 0 1]);

to approximate the q-value function within the critic, use a deep neural network. create each network path as an array of layer objects.

% create paths
obspath = [featureinputlayer(4) 
           fullyconnectedlayer(1,name="obsout")];
actpath = [featureinputlayer(1) 
           fullyconnectedlayer(1,name="actout")];
compath = [additionlayer(2,name="add")  ...
           fullyconnectedlayer(1)];
% add layers
net = layergraph;
net = addlayers(net,obspath); 
net = addlayers(net,actpath); 
net = addlayers(net,compath);
net = connectlayers(net,"obsout","add/in1");
net = connectlayers(net,"actout","add/in2");
% convert to dlnetwork object
net = dlnetwork(net);
% diplay the number of weights
summary(net)
   initialized: true
   number of learnables: 9
   inputs:
      1   'input'     4 features
      2   'input_1'   1 features

create an epsilon-greedy policy object using a q-value function approximator.

critic = rlqvaluefunction(net,obsinfo,actinfo);
policy = rlepsilongreedypolicy(critic)
policy = 
  rlepsilongreedypolicy with properties:
            qvaluefunction: [1x1 rl.function.rlqvaluefunction]
        explorationoptions: [1x1 rl.option.epsilongreedyexploration]
    useepsilongreedyaction: 1
        enableepsilondecay: 1
           observationinfo: [1x1 rl.util.rlnumericspec]
                actioninfo: [1x1 rl.util.rlfinitesetspec]
                sampletime: -1

reset the policy.

policy = reset(policy);

input arguments

reinforcement learning environment, specified as one of the following objects.

reinforcement learning agent, specified as one of the following objects.

note

agent is a handle object, so it is reset whether it is returned as an output argument or not. for more information about handle objects, see .

experience buffer, specified as one of the following replay memory objects.

reinforcement learning policy, specified as one of the following objects:

output arguments

initial environment observation after reset, returned as one of the following:

  • array with dimensions matching the observation specification for an environment with a single observation channel.

  • cell array with length equal to the number of observation channel for an environment with multiple observation channels. each element of the cell array contains an array with dimensions matching the corresponding element of the environment observation specifications.

reset policy, returned as a policy object of the same type as agent but with its recurrent neural network states set to zero.

reset agent, returned as an agent object. note that agent is a handle object. therefore, if it contains any recurrent neural network, its state is reset whether agent is returned as an output argument or not. for more information about handle objects, see .

version history

introduced in r2022a

see also

functions

网站地图