main content

environment model with deep neural network transition models -凯发k8网页登录

environment model with deep neural network transition models

since r2022a

description

use an rlneuralnetworkenvironment object to create a reinforcement learning environment that computes state transitions using deep neural networks.

using an rlneuralnetworkenvironment object you can:

  • create an internal environment model for a model-based policy optimization (mbpo) agent.

  • create an environment for training other types of reinforcement learning agents. you can identify the state-transition network using experimental or simulated data.

such environments can compute environment rewards and termination conditions using deep neural networks or custom functions.

creation

description

example

env = rlneuralnetworkenvironment(observationinfo,actioninfo,transitionfcn,rewardfcn,isdonefcn) creates a model for an environment with the observation and action specifications specified in observationinfo and actioninfo, respectively. this syntax sets the transitionfcn, rewardfcn, and isdonefcn properties.

input arguments

this property is read-only.

observation specifications, specified as an rlnumericspec object or an array of such objects. each element in the array defines the properties of an environment observation channel, such as its dimensions, data type, and name.

you can extract the observation specifications from an existing environment or agent using getobservationinfo. you can also construct the specifications manually using rlnumericspec.

action specifications, specified as an rlfinitesetspec or rlnumericspec object. this object defines the properties of the environment action channel, such as its dimensions, data type, and name.

note

only one action channel is allowed.

you can extract the action specifications from an existing environment or agent using getactioninfo. you can also construct the specification manually using rlfinitesetspec or rlnumericspec.

properties

environment transition function, specified as one of the following:

environment reward function, specified as one of the following:

  • rlcontinuousdeterministicrewardfunction object — use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be deterministic.

  • rlcontinuousgaussianrewardfunction object — use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be stochastic.

  • function handle — use this option when you know a ground-truth reward signal for your environment. when you use an rlneuralnetworkenvironment object to create an rlmbpoagent object, the custom reward function must return a batch of rewards given a batch of inputs.

environment is-done function, specified as one of the following:

  • rlisdonefunction object — use this option when you do not know a ground-truth termination signal for your environment.

  • function handle — use this option when you know a ground-truth termination signal for your environment. when you use an rlneuralnetworkenvironment object to create an rlmbpoagent object, the custom is-done function must return a batch of termination signals given a batch of inputs.

observation values, specified as a cell array with length equal to the number of specification objects in observationinfo. the order of the observations in observation must match the order in observationinfo. also, the dimensions of each element of the cell array must match the dimensions of the corresponding observation specification in observationinfo.

to evaluate whether the transition models are well-trained, you can manually evaluate the environment for a given observation value using the step function. specify the observation values before calling step.

when you use this neural network environment object within an mbpo agent, this property is ignored.

transition model index, specified as a positive integer.

to evaluate whether the transition models are well-trained, you can manually evaluate the environment for a given observation value using the step function. to select which transition model in transitionfcn to evaluate, specify the transition model index before calling step.

when you use this neural network environment object within an mbpo agent, this property is ignored.

object functions

model-based policy optimization (mbpo) reinforcement learning agent

examples

create an environment interface and extract observation and action specifications. alternatively, you can create specifications using rlnumericspec and rlfinitesetspec.

env = rlpredefinedenv("cartpole-continuous");
obsinfo = getobservationinfo(env);
actinfo = getactioninfo(env);

get the dimension of the observation and action spaces.

numobservations = obsinfo.dimension(1);
numactions = actinfo.dimension(1);

create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).

% create network layers.
statepath = featureinputlayer(numobservations, ...
    normalization="none",name="state");
actionpath = featureinputlayer(numactions, ...
    normalization="none",name="action");
commonpath = [concatenationlayer(1,2,name="concat")
    fullyconnectedlayer(64,name="fc1")
    relulayer(name="criticrelu1")
    fullyconnectedlayer(64, name="fc3")
    relulayer(name="criticcommonrelu2")
    fullyconnectedlayer(numobservations,name="nextobservation")];
% combine network layers.
transitionnetwork = layergraph(statepath);
transitionnetwork = addlayers(transitionnetwork,actionpath);
transitionnetwork = addlayers(transitionnetwork,commonpath);
transitionnetwork = connectlayers( ...
    transitionnetwork,"state","concat/in1");
transitionnetwork = connectlayers( ...
    transitionnetwork,"action","concat/in2");
% create dlnetwork object.
transitionnetwork = dlnetwork(transitionnetwork);
% create transition function object.
transitionfcn = rlcontinuousdeterministictransitionfunction(...
    transitionnetwork,obsinfo,actinfo,...
    observationinputnames="state", ...
    actioninputnames="action", ...
    nextobservationoutputnames="nextobservation");

create a deterministic reward function with two input channels (current action and next observations) and one output channel (predicted reward value).

% create network layers.
nextstatepath = featureinputlayer( ...
    numobservations,name="nextstate");
commonpath = [concatenationlayer(1,3,name="concat")
    fullyconnectedlayer(32,name="fc")
    relulayer(name="relu1")
    fullyconnectedlayer(32,name="fc2")];
meanpath = [relulayer(name="rewardmeanrelu")
    fullyconnectedlayer(1,name="rewardmean")];
stdpath = [relulayer(name="rewardstdrelu")
    fullyconnectedlayer(1,name="rewardstdfc")
    softpluslayer(name="rewardstd")];
% combine network layers.
rewardnetwork = layergraph(statepath);
rewardnetwork = addlayers(rewardnetwork,actionpath);
rewardnetwork = addlayers(rewardnetwork,nextstatepath);
rewardnetwork = addlayers(rewardnetwork,commonpath);
rewardnetwork = addlayers(rewardnetwork,meanpath);
rewardnetwork = addlayers(rewardnetwork,stdpath);
rewardnetwork = connectlayers( ...
    rewardnetwork,"nextstate","concat/in1");
rewardnetwork = connectlayers( ...
    rewardnetwork,"action","concat/in2");
rewardnetwork = connectlayers( ...
    rewardnetwork,"state","concat/in3");
rewardnetwork = connectlayers( ...
    rewardnetwork,"fc2","rewardmeanrelu");
rewardnetwork = connectlayers( ...
    rewardnetwork,"fc2","rewardstdrelu");
% create dlnetwork object.
rewardnetwork = dlnetwork(rewardnetwork);
% create reward function object.
rewardfcn = rlcontinuousgaussianrewardfunction(...
    rewardnetwork,obsinfo,actinfo,...
    observationinputnames="state",...
    actioninputnames="action", ...
    nextobservationinputnames="nextstate", ...
    rewardmeanoutputnames="rewardmean", ...
    rewardstandarddeviationoutputnames="rewardstd");

create an is-done function with one input channel (next observations) and one output channel (predicted termination signal).

% create network layers.
commonpath = [featureinputlayer(numobservations, ...
        normalization="none",name="nextstate");
    fullyconnectedlayer(64,name="fc1")
    relulayer(name="criticrelu1")
    fullyconnectedlayer(64,name="fc3")
    relulayer(name="criticcommonrelu2")
    fullyconnectedlayer(2,name="isdone0")
    softmaxlayer(name="isdone")];
isdonenetwork = layergraph(commonpath);
% create dlnetwork object.
isdonenetwork = dlnetwork(isdonenetwork);
% create is-done function object.
isdonefcn = rlisdonefunction(isdonenetwork, ...
    obsinfo,actinfo, ...
    nextobservationinputnames="nextstate");

create a neural network environment using the transition, reward, and is-done functions.

env = rlneuralnetworkenvironment( ...
    obsinfo,actinfo, ...
    transitionfcn,rewardfcn,isdonefcn);

create an environment interface and extract observation and action specifications. alternatively, you can create specifications using rlnumericspec and rlfinitesetspec.

env = rlpredefinedenv("cartpole-continuous");
obsinfo = getobservationinfo(env);
numobservations = obsinfo.dimension(1);
actinfo = getactioninfo(env);
numactions = actinfo.dimension(1);

create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).

% create network layers.
statepath = featureinputlayer(numobservations,...
    normalization="none",name="state");
actionpath = featureinputlayer(numactions,...
    normalization="none",name="action");
commonpath = [concatenationlayer(1,2,name="concat")
    fullyconnectedlayer(64,name="fc1")
    relulayer(name="criticrelu1")
    fullyconnectedlayer(64, name="fc3")
    relulayer(name="criticcommonrelu2")
    fullyconnectedlayer(numobservations,name="nextobservation")];
% combine network layers.
transitionnetwork = layergraph(statepath);
transitionnetwork = addlayers(transitionnetwork,actionpath);
transitionnetwork = addlayers(transitionnetwork,commonpath);
transitionnetwork = connectlayers(transitionnetwork,"state","concat/in1");
transitionnetwork = connectlayers(transitionnetwork,"action","concat/in2");
% create dlnetwork object.
transitionnetwork = dlnetwork(transitionnetwork);
% create transition function object.
transitionfcn = rlcontinuousdeterministictransitionfunction(...
    transitionnetwork,obsinfo,actinfo,...
    observationinputnames="state", ...
    actioninputnames="action", ...
    nextobservationoutputnames="nextobservation");

you can define a known reward function for your environment using a custom function. your custom reward function must take the observations, actions, and next observations as cell-array inputs and return a scalar reward value. for this example, use the following custom reward function, which computes the reward based on the next observation.

type cartpolerewardfunction.m
function reward = cartpolerewardfunction(obs,action,nextobs)
% compute reward value based on the next observation.
    if iscell(nextobs)
        nextobs = nextobs{1};
    end
    % distance at which to fail the episode
    xthreshold = 2.4;
    % reward each time step the cart-pole is balanced
    rewardfornotfalling = 1;
    % penalty when the cart-pole fails to balance
    penaltyforfalling = -5;
    x = nextobs(1,:);
    distreward = 1 - abs(x)/xthreshold;
    isdone = cartpoleisdonefunction(obs,action,nextobs);
    reward = zeros(size(isdone));
    reward(logical(isdone)) = penaltyforfalling;
    reward(~logical(isdone)) = ...
        0.5 * rewardfornotfalling   0.5 * distreward(~logical(isdone));
end

you can define a known is-done function for your environment using a custom function. your custom is-done function must take the observations, actions, and next observations as cell-array inputs and return a logical termination signal. for this example, use the following custom is-done function, which computes the termination signal based on the next observation.

type cartpoleisdonefunction.m
function isdone = cartpoleisdonefunction(obs,action,nextobs)
% compute termination signal based on next observation.
    if iscell(nextobs)
        nextobs = nextobs{1};
    end
    % angle at which to fail the episode
    thetathresholdradians = 12 * pi/180;
    % distance at which to fail the episode
    xthreshold = 2.4;
    x = nextobs(1,:);
    theta = nextobs(3,:);
    
    isdone = abs(x) > xthreshold | abs(theta) > thetathresholdradians;
end

create a neural network environment using the transition function object and the custom reward and is-done functions.

env = rlneuralnetworkenvironment(obsinfo,actinfo,transitionfcn,...
    @cartpolerewardfunction,@cartpoleisdonefunction);

version history

introduced in r2022a

网站地图