environment model with deep neural network transition models -凯发k8网页登录
environment model with deep neural network transition models
since r2022a
description
use an rlneuralnetworkenvironment
object to create a
reinforcement learning environment that computes state transitions using deep neural
networks.
using an rlneuralnetworkenvironment
object you can:
create an internal environment model for a model-based policy optimization (mbpo) agent.
create an environment for training other types of reinforcement learning agents. you can identify the state-transition network using experimental or simulated data.
such environments can compute environment rewards and termination conditions using deep neural networks or custom functions.
creation
syntax
description
creates a model for an environment with the observation and action specifications
specified in env
= rlneuralnetworkenvironment(observationinfo
,actioninfo
,transitionfcn
,rewardfcn
,isdonefcn
)observationinfo
and actioninfo
,
respectively. this syntax sets the transitionfcn
,
rewardfcn
, and isdonefcn
properties.
input arguments
observationinfo
— observation specifications
rlnumericspec
object | array rlnumericspec
objects
this property is read-only.
observation specifications, specified as an rlnumericspec
object or an array of such objects. each element in the array defines the properties
of an environment observation channel, such as its dimensions, data type, and name.
you can extract the observation specifications from an existing environment or
agent using getobservationinfo
. you can also construct the specifications manually
using rlnumericspec
.
actioninfo
— action specifications
rlfinitesetspec
object | rlnumericspec
object
action specifications, specified as an rlfinitesetspec
or rlnumericspec
object. this object defines the properties of the environment action channel, such as
its dimensions, data type, and name.
note
only one action channel is allowed.
you can extract the action specifications from an existing environment or agent
using getactioninfo
. you can also construct the specification manually using
rlfinitesetspec
or rlnumericspec
.
properties
transitionfcn
— environment transition function
rlcontinuousdeterministictransitionfunction
object | rlcontinuousgaussiantransitionfunction
object | array of transition objects
environment transition function, specified as one of the following:
rlcontinuousdeterministictransitionfunction
object — use this option when you expect the environment transitions to be deterministic.rlcontinuousgaussiantransitionfunction
object — use this option when you expect the environment transitions to be stochastic.vector of transition objects — use multiple transition models for an mbpo agent.
rewardfcn
— environment reward function
rlcontinuousdeterministicrewardfunction
object | rlcontinuousgaussianrewardfunction
object | function handle
environment reward function, specified as one of the following:
rlcontinuousdeterministicrewardfunction
object — use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be deterministic.rlcontinuousgaussianrewardfunction
object — use this option when you do not know a ground-truth reward signal for your environment and you expect the reward signal to be stochastic.function handle — use this option when you know a ground-truth reward signal for your environment. when you use an
rlneuralnetworkenvironment
object to create anrlmbpoagent
object, the custom reward function must return a batch of rewards given a batch of inputs.
isdonefcn
— environment is-done function
rlisdonefunction
object | function handle
environment is-done function, specified as one of the following:
rlisdonefunction
object — use this option when you do not know a ground-truth termination signal for your environment.function handle — use this option when you know a ground-truth termination signal for your environment. when you use an
rlneuralnetworkenvironment
object to create anrlmbpoagent
object, the custom is-done function must return a batch of termination signals given a batch of inputs.
observation
— observation values
cell array
observation values, specified as a cell array with length equal to the number of
specification objects in observationinfo
. the order of the
observations in observation
must match the order in
observationinfo
. also, the dimensions of each element of the cell
array must match the dimensions of the corresponding observation specification in
observationinfo
.
to evaluate whether the transition models are well-trained, you can manually
evaluate the environment for a given observation value using the
step
function. specify the observation values before calling
step
.
when you use this neural network environment object within an mbpo agent, this property is ignored.
transitionmodelnum
— transition model index
1 (default) | positive integer
transition model index, specified as a positive integer.
to evaluate whether the transition models are well-trained, you can manually
evaluate the environment for a given observation value using the
step
function. to select which transition model in
transitionfcn
to evaluate, specify the transition model index
before calling step
.
when you use this neural network environment object within an mbpo agent, this property is ignored.
object functions
model-based policy optimization (mbpo) reinforcement learning agent |
examples
create neural network environment
create an environment interface and extract observation and action specifications. alternatively, you can create specifications using rlnumericspec
and rlfinitesetspec
.
env = rlpredefinedenv("cartpole-continuous");
obsinfo = getobservationinfo(env);
actinfo = getactioninfo(env);
get the dimension of the observation and action spaces.
numobservations = obsinfo.dimension(1); numactions = actinfo.dimension(1);
create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).
% create network layers. statepath = featureinputlayer(numobservations, ... normalization="none",name="state"); actionpath = featureinputlayer(numactions, ... normalization="none",name="action"); commonpath = [concatenationlayer(1,2,name="concat") fullyconnectedlayer(64,name="fc1") relulayer(name="criticrelu1") fullyconnectedlayer(64, name="fc3") relulayer(name="criticcommonrelu2") fullyconnectedlayer(numobservations,name="nextobservation")]; % combine network layers. transitionnetwork = layergraph(statepath); transitionnetwork = addlayers(transitionnetwork,actionpath); transitionnetwork = addlayers(transitionnetwork,commonpath); transitionnetwork = connectlayers( ... transitionnetwork,"state","concat/in1"); transitionnetwork = connectlayers( ... transitionnetwork,"action","concat/in2"); % create dlnetwork object. transitionnetwork = dlnetwork(transitionnetwork); % create transition function object. transitionfcn = rlcontinuousdeterministictransitionfunction(... transitionnetwork,obsinfo,actinfo,... observationinputnames="state", ... actioninputnames="action", ... nextobservationoutputnames="nextobservation");
create a deterministic reward function with two input channels (current action and next observations) and one output channel (predicted reward value).
% create network layers. nextstatepath = featureinputlayer( ... numobservations,name="nextstate"); commonpath = [concatenationlayer(1,3,name="concat") fullyconnectedlayer(32,name="fc") relulayer(name="relu1") fullyconnectedlayer(32,name="fc2")]; meanpath = [relulayer(name="rewardmeanrelu") fullyconnectedlayer(1,name="rewardmean")]; stdpath = [relulayer(name="rewardstdrelu") fullyconnectedlayer(1,name="rewardstdfc") softpluslayer(name="rewardstd")]; % combine network layers. rewardnetwork = layergraph(statepath); rewardnetwork = addlayers(rewardnetwork,actionpath); rewardnetwork = addlayers(rewardnetwork,nextstatepath); rewardnetwork = addlayers(rewardnetwork,commonpath); rewardnetwork = addlayers(rewardnetwork,meanpath); rewardnetwork = addlayers(rewardnetwork,stdpath); rewardnetwork = connectlayers( ... rewardnetwork,"nextstate","concat/in1"); rewardnetwork = connectlayers( ... rewardnetwork,"action","concat/in2"); rewardnetwork = connectlayers( ... rewardnetwork,"state","concat/in3"); rewardnetwork = connectlayers( ... rewardnetwork,"fc2","rewardmeanrelu"); rewardnetwork = connectlayers( ... rewardnetwork,"fc2","rewardstdrelu"); % create dlnetwork object. rewardnetwork = dlnetwork(rewardnetwork); % create reward function object. rewardfcn = rlcontinuousgaussianrewardfunction(... rewardnetwork,obsinfo,actinfo,... observationinputnames="state",... actioninputnames="action", ... nextobservationinputnames="nextstate", ... rewardmeanoutputnames="rewardmean", ... rewardstandarddeviationoutputnames="rewardstd");
create an is-done function with one input channel (next observations) and one output channel (predicted termination signal).
% create network layers. commonpath = [featureinputlayer(numobservations, ... normalization="none",name="nextstate"); fullyconnectedlayer(64,name="fc1") relulayer(name="criticrelu1") fullyconnectedlayer(64,name="fc3") relulayer(name="criticcommonrelu2") fullyconnectedlayer(2,name="isdone0") softmaxlayer(name="isdone")]; isdonenetwork = layergraph(commonpath); % create dlnetwork object. isdonenetwork = dlnetwork(isdonenetwork); % create is-done function object. isdonefcn = rlisdonefunction(isdonenetwork, ... obsinfo,actinfo, ... nextobservationinputnames="nextstate");
create a neural network environment using the transition, reward, and is-done functions.
env = rlneuralnetworkenvironment( ... obsinfo,actinfo, ... transitionfcn,rewardfcn,isdonefcn);
create neural network environment using custom functions
create an environment interface and extract observation and action specifications. alternatively, you can create specifications using rlnumericspec
and rlfinitesetspec
.
env = rlpredefinedenv("cartpole-continuous");
obsinfo = getobservationinfo(env);
numobservations = obsinfo.dimension(1);
actinfo = getactioninfo(env);
numactions = actinfo.dimension(1);
create a deterministic transition function based on a deep neural network with two input channels (current observations and actions) and one output channel (predicted next observation).
% create network layers. statepath = featureinputlayer(numobservations,... normalization="none",name="state"); actionpath = featureinputlayer(numactions,... normalization="none",name="action"); commonpath = [concatenationlayer(1,2,name="concat") fullyconnectedlayer(64,name="fc1") relulayer(name="criticrelu1") fullyconnectedlayer(64, name="fc3") relulayer(name="criticcommonrelu2") fullyconnectedlayer(numobservations,name="nextobservation")]; % combine network layers. transitionnetwork = layergraph(statepath); transitionnetwork = addlayers(transitionnetwork,actionpath); transitionnetwork = addlayers(transitionnetwork,commonpath); transitionnetwork = connectlayers(transitionnetwork,"state","concat/in1"); transitionnetwork = connectlayers(transitionnetwork,"action","concat/in2"); % create dlnetwork object. transitionnetwork = dlnetwork(transitionnetwork); % create transition function object. transitionfcn = rlcontinuousdeterministictransitionfunction(... transitionnetwork,obsinfo,actinfo,... observationinputnames="state", ... actioninputnames="action", ... nextobservationoutputnames="nextobservation");
you can define a known reward function for your environment using a custom function. your custom reward function must take the observations, actions, and next observations as cell-array inputs and return a scalar reward value. for this example, use the following custom reward function, which computes the reward based on the next observation.
type cartpolerewardfunction.m
function reward = cartpolerewardfunction(obs,action,nextobs) % compute reward value based on the next observation. if iscell(nextobs) nextobs = nextobs{1}; end % distance at which to fail the episode xthreshold = 2.4; % reward each time step the cart-pole is balanced rewardfornotfalling = 1; % penalty when the cart-pole fails to balance penaltyforfalling = -5; x = nextobs(1,:); distreward = 1 - abs(x)/xthreshold; isdone = cartpoleisdonefunction(obs,action,nextobs); reward = zeros(size(isdone)); reward(logical(isdone)) = penaltyforfalling; reward(~logical(isdone)) = ... 0.5 * rewardfornotfalling 0.5 * distreward(~logical(isdone)); end
you can define a known is-done function for your environment using a custom function. your custom is-done function must take the observations, actions, and next observations as cell-array inputs and return a logical termination signal. for this example, use the following custom is-done function, which computes the termination signal based on the next observation.
type cartpoleisdonefunction.m
function isdone = cartpoleisdonefunction(obs,action,nextobs) % compute termination signal based on next observation. if iscell(nextobs) nextobs = nextobs{1}; end % angle at which to fail the episode thetathresholdradians = 12 * pi/180; % distance at which to fail the episode xthreshold = 2.4; x = nextobs(1,:); theta = nextobs(3,:); isdone = abs(x) > xthreshold | abs(theta) > thetathresholdradians; end
create a neural network environment using the transition function object and the custom reward and is-done functions.
env = rlneuralnetworkenvironment(obsinfo,actinfo,transitionfcn,...
@cartpolerewardfunction,@cartpoleisdonefunction);
version history
introduced in r2022a
打开示例
您曾对此示例进行过修改。是否要打开带有您的编辑的示例?
matlab 命令
您点击的链接对应于以下 matlab 命令:
请在 matlab 命令行窗口中直接输入以执行命令。web 浏览器不支持 matlab 命令。
select a web site
choose a web site to get translated content where available and see local events and offers. based on your location, we recommend that you select: .
you can also select a web site from the following list:
how to get best site performance
select the china site (in chinese or english) for best site performance. other mathworks country sites are not optimized for visits from your location.
americas
- (español)
- (english)
- (english)
europe
- (english)
- (english)
- (deutsch)
- (español)
- (english)
- (français)
- (english)
- (italiano)
- (english)
- (english)
- (english)
- (deutsch)
- (english)
- (english)
- switzerland
- (english)
asia pacific
- (english)
- (english)
- (english)
- 中国
- (日本語)
- (한국어)