create policies and value functions
a reinforcement learning policy is a mapping from an environment observation to a probability distribution of the actions to be taken (starting from the state corresponding to the observation). a value (or q-value) function is a mapping from an environment observation (or observation-action pair) to the value of a policy. the value of a policy is defined as its expected discounted cumulative long-term reward.
reinforcement learning agents use parametrized policies and value functions, which are implemented by function approximators called actors and critics, respectively. during training, the actor learns the policy that selects the best action to take. it does so by tuning its parameters to assign larger probability to actions that yield the greater values. the critic learns the value (or q-value) function that estimates the value of the current policy. it does so by tuning its parameters so that the predicted rewards approximate the observed ones.
before creating a non-default agent, you must create the actor and critic using approximation models such as deep neural networks, linear basis functions, or lookup tables. the type of function approximator and model you can use depends on the type of agent that you want to create.
you can also create policy objects from agents, actors, or critics. you can train these objects using custom loops and deploy them in applications.
the following section are an introduction to actor, critic, and policy objects, as well as their internal approximation models. for an introduction to agents, see reinforcement learning agents.
actors and critics
reinforcement learning toolbox™ software supports the following types of actors and critics:
v(s|θv) — critics that estimate the expected cumulative long-term reward of a policy based on a given observation s. you can create these critics using .
q(s,a|θq) — critics that estimate the expected cumulative long-term reward of a policy for a given discrete action a and a given observation s. you can create these critics using
rlqvaluefunction
.qi(s,ai|θq) — multi-output critics that estimate the expected cumulative long-term reward of a policy for all possible discrete actions ai given the observation s. you can create these critics using .
π(s|θπ) — actors with a continuous action space that select an action deterministically based on a given observation s, thereby implementing a deterministic policy. you can create these actors using
rlcontinuousdeterministicactor
.π(s|θπ) — actors that select an action stochastically (the action is sampled from a probability distribution) based on a given observation s, thereby implementing a stochastic policy. you can create these actors using either (for discrete action spaces) or (for continuous action spaces).
each approximator uses a set of parameters (θv, θq, θπ), which are computed during the learning process.
for systems with a limited number of discrete observations and discrete actions, you can store value functions in a lookup table. for systems that have many discrete observations and actions and for observation and action spaces that are continuous, storing the observations and actions is impractical. for such systems, you can represent your actors and critics using deep neural networks or custom (linear in the parameters) basis functions.
the following table summarizes the way in which you can use the six approximator objects available with reinforcement learning toolbox software, depending on the action and observation spaces of your environment, and on the approximation model and agent that you want to use.
how function approximators (actors or critics) are used in agents
approximator (actor or critic) | supported model | observation space | action space | supported agents |
---|---|---|---|---|
value function critic v(s), which you create using | table | discrete | not applicable | pg, ac, ppo |
deep neural network or custom basis function | discrete or continuous | not applicable | pg, ac, ppo | |
deep neural network | discrete or continuous | not applicable | trpo | |
q-value function critic, q(s,a), which you create using | table | discrete | discrete | q, dqn, sarsa |
deep neural network or custom basis function | discrete or continuous | discrete | q, dqn, sarsa | |
continuous | ddpg, td3, sac | |||
multi-output q-value function critic with a discrete action space q(s,a), which you create using | deep neural network or custom basis function | discrete or continuous | discrete | q, dqn, sarsa |
deterministic policy actor with a continuous action space π(s), which you create using | deep neural network or custom basis function | discrete or continuous | continuous | ddpg, td3 |
stochastic policy actor with a discrete action space π(s), which you create using | deep neural network or custom basis function | discrete or continuous | discrete | pg, ac, ppo |
deep neural network | discrete or continuous | discrete | trpo | |
stochastic policy actor with a continuous action space π(s), which you create using | deep neural network | discrete or continuous | continuous | pg, ac, ppo, sac, trpo |
you can configure the actor and critic optimization options using the rloptimizeroptions
object within an agent option object.
specifically, you can create an agent options object and set its
criticoptimizeroptions
and actoroptimizeroptions
properties to appropriate rloptimizeroptions
objects. then you pass the
agent options object to the function that creates the agent.
alternatively, you can create the agent and then use dot notation to access the
optimization options for the agent actor and critic, for example:
agent.agentoptions.actoroptimizeroptions.learnrate = 0.1;
.
for more information on agents, see reinforcement learning agents.
policy objects
you can extract a policy object from an agent using or , or you can create a policy object from an actor or critic.
once you have the policy object, you can then use to generate deterministic or stochastic actions from it, given an input observation. differently from function approximator objects like actors and critics, policy objects do not have functions that you can use to easily calculate gradients with respect to parameters. therefore, policy objects are more tailored toward application deployment, rather than training. the following table describes the available policy objects.
policy objects
policy object and getaction behavior | distribution and exploration | action space | approximator objects used for creation | agents needed for extraction |
---|---|---|---|---|
generates actions that maximize a discrete action-space q-value function | deterministic (no exploration) and greedy. | discrete | rlqvaluefunction or
| q, dqn, sarsa |
generates either actions that
maximize a discrete action-space q-value function with probability
| default: stochastic (random actions help exploration) | discrete | rlqvaluefunction or
| q, dqn, sarsa |
generates continuous deterministic actions | deterministic (no exploration) and greedy. | continuous | rlcontinuousdeterministicactor | ddpg, td3 |
generates continuous deterministic actions with added noise according to an internal noise model | default: stochastic (noise helps exploration) | continuous | rlcontinuousdeterministicactor | ddpg, td3 |
generates stochastic actions according to a probability distribution | default: stochastic (random actions help exploration) | discrete | pg, ac, ppo, trpo | |
continuous | pg, ac, ppo, trpo, sac |
each one of the stochastic policy objects has an option to enable deterministic
behavior, thereby disabling exploration. except for rlepsilongreedypolicy
and rladditivenoisepolicy
, you can use
and to generate a simulink® block or a function that evaluates the policy, returning an action, for a
given observation input. you can then use the generated function or block to generate code
for application deployment. for more information, see deploy trained reinforcement learning policies.
table models
value function approximators (critics) based on lookup tables models are appropriate for environments with a limited number of discrete observations and actions. you can create two types of lookup tables:
value tables, which store rewards for corresponding observations
q-tables, which store rewards for corresponding observation-action pairs
to create a table based critic, first create a value table or q-table using the function.
then use the table object as input argument for either or
rlqvaluefunction
to
create the approximator object.
custom basis function models
custom (linear in the parameters) basis function approximation models have the form
f = w'b
, where w
is a weight array and
b
is the column vector output of a custom basis function that you must
create. the learnable parameters of a linear basis function are the elements of
w
.
for value function critics, (such as the ones used in ac, pg or ppo agents),
f
is a scalar value, so w
must be a column vector
with the same length as b
, and b
must be a
function of the observation. for more information and examples, see .
for single-output q-value function critics, (such as the ones used in q, dqn, sarsa,
ddpg, td3, and sac agents), f
is a scalar value, so w
must be a column vector with the same length as b
, and
b
must be a function of both the observation and action. for more
information and examples, see rlqvaluefunction
.
for multi-output q-value function critics with discrete action spaces, (such as those
used in q, dqn, and sarsa agents), f
is a vector with as many elements as
the number of possible actions. therefore w
must be a matrix with as many
columns as the number of possible actions and as many rows as the length of
b
. b
must be only a function of the observation.
for more information and examples, see .
for deterministic actors with a continuous action space (such as the ones in ddpg, and td3 agents), the dimensions of
f
must match the dimensions of the agent action specification, which is either a scalar or a column vector. for more information and examples, seerlcontinuousdeterministicactor
.for stochastic actors with a discrete action space (such as the ones in pg, ac, and ppo agents),
f
must be column vector with length equal to the number of possible discrete actions. the output of the actor issoftmax(f)
, which represents the probability of selecting each possible action. for more information and examples, see .for stochastic actors with continuous action spaces cannot rely on custom basis functions (they can only use neural network approximators, due to the need to enforce positivity for the standard deviations). for more information and examples, see .
for any actor, w
must have as many columns as the number of elements
in f
, and as many rows as the number of elements in
b
. b
must be only a function of the observation.
for an example that trains a custom agent that uses a linear basis function, see create and train custom lqr agent.
neural network models
you can create actor and critic function approximators using deep neural networks models. doing so uses deep learning toolbox™ software features.
network input and output dimensions
the dimensions of the network input and output layers for your actor and critic must
match the dimension of the corresponding environment observation and action channels,
respectively. to obtain the action and observation specifications from the environment
env
, use the getactioninfo
and
getobservationinfo
functions, respectively.
actinfo = getactioninfo(env); obsinfo = getobservationinfo(env);
access the dimensions
property of each channel. for example, get
the size of the first environment and action channel:
actsize = actinfo(1).dimensions; obssize = obsinfo(1).dimensions;
in general actsize
and obssize
are row vectors
whose elements are the lengths of the corresponding dimensions. for example, if the first
observation channel is a 256-by-256 rgb image, actsize
is the vector
[256 256 3]
. to calculate the total number of dimension of the
channel, use prod
.for example, assuming the environment has only one
observation channel:
obsdimensions = prod(obsinfo.dimensions);
for critics and actors, you need to obtain the number of possible
elements of the action set. you can do so by accessing the elements
property of the action channel. for example, assuming the environment has only one action
channel:
actnumelements = numel(actinfo.elements);
networks for value function critics (such as the ones used in ac, pg, ppo or trpo agents) must take only observations as inputs and must have a single scalar output. for these networks, the dimensions of the input layers must match the dimensions of the environment observation channels. for more information, see .
networks for single-output q-value function critics (such as the ones used in q, dqn,
sarsa, ddpg, td3, and sac agents) must take both observations and actions as inputs, and
must have a single scalar output. for these networks, the dimensions of the input layers
must match the dimensions of the environment channels for both observations and actions.
for more information, see rlqvaluefunction
.
networks for multi-output q-value function critics (such as those used in q, dqn, and sarsa agents) take only observations as inputs and must have a single output layer with output size equal to the number of possible discrete actions. for these networks the dimensions of the input layers must match the dimensions of the environment observations channels. for more information, see .
for actor networks, the dimensions of the input layers must match the dimensions of the environment observation channels and the dimension of the output layer must be as follows.
networks used in actors with a discrete action space (such as the ones in pg, ac, and ppo agents) must have a single output layer with an output size equal to the number of possible discrete actions. for more information, see .
networks used in deterministic actors with a continuous action space (such as the ones in ddpg and td3 agents) must have a single output layer with an output size matching the dimension of the action space defined in the environment action specification. for more information, see
rlcontinuousdeterministicactor
.networks used in stochastic actors with a continuous action space (such as the ones in pg, ac, ppo, and sac agents) must have a two output layers each with as many elements as the dimension of the action space, as defined in the environment specification. one output layer must produce the mean values (which must be scaled to the output range of the action), and the other must produce the standard deviations of the actions (which must be non-negative). for more information, see .
deep neural networks
deep neural networks consist of a series of interconnected layers. you can specify a deep neural network as one of the following:
array of
layer
objectsobject
object
object
object
note
among the different network objects, is preferred, since it has
built-in validation checks and supports automatic differentiation. if you pass another
network object as an input argument, it is internally converted to a
dlnetwork
object. however, best practice is to convert other network
objects to dlnetwork
explicitly before using it to
create a critic or an actor for a reinforcement learning agent. you can do so using
dlnet=dlnetwork(net)
, where net
is any neural
network object from the deep learning toolbox. the resulting dlnet
is the dlnetwork
object that you use for your critic or actor. this practice allows a greater level of
insight and control for cases in which the conversion is not straightforward and might
require additional specifications.
typically, you build your neural network by stacking together a number of layers in an
array of layer
objects, possibly adding these arrays to a
object, and then converting the final result to a object.
for agents that need multiple input or output layers, you create an array of
layer
objects for each input path (observations or actions) and for
each output path (estimated rewards or actions). you then add these arrays to a object
and connect them paths together using the
function.
you can also create your deep neural network using the deep network designer app. for an example, see create dqn agent using deep network designer and train using image observations.
the following table lists some common deep learning layers used in reinforcement learning applications. for a full list of available layers, see .
deep learning toolbox layer | description |
---|---|
inputs feature data. normalization is not supported. | |
inputs vectors and 2-d images. normalization is not supported. | |
provides inputs sequence data to a network. normalization is not supported. | |
fullyconnectedlayer | multiplies the input vector by a weight matrix, and add a bias vector. |
applies sliding convolutional filters to the input. | |
concatenates inputs along a specified dimension. | |
adds the outputs of multiple layers together. | |
sets any input values that are less than zero to zero. | |
applies a sigmoid function to the input such that the output is bounded in the interval (0,1). | |
applies a hyperbolic tangent activation layer to the input. | |
applies a softmax function layer to the input, normalizing it to a probability distribution. | |
applies a long short-term memory layer to the input. supported for dqn and ppo agents. |
note
the and layers are not supported for reinforcement learning. normalization in any of the input layers is also not supported.
the reinforcement learning toolbox software provides the following layers, which contain no tunable parameters (that is, parameters that change during training).
reinforcement learning toolbox layer | description |
---|---|
applies a linear scale and bias to an input array. this layer is useful for scaling and shifting the outputs of nonlinear layers, such as and . | |
creates a vector of quadratic monomials constructed from the elements of the input array. this layer is useful when you need an output that is some quadratic function of its inputs, such as for an lqr controller. | |
implements the softplus activation y = log(1 ex), which ensures that the output is always positive. this function is a smoothed version of the rectified linear unit (relu). |
you can also create your own custom layers. for more information, see .
when you create a deep neural network, it is good practice to specify names for the first layer of each input path and the final layer of the output path. these names allow you to connect network paths and then later explicitly associate each network input layer with its appropriate environment channel.
the following code creates and connects the following input and output paths:
an observation input path,
observationpath
, with the first layer named"obsinputlayer"
.an action input path,
actionpath
, with the first layer named"actinputlayer"
.an estimated value function output path,
commonpath
, which takes the outputs ofobservationpath
andactionpath
as inputs. the final layer of this path is named"qvalueoutputlayer"
.
% observation path: array of layer objects observationpath = [ featureinputlayer(4,name="obsinputlayer") fullyconnectedlayer(24) relulayer fullyconnectedlayer(24,name="obsfc2") ]; % action path: array of layer objects actionpath = [ featureinputlayer(1,name="actinputlayer") fullyconnectedlayer(24,name="actfc1") ]; % common path: array of layer objects commonpath = [ additionlayer(2,name="add") relulayer fullyconnectedlayer(1,name="qvalueoutputlayer")]; % assemble layergraph object criticnetwork = layergraph(observationpath); criticnetwork = addlayers(criticnetwork,actionpath); criticnetwork = addlayers(criticnetwork,commonpath); % connect layers criticnetwork = connectlayers(criticnetwork,"obsfc2","add/in1"); criticnetwork = connectlayers(criticnetwork,"actfc1","add/in2"); % convert to a dlnetwork object criticnetwork = dlnetwork(criticnetwork); % display the number of learnable parameters summary(criticnetwork)
for all observation and action input paths, you must specify a
featureinputlayer
as the first layer in the path, with a number of
input neurons equal to the number of dimensions of the corresponding environment
channel.
you can view the structure of your deep neural network using the
plot
function.
plot(layergraph(criticnetwork))
since the output of a network in an actors must represent the probability of executing each possible action, the software automatically adds a as a final output layer if you do not specify it explicitly. when computing the action, the actor then randomly samples the distribution to return an action.
determining the number, type, and size of layers for your deep neural network can be difficult and is application dependent. however, the most critical component in deciding the characteristics of the function approximator is whether it is able to approximate the optimal policy or discounted value function for your application, that is, whether it has layers that can correctly learn the features of your observation, action, and reward signals.
consider the following tips when constructing your network.
for continuous action spaces, bound actions with a
tanhlayer
followed by ascalinglayer
to scale the action to desired values, if necessary.deep dense networks with
relulayer
layers can be fairly good at approximating many different functions. therefore, they are often a good first choice.start with the smallest possible network that you think can approximate the optimal policy or value function.
when you approximate strong nonlinearities or systems with algebraic constraints, adding more layers is often better than increasing the number of outputs per layer. in general, the ability of the approximator to represent more complex (compositional) functions grows only polynomially in the size of the layers, but grows exponentially with the number of layers. in other words, more layers allow approximating more complex and nonlinear compositional functions, although this generally requires more data and longer training times. given a total number of neurons and comparable approximation tasks, networks with fewer layers can require exponentially more units to successfully approximate the same class of functions, and might fail to learn and generalize correctly.
for on-policy agents (the ones that learn only from experience collected while following the current policy), such as ac and pg agents, parallel training works better if your networks are large (for example, a network with two hidden layers with 32 nodes each, which has a few hundred parameters). on-policy parallel updates assume each worker updates a different part of the network, such as when they explore different areas of the observation space. if the network is small, the worker updates can correlate with each other and make training unstable.
create and configure actors and critics from a neural network
to create a critic from your deep neural network, use an ,
rlqvaluefunction
or
(whenever possible) an object. to create a deterministic actor for a
continuous action space from your deep neural network, use an rlcontinuousdeterministicactor
object. to create a stochastic actor from your
deep neural network use either an or an object. to configure the learning rate and
optimization used by the actor or critic, use an optimizer object within an agent option
object.
for example, create a q-value function critic using the neural network
criticnetwork
and the environment action and observation
specifications. pass as additional arguments also the names of the network input layers to
be connected with the observation and action channels, respectively.
critic = rlqvaluefunction(criticnetwork,obsinfo,actinfo,... observationinputnames={"obsinputlayer"}, ... actioninputnames="actinputlayer");
to specify training options for the critic, use rloptimizeroptions
to create the critic optimizer object criticopts
, specifying a learning
rate of 0.02
and a gradient threshold of 1
.
criticopts = rloptimizeroptions(learnrate=0.02,...
gradientthreshold=1);
then create an agent option object, and set the
criticoptimizeroptions
property of the agent option object to
criticopts
. when finally you create the agent, pass the agent option
object as a last input argument to the agent constructor function. alternatively, you can
create the agent first, and then access its option object, and modify the options, using
dot notation.
when you create your deep neural network and configure your actor or critic, consider using the following approach as a starting point.
start with the smallest possible network and a high learning rate (
0.01
). train this initial network to see if the agent converges quickly to a poor policy or acts in a random manner. if either of these issues occur, rescale the network by adding more layers or more outputs on each layer. your goal is to find a network structure that is just big enough, does not learn too fast, and shows signs of learning (an improving trajectory of the reward graph) after an initial training period.once you settle on a good network architecture, a low initial learning rate can allow you to see if the agent is on the right track, and help you check that your network architecture is satisfactory for the problem. a low learning rate makes tuning parameters easier, especially for difficult problems.
also, consider the following tips when configuring your deep neural network agent.
be patient with ddpg and dqn agents, since they might not learn anything for some time during the early episodes, and they typically show a dip in cumulative reward early in the training process. eventually, they can show signs of learning after the first few thousand episodes.
for ddpg and dqn agents, promoting exploration of the agent is critical.
for agents with both actor and critic networks, set the initial learning rates of both actor and critic to the same value. however, for some problems, setting the critic learning rate to a higher value than that of the actor can improve learning results.
recurrent neural networks
when creating actors or critics for use with any agent except q, sarsa, trpo and mbpo, you can use recurrent neural networks (rnn). these networks are deep neural networks with a input layer and at least one layer that has hidden state information, such as an . they can be especially useful when the environment has states that cannot be included in the observation vector.
for agents that have both actor and critic, you must either use an rnn for both of them, or not use an rnn for any of them. you cannot use an rnn only for the critic or only for the actor.
note
code generation is not supported for continuous action space pg, ac, ppo agents, and sac agents using a recurrent neural network (rnn), or for any agent having multiple input paths and containing an rnn in any of the paths.
when using pg agents, the learning trajectory length (that is the sequence of input
data that the network uses for learning) for the rnn is the whole episode. for an ac
agent, the numstepstolookahead
property of its options object is
treated as the training trajectory length (except when training in parallel, in which case
numstepstolookahead
is ignored and the whole episode is used as
trajectory length). for a ppo agent, the trajectory length is the
minibatchsize
property of its options object.
for dqn, ddpg, sac and td3 agents, you must specify the trajectory length as an
integer greater than one in the sequencelength
property of their
options object. these learning algorithms randomly sample
minibatchsize
experience points from all the available episodes.
for each experience point, a sequence of sequencelength
consecutive
experiences is used for learning. if the number of available consecutive experiences is
shorter than sequencelength
(this can happen for example when an
episode terminates prematurely), then the available experiences are padded to complete the
sequence. the sequence is also appropriately masked so that the padded data does not
affect the gradient computation.
for more information and examples on policies and value functions, see ,
rlqvaluefunction
,
, rlcontinuousdeterministicactor
, , and .
create built-in agent from actor and critic
once you create your actor and critic, you can create a built-in reinforcement learning agent that uses them. for example, create a pg agent using a given actor and critic (baseline) network.
agentopts = rlpgagentoptions(usebaseline=true); agent = rlpgagent(actor,baseline,agentopts);
for more information on the different types of reinforcement learning agents, see reinforcement learning agents.
you can obtain the actor and critic from an existing agent using and , respectively.
you can also set the actor and critic of an existing agent using and , respectively. the input and output layers of the actor and critic must match the observation and action specifications of the original agent.
see also
functions
- | | | | |
evaluate
objects
- | |
rlqvaluefunction
| |rlcontinuousdeterministicactor
| | | | | | |