compare ddpg agent to lqr controller -凯发k8网页登录
this example shows how to train a deep deterministic policy gradient (ddpg) agent to control a second-order linear dynamic system modeled in matlab®. the example also compares the ddpg agent to an lqr controller.
for more information on ddpg agents, see . for an example showing how to train a ddpg agent in simulink®, see .
double integrator matlab environment
the reinforcement learning environment for this example is a second-order double-integrator system with a gain. the training goal is to control the position of a mass in the second-order system by applying a force input.
for this environment:
the mass starts at an initial position of either –4 or 4 units.
the observations from the environment are the position and velocity of the mass.
the episode terminates if the mass moves more than 5 m from the original position or if .
the reward , provided at every time step, is a discretization of :
here:
is the state vector of the mass.
is the force applied to the mass.
is the matrix of weights on the control performance; .
is the weight on the control effort; .
for more information on this model, see load predefined control system environments.
for this example the environment is a linear dynamical system, the environment state is observed directly, and the reward is a quadratic function of the observation and action. therefore the problem of finding the sequence of actions that minimizes the cumulative long-term reward is a discrete-time linear-quadratic optimal control problem, for which the optimal action is known to be a linear function of the system states. this problem can also be solved using linear-quadratic regulator (lqr) design, and in the last part of the example you can compare the agent to an lqr controller.
create environment interface
create a predefined environment interface for the double integrator system.
env = rlpredefinedenv("doubleintegrator-continuous")
env = doubleintegratorcontinuousaction with properties: gain: 1 ts: 0.1000 maxdistance: 5 goalthreshold: 0.0100 q: [2x2 double] r: 0.0100 maxforce: inf state: [2x1 double]
the interface has a continuous action space where the agent can apply force values from -inf
to inf
to the mass. the sample time is stored in env.ts
, while the continuous time cost function matrices are stored in env.q
and env.r
respectively.
obtain the observation and action information from the environment interface.
obsinfo = getobservationinfo(env)
obsinfo = rlnumericspec with properties: lowerlimit: -inf upperlimit: inf name: "states" description: "x, dx" dimension: [2 1] datatype: "double"
actinfo = getactioninfo(env)
actinfo = rlnumericspec with properties: lowerlimit: -inf upperlimit: inf name: "force" description: [0x0 string] dimension: [1 1] datatype: "double"
reset the environment and get its initial state.
x0 = reset(env)
x0 = 2×1
4
0
fix the random generator seed for reproducibility.
rng(0)
create ddpg agent
a ddpg agent approximates the discounted cumulative long-term reward using a q-value-function critic. a q-value function critic must accept an observation and an action as inputs and return a scalar (the estimated discounted cumulative long-term reward) as output. to approximate the q-value function within the critic, use a neural network. the value function of the optimal policy is known to be quadratic, use a network with a quadratic layer (which outputs a vector of quadratic monomials, as described in ) and a fully connected layer (which provides a linear combination of its inputs).
define each network path as an array of layer objects and get the dimension of the observation and action spaces from the environment specification objects. assign names to the network input layers, so you can connect them to the output path and later explicitly associate them with the appropriate environment channel. since there is no need for a bias term, set the bias term to zero (bias=0
) and prevent it from changing (biaslearnratefactor=0
).
for more information on creating value function approximators, see create policies and value functions.
% observation and action paths obspath = featureinputlayer(obsinfo.dimension(1),name="obsin"); actpath = featureinputlayer(actinfo.dimension(1),name="actin"); % common path commonpath = [ concatenationlayer(1,2,name="concat") quadraticlayer fullyconnectedlayer(1,name="value", ... biaslearnratefactor=0,bias=0) ]; % add layers to layergraph object criticnet = layergraph(obspath); criticnet = addlayers(criticnet,actpath); criticnet = addlayers(criticnet,commonpath); % connect layers criticnet = connectlayers(criticnet,"obsin","concat/in1"); criticnet = connectlayers(criticnet,"actin","concat/in2");
view the critic network configuration.
figure plot(criticnet)
convert to dlnetwork
and display the number of weights.
criticnet = dlnetwork(criticnet); summary(criticnet)
initialized: true number of learnables: 7 inputs: 1 'obsin' 2 features 2 'actin' 1 features
create the critic approximator object using criticnet
, the environment observation and action specifications, and the names of the network input layers to be connected with the environment observation and action channels, respectively. for more information, see rlqvaluefunction
.
critic = rlqvaluefunction(criticnet, ... obsinfo,actinfo, ... observationinputnames="obsin",actioninputnames="actin");
check the critic with a random observation and action input.
getvalue(critic,{rand(obsinfo.dimension)},{rand(actinfo.dimension)})
ans = single
-0.3977
ddpg agents use a parametrized continuous deterministic policy, which is learned by a continuous deterministic actor. this actor must accept an observation as input and return an action as output. to approximate the policy function within the actor, use a neural network. since for this example the optimal policy is known to be linear in the state, use a shallow network with a fully connected layer to provide a linear combination of the two network inputs.
define the network as an array of layer objects, and get the dimension of the observation and action spaces from the environment specification objects. since there is no need for a bias term, as done for the critic, set the bias term to zero (bias=0
) and prevent it from changing (biaslearnratefactor=0
). for more information on actors, see create policies and value functions.
actornet = [
featureinputlayer(obsinfo.dimension(1))
fullyconnectedlayer(actinfo.dimension(1), ...
biaslearnratefactor=0,bias=0)
];
convert to dlnetwork
and display the number of weights.
actornet = dlnetwork(actornet); summary(actornet)
initialized: true number of learnables: 3 inputs: 1 'input' 2 features
create the actor using actornet
and the observation and action specifications. for more information, see rlcontinuousdeterministicactor
.
actor = rlcontinuousdeterministicactor(actornet,obsinfo,actinfo);
check the actor with a random observation input.
getaction(actor,{rand(obsinfo.dimension)})
ans = 1x1 cell array
{[0.3493]}
create the ddpg agent using the actor and critic. for more information, see .
agent = rlddpgagent(actor,critic);
specify options for the agent, including training options for the critic, using dot notation. alternatively, you can use , and rloptimizeroptions
objects before creating the agent.
agent.agentoptions.sampletime = env.ts; agent.agentoptions.experiencebufferlength = 1e6; agent.agentoptions.minibatchsize = 32; agent.agentoptions.noiseoptions.standarddeviation = 0.3; agent.agentoptions.noiseoptions.standarddeviationdecayrate = 1e-7; agent.agentoptions.actoroptimizeroptions.learnrate = 1e-4; agent.agentoptions.actoroptimizeroptions.gradientthreshold = 1; agent.agentoptions.criticoptimizeroptions.learnrate = 5e-3; agent.agentoptions.criticoptimizeroptions.gradientthreshold = 1;
initialize agent parameters
the policy implemented by the actor is , where the feedback gains and are the two weights of the actor network. it can be shown that the closed loop system is stable if these gains are negative, therefore, initializing them to negative values can speed up convergence.
the q-value function has the following structure:
where are the weights of the fully connected layer. alternatively, in matrix form:
for a fixed policy , the cumulative long-term reward (that is the value of the policy) becomes:
since the rewards are always negative, to properly approximate the cumulative reward both and must be negative definite. therefore, to speed up convergence, initialize the critic network weights so that is negative definite.
% create diagonal matrix with negative eigenvalues
w = -single(diag([1 1 1]) 0.1)
w = 3x3 single matrix
-1.1000 -0.1000 -0.1000
-0.1000 -1.1000 -0.1000
-0.1000 -0.1000 -1.1000
% extract indexes of upper triangular part of a 3 by 3 matrix
idx = triu(true(3))
idx = 3x3 logical array
1 1 1
0 1 1
0 0 1
% update parameters in the actor and critic
par = getlearnableparameters(agent);
par.actor{1} = -single([1 1]);
par.critic{1} = w(idx)';
setlearnableparameters(agent,par);
check the agent with a random observation input.
getaction(agent,{rand(obsinfo.dimension)})
ans = 1x1 cell array
{[-1.2857]}
train agent
to train the agent, first specify the training options. for this example, use the following options.
run at most
5000
episodes in the training session, with each episode lasting at most200
time steps.display the training progress in the episode manager dialog box (set the
plots
option) and disable the command line display (verbose
option).stop training when the agent receives a moving average cumulative reward greater than
–66
. at this point, the agent can control the position of the mass using minimal control effort.
for more information, see rltrainingoptions
.
trainopts = rltrainingoptions(... maxepisodes=5000, ... maxstepsperepisode=200, ... verbose=false, ... plots="training-progress",... stoptrainingcriteria="averagereward",... stoptrainingvalue=-66);
you can visualize the double integrator environment by using the plot
function during training or simulation.
plot(env)
train the agent using train
. training this agent is a computationally intensive process that takes several hours to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the agent yourself, set dotraining
to true
.
dotraining = false; if dotraining % train the agent. trainingstats = train(agent,env,trainopts); else % load the pretrained agent for the example. load("doubleintegddpg.mat","agent"); end
simulate ddpg agent
to validate the performance of the trained agent, simulate it within the double integrator environment. for more information on agent simulation, see rlsimulationoptions
and sim
.
simoptions = rlsimulationoptions(maxsteps=500); experience = sim(env,agent,simoptions);
totalreward = sum(experience.reward)
totalreward = -65.9875
solve lqr problem
the function (control system toolbox) solves a discretized lqr problem, like the one presented in this example. this function calculates the optimal discrete-time gain matrix klqr
, together with the solution of the riccati equation plqr
. when klqr
is connected via negative state feedback to the plant input (force), the discrete-time equivalent of the cost function specified by env.q
and env.r
is minimized going forward. furthermore, the cumulative cost from initial time to infinity, starting from an initial state x0
, is equal to x0'*plqr*x0
.
use lqrd
to solve the discretized lqr problem.
[klqr,plqr] = lqrd([0 1;0 0],[0;env.gain],env.q,env.r,env.ts);
here, [0 1;0 0]
and [0;env.gain]
are the continuous-time transition and input gain matrices, respectively, of the double integrator system.
if control system toolbox™ is not installed, use the solution for the default example values.
klqr = [17.8756 8.2283]; plqr = [4.1031 0.3376; 0.3376 0.1351];
if the actor policy successfully approximates the optimal policy, then the resulting must be close to (the minus sign is due to the fact that is calculated assuming a negative feedback connection).
if the critic learns a good approximation of the optimal value function, then the resulting , as defined before, must be close to (the minus sign is due to the fact that the reward is defined as the negative of the cost).
compare ddpg agent to optimal controller
extract the parameters (weighs) of the actor and critic within the agent.
par = getlearnableparameters(agent);
display the actor weights.
k = par.actor{1}
k = 1x2 single row vector
-15.4601 -7.2076
note that the gains are close to the ones of the optimal solution -klqr
:
-klqr
ans = 1×2
-17.8756 -8.2283
recreate the matrices and defining the q-value and value functions, respectively. first, re-initialize to zero.
w = zeros(3);
place the critic weights in the upper triangular portion of .
w(idx) = par.critic{1};
recreate as defined before.
w = (w w')/2
w = 3×3
-4.9869 -0.7788 -0.0548
-0.7788 -0.3351 -0.0222
-0.0548 -0.0222 0.0008
using and , calculate as defined before.
p = [eye(2) k']*w*[eye(2);k]
p = 2x2 single matrix
-3.1113 0.0436
0.0436 0.0241
note that the gains are close to the solution of the riccati equation -plqr
.
-plqr
ans = 2×2
-4.1031 -0.3376
-0.3376 -0.1351
get the environment initial state.
x0=reset(env);
the value function is the estimate of future cumulative long-term reward when using the policy enacted by the actor. calculate the value function at the initial state, according to the critic weights. this is the same value displayed in the training window as episode q0.
q0 = x0'*p*x0
q0 = single
-49.7803
note that the value is close to the actual reward obtained in the validation simulation, totalreward
, suggesting that the critic learns a good approximation of the value function for the policy enacted by the actor.
calculate the value of the initial state, following the true optimal policy enacted by the lqr controller.
-x0'*plqr*x0
ans = -65.6494
this value is also very close to the value obtained in the validation simulation, confirming that the policy learned and enacted by the actor is a good approximation of the true optimal policy.