main content

create and train custom lqr agent -凯发k8网页登录

this example shows how to create and train a custom linear quadratic regulation (lqr) agent to control a discrete-time linear system modeled in matlab®. for a introduction to custom agents, see . for a step by step example on how to create a custom pg agent (using the reinforce algorithm) see create and train custom pg agent. for an example of how a ddpg agent can be used as an optimal controller for a discrete-time system, see compare ddpg agent to lqr controller.

create linear system environment

the reinforcement learning environment for this example is a discrete-time linear system. the dynamics for the system are given by

xt 1=axt but

the feedback control law is

ut=-kxt

the control objective is to minimize the quadratic cost: j=t=0(xtqxt utrut).

in this example, the system matrices are

a=[1.050.050.050.051.050.0500.051.05]b=[0.100.20.10.50000.5]

a = [1.05,0.05,0.05;0.05,1.05,0.05;0,0.05,1.05];
b = [0.1,0,0.2;0.1,0.5,0;0,0,0.5]; 

the quadratic cost matrices are:

q=[1031354149]r=[0.50000.50000.5]

q = [10,3,1;3,5,4;1,4,9]; 
r = 0.5*eye(3);

for this environment, the reward at time t is given by rt=-xt"qxt-ut"rut, which is the negative of the quadratic cost. therefore, maximizing the reward minimizes the cost. the initial conditions are set randomly by the reset function.

create the matlab environment interface for this linear system and reward. the mydiscreteenv function creates an environment by defining custom step and reset functions. for more information on creating such a custom environment, see create custom environment using step and reset functions.

env = mydiscreteenv(a,b,q,r);

fix the random generator seed for reproducibility.

rng(0)

create custom lqr agent

for the lqr problem, the q-value function for a given control gain k is quadratic and can be defined as qk(x,u)=[xu]hk[xu], where hk=[hxxhxuhuxhuu] is a symmetric, positive definite matrix.

the control law that maximizes qk is u=-(huu)-1huxx, so the feedback gain is k=-(huu)-1hux.

the matrix hk contains m=12n(n 1) distinct element values, where n is the sum of the number of states and number of inputs. denote θ as the vector containing these m elements, in which the off-diagonal elements in hk are multiplied by two. the elements of θ are the parameters that the custom agent needs to learn.

you can express the q-value function as the inner product of the vectors θ and ϕ(x,u): qk(x,u)=θϕ(x,u), where ϕ(x,u) is a vector of quadratic monomials built from the combination of all the elements in x and u. for an example, see the q(x,u) matrix in compare ddpg agent to lqr controller.

the lqr agent starts with a stabilizing controller k0. to get an initial stabilizing controller, place the poles of the closed-loop system a-bk0 inside the unit circle.

k0 = place(a,b,[0.4,0.8,0.5]);

to create a custom agent, you must create a subclass of the rl.agent.customagent abstract class. for the custom lqr agent, the defined custom subclass is lqrcustomagent. for more information, see . create the custom lqr agent using q, r, and k0. the agent does not require information on the system matrices a and b.

agent = lqrcustomagent(q,r,k0);

for this example, set the agent discount factor to one. to use a discounted future reward, set the discount factor to a value less than one.

agent.gamma = 1;

because the linear system has three states and three inputs, the total number of learnable parameters is m=21. to ensure satisfactory performance of the agent, set the number of parameter estimates np (the number of data point to be collected before updating the critic) to be greater than twice the number of learnable parameters. in this example, the value is np=45.

agent.estimatenum = 45;

to get good estimation results for θ, you must apply a persistently excited exploration model to the system. in this example, encourage model exploration by adding white noise to the controller output: ut=-kxt et. in general, the exploration model depends on the system models.

train agent

to train the agent, first specify the training options. for this example, use the following options.

  • run each training episode for at most 10 episodes, with each episode lasting at most 50 time steps.

  • display the training progress in the episode manager dialog box (set the plots option) and disable command line display (set the verbose option).

for more information, see rltrainingoptions.

trainingopts = rltrainingoptions(...
    maxepisodes=10, ...
    maxstepsperepisode=50, ...
    verbose=false, ...
    plots="training-progress");

train the agent using the train function.

trainingstats = train(agent,env,trainingopts);

simulate agent and compare with optimal solution

to validate the performance of the trained agent, simulate it within the matlab environment. for more information on agent simulation, see rlsimulationoptions and sim.

simoptions = rlsimulationoptions(maxsteps=20);
experience = sim(env,agent,simoptions);
totalreward = sum(experience.reward)
totalreward = -30.6482

you can compute the optimal solution for the lqr problem using the dlqr function.

[koptimal,p] = dlqr(a,b,q,r); 

the optimal reward is given by joptimal=-x0px0.

x0 = experience.observation.obs1.getdatasamples(1);
joptimal = -x0'*p*x0;

compute the error in the reward between the trained lqr agent and the optimal lqr solution.

rewarderror = totalreward - joptimal
rewarderror = 5.0439e-07

view the history of the norm of the difference between the gains between the trained lqr agent and the optimal lqr solution.

% number of gain updates
len = agent.kupdate;
% initialize error vector
err = zeros(len,1);
% fill elements
for i = 1:len
    err(i) = norm(agent.kbuffer{i}-koptimal);
end
% plot logarithm of the error vector
plot(log10(err),'b*-')
title("log of gain difference")
xlabel("number of updates")

figure contains an axes object. the axes object with title log of gain difference, xlabel number of updates contains an object of type line.

compute the norm of final error for the feedback gain.

gainerror = norm(agent.k - koptimal)
gainerror = 1.6756e-11

overall, the trained agent finds a solution that is very close to the true optimal lqr solution.

see also

functions

objects

related examples

more about

网站地图