generate reward function from a model predictive controller for a servomotor -凯发k8网页登录
this example shows how to automatically generate a reward function from cost and constraint specifications defined in a model predictive controller object. you then use the generated reward function to train a reinforcement learning agent.
introduction
you can use the generaterewardfunction
to generate a reward function for reinforcement learning, starting from cost and constraints specified in a model predictive controller. the resulting reward signal is a sum of costs (as defined by an objective function) and constraint violation penalties depending on the current state of the environment.
this example is based on the (model predictive control toolbox) example, in which you design a model predictive controller for a dc servomechanism under voltage and shaft torque constraints. here, you will convert the cost and constraints specifications defined in the mpc
object into a reward function and use it to train an agent to control the servomotor.
open the simulink model for this example which is based on the above mpc example but has been modified for reinforcement learning.
mdl = "rl_motor";
open_system(mdl);
create model predictive controller
create the open-loop dynamic model of the motor, defined in plant
and the maximum admissible torque tau
using a helper function.
[plant,tau] = mpcmotormodel;
specify input and output signal types for the mpc controller. the shaft angular position, is measured as first output. the second output, torque, is unmeasurable.
plant = setmpcsignals(plant, mv=1, mo=1, uo=2);
specify constraints on the manipulated variable, and define a scale factor.
mv = struct(min=-220, max=220, scalefactor=440);
impose torque constraints during the first three prediction steps, and specify scale factor for both shaft position and torque.
ov = struct(min={-inf, [-tau;-tau;-tau;-inf]}, ... max={inf, [tau;tau;tau;inf]}, ... scalefactor={2*pi, 2*tau});
specify weights for the quadratic cost function to achieve angular position tracking. set to zero the weight for the torque, thereby allowing it to float within its constraint.
weights = struct(mv=0, mvrate=0.1, ov=[0.1 0]);
create an mpc controller for the plant
model with a sample time of 0.1
s, a prediction horizon 10
steps, and a control horizon of 2
steps, using the previously defined structures for the weights, manipulated variables, and output variables.
mpcobj = mpc(plant, 0.1, 10, 2, weights, mv, ov);
display the controller specifications.
mpcobj
mpc object (created on 19-aug-2023 16:50:13): --------------------------------------------- sampling time: 0.1 (seconds) prediction horizon: 10 control horizon: 2 plant model: -------------- 1 manipulated variable(s) -->| 4 states | | |--> 1 measured output(s) 0 measured disturbance(s) -->| 1 inputs | | |--> 1 unmeasured output(s) 0 unmeasured disturbance(s) -->| 2 outputs | -------------- indices: (input vector) manipulated variables: [1 ] (output vector) measured outputs: [1 ] unmeasured outputs: [2 ] disturbance and noise models: output disturbance model: default (type "getoutdist(mpcobj)" for details) measurement noise model: default (unity gain after scaling) weights: manipulatedvariables: 0 manipulatedvariablesrate: 0.1000 outputvariables: [0.1000 0] ecr: 10000 state estimation: default kalman filter (type "getestimator(mpcobj)" for details) constraints: -220 <= mv1 (v) <= 220, mv1/rate (v) is unconstrained, mo1 (rad) is unconstrained -78.54 <= uo1 (nm)(t 1) <= 78.54 -78.54 <= uo1 (nm)(t 2) <= 78.54 -78.54 <= uo1 (nm)(t 3) <= 78.54 uo1 (nm)(t 4) is unconstrained use built-in "active-set" qp solver with maxiterations of 120.
the controller operates on a plant with 4 states, 1 input (voltage) and 2 output signals (angle and torque) and has the following specifications:
the cost function weights for the manipulated variable, manipulated variable rate and output variables are 0, 0.1 and [0.1 0] respectively.
the manipulated variable is constrained between -220v and 220v.
the manipulated variable rate is unconstrained.
the first output variable (angle) is unconstrained but the second (torque) is constrained between -78.54 nm and 78.54 nm in the first three prediction time steps and unconstrained in the fourth step.
note that for reinforcement learning only the constraints specification from the first prediction time step will be used since the reward is computed for a single time step.
generate the reward function
generate the reward function code from specifications in the mpc
object using generaterewardfunction
. the code is displayed in the matlab editor.
generaterewardfunction(mpcobj)
the generated reward function is a starting point for reward design. you can modify the function with different penalty function choices and tune the weights. for this example, make the following change to the generated code:
scale the original cost weights
qy
andqmvrate
by a factor of50
and10
respectively.scale the penalty weights
wy
,wmv
andwmvrate
by a factor of1e-2
,1e-3
and1e-3
respectively.the default exterior penalty function method is
step
. change the method toquadratic
.
after you make changes, the cost and penalty specifications should be as follows:
qy = 50 * [0.1 0]; qmv = 0; qmvrate = 10 * 0.1; wy = 1e-2 * [1 1]; wmv = 1e-3; wmvrate = 1e-3; py = wy * exteriorpenalty(y,ymin,ymax,'quadratic'); pmv = wmv * exteriorpenalty(mv,mvmin,mvmax,'quadratic'); pmvrate = wmvrate * exteriorpenalty(mv-lastmv,mvratemin,mvratemax,'quadratic');
for this example, the modified code has been saved in the matlab function file rewardfunctionmpc.m
. display the generated reward function.
type rewardfunctionmpc.m
function reward = rewardfunctionmpc(y,refy,mv,refmv,lastmv) % rewardfunctionmpc generates rewards from mpc specifications. % % description of input arguments: % % y : output variable from plant at step k 1 % refy : reference output variable at step k 1 % mv : manipulated variable at step k % refmv : reference manipulated variable at step k % lastmv : manipulated variable at step k-1 % % limitations (mpc and nlmpc): % - reward computed based on first step in prediction horizon. % therefore, signal previewing and control horizon settings are ignored. % - online cost and constraint update is not supported. % - custom cost and constraint specifications are not considered. % - time varying cost weights and constraints are not supported. % - mixed constraint specifications are not considered (for the mpc case). % reinforcement learning toolbox % 02-jun-2021 16:05:41 %#codegen %% specifications from mpc object % standard linear bounds as specified in 'states', 'outputvariables', and % 'manipulatedvariables' properties ymin = [-inf -78.5398163397448]; ymax = [inf 78.5398163397448]; mvmin = -220; mvmax = 220; mvratemin = -inf; mvratemax = inf; % scale factors as specified in 'states', 'outputvariables', and % 'manipulatedvariables' properties sy = [6.28318530717959 157.07963267949]; smv = 440; % standard cost weights as specified in 'weights' property qy = 50 * [0.1 0]; qmv = 0; qmvrate = 10 * 0.1; %% compute cost dy = (refy(:)-y(:)) ./ sy'; dmv = (refmv(:)-mv(:)) ./ smv'; dmvrate = (mv(:)-lastmv(:)) ./ smv'; jy = dy' * diag(qy.^2) * dy; jmv = dmv' * diag(qmv.^2) * dmv; jmvrate = dmvrate' * diag(qmvrate.^2) * dmvrate; cost = jy jmv jmvrate; %% penalty function weight (specify nonnegative) wy = 1e-2 * [1 1]; wmv = 1e-3; wmvrate = 1e-3; %% compute penalty % penalty is computed for violation of linear bound constraints. % % to compute exterior bound penalty, use the exteriorpenalty function and % specify the penalty method as 'step' or 'quadratic'. % % alternatively, use the hyperbolicpenalty or barrierpenalty function for % computing hyperbolic and barrier penalties. % % for more information, see help for these functions. % % set pmv value to 0 if the rl agent action specification has % appropriate 'lowerlimit' and 'upperlimit' values. py = wy * exteriorpenalty(y,ymin,ymax,'quadratic'); pmv = wmv * exteriorpenalty(mv,mvmin,mvmax,'quadratic'); pmvrate = wmvrate * exteriorpenalty(mv-lastmv,mvratemin,mvratemax,'quadratic'); penalty = py pmv pmvrate; %% compute reward reward = -(cost penalty); end
to integrate this reward function, open the matlab function block in the simulink model.
open_system("rl_motor/reward function")
append the function with the following line of code and save the model.
r = rewardfunctionmpc(y,refy,mv,refmv,lastmv);
the matlab function block will now execute rewardfunctionmpc.m
during simulation.
for this example, the matlab function block has already been modified and saved.
create a reinforcement learning environment
the environment dynamics are modeled in the servomechanism subsystem. for this environment,
the observations are the reference signals (angle and torque), output variables (angle and torque), and their integrals from the last 3 time steps. the angle and torque signals are normalized by multiplying with the gain
[0.1 1/tau]
.the action is the voltage applied to the servomotor. the action values are limited between
-220
and220
.the sample time is .
the total simulation time is .
specify the total simulation time and sample time.
tf = 20; ts = 0.1;
create observation and action specifications for the environment.
numobs = 24; numact = 1; oinfo = rlnumericspec([numobs 1]); ainfo = rlnumericspec([numact 1], ... lowerlimit=-220, ... upperlimit=220);
create the reinforcement learning environment using the rlsimulinkenv
function.
blk = "rl_motor/rl agent";
env = rlsimulinkenv(mdl, blk, oinfo, ainfo);
create a reinforcement learning agent
fix the random seed for reproducibility.
rng(0)
the agent used in this example is a twin-delayed deep deterministic policy gradient (td3) agent. td3 agents use two parametrized q-value function approximators to estimate the value (that is the expected cumulative long-term reward) of the policy. to model the parametrized q-value function within both critics, use a neural network with two inputs (the observation and action) and one output (the value of the policy when taking a given action from the state corresponding to a given observation). for more information on td3 agents, see twin-delayed deep deterministic (td3) policy gradient agents.
define each network path as an array of layer objects. assign names to the input and output layers of each path. these names allow you to connect the paths and then later explicitly associate the network input and output layers with the appropriate environment channel.
mainpath = [ featureinputlayer(numobs) fullyconnectedlayer(128) concatenationlayer(1, 2, name="concat") relulayer fullyconnectedlayer(64) relulayer fullyconnectedlayer(1)]; actionpath = [ featureinputlayer(numact) fullyconnectedlayer(8, name="fc_act")]; % create layergraph object and add layers criticnet = layergraph(mainpath); criticnet = addlayers(criticnet, actionpath); % connect layers criticnet = connectlayers(criticnet, "fc_act", "concat/in2");
plot the critic network structure.
plot(criticnet);
create the critic function objects using rlqvaluefunction
. the critic function object encapsulates the critic by wrapping around the critic deep neural network. to make sure the critics have different initial weights, explicitly initialize each network before using them to create the critics.
% convert the neural network to a dlnetwork object without initializing the networks criticdlnet = dlnetwork(criticnet, initialize=false); % create the two critic functions for the td3 agent critic1 = rlqvaluefunction(initialize(criticdlnet), oinfo, ainfo); critic2 = rlqvaluefunction(initialize(criticdlnet), oinfo, ainfo);
td3 agents learn a parametrized deterministic policy over continuous action spaces, which is learned by a continuous deterministic actor. this actor takes the current observation as input and returns as output an action that is a deterministic function of the observation.
to model the parametrized policy within the actor, use a neural network with one input layer (which receives the content of the environment observation channel, as specified by obsinfo) and one output layer (which returns the action to the environment action channel, as specified by ainfo
).
define the network as an array of layer objects.
actornet = [ featureinputlayer(numobs) fullyconnectedlayer(128) relulayer fullyconnectedlayer(64) relulayer fullyconnectedlayer(numact)];
plot the actor network.
plot(layergraph(actornet));
create a deterministic actor function that is responsible for modeling the policy of the agent. for more information, see rlcontinuousdeterministicactor
.
actordlnet = dlnetwork(actornet); actor = rlcontinuousdeterministicactor(actordlnet, oinfo, ainfo);
specify the agent options using rltd3agentoptions
. the agent trains from an experience buffer of maximum capacity 1e6
by randomly selecting mini-batches of size 256
. the discount factor of 0.995
favors long-term rewards.
agentopts = rltd3agentoptions(sampletime=ts, ... discountfactor=0.995, ... experiencebufferlength=1e6, ... minibatchsize=256);
specify optimizer options for the actor and critic functions. for this example, you will choose a learn rate of 1e-3
and a gradient threshold of 1
for the actor and critics.
% critic optimizer options for idx = 1:2 agentopts.criticoptimizeroptions(idx).learnrate = 1e-3; agentopts.criticoptimizeroptions(idx).gradientthreshold = 1; end % actor optimizer options agentopts.actoroptimizeroptions.learnrate = 1e-3; agentopts.actoroptimizeroptions.gradientthreshold = 1;
during training, the agent explores the action space using a gaussian action noise model. set the standard deviation and decay rate of the noise using the explorationmodel
property. the noise has an initial standard deviation of 100
which exponentially decays at the rate of 1e-5 until it reaches a mininum value of 1e-3.
this favors exploration towards the beginning of training and exploitation in later stages.
agentopts.explorationmodel.standarddeviationmin = 1e-3; agentopts.explorationmodel.standarddeviation = 100; agentopts.explorationmodel.standarddeviationdecayrate = 1e-5;
create the td3 agent using the actor and critic representations. for more information on td3 agents, see rltd3agent
.
agent = rltd3agent(actor, [critic1,critic2], agentopts);
train the agent
to train the agent, first specify the training options using rltrainingoptions
. for this example, use the following options:
run each training for at most
5000
episodes, with each episode lasting at mostceil(tf/ts)
time steps.stop the training when the agent receives an average cumulative reward greater than
-7
over20
consecutive episodes. at this point, the agent can track the reference signal.
trainopts = rltrainingoptions(... maxepisodes=5000, ... maxstepsperepisode=ceil(tf/ts), ... stoptrainingcriteria="averagereward", ... stoptrainingvalue=-7, ... scoreaveragingwindowlength=20);
train the agent using the train
function. training this agent is a computationally intensive process that may take several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the agent yourself, set dotraining
to true
.
dotraining = false; if dotraining trainresult = train(agent, env, trainopts); else load('rldcservomotortd3agent.mat') end
a snapshot of the training progress is shown in the following figure. you can expect different results due to inherent randomness in the training process.
validate controller response
to validate the performance of the trained agent, simulate the model and view the response in the scope blocks. the reinforcement learning agent is able to track the reference angle while satisfying the constraints on torque and voltage.
sim(mdl);
凯发官网入口首页 copyright 2021-2023 the mathworks, inc..
see also
functions
objects
blocks
related examples
- (model predictive control toolbox)
- train biped robot to walk using reinforcement learning agents
- generate reward function from a model verification block for a water tank system