custom training loop with simulink action noise -凯发k8网页登录
this example shows how to tune a controller for vehicle platooning applications using a custom reinforcement learning (rl) training loop. for this application, action noise is generated in the simulink® model to promote exploration during training.
for an example on tuning a pid-based vehicle platooning system, see (simulink control design).
platooning has the following control objectives [1].
individual vehicle stability — spacing error for each following vehicle converges to zero if the preceding vehicle is traveling at constant speed.
string stability — spacing errors do not amplify as they propagate towards the tail of the vehicle string.
platooning environment model
in this example, there are five vehicles in the platoon. every vehicle is modeled as a truck-trailer system with the following parameters. all lengths are in meters.
l1 = 6; % truck length l2 = 10; % trailer length m1 = 1; % hitch length l = l1 l2 m1 5; % desired front-to-front vehicle spacing
the lead vehicle follows a given acceleration profile. each trailing vehicle has a controller that controls its acceleration.
open the simulink® model.
mdl = "fivevehicleplatoonenv";
open_system(mdl)
the model contains an rl agent block with its last action input port enabled. this input port allows the specification of custom noise in the simulink model for off-policy rl agents, such as deep deterministic policy gradient (ddpg) agents.
specify the path to the rl agent block.
agentblk = mdl "/rl agent";
controller structure
in this example, each trailing vehicle (ego vehicle) has the same continuous-time controller structure and parameterization.
here:
, , and are the respective acceleration, velocity, and position of the ego vehicle.
, , and are the respective acceleration, velocity, and position of the vehicle directly in front of the ego vehicle.
each vehicle has full access to its own velocity and position states but can only access the acceleration, velocity, and position of the vehicle directly in front using wireless communication.
the controller minimizes the velocity error using the velocity gain and minimizes the spacing error using the spacing gain . the feedforward gain is used to improve tracking of the front vehicle.
the lead vehicle' acceleration is assumed to be a sine wave
here:
is the lead vehicle acceleration (m/s^2).
is the amplitude of the sine wave (m/s).
is the frequency of the sine wave (rad/s).
is the current simulation time (s).
reinforcement learning agent design
the objective of the agent is to compute adaptive gains so that each vehicle can track the desired spacing with respect to the vehicle immediately in front. therefore, the model is configured such that:
the action signal consists of the gains shared by all vehicles except the lead vehicle. each gain has a lower bound of 0 and upper bounds of 1, 20, and 20, respectively. the agent calculates new gains once per second. to encourage exploration during training, the gains are perturbed by random noise with a zero-mean normal distribution: where the variance .
the observation signal consists of the vehicle spacing ( minus the target spacing (, the vehicle velocities (), and the vehicle accelerations ().
the reward calculated at every time step is
where:
is the vehicle spacing () at time step .
calculates the max overshoot of all vehicles given the actual vehicle spacing and the desired spacing . in this case, overshoot is defined as when the vehicle spacing is less than the desired spacing .
indicates if a vehicle collision occurs. the simulation will terminate if is true.
the first term in the reward function encourages the vehicle spacing to match . the second term penalizes large changes in gain between time steps. the third term penalizes overshooting the target spacing (getting too close to the front vehicle). finally, the fourth term penalizes collisions.
for this example, to accommodate the custom noise specified in the model, you implement a custom ddpg training loop.
define model parameters
define the training and simulation parameters that remain fixed during training.
ts = 1; % sample time (seconds) tf = 100; % simulation length (seconds) accelnoisev = ones(1,5)*0.01; % acceleration input noise variance velnoisev = ones(1,5)*0.01; % velocity sensor noise variance posnoisev = ones(1,5)*0.01; % position sensor noise variance paramlowerlimit = [0 0 0]'; % lower limits for the controller gains paramupperlimit = [1 20 20]'; % upper limits for the controller gains useparamnoise = 1; % option to indicate noise injection
define the parameters that change every training episode. the values for these parameters are updated in the environment reset function resetfunction
.
leada = 2; % lead vehicle acceleration amplitude leadf = 1; % lead vehicle acceleration frequency paramnoisev = [0.02 0.1 0.1]; % variance for controller gains % random noise seeds paramnoiseseed = 1:3; % controller gain noise seed accelnoiseseed = 1:5 100; % acceleration input noise seed velnoiseseed = 1:5 200; % velocity sensor noise seed posnoiseseed = 1:5 300; % position sensor noise seed % initial position and velocity of each vehicle initialpositions = [200 150 100 50 0] 50; % positions initialvelocities = [10 10 10 10 10]; % velocities
create environment
create an environment using rlsimulinkenv
.
to do so, first define the observation and action specifications for the environment.
obsinfo = rlnumericspec([14 1]); actinfo = rlnumericspec([3 1],... lowerlimit=paramlowerlimit,... upperlimit=paramupperlimit); obsinfo.name = "measurements"; actinfo.name = "control_gains";
next, create the environment object.
env = rlsimulinkenv(mdl,agentblk,obsinfo,actinfo);
set the environment reset function to the local function resetfunction
included with this example. this function varies the training conditions for each episode.
env.resetfcn = @resetfunction;
noise in the model is specified using the (simulink) block. each block has its own random number generator and thus its own starting seed parameter. to ensure the noise stream varies across episodes, the seed variables are updated using resetfunction
.
create actor, critic, and policy
create actor and critic function approximators for the agent using the local function createnetworks
included with this example.
[critic,actor] = createnetworks(obsinfo,actinfo);
create optimizer objects for updating the actor and critic. use the same options object for both optimizers. for more information, see rloptimizeroptions
and rloptimizer
.
optimizeropt = rloptimizeroptions(... learnrate=1e-3, ... gradientthreshold=1, ... l2regularizationfactor=1e-3); criticoptimizer = rloptimizer(optimizeropt); actoroptimizer = rloptimizer(optimizeropt);
create a deterministic policy for the actor approximator. for more information, see .
policy = rldeterministicactorpolicy(actor)
policy = rldeterministicactorpolicy with properties: actor: [1x1 rl.function.rlcontinuousdeterministicactor] observationinfo: [1x1 rl.util.rlnumericspec] actioninfo: [1x1 rl.util.rlnumericspec] sampletime: -1
specify the policy sample time.
policy.sampletime = ts;
create experience buffer
create an experience buffer for the agent with a maximum length of 1e6
.
replaymemory = rlreplaymemory(obsinfo,actinfo,1e6)
replaymemory = rlreplaymemory with properties: maxlength: 1000000 length: 0
data required for learning
to update the actor and critic during training, the runepisode
function calls a processing function to process each experience as it is received from the environment. for this example, the processing function is the local function processexperiencefcn
.
this function requires additional data to perform its processing. create a structure to store this additional data.
processexpdata.critic = critic; processexpdata.targetcritic = critic; processexpdata.actor = actor; processexpdata.targetactor = actor; processexpdata.replaymemory = replaymemory; processexpdata.criticoptimizer = criticoptimizer; processexpdata.actoroptimizer = actoroptimizer; processexpdata.minibatchsize = 128; processexpdata.discountfactor = 0.99; processexpdata.targetsmoothfactor = 1e-3;
during each episode, the processexperiencefcn
function updates the critics, actors, replay memory, and optimizers. the updated data is used as the input for the next episode.
training loop
to train the agent, the custom training loop simulates the agent in the environment for a maximum of maxepisodes
episodes.
maxepisodes = 1000;
compute the maximum number of steps per episode using the simulation time and sample time.
maxsteps = ceil(tf/ts);
for this custom training loop:
the
runepisode
function simulates the agent in the environment for one episode.experiences are processed as they are received from the environment using the
processexperiencefcn
function.experiences are not logged by
runepisode
since they are processed as they are received.to speed up training, when calling
runepisode
, thecleanuppostsim
option is set tofalse
. doing so keeps the model compiled between episodes.the
platooningtrainingcurveplotter
object is a helper object to plot training data while the training is running.you can stop the training using a stop button in the training plot.
after all the episodes are complete, the
cleanup
function cleans up the environment and terminates the model compilation.
training the policy is a computationally intensive process that can take several minutes to hours to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the policy yourself, set dotraining
to true
.
dotraining = false; if dotraining % create plotting helper object. plotobj = platooningtrainingcurveplotter(); % training loop for i = 1:maxepisodes % run the episode. out = runepisode(... env,policy,... maxsteps=maxsteps,... processexperiencefcn=@processexperiencefcn,... processexperiencedata=processexpdata,... logexperiences=false,... cleanuppostsim=false); % extract episode information % to update the training curves. episodeinfo = out.agentdata.episodeinfo; % extract updated processexpdata for the next episode. processexpdata = out.agentdata.processexperiencedata; % extract the updated policy for the next episode. policy = out.agentdata.agent; % extract critic and actor approximators from processexpdata. critic = processexpdata.critic; actor = processexpdata.actor; % extract the cumulative reward and calculate % average reward per step for this episode. cumulativerwd = episodeinfo.cumulativereward; avgrwdperstep = cumulativerwd/episodeinfo.stepstaken; % evaluate q0 from the initial episode observation. obs0 = episodeinfo.initialobservation; q0 = evaluate(critic,[obs0,evaluate(actor,obs0)]); q0 = double(q0{1}); % update the plot. update(plotobj,i,avgrwdperstep,cumulativerwd,q0); % exit training if button pushed. if plotobj.stoptraining break; end end % clean up the environment. cleanup(env); % save the policy. save("platooningddpgpolicy.mat","policy"); else % load the pretrained policy. load("platooningddpgpolicy.mat"); end
validate trained policy
validate the learned policy by running five simulations with random initial conditions specified by the reset function.
first, turn off parameter noise in the model.
useparamnoise = 0;
simulate the model against the trained policy five times.
n = 5; simopts = rlsimulationoptions(... maxsteps=maxsteps, ... numsimulations=n); experiences = sim(env,policy,simopts);
plot the vehicle spacing error, gains, and reward from the experiences
output structure.
f = figure(position=[100 100 1024 768]); tiledlayout(f,n,3); for i = 1:n % get the spacing. tspacing = experiences(i).observation.measurements.time; spacing = ... squeeze(experiences(i).observation.measurements.data(1:4,:,:)); % get the gains. tgains = experiences(i).action.control_gains.time; gains = squeeze(experiences(i).action.control_gains.data); % get the reward. trwd = experiences(i).reward.time; rwd = experiences(i).reward.data; % plot the spacing. nexttile stairs(tspacing,spacing'); title(sprintf("vehicle spacing error simulation %u",i)) grid on % plot the gains. nexttile stairs(tgains,gains'); title(sprintf("vehicle gains simulation %u",i)) grid on % plot the reward. nexttile stairs(trwd,rwd); title(sprintf("vehicle reward simulation %u",i)) grid on end
from the plots, you can see that the trained policy generates adaptive gains that adequately track the desired spacing for all vehicles.
local functions
the process experience function is called every time an experience is processed by the rl agent block. here, processexperiencefcn
appends the experience to the replay memory, samples a mini-batch of experiences from the replay memory, and updates the critic, actor, and target networks.
function [policy,procexpdata] = processexperiencefcn(... exp,episodeinfo,policy,procexpdata) % append experience to replay memory buffer. append(procexpdata.replaymemory,exp); % sample a mini-batch of experiences from replay memory. minibatch = sample(procexpdata.replaymemory, ... procexpdata.minibatchsize,... discountfactor=procexpdata.discountfactor); if ~isempty(minibatch) % update network parameters using the mini-batch. [procexpdata,actorparams] = learnfcn(procexpdata,minibatch); % update the policy parameters using the actor parameters. policy = setlearnableparameters(policy,actorparams); end end
the learnfcn
function updates the critic, actor, and target networks given a sampled mini-batch.
function [processexpdata,actorparams] = learnfcn( ... processexpdata,minibatch) % find the terminal experiences. doneidx = (minibatch.isdone == 1); % compute target next actions against the next observations. nextaction = evaluate( ... processexpdata.targetactor,minibatch.nextobservation); % compute qtarget = reward gamma*q(nextobservation,nextaction) % = reward gamma*expectedfuturereturn targetq = minibatch.reward; % bootstrap the target at nonterminal experiences. expectedfuturereturn = getvalue(processexpdata.targetcritic, ... minibatch.nextobservation,nextaction); targetq(~doneidx) = targetq(~doneidx) ... processexpdata.discountfactor.*expectedfuturereturn(~doneidx); % compute critic gradient using deepcriticloss function. criticgradient = gradient(processexpdata.critic,@deepcriticloss,... [minibatch.observation,minibatch.action],targetq); % update the critic parameters. [processexpdata.critic,processexpdata.criticoptimizer] = update(... processexpdata.criticoptimizer,processexpdata.critic,... criticgradient); % compute the actor gradient using the deepactorgradient function. % to accelerate the deepactorgradient function, the critic network % is extracted outside the function and is passed in as a field % to the actorgraddata input struct. actorgraddata.criticnet = getmodel(processexpdata.critic); actorgraddata.minibatchsize = processexpdata.minibatchsize; actorgradient = customgradient(processexpdata.actor, ... @deepactorgradient,minibatch.observation,actorgraddata); % update the actor parameters. [processexpdata.actor,processexpdata.actoroptimizer] = update(... processexpdata.actoroptimizer,processexpdata.actor,... actorgradient); actorparams = getlearnableparameters(processexpdata.actor); % update targets using the targetsmoothfactor hyperparameter. processexpdata.targetcritic = syncparameters( ... processexpdata.targetcritic,... processexpdata.critic, ... processexpdata.targetsmoothfactor); processexpdata.targetactor = syncparameters( ... processexpdata.targetactor,... processexpdata.actor, ... processexpdata.targetsmoothfactor); end
the critic gradient is computed against the deepcriticloss
function.
function loss = deepcriticloss(q,targetq) % extract value from cell q = q{1}; % loss is the half mean-square error of % q = q(observation,action) against qtarget loss = mse(q,reshape(targetq,size(q))); end
the actor gradient is computed to maximize the expected value of an observation-action pair given the policy parameters. here, the negative sign is used to maximize with respect to .
here:
is the batch observations
is the batch actions
is the critic network parameterized by
is the actor network parameterized by
is the mini batch size
function dqdtheta = deepactorgradient( ... actornet,observation,graddata) % evaluate actions from current observations. action = forward(actornet,observation{:}); % compute: q = q(s,a) q = predict(graddata.criticnet,observation{:},action); % compute: qsum = -sum(q)/n to maximize q qsum = -sum(q,"all")/graddata.minibatchsize; % compute: d(-sum(q)/n)/dactorparams dqdtheta = dlgradient(qsum,actornet.learnables); end
the environment reset function varies the initial conditions, reference trajectory, and noise seeds for every episode.
function in = resetfunction(in) % perturb the nominal reference amplitude and frequency. leada = max(2 0.1*randn,0.1); leadf = max(1 0.1*randn,0.1); % perturb the nominal spacing. l = 22 3*randn; % perturb the initial states. initialpositions = [250 200 150 100 50] 5*randn(1,5); initialvelocities = [10 10 10 10 10] 1*randn(1,5); % update the noise seeds. paramnoiseseed = randi(100,1,3); accelnoiseseed = randi(100,1,5) 100; velnoiseseed = randi(100,1,5) 200; posnoiseseed = randi(100,1,5) 300; % update the model variables. in = setvariable(in,"l",l); in = setvariable(in,"leada",leada); in = setvariable(in,"leadf",leadf); in = setvariable(in,"initialpositions",initialpositions); in = setvariable(in,"initialvelocities",initialvelocities); in = setvariable(in,"paramnoiseseed",paramnoiseseed); in = setvariable(in,"accelnoiseseed",accelnoiseseed); in = setvariable(in,"velnoiseseed",velnoiseseed); in = setvariable(in,"posnoiseseed",posnoiseseed); end
create the critic and actor networks.
function [critic,actor] = createnetworks(obsinfo,actinfo) % the actor and critic networks are initialized randomly. % ensure reproducibility by fixing random generator seed. rng(0); % number of neurons in hidden layers hiddenlayersize = 64; % extract dimensions of observation and action spaces. numobs = prod(obsinfo.dimension); numact = prod(actinfo.dimension); % use a q-value function critic. this critic takes the current % observation and an action as inputs and returns a single % scalar as output (the estimated discounted cumulative long-term % reward given the action from the state corresponding to the % current observation, and following the policy thereafter). % to model the parametrized q-value function within the critic, % use a neural network with two input layers (one for the % observation channel and the other for the action channel), % and one output layer (which returns the scalar value). % create the critic network. % define each network path as an array of layer objects. obsinput = featureinputlayer(numobs, ... normalization="none", ... name=obsinfo.name); actinput = featureinputlayer(numact, ... normalization="none", ... name=actinfo.name); catpath = [ concatenationlayer(1,2,name="concat") fullyconnectedlayer(hiddenlayersize,name="fc1") relulayer(name="relu1") fullyconnectedlayer(hiddenlayersize,name="fc2") relulayer(name="relu2") fullyconnectedlayer(1,name="q") ]; % add layers to layergraph object. net = layergraph(); net = addlayers(net,obsinput); net = addlayers(net,actinput); net = addlayers(net,catpath); % connect layers. net = connectlayers(net,obsinfo.name,"concat/in1"); net = connectlayers(net,actinfo.name,"concat/in2"); % convert to dlnetwork object. net = dlnetwork(net); % create the critic object. critic = rlqvaluefunction(net,obsinfo,actinfo); % set critic to accelerate gradient computation. critic = accelerate(critic,true); % use a continuous deterministic actor. % this actor learns a parametrized deterministic policy % for a continuous action space. it takes the current % observation as input and returns as output an action % that is a deterministic function of the observation. % % to model the parametrized policy within the actor, use a % neural network with one input layer (which receives the % content of the environment observation channel) % and one output layer (which returns the action to the % environment action channel). % define scale and bias for the output layer scale = (actinfo.upperlimit - actinfo.lowerlimit)/2; bias = actinfo.lowerlimit scale; % create the actor network as an array of layer objects. % use tanhlayer to scale the signal to the (-1,1) range, % and scalinglayer to scale the output to the action range. obspath = [ featureinputlayer(numobs, ... normalization="none", ... name=obsinfo.name) fullyconnectedlayer(hiddenlayersize,name="fc1") relulayer(name="relu1") fullyconnectedlayer(numact,name="fc2") relulayer(name="relu2") fullyconnectedlayer(numact,name="fc3") tanhlayer(name="tanh1") scalinglayer(scale=scale,... bias=bias,... name=actinfo.name) ]; % add layers to layergraph object. net = layergraph; net = addlayers(net,obspath); % convert to dlnetwork object. net = dlnetwork(net); % create the actor object. actor = rlcontinuousdeterministicactor(net,obsinfo,actinfo); % set actor to accelerate gradient computation. actor = accelerate(actor,true); end
references
[1] rajamani, rajesh. vehicle dynamics and control. 2. ed. mechanical engineering series. new york, ny heidelberg: springer, 2012.
see also
functions
runepisode
|setup
|cleanup
| |rloptimizer
|rlsimulinkenv
objects
blocks
related examples
- (simulink control design)
- train reinforcement learning policy using custom training loop
- create and train custom lqr agent
- create and train custom pg agent