train multiple agents for path following control -凯发k8网页登录

this example shows how to train multiple agents to collaboratively perform path-following control (pfc) for a vehicle. the goal of pfc is to make the ego vehicle travel at a set velocity while maintaining a safe distance from a lead car by controlling longitudinal acceleration and braking, and also while keeping the vehicle travelling along the centerline of its lane by controlling the front steering angle. for more information on pfc, see (model predictive control toolbox).

overview

an example that trains a reinforcement learning agent to perform pfc is shown in . in that example, a single deep deterministic policy gradient (ddpg) agent is trained to control both the longitudinal speed and lateral steering of the ego vehicle. in this example, you train two reinforcement learning agents — a ddpg agent provides continuous acceleration values for the longitudinal control loop and a deep q-network (dqn) agent provides discrete steering angle values for the lateral control loop.

the trained agents perform pfc through cooperative behavior and achieve satisfactory results.

create environment

the environment for this example includes a simple bicycle model for the ego car and a simple longitudinal model for the lead car. the training goal is to make the ego car travel at a set velocity while maintaining a safe distance from lead car by controlling longitudinal acceleration and braking, while also keeping the ego car travelling along the centerline of its lane by controlling the front steering angle.

load the environment parameters.

multiagentpfcparams

open the simulink model.

mdl = "rlmultiagentpfc";
open_system(mdl)

in this model, the two reinforcement learning agents (rl agent1 and rl agent2) provide longitudinal acceleration and steering angle signals, respectively.

the simulation terminates when any of the following conditions occur.

$| e_{1} | > 1$ (magnitude of the lateral deviation exceeds 1)
$v_{e g o} < 0.5$ (longitudinal velocity of the ego car drops below 0.5.
$d_{r e l} < 0$ (distance between the ego and lead car is below zero)

for the longitudinal controller (rl agent1):

the reference velocity for the ego car $v_{r e f}$ is defined as follows. if the relative distance is less than the safe distance, the ego car tracks the minimum of the lead car velocity and driver-set velocity. in this manner, the ego car maintains some distance from the lead car. if the relative distance is greater than the safe distance, the ego car tracks the driver-set velocity. in this example, the safe distance is defined as a linear function of the ego car longitudinal velocity $v$ , that is, $t_{g a p} * v d_{d e f a u l t}$ . the safe distance determines the tracking velocity for the ego car.
the observations from the environment contain the longitudinal measurements: the velocity error $e_{v} = v_{r e f} - v$ , its integral $\int e$ , and the ego car longitudinal velocity $v$ .
the action signal consists of continuous acceleration values between -3 and 2 m/s^2.
the reward $r_{t}$ , provided at every time step $t$ , is

$r_{t} = - (10 e_{v}^{2} 100 a_{t - 1}^{2}) \times 0.001 - 10 f_{t} m_{t}$

here, $a_{t - 1}$ is the acceleration input from the previous time step, and:

$f_{t} = 1$ if the simulation is terminated, otherwise $f_{t} = 0$ .
$m_{t} = 1$ if $e_{v}^{2} < 1$ , otherwise $m_{t} = 0$ .

for the lateral controller (rl agent2):

the observations from the environment contain the lateral measurements: the lateral deviation $e_{1}$ , relative yaw angle $e_{2}$ , their derivatives ${e_{}^{˙}}_{1}$ and ${e_{}^{˙}}_{2}$ , and their integrals $\int e_{1}$ and $\int e_{2}$ .
the action signal consists of discrete steering angle actions which take values from -15 degrees (-0.2618 rad) to 15 degrees (0.2618 rad) in steps of 1 degree (0.0175 rad).
the reward $r_{t}$ , provided at every time step $t$ , is

$r_{t} = - (100 e_{1}^{2} 500 u_{t - 1}^{2}) \times 0.001 - 10 f_{t} 2 h_{t}$

here, $u_{t - 1}$ is the steering input from the previous time step, $a_{t - 1}$ is the acceleration input from the previous time step, and:

$f_{t} = 1$ if the simulation is terminated, otherwise $f_{t} = 0$ .
$h_{t} = 1$ $e_{1}^{2} < 0.01$ , otherwise $h_{t} = 0$ .

the logical terms in the reward functions ( $f_{t}$ , $m_{t}$ , and $h_{t}$ ) penalize the agents if the simulation terminates early, while encouraging the agents to make both the lateral error and velocity error small.

create the observation and action specifications for longitudinal control loop.

obsinfo1 = rlnumericspec([3 1]);
actinfo1 = rlnumericspec([1 1],lowerlimit=-3,upperlimit=2);

create the observation and action specifications for lateral control loop.

obsinfo2 = rlnumericspec([6 1]);
actinfo2 = rlfinitesetspec((-15:15)*pi/180);

combine the observation and action specifications as a cell array.

obsinfo = {obsinfo1,obsinfo2};
actinfo = {actinfo1,actinfo2};

create a simulink environment interface, specifying the block paths for both agent blocks. the order of the block paths must match the order of the observation and action specification cell arrays.

blks = mdl   ["/rl agent1", "/rl agent2"];
env = rlsimulinkenv(mdl,blks,obsinfo,actinfo);

specify a reset function for the environment using the resetfcn property. the function pfcresetfcn randomly sets the initial poses of the lead and ego vehicles at the beginning of every episode during training.

env.resetfcn = @pfcresetfcn;

create agents

for this example you create two reinforcement learning agents. first, fix the random seed for reproducibility.

rng(0)

both agents operate at the same sample time in this example. set the sample time value (in seconds).

ts = 0.1;

longitudinal control

the agent for the longitudinal control loop is a ddpg agent. a ddpg agent approximates the long-term reward given observations and actions using a critic value function representation and selects actions using an actor policy representation. for more information on creating deep neural network value function and policy representations, see .

use the createccagent function to create a ddpg agent for longitudinal control. the structure of this agent is similar to the example.

agent1 = createaccagent(obsinfo1,actinfo1,ts);

lateral control

the agent for the lateral control loop is a dqn agent. a dqn agent approximates the long-term reward given observations and actions using a critic value function representation.

use the createlkaagent function to create a dqn agent for lateral control. the structure of this agent is similar to the train dqn agent for lane keeping assist example.

agent2 = createlkaagent(obsinfo2,actinfo2,ts);

train agents

specify the training options. for this example, use the following options.

run each training episode for at most 5000 episodes, with each episode lasting at most maxsteps time steps.
display the training progress in the episode manager dialog box (set the verbose and plots options).
stop training the ddpg and dqn agents when they receive an average reward greater than 480 and 1195, respectively. when one agent reaches its stop criteria, it simulates its own policy without learning while the other agent continues training.

tf = 60; % simulation time
maxepisodes = 5000;
maxsteps = ceil(tf/ts);
trainingopts = rltrainingoptions(...
    maxepisodes=maxepisodes,...
    maxstepsperepisode=maxsteps,...
    verbose=false,...
    plots="training-progress",...
    stoptrainingcriteria="averagereward",...
    stoptrainingvalue=[480,1195]);

train the agents using the train function. training these agents is a computationally intensive process that takes several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining to false. to train the agent yourself, set dotraining to true.

dotraining = false;
if dotraining    
    % train the agent.
    trainingstats = train([agent1,agent2],env,trainingopts);
else
    % load pretrained agents for the example.
    load("rlpfcagents.mat")       
end

the following figure shows a snapshot of the training progress for the two agents.

simulate agents

to validate the performance of the trained agents, simulate the agents within the simulink environment by uncommenting the following commands. for more information on agent simulation, see rlsimulationoptions and sim.

% simoptions = rlsimulationoptions(maxsteps=maxsteps);
% experience = sim(env,[agent1, agent2],simoptions);

to demonstrate the trained agent using deterministic initial conditions, simulate the model in simulink.

e1_initial = -0.4;
e2_initial = 0.1;
x0_lead = 80;
sim(mdl)

the following plots show the results when the lead car is 70 m ahead of the ego car at the beginning of simulation.

the lead car changes speed from 24 m/s to 30 m/s periodically (top-right plot). the ego car maintains a safe distance throughout the simulation (bottom-right plot).
from 0 to 30 seconds, the ego car tracks the set velocity (top-right plot) and experiences some acceleration (top-left plot). after that, the acceleration is reduced to 0.
the bottom-left plot shows the lateral deviation. as shown in the plot, the lateral deviation is greatly decreased within 1 second. the lateral deviation remains less than 0.1 m.