avoid obstacles using reinforcement learning for mobile robots -凯发k8网页登录

this example uses deep deterministic policy gradient (ddpg) based reinforcement learning to develop a strategy for a mobile robot to avoid obstacles. for a brief summary of the ddpg algorithm, see (reinforcement learning toolbox).

this example scenario trains a mobile robot to avoid obstacles given range sensor readings that detect obstacles in the map. the objective of the reinforcement learning algorithm is to learn what controls (linear and angular velocity), the robot should use to avoid colliding into obstacles. this example uses an occupancy map of a known environment to generate range sensor readings, detect obstacles, and check collisions the robot may make. the range sensor readings are the observations for the ddpg agent, and the linear and angular velocity controls are the action.

load map

load a map matrix, simplemap, that represents the environment for the robot.

load examplemaps simplemap 
load examplehelperofficeareamap office_area_map
mapmatrix = simplemap;
mapscale = 1;

range sensor parameters

next, set up a rangesensor object which simulates a noisy range sensor. the range sensor readings are considered observations by the agent. define the angular positions of the range readings, the max range, and the noise parameters.

scanangles = -3*pi/8:pi/8:3*pi/8;
maxrange = 12;
lidarnoisevariance = 0.1^2;
lidarnoiseseeds = randi(intmax,size(scanangles));

robot parameters

the action of the agent is a two-dimensional vector $a = [v, ω]$ where $v$ and $ω$ are the linear and angular velocities of our robot. the ddpg agent uses normalized inputs for both the angular and linear velocities, meaning the actions of the agent are a scalar between -1 and 1, which is multiplied by the maxlinspeed and maxangspeed parameters to get the actual control. specify this maximum linear and angular velocity.

also, specify the initial position of the robot as [x y theta].

% max speed parameters
maxlinspeed = 0.3;
maxangspeed = 0.3;
% initial pose of the robot
initx = 17;
inity = 15;
inittheta = pi/2;

show map and robot positions

to visualize the actions of the robot, create a figure. start by showing the occupancy map and plot the initial position of the robot.

fig = figure("name","simplemap");
set(fig, "visible","on");
ax = axes(fig);
show(binaryoccupancymap(mapmatrix),"parent",ax);
hold on
plottransforms([initx,inity,0],eul2quat([inittheta,0,0]),"meshfilepath","groundvehicle.stl","view","2d");
light;
hold off

figure simplemap contains an axes object. the axes object with title binary occupancy grid, xlabel x [meters], ylabel y [meters] contains 5 objects of type patch, line, image.

environment interface

create an environment model that takes the action, and gives the observation and reward signals. specify the provided example model name, examplehelperavoidobstaclesmobilerobot, the simulation time parameters, and the agent block name.

mdl = "examplehelperavoidobstaclesmobilerobot";
tfinal = 100;
sampletime = 0.1;
agentblk = mdl   "/agent";

open the model.

open_system(mdl)

the model contains the environment and agent blocks. the agent block is not defined yet.

inside the environment subsystem block, you should see a model for simulating the robot and sensor data. the subsystem takes in the action, generates the observation signal based on the range sensor readings, and calculates the reward based on the distance from obstacles, and the effort of the action commands.

open_system(mdl   "/environment")

define observation parameters, obsinfo, using the rlnumericspec object and giving the lower and upper limit for the range readings with enough elements for each angular position in the range sensor.

obsinfo = rlnumericspec([numel(scanangles) 1],...
    "lowerlimit",zeros(numel(scanangles),1),...
    "upperlimit",ones(numel(scanangles),1)*maxrange);
numobservations = obsinfo.dimension(1);

define action parameters, actinfo. the action is the control command vector, $a = [v, ω]$ , normalized to $[- 1, 1]$ .

numactions = 2;
actinfo = rlnumericspec([numactions 1],...
    "lowerlimit",-1,...
    "upperlimit",1);

build the environment interface object using rlsimulinkenv (reinforcement learning toolbox). specify the model, agent block name, observation parameters, and action parameters. set the reset function for the simulation using examplehelperrlavoidobstaclesresetfcn. this function restarts the simulation by placing the robot in a new random location to begin avoiding obstacles.

env = rlsimulinkenv(mdl,agentblk,obsinfo,actinfo);
env.resetfcn = @(in)examplehelperrlavoidobstaclesresetfcn(in,scanangles,maxrange,mapmatrix);
env.usefastrestart = "off";

for another example that sets up a simulink® environment for training, see create simulink environment and train agent (reinforcement learning toolbox).

ddpg agent

a ddpg agent approximates the long-term reward given observations and actions using a critic value function representation. to create the critic, first create a deep neural network with two inputs, the observation and action, and one output. for more information on creating a deep neural network value function representation, see create policies and value functions (reinforcement learning toolbox).

statepath = [
    featureinputlayer(numobservations, "normalization","none","name","state")
    fullyconnectedlayer(50,"name","criticstatefc1")
    relulayer("name","criticrelu1")
    fullyconnectedlayer(25,"name","criticstatefc2")];
actionpath = [
    featureinputlayer(numactions,"normalization","none","name","action")
    fullyconnectedlayer(25,"name","criticactionfc1")];
commonpath = [
    additionlayer(2,"name","add")
    relulayer("name","criticcommonrelu")
    fullyconnectedlayer(1,"name","criticoutput")];
criticnetwork = layergraph();
criticnetwork = addlayers(criticnetwork,statepath);
criticnetwork = addlayers(criticnetwork,actionpath);
criticnetwork = addlayers(criticnetwork,commonpath);
criticnetwork = connectlayers(criticnetwork,"criticstatefc2","add/in1");
criticnetwork = connectlayers(criticnetwork,"criticactionfc1","add/in2");
criticnetwork = dlnetwork(criticnetwork);

next, specify options for the critic optimizer using rloptimizeroptions.

finally, create the critic representation using the specified deep neural network and options. you must also specify the action and observation specifications for the critic, which you obtain from the environment interface. for more information, see rlqvaluefunction (reinforcement learning toolbox).

criticoptions = rloptimizeroptions("learnrate",1e-3,"l2regularizationfactor",1e-4,"gradientthreshold",1);
critic = rlqvaluefunction(criticnetwork,obsinfo,actinfo,"observationinputnames","state","actioninputnames","action");

a ddpg agent decides which action to take given observations using an actor representation. to create the actor, first create a deep neural network with one input, the observation, and one output, the action.

finally, construct the actor in a similar manner as the critic. for more information, see rlcontinuousdeterministicactor (reinforcement learning toolbox).

actornetwork = [
    featureinputlayer(numobservations,"normalization","none","name","state")
    fullyconnectedlayer(50,"name","actorfc1")
    relulayer("name","actorrelu1")
    fullyconnectedlayer(50, "name","actorfc2")
    relulayer("name","actorrelu2")
    fullyconnectedlayer(2, "name","actorfc3")
    tanhlayer("name","action")];
actornetwork = dlnetwork(actornetwork);
actoroptions = rloptimizeroptions("learnrate",1e-4,"l2regularizationfactor",1e-4,"gradientthreshold",1);
actor = rlcontinuousdeterministicactor(actornetwork,obsinfo,actinfo);

create ddpg agent object

specify the agent options.

agentopts = rlddpgagentoptions(...
    "sampletime",sampletime,...
    "actoroptimizeroptions",actoroptions,...
    "criticoptimizeroptions",criticoptions,...
    "discountfactor",0.995, ...
    "minibatchsize",128, ...
    "experiencebufferlength",1e6); 
agentopts.noiseoptions.variance = 0.1;
agentopts.noiseoptions.variancedecayrate = 1e-5;

create the rlddpgagent object. the obstacleavoidanceagent variable is used in the model for the agent block.

obstacleavoidanceagent = rlddpgagent(actor,critic,agentopts);
open_system(mdl   "/agent")

reward

the reward function for the agent is modeled as shown.

the agent is rewarded to avoid the nearest obstacle, which minimizes the worst-case scenario. additionally, the agent is given a positive reward for higher linear speeds, and is given a negative reward for higher angular speeds. this rewarding strategy discourages the agent's behavior of going in circles. tuning your rewards is key to properly training an agent, so your rewards vary depending on your application.

train agent

to train the agent, first specify the training options. for this example, use the following options:

train for at most 10000 episodes, with each episode lasting at most maxsteps time steps.
display the training progress in the episode manager dialog box (set the plots option) and enable the command line display (set the verbose option to true).
stop training when the agent receives an average cumulative reward greater than 400 over fifty consecutive episodes.

for more information, see rltrainingoptions (reinforcement learning toolbox).

maxepisodes = 10000;
maxsteps = ceil(tfinal/sampletime);
trainopts = rltrainingoptions(...
    "maxepisodes",maxepisodes, ...
    "maxstepsperepisode",maxsteps, ...
    "scoreaveragingwindowlength",50, ...
    "stoptrainingcriteria","averagereward", ...
    "stoptrainingvalue",400, ...
    "verbose", true, ...
    "plots","training-progress");

train the agent using the train (reinforcement learning toolbox) function. training is a computationally-intensive process that takes several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining to false. to train the agent yourself, set dotraining to true.

dotraining = false; % toggle this to true for training. 
if dotraining
    % train the agent.
    trainingstats = train(obstacleavoidanceagent,env,trainopts);
else
    % load pretrained agent for the example.
    load examplehelperavoidobstaclesagent obstacleavoidanceagent
end

the reinforcement learning episode manager can be used to track episode-wise training progress. as the episode number increases, you want to see an increase in the reward value.

simulate

use the trained agent to simulate the robot driving in the map and avoiding obstacles for 200 seconds.

set_param("examplehelperavoidobstaclesmobilerobot","stoptime","200");
out = sim("examplehelperavoidobstaclesmobilerobot.slx");

visualize

to visualize the simulation of the robot driving around the environment with range sensor readings, use the helper, examplehelperavoidobstaclesposeplot.

for i = 1:5:size(out.range,3)
    u = out.pose(i,:);
    r = out.range(:,:,i);
    examplehelperavoidobstaclesposeplot(u,mapmatrix,mapscale,r,scanangles,ax);
end

figure simplemap contains an axes object. the axes object with title binary occupancy grid, xlabel x [meters], ylabel y [meters] contains 6 objects of type patch, line, image. one or more of the lines displays its values using only markers

extensibility

you can now use this agent to simulate driving in a different map. another map generated from lidar scans of an office environment is used with the same trained model. this map represents a more realistic scenario to apply the trained model after training.

change the map

mapmatrix = office_area_map.occupancymatrix > 0.5;
mapscale = 10;
initx = 20;
inity = 30;
inittheta = 0;
fig = figure("name","office_area_map");
set(fig,"visible","on");
ax = axes(fig);
show(binaryoccupancymap(mapmatrix,mapscale),"parent",ax);
hold on
plottransforms([initx,inity,0],eul2quat([inittheta, 0, 0]),"meshfilepath","groundvehicle.stl","view","2d");
light;
hold off

figure office_area_map contains an axes object. the axes object with title binary occupancy grid, xlabel x [meters], ylabel y [meters] contains 5 objects of type patch, line, image.

simulate

simulate the robot's trajectory across the new map. reset the simulation time in case the agent needs to be re-trained and retuned later.

out = sim("examplehelperavoidobstaclesmobilerobot.slx");
set_param("examplehelperavoidobstaclesmobilerobot","stoptime","inf");

visualize

for i = 1:5:size(out.range,3)
    u = out.pose(i,:);
    r = out.range(:,:,i);
    examplehelperavoidobstaclesposeplot(u,mapmatrix,mapscale,r,scanangles,ax);
end

figure office_area_map contains an axes object. the axes object with title binary occupancy grid, xlabel x [meters], ylabel y [meters] contains 6 objects of type patch, line, image. one or more of the lines displays its values using only markers

avoid obstacles using reinforcement learning for mobile robots -凯发k8网页登录

load map

range sensor parameters

robot parameters

show map and robot positions

environment interface

ddpg agent

create ddpg agent object

reward

train agent

simulate

visualize

extensibility

change the map

simulate

visualize

wechat