train sac agent for ball balance control -凯发k8网页登录
this example shows how to train a soft actor-critic (sac) reinforcement learning agent to control a robot arm for a ball-balancing task.
introduction
the robot arm in this example is a kinova gen3 robot, which is a seven degree-of-freedom (dof) manipulator. the arm is tasked to balance a ping pong ball at the center of a flat surface (plate) attached to the robot gripper. only the final two joints are actuated and contribute to motion in the pitch and roll axes as shown in the following figure. the remaining joints are fixed and do not contribute to motion.
open the simulink® model to view the system. the model contains a kinova ball balance subsystem connected to an rl agent block. the agent applies an action to the robot subsystem and receives the resulting observation, reward, and is-done signals.
open_system("rlkinovaballbalance")
view to the kinova ball balance subsystem.
open_system("rlkinovaballbalance/kinova ball balance")
in this model:
the physical components of the system (manipulator, ball, and plate) are modeled using simscape™ multibody™ components.
the plate is constrained to the end effector of the manipulator.
the ball has six degrees of freedom and can move freely in space.
contact forces between the ball and plate are modeled using the spatial contact force block.
control inputs to the manipulator are the torque signals for the actuated joints.
if you have the support package, you can view a 3-d animation of the manipulator in the mechanics explorer. to do so, open the 7 dof manipulator subsystem and set its visualization parameter to 3d mesh
. if you do have the support package installed, set the visualization parameter to none
. to download and install the support package, use the add-on explorer. for more information see .
create the parameters for the example by running the kinova_params
script included with this example. when you have the robotics system toolbox robot library data support package installed, this script also adds the necessary mesh files to the matlab® path.
kinova_params
define environment
to train a reinforcement learning agent, you must define the environment with which it will interact. for the ball balancing environment:
the observations are represented by a 22 element vector that contains information about the positions (sine and cosine of joint angles) and velocities (joint angle derivatives) of the two actuated joints, positions (x and y distances from plate center) and velocities (x and y derivatives) of the ball, orientation (quaternions) and velocities (quaternion derivatives) of the plate, joint torques from the last time step, ball radius, and mass.
the actions are normalized joint torque values.
the sample time is , and the simulation time is .
the simulation terminates when the ball falls off the plate.
the reward at time step is given by:
here, is a reward for the ball moving closer to the center of the plate, is a penalty for plate orientation, and is a penalty for control effort. , , and are the respective roll, pitch, and yaw angles of the plate in radians. and are the joint torques.
create the observation and action specifications for the environment using continuous observation and action spaces.
numobs = 22; % number of dimension of the observation space numact = 2; % number of dimension of the action space obsinfo = rlnumericspec([numobs 1]); actinfo = rlnumericspec([numact 1]); actinfo.lowerlimit = -1; actinfo.upperlimit = 1;
create the simulink environment interface using the observation and action specifications. for more information on creating simulink environments, see rlsimulinkenv
.
mdl = "rlkinovaballbalance"; blk = mdl "/rl agent"; env = rlsimulinkenv(mdl,blk,obsinfo,actinfo);
specify a reset function for the environment using the resetfcn
parameter.
env.resetfcn = @kinovaresetfcn;
this reset function (provided at the end of this example) randomly initializes the initial x and y positions of the ball with respect to the center of the plate. for more robust training, you can also randomize other parameters inside the reset function, such as the mass and radius of the ball.
specify the sample time ts
and simulation time tf
.
ts = 0.01; tf = 10;
create agent
the agent in this example is a soft actor-critic (sac) agent. sac agents use one or two parametrized q-value function approximators to estimate the value of the policy. a q-value function critic takes the current observation and an action as inputs and returns a single scalar as output (the estimated discounted cumulative long-term reward for which receives the action from the state corresponding to the current observation, and following the policy thereafter). for more information on sac agents, see .
the sac agent in this example uses two critics. to model the parametrized q-value functions within the critics, use a neural network with two input layers (one for the observation channel, as specified by obsinfo
, and the other for the action channel, as specified by actinfo
) and one output layer (which returns the scalar value).
define each network path as an array of layer objects. assign names to the input and output layers of each path. these names allow you to connect the paths and then later explicitly associate the network input and output layers with the appropriate environment channel.
for more information on creating deep neural networks for reinforcement learning agents, see create policies and value functions.
% set the random seed for reproducibility. rng(0) % define the network layers. criticnet = [ featureinputlayer(numobs,name="obsinlyr") fullyconnectedlayer(128) concatenationlayer(1,2,name="concat") relulayer fullyconnectedlayer(64) relulayer fullyconnectedlayer(32) relulayer fullyconnectedlayer(1,name="qvalueoutlyr") ]; actionpath = [ featureinputlayer(numact,name="actinlyr") fullyconnectedlayer(128,name="fc2") ]; % connect the layers. criticnet = layergraph(criticnet); criticnet = addlayers(criticnet, actionpath); criticnet = connectlayers(criticnet,"fc2","concat/in2");
view the critic neural network.
plot(criticnet)
when using two critics, a sac agent requires them to have different initial parameters. create and initialize two dlnetwork
objects.
criticdlnet = dlnetwork(criticnet,'initialize',false);
criticdlnet1 = initialize(criticdlnet);
criticdlnet2 = initialize(criticdlnet);
create the critic functions using rlqvaluefunction
.
critic1 = rlqvaluefunction(criticdlnet1,obsinfo,actinfo, ... observationinputnames="obsinlyr"); critic2 = rlqvaluefunction(criticdlnet2,obsinfo,actinfo, ... observationinputnames="obsinlyr");
soft actor-critic agents use a parametrized stochastic policy over a continuous action space, which is implemented by a continuous gaussian actor.
this actor takes an observation as input and returns as output a random action sampled from a gaussian probability distribution.
to approximate the mean values and standard deviations of the gaussian distribution, you must use a neural network with two output layers, each having as many elements as the dimension of the action space. one output layer must return a vector containing the mean values for each action dimension. the other must return a vector containing the standard deviation for each action dimension.
since standard deviations must be nonnegative, use a softplus or relu layer to enforce nonnegativity. the sac agent automatically reads the action range from the upperlimit
and lowerlimit
properties of actinfo
(which is used to create the actor), and then internally scales the distribution and bounds the action. therefore, do not add a tanhlayer
as the last nonlinear layer in the mean output path.
define each network path as an array of layer objects, and assign names to the input and output layers of each path.
% create the actor network layers. commonpath = [ featureinputlayer(numobs,name="obsinlyr") fullyconnectedlayer(128) relulayer fullyconnectedlayer(64) relulayer(name="compathoutlyr") ]; meanpath = [ fullyconnectedlayer(32,name="meanfc") relulayer fullyconnectedlayer(numact,name="meanoutlyr") ]; stdpath = [ fullyconnectedlayer(numact,name="stdfc") relulayer softpluslayer(name="stdoutlyr") ]; % connect the layers. actornetwork = layergraph(commonpath); actornetwork = addlayers(actornetwork,meanpath); actornetwork = addlayers(actornetwork,stdpath); actornetwork = connectlayers(actornetwork,"compathoutlyr","meanfc/in"); actornetwork = connectlayers(actornetwork,"compathoutlyr","stdfc/in");
view the actor neural network.
plot(actornetwork)
create the actor function using rlcontinuousgaussianactor
.
actordlnet = dlnetwork(actornetwork); actor = rlcontinuousgaussianactor(actordlnet, obsinfo, actinfo, ... observationinputnames="obsinlyr", ... actionmeanoutputnames="meanoutlyr", ... actionstandarddeviationoutputnames="stdoutlyr");
the sac agent in this example trains from an experience buffer of maximum capacity 1e6 by randomly selecting mini-batches of size 128. the fact that the discount factor is 0.99, that is very close to 1, means that the agent keeps more into account long term rewards (a discount factor closer to 0 would instead place a heavier discount on future rewards, therefore encouraging short term ones). for a full list of sac hyperparameters and their descriptions, see .
specify the agent hyperparameters for training.
agentopts = rlsacagentoptions( ... sampletime=ts, ... targetsmoothfactor=1e-3, ... experiencebufferlength=1e6, ... minibatchsize=128, ... numwarmstartsteps=1000, ... discountfactor=0.99);
for this example the actor and critic neural networks are updated using the adam algorithm with a learn rate of 1e-4 and gradient threshold of 1. specify the optimizer parameters.
agentopts.actoroptimizeroptions.algorithm = "adam"; agentopts.actoroptimizeroptions.learnrate = 1e-4; agentopts.actoroptimizeroptions.gradientthreshold = 1; for ct = 1:2 agentopts.criticoptimizeroptions(ct).algorithm = "adam"; agentopts.criticoptimizeroptions(ct).learnrate = 1e-4; agentopts.criticoptimizeroptions(ct).gradientthreshold = 1; end
create the sac agent.
agent = rlsacagent(actor,[critic1,critic2],agentopts);
train agent
to train the agent, first specify the training options using rltrainingoptions
. for this example, use the following options:
run each training for at most 5000 episodes, with each episode lasting at most
floor(tf/ts)
time steps.stop training when the agent receives an average cumulative reward greater than 675 over 100 consecutive episodes.
to speed up training set the
useparallel
option totrue
, which requires parallel computing toolbox™ software. if you do not have the software installed set the option tofalse
.
trainopts = rltrainingoptions(... maxepisodes=6000, ... maxstepsperepisode=floor(tf/ts), ... scoreaveragingwindowlength=100, ... plots="training-progress", ... stoptrainingcriteria="averagereward", ... stoptrainingvalue=675, ... useparallel=false);
specify a visualization flag doviz
that shows animation of the ball in a matlab figure. for parallel training, disable all visualization.
if trainopts.useparallel % disable visualization in simscape mechanics explorer set_param(mdl, simmechanicsopeneditoronupdate="off"); set_param(mdl "/kinova ball balance/7 dof manipulator", ... "vchoice", "none"); % disable animation in matlab figure doviz = false; save_system(mdl); else % enable visualization in simscape mechanics explorer set_param(mdl, simmechanicsopeneditoronupdate="on"); % enable animation in matlab figure doviz = true; end
you can log training data to disk using the rldatalogger
function. for this example, log the episode experience and actor and critic function losses. the functions logagentlearndata
and logepisodedata
are defined at the end of this live script. for more information see log training data to disk.
logger = rldatalogger(); logger.agentlearnfinishedfcn = @logagentlearndata; logger.episodefinishedfcn = @(data) logepisodedata(data, doviz);
train the agent using the train
function. training this agent is a computationally intensive process that takes several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the agent yourself, set dotraining
to true
.
dotraining = false; if dotraining trainresult = train(agent,env,trainopts,logger=logger); else load("kinovaballbalanceagent.mat") end
a snapshot of training progress is shown in the following figure. you can expect different results due to randomness in the training process.
simulate trained agent
specify an initial position for the ball with respect to the plate center. to randomize the initial ball position during simulation, set the userspecifiedconditions
flag to false
.
userspecifiedconditions = true; if userspecifiedconditions ball.x0 = 0.10; ball.y0 = -0.10; env.resetfcn = []; else env.resetfcn = @kinovaresetfcn; end
create a simulation options object for configuring the simulation. the agent will be simulated for a maximum of floor(tf/ts)
steps per simulation episode.
simopts = rlsimulationoptions(maxsteps=floor(tf/ts));
enable visualization during simulation.
set_param(mdl, simmechanicsopeneditoronupdate="on");
doviz = true;
simulate the agent.
experiences = sim(agent,env,simopts);
view the trajectory of the ball using the ball position scope block.
view the animation of the ball on the plate in a matlab figure.
fig = animatedpath(experiences);
environment reset function
function in = kinovaresetfcn(in) % kinovaresetfcn is used to randomize the initial joint angles r6_q0, % r7_q0 and the initial wrist and hand torque values. % ball parameters ball.radius = 0.02; % m ball.mass = 0.0027; % kg ball.shell = 0.0002; % m % calculate ball moment of inertia. ball.moi = calcmoi(ball.radius,ball.shell,ball.mass); % initial conditions. z is vertically upward. % randomize the x and y distances within the plate. ball.x0 = -0.125 0.25*rand; % m, initial x distance from plate center ball.y0 = -0.125 0.25*rand; % m, initial y distance from plate center ball.z0 = ball.radius; % m, initial z height from plate surface ball.dx0 = 0; % m/s, ball initial x velocity ball.dy0 = 0; % m/s, ball initial y velocity ball.dz0 = 0; % m/s, ball initial z velocity % contact friction parameters ball.staticfriction = 0.5; ball.dynamicfriction = 0.3; ball.criticalvelocity = 1e-3; % convert coefficient of restitution to spring-damper parameters. coeff_restitution = 0.89; [k, c, w] = cor2springdamperparams(coeff_restitution,ball.mass); ball.stiffness = k; ball.damping = c; ball.transitionwidth = w; in = setvariable(in,"ball",ball); % randomize joint angles within a range of /- 5 deg from the % starting positions of the joints. r6_q0 = deg2rad(-65) deg2rad(-5 10*rand); r7_q0 = deg2rad(-90) deg2rad(-5 10*rand); in = setvariable(in,"r6_q0",r6_q0); in = setvariable(in,"r7_q0",r7_q0); % compute approximate initial joint torques that hold the ball, % plate and arm at their initial congifuration g = 9.80665; wrist_torque_0 = ... (-1.882 ball.x0 * ball.mass * g) * cos(deg2rad(-65) - r6_q0); hand_torque_0 = ... (0.0002349 - ball.y0 * ball.mass * g) * cos(deg2rad(-90) - r7_q0); u0 = [wrist_torque_0 hand_torque_0]; in = setvariable(in,"u0",u0); end
data logging functions
function datatolog = logagentlearndata(data) % this function is executed after completion % of the agent's learning subroutine datatolog.actorloss = data.actorloss; datatolog.criticloss = data.criticloss; end function datatolog = logepisodedata(data, doviz) % this function is executed after the completion of an episode datatolog.experience = data.experience; % show an animation after episode completion if doviz animatedpath(data.experience); end end
see also
functions
train
|sim
|rlsimulinkenv
objects
blocks
related examples
- train biped robot to walk using reinforcement learning agents
- (model predictive control toolbox)