main content

train reinforcement learning agent offline to control quanser qube pendulum -凯发k8网页登录

this example shows how you can train an agent from existing data instead of using an environment. training an agent from data (offline reinforcement learning) can be useful to efficiently prototype different design choices by leveraging already collected data. in this example, two reinforcement learning (rl) agents are trained to swing up and balance a quanser qube™-servo 2 inverted pendulum system. first, you train a td3 agent online using a simulink® environment and collect training data using a data logger. then, you train another td3 agent offline using the collected data and a modified reward function to achieve a different behavior from the agent. you then compare the behavior of both agents.

inverted pendulum model

the quanser qube-servo 2 pendulum system is a rotational inverted pendulum with two degrees of freedom. it is non-linear, underactuated, non-minimum phase, and it is modeled in simulink using simscape™ electrical™ and simscape multibody™. for a detailed description of the dynamics, see [1]. for related examples that use this pendulum system model, see and .

the pendulum is attached to the motor arm through a free revolute joint. the arm is actuated by a dc motor. the environment has the following properties:

  • the angles and angular velocities of the motor arm (θ,θ˙) and pendulum (ϕ,ϕ˙) are measurable.

  • the motor is constrained to |θ|5π8  rad and |θ˙|30radsec.

  • the pendulum is upright when ϕ=0.

  • the motor input is constrained to ±12v.

  • the agent action (|u|1) is scaled to the motor voltage in the environment.

pendulum_convention.png

open the simulink model.

mdl = "rlqubeservomodel";
open_system(mdl)

define the angular and voltage limits, as well as the model's sample time.

theta_limit = 5*pi/8;
dtheta_limit = 30;
volt_limit = 12;
ts = 0.005;

train td3 agent online

a td3 agent is trained online against the simulink enviroment. for this agent:

  • the environment is the pendulum system modeled in simscape multibody.

  • the observation is the vector sk=[sinθk,cosθk,θk˙,sinφk,cosφk,φk˙,uk-1]. using the sine and cosine of the measured angles can facilitate training by representing the otherwise discontinuous angular measurements by a continuous two-dimensional parametrization.

  • the action is the normalized input voltage command to the servo motor.

  • the reward signal is defined as follows:

r(sk,uk-1)=fk-0.1(θk2 φk2 θk˙2 φ˙2 uk-12 0.3(uk-1-uk-2)2)fk={1θk±5π8  radandθk˙±30radsec0otherwise

the above reward function penalizes six different terms:

  • deviations from the forward position of the motor arm (θk=0)

  • deviations for the inverted position of the pendulum (φk=0)

  • the angular speed of the motor arm θk˙

  • the angular speed of the pendulum ϕk˙

  • the control action uk

  • changes to the control action (uk-1-uk-2)

the agent is rewarded while the system constraints are satisfied (that is fk=1).

set the random seed for reproducibility.

rng(0)

create the input and output specifications for the agent. set the action upper and lower limits to constrain the actions selected by the agent.

obsinfo = rlnumericspec([7 1]);
actinfo = rlnumericspec([1 1],upperlimit=1,lowerlimit=-1);

create the environment interface. specify a reset function, defined at the end of the example, that sets random initial conditions.

agentblk = mdl   "/rl agent";
simenv = rlsimulinkenv(mdl,agentblk,obsinfo,actinfo);
simenv.resetfcn = @localresetfcn;

to create an agent option object, use rltd3agentoptions. specify sample time, experience buffer length, and the mini batch size.

agentopts = rltd3agentoptions( sampletime=ts, ...
    experiencebufferlength=1e6, ...
    minibatchsize=128);

specify the actor and critic optimizer options.

agentopts.actoroptimizeroptions.learnrate = 1e-4;
agentopts.actoroptimizeroptions.gradientthreshold = 1;
agentopts.criticoptimizeroptions(1).learnrate = 1e-3;
agentopts.criticoptimizeroptions(1).gradientthreshold = 1;
agentopts.criticoptimizeroptions(2).learnrate = 1e-3;
agentopts.criticoptimizeroptions(2).gradientthreshold = 1;

set to 64 the number of neurons in each hidden layer of the actor and critic networks. to create a default td3 agent, use rltd3agent.

initopts = rlagentinitializationoptions(numhiddenunit=64);
td3agent = rltd3agent(obsinfo,actinfo,initopts,agentopts);

define training options. the length of an episode is given by the simulation time tf divided by the sample time ts.

tf = 5;
maxsteps = ceil(tf/ts);
trainopts = rltrainingoptions(...
    maxepisodes=300, ...
    maxstepsperepisode=maxsteps, ...
    scoreaveragingwindowlength= 10, ...
    verbose= false, ...
    plots="training-progress",...
    stoptrainingcriteria="averagereward",...
    stoptrainingvalue=inf);

create a data logger to save experiences that are later used for offline training of different agents. for more information, see rldatalogger.

logger = rldatalogger();
logger.episodefinishedfcn = @localepisodefinishedfcn;
logger.loggingoptions.loggingdirectory = "simulatedpendulumdataset";

train the td3 agent using the simulink environment using train. as shown in the following episode manager screenshot, training can take over an hour, so you can save time while running this example by setting dotd3training to false to load a pretrained agent. to train the agent yourself, set dotd3training to true. during training, the experiences for each episode are saved in a file named loggeddatan.mat (where n is the episode number) in the simulatedpendulumdataset subfolder, under the current folder.

dotd3training = false;
if dotd3training
    trainstats = train(td3agent,...
        simenv,trainopts,logger=logger);
else
    load("rlquanserqubeagents.mat","td3agent");
end

evaluate td3 agent

test the trained agent and evaluate its performance. since the environment reset function sets the intial state randomly, fix the seed of the random generator to ensure that the model uses the same initial conditions when running simulations.

testingseed = 1;
rng(testingseed)

for simulation, do not use an explorative policy.

td3agent.useexplorationpolicy = false;

define simulation options.

numtestepisodes = 1;
simopts = rlsimulationoptions(...
    maxsteps=maxsteps,...
    numsimulations=numtestepisodes);

simulate the trained agent.

simresult = sim(td3agent,simenv,simopts);

you can view the behavior of the trained agent in the mechanics explorer animation of the simscape model, or you can plot a sample trajectory of the angles, control actions, and reward.

% extract signals.
phi = get(simresult(1).simulationinfo.logsout,"phi_wrapped");
theta = get(simresult(1).simulationinfo.logsout,"theta_wrapped");
action = get(simresult(1).simulationinfo.logsout,"volt");
reward = get(simresult(1).simulationinfo.logsout,"reward");
% plot values.
figure
tiledlayout(4,1)
nexttile
plot(phi.values); title("pendulum angle")
nexttile
plot(theta.values); title("motor arm angle")
nexttile
plot(action.values); title("control action")
nexttile
plot(reward.values); title("reward")

figure contains 4 axes objects. axes object 1 with title pendulum angle, xlabel time (seconds), ylabel phi_wrapped contains an object of type stair. axes object 2 with title motor arm angle, xlabel time (seconds), ylabel theta_wrapped contains an object of type stair. axes object 3 with title control action, xlabel time (seconds), ylabel volt contains an object of type stair. axes object 4 with title reward, xlabel time (seconds), ylabel reward contains an object of type stair.

train td3 agent offline with a modified reward function

suppose you want to modify the reward function to encourage to balance the pendulum at a motor arm angle that matches some new reference angleβ. consider the new reward function:

r(sk,uk-1)=fk-0.1((θk-β)2 φk2 θk˙2 φ˙2 uk-12 0.3(uk-1-uk-2)2),

where β corresponds to the desired angle that the motor arm should target. by expanding the modifed term, you can express the reward function in terms of the original reward.

r(sk,uk-1)=r(sk,uk-1)-0.1(-2βθk β2)

instead of modifying the environment and training in simulation, you can leverage the already collected data to train the new agent offline. this process can be faster and more computationally efficient.

to train an agent from data you can:

  1. define a readfcn function that reads the collected files and returns data in appropriate form. if necessary, this function can also modify data (to change the reward, for example).

  2. create a object pointing to the collected data.

  3. create the agent to be trained.

  4. define training options.

  5. train the agent using trainfromdata.

if, in the previous section, you have loaded the pre-trained td3 agent instead of training the new agent, then at this point you do not have the training data in your local disk. in this case, download the data from the mathworks® server, and unzip it to recreate the same folder and files that would exist after training the agent. if you already have the training data in the previously defined logging directory, set downloaddata to false to avoid downloading the data.

downloaddata = true;
if downloaddata
    zipfilename = "simulatedpendulumdataset.zip";
    filename = ...
        matlab.internal.examples.downloadsupportfile("rl",zipfilename);
    unzip(filename)
end

define the that points to the collected training data. use the readfcn function of the data store to add the additional term to the reward of each experience. you can find the read function used in this example at the end of the example.

datafolder = fullfile( ...
    logger.loggingoptions.loggingdirectory, ...
    "loggeddata*.mat");
fds = filedatastore(datafolder, ...
    readfcn=@localreadfcn, ...
    fileextensions=".mat");
fds.shuffle();

to create an agent option object, use rltd3agentoptions. specify sample time, experience buffer length, and the mini batch size.

agentopts = rltd3agentoptions( ...
    sampletime=ts,...
    experiencebufferlength=1e6,...
    minibatchsize=128);

specify the actor and critic optimizer options.

agentopts.actoroptimizeroptions.learnrate = 1e-4;
agentopts.actoroptimizeroptions.gradientthreshold = 1;
agentopts.criticoptimizeroptions(1).learnrate = 1e-3;
agentopts.criticoptimizeroptions(1).gradientthreshold = 1;
agentopts.criticoptimizeroptions(2).learnrate = 1e-3;
agentopts.criticoptimizeroptions(2).gradientthreshold = 1;

set to 64 the number of neurons in each hidden layer of the actor and critic networks. to create a default td3 agent, use rltd3agent.

initopts = rlagentinitializationoptions(numhiddenunit=64);
offlinetd3agent = rltd3agent(obsinfo,actinfo,initopts,agentopts);

you can use the batch data regularizer options to alleviate the overestimation issue often caused by offline training. for more information, see .

offlinetd3agent.agentoptions.batchdataregularizeroptions...
                        = rlbehaviorcloningregularizeroptions;

define offline training options using the rltrainingfromdataoptions object.

options = rltrainingfromdataoptions;
options.maxepochs = 300;
options.numstepsperepoch = 500;

to calculate a training progress metric, use the observation vector for the computation of the current q value estimate. for this example, you can use the stable equilibrium at θ=θ˙=ϕ˙=0 and ϕ=π.

options.qvalueobservations = ...
    {[sin(0); cos(0); 0; sin(pi); cos(pi); 0; 0]};

train the agent offline from the collected data using trainfromdata. the training duration can be seen in the following episode manager screenshot, and it highlights how training from data can be faster. for this example, the training took less than half the time of the online training.

dotd3offlinetraining = false;
if dotd3offlinetraining
    offlinetrainstats = ...
        trainfromdata(offlinetd3agent, fds ,options); 
else
    load("rlquanserqubeagentsfromdata.mat","offlinetd3agent");
end

the q values during offline training do not necessarily indicate the agent's actual performance. overestimation is a common issue of offline reinforcement learning due to the state-action distribution shift between the dataset and the learned policy. you need to examine hyperparameters or training options if the q values become too big. this example uses the batch data regularizer to alleviate the overestimation issue by penalizing the learned policy's actions that are different from the dataset.

evaluate td3 agent in simulation

fix the random seed generator to the testing seed.

rng(testingseed)

define simulation options.

simopts = rlsimulationoptions(...
    maxsteps=maxsteps,...
    numsimulations=numtestepisodes);

simulate the trained agent.

offlinesimresult = sim(offlinetd3agent,simenv,simopts);

plot a sample trajectory of the angles, control action, and reward.

% extract signals.
offlinephi = ...
   get(offlinesimresult(1).simulationinfo.logsout,"phi_wrapped");
offlinetheta = ...
   get(offlinesimresult(1).simulationinfo.logsout,"theta_wrapped");
offlineaction = ...
   get(offlinesimresult(1).simulationinfo.logsout,"volt");
offlinereward = ...
   get(offlinesimresult(1).simulationinfo.logsout,"reward");
% plot values.
figure
tiledlayout(4,1)
nexttile
plot(offlinephi.values); title("pendulum angle")
nexttile
plot(offlinetheta.values); title("motor arm angle")
nexttile
plot(offlineaction.values); title("control action")
nexttile
plot(offlinereward.values); title("reward")

figure contains 4 axes objects. axes object 1 with title pendulum angle, xlabel time (seconds), ylabel phi_wrapped contains an object of type stair. axes object 2 with title motor arm angle, xlabel time (seconds), ylabel theta_wrapped contains an object of type stair. axes object 3 with title control action, xlabel time (seconds), ylabel volt contains an object of type stair. axes object 4 with title reward, xlabel time (seconds), ylabel reward contains an object of type stair.

you can see that the agent learns to balance the pendulum, as the angle φ, shown in the first plot, is driven to 0. for comparison with the original agent, plot the sampled trajectories of the arm angle in both cases (offline and online) in the same figure, as well as the desired angle.

figure
plot(theta.values);
hold on
plot(offlinetheta.values);
yline(0.2,'-','desired angle');
legend("online td3","offline td3 with modified reward",...
    location="southeast")
title("motor arm angle")

figure contains an axes object. the axes object with title motor arm angle, xlabel time (seconds), ylabel theta_wrapped contains 3 objects of type stair, constantline. these objects represent online td3, offline td3 with modified reward.

as expected, the td3 agent trained offline does not drive the motor arm back to 0 because you change the desired angle from 0 to 0.2 rad in the new reward function. note that online agent drives the motor arm to a position in which θ = 0.1 rad (instead of the front position in which θ = 0), despite θ being penalized with a negative reward. similarly the offline agent drives the motor arm to a position in which θ =0.3 rad, (instead of the desired position θ = 0.2). more training or other modifications of the reward function could achieve a more precise behavior. for example, you can experiment with the weights of each term in the reward function.

helper functions

the function localepisodefinishedfcn selects the data that should be logged in every training episode. the function is used by the rldatalogger object.

function datatolog = localepisodefinishedfcn(data)
datatolog = struct("data",data.experience);
end

the function localresetfcn resets the initial angles to a random number and the intial speeds to 0. it is used by the simulink environment.

function in = localresetfcn(in)
theta0 = -pi/4 rand*pi/2;
phi0 = pi-pi/4 rand*pi/2;
in = setvariable(in,"theta0",theta0);
in = setvariable(in,"phi0",phi0);
in = setvariable(in,"dtheta0",0);
in = setvariable(in,"dphi0",0);
end

the function localreadfcn reads experience data from the logged data files and modifies the reward of each experience. it is used by the object.

function experiences = localreadfcn(filename)
data = load(filename);
experiences = data.episodedata.data{:};
desired = 0.2; % desired angle
for idx = 1:numel(experiences)
    nextobs = experiences(idx).nextobservation;
    theta = atan2(nextobs{1}(1),nextobs{1}(2));
    newreward = ...
        experiences(idx).reward - 0.1*(-2*desired*theta desired^2);
    experiences(idx).reward = newreward;
end
end

references

[1] cazzolato, benjamin seth, and zebb prime. “on the dynamics of the furuta pendulum.” journal of control science and engineering 2011 (2011): 1–8. https://doi.org/10.1155/2011/528341.

see also

functions

objects

related examples

more about

网站地图