train reinforcement learning agent offline to control quanser qube pendulum -凯发k8网页登录
this example shows how you can train an agent from existing data instead of using an environment. training an agent from data (offline reinforcement learning) can be useful to efficiently prototype different design choices by leveraging already collected data. in this example, two reinforcement learning (rl) agents are trained to swing up and balance a quanser qube™-servo 2 inverted pendulum system. first, you train a td3 agent online using a simulink® environment and collect training data using a data logger. then, you train another td3 agent offline using the collected data and a modified reward function to achieve a different behavior from the agent. you then compare the behavior of both agents.
inverted pendulum model
the quanser qube-servo 2 pendulum system is a rotational inverted pendulum with two degrees of freedom. it is non-linear, underactuated, non-minimum phase, and it is modeled in simulink using simscape™ electrical™ and simscape multibody™. for a detailed description of the dynamics, see [1]. for related examples that use this pendulum system model, see and .
the pendulum is attached to the motor arm through a free revolute joint. the arm is actuated by a dc motor. the environment has the following properties:
the angles and angular velocities of the motor arm (,) and pendulum (,) are measurable.
the motor is constrained to and .
the pendulum is upright when .
the motor input is constrained to .
the agent action is scaled to the motor voltage in the environment.
open the simulink model.
mdl = "rlqubeservomodel";
open_system(mdl)
define the angular and voltage limits, as well as the model's sample time.
theta_limit = 5*pi/8; dtheta_limit = 30; volt_limit = 12; ts = 0.005;
train td3 agent online
a td3 agent is trained online against the simulink enviroment. for this agent:
the environment is the pendulum system modeled in simscape multibody.
the observation is the vector . using the sine and cosine of the measured angles can facilitate training by representing the otherwise discontinuous angular measurements by a continuous two-dimensional parametrization.
the action is the normalized input voltage command to the servo motor.
the reward signal is defined as follows:
the above reward function penalizes six different terms:
deviations from the forward position of the motor arm ()
deviations for the inverted position of the pendulum ()
the angular speed of the motor arm
the angular speed of the pendulum
the control action
changes to the control action
the agent is rewarded while the system constraints are satisfied (that is ).
set the random seed for reproducibility.
rng(0)
create the input and output specifications for the agent. set the action upper and lower limits to constrain the actions selected by the agent.
obsinfo = rlnumericspec([7 1]); actinfo = rlnumericspec([1 1],upperlimit=1,lowerlimit=-1);
create the environment interface. specify a reset function, defined at the end of the example, that sets random initial conditions.
agentblk = mdl "/rl agent";
simenv = rlsimulinkenv(mdl,agentblk,obsinfo,actinfo);
simenv.resetfcn = @localresetfcn;
to create an agent option object, use rltd3agentoptions
. specify sample time, experience buffer length, and the mini batch size.
agentopts = rltd3agentoptions( sampletime=ts, ... experiencebufferlength=1e6, ... minibatchsize=128);
specify the actor and critic optimizer options.
agentopts.actoroptimizeroptions.learnrate = 1e-4; agentopts.actoroptimizeroptions.gradientthreshold = 1; agentopts.criticoptimizeroptions(1).learnrate = 1e-3; agentopts.criticoptimizeroptions(1).gradientthreshold = 1; agentopts.criticoptimizeroptions(2).learnrate = 1e-3; agentopts.criticoptimizeroptions(2).gradientthreshold = 1;
set to 64 the number of neurons in each hidden layer of the actor and critic networks. to create a default td3 agent, use rltd3agent
.
initopts = rlagentinitializationoptions(numhiddenunit=64); td3agent = rltd3agent(obsinfo,actinfo,initopts,agentopts);
define training options. the length of an episode is given by the simulation time tf
divided by the sample time ts
.
tf = 5; maxsteps = ceil(tf/ts); trainopts = rltrainingoptions(... maxepisodes=300, ... maxstepsperepisode=maxsteps, ... scoreaveragingwindowlength= 10, ... verbose= false, ... plots="training-progress",... stoptrainingcriteria="averagereward",... stoptrainingvalue=inf);
create a data logger to save experiences that are later used for offline training of different agents. for more information, see rldatalogger
.
logger = rldatalogger();
logger.episodefinishedfcn = @localepisodefinishedfcn;
logger.loggingoptions.loggingdirectory = "simulatedpendulumdataset";
train the td3 agent using the simulink environment using train
. as shown in the following episode manager screenshot, training can take over an hour, so you can save time while running this example by setting dotd3training
to false
to load a pretrained agent. to train the agent yourself, set dotd3training
to true
. during training, the experiences for each episode are saved in a file named loggeddata
n
.mat
(where n
is the episode number) in the simulatedpendulumdataset
subfolder, under the current folder.
dotd3training = false; if dotd3training trainstats = train(td3agent,... simenv,trainopts,logger=logger); else load("rlquanserqubeagents.mat","td3agent"); end
evaluate td3 agent
test the trained agent and evaluate its performance. since the environment reset function sets the intial state randomly, fix the seed of the random generator to ensure that the model uses the same initial conditions when running simulations.
testingseed = 1; rng(testingseed)
for simulation, do not use an explorative policy.
td3agent.useexplorationpolicy = false;
define simulation options.
numtestepisodes = 1; simopts = rlsimulationoptions(... maxsteps=maxsteps,... numsimulations=numtestepisodes);
simulate the trained agent.
simresult = sim(td3agent,simenv,simopts);
you can view the behavior of the trained agent in the mechanics explorer animation of the simscape model, or you can plot a sample trajectory of the angles, control actions, and reward.
% extract signals. phi = get(simresult(1).simulationinfo.logsout,"phi_wrapped"); theta = get(simresult(1).simulationinfo.logsout,"theta_wrapped"); action = get(simresult(1).simulationinfo.logsout,"volt"); reward = get(simresult(1).simulationinfo.logsout,"reward"); % plot values. figure tiledlayout(4,1) nexttile plot(phi.values); title("pendulum angle") nexttile plot(theta.values); title("motor arm angle") nexttile plot(action.values); title("control action") nexttile plot(reward.values); title("reward")
train td3 agent offline with a modified reward function
suppose you want to modify the reward function to encourage to balance the pendulum at a motor arm angle that matches some new reference angle. consider the new reward function:
,
where corresponds to the desired angle that the motor arm should target. by expanding the modifed term, you can express the reward function in terms of the original reward.
instead of modifying the environment and training in simulation, you can leverage the already collected data to train the new agent offline. this process can be faster and more computationally efficient.
to train an agent from data you can:
define a r
eadfcn
function that reads the collected files and returns data in appropriate form. if necessary, this function can also modify data (to change the reward, for example).create a object pointing to the collected data.
create the agent to be trained.
define training options.
train the agent using
trainfromdata
.
if, in the previous section, you have loaded the pre-trained td3 agent instead of training the new agent, then at this point you do not have the training data in your local disk. in this case, download the data from the mathworks® server, and unzip it to recreate the same folder and files that would exist after training the agent. if you already have the training data in the previously defined logging directory, set downloaddata
to false
to avoid downloading the data.
downloaddata = true; if downloaddata zipfilename = "simulatedpendulumdataset.zip"; filename = ... matlab.internal.examples.downloadsupportfile("rl",zipfilename); unzip(filename) end
define the that points to the collected training data. use the readfcn
function of the data store to add the additional term to the reward of each experience. you can find the read function used in this example at the end of the example.
datafolder = fullfile( ... logger.loggingoptions.loggingdirectory, ... "loggeddata*.mat"); fds = filedatastore(datafolder, ... readfcn=@localreadfcn, ... fileextensions=".mat"); fds.shuffle();
to create an agent option object, use rltd3agentoptions
. specify sample time, experience buffer length, and the mini batch size.
agentopts = rltd3agentoptions( ... sampletime=ts,... experiencebufferlength=1e6,... minibatchsize=128);
specify the actor and critic optimizer options.
agentopts.actoroptimizeroptions.learnrate = 1e-4; agentopts.actoroptimizeroptions.gradientthreshold = 1; agentopts.criticoptimizeroptions(1).learnrate = 1e-3; agentopts.criticoptimizeroptions(1).gradientthreshold = 1; agentopts.criticoptimizeroptions(2).learnrate = 1e-3; agentopts.criticoptimizeroptions(2).gradientthreshold = 1;
set to 64 the number of neurons in each hidden layer of the actor and critic networks. to create a default td3 agent, use rltd3agent
.
initopts = rlagentinitializationoptions(numhiddenunit=64); offlinetd3agent = rltd3agent(obsinfo,actinfo,initopts,agentopts);
you can use the batch data regularizer options to alleviate the overestimation issue often caused by offline training. for more information, see .
offlinetd3agent.agentoptions.batchdataregularizeroptions...
= rlbehaviorcloningregularizeroptions;
define offline training options using the rltrainingfromdataoptions
object.
options = rltrainingfromdataoptions; options.maxepochs = 300; options.numstepsperepoch = 500;
to calculate a training progress metric, use the observation vector for the computation of the current q value estimate. for this example, you can use the stable equilibrium at and .
options.qvalueobservations = ...
{[sin(0); cos(0); 0; sin(pi); cos(pi); 0; 0]};
train the agent offline from the collected data using trainfromdata
. the training duration can be seen in the following episode manager screenshot, and it highlights how training from data can be faster. for this example, the training took less than half the time of the online training.
dotd3offlinetraining = false; if dotd3offlinetraining offlinetrainstats = ... trainfromdata(offlinetd3agent, fds ,options); else load("rlquanserqubeagentsfromdata.mat","offlinetd3agent"); end
the q values during offline training do not necessarily indicate the agent's actual performance. overestimation is a common issue of offline reinforcement learning due to the state-action distribution shift between the dataset and the learned policy. you need to examine hyperparameters or training options if the q values become too big. this example uses the batch data regularizer to alleviate the overestimation issue by penalizing the learned policy's actions that are different from the dataset.
evaluate td3 agent in simulation
fix the random seed generator to the testing seed.
rng(testingseed)
define simulation options.
simopts = rlsimulationoptions(... maxsteps=maxsteps,... numsimulations=numtestepisodes);
simulate the trained agent.
offlinesimresult = sim(offlinetd3agent,simenv,simopts);
plot a sample trajectory of the angles, control action, and reward.
% extract signals. offlinephi = ... get(offlinesimresult(1).simulationinfo.logsout,"phi_wrapped"); offlinetheta = ... get(offlinesimresult(1).simulationinfo.logsout,"theta_wrapped"); offlineaction = ... get(offlinesimresult(1).simulationinfo.logsout,"volt"); offlinereward = ... get(offlinesimresult(1).simulationinfo.logsout,"reward"); % plot values. figure tiledlayout(4,1) nexttile plot(offlinephi.values); title("pendulum angle") nexttile plot(offlinetheta.values); title("motor arm angle") nexttile plot(offlineaction.values); title("control action") nexttile plot(offlinereward.values); title("reward")
you can see that the agent learns to balance the pendulum, as the angle , shown in the first plot, is driven to 0. for comparison with the original agent, plot the sampled trajectories of the arm angle in both cases (offline and online) in the same figure, as well as the desired angle.
figure plot(theta.values); hold on plot(offlinetheta.values); yline(0.2,'-','desired angle'); legend("online td3","offline td3 with modified reward",... location="southeast") title("motor arm angle")
as expected, the td3 agent trained offline does not drive the motor arm back to 0 because you change the desired angle from 0 to 0.2 rad in the new reward function. note that online agent drives the motor arm to a position in which = 0.1 rad (instead of the front position in which = 0), despite being penalized with a negative reward. similarly the offline agent drives the motor arm to a position in which =0.3 rad, (instead of the desired position = 0.2). more training or other modifications of the reward function could achieve a more precise behavior. for example, you can experiment with the weights of each term in the reward function.
helper functions
the function localepisodefinishedfcn
selects the data that should be logged in every training episode. the function is used by the rldatalogger
object.
function datatolog = localepisodefinishedfcn(data) datatolog = struct("data",data.experience); end
the function localresetfcn
resets the initial angles to a random number and the intial speeds to 0. it is used by the simulink environment.
function in = localresetfcn(in) theta0 = -pi/4 rand*pi/2; phi0 = pi-pi/4 rand*pi/2; in = setvariable(in,"theta0",theta0); in = setvariable(in,"phi0",phi0); in = setvariable(in,"dtheta0",0); in = setvariable(in,"dphi0",0); end
the function localreadfcn
reads experience data from the logged data files and modifies the reward of each experience. it is used by the object.
function experiences = localreadfcn(filename) data = load(filename); experiences = data.episodedata.data{:}; desired = 0.2; % desired angle for idx = 1:numel(experiences) nextobs = experiences(idx).nextobservation; theta = atan2(nextobs{1}(1),nextobs{1}(2)); newreward = ... experiences(idx).reward - 0.1*(-2*desired*theta desired^2); experiences(idx).reward = newreward; end end
references
[1] cazzolato, benjamin seth, and zebb prime. “on the dynamics of the furuta pendulum.” journal of control science and engineering 2011 (2011): 1–8. https://doi.org/10.1155/2011/528341.
see also
functions
objects
rltrainingfromdataoptions
| | | |filelogger