train mbpo agent to balance cart-凯发k8网页登录
this example shows how to train a model-based policy optimization (mbpo) agent to balance a cart-pole system modeled in matlab®. for more information on mbpo agents, see .
mbpo agents use an environment model to generate more experiences while training a base agent. in this example, the base agent is a soft actor-critic (sac) agent.
the built-in mbpo agent is based on a model-based policy optimization algorithm in [1]. the original mbpo algorithm trains an ensemble of stochastic models. in contrast, this example trains an ensemble of deterministic models. for an example in which an mbpo agent is implemented using a custom training loop, see model-based reinforcement learning using custom training loop.
the following figure summarizes the algorithm used in this example. during training the mbpo agent collects real experiences resulting from interactions with the environment. the mbpo agent uses these experiences to train its internal environment model. then, it uses this model to generate experiences without interacting with the actual environment finally, the mbpo agent uses the real experiences and generated experiences to train the sac base agent.
cart-pole matlab environment
for this example, the reinforcement learning environment is a pole attached to an unactuated revolutionary joint on a cart. the cart has an actuated prismatic joint connected to a one-dimensional frictionless track. the training goal in this environment is to balance the pole by applying forces (actions) to the prismatic joint.
for this environment:
the upward balanced pendulum position is
0
radians and the downward hanging position ispi
radians.the pendulum starts upright with an initial angle between –0.05 radians and 0.05 radians.
the force action signal from the agent to the environment is from –10 n to 10 n.
the observations from the environment are the position and velocity of the cart, the pendulum angle, and the pendulum angle derivative.
the episode terminates if the pole is more than 12 degrees from vertical or if the cart moves more than 2.4 m from the original position.
a reward of 0.5 is provided for every time-step that the pole remains upright. an additional reward is provided based on the distance between the cart and the origin. a penalty of –50 is applied when the pendulum falls.
for more information on this model, see load predefined control system environments.
create a predefined environment interface for the cart-pole system.
env = rlpredefinedenv("cartpole-continuous");
the interface has a continuous action space where the agent can apply one force value ranging from –10 n to 10 n.
obtain the observation and action specifications from the environment interface.
obsinfo = getobservationinfo(env); numobservations = obsinfo.dimension(1); actinfo = getactioninfo(env);
fix the random generator seed for reproducibility.
rng(0)
create mbpo agent
an mbpo agent decides which action to take given observations using a base off-policy agent. the mbpo agent trains both the base agent and an environmental model. the environmental model consists of transition functions, a reward function, and an is-done function. this model is used to create more samples without interacting with an environment. this example uses the following steps to construct an mbpo agent.
define model-free off-policy agent.
define transition models.
define reward model.
define is-done model.
create neural network environment.
create mbpo agent.
1. define model-free off-policy agent
create a sac base agent with a default network structure. for more information on sac agents, see . for an environment with a continuous action space, you can also use a ddpg or td3 base agent. for discrete environments, you can use a dqn base agent.
agentopts = rlsacagentoptions; agentopts.minibatchsize = 256; initopts = rlagentinitializationoptions(numhiddenunit=64); baseagent = rlsacagent(obsinfo,actinfo,initopts,agentopts); baseagent.agentoptions.actoroptimizeroptions.learnrate = 1e-4; baseagent.agentoptions.criticoptimizeroptions(1).learnrate = 1e-4; baseagent.agentoptions.criticoptimizeroptions(2).learnrate = 1e-4; baseagent.agentoptions.numgradientstepsperupdate = 5;
2. define transition models
to model the environment, an mbpo agent trains one or more transition models. to model an environment effectively, you must consider two kinds of uncertainty: statistical uncertainty and modeling uncertainty. a stochastic transition function can model the statistical uncertainty better than a deterministic transition function. in this example, since the cart-pole environment is deterministic, you use deterministic transition functions.
it is challenging to have a perfect model, and a trained model usually has modeling uncertainty. one common approach to overcoming modeling uncertainty is to use multiple transition models. the original mbpo paper uses seven models [1]. for this example, to reduce computational cost, you use three models. the mbpo agent generates experiences using all three transition models. the following figure shows how an ensemble of transition models generates samples without interacting with the environment. in this figure, the models generate two trajectories with horizon = 2.
create three deterministic transition functions. to do so, create a deep neural network using the createdeterministictransitionnetwork
helper function. then, use the neural network to create an rlcontinuousdeterministictransitionfunction
object. when creating a transition function object, you must specify the action and observation input/output names for the neural network.
net1 = createdeterministictransitionnetwork(4,1); transitionfcn = rlcontinuousdeterministictransitionfunction(net1,... obsinfo,... actinfo,... observationinputnames="state",... actioninputnames="action",... nextobservationoutputnames="nextobservation"); net2 = createdeterministictransitionnetwork(4,1); transitionfcn2 = rlcontinuousdeterministictransitionfunction(net2,... obsinfo,... actinfo,... observationinputnames="state",... actioninputnames="action",... nextobservationoutputnames="nextobservation"); net3 = createdeterministictransitionnetwork(4,1); transitionfcn3 = rlcontinuousdeterministictransitionfunction(net3,... obsinfo,... actinfo,... observationinputnames="state",... actioninputnames="action",... nextobservationoutputnames="nextobservation");
3. define reward model
an mbpo agent also contains a reward model for the environment. if you know a ground-truth reward function, you can specify it using a custom function. in this example, the ground-truth reward function is defined in the cartpolerewardfunction
helper function. to use this reward function set usegroundtruthreward
to true
.
you can also specify a neural-network-based reward function that the mbpo agent can train. in this example, you can use such a reward function by setting usegroundtruthreward
to false
. the deep neural network for the reward function is defined in the createrewardnetworkactionnextobs
helper function. to define an is-done function using the neural network, create an rlcontinuousdeterministicrewardfunction
object.
usegroundtruthreward = true; if usegroundtruthreward rewardfcn = @cartpolerewardfunction; else % this neural network uses action and next observation as inputs. rewardnet = createrewardnetworkactionnextobs(4,1); rewardfcn = rlcontinuousdeterministicrewardfunction(rewardnet,... obsinfo,... actinfo, ... actioninputnames="action",... nextobservationinputnames="nextstate"); end
4. define is-done model
an mbpo agent also contains an is-done model for computing the termination signal for the environment. if you know a ground-truth termination signal, you can specify it using a custom function. in this example, the ground-truth termination signal is defined in the cartpoleisdonefunction
helper function. to use this reward function set usegroundtruthisdone
to true
.
you can also specify a neural-network-based is-done function that the mbpo agent can train. in this example, you can use such an is-done function by setting usegroundtruthisdone
to false
. the deep neural network for the is-done function is defined in the createisdonenetwork
helper function. to define an is-done function using the neural network, create an rlisdonefunction
object.
usegroundtruthisdone = true; if usegroundtruthisdone isdonefcn = @cartpoleisdonefunction; else % this neural network uses only next obesrvation as inputs. isdonenet = createisdonenetwork(4); isdonefcn = rlisdonefunction(isdonenet,... obsinfo,... actinfo,... nextobservationinputnames="nextstate"); end
5. create neural network environment
define a neural network environment using the transition, reward, and is-done functions. to do so, create an rlneuralnetworkenvironment
object.
generativeenv = rlneuralnetworkenvironment(obsinfo,actinfo, ... [transitionfcn,transitionfcn2,transitionfcn3],rewardfcn,isdonefcn); % reset model environment. reset(generativeenv);
6. create mbpo agent
define an mbpo agent using the base off-policy agent and environment model. to do so, first create an mbpo agent options object.
mbpoagentopts = rlmbpoagentoptions;
specify options for training the environment model. train the model for 1 epoch at the beginning of each episode and use 15 mini-batches of size 256.
mbpoagentopts.numepochfortrainingmodel = 1; mbpoagentopts.numminibatches = 15; mbpoagentopts.minibatchsize = 256;
specify the size of the model experience buffer.
mbpoagentopts.modelexperiencebufferlength = 60000;
specify the ratio of real and generated experience used to train the base sac agent. for this example, 20% of samples are from the real experience buffer and 80% of samples are from model experience buffer.
mbpoagentopts.realsampleratio = 0.2;
specify options for generating samples using the environment model.
generate 20000 trajectories at the beginning of each epoch.
use a piecewise roll-out horizon schedule, which increases the horizon length gradually.
increase the horizon length every 100 epochs.
use an initial horizon length of 1.
use a maximum horizon length of 3.
mbpoagentopts.modelrolloutoptions.numrollout = 20000;
mbpoagentopts.modelrolloutoptions.horizonupdateschedule = "piecewise";
mbpoagentopts.modelrolloutoptions.horizonupdatefrequency = 100;
mbpoagentopts.modelrolloutoptions.horizon = 1;
mbpoagentopts.modelrolloutoptions.horizonmax = 3;
specify optimizer options for training the transition models. use the same optimizer options for all three transition models.
transitionoptimizeroptions1 = rloptimizeroptions(... learnrate=1e-4,... gradientthreshold=1.0); transitionoptimizeroptions2 = rloptimizeroptions(... learnrate=1e-4,... gradientthreshold=1.0); transitionoptimizeroptions3 = rloptimizeroptions(... learnrate=1e-4,... gradientthreshold=1.0); mbpoagentopts.transitionoptimizeroptions = ... [transitionoptimizeroptions1,... transitionoptimizeroptions2,... transitionoptimizeroptions3];
specify optimizer options for training the reward model. if you use a custom ground-truth reward function, the agent ignores these options.
rewardoptimizeroptions = rloptimizeroptions(... learnrate=1e-4,... gradientthreshold=1.0); mbpoagentopts.rewardoptimizeroptions = rewardoptimizeroptions;
specify optimizer options for training the is-done model. if you use a custom ground-truth reward function, the agent ignores these options.
isdoneoptimizeroptions = rloptimizeroptions(... learnrate=1e-4,... gradientthreshold=1.0); mbpoagentopts.isdoneoptimizeroptions = isdoneoptimizeroptions;
create the mbpo agent, specifying the base agent, environment model, and options.
agent = rlmbpoagent(baseagent,generativeenv,mbpoagentopts);
train agent
to train the agent, first specify the training options. for this example, use the following options.
run each training episode for at most 500 episodes, with each episode lasting at most 500 time steps.
display the training progress in the episode manager dialog box (set the
plots
option) and disable the command line display (set theverbose
option tofalse
).save the agent when the average episode reward is greater than or equal to 470.
stop training when the agent receives an average cumulative reward greater than 470 over 5 consecutive episodes. at this point, the agent can balance the pendulum in the upright position.
for more information, see rltrainingoptions
.
trainopts = rltrainingoptions(... maxepisodes=500, ... maxstepsperepisode=500, ... verbose=false, ... plots="training-progress",... stoptrainingcriteria="averagereward",... stoptrainingvalue=470,... scoreaveragingwindowlength=5,... saveagentcriteria="episodereward",... saveagentvalue=470);
you can visualize the cart-pole system by using the plot
function during training or simulation.
plot(env)
train the agent using the train
function. training this agent is a computationally-intensive process that takes several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the agent yourself, set dotraining
to true
.
dotraining = false; if dotraining % train the agent. trainingstats = train(agent,env,trainopts); else % load the pretrained agent for the example. load("matlabcartpolembpo.mat","agent"); end
simulate mbpo agent
to validate the performance of the trained agent, simulate it within the cart-pole environment. for more information on agent simulation, see rlsimulationoptions
and sim
. exploration during validation is not necessary in this example. therefore, to use deterministic actions during the simulation, set the useexplorationpolicy
agent property to agent to be false
.
rng(1)
% disable exploration during sim
agent.useexplorationpolicy = false;
simoptions = rlsimulationoptions(maxsteps=500);
experience = sim(env,agent,simoptions);
totalreward_mbpo = sum(experience.reward)
totalreward_mbpo = 460.7233
instead of simulating the mbpo agent, you can simulate the base agent. if you use the same random seed, you get the same result as simulating the mbpo agent.
rng(1) experience = sim(env,agent.baseagent,simoptions);
totalreward_sac = sum(experience.reward)
totalreward_sac = 460.7233
evaluate learned environment model
to validate the trained environment transition models, you can check whether they are able to correctly predict the next observations. similarly, you can validate the performance of the reward and is-done functions. to make a prediction based on the environment model, use the step
function.
collect data for learned model evaluation
rng(1) % enable exploration during sim to create % diverse data for model evaluation agent.useexplorationpolicy = true; simoptions = rlsimulationoptions(maxsteps=500); experience = sim(env,agent,simoptions);
for this example, evaluate the performance of the first transition model.
agent.envmodel.transitionmodelnum = 1;
for each simulation step, extract the actual next observation.
numsteps = length(experience.reward.data); nextobsprediction = zeros(4,1,numsteps); rewardprediction = zeros(1,numsteps); isdoneprediction = zeros(1,numsteps); nextobsgroundtruth = zeros(4,1,numsteps); rewardgroundtruth = zeros(1,numsteps); isdonegroundtruth = zeros(1,numsteps); for stepct = 1:numsteps % extract the actual next observation, reward, and is-done value. nextobsgroundtruth(:,:,stepct) = ... experience.observation.cartpolestates.data(:,:,stepct 1); rewardgroundtruth(:, stepct) = experience.reward.data(stepct); isdonegroundtruth(:, stepct) = experience.isdone.data(stepct); % predict the next observation, reward, and is-done value % using the environment model. obs = experience.observation.cartpolestates.data(:,:,stepct); agent.envmodel.observation = {obs}; action = experience.action.cartpoleaction.data(:,:,stepct); [nextobs,reward,isdone] = step(agent.envmodel,{action}); nextobsprediction(:,:,stepct) = nextobs{1}; rewardprediction(:,stepct) = reward; isdoneprediction(:,stepct) = isdone; end
plot the ground truth and prediction of each dimension of the observations.
figure for obsdimensionindex = 1:4 subplot(2,2,obsdimensionindex) plot(reshape(nextobsgroundtruth(obsdimensionindex,:,:),1,numsteps)) hold on plot(reshape(nextobsprediction(obsdimensionindex,:,:),1,numsteps)) hold off xlabel("step") ylabel("observation") if obsdimensionindex == 1 legend("groundtruth","prediction","location","southwest") end end
references
[1] janner, michael, justin fu, marvin zhang, and sergey levine. “when to trust your model: model-based policy optimization.” in proceedings of the 33rd international conference on neural information processing systems, 12519–30. 1122. red hook, ny, usa: curran associates inc., 2019.