main content

train reinforcement learning agents within a specified environment -凯发k8网页登录

train reinforcement learning agents within a specified environment

since r2019a

description

trainstats = train(env,agents) trains one or more reinforcement learning agents within the environment env, using default training options, and returns training results in trainstats. although agents is an input argument, after each training episode, train updates the parameters of each agent specified in agents to maximize their expected long-term reward from the environment. this is possible because each agent is an handle object. when training terminates, agents reflects the state of each agent at the end of the final training episode.

note

to train an off-policy agent offline using existing data, use trainfromdata.

trainstats = train(agents,env) performs the same training as the previous syntax.

example

trainstats = train(___,trainopts) trains agents within env, using the training options object trainopts. use training options to specify training parameters such as the criteria for terminating training, when to save agents, the maximum number of episodes to train, and the maximum number of steps per episode.

example

trainstats = train(___,prevtrainstats) resumes training from the last values of the agent parameters and training results contained in prevtrainstats, which is returned by the previous function call to train.

example

trainstats = train(___,name=value) train agents with additional name-value arguments. use this syntax to specify a logger or evaluator object to be used in training. logger and evaluator objects allow you to periodically log results to disk and to evaluate agents, respectively.

examples

train the agent configured in the example, within the corresponding environment. the observation from the environment is a vector containing the position and velocity of a cart, as well as the angular position and velocity of the pole. the action is a scalar with two possible elements (a force of either -10 or 10 newtons applied to a cart).

load the file containing the environment and a pg agent already configured for it.

load rltrainexample.mat

specify some training parameters using rltrainingoptions. these parameters include the maximum number of episodes to train, the maximum steps per episode, and the conditions for terminating training. for this example, use a maximum of 2000 episodes and 500 steps per episode. instruct the training to stop when the average reward over the previous five episodes reaches 500. create a default options set and use dot notation to change some of the parameter values.

trainopts = rltrainingoptions;
trainopts.maxepisodes = 2000;
trainopts.maxstepsperepisode = 500;
trainopts.stoptrainingcriteria = "averagereward";
trainopts.stoptrainingvalue = 500;
trainopts.scoreaveragingwindowlength = 5;

during training, the train command can save candidate agents that give good results. further configure the training options to save an agent when the episode reward exceeds 500. save the agent to a folder called savedagents.

trainopts.saveagentcriteria = "episodereward";
trainopts.saveagentvalue = 500;
trainopts.saveagentdirectory = "savedagents";

turn off the command-line display. turn on the reinforcement learning episode manager so you can observe the training progress visually.

trainopts.verbose = false;
trainopts.plots = "training-progress";

you are now ready to train the pg agent. for the predefined cart-pole environment used in this example, you can use plot to generate a visualization of the cart-pole system.

plot(env)

when you run this example, both this visualization and the reinforcement learning episode manager update with each training episode. place them side by side on your screen to observe the progress, and train the agent. (this computation can take 20 minutes or more.)

traininginfo = train(agent,env,trainopts);

episode manager shows that the training successfully reaches the termination condition of a reward of 500 averaged over the previous five episodes. at each training episode, train updates agent with the parameters learned in the previous episode. when training terminates, you can simulate the environment with the trained agent to evaluate its performance. the environment plot updates during simulation as it did during training.

simoptions = rlsimulationoptions(maxsteps=500);
experience = sim(env,agent,simoptions);

figure cart pole visualizer contains an axes object. the axes object contains 6 objects of type line, polygon.

during training, train saves to disk any agents that meet the condition specified with trainops.saveagentcritera and trainopts.saveagentvalue. to test the performance of any of those agents, you can load the data from the data files in the folder you specified using trainopts.saveagentdirectory, and simulate the environment with that agent.

this example shows how to periodically evaluate an agent during training using an rlevaluator object.

load the predefined environment object representing a cart-pole system with a discrete action space. for more information on this environment, see load predefined control system environments.

env = rlpredefinedenv("cartpole-discrete");

the agent networks are initialized randomly. ensure reproducibility by fixing the seed of the random generator.

rng(0)

create a ddpg agent with default networks.

agent = rldqnagent(getobservationinfo(env),getactioninfo(env));

use the standard algorithm instead of the double dqn.

agent.agentoptions.usedoubledqn = false;

to specify training options, create an rltrainingoptions object. configure training to stop after when the average reward reaches 480.

tngopts = rltrainingoptions(...
    maxepisodes=1000, ...
    stoptrainingcriteria="averagereward",...
    stoptrainingvalue=480);

to evaluate the agent during training, create an rlevaluator object. configure the evaluator to run 5 consecutive evaluation episodes every 50 training episodes.

evl = rlevaluator( ...
    numepisodes=5, ...
    evaluationfrequency=50)
evl = 
  rlevaluator with properties:
    evaluationstatistictype: "meanepisodereward"
                numepisodes: 5
         maxstepsperepisode: []
       useexplorationpolicy: 0
                randomseeds: 1
        evaluationfrequency: 50

to train the agent using these evaluation options, pass evl to train.

results = train(agent, env, tngopts, evaluator=evl);

the red stars on the plot indicate the statistic (for this example the average episode reward) collected for the evaluation episodes.

display the reward accumulated during the last episode.

results.episodereward(end)
ans = 500

this value means that the agent is able to balance the cart-pole system for the whole episode.

display the size of the evaluation statistic vector returned for each episode.

size(results.evaluationstatistic)
ans = 1×2
   476     1

display only the finite values, corresponding to the training episodes at the end of which the 5 evaluation episodes are run.

results.evaluationstatistic(isfinite(results.evaluationstatistic))
ans = 9×1
   17.6000
  182.0000
   92.0000
   84.2000
   46.8000
  278.0000
  138.0000
   78.0000
   36.0000

train the agents configured in the train multiple agents to perform collaborative task example, within the corresponding environment.

set the random seed for reproducibility, and run the script that loads the environment parameters.

rng(0)
rlcollaborativetaskparams

load the file containing the agents. for this example, load the agents that have been already trained using decentralized learning.

load decentralizedagents.mat

create an environment object that uses the simulink® model rlcollaborativetask. since the agent objects referred by the agent blocks are already available in the matlab workspace at the time of environment creation, the observation and action specification arrays are not needed. for more information, see rlsimulinkenv.

env = rlsimulinkenv("rlcollaborativetask", ...
    ["rlcollaborativetask/agent a", "rlcollaborativetask/agent b"])
env = 
simulinkenvwithagent with properties:
           model : rlcollaborativetask
      agentblock : [
                     rlcollaborativetask/agent a
                     rlcollaborativetask/agent b
                   ]
        resetfcn : []
  usefastrestart : on

specify a reset function for the environment. the reset function resetrobots ensures that the robots start from random initial positions at the beginning of each episode.

env.resetfcn = @(in) resetrobots(in,ra,rb,rc,boundaryr);

for this example, configure the training to be centralized.

  • allocate both agents (with indices 1 and 2) in a single group. do this by specifying the agent indices in the "agentgroups" option.

  • specify the "centralized" learning strategy.

  • for this example, run the training for 5 episodes, with each episode lasting at most 600 time steps.

  • do not visualize training progress.

trainopts = rlmultiagenttrainingoptions(...
    agentgroups={[1,2]},...
    learningstrategy="centralized",...
    maxepisodes=5,...
    maxstepsperepisode=600,...
    stoptrainingcriteria="none",...
    plots="none");

train the agents using the train function.

results = train([agenta,agentb],env,trainopts);

figure multi agent collaborative task contains an axes object. the axes object with xlabel x (m), ylabel y (m) contains 5 objects of type rectangle, text.

replaying the animation plot shows you how the agent behaves in the training.

this example shows how to resume training using existing training data for training q-learning. for more information on these agents, see and .

create grid world environment

for this example, create the basic grid world environment.

env = rlpredefinedenv("basicgridworld");

to specify that the initial state of the agent is always [2,1], create a reset function that returns the state number for the initial agent state.

x0 = [1:12 15:17 19:22 24];
env.resetfcn = @() x0(randi(numel(x0)));

fix the random generator seed for reproducibility.

rng(1)

create q-learning agent

to create a q-learning agent, first create a q table using the observation and action specifications from the grid world environment. set the learning rate of the representation to 1.

qtable = rltable(getobservationinfo(env),getactioninfo(env));
qvf = rlqvaluefunction(qtable,getobservationinfo(env),getactioninfo(env));

next, create a q-learning agent using this table representation and configure the epsilon-greedy exploration. for more information on creating q-learning agents, see and . keep the default value of the discount factor to 0.99.

agentopts = rlqagentoptions;
agentopts.epsilongreedyexploration.epsilon = 0.2;
agentopts.criticoptimizeroptions.learnrate = 0.2;
agentopts.epsilongreedyexploration.epsilondecay = 1e-3;
agentopts.epsilongreedyexploration.epsilonmin = 1e-3;
agentopts.discountfactor = 1;
qagent = rlqagent(qvf,agentopts);

train q-learning agent for 100 episodes

to train the agent, first specify the training options. for more information, see rltrainingoptions.

trainopts = rltrainingoptions;
trainopts.maxstepsperepisode = 200;
trainopts.maxepisodes = 1e6;
trainopts.plots = "none";
trainopts.verbose = false;
trainopts.stoptrainingcriteria = "episodecount";
trainopts.stoptrainingvalue = 100;
trainopts.scoreaveragingwindowlength = 30;

train the q-learning agent using the train function. training can take several minutes to complete. to save time while running this example, load a pretrained agent by setting dotraining to false. to train the agent yourself, set dotraining to true.

trainingstats = train(qagent,env,trainopts);

display index of last episode.

trainingstats.episodeindex(end)
ans = 100

train q-learning agent for 200 more episodes

set the training to stop after episode 300.

trainingstats.trainingoptions.stoptrainingvalue = 300;

resume the training using the training data that exists in trainingstats.

trainingstats = train(qagent,env,trainingstats);

display index of last episode.

trainingstats.episodeindex(end)
ans = 300

plot episode reward.

figure()
plot(trainingstats.episodeindex,trainingstats.episodereward)
title("episode reward")
xlabel("episodeindex")
ylabel("episodereward")

figure contains an axes object. the axes object with title episode reward, xlabel episodeindex, ylabel episodereward contains an object of type line.

display the final q-value table.

qagentfinalq = getlearnableparameters(getcritic(qagent));
qagentfinalq{1}
ans = 25x4 single matrix
   -5.9934    5.4707   10.0000    1.6349
    8.9968   -4.5969   -4.7967   -8.0369
   -3.9844    8.0000   -4.3924   -6.3623
   -4.4457   -3.4794    9.0000   -4.1959
   -4.4743   -2.3964    7.0000    1.7904
   -4.5117   -3.7606   11.0000   -1.3847
   -3.5016    6.8809   12.0000    4.0197
   11.0000   -3.8480    0.6307   -3.0320
   10.0000    7.0000   -1.5601   -3.4550
    3.0709    4.2059    8.0000    4.8305
      ⋮

validate q-learning results

to validate the training results, simulate the agent in the training environment.

before running the simulation, visualize the environment and configure the visualization to maintain a trace of the agent states.

plot(env)
env.resetfcn = @() 2;
env.model.viewer.showtrace = true;
env.model.viewer.cleartrace;

simulate the agent in the environment using the sim function.

sim(qagent,env)

input arguments

agents to train, specified as a reinforcement learning agent object, such as rlacagent or , or as an array of such objects.

if env is a multi-agent environment, specify agents as an array. the order of the agents in the array must match the agent order used to create env.

note

train updates the agents as training progresses. this is possible because each agent is an handle object. to preserve the original agent parameters for later use, save the agent to a mat-file (if you copy the agent into a new variable, the new variable will also always point to the most recent agent version with updated parameters). for more information about handle objects, see .

note

when training terminates, agents reflects the state of each agent at the end of the final training episode. the rewards obtained by the final agents are not necessarily the highest achieved during the training process, due to continuous exploration. to save agents during training, create an rltrainingoptions object specifying the saveagentcriteria and saveagentvalue properties and pass it to train as a trainopts argument.

for more information about how to create and configure agents for reinforcement learning, see reinforcement learning agents.

environment in which the agents act, specified as one of the following kinds of reinforcement learning environment object:

  • a predefined matlab® or simulink® environment created using rlpredefinedenv.

  • a custom matlab environment you create with functions such as rlfunctionenv or rlcreateenvtemplate. this kind of environment does not support training multiple agents at the same time.

  • a simulink environment you create using createintegratedenv. this kind of environment does not support training multiple agents at the same time.

  • a custom simulink environment you create using rlsimulinkenv. this kind of environment supports training multiple agents at the same time, and allows you to use multi-rate execution, so that each agent has its own execution rate.

  • a custom matlab environment you create using rlmultiagentfunctionenv or rlturnbasedfunctionenv. this kind of environment supports training multiple agents at the same time. in an rlmultiagentfunctionenv environment all agents execute in the same step, while in an rlmultiagentfunctionenv environment agents execute in turns.

for more information about creating and configuring environments, see:

when env is a simulink environment, the environment object acts an interface so that train calls the (compiled) simulink model to generate experiences for the agents.

training parameters and options, specified as either an rltrainingoptions or an rlmultiagenttrainingoptions object. use this argument to specify parameters and options such as:

  • criteria for ending training

  • criteria for saving candidate agents

  • how to display training progress

  • options for parallel computing

for details, see rltrainingoptions and rlmultiagenttrainingoptions.

training episode data, specified as an:

  • rltrainingresult object, when training a single agent.

  • array of rltrainingresult objects when training multiple agents.

use this argument to resume training from the exact point at which it stopped. this starts the training from the last values of the agent parameters and training results object obtained after the previous train function call. prevtrainstats contains, as one of its properties, the rltrainingoptions object or the rlmultiagenttrainingoptions object specifying the training option set. therefore, to restart the training with updated training options, first change the training options in trainresults using dot notation. if the maximum number of episodes was already reached in the previous training session, you must increase the maximum number of episodes.

for details about the rltrainingresult object properties, see the trainstats output argument.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

example: train(agent,env,evaluator=myeval)

logger object, specified either as a filelogger or as a monitorlogger object. use a logger object to periodically save data during training. for more information on reinforcement logger objects, see rldatalogger.

evaluator object, specified either as a rlevaluator or as a rlcustomevaluator object. use an evaluator object to periodically save data during training. for more information on reinforcement learning evaluator objects, see rlevaluator and rlcustomevaluator.

output arguments

training episode data, returned as an:

  • rltrainingresult object, when training a single agent.

  • array of rltrainingresult objects when training multiple agents.

the following properties pertain to the rltrainingresult object:

episode numbers, returned as the column vector [1;2;…;n], where n is the number of episodes in the training run. this vector is useful if you want to plot the evolution of other quantities from episode to episode.

reward for each episode, returned in a column vector of length n. each entry contains the reward for the corresponding episode.

number of steps in each episode, returned in a column vector of length n. each entry contains the number of steps in the corresponding episode.

average reward over the averaging window specified in trainopts, returned as a column vector of length n. each entry contains the average award computed at the end of the corresponding episode.

total number of agent steps in training, returned as a column vector of length n. each entry contains the cumulative sum of the entries in episodesteps up to that point.

critic estimate of expected discounted cumulative long-term reward using the current agent and the environment initial conditions, returned as a column vector of length n. each entry is the critic estimate (q0) for the agent of the beginning of corresponding episode. this field is present only for agents that have critics, such as and .

information collected during the simulations performed for training, returned as:

  • for training in matlab environments, a structure containing the field simulationerror. this field is a column vector with one entry per episode. when the stoponerror option of rltrainingoptions is "off", each entry contains any errors that occurred during the corresponding episode. otherwise, the field contains an empty array.

  • for training in simulink environments, a vector of simulink.simulationoutput objects containing simulation data recorded during the corresponding episode. recorded data for an episode includes any signals and states that the model is configured to log, simulation metadata, and any errors that occurred during the corresponding episode.

evaluation statistic for each episode, returned as a column vector with as many elements as the number of episodes. when a (training) episode is followed by a number of consecutive evaluation episodes, the corresponding evaluationstatistic element is a statistic (for example, mean, maximum, minimum, median) calculated from these evaluation episodes. otherwise, when the episode is followed by another training episode, the evaluationstatistic element corresponding to the episode is nan. if no evaluator object is passed to train, each element of this vector is nan. for more information, see rlevaluator and rlcustomevaluator.

training options set, returned as:

tips

  • train updates the agents as training progresses. to preserve the original agent parameters for later use, save the agents to a mat-file.

  • by default, calling train opens the reinforcement learning episode manager, which lets you visualize the progress of the training. the episode manager plot shows the reward for each episode, a running average reward value, and the critic estimate q0 (for agents that have critics). the episode manager also displays various episode and training statistics. to turn off the reinforcement learning episode manager, set the plots option of trainopts to "none".

  • if you use a predefined environment for which there is a visualization, you can use plot(env) to visualize the environment. if you call plot(env) before training, then the visualization updates during training to allow you to visualize the progress of each episode. (for custom environments, you must implement your own plot method.)

  • training terminates when the conditions specified in trainopts are satisfied. to terminate training in progress, in the reinforcement learning episode manager, click stop training. because train updates the agent at each episode, you can resume training by calling train(agent,env,trainopts) again, without losing the trained parameters learned during the first call to train.

  • during training, you can save candidate agents that meet conditions you specify with trainopts. for instance, you can save any agent whose episode reward exceeds a certain value, even if the overall condition for terminating training is not yet satisfied. train stores saved agents in a mat-file in the folder you specify with trainopts. saved agents can be useful, for instance, to allow you to test candidate agents generated during a long-running training process. for details about saving criteria and saving location, see rltrainingoptions.

algorithms

in general, train performs the following iterative steps:

  1. initialize agent.

  2. for each episode:

    1. reset the environment.

    2. get the initial observation s0 from the environment.

    3. compute the initial action a0 = μ(s0).

    4. set the current action to the initial action (aa0) and set the current observation to the initial observation (ss0).

    5. while the episode is not finished or terminated:

      1. step the environment with action a to obtain the next observation s' and the reward r.

      2. learn from the experience set (s,a,r,s').

      3. compute the next action a' = μ(s').

      4. update the current action with the next action (aa') and update the current observation with the next observation (ss').

      5. break if the episode termination conditions defined in the environment are met.

  3. if the training termination condition defined by trainopts is met, terminate training. otherwise, begin the next episode.

the specifics of how train performs these computations depends on your configuration of the agent and environment. for instance, resetting the environment at the start of each episode can include randomizing initial state values, if you configure your environment to do so.

extended capabilities

version history

introduced in r2019a
网站地图