main content

options for evaluating reinforcement learning agents during training -凯发k8网页登录

options for evaluating reinforcement learning agents during training

since r2023b

description

use an rlevaulator object to specify options to evaluate agents periodically during training. evaluation options include the type of evaluation statistic, the frequency at which evaluation episodes occur, and whether exploration is allowed during an evaluation episode. to train the agents using the specified evaluation options, pass this object to train.

for more information on training agents, see train reinforcement learning agents.

creation

description

evalopts = rlevaluator returns the evaluator object evalopts, which contains default options for evaluating an agent during training.

example

evalopts = rlevaluator(name=value) creates the evaluator object evalopts and sets its properties using one or more name-value arguments.

properties

type of evaluation statistic for a group of numepisodes consecutive evaluation episodes, specified as one of these strings:

  • "meanepisodereward" — mean value of the evaluation episodes rewards. this is the default behavior.

  • "medianepisodereward" — median value of the evaluation episodes rewards.

  • "maxepisodereward" — maximum value of the evaluation episodes rewards.

  • "minepisodereward" — minimum value of the evaluation episodes rewards.

this value is returned by train as the element of the evaluationstatistics vector corresponding to the training episode that precedes the group of consecutive evaluation episodes. for more information, see numepisodes.

example: evaluationstatistictype="minepisodereward"

number of consecutive evaluation episodes, specified as a positive integer. after every evaluationfrequency training episodes, train runs numepisodes evaluation episodes.

for example, if evaluationfrequency is 100 and numepisodes is 3 then three evaluation episodes are run, consecutively, after 100 training episodes. these three evaluation episodes are used to calculate a single statistic, specified by evaluationstatistictype, which is returned as the 100th element of the vector in the evaluationstatistic property of the rltrainingresults object returned by train. after 200 training episodes, three new evaluation episodes are run, with their statistic returned in the 200th element of evaluationstatistic, and so on.

example: numepisodes=5

maximum number of steps to run for an evaluation episode, specified as a positive integer. this value is the maximum number of steps to run for an evaluation episode if other termination conditions are not met before. to accurately assess the agent stability and performance, it is often useful to specify a larger number of steps for an evaluation episode, with respect to a training episode.

if empty (default), the maxstepsperepisode property specified for training (see rltrainingoptions) is used.

example: maxstepsperepisode=1000

option to use exploration policy during evaluation episodes, specified as a one of the following logical values.

  • 0 (false) — the agent uses its base greedy policy when selecting actions during an evaluation episode. this is the default behavior.

  • 1 (true) — the agent uses its base exploration policy when selecting actions during an evaluation episode.

random seeds used for evaluation episodes, specified as one of the following.

  • [] — the random seed is not initialized before an evaluation episode.

  • nonnegative integer — the random seed is reinitialized to the specified value before the first of the numepisodes consecutive evaluation episodes occurring after evaluationfrequency training episodes. this is the default behavior, with the seed initialized to 1.

  • vector of nonnegative integers with numepisodes elements — before each episode of an evaluation sequence, the random seed is reinitialized to the corresponding element of the specified vector. this guarantees that the ith episode of each evaluation sequence always runs with the same random seed, which helps when comparing evaluation episodes occurring at different stages of training.

the current random seed used for training is stored before the first episode of an evaluation sequence and reset as the current seed after the evaluation sequence. this ensures that the training results with evaluation are the same as the results without evaluation.

example: randomseeds=0

evaluation period, specified as a positive integer. it is the number of episodes after which numepisodes evaluation episodes are run. for example, if evaluationfrequency is 100 and numepisodes is 3, three evaluation episodes are run, consecutively, after 100 episodes. the default is 100.

example: evaluationfrequency=200

object functions

examples

create an rlevaluator object to evaluate an agent during training.

configure the evaluator to run five consecutive evaluation episodes every 100 training episodes using fixed random seeds for each evaluation episode.

evl = rlevaluator( ...
    numepisodes=5, ...
    evaluationfrequency=100, ...
    randomseeds=[11,15,20,30,99])
evl = 
  rlevaluator with properties:
    evaluationstatistictype: "meanepisodereward"
                numepisodes: 5
         maxstepsperepisode: []
       useexplorationpolicy: 0
                randomseeds: [11 15 20 30 99]
        evaluationfrequency: 100

you can use dot notation to change some of the values. set the maximum number of steps for evaluation episodes to 1000.

evl.maxstepsperepisode = 1000;

to evaluate an agent during training using these evaluation options, pass evl to train, as in the following code example.

results = train(agent, env, trainingoptions, evaluator=evopts);

for more information see train.

version history

introduced in r2023b

网站地图