main content

custom object for evaluating reinforcement learning agents during training -凯发k8网页登录

custom object for evaluating reinforcement learning agents during training

since r2023b

description

create an rlcustomevaulator object to specify a custom function and evaluation frequency that you want to use to evaluate agents during training. to train the agents, pass this object to train.

for more information on training agents, see train reinforcement learning agents.

creation

description

example

evaluator = rlcustomevaluator(evalfcn) returns the custom evaluator object evaluator. the evalfcn argument is a handle to your custom matlab® evaluation function.

evaluator = rlcustomevaluator(evalfcn,evaluationfrequency=evalperiod) also specifies the number of training episodes after which train calls the evaluation function.

properties

custom evaluation function, specified as a function handle. the train function calls evalfcn after evalperiod episodes.

your evaluation function must have three inputs and three outputs, as illustrated by the following signature.

[statistic, scores, data] = myevalfcn(agent, environment, traininginfo)

given an agent, its environment, and training episode information, the custom evaluation function runs a number of evaluation episodes and returns a corresponding summarizing statistic, a vector of episode scores, and any additional data that might be needed for logging.

the required input arguments (passed to evalfcn from train) are:

  • agent — agent to evaluate, specified as a reinforcement learning agent object. for multiagent environments, this is a cell array of agent objects.

  • environment — environments within which the agents are evaluated, specified as a reinforcement environment object.

  • traininginfo — a structure containing the following fields.

    • episodeindex — current episode index, specified as a positive integer

    • episodeinfo — a structure containing the fields cumulativereward, stepstaken, and initialobservation, which contain, respectively, the cumulative reward, the number of steps taken, and the initial observations of the current training episode

the output arguments (passed from evalfcn to train) are:

  • statistic — a statistic computed from a group of consecutive evaluation episodes. common statistics are the mean, medium, maximum, and minimum. at the end of the training, this value is returned by train as the element of the evaluationstatistics vector corresponding to the last training episode.

  • scores — a vector of episode scores from each evaluation episode. you can use a logger object to store this argument during training.

  • data — any additional data from evaluation that you might find useful, for example for logging purposes. you can use a logger object to store this argument during training.

to use additional input arguments beyond the allowed two, define your additional arguments in the matlab workspace, then specify stepfcn as an anonymous function that in turn calls your custom function with the additional arguments defined in the workspace, as shown in the example create custom environment using step and reset functions.

example: evalfcn=@myevalfcn

evaluation period, specified as a positive integer. it is the number of episodes after which numepisodes evaluation episodes are run. for example, if evaluationfrequency is 100 and numepisodes is 3 then three evaluation episodes are run, consecutively, after 100 training episodes. the default is 100.

example: evaluationfrequency=200

object functions

examples

create an rlcustomevaluator object to evaluate an agent during training using a custom evaluation function. use the function myevaluationfcn, defined at the end of this example.

myevaluator = rlcustomevaluator(@myevaluationfcn)
myevaluator = 
  rlcustomevaluator with properties:
          evaluationfcn: @myevaluationfcn
    evaluationfrequency: 100

configure the evaluator to run the evaluation function every 200 training episodes.

myevaluator.evaluationfrequency = 200;

to evaluate an agent during training using these evaluation option, pass myevaluator to train, as in the following code example.

results = train(agent, env, trainingoptions, evaluator=myevaluator);

for more information see train.

custom evaluation function

the evaluation function is called by train every evaluator.evaluationfrequency training episodes. within the evaluation function, if the number of training episodes is up to 1000, run just one evaluation episode; otherwise, run 10 consecutive evaluation episodes. configure the agent to use a greedy policy (no exploration) during evaluation, and return the eight largest episode reward as final statistic (this is consistent with achieving a desired reward 80% of the time).

function  [statistic, scores, data] = ...
    myevaluationfcn(agent, env, trainingepisodeinfo)
    % do not use an exploration policy for evaluation.
    agent.useexplorationpolicy = false;
    
    % set the number of consecutive evaluation episodes to run.
    if trainingepisodeinfo.episodeindex <= 1000
        numepisodes = 1;
    else
        numepisodes = 10;
    end
    
    % initialize the rewards and data arrays.
    episoderewards = zeros(numepisodes, 1);
    data = cell(numepisodes, 1);
    
    % run numepisodes consecutive evaluation episodes.
    for evaluationepisode = 1:numepisodes
    
        % use a fixed random seed for reproducibility.
        rng(evaluationepisode*10)
    
        % run one evaluation episode. the output is a structure
        % containing various agent simulation information,
        % as described in runepisode.
        data = runepisode(env, agent, ...
            maxsteps=500, ...
            cleanuppostsim=false);
    
        if isa(data,"rl.env.future")
    
            % for parallel simulation, fetch data from workers.
            [~,out] = fetchnext(data);
    
            % collect the episode cumulative reward.
            episoderewards(evaluationepisode) = ...
                out.agentdata.episodeinfo.cumulativereward;
    
            % collect the whole data structure.
            data{evaluationepisode} = out;
    
        else
    
            % collect the episode cumulative reward.
            episoderewards(evaluationepisode) = ...
                data.agentdata.episodeinfo.cumulativereward;
            data{evaluationepisode} = data;
        end
    end
    
    % return the eight largest episode reward if 10 episodes
    % are run, otherwise return just the greatest (and only) reward.
    statistic = sort(episoderewards);
    if length(statistic) == 10
        statistic = statistic(8);
    else
        % make sure to always return a scalar in any case.
        statistic = statistic(end);
    end
    
    % return the rewards vector.
    scores = episoderewards;
end

version history

introduced in r2023b

网站地图