train biped robot to walk using evolution strategy -凯发k8网页登录
this example shows how to train a biped robot to walk using evolutionary strategy with a twin-delayed deep deterministic policy gradient (td3) reinforcement learning (rl) agent. the robot in this example is modeled in simscape™ multibody™.
for a related example, see train biped robot to walk using reinforcement learning agents. for more information on these agents, see twin-delayed deep deterministic (td3) policy gradient agents.
for this example, the agent is trained using the evolution strategy reinforcement learning (es-rl) algorithm. this algorithm [3] combines the cross entropy method (cem) with off-policy rl algorithms like sac, ddpg or td3. cem-rl is built on the framework of evolutionary reinforcement learning (erl) [4] in which a standard evolutionary algorithm selects and evolves a population of actors and generates experiences in the process. these experiences are then added into a reply buffer that is used to train a single gradient-based actor that is considered part of the population.
the es-rl algorithm proceeds as follows:
a population of actor networks is initialized with random weights. in addition to the population, one additional actor network is initialized alongside a critic network.
the population of actors is then evaluated in an episode of interaction with the environment.
the additional actor and critic are updated on the data buffer populated using population actor evaluation.
the fitness of all actors in the population is computed through their interaction with the environment. the average return over the episode is used as their respective fitness index.
a selection operator selects surviving actors in the population based on their relative fitness scores.
the surviving elite set of actors is used to generate the next population of actors.
biped robot model
the reinforcement learning environment for this example is a biped robot. the training goal is to make the robot walk in a straight line using minimal control effort.
load the parameters of the model into the matlab® workspace.
robotparametersrl
open the simulink model.
mdl = "rlwalkingbipedrobot";
open_system(mdl)
the robot is modeled using simscape multibody.
for this model:
in the neutral 0 radians position, both legs are straight and the ankles are flat.
the foot contact is modeled using the (simscape multibody) block.
the agent can control this individual joints (ankle, knee, and hip) on both legs of the robot by applying joint torques bounded between [-3,3] n·m. the actual computed action signals are normalized between -1 and 1.
the environment provides the following 29 observations to the agent.
y (lateral) and z (vertical) translations of the torso center of mass. the translation in the z direction is normalized to a similar range as the other observations.
x (forward), y (lateral), and z (vertical) translation velocities.
yaw, pitch, and roll angles of the torso.
yaw, pitch, and roll angular velocities of the torso.
angular positions and velocities of the three joints (ankle, knee, hip) on both legs.
action values from the previous time step.
the episode terminates if either of the following conditions occur.
the robot torso center of mass is less than 0.1 m in the z direction (the robot falls) or more than 1 m in either y direction (the robot moves too far to the side).
the absolute value of the roll, pitch, or yaw is greater than 0.7854 rad.
the following reward function , which is provided at every time step, is inspired by [2].
here:
is the translation velocity in the x direction (forward toward goal) of the robot.
is the lateral translation displacement of the robot from the target straight line trajectory.
is the normalized vertical translation displacement of the robot center of mass.
is the torque from joint i from the previous time step.
is the sample time of the environment.
is the final simulation time of the environment.
this reward function encourages the agent to move forward by providing a positive reward for positive forward velocity. it also encourages the agent to avoid episode termination by providing a constant reward (25 ts/tf) at every time step. the other terms in the reward function are penalties for substantial changes in lateral and vertical translations, and for the use of excess control effort.
create environment interface
create the observation specification.
numobs = 29;
obsinfo = rlnumericspec([numobs 1]);
obsinfo.name = "observations";
create the action specification.
numact = 6;
actinfo = rlnumericspec([numact 1],lowerlimit=-1,upperlimit=1);
actinfo.name = "foot_torque";
create the environment interface for the walking robot model.
blk = mdl "/rl agent"; env = rlsimulinkenv(mdl,blk,obsinfo,actinfo); env.resetfcn = @(in) walkerresetfcn(in, ... upper_leg_length/100, ... lower_leg_length/100, ... h/100);
create rl agent for training
this example trains a td3 agent using an evolutionary-strategy-based gradient-free optimization technique to learn biped locomotion. create the td3 agent.
agent = createtd3agent(numobs,obsinfo,numact,actinfo,ts);
the createtd3agent
helper function performs the following actions.
create the actor and critic networks.
specify training options for actor and critic.
create actor and critic using the networks and options defined.
configure agent-specific options.
create the agent.
td3 agent
the td3 algorithm is an extension of ddpg with improvements that make it more robust by preventing overestimation of q values [3].
two critic networks — td3 agents learn two critic networks independently and use the minimum value function estimate to update the actor (policy). doing so avoids overestimation of q values through the maximum operator in the critic update.
addition of target policy noise — adding clipped noise to value functions smooths out q function values over similar actions. doing so prevents learning an incorrect sharp peak of a noisy value estimate.
delayed policy and target updates — for a td3 agent, delaying the actor network update allows more time for the q function to reduce error (get closer to the required target) before updating the policy. doing so prevents variance in value estimates and results in a higher quality policy update.
the structure of the actor and critic networks used for this agent are the same as the ones used for ddpg agents. for details on the creating the td3 agent, see the createtd3agent
helper function. for information on configuring td3 agent options, see rltd3agentoptions
.
specify evolution strategy training options and train the agent
set es-rl training options as follows:
set
populationsize
, the number of actors that are evaluated in each generation, to25
.set
percentageelitesize
, the size of the surviving elite population from which next generation actors are generated, to 50% of the total population.set
maxgenerations
, the maximum number of generations for population to evolve, to2000
.set
maxstepsperepisode
, the maximum simulation steps per episode run per actor.set
trainepochs
, the number of epochs for the gradient-based agent.display the training progress in the episode manager dialog box by setting
plots
to"training-progress"
and disable the command line display by settingverbose
tofalse
(0
).terminate the training when the agent reaches an average score of
250
.
for more information and additional options, see rlevolutionstrategytrainingoptions
.
maxepisodes = 2000; maxsteps = floor(tf/ts); trainopts = rlevolutionstrategytrainingoptions(... "maxgeneration", maxepisodes, ... "maxstepsperepisode", maxsteps, ... "scoreaveragingwindowlength", 10, ... "plots", "training-progress", ... "stoptrainingcriteria", "episodereward", ... "stoptrainingvalue", 250,... "populationsize",25,... "percentageelitesize",50,... "returnedpolicy", 'bestpolicy',... "verbose",0,... "saveagentcriteria",'none'); trainopts.trainepochs = 50; trainopts.evaluationsperindividual = 1; trainopts.populationupdateoptions.updatemethod = "weightedmixing"; trainopts.populationupdateoptions.initialstandarddeviation = 0.25; trainopts.populationupdateoptions.initialstandarddeviationbias = 0.25;
train the agent using the trainwithevolutionstrategy
function. this process is computationally intensive and takes several hours to complete. to save time while running this example, load a pretrained agent by setting dotraining
to false
. to train the agent yourself, set dotraining
to true
.
dotraining = false; if dotraining % train the agent. trainingstats = trainwithevolutionstrategy(agent,env,trainopts); else % load a pretrained agent. load("rlwalkingbipedrobotestd3.mat","saved_agent") end
in this example, the training was stopped at average reward of 250. the steady increases of the estimates shows the agent potential to converge to the true discounted long-term reward with longer training periods.
simulate trained agents
fix the random seed for reproducibility.
rng(0)
to validate the performance of the trained agent, simulate it within the biped robot environment. for more information on agent simulation, see rlsimulationoptions
and sim
.
simoptions = rlsimulationoptions(maxsteps=maxsteps); experience = sim(env,saved_agent,simoptions);
the figure shows the simulated biped robot while walking along a line.
references
[1] lillicrap, timothy p., jonathan j. hunt, alexander pritzel, nicolas heess, tom erez, yuval tassa, david silver, and daan wierstra. "continuous control with deep reinforcement learning." preprint, submitted july 5, 2019. .
[2] heess, nicolas, dhruva tb, srinivasan sriram, jay lemmon, josh merel, greg wayne, yuval tassa, et al. "emergence of locomotion behaviours in rich environments." preprint, submitted july 10, 2017. .
[3] fujimoto, scott, herke van hoof, and david meger. "addressing function approximation error in actor-critic methods." preprint, submitted october 22, 2018.
[4] pourchot, aloïs, and olivier sigaud. "cem-rl: combining evolutionary and gradient-based methods for policy search." preprint, submitted february 11, 2019. .
[5] khadka, shauharda, and kagan tumer. "evolution-guided policy gradient in reinforcement learning." advances in neural information processing systems 31 (2018). .