load predefined control system environments
reinforcement learning toolbox™ software provides several predefined environments representing dynamical systems that are often used as benchmarks cases for control systems design.
in these environments, the state and observation (which are predefined) belong to nonfinite numerical vector spaces, while the action (also predefined) can still belong to a finite set. the (deterministic) state transitions laws are derived by discretizing the dynamics of an underlying physical system.
environments that rely on an underlying simulink® model for the calculation of the state transition, reward and observation, are referred to as simulink environments. some of the predefined control system environments belong to this category.
multiagent environments are environments in which you can train and simulate multiple agents together. some of the predefined matlab® and simulink control system environments are multiagent environments.
you can use predefined control system environments to learn how to apply reinforcement learning to the control of physical systems, gain familiarity with reinforcement learning toolbox software features, or test your own agents.
to load the following predefined matlab control system environments, use the rlpredefinedenv
function. each of these predefined environment is available in two versions, one with a
discrete action space, the other with a continuous action space.
environment | agent task |
---|---|
double integrator | control a second-order dynamic system using either a discrete or continuous action space. |
cart-pole | balance a pole on a moving cart by applying forces to the cart using either a discrete or continuous action space. |
simple pendulum with image observation | swing up and balance a simple pendulum using either a discrete or continuous action space. |
you can also load the following predefined simulink environments using the rlpredefinedenv
function. for these environments, rlpredefinedenv
creates a simulinkenvwithagent
object. each of these predefined environment is also available in two versions, one with a
discrete action space, the other with a continuous action space.
environment | agent task |
---|---|
simple pendulum simulink model | swing up and balance a simple pendulum using either a discrete or continuous action space. |
cart-pole simscape™ model | balance a pole on a moving cart by applying forces to the cart using either a discrete or continuous action space. |
you can also load predefined grid world environments. for more information, see load predefined grid world environments.
to learn how to create your own custom environment, see create custom environment using step and reset functions, create custom simulink environments and create custom environment from class template.
double integrator environments
the goal of the agent in the predefined double integrator environments is to control the position of a mass in a frictionless mono-dimensional space by applying a force input. the system has a second-order dynamics that can be represented by a double integrator (that is two integrators in series).
in this environment, a training episode ends when either of the following events occurs:
the mass moves beyond a given threshold from the origin.
the norm of the state vector is less than a given threshold.
there are two double integrator environment variants, which differ by the agent action space.
discrete — agent can apply a force of either fmax or -fmax to the cart, where fmax is the
maxforce
property of the environment.continuous — agent can apply any force within the range [-fmax,fmax].
to create a double integrator environment, use the rlpredefinedenv
function.
discrete action space
env = rlpredefinedenv('doubleintegrator-discrete');
continuous action space
env = rlpredefinedenv('doubleintegrator-continuous');
you can visualize the double integrator environment using the plot
function. the plot displays the mass as a red rectangle.
plot(env)
to visualize the environment during training, call plot
before
training and keep the visualization figure open.
for examples showing how to train agents in double integrator environments, see the following:
environment properties
property | description | default |
---|---|---|
gain | gain for the double integrator | 1 |
ts | sample time in seconds | 0.1 |
maxdistance | distance magnitude threshold in meters | 5 |
goalthreshold | state norm threshold | 0.01 |
q | weight matrix for observation component of reward signal | [10 0; 0 1] |
r | weight matrix for action component of reward signal | 0.01 |
maxforce | maximum input force in newtons | discrete: continuous:
|
state | environment state, specified as a column vector with the following state variables:
| [0 0]' |
actions
in the double integrator environments, the agent interacts with the environment using a single action signal, the force applied to the mass. the environment contains a specification object for this action signal. for the environment with a:
discrete action space, the specification is an
rlfinitesetspec
object.continuous action space, the specification is an
rlnumericspec
object.
for more information on obtaining action specifications from an environment, see
getactioninfo
.
observations
in the double integrator system, the agent can observe both of the environment state
variables in env.state
. for each state variable, the environment
contains an rlnumericspec
observation specification. both states are continuous and unbounded.
for more information on obtaining observation specifications from an environment, see
getobservationinfo
.
reward
the reward signal for this environment is the discrete-time equivalent of the following continuous-time reward, which is analogous to the cost function of an lqr controller.
here:
q
andr
are environment properties.x is the environment state vector.
u is the input force.
cart-pole environments
the goal of the agent in the predefined cart-pole environments is to balance a pole on a moving cart by applying horizontal forces to the cart. the pole is considered successfully balanced if both of the following conditions are satisfied:
the pole angle remains within a given threshold of the vertical position, where the vertical position is zero radians.
the magnitude of the cart position remains below a given threshold.
there are two cart-pole environment variants, which differ by the agent action space.
discrete — agent can apply a force of either fmax or -fmax to the cart, where fmax is the
maxforce
property of the environment.continuous — agent can apply any force within the range [-fmax,fmax].
to create a cart-pole environment, use the rlpredefinedenv
function.
discrete action space
env = rlpredefinedenv('cartpole-discrete');
continuous action space
env = rlpredefinedenv('cartpole-continuous');
you can visualize the cart-pole environment using the plot
function. the plot displays the cart as a blue square and the pole as a red
rectangle.
plot(env)
to visualize the environment during training, call plot
before
training and keep the visualization figure open.
for examples showing how to train agents in cart-pole environments, see the following:
environment properties
property | description | default |
---|---|---|
gravity | acceleration due to gravity in meters per second squared | 9.8 |
masscart | mass of the cart in kilograms | 1 |
masspole | mass of the pole in kilograms | 0.1 |
length | half the length of the pole in meters | 0.5 |
maxforce | maximum horizontal force magnitude in newtons | 10 |
ts | sample time in seconds | 0.02 |
thetathresholdradians | pole angle threshold in radians | 0.2094 |
xthreshold | cart position threshold in meters | 2.4 |
rewardfornotfalling | reward for each time step the pole is balanced | 1 |
penaltyforfalling | reward penalty for failing to balance the pole | discrete — continuous —
|
state | environment state, specified as a column vector with the following state variables:
| [0 0 0 0]' |
actions
in the cart-pole environments, the agent interacts with the environment using a single scalar action signal, the horizontal force applied to the cart. the environment contains a specification object for this action signal. for the environment with a:
discrete action space, the specification is an
rlfinitesetspec
object.continuous action space, the specification is an
rlnumericspec
object.
for more information on obtaining action specifications from an environment, see
getactioninfo
.
observations
in the cart-pole system, the agent can observe all the environment state variables in
env.state
. for
each state variable, the environment contains an rlnumericspec
observation specification. all the states are continuous and unbounded.
for more information on obtaining observation specifications from an environment, see
getobservationinfo
.
reward
the reward signal for this environment consists of two components.
a positive reward for each time step that the pole is balanced, that is, the cart and pole both remain within their specified threshold ranges. this reward accumulates over the entire training episode. to control the size of this reward, use the
rewardfornotfalling
property of the environment.a one-time negative penalty if either the pole or cart moves outside of their threshold range. at this point, the training episode stops. to control the size of this penalty, use the
penaltyforfalling
property of the environment.
simple pendulum environments with image observation
this environment is a simple frictionless pendulum that is initially hangs in a downward position. the training goal is to make the pendulum stand upright without falling over using minimal control effort.
there are two simple pendulum environment variants, which differ by the agent action space.
discrete — agent can apply a torque of
-2
,-1
,0
,1
, or2
to the pendulum.continuous — agent can apply any torque within the range [
-2
,2
].
to create a simple pendulum environment, use the
rlpredefinedenv
function.
discrete action space
env = rlpredefinedenv('simplependulumwithimage-discrete');
continuous action space
env = rlpredefinedenv('simplependulumwithimage-continuous');
for examples showing how to train an agent in this environment, see the following:
environment properties
property | description | default |
---|---|---|
mass | pendulum mass | 1 |
rodlength | pendulum length | 1 |
rodinertia | pendulum moment of inertia | 0 |
gravity | acceleration due to gravity in meters per second squared | 9.81 |
dampingratio | damping on pendulum motion | 0 |
maximumtorque | maximum input torque in newtons | 2 |
ts | sample time in seconds | 0.05 |
state | environment state, specified as a column vector with the following state variables:
| [0 0 ]' |
q | weight matrix for observation component of reward signal | [1 0;0 0.1] |
r | weight matrix for action component of reward signal | 1e-3 |
actions
in the simple pendulum environments, the agent interacts with the environment using a single action signal, the torque applied at the base of the pendulum. the environment contains a specification object for this action signal. for the environment with a:
discrete action space, the specification is an
rlfinitesetspec
object.continuous action space, the specification is an
rlnumericspec
object.
for more information on obtaining action specifications from an environment, see
getactioninfo
.
observations
in the simple pendulum environment, the agent receives the following observation signals:
50-by-50 grayscale image of the pendulum position
derivative of the pendulum angle
for each observation signal, the environment contains an rlnumericspec
observation specification. all the observations are continuous and unbounded.
for more information on obtaining observation specifications from an environment, see
getobservationinfo
.
reward
the reward signal for this environment is
here:
θt is the pendulum angle of displacement from the upright position.
is the derivative of the pendulum angle.
ut-1 is the control effort from the previous time step.
simple pendulum simulink model
this environment is a simple frictionless pendulum that initially hangs in a downward
position. the training goal is to make the pendulum stand upright without falling over using
minimal control effort. the model for this environment is defined in the
rlsimplependulummodel
simulink model.
open_system('rlsimplependulummodel')
there are two simple pendulum environment variants, which differ by the agent action space.
discrete — agent can apply a torque of either tmax,
0
, or -tmax to the pendulum, where tmax is themax_tau
variable in the model workspace.continuous — agent can apply any torque within the range [-tmax,tmax].
to create a simple pendulum environment, use the rlpredefinedenv
function.
discrete action space
env = rlpredefinedenv('simplependulummodel-discrete');
continuous action space
env = rlpredefinedenv('simplependulummodel-continuous');
for examples that train agents in the simple pendulum environment, see:
actions
in the simple pendulum environments, the agent interacts with the environment using a single action signal, the torque applied at the base of the pendulum. the environment contains a specification object for this action signal. for the environment with a:
discrete action space, the specification is an
rlfinitesetspec
object.continuous action space, the specification is an
rlnumericspec
object.
for more information on obtaining action specifications from an environment, see
getactioninfo
.
observations
in the simple pendulum environment, the agent receives the following three observation signals, which are constructed within the create observations subsystem.
sine of the pendulum angle
cosine of the pendulum angle
derivative of the pendulum angle
for each observation signal, the environment contains an rlnumericspec
observation specification. all the observations are continuous and unbounded.
for more information on obtaining observation specifications from an environment, see
getobservationinfo
.
reward
the reward signal for this environment, which is constructed in the calculate reward subsystem, is
here:
θt is the pendulum angle of displacement from the upright position.
is the derivative of the pendulum angle.
ut-1 is the control effort from the previous time step.
cart-pole simscape model
the goal of the agent in the predefined cart-pole environments is to balance a pole on a moving cart by applying horizontal forces to the cart. the pole is considered successfully balanced if both of the following conditions are satisfied:
the pole angle remains within a given threshold of the vertical position, where the vertical position is zero radians.
the magnitude of the cart position remains below a given threshold.
the model for this environment is defined in the
rlcartpolesimscapemodel
simulink model. the dynamics of this model are defined using simscape
multibody™.
open_system('rlcartpolesimscapemodel')
in the environment subsystem, the model dynamics are defined using simscape components and the reward and observation are constructed using simulink blocks.
open_system('rlcartpolesimscapemodel/environment')
there are two cart-pole environment variants, which differ by the agent action space.
discrete — agent can apply a force of
15
,0
, or-15
to the cart.continuous — agent can apply any force within the range [
-15
,15
].
to create a cart-pole environment, use the rlpredefinedenv
function.
discrete action space
env = rlpredefinedenv('cartpolesimscapemodel-discrete');
continuous action space
env = rlpredefinedenv('cartpolesimscapemodel-continuous');
for an example that trains an agent in this cart-pole environment, see .
actions
in the cart-pole environments, the agent interacts with the environment using a single action signal, the force applied to the cart. the environment contains a specification object for this action signal. for the environment with a:
discrete action space, the specification is an
rlfinitesetspec
object.continuous action space, the specification is an
rlnumericspec
object.
for more information on obtaining action specifications from an environment, see
getactioninfo
.
observations
in the cart-pole environment, the agent receives the following five observation signals.
sine of the pole angle
cosine of the pole angle
derivative of the pendulum angle
cart position
derivative of cart position
for each observation signal, the environment contains an rlnumericspec
observation specification. all the observations are continuous and unbounded.
for more information on obtaining observation specifications from an environment, see
getobservationinfo
.
reward
the reward signal for this environment is the sum of two components (r = rqr rn rp):
a quadratic regulator control reward, constructed in the
environment/qr reward
subsystem.a cart limit penalty, constructed in the
environment/x limit penalty
subsystem. this subsystem generates a negative reward when the magnitude of the cart position exceeds a given threshold.
here:
x is the cart position.
θ is the pole angle of displacement from the upright position.
ut-1 is the control effort from the previous time step.
see also
functions
rlpredefinedenv
|train
|sim
|reset