define reward and observation signals in custom environments -凯发k8网页登录

main content

define reward and observation signals in custom environments

to guide the learning process, reinforcement learning uses a scalar reward signal generated from the environment. this signal measures the performance of the agent with respect to the task goals. in other words, for a given observation (state), the reward measures the immediate effectiveness of taking a particular action. during training, an agent updates its policy based on the rewards received for different state-action combinations. for an introduction to different types of agents and how they use the reward signal during training, see reinforcement learning agents.

in general, you provide a positive reward to encourage certain agent actions and a negative reward (penalty) to discourage other actions. a well-designed reward signal guides the agent to maximize the expectation of the (possibly discounted) cumulative long-term reward. what constitutes a well-designed reward depends on your application and the agent goals.

for example, when an agent must perform a task for as long as possible, a common strategy is to provide a small positive reward for each time step that the agent successfully performs the task and a large penalty when the agent fails. this approach encourages longer training episodes while heavily discouraging actions that lead to episodes in which the agent fails. for an example that uses this approach, see train dqn agent to balance cart-pole system.

if your reward function incorporates multiple signals, such as position, velocity, and control effort, you must consider the relative sizes of the signals and scale their contributions to the reward signal accordingly.

you can specify either continuous or discrete reward signals. in either case, you must provide a reward signal that provides rich information when the action and observation signals change.

for control system applications in which cost functions and constraints are already available, you can also use generate rewards functions from such specifications.

continuous rewards

a continuous reward function varies continuously with changes in the environment observations and actions. in general, continuous reward signals improve convergence during training and can lead to simpler network structures.

an example of a continuous reward is the quadratic regulator (qr) cost function, where the cumulative long-term reward can be expressed as:

ji=(sτtqτsτ j=iτsjtqjsj ajtrjaj 2sjtnjaj)

here, qτ, q, r, and n are the weight matrices. qτ is the terminal weight matrix, applied only at the end of the episode. also, s is the observation vector, a is the action vector, and τ is the terminal iteration of the episode. the (instantaneous) reward for this cost function is

ri=sitqisi aitriai 2sitniai

this qr reward structure encourages an agent to drive s to zero with minimal action effort. a qr-based reward structure is a good reward to choose for regulation or stationary point problems, such as pendulum swing-up or regulating the position of the double integrator. for training examples that use a qr reward, see train dqn agent to swing up and balance pendulum and compare ddpg agent to lqr controller.

smooth continuous rewards, such as the qr regulator, are good for fine-tuning parameters and can provide policies similar to optimal controllers (lqr/mpc).

discrete rewards

a discrete reward function varies discontinuously with changes in the environment observations or actions. these types of reward signals can make convergence slower and can require more complex network structures. discrete rewards are usually implemented as events that occur in the environment—for example, when an agent receives a positive reward if it exceeds some target value or a penalty when it violates some performance constraint.

while discrete rewards can slow down convergence, they can also guide the agent toward better reward regions in the state space of the environment. for example, a region-based reward, such as a fixed reward when the agent is near a target location, can emulate final-state constraints. also, a region-based penalty can encourage an agent to avoid certain areas of the state space.

mixed rewards

in many cases, providing a mixed reward signal that has a combination of continuous and discrete reward components is beneficial. the discrete reward signal can be used to drive the system away from bad states, and the continuous reward signal can improve convergence by providing a smooth reward near target states. for example, in train ddpg agent to control sliding robot, the reward function has three components: r1, r2, and r3.

r1=10((xt2 yt2 θt2)<0.5)r2=100(|xt|20|||yt|20)r3=(0.2(rt1 lt1)2 0.3(rt1lt1)2 0.03xt2 0.03yt2 0.02θt2)r=r1 r2 r3

here:

  • r1 is a region-based continuous reward that applies only near the target location of the robot.

  • r2 is a discrete signal that provides a large penalty when the robot moves far from the target location.

  • r3 is a continuous qr penalty that applies for all robot states.

reward generation from control specifications

for applications where a working control system already exists, specifications such as cost functions or constraints might already be available. in these cases, you can use generaterewardfunction to automatically generate a reward function, coded in matlab®, that can be used as a starting point for reward design. this function allows you to generate reward functions from:

  • cost and constraint specifications defined in an mpc (model predictive control toolbox) or nlmpc (model predictive control toolbox) controller object. this feature requires model predictive control toolbox™ software.

  • performance constraints defined in simulink® design optimization™ model verification blocks.

in both cases, when constraints are violated, a negative reward is calculated using penalty functions such as exteriorpenalty (default), hyperbolicpenalty or barrierpenalty functions.

starting from the generated reward function, you can tune the cost and penalty weights, use a different penalty function, and then use the resulting reward function within an environment to train an agent.

observation signals

when you create a custom environment, the signals you select as actions and observations depend on your application. for example, for control system applications, the integrals (and sometimes derivatives) of error signals are often useful observations. also, for reference-tracking applications, having a time-varying reference signal as an observation is helpful.

when you define your observation signals, it is best practice to include all the available environment states in the observation vector.

failure to do so can lead to situations in which different environment states result in the same observation. for such states, the agent policy (assuming it is a static function of the observation) returns the same action. such a policy is typically unsuccessful, because it is normally the case that a successful policy needs to react to different environment states by returning different actions.

for example, an image observation of a swinging pendulum has position information but does not have enough information, by itself, to determine the pendulum velocity. in this case, a static policy that cannot sense the velocity would not be able to stabilize the pendulum. but if the velocity can be measured or estimated, adding it as an additional entry in the observation vector will provide a static policy with enough information to stabilize the pendulum.

when not all states are available as observation signals (for example because it would be unrealistic to measure them), a possible workaround is to use an estimator (as a part of the environment) that estimates the values of the unmeasured states, and makes such estimates available to the agent as observations. alternatively, you can use recurrent networks such as an lstm in your policy. doing so results in a policy that has states, and that might therefore be able to use its state as an internal representation of the environment state. such a policy can consequently return different actions (based on different values of its internal state) even when there is not enough information to reconstruct the correct environment state from the current observation.

see also

functions

objects

related examples

more about

网站地图