# ~/Chapter 3 - Finite Markov Decision Processes

## Brandon Rozek

PhD Student @ RPI studying Automated Reasoning in AI and Linux Enthusiast.

Markov Decision processes are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations, or states, and through those future rewards. Thus MDPs involve delayed reward and the need to trade-off immediate and delayed reward. Whereas in bandit problems we estimated the value of $q_(a)$ of each action $a$, in MDPs we estimate the value of $q_(s, a)$ of each action $a$ in state $s$.

MDPs are a mathematically idealized form of the reinforcement learning problem. As we will see in artificial intelligence, there is often a tension between breadth of applicability and mathematical tractability. This chapter will introduce this tension and discuss some of the trade-offs and challenges that it implies.

## Agent-Environment Interface

The learner and decision maker is called the agent. The thing it interacts with is called the environment. These interact continually, the agent selecting actions and the environment responding to these actions and presenting new situations to the agent.

The environment also give rise to rewards, special numerical values that the agent seeks to maximize over time through its choice of actions.

![1536511147205](/home/rozek/Documents/Research/Reinforcement Learning/1536511147205.png)

To make the future paragraphs clearer, a Markov Decision Process is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of the decision maker.

In a finite MDP, the set of states, actions, and rewards all have a finite number of elements. In this case, the random variables R_t and S_t have a well defined discrete probability distribution dependent only on the preceding state and action. $$p(s^\prime | s,a) = \sum_{r \in \mathcal{R}}{p(s^\prime, r|s, a)}$$ Breaking down the above formula, it’s just an instantiation of the law of total probability. If you partition the probabilistic space by the reward, if you sum up each partition you should get the overall space. This formula has a special name: the state-transition probability.

From this we can compute the expected rewards for each state-action pair by multiplying each reward with the probability of getting said reward and summing it all up. $$r(s, a) = \sum_{r \in \mathcal{R}}{r}\sum_{s^\prime \in \mathcal{S}}{p(s^\prime, r|s,a)}$$ The expected reward for a state-action-next-state triple is $$r(s, a, s^\prime) = \sum_{r \in \mathcal{R}}{r\frac{p(s^\prime, r|s,a)}{p(s^\prime|s,a)}}$$ I wasn’t able to piece together this function in my head like the others. Currently I imagine it as if we took the formula before and turned the universe of discourse from the universal set to the state where $s^\prime$ occurred.

The MDP framework is abstract and flexible and can be applied to many different problems in many different ways. For example, the time steps need not refer to fixed intervals of real time; they can refer to arbitrary successive states of decision making and acting.

### Agent-Environment Boundary

In particular, the boundary between agent and environment is typically not the same as the physical boundary of a robot’s or animals’ body. Usually, the boundary is drawn closer to the agent than that. For example, the motors and mechanical linkages of a robot and its sensing hardware should usually be considered parts of the environment rather than parts of the agent.

The general rule we follow is that anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment. We do not assume that everything in the environment is unknown to the agent. For example, the agent often knows quite a bit about how its rewards are computed as a function of its actions and the states in which they are taken. But we always consider the reward computation to be external to the agent because it defines the task facing the agent and thus must be beyond its ability to change arbitrarily. The agent-environment boundary represents the limit of the agent’s absolute control, not of its knowledge.

This framework breaks down whatever the agent is trying to achieve to three signals passing back and forth between an agent and its environment: one signal to represent the choices made by the agent, one signal to represent the basis on which the choices are made (the states), and one signal to define the agent’s goal (the rewards).

### Example 3.4: Recycling Robot MDP

Recall that the agent makes a decision at times determined by external events. At each such time the robot decides whether it should

(1) Actively search for a can

(2) Remain stationary and wait for someone to bring it a can

(3) Go back to home base to recharge its battery

Suppose the environment works as follows: the best way to find cans is to actively search for them, but this runs down the robot’s battery, whereas waiting does not. Whenever the robot is searching the possibility exists that its battery will become depleted. In this case, the robot must shut down and wait to be rescued (producing a low reward.)

The agent makes its decisions solely as a function of the energy level of the battery, It can distinguish two levels, high and low, so that the state set is $\mathcal{S} = {high, low}$. Let us call the possible decisions – the agent’s actions – wait, search, and recharge. When the energy level is high, recharging would always be foolish, so we do not include it in the action set for this state. The agent’s action sets are \begin{align*} \mathcal{A}(high) &= {search, wait} \ \mathcal{A}(low) &= {search, wait, recharge} \end{align*} If the energy level is high, then a period of active search can always be completed without a risk of depleting the battery. A period of searching that begins with a high energy level leaves the energy level high with a probability of $\alpha$ and reduces it to low with a probability of $(1 - \alpha)$. On the other hand, a period of searching undertaken when the energy level is low leaves it low with a probability of $\beta$ and depletes the battery with a probability of $(1 - \beta)$. In the latter case, the robot must be rescued, and the batter is then recharged back to high.

Each can collected by the robot counts as a unit reward, whereas a reward of $-3$ occurs whenever the robot has to be rescued. Let $r_{search}$ and $r_{wait}$ with $r_{search } > r_{wait}$, respectively denote the expected number of cans the robot will collect while searching and waiting. Finally, to keep things simple, suppose that no cans can be collected ruing a run home for recharging and that no cans can be collected on a step in which the battery is depleted.

### Optimality and Approximation

For some kinds of tasks we are interested in,optimal policies can be generated only with extreme computational cost. A critical aspect of the problemfacing the agent is always teh computational power available to it, in particular, the amount of computation it can perform in a single time step.

The memory available is also an important constraint. A large amount of memory is often required to build up approximations of value functions, policies, and models. In the case of large state sets, functions must be approximated using some sort of more compact parameterized function representation.

This presents us with unique oppurtunities for achieving useful approximations. For example, in approximating optimal behavior, there may be many states that the agent faces with such a low probability that selecting suboptimal actions for them has little impact on the amount of reward the agent receives.

The online nature of reinforcement learning which makes it possible to approximate optimal policies in ways that put more effort into learning to make good decisions for frequently encountered states at the expense of infrequent ones is the key property that distinguishes reinforcement learning from other approaches to approximately solve MDPs.

### Summary

Let us summarize the elements of the reinforcement learning problem.

Reinforcement learning is about learning from interaction how to behave in order to achieve a goal. The reinforcement learning agent and its environment interact over a sequence of discrete time steps.

The actions are the choices made by the agent; the states are the basis for making the choice; and the rewards are the basis for evaluating the choices.

Everything inside the agent is completely known and controllable by the agent; everything outside is incompletely controllable but may or may not be completely known.

A policy is a stochastic rule by which the agent selects actions as a function of states.

When the reinforcement learning setup described above is formulated with well defined transition probabilities it constitutes a Markov Decision Process (MDP)

The return is the function of future rewards that the agent seeks to maximize. It has several different definitions depending on the nature of the task and whether one wishes to discount delayed reward.

• The un-discounted formulation is appropriate for episodic tasks, in which the agent-environment interaction breaks naturally into episodes
• The discounted formulation is appropriate for continuing tasks in which the interaction does not naturally break into episodes but continue without limit

A policy’s value functions assign to each state, or state-action pair, the expected return from that state, or state-action pair, given that the agent uses the policy. The optimal value functions assign to each state, or state-action pair, the largest expected return achievable by any policy. A policy whose value unctions are optimal is an optimal policy.

Even if the agent has a complete and accurate environment model, the agent is typically unable to perform enough computation per time step to fully use it. The memory available is also an important constraint. Memory may be required to build up accurate approximations of value functions, policies, and models. In most cases of practical interest there are far more states that could possibly be entries in a table, and approximations must be made.