Atomic Model for stochastic environments with generalized rewards Deterministic worlds + goals of attainment
Optimal Policies depend on horizon, rewards..
Classical Planning Assumptions
Stochastic/Probabilistic Planning: Markov Decision Process (MDP) Model
Types of Uncertainty Disjunctive (used by nondeterministic planning)
Next state could be one of a set of states. Next state is drawn from a probability distribution over the set of states. How are these models related?
Markov Decision Processes An MDP has four components: S, A, R, T:  (finite) state set S (S = n)
 (finite) action set A (A = m)
 (Markov) transition function T(s,a,s’) = Pr(s’  s,a)
 Probability of going to state s’ after taking action a in state s
 How many parameters does it take to represent?
 bounded, realvalued (Markov) reward function R(s)
 Immediate reward we get for being in state s
 For example in a goalbased domain R(s) may equal 1 for goal states and 0 for all others
 Can be generalized to include action costs: R(s,a)
 Can be generalized to be a stochastic function
Can easily generalize to countable or continuous state and action spaces (but algorithms will be different)
Graphical View of MDP
Assumptions FirstOrder Markovian dynamics (history independence)  Pr(St+1At,St,At1,St1,..., S0) = Pr(St+1At,St)
 Next state only depends on current state and current action
FirstOrder Markovian reward process  Pr(RtAt,St,At1,St1,..., S0) = Pr(RtAt,St)
 Reward only depends on current state and action
 As described earlier we will assume reward is specified by a deterministic function R(s)
 i.e. Pr(Rt=R(St)  At,St) = 1
Stationary dynamics and reward  Pr(St+1At,St) = Pr(Sk+1Ak,Sk) for all t, k
 The world dynamics do not depend on the absolute time
Full observability  Though we can’t predict exactly which state we will reach when we execute an action, once it is realized, we know what it is
Policies (“plans” for MDPs) Nonstationary policy [Even though we have stationary dynamics and reward??]  π:S x T → A, where T is the nonnegative integers
 π(s,t) is action to do at state s with t stagestogo
 What if we want to keep acting indefinitely?
Stationary policy  π:S → A
 π(s) is action to do at state s (regardless of time)
 specifies a continuously reactive controller
These assume or have these properties:  full observability
 historyindependence
 deterministic action choice
Value of a Policy How good is a policy π? How do we measure “accumulated” reward? Value function V: S →ℝ associates value with each state (or each state and time for nonstationary π) Vπ(s) denotes value of policy at state s  Depends on immediate reward, but also what you achieve subsequently by following π
 An optimal policy is one that is no worse than any other policy at any state
The goal of MDP planning is to compute an optimal policy (method depends on how we define value)
FiniteHorizon Value Functions We first consider maximizing total reward over a finite horizon Assumes the agent has n time steps to live To act optimally, should the agent use a stationary or nonstationary policy? Put another way:  If you had only one week to live would you act the same way as if you had fifty years to live?
Finite Horizon Problems Value (utility) depends on stagetogo  hence so should policy: nonstationary π(s,k)
is kstagetogo value function for π  expected total reward after executing π for k time steps (for k=0?)
Here Rt and st are random variables denoting the reward received and state at stage t respectively
Computing FiniteHorizon Value Can use dynamic programming to compute  Markov property is critical for this
(a) (b)
Bellman Backup
Value Iteration: Finite Horizon Case Markov property allows exploitation of DP principle for optimal policy construction  no need to enumerate ATn possible policies
Value Iteration
Value Iteration
Value Iteration
Value Iteration Note how DP is used  optimal soln to k1 stage problem can be used without modification as part of optimal soln to kstage problem
Because of finite horizon, policy nonstationary What is the computational complexity?  T iterations
 At each iteration, each of n states, computes expectation for A actions
 Each expectation takes O(n) time
Total time complexity: O(TAn2)  Polynomial in number of states. Is this good?
Summary: Finite Horizon Resulting policy is optimal  convince yourself of this
Note: optimal value function is unique, but optimal policy is not  Many policies can have same value
Discounted Infinite Horizon MDPs Defining value as total reward is problematic with infinite horizons  many or all policies have infinite expected reward
 some MDPs are ok (e.g., zerocost absorbing states)
“Trick”: introduce discount factor 0 ≤ β < 1  future rewards discounted by β per time step
Note: Motivation: economic? failure prob? convenience?
Notes: Discounted Infinite Horizon Optimal policy maximizes value at each state Optimal policies guaranteed to exist (Howard60) Can restrict attention to stationary policies  I.e. there is always an optimal stationary policy
 Why change action at state s at new time t?
We define for some optimal π
Policy Evaluation Value equation for fixed policy How can we compute the value function for a policy?  we are given R and Pr
 simple linear system with n variables (each variables is value of a state) and n constraints (one value equation for each state)
 Use linear algebra (e.g. matrix inverse)
Computing an Optimal Value Function Bellman equation for optimal value function  Bellman proved this is always true
How can we compute the optimal value function?  The MAX operator makes the system nonlinear, so the problem is more difficult than policy evaluation
Notice that the optimal value function is a fixedpoint of the Bellman Backup operator B  B takes a value function as input and returns a new value function
Value Iteration Can compute optimal policy using value iteration, just like finitehorizon problems (just include discount term) Will converge to the optimal value function as k gets large. Why?
Convergence B[V] is a contraction operator on value functions  For any V and V’ we have  B[V] – B[V’]  ≤ β  V – V’ 
 Here V is the maxnorm, which returns the maximum element of the vector
 So applying a Bellman backup to any two value functions causes them to get closer together in the maxnorm sense.
Convergence is assured  any V:  V*  B[V]  =  B[V*] – B[V]  ≤ β V*  V 
 so applying Bellman backup to any value function brings us closer to V* by a factor β
 thus, Bellman fixed point theorems ensure convergence in the limit
When to stop value iteration? when Vk  Vk1≤ ε  this ensures Vk – V* ≤ εβ /1β
Contraction property proof sketch Note that for any functions f and g We can use this to show that  B[V]B[V’] <= bV – V’
How to Act Given a Vk from value iteration that closely approximates V*, what should we use as our policy? Use greedy policy: Note that the value of greedy policy may not be equal to Vk Let VG be the value of the greedy policy? How close is VG to V*?
How to Act Given a Vk from value iteration that closely approximates V*, what should we use as our policy? Use greedy policy:  We can show that greedy is not too far from optimal if Vk is close to V*
In particular, if Vk is within ε of V*, then VG within 2εβ /1β of V* (if ε is 0.001 and β is 0.9, we have 0.018) Furthermore, there exists a finite ε s.t. greedy policy is optimal  That is, even if value estimate is off, greedy policy is optimal once it is close enough
Policy Iteration Given fixed policy, can compute its value exactly: Policy iteration exploits this: iterates steps of policy evaluation and policy improvement
Policy Iteration Notes Each step of policy iteration is guaranteed to strictly improve the policy at some state when improvement is possible Convergence assured (Howard)  intuitively: no local maxima in value space, and each policy must improve value; since finite number of policies, will converge to optimal policy
Gives exact value of optimal policy
Value Iteration vs. Policy Iteration Which is faster? VI or PI  It depends on the problem
VI takes more iterations than PI, but PI requires more time on each iteration  PI must perform policy evaluation on each step which involves solving a linear system
 Also, VI can be done with asynchronous and prioritized update fashion..
Complexity:  There are at most exp(n) policies, so PI is no worse than exponential time in number of states
 Empirically O(n) iterations are required
 Still no polynomial bound on the number of PI iterations (open problem)!
Markov Decision Process (MDP) S: A set of states A: A set of actions Pr(s’s,a): transition model C(s,a,s’): cost model G: set of goals s0: start state : discount factor R(s,a,s’): reward model
Examples of MDPs Goaldirected, Indefinite Horizon, Cost Minimization MDP  <S, A, Pr, C, G, s0>
 Most often studied in planning community
Infinite Horizon, Discounted Reward Maximization MDP  <S, A, Pr, R, >
 Most often studied in reinforcement learning
Goaldirected, Finite Horizon, Prob. Maximization MDP  <S, A, Pr, G, s0, T>
 Also studied in planning community
Oversubscription Planning: Non absorbing goals, Reward Max. MDP  <S, A, Pr, G, R, s0>
 Relatively recent model
SSPP—Stochastic Shortest Path Problem An MDP with Init and Goal states MDPs don’t have a notion of an “initial” and “goal” state. (Process orientation instead of “task” orientation)  Goals are sort of modeled by reward functions
 Allows pretty expressive goals (in theory)
 Normal MDP algorithms don’t use initial state information (since policy is supposed to cover the entire search space anyway).
 Could consider “envelope extension” methods
 Compute a “deterministic” plan (which gives the policy for some of the states; Extend the policy to other states that are likely to happen during execution
 RTDP methods
Bellman Equations for Cost Minimization MDP (absorbing goals)[also called Stochastic Shortest Path] <S, A, Pr, C, G, s0> Define J*(s) {optimal cost} as the minimum expected cost to reach a goal from this state. J* should satisfy the following equation:
Bellman Equations for infinite horizon discounted reward maximization MDP <S, A, Pr, R, s0, > Define V*(s) {optimal value} as the maximum expected discounted reward from this state. V* should satisfy the following equation:
Bellman Equations for probability maximization MDP <S, A, Pr, G, s0, T> Define P*(s,t) {optimal prob.} as the maximum probability of reaching a goal from this state at tth timestep. P* should satisfy the following equation:
Modeling Softgoal problems as deterministic MDPs Consider the netbenefit problem, where actions have costs, and goals have utilities, and we want a plan with the highest net benefit How do we model this as MDP?  (wrong idea): Make every state in which any subset of goals hold into a sink state with reward equal to the cumulative sum of utilities of the goals.
 Problem—what if achieving g1 g2 will necessarily lead you through a state where g1 is already true?
 (correct version): Make a new fluent called “done” dummy action called DoneDeal. It is applicable in any state and asserts the fluent “done”. All “done” states are sink states. Their reward is equal to sum of rewards of the individual states.
Heuristic Search vs. Dynamic Programming (Value/Policy Iteration) VI and PI approaches use Dynamic Programming Update Set the value of a state in terms of the maximum expected value achievable by doing actions from that state. They do the update for every state in the state space  Wasteful if we know the initial state(s) that the agent is starting from
