MDPs.jl: Markov Decision Processes
Models
This section describes the data structures that can be used to model various types on MDPs.
MDP
This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.
MDPs.MDP
— TypeA general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions
S: state type A: action type
MDPs.getnext
— Methodgetnext(model, s, a)
Compute next states using transition
function.
Returns an object that can return a NamedTuple
with states
, probabilities
, and transitions
as AbstractArrays
. This is a more-efficient version of transition (when supported).
The standard implementation is not memory efficient.
MDPs.transition
— Function(sn, p, r) ∈ transition(model, s, a)
Return an iterator with next states, probabilities, and rewards for model
taking an action a
in state s
.
Use getnext
instead, which is more efficient and convenient to use.
MDPs.valuefunction
— Functionvaluefunction(mdp, state, valuefunction)
Evaluates the value function for an mdp
in a state
Tabular MDPs
This is an MDP instance that assumes that the states and actions are tabular.
MDPs.TabMDP
— TypeAn abstract tabular Markov Decision Process which is specified by a transition function. Functions that should be defined for any subtype for value and policy iterations to work are: state_count
, states
, action_count
, actions
, and transition
.
Generally, states and actions are 1-based.
The methods state_count
and states
should only include non-terminal states
MDPs.save_mdp
— Methodsave_mdp(T::DataFrame, model::TabMDP)
Convert an MDP model
to a DataFrame
representation with 0-based indices.
Important: The MDP representation uses 0-based indexes while the output DataFrame is 0-based for backwards compatibility.
The columns are: idstatefrom
, idaction
, idstateto
, probability
, and reward
.
Integral MDPs
This is a specific MDP instance in which states and actions are specified by integers.
MDPs.FormatError
— TypeAn incorrect parameter value
MDPs.IntAction
— TypeRepresents transitions that follow an action. The lengths nextstate
, probability
, and reward
must be the same.
Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.
MDPs.IntMDP
— TypeMDP with integral states and stationary transitions State and action indexes are all 1-based integers
MDPs.IntState
— TypeRepresents a discrete state
MDPs.compress
— Methodcompress(nextstate, probability, reward)
The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.
MDPs.load_mdp
— Methodload_mdp(input, idoutcome)
Load the MDP from input
. The function assumes 0-based indexes of states and actions, which is transformed to 1-based index.
Input formats are anything that is supported by DataFrame. Some options are CSV.File(...)
or Arrow.Table(...)
.
States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.
If docombine
is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.
The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.
The method can also process CSV files for MDPO/MMDP, in which case idoutcome
specifies a 1-based outcome to load.
Examples
Load the model from a CSV
using CSV: File
using MDPs
filepath = joinpath(dirname(pathof(MDPs)), "..",
"data", "riverswim.csv")
model = load_mdp(File(filepath); idoutcome = 1)
state_count(model)
# output
20
Load the model from an Arrow file (a binary tabular file format)
using MDPs, Arrow
filepath = joinpath(dirname(pathof(MDPs)), "..",
"data", "inventory.arr")
model = load_mdp(Arrow.Table(filepath))
state_count(model)
# output
21
MDPs.make_int_mdp
— Methodmake_int_mdp(Ps, rs)
Build IntMDP from a list of transition probabilities Ps
and reward vectors rs
for each action in the MDP. If rs
are vectors, then they are assumed to be state action rewards. If rs
are matrixes then they are assumed to be state-action-state rewwards. Each row of the transition matrix (and the reward matrix) represents the probabilities of transitioning to next states.
MDPs.make_int_mdp
— Methodmake_int_mdp(mdp::TabMDP, docompress = false)
Transform any tabular MDP mdp
to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.
The option docompress
combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.
Objectives
MDPs.FiniteH
— TypeFiniteH(γ, T)
Finite-horizon discounted model. The discount factor γ
can be in [0,1] and the horizon T
must be a positive integer. The optimal policy is Markov but time dependent.
MDPs.InfiniteH
— TypeInfiniteH(γ)
Inifinite-horizon discounted objective. The discount factor γ
can be in [0,1]. The optimal policy is stationary.
MDPs.Markov
— TypeObjective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.MarkovDet
— TypeObjective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.Objective
— TypeAbstract objective for an MDP.
MDPs.Stationary
— TypeObjective that is solved by a randomized stationary policy
MDPs.StationaryDet
— TypeObjective that is solved by a randomized stationary policy
MDPs.TotalReward
— TypeTotalReward()
Total reward criterion. The objective is to maximize the sum of the undiscounted rewards.
This objective can generally only be applied to transient states, which have a terminal state; see isterminal
for more details.
MDPs.horizon
— Functionhorizon(objective)
Return the horizon length for objective
.
Algorithms
MDPs.value_iteration
— Functionvalue_iteration(model, objective[, π]; [v_terminal, iterations = 1000, ϵ = 1e-3] )
Compute value function and policy for a tabular MDP model
with an objective objective
. The time steps go from 1 to T+1, the last decision happens at time T.
The supported objectives are FiniteH
, and InfiniteH
. When provided with a a real number γ ∈ [0,1]
then the objective is treated as an infinite horizon problem.
Finite Horizon
Use finite-horizon value iteration for a tabular MDP model
with a discount factor γ
and horizon T
(time steps 1
to T+1
) the last decision happens at time T. Returns a vector of value functions for each time step.
The argument v_terminal
represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.
If a policy π
is provided, then the algorithm evaluates it.
Infinite Horizon
For a Bellman error ϵ
, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).
The value function is parallelized when parallel
is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)
Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.
The time steps go from 1 to T+1.
See also
value_iteration!
MDPs.value_iteration!
— Functionvalue_iteration!(v, π, model, objective; [v_terminal] )
Run value iteration using the provided v
and π
storage for the value function and the policy. See value_iteration
for more details.
Only support FiniteH
objective.
MDPs.mrp!
— Methodmrp!(P_π, r_π, model, π)
Save the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. Also supports terminal states.
Does not support duplicate entries in transition probabilities.
MDPs.mrp
— Methodmrp(model, π)
Compute the transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
. See mrp! for more details.
MDPs.mrp_sparse
— Methodmrp(model, π)
Compute a sparse transition matrix P_π
and reward vector r_π
for the MDP model
and policy π
.
This function does not support duplicate entries in transition probabilities.
MDPs.policy_iteration
— Methodpolicy_iteration(model, γ; [iterations=1000])
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached.
Does not support duplicate entries in transition probabilities.
MDPs.policy_iteration_sparse
— Methodpolicy_iteration_sparse(model, γ; iterations)
Implements policy iteration for MDP model
with a discount factor γ
. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.
Does not support duplicate entries in transition probabilities.
MDPs.alltransient
— Methodanytransient(model, lpmf, [silent = true])
Checks if the MDP model
has all transient policies. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.
Note that the function returns true only if all policies are transient.
The parameters match the use in lp_solve
.
MDPs.anytransient
— Methodanytransient(model, lpmf, [silent = true])
Checks if the MDP model
has some transient policy. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.
Note that the function returns true even when there are some policies that are not transient.
The parameters match the use in lp_solve
.
MDPs.isterminal
— Methodisterminal(model, state)
Checks that the state
is terminal in model
. A state is terminal if it
- has a single action,
- transitions to itself,
- has a reward 0.
Example
using MDPs
model = Domains.Gambler.RuinTransient(0.5, 4, true)
isterminal.((model,), states(model))[1:2]
# output
2-element BitVector:
1
0
MDPs.lp_solve
— Methodlp_solve(model, lpmf, [silent = true])
Implements the linear program primal problem for an MDP model
with a discount factor γ
. It uses the JuMP model lpm
as the linear program solver and returns the state values found found using the solver constructed by JuMP.Model(lpmf)
.
Examples
Example
using MDPs, HiGHS
model = Domains.Gambler.RuinTransient(0.5, 4, true)
lp_solve(model, TotalReward(), HiGHS.Optimizer).policy
# output
5-element Vector{Int64}:
1
2
3
2
1
Value Function Manipulation
MDPs.make_value
— Methodmake_value(model, objective)
Creates an undefined policy and value function for the model
and objective
.
See Also
value_iteration!
MDPs.bellman
— Methodbellman(model, obj, [t=0,] s, v)
Compute the Bellman operator for state s
, and value function v
assuming an objective obj
.
MDPs.bellmangreedy
— Methodbellmangreedy(model, obj, [t=0,] s, v)
Compute the Bellman operator and greedy action for state s
, and value function v
assuming an objective obj
. The optional time parameter t
allows for time-dependent updates.
The function uses qvalue
to compute the Bellman operator and the greedy policy.
MDPs.greedy!
— Methodgreedy!(π, model, obj, v)
Update policy π
with the greedy policy for value function v
and MDP model
and an objective obj
.
MDPs.greedy
— Methodgreedy(model, obj, v)
Compute the greedy action for all states and value function v
assuming an objective obj
and time t=0
.
MDPs.greedy
— Methodgreedy(model, obj, [t=0,] s, v)
Compute the greedy action for state s
and value function v
assuming an objective obj
.
If s
is not provided, then computes a value function for all states. The model must support states
function.
MDPs.qvalue
— Methodqvalue(model, objective, [t=0,] s, a, v)
Compute the state-action-values for state s
, action a
, and value function v
for an objective
.
There is no set representation for the value function.
MDPs.qvalues!
— Methodqvalues!(qvalues, model, objective, [t=0,] s, v)
Compute the state-action-values for state s
, and value function v
for the objective
.
Saves the values to qvalue
which should be at least as long as the number of actions. Values of elements in qvalues
that are beyond the action count are set to -Inf
.
See qvalues
for more information.
MDPs.qvalues
— Methodqvalues(model, objective, [t=0,] s, v)
Compute the state-action-value for state s
, and value function v
for objective
. There is no set representation of the value function v
.
The function is tractable only if there are a small number of actions and transitions.
The function is tractable only if there are a small number of actions and transitions.
Simulation
MDPs.FPolicyM
— TypeGeneral stationary policy specified by a function s,t → a
MDPs.FPolicyS
— TypeGeneral stationary policy specified by a function s → a
MDPs.Policy
— TypeDefines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal
, append_history
that initialize and update the internal state. The function take_action
then chooses an action to take.
It is important that the tracker keeps their own internal states in order to be thread safe.
MDPs.TabPolicyMD
— TypeMarkov deterministic policy for tabular MDPs. The policy π
has an outer array over time steps and an inner array over states.
MDPs.TabPolicySD
— TypeStationary deterministic policy for tabular MDPs
MDPs.TabPolicyStationary
— TypeGeneric policy for tabular MDPs
MDPs.Transition
— TypeInformation about a transition from state
to nstate
after than an action
. time
is the time at which nstate
is observed.
MDPs.append_history
— Functionappend_history(policy, internal, transition) :: internal
Update the internal state for a policy by the transition information.
MDPs.cumulative
— Methodcumulative(rewards, γ)
Computes the cumulative return from rewards returned by the simulation function.
MDPs.make_internal
— Functionmake_internal(model, policy, state) -> internal
Initialize the internal state for a policy with the initial state. Returns the initial state.
MDPs.random_π
— Methodrandom_π(model)
Construct a random policy for a tabular MDP
MDPs.simulate
— Methodsimulate(model, π, initial, horizon, episodes; [stationary = true])
Simulate a policy π
in a model
and generate states and actions for the horizon
decisions and episodes
episodes. The initial state is initial
.
The policy π
can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary
is used.
There are horizon+1 states generated in every episode including the terminal state at T+1.
The initial state initial
should either be of a type S
or can also be a vector that represents the distribution over the states
The function requires that each state and action transition to a reasonable small number of next states.
See Also
cumulative
to compute the cumulative rewards
MDPs.take_action
— Functiontake_action(policy, internal, state) -> action
Return which action to take with the internal
state and the MDP state state
.
Domains
MDPs.Domains.Gambler.Ruin
— TypeRuin(win, max_capital)
Gambler's ruin; the discounted version. Can decide how much to bet at any point in time. With some probability win
, the bet is doubled, and with 1-win
it is lost. The reward is 1
if it achieves some terminal capital and 0
otherwise. State max_capital+1
is an absorbing win state in which 1
is received forever.
- Capital =
state - 1
- Bet =
action - 1
Available actions are 1
, ..., state
.
Special states: state=1
is broke and state=max_capital+1
is a terminal winning state.
MDPs.Domains.Gambler.RuinTransient
— TypeRuinTransient(win, max_capital, noop[, win_reward = 1.0, lose_reward = 0.0])
Gambler's ruin; the transient version. Can decide how much to bet at any point in time. With some probability win
, the bet is doubled, and with 1-win
it is lost. The reward is 1
if it achieves some terminal capital and 0
otherwise. State max_capital+1
is an absorbing win state in which 1
is received forever.
- Capital =
state - 1
If noop = true
then the available actions are 1, ..., capital+1
and bet = action - 1
. This allows a bet of 0 which is not a transient policy.
If noop = false
then the available actions are 1, ..., capital
and bet = action
. The MDP is not transient if noop = true
, but has some transient policies. When noop = false
, the MDP is transient.
Special states: state=1
is broke and state=max_capital+1
is maximal capital. Both of the states are absorbing/terminal.
By default, the reward is 0
when the gambler goes broke and +1
when it achieves the target capital. The difference from Ruin
is that no reward received in the terminal state. The rewards for overall win and loss can be adjusted by providing win_reward
and lose_reward
optional parameters.
MDPs.Domains.Inventory.Demand
— TypeModels values of demand in values
and probabilities in probabilities
.
MDPs.Domains.Inventory.Model
— TypeAn inventory MDP problem simulator
The states and actions are 1-based integers.
MDPs.Domains.Inventory.Parameters
— TypeParameters that define an inventory problem
MDPs.transition
— Methodtransition(params, stock, order, demand)
Update the inventory value and compute the profit.
Starting with a stock
number of items, then order
of items arrive, after demand
of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.
Stocking costs are asessed after all the orders are fulfilled.
Causes an error when the order
is too large, but no error when the demand cannot be satisfied or backlogged.
MDPs.Domains.Machine.Replacement
— TypeStandard machine replacement simulator. See Figure 3 in Delage 2009 for details.
States are: 1: repair 1 2: repair 2 3 - 10: utility state
Actions: 1: Do nothing 2: Repair
MDPs.Domains.GridWorld.Action
— TypeModels values of demand in values
and probabilities in probabilities
.
MDPs.Domains.GridWorld.Model
— TypeA GridWorld MDP problem simulator
The states and actions are 1-based integers.
MDPs.Domains.GridWorld.Parameters
— TypeParameters(reward_s, max_side_length, wind)
Parameters that define a GridWorld problem
rewards_s
: A vector of rewards for each statemax_side_length
: An integer that represents the maximum side length of the gridwind
: A float that represents the wind ∈ [0, 1]- 'revolve': Whether or not the agennt can wrap around the grid by moving off the edge and appearing on the other side default True
- 'transient': Whether or not there is an absorbing state default False