MDPs.jl: Markov Decision Processes
Models
This section describes the data structures that can be used to model various types on MDPs.
MDP
This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.
MDPs.getnext — Method
getnext(model, s, a)Compute next states using transition function.
Returns an object that can return a NamedTuple with states, probabilities, and transitions as AbstractArrays. This is a more-efficient version of transition (when supported).
The standard implementation is not memory efficient.
MDPs.transition — Function
(sn, p, r) ∈ transition(model, s, a)Return an iterator with next states, probabilities, and rewards for model taking an action a in state s.
Use getnext instead, which is more efficient and convenient to use.
MDPs.valuefunction — Function
valuefunction(mdp, state, valuefunction)Evaluates the value function for an mdp in a state
Tabular MDPs
This is an MDP instance that assumes that the states and actions are tabular.
MDPs.TabMDP — Type
An abstract tabular Markov Decision Process which is specified by a transition function. Functions that should be defined for any subtype for value and policy iterations to work are: state_count, states, action_count, actions, and transition.
Generally, states and actions are 1-based.
The methods state_count and states should only include non-terminal states
MDPs.save_mdp — Method
save_mdp(T::DataFrame, model::TabMDP)Convert an MDP model to a DataFrame representation with 0-based indices.
Important: The MDP representation uses 0-based indexes while the output DataFrame is 0-based for backwards compatibility.
The columns are: idstatefrom, idaction, idstateto, probability, and reward.
Integral MDPs
This is a specific MDP instance in which states and actions are specified by integers.
MDPs.FormatError — Type
An incorrect parameter value
MDPs.IntAction — Type
Represents transitions that follow an action. The lengths nextstate, probability, and reward must be the same.
Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.
MDPs.IntMDP — Type
MDP with integral states and stationary transitions State and action indexes are all 1-based integers
MDPs.IntState — Type
Represents a discrete state
MDPs.compress — Method
compress(nextstate, probability, reward)The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.
MDPs.load_mdp — Method
load_mdp(input, idoutcome)
Load the MDP from `input`. The function **assumes 0-based indexes** (via `zerobased` flag),of states and actions, which is transformed to 1-based index.
Input formats are anything that is supported by DataFrame. Some options are CSV.File(...) or Arrow.Table(...).
States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.
If docombine is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.
The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.
The method can also process CSV files for MDPO/MMDP, in which case idoutcome specifies a 1-based outcome to load.
Examples
Load the model from a CSV
using CSV: File
using MDPs
filepath = joinpath(dirname(pathof(MDPs)), "..",
"data", "riverswim.csv")
model = load_mdp(File(filepath); idoutcome = 1)
state_count(model)
# output
20Load the model from an Arrow file (a binary tabular file format)
using MDPs, Arrow
filepath = joinpath(dirname(pathof(MDPs)), "..",
"data", "inventory.arr")
model = load_mdp(Arrow.Table(filepath))
state_count(model)
# output
21MDPs.make_int_mdp — Method
make_int_mdp(Ps, rs)Build IntMDP from a list of transition probabilities Ps and reward vectors rs for each action in the MDP. If rs are vectors, then they are assumed to be state action rewards. If rs are matrixes then they are assumed to be state-action-state rewwards. Each row of the transition matrix (and the reward matrix) represents the probabilities of transitioning to next states.
MDPs.make_int_mdp — Method
make_int_mdp(mdp::TabMDP, docompress = false)Transform any tabular MDP mdp to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.
The option docompress combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.
Objectives
MDPs.AverageReward — Type
AverageReward()The average reward infinite hotizon criterion. The objective is to maximize the expected average reward over an infinite horizon.
MDPs.FiniteH — Type
FiniteH(γ, T)Finite-horizon discounted model. The discount factor γ can be in [0,1] and the horizon T must be a positive integer. The optimal policy is Markov but time dependent.
MDPs.InfiniteH — Type
InfiniteH(γ)Inifinite-horizon discounted objective. The discount factor γ can be in [0,1]. The optimal policy is stationary.
MDPs.Markov — Type
Objective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.MarkovDet — Type
Objective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.
MDPs.Objective — Type
Abstract objective for an MDP.
MDPs.Stationary — Type
Objective that is solved by a randomized stationary policy
MDPs.StationaryDet — Type
Objective that is solved by a randomized stationary policy
MDPs.TotalReward — Type
TotalReward()Total reward criterion. The objective is to maximize the sum of the undiscounted rewards.
This objective can generally only be applied to transient states, which have a terminal state; see isterminal for more details.
MDPs.horizon — Function
horizon(objective)Return the horizon length for objective.
Algorithms
MDPs.value_iteration — Function
value_iteration(model, objective[, π]; [v_terminal, iterations = 1000, ϵ = 1e-3] )Compute value function and policy for a tabular MDP model with an objective objective. The time steps go from 1 to T+1, the last decision happens at time T.
The supported objectives are FiniteH, and InfiniteH. When provided with a a real number γ ∈ [0,1] then the objective is treated as an infinite horizon problem.
Finite Horizon
Use finite-horizon value iteration for a tabular MDP model with a discount factor γ and horizon T (time steps 1 to T+1) the last decision happens at time T. Returns a vector of value functions for each time step.
The argument v_terminal represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.
If a policy π is provided, then the algorithm evaluates it.
Infinite Horizon
For a Bellman error ϵ, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).
The value function is parallelized when parallel is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)
Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.
The time steps go from 1 to T+1.
See also
value_iteration!
MDPs.value_iteration! — Function
value_iteration!(v, π, model, objective; [v_terminal] )Run value iteration using the provided v and π storage for the value function and the policy. See value_iteration for more details.
Only support FiniteH objective.
MDPs.mrp_sparse — Method
mrp(model, π)Compute a sparse transition matrix P_π and reward vector r_π for the MDP model and policy π.
This function does not support duplicate entries in transition probabilities.
MDPs.policy_iteration — Method
policy_iteration(model, γ; [iterations=1000])Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached.
Does not support duplicate entries in transition probabilities.
MDPs.policy_iteration_sparse — Method
policy_iteration_sparse(model, γ; iterations)Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.
Does not support duplicate entries in transition probabilities.
MDPs.alltransient — Method
anytransient(model, lpmf, [silent = true])Checks if the MDP model has all transient policies. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.
Note that the function returns true only if all policies are transient.
The parameters match the use in lp_solve.
MDPs.anytransient — Method
anytransient(model, lpmf, [silent = true])Checks if the MDP model has some transient policy. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.
Note that the function returns true even when there are some policies that are not transient.
The parameters match the use in lp_solve.
MDPs.isterminal — Method
isterminal(model, state)Checks that the state is terminal in model. A state is terminal if it
- has a single action,
- transitions to itself,
- has a reward 0.
Example
using MDPs
model = Domains.Gambler.RuinTransient(0.5, 4, true)
isterminal.((model,), states(model))
# output
5-element BitVector:
0
0
0
0
1MDPs.lp_solve — Method
lp_solve(model, lpmf, [silent = true])Implements the linear program primal problem for an MDP model with a discount factor γ. It uses the JuMP model lpm as the linear program solver and returns the state values found found using the solver constructed by JuMP.Model(lpmf).
Examples
Example
using MDPs, HiGHS
model = Domains.Gambler.RuinTransient(0.5, 4, true)
lp_solve(model, TotalReward(), HiGHS.Optimizer).policy
# output
5-element Vector{Int64}:
1
2
3
2
1Value Function Manipulation
MDPs.make_value — Method
make_value(model, objective)Creates an undefined policy and value function for the model and objective.
See Also
value_iteration!
MDPs.bellman — Method
bellman(model, obj, [t=0,] s, v)Compute the Bellman operator for state s, and value function v assuming an objective obj.
MDPs.bellmangreedy — Method
bellmangreedy(model, obj, [t=0,] s, v)Compute the Bellman operator and greedy action for state s, and value function v assuming an objective obj. The optional time parameter t allows for time-dependent updates.
The function uses qvalue to compute the Bellman operator and the greedy policy.
MDPs.greedy! — Method
greedy!(π, model, obj, v)Update policy π with the greedy policy for value function v and MDP model and an objective obj.
MDPs.greedy — Method
greedy(model, obj, v)Compute the greedy action for all states and value function v assuming an objective obj and time t=0.
MDPs.greedy — Method
greedy(model, obj, [t=0,] s, v)Compute the greedy action for state s and value function v assuming an objective obj.
If s is not provided, then computes a value function for all states. The model must support states function.
MDPs.qvalue — Method
qvalue(model, objective, [t=0,] s, a, v)Compute the state-action-values for state s, action a, and value function v for an objective.
There is no set representation for the value function.
MDPs.qvalues! — Method
qvalues!(qvalues, model, objective, [t=0,] s, v)Compute the state-action-values for state s, and value function v for the objective.
Saves the values to qvalue which should be at least as long as the number of actions. Values of elements in qvalues that are beyond the action count are set to -Inf.
See qvalues for more information.
MDPs.qvalues — Method
qvalues(model, objective, [t=0,] s, v)Compute the state-action-value for state s, and value function v for objective. There is no set representation of the value function v.
The function is tractable only if there are a small number of actions and transitions.
The function is tractable only if there are a small number of actions and transitions.
Simulation
MDPs.FPolicyM — Type
General stationary policy specified by a function s,t → a
MDPs.FPolicyS — Type
General stationary policy specified by a function s → a
MDPs.Policy — Type
Defines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal, append_history that initialize and update the internal state. The function take_action then chooses an action to take.
It is important that the tracker keeps their own internal states in order to be thread safe.
MDPs.TabPolicyMD — Type
Markov deterministic policy for tabular MDPs. The policy π has an outer array over time steps and an inner array over states.
MDPs.TabPolicySD — Type
Stationary deterministic policy for tabular MDPs
MDPs.TabPolicyStationary — Type
Generic policy for tabular MDPs
MDPs.Transition — Type
Information about a transition from state to nstate after than an action. time is the time at which nstate is observed.
MDPs.append_history — Function
append_history(policy, internal, transition) :: internalUpdate the internal state for a policy by the transition information.
MDPs.cumulative — Method
cumulative(rewards, γ)Computes the cumulative return from rewards returned by the simulation function.
MDPs.make_internal — Function
make_internal(model, policy, state) -> internalInitialize the internal state for a policy with the initial state. Returns the initial state.
MDPs.random_π — Method
random_π(model)Construct a random policy for a tabular MDP
MDPs.simulate — Method
simulate(model, π, initial, horizon, episodes; [stationary = true])Simulate a policy π in a model and generate states and actions for the horizon decisions and episodes episodes. The initial state is initial.
The policy π can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary is used.
There are horizon+1 states generated in every episode including the terminal state at T+1.
The initial state initial should either be of a type S or can also be a vector that represents the distribution over the states
The function requires that each state and action transition to a reasonable small number of next states.
See Also
cumulative to compute the cumulative rewards
MDPs.take_action — Function
take_action(policy, internal, state) -> actionReturn which action to take with the internal state and the MDP state state.
Domains
MDPs.Domains.Gambler.Ruin — Type
Ruin(win, max_capital)Gambler's ruin; the discounted version. Can decide how much to bet at any point in time. With some probability win, the bet is doubled, and with 1-win it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise. State max_capital+1 is an absorbing win state in which 1 is received forever.
- Capital =
state - 1 - Bet =
action - 1
Available actions are 1, ..., state.
Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.
MDPs.Domains.Gambler.RuinTransient — Type
RuinTransient(win, max_capital, noop[, win_reward = 1.0, lose_reward = 0.0])Gambler's ruin; the transient version. Can decide how much to bet at any point in time. With some probability win, the bet is doubled, and with 1-win it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise. State max_capital+1 is an absorbing win state in which 1 is received forever.
- Capital =
state - 1
If noop = true then the available actions are 1, ..., capital+1 and bet = action - 1. This allows a bet of 0 which is not a transient policy.
If noop = false then the available actions are 1, ..., capital and bet = action. The MDP is not transient if noop = true, but has some transient policies. When noop = false, the MDP is transient.
Special states: state=1 is broke and state=max_capital+1 is maximal capital. Both of the states are absorbing/terminal.
By default, the reward is 0 when the gambler goes broke and +1 when it achieves the target capital. The difference from Ruin is that no reward received in the terminal state. The rewards for overall win and loss can be adjusted by providing win_reward and lose_reward optional parameters.
MDPs.Domains.Inventory.Demand — Type
Models values of demand in values and probabilities in probabilities.
MDPs.Domains.Inventory.Model — Type
An inventory MDP problem simulator
The states and actions are 1-based integers.
MDPs.Domains.Inventory.Parameters — Type
Parameters that define an inventory problem
MDPs.transition — Method
transition(params, stock, order, demand)Update the inventory value and compute the profit.
Starting with a stock number of items, then order of items arrive, after demand of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.
Stocking costs are asessed after all the orders are fulfilled.
Causes an error when the order is too large, but no error when the demand cannot be satisfied or backlogged.
MDPs.Domains.Machine.Replacement — Type
Standard machine replacement simulator. See Figure 3 in Delage 2009 for details.
States are: 1: repair 1 2: repair 2 3 - 10: utility state
Actions: 1: Do nothing 2: Repair
MDPs.Domains.GridWorld.Action — Type
Models values of demand in values and probabilities in probabilities.
MDPs.Domains.GridWorld.Model — Type
A GridWorld MDP problem simulator
The states and actions are 1-based integers.
MDPs.Domains.GridWorld.Parameters — Type
Parameters(reward_s, max_side_length, wind)Parameters that define a GridWorld problem
rewards_s: A vector of rewards for each statemax_side_length: An integer that represents the maximum side length of the gridwind: A float that represents the wind ∈ [0, 1]- 'revolve': Whether or not the agennt can wrap around the grid by moving off the edge and appearing on the other side default True
- 'transient': Whether or not there is an absorbing state default False