MDPs.jl: Markov Decision Processes

Models

This section describes the data structures that can be used to model various types on MDPs.

MDP

This is a general MDP data structure that supports basic functions. See IntMDP and TabMDP below for more models that can be used more directly to model and solve.

MDPs.MDP — Type

A general MDP representation with time-independent transition probabilities and rewards. The model makes no assumption that the states can be efficiently enumerated, but assumes that there is small number of actions

S: state type A: action type

MDPs.getnext — Method

getnext(model, s, a)

Compute next states using transition function.

Returns an object that can return a NamedTuple with states, probabilities, and transitions as AbstractArrays. This is a more-efficient version of transition (when supported).

The standard implementation is not memory efficient.

MDPs.transition — Function

(sn, p, r) ∈ transition(model, s, a)

Return an iterator with next states, probabilities, and rewards for model taking an action a in state s.

Use getnext instead, which is more efficient and convenient to use.

MDPs.valuefunction — Function

valuefunction(mdp, state, valuefunction)

Evaluates the value function for an mdp in a state

Tabular MDPs

This is an MDP instance that assumes that the states and actions are tabular.

MDPs.TabMDP — Type

An abstract tabular Markov Decision Process which is specified by a transition function. Functions that should be defined for any subtype for value and policy iterations to work are: state_count, states, action_count, actions, and transition.

Generally, states and actions are 1-based.

The methods state_count and states should only include non-terminal states

MDPs.save_mdp — Method

save_mdp(T::DataFrame, model::TabMDP)

Convert an MDP model to a DataFrame representation with 0-based indices.

Important: The MDP representation uses 0-based indexes while the output DataFrame is 0-based for backwards compatibility.

The columns are: idstatefrom, idaction, idstateto, probability, and reward.

Integral MDPs

This is a specific MDP instance in which states and actions are specified by integers.

MDPs.FormatError — Type

An incorrect parameter value

MDPs.IntAction — Type

Represents transitions that follow an action. The lengths nextstate, probability, and reward must be the same.

Nextstate may not be unique and each transition can have a different reward associated with the transition. The transitions are not aggregated to allow for comuting the risk of a transition. Aggregating the values by state would change the risk value of the transition.

MDPs.IntMDP — Type

MDP with integral states and stationary transitions State and action indexes are all 1-based integers

MDPs.IntState — Type

Represents a discrete state

MDPs.compress — Method

compress(nextstate, probability, reward)

The command will combine mulitple transitions to the same state into a single transition. Reward is computed as a weigted average of the individual rewards, assuming expected reward objective.

MDPs.load_mdp — Method

load_mdp(input, idoutcome)

Load the MDP from `input`. The function **assumes 0-based indexes** (via `zerobased` flag),

of states and actions, which is transformed to 1-based index.

Input formats are anything that is supported by DataFrame. Some options are CSV.File(...) or Arrow.Table(...).

States that have no transition probabilities defined are assumed to be terminal and are set to transition to themselves.

If docombine is true then the method combines transitions that have the same statefrom, action, stateto. This makes risk-neutral value iteration faster, but may change the value of a risk-averse solution.

The formulation allows for multiple transitions s,a → s'. When this is the case, the transition probability is assumed to be their sum and the reward is the weighted average of the rewards.

The method can also process CSV files for MDPO/MMDP, in which case idoutcome specifies a 1-based outcome to load.

Examples

Load the model from a CSV

using CSV: File
using MDPs
filepath = joinpath(dirname(pathof(MDPs)), "..",
                    "data", "riverswim.csv")
model = load_mdp(File(filepath); idoutcome = 1)
state_count(model)

# output
20

Load the model from an Arrow file (a binary tabular file format)

using MDPs, Arrow
filepath = joinpath(dirname(pathof(MDPs)), "..",
                    "data", "inventory.arr")
model = load_mdp(Arrow.Table(filepath))
state_count(model)

# output
21

MDPs.make_int_mdp — Method

make_int_mdp(Ps, rs)

Build IntMDP from a list of transition probabilities Ps and reward vectors rs for each action in the MDP. If rs are vectors, then they are assumed to be state action rewards. If rs are matrixes then they are assumed to be state-action-state rewwards. Each row of the transition matrix (and the reward matrix) represents the probabilities of transitioning to next states.

MDPs.make_int_mdp — Method

make_int_mdp(mdp::TabMDP, docompress = false)

Transform any tabular MDP mdp to a numeric one. This helps to accelerate operations and value function computation. The actions are also turned into 1-based integer values.

The option docompress combined transitions to the same state into a single transition. This improves efficiency in risk-neutral settings, but may change the outcome in risk-averse settings.

Objectives

MDPs.AverageReward — Type

AverageReward()

The average reward infinite hotizon criterion. The objective is to maximize the expected average reward over an infinite horizon.

MDPs.FiniteH — Type

FiniteH(γ, T)

Finite-horizon discounted model. The discount factor γ can be in [0,1] and the horizon T must be a positive integer. The optimal policy is Markov but time dependent.

MDPs.InfiniteH — Type

InfiniteH(γ)

Inifinite-horizon discounted objective. The discount factor γ can be in [0,1]. The optimal policy is stationary.

MDPs.Markov — Type

Objective solved by a randomized Markov non-stationary policy. In other words, the solution is time-dependent.

MDPs.MarkovDet — Type

Objective solved by a deterministic Markov non-stationary policy. In other words, the solution is time-dependent.

MDPs.Objective — Type

Abstract objective for an MDP.

MDPs.Stationary — Type

Objective that is solved by a randomized stationary policy

MDPs.StationaryDet — Type

Objective that is solved by a randomized stationary policy

MDPs.TotalReward — Type

TotalReward()

Total reward criterion. The objective is to maximize the sum of the undiscounted rewards.

This objective can generally only be applied to transient states, which have a terminal state; see isterminal for more details.

MDPs.horizon — Function

horizon(objective)

Return the horizon length for objective.

Algorithms

MDPs.value_iteration — Function

value_iteration(model, objective[, π]; [v_terminal, iterations = 1000, ϵ = 1e-3] )

Compute value function and policy for a tabular MDP model with an objective objective. The time steps go from 1 to T+1, the last decision happens at time T.

The supported objectives are FiniteH, and InfiniteH. When provided with a a real number γ ∈ [0,1] then the objective is treated as an infinite horizon problem.

Finite Horizon

Use finite-horizon value iteration for a tabular MDP model with a discount factor γ and horizon T (time steps 1 to T+1) the last decision happens at time T. Returns a vector of value functions for each time step.

The argument v_terminal represents the terminal value function. It should be provided as a function that maps the state id to its terminal value (at time T+1). If this value is provided, then it is used in place of 0.

If a policy π is provided, then the algorithm evaluates it.

Infinite Horizon

For a Bellman error ϵ, the computed value function is quaranteed to be within ϵ ⋅ γ / (1 - γ) of the optimal value function (all in terms of the L_∞ norm).

The value function is parallelized when parallel is true. This is also known as a Jacobi type of value iteration (as opposed to Gauss-Seidel)

Note that for the purpose of the greedy policy, minimizing the span seminorm is more efficient, but the goal of this function is also to compute the value function.

The time steps go from 1 to T+1.

See also

value_iteration!

MDPs.value_iteration! — Function

value_iteration!(v, π, model, objective; [v_terminal] )

Run value iteration using the provided v and π storage for the value function and the policy. See value_iteration for more details.

Only support FiniteH objective.

MDPs.mrp! — Method

mrp!(P_π, r_π, model, π)

Save the transition matrix P_π and reward vector r_π for the MDP model and policy π. Also supports terminal states.

Does not support duplicate entries in transition probabilities.

MDPs.mrp — Method

mrp(model, π)

Compute the transition matrix P_π and reward vector r_π for the MDP model and policy π. See mrp! for more details.

MDPs.mrp_sparse — Method

mrp(model, π)

Compute a sparse transition matrix P_π and reward vector r_π for the MDP model and policy π.

This function does not support duplicate entries in transition probabilities.

MDPs.policy_iteration — Method

policy_iteration(model, γ; [iterations=1000])

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached.

Does not support duplicate entries in transition probabilities.

MDPs.policy_iteration_sparse — Method

policy_iteration_sparse(model, γ; iterations)

Implements policy iteration for MDP model with a discount factor γ. The algorithm runs until the policy stops changing or the number of iterations is reached. The value function is computed using sparse linear algebra.

Does not support duplicate entries in transition probabilities.

MDPs.alltransient — Method

anytransient(model, lpmf, [silent = true])

Checks if the MDP model has all transient policies. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.

Note that the function returns true only if all policies are transient.

The parameters match the use in lp_solve.

MDPs.anytransient — Method

anytransient(model, lpmf, [silent = true])

Checks if the MDP model has some transient policy. A policy is transient if it is guaranteed to terminate with positive probability after some finite number of steps.

Note that the function returns true even when there are some policies that are not transient.

The parameters match the use in lp_solve.

MDPs.isterminal — Method

isterminal(model, state)

Checks that the state is terminal in model. A state is terminal if it

has a single action,
transitions to itself,
has a reward 0.

Example

    using MDPs
    model = Domains.Gambler.RuinTransient(0.5, 4, true)
    isterminal.((model,), states(model))

# output

5-element BitVector:
 0
 0
 0
 0
 1

MDPs.lp_solve — Method

lp_solve(model, lpmf, [silent = true])

Implements the linear program primal problem for an MDP model with a discount factor γ. It uses the JuMP model lpm as the linear program solver and returns the state values found found using the solver constructed by JuMP.Model(lpmf).

Examples

Example

    using MDPs, HiGHS
    model = Domains.Gambler.RuinTransient(0.5, 4, true)
    lp_solve(model, TotalReward(), HiGHS.Optimizer).policy

# output

5-element Vector{Int64}:
 1
 2
 3
 2
 1

Value Function Manipulation

MDPs.make_value — Method

make_value(model, objective)

Creates an undefined policy and value function for the model and objective.

See Also

value_iteration!

MDPs.bellman — Method

bellman(model, obj, [t=0,] s, v)

Compute the Bellman operator for state s, and value function v assuming an objective obj.

MDPs.bellmangreedy — Method

bellmangreedy(model, obj, [t=0,] s, v)

Compute the Bellman operator and greedy action for state s, and value function v assuming an objective obj. The optional time parameter t allows for time-dependent updates.

The function uses qvalue to compute the Bellman operator and the greedy policy.

MDPs.greedy! — Method

greedy!(π, model, obj, v)

Update policy π with the greedy policy for value function v and MDP model and an objective obj.

MDPs.greedy — Method

greedy(model, obj, v)

Compute the greedy action for all states and value function v assuming an objective obj and time t=0.

MDPs.greedy — Method

greedy(model, obj, [t=0,] s, v)

Compute the greedy action for state s and value function v assuming an objective obj.

If s is not provided, then computes a value function for all states. The model must support states function.

MDPs.qvalue — Method

qvalue(model, objective, [t=0,] s, a, v)

Compute the state-action-values for state s, action a, and value function v for an objective.

There is no set representation for the value function.

MDPs.qvalues! — Method

qvalues!(qvalues, model, objective, [t=0,] s, v)

Compute the state-action-values for state s, and value function v for the objective.

Saves the values to qvalue which should be at least as long as the number of actions. Values of elements in qvalues that are beyond the action count are set to -Inf.

See qvalues for more information.

MDPs.qvalues — Method

qvalues(model, objective, [t=0,] s, v)

Compute the state-action-value for state s, and value function v for objective. There is no set representation of the value function v.

The function is tractable only if there are a small number of actions and transitions.

The function is tractable only if there are a small number of actions and transitions.

Simulation

MDPs.FPolicyM — Type

General stationary policy specified by a function s,t → a

MDPs.FPolicyS — Type

General stationary policy specified by a function s → a

MDPs.Policy — Type

Defines a policy, whether a stationary deterministic, or randomized, Markov, or even history-dependent. The policy should support functions make_internal, append_history that initialize and update the internal state. The function take_action then chooses an action to take.

It is important that the tracker keeps their own internal states in order to be thread safe.

MDPs.TabPolicyMD — Type

Markov deterministic policy for tabular MDPs. The policy π has an outer array over time steps and an inner array over states.

MDPs.TabPolicySD — Type

Stationary deterministic policy for tabular MDPs

MDPs.TabPolicyStationary — Type

Generic policy for tabular MDPs

MDPs.Transition — Type

Information about a transition from state to nstate after than an action. time is the time at which nstate is observed.

MDPs.append_history — Function

append_history(policy, internal, transition) :: internal

Update the internal state for a policy by the transition information.

MDPs.cumulative — Method

cumulative(rewards, γ)

Computes the cumulative return from rewards returned by the simulation function.

MDPs.make_internal — Function

make_internal(model, policy, state) -> internal

Initialize the internal state for a policy with the initial state. Returns the initial state.

MDPs.random_π — Method

random_π(model)

Construct a random policy for a tabular MDP

MDPs.simulate — Method

simulate(model, π, initial, horizon, episodes; [stationary = true])

Simulate a policy π in a model and generate states and actions for the horizon decisions and episodes episodes. The initial state is initial.

The policy π can be a function, or a array, or an array of arrays depending on whether the policy is stationary, Markovian, deterministic, or randomized. When the policy is provided as a function, then the parameter stationary is used.

There are horizon+1 states generated in every episode including the terminal state at T+1.

The initial state initial should either be of a type S or can also be a vector that represents the distribution over the states

The function requires that each state and action transition to a reasonable small number of next states.

See Also

cumulative to compute the cumulative rewards

MDPs.take_action — Function

take_action(policy, internal, state) -> action

Return which action to take with the internal state and the MDP state state.

Domains

MDPs.Domains.Gambler.Ruin — Type

Ruin(win, max_capital)

Gambler's ruin; the discounted version. Can decide how much to bet at any point in time. With some probability win, the bet is doubled, and with 1-win it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise. State max_capital+1 is an absorbing win state in which 1 is received forever.

Capital = state - 1
Bet = action - 1

Available actions are 1, ..., state.

Special states: state=1 is broke and state=max_capital+1 is a terminal winning state.

MDPs.Domains.Gambler.RuinTransient — Type

RuinTransient(win, max_capital, noop[, win_reward = 1.0, lose_reward = 0.0])

Gambler's ruin; the transient version. Can decide how much to bet at any point in time. With some probability win, the bet is doubled, and with 1-win it is lost. The reward is 1 if it achieves some terminal capital and 0 otherwise. State max_capital+1 is an absorbing win state in which 1 is received forever.

Capital = state - 1

If noop = true then the available actions are 1, ..., capital+1 and bet = action - 1. This allows a bet of 0 which is not a transient policy.

If noop = false then the available actions are 1, ..., capital and bet = action. The MDP is not transient if noop = true, but has some transient policies. When noop = false, the MDP is transient.

Special states: state=1 is broke and state=max_capital+1 is maximal capital. Both of the states are absorbing/terminal.

By default, the reward is 0 when the gambler goes broke and +1 when it achieves the target capital. The difference from Ruin is that no reward received in the terminal state. The rewards for overall win and loss can be adjusted by providing win_reward and lose_reward optional parameters.

MDPs.Domains.Inventory.Demand — Type

Models values of demand in values and probabilities in probabilities.

MDPs.Domains.Inventory.Model — Type

An inventory MDP problem simulator

The states and actions are 1-based integers.

MDPs.Domains.Inventory.Parameters — Type

Parameters that define an inventory problem

MDPs.transition — Method

transition(params, stock, order, demand)

Update the inventory value and compute the profit.

Starting with a stock number of items, then order of items arrive, after demand of items are sold. Sale price is collected even if it is backlogged (not beyond backlog level). Negative stock means backlog.

Stocking costs are asessed after all the orders are fulfilled.

Causes an error when the order is too large, but no error when the demand cannot be satisfied or backlogged.

MDPs.Domains.Machine.Replacement — Type

Standard machine replacement simulator. See Figure 3 in Delage 2009 for details.

States are: 1: repair 1 2: repair 2 3 - 10: utility state

Actions: 1: Do nothing 2: Repair

MDPs.Domains.GridWorld.Action — Type

Models values of demand in values and probabilities in probabilities.

MDPs.Domains.GridWorld.Model — Type

A GridWorld MDP problem simulator

The states and actions are 1-based integers.

MDPs.Domains.GridWorld.Parameters — Type

Parameters(reward_s, max_side_length, wind)

Parameters that define a GridWorld problem

rewards_s: A vector of rewards for each state
max_side_length: An integer that represents the maximum side length of the grid
wind: A float that represents the wind ∈ [0, 1]
'revolve': Whether or not the agennt can wrap around the grid by moving off the edge and appearing on the other side default True
'transient': Whether or not there is an absorbing state default False