Use for Bellman equations, MDPs, finite/infinite horizon DP, average-cost DP, threshold policies, monotone policies, indexability, inventory/queueing/control models, and dynamic mechanism/control problems. The goal is to turn a hard DP into a small set of Bellman, structural, and verification lemmas.
- Intake and branch split: identify the DP object and horizon/criterion.
- Smart routes and common lemmas: choose Bellman, monotonicity, threshold, or certificate proof.
- Counterexample tests and tool hooks: stress boundary/tie cases before final proof.
- Templates and failure diagnosis: use the smallest certificate that proves the policy/value claim.
Write these before proving:
- State
s, actiona, shockw, transitions' = f(s,a,w), reward/costr(s,a), discount or horizon. - Feasible action set
A(s)and how it changes withs. - Value object:
V_t, discountedV, average-cost biash, Q-functionQ, or post-decision value. - Target: existence, Bellman equation, convergence, threshold/index/monotone policy, convexity/concavity, comparative statics, or optimality of a candidate policy.
- Boundary and tie rules: terminal state, absorbing state, capacity, zero inventory/queue, action ties, finite horizon endpoints.
- Finite horizon: backward induction; prove the structural property is preserved from
V_{t+1}toV_t. - Discounted infinite horizon: define Bellman operator
T; prove contraction or monotone bounded iteration; then prove fixed point and policy verification. - Average cost: use ACOE/optimality inequality with gain
gand biash; for finite states use relative value/LP; for unbounded states use drift or Lyapunov bounds. - Threshold or monotone policy: compare action differences
Delta(s)=Q(s,a_high)-Q(s,a_low)and prove single crossing. - Convex/concave value: prove
Tpreserves convexity/concavity; use Jensen, monotone transitions, or post-decision state. - Index policy: introduce subsidy/passive reward; prove passive set expands monotonically in subsidy.
- Constrained DP: use Lagrangian relaxation, prove policy optimality for multiplier, then recover feasibility/complementary slackness.
- Partially observed or belief-state DP: move to belief state, prove value structure on simplex, then verify filter update assumptions.
- Do not solve for
Vunless needed. Prove Bellman inequalities or compare Q-values. - For a candidate policy
pi, verifyQ(s,pi(s)) >= Q(s,a)for alla; this is often easier than derivingV. - For thresholds, prove
Delta(s)is monotone and check one crossing plus boundary signs. - For monotone optimal policies, use Topkis: lattice state/action spaces, increasing differences, and monotone feasible sets.
- For value monotonicity, prove the Bellman operator is order preserving and start value iteration from an ordered function.
- For convexity/submodularity, prove the operator preserves the property; avoid differentiating an unknown value function unless justified.
- For hard global optimality, build a Bellman certificate:
V(s) >= r(s,a)+beta E[V(s')]for alla, with equality for the proposed policy. - For average-cost proofs, separate recurrence/stability from optimality; ACOE alone may not imply optimality without transversality or boundedness.
- Bellman operator maps the chosen function class into itself.
- Discounted Bellman operator is a contraction in sup norm.
- Finite-horizon induction preserves monotonicity/convexity/submodularity.
- Q-value difference has single crossing.
- Topkis monotonicity gives monotone argmax correspondence.
- Candidate value/policy satisfies Bellman equality on-policy and inequality off-policy.
- Boundary states and action ties preserve the claimed structure.
- Average-cost gain/bias pair satisfies the ACOE and required transversality/stability condition.
- Two states, two actions, two periods.
- Boundary state: empty/full inventory, zero queue, capacity limit, absorbing state.
- Discount extremes:
beta = 0,betanear 1. - Tie-breaking and nonunique argmax.
- Nonmonotone transition kernel or action-dependent feasible set.
- Finite horizon versus stationary infinite-horizon policy.
- Average-cost chain with transient or unstable policy.
- Python: enumerate finite-state DP, value iteration, policy iteration, random parameter search, boundary grids.
- Z3/OR-Tools: finite-state counterexamples to threshold/monotone policies.
- CVXPy: finite MDP LP, occupation measures, dual Bellman certificates.
- Wolfram/SymPy: simplify Q-value differences, derivatives, boundary signs, contraction constants, and threshold inequalities. For concrete templates, use
math-toolsreferencedp-wolfram-patterns.md. - NetworkX: recurrent classes and reachability in finite Markov chains.
- Lean: small order/inequality lemmas after the DP statement is stable.
Bellman certificate for maximization:
Find V and policy pi such that
V(s) = r(s,pi(s)) + beta E[V(s' | s,pi(s))]
V(s) >= r(s,a) + beta E[V(s' | s,a)] for all a.
Then pi is optimal.
Threshold proof:
Define Delta(s) = Q(s,1)-Q(s,0).
Prove Delta is increasing or decreasing in s.
Check boundary signs.
Conclude the argmax switches at most once.
Handle ties by the stated tie-breaking rule.
Finite-horizon structural induction:
Base: terminal value V_T has property P.
Step: assume V_{t+1} has P.
Show T_t preserves P, so V_t = T_t V_{t+1} has P.
Then derive policy structure from Q_t differences.
Average-cost verification:
Find gain g, bias h, and policy pi.
Show g + h(s) >= r(s,a) + E[h(s') | s,a] for all a.
Show equality for a = pi(s).
Check recurrence/stability/transversality.
Then pi is average-cost optimal.
Indexability route:
Add subsidy lambda for passive action.
For each lambda, characterize passive set P(lambda).
Prove P(lambda) expands monotonically in lambda.
Solve indifference equation for the index.
Verify boundary states and unreachable/tie states.
- If monotonicity fails, check transition stochastic monotonicity and increasing differences.
- If threshold proof fails, inspect whether Q-difference is nonmonotone or has multiple crossings.
- If contraction fails, check discounting, weighted norms, or monotone bounded convergence.
- If FOC proof ignores corners, switch to Bellman inequalities or KKT-style active-action cases.
- If average-cost proof is stuck, first prove stability/recurrence under the candidate policy.
- If continuous state proof is messy, discretize for counterexamples before proving the continuous theorem.