Simulating sequential interaction #104

nitayalon · 2025-10-27T05:51:47Z

nitayalon
Oct 27, 2025

I have a code implementing a POMDP solver. I want to run the code in loop, simulating the interaction with the environment rather than solving the entire problem. I want to collect the updated belief and q-values after each iteration.

Here's my code:

import jax
import jax.numpy as jnp
from memo import memo
from enum import IntEnum

from matplotlib import pyplot as plt

class Arms(IntEnum):
    A = 0    
    B = 1

class Prize(IntEnum):
    LOW = 0
    HIGH = 1

class BanditType(IntEnum):
    A = 0    
    B = 1

@jax.jit
def gamma():
    return 1.0

@jax.jit
def reward_probability(prize, arm, type):
    # 3D array: [prize][arm][bandit_type]
    # prize: 0=LOW, 1=HIGH
    # arm: 0=A, 1=B  
    # bandit_type: 0=Type I, 1=Type II
    probability_of_reward = jnp.array([
        # prize = 0 (LOW): probability of getting LOW reward
        [[0.1, 0.5],  # Arm A: Type I=0.1, Type II=0.5
         [0.5, 0.1]], # Arm B: Type I=0.5, Type II=0.1
        # prize = 1 (HIGH): probability of getting HIGH reward  
        [[0.9, 0.5],  # Arm A: Type I=0.9, Type II=0.5
         [0.5, 0.9]]  # Arm B: Type I=0.5, Type II=0.9
    ])
    return probability_of_reward[prize, arm, type]

@jax.jit
def reward(prize):
    # Prize is the probability of 
    rewards = jnp.array([-3.0, 3.0])  # LOW reward = -3.0, HIGH reward = 3.0
    return rewards[prize]

B = jnp.linspace(0, 1, 50) # beliefs over Arm A being high reward

@jax.jit
def get_belief(b, bandit_type):
     return jnp.array([b, 1 - b])[bandit_type]


@memo(cache=True)
def Q_argmax_UN[b:B, arm:Arms](t):
    """
    Simulate an agent playing a 3-armed bandit for t steps.
    The agent does not know the Arms means, and solves a MAB task.
    Returns the total reward.
    """            
    agent: knows(b, arm)
    agent: thinks[
        bandit: knows(b, arm),
        bandit: chooses(bandit_type in BanditType, wpp=get_belief(b, bandit_type)),
        bandit: chooses(prize in Prize, wpp=reward_probability(prize, arm, bandit_type))                
    ]
    agent: snapshots_self_as(future_agent)      
    return agent[ E[reward(bandit.prize)] + (0.0 if t <= 0 else gamma() * imagine[
        future_agent: observes [bandit.prize] is bandit.prize,
        future_agent: chooses(b_ in B, wpp=exp(-10.0 * abs(E[bandit.bandit_type == 0] - b_))),
        future_agent: chooses(arm_next in Arms, to_maximize= Q_argmax_UN[b_, arm_next](t-1)),
        E[future_agent[Q_argmax_UN[b_, arm_next](t-1)]]
    ])]


if __name__ == "__main__":
    q = Q_argmax_UN(9)
    print(f"Q-values single agent partial information: {q}")
    v = jnp.max(q, axis=1, keepdims=True)
    print(f"V-values single agent partial information: {v}")
    p = (q == v) * 1.0
    v = v.squeeze(-1)    
    plt.figure(figsize=(3, 2))
    plt.plot(B[p[:, 0] == 1], v[p[:, 0] == 1], label='Arm A')
    plt.plot(B[p[:, 1] == 1], v[p[:, 1] == 1], ':', label='Arm B')    
    plt.legend()
    plt.xlabel('Belief state, P(Bandit type = I)')
    plt.ylabel('Long-term reward')
    plt.title('2-Arms, 2-types POMDP solution')
    plt.tight_layout()
    plt.show()

Which updates to the code are needed?
Thanks!

kach · 2025-10-27T11:19:51Z

kach
Oct 27, 2025
Maintainer

Hi Nitay,

Thanks for writing. You should make the following changes.

First, write a new memo model that only performs the belief update step. This model should have the signature:

@memo(cache=True)
def belief_update[bandit_type: BanditType, arm: Arms, prize: Prize](b):

and it should return the updated belief state b' after conditioning on prize prize from arm arm.

With that model in hand, you can write a Python simulation loop where you maintain a variable b, and repeatedly:

Compute the Q-values at belief state b using Q_argmax_UN
Sample an action softmax-rationally from the Q-values
Update b by running the belief_update model

Does that plan make sense?

0 replies

nitayalon · 2025-10-29T08:17:30Z

nitayalon
Oct 29, 2025
Author

Hi Kartik,
Thanks for the answer - I wondered if there's an example of similar code (that takes as input belief and returns updated belief) in the Demo directory or elsewhere - I'm not sure I fully understand the inner logic of the new model.
My current implementation -

@memo(cache=True)
def belief_update[bandit_type: BanditType, arm: Arms, prize: Prize](b):
    agent: knows(arm, prize)
    agent: thinks[
        bandit: knows(arm, prize),
        bandit: chooses(bandit_type in BanditType, wpp=reward_probability(prize, arm, bandit_type))
    ]        
    agent: observes[bandit.bandit_type] is bandit_type
    # Update belief using Bayes' rule
    agent: chooses(b in B, wpp= exp(b * bandit.bandit_type))
    return agent.b

return wrong values - but it is also unclear to me where the conditioning takes place - shouldn't the new model take (b, arm, prize) as input?
Thanks!

0 replies

kach · 2025-10-29T16:39:24Z

kach
Oct 29, 2025
Maintainer

I don't think there is an example of such a model in the demo directory.

In your model, you should have:

The agent thinks that the bandit is given its BanditType with a probability distribution that depends on b (the agent's prior belief about the bandit type). You've already written this line of code in Q_argmax_UN: bandit: chooses(bandit_type in BanditType, wpp=get_belief(b, bandit_type)).
The agent thinks the bandit then chooses a prize based on the action. Similarly already implemented in Q_argmax_UN.
Next, the agent observes the prize.
(At this point, memo will update the agent's belief about BanditType.)
Lastly, you return the agent's new belief about BanditType, computed by something like agent[Pr[bandit.bandit_type == 0]]

Does that general plan make sense?

0 replies

kach · 2025-11-03T14:40:54Z

kach
Nov 3, 2025
Maintainer

Hi Nitay! Just checking in — did that help you solve your problem? Please feel free to ask follow-up questions if my response wasn't clear, or if more issues come up! :)

2 replies

nitayalon Nov 11, 2025
Author

Dear Kartik,
Works like a charm! Thanks for the help.

kach Nov 11, 2025
Maintainer

Glad to hear it!! Excited to see where this research program leads you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simulating sequential interaction #104

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Simulating sequential interaction #104

Uh oh!

Uh oh!

nitayalon Oct 27, 2025

Replies: 4 comments · 2 replies

Uh oh!

Uh oh!

kach Oct 27, 2025 Maintainer

Uh oh!

nitayalon Oct 29, 2025 Author

Uh oh!

kach Oct 29, 2025 Maintainer

Uh oh!

kach Nov 3, 2025 Maintainer

Uh oh!

nitayalon Nov 11, 2025 Author

Uh oh!

kach Nov 11, 2025 Maintainer

nitayalon
Oct 27, 2025

Replies: 4 comments 2 replies

kach
Oct 27, 2025
Maintainer

nitayalon
Oct 29, 2025
Author

kach
Oct 29, 2025
Maintainer

kach
Nov 3, 2025
Maintainer

nitayalon Nov 11, 2025
Author

kach Nov 11, 2025
Maintainer