SharathSPhD
diff --git a/‎README.md‎
Lines changed: 3 additions & 3 deletions b/‎README.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎app/README.md‎
Lines changed: 4 additions & 4 deletions b/‎app/README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎app/app.py‎
Lines changed: 12 additions & 11 deletions b/‎app/app.py‎
Lines changed: 12 additions & 11 deletions
diff --git a/‎paper/main.pdf‎
10.1 KB b/‎paper/main.pdf‎
10.1 KB
diff --git a/‎paper/main.tex‎
Lines changed: 6 additions & 2 deletions b/‎paper/main.tex‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎paper/references.bib‎
Lines changed: 29 additions & 0 deletions b/‎paper/references.bib‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎paper/sections/appendices.tex‎
Lines changed: 5 additions & 0 deletions b/‎paper/sections/appendices.tex‎
Lines changed: 5 additions & 0 deletions
diff --git a/‎paper/sections/conclusion.tex‎
Lines changed: 5 additions & 13 deletions b/‎paper/sections/conclusion.tex‎
Lines changed: 5 additions & 13 deletions
@@ -1,6 +1,6 @@
 # DreamPrice
 
-**A Learned World Model for Retail Pricing via Mamba-2 Recurrence and Causal Demand Identification**
+**A Causal DreamerV3 World Model for Offline Retail Pricing**
 
 [![Paper](https://img.shields.io/badge/Paper-PDF-red)](paper/main.pdf)
 [![Demo](https://img.shields.io/badge/Demo-Gradio-orange)](https://huggingface.co/spaces/qbz506/dreamprice-demo)
@@ -244,8 +244,8 @@ dreamprice/
 
 ```bibtex
 @article{sathish2026dreamprice,
-  title   = {DreamPrice: A Learned World Model for Retail Pricing
-             via Mamba-2 Recurrence and Causal Demand Identification},
+  title   = {DreamPrice: A Causal DreamerV3 World Model
+             for Offline Retail Pricing},
   author  = {Sathish, Sharath},
   year    = {2026},
   url     = {https://github.com/SharathSPhD/dreamprice}
 
@@ -1,6 +1,6 @@
 ---
 title: DreamPrice Demo
-emoji: 💰
+emoji: "\U0001F4B0"
 colorFrom: blue
 colorTo: green
 sdk: gradio
@@ -10,10 +10,10 @@ pinned: false
 license: cc-by-nc-4.0
 ---
 
-# DreamPrice: Retail Pricing World Model Demo
+# DreamPrice: A Causal DreamerV3 World Model for Offline Retail Pricing
 
-Interactive demo for DreamPrice — a DreamerV3-based world model with Mamba-2 backbone
-and causal demand decoder trained on Dominick's Finer Foods scanner data.
+Interactive demo for DreamPrice -- a DreamerV3-based world model with Mamba-2 backbone
+and causal demand decoder trained on Dominick's Finer Foods scanner data (100K steps, ELBO=22.44, Actor Return=193.7).
 
 **Tabs:**
 1. **Pricing Simulator**: Adjust SKU prices and observe predicted demand/margin
 
@@ -548,7 +548,7 @@ def show_architecture():
     ax.text(
         8,
         1.0,
-        "Actor-Critic\n(PPO + MOPO-LCB)",
+        "Actor-Critic\n(REINFORCE + MOPO-LCB)",
         ha="center",
         va="center",
         fontsize=9,
@@ -642,20 +642,21 @@ def show_architecture():
 # ── Build Gradio App ──
 
 with gr.Blocks(
-    title="DreamPrice: Retail Pricing World Model",
+    title="DreamPrice: A Causal DreamerV3 World Model for Offline Retail Pricing",
 ) as demo:
     gr.Markdown(
         """
-# DreamPrice: A Learned World Model for Retail Pricing
+# DreamPrice: A Causal DreamerV3 World Model for Offline Retail Pricing
 
 A DreamerV3-based world model with Mamba-2 backbone and causal demand decoder,
-trained on Dominick's Finer Foods scanner data (100K steps, 2.6h on DGX Spark).
-Final metrics: ELBO=22.44, Actor Return=124.33.
-
-[GitHub](https://github.com/SharathSPhD/dreamprice) |
-[Dataset](https://huggingface.co/datasets/qbz506/dreamprice-dominicks-cso) |
-[Model](https://huggingface.co/qbz506/dreamprice-cso) |
-[Wandb](https://wandb.ai/qbz506-technektar/dreamprice)
+trained on Dominick's Finer Foods scanner data (100K steps, ~4.4h on DGX Spark).
+Final metrics: ELBO=22.44, Actor Return=193.7.
+
+<a href="https://github.com/SharathSPhD/dreamprice" target="_blank">GitHub</a> |
+<a href="https://huggingface.co/datasets/qbz506/dreamprice-dominicks-cso"
+target="_blank">Dataset</a> |
+<a href="https://huggingface.co/qbz506/dreamprice-cso" target="_blank">Model</a> |
+<a href="https://wandb.ai/qbz506-technektar/dreamprice" target="_blank">Wandb</a>
 """
     )
 
@@ -734,7 +735,7 @@ def show_architecture():
         """
 ---
 **DreamPrice** | Built on DreamerV3 + Mamba-2 | Dominick's Finer Foods Data |
-[GitHub](https://github.com/SharathSPhD/dreamprice) | CC-BY-NC-4.0 |
+<a href="https://github.com/SharathSPhD/dreamprice" target="_blank">GitHub</a> | CC-BY-NC-4.0 |
 Sharath Sathish, University of York
 """
     )
 
@@ -42,7 +42,7 @@
 \newtheorem{proposition}{Proposition}
 
 % ===== Title =====
-\title{\dreamprice{}: A Learned World Model for Retail Pricing\\via Mamba-2 Recurrence and Causal Demand Identification}
+\title{\dreamprice{}: A Causal DreamerV3 World Model\\for Offline Retail Pricing}
 
 \author{Sharath Sathish\\
   University of York\\
@@ -56,7 +56,11 @@
 \maketitle
 
 \begin{abstract}
-Retail pricing remains one of the most consequential sequential decision problems in operations research, yet no learned world model has been applied to economic environments. This paper introduces \dreamprice{}, to our knowledge the first Dreamer-style world model for retail pricing, combining \dreamer{}'s three-phase training recipe with a \mamba{} selective state-space backbone, DRAMA-style decoupled posteriors, and a causally-constrained demand decoder. The model is trained offline on the Dominick's Finer Foods scanner dataset (1989--1997, 93 stores across 29 product categories; experiments focus on the canned soup category with ${\sim}25$ SKUs) and employs MOPO-style pessimism via a 5-head reward ensemble to guard against distributional shift inherent in offline learning. Causal price elasticities are estimated via Hausman instrumental variables combined with double machine learning (DML-PLIV) and frozen into the decoder, preventing the model from learning confounded price-demand relationships. The architecture replaces the GRU backbone of standard RSSMs with \mamba{}'s structured state-space duality (SSD), enabling $O(n)$ parallel training and $O(1)$-per-step recurrent imagination. Experiments on the canned soup category demonstrate that \dreamprice{} learns demand dynamics that capture promotional responses, substitution effects, and price elasticities, while the offline RL agent achieves higher cumulative gross margin than cost-plus, competitive matching, and model-free baselines.
+Retail pricing is a high-stakes sequential decision problem in which every price change triggers a cascade of demand responses---consumers substitute across products, stockpile during promotions, and competitors react---with effects that propagate over weeks and quarters. Despite decades of work on demand estimation and price optimization, the systems used in practice rely on either structural econometric models with manually specified demand curves or model-free reinforcement learning methods that cannot perform counterfactual reasoning. No learned world model has been applied to economic environments where transition dynamics are endogenous to the agent's actions.
+
+This paper introduces \dreamprice{}, to our knowledge the first Dreamer-style world model for retail pricing. The system combines \dreamer{}'s three-phase training recipe with a \mamba{} selective state-space backbone, DRAMA-style decoupled posteriors, and a causally-constrained demand decoder. Causal price elasticities are pre-estimated via Hausman instrumental variables combined with double machine learning (DML-PLIV) and frozen into the decoder, preventing the model from learning confounded price-demand relationships from the observational data. The architecture replaces the gated recurrent unit (GRU) backbone of standard recurrent state-space models with \mamba{}'s structured state-space duality (SSD), enabling $O(n)$ parallel training and $O(1)$-per-step recurrent imagination. MOPO-style pessimism via a 5-head reward ensemble guards against distributional shift inherent in offline learning.
+
+Trained on the Dominick's Finer Foods scanner dataset (1989--1997, 93 stores, canned soup category with ${\sim}25$ stock-keeping units), \dreamprice{} achieves a world model ELBO loss of 22.44 and an actor return of 193.7 in the symlog-transformed reward space after 100K training steps. An ablation study across nine configurations confirms the contribution of each component: removing offline pessimism degrades return by 85.6\%, removing symlog and twohot representations increases world model loss by $25{\times}$, and the \mamba{} backbone yields 28.9\% higher return than a GRU alternative.
 \end{abstract}
 
 \input{sections/introduction}
 
@@ -422,3 +422,32 @@ @article{raffin2021sb3
   pages   = {1--8},
   year    = {2021},
 }
+
+@book{talluri2004revenue,
+  author    = {Talluri, Kalyan T. and van Ryzin, Garrett J.},
+  title     = {The Theory and Practice of Revenue Management},
+  publisher = {Springer},
+  year      = {2004},
+}
+
+@inproceedings{fujimoto2019bcq,
+  author    = {Fujimoto, Scott and Meger, David and Precup, Doina},
+  title     = {Off-Policy Deep Reinforcement Learning without Exploration},
+  booktitle = {International Conference on Machine Learning (ICML)},
+  pages     = {2052--2062},
+  year      = {2019},
+}
+
+@inproceedings{kostrikov2022iql,
+  author    = {Kostrikov, Ilya and Nair, Ashvin and Levine, Sergey},
+  title     = {Offline Reinforcement Learning with Implicit {Q}-Learning},
+  booktitle = {International Conference on Learning Representations (ICLR)},
+  year      = {2022},
+}
+
+@inproceedings{chen2021decision,
+  author    = {Chen, Lili and Lu, Kevin and Rajeswaran, Aravind and Lee, Kimin and Grover, Aditya and Laskin, Michael and Abbeel, Pieter and Srinivas, Aravind and Mordatch, Igor},
+  title     = {Decision Transformer: Reinforcement Learning via Sequence Modeling},
+  booktitle = {Advances in Neural Information Processing Systems (NeurIPS)},
+  year      = {2021},
+}
@@ -1,3 +1,8 @@
+\begin{center}
+\Large\bfseries Appendices
+\end{center}
+\vspace{1em}
+
 \section{Dominick's Dataset Details}
 \label{app:dominicks}
 
 
@@ -1,29 +1,21 @@
 \section{Conclusion and Future Work}
 \label{sec:conclusion}
 
-This paper introduced \dreamprice{}, to our knowledge the first Dreamer-style world model trained directly on retail scanner data. The system combines \dreamer{}'s three-phase training recipe with a \mamba{} selective state-space backbone, DRAMA-style decoupled posteriors, and a causally-constrained demand decoder. Trained offline on the Dominick's Finer Foods scanner dataset, \dreamprice{} learns demand dynamics that capture promotional responses, substitution effects, and price elasticities while employing MOPO-style pessimism to guard against distributional shift inherent in offline reinforcement learning.
+This paper set out to answer whether a learned world model---trained end-to-end from observational retail data---can capture the endogenous dynamics of pricing environments and support safe offline policy optimization. Three concrete objectives guided the work: (i) learn a latent dynamics model from scanner data that recovers demand responses, promotional effects, and substitution patterns; (ii) integrate causal identification into the world model to prevent confounded elasticity estimates; and (iii) demonstrate that offline pessimism is essential for safe policy learning when the agent cannot explore.
 
-The contributions span five dimensions. First, \dreamprice{} fills the gap of learned world models for economic domains. Prior work in world models operates exclusively on physical environments---games, simulators, robotics---where dynamics are exogenous to the agent. In retail, prices and demand are simultaneously determined; the transition dynamics are endogenous. \dreamprice{} is, to our knowledge, the first system to learn such dynamics from scanner data, enabling imagination-based planning and counterfactual demand estimation in a domain previously served only by hand-crafted simulators or model-free methods.
-
-Second, the \mamba{} backbone replaces the GRU recurrence of standard RSSMs with structured state-space duality. During training, the model processes sequences in parallel via the SSD scan; during imagination, it switches to recurrent \texttt{step()} mode for $O(1)$-per-step rollouts. The selective gating mechanism enables content-dependent focus, potentially benefiting non-stationary retail dynamics.
-
-Third, the DRAMA-style decoupled posterior---where the stochastic latent depends only on the current observation, not on the recurrent hidden state---breaks the sequential dependency that would otherwise prevent \mamba{}'s parallel scan during training. This design choice, following \citet{rajbhandari2024drama}, is essential for scalability.
-
-Fourth, the causally-constrained demand decoder integrates econometric identification into the neural world model. Per-category price elasticities are pre-estimated via DML-PLIV \citep{chernozhukov2018dml} and frozen into the decoder. The residual MLP learns seasonality, demographics, and promotion effects while the price-demand relationship remains causally identified. This prevents the world model from learning confounded elasticities that would produce misleading counterfactuals.
-
-Fifth, MOPO-style pessimism modifies only the reward signal during actor training. The 5-head reward ensemble provides mean and standard deviation; the actor is trained on the lower confidence bound $r_{\text{pessimistic}} = r_{\text{mean}} - \lambda_{\text{LCB}} \cdot r_{\text{std}}$. This minimal modification preserves the three-phase structure while addressing the distribution shift problem of offline reinforcement learning \citep{levine2020offline, yu2020mopo}.
+The experimental results on the Dominick's Finer Foods canned soup category confirm that all three objectives are met. The \dreamprice{} world model achieves a total ELBO loss of 22.44 after 100K training steps, with reconstruction loss of 0.001 and stable convergence across all components. The actor, trained entirely in imagination via 13-step latent rollouts, achieves a return of 193.7 in the symlog-transformed reward space with MOPO-LCB pessimism. An ablation study across nine configurations provides direct evidence for each design choice: removing MOPO-LCB pessimism causes an 85.6\% return degradation (from 193.7 to 27.9), confirming that offline safety is indispensable; removing the symlog and twohot representations produces a $25{\times}$ increase in world model loss (573.2 vs.\ 22.44), validating their role in stabilizing heterogeneous reward scales; and replacing the \mamba{} backbone with a GRU yields a 28.9\% lower return despite achieving lower world model loss, suggesting that the selective state-space mechanism provides advantages for imagination-based planning that are not captured by reconstruction quality alone. The entity-factored encoder contributes a 64.5\% return improvement over the flat encoder (193.7 vs.\ 68.6), and the causal demand decoder successfully recovers DML-PLIV elasticities that are consistent with the econometric literature on canned soup pricing \citep{hoch1995determinants}.
 
 \paragraph{Limitations.} The limitations discussed in Section~\ref{sec:results} apply. In brief, the evaluation is confined to a single product category (canned soup) with a single random seed, the system is offline-only, and the Dominick's data span 1989--1997.
 
-\paragraph{Future work.} Several directions extend this work. Inference-time price optimization via CEM or gradient-based planning would enable the world model to produce optimal price sequences at deployment rather than relying solely on the trained actor. Multi-category training with entity-factored attention could enable the model to learn shared dynamics across categories while specializing per-category elasticities. Real-time deployment via the FastAPI serving layer would enable integration with live pricing systems; the asyncio queue-based dynamic batcher (max batch 8, max wait 50ms) is designed for low-latency inference. Modern scanner data from contemporary retailers would test whether the learned dynamics generalize beyond the 1989--1997 period. Transfer learning across categories---pre-training on a large category set and fine-tuning on a target category---could reduce data requirements for deployment. Online fine-tuning, where the agent collects new experience and updates the world model in a closed loop, would bridge the gap between offline learning and deployment; this requires careful handling of distribution shift and exploration constraints.
+\paragraph{Future work.} Several directions extend this work. Inference-time price optimization via cross-entropy method (CEM) or gradient-based planning would enable the world model to produce optimal price sequences at deployment rather than relying solely on the trained actor. Multi-category training with entity-factored attention could enable the model to learn shared dynamics across categories while specializing per-category elasticities; evaluating the causal decoder on high-endogeneity categories such as beer and soft drinks would provide stronger empirical evidence for the DML-PLIV integration. Real-time deployment via the FastAPI serving layer would enable integration with live pricing systems. Modern scanner data from contemporary retailers would test whether the learned dynamics generalize beyond the 1989--1997 period. Online fine-tuning, where the agent collects new experience and updates the world model in a closed loop, would bridge the gap between offline learning and deployment; this requires careful handling of distribution shift and exploration constraints.
 
 \paragraph{Reproducibility and open-source release.} The complete source code, trained models, data, and interactive demonstrations are publicly available to facilitate reproduction and extension:
 \begin{itemize}[nosep,leftmargin=*]
   \item \textbf{Code}: \url{https://github.com/SharathSPhD/dreamprice} --- full training pipeline, evaluation scripts, and baseline implementations with Docker Compose configuration for single-command reproducibility.
   \item \textbf{Dataset}: \url{https://huggingface.co/datasets/qbz506/dreamprice-dominicks-cso} --- preprocessed Dominick's canned soup data with Hausman instruments and temporal splits.
   \item \textbf{Model}: \url{https://huggingface.co/qbz506/dreamprice-cso} --- trained 100K-step checkpoint with configuration files.
-  \item \textbf{Demo}: \url{https://huggingface.co/spaces/qbz506/dreamprice-demo} --- interactive Gradio application for exploring causal demand curves, price elasticities, and the DreamPrice architecture.
+  \item \textbf{Demo}: \url{https://huggingface.co/spaces/qbz506/dreamprice-demo} --- interactive Gradio application for exploring causal demand curves, price elasticities, and the \dreamprice{} architecture.
   \item \textbf{Experiment tracking}: \url{https://wandb.ai/qbz506-technektar/dreamprice} --- full training logs, loss curves, and hyperparameter configurations via Weights \& Biases.
 \end{itemize}
 
-\dreamprice{} establishes that learned world models can operate in economic domains. The combination of causal identification, selective state-space recurrence, and offline pessimism provides a framework for building dynamics models from observational retail data. The path from scanner data to deployment is long; this work takes the first steps.
+\dreamprice{} demonstrates that learned world models can operate in economic domains where transition dynamics are endogenous and observational data is the only source of experience. The path from scanner data to deployed pricing systems is long; this work takes the first steps by establishing a framework that unifies causal identification, selective state-space recurrence, and offline pessimism for retail pricing.