|
1 | 1 | \section{Conclusion and Future Work} |
2 | 2 | \label{sec:conclusion} |
3 | 3 |
|
4 | | -This paper introduced \dreamprice{}, to our knowledge the first Dreamer-style world model trained directly on retail scanner data. The system combines \dreamer{}'s three-phase training recipe with a \mamba{} selective state-space backbone, DRAMA-style decoupled posteriors, and a causally-constrained demand decoder. Trained offline on the Dominick's Finer Foods scanner dataset, \dreamprice{} learns demand dynamics that capture promotional responses, substitution effects, and price elasticities while employing MOPO-style pessimism to guard against distributional shift inherent in offline reinforcement learning. |
| 4 | +This paper set out to answer whether a learned world model---trained end-to-end from observational retail data---can capture the endogenous dynamics of pricing environments and support safe offline policy optimization. Three concrete objectives guided the work: (i) learn a latent dynamics model from scanner data that recovers demand responses, promotional effects, and substitution patterns; (ii) integrate causal identification into the world model to prevent confounded elasticity estimates; and (iii) demonstrate that offline pessimism is essential for safe policy learning when the agent cannot explore. |
5 | 5 |
|
6 | | -The contributions span five dimensions. First, \dreamprice{} fills the gap of learned world models for economic domains. Prior work in world models operates exclusively on physical environments---games, simulators, robotics---where dynamics are exogenous to the agent. In retail, prices and demand are simultaneously determined; the transition dynamics are endogenous. \dreamprice{} is, to our knowledge, the first system to learn such dynamics from scanner data, enabling imagination-based planning and counterfactual demand estimation in a domain previously served only by hand-crafted simulators or model-free methods. |
7 | | - |
8 | | -Second, the \mamba{} backbone replaces the GRU recurrence of standard RSSMs with structured state-space duality. During training, the model processes sequences in parallel via the SSD scan; during imagination, it switches to recurrent \texttt{step()} mode for $O(1)$-per-step rollouts. The selective gating mechanism enables content-dependent focus, potentially benefiting non-stationary retail dynamics. |
9 | | - |
10 | | -Third, the DRAMA-style decoupled posterior---where the stochastic latent depends only on the current observation, not on the recurrent hidden state---breaks the sequential dependency that would otherwise prevent \mamba{}'s parallel scan during training. This design choice, following \citet{rajbhandari2024drama}, is essential for scalability. |
11 | | - |
12 | | -Fourth, the causally-constrained demand decoder integrates econometric identification into the neural world model. Per-category price elasticities are pre-estimated via DML-PLIV \citep{chernozhukov2018dml} and frozen into the decoder. The residual MLP learns seasonality, demographics, and promotion effects while the price-demand relationship remains causally identified. This prevents the world model from learning confounded elasticities that would produce misleading counterfactuals. |
13 | | - |
14 | | -Fifth, MOPO-style pessimism modifies only the reward signal during actor training. The 5-head reward ensemble provides mean and standard deviation; the actor is trained on the lower confidence bound $r_{\text{pessimistic}} = r_{\text{mean}} - \lambda_{\text{LCB}} \cdot r_{\text{std}}$. This minimal modification preserves the three-phase structure while addressing the distribution shift problem of offline reinforcement learning \citep{levine2020offline, yu2020mopo}. |
| 6 | +The experimental results on the Dominick's Finer Foods canned soup category confirm that all three objectives are met. The \dreamprice{} world model achieves a total ELBO loss of 22.44 after 100K training steps, with reconstruction loss of 0.001 and stable convergence across all components. The actor, trained entirely in imagination via 13-step latent rollouts, achieves a return of 193.7 in the symlog-transformed reward space with MOPO-LCB pessimism. An ablation study across nine configurations provides direct evidence for each design choice: removing MOPO-LCB pessimism causes an 85.6\% return degradation (from 193.7 to 27.9), confirming that offline safety is indispensable; removing the symlog and twohot representations produces a $25{\times}$ increase in world model loss (573.2 vs.\ 22.44), validating their role in stabilizing heterogeneous reward scales; and replacing the \mamba{} backbone with a GRU yields a 28.9\% lower return despite achieving lower world model loss, suggesting that the selective state-space mechanism provides advantages for imagination-based planning that are not captured by reconstruction quality alone. The entity-factored encoder contributes a 64.5\% return improvement over the flat encoder (193.7 vs.\ 68.6), and the causal demand decoder successfully recovers DML-PLIV elasticities that are consistent with the econometric literature on canned soup pricing \citep{hoch1995determinants}. |
15 | 7 |
|
16 | 8 | \paragraph{Limitations.} The limitations discussed in Section~\ref{sec:results} apply. In brief, the evaluation is confined to a single product category (canned soup) with a single random seed, the system is offline-only, and the Dominick's data span 1989--1997. |
17 | 9 |
|
18 | | -\paragraph{Future work.} Several directions extend this work. Inference-time price optimization via CEM or gradient-based planning would enable the world model to produce optimal price sequences at deployment rather than relying solely on the trained actor. Multi-category training with entity-factored attention could enable the model to learn shared dynamics across categories while specializing per-category elasticities. Real-time deployment via the FastAPI serving layer would enable integration with live pricing systems; the asyncio queue-based dynamic batcher (max batch 8, max wait 50ms) is designed for low-latency inference. Modern scanner data from contemporary retailers would test whether the learned dynamics generalize beyond the 1989--1997 period. Transfer learning across categories---pre-training on a large category set and fine-tuning on a target category---could reduce data requirements for deployment. Online fine-tuning, where the agent collects new experience and updates the world model in a closed loop, would bridge the gap between offline learning and deployment; this requires careful handling of distribution shift and exploration constraints. |
| 10 | +\paragraph{Future work.} Several directions extend this work. Inference-time price optimization via cross-entropy method (CEM) or gradient-based planning would enable the world model to produce optimal price sequences at deployment rather than relying solely on the trained actor. Multi-category training with entity-factored attention could enable the model to learn shared dynamics across categories while specializing per-category elasticities; evaluating the causal decoder on high-endogeneity categories such as beer and soft drinks would provide stronger empirical evidence for the DML-PLIV integration. Real-time deployment via the FastAPI serving layer would enable integration with live pricing systems. Modern scanner data from contemporary retailers would test whether the learned dynamics generalize beyond the 1989--1997 period. Online fine-tuning, where the agent collects new experience and updates the world model in a closed loop, would bridge the gap between offline learning and deployment; this requires careful handling of distribution shift and exploration constraints. |
19 | 11 |
|
20 | 12 | \paragraph{Reproducibility and open-source release.} The complete source code, trained models, data, and interactive demonstrations are publicly available to facilitate reproduction and extension: |
21 | 13 | \begin{itemize}[nosep,leftmargin=*] |
22 | 14 | \item \textbf{Code}: \url{https://github.com/SharathSPhD/dreamprice} --- full training pipeline, evaluation scripts, and baseline implementations with Docker Compose configuration for single-command reproducibility. |
23 | 15 | \item \textbf{Dataset}: \url{https://huggingface.co/datasets/qbz506/dreamprice-dominicks-cso} --- preprocessed Dominick's canned soup data with Hausman instruments and temporal splits. |
24 | 16 | \item \textbf{Model}: \url{https://huggingface.co/qbz506/dreamprice-cso} --- trained 100K-step checkpoint with configuration files. |
25 | | - \item \textbf{Demo}: \url{https://huggingface.co/spaces/qbz506/dreamprice-demo} --- interactive Gradio application for exploring causal demand curves, price elasticities, and the DreamPrice architecture. |
| 17 | + \item \textbf{Demo}: \url{https://huggingface.co/spaces/qbz506/dreamprice-demo} --- interactive Gradio application for exploring causal demand curves, price elasticities, and the \dreamprice{} architecture. |
26 | 18 | \item \textbf{Experiment tracking}: \url{https://wandb.ai/qbz506-technektar/dreamprice} --- full training logs, loss curves, and hyperparameter configurations via Weights \& Biases. |
27 | 19 | \end{itemize} |
28 | 20 |
|
29 | | -\dreamprice{} establishes that learned world models can operate in economic domains. The combination of causal identification, selective state-space recurrence, and offline pessimism provides a framework for building dynamics models from observational retail data. The path from scanner data to deployment is long; this work takes the first steps. |
| 21 | +\dreamprice{} demonstrates that learned world models can operate in economic domains where transition dynamics are endogenous and observational data is the only source of experience. The path from scanner data to deployed pricing systems is long; this work takes the first steps by establishing a framework that unifies causal identification, selective state-space recurrence, and offline pessimism for retail pricing. |
0 commit comments