Probabilistic modeling for multi-portal real estate market analysis
Models · Results · Quick Start · Tractability
A probabilistic programming framework that applies hierarchical Bayesian models, Gaussian Process spatial regression, and mixture-model anomaly detection to real estate listing data scraped from four Spanish portals.
Built on PyMC 5, ArviZ, and nutpie (Rust-based NUTS sampler).
| # | Model | Technique | What it does |
|---|---|---|---|
| 1 | Hierarchical Pricing | Multi-level partial pooling | Estimates price drivers per portal while sharing statistical strength across all four portals via group-level hyperpriors |
| 2 | Spatial GP | Gaussian Process, Matern-5/2 kernel | Learns a continuous price surface over geographic coordinates with calibrated uncertainty — no hand-crafted spatial features needed |
| 3 | Anomaly Detection | Bayesian mixture model | Classifies each listing into "normal market" vs "anomaly" components, yielding a posterior probability of being mispriced |
Hierarchical model — partial pooling across
Spatial GP — Matern-5/2 covariance over lat/lon:
Anomaly mixture — two-component on price residuals:
Portal-level intercepts pulled toward the group mean — portals with fewer listings borrow more strength:
Posterior distributions for the hierarchical hyperparameters:
Mixture-model identifies overpriced/underpriced listings with calibrated anomaly scores:
Comparison of MCMC (NUTS via nutpie) vs Variational Inference (ADVI), plus sampling efficiency (ESS/s):
git clone https://github.com/gilito11/bayesian-realestate.git
cd bayesian-realestate
pip install -r requirements.txt# Quick demo (~6 min, synthetic data)
python demo.py --quick --no-spatial
# Full run with all three models (~20 min)
python demo.py
# With real data from PostgreSQL
python demo.py --source neon --database-url $DATABASE_URL| Flag | Description |
|---|---|
--quick |
Fewer MCMC draws for faster iteration |
--no-spatial |
Skip the GP model (slowest due to O(n^3) scaling) |
--n-listings N |
Number of synthetic listings to generate (default: 800) |
--source neon |
Load real data from Neon PostgreSQL |
--output-dir DIR |
Where to save plots (default: output/) |
bayesian_realestate/
├── models/
│ ├── hierarchical.py # Multi-level partial pooling across portals
│ ├── spatial.py # GP spatial regression (Matern-5/2)
│ └── anomaly.py # Two-component Bayesian mixture
├── data.py # Synthetic data generator + Neon DB loader
├── diagnostics.py # R-hat, ESS, divergences, model comparison
├── plots.py # Publication-quality visualizations
└── demo.py # Full pipeline entry point
Each model includes a tractability analysis comparing inference methods:
- NUTS (No U-Turn Sampler) via nutpie — exact posterior samples
- ADVI (Automatic Differentiation Variational Inference) — fast approximate posterior
- ESS/s (Effective Sample Size per second) — sampling efficiency metric
- R-hat convergence diagnostics and divergence counts
The GP model demonstrates the tractability/expressiveness trade-off: Matern-5/2 gives rich spatial structure but scales as O(n^3), requiring subsampling for large datasets.
| Component | Technology |
|---|---|
| Probabilistic programming | PyMC 5.x |
| MCMC sampler | nutpie (Rust) with PyMC fallback |
| Posterior analysis | ArviZ |
| Data | pandas, NumPy |
| Database | PostgreSQL (Neon serverless) via psycopg2 |
| Visualization | matplotlib, seaborn |
The framework operates in two modes:
- Synthetic (default) — Generates realistic listings across 8 zones on the Tarragona coast with known ground truth anomalies. Useful for validating model recovery.
- PostgreSQL — Connects to a live database of listings scraped from habitaclia, fotocasa, milanuncios, and idealista.
MIT
Eric Gil — BSc Computer Science, Universitat de Lleida



