Skip to content

AI/ML/DS Checkpoint November 5th 2025 #34

@naman0r

Description

@naman0r

AI/ML/DS Ticket: Baseline Pricing Engine v0 (Perplexity + LLM + Heuristics)

Summary

  • Build a minimal but real pricing engine that combines:
    • Simple, explainable heuristics on top of historical features (day-of-week, seasonality, optional occupancy/pickup proxies).
    • External signals via the Perplexity Search API to derive a daily event impact score.
    • An optional LLM summarizer to produce short human-readable reasons.
  • Package it under ml/ with clean function boundaries so the backend can later call it via MLService or a thin adapter.
  • Must run locally without external keys (falls back to heuristics and templated reasons).

Why Now

  • Unblocks an initial “real” engine beyond a pure mock while keeping scope tractable. Establishes extensible interfaces for future model swaps (XGBoost/LightGBM), PMS-fed features, and richer signals.

Acceptance Criteria

  • ml/ package exists and is importable from backend (local dev path is fine).
  • ml/inference/predict.py exposes predict_price(features: dict) -> { price_rec, price_min, price_max, drivers }.
  • A batch function score_dates(hotel_id, room_type_code, dates) returns a list of items with non-flat prices, weekend uplift, and reasonable bounds.
  • Perplexity adapter can fetch events for a location and date range, map them into a per-day impact_score in [0, 1], and cache responses locally. If PERPLEXITY_API_KEY is missing, returns empty events gracefully.
  • LLM summarizer produces a ≤ 160 char reason string from drivers if OPENAI_API_KEY is set; otherwise, returns a deterministic templated string.
  • Unit tests cover: weekend uplift, month seasonality, bounding, and adapter fallbacks.

Proposed Directory Structure

ml/
  requirements-ml.txt
  __init__.py
  features/
    __init__.py
    schema.py                # Pydantic schemas for feature rows
    make_features.py         # placeholder to compute features from raw (later)
  inference/
    __init__.py
    predict.py               # baseline heuristics + optional signals
  external/
    __init__.py
    perplexity_adapter.py    # search → normalized events per day + caching
    llm_reasoner.py          # optional OpenAI call; templated fallback
  utils/
    __init__.py
    dates.py                 # date helpers
    caching.py               # simple file cache
  cache/                     # gitignored JSON cache files
  artifacts/                 # future model files (gitignored)

Detailed Implementation Plan

  1. Create ml/requirements-ml.txt
    • Contents (pin reasonably):
pandas>=2.1
numpy>=1.26
pydantic>=2.7
python-dotenv>=1.0
requests>=2.32
perplexityai>=0.17.0
openai>=1.0.0
  1. Define feature schemas (ml/features/schema.py)
from pydantic import BaseModel, Field

class FeatureRow(BaseModel):
    date: str                 # YYYY-MM-DD
    hotel_id: int
    room_type_code: str
    published_rate: float | None = None  # if provided (e.g., from PMS)
    occupancy_pct: float | None = None   # 0..1 if available
    pickup_24h: int | None = None        # new bookings in last 24h
    month: int | None = None
    dow: int | None = None               # 0=Mon..6=Sun
    event_impact: float | None = None    # 0..1 (filled by adapter)
  1. Perplexity adapter (ml/external/perplexity_adapter.py)
  • Responsibility: given (location: str, from: str, to: str), return a dict mapping date -> impact_score and a list of raw sources.
  • Strategy:
    • Query Perplexity with: "events in {location} between {from} and {to} that impact hotel demand".
    • max_results=5..10, extract dates if present, otherwise heuristically map to nearest relevant days.
    • Score each result (e.g., concerts/sports: 0.6–0.9; conferences: 0.3–0.6) and clamp to [0, 1].
    • Cache JSON responses under ml/cache/perplexity_{hash}.json keyed by (location, from, to).
    • If PERPLEXITY_API_KEY is missing or request fails, return empty mapping and empty sources.

Example shape:

{
  "daily": {"2025-11-12": 0.7, "2025-11-13": 0.4},
  "sources": [{"title": "Taylor Swift @ Aviva Stadium", "url": "https://...", "date": "2025-11-12"}]
}
  1. LLM reasoner (ml/external/llm_reasoner.py)
  • Input: drivers: list[str], date, optional extra context (event title snippets).
  • Output: short reason (≤160 chars). If OPENAI_API_KEY missing, fallback to ", ".join(drivers) with a prefix like "Drivers: ...".
  1. Baseline heuristics (ml/inference/predict.py)
  • Rules:
    • Start with base = published_rate if provided else 150.0.
    • Weekend uplift: +20 for Fri/Sat (dow 4/5).
    • Midweek softness: -10 for Tue/Wed (dow 1/2).
    • Seasonality: monthly map {6: +10, 7: +15, 8: +10, 12: +5}.
    • Event impact: base += round(25 * event_impact, 2) if provided.
    • Occupancy/pickup (if available): base += min(15, (occupancy_pct or 0)*10 + min(10, (pickup_24h or 0))).
    • Bounds: price_min = base - 20, price_max = base + 20 (then round 2 decimals; ensure min < rec < max by adjusting if needed).
    • Drivers: collect labels for each adjustment applied (e.g., "Weekend uplift", "Seasonality", "High pickup", "Event impact").

API:

from typing import Dict
from ml.features.schema import FeatureRow

def predict_price(model: object | None, row: Dict) -> Dict:
    # model is reserved for future use; ignored in v0
    f = FeatureRow(**row)
    # compute base using rules above → return dict with keys:
    # price_rec, price_min, price_max, drivers
  1. Batch scoring helper
from typing import List, Tuple
from datetime import date, timedelta

def score_dates(*, hotel_id: int, room_type_code: str, from_date: str, to_date: str, location: str | None = None) -> Tuple[list[dict], dict]:
    # Build feature rows for each date
    # If location provided and PERPLEXITY_API_KEY available → fetch daily impact
    # Call predict_price for each row
    # Return (items, metadata) where metadata contains sources and parameters
  1. Caching utilities (ml/utils/caching.py)
  • Minimal JSON read/write with file lock to avoid corruption.
  • Hash key: sha1(json.dumps(params, sort_keys=True)).
  1. Tests
  • Add lightweight unit tests (pytest) for:
    • Weekend uplift on a known Friday vs Wednesday.
    • Seasonality bumps in July.
    • Bounds always 40 wide centered around rec ±20.
    • Perplexity adapter fallback without key.

Optional (Stretch)

  • Provide a thin adapter in backend/app/services/ml_service.py that, if USE_ML_BASELINE=true, proxies quote() to ml.inference.predict.score_dates for now. Keep it behind a flag; default continues to mock.

Environment Variables

  • PERPLEXITY_API_KEY — required to fetch events; otherwise adapter no-ops.
  • OPENAI_API_KEY — required for LLM reasons; otherwise templated reasons.
  • USE_ML_BASELINE — optional boolean to toggle backend to this engine later.

Local Run and Manual Test

# 1) Install DS deps
pip install -r ml/requirements-ml.txt

# 2) Quick check (Python REPL)
from ml.inference.predict import score_dates
items, meta = score_dates(hotel_id=1, room_type_code="DLX-QUEEN", from_date="2025-11-10", to_date="2025-11-20", location="Dublin, Ireland")
len(items), items[0], list(meta.keys())

Out of Scope (for this ticket)

  • True model training, feature stores, or PMS ingestion.
  • Persistence of predictions or integration into API routes (covered by separate backend ticket).
  • Advanced explanations (SHAP, feature importances).

Risks & Mitigations

  • API rate limits or missing keys → design graceful fallbacks and caching.
  • Event date extraction ambiguity → start with manual date fields from results; extend with light NLP later.
  • Noisy heuristics → keep drivers explicit and explainable; will be replaced by a trained model.

Resources

Definition of Done

  • ml/ package created with modules listed above.
  • score_dates returns valid items for a 10-day range with non-flat prices and sensible drivers.
  • Perplexity and LLM integration works when keys set; otherwise code falls back without errors.
  • Unit tests for heuristics and adapter fallback pass locally.
  • strong documentation on the whats, how and why you made decisions.

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions