AI/ML/DS Checkpoint November 5th 2025

## AI/ML/DS Ticket: Baseline Pricing Engine v0 (Perplexity + LLM + Heuristics)

### Summary

- Build a minimal but real pricing engine that combines:
  - Simple, explainable heuristics on top of historical features (day-of-week, seasonality, optional occupancy/pickup proxies).
  - External signals via the Perplexity Search API to derive a daily event impact score.
  - An optional LLM summarizer to produce short human-readable reasons.
- Package it under `ml/` with clean function boundaries so the backend can later call it via `MLService` or a thin adapter.
- Must run locally without external keys (falls back to heuristics and templated reasons).

### Why Now

- Unblocks an initial “real” engine beyond a pure mock while keeping scope tractable. Establishes extensible interfaces for future model swaps (XGBoost/LightGBM), PMS-fed features, and richer signals.

### Acceptance Criteria

- `ml/` package exists and is importable from backend (local dev path is fine).
- `ml/inference/predict.py` exposes `predict_price(features: dict) -> { price_rec, price_min, price_max, drivers }`.
- A batch function `score_dates(hotel_id, room_type_code, dates)` returns a list of items with non-flat prices, weekend uplift, and reasonable bounds.
- Perplexity adapter can fetch events for a location and date range, map them into a per-day `impact_score` in [0, 1], and cache responses locally. If `PERPLEXITY_API_KEY` is missing, returns empty events gracefully.
- LLM summarizer produces a ≤ 160 char reason string from drivers if `OPENAI_API_KEY` is set; otherwise, returns a deterministic templated string.
- Unit tests cover: weekend uplift, month seasonality, bounding, and adapter fallbacks.

### Proposed Directory Structure

```
ml/
  requirements-ml.txt
  __init__.py
  features/
    __init__.py
    schema.py                # Pydantic schemas for feature rows
    make_features.py         # placeholder to compute features from raw (later)
  inference/
    __init__.py
    predict.py               # baseline heuristics + optional signals
  external/
    __init__.py
    perplexity_adapter.py    # search → normalized events per day + caching
    llm_reasoner.py          # optional OpenAI call; templated fallback
  utils/
    __init__.py
    dates.py                 # date helpers
    caching.py               # simple file cache
  cache/                     # gitignored JSON cache files
  artifacts/                 # future model files (gitignored)
```

### Detailed Implementation Plan

1. Create `ml/requirements-ml.txt`
   - Contents (pin reasonably):

```text
pandas>=2.1
numpy>=1.26
pydantic>=2.7
python-dotenv>=1.0
requests>=2.32
perplexityai>=0.17.0
openai>=1.0.0
```

2. Define feature schemas (`ml/features/schema.py`)

```python
from pydantic import BaseModel, Field

class FeatureRow(BaseModel):
    date: str                 # YYYY-MM-DD
    hotel_id: int
    room_type_code: str
    published_rate: float | None = None  # if provided (e.g., from PMS)
    occupancy_pct: float | None = None   # 0..1 if available
    pickup_24h: int | None = None        # new bookings in last 24h
    month: int | None = None
    dow: int | None = None               # 0=Mon..6=Sun
    event_impact: float | None = None    # 0..1 (filled by adapter)
```

3. Perplexity adapter (`ml/external/perplexity_adapter.py`)

- Responsibility: given `(location: str, from: str, to: str)`, return a dict mapping `date -> impact_score` and a list of raw sources.
- Strategy:
  - Query Perplexity with: `"events in {location} between {from} and {to} that impact hotel demand"`.
  - `max_results=5..10`, extract dates if present, otherwise heuristically map to nearest relevant days.
  - Score each result (e.g., concerts/sports: 0.6–0.9; conferences: 0.3–0.6) and clamp to [0, 1].
  - Cache JSON responses under `ml/cache/perplexity_{hash}.json` keyed by `(location, from, to)`.
  - If `PERPLEXITY_API_KEY` is missing or request fails, return empty mapping and empty sources.

Example shape:

```python
{
  "daily": {"2025-11-12": 0.7, "2025-11-13": 0.4},
  "sources": [{"title": "Taylor Swift @ Aviva Stadium", "url": "https://...", "date": "2025-11-12"}]
}
```

4. LLM reasoner (`ml/external/llm_reasoner.py`)

- Input: `drivers: list[str]`, `date`, optional extra context (event title snippets).
- Output: short reason (≤160 chars). If `OPENAI_API_KEY` missing, fallback to `", ".join(drivers)` with a prefix like `"Drivers: ..."`.

5. Baseline heuristics (`ml/inference/predict.py`)

- Rules:
  - Start with `base = published_rate if provided else 150.0`.
  - Weekend uplift: +20 for Fri/Sat (dow 4/5).
  - Midweek softness: -10 for Tue/Wed (dow 1/2).
  - Seasonality: monthly map `{6: +10, 7: +15, 8: +10, 12: +5}`.
  - Event impact: `base += round(25 * event_impact, 2)` if provided.
  - Occupancy/pickup (if available): `base += min(15, (occupancy_pct or 0)*10 + min(10, (pickup_24h or 0)))`.
  - Bounds: `price_min = base - 20`, `price_max = base + 20` (then round 2 decimals; ensure min < rec < max by adjusting if needed).
  - Drivers: collect labels for each adjustment applied (e.g., `"Weekend uplift"`, `"Seasonality"`, `"High pickup"`, `"Event impact"`).

API:

```python
from typing import Dict
from ml.features.schema import FeatureRow

def predict_price(model: object | None, row: Dict) -> Dict:
    # model is reserved for future use; ignored in v0
    f = FeatureRow(**row)
    # compute base using rules above → return dict with keys:
    # price_rec, price_min, price_max, drivers
```

6. Batch scoring helper

```python
from typing import List, Tuple
from datetime import date, timedelta

def score_dates(*, hotel_id: int, room_type_code: str, from_date: str, to_date: str, location: str | None = None) -> Tuple[list[dict], dict]:
    # Build feature rows for each date
    # If location provided and PERPLEXITY_API_KEY available → fetch daily impact
    # Call predict_price for each row
    # Return (items, metadata) where metadata contains sources and parameters
```

7. Caching utilities (`ml/utils/caching.py`)

- Minimal JSON read/write with file lock to avoid corruption.
- Hash key: `sha1(json.dumps(params, sort_keys=True))`.

8. Tests

- Add lightweight unit tests (pytest) for:
  - Weekend uplift on a known Friday vs Wednesday.
  - Seasonality bumps in July.
  - Bounds always 40 wide centered around rec ±20.
  - Perplexity adapter fallback without key.

### Optional (Stretch)

- Provide a thin adapter in `backend/app/services/ml_service.py` that, if `USE_ML_BASELINE=true`, proxies `quote()` to `ml.inference.predict.score_dates` for now. Keep it behind a flag; default continues to mock.

### Environment Variables

- `PERPLEXITY_API_KEY` — required to fetch events; otherwise adapter no-ops.
- `OPENAI_API_KEY` — required for LLM reasons; otherwise templated reasons.
- `USE_ML_BASELINE` — optional boolean to toggle backend to this engine later.

### Local Run and Manual Test

```bash
# 1) Install DS deps
pip install -r ml/requirements-ml.txt

# 2) Quick check (Python REPL)
from ml.inference.predict import score_dates
items, meta = score_dates(hotel_id=1, room_type_code="DLX-QUEEN", from_date="2025-11-10", to_date="2025-11-20", location="Dublin, Ireland")
len(items), items[0], list(meta.keys())
```

### Out of Scope (for this ticket)

- True model training, feature stores, or PMS ingestion.
- Persistence of predictions or integration into API routes (covered by separate backend ticket).
- Advanced explanations (SHAP, feature importances).

### Risks & Mitigations

- API rate limits or missing keys → design graceful fallbacks and caching.
- Event date extraction ambiguity → start with manual date fields from results; extend with light NLP later.
- Noisy heuristics → keep drivers explicit and explainable; will be replaced by a trained model.

### Resources

- Perplexity Python SDK: [Perplexity on PyPI](https://pypi.org/project/perplexityai/)
- Perplexity API overview: [Perplexity Docs](https://docs.perplexity.ai/)
- OpenAI API (for reasons): [OpenAI Docs](https://platform.openai.com/docs/)
- Pydantic v2: [Pydantic Docs](https://docs.pydantic.dev/latest/)

### Definition of Done

- `ml/` package created with modules listed above.
- `score_dates` returns valid items for a 10-day range with non-flat prices and sensible drivers.
- Perplexity and LLM integration works when keys set; otherwise code falls back without errors.
- Unit tests for heuristics and adapter fallback pass locally.
-  strong documentation on the whats, how and why you made decisions. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI/ML/DS Checkpoint November 5th 2025 #34

AI/ML/DS Ticket: Baseline Pricing Engine v0 (Perplexity + LLM + Heuristics)

Summary

Why Now

Acceptance Criteria

Proposed Directory Structure

Detailed Implementation Plan

Optional (Stretch)

Environment Variables

Local Run and Manual Test

Out of Scope (for this ticket)

Risks & Mitigations

Resources

Definition of Done

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

AI/ML/DS Checkpoint November 5th 2025 #34

Description

AI/ML/DS Ticket: Baseline Pricing Engine v0 (Perplexity + LLM + Heuristics)

Summary

Why Now

Acceptance Criteria

Proposed Directory Structure

Detailed Implementation Plan

Optional (Stretch)

Environment Variables

Local Run and Manual Test

Out of Scope (for this ticket)

Risks & Mitigations

Resources

Definition of Done

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions