-
Notifications
You must be signed in to change notification settings - Fork 321
Description
PR #5551 changed the start_date() function to prioritize parent start dates over cron-based inference, correctly fixing incorrect parent backfill during restatements. However, this change introduces a regression for maiden plans.
The Issue: When a model has no explicit start date (relies on inference from parents or cron), during maiden plan:
- The model's start date is inferred using cron relative to the plan's end date (often today)
- This results in the model being backfilled from a later date than the plan's start date
- For example, a model with
@dailycron and no explicit start, when planning from2024-01-01, gets backfilled from2024-01-09instead of2024-01-01 - If the model has children with explicit start dates, those children will be missing required parent data
Root Cause
The start_date() Function Change (PR #5551)
PR #5551 changed the logic in start_date() function to fix incorrect parent backfill during restatements. Here's the before and after:
BEFORE (Buggy - Before PR #5551):
# Always starts with cron, then compares with parents
earliest = snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))
for parent in snapshot.parents:
if parent in snapshots:
earliest = min(
earliest,
start_date(snapshots[parent], snapshots, cache=cache, relative_to=relative_to),
)Behavior: Cron date is always considered and compared against parent dates. This caused issues during restatements where cron relative to restatement date would incorrectly influence parent start dates.
AFTER (Fixed - After PR #5551):
# Collect parent start dates first
parent_starts = [
start_date(snapshots[parent], snapshots, cache=cache, relative_to=relative_to)
for parent in snapshot.parents
if parent in snapshots
]
# Use parent dates if available, otherwise fallback to cron
earliest = (
min(parent_starts)
if parent_starts
else snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))
)Behavior: Parent start dates take precedence. Cron is only used as a fallback when no parents exist. This correctly fixes the restatement bug.
The Maiden Plan Problem
However, this change introduces a regression for maiden plans. In sqlmesh/core/snapshot/definition.py, the start_date() function is called during missing_intervals() calculation with relative_to=snapshot_end_date (line 2124):
snapshot_start_date = max(
to_datetime(snapshot_start_date),
to_datetime(start_date(snapshot, snapshots, cache, relative_to=snapshot_end_date)),
)Problem: For models without explicit start dates and no parent dependencies, the NEW code falls back to:
snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))Since relative_to is set to snapshot_end_date (the plan's end date, often today), the cron calculation produces a start date that's too late for the plan's start date requirements.
Related Issues/PRs
- PR Fix: Unexpected backfill of a parent when an interval outside the parent's range is restated for a child #5551: Fix: Unexpected backfill of a parent when an interval outside the parent's range is restated for a child