Skip to content

Regression: Cron-based start date inference incorrect for maiden plans #5629

@anismiles

Description

@anismiles

PR #5551 changed the start_date() function to prioritize parent start dates over cron-based inference, correctly fixing incorrect parent backfill during restatements. However, this change introduces a regression for maiden plans.

The Issue: When a model has no explicit start date (relies on inference from parents or cron), during maiden plan:

  1. The model's start date is inferred using cron relative to the plan's end date (often today)
  2. This results in the model being backfilled from a later date than the plan's start date
  3. For example, a model with @daily cron and no explicit start, when planning from 2024-01-01, gets backfilled from 2024-01-09 instead of 2024-01-01
  4. If the model has children with explicit start dates, those children will be missing required parent data

Root Cause

The start_date() Function Change (PR #5551)

PR #5551 changed the logic in start_date() function to fix incorrect parent backfill during restatements. Here's the before and after:

BEFORE (Buggy - Before PR #5551):

# Always starts with cron, then compares with parents
earliest = snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))

for parent in snapshot.parents:
    if parent in snapshots:
        earliest = min(
            earliest,
            start_date(snapshots[parent], snapshots, cache=cache, relative_to=relative_to),
        )

Behavior: Cron date is always considered and compared against parent dates. This caused issues during restatements where cron relative to restatement date would incorrectly influence parent start dates.

AFTER (Fixed - After PR #5551):

# Collect parent start dates first
parent_starts = [
    start_date(snapshots[parent], snapshots, cache=cache, relative_to=relative_to)
    for parent in snapshot.parents
    if parent in snapshots
]
# Use parent dates if available, otherwise fallback to cron
earliest = (
    min(parent_starts)
    if parent_starts
    else snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))
)

Behavior: Parent start dates take precedence. Cron is only used as a fallback when no parents exist. This correctly fixes the restatement bug.

The Maiden Plan Problem

However, this change introduces a regression for maiden plans. In sqlmesh/core/snapshot/definition.py, the start_date() function is called during missing_intervals() calculation with relative_to=snapshot_end_date (line 2124):

snapshot_start_date = max(
    to_datetime(snapshot_start_date),
    to_datetime(start_date(snapshot, snapshots, cache, relative_to=snapshot_end_date)),
)

Problem: For models without explicit start dates and no parent dependencies, the NEW code falls back to:

snapshot.node.cron_prev(snapshot.node.cron_floor(relative_to or now()))

Since relative_to is set to snapshot_end_date (the plan's end date, often today), the cron calculation produces a start date that's too late for the plan's start date requirements.

Related Issues/PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions