Skip to content

datakit-smoke: verify source staging in us-central1#5106

Merged
ravwojdyla merged 2 commits intomainfrom
rav-datakit-sources-staged-check
Apr 23, 2026
Merged

datakit-smoke: verify source staging in us-central1#5106
ravwojdyla merged 2 commits intomainfrom
rav-datakit-sources-staged-check

Conversation

@ravwojdyla-agent
Copy link
Copy Markdown
Contributor

@ravwojdyla-agent ravwojdyla-agent commented Apr 22, 2026

  • new parallel job datakit-sources-staged in marin-datakit-smoke.yaml, runs alongside the ferry on the same schedule
  • add scripts/datakit/validate_source_staging.pyThreadPoolExecutor over StatusFile(<full_path>).status for the 70 unique DatakitSource.staged_path prefixes under gs://marin-us-central1, exits non-zero with a per-path <status>: <url> report when any is not SUCCESS 1
    • covers every source with a non-None staged_path, not just the pinned subset
    • reuses marin.execution.executor_step_status.StatusFile so both the plain-text and legacy JSON-lines .executor_status formats are handled
  • notify-slack now fans in from both lanes via needs: [datakit-smoke, datakit-sources-staged] — either failure pages Slack on scheduled runs
  • claude-triage stays scoped to the ferry; the sources lane is self-describing from its own error output
  • local probe today: 70/70 paths report SUCCESS — the two previous offenders (common_corpus_english-b78a5c1, finetranslations_d17a789b) were dropped from the registry on main, so this lane goes green on day one

Footnotes

  1. fs.ls non-empty was the earlier bar, which both of the now-dropped paths passed; tightening to SUCCESS is the point — a half-written or abandoned dump must not quietly flow into a ferry.

@ravwojdyla-agent ravwojdyla-agent added the agent-generated Created by automation/agent label Apr 22, 2026
@ravwojdyla-agent ravwojdyla-agent changed the title datakit-smoke: parallel lane verifies source staging in us-central1 datakit-smoke: verify source staging in us-central1 Apr 23, 2026
Base automatically changed from rav-datakit-sources to main April 23, 2026 00:26
ravwojdyla and others added 2 commits April 22, 2026 17:27
Runs the ~72 unique DatakitSource.staged_path prefixes past gs://marin-us-central1
each day in parallel with the ferry, so a staged dump that is deleted or
otherwise disappears surfaces before the next ferry run 404s on download. Covers
every source with a non-None staged_path, not just the pinned subset — unpinned
entries still point at real bytes the registry claims exist.

notify-slack now watches both lanes so either failure pages Slack;
claude-triage stays scoped to the ferry (the sources check explains itself
from its own error output).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tighten the lane from "prefix has at least one object" to "the executor step
that produced it reports SUCCESS". Reuses StatusFile so both the plain-text
and legacy JSON-lines formats are handled (the former is a single token, the
latter is an event log where the latest status wins).

Surfaces two existing issues in the registry on first run:
- raw/common_corpus_english-b78a5c1: no .executor_status file at all
- raw/finetranslations_d17a789b: status=RUNNING (never terminated)

Both will need to be cleaned up — or the source entries dropped from the
registry — for this lane to go green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ravwojdyla ravwojdyla force-pushed the rav-datakit-sources-staged-check branch from 664ed9c to 3e2ab77 Compare April 23, 2026 00:28
@ravwojdyla ravwojdyla requested a review from Helw150 April 23, 2026 00:32
@ravwojdyla ravwojdyla merged commit 3def9cc into main Apr 23, 2026
37 checks passed
@ravwojdyla ravwojdyla deleted the rav-datakit-sources-staged-check branch April 23, 2026 00:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants