This document exists to prevent repeated operator mistakes in PEARL.
It is not aspirational. It is a hard constraint set.
If a step below cannot be executed exactly, stop and ask the operator before touching the run.
These standards apply to:
- mining launches
- rebalances
- VM handoffs
- warmstart launches
- robustness launches
- cleanup
- progress reporting
- artifact counting
No unrequested topology or budget change is allowed.
That means:
- do not change shard count unless explicitly asked
- do not change prompt count unless explicitly asked
- do not change model prior unless explicitly asked
- do not change device target unless explicitly asked or required to fix a broken launch
- do not arm a stop cap, cleanup, or kill path unless explicitly asked or required to stop known duplicate work
Date:
- March 29, 2026
What happened:
- a live wave that had already been rebalanced once was rebalanced again
- the rebalance helper scanned all historical run directories instead of only the latest shard generation
- that caused a new shard generation to be launched against prompts that were already being worked
Impact:
- about
847prompt-equivalents of duplicate compute - about
216,832duplicate raw candidates burned
Root cause:
- code bug in
scripts/rebalance_stage1_wave.py - operator error in trusting the helper without a generation-filtered dry run
Never again:
- rebalances must only read the latest shard generation
- every rebalance must be dry-run first
- dry-run output must be checked against expected unique completed prompts and remaining prompts
- no second rebalance on the same wave without explicit operator approval
Date:
- March 29, 2026
What happened:
- stale
rebal02run directories and one liverebal02worker survived the first cleanup pass
Impact:
- duplicate work continued after the operator believed the duplicate generation was dead
- on-disk progress accounting became polluted
Root cause:
- shell-glob cleanup was not sufficient
- process and filesystem verification were not treated as separate required checks
Never again:
- cleanup is not complete until both of these are true:
pgrepfor the target generation returns zero live workers- filesystem sweep confirms zero matching run dirs, prompt files, and metadata files
- cleanup verification must be done with structured checks, not by visual inspection
Date:
- March 29, 2026
What happened:
- a smoke gate launched with
mpson a GPU VM instead ofcuda
Impact:
- wasted launch time
- required relaunch
Root cause:
- implicit/default device handling was allowed to survive into a remote GPU path
Never again:
- all remote launches must pass an explicit device
- every GPU launch must assert
cuda=Truebefore the actual run command is issued - no smoke gate may rely on ambient defaults
Date:
- March 29, 2026
What happened:
- the finalize bundle sent to the H100 partition VMs was missing
run_ablation.py
Impact:
- remote finalization launch failed and had to be resynced/relaunched
Root cause:
- incomplete sync manifest
Never again:
- every sync bundle needs an explicit manifest
- remote setup is not complete until the manifest is present and import-checked on the target box
Date:
- March 29, 2026
What happened:
- a queued warmstart path still defaulted to a local
/Users/...report path on the remote VM
Impact:
- remote chain broke and had to be patched/relaunched
Root cause:
- local absolute paths leaked into a remote execution path
Never again:
- any script that can run remotely must resolve paths from repo root or runtime cwd
- no
/Users/...path is allowed in remote-executed defaults
Date:
- multiple March 2026 runs
What happened:
- earlier failed launch traces remained at the top of the same log files used by later good runs
Impact:
- status reading became ambiguous
- extra manual interpretation was required
Root cause:
- relaunches reused log files without a hard run boundary
Never again:
- every relaunch must get a fresh log file or a clear run marker
- status reports must key off live PID, timestamps, and new summary artifacts, not just top-of-file text
Documented in:
What happened:
- the notes record
11unicorns by evidence but only9recoverable unicorn sequences on disk
Impact:
- evidence count and usable training data count diverged
Root cause:
- artifact retention and evidence accounting were not kept in strict lockstep
Never again:
- no positive is counted as reusable training data unless its exact artifact path is recorded and present
- evidence count and recoverable-count must be reported separately
Documented in:
- duplicate-run guards were later added to
scripts/run_reseed3_batch.sh
What happened:
- launch paths previously allowed duplicate work unless guarded after the fact
Impact:
- wasted compute risk
Never again:
- every launcher must check:
- live process metadata
- output dir presence
- log metadata presence
- whether the exact named job is already active
Documented in:
What happened:
- the prior coarse stage1 wave hit a rebalance edge case and needed a manual tail fix
Impact:
- automation could not be trusted at the tail
Never again:
- a rebalance tool is not trusted on mixed stopped/relaunched state without a dry-run exact-count check
- if exact remaining prompt count is not obvious, stop and ask instead of improvising
Date:
- repeated across this project cycle
What happened:
- quoting mistakes
- shell-glob cleanup misses
- false-positive greps
- command pipelines that looked valid but were not exact
Impact:
- status confusion
- missed cleanup
- extra operator distrust
Never again:
- use Python for structured checks and destructive cleanup
- shell one-liners are allowed only for simple transport, not for state accounting
- grep patterns must be quoted and tested against exact intended text
- no destructive command should depend on shell expansion when a Python path walk is possible
- every launch must have:
- exact prompt count
- exact candidate count
- exact prior URI
- exact shard count
- exact prompt offset if applicable
- these values must be echoed back before the launch is considered complete
- every rebalance requires:
- explicit operator approval
- dry-run first
- latest-generation-only source selection
- exact expected
completed_prompt_count - exact expected
remaining_prompt_count
- if dry-run numbers are surprising, do not proceed
- every long run needs a declared spend envelope
- duplicate compute is counted as budget burn even if no data is lost
- if duplicate work is detected:
- stop it first
- measure it second
- only then decide whether to continue
- cleanup is not done until:
- target workers are dead
- stale prompt files are gone
- stale run dirs are gone
- stale metadata files are gone
- process state and filesystem state must both be checked
- every status report must distinguish:
- valid unique progress
- duplicate burned progress
- stale artifact residue
- do not report aggregate counts from polluted directories
- if the directory state is ambiguous, say so and clean it before reporting totals
- no implicit
mpsorcudadefault on a remote run - every remote launch must state the intended device explicitly
- every GPU launch must verify:
- torch import works
- CUDA is available
- the intended model/scorer loads on the target box
- every VM sync path needs a manifest
- sync success is not assumed from
rsyncalone - the target box must verify presence of all required scripts before launch
- failed launch logs and successful relaunch logs must be separable
- never reuse a log file in a way that forces manual archaeology
- every relaunch must be visually and mechanically distinguishable
- do not paste secrets into commands that may be echoed back into logs if another path exists
- when a secret must be exported interactively, it must not be repeated in user-facing output
- no ad hoc remote shell surgery for structured state unless there is no safer scripted path
- no destructive glob-based cleanup when a short Python script can do exact path targeting
- if a command is easy to misquote, do not use it
- confirm exact run name
- confirm exact prior URI
- confirm exact prompt count
- confirm exact shard count
- confirm exact offset
- confirm exact budget envelope
- confirm operator explicitly asked for rebalance
- run dry-run
- verify latest generation only
- verify completed and remaining counts
- verify stop targets
- verify zero live target workers
- verify zero stale target run dirs
- verify zero stale target prompt files
- verify zero stale target metadata/log files
- verify counts are unique-valid counts
- verify stale generations are excluded
- verify duplicate burn, if any, is called out separately
If an error is:
- budget-impacting
- data-damaging
- duplicate-work-causing
- path-resolution-causing
- cleanup-causing
then:
- stop the bleeding
- measure exact impact
- document the incident here
- add or tighten the rule that would have prevented it
- do not continue normal execution until the rule is in place
For the current even-million wave:
- no more topology changes
- no more shard-count changes
- no more rebalances
- status-only unless the operator explicitly requests a new action