feat(config): add BLOCK_PRODUCTION_LOOKBACK env var for pruned RPC nodes#9
Open
nbridges333 wants to merge 15 commits into
Open
feat(config): add BLOCK_PRODUCTION_LOOKBACK env var for pruned RPC nodes#9nbridges333 wants to merge 15 commits into
nbridges333 wants to merge 15 commits into
Conversation
…-only events promoted)
Three ERROR-class log messages emitted from observer/validation/fdc.py
were previously logger-only — no Prometheus surface meant downstream
consumers (e.g. FlareWatch agent's signal-collection layer + alarm
playbooks) had no way to track rate or count. This commit adds
backing Counters and wires them at the existing log-emit sites.
NEW COUNTERS (observer/metrics.py):
flare_fsp_fdc_submit1_unexpected_total{identity_address}
Backs fdc.py:43 "found submit1 transaction" — FDC protocol does
not use submit1, so its presence indicates misconfig.
flare_fsp_fdc_submit2_bit_vote_length_mismatch_total{identity_address}
Backs fdc.py:124 "submit2 bit vote length didn't match number of
requests in round" — structural protocol error; rare but worth
visibility.
flare_fsp_fdc_submit2_consensus_miss_total{identity_address,attestation_type,source_id}
Backs fdc.py:138 "submit2 didn't confirm request that was part
of consensus {attestation_type}/{source_id} at index N" — the
per-(attestation_type, source_id) consensus miss event. Counter
records the raw observation; downstream consumers (operator's
agent participation-policy layer) decide whether a given combo
is opted out (expected) or opted in (alarm).
CARDINALITY (FDC_SUBMIT2_CONSENSUS_MISS):
attestation_type and source_id are 32-byte fields per
py_flare_common — theoretically unbounded. In practice the Flare
protocol uses a small known set (~7 attestation types × ~6 source
ids = ~42 max combos per identity_address). Acceptable as
Prometheus labels; a future protocol expansion that pushes
cardinality high would warrant moving to a JSON payload field.
NOT pre-initialized in initialize_labels(): combos are added on
first emission so we don't enumerate the full ~42 cross-product
preemptively. This matches the existing precedent (other
attestation/source-keyed metrics are not pre-initialized either).
WIRE-IN (observer/validation/fdc.py):
- check_submit_1: increment FDC_SUBMIT1_UNEXPECTED at the existing
ERROR emit point.
- check_submit_2: increment FDC_SUBMIT2_BIT_VOTE_LENGTH_MISMATCH at
the bit-vote-length mismatch emit; FDC_SUBMIT2_CONSENSUS_MISS at
the per-request consensus-miss emit, with attestation_type +
source_id labels populated from the request's representation
property.
VERIFICATION:
PYTHONPATH-overridden smoke test exercised all three Counters via
inc() + generate_latest() — exposition output renders the new
counters with correct labels. Production scrape via
flarewatch-agent's fsp-observer-metrics collector picks up the
new metrics automatically (the collector emits one F1 signal per
metric+label-set tuple per scrape, no agent-side code change
required for the Counter to flow into F1).
Cross-repo follow-ups:
- flarewatch-validator catalog (signal-catalog.md Component 11) gets
three new rows for these metrics.
- flarewatch-agent gets a participation-policy schema + collector
update to tag-with-expected on the consensus_miss signal so
downstream playbooks consume only opted-in events.
Rollback: `git revert <this-commit-hash>` removes the Counters and
the wire-in increments. The existing ERROR log emissions stay
unchanged. Downstream consumers fall back to logger-only as before.
… threshold The fast-updates participation check fires WARNING for any voter whose n_blocks reaches max_exponent, regardless of statistical confidence. For small-share voters (e.g. weight/total ~= 0.0001 on Songbird), the calculated false-positive probability remains high (~70-80%) even at n_blocks = max_exponent, because the per-block selection probability is genuinely tiny. The alert is mathematically correct but operationally noise: the validator IS participating, just being selected at its true small share. Add a 5% (50_000_000 ppb) threshold gate on the WARNING path. Below that threshold the alert is statistically meaningful and still fires. At or above it (high false-positive probability), suppress the WARNING. CRITICAL-level emission (probability_ppb <= 100) is preserved unconditionally on both n_blocks branches. Operationally: whales drop probability_ppb below 5% within ~max_exponent/2 blocks, so for them this is functionally a no-op. For sub-1% share voters this eliminates a high-cardinality noise source on Discord.
Add FLAREWATCH_PROTOCOLS_ACTIVE env var (comma-separated lowercase protocol names matching Protocol.id_to_name() output). Messages tagged with a protocol NOT in the whitelist are logged locally but NOT dispatched to Discord/Slack/Telegram/generic webhooks. Default (unset/empty) preserves original upstream behavior: dispatch everything. Motivation: operators frequently deploy fsp-observer ahead of deploying all FSP protocols. Songbird FSP-only validators have no staking; operators staging FDC suite deployment have FDC ERRORs by design until the verifier stack lands. Today these emit ERROR-level Discord noise that the operator cannot quiet without disabling the entire notification channel or upstream-patching protocol-specific code paths. Filter applied at log_message dispatch (single seam covering all notification backends). Untagged messages (e.g. observer crashes, network errors) always dispatch — the filter is intentionally narrow to operator-known protocol-tagged alerts. Also gitignore *.bak.* — in-place backup files generated during local patch application, never intended for commit.
…crash
Per validator-repo 2026-05-13 directive aligning alert bodies to the
agent-side TrainingNotifier format. Previous crash message was raw
"observer crashed (traceback in logs) - <exception-str>" — operator had
to ssh to validator-host, find the right log file, find the traceback,
and reason from scratch.
New crash handler:
- Captures full traceback via traceback.format_exc()
- Classifies the exception into one of:
connection-reset (ConnectionResetError, OSError with "Connection reset")
timeout (TimeoutError, asyncio.TimeoutError)
parse-or-config (ValueError, KeyError, TypeError, AttributeError)
import-error (ImportError, ModuleNotFoundError)
unknown (fallback)
- Per-class diagnosis text describing the likely cause
- Per-class operator-actions text with specific debug commands
(docker ps / docker logs / curl local RPC / compose up / etc.)
- Truncates traceback to head/tail (12 lines each) so the Discord
embed stays under 2000 chars even on deep stacks
- Surfaces exception type + str in EVIDENCE alongside the truncated
traceback (so the operator gets both the headline and the depth)
Output stays Discord-embed-safe: tested rendering on a synthetic
ConnectionResetError(104, "Connection reset by peer") via
/tmp/observer-dryrun.py — total body 1353 chars.
Example rendered body (connection-reset case from main alert this morning):
observer crashed on network:songbird (class: connection-reset)
DIAGNOSIS
The observer's connection to its upstream RPC endpoint was reset.
Most likely the RPC node (go-flare via cloudflared, or the public
fallback) closed the TCP connection mid-request: rate-limited,
restarted, or proxied through a stale Cloudflare edge. The observer
process exits on this error; supervisor (docker compose restart
policy or systemd) brings it back, but the gap shows up as missed
epoch participation.
EVIDENCE
exception: ConnectionResetError: [Errno 104] Connection reset by peer
traceback (truncated; full in journalctl/docker logs):
Traceback (most recent call last):
File ".../main.py", line 142, in main
asyncio.run(observer_loop(config))
... (lines truncated; full in journalctl) ...
ConnectionResetError: [Errno 104] Connection reset by peer
OPERATOR ACTIONS
If supervisor already restarted the observer (check
docker ps --filter name=fsp-observer
docker logs --tail 50 fsp-observer
and see if there's recent activity): the observer is back up;
verify next epoch is being signed by checking the validator's
vote-power utilization on Flare Explorer.
If the observer is NOT back up: restart it manually:
cd /opt/flare/observer && docker compose up -d
If the connection reset is recurring (same error every few minutes):
- Check upstream RPC: curl -s http://127.0.0.1:9653/ext/C/rpc ...
- If local RPC is slow/dead: see flr-rpc-heartbeat-deploy.md.
- If local RPC fine but observer keeps disconnecting: check if
observer is hitting rate-limited public RPC instead of local node.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator alert: 7 consecutive [ERROR] per-round alerts on
Songbird FTSO ('no submitSignatures transaction') fired immediately after
restarting fsp-observer from the morning's RPC-URL outage. Operator: 'I
think these alerts could be refined and provide more clarity on what is
happening and what we should do when it happens.'
The per-round ERROR is ambiguous between two very different scenarios:
(a) REAL outage — FTSO client failed to submit
(b) FALSE POSITIVE — observer was offline during the round and is
reporting historical gaps on catch-up
Body now distinguishes the two cases explicitly, surfaces the relevant
addresses (identity / submit_signatures_to / signing_policy) + voting
epoch ID + start_unix so operator can spot-check one round on Flare
Explorer without first looking up which address sends submitSignatures.
OPERATOR ACTIONS section has two branches: 'If you just restarted
fsp-observer (FALSE POSITIVE)' and 'If observer was up the whole time
(REAL miss)' with concrete commands per branch. Escalation pointer to
staking-key-emergency-rotation.md if 3+ consecutive misses suggest
signing-key compromise.
Scope kept narrow: only the no-submitSignatures-transaction call site at
ftso.py:247. The sibling call sites (grace-period WARNING at ftso.py:265,
FDC equivalent at fdc.py:205, minimal-conditions at minimal_conditions.py)
intentionally NOT touched this commit — propagate after operator confirms
the format is right for them. Per 'one fix at a time' working pattern
(memory feedback_separate_commands).
Per-alert body length: 1562 chars. Discord embed limit is 4096; even a
5-alert streak fits comfortably.
Example rendered body (Songbird round 1336254; FALSE POSITIVE case):
[ERROR] network:songbird round:1336254 protocol:ftso no submitSignatures transaction
DIAGNOSIS
The observer scanned this voting round and saw no submitSignatures
transaction from this entity. Two common causes:
(a) REAL outage. The FTSO client failed to submit. Investigate if
this is a NEW streak of 3+ consecutive rounds.
(b) FALSE POSITIVE. The observer was offline during this round and
is reporting a historical gap during catch-up. Correlates with a
recent fsp-observer container restart.
EVIDENCE
identity 0xBdeb203e55e65451fABf0e7B778b32ac174918fF
submit_sigs_to 0xcB3D4E2B5a01a86e252626708fDd67A61496A5c9
signing_policy 0x4DE8779A7Efae16cFAC0D04a144915bC814eC8c0
voting_epoch 1336254
start_unix 1747162320
OPERATOR ACTIONS
If you JUST restarted fsp-observer and the round start_unix is BEFORE
the restart timestamp: this is a FALSE POSITIVE. Alerts will stop
firing within ~2-3 rounds. Spot-check one round on Flare Explorer for
a submitSignatures tx from submit_sigs_to to confirm the actual
submission landed.
If observer was up the whole time (REAL miss):
docker logs --tail 50 flare-systems-deployment-ftso-client-1
Check gas balance of submit_sigs_to on Flare Explorer for the round window.
Verify FSP entity registration still active via EntityManager.getVoterAddresses(identity).
Single missed round per ~24h is acceptable (network reorg or transient
RPC blip). 3+ consecutive misses indicates a real outage; escalate per
docs/runbooks/staking-key-emergency-rotation.md if signing-key
compromise is suspected (cross-check staking-dir tripwire HC.io
status).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per operator 2026-05-13 directive ("happy to add as many alerts as we
need"). Promotes the per-alert 4-section format from ad-hoc inline strings
in ftso.py to a central observer/alert_text.py module that:
- Provides build_alert(summary, diagnosis, evidence, actions) helper
with auto column-padded EVIDENCE rendering
- Centralizes diagnostic + action TEMPLATES per alert class (currently
FTSO_NO_SUBMIT1, FTSO_NO_SUBMIT_SIGNATURES, FTSO_SIGNATURE_MISMATCH,
FDC_NO_SUBMIT_SIGNATURES). Future refinements live in one file.
Call sites refactored (all ERROR-level per-round alerts):
observer/validation/ftso.py
line 60-ish no submit1 transaction -> FTSO_NO_SUBMIT1
line 247 no submitSignatures -> FTSO_NO_SUBMIT_SIGNATURES
line 286 submitSignatures sig mismatch -> FTSO_SIGNATURE_MISMATCH
observer/validation/fdc.py
line 205 no submitSignatures -> FDC_NO_SUBMIT_SIGNATURES
Each call site is now ~15 lines of build_alert(...) invocation with
context-specific summary + evidence; the multi-paragraph diagnosis and
operator-actions text live in alert_text.py.
NOT TOUCHED this commit (intentionally):
- WARNING-level alerts (grace-period, minimal-conditions value warnings)
- CRITICAL-level reveal-offence alerts (different concern; can fold
later)
- early/late submission ERRORs (rare; share state with other paths)
- fdc.py line 241 (no submitSignatures + reveal offence variant; needs
its own template; can fold later)
- signature-mismatch in fdc.py:283 (same template would work; can
fold as a follow-up)
Each alert body 1500-1900 chars. Discord embed description limit is 4096;
5-alert streaks comfortable.
Verified via python /tmp/alert-test.py — all 3 new templates render
cleanly with column-aligned EVIDENCE blocks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-section format
Per operator directive ("fold them in too"). Completes the alert-body
refinement started in f2a5a23 by applying build_alert() + named templates
to every per-round + per-check alert call site across observer/validation/
+ observer/address.py + observer/validation/minimal_conditions.py.
BUG FIX (subtle, from f2a5a23): check_submit_1 in ftso.py did not have
entity + round in its signature; the previous refactor's call site
referenced them as bare names, which would have raised NameError at
runtime when 'no submit1 transaction' fired. Added them to the signature
(they were already being passed via the kwargs unpack). Same fix applied
preemptively to fdc.py:check_submit_1 (which now needs entity + round
for the FOUND_SUBMIT1 evidence block) and fdc.py:check_submit_2 (which
also needed entity for the bitvote / consensus-miss evidence).
NEW templates added in observer/alert_text.py (in addition to the 4 from
f2a5a23):
FTSO_SUBMIT1_LATE ftso.py:submit1 late ERROR
FTSO_SUBMIT1_HASH_LENGTH ftso.py:hash-length ERROR
FTSO_SUBMIT2_MISSING_OR_OUT_OF_WINDOW ftso.py:submit2 ERROR/CRITICAL
FTSO_COMMIT_REVEAL_MISMATCH ftso.py:commit-reveal CRITICAL
FDC_FOUND_SUBMIT1 fdc.py:found-submit1 ERROR
FDC_SUBMIT2_MISSING_OR_OUT_OF_WINDOW fdc.py:submit2 ERROR
FDC_SUBMIT2_BITVOTE_LENGTH fdc.py:bitvote ERROR
FDC_SUBMIT2_CONSENSUS_MISS fdc.py:consensus-miss ERROR
FDC_REVEAL_OFFENCE_NO_SIGS fdc.py:reveal-offence CRITICAL
FDC_SIGNATURE_MISMATCH fdc.py:sig-mismatch ERROR
GRACE_PERIOD_LATE_SIGNATURES ftso.py + fdc.py:grace WARNING
FAST_UPDATE_MISSED minimal_conditions.py:fast-update CRITICAL
LOW_BALANCE address.py ERROR (when balance <= 5 NAT)
MINIMAL_CONDITIONS_NULL_VALUES (template ready; not yet wired)
MINIMAL_CONDITIONS_OUT_OF_RANGE (template ready; not yet wired)
ADDRESS_MISMATCH (template ready; not yet wired)
Call sites refactored to use build_alert + named template + structured
EVIDENCE dict:
observer/validation/ftso.py
submit1: late (ERROR), no-transaction (ERROR), hash-length (ERROR)
submit2: missing/out-of-window (ERROR/CRITICAL), commit-reveal mismatch (CRITICAL)
submitSignatures: missing (ERROR), grace-period late (WARNING),
signature mismatch (ERROR)
observer/validation/fdc.py
submit1: found-submit1 (ERROR; FDC doesn't use submit1)
submit2: missing/out-of-window (ERROR), bitvote length (ERROR),
consensus miss (ERROR)
submitSignatures: missing (ERROR), reveal offence no sigs (CRITICAL),
reveal offence + early sigs (CRITICAL),
grace-period late (WARNING), sig mismatch (ERROR)
observer/validation/minimal_conditions.py
fast-update missed within max_exponent (CRITICAL)
fast-update missed past max_exponent (WARNING or CRITICAL)
observer/address.py
low balance < threshold (WARNING)
low balance <= 5 NAT (ERROR)
NOT TOUCHED (intentional; out of scope for this commit):
- WARNING-level per-feed null/out-of-range alerts in ftso.py (templates
are ready; per-feed-index loop has different shape that would need
its own structure)
- WARNING staking-condition alert in minimal_conditions.py:166-174
- WARNING fdc-participation alert in minimal_conditions.py:191-199
- WARNING anchor-feed condition in minimal_conditions.py:84-92
These remain bare-text; can fold in a follow-up if operator wants.
Body sizes range 800-1900 chars per alert. Discord embed limit is 4096;
all alerts comfortably fit. Multi-alert streaks (e.g. 7 consecutive
missed FTSO submitSignatures rounds) won't hit Discord rate limits.
All 5 modified modules pass python3 ast.parse() syntax check.
All 17 templates load + render cleanly via /tmp/all-templates-test.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator triage: same alert (same level/network/round/ protocol/headline) was double-firing to Discord. Root cause: observer's outer block-processing loop converges multiple message lists (tx_messages, event_messages, validation_msgs, min_cond_messages) before dispatch via log_message at 5 distinct call sites. When the same alert appears in more than one list (e.g. during catch-up overlap), it dispatches twice. Fix: in-memory TTL cache keyed on (level, network, round_id, protocol_id, headline-first-120-chars). 300s TTL. Hit -> skip notify_*; miss -> notify and record. Properties: - round.id is IN the key, so 3+ consecutive missed rounds still fire 3 alerts (no false consolidation across rounds) - Cache evicts opportunistically on every lookup - LOGGER still records all messages to stdout/journald (operator keeps full historical record on box) - Dedup only affects Discord/Slack/Telegram/generic webhooks Trade-off: if the SAME alert content really fires twice for distinct reasons within 5 min (extremely unlikely given round.id is in key), the second would be suppressed. Acceptable given the volume reduction. Operator deploy: cd /opt/flare/observer sudo -u flareobserver git pull --rebase sudo docker compose build sudo docker compose up -d Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator feedback: Discord groups consecutive webhook
messages from the same author into one continuous bubble. With the new
multi-line 4-section bodies, the header of each alert ('[LEVEL]
network:X round:Y') was no longer visually distinct from the previous
alert's OPERATOR ACTIONS section.
Prepend a 39-char box-drawing horizontal line to every notify_discord
content. The header line of each alert now sits visibly below a
divider, making the alert boundary unmissable in Discord even when 5+
alerts stack.
Box-drawing char (═, U+2550) is technically Unicode but renders
identically across all Discord clients and is widely used in
terminal/log UI. ASCII '=' would visually be 'noisier' for the same
effect.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New observer/portal_post.py classifies each dispatched Message into a
(post_type, severity) and INSERTs a row into the shared portal `posts`
table — the operator-facing Activity feed. Hooked into log_message()
after the notify_* dispatch, so it inherits the protocol filter + 300s
dedup gate for free and never affects the Discord path (best-effort,
swallows every failure, never raises).
The INSERT also fires SELECT pg_notify('portal_events', ...) in the same
transaction: there is no INSERT trigger on `posts`, so a producer must
issue the NOTIFY itself or the row only surfaces on a full page reload.
Channel/topic/payload shape match the portal's own post-writer.ts.
PORTAL_DATABASE_URL is optional (unset = no-op, observer behaves exactly
as upstream) and hard-rejected unless it targets /flarewatch_portal
(2026-05-20 db-split incident). Adds psycopg[binary] to requirements.txt.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
calculate_ftso_anchor_feeds re-emitted its WARNING every check cycle the
2h trailing success rate sat below the 80% minimal condition — and the
success-rate % embedded in the text defeated observer.py's headline
dedup, so it spammed Discord the whole time a dip recovered on its own.
Now trend-aware + deduped via MinimalConditions instance state:
- one WARNING on the breach
- re-fire only on a meaningful further drop (>= 200 bips) or a 6h
reminder while the rate sits flat below threshold
- stay quiet while it climbs back on its own
- one INFO when it recovers to >= the minimal condition
Margins live in MinimalConditionsConfig. All WARNING texts keep the
"minimal condition for FTSO anchor feeds" phrase so the portal-posts
mirror still classifies them validator_ftso_anchor_feeds_low.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two operator-driven changes after the 2026-05-24 XRP RPC fallback storm flooded the Discord inbox with one consensus-miss alert per round per attestation_type (~12 alerts per voting cycle while GetBlock was quota-exhausted). WORDING (observer/validation/fdc.py): FDC consensus-miss alerts now lead the summary line with [data:<source>] (e.g. [data:XRP], [data:DOGE]) so the chain whose RPC actually failed is unambiguous. The legacy `network:songbird` prefix (added by build_str() for ALL alerts) refers to the protocol network where the FDC vote runs -- operators routinely misread it as "Songbird-source data missing" when the real issue was an upstream chain RPC. Wording change is surgical to consensus-miss only; other alert types unchanged. COALESCE (observer/observer.py): New _COALESCE_CACHE layer sits BELOW the existing 5-min same- round dedup. Catches the same logical issue repeating across DIFFERENT rounds within a 1h window. First occurrence dispatches immediately; subsequent are suppressed + counted; next post- window occurrence includes "[STILL ONGOING] +N similar alerts suppressed in last ~60 min" preamble in the message body. Operator silence on the suppressed run is acceptable per the FlareWatch-side feedback_rolling_subscription_reminders convention. Key drops round.id (different rounds of the same issue collapse) but keeps the headline including the new [data:<chain>] prefix -- so per-chain consensus-miss floods coalesce per-chain, not all chains into one bucket. OUT OF SCOPE: WARNING/INFO/DEBUG level alerts also coalesce under the same window. Self-tuning: low-frequency alerts effectively never coalesce (window expires between fires), only high-frequency ones get summarized. Deploy on validator box: cd ~/flarewatch-validator && git pull sudo bash scripts/deploy-observer.sh Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… identity A newly-launched validator is registered ~3.5 days before its first reward epoch; until then its identity is absent from the active signing policy and the by_identity_address[tia] lookups in observer_loop raise KeyError, crash-looping the container. Wait + reload the policy until the identity appears, then resume normal operation. No-op for an always-in-policy node (e.g. SGB). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These are build-time (docker buildx state) + shell-history artifacts that appear as untracked in the box clone; not part of the project. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a new optional environment variable
BLOCK_PRODUCTION_LOOKBACKthat controls how far backget_block_production()looks when computing the average block time at startup. Default is1_000_000, preserving current behavior exactly.Motivation
When
fsp-observeris pointed at a validator operator's own prunedgo-flarenode (instead of a public archive RPC), the tool fails to start.get_block_production()queries a blockmin(1_000_000, latest-1)behind head to compute an average-block-time scalar, which the pruned node returnsnullfor. The container then crash-loops withweb3.exceptions.BlockNotFound.Validators typically do not run archive nodes — they prune for disk efficiency. The public RPC pattern works for most operators, but for operators who want to consume their own node's RPC directly (e.g. over a private tunnel), the hardcoded 1M-block lookback is the blocker.
Behavior
1000000): unchanged. Existing deployments pointed at archive RPCs work identically.BLOCK_PRODUCTION_LOOKBACK=1000): enables operation against pruned nodes that retain only recent history. Block time on Flare/Songbird is very stable (~1.8s), so a 1000-block average produces an essentially identical scalar to a 1M-block average for the purpose ofcalculate_maximum_exponent.Changes
configuration/types.py: addblock_production_lookback: intfield toConfigurationconfiguration/config.py: parseBLOCK_PRODUCTION_LOOKBACKenv var (default"1000000"), validate>= 1observer/observer.py: thread the lookback intoget_block_production()as a parameter with default1_000_000so existing single-arg callers keep working; call site inobserver_looppassesconfig.block_production_lookbackREADME.md: document the new env varBackward compatibility