Skip to content

feat(config): add BLOCK_PRODUCTION_LOOKBACK env var for pruned RPC nodes#9

Open
nbridges333 wants to merge 15 commits into
flare-foundation:mainfrom
nbridges333:feat/block-production-lookback-env
Open

feat(config): add BLOCK_PRODUCTION_LOOKBACK env var for pruned RPC nodes#9
nbridges333 wants to merge 15 commits into
flare-foundation:mainfrom
nbridges333:feat/block-production-lookback-env

Conversation

@nbridges333
Copy link
Copy Markdown

Summary

Adds a new optional environment variable BLOCK_PRODUCTION_LOOKBACK that controls how far back get_block_production() looks when computing the average block time at startup. Default is 1_000_000, preserving current behavior exactly.

Motivation

When fsp-observer is pointed at a validator operator's own pruned go-flare node (instead of a public archive RPC), the tool fails to start. get_block_production() queries a block min(1_000_000, latest-1) behind head to compute an average-block-time scalar, which the pruned node returns null for. The container then crash-loops with web3.exceptions.BlockNotFound.

Validators typically do not run archive nodes — they prune for disk efficiency. The public RPC pattern works for most operators, but for operators who want to consume their own node's RPC directly (e.g. over a private tunnel), the hardcoded 1M-block lookback is the blocker.

Behavior

  • Default (1000000): unchanged. Existing deployments pointed at archive RPCs work identically.
  • Explicit smaller value (e.g. BLOCK_PRODUCTION_LOOKBACK=1000): enables operation against pruned nodes that retain only recent history. Block time on Flare/Songbird is very stable (~1.8s), so a 1000-block average produces an essentially identical scalar to a 1M-block average for the purpose of calculate_maximum_exponent.

Changes

  • configuration/types.py: add block_production_lookback: int field to Configuration
  • configuration/config.py: parse BLOCK_PRODUCTION_LOOKBACK env var (default "1000000"), validate >= 1
  • observer/observer.py: thread the lookback into get_block_production() as a parameter with default 1_000_000 so existing single-arg callers keep working; call site in observer_loop passes config.block_production_lookback
  • README.md: document the new env var

Backward compatibility

  • Env var is optional; default preserves existing behavior exactly
  • Function signature uses a default parameter, so any existing callers continue to work

nbridges333 and others added 15 commits April 23, 2026 19:51
…-only events promoted)

Three ERROR-class log messages emitted from observer/validation/fdc.py
were previously logger-only — no Prometheus surface meant downstream
consumers (e.g. FlareWatch agent's signal-collection layer + alarm
playbooks) had no way to track rate or count. This commit adds
backing Counters and wires them at the existing log-emit sites.

NEW COUNTERS (observer/metrics.py):

flare_fsp_fdc_submit1_unexpected_total{identity_address}
  Backs fdc.py:43 "found submit1 transaction" — FDC protocol does
  not use submit1, so its presence indicates misconfig.

flare_fsp_fdc_submit2_bit_vote_length_mismatch_total{identity_address}
  Backs fdc.py:124 "submit2 bit vote length didn't match number of
  requests in round" — structural protocol error; rare but worth
  visibility.

flare_fsp_fdc_submit2_consensus_miss_total{identity_address,attestation_type,source_id}
  Backs fdc.py:138 "submit2 didn't confirm request that was part
  of consensus {attestation_type}/{source_id} at index N" — the
  per-(attestation_type, source_id) consensus miss event. Counter
  records the raw observation; downstream consumers (operator's
  agent participation-policy layer) decide whether a given combo
  is opted out (expected) or opted in (alarm).

CARDINALITY (FDC_SUBMIT2_CONSENSUS_MISS):
attestation_type and source_id are 32-byte fields per
py_flare_common — theoretically unbounded. In practice the Flare
protocol uses a small known set (~7 attestation types × ~6 source
ids = ~42 max combos per identity_address). Acceptable as
Prometheus labels; a future protocol expansion that pushes
cardinality high would warrant moving to a JSON payload field.

NOT pre-initialized in initialize_labels(): combos are added on
first emission so we don't enumerate the full ~42 cross-product
preemptively. This matches the existing precedent (other
attestation/source-keyed metrics are not pre-initialized either).

WIRE-IN (observer/validation/fdc.py):
- check_submit_1: increment FDC_SUBMIT1_UNEXPECTED at the existing
  ERROR emit point.
- check_submit_2: increment FDC_SUBMIT2_BIT_VOTE_LENGTH_MISMATCH at
  the bit-vote-length mismatch emit; FDC_SUBMIT2_CONSENSUS_MISS at
  the per-request consensus-miss emit, with attestation_type +
  source_id labels populated from the request's representation
  property.

VERIFICATION:
PYTHONPATH-overridden smoke test exercised all three Counters via
inc() + generate_latest() — exposition output renders the new
counters with correct labels. Production scrape via
flarewatch-agent's fsp-observer-metrics collector picks up the
new metrics automatically (the collector emits one F1 signal per
metric+label-set tuple per scrape, no agent-side code change
required for the Counter to flow into F1).

Cross-repo follow-ups:
- flarewatch-validator catalog (signal-catalog.md Component 11) gets
  three new rows for these metrics.
- flarewatch-agent gets a participation-policy schema + collector
  update to tag-with-expected on the consensus_miss signal so
  downstream playbooks consume only opted-in events.

Rollback: `git revert <this-commit-hash>` removes the Counters and
the wire-in increments. The existing ERROR log emissions stay
unchanged. Downstream consumers fall back to logger-only as before.
… threshold

The fast-updates participation check fires WARNING for any voter whose
n_blocks reaches max_exponent, regardless of statistical confidence.
For small-share voters (e.g. weight/total ~= 0.0001 on Songbird), the
calculated false-positive probability remains high (~70-80%) even at
n_blocks = max_exponent, because the per-block selection probability
is genuinely tiny. The alert is mathematically correct but operationally
noise: the validator IS participating, just being selected at its true
small share.

Add a 5% (50_000_000 ppb) threshold gate on the WARNING path. Below
that threshold the alert is statistically meaningful and still fires.
At or above it (high false-positive probability), suppress the WARNING.
CRITICAL-level emission (probability_ppb <= 100) is preserved
unconditionally on both n_blocks branches.

Operationally: whales drop probability_ppb below 5% within ~max_exponent/2
blocks, so for them this is functionally a no-op. For sub-1% share
voters this eliminates a high-cardinality noise source on Discord.
Add FLAREWATCH_PROTOCOLS_ACTIVE env var (comma-separated lowercase
protocol names matching Protocol.id_to_name() output). Messages tagged
with a protocol NOT in the whitelist are logged locally but NOT
dispatched to Discord/Slack/Telegram/generic webhooks. Default
(unset/empty) preserves original upstream behavior: dispatch everything.

Motivation: operators frequently deploy fsp-observer ahead of deploying
all FSP protocols. Songbird FSP-only validators have no staking;
operators staging FDC suite deployment have FDC ERRORs by design until
the verifier stack lands. Today these emit ERROR-level Discord noise
that the operator cannot quiet without disabling the entire
notification channel or upstream-patching protocol-specific code paths.

Filter applied at log_message dispatch (single seam covering all
notification backends). Untagged messages (e.g. observer crashes,
network errors) always dispatch — the filter is intentionally narrow
to operator-known protocol-tagged alerts.

Also gitignore *.bak.* — in-place backup files generated during local
patch application, never intended for commit.
…crash

Per validator-repo 2026-05-13 directive aligning alert bodies to the
agent-side TrainingNotifier format. Previous crash message was raw
"observer crashed (traceback in logs) - <exception-str>" — operator had
to ssh to validator-host, find the right log file, find the traceback,
and reason from scratch.

New crash handler:
  - Captures full traceback via traceback.format_exc()
  - Classifies the exception into one of:
      connection-reset  (ConnectionResetError, OSError with "Connection reset")
      timeout           (TimeoutError, asyncio.TimeoutError)
      parse-or-config   (ValueError, KeyError, TypeError, AttributeError)
      import-error      (ImportError, ModuleNotFoundError)
      unknown           (fallback)
  - Per-class diagnosis text describing the likely cause
  - Per-class operator-actions text with specific debug commands
    (docker ps / docker logs / curl local RPC / compose up / etc.)
  - Truncates traceback to head/tail (12 lines each) so the Discord
    embed stays under 2000 chars even on deep stacks
  - Surfaces exception type + str in EVIDENCE alongside the truncated
    traceback (so the operator gets both the headline and the depth)

Output stays Discord-embed-safe: tested rendering on a synthetic
ConnectionResetError(104, "Connection reset by peer") via
/tmp/observer-dryrun.py — total body 1353 chars.

Example rendered body (connection-reset case from main alert this morning):

  observer crashed on network:songbird (class: connection-reset)

  DIAGNOSIS
  The observer's connection to its upstream RPC endpoint was reset.
  Most likely the RPC node (go-flare via cloudflared, or the public
  fallback) closed the TCP connection mid-request: rate-limited,
  restarted, or proxied through a stale Cloudflare edge. The observer
  process exits on this error; supervisor (docker compose restart
  policy or systemd) brings it back, but the gap shows up as missed
  epoch participation.

  EVIDENCE
  exception: ConnectionResetError: [Errno 104] Connection reset by peer

  traceback (truncated; full in journalctl/docker logs):
  Traceback (most recent call last):
    File ".../main.py", line 142, in main
      asyncio.run(observer_loop(config))
    ... (lines truncated; full in journalctl) ...
  ConnectionResetError: [Errno 104] Connection reset by peer

  OPERATOR ACTIONS
  If supervisor already restarted the observer (check
    docker ps --filter name=fsp-observer
    docker logs --tail 50 fsp-observer
  and see if there's recent activity): the observer is back up;
  verify next epoch is being signed by checking the validator's
  vote-power utilization on Flare Explorer.

  If the observer is NOT back up: restart it manually:
    cd /opt/flare/observer && docker compose up -d

  If the connection reset is recurring (same error every few minutes):
    - Check upstream RPC: curl -s http://127.0.0.1:9653/ext/C/rpc ...
    - If local RPC is slow/dead: see flr-rpc-heartbeat-deploy.md.
    - If local RPC fine but observer keeps disconnecting: check if
      observer is hitting rate-limited public RPC instead of local node.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator alert: 7 consecutive [ERROR] per-round alerts on
Songbird FTSO ('no submitSignatures transaction') fired immediately after
restarting fsp-observer from the morning's RPC-URL outage. Operator: 'I
think these alerts could be refined and provide more clarity on what is
happening and what we should do when it happens.'

The per-round ERROR is ambiguous between two very different scenarios:
  (a) REAL outage — FTSO client failed to submit
  (b) FALSE POSITIVE — observer was offline during the round and is
      reporting historical gaps on catch-up

Body now distinguishes the two cases explicitly, surfaces the relevant
addresses (identity / submit_signatures_to / signing_policy) + voting
epoch ID + start_unix so operator can spot-check one round on Flare
Explorer without first looking up which address sends submitSignatures.

OPERATOR ACTIONS section has two branches: 'If you just restarted
fsp-observer (FALSE POSITIVE)' and 'If observer was up the whole time
(REAL miss)' with concrete commands per branch. Escalation pointer to
staking-key-emergency-rotation.md if 3+ consecutive misses suggest
signing-key compromise.

Scope kept narrow: only the no-submitSignatures-transaction call site at
ftso.py:247. The sibling call sites (grace-period WARNING at ftso.py:265,
FDC equivalent at fdc.py:205, minimal-conditions at minimal_conditions.py)
intentionally NOT touched this commit — propagate after operator confirms
the format is right for them. Per 'one fix at a time' working pattern
(memory feedback_separate_commands).

Per-alert body length: 1562 chars. Discord embed limit is 4096; even a
5-alert streak fits comfortably.

Example rendered body (Songbird round 1336254; FALSE POSITIVE case):

  [ERROR] network:songbird round:1336254 protocol:ftso no submitSignatures transaction

  DIAGNOSIS
  The observer scanned this voting round and saw no submitSignatures
  transaction from this entity. Two common causes:
    (a) REAL outage. The FTSO client failed to submit. Investigate if
        this is a NEW streak of 3+ consecutive rounds.
    (b) FALSE POSITIVE. The observer was offline during this round and
        is reporting a historical gap during catch-up. Correlates with a
        recent fsp-observer container restart.

  EVIDENCE
    identity        0xBdeb203e55e65451fABf0e7B778b32ac174918fF
    submit_sigs_to  0xcB3D4E2B5a01a86e252626708fDd67A61496A5c9
    signing_policy  0x4DE8779A7Efae16cFAC0D04a144915bC814eC8c0
    voting_epoch    1336254
    start_unix      1747162320

  OPERATOR ACTIONS
  If you JUST restarted fsp-observer and the round start_unix is BEFORE
  the restart timestamp: this is a FALSE POSITIVE. Alerts will stop
  firing within ~2-3 rounds. Spot-check one round on Flare Explorer for
  a submitSignatures tx from submit_sigs_to to confirm the actual
  submission landed.

  If observer was up the whole time (REAL miss):
    docker logs --tail 50 flare-systems-deployment-ftso-client-1
    Check gas balance of submit_sigs_to on Flare Explorer for the round window.
    Verify FSP entity registration still active via EntityManager.getVoterAddresses(identity).

  Single missed round per ~24h is acceptable (network reorg or transient
  RPC blip). 3+ consecutive misses indicates a real outage; escalate per
  docs/runbooks/staking-key-emergency-rotation.md if signing-key
  compromise is suspected (cross-check staking-dir tripwire HC.io
  status).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per operator 2026-05-13 directive ("happy to add as many alerts as we
need"). Promotes the per-alert 4-section format from ad-hoc inline strings
in ftso.py to a central observer/alert_text.py module that:

  - Provides build_alert(summary, diagnosis, evidence, actions) helper
    with auto column-padded EVIDENCE rendering
  - Centralizes diagnostic + action TEMPLATES per alert class (currently
    FTSO_NO_SUBMIT1, FTSO_NO_SUBMIT_SIGNATURES, FTSO_SIGNATURE_MISMATCH,
    FDC_NO_SUBMIT_SIGNATURES). Future refinements live in one file.

Call sites refactored (all ERROR-level per-round alerts):

  observer/validation/ftso.py
    line 60-ish  no submit1 transaction          -> FTSO_NO_SUBMIT1
    line 247     no submitSignatures             -> FTSO_NO_SUBMIT_SIGNATURES
    line 286     submitSignatures sig mismatch   -> FTSO_SIGNATURE_MISMATCH

  observer/validation/fdc.py
    line 205     no submitSignatures             -> FDC_NO_SUBMIT_SIGNATURES

Each call site is now ~15 lines of build_alert(...) invocation with
context-specific summary + evidence; the multi-paragraph diagnosis and
operator-actions text live in alert_text.py.

NOT TOUCHED this commit (intentionally):
  - WARNING-level alerts (grace-period, minimal-conditions value warnings)
  - CRITICAL-level reveal-offence alerts (different concern; can fold
    later)
  - early/late submission ERRORs (rare; share state with other paths)
  - fdc.py line 241 (no submitSignatures + reveal offence variant; needs
    its own template; can fold later)
  - signature-mismatch in fdc.py:283 (same template would work; can
    fold as a follow-up)

Each alert body 1500-1900 chars. Discord embed description limit is 4096;
5-alert streaks comfortable.

Verified via python /tmp/alert-test.py — all 3 new templates render
cleanly with column-aligned EVIDENCE blocks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-section format

Per operator directive ("fold them in too"). Completes the alert-body
refinement started in f2a5a23 by applying build_alert() + named templates
to every per-round + per-check alert call site across observer/validation/
+ observer/address.py + observer/validation/minimal_conditions.py.

BUG FIX (subtle, from f2a5a23): check_submit_1 in ftso.py did not have
entity + round in its signature; the previous refactor's call site
referenced them as bare names, which would have raised NameError at
runtime when 'no submit1 transaction' fired. Added them to the signature
(they were already being passed via the kwargs unpack). Same fix applied
preemptively to fdc.py:check_submit_1 (which now needs entity + round
for the FOUND_SUBMIT1 evidence block) and fdc.py:check_submit_2 (which
also needed entity for the bitvote / consensus-miss evidence).

NEW templates added in observer/alert_text.py (in addition to the 4 from
f2a5a23):

  FTSO_SUBMIT1_LATE                   ftso.py:submit1 late ERROR
  FTSO_SUBMIT1_HASH_LENGTH            ftso.py:hash-length ERROR
  FTSO_SUBMIT2_MISSING_OR_OUT_OF_WINDOW   ftso.py:submit2 ERROR/CRITICAL
  FTSO_COMMIT_REVEAL_MISMATCH         ftso.py:commit-reveal CRITICAL
  FDC_FOUND_SUBMIT1                   fdc.py:found-submit1 ERROR
  FDC_SUBMIT2_MISSING_OR_OUT_OF_WINDOW    fdc.py:submit2 ERROR
  FDC_SUBMIT2_BITVOTE_LENGTH          fdc.py:bitvote ERROR
  FDC_SUBMIT2_CONSENSUS_MISS          fdc.py:consensus-miss ERROR
  FDC_REVEAL_OFFENCE_NO_SIGS          fdc.py:reveal-offence CRITICAL
  FDC_SIGNATURE_MISMATCH              fdc.py:sig-mismatch ERROR
  GRACE_PERIOD_LATE_SIGNATURES        ftso.py + fdc.py:grace WARNING
  FAST_UPDATE_MISSED                  minimal_conditions.py:fast-update CRITICAL
  LOW_BALANCE                         address.py ERROR (when balance <= 5 NAT)
  MINIMAL_CONDITIONS_NULL_VALUES      (template ready; not yet wired)
  MINIMAL_CONDITIONS_OUT_OF_RANGE     (template ready; not yet wired)
  ADDRESS_MISMATCH                    (template ready; not yet wired)

Call sites refactored to use build_alert + named template + structured
EVIDENCE dict:

  observer/validation/ftso.py
    submit1: late (ERROR), no-transaction (ERROR), hash-length (ERROR)
    submit2: missing/out-of-window (ERROR/CRITICAL), commit-reveal mismatch (CRITICAL)
    submitSignatures: missing (ERROR), grace-period late (WARNING),
                      signature mismatch (ERROR)

  observer/validation/fdc.py
    submit1: found-submit1 (ERROR; FDC doesn't use submit1)
    submit2: missing/out-of-window (ERROR), bitvote length (ERROR),
             consensus miss (ERROR)
    submitSignatures: missing (ERROR), reveal offence no sigs (CRITICAL),
                      reveal offence + early sigs (CRITICAL),
                      grace-period late (WARNING), sig mismatch (ERROR)

  observer/validation/minimal_conditions.py
    fast-update missed within max_exponent (CRITICAL)
    fast-update missed past max_exponent (WARNING or CRITICAL)

  observer/address.py
    low balance < threshold (WARNING)
    low balance <= 5 NAT (ERROR)

NOT TOUCHED (intentional; out of scope for this commit):
  - WARNING-level per-feed null/out-of-range alerts in ftso.py (templates
    are ready; per-feed-index loop has different shape that would need
    its own structure)
  - WARNING staking-condition alert in minimal_conditions.py:166-174
  - WARNING fdc-participation alert in minimal_conditions.py:191-199
  - WARNING anchor-feed condition in minimal_conditions.py:84-92
  These remain bare-text; can fold in a follow-up if operator wants.

Body sizes range 800-1900 chars per alert. Discord embed limit is 4096;
all alerts comfortably fit. Multi-alert streaks (e.g. 7 consecutive
missed FTSO submitSignatures rounds) won't hit Discord rate limits.

All 5 modified modules pass python3 ast.parse() syntax check.
All 17 templates load + render cleanly via /tmp/all-templates-test.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator triage: same alert (same level/network/round/
protocol/headline) was double-firing to Discord. Root cause: observer's
outer block-processing loop converges multiple message lists (tx_messages,
event_messages, validation_msgs, min_cond_messages) before dispatch via
log_message at 5 distinct call sites. When the same alert appears in
more than one list (e.g. during catch-up overlap), it dispatches twice.

Fix: in-memory TTL cache keyed on (level, network, round_id, protocol_id,
headline-first-120-chars). 300s TTL. Hit -> skip notify_*; miss -> notify
and record.

Properties:
- round.id is IN the key, so 3+ consecutive missed rounds still fire 3
  alerts (no false consolidation across rounds)
- Cache evicts opportunistically on every lookup
- LOGGER still records all messages to stdout/journald (operator keeps
  full historical record on box)
- Dedup only affects Discord/Slack/Telegram/generic webhooks

Trade-off: if the SAME alert content really fires twice for distinct
reasons within 5 min (extremely unlikely given round.id is in key), the
second would be suppressed. Acceptable given the volume reduction.

Operator deploy:
  cd /opt/flare/observer
  sudo -u flareobserver git pull --rebase
  sudo docker compose build
  sudo docker compose up -d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per 2026-05-13 operator feedback: Discord groups consecutive webhook
messages from the same author into one continuous bubble. With the new
multi-line 4-section bodies, the header of each alert ('[LEVEL]
network:X round:Y') was no longer visually distinct from the previous
alert's OPERATOR ACTIONS section.

Prepend a 39-char box-drawing horizontal line to every notify_discord
content. The header line of each alert now sits visibly below a
divider, making the alert boundary unmissable in Discord even when 5+
alerts stack.

Box-drawing char (═, U+2550) is technically Unicode but renders
identically across all Discord clients and is widely used in
terminal/log UI. ASCII '=' would visually be 'noisier' for the same
effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
New observer/portal_post.py classifies each dispatched Message into a
(post_type, severity) and INSERTs a row into the shared portal `posts`
table — the operator-facing Activity feed. Hooked into log_message()
after the notify_* dispatch, so it inherits the protocol filter + 300s
dedup gate for free and never affects the Discord path (best-effort,
swallows every failure, never raises).

The INSERT also fires SELECT pg_notify('portal_events', ...) in the same
transaction: there is no INSERT trigger on `posts`, so a producer must
issue the NOTIFY itself or the row only surfaces on a full page reload.
Channel/topic/payload shape match the portal's own post-writer.ts.

PORTAL_DATABASE_URL is optional (unset = no-op, observer behaves exactly
as upstream) and hard-rejected unless it targets /flarewatch_portal
(2026-05-20 db-split incident). Adds psycopg[binary] to requirements.txt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
calculate_ftso_anchor_feeds re-emitted its WARNING every check cycle the
2h trailing success rate sat below the 80% minimal condition — and the
success-rate % embedded in the text defeated observer.py's headline
dedup, so it spammed Discord the whole time a dip recovered on its own.

Now trend-aware + deduped via MinimalConditions instance state:
  - one WARNING on the breach
  - re-fire only on a meaningful further drop (>= 200 bips) or a 6h
    reminder while the rate sits flat below threshold
  - stay quiet while it climbs back on its own
  - one INFO when it recovers to >= the minimal condition
Margins live in MinimalConditionsConfig. All WARNING texts keep the
"minimal condition for FTSO anchor feeds" phrase so the portal-posts
mirror still classifies them validator_ftso_anchor_feeds_low.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two operator-driven changes after the 2026-05-24 XRP RPC fallback
storm flooded the Discord inbox with one consensus-miss alert per
round per attestation_type (~12 alerts per voting cycle while
GetBlock was quota-exhausted).

WORDING (observer/validation/fdc.py):
  FDC consensus-miss alerts now lead the summary line with
  [data:<source>] (e.g. [data:XRP], [data:DOGE]) so the chain
  whose RPC actually failed is unambiguous. The legacy
  `network:songbird` prefix (added by build_str() for ALL alerts)
  refers to the protocol network where the FDC vote runs --
  operators routinely misread it as "Songbird-source data missing"
  when the real issue was an upstream chain RPC. Wording change
  is surgical to consensus-miss only; other alert types unchanged.

COALESCE (observer/observer.py):
  New _COALESCE_CACHE layer sits BELOW the existing 5-min same-
  round dedup. Catches the same logical issue repeating across
  DIFFERENT rounds within a 1h window. First occurrence dispatches
  immediately; subsequent are suppressed + counted; next post-
  window occurrence includes "[STILL ONGOING] +N similar alerts
  suppressed in last ~60 min" preamble in the message body.
  Operator silence on the suppressed run is acceptable per
  the FlareWatch-side feedback_rolling_subscription_reminders
  convention.

  Key drops round.id (different rounds of the same issue
  collapse) but keeps the headline including the new [data:<chain>]
  prefix -- so per-chain consensus-miss floods coalesce per-chain,
  not all chains into one bucket.

OUT OF SCOPE: WARNING/INFO/DEBUG level alerts also coalesce
under the same window. Self-tuning: low-frequency alerts effectively
never coalesce (window expires between fires), only high-frequency
ones get summarized.

Deploy on validator box:
  cd ~/flarewatch-validator && git pull
  sudo bash scripts/deploy-observer.sh

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… identity

A newly-launched validator is registered ~3.5 days before its first reward
epoch; until then its identity is absent from the active signing policy and the
by_identity_address[tia] lookups in observer_loop raise KeyError, crash-looping
the container. Wait + reload the policy until the identity appears, then resume
normal operation. No-op for an always-in-policy node (e.g. SGB).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
These are build-time (docker buildx state) + shell-history artifacts that
appear as untracked in the box clone; not part of the project.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant