You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* fix(tools/talis): wait-for-chain + atomic keyring + one-command driver
Three race conditions surfaced repeatedly on a fresh AWS bring-up of
the Fibre throughput experiment. Each one had the same shape: a
talis subcommand "succeeded" at the CLI level (or returned the txhash
with --yes) before the chain had actually applied the work, leaving
downstream steps to fail in confusing ways. This commit makes each
step verify *outcome*, not just *invocation*, so the experiment can
go from a fresh `talis up` to a running loadgen without manual
intervention.
• setup-fibre script (fibre_setup.go) now:
- polls `celestia-appd status` for `latest_block_height>0`
before submitting any tx — fixes the silent-noop where
set-host + 100× deposit-to-escrow all bounced with
"celestia-app is not ready; please wait for first block";
- retries `set-host` in a loop until the validator's host
shows up in `query valaddr providers` — fixes the case
where --yes returns the txhash before block inclusion and
the tx silently lands in the mempool but never confirms;
- verifies fibre-0's escrow account is funded on-chain before
the tmux session exits — same silent-failure mode as
set-host, but on the deposit side.
The talis-CLI step also now cross-checks all validators are
registered from a single vantage point before returning, so a
concurrent set-host race surfaces as an error instead of a
half-empty provider list start-fibre would cache forever.
• fibre-bootstrap-evnode (fibre_bootstrap_evnode.go) now stages
the keyring scp into a tmp directory and `mv`s it atomically
into place. The previous direct `scp -r` to
/root/keyring-fibre/keyring-test created the directory before
transferring its contents — the evnode init script's
`[ -d keyring-test ]` poll passed mid-transfer, the daemon
launched with no fibre-0.info, and crashed with `keyring entry
"fibre-0" not found`.
• evnode_init.sh (genesis.go) now waits for the specific
keyring-test/fibre-0.info file rather than just the
keyring-test directory. Belt-and-braces: the bootstrap mv is
already atomic on the same filesystem, but the file-level
guard means a hand-pushed keyring (not via talis) can't trip
the same race.
• New `talis fibre-experiment` umbrella command runs
up → genesis → deploy → setup-fibre → start-fibre →
fibre-bootstrap-evnode in order. Each step uses the same
binary as a subprocess; failures in any step abort the chain.
Operator goes from a prepared root dir to a running loadgen
with one command, instead of remembering the sequence.
Verified by 5-min sustained loadgen against julien/fiber HEAD with
PR #3287 (concurrent submitter) merged: 47.65 MB/s @ 99.999 % ok,
up from the prior 24.57 MB/s baseline (the gap is PR #3287's
overlapping uploads — these talis fixes just stop the deploy from
silently breaking before throughput matters).
* fix(tools/talis): finalize fibre setup race fixes
Three follow-up bugs surfaced from the PR #3303 follow-up
verification run on a 3-validator AWS Fibre cluster:
- aws.go: CreateAWSInstances exited 0 even when individual
instance launches failed, so `talis up` lied about success
and downstream steps proceeded against a partial cluster.
Returns a joined error now so failure cascades stop early.
- download.go: sshExec used cmd.CombinedOutput, mixing SSH
warnings (the "Warning: Permanently added '...'..." chatter
on stderr) into bytes the caller hands to fmt.Sscanf("%d").
The CLI-side providers cross-check parsed those warnings
as 0 and looped until its 5-min deadline even though a
direct SSH query showed all 3 providers registered. Switch
to cmd.Output() (stdout only) and add `-q -o LogLevel=ERROR`
to silence the chatter for any caller that does combine
streams.
- fibre_setup.go: the per-validator escrow verification used
`celestia-appd query fibre escrow` which doesn't exist —
the actual subcommand is `escrow-account`. The query
errored on every retry, the grep for "amount" never
matched, and the script wedged on the 3-min deadline
reporting `FATAL: fibre-0 escrow not present`. Switch to
`escrow-account` and key on `"found":true` (the explicit
existence flag in the response). Also wrap the fibre-0
deposit-to-escrow itself in a retry loop matching set-host
— same `--yes`-returns-before-inclusion silent-failure
mode bit it. fibre-1..N stay best-effort.
* feat(evnode-txsim): keep-alive conn pool + pprof endpoint
Two diagnostic improvements for the load generator:
1. http.Transport.MaxIdleConnsPerHost defaults to 2 in stdlib.
With --concurrency=8 (or higher), 6+ goroutines per cycle had
to open fresh TCP+TLS sockets per request because the pool
couldn't hold their idle conns between requests. Bump
MaxIdleConns / MaxIdleConnsPerHost / MaxConnsPerHost to
2*concurrency so every active sender has a reusable keep-alive
socket, eliminating handshake churn from the hot path.
2. Always-on net/http/pprof on 127.0.0.1:6060. evnode-txsim is a
load tester, not a production daemon, so cost of always serving
profiling is acceptable; the payoff is being able to grab CPU
profiles under live load without re-deploying the binary —
`ssh -L 6060:127.0.0.1:6060 root@loadgen \
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30`.
A profile captured this way under c=8 traced the per-request hot
path: 25.5% in kernel write(2), 25% in net/http body marshaling.
That diagnostic surfaced that the c6in.2xlarge loadgen was the
binding constraint for the experiment at ~22 MB/s, not evnode or
DA — a finding we'd have spent another debug round chasing
without the in-process profiler.
0 commit comments