feat(talis): use local NVMe on AWS i-family instances by walldiss · Pull Request #7145 · celestiaorg/celestia-app

walldiss · 2026-04-20T17:45:56Z

Closes: https://linear.app/celestia/issue/PROTOCO-1545/feattalis-use-local-nvme-on-aws-i-family-instances

Stacked on #7142 — branches off feat/talis-aws-provider. Rebase to main once #7142 lands.

Summary

Fibre experiments on c6in.4xlarge with default gp3 hit the EBS 125 MB/s / 3000 IOPS ceiling: pebble store_put dominated the fibre-server hot path (~97 % of upload_shard time) long before network or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe, ~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way so the network is the actual bottleneck — which is the point of these experiments.

Changes

AWSDefaultValidatorInstanceType + AWSDefaultEncoderInstanceType → i4i.4xlarge (16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local NVMe). Obs unchanged at t3.medium.
AWSDefaultRootVolumeGB 400 → 50. Root EBS only holds the OS + downloaded payload tarball.
awsRootSSHUserData ships /usr/local/sbin/talis-setup-nvme.sh via write_files + invokes it from runcmd: formats /dev/nvme[1-3]n1 ext4, mounts at /mnt/data with nofail, creates /root/.celestia-fibre → /mnt/data/.celestia-fibre so the fibre server's relative --home transparently hits NVMe. Safe on instance types without local NVMe: no device → script exits early.
validator_init.sh: detect /mnt/data, symlink /Users/vladkrintisn/.celestia-app and /Users/vladkrintisn/.celestia-fibre there. Re-establish the .celestia-app symlink right after rm -rf so celestia-appd init recreates state on NVMe.
writeEncoderInitScript (genesis.go) applies the same detection for encoders.

Compatibility

DO hosts have no /mnt/data → every detection short-circuits; behaviour unchanged.
AWS sizes without instance-store (c6i, t3, etc.) → script exits early, scripts fall back to /Users/vladkrintisn/.celestia-app.

Test plan

go vet ./tools/talis/..., go test ./tools/talis/... clean.
Spin up i4i.4xlarge via talis up, confirm /dev/nvme1n1 on /mnt/data (~3.4 TiB) and /root/.celestia-fibre symlink present after cloud-init completes.
Run full fibre experiment on i4i.4xlarge; celestia-appd start --home /mnt/data/.celestia-app and fibre start --home .celestia-fibre both write to NVMe. Confirm pebble store_put no longer dominates upload_shard latency.
Spot-check: DO run unchanged behaviour.

🤖 Generated with Claude Code

Bring AWS up to parity with the existing DigitalOcean and Google Cloud providers so talis can launch Celestia / fibre experiments on EC2. ## Overview - `--provider aws` on `talis init`, `talis add`, and `talis init-env` - New top-level `tools/talis/aws.go` (~700 LOC) covering the full instance lifecycle: AMI resolution, key-pair import, security group, cluster placement group, subnet lookup, RunInstances, wait-for-IPs, TerminateInstances, destroy-all, and an existing-experiment check for the shared `checkForRunningExperiments` gate. - New `--slug` override on `talis add` so instance types can be picked per-node without editing code (works for every provider). - Sensible defaults matching the DO layout: validator = c6in.4xlarge (network-enhanced, 25 Gbps baseline), encoder = c6in.2xlarge, obs = t3.medium. - Single-region, single-AZ layout with a cluster placement group. AWS charges $0.09/GB on cross-region traffic so a DO-style "random region" default would make networking experiments absurdly expensive — the shipping default is `us-east-1` / `us-east-1a`, overridable via `--aws-zone` on `init` and `--aws-region` on up/down/list. - Root-SSH and hostname setup via cloud-init user-data. The hostname piece matters: validator_init.sh parses `hostname` to pick which per-validator keys/config to install, and AWS's default `ip-172-31-X-Y` hostname breaks that parser. We `hostnamectl set-hostname validator-N` at boot. - No S3 / scripts changes. The existing payload path reads AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_S3_BUCKET the same way DO Spaces does; for a first AWS run, operators can either set `AWS_*` env vars (shared with EC2 creds) or use `deploy --direct-payload-upload` to skip S3 entirely. ## Files - `tools/talis/aws.go` — new, the EC2 implementation - `tools/talis/config.go` — `AWS` Provider const, AWSRegion + AWSZone on Config, Instance.Zone, and WithAWSValidator/Encoder/Observability + WithAWSRegion/Zone builders - `tools/talis/client.go` — `NewClient` dispatches to `NewAWSClient` when `cfg.AWSRegion` is set - `tools/talis/add.go` — `--provider aws` case, `--slug` flag - `tools/talis/init.go` — `--provider aws` + `--aws-zone` flag, stamps AWSRegion / AWSZone into config - `tools/talis/env.go` — `generateAWSEnv` template + switch case - `tools/talis/deployment.go` — `--aws-region` flag on up/down/list, AWS branch in checkForRunningExperiments and destroyAllInstances - `go.mod` / `go.sum` — adds `github.com/aws/aws-sdk-go-v2/service/ec2` ## Follow-ups (separate PRs) - Provider-tied S3 payload env vars (untangle `AWS_*` when both AWS compute and DO Spaces are in play) - Use local NVMe instance-store on i-family instances (defaults swap to i4i, cloud-init formats + mounts `/mnt/data`, init scripts learn to honour it) ## Validated - `go vet ./tools/talis/...`, `go test ./tools/talis/...` clean - End-to-end: launched 10 validator + 10 encoder + 1 observability on `c6in.4xlarge` in `us-east-1d`, ran `talis genesis` / `deploy` / `setup-fibre` / `start-fibre` / `fibre-txsim` (upload-only, 1 MB blobs) and observed continuous confirmed uploads with per-validator Fibre + celestia-appd OTel metrics in Prometheus

Fibre experiments on c6in.4xlarge with default gp3 ran into the EBS 125 MB/s / 3000 IOPS ceiling: pebble store_put dominated the fibre-server hot path (~97% of upload_shard time) long before network or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe, ~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way so the network actually becomes the bottleneck — which is the point of these experiments. - `AWSDefaultValidatorInstanceType` and `AWSDefaultEncoderInstanceType` → `i4i.4xlarge` (16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local NVMe). Observability unchanged at `t3.medium`. - `AWSDefaultRootVolumeGB` 400 → 50. Root EBS only holds the OS and the downloaded payload tarball; the big stuff lives on NVMe. - `awsRootSSHUserData` now ships `/usr/local/sbin/talis-setup-nvme.sh` via `write_files` + invokes it from `runcmd`. The script formats the first instance-store NVMe (`/dev/nvme1n1`..`nvme3n1`) ext4, mounts it at `/mnt/data`, adds an `fstab` entry with `nofail`, and creates `/root/.celestia-fibre -> /mnt/data/.celestia-fibre` so the fibre server's relative `--home .celestia-fibre` lands on the fast disk with no fibre-side changes. Safe on instance types without local NVMe: no device → script exits early, no mountpoint, no symlinks. Init scripts learn to honour `/mnt/data`: - `validator_init.sh`: when `/mnt/data` is a mountpoint, symlink `$HOME/.celestia-app` and `$HOME/.celestia-fibre` to it. CELES_HOME stays relative so every tool (celestia-appd, fibre, fibre-txsim, setup-fibre) resolves paths under $HOME and transparently hits NVMe. Also re-establish the symlink after `rm -rf .celestia-app/` so `celestia-appd init` recreates state on /mnt/data. - `writeEncoderInitScript` (genesis.go) applies the same detection for encoders. CELES_HOME is absolute on NVMe hosts, $HOME otherwise. Compatibility: DO hosts have no `/mnt/data` — every detection short-circuits and the scripts behave exactly as before. AWS sizes without instance-store (c6i, t3, etc.) also fall through to `$HOME/.celestia-app`. Stacks on #7142.

The encoder_init.sh template sets CELES_HOME to an absolute path (/mnt/data/.celestia-app on i-family hosts) but fibre-txsim and setup-fibre use a relative --keyring-dir .celestia-app that resolves from $HOME=/root. Without a symlink from /root/.celestia-app to /mnt/data/.celestia-app, the encoder keyring is found by the init script but invisible to the binaries the rest of talis runs. Concretely: running the full setup-fibre → start-fibre → fibre-txsim flow on a fresh i4i.4xlarge cluster failed with "key not found: enc0-0.info" and, when manually worked around, all uploads failed "payment promise verification: escrow account not found" because the encoder-side deposit-to-escrow never saw its own keyring and the txs never landed. Mirror the symlink dance validator_init.sh already does: if /mnt/data was chosen as STATE_BASE, `ln -sfn /mnt/data/.celestia-app $HOME/.celestia-app` after the keyring copy. DO and AWS sizes without instance-store still never branch into this path.

walldiss added 2 commits April 20, 2026 17:11

Base automatically changed from feat/talis-aws-provider to main April 20, 2026 22:38

walldiss self-assigned this Apr 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(talis): use local NVMe on AWS i-family instances#7145

feat(talis): use local NVMe on AWS i-family instances#7145
walldiss wants to merge 3 commits intomainfrom
feat/talis-aws-nvme

walldiss commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

walldiss commented Apr 20, 2026

Summary

Changes

Compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant