Skip to content

feat(talis): use local NVMe on AWS i-family instances#7145

Draft
walldiss wants to merge 3 commits intomainfrom
feat/talis-aws-nvme
Draft

feat(talis): use local NVMe on AWS i-family instances#7145
walldiss wants to merge 3 commits intomainfrom
feat/talis-aws-nvme

Conversation

@walldiss
Copy link
Copy Markdown
Member

Closes: https://linear.app/celestia/issue/PROTOCO-1545/feattalis-use-local-nvme-on-aws-i-family-instances

Stacked on #7142 — branches off feat/talis-aws-provider. Rebase to main once #7142 lands.

Summary

Fibre experiments on c6in.4xlarge with default gp3 hit the EBS 125 MB/s / 3000 IOPS ceiling: pebble store_put dominated the fibre-server hot path (~97 % of upload_shard time) long before network or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe, ~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way so the network is the actual bottleneck — which is the point of these experiments.

Changes

  • AWSDefaultValidatorInstanceType + AWSDefaultEncoderInstanceTypei4i.4xlarge (16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local NVMe). Obs unchanged at t3.medium.
  • AWSDefaultRootVolumeGB 400 → 50. Root EBS only holds the OS + downloaded payload tarball.
  • awsRootSSHUserData ships /usr/local/sbin/talis-setup-nvme.sh via write_files + invokes it from runcmd: formats /dev/nvme[1-3]n1 ext4, mounts at /mnt/data with nofail, creates /root/.celestia-fibre → /mnt/data/.celestia-fibre so the fibre server's relative --home transparently hits NVMe. Safe on instance types without local NVMe: no device → script exits early.
  • validator_init.sh: detect /mnt/data, symlink /Users/vladkrintisn/.celestia-app and /Users/vladkrintisn/.celestia-fibre there. Re-establish the .celestia-app symlink right after rm -rf so celestia-appd init recreates state on NVMe.
  • writeEncoderInitScript (genesis.go) applies the same detection for encoders.

Compatibility

  • DO hosts have no /mnt/data → every detection short-circuits; behaviour unchanged.
  • AWS sizes without instance-store (c6i, t3, etc.) → script exits early, scripts fall back to /Users/vladkrintisn/.celestia-app.

Test plan

  • go vet ./tools/talis/..., go test ./tools/talis/... clean.
  • Spin up i4i.4xlarge via talis up, confirm /dev/nvme1n1 on /mnt/data (~3.4 TiB) and /root/.celestia-fibre symlink present after cloud-init completes.
  • Run full fibre experiment on i4i.4xlarge; celestia-appd start --home /mnt/data/.celestia-app and fibre start --home .celestia-fibre both write to NVMe. Confirm pebble store_put no longer dominates upload_shard latency.
  • Spot-check: DO run unchanged behaviour.

🤖 Generated with Claude Code

Bring AWS up to parity with the existing DigitalOcean and Google Cloud
providers so talis can launch Celestia / fibre experiments on EC2.

## Overview

- `--provider aws` on `talis init`, `talis add`, and `talis init-env`
- New top-level `tools/talis/aws.go` (~700 LOC) covering the full
  instance lifecycle: AMI resolution, key-pair import, security group,
  cluster placement group, subnet lookup, RunInstances, wait-for-IPs,
  TerminateInstances, destroy-all, and an existing-experiment check for
  the shared `checkForRunningExperiments` gate.
- New `--slug` override on `talis add` so instance types can be picked
  per-node without editing code (works for every provider).
- Sensible defaults matching the DO layout:
  validator = c6in.4xlarge (network-enhanced, 25 Gbps baseline),
  encoder   = c6in.2xlarge,
  obs       = t3.medium.
- Single-region, single-AZ layout with a cluster placement group. AWS
  charges $0.09/GB on cross-region traffic so a DO-style "random region"
  default would make networking experiments absurdly expensive — the
  shipping default is `us-east-1` / `us-east-1a`, overridable via
  `--aws-zone` on `init` and `--aws-region` on up/down/list.
- Root-SSH and hostname setup via cloud-init user-data. The hostname
  piece matters: validator_init.sh parses `hostname` to pick which
  per-validator keys/config to install, and AWS's default
  `ip-172-31-X-Y` hostname breaks that parser. We `hostnamectl
  set-hostname validator-N` at boot.
- No S3 / scripts changes. The existing payload path reads
  AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_S3_BUCKET the same
  way DO Spaces does; for a first AWS run, operators can either set
  `AWS_*` env vars (shared with EC2 creds) or use
  `deploy --direct-payload-upload` to skip S3 entirely.

## Files

- `tools/talis/aws.go`           — new, the EC2 implementation
- `tools/talis/config.go`        — `AWS` Provider const, AWSRegion +
                                   AWSZone on Config, Instance.Zone, and
                                   WithAWSValidator/Encoder/Observability
                                   + WithAWSRegion/Zone builders
- `tools/talis/client.go`        — `NewClient` dispatches to `NewAWSClient`
                                   when `cfg.AWSRegion` is set
- `tools/talis/add.go`           — `--provider aws` case, `--slug` flag
- `tools/talis/init.go`          — `--provider aws` + `--aws-zone` flag,
                                   stamps AWSRegion / AWSZone into config
- `tools/talis/env.go`           — `generateAWSEnv` template + switch case
- `tools/talis/deployment.go`    — `--aws-region` flag on up/down/list,
                                   AWS branch in checkForRunningExperiments
                                   and destroyAllInstances
- `go.mod` / `go.sum`            — adds `github.com/aws/aws-sdk-go-v2/service/ec2`

## Follow-ups (separate PRs)

- Provider-tied S3 payload env vars (untangle `AWS_*` when both AWS
  compute and DO Spaces are in play)
- Use local NVMe instance-store on i-family instances (defaults swap to
  i4i, cloud-init formats + mounts `/mnt/data`, init scripts learn to
  honour it)

## Validated

- `go vet ./tools/talis/...`, `go test ./tools/talis/...` clean
- End-to-end: launched 10 validator + 10 encoder + 1 observability on
  `c6in.4xlarge` in `us-east-1d`, ran `talis genesis` / `deploy` /
  `setup-fibre` / `start-fibre` / `fibre-txsim` (upload-only, 1 MB blobs)
  and observed continuous confirmed uploads with per-validator Fibre +
  celestia-appd OTel metrics in Prometheus
Fibre experiments on c6in.4xlarge with default gp3 ran into the EBS
125 MB/s / 3000 IOPS ceiling: pebble store_put dominated the
fibre-server hot path (~97% of upload_shard time) long before network
or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe,
~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way
so the network actually becomes the bottleneck — which is the point of
these experiments.

- `AWSDefaultValidatorInstanceType` and `AWSDefaultEncoderInstanceType`
  → `i4i.4xlarge` (16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local
  NVMe). Observability unchanged at `t3.medium`.
- `AWSDefaultRootVolumeGB` 400 → 50. Root EBS only holds the OS and
  the downloaded payload tarball; the big stuff lives on NVMe.
- `awsRootSSHUserData` now ships `/usr/local/sbin/talis-setup-nvme.sh`
  via `write_files` + invokes it from `runcmd`. The script formats the
  first instance-store NVMe (`/dev/nvme1n1`..`nvme3n1`) ext4, mounts
  it at `/mnt/data`, adds an `fstab` entry with `nofail`, and creates
  `/root/.celestia-fibre -> /mnt/data/.celestia-fibre` so the fibre
  server's relative `--home .celestia-fibre` lands on the fast disk
  with no fibre-side changes. Safe on instance types without local
  NVMe: no device → script exits early, no mountpoint, no symlinks.

Init scripts learn to honour `/mnt/data`:

- `validator_init.sh`: when `/mnt/data` is a mountpoint, symlink
  `$HOME/.celestia-app` and `$HOME/.celestia-fibre` to it. CELES_HOME
  stays relative so every tool (celestia-appd, fibre, fibre-txsim,
  setup-fibre) resolves paths under $HOME and transparently hits
  NVMe. Also re-establish the symlink after `rm -rf .celestia-app/`
  so `celestia-appd init` recreates state on /mnt/data.
- `writeEncoderInitScript` (genesis.go) applies the same detection
  for encoders. CELES_HOME is absolute on NVMe hosts, $HOME otherwise.

Compatibility: DO hosts have no `/mnt/data` — every detection
short-circuits and the scripts behave exactly as before. AWS sizes
without instance-store (c6i, t3, etc.) also fall through to
`$HOME/.celestia-app`.

Stacks on #7142.
Base automatically changed from feat/talis-aws-provider to main April 20, 2026 22:38
The encoder_init.sh template sets CELES_HOME to an absolute path
(/mnt/data/.celestia-app on i-family hosts) but fibre-txsim and
setup-fibre use a relative --keyring-dir .celestia-app that resolves
from $HOME=/root. Without a symlink from /root/.celestia-app to
/mnt/data/.celestia-app, the encoder keyring is found by the init
script but invisible to the binaries the rest of talis runs.

Concretely: running the full setup-fibre → start-fibre → fibre-txsim
flow on a fresh i4i.4xlarge cluster failed with "key not found:
enc0-0.info" and, when manually worked around, all uploads failed
"payment promise verification: escrow account not found" because the
encoder-side deposit-to-escrow never saw its own keyring and the txs
never landed.

Mirror the symlink dance validator_init.sh already does: if
/mnt/data was chosen as STATE_BASE, `ln -sfn /mnt/data/.celestia-app
$HOME/.celestia-app` after the keyring copy. DO and AWS sizes without
instance-store still never branch into this path.
@walldiss walldiss self-assigned this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant