feat(talis): use local NVMe on AWS i-family instances#7145
Draft
feat(talis): use local NVMe on AWS i-family instances#7145
Conversation
Bring AWS up to parity with the existing DigitalOcean and Google Cloud
providers so talis can launch Celestia / fibre experiments on EC2.
## Overview
- `--provider aws` on `talis init`, `talis add`, and `talis init-env`
- New top-level `tools/talis/aws.go` (~700 LOC) covering the full
instance lifecycle: AMI resolution, key-pair import, security group,
cluster placement group, subnet lookup, RunInstances, wait-for-IPs,
TerminateInstances, destroy-all, and an existing-experiment check for
the shared `checkForRunningExperiments` gate.
- New `--slug` override on `talis add` so instance types can be picked
per-node without editing code (works for every provider).
- Sensible defaults matching the DO layout:
validator = c6in.4xlarge (network-enhanced, 25 Gbps baseline),
encoder = c6in.2xlarge,
obs = t3.medium.
- Single-region, single-AZ layout with a cluster placement group. AWS
charges $0.09/GB on cross-region traffic so a DO-style "random region"
default would make networking experiments absurdly expensive — the
shipping default is `us-east-1` / `us-east-1a`, overridable via
`--aws-zone` on `init` and `--aws-region` on up/down/list.
- Root-SSH and hostname setup via cloud-init user-data. The hostname
piece matters: validator_init.sh parses `hostname` to pick which
per-validator keys/config to install, and AWS's default
`ip-172-31-X-Y` hostname breaks that parser. We `hostnamectl
set-hostname validator-N` at boot.
- No S3 / scripts changes. The existing payload path reads
AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_S3_BUCKET the same
way DO Spaces does; for a first AWS run, operators can either set
`AWS_*` env vars (shared with EC2 creds) or use
`deploy --direct-payload-upload` to skip S3 entirely.
## Files
- `tools/talis/aws.go` — new, the EC2 implementation
- `tools/talis/config.go` — `AWS` Provider const, AWSRegion +
AWSZone on Config, Instance.Zone, and
WithAWSValidator/Encoder/Observability
+ WithAWSRegion/Zone builders
- `tools/talis/client.go` — `NewClient` dispatches to `NewAWSClient`
when `cfg.AWSRegion` is set
- `tools/talis/add.go` — `--provider aws` case, `--slug` flag
- `tools/talis/init.go` — `--provider aws` + `--aws-zone` flag,
stamps AWSRegion / AWSZone into config
- `tools/talis/env.go` — `generateAWSEnv` template + switch case
- `tools/talis/deployment.go` — `--aws-region` flag on up/down/list,
AWS branch in checkForRunningExperiments
and destroyAllInstances
- `go.mod` / `go.sum` — adds `github.com/aws/aws-sdk-go-v2/service/ec2`
## Follow-ups (separate PRs)
- Provider-tied S3 payload env vars (untangle `AWS_*` when both AWS
compute and DO Spaces are in play)
- Use local NVMe instance-store on i-family instances (defaults swap to
i4i, cloud-init formats + mounts `/mnt/data`, init scripts learn to
honour it)
## Validated
- `go vet ./tools/talis/...`, `go test ./tools/talis/...` clean
- End-to-end: launched 10 validator + 10 encoder + 1 observability on
`c6in.4xlarge` in `us-east-1d`, ran `talis genesis` / `deploy` /
`setup-fibre` / `start-fibre` / `fibre-txsim` (upload-only, 1 MB blobs)
and observed continuous confirmed uploads with per-validator Fibre +
celestia-appd OTel metrics in Prometheus
Fibre experiments on c6in.4xlarge with default gp3 ran into the EBS 125 MB/s / 3000 IOPS ceiling: pebble store_put dominated the fibre-server hot path (~97% of upload_shard time) long before network or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe, ~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way so the network actually becomes the bottleneck — which is the point of these experiments. - `AWSDefaultValidatorInstanceType` and `AWSDefaultEncoderInstanceType` → `i4i.4xlarge` (16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local NVMe). Observability unchanged at `t3.medium`. - `AWSDefaultRootVolumeGB` 400 → 50. Root EBS only holds the OS and the downloaded payload tarball; the big stuff lives on NVMe. - `awsRootSSHUserData` now ships `/usr/local/sbin/talis-setup-nvme.sh` via `write_files` + invokes it from `runcmd`. The script formats the first instance-store NVMe (`/dev/nvme1n1`..`nvme3n1`) ext4, mounts it at `/mnt/data`, adds an `fstab` entry with `nofail`, and creates `/root/.celestia-fibre -> /mnt/data/.celestia-fibre` so the fibre server's relative `--home .celestia-fibre` lands on the fast disk with no fibre-side changes. Safe on instance types without local NVMe: no device → script exits early, no mountpoint, no symlinks. Init scripts learn to honour `/mnt/data`: - `validator_init.sh`: when `/mnt/data` is a mountpoint, symlink `$HOME/.celestia-app` and `$HOME/.celestia-fibre` to it. CELES_HOME stays relative so every tool (celestia-appd, fibre, fibre-txsim, setup-fibre) resolves paths under $HOME and transparently hits NVMe. Also re-establish the symlink after `rm -rf .celestia-app/` so `celestia-appd init` recreates state on /mnt/data. - `writeEncoderInitScript` (genesis.go) applies the same detection for encoders. CELES_HOME is absolute on NVMe hosts, $HOME otherwise. Compatibility: DO hosts have no `/mnt/data` — every detection short-circuits and the scripts behave exactly as before. AWS sizes without instance-store (c6i, t3, etc.) also fall through to `$HOME/.celestia-app`. Stacks on #7142.
The encoder_init.sh template sets CELES_HOME to an absolute path (/mnt/data/.celestia-app on i-family hosts) but fibre-txsim and setup-fibre use a relative --keyring-dir .celestia-app that resolves from $HOME=/root. Without a symlink from /root/.celestia-app to /mnt/data/.celestia-app, the encoder keyring is found by the init script but invisible to the binaries the rest of talis runs. Concretely: running the full setup-fibre → start-fibre → fibre-txsim flow on a fresh i4i.4xlarge cluster failed with "key not found: enc0-0.info" and, when manually worked around, all uploads failed "payment promise verification: escrow account not found" because the encoder-side deposit-to-escrow never saw its own keyring and the txs never landed. Mirror the symlink dance validator_init.sh already does: if /mnt/data was chosen as STATE_BASE, `ln -sfn /mnt/data/.celestia-app $HOME/.celestia-app` after the keyring copy. DO and AWS sizes without instance-store still never branch into this path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes: https://linear.app/celestia/issue/PROTOCO-1545/feattalis-use-local-nvme-on-aws-i-family-instances
Summary
Fibre experiments on c6in.4xlarge with default gp3 hit the EBS 125 MB/s / 3000 IOPS ceiling: pebble
store_putdominated the fibre-server hot path (~97 % ofupload_shardtime) long before network or CPU saturated. Switching to i4i.4xlarge (3.75 TB local NVMe, ~3 GB/s write, ~300k write IOPS) moves the disk ceiling out of the way so the network is the actual bottleneck — which is the point of these experiments.Changes
AWSDefaultValidatorInstanceType+AWSDefaultEncoderInstanceType→i4i.4xlarge(16 vCPU / 128 GiB / up to 25 Gbps / 3.75 TB local NVMe). Obs unchanged att3.medium.AWSDefaultRootVolumeGB400 → 50. Root EBS only holds the OS + downloaded payload tarball.awsRootSSHUserDataships/usr/local/sbin/talis-setup-nvme.shviawrite_files+ invokes it fromruncmd: formats/dev/nvme[1-3]n1ext4, mounts at/mnt/datawithnofail, creates/root/.celestia-fibre → /mnt/data/.celestia-fibreso the fibre server's relative--hometransparently hits NVMe. Safe on instance types without local NVMe: no device → script exits early.validator_init.sh: detect/mnt/data, symlink/Users/vladkrintisn/.celestia-appand/Users/vladkrintisn/.celestia-fibrethere. Re-establish the.celestia-appsymlink right afterrm -rfsocelestia-appd initrecreates state on NVMe.writeEncoderInitScript(genesis.go) applies the same detection for encoders.Compatibility
/mnt/data→ every detection short-circuits; behaviour unchanged./Users/vladkrintisn/.celestia-app.Test plan
go vet ./tools/talis/...,go test ./tools/talis/...clean.talis up, confirm/dev/nvme1n1 on /mnt/data(~3.4 TiB) and/root/.celestia-fibresymlink present after cloud-init completes.celestia-appd start --home /mnt/data/.celestia-appandfibre start --home .celestia-fibreboth write to NVMe. Confirmpebblestore_putno longer dominatesupload_shardlatency.🤖 Generated with Claude Code