Skip to content

Commit 2f403a8

Browse files
committed
infra+bench: increase EBS 100→200G + fix SSM output truncation
Root cause of large-shape disk-full: the 100G EBS fills to 99% (95G used) from Neuron SDK packages + medium NEFF cache alone, leaving only 1.1G for large-shape compilation artifacts. Changes: - infra/terraform/main.tf: root_block_device volume_size 100 → 200 - scripts/run_bench.sh: - Redirect bench output to /tmp log file; grep timing line to stdout. NEURON_RT_LOG_LEVEL=WARNING doesn't suppress the [INFO] NEFF messages (they come from libnrt.so); file redirect + grep is the only reliable approach to staying under SSM's 24 KB stdout budget. - Add growpart/resize2fs at startup to expand filesystem after terraform resizes the EBS volume (idempotent: no-op if already full size). - df -h / before each pass for disk diagnostics. Next step: run `terraform apply` in infra/terraform/ to resize the EBS, then re-run `AWS_PROFILE=aws ./scripts/run_bench.sh`.
1 parent 35a2e00 commit 2f403a8

2 files changed

Lines changed: 33 additions & 14 deletions

File tree

infra/terraform/main.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -111,7 +111,7 @@ resource "aws_instance" "ci" {
111111
associate_public_ip_address = true # Needed for SSM agent to reach regional endpoint without VPC endpoints
112112

113113
root_block_device {
114-
volume_size = 100
114+
volume_size = 200 # 100G filled up: Neuron SDK (~70G) + medium NEFF cache leaves <2G for large compilation
115115
volume_type = "gp3"
116116
}
117117

@@ -126,7 +126,7 @@ resource "aws_instance" "ci" {
126126
sudo -u ubuntu $NEURON_VENV/bin/pip install -e '/home/ubuntu/trnblas[dev]'
127127
# neuronxcc compile workdirs can be >5 GB for large NKI kernels. /tmp is
128128
# tmpfs (RAM-backed, ~16 GB on trn1.2xlarge) and runs out. Redirect the
129-
# compiler to /var/tmp (EBS-backed, 100 GB) for all ubuntu-user sessions.
129+
# compiler to /var/tmp (EBS-backed, 200 GB) for all ubuntu-user sessions.
130130
echo 'export TMPDIR=/var/tmp' >> /home/ubuntu/.profile
131131
EOF
132132

scripts/run_bench.sh

Lines changed: 31 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -120,30 +120,49 @@ shapes = os.environ["SHAPES_VAL"].split()
120120
# Each shape: cold pass first (compiles + caches NEFFs), then warm pass
121121
# in a fresh process (loads from EBS cache).
122122
#
123-
# NEURON_RT_LOG_LEVEL=WARNING suppresses the per-NEFF "[INFO]: Using a
124-
# cached neff" messages that would otherwise flood SSM's 24 KB stdout
125-
# limit and truncate the timing output.
123+
# Output filtering: bench output is redirected to a temp file; only the
124+
# timing line and errors are echoed to stdout. This avoids the SSM 24 KB
125+
# stdout limit being consumed by per-NEFF "[INFO]: Using a cached neff"
126+
# lines that NEURON_RT_LOG_LEVEL cannot suppress (they come from libnrt.so
127+
# before the Python runtime log level is applied).
126128
bench_cmds = ""
127129
for shape in shapes:
128130
run = (
129131
f"sudo -u ubuntu env PATH=$NEURON_VENV/bin:/usr/bin:/bin TMPDIR=/var/tmp"
130132
f" TRNBLAS_REQUIRE_NKI=1"
131-
f" NEURON_RT_LOG_LEVEL=WARNING"
132133
f" $NEURON_VENV/bin/python /home/ubuntu/trnblas/examples/df_mp2.py"
133134
f" --bench --shape {shape} --batched-pair-energy"
134135
)
135-
bench_cmds += (
136-
f"echo '--- shape={shape} cold (df -h /var/tmp) ---'\n"
137-
f"df -h /var/tmp\n"
138-
f"echo '--- shape={shape} cold ---'\n"
139-
f"{run} --passes cold\n"
140-
f"echo '--- shape={shape} warm ---'\n"
141-
f"{run} --passes warm\n"
142-
)
136+
for pass_name in ("cold", "warm"):
137+
log = f"/tmp/bench_{shape}_{pass_name}.log"
138+
bench_cmds += (
139+
f"echo '--- shape={shape} {pass_name} (disk) ---'\n"
140+
f"df -h / | tail -1\n"
141+
f"echo '--- shape={shape} {pass_name} ---'\n"
142+
f"set +e\n"
143+
f"{run} --passes {pass_name} > {log} 2>&1\n"
144+
f"BENCH_EXIT=$?\n"
145+
f"set -e\n"
146+
f"grep -E '^ {pass_name}:' {log} || true\n"
147+
f"if [[ $BENCH_EXIT -ne 0 ]]; then\n"
148+
f" echo 'BENCH FAILED (exit=$BENCH_EXIT):'\n"
149+
f" tail -10 {log}\n"
150+
f" false\n"
151+
f"fi\n"
152+
)
143153
144154
script = (
145155
"#!/bin/bash\n"
146156
"set -euo pipefail\n"
157+
# Expand filesystem if the EBS volume was resized via terraform.
158+
# growpart/resize2fs are idempotent: they no-op if already at full size.
159+
"ROOT_DEV=$(df / --output=source | tail -1)\n"
160+
"PARENT=$(lsblk -no PKNAME \"$ROOT_DEV\" 2>/dev/null | head -1 || true)\n"
161+
"if [[ -n \"$PARENT\" ]]; then\n"
162+
" sudo growpart \"/dev/$PARENT\" 1 2>/dev/null || true\n"
163+
"fi\n"
164+
"sudo resize2fs \"$ROOT_DEV\" 2>/dev/null || true\n"
165+
"echo \"disk after grow: $(df -h / | tail -1)\"\n"
147166
"cd /home/ubuntu/trnblas\n"
148167
"sudo -u ubuntu git fetch --all\n"
149168
f"sudo -u ubuntu git checkout {sha}\n"

0 commit comments

Comments
 (0)