Post-merge trust fixes: .gitattributes for SHA pinning, cells.jsonl drift, README/AUDIT consistency by Lightheartdevs · Pull Request #19 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-17T10:47:27Z

Why

Post-merge audit of #18 flagged five concrete trust hits. Each is a real reproducibility / consistency issue. All five fixed here.

The five fixes

1. `.gitattributes` added at repo root

Without explicit rules, Windows reproducers checking out this repo with core.autocrlf=true (the default) silently convert JSONL line endings LF→CRLF on checkout, breaking the published SHA on workloads/prompts.jsonl. The reviewer's downloaded file genuinely didn't match the published hash. Now pinned:

*.jsonl, power.csv, thermals.csv, *.log → binary (no transformation ever; SHAs match across platforms)
*.md, *.yaml, *.json, *.csv (other than power/thermals), *.sh, *.py, *.ts, *.sha256 → text eol=lf

2. `prompts.jsonl.sha256` in `sha256sum --check` format

Previously just the bare hash (9a27e...). Now standard <hash> <filename> form so sha256sum --check prompts.jsonl.sha256 works directly. Applied to all four sha256 files in the bundle + harness:

$ sha256sum --check prompts.jsonl.sha256
prompts.jsonl: OK
$ sha256sum --check smoke-prompts.jsonl.sha256
smoke-prompts.jsonl: OK

3. `cells.jsonl` drift fixed

The v1 aggregate/cells.jsonl had 35 rows for tower2/qwen3.6-27b/cuda while filesystem + manifest.json both said 36. Missing: ctx32768_gen2048_conc8 — the engine-bound timeout cell where only 1 of 8 slots completed per batch. The harness aggregator silently drops rows with per_slot_decode_tps_mean=null, which is exactly what that cell produced.

Row backfilled with:

aggregate_decode_tps_mean=0.87 and batch_wall_s_mean=1200.10 from cell.json
cold_start_decode_tps=1.52, cold_start_wall_s=1200.22
power_w_silicon_mean=183.15, power_w_silicon_max=390.79 derived from this cell's power.csv (gpu0 filter, n=11419 samples)
temp_c_max=57.0 from this cell's thermals.csv (gpu0 sensor)
All per_slot_* fields explicitly null
A notes field explaining the null + the reconstruction method

After fix: cells.jsonl 36 rows ↔ filesystem 36 dirs ↔ manifest.json 36 cells, internally consistent.

4. README/AUDIT.md drift on `llama-server-*.log` publication resolved

README.md said the per-cell llama-server-*.log debug logs are not included (correct — they're 40 MB across 251 files and excluded for bundle size). AUDIT.md said they are published in the reproducibility bundle and listed them as a per-cell artifact. Resolved AUDIT.md to README's position throughout, including a new explicit:

NOT included in the bundle (regeneratable):

Per-cell llama-server-<port>.log — excluded to keep the bundle ~110 MB; regeneratable from the pinned SHA + per-cell cell.meta.json server invocation.

Per-host build-<backend>.configure.log and build-<backend>.build.log — excluded for size; the build invocations themselves are in harness/HARNESS-README.md.

5. Harness docs de-drifted

harness/README.md still described Strix Halo as "ROCm canonical" with the grid "running twice (ROCm and Vulkan)" — pre-bug-discovery framing that contradicted the bundle's findings.md (Vulkan is canonical, ROCm 6.4.4 segfaulted — see Finding 1). Updated the hosts table, workload-size estimate, and engines/ inventory comments to reflect Vulkan-canonical reality.

harness/AUDIT.md was a full duplicate copy of the upstream bench-fleet AUDIT, with several pre-curation internal references (task #16, #20, #21, #37; "the user"; targets.json.broken_rocm_finding) that were sanitized in the bundle-level ../AUDIT.md during PR #18 but not propagated to the harness copy. Replaced with a single-paragraph pointer to ../AUDIT.md as single source of truth, so future curation passes have one file to update instead of two that can drift.

What this PR is NOT

No new bench data. The next round (MMBT Phase B Q8 quality companion, full 30-min sustained-thermal tier, PTX-JIT SOFT_MAX retry on Tower2 35B-A3B native CUDA) ships as a separate PR.

Verification

$ cd hardware-tests/qwen3.6-q8-fleet-2026-05-17/workloads
$ sha256sum --check prompts.jsonl.sha256
prompts.jsonl: OK

$ python3 -c "
import json
counts = {}
for line in open('hardware-tests/qwen3.6-q8-fleet-2026-05-17/aggregate/cells.jsonl'):
    r = json.loads(line)
    k = (r['host'], r['model'], r['backend'])
    counts[k] = counts.get(k, 0) + 1
print(counts[('tower2', 'qwen3.6-27b', 'cuda')])
"
36   # was 35, now matches manifest + filesystem

Test plan

On Windows, git clone and verify sha256sum --check prompts.jsonl.sha256 passes (no CRLF conversion)
Spot-check the backfilled ctx32768_gen2048_conc8 row in aggregate/cells.jsonl against tower2/qwen3.6-27b/cuda/ctx32768_gen2048_conc8/cell.json
Confirm AUDIT.md no longer claims llama-server logs are published
Confirm harness/README.md describes Vulkan as Strix canonical
Confirm harness/AUDIT.md is a pointer stub to ../AUDIT.md

🤖 Generated with Claude Code

Five concrete reviewer nits flagged on PR #18 post-merge. Each is a real trust hit; all five are fixed here. 1. **.gitattributes added at repo root.** Without explicit rules, Windows reproducers checking out this repo with the default core.autocrlf=true silently converted JSONL line endings LF→CRLF on checkout, which broke the published SHA on workloads/prompts.jsonl ("the file the reviewer downloaded didn't match the SHA we published"). New .gitattributes pins *.jsonl, power.csv, thermals.csv, and *.log as `binary` (no transformation ever; SHAs match across platforms), forces LF on *.md/*.yaml/*.json/*.csv/*.sh/*.py and other text formats. 2. **prompts.jsonl.sha256 fixed to sha256sum --check format.** The file previously contained just the bare hash, which fails `sha256sum --check prompts.jsonl.sha256` with "no properly formatted checksum lines found". Updated to standard `<hash> <filename>` form so reproducers can verify directly. Applied to all four sha256 files: workloads/prompts.jsonl.sha256, harness/workloads/prompts.jsonl.sha256, harness/workloads/smoke-prompts.jsonl.sha256, and the bundle-level workloads/prompts.jsonl.sha256. 3. **cells.jsonl drift fixed.** The v1 aggregate had 35 rows for tower2/qwen3.6-27b/cuda while the filesystem (and manifest.json) said 36. The missing cell was ctx32768_gen2048_conc8 — the canonical engine-bound timeout cell where only 1 of 8 slots completed per batch. The harness aggregator silently drops rows with per_slot_decode_tps_mean=null, which is exactly what that cell produced. Backfilled the row with the available aggregate numbers from cell.json (aggregate_decode_tps_mean=0.87, batch_wall_s_mean=1200.10, cold_start_decode_tps=1.52), power/thermal stats derived from the cell's power.csv (gpu0 filter, n=11419) and thermals.csv (gpu0 sensor, n=10861), per_slot fields explicitly null, and a `notes` field explaining why per_slot is null and how the row was reconstructed. manifest.json's 36-cell count now matches aggregate; filesystem reality preserved. 4. **README/AUDIT.md drift on llama-server log publication fixed.** README.md said llama-server debug logs are NOT included (correct); AUDIT.md said they ARE published in the reproducibility bundle and listed them as a per-cell artifact. Resolved to README's position throughout AUDIT.md, with a new explicit "NOT included in the bundle (regeneratable)" section listing the per-cell llama-server log and the per-host build-*.log files, both regeneratable from the pinned source SHA in harness/VENDORED-FROM-SHA.txt. 5. **harness/README.md + harness/AUDIT.md de-drifted.** harness/README.md still described Strix Halo as "ROCm canonical" with the grid "running twice (ROCm and Vulkan)" — pre-bug-discovery framing that contradicted the bundle's findings.md (Vulkan is the canonical, working path; ROCm 6.4.4 segfaulted, see Finding 1). Updated the harness/README hosts table, the workload-size estimate, and the engines/ inventory comments to reflect Vulkan-canonical reality. harness/AUDIT.md was a full duplicate copy of the upstream bench-fleet AUDIT, with several pre-curation internal references (task #16, #20, #21, #37; "the user"; targets.json.broken_rocm_finding) that were sanitized in the bundle-level ../AUDIT.md during the curation pass but not propagated. Replaced harness/AUDIT.md with a one-paragraph pointer to ../AUDIT.md as single source of truth, so future curation passes have one file to update instead of two that can drift. Net effect: the bundle is consistent (manifest ↔ aggregate ↔ filesystem), the SHA pin is platform-neutral (Linux + Windows + macOS reproducers all get the same bytes), the README and the AUDIT agree on what's in the bundle and what isn't, and the harness docs reflect the actually-running configuration instead of the pre-bug plan. This commit adds no new bench data. The next round (MMBT Phase B Q8 quality companion, full sustained-thermal tier, PTX-JIT SOFT_MAX retry) ships as a separate PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lightheartdevs merged commit ebf69ee into main May 17, 2026
1 check passed

Lightheartdevs mentioned this pull request May 17, 2026

Trust-fixes-2: doc/data drift sweep + cherry-pick cpu-fullpower evidence #20

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Post-merge trust fixes: .gitattributes for SHA pinning, cells.jsonl drift, README/AUDIT consistency#19

Post-merge trust fixes: .gitattributes for SHA pinning, cells.jsonl drift, README/AUDIT consistency#19
Lightheartdevs merged 1 commit into
mainfrom
submit/hardware-tests-q8-fleet-trust-fixes-2026-05-17

Lightheartdevs commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Lightheartdevs commented May 17, 2026

Why

The five fixes

1. .gitattributes added at repo root

2. prompts.jsonl.sha256 in sha256sum --check format

3. cells.jsonl drift fixed

4. README/AUDIT.md drift on llama-server-*.log publication resolved

5. Harness docs de-drifted

What this PR is NOT

Verification

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `.gitattributes` added at repo root

2. `prompts.jsonl.sha256` in `sha256sum --check` format

3. `cells.jsonl` drift fixed

4. README/AUDIT.md drift on `llama-server-*.log` publication resolved