Conversation
Register raw-PPL eval slices for SMT-LIB, TPTP, CoqGym, DIMACS CNF, VerilogEval, RTL-Repo, and RTL-Coder behind a shared HTTP-archive downloader. Slice keys use a family/slice layout (formal_methods/* vs hardware_rtl/*) so levanter's gap-report aggregation keeps the two families separable per issue DoD bullet 3. Each source downloads its full upstream archive and truncates the emitted JSONL.gz at 40 MB compressed; file bytes are preserved verbatim so comments, long symbols, solver status markers, and module boundaries survive for PPL eval. AIGER/BTOR model-checking benchmarks are intentionally skipped (binary, no canonical textual rendering). Adds a pilot gap-report runner (marin-8b-base vs Llama-3.1-8B, vs Qwen3-8B-Base) that does not modify default_raw_validation_sets(). Co-authored-by: David Hall <dlwh@users.noreply.github.com>
|
Claude finished @dlwh's task in 3m 58s —— View job Code ReviewChecked for bugs and AGENTS.md compliance. Posted 3 inline comments, all AGENTS.md violations — no functional bugs found.
Summary of findings
No functional or logic bugs, security issues, or test correctness issues identified. Archive extraction (tar/zip) reads members into memory only — no path-traversal risk. Byte-budget truncation logic in |
|
|
||
|
|
||
| def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]: | ||
| import zstandard # local import: optional-feeling but already a project dep |
There was a problem hiding this comment.
AGENTS.md code-style rule: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." (AGENTS.md)
The inline comment already concedes the rule doesn't apply here — zstandard is a declared dep (e.g. lib/zephyr/pyproject.toml, lib/fray/pyproject.toml) and the same project already imports it at module scope (see datakit/download/nemotron_v1.py#L12). Move this import to the top of the file.
| import zstandard # local import: optional-feeling but already a project dep | |
| def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]: | |
| dctx = zstandard.ZstdDecompressor() | |
| decompressed = dctx.decompress(archive_bytes, max_output_size=2**34) # 16 GB ceiling | |
| yield from _iter_tar_members(decompressed, "tar") |
(and add import zstandard to the top-of-file import block alongside zipfile / tarfile).
| # mypy: session doesn't have timeout attribute; callers pass it per-request | ||
| session.request_timeout_seconds = timeout_seconds # type: ignore[attr-defined] | ||
| return session |
There was a problem hiding this comment.
session.request_timeout_seconds is set on the Session but never read anywhere — _fetch_archive_bytes passes timeout=timeout_seconds directly on the session.get call (line 245). The attached comment (callers pass it per-request) confirms this is unused.
AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)
| # mypy: session doesn't have timeout attribute; callers pass it per-request | |
| session.request_timeout_seconds = timeout_seconds # type: ignore[attr-defined] | |
| return session | |
| return session |
| # AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html) | ||
| # are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal | ||
| # text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip. | ||
| SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",) |
There was a problem hiding this comment.
SKIPPED_BINARY_SOURCES is defined but never referenced anywhere in the repo (grep -rn SKIPPED_BINARY_SOURCES returns only this definition). The rationale is already captured in the preceding comment block, so the constant itself is dead code.
AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)
| # AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html) | |
| # are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal | |
| # text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip. | |
| SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",) | |
| # AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html) | |
| # are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal | |
| # text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e685e1cbd6
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| written = 0 | ||
| with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=9) as gz: | ||
| for idx, (filename, text) in enumerate(records): | ||
| if buf.tell() >= max_compressed_bytes: |
There was a problem hiding this comment.
Enforce compressed-byte budget after each record write
The size guard in _write_budgeted_jsonl_gz is checked before appending the next JSONL row, so one large record can push the output far past max_compressed_bytes (for example, if the first matched source file is very large). That violates the stated per-source cap and can materially increase storage/compute cost for these eval slices. The write path should either preflight each row’s compressed contribution or rollback the last write when it crosses the budget.
Useful? React with 👍 / 👎.
| # Z3 ships a large corpus of SMT-LIB2 example/regression files we can mirror as a | ||
| # representative SMT-LIB sample. For full SMT-LIB benchmark logics, repoint to a | ||
| # specific Zenodo tar.zst (see https://smt-lib.org/benchmarks.shtml). | ||
| url="https://github.com/Z3Prover/z3/archive/refs/heads/master.zip", |
There was a problem hiding this comment.
Pin archive URLs to immutable revisions
This source is pulled from a moving branch head (refs/heads/...), so dataset contents can change over time without any code/config change. Because step versioning only keys off the URL string (plus config/function hash), cached outputs may silently become stale relative to upstream, and fresh runs become non-reproducible across dates. Use immutable commit/tag archives (or include an explicit revision in versioned config) to keep cache invalidation and eval comparisons reliable.
Useful? React with 👍 / 👎.
Add opt-in formal-methods and hardware RTL PPL slices behind a shared archive downloader, keeping formal and RTL families separate in reporting and leaving default validation sets unchanged. Includes a focused pilot gap runner and bounded sampled outputs per source.
Part of #5005