Skip to content

[evals] Add formal-methods and RTL PPL slices#5128

Open
dlwh wants to merge 5 commits intomainfrom
claude/issue-5060-20260422-2238
Open

[evals] Add formal-methods and RTL PPL slices#5128
dlwh wants to merge 5 commits intomainfrom
claude/issue-5060-20260422-2238

Conversation

@dlwh
Copy link
Copy Markdown
Member

@dlwh dlwh commented Apr 23, 2026

Add opt-in formal-methods and hardware RTL PPL slices behind a shared archive downloader, keeping formal and RTL families separate in reporting and leaving default validation sets unchanged. Includes a focused pilot gap runner and bounded sampled outputs per source.

Part of #5005

Register raw-PPL eval slices for SMT-LIB, TPTP, CoqGym, DIMACS CNF,
VerilogEval, RTL-Repo, and RTL-Coder behind a shared HTTP-archive
downloader. Slice keys use a family/slice layout
(formal_methods/* vs hardware_rtl/*) so levanter's gap-report
aggregation keeps the two families separable per issue DoD bullet 3.

Each source downloads its full upstream archive and truncates the
emitted JSONL.gz at 40 MB compressed; file bytes are preserved
verbatim so comments, long symbols, solver status markers, and
module boundaries survive for PPL eval.

AIGER/BTOR model-checking benchmarks are intentionally skipped
(binary, no canonical textual rendering). Adds a pilot gap-report
runner (marin-8b-base vs Llama-3.1-8B, vs Qwen3-8B-Base) that does
not modify default_raw_validation_sets().

Co-authored-by: David Hall <dlwh@users.noreply.github.com>
@dlwh dlwh added the agent-generated Created by automation/agent label Apr 23, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 23, 2026

Claude finished @dlwh's task in 3m 58s —— View job


Code Review

Checked for bugs and AGENTS.md compliance. Posted 3 inline comments, all AGENTS.md violations — no functional bugs found.

  • Read PR skill definition
  • Review PR files
  • Gather repo context (datakit patterns, existing zstandard/atomic_rename usage)
  • Post inline comments

Summary of findings

  1. Local import zstandard in _iter_tar_zst_members (download/formal_methods_evals.py#L158). zstandard is a declared project dep and already imported at module scope elsewhere in marin; AGENTS.md requires top-of-file imports.
  2. Dead session.request_timeout_seconds assignment (download/formal_methods_evals.py#L234-L236). The attribute is set but never read — the actual timeout is passed per-request on session.get.
  3. Unused SKIPPED_BINARY_SOURCES constant (experiments/evals/exp5060_formal_methods_evals.py#L135). Never referenced anywhere; the rationale is already in the comment block above it.

No functional or logic bugs, security issues, or test correctness issues identified. Archive extraction (tar/zip) reads members into memory only — no path-traversal risk. Byte-budget truncation logic in _write_budgeted_jsonl_gz is correct, and id/written counters stay in sync on break. atomic_rename + open_url follows the existing pattern in uncheatable_eval.py.
· Branch



def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:
import zstandard # local import: optional-feeling but already a project dep
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AGENTS.md code-style rule: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." (AGENTS.md)

The inline comment already concedes the rule doesn't apply here — zstandard is a declared dep (e.g. lib/zephyr/pyproject.toml, lib/fray/pyproject.toml) and the same project already imports it at module scope (see datakit/download/nemotron_v1.py#L12). Move this import to the top of the file.

Suggested change
import zstandard # local import: optional-feeling but already a project dep
def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:
dctx = zstandard.ZstdDecompressor()
decompressed = dctx.decompress(archive_bytes, max_output_size=2**34) # 16 GB ceiling
yield from _iter_tar_members(decompressed, "tar")

(and add import zstandard to the top-of-file import block alongside zipfile / tarfile).

Comment on lines +234 to +236
# mypy: session doesn't have timeout attribute; callers pass it per-request
session.request_timeout_seconds = timeout_seconds # type: ignore[attr-defined]
return session
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

session.request_timeout_seconds is set on the Session but never read anywhere — _fetch_archive_bytes passes timeout=timeout_seconds directly on the session.get call (line 245). The attached comment (callers pass it per-request) confirms this is unused.

AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)

Suggested change
# mypy: session doesn't have timeout attribute; callers pass it per-request
session.request_timeout_seconds = timeout_seconds # type: ignore[attr-defined]
return session
return session

Comment on lines +132 to +135
# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)
# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal
# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.
SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SKIPPED_BINARY_SOURCES is defined but never referenced anywhere in the repo (grep -rn SKIPPED_BINARY_SOURCES returns only this definition). The rationale is already captured in the preceding comment block, so the constant itself is dead code.

AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)

Suggested change
# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)
# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal
# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.
SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",)
# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)
# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal
# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e685e1cbd6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

written = 0
with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=9) as gz:
for idx, (filename, text) in enumerate(records):
if buf.tell() >= max_compressed_bytes:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce compressed-byte budget after each record write

The size guard in _write_budgeted_jsonl_gz is checked before appending the next JSONL row, so one large record can push the output far past max_compressed_bytes (for example, if the first matched source file is very large). That violates the stated per-source cap and can materially increase storage/compute cost for these eval slices. The write path should either preflight each row’s compressed contribution or rollback the last write when it crosses the budget.

Useful? React with 👍 / 👎.

# Z3 ships a large corpus of SMT-LIB2 example/regression files we can mirror as a
# representative SMT-LIB sample. For full SMT-LIB benchmark logics, repoint to a
# specific Zenodo tar.zst (see https://smt-lib.org/benchmarks.shtml).
url="https://github.com/Z3Prover/z3/archive/refs/heads/master.zip",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Pin archive URLs to immutable revisions

This source is pulled from a moving branch head (refs/heads/...), so dataset contents can change over time without any code/config change. Because step versioning only keys off the URL string (plus config/function hash), cached outputs may silently become stale relative to upstream, and fresh runs become non-reproducible across dates. Use immutable commit/tag archives (or include an explicit revision in versioned config) to keep cache invalidation and eval comparisons reliable.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant