[evals] Add formal-methods and RTL PPL slices by dlwh · Pull Request #5128 · marin-community/marin

dlwh · 2026-04-23T16:45:58Z

Add opt-in formal-methods and hardware RTL PPL slices behind a shared archive downloader, keeping formal and RTL families separate in reporting and leaving default validation sets unchanged. Includes a focused pilot gap runner and bounded sampled outputs per source.

Part of #5005

Register raw-PPL eval slices for SMT-LIB, TPTP, CoqGym, DIMACS CNF, VerilogEval, RTL-Repo, and RTL-Coder behind a shared HTTP-archive downloader. Slice keys use a family/slice layout (formal_methods/* vs hardware_rtl/*) so levanter's gap-report aggregation keeps the two families separable per issue DoD bullet 3. Each source downloads its full upstream archive and truncates the emitted JSONL.gz at 40 MB compressed; file bytes are preserved verbatim so comments, long symbols, solver status markers, and module boundaries survive for PPL eval. AIGER/BTOR model-checking benchmarks are intentionally skipped (binary, no canonical textual rendering). Adds a pilot gap-report runner (marin-8b-base vs Llama-3.1-8B, vs Qwen3-8B-Base) that does not modify default_raw_validation_sets(). Co-authored-by: David Hall <dlwh@users.noreply.github.com>

claude · 2026-04-23T16:46:26Z

Claude finished @dlwh's task in 3m 58s —— View job

Code Review

Checked for bugs and AGENTS.md compliance. Posted 3 inline comments, all AGENTS.md violations — no functional bugs found.

Read PR skill definition
Review PR files
Gather repo context (datakit patterns, existing zstandard/atomic_rename usage)
Post inline comments

Summary of findings

Local import zstandard in _iter_tar_zst_members (download/formal_methods_evals.py#L158). zstandard is a declared project dep and already imported at module scope elsewhere in marin; AGENTS.md requires top-of-file imports.
Dead session.request_timeout_seconds assignment (download/formal_methods_evals.py#L234-L236). The attribute is set but never read — the actual timeout is passed per-request on session.get.
Unused SKIPPED_BINARY_SOURCES constant (experiments/evals/exp5060_formal_methods_evals.py#L135). Never referenced anywhere; the rationale is already in the comment block above it.

No functional or logic bugs, security issues, or test correctness issues identified. Archive extraction (tar/zip) reads members into memory only — no path-traversal risk. Byte-budget truncation logic in _write_budgeted_jsonl_gz is correct, and id/written counters stay in sync on break. atomic_rename + open_url follows the existing pattern in uncheatable_eval.py.
· Branch

claude · 2026-04-23T16:50:12Z

+
+
+def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:
+    import zstandard  # local import: optional-feeling but already a project dep


AGENTS.md code-style rule: "All imports at the top of the file. No local imports except to break circular dependencies or guard optional deps." (AGENTS.md)

The inline comment already concedes the rule doesn't apply here — zstandard is a declared dep (e.g. lib/zephyr/pyproject.toml, lib/fray/pyproject.toml) and the same project already imports it at module scope (see datakit/download/nemotron_v1.py#L12). Move this import to the top of the file.

Suggested change

import zstandard # local import: optional-feeling but already a project dep

def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:

dctx = zstandard.ZstdDecompressor()

decompressed = dctx.decompress(archive_bytes, max_output_size=2**34) # 16 GB ceiling

yield from _iter_tar_members(decompressed, "tar")

(and add import zstandard to the top-of-file import block alongside zipfile / tarfile).

claude · 2026-04-23T16:50:16Z

+    # mypy: session doesn't have timeout attribute; callers pass it per-request
+    session.request_timeout_seconds = timeout_seconds  # type: ignore[attr-defined]
+    return session


session.request_timeout_seconds is set on the Session but never read anywhere — _fetch_archive_bytes passes timeout=timeout_seconds directly on the session.get call (line 245). The attached comment (callers pass it per-request) confirms this is unused.

AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)

Suggested change

# mypy: session doesn't have timeout attribute; callers pass it per-request

session.request_timeout_seconds = timeout_seconds # type: ignore[attr-defined]

return session

return session

claude · 2026-04-23T16:50:22Z

+# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)
+# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal
+# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.
+SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",)


SKIPPED_BINARY_SOURCES is defined but never referenced anywhere in the repo (grep -rn SKIPPED_BINARY_SOURCES returns only this definition). The rationale is already captured in the preceding comment block, so the constant itself is dead code.

AGENTS.md: "Delete dead code: unused parameters, stale options, old experiments." (AGENTS.md)

Suggested change

# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)

# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal

# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.

SKIPPED_BINARY_SOURCES: tuple[str, ...] = ("hardware_rtl/aiger_hwmcc",)

# AIGER / BTOR hardware model-checking benchmarks (https://fmv.jku.at/hwmcc11/benchmarks.html)

# are intentionally omitted: the primary distribution is binary AIGER. Adding a Marin-internal

# text rendering would be a bespoke serializer nobody else uses. Per @dlwh 2026-04-22, skip.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e685e1cbd6

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:52:42Z

+    written = 0
+    with gzip.GzipFile(fileobj=buf, mode="wb", compresslevel=9) as gz:
+        for idx, (filename, text) in enumerate(records):
+            if buf.tell() >= max_compressed_bytes:


Enforce compressed-byte budget after each record write

The size guard in _write_budgeted_jsonl_gz is checked before appending the next JSONL row, so one large record can push the output far past max_compressed_bytes (for example, if the first matched source file is very large). That violates the stated per-source cap and can materially increase storage/compute cost for these eval slices. The write path should either preflight each row’s compressed contribution or rollback the last write when it crosses the budget.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-23T16:52:43Z

+        # Z3 ships a large corpus of SMT-LIB2 example/regression files we can mirror as a
+        # representative SMT-LIB sample. For full SMT-LIB benchmark logics, repoint to a
+        # specific Zenodo tar.zst (see https://smt-lib.org/benchmarks.shtml).
+        url="https://github.com/Z3Prover/z3/archive/refs/heads/master.zip",


Pin archive URLs to immutable revisions

This source is pulled from a moving branch head (refs/heads/...), so dataset contents can change over time without any code/config change. Because step versioning only keys off the URL string (plus config/function hash), cached outputs may silently become stale relative to upstream, and fresh runs become non-reproducible across dates. Use immutable commit/tag archives (or include an explicit revision in versioned config) to keep cache invalidation and eval comparisons reliable.

Useful? React with 👍 / 👎.

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

claude Bot reviewed Apr 23, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

dlwh added 4 commits April 24, 2026 15:55

Refactor perplexity gap around cached model scoring

d4b45cd

Add raw FineWeb2 multilingual gap datasets

a0603c9

Fix cached perplexity gap scoring imports

ed2217c

Expand formal methods and RTL eval sources

0818fe5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add formal-methods and RTL PPL slices#5128

[evals] Add formal-methods and RTL PPL slices#5128
dlwh wants to merge 5 commits intomainfrom
claude/issue-5060-20260422-2238

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 •

edited

Loading

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:
		import zstandard # local import: optional-feeling but already a project dep

-    import zstandard  # local import: optional-feeling but already a project dep
+def _iter_tar_zst_members(archive_bytes: bytes) -> Iterator[_SourceFile]:
+    dctx = zstandard.ZstdDecompressor()
+    decompressed = dctx.decompress(archive_bytes, max_output_size=2**34)  # 16 GB ceiling
+    yield from _iter_tar_members(decompressed, "tar")

Conversation

dlwh commented Apr 23, 2026

Uh oh!

claude Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary of findings

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Apr 23, 2026 •

edited

Loading