[evals] Add UWF Zeek security eval slice by dlwh · Pull Request #5126 · marin-community/marin

dlwh · 2026-04-23T16:42:47Z

Add a bounded UWF Zeek CSV downloader and raw-text eval slice so issue #5057 has one runnable binary/network-security PPL dataset instead of only scaffolding. Also point the long-tail registry entry at the listable source and cover the downloader and dataset builder with tests.

Part of #5057

…5057) Adds deterministic text renderers and a Zeek→Dolma transform so security-artifact diagnostic PPL slices can be built on top of the existing `raw_text_dataset` pattern from `marin.evaluation.perplexity_gap`. This is Phase 1 of #5057: * `marin.transform.security_artifacts.renderers` — pure functions: * `render_hex_dump` (xxd-style, offset + hex + printable-ASCII gutter, configurable line width / ASCII toggle / uppercase / offset_start). * `render_zeek_tsv_value` / `render_zeek_tsv_record` / `render_zeek_tsv_log` that emit canonical Zeek log format (headers, `#fields`, `#types`, body rows, `#close`) with proper handling of unset fields, empty strings, booleans, and set-typed values. * `marin.transform.security_artifacts.zeek_to_dolma` — transform that reads per-row Zeek records (parquet or JSONL) from an input directory, groups them into blocks, renders each block with `render_zeek_tsv_log`, and writes Dolma JSONL with `{id, text, source, render: "zeek-tsv"}`. Size guardrails (`records_per_block`, `max_blocks_per_file`) keep each slice small per the issue's "region-local, cost-reviewed" requirement. * `experiments.evals.security_artifact_slices` — wires HF download → Zeek transform → `raw_text_dataset(...)` into a two-line helper (`zeek_conn_raw_text_slice`). * Tests (27 passing) cover deterministic rendering, missing-field handling, block grouping, size caps, and gzipped JSONL ingestion. Phase 2 (tshark/objdump via an Iris Docker image) and Phase 3 (bounded mirrors + gap-report regex bucketing) are separate follow-ups.

claude · 2026-04-23T16:43:04Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 238196701f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-23T16:47:42Z

+        return "*.parquet"
+    return "*.jsonl*"


Use recursive default glob for Zeek input files

The fallback in _default_input_glob only matches top-level files (*.parquet / *.jsonl*), but this transform is wired from default_download, which preserves Hugging Face repo subdirectories (commonly data/*.parquet or split folders). In that common layout, convert_zeek_to_dolma resolves no inputs and fails with FileNotFoundError unless every caller manually overrides input_glob, so the default path for zeek_conn_raw_text_slice is brittle and breaks valid datasets.

Useful? React with 👍 / 👎.

github-actions Bot and others added 2 commits April 22, 2026 22:11

[evals] Refine security artifact renderers

2381967

dlwh added the agent-generated Created by automation/agent label Apr 23, 2026

chatgpt-codex-connector Bot reviewed Apr 23, 2026

View reviewed changes

Add UWF Zeek security eval slice

dca45a7

dlwh changed the title ~~[evals] Add security artifact PPL slice scaffolding~~ [evals] Add UWF Zeek security eval slice Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[evals] Add UWF Zeek security eval slice#5126

[evals] Add UWF Zeek security eval slice#5126
dlwh wants to merge 3 commits intomainfrom
claude/issue-5057-20260422-2159

dlwh commented Apr 23, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dlwh commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dlwh commented Apr 23, 2026 •

edited

Loading