Conversation
…5057) Adds deterministic text renderers and a Zeek→Dolma transform so security-artifact diagnostic PPL slices can be built on top of the existing `raw_text_dataset` pattern from `marin.evaluation.perplexity_gap`. This is Phase 1 of #5057: * `marin.transform.security_artifacts.renderers` — pure functions: * `render_hex_dump` (xxd-style, offset + hex + printable-ASCII gutter, configurable line width / ASCII toggle / uppercase / offset_start). * `render_zeek_tsv_value` / `render_zeek_tsv_record` / `render_zeek_tsv_log` that emit canonical Zeek log format (headers, `#fields`, `#types`, body rows, `#close`) with proper handling of unset fields, empty strings, booleans, and set-typed values. * `marin.transform.security_artifacts.zeek_to_dolma` — transform that reads per-row Zeek records (parquet or JSONL) from an input directory, groups them into blocks, renders each block with `render_zeek_tsv_log`, and writes Dolma JSONL with `{id, text, source, render: "zeek-tsv"}`. Size guardrails (`records_per_block`, `max_blocks_per_file`) keep each slice small per the issue's "region-local, cost-reviewed" requirement. * `experiments.evals.security_artifact_slices` — wires HF download → Zeek transform → `raw_text_dataset(...)` into a two-line helper (`zeek_conn_raw_text_slice`). * Tests (27 passing) cover deterministic rendering, missing-field handling, block grouping, size caps, and gzipped JSONL ingestion. Phase 2 (tshark/objdump via an Iris Docker image) and Phase 3 (bounded mirrors + gap-report regex bucketing) are separate follow-ups.
|
I'll analyze this and get back to you. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 238196701f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| return "*.parquet" | ||
| return "*.jsonl*" |
There was a problem hiding this comment.
Use recursive default glob for Zeek input files
The fallback in _default_input_glob only matches top-level files (*.parquet / *.jsonl*), but this transform is wired from default_download, which preserves Hugging Face repo subdirectories (commonly data/*.parquet or split folders). In that common layout, convert_zeek_to_dolma resolves no inputs and fails with FileNotFoundError unless every caller manually overrides input_glob, so the default path for zeek_conn_raw_text_slice is brittle and breaks valid datasets.
Useful? React with 👍 / 👎.

Add a bounded UWF Zeek CSV downloader and raw-text eval slice so issue #5057 has one runnable binary/network-security PPL dataset instead of only scaffolding. Also point the long-tail registry entry at the listable source and cover the downloader and dataset builder with tests.
Part of #5057