Add TeraflopAI/SEC-EDGAR to datakit sources by Helw150 · Pull Request #5305 · marin-community/marin

Helw150 · 2026-04-30T08:47:15Z

Registers TeraflopAI/SEC-EDGAR (43.7B tokens, ~8M filings across 10 form types) as a datakit source. Text is read from the upstream content column.

A DuckDB transform step sits between download and normalize because the upstream shards trip apache/arrow#46404 — PyArrow's parquet reader can't decode page headers >8 MiB, which the multi-MB filings in the content column overflow on per-page string statistics. The transform reads each shard via DuckDB and rewrites it with write_statistics=False so the rewritten shards don't reproduce the bug for normalize/tokenize. Once apache/arrow#47758 lands max_page_header_size in a released PyArrow we can pin, the transform can be deleted.

Fixes #5334

claude · 2026-04-30T08:47:32Z

Claude finished @Helw150's task in 1m 35s —— View job

Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

Summary of review

New sec_edgar.py is a minimal, single-source datakit entry that mirrors the structure of peers like institutional_books/nsf_awards. It returns a (download, normalize) chain via download_hf_step + normalize_step.
All keyword arguments used (hf_urls_glob, text_field, file_extensions) are valid against the current download_hf_step and normalize_step signatures (lib/marin/src/marin/datakit/download/huggingface.py, lib/marin/src/marin/datakit/normalize.py:422).
text_field="content" is consistent with the module docstring's note about the upstream column convention.
sources.py adds the import and tuple entry in alphabetical position; the registry name "sec-edgar" and token count 43.73 match the PR description (43.7B).
Default id_field="id" is fine — normalize synthesizes IDs when absent (same pattern relied on by institutional_books).

Todo

Read PR review skill instructions
Read changed files in PR
Inspect surrounding datakit context
Check normalize_step / download_hf_step signatures
Analyze for correctness and consistency issues
Post review feedback
· branch: held/add-sec-edgar-source

Registers SEC-EDGAR (43.7B tokens, 8M filings across 10 form types) as a single-source datakit entry pulling content from the dataset's content column.

Adds a transform step between download and normalize that reads each upstream parquet via DuckDB and rewrites it with write_statistics=False. Upstream shards trip apache/arrow#46404 (PyArrow can't decode page headers >8 MiB; SEC's content column overflows on per-page string stats). Disabling stats on the rewrite keeps normalize's PyArrow reader happy. Adds duckdb to marin deps (already transitive via iris).

DuckDB's register_filesystem isinstance-checks for fsspec.AbstractFileSystem and rejects rigging's GCS wrapper. The guard's region check has already fired at url_to_fs time; we hand DuckDB the underlying fs.

…stem Strict isinstance checks (e.g. duckdb.DuckDBPyConnection.register_filesystem) rejected the wrapper. Inherit from AbstractFileSystem with cachable=False so fsspec's instance cache stays out of the way, and switch __getattr__ → __getattribute__ so AbstractFileSystem methods inherited via the new base don't shadow the wrapped fs. The inner GCSFileSystem already implements the full AbstractFileSystem surface, so non-guarded calls delegate transparently. Reverts the temporary _fs unwrap hack from sec_edgar.py — it's no longer needed and was a budget-evasion regression for any DuckDB read.

Tokenize the normalized sec-edgar parquet shards with marin_tokenizer. Worker RAM bumped to 32 GiB over the 16 GiB default because the content column carries multi-MB SEC filings.

Matches the molmo2_cap / davinci_dev datakit convention where download_<name>_step() returns the processed StepSpec with the raw download as a transitive dep. Trims the hash_attrs to just the version key (the duckdb-readback transform mechanics live in the module docstring, not the hash attrs).

Replaces the upstream-README placeholder (43.73B, Comma v0.1) with the measured marin_tokenizer count from the tokenize run's stats.json. Also points the transform docstring at the marin tracking issue for the PyArrow page-header workaround.

…andard transform Per the discussion on PR #5335: scope the DuckDB workaround to a single reader function for this dataset rather than fighting zephyr's writer. The transform now matches the molmo2_cap / davinci_dev shape: Dataset.from_files(...).flat_map(read_parquet_via_duckdb).write_parquet(...). PyArrow's default ParquetWriter truncates page-header stats safely even on multi-MB SEC rows (verified locally on the same upstream shards that trip the read path), so we no longer need write_statistics=False on the rewrite — and dropping it lets us use zephyr's standard writer instead of a per-file atomic_rename map. Net -29 LOC; no behavior change for downstream consumers.

Drops the standalone transform stage entirely. The new download_sec_edgar() lists upstream parquets via HfFileSystem and maps each through a worker that streams it via DuckDB and writes a PyArrow-readable shard at raw/sec-edgar/<form-type>/<file>.parquet. Normalize consumes that directly — chain collapses to (download, normalize) with one on-disk copy instead of two. Matches the nsf_awards pattern: download function + zephyr Dataset.map + write_jsonl manifest + provenance.json at the end. Carries a 20-try exponential-backoff retry around each per-file DuckDB read to absorb HF rate-limits and transient xet-bridge errors that download_hf_step got for free. Saves ~220 GiB of redundant staging copy compared to the prior download_hf + transform chain. hash_attrs change forces fresh fetches (orphans the existing raw/sec-edgar_62f8bccd and downstream cached output paths), but that's a one-time cost.

Helw150 · 2026-05-15T02:33:43Z

@ravwojdyla / @wmoss PTAL specifically at the duckdb stuff I'm doing to workaround #5335

fsspec's _Cached metaclass reads ``async_impl`` and ``mirror_sync_methods`` on the new instance right after ``__init__``. Our __getattribute__ was delegating those to self._fs, which works for real GCSFileSystem (itself an AbstractFileSystem) but failed against duck-typed test fakes that don't carry the attrs. Add the AbstractFileSystem non-callable config surface (async_impl, mirror_sync_methods, blocksize, protocol, sep, root_marker, fsid, transaction) to _OWN_ATTRS so they resolve via the wrapper's own MRO.

The custom sec-edgar download passes HfFileSystem() directly to duckdb.register_filesystem, never the wrapped CrossRegionGuardedFS, so we no longer need the wrapper to pass DuckDB's strict isinstance check. Drops the AbstractFileSystem inheritance, __getattribute__ delegation, and _OWN_ATTRS allowlist that landed for the now-removed transform path. Net diff against origin/main on this file is zero.

Helw150 added the agent-generated Created by automation/agent label Apr 30, 2026

Helw150 mentioned this pull request May 7, 2026

[zephyr] Fall back to DuckDB when PyArrow hits the 8 MiB page-header cap #5335

Closed

Helw150 force-pushed the held/add-sec-edgar-source branch 3 times, most recently from 042c0c1 to b528dcc Compare May 14, 2026 23:30

Helw150 added 9 commits May 14, 2026 19:29

Add TeraflopAI/SEC-EDGAR to datakit sources

54b02a9

Registers SEC-EDGAR (43.7B tokens, 8M filings across 10 form types) as a single-source datakit entry pulling content from the dataset's content column.

sec-edgar: unwrap CrossRegionGuardedFS before duckdb.register_filesystem

c333406

DuckDB's register_filesystem isinstance-checks for fsspec.AbstractFileSystem and rejects rigging's GCS wrapper. The guard's region check has already fired at url_to_fs time; we hand DuckDB the underlying fs.

sec-edgar: add tokenization experiment

9084ced

Tokenize the normalized sec-edgar parquet shards with marin_tokenizer. Worker RAM bumped to 32 GiB over the 16 GiB default because the content column carries multi-MB SEC filings.

Helw150 force-pushed the held/add-sec-edgar-source branch from 7d71517 to dc010e2 Compare May 15, 2026 02:30

Helw150 requested review from ravwojdyla and wmoss May 15, 2026 02:33

Helw150 added 2 commits May 14, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TeraflopAI/SEC-EDGAR to datakit sources#5305

Add TeraflopAI/SEC-EDGAR to datakit sources#5305
Helw150 wants to merge 11 commits into
mainfrom
held/add-sec-edgar-source

Helw150 commented Apr 30, 2026 •

edited

Loading

Uh oh!

claude Bot commented Apr 30, 2026 •

edited

Loading

Uh oh!

Helw150 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Helw150 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code review

Todo

Uh oh!

Helw150 commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Helw150 commented Apr 30, 2026 •

edited

Loading

claude Bot commented Apr 30, 2026 •

edited

Loading