Skip to content

Add TeraflopAI/SEC-EDGAR to datakit sources#5305

Open
Helw150 wants to merge 11 commits into
mainfrom
held/add-sec-edgar-source
Open

Add TeraflopAI/SEC-EDGAR to datakit sources#5305
Helw150 wants to merge 11 commits into
mainfrom
held/add-sec-edgar-source

Conversation

@Helw150
Copy link
Copy Markdown
Member

@Helw150 Helw150 commented Apr 30, 2026

Registers TeraflopAI/SEC-EDGAR (43.7B tokens, ~8M filings across 10 form types) as a datakit source. Text is read from the upstream content column.

A DuckDB transform step sits between download and normalize because the upstream shards trip apache/arrow#46404 — PyArrow's parquet reader can't decode page headers >8 MiB, which the multi-MB filings in the content column overflow on per-page string statistics. The transform reads each shard via DuckDB and rewrites it with write_statistics=False so the rewritten shards don't reproduce the bug for normalize/tokenize. Once apache/arrow#47758 lands max_page_header_size in a released PyArrow we can pin, the transform can be deleted.

Fixes #5334

@Helw150 Helw150 added the agent-generated Created by automation/agent label Apr 30, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 30, 2026

Claude finished @Helw150's task in 1m 35s —— View job


Code review

No issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance.

Summary of review

  • New sec_edgar.py is a minimal, single-source datakit entry that mirrors the structure of peers like institutional_books/nsf_awards. It returns a (download, normalize) chain via download_hf_step + normalize_step.
  • All keyword arguments used (hf_urls_glob, text_field, file_extensions) are valid against the current download_hf_step and normalize_step signatures (lib/marin/src/marin/datakit/download/huggingface.py, lib/marin/src/marin/datakit/normalize.py:422).
  • text_field="content" is consistent with the module docstring's note about the upstream column convention.
  • sources.py adds the import and tuple entry in alphabetical position; the registry name "sec-edgar" and token count 43.73 match the PR description (43.7B).
  • Default id_field="id" is fine — normalize synthesizes IDs when absent (same pattern relied on by institutional_books).

Todo

  • Read PR review skill instructions
  • Read changed files in PR
  • Inspect surrounding datakit context
  • Check normalize_step / download_hf_step signatures
  • Analyze for correctness and consistency issues
  • Post review feedback
    · branch: held/add-sec-edgar-source

@Helw150 Helw150 force-pushed the held/add-sec-edgar-source branch 3 times, most recently from 042c0c1 to b528dcc Compare May 14, 2026 23:30
Helw150 added 9 commits May 14, 2026 19:29
Registers SEC-EDGAR (43.7B tokens, 8M filings across 10 form types) as
a single-source datakit entry pulling content from the dataset's
content column.
Adds a transform step between download and normalize that reads each
upstream parquet via DuckDB and rewrites it with write_statistics=False.
Upstream shards trip apache/arrow#46404 (PyArrow can't decode page
headers >8 MiB; SEC's content column overflows on per-page string
stats). Disabling stats on the rewrite keeps normalize's PyArrow
reader happy. Adds duckdb to marin deps (already transitive via iris).
DuckDB's register_filesystem isinstance-checks for fsspec.AbstractFileSystem
and rejects rigging's GCS wrapper. The guard's region check has already
fired at url_to_fs time; we hand DuckDB the underlying fs.
…stem

Strict isinstance checks (e.g. duckdb.DuckDBPyConnection.register_filesystem)
rejected the wrapper. Inherit from AbstractFileSystem with cachable=False so
fsspec's instance cache stays out of the way, and switch __getattr__ →
__getattribute__ so AbstractFileSystem methods inherited via the new base
don't shadow the wrapped fs. The inner GCSFileSystem already implements the
full AbstractFileSystem surface, so non-guarded calls delegate transparently.

Reverts the temporary _fs unwrap hack from sec_edgar.py — it's no longer
needed and was a budget-evasion regression for any DuckDB read.
Tokenize the normalized sec-edgar parquet shards with marin_tokenizer.
Worker RAM bumped to 32 GiB over the 16 GiB default because the content
column carries multi-MB SEC filings.
Matches the molmo2_cap / davinci_dev datakit convention where
download_<name>_step() returns the processed StepSpec with the raw
download as a transitive dep. Trims the hash_attrs to just the version
key (the duckdb-readback transform mechanics live in the module
docstring, not the hash attrs).
Replaces the upstream-README placeholder (43.73B, Comma v0.1) with the
measured marin_tokenizer count from the tokenize run's stats.json. Also
points the transform docstring at the marin tracking issue for the
PyArrow page-header workaround.
…andard transform

Per the discussion on PR #5335: scope the DuckDB workaround to a single
reader function for this dataset rather than fighting zephyr's writer.
The transform now matches the molmo2_cap / davinci_dev shape:
Dataset.from_files(...).flat_map(read_parquet_via_duckdb).write_parquet(...).

PyArrow's default ParquetWriter truncates page-header stats safely
even on multi-MB SEC rows (verified locally on the same upstream
shards that trip the read path), so we no longer need
write_statistics=False on the rewrite — and dropping it lets us use
zephyr's standard writer instead of a per-file atomic_rename map.

Net -29 LOC; no behavior change for downstream consumers.
Drops the standalone transform stage entirely. The new
download_sec_edgar() lists upstream parquets via HfFileSystem and maps
each through a worker that streams it via DuckDB and writes a
PyArrow-readable shard at raw/sec-edgar/<form-type>/<file>.parquet.
Normalize consumes that directly — chain collapses to
(download, normalize) with one on-disk copy instead of two.

Matches the nsf_awards pattern: download function + zephyr Dataset.map
+ write_jsonl manifest + provenance.json at the end. Carries a 20-try
exponential-backoff retry around each per-file DuckDB read to absorb
HF rate-limits and transient xet-bridge errors that download_hf_step
got for free.

Saves ~220 GiB of redundant staging copy compared to the prior
download_hf + transform chain. hash_attrs change forces fresh fetches
(orphans the existing raw/sec-edgar_62f8bccd and downstream cached
output paths), but that's a one-time cost.
@Helw150 Helw150 force-pushed the held/add-sec-edgar-source branch from 7d71517 to dc010e2 Compare May 15, 2026 02:30
@Helw150 Helw150 requested review from ravwojdyla and wmoss May 15, 2026 02:33
@Helw150
Copy link
Copy Markdown
Member Author

Helw150 commented May 15, 2026

@ravwojdyla / @wmoss PTAL specifically at the duckdb stuff I'm doing to workaround #5335

Helw150 added 2 commits May 14, 2026 19:36
fsspec's _Cached metaclass reads ``async_impl`` and ``mirror_sync_methods``
on the new instance right after ``__init__``. Our __getattribute__ was
delegating those to self._fs, which works for real GCSFileSystem
(itself an AbstractFileSystem) but failed against duck-typed test fakes
that don't carry the attrs.

Add the AbstractFileSystem non-callable config surface (async_impl,
mirror_sync_methods, blocksize, protocol, sep, root_marker, fsid,
transaction) to _OWN_ATTRS so they resolve via the wrapper's own MRO.
The custom sec-edgar download passes HfFileSystem() directly to
duckdb.register_filesystem, never the wrapped CrossRegionGuardedFS,
so we no longer need the wrapper to pass DuckDB's strict isinstance
check. Drops the AbstractFileSystem inheritance, __getattribute__
delegation, and _OWN_ATTRS allowlist that landed for the now-removed
transform path. Net diff against origin/main on this file is zero.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent-generated Created by automation/agent

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[zephyr] PyArrow parquet reader can't decode page headers >8 MiB (apache/arrow#46404)

1 participant