Add TeraflopAI/SEC-EDGAR to datakit sources#5305
Open
Helw150 wants to merge 11 commits into
Open
Conversation
Contributor
|
Claude finished @Helw150's task in 1m 35s —— View job Code reviewNo issues found. Checked for bugs and CLAUDE.md/AGENTS.md compliance. Summary of review
Todo
|
042c0c1 to
b528dcc
Compare
Registers SEC-EDGAR (43.7B tokens, 8M filings across 10 form types) as a single-source datakit entry pulling content from the dataset's content column.
Adds a transform step between download and normalize that reads each upstream parquet via DuckDB and rewrites it with write_statistics=False. Upstream shards trip apache/arrow#46404 (PyArrow can't decode page headers >8 MiB; SEC's content column overflows on per-page string stats). Disabling stats on the rewrite keeps normalize's PyArrow reader happy. Adds duckdb to marin deps (already transitive via iris).
DuckDB's register_filesystem isinstance-checks for fsspec.AbstractFileSystem and rejects rigging's GCS wrapper. The guard's region check has already fired at url_to_fs time; we hand DuckDB the underlying fs.
…stem Strict isinstance checks (e.g. duckdb.DuckDBPyConnection.register_filesystem) rejected the wrapper. Inherit from AbstractFileSystem with cachable=False so fsspec's instance cache stays out of the way, and switch __getattr__ → __getattribute__ so AbstractFileSystem methods inherited via the new base don't shadow the wrapped fs. The inner GCSFileSystem already implements the full AbstractFileSystem surface, so non-guarded calls delegate transparently. Reverts the temporary _fs unwrap hack from sec_edgar.py — it's no longer needed and was a budget-evasion regression for any DuckDB read.
Tokenize the normalized sec-edgar parquet shards with marin_tokenizer. Worker RAM bumped to 32 GiB over the 16 GiB default because the content column carries multi-MB SEC filings.
Matches the molmo2_cap / davinci_dev datakit convention where download_<name>_step() returns the processed StepSpec with the raw download as a transitive dep. Trims the hash_attrs to just the version key (the duckdb-readback transform mechanics live in the module docstring, not the hash attrs).
Replaces the upstream-README placeholder (43.73B, Comma v0.1) with the measured marin_tokenizer count from the tokenize run's stats.json. Also points the transform docstring at the marin tracking issue for the PyArrow page-header workaround.
…andard transform Per the discussion on PR #5335: scope the DuckDB workaround to a single reader function for this dataset rather than fighting zephyr's writer. The transform now matches the molmo2_cap / davinci_dev shape: Dataset.from_files(...).flat_map(read_parquet_via_duckdb).write_parquet(...). PyArrow's default ParquetWriter truncates page-header stats safely even on multi-MB SEC rows (verified locally on the same upstream shards that trip the read path), so we no longer need write_statistics=False on the rewrite — and dropping it lets us use zephyr's standard writer instead of a per-file atomic_rename map. Net -29 LOC; no behavior change for downstream consumers.
Drops the standalone transform stage entirely. The new download_sec_edgar() lists upstream parquets via HfFileSystem and maps each through a worker that streams it via DuckDB and writes a PyArrow-readable shard at raw/sec-edgar/<form-type>/<file>.parquet. Normalize consumes that directly — chain collapses to (download, normalize) with one on-disk copy instead of two. Matches the nsf_awards pattern: download function + zephyr Dataset.map + write_jsonl manifest + provenance.json at the end. Carries a 20-try exponential-backoff retry around each per-file DuckDB read to absorb HF rate-limits and transient xet-bridge errors that download_hf_step got for free. Saves ~220 GiB of redundant staging copy compared to the prior download_hf + transform chain. hash_attrs change forces fresh fetches (orphans the existing raw/sec-edgar_62f8bccd and downstream cached output paths), but that's a one-time cost.
7d71517 to
dc010e2
Compare
Member
Author
|
@ravwojdyla / @wmoss PTAL specifically at the duckdb stuff I'm doing to workaround #5335 |
fsspec's _Cached metaclass reads ``async_impl`` and ``mirror_sync_methods`` on the new instance right after ``__init__``. Our __getattribute__ was delegating those to self._fs, which works for real GCSFileSystem (itself an AbstractFileSystem) but failed against duck-typed test fakes that don't carry the attrs. Add the AbstractFileSystem non-callable config surface (async_impl, mirror_sync_methods, blocksize, protocol, sep, root_marker, fsid, transaction) to _OWN_ATTRS so they resolve via the wrapper's own MRO.
The custom sec-edgar download passes HfFileSystem() directly to duckdb.register_filesystem, never the wrapped CrossRegionGuardedFS, so we no longer need the wrapper to pass DuckDB's strict isinstance check. Drops the AbstractFileSystem inheritance, __getattribute__ delegation, and _OWN_ATTRS allowlist that landed for the now-removed transform path. Net diff against origin/main on this file is zero.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Registers TeraflopAI/SEC-EDGAR (43.7B tokens, ~8M filings across 10 form types) as a datakit source. Text is read from the upstream content column.
A DuckDB transform step sits between download and normalize because the upstream shards trip apache/arrow#46404 — PyArrow's parquet reader can't decode page headers >8 MiB, which the multi-MB filings in the content column overflow on per-page string statistics. The transform reads each shard via DuckDB and rewrites it with write_statistics=False so the rewritten shards don't reproduce the bug for normalize/tokenize. Once apache/arrow#47758 lands max_page_header_size in a released PyArrow we can pin, the transform can be deleted.
Fixes #5334