sec-edgar: update token count to measured 334.9B; link issue #5334

Helw150 · Helw150 · commit 042c0c1080ee · 2026-05-14T16:29:30.000-07:00
Replaces the upstream-README placeholder (43.73B, Comma v0.1) with the
measured marin_tokenizer count from the tokenize run's stats.json. Also
points the transform docstring at the marin tracking issue for the
PyArrow page-header workaround.
diff --git a/lib/marin/src/marin/datakit/download/sec_edgar.py b/lib/marin/src/marin/datakit/download/sec_edgar.py
@@ -3,18 +3,20 @@
 
 """TeraflopAI/SEC-EDGAR download + transform + normalize helpers.
 
-8M filings (~43B tokens) from the SEC EDGAR database, organized into
-per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F, S-1, S-8, 144,
-and Form 3/4/5. Text lives in the upstream ``content`` column.
-
-A transform step sits between download and normalize because the upstream
-parquet shards trip ``apache/arrow#46404`` — PyArrow's parquet reader
+~8M filings (~335B marin_tokenizer tokens) from the SEC EDGAR database,
+organized into per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F,
+S-1, S-8, 144, and Form 3/4/5. Text lives in the upstream ``content``
+column.
+
+A transform step sits between download and normalize as a workaround
+for https://github.com/marin-community/marin/issues/5334 — the upstream
+parquet shards trip ``apache/arrow#46404`` (PyArrow's parquet reader
 can't decode page headers >8 MiB, which the multi-MB filings in the
-``content`` column overflow on per-page string statistics. The transform
-reads each shard via DuckDB (no such cap) and rewrites it with
-``write_statistics=False`` so the rewritten shards don't reproduce the
-bug for downstream PyArrow readers (normalize, tokenize). Once
-``apache/arrow#47758`` lands `max_page_header_size` in a released
+``content`` column overflow on per-page string statistics). The
+transform reads each shard via DuckDB (no such cap) and rewrites it
+with ``write_statistics=False`` so the rewritten shards don't reproduce
+the bug for downstream PyArrow readers (normalize, tokenize). Once
+``apache/arrow#47758`` lands ``max_page_header_size`` in a released
 PyArrow we can pin, this transform can be deleted.
 """
 
diff --git a/lib/marin/src/marin/datakit/sources.py b/lib/marin/src/marin/datakit/sources.py
@@ -156,7 +156,7 @@ def all_sources() -> dict[str, DatakitSource]:
         ("molmo2-cap", molmo2_cap_normalize_steps, 0.36),
         ("nemotron-terminal", nemotron_terminal_normalize_steps, 6.08),
         ("nsf_awards", nsf_awards_normalize_steps, 0.17),
-        ("sec-edgar", sec_edgar_normalize_steps, 43.73),
+        ("sec-edgar", sec_edgar_normalize_steps, 334.90),
         ("superior-reasoning", superior_reasoning_normalize_steps, 7.08),
         ("svg", svgfind_creativecommons_normalize_steps, 8.95),
         ("swe-rebench-openhands", swe_rebench_openhands_normalize_steps, 2.47),