Skip to content

Commit 042c0c1

Browse files
committed
sec-edgar: update token count to measured 334.9B; link issue #5334
Replaces the upstream-README placeholder (43.73B, Comma v0.1) with the measured marin_tokenizer count from the tokenize run's stats.json. Also points the transform docstring at the marin tracking issue for the PyArrow page-header workaround.
1 parent 29e5883 commit 042c0c1

2 files changed

Lines changed: 14 additions & 12 deletions

File tree

lib/marin/src/marin/datakit/download/sec_edgar.py

Lines changed: 13 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -3,18 +3,20 @@
33

44
"""TeraflopAI/SEC-EDGAR download + transform + normalize helpers.
55
6-
8M filings (~43B tokens) from the SEC EDGAR database, organized into
7-
per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F, S-1, S-8, 144,
8-
and Form 3/4/5. Text lives in the upstream ``content`` column.
9-
10-
A transform step sits between download and normalize because the upstream
11-
parquet shards trip ``apache/arrow#46404`` — PyArrow's parquet reader
6+
~8M filings (~335B marin_tokenizer tokens) from the SEC EDGAR database,
7+
organized into per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F,
8+
S-1, S-8, 144, and Form 3/4/5. Text lives in the upstream ``content``
9+
column.
10+
11+
A transform step sits between download and normalize as a workaround
12+
for https://github.com/marin-community/marin/issues/5334 — the upstream
13+
parquet shards trip ``apache/arrow#46404`` (PyArrow's parquet reader
1214
can't decode page headers >8 MiB, which the multi-MB filings in the
13-
``content`` column overflow on per-page string statistics. The transform
14-
reads each shard via DuckDB (no such cap) and rewrites it with
15-
``write_statistics=False`` so the rewritten shards don't reproduce the
16-
bug for downstream PyArrow readers (normalize, tokenize). Once
17-
``apache/arrow#47758`` lands `max_page_header_size` in a released
15+
``content`` column overflow on per-page string statistics). The
16+
transform reads each shard via DuckDB (no such cap) and rewrites it
17+
with ``write_statistics=False`` so the rewritten shards don't reproduce
18+
the bug for downstream PyArrow readers (normalize, tokenize). Once
19+
``apache/arrow#47758`` lands ``max_page_header_size`` in a released
1820
PyArrow we can pin, this transform can be deleted.
1921
"""
2022

lib/marin/src/marin/datakit/sources.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -156,7 +156,7 @@ def all_sources() -> dict[str, DatakitSource]:
156156
("molmo2-cap", molmo2_cap_normalize_steps, 0.36),
157157
("nemotron-terminal", nemotron_terminal_normalize_steps, 6.08),
158158
("nsf_awards", nsf_awards_normalize_steps, 0.17),
159-
("sec-edgar", sec_edgar_normalize_steps, 43.73),
159+
("sec-edgar", sec_edgar_normalize_steps, 334.90),
160160
("superior-reasoning", superior_reasoning_normalize_steps, 7.08),
161161
("svg", svgfind_creativecommons_normalize_steps, 8.95),
162162
("swe-rebench-openhands", swe_rebench_openhands_normalize_steps, 2.47),

0 commit comments

Comments
 (0)