|
3 | 3 |
|
4 | 4 | """TeraflopAI/SEC-EDGAR download + transform + normalize helpers. |
5 | 5 |
|
6 | | -8M filings (~43B tokens) from the SEC EDGAR database, organized into |
7 | | -per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F, S-1, S-8, 144, |
8 | | -and Form 3/4/5. Text lives in the upstream ``content`` column. |
9 | | -
|
10 | | -A transform step sits between download and normalize because the upstream |
11 | | -parquet shards trip ``apache/arrow#46404`` — PyArrow's parquet reader |
| 6 | +~8M filings (~335B marin_tokenizer tokens) from the SEC EDGAR database, |
| 7 | +organized into per-filing-type subdirectories: 10-K, 10-Q, 8-K, 20-F, |
| 8 | +S-1, S-8, 144, and Form 3/4/5. Text lives in the upstream ``content`` |
| 9 | +column. |
| 10 | +
|
| 11 | +A transform step sits between download and normalize as a workaround |
| 12 | +for https://github.com/marin-community/marin/issues/5334 — the upstream |
| 13 | +parquet shards trip ``apache/arrow#46404`` (PyArrow's parquet reader |
12 | 14 | can't decode page headers >8 MiB, which the multi-MB filings in the |
13 | | -``content`` column overflow on per-page string statistics. The transform |
14 | | -reads each shard via DuckDB (no such cap) and rewrites it with |
15 | | -``write_statistics=False`` so the rewritten shards don't reproduce the |
16 | | -bug for downstream PyArrow readers (normalize, tokenize). Once |
17 | | -``apache/arrow#47758`` lands `max_page_header_size` in a released |
| 15 | +``content`` column overflow on per-page string statistics). The |
| 16 | +transform reads each shard via DuckDB (no such cap) and rewrites it |
| 17 | +with ``write_statistics=False`` so the rewritten shards don't reproduce |
| 18 | +the bug for downstream PyArrow readers (normalize, tokenize). Once |
| 19 | +``apache/arrow#47758`` lands ``max_page_header_size`` in a released |
18 | 20 | PyArrow we can pin, this transform can be deleted. |
19 | 21 | """ |
20 | 22 |
|
|
0 commit comments