Skip to content

Commit e99b40b

Browse files
ravwojdylaclaude
andcommitted
Update datakit design doc: use Parquet instead of Vortex
Switches the standard format from Vortex to Parquet throughout the design doc. Notes vortex#6905 as the blocking issue that motivated the change. Parquet provides the same columnar benefits with a mature ecosystem. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 04f7894 commit e99b40b

1 file changed

Lines changed: 12 additions & 11 deletions

File tree

docs/design/2355_datakit.md

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ Download raw dataset from Hugging Face (or other sources). Raw downloads are pre
2323

2424
Convert raw data into the **datakit standard format**:
2525

26-
* **File format**: Vortex \- columnar, supports pushdown filters and column projection, efficient lookup.
26+
* **File format**: Parquet \- columnar, widely supported, supports pushdown filters and column projection.
2727
* **Mandatory columns**:
2828
* `id` \- unique document identifier (see [ID Column](#id-column) below)
2929
* `text` \- primary text content \- we enforce UTF-8
@@ -35,7 +35,7 @@ Convert raw data into the **datakit standard format**:
3535
* **Sort invariant**: each partition is sorted by `id`
3636
* **Typed output:** in the code the data has typed representation via `Artifact`
3737

38-
This is the "intake" step \- all downstream stages operate on normalized Vortex datasets.
38+
This is the "intake" step \- all downstream stages operate on normalized Parquet datasets.
3939

4040
## 3\. Embed
4141

@@ -56,7 +56,7 @@ Join attributes datasets back to the source documents and apply filters:
5656
* Filter by classifier thresholds (e.g., quality score \> 0.8)
5757
* Remove duplicate spans/documents
5858

59-
Output is a clean, filtered Vortex dataset \- still sorted by `id`, still co-partitioned.
59+
Output is a clean, filtered Parquet dataset \- still sorted by `id`, still co-partitioned.
6060

6161
## 8\. Tokenize
6262

@@ -66,15 +66,16 @@ Convert clean text into tokenized Levanter cache format.
6666

6767
# Core Design Decisions
6868

69-
## Vortex as the Standard Format
69+
## Parquet as the Standard Format
7070

71-
All intermediate datasets (from normalization through consolidation) use the Vortex columnar format. Benefits:
71+
All intermediate datasets (from normalization through consolidation) use the Parquet columnar format. Benefits:
7272

7373
* Column projection (only read the columns you need)
7474
* Filter pushdown
7575
* Efficient sorted merge joins via Zephyr
76+
* Mature ecosystem with broad tooling support
7677

77-
NOTE: Vortex is much less mature than Parquet. This is a major concern. We will start with Vortex and if we hit roadblocks, revert to Parquet.
78+
NOTE: We initially considered Vortex for its pushdown and lookup capabilities, but encountered blocking issues with Zephyr pipeline integration (see [vortex\#6905](https://github.com/vortex-data/vortex/issues/6905)). Parquet provides the same columnar benefits with a proven ecosystem. If Vortex matures, we can revisit.
7879

7980
## ID Column {#id-column}
8081

@@ -96,14 +97,14 @@ This is enforced by convention: each processing stage reads source partitions 1:
9697

9798
## Attributes Datasets {#attributes-datasets}
9899

99-
Processing stages (embed, classify, dedup) produce **attributes datasets** \- lightweight Vortex files containing:
100+
Processing stages (embed, classify, dedup) produce **attributes datasets** \- lightweight Parquet files containing:
100101

101102
* `id` — matching the source document ID
102103
* Stage-specific output columns (e.g., `quality_score`, `is_duplicate`, `topic_label`)
103104

104105
Attributes datasets:
105106

106-
* Use Vortex format
107+
* Use Parquet format
107108
* Are co-partitioned with the source (same shard count and key ranges)
108109
* Are sorted by `id` within each partition
109110
* Can be joined back to source documents via `sorted_merge_join`
@@ -133,7 +134,7 @@ download = StepSpec(
133134
normalize = StepSpec(
134135
name="fineweb/normalize",
135136
deps=[download],
136-
fn=lambda output_path: normalize_to_vortex(
137+
fn=lambda output_path: normalize_to_parquet(
137138
input_path=download.output_path, output_path=output_path, text_field="text",
138139
),
139140
hash_attrs={"text_field": "text"},
@@ -188,7 +189,7 @@ Core primitives — the reusable building blocks:
188189

189190
```
190191
lib/marin/datakit/
191-
normalize # Raw format -> standard Vortex (id, text, ...)
192+
normalize # Raw format -> standard Parquet (id, text, ...)
192193
embed # Document embedding
193194
classify # Quality/topic classification
194195
dedup # Deduplication (exact + fuzzy)
@@ -201,7 +202,7 @@ Dataset-specific wiring \- which transforms to apply for a given dataset, expres
201202

202203
# Execution Plan
203204

204-
* Implement `datakit/normalize.py` \- standard schema definitions, ID generation, raw format to Vortex conversion with mandatory columns
205+
* Implement `datakit/normalize.py` \- standard schema definitions, ID generation, raw format to Parquet conversion with mandatory columns
205206
* Integration tests for the normalize step
206207
* Integration tests covering download, normalize, dedup and tokenize at reasonable scale
207208
* Update Grug/ferry experiment definitions to consume datakit pipeline outputs directly

0 commit comments

Comments
 (0)