You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Update datakit design doc: use Parquet instead of Vortex
Switches the standard format from Vortex to Parquet throughout the
design doc. Notes vortex#6905 as the blocking issue that motivated
the change. Parquet provides the same columnar benefits with a
mature ecosystem.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
*`id`\- unique document identifier (see [ID Column](#id-column) below)
29
29
*`text`\- primary text content \- we enforce UTF-8
@@ -35,7 +35,7 @@ Convert raw data into the **datakit standard format**:
35
35
***Sort invariant**: each partition is sorted by `id`
36
36
***Typed output:** in the code the data has typed representation via `Artifact`
37
37
38
-
This is the "intake" step \- all downstream stages operate on normalized Vortex datasets.
38
+
This is the "intake" step \- all downstream stages operate on normalized Parquet datasets.
39
39
40
40
## 3\. Embed
41
41
@@ -56,7 +56,7 @@ Join attributes datasets back to the source documents and apply filters:
56
56
* Filter by classifier thresholds (e.g., quality score \> 0.8)
57
57
* Remove duplicate spans/documents
58
58
59
-
Output is a clean, filtered Vortex dataset \- still sorted by `id`, still co-partitioned.
59
+
Output is a clean, filtered Parquet dataset \- still sorted by `id`, still co-partitioned.
60
60
61
61
## 8\. Tokenize
62
62
@@ -66,15 +66,16 @@ Convert clean text into tokenized Levanter cache format.
66
66
67
67
# Core Design Decisions
68
68
69
-
## Vortex as the Standard Format
69
+
## Parquet as the Standard Format
70
70
71
-
All intermediate datasets (from normalization through consolidation) use the Vortex columnar format. Benefits:
71
+
All intermediate datasets (from normalization through consolidation) use the Parquet columnar format. Benefits:
72
72
73
73
* Column projection (only read the columns you need)
74
74
* Filter pushdown
75
75
* Efficient sorted merge joins via Zephyr
76
+
* Mature ecosystem with broad tooling support
76
77
77
-
NOTE: Vortex is much less mature than Parquet. This is a major concern. We will start with Vortex and if we hit roadblocks, revert to Parquet.
78
+
NOTE: We initially considered Vortex for its pushdown and lookup capabilities, but encountered blocking issues with Zephyr pipeline integration (see [vortex\#6905](https://github.com/vortex-data/vortex/issues/6905)). Parquet provides the same columnar benefits with a proven ecosystem. If Vortex matures, we can revisit.
78
79
79
80
## ID Column {#id-column}
80
81
@@ -96,14 +97,14 @@ This is enforced by convention: each processing stage reads source partitions 1:
0 commit comments