|
| 1 | +Marin has most of the pieces for end-to-end data processing \- download, dedup, filtering, classification, decontamination, tokenization \- but the code is scattered across `experiments/` and `lib/marin/` with inconsistent formats, ad-hoc ID handling, and unclear provenance. |
| 2 | + |
| 3 | +We propose consolidating this into **datakit**: a set of composable pipeline stages with standardized formats and conventions, living in `lib/marin/datakit/`. Dataset-specific wiring (e.g., "for Arxiv, apply these transforms") lives in `experiments/` or reference configurations. |
| 4 | + |
| 5 | + |
| 6 | +Links: |
| 7 | + * [marin\#2355](https://github.com/marin-community/marin/issues/2355) |
| 8 | + * [gdoc](https://docs.google.com/document/d/1kDSzONg32zv2VnCO4FJiMP0fcjRSjgP0uTDpI4_C4O0) |
| 9 | + |
| 10 | +# Golden Path |
| 11 | + |
| 12 | +The canonical pipeline for getting a dataset from source to training: |
| 13 | + |
| 14 | +`Download → Normalize → Embed → Classify/Filter → Dedup → Tokenize` |
| 15 | + |
| 16 | +Notably, datakit in the proposed form, doesn’t include **data mixing** or **training**. |
| 17 | + |
| 18 | +## 1\. Download |
| 19 | + |
| 20 | +Download raw dataset from Hugging Face (or other sources). Raw downloads are preserved as-is in their original format and directory structure. |
| 21 | + |
| 22 | +## 2\. Normalize to Standard Format |
| 23 | + |
| 24 | +Convert raw data into the **datakit standard format**: |
| 25 | + |
| 26 | +* **File format**: Parquet \- columnar, widely supported, supports pushdown filters and column projection. |
| 27 | +* **Mandatory columns**: |
| 28 | + * `id` \- unique document identifier (see [ID Column](#id-column) below) |
| 29 | + * `text` \- primary text content \- we enforce UTF-8 |
| 30 | +* **Arbitrary additional columns**: any fields present in the raw data are preserved |
| 31 | +* **Directory structure**: preserver original directory structure |
| 32 | +* **Partition structure**: partition layout from the source does NOT need to be preserved at this point \- and in most cases it will not be |
| 33 | + * We may want to introduce a more efficient partitioning at this stage and preserve the new partitioning until tokenization |
| 34 | + * The partitions must follow `part-x-of-y` suffix naming convention |
| 35 | +* **Sort invariant**: each partition is sorted by `id` |
| 36 | +* **Typed output:** in the code the data has typed representation via `Artifact` |
| 37 | + |
| 38 | +This is the "intake" step \- all downstream stages operate on normalized Parquet datasets. |
| 39 | + |
| 40 | +## 3\. Embed |
| 41 | + |
| 42 | +Produce vector embeddings for each document. Output is an **attributes dataset** (see [Attributes Datasets](#attributes-datasets)) with embedding vectors keyed by `id`. |
| 43 | + |
| 44 | +## 4\. Quality Classification, Topic Assignment |
| 45 | + |
| 46 | +Each classifier produces an **attributes dataset** containing scores/labels keyed by `id`. |
| 47 | + |
| 48 | +## 5\. Deduplication |
| 49 | + |
| 50 | +Produces an **attributes dataset** marking duplicate spans or documents. |
| 51 | + |
| 52 | +## 7\. Consolidation |
| 53 | + |
| 54 | +Join attributes datasets back to the source documents and apply filters: |
| 55 | + |
| 56 | +* Filter by classifier thresholds (e.g., quality score \> 0.8) |
| 57 | +* Remove duplicate spans/documents |
| 58 | + |
| 59 | +Output is a clean, filtered Parquet dataset \- still sorted by `id`, still co-partitioned. |
| 60 | + |
| 61 | +## 8\. Tokenize |
| 62 | + |
| 63 | +Convert clean text into tokenized Levanter cache format. |
| 64 | + |
| 65 | +**Tokenization is the boundary where per-document structure ends.** The tokenizer concatenates documents into fixed-size token sequences for efficient training. Partition structure from earlier stages does not carry through \- the output is sharded Levanter TreeStore caches with a `.stats.json` summary. |
| 66 | + |
| 67 | +# Core Design Decisions |
| 68 | + |
| 69 | +## Parquet as the Standard Format |
| 70 | + |
| 71 | +All intermediate datasets (from normalization through consolidation) use the Parquet columnar format. Benefits: |
| 72 | + |
| 73 | +* Column projection (only read the columns you need) |
| 74 | +* Filter pushdown |
| 75 | +* Efficient sorted merge joins via Zephyr |
| 76 | +* Mature ecosystem with broad tooling support |
| 77 | + |
| 78 | +NOTE: We initially considered Vortex for its pushdown and lookup capabilities, but encountered blocking issues with Zephyr pipeline integration (see [vortex\#6905](https://github.com/vortex-data/vortex/issues/6905)). Parquet provides the same columnar benefits with a proven ecosystem. If Vortex matures, we can revisit. |
| 79 | + |
| 80 | +## ID Column {#id-column} |
| 81 | + |
| 82 | +* **Preserve existing IDs** when present in the raw data (e.g., WARC-Record-ID in DCLM, HF row indices). These carry provenance meaning and aid debugging. |
| 83 | + * But rename column to `source_id` |
| 84 | +* **Generate deterministic IDs** via content hash. Column named `id`. Deterministic hashing ensures reproducibility \- re-running the pipeline produces the same IDs, which preserves caching and diffing. |
| 85 | + |
| 86 | +## Co-Partitioning Invariant |
| 87 | + |
| 88 | +The key invariant that enables efficient joins: **Attributes datasets must have the same number of shards and the same key-range partitioning as their source dataset.** |
| 89 | + |
| 90 | +This means: |
| 91 | + |
| 92 | +* The normalization step determines the partition structure |
| 93 | +* All downstream stages (embed, classify, dedup) preserve this structure \- same shard count, same ID ranges per shard |
| 94 | +* Consolidation can use Zephyr's `sorted_merge_join` without a costly `group_by` shuffle |
| 95 | + |
| 96 | +This is enforced by convention: each processing stage reads source partitions 1:1 and writes output partitions with matching structure. |
| 97 | + |
| 98 | +## Attributes Datasets {#attributes-datasets} |
| 99 | + |
| 100 | +Processing stages (embed, classify, dedup) produce **attributes datasets** \- lightweight Parquet files containing: |
| 101 | + |
| 102 | +* `id` — matching the source document ID |
| 103 | +* Stage-specific output columns (e.g., `quality_score`, `is_duplicate`, `topic_label`) |
| 104 | + |
| 105 | +Attributes datasets: |
| 106 | + |
| 107 | +* Use Parquet format |
| 108 | +* Are co-partitioned with the source (same shard count and key ranges) |
| 109 | +* Are sorted by `id` within each partition |
| 110 | +* Can be joined back to source documents via `sorted_merge_join` |
| 111 | + |
| 112 | +Multiple attribute datasets from different stages can be joined together during consolidation to apply compound filters. |
| 113 | + |
| 114 | +## Step Orchestration via StepSpec |
| 115 | + |
| 116 | +Datakit builds on `StepSpec` \- the pure-data step descriptor that captures identity, dependencies. Each datakit stage (normalize, classify, dedup, etc.) is a `StepSpec` with: |
| 117 | + |
| 118 | +* **`name`**: human-readable stage name (e.g., `"fineweb/normalize"`) |
| 119 | +* **`deps`**: upstream `StepSpec`s whose `output_path` this stage reads from |
| 120 | +* **`hash_attrs`**: configuration values that affect output (model name, thresholds, etc.) — changes invalidate the cache |
| 121 | +* **`fn`**: the callable that performs the work, receiving `output_path` as its argument |
| 122 | + |
| 123 | +`StepSpec` gives us automatic cache invalidation (via `hash_id` derived from name \+ attrs \+ dep paths), dependency tracking, and deterministic output paths. The step runner handles locking, heartbeats, and status \- datakit stages just describe what to run. |
| 124 | + |
| 125 | +Example wiring: |
| 126 | + |
| 127 | +```py |
| 128 | +download = StepSpec( |
| 129 | + name="fineweb/download", |
| 130 | + fn=lambda output_path: download_hf(output_path=output_path, dataset_id="HuggingFaceFW/fineweb"), |
| 131 | + hash_attrs={"dataset_id": "HuggingFaceFW/fineweb", "revision": "abc1234"}, |
| 132 | +) |
| 133 | + |
| 134 | +normalize = StepSpec( |
| 135 | + name="fineweb/normalize", |
| 136 | + deps=[download], |
| 137 | + fn=lambda output_path: normalize_to_parquet( |
| 138 | + input_path=download.output_path, output_path=output_path, text_field="text", |
| 139 | + ), |
| 140 | + hash_attrs={"text_field": "text"}, |
| 141 | +) |
| 142 | + |
| 143 | +quality = StepSpec( |
| 144 | + name="fineweb/quality", |
| 145 | + deps=[normalize], |
| 146 | + fn=lambda output_path: classify( |
| 147 | + input_path=normalize.output_path, output_path=output_path, model="fasttext-quality-v1", |
| 148 | + ), |
| 149 | + hash_attrs={"model": "fasttext-quality-v1"}, |
| 150 | +) |
| 151 | + |
| 152 | +dedup = StepSpec( |
| 153 | + name="fineweb/dedup", |
| 154 | + deps=[normalize], |
| 155 | + fn=lambda output_path: deduplicate( |
| 156 | + input_path=normalize.output_path, output_path=output_path, mode="fuzzy_document", |
| 157 | + ), |
| 158 | + hash_attrs={"mode": "fuzzy_document"}, |
| 159 | +) |
| 160 | + |
| 161 | +consolidated = StepSpec( |
| 162 | + name="fineweb/consolidated", |
| 163 | + deps=[normalize, quality, dedup], |
| 164 | + fn=lambda output_path: consolidate( |
| 165 | + source_path=normalize.output_path, |
| 166 | + attribute_paths=[quality.output_path, dedup.output_path], |
| 167 | + output_path=output_path, |
| 168 | + quality_threshold=0.8, |
| 169 | + ), |
| 170 | + hash_attrs={"quality_threshold": 0.8}, |
| 171 | +) |
| 172 | + |
| 173 | +tokenized = StepSpec( |
| 174 | + name="fineweb/tokenized", |
| 175 | + deps=[consolidated], |
| 176 | + fn=lambda output_path: tokenize( |
| 177 | + input_path=consolidated.output_path, output_path=output_path, |
| 178 | + tokenizer="meta-llama/Llama-3.1-8B", |
| 179 | + ), |
| 180 | + hash_attrs={"tokenizer": "meta-llama/Llama-3.1-8B"}, |
| 181 | +) |
| 182 | +``` |
| 183 | + |
| 184 | +# API Surface |
| 185 | + |
| 186 | +## `lib/marin/datakit/` |
| 187 | + |
| 188 | +Core primitives — the reusable building blocks: |
| 189 | + |
| 190 | +``` |
| 191 | +lib/marin/datakit/ |
| 192 | + normalize # Raw format -> standard Parquet (id, text, ...) |
| 193 | + embed # Document embedding |
| 194 | + classify # Quality/topic classification |
| 195 | + dedup # Deduplication (exact + fuzzy) |
| 196 | + consolidate # Join attributes + apply filters |
| 197 | +``` |
| 198 | + |
| 199 | +## `experiments/` (or reference configurations) |
| 200 | + |
| 201 | +Dataset-specific wiring \- which transforms to apply for a given dataset, expressed as `StepSpec` DAGs. |
| 202 | + |
| 203 | +# Execution Plan |
| 204 | + |
| 205 | +* Implement `datakit/normalize.py` \- standard schema definitions, ID generation, raw format to Parquet conversion with mandatory columns |
| 206 | +* Integration tests for the normalize step |
| 207 | +* Integration tests covering download, normalize, dedup and tokenize at reasonable scale |
| 208 | +* Update Grug/ferry experiment definitions to consume datakit pipeline outputs directly |
| 209 | + |
| 210 | +# Non-Goals |
| 211 | + |
| 212 | +* **Replacing the mixing or training APIs** \- datakit standardizes everything upstream of tokenization. |
| 213 | +* **Supporting non-text modalities** \- the initial scope is text datasets with a mandatory `text` field. Multimodal support can be added later by relaxing this constraint. |
| 214 | + |
| 215 | +# Open Questions |
| 216 | + |
| 217 | +1. **ID uniqueness enforcement**: Per-partition validation is cheap and will be the default. Should we also support global uniqueness checks? What's the failure mode — warn or error? |
| 218 | +2. **Non-text datasets**: Code datasets, structured data \- do we need a configurable primary field, or is `text` always sufficient? |
| 219 | +3. **Versioning**: How do we version datakit outputs so that downstream consumers (Grug) can pin to a specific processing run? `StepSpec.hash_id` provides content-based versioning, but do we need human-readable version tags as well? |
0 commit comments