Commit 9ff9161
authored
Dedupe pipeline on 10BT fineweb-edu sample (#2129)
## Description
Re: #2096
In this PR - "we use more data and more resources":
* we add `fineweb_edu_small_10bt` dataset
* request more resources - 1024 parallelism by default
* guard `dupekit` to "protect" the Marin driver
* cleanup the dedupe code slightly - still not clean enough
* we will break out the dedupe utils out soon
* compute dedupe stats/counts
* reuse more code between paragraph and doc exact modes1 parent 5308ed9 commit 9ff9161
File tree
3 files changed
+168
-78
lines changed- experiments/dedup
- lib
- marin/src/marin/processing/classification
- zephyr/src/zephyr
3 files changed
+168
-78
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
50 | 50 | | |
51 | 51 | | |
52 | 52 | | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
53 | 63 | | |
54 | 64 | | |
55 | 65 | | |
| |||
61 | 71 | | |
62 | 72 | | |
63 | 73 | | |
| 74 | + | |
64 | 75 | | |
| 76 | + | |
65 | 77 | | |
66 | 78 | | |
67 | 79 | | |
68 | 80 | | |
69 | | - | |
| 81 | + | |
| 82 | + | |
| 83 | + | |
| 84 | + | |
| 85 | + | |
| 86 | + | |
| 87 | + | |
| 88 | + | |
| 89 | + | |
| 90 | + | |
70 | 91 | | |
71 | | - | |
| 92 | + | |
72 | 93 | | |
73 | 94 | | |
74 | 95 | | |
| |||
80 | 101 | | |
81 | 102 | | |
82 | 103 | | |
83 | | - | |
| 104 | + | |
| 105 | + | |
| 106 | + | |
| 107 | + | |
| 108 | + | |
84 | 109 | | |
85 | 110 | | |
86 | 111 | | |
| |||
0 commit comments