Skip to content

Commit 5b9b77c

Browse files
authored
Make dupekit a pure Rust workspace (#2170)
1 parent a4991d2 commit 5b9b77c

File tree

8 files changed

+27
-7
lines changed

8 files changed

+27
-7
lines changed

docker/marin/Dockerfile.cluster

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ RUN /tmp/google-cloud-sdk/install.sh --quiet --path-update false --command-compl
1111
RUN /tmp/google-cloud-sdk/bin/gcloud components install alpha --quiet
1212

1313
# Rust toolchain
14-
FROM rust:1.82-slim AS rust-builder
14+
FROM rust:1.91-slim AS rust-builder
1515

1616

1717
# Main build

docker/marin/Dockerfile.vllm

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ RUN /tmp/google-cloud-sdk/install.sh --quiet --path-update false --command-compl
1111
RUN /tmp/google-cloud-sdk/bin/gcloud components install alpha --quiet
1212

1313
# Rust toolchain
14-
FROM rust:1.82-slim AS rust-builder
14+
FROM rust:1.91-slim AS rust-builder
1515

1616

1717
# Main build

lib/dupekit/README.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,16 @@
11
dupekit
22
---
3-
🚧 WIP 🚧
3+
Raison d'être: Home for the Rust code used for text deduplication.
4+
5+
## Install
6+
7+
- **Locally:** This code is auto-magically built by `uv` via Cargo and [Maturin](https://github.com/PyO3/maturin). You
8+
might need to install them (e.g., `brew install maturin rust` on macOS).
9+
- **Cluster:** This code is compiled as part of the Docker build (`uv pip install -e ...` step): Maturin builds the Rust
10+
code and places it in the system `site-packages` (e.g., `/home/ray/anaconda3/lib/python3.11/site-packages/dupekit/dupekit.abi3.so`).
11+
12+
**Note:** What about making `dupekit` a hybrid Python/Rust Maturin workspace? We tried and experienced issues getting
13+
the Docker build to work while keeping it simple—a simple Rust workspace helps keep the setup clean.
414

515
## Benchmarking
616

lib/dupekit/dupekit/__init__.py renamed to lib/marin/src/marin/processing/classification/__init__.py

Lines changed: 0 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,3 @@
1111
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
14-
15-
# TODO (rav): remove F403 when we stabilize dupekit
16-
from .dupekit import * # noqa: F403
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright 2025 The Marin Authors
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# https://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.

lib/dupekit/dupekit/text_cleaning.py renamed to lib/marin/src/marin/processing/classification/deduplication/text_cleaning.py

File renamed without changes.

lib/dupekit/tests/test_text_cleaning.py renamed to tests/processing/classification/deduplication/test_text_cleaning.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
# See the License for the specific language governing permissions and
1313
# limitations under the License.
1414

15-
from dupekit.text_cleaning import clean_text
15+
from marin.processing.classification.deduplication.text_cleaning import clean_text
1616

1717

1818
def test_clean_text():

0 commit comments

Comments
 (0)