File tree Expand file tree Collapse file tree 8 files changed +27
-7
lines changed
marin/src/marin/processing/classification
tests/processing/classification/deduplication Expand file tree Collapse file tree 8 files changed +27
-7
lines changed Original file line number Diff line number Diff line change @@ -11,7 +11,7 @@ RUN /tmp/google-cloud-sdk/install.sh --quiet --path-update false --command-compl
1111RUN /tmp/google-cloud-sdk/bin/gcloud components install alpha --quiet
1212
1313# Rust toolchain
14- FROM rust:1.82 -slim AS rust-builder
14+ FROM rust:1.91 -slim AS rust-builder
1515
1616
1717# Main build
Original file line number Diff line number Diff line change @@ -11,7 +11,7 @@ RUN /tmp/google-cloud-sdk/install.sh --quiet --path-update false --command-compl
1111RUN /tmp/google-cloud-sdk/bin/gcloud components install alpha --quiet
1212
1313# Rust toolchain
14- FROM rust:1.82 -slim AS rust-builder
14+ FROM rust:1.91 -slim AS rust-builder
1515
1616
1717# Main build
Original file line number Diff line number Diff line change 11dupekit
22---
3- 🚧 WIP 🚧
3+ Raison d'être: Home for the Rust code used for text deduplication.
4+
5+ ## Install
6+
7+ - ** Locally:** This code is auto-magically built by ` uv ` via Cargo and [ Maturin] ( https://github.com/PyO3/maturin ) . You
8+ might need to install them (e.g., ` brew install maturin rust ` on macOS).
9+ - ** Cluster:** This code is compiled as part of the Docker build (` uv pip install -e ... ` step): Maturin builds the Rust
10+ code and places it in the system ` site-packages ` (e.g., ` /home/ray/anaconda3/lib/python3.11/site-packages/dupekit/dupekit.abi3.so ` ).
11+
12+ ** Note:** What about making ` dupekit ` a hybrid Python/Rust Maturin workspace? We tried and experienced issues getting
13+ the Docker build to work while keeping it simple—a simple Rust workspace helps keep the setup clean.
414
515## Benchmarking
616
File renamed without changes.
Original file line number Diff line number Diff line change 1111# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
1212# See the License for the specific language governing permissions and
1313# limitations under the License.
14-
15- # TODO (rav): remove F403 when we stabilize dupekit
16- from .dupekit import * # noqa: F403
Original file line number Diff line number Diff line change 1+ # Copyright 2025 The Marin Authors
2+ #
3+ # Licensed under the Apache License, Version 2.0 (the "License");
4+ # you may not use this file except in compliance with the License.
5+ # You may obtain a copy of the License at
6+ #
7+ # https://www.apache.org/licenses/LICENSE-2.0
8+ #
9+ # Unless required by applicable law or agreed to in writing, software
10+ # distributed under the License is distributed on an "AS IS" BASIS,
11+ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+ # See the License for the specific language governing permissions and
13+ # limitations under the License.
File renamed without changes.
Original file line number Diff line number Diff line change 1212# See the License for the specific language governing permissions and
1313# limitations under the License.
1414
15- from dupekit .text_cleaning import clean_text
15+ from marin . processing . classification . deduplication .text_cleaning import clean_text
1616
1717
1818def test_clean_text ():
You can’t perform that action at this time.
0 commit comments