Feat: turbopuffer datasink #58910

nehiljain · 2025-11-22T01:27:23Z

Description

This PR adds a TurbopufferDatasink for Ray Data, enabling Ray datasets to be written directly into the Turbopuffer vector database. The datasink supports both single-namespace writes and multi-namespace writes, where rows are grouped by a namespace_column and written into separate Turbopuffer namespaces derived from that column.

The implementation includes:

A TurbopufferDatasink(Datasink) that:
- Validates configuration (mutually exclusive namespace vs namespace_column, required API key, namespace format).
- Aggregates Ray blocks into a single PyArrow table, prepares it by renaming id_column → "id" and vector_column → "vector", and filters out rows with null IDs.
- Handles multi-namespace mode by grouping on namespace_column, formatting namespace names via namespace_format (e.g., block_spans__{namespace}), and writing each group to its own Turbopuffer namespace.
- Converts bytes fields to stable, JSON-serializable representations:
  - 16-byte UUIDs → string UUIDs.
  - All other bytes (including inside lists) → hex strings.
Safer schema handling in _prepare_arrow_table:
- Raises a ValueError if renaming a custom id_column to "id" would conflict with an existing "id" column.
- Raises a ValueError if renaming a custom vector_column to "vector" would conflict with an existing "vector" column.
- This avoids ambiguous duplicate column names when later accessing table.column("id") / "vector".

A comprehensive test suite is added/updated in python/ray/data/tests/test_turbopuffer_datasink.py to cover:

Constructor validation (namespace config, namespace format placeholder, API key from param or env).
Client initialization and region defaulting.
Arrow table preparation:
- Renaming and null-ID filtering.
- Error on missing custom ID column.
- Errors when renaming would produce duplicate "id" or "vector" columns.
Multi-namespace grouping behavior (including UUID-typed namespace keys) and error on missing namespace_column.
Batching behavior in single-namespace mode.
Transform-to-Turbopuffer-row behavior:
- Requires an "id" column.
- Correctly converts UUID bytes and non-UUID bytes (including inside lists) to consistent string/hex representations.
Retry behavior:
- Success on first attempt.
- Transient failures that eventually succeed, verifying backoff is invoked.
- Persistent failures that exhaust retries and raise.
End-to-end write orchestration in both single-namespace and multi-namespace modes.

This PR is aligned with the design and performance considerations described in turbopuffer_datasink_spec.md, including support for multi-tenant (multi-namespace) ingestion patterns and Turbopuffer’s performance guidance (schema types, batch sizing, concurrency).

Reference docs

Turbopuffer docs: https://turbopuffer.com/docs
Turbopuffer performance guide: https://turbopuffer.com/docs/performance

gemini-code-assist

Code Review

This pull request introduces a TurbopufferDatasink to enable writing Ray datasets to the Turbopuffer vector database. The implementation is comprehensive, covering both single-namespace and multi-namespace writes, along with robust configuration validation and a thorough test suite.

My review focuses on performance and configurability. I've identified a significant performance issue in the multi-namespace write logic and suggest a more efficient implementation using pyarrow.Table.group_by(). I also recommend making the distance_metric configurable to provide more flexibility to users. Additionally, there are a couple of minor improvements for robustness and code style.

Overall, this is a great contribution that adds valuable functionality to Ray Data.

gemini-code-assist · 2025-11-22T01:29:35Z