Skip to content

feat(io): add GooseFS support via OpenDAL services-goosefs#7109

Open
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/goosefs-support
Open

feat(io): add GooseFS support via OpenDAL services-goosefs#7109
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/goosefs-support

Conversation

@XuQianJin-Stars

Copy link
Copy Markdown
Contributor

Changes Made

This PR adds first-class support for GooseFS (Tencent Cloud's distributed cache/acceleration filesystem) as a Daft I/O backend, mirroring the existing oss / cos / obs integrations. It is implemented entirely on top of OpenDAL's new services-goosefs backend (available since OpenDAL 0.57.0), so the surface area in Daft is intentionally small.

Implementation details

  1. Dependency bump

    • Cargo.toml: bump opendal to 0.57.0 and enable the services-goosefs feature.
    • Cargo.lock: regenerated.
  2. New config: GooseFSConfig

    • src/common/io-config/src/goosefs.rs (new file) — Rust config struct with:
      • endpoint: Option<String> — GooseFS master endpoint (e.g. http://master:9200).
      • root: Option<String> — optional root path inside the GooseFS namespace.
      • anonymous: bool — anonymous access toggle (defaults to false).
    • Implements Default, Display, multiline_display, and unit tests for round-tripping / formatting, consistent with cos.rs / obs.rs.
    • Exported from src/common/io-config/src/lib.rs and added to IOConfig next to the other cloud configs.
  3. Python bindings

    • src/common/io-config/src/python.rs: add GooseFSConfig PyO3 class with __init__, replace, __repr__, __eq__, pickling (__reduce__), and field accessors — same shape as COSConfig.
    • Wired through IOConfig.goosefs so it round-trips between Python and Rust.
  4. OpenDAL source registration

    • src/daft-io/src/opendal_source.rs:
      • Add goosefs to OpenDALSource::available_schemes().
      • New operator builder that constructs an OpenDAL Goosefs operator from GooseFSConfig (endpoint / root / anonymous).
    • src/daft-io/src/lib.rs: route goosefs://... URIs to the new source in the scheme dispatcher.
  5. Tests

    • Unit tests for GooseFSConfig (defaults, replace, display, multiline display).
    • Python config round-trip / pickle test alongside the existing COS tests.

User-facing behavior

After this PR, GooseFS-backed paths can be used directly with Daft's standard readers/writers:

import daft
from daft.io import IOConfig, GooseFSConfig

io_config = IOConfig(
    goosefs=GooseFSConfig(
        endpoint="http://goosefs-master:9200",
        root="/datasets",
    )
)

df = daft.read_parquet("goosefs://my-namespace/path/to/data/", io_config=io_config)
df.show()

All existing readers (read_parquet, read_csv, read_json, read_iceberg, ...) and writers work transparently because everything goes through the shared OpenDALSource plumbing — no reader-specific changes were needed.

Why OpenDAL (vs. a custom client)

GooseFS already has an official OpenDAL backend as of opendal 0.57.0, so plugging it in keeps Daft's I/O code uniform with the other OpenDAL-backed schemes (oss, cos, obs, huggingface) and lets us inherit upstream improvements for free. No new transport, retry, or auth code was introduced.

Validation

  • cargo fmt --all
  • cargo check --workspace
  • cargo test -p common-io-config -p daft-io ✅ (new GooseFS unit tests pass)
  • Manual smoke test: daft.read_parquet("goosefs://...") against a local GooseFS cluster.

Related Issues

Closes #7108

@XuQianJin-Stars XuQianJin-Stars requested a review from a team as a code owner June 11, 2026 02:17
@github-actions github-actions Bot added the feat label Jun 11, 2026
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds first-class GooseFS (Tencent Cloud's distributed cache filesystem) support to Daft's I/O layer by wiring up OpenDAL's services-goosefs backend, following the same pattern as the existing cos/obs/oss integrations. A new GoosefsConfig Rust struct and PyO3 binding are introduced, with goosefs:// URLs routed through the shared OpenDALSource dispatcher.

  • Adds GoosefsConfig with fields for master address, root path, block/chunk sizes, write type, auth, and connection tuning — exposed in both Rust and Python with pickling, __repr__, replace, and from_env support.
  • Extends IOClient::get_source_and_path to extract the URL authority as a default master_addr when a goosefs:// path is resolved, mirroring how COS extracts the bucket.
  • Bumps opendal with the services-goosefs feature, pulling in goosefs-sdk 0.1.5 and two transitive dependencies (hostname, tonic-prost).

Confidence Score: 3/5

Safe to merge for basic connectivity, but user-configured timeout, retry, and concurrency settings are accepted and stored yet never forwarded to the GooseFS backend.

The routing and credential/auth path work correctly, and the overall integration structure mirrors the battle-tested COS path. However, to_opendal_config silently drops six user-facing tuning fields — they appear in the API, are documented, and have defaults, but are never inserted into the OpenDAL config map. A user setting connect_timeout_ms=2000 for a flaky cluster will see no effect, and there is no warning.

src/common/io-config/src/goosefs.rs — the to_opendal_config method needs to forward the timeout/retry/concurrency fields to the returned map.

Important Files Changed

Filename Overview
src/common/io-config/src/goosefs.rs New GoosefsConfig struct with many fields; to_opendal_config silently drops all timeout/retry/concurrency fields, making them no-ops despite being user-configurable.
src/common/io-config/src/python.rs Adds GoosefsConfig PyO3 class following CosConfig pattern; max_connections Python param maps to max_connections_per_io_thread Rust field correctly.
src/common/io-config/src/config.rs GoosefsConfig added to IOConfig struct and display methods; always shown in multiline_display even at defaults.
src/daft-io/src/lib.rs Routes goosefs:// URIs to OpenDAL, extracts host:port as default master_addr from URL authority; logic mirrors COS bucket extraction.
src/daft-io/src/opendal_source.rs Adds goosefs to available_schemes list; no structural changes to the OpenDAL operator dispatch.
src/daft-io/Cargo.toml Adds services-goosefs feature flag to opendal dependency.
src/common/io-config/src/lib.rs Exports GoosefsConfig from the goosefs module; straightforward addition.
Cargo.lock Adds goosefs-sdk 0.1.5, opendal-service-goosefs 0.57.0, and hostname 0.4.2 as new transitive deps.

Sequence Diagram

sequenceDiagram
    participant User as Python User
    participant IOClient as IOClient
    participant GooseFSConfig as GoosefsConfig
    participant OpenDAL as OpenDALSource
    participant GFS as GooseFS Cluster

    User->>IOClient: daft.read_parquet(goosefs://host:9200/path)
    IOClient->>IOClient: "parse_url => SourceType::OpenDAL{scheme:goosefs}"
    IOClient->>IOClient: extract authority host:9200 from URL
    IOClient->>GooseFSConfig: to_opendal_config(host:9200)
    GooseFSConfig-->>IOClient: "BTreeMap{master_addr, root, auth}"
    IOClient->>OpenDAL: get_client(goosefs, config_map)
    OpenDAL->>OpenDAL: Operator::via_iter(goosefs, config_map)
    OpenDAL-->>IOClient: Arc OpenDALSource
    IOClient->>OpenDAL: get(path, range, io_stats)
    OpenDAL->>GFS: gRPC read request
    GFS-->>OpenDAL: bytes stream
    OpenDAL-->>User: GetResult::Stream
Loading

Comments Outside Diff (1)

  1. src/common/io-config/src/python.rs, line 761-784 (link)

    P2 from_env not documented in the class docstring

    GoosefsConfig.from_env() is a static method that reads GOOSEFS_MASTER_ADDR, GOOSEFS_AUTH_USERNAME, GOOSEFS_AUTH_PASSWORD, GOOSEFS_AUTH_TYPE, GOOSEFS_WRITE_TYPE, and GOOSEFS_ROOT — but none of these environment variables are listed in the class docstring above the #[pyclass]. Users have no discovery path for this method or its expected env var names.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Reviews (1): Last reviewed commit: "feat(io): add GooseFS support via OpenDA..." | Re-trigger Greptile

Comment on lines +290 to +343
#[test]
fn test_goosefs_config_display_masks_password() {
let config = GoosefsConfig {
master_addr: Some("m:9200".to_string()),
auth_username: Some("alice".to_string()),
auth_password: Some("super-secret".to_string().into()),
..Default::default()
};
let s = format!("{}", config);
assert!(s.contains("GoosefsConfig"));
assert!(s.contains("alice"));
assert!(!s.contains("super-secret"));
assert!(s.contains("***"));
}

#[test]
fn test_goosefs_config_multiline_display() {
let config = GoosefsConfig {
root: Some("/data".to_string()),
master_addr: Some("m:9200".to_string()),
block_size: Some(1024),
chunk_size: Some(256),
write_type: Some("cache_through".to_string()),
auth_type: Some("simple".to_string()),
auth_username: Some("alice".to_string()),
auth_password: Some("secret".to_string().into()),
..Default::default()
};
let lines = config.multiline_display();
assert!(lines.iter().any(|l| l.contains("Root = /data")));
assert!(lines.iter().any(|l| l.contains("Master addr = m:9200")));
assert!(lines.iter().any(|l| l.contains("Block size = 1024")));
assert!(lines.iter().any(|l| l.contains("Chunk size = 256")));
assert!(
lines
.iter()
.any(|l| l.contains("Write type = cache_through"))
);
assert!(lines.iter().any(|l| l.contains("Auth type = simple")));
assert!(lines.iter().any(|l| l.contains("Auth username = alice")));
assert!(lines.iter().any(|l| l.contains("Auth password = ***")));
}

#[test]
fn test_goosefs_config_equality_and_hash() {
use std::{
collections::hash_map::DefaultHasher,
hash::{Hash, Hasher},
};

let c1 = GoosefsConfig {
master_addr: Some("m:9200".to_string()),
..Default::default()
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Timeout/retry/concurrency fields silently dropped from OpenDAL config

to_opendal_config populates master_addr, root, block_size, chunk_size, write_type, and credential fields — but never inserts max_retries, retry_timeout_ms, connect_timeout_ms, read_timeout_ms, max_concurrent_requests, or max_connections_per_io_thread into the returned map. The OpenDAL services-goosefs backend exposes these as GooseFSConfig fields (e.g. retry_timeout, connection_timeout, parallel). Users who set GooseFSConfig(connect_timeout_ms=5000) will see their value stored and displayed, but it is never forwarded to the underlying backend.

Comment thread src/common/io-config/src/config.rs Outdated
Comment on lines +78 to +81
res.push(format!(
"GooseFS config = {{ {} }}",
self.goosefs.multiline_display().join(", ")
));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 GooseFS config always shown in multiline_display even at defaults

multiline_display unconditionally adds a GooseFS config = {{ ... }} line. When GoosefsConfig is at its defaults, multiline_display() still returns many non-empty lines (Anonymous, Max retries, Retry timeout, Connect timeout, Read timeout, Max concurrent requests, Max connections) because those fields always emit — unlike truly sparse configs. If the intent is only to display non-default values, numeric fields need their own if value != default guards (analogous to how auth_username is guarded with if let Some(...)).

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

@XuQianJin-Stars XuQianJin-Stars force-pushed the feat/goosefs-support branch 2 times, most recently from 6ab2cd1 to 574fc4d Compare June 17, 2026 04:25
… multiline display

Address review feedback on GoosefsConfig:

- P1: to_opendal_config now forwards max_retries, retry_timeout_ms, connect_timeout_ms, read_timeout_ms, max_concurrent_requests and max_connections_per_io_thread into the returned config map (only when non-default), so user-provided values are no longer silently dropped at the Daft layer.

- P2: multiline_display now gates every numeric/boolean field with an if value != default guard, mirroring how auth_username is handled. A default-constructed GoosefsConfig produces an empty multiline view, and IOConfig::multiline_display omits the GooseFS config = { ... } line entirely in that case.

Adds regression tests covering both behaviours.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add GooseFS support as an OpenDAL-backed I/O source

1 participant