feat(io): add GooseFS support via OpenDAL services-goosefs#7109
feat(io): add GooseFS support via OpenDAL services-goosefs#7109XuQianJin-Stars wants to merge 2 commits into
Conversation
Greptile SummaryThis PR adds first-class GooseFS (Tencent Cloud's distributed cache filesystem) support to Daft's I/O layer by wiring up OpenDAL's
Confidence Score: 3/5Safe to merge for basic connectivity, but user-configured timeout, retry, and concurrency settings are accepted and stored yet never forwarded to the GooseFS backend. The routing and credential/auth path work correctly, and the overall integration structure mirrors the battle-tested COS path. However, src/common/io-config/src/goosefs.rs — the Important Files Changed
Sequence DiagramsequenceDiagram
participant User as Python User
participant IOClient as IOClient
participant GooseFSConfig as GoosefsConfig
participant OpenDAL as OpenDALSource
participant GFS as GooseFS Cluster
User->>IOClient: daft.read_parquet(goosefs://host:9200/path)
IOClient->>IOClient: "parse_url => SourceType::OpenDAL{scheme:goosefs}"
IOClient->>IOClient: extract authority host:9200 from URL
IOClient->>GooseFSConfig: to_opendal_config(host:9200)
GooseFSConfig-->>IOClient: "BTreeMap{master_addr, root, auth}"
IOClient->>OpenDAL: get_client(goosefs, config_map)
OpenDAL->>OpenDAL: Operator::via_iter(goosefs, config_map)
OpenDAL-->>IOClient: Arc OpenDALSource
IOClient->>OpenDAL: get(path, range, io_stats)
OpenDAL->>GFS: gRPC read request
GFS-->>OpenDAL: bytes stream
OpenDAL-->>User: GetResult::Stream
|
| #[test] | ||
| fn test_goosefs_config_display_masks_password() { | ||
| let config = GoosefsConfig { | ||
| master_addr: Some("m:9200".to_string()), | ||
| auth_username: Some("alice".to_string()), | ||
| auth_password: Some("super-secret".to_string().into()), | ||
| ..Default::default() | ||
| }; | ||
| let s = format!("{}", config); | ||
| assert!(s.contains("GoosefsConfig")); | ||
| assert!(s.contains("alice")); | ||
| assert!(!s.contains("super-secret")); | ||
| assert!(s.contains("***")); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_goosefs_config_multiline_display() { | ||
| let config = GoosefsConfig { | ||
| root: Some("/data".to_string()), | ||
| master_addr: Some("m:9200".to_string()), | ||
| block_size: Some(1024), | ||
| chunk_size: Some(256), | ||
| write_type: Some("cache_through".to_string()), | ||
| auth_type: Some("simple".to_string()), | ||
| auth_username: Some("alice".to_string()), | ||
| auth_password: Some("secret".to_string().into()), | ||
| ..Default::default() | ||
| }; | ||
| let lines = config.multiline_display(); | ||
| assert!(lines.iter().any(|l| l.contains("Root = /data"))); | ||
| assert!(lines.iter().any(|l| l.contains("Master addr = m:9200"))); | ||
| assert!(lines.iter().any(|l| l.contains("Block size = 1024"))); | ||
| assert!(lines.iter().any(|l| l.contains("Chunk size = 256"))); | ||
| assert!( | ||
| lines | ||
| .iter() | ||
| .any(|l| l.contains("Write type = cache_through")) | ||
| ); | ||
| assert!(lines.iter().any(|l| l.contains("Auth type = simple"))); | ||
| assert!(lines.iter().any(|l| l.contains("Auth username = alice"))); | ||
| assert!(lines.iter().any(|l| l.contains("Auth password = ***"))); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_goosefs_config_equality_and_hash() { | ||
| use std::{ | ||
| collections::hash_map::DefaultHasher, | ||
| hash::{Hash, Hasher}, | ||
| }; | ||
|
|
||
| let c1 = GoosefsConfig { | ||
| master_addr: Some("m:9200".to_string()), | ||
| ..Default::default() | ||
| }; |
There was a problem hiding this comment.
Timeout/retry/concurrency fields silently dropped from OpenDAL config
to_opendal_config populates master_addr, root, block_size, chunk_size, write_type, and credential fields — but never inserts max_retries, retry_timeout_ms, connect_timeout_ms, read_timeout_ms, max_concurrent_requests, or max_connections_per_io_thread into the returned map. The OpenDAL services-goosefs backend exposes these as GooseFSConfig fields (e.g. retry_timeout, connection_timeout, parallel). Users who set GooseFSConfig(connect_timeout_ms=5000) will see their value stored and displayed, but it is never forwarded to the underlying backend.
| res.push(format!( | ||
| "GooseFS config = {{ {} }}", | ||
| self.goosefs.multiline_display().join(", ") | ||
| )); |
There was a problem hiding this comment.
GooseFS config always shown in multiline_display even at defaults
multiline_display unconditionally adds a GooseFS config = {{ ... }} line. When GoosefsConfig is at its defaults, multiline_display() still returns many non-empty lines (Anonymous, Max retries, Retry timeout, Connect timeout, Read timeout, Max concurrent requests, Max connections) because those fields always emit — unlike truly sparse configs. If the intent is only to display non-default values, numeric fields need their own if value != default guards (analogous to how auth_username is guarded with if let Some(...)).
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
6ab2cd1 to
574fc4d
Compare
… multiline display
Address review feedback on GoosefsConfig:
- P1: to_opendal_config now forwards max_retries, retry_timeout_ms, connect_timeout_ms, read_timeout_ms, max_concurrent_requests and max_connections_per_io_thread into the returned config map (only when non-default), so user-provided values are no longer silently dropped at the Daft layer.
- P2: multiline_display now gates every numeric/boolean field with an if value != default guard, mirroring how auth_username is handled. A default-constructed GoosefsConfig produces an empty multiline view, and IOConfig::multiline_display omits the GooseFS config = { ... } line entirely in that case.
Adds regression tests covering both behaviours.
574fc4d to
434897f
Compare
Changes Made
This PR adds first-class support for GooseFS (Tencent Cloud's distributed cache/acceleration filesystem) as a Daft I/O backend, mirroring the existing
oss/cos/obsintegrations. It is implemented entirely on top of OpenDAL's newservices-goosefsbackend (available since OpenDAL 0.57.0), so the surface area in Daft is intentionally small.Implementation details
Dependency bump
Cargo.toml: bumpopendalto0.57.0and enable theservices-goosefsfeature.Cargo.lock: regenerated.New config:
GooseFSConfigsrc/common/io-config/src/goosefs.rs(new file) — Rust config struct with:endpoint: Option<String>— GooseFS master endpoint (e.g.http://master:9200).root: Option<String>— optional root path inside the GooseFS namespace.anonymous: bool— anonymous access toggle (defaults tofalse).Default,Display,multiline_display, and unit tests for round-tripping / formatting, consistent withcos.rs/obs.rs.src/common/io-config/src/lib.rsand added toIOConfignext to the other cloud configs.Python bindings
src/common/io-config/src/python.rs: addGooseFSConfigPyO3 class with__init__,replace,__repr__,__eq__, pickling (__reduce__), and field accessors — same shape asCOSConfig.IOConfig.goosefsso it round-trips between Python and Rust.OpenDAL source registration
src/daft-io/src/opendal_source.rs:goosefstoOpenDALSource::available_schemes().Goosefsoperator fromGooseFSConfig(endpoint / root / anonymous).src/daft-io/src/lib.rs: routegoosefs://...URIs to the new source in the scheme dispatcher.Tests
GooseFSConfig(defaults,replace, display, multiline display).User-facing behavior
After this PR, GooseFS-backed paths can be used directly with Daft's standard readers/writers:
All existing readers (
read_parquet,read_csv,read_json,read_iceberg, ...) and writers work transparently because everything goes through the sharedOpenDALSourceplumbing — no reader-specific changes were needed.Why OpenDAL (vs. a custom client)
GooseFS already has an official OpenDAL backend as of
opendal0.57.0, so plugging it in keeps Daft's I/O code uniform with the other OpenDAL-backed schemes (oss,cos,obs,huggingface) and lets us inherit upstream improvements for free. No new transport, retry, or auth code was introduced.Validation
cargo fmt --all✅cargo check --workspace✅cargo test -p common-io-config -p daft-io✅ (new GooseFS unit tests pass)daft.read_parquet("goosefs://...")against a local GooseFS cluster.Related Issues
Closes #7108