-
Notifications
You must be signed in to change notification settings - Fork 109
Description
Description
Context
I think FileDataset is butchering cloud filepaths. When I set a filepath to abfss://, FileDataset (pathlib) drops the second forward slash making it abfss:/ here and says the the path does not exist.
The effect is I can read a file using ibis like so, but it will fail in kedro. This works:
import ibis
ispark = ibis.pyspark.connect(session = spark)
my_path = 'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
df = ispark.read_parquet_dir(my_path)This fails with [PATH_NOT_FOUND](https://learn.microsoft.com/azure/databricks/error-messages/error-classes#path_not_found)] Path does not exist: dbfs:/Workspace/Users/me@company.com/notebooks/abfss:/container@account.dfs.core.windows.net/folder/parquet_directory. SQLSTATE: 42K03:
my_data:
type: ibis.FileDataset
filepath: abfss://container@account.dfs.core.windows.net/folder/parquet_directory
file_format: parquet_dir
table_name: my_table
connection: ${globals:my_backend}Steps to Reproduce
This is challenging to create. I tried to use an open source dataset on Google, but my databricks environment (work) blocked it so that was kind of a dead end. Given that, I'm just going to provide the code with abstracted paths for now and I'll update my issue if I'm able to get a better reproducible example or debug further on my side.
I setup a catalog entry:
my_data:
type: ibis.FileDataset
filepath: abfss://container@account.dfs.core.windows.net/folder/parquet_directory
file_format: parquet_dir
table_name: my_table
connection: ${globals:my_backend}I install the project and dependencies in my databricks cluster, instantiate kedro, and call catalog.load():
%pip install -e ../.
dbutils.library.restartPython()import os
os.environ["KEDRO_ENV"] = "dev"
%load_ext kedro.ipythoncatalog.load("my_data")[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/user@company.com/notebooks/abfss://container@account.dfs.core.windows.net/folder/parquet_directory. SQLSTATE: 42K03
I confirmed reading the same path in ibis did work:
import ibis
filepath = 'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
ispark.read_parquet_dir(filepath)Looking at file_dataset.py, I see where the forward slash is dropped by pathlib.
from pathlib import Path, PurePosixPath
my_file = 'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
pure_path = PurePosixPath(my_file)
print(my_file)
> 'abfss:/container@account.dfs.core.windows.net/folder/parquet_directory' # Missing one of the forward slashes after the colon
Path(pure_path).exists()
> FalseExpected Result
The FileDataset should be able to read data the same way ibis can.
Actual Result
The filepath was modified by pathlib.
2026-02-04 18:55:45,428 8088 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1726, in _execute_and_fetch_as_iterator
for b in generator:
File "<frozen _collections_abc>", line 330, in __next__
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 139, in send
if not self._has_next():
^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 200, in _has_next
raise e
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 172, in _has_next
self._current = self._call_iter(
^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 297, in _call_iter
raise e
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 277, in _call_iter
return iter_fun()
^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 173, in <lambda>
lambda: next(self._iterator) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
return self._next()
^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03", grpc_status:13, created_time:"2026-02-04T18:55:45.427632826+00:00"}"
>
2026-02-04 18:55:45,428 8088 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1726, in _execute_and_fetch_as_iterator
for b in generator:
File "<frozen _collections_abc>", line 330, in __next__
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 139, in send
if not self._has_next():
^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 200, in _has_next
raise e
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 172, in _has_next
self._current = self._call_iter(
^^^^^^^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 297, in _call_iter
raise e
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 277, in _call_iter
return iter_fun()
^^^^^^^^^^
File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 173, in <lambda>
lambda: next(self._iterator) # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
return self._next()
^^^^^^^^^^^^
File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03", grpc_status:13, created_time:"2026-02-04T18:55:45.427632826+00:00"}"
Your Environment
Include as many relevant details about the environment in which you experienced the bug:
- kedro 1.2.0
- kedro-datasets 9.1.1
- ibis 11.0.0
- Databricks Environment 15.4 LTS (Spark 3.5.0)
Metadata
Metadata
Assignees
Labels
Type
Projects
Status