Skip to content

bug(datasets): ibis.FileDataset does not work with abfss (or other cloud protocols?) #1298

@mark-druffel

Description

@mark-druffel

Description

Context

I think FileDataset is butchering cloud filepaths. When I set a filepath to abfss://, FileDataset (pathlib) drops the second forward slash making it abfss:/ here and says the the path does not exist.

The effect is I can read a file using ibis like so, but it will fail in kedro. This works:

import ibis
ispark = ibis.pyspark.connect(session = spark)
my_path = 'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
df = ispark.read_parquet_dir(my_path)

This fails with [PATH_NOT_FOUND](https://learn.microsoft.com/azure/databricks/error-messages/error-classes#path_not_found)] Path does not exist: dbfs:/Workspace/Users/me@company.com/notebooks/abfss:/container@account.dfs.core.windows.net/folder/parquet_directory. SQLSTATE: 42K03:

my_data:
  type: ibis.FileDataset
  filepath: abfss://container@account.dfs.core.windows.net/folder/parquet_directory
  file_format: parquet_dir
  table_name: my_table
  connection: ${globals:my_backend}

Steps to Reproduce

This is challenging to create. I tried to use an open source dataset on Google, but my databricks environment (work) blocked it so that was kind of a dead end. Given that, I'm just going to provide the code with abstracted paths for now and I'll update my issue if I'm able to get a better reproducible example or debug further on my side.

I setup a catalog entry:

my_data:
  type: ibis.FileDataset
  filepath: abfss://container@account.dfs.core.windows.net/folder/parquet_directory
  file_format: parquet_dir
  table_name: my_table
  connection: ${globals:my_backend}

I install the project and dependencies in my databricks cluster, instantiate kedro, and call catalog.load():

%pip install -e ../.
dbutils.library.restartPython()
import os
os.environ["KEDRO_ENV"] = "dev"
%load_ext kedro.ipython
catalog.load("my_data")

[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/user@company.com/notebooks/abfss://container@account.dfs.core.windows.net/folder/parquet_directory. SQLSTATE: 42K03

I confirmed reading the same path in ibis did work:

import ibis
filepath = 'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
ispark.read_parquet_dir(filepath)

Looking at file_dataset.py, I see where the forward slash is dropped by pathlib.

from pathlib import Path, PurePosixPath
my_file =  'abfss://container@account.dfs.core.windows.net/folder/parquet_directory'
pure_path = PurePosixPath(my_file)
print(my_file) 
> 'abfss:/container@account.dfs.core.windows.net/folder/parquet_directory'  # Missing one of the forward slashes after the colon 
Path(pure_path).exists() 
> False

Expected Result

The FileDataset should be able to read data the same way ibis can.

Actual Result

The filepath was modified by pathlib.

2026-02-04 18:55:45,428 8088 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1726, in _execute_and_fetch_as_iterator
    for b in generator:
  File "<frozen _collections_abc>", line 330, in __next__
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 139, in send
    if not self._has_next():
           ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 200, in _has_next
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 172, in _has_next
    self._current = self._call_iter(
                    ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 297, in _call_iter
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 277, in _call_iter
    return iter_fun()
           ^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 173, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03", grpc_status:13, created_time:"2026-02-04T18:55:45.427632826+00:00"}"
>
2026-02-04 18:55:45,428 8088 ERROR _handle_rpc_error GRPC Error received
Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/sql/connect/client/core.py", line 1726, in _execute_and_fetch_as_iterator
    for b in generator:
  File "<frozen _collections_abc>", line 330, in __next__
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 139, in send
    if not self._has_next():
           ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 200, in _has_next
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 172, in _has_next
    self._current = self._call_iter(
                    ^^^^^^^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 297, in _call_iter
    raise e
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 277, in _call_iter
    return iter_fun()
           ^^^^^^^^^^
  File "/databricks/spark/python/pyspark/sql/connect/client/reattach.py", line 173, in <lambda>
    lambda: next(self._iterator)  # type: ignore[arg-type]
            ^^^^^^^^^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 540, in __next__
    return self._next()
           ^^^^^^^^^^^^
  File "/databricks/python/lib/python3.11/site-packages/grpc/_channel.py", line 966, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.INTERNAL
	details = "[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03"
	debug_error_string = "UNKNOWN:Error received from peer  {grpc_message:"[PATH_NOT_FOUND] Path does not exist: dbfs:/Workspace/Users/m109993@8451.com/camper/notebooks/abfss:/measure@sa8451camprd.dfs.core.windows.net/kmplus/reports/uplift/version=v2/source=azure/campaign_type=SSE/campaign_id=162659. SQLSTATE: 42K03", grpc_status:13, created_time:"2026-02-04T18:55:45.427632826+00:00"}"

Your Environment

Include as many relevant details about the environment in which you experienced the bug:

  • kedro 1.2.0
  • kedro-datasets 9.1.1
  • ibis 11.0.0
  • Databricks Environment 15.4 LTS (Spark 3.5.0)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Needs more info

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions