Open
Description
Describe the bug
I would like to read all parquet files from subdirectories from s3. I am doing that in databricks. My data is partitioned by yyyy, mm, dd, hh but I want to validate the whole day at once. recursive_file_lookup
doesn't seem to work as expected,
I get
TestConnectionError: No file in bucket "my_bucket" with prefix "" and recursive file discovery set to "False" found using delimiter "/" for DataAsset "inventory_parts_asset_".
To Reproduce
import great_expectations as gx
# Get the Ephemeral Data Context
context = gx.get_context(mode="ephemeral")
assert type(context).__name__ == "EphemeralDataContext"
# Define the Data Source's parameters:
data_source_name = "source_name"
bucket_name = "my_bucket"
boto3_options = {
"region_name": "region",
"endpoint_url": "endpoint_url",
"aws_access_key_id": "key_id",
"aws_secret_access_key": "access_key",
}
# Create the Data Source:
data_source = context.data_sources.add_or_update_spark_s3(
name=data_source_name, bucket=bucket_name, boto3_options=boto3_options
)
asset_name = "inventory_parts_asset_"
It doesn't work:
s3_prefix = "my_prefix/yyyy=2025/mm=03/dd=09/"
data_asset = data_source.add_parquet_asset(name=asset_name, s3_prefix=s3_prefix, recursive_file_lookup=True)
It works:
s3_prefix = "my_prefix/yyyy=2025/mm=03/dd=09/hh=00/"
data_asset = data_source.add_parquet_asset(name=asset_name, s3_prefix=s3_prefix, recursive_file_lookup=True)
Expected behavior
Read all parquet files from subdirectories
Environment (please complete the following information):
- Operating System: MacOS
- Great Expectations Version: 1.3.9
- Cloud environment: AWS
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Fixing