Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider`

### Is your feature request related to a problem or challenge?

The `ListingTableProvider` in DataFusion provides an implementation of a `TableProvider` that organizes a collection of (potentially hive partitioned) files in an object store into a single table.

Similar to how hive partitions are injected into the listing table schema, but they don't actually exist in the physical parquet files - I'd like to be able to request the ListingTable to inject metadata columns that get their data from the `ObjectMeta` provided by the object store crate. Then I can query for and filter on the columns `location`, `size` and `last_modified`).

I'd also like queries that filter on the metadata columns to be able to prune out files, similar to partition pruning. I.e. if I do `SELECT * FROM my_listing_table WHERE last_modified > '2025-03-10'` then only files that were modified after `'2025-03-10'` should be passed to the FileScanConfig to be read.

My scenario is I'd like to be able to efficiently ingest files from an object store bucket that I haven't seen before - and filtering on `last_modified` seems like a good solution.

This could potentially fold into the work ongoing in #13975 / #14057 / #14362 to mark these columns as proper system/metadata columns - but it fundamentally isn't blocked on that work. Since this would be an opt-in from the consumer, automatic filtering out on a `SELECT *` doesn't seem required.

### Describe the solution you'd like

A new API on the `ListingOptions` struct that is passed to a `ListingTableConfig` which is passed to `ListingTable::try_new`.

```rust
    /// Set metadata columns on [`ListingOptions`] and returns self.
    ///
    /// "metadata columns" are columns that are computed from the `ObjectMeta` of the files from object store.
    ///
    /// Available metadata columns:
    /// - `location`: The full path to the object
    /// - `last_modified`: The last modified time
    /// - `size`: The size in bytes of the object
    ///
    /// For example, given the following files in object store:
    ///
    /// ```text
    /// /mnt/nyctaxi/tripdata01.parquet
    /// /mnt/nyctaxi/tripdata02.parquet
    /// /mnt/nyctaxi/tripdata03.parquet
    /// ```
    ///
    /// If the `last_modified` field in the `ObjectMeta` for `tripdata01.parquet` is `2024-01-01 12:00:00`,
    /// then the table schema will include a column named `last_modified` with the value `2024-01-01 12:00:00`
    /// for all rows read from `tripdata01.parquet`.
    ///
    /// | <other columns> | last_modified         |
    /// |-----------------|-----------------------|
    /// | ...             | 2024-01-01 12:00:00   |
    /// | ...             | 2024-01-02 15:30:00   |
    /// | ...             | 2024-01-03 09:15:00   |
    ///
    /// # Example
    /// ```
    /// # use std::sync::Arc;
    /// # use datafusion::datasource::{listing::ListingOptions, file_format::parquet::ParquetFormat};
    ///
    /// let listing_options = ListingOptions::new(Arc::new(
    ///     ParquetFormat::default()
    ///   ))
    ///   .with_metadata_cols(vec![MetadataColumn::LastModified]);
    ///
    /// assert_eq!(listing_options.metadata_cols, vec![MetadataColumn::LastModified]);
    /// ```
    pub fn with_metadata_cols(mut self, metadata_cols: Vec<MetadataColumn>) -> Self {
        self.metadata_cols = metadata_cols;
        self
    }
```

The definition for `MetadataColumn` is a simple enum:

```rust
/// A metadata column that can be used to filter files
#[derive(Debug, Clone, Copy, PartialEq, Eq, Hash)]
pub enum MetadataColumn {
    /// The location of the file in object store
    Location,
    /// The last modified timestamp of the file
    LastModified,
    /// The size of the file in bytes
    Size,
}
```

The order of the `MetadataColumn` passed into `with_metadata_cols` denotes the order it will appear in the table schema. Metadata columns will be added after partition columns.

### Describe alternatives you've considered

I considered what it might look like to make `ListingTableProvider` more extensible to be able to implement these changes without a core DataFusion change. I wasn't able to come up with anything simpler than the above though.

Another option might be to make a lot of the internals of ListingTableProvider public so that it is easier for people to maintain their own customized versions of ListingTableProvider.

### Additional context

I've already implemented this in my project, I will be upstreaming my change and linking to this issue. To view what this looks like already implemented, see: https://github.com/spiceai/datafusion/pull/74

And to see the changes needed to integrate with it from a consuming project, see: https://github.com/spiceai/spiceai/pull/4970 (It is quite contained, which I'm happy with)

This change will have no visible effect on consumers - they need to explicitly opt-in to see the metadata columns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` #15173

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support metadata columns (location, size, last_modified) in ListingTableProvider #15173

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Support metadata columns (`location`, `size`, `last_modified`) in `ListingTableProvider` #15173