Change how read/write args are read from metadata

The current implementation is unnecessarily susceptible to breaking changes. Say a new version of Polars (or `pyarrow.dataset`) removes/renames a given `write_parquet` argument. Because the arguments are currently hard-coded in the `PolarsParquetIOManager`, this would break user scripts if they upgrade to the newer version of Polars.

I think a good solution would be switching to a dictionary (called something along the lines of `polars_io_args` -- I'm sure there's a better name) as follows:

```python
@asset(
    metadata={
        "polars_io_args": {
            "compression": "snappy"
        }
    }
)
def some_asset():
    ...
```

And the `dump_df_to_path` and `scan_df_from_path` method then forward these args to Polars, without hard-coding their names:

```python
class PolarsParquetIOManager(BasePolarsUPathIOManager):
    ...

    def dump_df_to_path(self, context: OutputContext, df: pl.DataFrame, path: UPath):
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})

        with path.open("wb") as file:
            df.write_parquet(
                file,
                **io_args 
            )

    def scan_df_from_path(self, path: UPath, context: InputContext) -> pl.LazyFrame:
        assert context.metadata is not None
        io_args = context.metadata.get("polars_io_args", {})
        io_args.setdefault("format", "parquet")  # sets format to Parquet if not defined already


        fs: Union[fsspec.AbstractFileSystem, None] = None

        try:
            fs = path._accessor._fs
        except AttributeError:
            pass

        return pl.scan_pyarrow_dataset(
            ds.dataset(
                str(path),
                filesystem=fs,
                **io_args
            ),
            allow_pyarrow_filter=context.metadata.get("allow_pyarrow_filter", True),
        )
```

This 
- helps reduces the amount of "coupling" we have in terms of expected arguments 
- is overall "cleaner", as we're separating IO-related metadata in a different 'namespace' compared to other metadata. 
- helps the metadata section in Dagit be less 'cluttered' with IO minutia by hiding it under a single JSON entry.

Finally, this might also enable (in the future) 'hybrid' IO managers which save the same asset in two ways, because you can separate each IO manager's args in different keys, even if the names of the children-args clash/overlap.

The only problem with the above solution is that I can't think of a "clean" solution that also allows you to pass `allow_pyarrow_filer` values because `io_args` is unpacked at a different place.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Change how read/write args are read from metadata #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Change how read/write args are read from metadata #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions