-
Couldn't load subscription status.
- Fork 4
Change how read/write args are read from metadata #15
Description
The current implementation is unnecessarily susceptible to breaking changes. Say a new version of Polars (or pyarrow.dataset) removes/renames a given write_parquet argument. Because the arguments are currently hard-coded in the PolarsParquetIOManager, this would break user scripts if they upgrade to the newer version of Polars.
I think a good solution would be switching to a dictionary (called something along the lines of polars_io_args -- I'm sure there's a better name) as follows:
@asset(
metadata={
"polars_io_args": {
"compression": "snappy"
}
}
)
def some_asset():
...And the dump_df_to_path and scan_df_from_path method then forward these args to Polars, without hard-coding their names:
class PolarsParquetIOManager(BasePolarsUPathIOManager):
...
def dump_df_to_path(self, context: OutputContext, df: pl.DataFrame, path: UPath):
assert context.metadata is not None
io_args = context.metadata.get("polars_io_args", {})
with path.open("wb") as file:
df.write_parquet(
file,
**io_args
)
def scan_df_from_path(self, path: UPath, context: InputContext) -> pl.LazyFrame:
assert context.metadata is not None
io_args = context.metadata.get("polars_io_args", {})
io_args.setdefault("format", "parquet") # sets format to Parquet if not defined already
fs: Union[fsspec.AbstractFileSystem, None] = None
try:
fs = path._accessor._fs
except AttributeError:
pass
return pl.scan_pyarrow_dataset(
ds.dataset(
str(path),
filesystem=fs,
**io_args
),
allow_pyarrow_filter=context.metadata.get("allow_pyarrow_filter", True),
)This
- helps reduces the amount of "coupling" we have in terms of expected arguments
- is overall "cleaner", as we're separating IO-related metadata in a different 'namespace' compared to other metadata.
- helps the metadata section in Dagit be less 'cluttered' with IO minutia by hiding it under a single JSON entry.
Finally, this might also enable (in the future) 'hybrid' IO managers which save the same asset in two ways, because you can separate each IO manager's args in different keys, even if the names of the children-args clash/overlap.
The only problem with the above solution is that I can't think of a "clean" solution that also allows you to pass allow_pyarrow_filer values because io_args is unpacked at a different place.