The future of I/O managers: an opt-in layer #17595
Replies: 4 comments 4 replies
-
|
I would love to see AssetSpecs for DBT-based asset factories to i.e. auto-magically define test cases |
Beta Was this translation helpful? Give feedback.
-
|
Is there any plan or way to bring over the There would be a lot of overhead to carry out (for partitioned tables especially) that is already written and implemented here. At the moment, we would love to connect to the database like this instead of via an IO manager, but without the ability to use table slices, it makes it far too complicated to use. Something like: @asset(deps=[iris_dataset])
def iris_setosa(context: AssetExecutionContext, duckdb: DuckDBResource) -> MaterializeResult:
with duckdb.get_client() as client: # or similar?
asset_table_slice = client.get_table_slice(context, context)
dep_table_slice = client.get_table_slice(context, context.upstream_output) # doesn't currently exist in AssetExecutionContext
dep_table_query = client.get_select_statement(dep_table_slice)
client.ensure_schema_exists(context, asset_table_slice)
client.delete_table_slice(context, asset_table_slice)
with duckdb.get_connection() as conn:
conn.execute(f"""
INSERT INTO {asset_table_slice.schema}.{asset_table_slice.table}
SELECT * FROM ({dep_table_query})
WHERE species = 'Iris-setosa'
""")
num_rows = conn.execute("SELECT COUNT(*) FROM iris.iris_setosa;")
return MaterializeResult(metadata={"num_rows": num_rows})
|
Beta Was this translation helpful? Give feedback.
-
|
@jamiedemaria is this already practically possible? "Allowing users to access simple metadata about upstream assets. For example, the name of a table where data was stored in the upstream asset." |
Beta Was this translation helpful? Give feedback.
-
|
I'm curious here too. Issues like this #21830 suggest there's demand for that simple access ability. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Most, if not all, Dagster users have had to contend with the I/O manager system at some point. The I/O manager system has been tightly coupled with the orchestration layer, and it was expected that every Dagster user would have to understand the abstraction. For some users, I/O managers are a natural fit for how their data pipelines are designed and how their data is stored. These data pipelines typically transform data in-memory— for example with a library like pandas, and store the resulting data assets in the same location— for example, in Snowflake tables.
However, not all data pipelines fit this pattern. Sometimes, data assets are too large to store in-memory, or the data pipeline calls out to a third party tool that already handles storage— for example,
dbt. In these cases, the I/O manager system doesn’t add additional value, and sometimes gets in the way of the developer experience.While opting out of the I/O manager system has been possible since 0.13.11, in version 1.15.0 we introduced new APIs to simplify working with Dagster without the I/O managers. Our goal is for these new APIs to be the default way of working with Dagster. Users will no longer have to map input names and output names to asset keys: there will only be asset keys. I/O managers become an opt-in system for those who desire its opinionated structure. Users who do not opt-in should not have to consider or think about I/O managers.
These new APIs are:
depsparameter on@assetThe
depsparameter allows users to set the upstream assets an asset depends on, but the I/O manager will not be used to load these assets into memory.The
depsparameter replaces usingnon_argument_deps, andnon_argument_depshas been marked deprecated.non_argument_depsrelied on string matching asset names, which often resulted in typos that were only caught at runtime.depsaccepts assets, which allows in-editor type checkers to detect errors.AssetDepAssetDepis a class used for defining a dependency on another asset when additional information, like aPartitionMapping, is needed.With the addition of
AssetDep, users can now define complex dependency relationships that were previously only available in the I/O manager system.MaterializeResultMaterializeResultis a class that can be optionally returned from an@assetto report metadata, code version, and other information about the asset.The
MaterializeResulttype does not require that an output value be returned, and does not require the user to understand output names or that the implicit output name for assets is "result".AssetSpecAssetSpecis a class for defining the specifications of a data asset - like the asset key, dependencies, group, and freshness policies, separate from the computation that creates the asset. Currently,AssetSpecs can be used when writing@multi_assets, and when used, Dagster will expect that storing the assets will be handled in the body of the@multi_assetfunction.@multi_assetsthat useAssetSpeccan returnMaterializeResults orNone.Authors of asset factory functions may find
AssetSpecs to be particularly useful.AssetSpecbundles several parameters that are on the@multi_assetdecorator together, which allows you to acceptAssetSpecs in your factory function rather than duplicating each parameterWhat’s next?
Before we can officially declare I/O managers an opt-in system, we still have some work to do. This includes:
contextso that it is not geared around inputs and outputsWe are continuing to work on improving these APIs and the experience of using Dagster without the I/O manager system. Your feedback and suggestions are always welcome!
Beta Was this translation helpful? Give feedback.
All reactions