-
Notifications
You must be signed in to change notification settings - Fork 100
Description
Currently the data pipeline DAG is defined fixed on compilation and supports only a small option of dynamics e.g. the task ParallelReadFile
supports to read files (the number of files are unknown on compilation time).
I would like to have similar dynamics in other areas as well:
Dynamic nodes
The following dynamic nodes could be implemented:
Dynamic tasks
A option to give the Task
a python function which is executed on pipeline runtime and returns a list of commands to execute in order.
Dynamic parallel tasks
A option to give the ParallelTask
a python function which is executed on pipeline runtime and returns a list of commands / command chains to be executed in parallel.
Dynamic pipeline
A option to define a DynamicPipeline
where the nodes are defined within a python function which is executed on pipeline runtime.
Implement UI awareness
The dynamic node objects (Task/ParallelTask/Pipeline) must be defined so that the python function which defines the actual commands/tasks/nodes is not run when interacting with the UI.
Implement node cost handling
These dynamic nodes should be defined so that they define sub-nodes for the dynamic node object. The pipeline execution should then intelligently retract the node cost from the database when the node had been executed in the past. E.g. a dynamic node could represent a export of a database table. By defining the sub-nodes, the pipeline execution can intelligently run the nodes with the highest node cost first to save up execution time.
Example use cases
- performing actions against tables on a database (e.g. export table to datalake). We don't know on time of compilation what tables exist in the database
- performing actions against a data lake / lakehouse per table on disk (e.g. connecting the table to our database engine). We don't know on time of compilation what tables exist on the data lake / lakehouse.