You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Discussion issue for the topic of "data-driven workflows".
Purpose: Outline a rough roadmap for potential future development in this area for future prioritisation (what work gets undertaken on what timeframe to be decided in later prioritisation). We won't produce a detailed implementation plan, hash out interfaces or technical details at this stage.
Comments: Please leave comments below rather than editing the OP (please reference the point number rather than just quoting). We'll discuss this at a future Cylc project meeting and write up (say as a GH project) when we're done (then close/lock this issue). Please keep it focused (see "Context" and "Action" below)!
Extensions: Suggesting new points is encouraged (if not already covered), but please keep it on topic, more discussions will follow this, one by one to allow us to focus.
Changelog:
[edit 2026-03-31] Added point 2.iv.c
[edit 2026-03-31] Added point 1.viii
Data driven workflows
Topics pertaining to how inputs/outputs are defined and exchanged between tasks/workflows.
Context: As workflow and suite (i.e, "suite" a collection of inter-dependent workflows) complexity grows, the matter of interfaces increases in priority. There is a growing emphasis on data provenance, traceability and observability, all problems which are hard to resolve at present due to poor visibility of data flow within and between Cylc workflows. Data driven paradigms also have benefit to workflow developers and allow for more decoupled workflows.
Action: There are several things we can do to formalise inputs/outputs in Cylc and facilitate the exchange of metadata (whilst continuing to offer existing paradigms of course).
Slightly harder as it requires cylc message changes.
Quick win - basic functionality very easy to implement (done in PoC), just need to make sure we do this in a way which leaves the door open to future work.
How should we provide this metadata to the downstream task?
In a PoC, I broadcasted the metadata as environment variable called INPUT_<output_name>.
But we can't dump the metadata for all prerequistes into the environment (there are size limits)!
Prerequisites file might be an alternative, e.g, a JSON file.
Note the format has to be easy to read in multiple programming languages!
How should we define the interaction of collections of workflows, e.g, an operational suite?
Currently these collections are assembled by hand, but a more configuration based approach is desirable (reproducibility, ease of collaboration, safer deployment, etc).
Should we be able to define a suite of workflows, say in a configuration file?
Should we define top-level workflow inputs and outputs:
Binding to workflow//cycle/task fragile and requires knowledge of workflow/task structure.
Better to bind to more formally defined "hard points" e.g, specific data sets.
E.g: global:<cycle>:dataset rather than global-v32//<cycle>/run-model
CC: BoM desire to decouple workflows via Kafka.
Should we be able to map the outputs of one workflow onto the inputs of another in the "suite" configuration?
Should we be able to define how the suite should be started / monitored / updated / stopped?
E.g, workflow/run names.
Start arguments, etc.
Strongly linked to CD workflow deployment (see "Supporting Continuous Deployment Paradigm For Cylc Workflows" topic).
Define hierarchical workflow structure:
Suites of suites of workflows...
I.e, modularity at the macro-level.
E.g, this group of workflows covers a particular regional model (obs, model, post-processing, etc).
They can be developed and trailed standalone, or as part of a larger collection
Just remap the inputs these workflows require (e.g, global boundary conditions) onto the desired data source.
External triggers:
Xtriggers:
Are a "data-driver" (in that they produce outputs)
It might make more sense to use ext-triggers as the "event-driver" rather than trying to "fake it" with runahead-driven xtriggers.
This resolves spawning discrepancies.
Are not an "event-driver" (in that they are driven by external events namely runahead limiting).
They could provide an inter-workflow interface for "support output data in Cylc messages" (so that inputs/outputs work the same way within workflows as they do between workflows).
Are are an "event-driver".
Consider ways to make it easier to configure by-modal workflows which can operate in either regime?
E.g, Easily transition from push events (e.g, ext-triggers) to pull events (i.e, polling) as a workflow-level configuration which can be easily changed without having to re-write the workflow.
Does it actually make sense to run polling tasks in a Cylc event loop (async or otherwise)?
Or should we just run them as standalone background processes (i.e, like regular tasks)?
Would we be better off just wrapping the Python function call in a basic for; sleep loop and leaving it there?
Removes the strict requirement for xtriggers to be Python based.
Removes the need for a bespoke xtrigger execution model.
Subproc-pool throttling can be replaced by task-queue throttling.
Note, most of the requirements we have for polling at the MO require remote polling which is not possible via xtriggers or Python interfaces in general (without developing a Python SSH implementation or relying an external service) but is very easy via polling tasks.
I.e, We could keep the xtrigger "look and feel" but remove the bespoke execution altogether? (basically just a subtly different job script).
Note, local background tasks are currently a problem for auto-restart functionality:
If going down either the async xtrigger (in main-loop) or local background task (outside main-loop) approaches, we need the ability to abruptly kill a poller to support auto-restart.
Integrating event brokers
Event brokers offer a push based alternative to inter-workflow triggering.
Rather than polling for a file, or a task output, subscribe to the event and be notified when it happens.
This requires an external service, BoM has investigated Kafka for this purpose.
PoC is currently adding ext-triggers to the scheduler's internal queue, might be ok, might want to decouple from the implementation a bit further?
Automated / configurable synchronisation of task/workflow artifacts.
Artefacts, i.e, the actual datasets that Cylc outputs represent.
Currently, users have to arrange all of the data moving themselves.
E.g, if a task on one platform requires an output produced on another, then they must add in an rsync manually.
Since Cylc is "install target" (aka filesystem) aware, it can roll this rsync command on behalf of the user.
E.g, To pull dataset from the install target where it was written to the install target we are currently on, we could call cylc sync workflow//123/abc:dataset.
Would make writing portable workflows easier as it removes "install target" logic and the need for platform-dependent sync tasks.
If tasks define their inputs, then we could even call this from the job script automatically.
The command and interface for this is covered in the "managing workflow files" topic.
Note
Discussion issue for the topic of "data-driven workflows".
Purpose: Outline a rough roadmap for potential future development in this area for future prioritisation (what work gets undertaken on what timeframe to be decided in later prioritisation). We won't produce a detailed implementation plan, hash out interfaces or technical details at this stage.
Comments: Please leave comments below rather than editing the OP (please reference the point number rather than just quoting). We'll discuss this at a future Cylc project meeting and write up (say as a GH project) when we're done (then close/lock this issue). Please keep it focused (see "Context" and "Action" below)!
Extensions: Suggesting new points is encouraged (if not already covered), but please keep it on topic, more discussions will follow this, one by one to allow us to focus.
Changelog:
Data driven workflows
Topics pertaining to how inputs/outputs are defined and exchanged between tasks/workflows.
Context: As workflow and suite (i.e, "suite" a collection of inter-dependent workflows) complexity grows, the matter of interfaces increases in priority. There is a growing emphasis on data provenance, traceability and observability, all problems which are hard to resolve at present due to poor visibility of data flow within and between Cylc workflows. Data driven paradigms also have benefit to workflow developers and allow for more decoupled workflows.
Action: There are several things we can do to formalise inputs/outputs in Cylc and facilitate the exchange of metadata (whilst continuing to offer existing paradigms of course).
cylc messages:cylc message -- 'dataset 1 written filepath: ... host: ... archive location: ...'[outputs]dataset-1 = dataset 1 written filepath: (?P<filepath>[^ ]+) host: (?P<host>[^ ]+) ...cylc message -- 'dataset 1 written' -- {'filepath': '...', 'host': '...', 'archive location': '...'}cylc messagechanges.INPUT_<output_name>.jqcommand is not POSIX.cylc messagethe job's exit code back to the scheduler to allow this to be captured as an output?[outputs]connection-error = $255.global:<cycle>:datasetrather thanglobal-v32//<cycle>/run-modelcylc ext-trigger 'dataset 1 ready' '{"filepath": "...", "host": "...", ...}'for; sleeploop and leaving it there?process_messagesto intercept incoming task outputs, could do with something better.rsyncmanually.rsynccommand on behalf of the user.datasetfrom the install target where it was written to the install target we are currently on, we could callcylc sync workflow//123/abc:dataset.