Skip to content

Allow different Snapshot behaviours in distributed RDataFrame #17136

Open
@vepadulano

Description

@vepadulano

Feature description

The Snapshot operation, when called locally, will produce one output file to the path provided by the user. Currently, the distributed version of Snapshot produces one output file per task, with the result of running the computation graph on the event range processed by that task. This difference in behaviour was introduced for performance reasons, but for the end users it can be sometimes confusing. We should introduce new behaviours to the distributed RDataFrame function:

  1. Allow the user to request one final merged output file, understanding that this will incur in performance costs.
  2. Allow the user to get back the list of partial output files, so they can harvest them according to further workflow requirements. Also, we should check that it's possible to write the files to a remote storage location where the user has write access (e.g. via an xrootd path).

Alternatives considered

No response

Additional context

No response

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions