Conversation
test/workflows/automatic_job_grouping/inputs_files-strings.yaml
Outdated
Show resolved
Hide resolved
| class TransformationSubmissionModel(BaseModel): | ||
| """Transformation definition sent to the router.""" | ||
|
|
||
| # Allow arbitrary types to be passed to the model | ||
| model_config = ConfigDict(arbitrary_types_allowed=True) | ||
|
|
||
| task: CommandLineTool | Workflow | ExpressionTool | ||
| input_data: Optional[list[str | File] | None] = None |
There was a problem hiding this comment.
As we are going to integrate input sandbox within transformations (#92), it would be interesting to see if we could reuse the JobInputModel (renamed as InputModel?)
There was a problem hiding this comment.
Regarding @arrabito comments:
- I agree that we don't need to have input sandbox for now, so it can't be local files.
- I don't remember how we will add support for sandboxes in the transformation system. For simplicity, I would keep just LFN paths for now.
- As said before, in my opinion there is no need to support/create sandboxes for now.
Do I still make this change in this PR? Or wouldn't it be better to do it in a (futur) sandbox PR? Maybe I missunderstood what you meant here.
There was a problem hiding this comment.
Let's make this change in a future sandbox PR I would say
There was a problem hiding this comment.
Thinking a little bit further, we may also want to allow local file paths, but only to be used for Local execution (without adding them to SB).
So if the submission is local we allow only local paths, while if the submission is to DIRAC we allow only LFN paths.
In this way, we could also execute transformations locally.
Eventually later on, we will also allow local file paths for DIRAC submission (adding them to ISB).
@aldbr what do you think?
|
@aldbr Regarding this part of the code: dirac-cwl/src/dirac_cwl_proto/transformation/__init__.py Lines 130 to 163 in 72956d5 Are we planning on keeping it? Just so I un-comment it and make the changes related to the |
|
Waiting on #66 (comment) and #95 (comment) approval about what we're doing, and then, PR should be ready to be fully reviewed (and potentially merged 🙏). |
Yes we want to keep it. A transformation should either get inputs from the CLI, or from a |
|
I’m also not sure whether the Also, the If you have any ideas. |
Since |
As far as I see, I'm not sure that any input_name is needed anymore. In the current QueryBasedPlugin, input_name is just used to build the LFN path, see: Probably we could just change get_input_query to not take any argument and just build LFN path as:
instead of:
Then, I guess that the group_size in yaml file should be specified as: instead of: @aldbr do you agree? (Maybe some other changes are needed that I haven't thought). |
Yes I agree. In any case, this is going to be revised at some point with the hints proposed in #69 |
|
Current PR status:
|
|
I don't know how to fix current lint error. I had to rebase and reword all my "wrong" commits and it's still not working because of old commits already pushed. |
|
Also, PyPi is failling on something I didn't touch directly (mostly sure about that), so I don't really know what to do about that too. If you have any ideas models.update(
{
"JobInputModel": JobInputModel, # <--- error happens here
"JobSubmissionModel": JobSubmissionModel,
"TransformationSubmissionModel": TransformationSubmissionModel,
"ProductionSubmissionModel": ProductionSubmissionModel,
}
)class JobInputModel(BaseModel): # <-- it's a BaseModel ?
"""Input data and sandbox files for a job execution."""
# Allow arbitrary types to be passed to the model
model_config = ConfigDict(arbitrary_types_allowed=True)
sandbox: list[str] | None
cwl: dict[str, Any]
@field_serializer("cwl")
def serialize_cwl(self, value):
"""Serialize CWL object to dictionary.
:param value: CWL object to serialize.
:return: Serialized CWL dictionary.
"""
return save(value) |
f9337a8 to
52e6144
Compare
99df614 to
3a1314c
Compare
# Conflicts: # src/dirac_cwl/job/job_wrapper.py # test/test_integration.py # test/test_job_wrapper.py
aldbr
left a comment
There was a problem hiding this comment.
Since this PR was opened, there has been some changes in the design of the hints so we should be pragmatic and focus on the user interfaces mostly.
The input_data dictionary maps CWL input parameter names to lists of files. This avoids hardcoding any key name (like input-data) and makes the mapping between hint data and workflow inputs explicit.
CLI convenience: --inputs-file
Users can provide a standard CWL inputs YAML file via the CLI:
dirac-cwl transformation submit workflow.cwl --inputs-file data.yamlWhere data.yaml is:
simulation-files:
- /lfn/path/file1.root
- /lfn/path/file2.root
- /lfn/path/file3.rootThe CLI reads this file and populates dirac:ExecutionHooks.input_data in the hint before submission. This means:
- The router always reads input data from one place (the hint)
- The
--inputs-fileflag is syntactic sugar for populating the hint:
hints:
- class: dirac:ExecutionHooks
group_size: 5
input_data:
simulation-files:
- /lfn/path/file1.root
- /lfn/path/file2.root
- /lfn/path/file3.root- If the hint already contains
input_dataAND--inputs-fileis provided, emit a warning (--input_fileshould overrideinput_data). - The file format is a standard CWL inputs YAML, nothing new to learn. For simplicity, it should contain only 1 parameter I think (unless CTAO has a use case where they would need multiple static lists of inputs?).
Dynamic queries (what we already have with configuration)
For transformations that discover inputs at runtime (e.g., from upstream transformation outputs), the hint uses a plugin-based configuration instead:
hints:
- class: dirac:ExecutionHooks
group_size: 5
input_query:
plugin: QueryBasedPlugin
config:
query_root: "."
campaign: "pi"
data_type: "100"A transformation receives dynamic inputs from one upstream transformation only.
Mutual exclusivity
input_data (static) and input_query (dynamic) cannot coexist on the same transformation. The system should raise a clear error:
Cannot specify both static input data and dynamic input query.
Use input_data (or --inputs-file) for standalone transformations with known files.
Use input_query for transformations that discover inputs from upstream outputs.
Which input gets split?
A transformation receives input data through one file-array input only, whether static (input_data with a single key) or dynamic (input_query from one upstream transformation). group_size applies to that input.
If input_data contains multiple keys, raise an error. Multiple dynamic input sources per transformation are not supported, each transformation queries one upstream transformation only.
Non-file inputs (scalars, non-array types) defined in the workflow's regular CWL inputs are passed unchanged to every job.
Productions
Static inputs and the first transformation convention
When a production is submitted with --inputs-file, the static input data is passed to the first transformation by convention. This follows the natural pipeline structure:
t1 (receives static inputs) -> t2 (queries t1 outputs) -> t3 (queries t2 outputs)
Downstream transformations typically get their inputs via dynamic queries against the DataCatalog/Bookkeeping service.
In that way we start simple. If it's enough to cover CTAO and LHCb use cases, then it's fine. If we need something more complex (multiple input queries/data), then we can reevaluate based on the use cases.
There was a problem hiding this comment.
Looks like that shouldn't have been changed like that (we more and more need to update the lock file from github 😅 )
There was a problem hiding this comment.
Maybe we can add #22 to the current sprint? I don't think it'll take a lot of time (might be a weight 1)
| if os.getenv("DIRAC_PROTO_LOCAL") == "1": | ||
| from dirac_cwl.mocks.sandbox import create_sandbox, download_sandbox # type: ignore[no-redef] | ||
| from dirac_cwl.mocks.status import JobReportMock | ||
| else: | ||
| from diracx.api.jobs import create_sandbox, download_sandbox # type: ignore[no-redef] |
There was a problem hiding this comment.
Why did you move these lines?
There was a problem hiding this comment.
I had issues during tests: #95 (comment)
The problem was that the import weren't working correctly. IIRC, the value of os.getenv wasn't set yet when the code was called in the tests.
So, the JobWrapper would import diracx.api.jobs instead of dirac_cwl.mocks... because os.getenv("DIRAC_PROTO_LOCAL") was != "1" (None or 0 I don't remember)
| @property | ||
| def job_path(self): | ||
| """Return the job path.""" | ||
| return self._job_path |
There was a problem hiding this comment.
Where and why do you need this getter?
There was a problem hiding this comment.
This getter is used in this fixture: conftest.py
@pytest.fixture
def job_wrapper():
"""Create a JobWrapper instance and cleanup test files."""
job_wrapper = JobWrapper(job_id=0)
yield job_wrapper
task_file = job_wrapper.job_path / "task.cwl"
task_file.unlink(missing_ok=True)It’s used to clean up any files created during tests. As mentioned in one of our discussions, running JobWrapper-related tests generates a task.cwl file in the project directory, which isn’t automatically removed. This fixture handles that cleanup and can also provide a job_wrapper instance for use in tests when needed.
I had to create it since now, job_path is a private attribute.
| raise ValueError(f"Invalid DIRAC hints:\n{exc}") from exc | ||
|
|
||
| # Inputs from Transformation inputs_file | ||
| if transformation.input_data: |
There was a problem hiding this comment.
I guess input data should come either from input_data or from the query parameters but not both (I would make if/else instead of if/if).
What do you think?
There was a problem hiding this comment.
So, more like a if, elseif?
So we can keep this condition just in case: transformation_execution_hooks.configuration and transformation_execution_hooks.group_size or it's not needed anymore?
If I understand correctly, input_data is only defined in the hint now (at CLI-level) and shouldn't be part of SubmissionModels? |
|
@aldbr I'm not sure to understand how Currently, we have this: class ExecutionHooksHint(BaseModel, Hint):
configuration: Dict[str, Any] = Field(
default_factory=dict, description="Additional parameters for metadata plugins"
)You said:
So, But then, you said this: hints:
- class: dirac:ExecutionHooks
group_size: 5
input_query:
plugin: QueryBasedPlugin
config:
query_root: "."
campaign: "pi"
data_type: "100"What is the "relation" between I had to add this to be able to check if class TransformationExecutionHooksHint(ExecutionHooksHint):
"""Extended data manager for transformations."""
group_size: Optional[int] = Field(default=None, description="Input grouping configuration for transformation jobs")
input_data: Optional[Dict[str, List[str]]] = Field(default=None, description="Input data for transformation jobs")
input_query: Optional[Dict] = Field(default=None, description="Input query for transformation jobs")Because I'm calling: task = load_document(pack(task_path))
execution_hooks = TransformationExecutionHooksHint.from_cwl(task)
# Input_query and input_data are mutually exclusive
if execution_hooks.input_query and (execution_hooks.input_data or inputs_file):
console.print("Error.....")
return typer.Exit(code=1)But if |
cc @aldbr @arrabito @natthan-pigoux
Closes: #66
Related to: #61
Changes:
input_data: list[str | File]toTransformationSubmissionModelandProductionSubmissionModelinputs-fileparameter to Transformation and Production CLIs:dirac-cwl transformation/production submit file.cwl --inputs-file file.yamlparameter-pathtoinput_files: list[str]in Job CLI:dirac-cwl job submit file.cwl --input-files file1.yaml file2.yaml ...group_sizeexecutionHooksHintto Transformation Workflows, such as:group_sizedetermines the number of jobs to be created and how many inputs files they will contain insubmit_transformation_router, by default, it equals 1, which mean a job will be created for each input in the inputs file. Once the list of jobs is created, it is sent to thejob_routerand processed.JobWrapperrelated tests:task.cwlwas created duringpost_processbut never cleared after running tests.TODO after this PR: