Skip to content

Conversation

@henry3260
Copy link
Contributor

@henry3260 henry3260 commented Jan 29, 2026

What

This PR fixes a bug in EmrCreateJobFlowOperator where the wait_policy parameter was ignored, causing the operator to always default to the JobFlowWaiting waiter (waiting for the cluster to start) regardless of the user's input.

The changes include:

Ensuring wait_policy is correctly persisted in the operator instance.

Updating the execute method to select the correct boto3 waiter based on the policy.

Passing the specific waiter name to EmrCreateJobFlowTrigger to support this logic in deferrable mode.

Adding a validation check to prevent users from providing both wait_for_completion and wait_policy simultaneously.

Why

Currently, if a user wants the operator to wait until the EMR cluster finishes all steps and terminates (using WaitPolicy.WAIT_FOR_STEPS_COMPLETION), the operator fails to do so.

It converts the policy to a boolean wait_for_completion = True in __init__ and discards the specific policy type. Consequently, the execute method hardcodes the waiter to WAITER_POLICY_NAME_MAPPING[WaitPolicy.WAIT_FOR_COMPLETION].

This behavior causes the task to be marked as "Success" as soon as the cluster enters the WAITING state. If the cluster subsequently fails during a step execution, Airflow does not catch the failure, leading to false positives in DAG runs.

closes: #61180

  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Jan 29, 2026
The operator was incorrectly ignoring the `wait_policy` argument and always defaulting to waiting for cluster completion.

This change ensures the `wait_policy` is correctly persisted and used to select the appropriate waiter (e.g., for step completion), fixing the hardcoded behavior.
@henry3260 henry3260 marked this pull request as ready for review January 29, 2026 10:14
@henry3260 henry3260 requested a review from o-nikolas as a code owner January 29, 2026 10:14
@karenbraganz
Copy link
Collaborator

karenbraganz commented Jan 29, 2026

I think the logic might be less confusing for users if you allow them to set both wait_for_completion and wait_policy without raising a ValueError. If wait_for_completion is set to True, the code would check wait_policy. If wait_policy is not set, it defaults to WAIT_FOR_COMPLETE. This shouldn't interfere with your logic too much because you are anyway setting wait_for_completion to True on line 711.

If you want to use the current logic, I think you should clarify a few things in the param descriptions:

  1. For wait_for_completion set to True, make it clear that the task will succeed once the cluster starts running and does not wait until termination. Use wait_policy to change this behavior.
  2. For wait_policy state that this should not be used unless wait_for_termination is set to False.

@jroachgolf84
Copy link
Collaborator

Are we "un-deprecating" wait_policy? Is there a way to do this without "un-deprecating" it?

@karenbraganz
Copy link
Collaborator

@jroachgolf84 do you know why it was deprecated? Imo we should allow users to configure wait_policy because not all of them may want to wait only till the cluster starts running. It gives them more flexibility to set the success criterion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

wait_for_completion in EmrCreateJobFlorOperator should wait for the cluster to complete to return success

3 participants