Skip to content

Endpoint to get latest step instance views across runs for a given workflow instance #204

Merged
anjujha merged 1 commit intoNetflix:mainfrom
anjujha:anju/get-workflow-instance-steps
Apr 9, 2026
Merged

Endpoint to get latest step instance views across runs for a given workflow instance #204
anjujha merged 1 commit intoNetflix:mainfrom
anjujha:anju/get-workflow-instance-steps

Conversation

@anjujha
Copy link
Copy Markdown
Collaborator

@anjujha anjujha commented Apr 8, 2026

Pull Request type

  • Bugfix
  • Feature
  • Refactoring (no functional changes, no api changes)
  • Build related changes (Please run ./gradlew build --write-locks to refresh dependencies)
  • Other (please describe):

NOTE: Please remember to run ./gradlew spotlessApply to fix any format violations.

Changes in this PR

Add endpoint to get latest step instance views across all runs for a workflow instance

Adds GET /{workflowId}/instances/{workflowInstanceId}/steps which returns the most recent step attempt per step across all runs, useful for workflows restarted from failure where steps ran in different runs.

Why is this needed

This endpoint makes it easy to get a snapshot of the current state of all steps in a workflow instance, regardless of how many times it has been restarted.
Currently, to understand the final state of each step you need to know the latest run ID and call the run-specific /runs/{runId}/steps endpoint. But for restarted workflows, different steps may have completed in different runs — some steps succeed early and are skipped in subsequent runs.
TheGET /{workflowId}/instances/{workflowInstanceId}/steps endpoint added in this PR solves this by querying across all runs and returns the most recent attempt per step, giving a complete and accurate view of the workflow instance's step states in a single call.

Example

Say you have a workflow with 3 steps: step-a, step-b, step-c.

Run 1 — all 3 steps ran, but step-c failed:

step run status
step-a 1 SUCCEEDED
step-b 1 SUCCEEDED
step-c 1 FAILED

Run 2 — restarted from failure, only step-c ran again:

step run status
step-c 2 SUCCEEDED

GET /instances/1/runs/2/steps (existing) — incomplete, only shows steps from run 2:

step run status
step-c 2 SUCCEEDED

GET /instances/1/steps (new) — complete picture, latest attempt per step across all runs:

step run status
step-a 1 SUCCEEDED
step-b 1 SUCCEEDED
step-c 2 SUCCEEDED

Testing


DAO (MaestroStepInstanceDaoTest):

  • testGetAllStepInstanceViews — simulates a restart-from-failure scenario: run 1 has two steps (job1, job2), run 2 only re-ran job1. Verifies that job1 is returned from run 2 and job2 from run 1 — i.e. the most recent attempt per step is correctly selected
    across runs.

Controller (StepInstanceControllerTest):

  • testGetAllStepInstanceViews — verifies the DAO is called with correct arguments and the result is sorted by stepInstanceId.

Locally:

  • Also tested locally by spinning the server , creating a workflow, triggering multiple runs, and verifying endpoint returns as expected

…workflow instance

Adds GET /{workflowId}/instances/{workflowInstanceId}/steps which returns the most recent step attempt per step across all runs, useful for workflows restarted from failure where steps ran in different runs.
@anjujha anjujha marked this pull request as ready for review April 8, 2026 23:06
INNER_RANK_QUERY_ALL_FIELD_WITH
+ ", ROW_NUMBER() OVER (PARTITION BY step_id ORDER BY workflow_run_id DESC, step_attempt_id DESC) AS rank"
+ GET_STEP_FIELD_QUERY_FROM
+ ") SELECT * FROM inner_ranked WHERE rank=1";
Copy link
Copy Markdown
Collaborator

@akashdw akashdw Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have any benchmarks or EXPLAIN / query plan results, could you share those as well?

Copy link
Copy Markdown
Collaborator Author

@anjujha anjujha Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pasted the query plan below

QUERY PLAN

 Subquery Scan on inner_ranked  (cost=56.59..57.05 rows=1 width=1707) (actual time=1.327..1.486 rows=211 loops=1)                                                                                                                                                   
   Filter: (inner_ranked.rank = 1)                                                                                                                                                                                                                                  
   ->  WindowAgg  (cost=56.59..56.88 rows=13 width=1707) (actual time=1.326..1.471 rows=211 loops=1)                                                                                                                                                                
         Run Condition: (row_number() OVER (?) <= 1)                                                                                                                                                                                                                
         ->  Sort  (cost=56.59..56.62 rows=13 width=1699) (actual time=1.320..1.329 rows=211 loops=1)                                                                                                                                                               
               Sort Key: maestro_step_instance.step_id COLLATE "C", maestro_step_instance.workflow_run_id DESC, maestro_step_instance.step_attempt_id DESC                                                                                                          
               Sort Method: quicksort  Memory: 410kB                                                                                                                                                                                                                
               ->  Index Scan using maestro_step_instance_pkey on maestro_step_instance  (cost=0.42..56.35 rows=13 width=1699) (actual time=0.028..0.254 rows=211 loops=1)                                                                                          
                     Index Cond: ((workflow_id = '<redacted>_large_demo'::text) AND (workflow_instance_id = 1))                                                                                                                                                        
 Planning Time: 0.346 ms                                                                                                                                                                                                                                            
 Execution Time: 1.567 ms                                                                                                                                                                                                                                           
(11 rows)               
Query always hits the primary key index on (workflow_id, workflow_instance_id)

value = "/api/v3/workflows",
produces = MediaType.APPLICATION_JSON_VALUE,
consumes = MediaType.APPLICATION_JSON_VALUE)
@SuppressWarnings("PMD.AvoidDuplicateLiterals")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify why this is needed?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this here because without this we will have to use constant for 'workflowId' and 'workflowInstanceId' in line 98 and 99 below
@Valid @NotNull @PathVariable("workflowId") String workflowId

Previously this file has a few such strings but with my new endpoint it crossed over the PMD threshold

Similar pattern is used in other controllers

Copy link
Copy Markdown
Collaborator

@rdeepak2002 rdeepak2002 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

@anjujha anjujha merged commit 0150f2a into Netflix:main Apr 9, 2026
1 check passed
Copy link
Copy Markdown
Collaborator

@praneethy91 praneethy91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants