Open
Description
Describe the Bug
At scale, some AWs do not enter into a complete state due to the fact that the informer and etcd do not agree.
Codeflare Stack Component Versions
Please specify the component versions in which you have encountered this bug.
Codeflare SDK:
MCAD:
Steps to Reproduce the Bug
Fire 1K AWs with very short jobs (10 seconds) and wait for completion of all 1K AWs
What Have You Already Tried to Debug the Issue?
I have run scale tests to reproduce the issue
Expected Behavior
All AWs should be completed.
Screenshots, Console Output, Logs, etc.
NA
Affected Releases
Current 1.35.0 release and main branch
Additional Context
NA
Add any other information you think might be useful here.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
No status