Description
Describe the Bug
- Seeing multiple delete attempts on the same job ID.
- Seeing many deleteJob log events and very few others.
Steps to Reproduce the Bug
The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz
in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.
Here is the stats summary by log event type:
MCAD Log Event Type | # Log Events
deleteJob | 58423
processCleanupJob | 318
Unknown | 293
UpdatePod | 251
AddPod | 67
Here are the Top5 results of repeated job logs on the same job ID:
Job ID | # Log Events
66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807
1a839594-a273-46ff-b83c-824e11645ba0 | 2740
a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160
e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160
4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160
What Have You Already Tried to Debug the Issue?
My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.
Expected Behavior
MCAD controller logs accurately reflect job handling on Vela cluster.
Additional Context
Add as applicable and when known:
- Cloud: IBM COS dipc-prod-logs. See here for access.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
No status