Skip to content

MCAD controller logs #690

Open
Open
@ordavidov

Description

@ordavidov

Describe the Bug

  • Seeing multiple delete attempts on the same job ID.
  • Seeing many deleteJob log events and very few others.

Steps to Reproduce the Bug

The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.

Here is the stats summary by log event type:
MCAD Log Event Type | # Log Events
deleteJob | 58423
processCleanupJob | 318
Unknown | 293
UpdatePod | 251
AddPod | 67

Here are the Top5 results of repeated job logs on the same job ID:
Job ID | # Log Events
66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807
1a839594-a273-46ff-b83c-824e11645ba0 | 2740
a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160
e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160
4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160

What Have You Already Tried to Debug the Issue?

My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.

Expected Behavior

MCAD controller logs accurately reflect job handling on Vela cluster.

Additional Context

Add as applicable and when known:

  • Cloud: IBM COS dipc-prod-logs. See here for access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    • Status

      No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions