Skip to content

MCAD controller logs #690

Open
Open
@ordavidov

Description

@ordavidov

Describe the Bug

  • Seeing multiple delete attempts on the same job ID.
  • Seeing many deleteJob log events and very few others.

Steps to Reproduce the Bug

The MCAD log stats come from the log file year=2023/month=11/day=13/97741eb604.2023-11-13.2300.json.gz in dipc-prod-logs. It covers the 1 hour time period from 2023-11-13 23:00:00 to 2023-11-13 23:59:59. They are also summarized below.

Here is the stats summary by log event type:
MCAD Log Event Type | # Log Events
deleteJob | 58423
processCleanupJob | 318
Unknown | 293
UpdatePod | 251
AddPod | 67

Here are the Top5 results of repeated job logs on the same job ID:
Job ID | # Log Events
66d95bbd-e9ca-40ed-966e-863a5f60a8d1 | 2807
1a839594-a273-46ff-b83c-824e11645ba0 | 2740
a03b1fbb-0116-42d6-a822-1f09bd2b0238 | 2160
e73488e7-8a41-49f3-94a3-5a4f51d03f93 | 2160
4e6dc4ba-41b6-49b9-bd50-a6fc3e818349 | 2160

What Have You Already Tried to Debug the Issue?

My understanding is that MCAD reports repeated attempts to delete a job, even though it has already been deleted.

Expected Behavior

MCAD controller logs accurately reflect job handling on Vela cluster.

Additional Context

Add as applicable and when known:

  • Cloud: IBM COS dipc-prod-logs. See here for access.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions