Skip to content

UCS log rotation caused coredump #11214

@MyCproject

Description

@MyCproject

Describe the bug

Process crashes with SIGABRT in ucs_log_file_rotate() when log file reaches the configured size limit in high-concurrency environment. The crash is caused by access() returning error at line 172 of src/ucs/debug/log.c, which triggers ucs_fatal() and aborts the process.

Error message: Fatal: unable to write to <log_file>.log.2

Root cause: Race condition in ucs_log_file_rotate() - the function lacks thread synchronization, and the code has a known TOCTOU (Time-of-Check-Time-of-Use) issue between access() check and rename() operation (acknowledged by /* coverity[toctou] */ comment in code).

Steps to Reproduce

This is a non-deterministic race condition that occurs in production under specific conditions:

Required conditions:

  • High-concurrency application (100+ threads)
  • UCX debug logging enabled with file output
  • Log file configured with size-based rotation
  • Multiple threads writing UCX logs simultaneously when file approaches size limit

UCX configuration:

UCX_LOG_FILE=/path/to/ucx.log
UCX_LOG_LEVEL=DEBUG
UCX_LOG_FILE_SIZE=524288000  # 500MB
UCX_LOG_FILE_ROTATE=100

Note: We cannot provide exact reproduction steps as it requires specific high-concurrency workload. The issue was observed in production environment with RDMA operations on Mellanox devices.

Setup and versions

UCX version: 1.18.0

OS version + CPU architecture:

Rocky Linux release 8.6 (Green Obsidian)
Linux 5.14.0-162.nos.4.el8.x86_64 x86_64

RDMA driver:

rdma-core-58mlnx43-1.58415.x86_64
libibverbs-58mlnx43-1.58415.x86_64

Hardware: Mellanox ConnectX-5 (MT4125), Firmware 22.43.3608, RoCE 100Gbps

Additional information

Core dump stack trace:

#0  __pthread_kill_implementation () from /lib64/libc.so.6
#1  raise () from /lib64/libc.so.6
#2  abort () from /lib64/libc.so.6
#3  ucs_fatal_error_message (
      file="debug/log.c",
      line=172,
      function="ucs_log_file_rotate",
      message_buf="Fatal: unable to write to <log_file>.log.2"
    ) at debug/assert.c:38
#4  ucs_fatal_error_format (...) at debug/assert.c:53
#5  ucs_log_file_rotate () at debug/log.c:172
      old_log_file_name = "<log_file>.log.2"
      new_log_file_name = "<log_file>.log.3"
#6  ucs_log_handle_file_max_size () at debug/log.c:205
#7  ucs_log_print () at debug/log.c:258
(called from uct_ib_mlx5_txwq_init)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions