-
Notifications
You must be signed in to change notification settings - Fork 532
Description
Describe the bug
Process crashes with SIGABRT in ucs_log_file_rotate() when log file reaches the configured size limit in high-concurrency environment. The crash is caused by access() returning error at line 172 of src/ucs/debug/log.c, which triggers ucs_fatal() and aborts the process.
Error message: Fatal: unable to write to <log_file>.log.2
Root cause: Race condition in ucs_log_file_rotate() - the function lacks thread synchronization, and the code has a known TOCTOU (Time-of-Check-Time-of-Use) issue between access() check and rename() operation (acknowledged by /* coverity[toctou] */ comment in code).
Steps to Reproduce
This is a non-deterministic race condition that occurs in production under specific conditions:
Required conditions:
- High-concurrency application (100+ threads)
- UCX debug logging enabled with file output
- Log file configured with size-based rotation
- Multiple threads writing UCX logs simultaneously when file approaches size limit
UCX configuration:
UCX_LOG_FILE=/path/to/ucx.log
UCX_LOG_LEVEL=DEBUG
UCX_LOG_FILE_SIZE=524288000 # 500MB
UCX_LOG_FILE_ROTATE=100Note: We cannot provide exact reproduction steps as it requires specific high-concurrency workload. The issue was observed in production environment with RDMA operations on Mellanox devices.
Setup and versions
UCX version: 1.18.0
OS version + CPU architecture:
Rocky Linux release 8.6 (Green Obsidian)
Linux 5.14.0-162.nos.4.el8.x86_64 x86_64
RDMA driver:
rdma-core-58mlnx43-1.58415.x86_64
libibverbs-58mlnx43-1.58415.x86_64
Hardware: Mellanox ConnectX-5 (MT4125), Firmware 22.43.3608, RoCE 100Gbps
Additional information
Core dump stack trace:
#0 __pthread_kill_implementation () from /lib64/libc.so.6
#1 raise () from /lib64/libc.so.6
#2 abort () from /lib64/libc.so.6
#3 ucs_fatal_error_message (
file="debug/log.c",
line=172,
function="ucs_log_file_rotate",
message_buf="Fatal: unable to write to <log_file>.log.2"
) at debug/assert.c:38
#4 ucs_fatal_error_format (...) at debug/assert.c:53
#5 ucs_log_file_rotate () at debug/log.c:172
old_log_file_name = "<log_file>.log.2"
new_log_file_name = "<log_file>.log.3"
#6 ucs_log_handle_file_max_size () at debug/log.c:205
#7 ucs_log_print () at debug/log.c:258
(called from uct_ib_mlx5_txwq_init)