Skip to content

Conversation

michal-shalev
Copy link
Contributor

@michal-shalev michal-shalev commented Sep 29, 2025

What?

Improve error logging to UCP and UCT device APIs to improve debugging.

Why?

The device API functions lacked proper error reporting, making it difficult to debug failures in GPU kernels.
Additionally, some UCP host functions were returning error codes without logging what went wrong,

How?

  • Changed ucs_device_printf macro to automatically include file, line, and function information
  • Added ucs_device_status_string() function for device-compatible status code to string conversion
  • Added error logs in ucp_device_impl.h, uct_device_impl.h, ucp_device.c and gdaki.cuh

@michal-shalev michal-shalev requested a review from brminich October 6, 2025 09:13
@michal-shalev michal-shalev changed the title DEVICE: Add error logs in GPU and host device APIs DEVICE: Improve error logs in GPU and host device APIs Oct 6, 2025
tvegas1
tvegas1 previously approved these changes Oct 6, 2025
@michal-shalev michal-shalev requested a review from tvegas1 October 8, 2025 18:41
UCS_F_DEVICE const char* ucs_device_status_string(ucs_status_t status)
{
switch (status) {
UCS_STATUS_STRING_CASES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO X macro is an ideal candidate here, because it's much more flexible.
Also we keep all related things in one place, so it's less error prone

#define UCS_FOREACH_STATUS(_macro) \
    _macro(UCS_OK, 0, "Success") \
    _macro(UCS_INPROGRESS, 1, "Operation in progress") \
    ...

#define UCS_STATUS_ENUMIFY(ID, VALUE, _) ID = VALUE,

typedef enum {
    UCS_FOREACH_STATUS(UCS_STATUS_ENUMIFY)
} ucs_status_t;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, it's a great idea, I pushed a commit that changes the implementation to use X macro.

#define ucs_device_printf(_title, _fmt, ...) \
printf("(%d:%d) %6s " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
##__VA_ARGS__)
printf("(%d:%d) %6s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some ideas for improvements:

  1. Now messages are not aligned, because thread id vary from 0 to 1023. Maybe we can use a fixed width for thread:block part?
  2. I think we don't need a function name - it would make the output too long. Also not sure if it really helps in case of inline functions - did you check?
  3. Maybe we could also fix max width for file:line, so than the output looks like:
  (0:0) DEBUG filename_long.cu:1234 Message 1
(255:0) ERROR         short.cu:9    Message 2

IMO it's important that message starts at the same length, otherwise it's hard to read

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

%5d and %5s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Fixed with Thomas' suggestion.
  2. It works in the case of the inline functions, I tested it. IMO function name is very helpful for debug.
  3. Yes, I changed filename max width to 40 character, line number to max 4 digits and function name max to 30 characters.

This is how it looks like now:

(    0:    0) ERROR /hpc/newhome/mshalev/workspace/gdaki/ucx:255  uct_rc_mlx5_gda_ep_single     : This is an example error message from lane 0
(    0:    0) DEBUG /hpc/newhome/mshalev/workspace/gdaki/ucx:253  uct_rc_mlx5_gda_ep_single     : This is an example debug message addr=0x7fdeacc00200 remote=0x7fdeacc44400 len=8

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, much better now

  1. But still I'm against the function name, because:
  • it makes output too long
  • it's redundant since file:line is present
  • it's absent in UCX logs (I think it would be nice to have a similar layout)
UCX logs
[1760011677.835218] [rock14:3182762:0]           debug.c:1157 UCX  DEBUG using signal stack 0x7f1a76826000 size 141824
[1760011677.848311] [rock14:3182762:0]             cpu.c:338  UCX  DEBUG measured tsc frequency 2595.125 MHz after 0.08 ms
[1760011677.848337] [rock14:3182762:0]            init.c:120  UCX  DEBUG /labhome/iyastrebov/ws/ucx2/bld-devel/lib/libucs.so.0 loaded at 0x7f1a75d50000

It looks weird if we have different log formats within the same library

  1. Can we print just the basename of the file? Long log lines are hard to read
  2. Maybe align (0:1) on the right or left?

@michal-shalev michal-shalev disabled auto-merge October 9, 2025 09:07
@michal-shalev michal-shalev requested a review from iyastreb October 9, 2025 10:45
_macro(UCS_ERR_CONNECTION_RESET, -25, "Connection reset by remote peer") \
_macro(UCS_ERR_FIRST_LINK_FAILURE, -40, "First link failure") \
_macro(UCS_ERR_LAST_LINK_FAILURE, -59, "Last link failure") \
_macro(UCS_ERR_FIRST_ENDPOINT_FAILURE,-60, "First endpoint failure") \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls align

uct_allocated_memory_t mem;

if (!(ep->flags & UCP_EP_FLAG_REMOTE_CONNECTED)) {
ucs_error("ep=%p didn't complete wireup", ep);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to remove that one eventually?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants