-
Notifications
You must be signed in to change notification settings - Fork 489
DEVICE: Improve error logs in GPU and host device APIs #10921
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
0deab63
to
341b9c3
Compare
341b9c3
to
325142a
Compare
59e1cb7
to
e00e62a
Compare
UCS_F_DEVICE const char* ucs_device_status_string(ucs_status_t status) | ||
{ | ||
switch (status) { | ||
UCS_STATUS_STRING_CASES |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO X macro is an ideal candidate here, because it's much more flexible.
Also we keep all related things in one place, so it's less error prone
#define UCS_FOREACH_STATUS(_macro) \
_macro(UCS_OK, 0, "Success") \
_macro(UCS_INPROGRESS, 1, "Operation in progress") \
...
#define UCS_STATUS_ENUMIFY(ID, VALUE, _) ID = VALUE,
typedef enum {
UCS_FOREACH_STATUS(UCS_STATUS_ENUMIFY)
} ucs_status_t;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, it's a great idea, I pushed a commit that changes the implementation to use X macro.
src/ucs/sys/device_code.h
Outdated
#define ucs_device_printf(_title, _fmt, ...) \ | ||
printf("(%d:%d) %6s " _fmt "\n", threadIdx.x, blockIdx.x, _title, \ | ||
##__VA_ARGS__) | ||
printf("(%d:%d) %6s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some ideas for improvements:
- Now messages are not aligned, because thread id vary from 0 to 1023. Maybe we can use a fixed width for thread:block part?
- I think we don't need a function name - it would make the output too long. Also not sure if it really helps in case of inline functions - did you check?
- Maybe we could also fix max width for file:line, so than the output looks like:
(0:0) DEBUG filename_long.cu:1234 Message 1
(255:0) ERROR short.cu:9 Message 2
IMO it's important that message starts at the same length, otherwise it's hard to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
%5d and %5s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Fixed with Thomas' suggestion.
- It works in the case of the inline functions, I tested it. IMO function name is very helpful for debug.
- Yes, I changed filename max width to 40 character, line number to max 4 digits and function name max to 30 characters.
This is how it looks like now:
( 0: 0) ERROR /hpc/newhome/mshalev/workspace/gdaki/ucx:255 uct_rc_mlx5_gda_ep_single : This is an example error message from lane 0
( 0: 0) DEBUG /hpc/newhome/mshalev/workspace/gdaki/ucx:253 uct_rc_mlx5_gda_ep_single : This is an example debug message addr=0x7fdeacc00200 remote=0x7fdeacc44400 len=8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, much better now
- But still I'm against the function name, because:
- it makes output too long
- it's redundant since file:line is present
- it's absent in UCX logs (I think it would be nice to have a similar layout)
UCX logs
[1760011677.835218] [rock14:3182762:0] debug.c:1157 UCX DEBUG using signal stack 0x7f1a76826000 size 141824
[1760011677.848311] [rock14:3182762:0] cpu.c:338 UCX DEBUG measured tsc frequency 2595.125 MHz after 0.08 ms
[1760011677.848337] [rock14:3182762:0] init.c:120 UCX DEBUG /labhome/iyastrebov/ws/ucx2/bld-devel/lib/libucs.so.0 loaded at 0x7f1a75d50000
It looks weird if we have different log formats within the same library
- Can we print just the basename of the file? Long log lines are hard to read
- Maybe align
(0:1)
on the right or left?
_macro(UCS_ERR_CONNECTION_RESET, -25, "Connection reset by remote peer") \ | ||
_macro(UCS_ERR_FIRST_LINK_FAILURE, -40, "First link failure") \ | ||
_macro(UCS_ERR_LAST_LINK_FAILURE, -59, "Last link failure") \ | ||
_macro(UCS_ERR_FIRST_ENDPOINT_FAILURE,-60, "First endpoint failure") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls align
uct_allocated_memory_t mem; | ||
|
||
if (!(ep->flags & UCP_EP_FLAG_REMOTE_CONNECTED)) { | ||
ucs_error("ep=%p didn't complete wireup", ep); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to remove that one eventually?
What?
Improve error logging to UCP and UCT device APIs to improve debugging.
Why?
The device API functions lacked proper error reporting, making it difficult to debug failures in GPU kernels.
Additionally, some UCP host functions were returning error codes without logging what went wrong,
How?
ucs_device_printf
macro to automatically include file, line, and function informationucs_device_status_string()
function for device-compatible status code to string conversionucp_device_impl.h
,uct_device_impl.h
,ucp_device.c
andgdaki.cuh