Skip to content

Conversation

michal-shalev
Copy link
Contributor

What?

Add device-side logging infrastructure to NIXL with nixl_device_error macro and use it to log when UCX backend operations fail.

Why?

Without NIXL-layer logging, we lose calling context when UCX errors occur in device code. Each layer should log its own context for proper debugging.

How?

Added nixl_device_printf and nixl_device_error macros that print thread/block info, file, line, and function. Used in nixlGpuConvertUcsStatus and nixlGpuGetXferStatus to log "UCX backend error" when errors occur.

Copy link

github-actions bot commented Oct 9, 2025

👋 Hi michal-shalev! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀


/* Helper macro to print a message from NIXL device function including the
* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

title -> level

* thread and block indices, file, line, and function */
#define nixl_device_printf(_title, _fmt, ...) \
printf("(%5d:%5d) %5s %s:%d %s: " _fmt "\n", threadIdx.x, blockIdx.x, _title, \
__FILE__, __LINE__, __func__, ##__VA_ARGS__)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably __func__ not needed, maybe left pad file and pad lines with %-5d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants