mscclpp_torchcomms: build _comms_nccl.so with bad_weak_ptr fix

michael-beebe · Copilot · michael-beebe · commit 59e95c73059c · 2026-05-27T06:25:42.000Z
Adds a separate _comms_nccl pybind11 module alongside the existing
_comms_mscclpp module. The .so wraps the upstream torchcomms NCCL
backend sources from build-torchcomm/_deps/torchcomms-src/comms/
torchcomms/nccl/, paired with a new csrc/NcclDynamicLoader.cpp that
publishes the create_dynamic_loader_nccl entry point torchcomms's
TorchCommFactory dlopen path requires.

Why this exists:

  TorchCommFactory::create_generic_backend (TorchCommFactory.cpp)
  wraps the raw pointer returned by loader.new_comm() in a
  std::shared_ptr&lt;TorchCommBackend&gt;(rawBackendPtr, deleter).
  std::enable_shared_from_this&lt;Y&gt;'s internal weak_ptr is only
  initialized when the shared_ptr is constructed from a pointer
  *statically typed* as the derived class. Constructing
  shared_ptr&lt;TorchCommBackend&gt; from a pointer typed as
  TorchCommBackend* skips that machinery, so when
  TorchCommNCCL::createWork() later calls shared_from_this() it
  throws std::bad_weak_ptr — the very first all_reduce crashes,
  before any user code runs.

  Verified with a 6-line minimal repro (torchcomms.new_comm("nccl",
  ...).all_reduce(...)) that crashes with the upstream-only build
  and now succeeds with this fix.

How the fix works:

  NcclDynamicLoader.cpp::new_comm_impl creates TorchCommNCCL via
  std::make_shared so the weak_ptr is properly populated, then
  stashes the shared_ptr in a static keep-alive map keyed by the
  TorchCommBackend* it returns. destroy_comm_impl drops the
  keep-alive entry. While the entry lives, shared_from_this() inside
  the NCCL backend constructs a new shared_ptr that aliases our
  keep-alive one — no upstream changes required.

CMakeLists.txt:

- Exclude NcclDynamicLoader.cpp from _comms_mscclpp source glob
  (it's the entry point for the separate _comms_nccl target).
- Add _comms_nccl pybind11_add_module target that compiles the
  upstream torchcomms NCCL backend sources + the framework set
  shared with _comms_mscclpp + our loader. Links against PyTorch's
  bundled libnccl.so.2, torch libs, GPU libs, and glog.
- Compile with FMT_HEADER_ONLY to sidestep the fmt v11/v12 ABI
  mismatch between conda's libfmt.so.11 and PyTorch's bundled fmt
  v12 headers (otherwise the .so dlopen fails with
  'undefined symbol: fmt::v12::vformat...').
- Define USE_NVSHMEM to match the historical build's compile flags
  (NcclApi.cpp checks the macro).
- torchcomm_nccl_copy custom target installs the .so into the
  package source tree alongside _comms_mscclpp.

Build with:
  ./build_torchcomm.sh config
  cmake --build build-torchcomm --target _comms_nccl -j

Co-authored-by: Copilot &lt;223556219+Copilot@users.noreply.github.com&gt;
diff --git a/python/mscclpp_torchcomms/CMakeLists.txt b/python/mscclpp_torchcomms/CMakeLists.txt
@@ -29,8 +29,10 @@ if(PYBIND11_FIND_RESULT EQUAL 0 AND PYBIND11_CMAKE_DIR)
 endif()
 find_package(pybind11 REQUIRED)
 
-# Gather our C++ sources
+# Gather our C++ sources. NcclDynamicLoader.cpp is the entry point for the
+# separate _comms_nccl target — exclude it from the _comms_mscclpp module.
 file(GLOB_RECURSE TORCHCOMM_SOURCES CONFIGURE_DEPENDS ${CMAKE_CURRENT_SOURCE_DIR}/csrc/*.cpp)
+list(REMOVE_ITEM TORCHCOMM_SOURCES ${CMAKE_CURRENT_SOURCE_DIR}/csrc/NcclDynamicLoader.cpp)
 
 # Torchcomms framework sources we need to compile in directly.
 # Our module inherits from TorchWork, TorchCommBackend, and registers with
@@ -109,3 +111,93 @@ add_custom_target(torchcomm_lib_copy ALL
         ${CMAKE_CURRENT_SOURCE_DIR}
     DEPENDS _comms_mscclpp
 )
+
+# -----------------------------------------------------------------------------
+# Second target: _comms_nccl
+#
+# This builds the upstream torchcomms NCCL backend sources from
+# build-torchcomm/_deps/torchcomms-src/comms/torchcomms/nccl/ into a separate
+# .so, paired with our own NcclDynamicLoader.cpp that publishes the
+# create_dynamic_loader_nccl entry point torchcomms's TorchCommFactory dlopen
+# path requires.
+#
+# Why we don't reuse the upstream NCCL CMakeLists: it depends on the
+# torchcomms top-level CMake project that defines the `torchcomms` static
+# library and the ROOT, CONDA_INCLUDE, etc. variables. Mirroring the small
+# bit we need here keeps the dependency tree shallow and ensures the .so we
+# produce loads correctly through TorchCommFactory.
+# -----------------------------------------------------------------------------
+
+set(TORCHCOMMS_NCCL_BACKEND_SOURCES
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/NcclApi.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchCommNCCL.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchCommNCCLBootstrap.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchCommNCCLCCA.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchCommNCCLPy.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchCommNCCLUtils.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchWorkNCCL.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/nccl/TorchWorkNCCLQueue.cpp
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/device/cuda/CudaApi.cpp
+)
+
+# Framework sources are shared between _comms_mscclpp and _comms_nccl. Each
+# .so gets its own copy of TorchCommFactory's singleton (RTLD_LOCAL load) so
+# the registrations don't leak across backends.
+pybind11_add_module(_comms_nccl
+    ${TORCHCOMMS_NCCL_BACKEND_SOURCES}
+    ${TORCHCOMMS_FRAMEWORK_SOURCES}
+    ${CMAKE_CURRENT_SOURCE_DIR}/csrc/NcclDynamicLoader.cpp
+    # TracingGuard.cpp lives outside the framework set used by mscclpp because
+    # only the NCCL backend references it (via TracingGuard.hpp).
+    ${torchcomms_SOURCE_DIR}/comms/torchcomms/utils/TracingGuard.cpp
+)
+
+# Locate libnccl.so.2 shipped with PyTorch (the same one PyTorch's c10d uses).
+get_filename_component(_TORCH_LIB_DIR "${TORCH_INSTALL_PREFIX}/lib" ABSOLUTE)
+set(_NCCL_PYTORCH_INCLUDE "${TORCH_INSTALL_PREFIX}/../nvidia/nccl/include")
+set(_NCCL_PYTORCH_LIB     "${TORCH_INSTALL_PREFIX}/../nvidia/nccl/lib/libnccl.so.2")
+if(NOT EXISTS "${_NCCL_PYTORCH_LIB}")
+    message(FATAL_ERROR
+        "Could not find PyTorch's bundled libnccl.so.2 at ${_NCCL_PYTORCH_LIB}. "
+        "Install PyTorch with the nvidia-nccl-cu12 wheel.")
+endif()
+
+target_include_directories(_comms_nccl SYSTEM PRIVATE
+    ${torchcomms_SOURCE_DIR}
+    ${_NCCL_PYTORCH_INCLUDE}
+    ${GPU_INCLUDE_DIRS}
+)
+
+target_link_libraries(_comms_nccl PRIVATE
+    ${TORCH_LIBRARIES}
+    ${GPU_LIBRARIES}
+    ${_NCCL_PYTORCH_LIB}
+    glog::glog
+)
+
+if(EXISTS "${TORCH_PYTHON_LIB}")
+    target_link_libraries(_comms_nccl PRIVATE "${TORCH_PYTHON_LIB}")
+endif()
+
+# Match the NVSHMEM macro used in the historical build (NcclApi.cpp checks it).
+target_compile_definitions(_comms_nccl PRIVATE
+    USE_NVSHMEM
+    # fmt v12 ships in PyTorch's bundled headers (torch/include/fmt/), but only
+    # fmt v11 is available as a linkable shared object (libfmt.so.11 in conda).
+    # Compiling against v12 headers and linking against v11 yields an
+    # ``undefined symbol: fmt::v12::vformat...`` at dlopen time. Force the
+    # header-only build of whichever fmt headers are picked so all fmt code
+    # is inlined into _comms_nccl.so itself, eliminating the runtime dep.
+    FMT_HEADER_ONLY
+)
+
+target_compile_features(_comms_nccl PRIVATE cxx_std_20)
+
+install(TARGETS _comms_nccl LIBRARY DESTINATION mscclpp_torchcomms COMPONENT torchcomm)
+
+add_custom_target(torchcomm_nccl_copy ALL
+    COMMAND ${CMAKE_COMMAND} -E copy_if_different
+        ${CMAKE_LIBRARY_OUTPUT_DIRECTORY}/_comms_nccl*.so
+        ${CMAKE_CURRENT_SOURCE_DIR}
+    DEPENDS _comms_nccl
+)
diff --git a/python/mscclpp_torchcomms/csrc/NcclDynamicLoader.cpp b/python/mscclpp_torchcomms/csrc/NcclDynamicLoader.cpp
@@ -0,0 +1,82 @@
+// Copyright (c) Microsoft Corporation.
+// Licensed under the MIT License.
+
+// Dynamic-loader entry point for the upstream torchcomms NCCL backend.
+//
+// Why this file exists separately from the upstream torchcomms tree:
+//
+// The torchcomms TorchCommFactory dlopen-based loader path
+// (TorchCommFactory::create_generic_backend in TorchCommFactory.cpp) wraps the
+// raw pointer returned by `loader.new_comm()` in a
+// `std::shared_ptr<TorchCommBackend>(rawBackendPtr, deleter)`. The
+// `enable_shared_from_this<Y>` mechanism only initializes its internal
+// weak_ptr when the shared_ptr is constructed from a pointer to the *derived*
+// type `Y`. Constructing `shared_ptr<TorchCommBackend>` from a pointer
+// statically typed as `TorchCommBackend*` skips that machinery, so when
+// `TorchCommNCCL::createWork()` later calls `shared_from_this()` it throws
+// `std::bad_weak_ptr` (the very first all_reduce crashes).
+//
+// To work around this without patching torchcomms, this loader keeps a
+// keep-alive `shared_ptr<TorchCommNCCL>` (created via `std::make_shared` so
+// the weak_ptr is set up correctly) alive in a static map keyed by the raw
+// `TorchCommBackend*` we hand back. The factory still wraps our pointer in
+// its own `shared_ptr<TorchCommBackend>` for ownership semantics, and
+// destroy_comm_impl drops the keep-alive entry — but as long as the entry
+// lives, `shared_from_this()` inside the NCCL backend successfully
+// constructs a new shared_ptr that aliases our keep-alive one.
+
+#include <comms/torchcomms/TorchCommBackend.hpp>
+#include <comms/torchcomms/nccl/TorchCommNCCL.hpp>
+
+#include <memory>
+#include <mutex>
+#include <unordered_map>
+
+namespace {
+
+std::mutex& keepaliveMutex() {
+  static std::mutex m;
+  return m;
+}
+
+std::unordered_map<torch::comms::TorchCommBackend*, std::shared_ptr<torch::comms::TorchCommNCCL>>&
+keepaliveMap() {
+  static std::unordered_map<torch::comms::TorchCommBackend*, std::shared_ptr<torch::comms::TorchCommNCCL>>
+      m;
+  return m;
+}
+
+torch::comms::TorchCommBackend* new_comm_impl() {
+  auto sp = std::make_shared<torch::comms::TorchCommNCCL>();
+  auto* base = static_cast<torch::comms::TorchCommBackend*>(sp.get());
+  {
+    std::lock_guard<std::mutex> guard(keepaliveMutex());
+    keepaliveMap().emplace(base, std::move(sp));
+  }
+  return base;
+}
+
+void destroy_comm_impl(torch::comms::TorchCommBackend* comm) {
+  std::lock_guard<std::mutex> guard(keepaliveMutex());
+  auto it = keepaliveMap().find(comm);
+  if (it != keepaliveMap().end()) {
+    keepaliveMap().erase(it);
+  } else {
+    delete comm;
+  }
+}
+
+const char* get_supported_version_impl() {
+  return torch::comms::TORCHCOMM_BACKEND_ABI_VERSION;
+}
+
+}  // namespace
+
+extern "C" __attribute__((visibility("default"))) torch::comms::DynamicLoaderInterface
+create_dynamic_loader_nccl() {
+  return torch::comms::DynamicLoaderInterface{
+      .new_comm = new_comm_impl,
+      .destroy_comm = destroy_comm_impl,
+      .get_supported_version = get_supported_version_impl,
+  };
+}