Skip to content

[PG][EP] Fix engine.so runtime dependency#2255

Open
mmangkad wants to merge 1 commit into
kvcache-ai:mainfrom
mmangkad:fix-ep-engine-as-needed
Open

[PG][EP] Fix engine.so runtime dependency#2255
mmangkad wants to merge 1 commit into
kvcache-ai:mainfrom
mmangkad:fix-ep-engine-as-needed

Conversation

@mmangkad
Copy link
Copy Markdown

Description

This fixes the torch 2.12 Mooncake EP import failure where the built ep_2_12_0 extension does not record engine.so as a direct runtime dependency.

The failure mode is an import-time unresolved symbol from glog:

undefined symbol: _ZN6google10LogMessageC1EPKcii

I checked the installed Mooncake 0.3.11.post1 wheel and found that the torch 2.12 EP artifact is the only EP artifact missing the direct engine.so dependency:

ep_2_9_1.cpython-312-x86_64-linux-gnu.so
  engine.so: True
  torch libs: ['libc10.so', 'libtorch.so', 'libtorch_cpu.so', 'libtorch_python.so', 'libc10_cuda.so']

ep_2_10_0.cpython-312-x86_64-linux-gnu.so
  engine.so: True
  torch libs: ['libc10.so', 'libtorch.so', 'libtorch_cpu.so', 'libtorch_python.so', 'libc10_cuda.so']

ep_2_11_0.cpython-312-x86_64-linux-gnu.so
  engine.so: True
  torch libs: ['libc10.so', 'libtorch.so', 'libtorch_cpu.so', 'libtorch_python.so', 'libc10_cuda.so']

ep_2_12_0.cpython-312-x86_64-linux-gnu.so
  engine.so: False
  torch libs: ['libc10.so', 'libtorch_cpu.so', 'libtorch_python.so', 'libc10_cuda.so']

ep_2_12_0 still has unresolved glog references:

U _ZN6google10LogMessage6streamEv
U _ZN6google10LogMessageC1EPKci
U _ZN6google10LogMessageC1EPKcii
U _ZN6google10LogMessageD1Ev

Directly importing the exact torch 2.12 EP module fails:

ImportError: .../mooncake/ep_2_12_0.cpython-312-x86_64-linux-gnu.so: undefined symbol: _ZN6google10LogMessageC1EPKcii

Preloading mooncake.engine with RTLD_GLOBAL makes the same import pass, which confirms the missing dependency is the issue:

preload then import mooncake.ep_2_12_0 OK

Mooncake already links EP/PG with -l:engine.so, but in torch 2.12 builds that dependency can be dropped by linker --as-needed behavior. This PR wraps only -l:engine.so with:

-Wl,--push-state,--no-as-needed
-l:engine.so
-Wl,--pop-state

This preserves the existing linker state for all other libraries while forcing engine.so to remain a direct DT_NEEDED dependency, matching the working EP wheel contract from torch 2.9/2.10/2.11.

Module

  • Transfer Engine (mooncake-transfer-engine)
  • Mooncake Store (mooncake-store)
  • Mooncake EP (mooncake-ep)
  • Integration (mooncake-integration)
  • P2P Store (mooncake-p2p-store)
  • Python Wheel (mooncake-wheel)
  • PyTorch Backend (mooncake-pg)
  • Mooncake RL (mooncake-rl)
  • CI/CD
  • Docs
  • Other

Type of Change

  • Bug fix
  • New feature
  • Refactor
  • Breaking change
  • Documentation update
  • Other

How Has This Been Tested?

Verified the current installed wheel behavior before the fix:

readelf -d ep_2_9_1*.so
readelf -d ep_2_10_0*.so
readelf -d ep_2_11_0*.so
readelf -d ep_2_12_0*.so

Observed that ep_2_12_0 is missing NEEDED engine.so, while ep_2_9_1, ep_2_10_0, and ep_2_11_0 include it.

Verified the import failure and workaround:

import importlib
import torch

print("torch", torch.__version__)
importlib.import_module("mooncake.ep_2_12_0")

fails with:

undefined symbol: _ZN6google10LogMessageC1EPKcii

and:

import importlib
import os
import sys
import torch

print("torch", torch.__version__)
old = sys.getdlopenflags()
sys.setdlopenflags(old | os.RTLD_GLOBAL)
try:
    importlib.import_module("mooncake.engine")
finally:
    sys.setdlopenflags(old)

importlib.import_module("mooncake.ep_2_12_0")
print("preload then import mooncake.ep_2_12_0 OK")

passes.

Expected post-build verification:

readelf -d mooncake/ep_2_12_0*.so | grep engine.so

should include:

Shared library: [engine.so]

Checklist

  • I have performed a self-review of my own code.
  • I have formatted my own code using ./scripts/code_format.sh before submitting.
  • I have updated the documentation.
  • I have added tests to prove my changes are effective.

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the linker arguments in mooncake-ep/setup.py and mooncake-pg/setup.py to use --push-state,--no-as-needed and --pop-state when linking engine.so. The reviewer pointed out that --push-state and --pop-state are not supported by the gold linker, which could lead to build failures. They recommended a more portable approach of explicitly toggling --no-as-needed and --as-needed instead.

Comment thread mooncake-ep/setup.py
Comment on lines +53 to +55
"-Wl,--push-state,--no-as-needed",
"-l:engine.so",
"-Wl,--pop-state",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using --push-state and --pop-state can cause build failures if the build environment is configured to use the gold linker (e.g., via -fuse-ld=gold), as gold does not support these state-saving options.

A more portable and widely compatible approach to temporarily disable --as-needed for a specific library is to explicitly toggle it off and on using "--no-as-needed" and "--as-needed". Since --as-needed is the default behavior for modern toolchains, restoring it explicitly is safe and highly compatible across GNU ld, gold, and lld.

Suggested change
"-Wl,--push-state,--no-as-needed",
"-l:engine.so",
"-Wl,--pop-state",
"-Wl,--no-as-needed",
"-l:engine.so",
"-Wl,--as-needed",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer keeping this because push-state/pop-state restores the prior linker state exactly, and Mooncake’s CI/release path does not appear to use old gold, but I can switch to --no-as-needed/--as-needed if maintainers prefer that compatibility tradeoff.

Comment thread mooncake-pg/setup.py
Comment on lines +67 to +69
"-Wl,--push-state,--no-as-needed",
"-l:engine.so",
"-Wl,--pop-state",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using --push-state and --pop-state can cause build failures if the build environment is configured to use the gold linker (e.g., via -fuse-ld=gold), as gold does not support these state-saving options.

A more portable and widely compatible approach to temporarily disable --as-needed for a specific library is to explicitly toggle it off and on using "--no-as-needed" and "--as-needed". Since --as-needed is the default behavior for modern toolchains, restoring it explicitly is safe and highly compatible across GNU ld, gold, and lld.

Suggested change
"-Wl,--push-state,--no-as-needed",
"-l:engine.so",
"-Wl,--pop-state",
"-Wl,--no-as-needed",
"-l:engine.so",
"-Wl,--as-needed",

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant