Skip to content

[Bug]: Potential QP leak on transfer failure for PD disaggregation scenario #1845

@thincal

Description

@thincal

Environment

Components: Mooncake v0.3.7-post2, SGLang v0.5.7
Use Case: SGLang Prefill-Decode (PD) disaggregation for LLM inference

Problem Description

We observed that after running for some time in a PD disaggregation scenario, the Prefill side encounters the error: Failed to create QP: Cannot allocate memory

Analysis

  1. When Prefill transfers KV cache to Decode, if the transfer engine returns an error:
  • SGLang only marks this session as failed in Python layer
  • The transfer engine should have already allocated underlying resources (endpoints, QPs) and established connection with Decode side
  1. When transfer fails, we want to confirm whether Mooncake:
  • Automatically cleans up QP resources
  • Automatically cleans up endpoint resources
  • Or leaves these resources allocated (potential leak)

Current Understanding

From code analysis of mooncake-integration/transfer_engine/transfer_engine_py.cpp:

int TransferEnginePy::batchTransferSync(...) {
    // Get or create segment handle
    Transport::SegmentHandle handle;
    {
        std::lock_guard<std::mutex> guard(mutex_);
        if (handle_map_.count(target_hostname)) {
            handle = handle_map_[target_hostname];
        } else {
            handle = engine_->openSegment(target_hostname);
            if (handle == (Transport::SegmentHandle)-1) return -1;
            handle_map_[target_hostname] = handle;  // ← Cached permanently
        }
    }

    // ... submit transfer ...

    // On transfer failure
    else if (status.s == TransferStatusEnum::FAILED) {
        engine_->freeBatchID(batch_id);  // Only frees batch ID
        already_freed = true;
        completed = true;
        // Question: Are QP/endpoint resources cleaned up here?
    }

    return -1;
}

We notice that:

  • freeBatchID(batch_id) is called on failure
  • handle_map_[target_hostname] is never removed
  • closeSegment() appears to be a no-op (return 0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions