Skip to content

Support For VLLM AFD Connector Beta.#37

Merged
zhouyu-sunny merged 2 commits intostepfun-ai:mainfrom
niehao100:public
Sep 22, 2025
Merged

Support For VLLM AFD Connector Beta.#37
zhouyu-sunny merged 2 commits intostepfun-ai:mainfrom
niehao100:public

Conversation

@niehao100
Copy link
Collaborator

Pull Request Description

Overview

This PR contains the support for VLLM StepMeshConnector(For AFD), and series of bug fixes.

Key Changes

🔧 Core Bug Fixes

  1. Fix SArray Memory Management Issue (5ac55cf)

    • Changed shared_ptr to bare pointer in SArray to resolve hang issues with various tensor sizes
  2. Fix RDMA Memory Registration Bug (a59646c)

    • Fix GIL deadlock problem for dymanic tensor mr_reg reues.

⏱️ Timeout Mechanism Optimization

  1. Add Timeout for Wait API (c029bf7)
    • Added default 10-second timeout mechanism for wait API

🛠️ Build and Configuration Optimization

  1. Dynamic Python3 Path Configuration (f096caa)
    • Changed Python3 path to dynamic configuration for improved deployment flexibility

Test Coverage

  • Added VLLM-related test cases
  • Updated multi-GPU and single-GPU test scripts

1. Add support for dynamic tensors push pull.
2. Fix hang problem in memory management due to smart ptr and GIL.
3. Add timeout for wait api.
4. Add more test benchmark:
* Test case for VLLM StepMeshConnector.
* Test case for dynamic tensors push pull.
@zhouyu-sunny zhouyu-sunny merged commit 5ed45d4 into stepfun-ai:main Sep 22, 2025
1 check passed
@niehao100 niehao100 deleted the public branch November 4, 2025 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants