Skip to content

NIXL_LIBFABRIC_NUM_RAILS environment variable not enforced during discovery #1161

@dmvevents

Description

@dmvevents

Description

The NIXL_LIBFABRIC_NUM_RAILS environment variable is not enforced during EFA device discovery, causing all available devices to be used regardless of the setting.

Environment

  • NIXL 0.8.0
  • AWS P5.48xlarge (32 EFA devices)
  • libfabric backend

File

src/utils/libfabric/libfabric_rail_manager.cpp

Symptoms

export NIXL_LIBFABRIC_NUM_RAILS=8
# Still initializes all 32 rails

Proposed Fix

Enforce rail limit during discovery loop:

const char* num_rails_env = std::getenv("NIXL_LIBFABRIC_NUM_RAILS");
size_t max_rails = SIZE_MAX;
if (num_rails_env) {
    max_rails = std::stoul(num_rails_env);
}

// In discovery loop:
if (rail_count >= max_rails) break;

Also support NIXL_LIBFABRIC_MAX_RAILS as alternative name for consistency.

Impact

Cannot limit rails for testing or resource management without modifying code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions