Skip to content

fix: simplify determine_available_memory#317

Merged
rebel-jaehwang merged 1 commit intodev-0.12from
avail-mem
Jan 30, 2026
Merged

fix: simplify determine_available_memory#317
rebel-jaehwang merged 1 commit intodev-0.12from
avail-mem

Conversation

@rebel-jaehwang
Copy link
Copy Markdown
Contributor

@rebel-jaehwang rebel-jaehwang commented Jan 30, 2026

  • We don't need to compute the number of blocks since vllm allocator
    already does it, properly considering diffferent layer types.
  • Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
    gpu_memory_utilization config instead.
  • Don't take min with max of memory used for active request, since
    prefix cache can utilize the extra memory.

* We don't need to compute the number of blocks since vllm allocator
  already does it, properly considering diffferent layer types.
* Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
  gpu_memory_utilization config instead.
* Don't take min with max of memory used for active request, since
  prefix cache can utilize the extra memory.
Copy link
Copy Markdown
Contributor

@rebel-wonsubkim rebel-wonsubkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
thx a lot

@rebel-jaehwang rebel-jaehwang merged commit 4a52b54 into dev-0.12 Jan 30, 2026
1 check passed
@rebel-jaehwang rebel-jaehwang deleted the avail-mem branch January 30, 2026 17:06
rebel-jaehwang added a commit that referenced this pull request Jan 30, 2026
* We don't need to compute the number of blocks since vllm allocator
  already does it, properly considering diffferent layer types.
* Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
  gpu_memory_utilization config instead.
* Don't take min with max of memory used for active request, since
  prefix cache can utilize the extra memory.
rebel-jaehwang added a commit that referenced this pull request Jan 30, 2026
* We don't need to compute the number of blocks since vllm allocator
  already does it, properly considering diffferent layer types.
* Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
  gpu_memory_utilization config instead.
* Don't take min with max of memory used for active request, since
  prefix cache can utilize the extra memory.
rebel-jaehwang added a commit that referenced this pull request Jan 30, 2026
* We don't need to compute the number of blocks since vllm allocator
  already does it, properly considering diffferent layer types.
* Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
  gpu_memory_utilization config instead.
* Don't take min with max of memory used for active request, since
  prefix cache can utilize the extra memory.
rebel-jiwoopark pushed a commit that referenced this pull request Feb 4, 2026
* We don't need to compute the number of blocks since vllm allocator
  already does it, properly considering diffferent layer types.
* Remove VLLM_RBLN_NPU_NUM_BLOCKS. User should use the standard
  gpu_memory_utilization config instead.
* Don't take min with max of memory used for active request, since
  prefix cache can utilize the extra memory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants