We intend to adopt the following solution to enable NPU adaptation for this repository. Would appreciate any suggestions from the community.
1. 背景与动机
1.1 现状
VLMEvalKit 是目前最活跃的开源多模态大模型评测框架之一,支持 220+ LMMs 和 80+ benchmarks。然而,整个代码仓深度绑定 NVIDIA CUDA 生态,在非 CUDA 加速器(如华为昇腾 NPU)上完全无法运行。
1.2 问题规模
通过对代码仓的全面审计,我们发现了以下 CUDA 硬编码依赖:
| 类别 |
数量 |
影响级别 |
torch.cuda.empty_cache() 无条件调用 |
~40 处 |
高 - 非CUDA环境直接报错 |
.cuda() / .to('cuda') 硬编码 |
~60+ 处 |
高 - 模型/张量无法迁移到NPU |
device_map='cuda' 硬编码 |
~20 处 |
高 - 模型加载失败 |
self.device = 'cuda' 硬编码 |
~10 处 |
高 - 推理时设备错误 |
flash_attention_2 硬编码 |
~17 处 |
高 - NPU不支持flash_attention_2 |
torch.cuda.device_count() |
~15 处 |
中 - GPU计数错误 |
torch.cuda.set_device() |
~6 处 |
中 - 设备设置失败 |
torch.cuda.synchronize() |
~2 处 |
中 - 同步失败 |
torch.cuda.amp.autocast() |
~5 处 |
低 - 可用torch.amp替代 |
nvidia-smi 命令调用 |
~2 处 |
中 - 命令不存在 |
nccl backend 硬编码 |
~1 处 |
中 - NPU使用hccl |
torch.cuda.manual_seed_all() |
~6 处 |
低 - 种子设置失败 |
| vLLM 硬编码 |
~10 处 |
低 - 依赖vLLM自身NPU支持 |
1.3 目标
使 VLMEvalKit 能够在华为昇腾 NPU(Ascend 910B2 等)上直接运行,评测本地多模态模型,包括:
- 支持
torch_npu 后端
- 支持 HCCL 分布式后端
- 支持 NPU 设备检测和自动分配
- 保持与现有 CUDA 代码的完全兼容
2. 技术方案
2.1 核心设计原则
- 零侵入兼容:所有修改必须保持现有 CUDA 功能不受影响
- 统一设备抽象:引入设备管理工具模块,所有设备相关操作通过统一接口
- 渐进式适配:分优先级修改,核心框架先行,模型适配后行
- 环境自动检测:根据运行环境自动选择 CUDA/NPU/CPU
2.2 新增设备管理模块 vlmeval/utils/device.py
这是整个适配方案的核心,提供统一的设备管理接口:
"""Unified device management for VLMEvalKit.
Supports CUDA, Ascend NPU, MPS, and CPU backends.
"""
import os
import torch
_BACKEND = None
def _detect_backend():
"""Detect the best available compute backend."""
global _BACKEND
if _BACKEND is not None:
return _BACKEND
if torch.cuda.is_available():
_BACKEND = 'cuda'
return 'cuda'
try:
import torch_npu # noqa: F401
if torch.npu.is_available():
_BACKEND = 'npu'
return 'npu'
except ImportError:
pass
if hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
_BACKEND = 'mps'
return 'mps'
_BACKEND = 'cpu'
return 'cpu'
def get_backend():
"""Get the current compute backend name."""
return _detect_backend()
def get_device():
"""Get the best available torch device."""
backend = get_backend()
return torch.device(backend)
def get_device_name():
"""Get device name string (e.g., 'cuda', 'npu', 'cpu')."""
return get_backend()
def get_device_count():
"""Get the number of available accelerator devices."""
backend = get_backend()
if backend == 'cuda':
return torch.cuda.device_count()
elif backend == 'npu':
return torch.npu.device_count()
return 0
def get_device_memory(device_index=0):
"""Get free memory of the specified device in MB."""
backend = get_backend()
if backend == 'cuda':
try:
import subprocess
command = f"nvidia-smi --query-gpu=memory.free --format=csv -i {device_index}"
output = subprocess.check_output(command.split()).decode('ascii')
return int(output.strip().split('\n')[1].split()[0])
except Exception:
return 0
elif backend == 'npu':
try:
import subprocess
command = f"npu-smi info -t usages -i {device_index}"
output = subprocess.check_output(command.split()).decode('utf-8')
for line in output.split('\n'):
if 'HBM' in line or 'Memory' in line:
parts = line.split()
for i, p in enumerate(parts):
if 'MB' in p.upper() or '/' in p:
free_str = p.split('/')[-1].strip() if '/' in p else parts[i+1]
return int(''.join(c for c in free_str if c.isdigit()))
return 0
except Exception:
return 0
return 0
def get_all_device_memory():
"""Get free memory of all available devices in MB."""
count = get_device_count()
return [get_device_memory(i) for i in range(count)]
def set_device(device_index):
"""Set the current device."""
backend = get_backend()
if backend == 'cuda':
torch.cuda.set_device(device_index)
elif backend == 'npu':
torch.npu.set_device(device_index)
def empty_cache():
"""Safely empty the cache of the current device."""
backend = get_backend()
if backend == 'cuda':
torch.cuda.empty_cache()
elif backend == 'npu':
torch.npu.empty_cache()
def synchronize():
"""Safely synchronize the current device."""
backend = get_backend()
if backend == 'cuda':
torch.cuda.synchronize()
elif backend == 'npu':
torch.npu.synchronize()
def manual_seed_all(seed):
"""Set random seed for all devices."""
backend = get_backend()
if backend == 'cuda':
torch.cuda.manual_seed_all(seed)
elif backend == 'npu':
torch.npu.manual_seed_all(seed)
def get_distributed_backend():
"""Get the appropriate distributed backend."""
backend = get_backend()
if backend == 'cuda':
return 'nccl'
elif backend == 'npu':
return 'hccl'
return 'gloo'
def get_attn_implementation():
"""Get the best attention implementation for current device."""
backend = get_backend()
if backend == 'cuda':
return 'flash_attention_2'
return 'eager'
def get_visible_devices_env_var():
"""Get the environment variable name for visible devices."""
backend = get_backend()
if backend == 'cuda':
return 'CUDA_VISIBLE_DEVICES'
elif backend == 'npu':
return 'ASCEND_RT_VISIBLE_DEVICES'
return None
def get_visible_devices():
"""Get the list of visible device indices."""
env_var = get_visible_devices_env_var()
if env_var is None:
return []
devices_str = os.environ.get(env_var, '')
if not devices_str:
return list(range(get_device_count()))
return [int(x) for x in devices_str.split(',')]
def get_accelerator_list():
"""Get the list of accelerator devices (replaces get_gpu_list in run.py)."""
env_var = get_visible_devices_env_var()
if env_var is None:
return []
devices_str = os.environ.get(env_var, '')
if devices_str:
return [int(x) for x in devices_str.split(',')]
# Fallback: try device-specific detection
try:
return list(range(get_device_count()))
except Exception:
return []
2.3 修改方案:按优先级分层
第一层:核心框架修改(必须,影响所有模型)
2.3.1 run.py - 入口文件
问题1:get_gpu_list() 依赖 nvidia-smi
# 原代码 (行19-30)
def get_gpu_list():
CUDA_VISIBLE_DEVICES = os.environ.get('CUDA_VISIBLE_DEVICES', '')
if CUDA_VISIBLE_DEVICES != '':
gpu_list = [int(x) for x in CUDA_VISIBLE_DEVICES.split(',')]
return gpu_list
try:
ps = subprocess.Popen(('nvidia-smi', '--list-gpus'), stdout=subprocess.PIPE)
output = subprocess.check_output(('wc', '-l'), stdin=ps.stdout)
return list(range(int(output)))
except Exception:
return []
# 修改后
from vlmeval.utils.device import get_accelerator_list, get_visible_devices_env_var, get_device_name
def get_gpu_list():
env_var = get_visible_devices_env_var()
devices_str = os.environ.get(env_var, '') if env_var else ''
if devices_str:
return [int(x) for x in devices_str.split(',')]
try:
ps = subprocess.Popen(('nvidia-smi', '--list-gpus'), stdout=subprocess.PIPE)
output = subprocess.check_output(('wc', '-l'), stdin=ps.stdout)
return list(range(int(output)))
except Exception:
try:
import torch_npu # noqa: F401
return list(range(torch.npu.device_count()))
except Exception:
return []
问题2:CUDA_VISIBLE_DEVICES 硬编码(行38-47)
# 原代码
GPU_LIST = get_gpu_list()
if LOCAL_WORLD_SIZE > 1 and len(GPU_LIST):
NGPU = len(GPU_LIST)
...
CUDA_VISIBLE_DEVICES = [str(i) for i in GPU_LIST[DEVICE_START_IDX: DEVICE_START_IDX + GPU_PER_PROC]]
CUDA_VISIBLE_DEVICES = ','.join(CUDA_VISIBLE_DEVICES)
os.environ['CUDA_VISIBLE_DEVICES'] = CUDA_VISIBLE_DEVICES
# 修改后
from vlmeval.utils.device import get_visible_devices_env_var
GPU_LIST = get_gpu_list()
if LOCAL_WORLD_SIZE > 1 and len(GPU_LIST):
NGPU = len(GPU_LIST)
assert NGPU >= LOCAL_WORLD_SIZE
GPU_PER_PROC = NGPU // LOCAL_WORLD_SIZE
DEVICE_START_IDX = GPU_PER_PROC * LOCAL_RANK
VISIBLE_DEVICES = [str(i) for i in GPU_LIST[DEVICE_START_IDX: DEVICE_START_IDX + GPU_PER_PROC]]
VISIBLE_DEVICES_STR = ','.join(VISIBLE_DEVICES)
env_var = get_visible_devices_env_var()
if env_var:
os.environ[env_var] = VISIBLE_DEVICES_STR
print(
f'RANK: {RANK}, LOCAL_RANK: {LOCAL_RANK}, WORLD_SIZE: {WORLD_SIZE},'
f'LOCAL_WORLD_SIZE: {LOCAL_WORLD_SIZE}, {env_var}: {VISIBLE_DEVICES_STR}'
)
问题3:nccl backend 硬编码(行527)
# 原代码
dist.init_process_group(
backend='nccl',
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
)
# 修改后
from vlmeval.utils.device import get_distributed_backend
dist.init_process_group(
backend=get_distributed_backend(),
timeout=datetime.timedelta(seconds=int(os.environ.get('DIST_TIMEOUT', 3600)))
)
2.3.2 vlmeval/inference.py - 推理核心
问题:torch.cuda.synchronize() 和 torch.cuda.empty_cache()
# 原代码 (行185, 190)
torch.cuda.synchronize()
...
torch.cuda.empty_cache()
# 修改后
from vlmeval.utils.device import synchronize, empty_cache
synchronize()
...
empty_cache()
2.3.3 vlmeval/inference_mt.py - 多轮推理
# 原代码 (行167)
torch.cuda.empty_cache()
# 修改后
from vlmeval.utils.device import empty_cache
empty_cache()
2.3.4 vlmeval/inference_video.py - 视频推理
# 原代码 (行210, 215)
torch.cuda.synchronize()
torch.cuda.empty_cache()
# 修改后
from vlmeval.utils.device import synchronize, empty_cache
synchronize()
empty_cache()
2.3.5 vlmeval/smp/misc.py - GPU 内存检测
# 原代码 (行277-287)
def get_gpu_memory():
import subprocess
try:
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = subprocess.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
return memory_free_values
except Exception as e:
print(f'{type(e)}: {str(e)}')
return []
# 修改后
def get_gpu_memory():
from vlmeval.utils.device import get_backend, get_all_device_memory
backend = get_backend()
if backend == 'cuda':
try:
import subprocess
command = "nvidia-smi --query-gpu=memory.free --format=csv"
memory_free_info = subprocess.check_output(command.split()).decode('ascii').split('\n')[:-1][1:]
memory_free_values = [int(x.split()[0]) for i, x in enumerate(memory_free_info)]
return memory_free_values
except Exception as e:
print(f'{type(e)}: {str(e)}')
return []
elif backend == 'npu':
return get_all_device_memory()
return []
2.3.6 vlmeval/tools.py - GPU 计数
# 原代码 (行387)
NGPU = torch.cuda.device_count()
# 修改后
from vlmeval.utils.device import get_device_count
NGPU = get_device_count()
第二层:VLM 模型适配(重点模型优先)
2.3.7 通用替换模式
所有 VLM 模型实现中的 CUDA 硬编码,按以下模式替换:
| 原代码 |
替换为 |
self.device = 'cuda' |
from vlmeval.utils.device import get_device_name; self.device = get_device_name() |
.cuda() |
.to(self.device) |
.to('cuda') |
.to(self.device) 或 .to(self.model.device) |
torch.cuda.device_count() |
from vlmeval.utils.device import get_device_count; get_device_count() |
torch.cuda.set_device(0) |
from vlmeval.utils.device import set_device; set_device(0) |
torch.cuda.empty_cache() |
from vlmeval.utils.device import empty_cache; empty_cache() |
torch.cuda.manual_seed_all(0) |
from vlmeval.utils.device import manual_seed_all; manual_seed_all(0) |
device_map='cuda' |
device_map=get_device_name() 或 device_map='auto' |
attn_implementation='flash_attention_2' |
from vlmeval.utils.device import get_attn_implementation; attn_implementation=get_attn_implementation() |
torch.cuda.amp.autocast() |
torch.amp.autocast(device_type=get_device_name()) |
torch.cuda.is_available() |
需扩展检测逻辑,见 device.py 中的 _detect_backend() |
2.3.8 重点模型修改示例
Qwen2-VL (vlmeval/vlm/qwen2_vl/model.py)
# 修改1: 添加导入
from vlmeval.utils.device import (
get_device_name, get_device_count, set_device, empty_cache,
get_attn_implementation, get_all_device_memory
)
# 修改2: GPU内存检查 (行265-267)
# 原代码:
# gpu_mems = get_gpu_memory()
# max_gpu_mem = max(gpu_mems) if gpu_mems != [] else -1
# assert max_gpu_mem > 0
# 修改后:
gpu_mems = get_all_device_memory()
max_gpu_mem = max(gpu_mems) if gpu_mems else -1
if max_gpu_mem <= 0:
import logging
logging.warning(f'No accelerator memory detected, falling back to CPU. mems={gpu_mems}')
# 修改3: vLLM路径 (行275)
# 原代码: gpu_count = torch.cuda.device_count()
# 修改后: gpu_count = get_device_count()
# 修改4: LMDeploy路径 (行303-309)
# 原代码:
# num_gpus = torch.cuda.device_count()
# torch.cuda.set_device(0)
# self.device = 'cuda'
# 修改后:
num_gpus = get_device_count()
set_device(0)
self.device = get_device_name()
# 修改5: Transformers路径 (行312)
# 原代码: device_map="auto", attn_implementation='flash_attention_2'
# 修改后: device_map="auto", attn_implementation=get_attn_implementation()
# 修改6: 推理 (行487)
# 原代码: inputs = inputs.to('cuda')
# 修改后: inputs = inputs.to(self.model.device)
InternVL (vlmeval/vlm/internvl/internvl_chat.py)
# 修改1: 添加导入
from vlmeval.utils.device import get_device_name, get_device_count, set_device, empty_cache
# 修改2: LMDeploy路径 (行168-179)
# 原代码:
# num_gpus = torch.cuda.device_count()
# torch.cuda.set_device(0)
# self.device = 'cuda'
# 修改后:
num_gpus = get_device_count()
set_device(0)
self.device = get_device_name()
# 修改3: Transformers路径 (行187)
# 原代码: self.device = 'cuda'
# 修改后: self.device = get_device_name()
MiniCPM-V (vlmeval/vlm/minicpm_v.py)
# 修改1: 添加导入
from vlmeval.utils.device import (
get_device_name, set_device, empty_cache, manual_seed_all, get_device_count
)
# 修改2: 所有模型类中的 .eval().cuda() -> .eval().to(get_device_name())
# 例如 (行30):
# self.model = self.model.to(dtype=torch.bfloat16)
# self.model.eval().cuda()
# torch.cuda.empty_cache()
# 修改后:
self.model = self.model.to(dtype=torch.bfloat16)
self.device = get_device_name()
self.model.eval().to(self.device)
empty_cache()
# 修改3: manual_seed_all (行272, 551, 807, 1286)
# 原代码: torch.cuda.manual_seed_all(0)
# 修改后: manual_seed_all(0)
LLaVA (vlmeval/vlm/llava/llava.py)
# 修改1: 所有 .cuda() 调用改为 .to(self.device)
# 修改2: 所有 .to("cuda", ...) 改为 .to(self.device, ...)
# 修改3: self.device 初始化使用 get_device_name()
2.3.9 全量模型修改清单
以下文件需要按照上述通用替换模式进行修改:
| 文件 |
硬编码数量 |
优先级 |
vlmeval/vlm/qwen2_vl/model.py |
9 |
P0 |
vlmeval/vlm/qwen3_vl/model.py |
5 |
P0 |
vlmeval/vlm/internvl/internvl_chat.py |
6 |
P0 |
vlmeval/vlm/llava/llava.py |
15+ |
P0 |
vlmeval/vlm/minicpm_v.py |
18 |
P0 |
vlmeval/vlm/llama4.py |
5 |
P1 |
vlmeval/vlm/deepseek_vl2.py |
2 |
P1 |
vlmeval/vlm/deepseek_vl.py |
2 |
P1 |
vlmeval/vlm/gemma.py |
8 |
P1 |
vlmeval/vlm/cogvlm.py |
8 |
P1 |
vlmeval/vlm/phi3_vision.py |
5 |
P1 |
vlmeval/vlm/nvlm.py |
4 |
P1 |
vlmeval/vlm/yi_vl.py |
3 |
P1 |
vlmeval/vlm/vila.py |
2 |
P1 |
vlmeval/vlm/wemm.py |
2 |
P2 |
vlmeval/vlm/wethink_vl.py |
4 |
P2 |
vlmeval/vlm/vlm_r1.py |
4 |
P2 |
vlmeval/vlm/vlaa_thinker.py |
4 |
P2 |
vlmeval/vlm/treevgr.py |
4 |
P2 |
vlmeval/vlm/thyme/model.py |
4 |
P2 |
vlmeval/vlm/smolvlm.py |
4 |
P2 |
vlmeval/vlm/ristretto.py |
3 |
P2 |
vlmeval/vlm/points.py |
2 |
P2 |
vlmeval/vlm/phi4_multimodal.py |
3 |
P2 |
vlmeval/vlm/oryx.py |
5 |
P2 |
vlmeval/vlm/omchat.py |
3 |
P2 |
vlmeval/vlm/monkey.py |
6 |
P2 |
vlmeval/vlm/moondream.py |
6 |
P2 |
vlmeval/vlm/mmalaya.py |
5 |
P2 |
vlmeval/vlm/minimonkey.py |
4 |
P2 |
vlmeval/vlm/mantis.py |
3 |
P2 |
vlmeval/vlm/mixsense.py |
1 |
P2 |
vlmeval/vlm/molmo.py |
1 |
P2 |
vlmeval/vlm/long_vita.py |
4 |
P2 |
vlmeval/vlm/kimi_vl.py |
1 |
P2 |
vlmeval/vlm/keye_vlm/model.py |
3 |
P2 |
vlmeval/vlm/h2ovl_mississippi.py |
1 |
P2 |
vlmeval/vlm/vita.py |
10+ |
P3 |
vlmeval/vlm/vlm3r.py |
6 |
P3 |
vlmeval/vlm/valley/valley2.py |
2 |
P3 |
vlmeval/vlm/valley/valley3.py |
2 |
P3 |
vlmeval/vlm/video_llm/*.py |
15+ |
P3 |
vlmeval/vlm/xcomposer/*.py |
10+ |
P3 |
vlmeval/vlm/pandagpt.py |
2 |
P3 |
vlmeval/vlm/parrot.py |
3 |
P3 |
vlmeval/vlm/pixtral.py |
1 |
P3 |
vlmeval/vlm/ross.py |
3 |
P3 |
vlmeval/vlm/slime.py |
3 |
P3 |
vlmeval/vlm/visualglm.py |
1 |
P3 |
vlmeval/vlm/transcore_m.py |
3 |
P3 |
vlmeval/vlm/open_flamingo.py |
3 |
P3 |
vlmeval/vlm/omnilmm.py |
2 |
P3 |
vlmeval/vlm/mplug_owl2.py |
1 |
P3 |
vlmeval/vlm/mplug_owl3.py |
2 |
P3 |
vlmeval/vlm/idefics.py |
2 |
P3 |
vlmeval/vlm/eagle_x.py |
1 |
P3 |
vlmeval/vlm/emu.py |
4 |
P3 |
vlmeval/vlm/bagel_umm.py |
3 |
P3 |
vlmeval/vlm/aria.py |
1 |
P3 |
vlmeval/vlm/vxverse.py |
1 |
P3 |
vlmeval/vlm/vintern_chat.py |
1 |
P3 |
vlmeval/vlm/minigpt4.py |
1 |
P3 |
vlmeval/vlm/kosmos.py |
1 |
P3 |
vlmeval/vlm/janus.py |
1 |
P3 |
vlmeval/vlm/instructblip.py |
1 |
P3 |
vlmeval/vlm/deepseek_ocr.py |
3 |
P3 |
vlmeval/vlm/covt.py |
5 |
P3 |
vlmeval/vlm/cambrian_s.py |
1 |
P3 |
vlmeval/vlm/falcon_vlm.py |
1 |
P3 |
vlmeval/vlm/xgen_mm.py |
2 |
P3 |
vlmeval/vlm/x_vl.py |
2 |
P3 |
vlmeval/vlm/varco_vision.py |
3 |
P3 |
vlmeval/vlm/qianfan_vl.py |
2 |
P3 |
vlmeval/vlm/nanovlm.py |
3 |
P3 |
vlmeval/vlm/llama_vision.py |
2 |
P3 |
vlmeval/vlm/logics.py |
1 |
P3 |
vlmeval/vlm/spatial_mllm.py |
3 |
P3 |
vlmeval/vlm/rbdash.py |
2 |
P3 |
vlmeval/vlm/sail_vl.py |
2 |
P3 |
vlmeval/vlm/ovis/ovis.py |
5 |
P3 |
vlmeval/vlm/cosmos.py |
2 |
P3 |
vlmeval/vlm/hawk_vl/model.py |
2 |
P3 |
vlmeval/vlm/qtunevl/*.py |
8+ |
P3 |
vlmeval/vlm/ola/*.py |
6+ |
P3 |
vlmeval/vlm/ursa/*.py |
3 |
P3 |
vlmeval/api/hf_chat_model.py |
3 |
P2 |
第三层:Dataset 评估工具适配
2.3.10 vlmeval/dataset/utils/ 目录
| 文件 |
修改内容 |
verifier.py (行213) |
torch.cuda.device_count() -> get_device_count() |
uni_svg.py (行37) |
"cuda" if torch.cuda.is_available() else "cpu" -> get_device_name() |
design2code/visual_score.py (行28) |
同上 |
chartcap.py (行140) |
同上 |
SArena/video/viclip/viclip_vision.py (行452-453) |
torch.cuda.manual_seed() -> manual_seed_all() |
SArena/video/CLIP_video.py (行98) |
默认参数 device=torch.device("cuda") -> device=None,函数内动态检测 |
SArena/LPIPS.py (行41) |
同 uni_svg.py 模式 |
SArena/DINO_Score.py (行15) |
同上 |
SArena/FID.py (行25) |
同上 |
SArena/runtime.py (行39-53) |
扩展设备检测逻辑 |
SArena/token_length.py (行13) |
同上 |
SArena/CLIP_Score.py (行19) |
同上 |
3. NPU 特有注意事项
3.1 torch_npu 兼容性
| 功能 |
CUDA API |
NPU 对应 API |
备注 |
| 设备可用性 |
torch.cuda.is_available() |
torch.npu.is_available() |
需先 import torch_npu |
| 设备计数 |
torch.cuda.device_count() |
torch.npu.device_count() |
同上 |
| 设备设置 |
torch.cuda.set_device(idx) |
torch.npu.set_device(idx) |
同上 |
| 缓存清理 |
torch.cuda.empty_cache() |
torch.npu.empty_cache() |
同上 |
| 设备同步 |
torch.cuda.synchronize() |
torch.npu.synchronize() |
同上 |
| 随机种子 |
torch.cuda.manual_seed_all() |
torch.npu.manual_seed_all() |
同上 |
| 分布式后端 |
nccl |
hccl |
torch.distributed 支持 |
| 可见设备 |
CUDA_VISIBLE_DEVICES |
ASCEND_RT_VISIBLE_DEVICES |
环境变量 |
| 设备管理命令 |
nvidia-smi |
npu-smi |
CLI工具 |
| 注意力实现 |
flash_attention_2 |
eager |
NPU暂不支持FA2 |
| AMP |
torch.cuda.amp.autocast() |
torch.npu.amp.autocast() |
或 torch.amp.autocast(device_type='npu') |
| 数据类型 |
fp16/bf16 |
bf16 优先 |
NPU上bf16性能更优 |
3.2 transformers 兼容性
device_map="auto" 在 NPU 上的行为:
- PyTorch 2.1+ / transformers 4.36+:
device_map="auto" 已支持通过 torch.npu 识别 NPU 设备
- 建议:优先使用
device_map="auto" 而非 device_map="cuda" 或 device_map="npu"
- 对不支持
device_map="auto" 的模型,使用 device_map=get_device_name()
3.3 vLLM 兼容性
vLLM 目前对 NPU 的支持状态:
- vLLM 社区正在推进 Ascend NPU 后端支持
- 在 NPU 适配完成前,vLLM 路径应自动禁用并回退到 Transformers 路径
- 修改方式:在 vLLM 导入时增加设备检测
# 在使用 vLLM 的模型中
def _should_use_vllm(self):
from vlmeval.utils.device import get_backend
if get_backend() != 'cuda':
return False
try:
from vllm import LLM # noqa: F401
return True
except ImportError:
return False
3.4 LMDeploy 兼容性
LMDeploy 同样仅支持 CUDA,在 NPU 环境下应回退到 Transformers 路径。
3.5 精度建议
| 硬件 |
推荐精度 |
备注 |
| NVIDIA GPU |
fp16/bf16 |
flash_attention_2 可用 |
| Ascend NPU |
bf16 |
bf16 计算性能优于 fp16 |
| CPU |
fp32 |
仅供测试 |
建议在 device.py 中添加精度建议函数:
def get_recommended_dtype():
"""Get the recommended dtype for the current device."""
backend = get_backend()
if backend in ('cuda', 'npu'):
return torch.bfloat16
return torch.float32
4. 实施计划
Phase 1: 核心框架适配(1周)
- 新增
vlmeval/utils/device.py
- 修改
run.py - 设备检测和分布式后端
- 修改
vlmeval/inference.py - 推理核心
- 修改
vlmeval/inference_mt.py - 多轮推理
- 修改
vlmeval/inference_video.py - 视频推理
- 修改
vlmeval/smp/misc.py - GPU 内存检测
- 修改
vlmeval/tools.py - GPU 计数
验收标准:
python run.py --model Qwen2-VL-7B-Instruct --data MMBench_DEV_EN_V11 在 NPU 上可以启动
- CUDA 环境下所有现有测试通过
Phase 2: P0 模型适配(1周)
vlmeval/vlm/qwen2_vl/model.py
vlmeval/vlm/qwen3_vl/model.py
vlmeval/vlm/internvl/internvl_chat.py
vlmeval/vlm/llava/llava.py
vlmeval/vlm/minicpm_v.py
验收标准:
- 以上5个模型在 NPU 上可以完成 MMBench_DEV_EN_V11 评测
- CUDA 环境下回归测试通过
Phase 3: P1 模型适配(1周)
- llama4, deepseek_vl2, deepseek_vl, gemma, cogvlm, phi3_vision, nvlm, yi_vl, vila
验收标准:
- 所有 P1 模型在 NPU 上可以加载和推理
- CUDA 环境下回归测试通过
Phase 4: P2-P3 模型适配 + Dataset 适配(2周)
- 剩余所有 VLM 模型
- Dataset 评估工具中的 CUDA 硬编码
- API 模型适配(hf_chat_model.py)
验收标准:
- 所有模型在 NPU 上可以加载
- 全量 benchmark 评测在 NPU 上通过
- CUDA 环境下回归测试通过
Phase 5: 文档和 CI(1周)
- 添加 NPU 使用文档
- 添加 NPU CI 流水线
- 添加
requirements_npu.txt
5. 测试方案
5.1 单元测试
新增 tests/test_device.py:
import pytest
import torch
def test_device_detection():
from vlmeval.utils.device import get_backend, get_device, get_device_count
backend = get_backend()
assert backend in ('cuda', 'npu', 'mps', 'cpu')
device = get_device()
assert isinstance(device, torch.device)
def test_empty_cache_no_error():
from vlmeval.utils.device import empty_cache
empty_cache() # 不应抛出异常
def test_synchronize_no_error():
from vlmeval.utils.device import synchronize
synchronize() # 不应抛出异常
def test_distributed_backend():
from vlmeval.utils.device import get_distributed_backend
backend = get_distributed_backend()
assert backend in ('nccl', 'hccl', 'gloo')
def test_attn_implementation():
from vlmeval.utils.device import get_attn_implementation
impl = get_attn_implementation()
assert impl in ('flash_attention_2', 'eager', 'sdpa')
5.2 集成测试
| 测试场景 |
模型 |
Benchmark |
硬件 |
| NPU 单卡推理 |
Qwen2-VL-7B-Instruct |
MMBench_DEV_EN_V11 |
Ascend 910B2 |
| NPU 多卡推理 |
InternVL2-26B |
MMBench_DEV_EN_V11 |
Ascend 910B2 x2 |
| NPU 视频推理 |
Qwen2-VL-7B-Instruct |
MMBench_Video |
Ascend 910B2 |
| CUDA 回归测试 |
Qwen2-VL-7B-Instruct |
MMBench_DEV_EN_V11 |
NVIDIA A100 |
| CUDA 多卡回归 |
InternVL2-26B |
MMBench_DEV_EN_V11 |
NVIDIA A100 x2 |
5.3 精度验证
NPU 上的评测精度应与 CUDA 上的结果在统计意义上等价(允许 ±0.5% 的精度差异,源于浮点计算差异)。
6. 风险与缓解
| 风险 |
影响 |
缓解措施 |
torch_npu API 与 torch.cuda 不完全对齐 |
部分功能可能缺失 |
在 device.py 中添加 try/except 降级 |
transformers device_map="auto" 在 NPU 上行为异常 |
模型加载失败 |
提供手动 device_map 指定选项 |
| 某些模型依赖 CUDA 特有算子 |
推理失败 |
记录不兼容模型,提供替代路径 |
| vLLM/LMDeploy 不支持 NPU |
高吞吐推理不可用 |
自动回退到 Transformers 路径 |
| NPU 上 bf16/fp16 精度差异 |
评测结果偏差 |
记录精度差异,提供校准方案 |
| 大规模修改引入回归 bug |
CUDA 环境功能受损 |
完善回归测试,CI 双硬件验证 |
7. 后续扩展
- 支持更多加速器:AMD ROCm (torch.hip)、Intel Gaudi (torch.hpu) 等
- NPU 性能优化:适配 NPU 特有的算子优化和内存管理策略
- vLLM NPU 后端:配合 vLLM 社区推进 NPU 支持
- 自动混合精度:根据设备类型自动选择最优精度策略
8. 参考资料
We intend to adopt the following solution to enable NPU adaptation for this repository. Would appreciate any suggestions from the community.
1. 背景与动机
1.1 现状
VLMEvalKit 是目前最活跃的开源多模态大模型评测框架之一,支持 220+ LMMs 和 80+ benchmarks。然而,整个代码仓深度绑定 NVIDIA CUDA 生态,在非 CUDA 加速器(如华为昇腾 NPU)上完全无法运行。
1.2 问题规模
通过对代码仓的全面审计,我们发现了以下 CUDA 硬编码依赖:
torch.cuda.empty_cache()无条件调用.cuda()/.to('cuda')硬编码device_map='cuda'硬编码self.device = 'cuda'硬编码flash_attention_2硬编码torch.cuda.device_count()torch.cuda.set_device()torch.cuda.synchronize()torch.cuda.amp.autocast()nvidia-smi命令调用ncclbackend 硬编码torch.cuda.manual_seed_all()1.3 目标
使 VLMEvalKit 能够在华为昇腾 NPU(Ascend 910B2 等)上直接运行,评测本地多模态模型,包括:
torch_npu后端2. 技术方案
2.1 核心设计原则
2.2 新增设备管理模块
vlmeval/utils/device.py这是整个适配方案的核心,提供统一的设备管理接口:
2.3 修改方案:按优先级分层
第一层:核心框架修改(必须,影响所有模型)
2.3.1
run.py- 入口文件问题1:
get_gpu_list()依赖 nvidia-smi问题2:
CUDA_VISIBLE_DEVICES硬编码(行38-47)问题3:
ncclbackend 硬编码(行527)2.3.2
vlmeval/inference.py- 推理核心问题:
torch.cuda.synchronize()和torch.cuda.empty_cache()2.3.3
vlmeval/inference_mt.py- 多轮推理2.3.4
vlmeval/inference_video.py- 视频推理2.3.5
vlmeval/smp/misc.py- GPU 内存检测2.3.6
vlmeval/tools.py- GPU 计数第二层:VLM 模型适配(重点模型优先)
2.3.7 通用替换模式
所有 VLM 模型实现中的 CUDA 硬编码,按以下模式替换:
self.device = 'cuda'from vlmeval.utils.device import get_device_name; self.device = get_device_name().cuda().to(self.device).to('cuda').to(self.device)或.to(self.model.device)torch.cuda.device_count()from vlmeval.utils.device import get_device_count; get_device_count()torch.cuda.set_device(0)from vlmeval.utils.device import set_device; set_device(0)torch.cuda.empty_cache()from vlmeval.utils.device import empty_cache; empty_cache()torch.cuda.manual_seed_all(0)from vlmeval.utils.device import manual_seed_all; manual_seed_all(0)device_map='cuda'device_map=get_device_name()或device_map='auto'attn_implementation='flash_attention_2'from vlmeval.utils.device import get_attn_implementation; attn_implementation=get_attn_implementation()torch.cuda.amp.autocast()torch.amp.autocast(device_type=get_device_name())torch.cuda.is_available()_detect_backend()2.3.8 重点模型修改示例
Qwen2-VL (
vlmeval/vlm/qwen2_vl/model.py)InternVL (
vlmeval/vlm/internvl/internvl_chat.py)MiniCPM-V (
vlmeval/vlm/minicpm_v.py)LLaVA (
vlmeval/vlm/llava/llava.py)2.3.9 全量模型修改清单
以下文件需要按照上述通用替换模式进行修改:
vlmeval/vlm/qwen2_vl/model.pyvlmeval/vlm/qwen3_vl/model.pyvlmeval/vlm/internvl/internvl_chat.pyvlmeval/vlm/llava/llava.pyvlmeval/vlm/minicpm_v.pyvlmeval/vlm/llama4.pyvlmeval/vlm/deepseek_vl2.pyvlmeval/vlm/deepseek_vl.pyvlmeval/vlm/gemma.pyvlmeval/vlm/cogvlm.pyvlmeval/vlm/phi3_vision.pyvlmeval/vlm/nvlm.pyvlmeval/vlm/yi_vl.pyvlmeval/vlm/vila.pyvlmeval/vlm/wemm.pyvlmeval/vlm/wethink_vl.pyvlmeval/vlm/vlm_r1.pyvlmeval/vlm/vlaa_thinker.pyvlmeval/vlm/treevgr.pyvlmeval/vlm/thyme/model.pyvlmeval/vlm/smolvlm.pyvlmeval/vlm/ristretto.pyvlmeval/vlm/points.pyvlmeval/vlm/phi4_multimodal.pyvlmeval/vlm/oryx.pyvlmeval/vlm/omchat.pyvlmeval/vlm/monkey.pyvlmeval/vlm/moondream.pyvlmeval/vlm/mmalaya.pyvlmeval/vlm/minimonkey.pyvlmeval/vlm/mantis.pyvlmeval/vlm/mixsense.pyvlmeval/vlm/molmo.pyvlmeval/vlm/long_vita.pyvlmeval/vlm/kimi_vl.pyvlmeval/vlm/keye_vlm/model.pyvlmeval/vlm/h2ovl_mississippi.pyvlmeval/vlm/vita.pyvlmeval/vlm/vlm3r.pyvlmeval/vlm/valley/valley2.pyvlmeval/vlm/valley/valley3.pyvlmeval/vlm/video_llm/*.pyvlmeval/vlm/xcomposer/*.pyvlmeval/vlm/pandagpt.pyvlmeval/vlm/parrot.pyvlmeval/vlm/pixtral.pyvlmeval/vlm/ross.pyvlmeval/vlm/slime.pyvlmeval/vlm/visualglm.pyvlmeval/vlm/transcore_m.pyvlmeval/vlm/open_flamingo.pyvlmeval/vlm/omnilmm.pyvlmeval/vlm/mplug_owl2.pyvlmeval/vlm/mplug_owl3.pyvlmeval/vlm/idefics.pyvlmeval/vlm/eagle_x.pyvlmeval/vlm/emu.pyvlmeval/vlm/bagel_umm.pyvlmeval/vlm/aria.pyvlmeval/vlm/vxverse.pyvlmeval/vlm/vintern_chat.pyvlmeval/vlm/minigpt4.pyvlmeval/vlm/kosmos.pyvlmeval/vlm/janus.pyvlmeval/vlm/instructblip.pyvlmeval/vlm/deepseek_ocr.pyvlmeval/vlm/covt.pyvlmeval/vlm/cambrian_s.pyvlmeval/vlm/falcon_vlm.pyvlmeval/vlm/xgen_mm.pyvlmeval/vlm/x_vl.pyvlmeval/vlm/varco_vision.pyvlmeval/vlm/qianfan_vl.pyvlmeval/vlm/nanovlm.pyvlmeval/vlm/llama_vision.pyvlmeval/vlm/logics.pyvlmeval/vlm/spatial_mllm.pyvlmeval/vlm/rbdash.pyvlmeval/vlm/sail_vl.pyvlmeval/vlm/ovis/ovis.pyvlmeval/vlm/cosmos.pyvlmeval/vlm/hawk_vl/model.pyvlmeval/vlm/qtunevl/*.pyvlmeval/vlm/ola/*.pyvlmeval/vlm/ursa/*.pyvlmeval/api/hf_chat_model.py第三层:Dataset 评估工具适配
2.3.10
vlmeval/dataset/utils/目录verifier.py(行213)torch.cuda.device_count()->get_device_count()uni_svg.py(行37)"cuda" if torch.cuda.is_available() else "cpu"->get_device_name()design2code/visual_score.py(行28)chartcap.py(行140)SArena/video/viclip/viclip_vision.py(行452-453)torch.cuda.manual_seed()->manual_seed_all()SArena/video/CLIP_video.py(行98)device=torch.device("cuda")->device=None,函数内动态检测SArena/LPIPS.py(行41)SArena/DINO_Score.py(行15)SArena/FID.py(行25)SArena/runtime.py(行39-53)SArena/token_length.py(行13)SArena/CLIP_Score.py(行19)3. NPU 特有注意事项
3.1 torch_npu 兼容性
torch.cuda.is_available()torch.npu.is_available()import torch_nputorch.cuda.device_count()torch.npu.device_count()torch.cuda.set_device(idx)torch.npu.set_device(idx)torch.cuda.empty_cache()torch.npu.empty_cache()torch.cuda.synchronize()torch.npu.synchronize()torch.cuda.manual_seed_all()torch.npu.manual_seed_all()ncclhcclCUDA_VISIBLE_DEVICESASCEND_RT_VISIBLE_DEVICESnvidia-sminpu-smiflash_attention_2eagertorch.cuda.amp.autocast()torch.npu.amp.autocast()torch.amp.autocast(device_type='npu')3.2 transformers 兼容性
device_map="auto"在 NPU 上的行为:device_map="auto"已支持通过torch.npu识别 NPU 设备device_map="auto"而非device_map="cuda"或device_map="npu"device_map="auto"的模型,使用device_map=get_device_name()3.3 vLLM 兼容性
vLLM 目前对 NPU 的支持状态:
3.4 LMDeploy 兼容性
LMDeploy 同样仅支持 CUDA,在 NPU 环境下应回退到 Transformers 路径。
3.5 精度建议
建议在
device.py中添加精度建议函数:4. 实施计划
Phase 1: 核心框架适配(1周)
vlmeval/utils/device.pyrun.py- 设备检测和分布式后端vlmeval/inference.py- 推理核心vlmeval/inference_mt.py- 多轮推理vlmeval/inference_video.py- 视频推理vlmeval/smp/misc.py- GPU 内存检测vlmeval/tools.py- GPU 计数验收标准:
python run.py --model Qwen2-VL-7B-Instruct --data MMBench_DEV_EN_V11在 NPU 上可以启动Phase 2: P0 模型适配(1周)
vlmeval/vlm/qwen2_vl/model.pyvlmeval/vlm/qwen3_vl/model.pyvlmeval/vlm/internvl/internvl_chat.pyvlmeval/vlm/llava/llava.pyvlmeval/vlm/minicpm_v.py验收标准:
Phase 3: P1 模型适配(1周)
验收标准:
Phase 4: P2-P3 模型适配 + Dataset 适配(2周)
验收标准:
Phase 5: 文档和 CI(1周)
requirements_npu.txt5. 测试方案
5.1 单元测试
新增
tests/test_device.py:5.2 集成测试
5.3 精度验证
NPU 上的评测精度应与 CUDA 上的结果在统计意义上等价(允许 ±0.5% 的精度差异,源于浮点计算差异)。
6. 风险与缓解
torch_npuAPI 与torch.cuda不完全对齐device.py中添加 try/except 降级device_map="auto"在 NPU 上行为异常device_map指定选项7. 后续扩展
8. 参考资料