Skip to content

[VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend#19003

Open
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:support_vit_fi_backend
Open

[VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend#19003
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:support_vit_fi_backend

Conversation

@yuan-luo
Copy link
Collaborator

@yuan-luo yuan-luo commented Feb 19, 2026

Motivation

FlashInfer CUDNN Prefill demonstrate strong performance. Introduce it to SGLang as one of VLM ViT attention backends. A new "fi" mm attention backend is added.

Per manual testing, the performance improved 10%. More comprehensive performance test will be conducted soon.
The image understanding is expected.

➜  sglang_dev2 git:(support_vit_fi_backend) ✗ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --model Qwen/Qwen3-VL-30B-A3B-Instruct --tp 4 --mm-attention-backend fi
[2026-02-19 07:44:20] INFO server_args.py:1830: Attention backend not specified. Use fa3 backend by default.
[2026-02-19 07:44:21] server_args=ServerArgs(model_path='Qwen/Qwen3-VL-30B-A3B-Instruct', tokenizer_path='Qwen/Qwen3-VL-30B-A3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.8322890624999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=58862372, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3-VL-30B-A3B-Instruct', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend='fi', fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-02-19 07:44:21] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:24] Using default HuggingFace chat template with detected content format: openai
[2026-02-19 07:44:32 TP2] Init torch distributed begin.
[2026-02-19 07:44:32 TP1] Init torch distributed begin.
[2026-02-19 07:44:33 TP0] Init torch distributed begin.
[2026-02-19 07:44:33 TP3] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-02-19 07:44:35 TP0] sglang is using nccl==2.27.5
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-02-19 07:44:38 TP0] Init torch distributed ends. elapsed=5.07 s, mem usage=1.24 GB
[2026-02-19 07:44:38 TP2] Init torch distributed ends. elapsed=5.15 s, mem usage=1.29 GB
[2026-02-19 07:44:38 TP1] Init torch distributed ends. elapsed=5.12 s, mem usage=1.29 GB
[2026-02-19 07:44:38 TP3] Init torch distributed ends. elapsed=5.03 s, mem usage=1.06 GB
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:39 TP3] Load weight begin. avail mem=138.23 GB
[2026-02-19 07:44:39 TP1] Load weight begin. avail mem=137.99 GB
[2026-02-19 07:44:39 TP0] Load weight begin. avail mem=138.04 GB
[2026-02-19 07:44:39 TP1] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP3] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP0] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP2] Load weight begin. avail mem=137.99 GB
[2026-02-19 07:44:39 TP2] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP0] Found local HF snapshot for Qwen/Qwen3-VL-30B-A3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-30B-A3B-Instruct/snapshots/9c4b90e1e4ba969fd3b5378b57d966d725f1b86c; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:01<00:16,  1.35s/it]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:03<00:18,  1.69s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:05<00:17,  1.80s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:07<00:16,  1.85s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:09<00:14,  1.87s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:10<00:13,  1.87s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:12<00:11,  1.87s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:14<00:09,  1.86s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:16<00:07,  1.87s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:18<00:05,  1.88s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:20<00:03,  1.89s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:22<00:01,  1.89s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:22<00:00,  1.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:22<00:00,  1.77s/it]

[2026-02-19 07:45:02 TP0] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.26 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP1] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.21 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP2] Load weight end. elapsed=23.46 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.21 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP3] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.45 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP0] Using KV cache dtype: torch.bfloat16
[2026-02-19 07:45:02 TP1] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP1] Memory pool end. avail mem=19.01 GB
[2026-02-19 07:45:02 TP0] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP0] Memory pool end. avail mem=19.06 GB
[2026-02-19 07:45:02 TP2] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP3] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP2] Memory pool end. avail mem=19.01 GB
[2026-02-19 07:45:02 TP3] Memory pool end. avail mem=19.25 GB
[2026-02-19 07:45:02 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.92 GB
[2026-02-19 07:45:02 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.97 GB
[2026-02-19 07:45:02 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
[2026-02-19 07:45:02 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=18.92 GB
[2026-02-19 07:45:02 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=19.15 GB
Capturing batches (bs=512 avail_mem=18.17 GB):   0%|                                                                                                                                                                        | 0/52 [00:00<?, ?it/s][2026-02-19 07:45:04 TP2] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP2] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP0] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP1] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP3] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP3] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=1 avail_mem=17.16 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:15<00:00,  3.41it/s]
[2026-02-19 07:45:18 TP0] Registering 5044 cuda graph addresses
[2026-02-19 07:45:18 TP0] Capture cuda graph end. Time elapsed: 16.11 s. mem usage=1.81 GB. avail mem=17.15 GB.
[2026-02-19 07:45:18 TP3] Capture cuda graph end. Time elapsed: 16.10 s. mem usage=1.81 GB. avail mem=17.34 GB.
[2026-02-19 07:45:18 TP2] Capture cuda graph end. Time elapsed: 16.11 s. mem usage=1.81 GB. avail mem=17.11 GB.
[2026-02-19 07:45:18 TP1] Capture cuda graph end. Time elapsed: 16.13 s. mem usage=1.81 GB. avail mem=17.11 GB.
[2026-02-19 07:45:20 TP0] max_total_num_tokens=4370338, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=262144, available_gpu_mem=17.15 GB
[2026-02-19 07:45:21] INFO:     Started server process [188544]
[2026-02-19 07:45:21] INFO:     Waiting for application startup.
[2026-02-19 07:45:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2026-02-19 07:45:21] INFO:     Application startup complete.
[2026-02-19 07:45:21] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-02-19 07:45:22] INFO:     127.0.0.1:46914 - "GET /model_info HTTP/1.1" 200 OK
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:0',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:2',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:3',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:1',
       dtype=torch.int32)
[2026-02-19 07:45:26 TP0] Prefill batch, #new-seq: 1, #new-token: 78, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-02-19 07:45:26] INFO:     127.0.0.1:46930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-02-19 07:45:26] The server is fired up and ready to roll!
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:1',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:0',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:2',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:3',
       dtype=torch.int32)
[2026-02-19 07:45:53 TP0] Prefill batch, #new-seq: 1, #new-token: 760, #cached-token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 2.90, cuda graph: False
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 797, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.49, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 837, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.53, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 877, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.47, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 917, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.38, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 957, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.76, #queue-req: 0
[2026-02-19 07:45:54] INFO:     127.0.0.1:38730 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client:

➜  bench_script bash test_image.sh
{"id":"643ee6766d1046279ae6609a80b37352","object":"chat.completion","created":1771487198,"model":"nvidia/Eagle2.5-8B","choices":[{"index":0,"message":{"role":"assistant","content":"好的,这张图片描绘了一个宁静而专注的场景。\n\n图中主要人物是一位从后侧方拍摄的摄影师,我们无法看到其完整的面容。这位摄影师戴着一顶深灰色的针织帽,帽子的纹理清晰可见。他们有着一头深色的卷发,身穿一件浅蓝色的牛仔夹克。\n\n这位摄影师正双手紧握着一台银色的佳能(Canon)单反相机,镜头对准前方,似乎正在专注地拍摄。相机的细节,如镜头上的刻度和机身上的“Canon”标志,都清晰可见。摄影师的左手无名指上戴着一枚简约的戒指。\n\n背景是一个开阔的户外环境,可能是田野或草地,远处的景物被虚化,呈现出温暖的黄昏或黎明色调。整个画面的光线柔和,营造出一种温暖、怀旧且富有艺术感的氛围。摄影师的姿势和专注的神情,传达出一种对摄影艺术的热爱和投入。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":764,"total_tokens":965,"completion_tokens":201,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m0.934s
user	0m0.000s
sys	0m0.006s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@github-actions github-actions bot added the Multi-modal multi-modal language model label Feb 19, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates FlashInfer CUDNN Prefill into the Vision Transformer (ViT) component of Vision-Language Models (VLMs). The primary goal is to improve performance by utilizing FlashInfer's optimized prefill operations. This involves adding a dedicated attention class, implementing specialized sequence length management, and updating the system's configuration to enable the new backend.

Highlights

  • FlashInfer CUDNN Prefill Integration: Introduced FlashInfer CUDNN Prefill as a new backend option for the Vision Transformer (ViT) attention mechanism, aiming to leverage its strong performance for Vision-Language Models (VLMs).
  • New Attention Class: Added a VisionFlashInferAttention class to encapsulate the logic for using FlashInfer's cudnn_batch_prefill_with_kv_cache function.
  • Sequence Length Handling and Padding: Implemented specific methods for FlashInfer to handle sequence lengths, including padding to batch buckets and bucketing maximum sequence lengths to optimize cuDNN graph caching.
  • Configuration and CLI Extension: Extended the command-line interface to allow users to select 'fi' (FlashInfer) as the multimodal attention backend, and integrated a workspace buffer for FlashInfer operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/layers/attention/vision.py
    • Imported cudnn_batch_prefill_with_kv_cache from flashinfer.prefill.
    • Added a new class VisionFlashInferAttention to handle FlashInfer-specific attention logic.
    • Registered "fi" as a new attention backend in the VISION_ATTENTION_BACKEND map.
    • Modified VisionAttention and VisionBlock constructors to accept an optional workspace_buffer parameter.
    • Updated the forward method of VisionAttention to pass max_seqlen and sequence_lengths to the attention backend.
  • python/sglang/srt/models/qwen3_vl.py
    • Imported the round_up utility function.
    • Defined constants for FLASHINFER_WORKSPACE_SIZE_BYTES, BATCH_BUCKETS, and FLASHINFER_MAX_SEQLEN_BUCKETS for FlashInfer configuration.
    • Modified the Qwen3_VisionBlock constructor to accept an optional workspace_buffer.
    • Modified the Qwen3_VisionBlock forward method to accept max_seqlen and sequence_lengths.
    • Initialized a workspace_buffer for the FlashInfer backend based on global server arguments.
    • Renamed fast_pos_embed_interpolate to fast_pos_embed_interpolate_from_list and updated its call site.
    • Added add_padding_to_fi_seqlens to pad sequence lengths for FlashInfer batching.
    • Added compute_flashinfer_cu_seqlens to adjust cumulative sequence lengths for FlashInfer.
    • Added bucket_flashinfer_max_seqlen to bucket sequence lengths for cuDNN graph caching.
    • Updated the main forward method to compute and pass FlashInfer-specific max_seqlen and sequence_lengths when the "fi" backend is active.
    • Changed grid_thw processing from torch.tensor to np.array and numpy() for compatibility.
  • python/sglang/srt/server_args.py
    • Added "fi" to the list of available choices for the --mm-attention-backend command-line argument.
Activity
  • The pull request introduces FlashInfer CUDNN Prefill as a new backend for ViT.
  • The author notes an ongoing issue with the cuDNN library that is currently under investigation, indicating active debugging and development.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces FlashInfer CUDNN prefill as a new backend for Vision Transformer attention, which is a great step towards improving performance. My review focuses on two critical issues that appear to be causing the runtime error mentioned in the PR description and could lead to incorrect computations.

  1. An incorrect type for the scale parameter in VisionFlashInferAttention, which likely causes the pybind11 casting error.
  2. Incorrect logic for calculating cumulative sequence lengths in compute_flashinfer_cu_seqlens in qwen3_vl.py, which could lead to incorrect attention results.

Addressing these issues should help in getting the new backend to work correctly. The rest of the changes for plumbing the new backend and its parameters seem correct.

@yuan-luo yuan-luo force-pushed the support_vit_fi_backend branch from cf0eb75 to 4bc9cb9 Compare February 19, 2026 07:47
@yuan-luo yuan-luo changed the title [WIP][VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend [VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend Feb 19, 2026
@yuan-luo
Copy link
Collaborator Author

/tag-and-rerun-ci

Copy link
Collaborator

@JustinTong0323 JustinTong0323 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also resolve 3 bugs in devin review? https://app.devin.ai/review/sgl-project/sglang/pull/19003 They are all reasonable imo.

"sdpa": VisionSdpaAttention,
"fa3": VisionFlash3Attention,
"fa4": VisionFlash4Attention,
"fi": VisionFlashInferAttention,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use flashinfer as it's unified naming with --attention-backend

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change it.


return torch.cat(result_parts, dim=0)

def fast_pos_embed_interpolate_from_list(self, grid_thw):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function only apply to qwen3_vl?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function only applies to qwen3_vl. For Qwen2.5-VL and some other VLMs we need to adapt it's own function as the flashinfer cudnn needs to do padding in advance.

@yuan-luo
Copy link
Collaborator Author

Could you also resolve 3 bugs in devin review? https://app.devin.ai/review/sgl-project/sglang/pull/19003 They are all reasonable imo.

Sure, will address them.

@yuan-luo
Copy link
Collaborator Author

/rerun-failed-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments