[VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend by yuan-luo · Pull Request #19003 · sgl-project/sglang

yuan-luo · 2026-02-19T07:31:19Z

Motivation

FlashInfer CUDNN Prefill demonstrate strong performance. Introduce it to SGLang as one of VLM ViT attention backends. A new "fi" mm attention backend is added.

Per manual testing, the performance improved 10%. More comprehensive performance test will be conducted soon.
The image understanding is expected.

➜  sglang_dev2 git:(support_vit_fi_backend) ✗ CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m sglang.launch_server --model Qwen/Qwen3-VL-30B-A3B-Instruct --tp 4 --mm-attention-backend fi
[2026-02-19 07:44:20] INFO server_args.py:1830: Attention backend not specified. Use fa3 backend by default.
[2026-02-19 07:44:21] server_args=ServerArgs(model_path='Qwen/Qwen3-VL-30B-A3B-Instruct', tokenizer_path='Qwen/Qwen3-VL-30B-A3B-Instruct', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=None, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='127.0.0.1', port=30000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.8322890624999999, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=4, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_output=False, random_seed=58862372, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen/Qwen3-VL-30B-A3B-Instruct', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend='fi', fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='flashinfer_cutlass', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_match_window_size=1, speculative_ngram_max_match_window_size=12, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_branch_length=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', disable_hicache_numa_detect=False, hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, hierarchical_sparse_attention_extra_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=512, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, enable_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_decode_tp=None, disaggregation_decode_dp=None, disaggregation_prefill_pp=1, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, mm_max_concurrent_calls=32, mm_per_request_timeout=10.0, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-02-19 07:44:21] Ignore import error when loading sglang.srt.multimodal.processors.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:24] Using default HuggingFace chat template with detected content format: openai
[2026-02-19 07:44:32 TP2] Init torch distributed begin.
[2026-02-19 07:44:32 TP1] Init torch distributed begin.
[2026-02-19 07:44:33 TP0] Init torch distributed begin.
[2026-02-19 07:44:33 TP3] Init torch distributed begin.
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 0 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 1 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 2 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[Gloo] Rank 3 is connected to 3 peer ranks. Expected number of connected peer ranks is : 3
[2026-02-19 07:44:35 TP0] sglang is using nccl==2.27.5
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-02-19 07:44:38 TP0] Init torch distributed ends. elapsed=5.07 s, mem usage=1.24 GB
[2026-02-19 07:44:38 TP2] Init torch distributed ends. elapsed=5.15 s, mem usage=1.29 GB
[2026-02-19 07:44:38 TP1] Init torch distributed ends. elapsed=5.12 s, mem usage=1.29 GB
[2026-02-19 07:44:38 TP3] Init torch distributed ends. elapsed=5.03 s, mem usage=1.06 GB
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP3] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP2] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP0] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glm_ocr: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glm_ocr_nextn: No module named 'transformers.models.glm_ocr'
[2026-02-19 07:44:38 TP1] Ignore import error when loading sglang.srt.models.glmasr: cannot import name 'GlmAsrConfig' from 'transformers' (/usr/local/lib/python3.12/dist-packages/transformers/__init__.py)
[2026-02-19 07:44:39 TP3] Load weight begin. avail mem=138.23 GB
[2026-02-19 07:44:39 TP1] Load weight begin. avail mem=137.99 GB
[2026-02-19 07:44:39 TP0] Load weight begin. avail mem=138.04 GB
[2026-02-19 07:44:39 TP1] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP3] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP0] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP2] Load weight begin. avail mem=137.99 GB
[2026-02-19 07:44:39 TP2] Using fi as multimodal attention backend.
[2026-02-19 07:44:39 TP0] Found local HF snapshot for Qwen/Qwen3-VL-30B-A3B-Instruct at /root/.cache/huggingface/hub/models--Qwen--Qwen3-VL-30B-A3B-Instruct/snapshots/9c4b90e1e4ba969fd3b5378b57d966d725f1b86c; skipping download.
Loading safetensors checkpoint shards:   0% Completed | 0/13 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   8% Completed | 1/13 [00:01<00:16,  1.35s/it]
Loading safetensors checkpoint shards:  15% Completed | 2/13 [00:03<00:18,  1.69s/it]
Loading safetensors checkpoint shards:  23% Completed | 3/13 [00:05<00:17,  1.80s/it]
Loading safetensors checkpoint shards:  31% Completed | 4/13 [00:07<00:16,  1.85s/it]
Loading safetensors checkpoint shards:  38% Completed | 5/13 [00:09<00:14,  1.87s/it]
Loading safetensors checkpoint shards:  46% Completed | 6/13 [00:10<00:13,  1.87s/it]
Loading safetensors checkpoint shards:  54% Completed | 7/13 [00:12<00:11,  1.87s/it]
Loading safetensors checkpoint shards:  62% Completed | 8/13 [00:14<00:09,  1.86s/it]
Loading safetensors checkpoint shards:  69% Completed | 9/13 [00:16<00:07,  1.87s/it]
Loading safetensors checkpoint shards:  77% Completed | 10/13 [00:18<00:05,  1.88s/it]
Loading safetensors checkpoint shards:  85% Completed | 11/13 [00:20<00:03,  1.89s/it]
Loading safetensors checkpoint shards:  92% Completed | 12/13 [00:22<00:01,  1.89s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:22<00:00,  1.54s/it]
Loading safetensors checkpoint shards: 100% Completed | 13/13 [00:22<00:00,  1.77s/it]

[2026-02-19 07:45:02 TP0] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.26 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP1] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.21 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP2] Load weight end. elapsed=23.46 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.21 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP3] Load weight end. elapsed=23.55 s, type=Qwen3VLMoeForConditionalGeneration, dtype=torch.bfloat16, avail mem=123.45 GB, mem usage=14.78 GB.
[2026-02-19 07:45:02 TP0] Using KV cache dtype: torch.bfloat16
[2026-02-19 07:45:02 TP1] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP1] Memory pool end. avail mem=19.01 GB
[2026-02-19 07:45:02 TP0] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP0] Memory pool end. avail mem=19.06 GB
[2026-02-19 07:45:02 TP2] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP3] KV Cache is allocated. #tokens: 4370338, K size: 50.01 GB, V size: 50.01 GB
[2026-02-19 07:45:02 TP2] Memory pool end. avail mem=19.01 GB
[2026-02-19 07:45:02 TP3] Memory pool end. avail mem=19.25 GB
[2026-02-19 07:45:02 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=18.92 GB
[2026-02-19 07:45:02 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=18.97 GB
[2026-02-19 07:45:02 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512]
[2026-02-19 07:45:02 TP2] Capture cuda graph begin. This can take up to several minutes. avail mem=18.92 GB
[2026-02-19 07:45:02 TP3] Capture cuda graph begin. This can take up to several minutes. avail mem=19.15 GB
Capturing batches (bs=512 avail_mem=18.17 GB):   0%|                                                                                                                                                                        | 0/52 [00:00<?, ?it/s][2026-02-19 07:45:04 TP2] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP2] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP0] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP0] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP1] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP1] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
[2026-02-19 07:45:04 TP3] Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200.json. Fallback to triton version 3.2.0 and use MoE kernel config from /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_2_0/E=128,N=192,device_name=NVIDIA_H200.json. Performance might be sub-optimal!
[2026-02-19 07:45:04 TP3] Using MoE kernel config with down_moe=False. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/sglang/srt/layers/moe/fused_moe_triton/configs/triton_3_5_1/E=128,N=192,device_name=NVIDIA_H200_down.json, you can create them with https://github.com/sgl-project/sglang/tree/main/benchmark/kernels/fused_moe_triton
Capturing batches (bs=1 avail_mem=17.16 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 52/52 [00:15<00:00,  3.41it/s]
[2026-02-19 07:45:18 TP0] Registering 5044 cuda graph addresses
[2026-02-19 07:45:18 TP0] Capture cuda graph end. Time elapsed: 16.11 s. mem usage=1.81 GB. avail mem=17.15 GB.
[2026-02-19 07:45:18 TP3] Capture cuda graph end. Time elapsed: 16.10 s. mem usage=1.81 GB. avail mem=17.34 GB.
[2026-02-19 07:45:18 TP2] Capture cuda graph end. Time elapsed: 16.11 s. mem usage=1.81 GB. avail mem=17.11 GB.
[2026-02-19 07:45:18 TP1] Capture cuda graph end. Time elapsed: 16.13 s. mem usage=1.81 GB. avail mem=17.11 GB.
[2026-02-19 07:45:20 TP0] max_total_num_tokens=4370338, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=4096, context_len=262144, available_gpu_mem=17.15 GB
[2026-02-19 07:45:21] INFO:     Started server process [188544]
[2026-02-19 07:45:21] INFO:     Waiting for application startup.
[2026-02-19 07:45:21] Using default chat sampling params from model generation config: {'repetition_penalty': 1.0, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}
[2026-02-19 07:45:21] INFO:     Application startup complete.
[2026-02-19 07:45:21] INFO:     Uvicorn running on http://127.0.0.1:30000 (Press CTRL+C to quit)
[2026-02-19 07:45:22] INFO:     127.0.0.1:46914 - "GET /model_info HTTP/1.1" 200 OK
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:0',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:2',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:3',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([256,   0,   0,   0,   0,   0,   0,   0], device='cuda:1',
       dtype=torch.int32)
[2026-02-19 07:45:26 TP0] Prefill batch, #new-seq: 1, #new-token: 78, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 0.00, cuda graph: False
[2026-02-19 07:45:26] INFO:     127.0.0.1:46930 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-02-19 07:45:26] The server is fired up and ready to roll!
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:1',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:0',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:2',
       dtype=torch.int32)
====Qwen3VLMoe sequence_lengths=tensor([2992,    0,    0,    0,    0,    0,    0,    0], device='cuda:3',
       dtype=torch.int32)
[2026-02-19 07:45:53 TP0] Prefill batch, #new-seq: 1, #new-token: 760, #cached-token: 4, token usage: 0.00, #running-req: 0, #queue-req: 0, input throughput (token/s): 2.90, cuda graph: False
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 797, token usage: 0.00, cuda graph: True, gen throughput (token/s): 0.49, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 837, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.53, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 877, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.47, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 917, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.38, #queue-req: 0
[2026-02-19 07:45:53 TP0] Decode batch, #running-req: 1, #token: 957, token usage: 0.00, cuda graph: True, gen throughput (token/s): 261.76, #queue-req: 0
[2026-02-19 07:45:54] INFO:     127.0.0.1:38730 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Client:

➜  bench_script bash test_image.sh
{"id":"643ee6766d1046279ae6609a80b37352","object":"chat.completion","created":1771487198,"model":"nvidia/Eagle2.5-8B","choices":[{"index":0,"message":{"role":"assistant","content":"好的，这张图片描绘了一个宁静而专注的场景。\n\n图中主要人物是一位从后侧方拍摄的摄影师，我们无法看到其完整的面容。这位摄影师戴着一顶深灰色的针织帽，帽子的纹理清晰可见。他们有着一头深色的卷发，身穿一件浅蓝色的牛仔夹克。\n\n这位摄影师正双手紧握着一台银色的佳能（Canon）单反相机，镜头对准前方，似乎正在专注地拍摄。相机的细节，如镜头上的刻度和机身上的“Canon”标志，都清晰可见。摄影师的左手无名指上戴着一枚简约的戒指。\n\n背景是一个开阔的户外环境，可能是田野或草地，远处的景物被虚化，呈现出温暖的黄昏或黎明色调。整个画面的光线柔和，营造出一种温暖、怀旧且富有艺术感的氛围。摄影师的姿势和专注的神情，传达出一种对摄影艺术的热爱和投入。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151645}],"usage":{"prompt_tokens":764,"total_tokens":965,"completion_tokens":201,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m0.934s
user	0m0.000s
sys	0m0.006s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-02-19T07:31:35Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates FlashInfer CUDNN Prefill into the Vision Transformer (ViT) component of Vision-Language Models (VLMs). The primary goal is to improve performance by utilizing FlashInfer's optimized prefill operations. This involves adding a dedicated attention class, implementing specialized sequence length management, and updating the system's configuration to enable the new backend.

Highlights

FlashInfer CUDNN Prefill Integration: Introduced FlashInfer CUDNN Prefill as a new backend option for the Vision Transformer (ViT) attention mechanism, aiming to leverage its strong performance for Vision-Language Models (VLMs).
New Attention Class: Added a VisionFlashInferAttention class to encapsulate the logic for using FlashInfer's cudnn_batch_prefill_with_kv_cache function.
Sequence Length Handling and Padding: Implemented specific methods for FlashInfer to handle sequence lengths, including padding to batch buckets and bucketing maximum sequence lengths to optimize cuDNN graph caching.
Configuration and CLI Extension: Extended the command-line interface to allow users to select 'fi' (FlashInfer) as the multimodal attention backend, and integrated a workspace buffer for FlashInfer operations.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/layers/attention/vision.py
- Imported cudnn_batch_prefill_with_kv_cache from flashinfer.prefill.
- Added a new class VisionFlashInferAttention to handle FlashInfer-specific attention logic.
- Registered "fi" as a new attention backend in the VISION_ATTENTION_BACKEND map.
- Modified VisionAttention and VisionBlock constructors to accept an optional workspace_buffer parameter.
- Updated the forward method of VisionAttention to pass max_seqlen and sequence_lengths to the attention backend.
python/sglang/srt/models/qwen3_vl.py
- Imported the round_up utility function.
- Defined constants for FLASHINFER_WORKSPACE_SIZE_BYTES, BATCH_BUCKETS, and FLASHINFER_MAX_SEQLEN_BUCKETS for FlashInfer configuration.
- Modified the Qwen3_VisionBlock constructor to accept an optional workspace_buffer.
- Modified the Qwen3_VisionBlock forward method to accept max_seqlen and sequence_lengths.
- Initialized a workspace_buffer for the FlashInfer backend based on global server arguments.
- Renamed fast_pos_embed_interpolate to fast_pos_embed_interpolate_from_list and updated its call site.
- Added add_padding_to_fi_seqlens to pad sequence lengths for FlashInfer batching.
- Added compute_flashinfer_cu_seqlens to adjust cumulative sequence lengths for FlashInfer.
- Added bucket_flashinfer_max_seqlen to bucket sequence lengths for cuDNN graph caching.
- Updated the main forward method to compute and pass FlashInfer-specific max_seqlen and sequence_lengths when the "fi" backend is active.
- Changed grid_thw processing from torch.tensor to np.array and numpy() for compatibility.
python/sglang/srt/server_args.py
- Added "fi" to the list of available choices for the --mm-attention-backend command-line argument.

Activity

The pull request introduces FlashInfer CUDNN Prefill as a new backend for ViT.
The author notes an ongoing issue with the cuDNN library that is currently under investigation, indicating active debugging and development.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces FlashInfer CUDNN prefill as a new backend for Vision Transformer attention, which is a great step towards improving performance. My review focuses on two critical issues that appear to be causing the runtime error mentioned in the PR description and could lead to incorrect computations.

An incorrect type for the scale parameter in VisionFlashInferAttention, which likely causes the pybind11 casting error.
Incorrect logic for calculating cumulative sequence lengths in compute_flashinfer_cu_seqlens in qwen3_vl.py, which could lead to incorrect attention results.

Addressing these issues should help in getting the new backend to work correctly. The rest of the changes for plumbing the new backend and its parameters seem correct.

python/sglang/srt/layers/attention/vision.py

python/sglang/srt/models/qwen3_vl.py

yuan-luo · 2026-02-19T07:50:50Z

/tag-and-rerun-ci

JustinTong0323

Could you also resolve 3 bugs in devin review? https://app.devin.ai/review/sgl-project/sglang/pull/19003 They are all reasonable imo.

JustinTong0323 · 2026-02-19T11:45:47Z

python/sglang/srt/layers/attention/vision.py

    "sdpa": VisionSdpaAttention,
    "fa3": VisionFlash3Attention,
    "fa4": VisionFlash4Attention,
+    "fi": VisionFlashInferAttention,


use flashinfer as it's unified naming with --attention-backend

Sure, will change it.

JustinTong0323 · 2026-02-19T11:46:49Z

python/sglang/srt/models/qwen3_vl.py


        return torch.cat(result_parts, dim=0)

+    def fast_pos_embed_interpolate_from_list(self, grid_thw):


Does this function only apply to qwen3_vl?

This function only applies to qwen3_vl. For Qwen2.5-VL and some other VLMs we need to adapt it's own function as the flashinfer cudnn needs to do padding in advance.

yuan-luo · 2026-02-19T13:57:28Z

Could you also resolve 3 bugs in devin review? https://app.devin.ai/review/sgl-project/sglang/pull/19003 They are all reasonable imo.

Sure, will address them.

yuan-luo · 2026-02-19T13:57:57Z

/rerun-failed-ci

yuan-luo requested review from Fridge003, Qiaolin-Yu, hebiao064, ispobock and merrymercy as code owners February 19, 2026 07:31

github-actions bot added the Multi-modal multi-modal language model label Feb 19, 2026

gemini-code-assist bot reviewed Feb 19, 2026

View reviewed changes

python/sglang/srt/layers/attention/vision.py Outdated Show resolved Hide resolved

python/sglang/srt/models/qwen3_vl.py Show resolved Hide resolved

Support vit fi backend

4bc9cb9

yuan-luo force-pushed the support_vit_fi_backend branch from cf0eb75 to 4bc9cb9 Compare February 19, 2026 07:47

yuan-luo changed the title ~~[WIP][VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend~~ [VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend Feb 19, 2026

github-actions bot added the run-ci label Feb 19, 2026

yuan-luo added performance vlm flashinfer labels Feb 19, 2026

yuan-luo requested review from BBuf, JustinTong0323, mickqian and yhyang201 February 19, 2026 07:53

JustinTong0323 requested changes Feb 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend#19003

[VLM] Introduce FlashInfer CUDNN Prefill as ViT Backend#19003
yuan-luo wants to merge 1 commit intosgl-project:mainfrom
antgroup:support_vit_fi_backend

yuan-luo commented Feb 19, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

JustinTong0323 left a comment

Uh oh!

JustinTong0323 Feb 19, 2026

Uh oh!

yuan-luo Feb 19, 2026

Uh oh!

JustinTong0323 Feb 19, 2026

Uh oh!

yuan-luo Feb 19, 2026

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments


		return torch.cat(result_parts, dim=0)

		def fast_pos_embed_interpolate_from_list(self, grid_thw):

Conversation

yuan-luo commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

JustinTong0323 Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

yuan-luo commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

yuan-luo commented Feb 19, 2026 •

edited

Loading