Support for flash attention (#7)

rebel-jindol21 · web-flow · commit 0c7683096f33 · 2025-08-07T13:16:44.000+09:00
* Supprt for flash attention

Signed-off-by: Jinseok Lee &lt;jindol21@rebellions.ai&gt;
diff --git a/examples/experimental/qwen3_flash_test.py b/examples/experimental/qwen3_flash_test.py
@@ -0,0 +1,85 @@
+# Copyright 2025 Rebellions Inc. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at:
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# ruff: noqa: E501
+
+from vllm import LLM, SamplingParams
+
+prompts = [
+    """
+Rebellions, SK Telecom, and DOCOMO Innovations Partner to Accelerate Next-Gen AI Infrastructure
+News
+Apr 24, 2025
+[Seoul, April 24, 2025] Rebellions today announced the signing of a strategic MOU with SK Telecom and DOCOMO Innovations (DII), a subsidiary of Japan’s leading telecom provider NTT DOCOMO. The agreement lays the foundation for joint development and validation of cutting-edge AI acceleration technologies.
+
+Under this collaboration, the three companies will focus on evaluating Rebellions’ ATOM-based NPU servers within SK Telecom’s NPU farm. Building on the success of this initiative, the partners plan to expand validation efforts to include a broader range of Rebellions’ product portfolio.
+
+Rebellions will bring its high-performance AI chips, proven system reliability, and software optimization expertise to the table. SK Telecom will provide its NPU farm and AI infrastructure capabilities, while DII will conduct technical evaluations and help facilitate discussions on potential paths to commercialization. This partnership is expected to combine the core strengths of each company to help accelerate the growth of the global AI hardware ecosystem.
+
+Rebellions recently established a Japanese subsidiary, further accelerating its global expansion. This agreement marks a significant step in the company’s ambition to become a key player in the global AI infrastructure market.
+
+“Rebellions is committed to proving the real-world stability and performance of our technology in live infrastructure environments,” said Jinwook Oh, CTO of Rebellions. “Collaboration between a hardware provider, an infrastructure leader, and actual users is crucial – and together with SK Telecom and DII, we’re building meaningful progress toward the future of AI.”
+
+Yoshikazu Akinaga, CEO of DOCOMO Innovations, added: “At DOCOMO Innovations, we are dedicated to driving forward innovation in practical AI solutions by working closely with global technology leaders. This partnership—bringing together Rebellions’ advanced semiconductor technology and SK Telecom’s infrastructure capabilities—will allow us to explore and assess the potential of scalable, sustainable AI systems, while maintaining a technology-agnostic approach to ensure optimal solutions for future applications.”
+
+Sangmin Lee, Vice President of Growth Business Development Office at SK Telecom, commented: “SK Telecom delivers world-class, AI-optimized cloud services. This collaboration provides an opportunity to demonstrate our NPU cloud technology, which integrates a wide range of AI data center solutions. We are committed to contributing to the success of next-generation AI infrastructure services—powered not just by GPUs, but also by NPUs.”
+""", """
+Rebellions Partners on Strategic Collaboration Initiative to Advance Global AI Data Center Ecosystem
+News
+Mar 04, 2025
+[Seoul, March 4, 2025] Rebellions, a pioneering AI chip company, today announced a strategic partnership with Penguin Solutions and SK Telecom at the Mobile World Congress (MWC) 2025 in Barcelona, Spain, taking a significant step toward building a global AI data center ecosystem.
+
+The collaboration aims to establish technical foundations and business capabilities for large-scale data center operations. By combining the unique strengths and experiences of the Rebellions, Penguin Solutions and SK Telecom, the companies will pursue strategic development around AI inference and software stack delivery in the AI data center sector.
+
+Within the planned collaboration, Rebellions brings its portfolio of energy-efficient AI accelerators optimized for Generative AI workloads. Penguin Solutions, a premiere AI infrastructure expert with more than 85,000 deployed GPUs under management, brings deep AI infrastructure expertise to the partnership. SK Telecom, actively accelerating its AI infrastructure business with key software elements and major investments in AI infra companies around the world, completes the strategic alliance.
+
+The three companies will collaborate on multiple fronts, including:
+
+Developing AI infrastructure solutions and creating testing environments for enterprise clients
+Joint development of AI data center management solutions by integrating Rebellions’ AI accelerators while supporting both GPU and NPU environments
+Leveraging each party’s technical expertise for software development specialized in AI data center infrastructure
+“Since ‘DeepSeek’, efficient operations have emerged as a key concept of an AI business, making energy efficiency and cost of ownership critical evaluation criteria for customers,” said Sunghyun Park, CEO of Rebellions. “This partnership represents a crucial first step in establishing an efficient AI data center ecosystem by bringing together companies with diverse technological expertise.”
+
+Mark Seamans, Penguin Solutions’ Vice President of Global Marketing, stated “With this partnership, our deep expertise in HPC and AI cluster management will now extend beyond GPU infrastructure to NPU environments. We’re committed to providing state-of-the-art AI infrastructure that meets the diverse needs of customers, solves for complexity, and accelerates business outcomes in the rapidly-growing global AI market.”
+
+Through this strategic partnership, Rebellions, Penguin Solutions, and SK Telecom aim to identify and expand new business opportunities in the global AI data center market, positioning themselves at the forefront of AI infrastructure innovation.
+"""
+]
+
+qwen3_0_6_model_id = "Qwen/Qwen3-0.6B"
+qwen3_1_7_model_id = "Qwen/Qwen3-1.7B"
+qwen3_4_model_id = "Qwen/Qwen3-4B"
+qwen3_8_model_id = "Qwen/Qwen3-8B"
+qwen3_30_moe_model_id = "Qwen/Qwen3-30B-A3B"
+qwen1_5_moe_model_id = "Qwen/Qwen1.5-MoE-A2.7B"
+
+# Create a sampling params object.
+sampling_params = SamplingParams(temperature=0.0, max_tokens=256)
+llm = LLM(
+    model=qwen3_4_model_id,
+    max_model_len=16 * 128,
+    block_size=128,
+    enable_chunked_prefill=True,
+    max_num_batched_tokens=128,
+    max_num_seqs=5,
+)
+
+# Generate texts from the prompts. The output is a list of RequestOutput objects
+# that contain the prompt, generated text, and other information.
+outputs = llm.generate(prompts, sampling_params)
+# Print the outputs.
+for output in outputs:
+    prompt = output.prompt
+    generated_text = output.outputs[0].text
+    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
diff --git a/vllm_rbln/attention/backends/flash_attention.py b/vllm_rbln/attention/backends/flash_attention.py
@@ -204,7 +204,10 @@ def get_kv_cache_shape(
         kv_cache_shape= [B, H, 1, S, D]
         query_shape   = [1, H, G, L, D]
         """
-        return (2, num_blocks, num_kv_heads, 1, block_size, head_size)
+        # for partition skip, we need dummy block slot.
+        no_dummy_slots = 1
+        return (2, num_blocks + no_dummy_slots, num_kv_heads, 1, block_size,
+                head_size)
 
     @staticmethod
     def swap_blocks(
@@ -235,10 +238,8 @@ def __init__(self, input_builder: ModelInputForRebelBuilder) -> None:
         self.chunked_prefill = input_builder.chunked_prefill
         self.chunked_prefill_size = input_builder.chunked_prefill_size
         self.input_builder = input_builder
-        # model max sequence length (cache_config.num_cpu_blocks)
-        self.max_seq_len = 128 * 1024
-        # flash attention partition size (cache_config.block_size)
-        self.partition_len = 1024
+
+        self.partition_len = input_builder.block_size
 
     def prepare(self):
         self.input_data = self.input_builder.input_data
@@ -262,8 +263,13 @@ def build(
         steps = [[input_positions[0]]
                  for input_positions in input_data.input_positions]
         seq_idx = torch.tensor(steps, dtype=torch.int32)
-        max_seq_len = self.max_seq_len
         partition_len = self.partition_len
+        # no. of block(HW constraint) determines max sequence length.
+        # max_model_len(Model constraint) determines max sequence length.
+        # One of them is selected for max_seq_len.
+        block_length = self.input_builder.runner.cache_config.num_gpu_blocks * \
+                                            partition_len
+        max_seq_len = min(self.input_builder.max_model_len, block_length)
         num_partition = max_seq_len // partition_len
 
         batch_size = 1 if input_data.num_prefills else len(steps)
@@ -298,7 +304,7 @@ def build(
                                                  1,
                                                  1,
                                                  prefill_chunk_size,
-                                                 self.max_seq_len,
+                                                 max_seq_len,
                                                  dtype=torch.float32)
             causal_mask = 1 - torch.triu(torch.ones(1, 1, prefill_chunk_size,
                                                     prefill_chunk_size),
@@ -313,7 +319,7 @@ def build(
                                                 1,
                                                 1,
                                                 1,
-                                                self.max_seq_len,
+                                                max_seq_len,
                                                 dtype=torch.float32)
             for batch_index, batch_step in enumerate(steps):
                 decode_attention_mask[batch_index, :, :, :, :batch_step[0] +
diff --git a/vllm_rbln/worker/model_runner.py b/vllm_rbln/worker/model_runner.py
@@ -165,6 +165,8 @@ def __init__(self,
         self.sliding_window = self.runner.sliding_window
         self.block_size = self.runner.cache_config.block_size
         self.device = self.runner.device
+        self.max_model_len = self.runner.scheduler_config.max_model_len
+
         if self.runner.attn_backend is not None:
             # spec decode (e.g. Medusa) does not have atten backend
             attn_backend = self.runner.attn_backend
@@ -190,7 +192,7 @@ def _prepare_prompt(
         seq_group_metadata_list: List[SequenceGroupMetadata],
     ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
         assert len(seq_group_metadata_list) > 0
-        input_block_ids: List[int] = []
+        list_input_block_ids: List[List[int]] = []
 
         block_size = self.runner.block_size
         assert (
@@ -212,8 +214,9 @@ def _prepare_prompt(
 
             assert seq_group_metadata.block_tables is not None
             block_table = seq_group_metadata.block_tables[seq_id]
-            assert len(block_table) == 1
-            input_block_ids.append(block_table[0])
+            assert len(block_table) == math.ceil(seq_data.get_len() /
+                                                 block_size)
+            list_input_block_ids.append(block_table)
             data.input_tokens.append(tokens)
             data.input_positions.append(list(range(computed_len, seq_len)))
             data.num_prefills += 1
@@ -229,6 +232,21 @@ def _prepare_prompt(
         max_seq_len = max(data.seq_lens)
         assert max_seq_len > 0
 
+        num_partition = self.max_model_len // block_size
+        dummy = self.runner.cache_config.num_gpu_blocks
+        # make_tensor_with_pad takes List[List[]] as input
+        # To make it work, input_block_ids is expanded
+        input_block_ids = make_tensor_with_pad(list_input_block_ids,
+                                               max_len=num_partition,
+                                               pad=dummy,
+                                               dtype=torch.long,
+                                               device=self.device)
+        # input_block_ids gets back in here.
+        input_block_ids = input_block_ids.flatten().tolist()
+        input_block_ids = torch.tensor(input_block_ids,
+                                       dtype=torch.long,
+                                       device=self.device)
+
         prefill_size = (self.chunked_prefill_size if self.chunked_prefill else
                         1 << (math.ceil(math.log2(max_seq_len))))
         input_tokens = make_tensor_with_pad(data.input_tokens,
@@ -241,9 +259,6 @@ def _prepare_prompt(
                                                pad=0,
                                                dtype=torch.long,
                                                device=self.device)
-        input_block_ids = torch.tensor(input_block_ids,
-                                       dtype=torch.long,
-                                       device=self.device)
 
         logger.info("[RBLN] model input builder, prepare_prompt")
         logger.info("\tpadded input_tokens = %s", input_tokens)
@@ -260,7 +275,7 @@ def _prepare_decode(
     ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
         assert len(seq_group_metadata_list) > 0
 
-        arr_input_block_ids: List[int] = []
+        list_input_block_ids: List[List[int]] = []
         block_size = self.block_size
         for seq_group_metadata in seq_group_metadata_list:
             assert not seq_group_metadata.is_prompt
@@ -275,10 +290,8 @@ def _prepare_decode(
                 assert seq_group_metadata.block_tables is not None
                 block_table = seq_group_metadata.block_tables[seq_id]
                 assert len(block_table) >= 1
-                for i in range(len(block_table)):
-                    assert block_table[i] != self.max_num_seqs
-                    arr_input_block_ids.append(block_table[i])
 
+                list_input_block_ids.append(block_table)
                 data.max_decode_seq_len = max(data.max_decode_seq_len, seq_len)
                 data.input_tokens.append([generation_token])
                 data.input_positions.append([token_position])
@@ -291,10 +304,18 @@ def _prepare_decode(
                 data.slot_mapping.append(block_offset)
 
         # batch padding
-        batch_padding_szie = self.max_num_seqs - len(data.input_tokens)
-        data.input_tokens.extend([[0]] * batch_padding_szie)
-        data.input_positions.extend([[0]] * batch_padding_szie)
-        arr_input_block_ids.extend([self.max_num_seqs] * batch_padding_szie)
+        dummy = self.runner.cache_config.num_gpu_blocks
+        batch_padding_size = self.max_num_seqs - len(data.input_tokens)
+        data.input_tokens.extend([[0]] * batch_padding_size)
+        data.input_positions.extend([[0]] * batch_padding_size)
+        list_input_block_ids.extend([[dummy]] * batch_padding_size)
+
+        num_partition = self.max_model_len // block_size
+        input_block_ids = make_tensor_with_pad(list_input_block_ids,
+                                               max_len=num_partition,
+                                               pad=dummy,
+                                               dtype=torch.long,
+                                               device=self.device)
 
         input_tokens = make_tensor_with_pad(data.input_tokens,
                                             max_len=1,
@@ -306,9 +327,6 @@ def _prepare_decode(
                                                pad=0,
                                                dtype=torch.long,
                                                device=self.device)
-        input_block_ids = torch.tensor(arr_input_block_ids,
-                                       dtype=torch.long,
-                                       device=self.device)
 
         logger.info("[RBLN] model input builder, prepare_decode")
         logger.info("\tpadded input_tokens = %s", data.input_tokens)
diff --git a/vllm_rbln/worker/utils.py b/vllm_rbln/worker/utils.py
@@ -0,0 +1,112 @@
+# Copyright 2025 Rebellions Inc. All rights reserved.
+
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at:
+
+#     http://www.apache.org/licenses/LICENSE-2.0
+
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""A RBLN util class."""
+
+import math
+from typing import Optional
+
+
+def get_maximum_num_blocks(
+    config,
+    tensor_parallel_size: int,
+    kvcache_block_size: int,
+    nbits_per_param: Optional[int] = None,
+    n_model_params: Optional[int] = None,
+    kernel_size: Optional[int] = None,
+    buffer: Optional[int] = None,
+    num_runtimes: int = 2,
+) -> int:
+    # We are finding max_n_blocks(x) that satisfies the following equation:
+
+    # available_dram - kernel_size - buffer
+    #     - num_layers * 2 * tensor_parallel_size
+    #     * align_2MB(
+    #         x
+    #         * block_size
+    #         * align_64(head_dim)
+    #         * math.ceil(num_key_value_heads / tensor_parallel_size)
+    #         * 2
+    #     ) > 0
+
+    # This inequality can be rewritten as follows:
+
+    # a - c * align_2MB(b * x) > 0
+    # where
+    #    a = available_dram - kernel_size - buffer
+    #    b = block_size
+    #         * align_64(head_dim)
+    #         * math.ceil(num_key_value_heads / tensor_parallel_size) * 2
+    #    c = num_layers * 2 * tensor_parallel_size
+
+    # We can rewrite the inequality as follows:
+    # k > align_2MB(b*x)
+    # where
+    #    k = a / c
+
+    # After that, we can derive the following equation:
+    # x = floor(2**21 / b * floor((k - 1) / 2**21))
+
+    def align(x: int, nbytes: int) -> int:
+        return int(math.ceil(x / nbytes) * nbytes)
+
+    def align_2MB(x: int) -> int:
+        return align(x, 2**21)
+
+    num_layers = config.hf_config.num_hidden_layers
+    head_dim = config.hf_config.head_dim
+    vocab_size = config.hf_config.vocab_size
+    hidden_size = config.hf_config.hidden_size
+    num_key_value_heads = config.hf_config.num_key_value_heads
+
+    # TODO(jongho): Update if target npu is REBEL.
+    ATOM_DRAM_NBYTES = 16 * 2**30
+    ATOM_SYS_DRAM_NBYTES = 288 * 2**20
+    available_dram = tensor_parallel_size * (ATOM_DRAM_NBYTES -
+                                             ATOM_SYS_DRAM_NBYTES)
+
+    if kernel_size is None:
+        if n_model_params is None:
+            raise ValueError("`n_model_params` should be specified \
+                to estimate the kernel memory.")
+        # Get estimated kernel size (approximated)
+        lm_heads_params = align(vocab_size, 64) * hidden_size
+        lm_heads_nbytes = (align_2MB(
+            lm_heads_params * nbits_per_param // 8 / tensor_parallel_size) *
+                           tensor_parallel_size)
+        params = n_model_params - lm_heads_params
+        layer_nbytes = (align_2MB(params * nbits_per_param // 8 / num_layers /
+                                  tensor_parallel_size) * num_layers *
+                        tensor_parallel_size)
+        kernel_size = layer_nbytes + lm_heads_nbytes
+    elif n_model_params is not None:
+        raise ValueError(
+            "Both `n_model_params` and `kernel_size` cannot be specified.")
+
+    available_dram -= kernel_size
+
+    if buffer is None:
+        # TODO: Accurate buffer estimation
+        buffer_per_runtime_per_core = 2**28  # 256MB per runtime
+        # 1 for prefill, 1 for decoder
+        buffer_per_core = buffer_per_runtime_per_core * num_runtimes
+        buffer = buffer_per_core * tensor_parallel_size
+    available_dram -= buffer
+
+    b = kvcache_block_size * align(head_dim, 64) * math.ceil(
+        num_key_value_heads / tensor_parallel_size) * 2
+    c = num_layers * 2 * tensor_parallel_size
+    k = available_dram / c
+    max_n_blocks = math.floor(2**21 / b * math.floor((k - 1) / 2**21))
+
+    return max_n_blocks
diff --git a/vllm_rbln/worker/worker.py b/vllm_rbln/worker/worker.py