vllm-project
diff --git a/‎docs/source/community/governance.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/community/governance.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/developer_guide/Design_Documents/ModelRunner_prepare_inputs.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/Design_Documents/disaggregated_prefill.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/developer_guide/Design_Documents/disaggregated_prefill.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/source/developer_guide/Design_Documents/eplb_swift_balancer.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/source/developer_guide/contribution/testing.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/developer_guide/contribution/testing.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/evaluation/using_lm_eval.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/developer_guide/evaluation/using_lm_eval.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/developer_guide/performance_and_debug/performance_benchmark.md‎
Lines changed: 2 additions & 2 deletions b/‎docs/source/developer_guide/performance_and_debug/performance_benchmark.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/installation.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/quick_start.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/quick_start.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/tutorials/features/long_sequence_context_parallel_single_node.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/tutorials/features/long_sequence_context_parallel_single_node.md‎
Lines changed: 1 addition & 1 deletion
@@ -14,7 +14,7 @@ vLLM Ascend is an open-source project under the vLLM community, where the author
 
 - Contributor:
 
-    **Responsibility:** Help new contributors onboarding, handle and respond to community questions, review RFCs and code.
+    **Responsibility:** Help new contributors with onboarding, handle and respond to community questions, review RFCs and code.
 
     **Requirements:** Complete at least 1 contribution. A contributor is someone who consistently and actively participates in a project, including but not limited to issue/review/commits/community involvement.
 
 
@@ -33,7 +33,7 @@ The workflow of obtaining inputs:
 
 3. Get `Token IDs`: using token indices to retrieve the Token IDs from **token id table**.
 
-At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `Rope` (Rotary positional embedding). Both of them are the inputs of the model.
+At last, these `Token IDs` are required to be fed into a model, and `positions` should also be sent into the model to create `RoPE` (Rotary positional embedding). Both of them are the inputs of the model.
 
 **Note**: The `Token IDs` are the inputs of a model, so we also call them `Input IDs`.
 
 
@@ -100,6 +100,6 @@ Under non-symmetric PD scenarios, validate the P-to-D tp ratio against expected
 
 ## Limitations
 
-- Heterogeneous P and D nodes are not supported—for example, running P nodes on A2 and D nodes on A3.
+- Heterogeneous P and D nodes are not supported, for example, running P nodes on A2 and D nodes on A3.
 
 - In non-symmetric TP configurations, only cases where the P nodes have a higher TP degree than the D nodes and the P TP count is an integer multiple of the D TP count are supported (i.e., P_tp > D_tp and P_tp % D_tp = 0).
@@ -184,10 +184,10 @@ All integer input parameters must explicitly specify their maximum and minimum v
             raise TypeError(f"The {iterations} is not int.")
         if iterations <= 0:
             raise ValueError(
-                f"The {iterations} can not less than or equal to 0.")
+                f"The {iterations} can not be less than or equal to 0.")
         if iterations > sys.maxsize:
             raise ValueError(
-                f"The {iterations} can not large than {sys.maxsize}")
+                f"The {iterations} can not be larger than {sys.maxsize}")
 ```
 
 #### File Path
@@ -207,7 +207,7 @@ The file path for EPLB must be checked for legality, such as whether the file pa
         if ext.lower() != ".json":
             raise TypeError("The expert_map is not json.")
         if not os.path.exists(expert_map):
-            raise ValueError("The expert_map is not exist.")
+            raise ValueError("The expert_map does not exist.")
         try:
             with open(expert_map, "w", encoding='utf-8') as f:
                 f.read()
 
@@ -210,7 +210,7 @@ pytest -sv tests/ut/test_ascend_config.py
 ### E2E test
 
 Although vllm-ascend CI provides E2E tests on Ascend CI (for example,
-[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test_full.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test_full.yaml)), you can run them locally.
+[schedule_nightly_test_a2.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a2.yaml), [schedule_nightly_test_a3.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/schedule_nightly_test_a3.yaml), [pr_test.yaml](https://github.com/vllm-project/vllm-ascend/blob/main/.github/workflows/pr_test.yaml)), you can run them locally.
 
 #### PR-triggered E2E test
 
 
@@ -139,7 +139,7 @@ lm_eval \
 After 30 minutes, the output is as shown below:
 
 ```shell
-The markdown format results is as below:
+The results in Markdown format are as follows:
 
 |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
 |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
 
@@ -165,7 +165,7 @@ Total num prompt tokens:  1280
 Total num output tokens:  1280
 ```
 
-#### 3.2.4 Multi-Modal Benchmark
+#### 3.2.3 Multi-Modal Benchmark
 
 ```shell
 export VLLM_USE_MODELSCOPE=True
@@ -214,7 +214,7 @@ P99 ITL (ms):                            182.28
 ==================================================
 ```
 
-#### 3.2.5 Embedding Benchmark
+#### 3.2.4 Embedding Benchmark
 
 ```shell
 vllm serve Qwen/Qwen3-Embedding-8B --trust-remote-code
 
@@ -367,7 +367,7 @@ Prompt: 'The capital of France is', Generated text: ' a city. What is the capita
 Prompt: 'The future of AI is', Generated text: ' a topic that is being discussed in various contexts. In the business world, AI'
 ```
 
-This section shows process exits after offline inference, and is does not affect actual inference:
+This section shows process exits after offline inference, and does not affect actual inference:
 
 ```bash
 (EngineCore pid=970) INFO 05-12 11:36:00 [core.py:1201] Shutdown initiated (timeout=0)
 
@@ -180,7 +180,7 @@ Prompt: 'The capital of France is', Generated text: ' a city. What is the capita
 Prompt: 'The future of AI is', Generated text: ' a topic that is being discussed in various contexts. In the business world, AI'
 ```
 
-This section shows process exits after offline inference, and is does not affect actual inference:
+This section shows process exits after offline inference, and does not affect actual inference:
 
 ```bash
 (EngineCore pid=970) INFO 05-12 11:36:00 [core.py:1201] Shutdown initiated (timeout=0)
 
@@ -119,7 +119,7 @@ The parameters are explained as follows:
     - (2) Decode requests are prioritized for scheduling, and prefill requests are scheduled only if there is available capacity.
     - Generally, if `--max-num-batched-tokens` is set to a larger value, the overall latency will be lower, but the pressure on GPU memory (activation value usage) will be greater.
 - `--gpu-memory-utilization` represents the proportion of HBM that vLLM will use for actual inference. Its essential function is to calculate the available kv_cache size. During the warm-up phase (referred to as profile run in vLLM), vLLM records the peak GPU memory usage during an inference process with an input size of `--max-num-batched-tokens`. The available kv_cache size is then calculated as: `--gpu-memory-utilization` * HBM size - peak GPU memory usage. Therefore, the larger the value of `--gpu-memory-utilization`, the more kv_cache can be used. However, since the GPU memory usage during the warm-up phase may differ from that during actual inference (e.g., due to uneven EP load), setting `--gpu-memory-utilization` too high may lead to OOM (Out of Memory) issues during actual inference. The default value is `0.9`.
-- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of ETP and EP; that is, MoE can either use pure EP or pure TP.
+- `--enable-expert-parallel` indicates that EP is enabled. Note that vLLM does not support a mixed approach of EP and TP; that is, MoE can either use pure EP or pure TP.
 - `--no-enable-prefix-caching` indicates that prefix caching is disabled. To enable it, remove this option.
 - `--quantization` "ascend" indicates that quantization is used. To disable quantization, remove this option.
 - `--compilation-config` contains configurations related to the aclgraph graph mode. The most significant configurations are "cudagraph_mode" and "cudagraph_capture_sizes", which have the following meanings: