Skip to content

Commit 1bef648

Browse files
committed
docs: refine speculative decoding proposal design
Signed-off-by: Juntao Zhang <juntaozhang22@m.fudan.edu.cn>
1 parent c9bcec9 commit 1bef648

5 files changed

Lines changed: 27 additions & 80 deletions

File tree

docs/proposals/algorithms/joint-inference/cloud-edge-benchmark-for-llm-speculative-decoding.md

Lines changed: 27 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -8,12 +8,10 @@
88
- [Overall Architecture](#overall-architecture)
99
- [Changes in Sedna](#changes-in-sedna)
1010
- [Changes in Ianvs](#changes-in-ianvs)
11-
- [Integration Options](#integration-options)
12-
- [High-Level Paradigm Concept](#high-level-paradigm-concept)
11+
- [Speculative-Decoding Modules](#speculative-decoding-modules)
1312
- [Benchmark Construction](#benchmark-construction)
1413
- [Algorithm Exploration](#algorithm-exploration)
1514
- [Roadmap](#roadmap)
16-
- [References](#references)
1715

1816
# Cloud-Edge Speculative Decoding Benchmark for LLM based on KubeEdge-Ianvs
1917

@@ -34,7 +32,7 @@ However, the effectiveness of speculative decoding in cloud-edge scenarios canno
3432

3533
We propose that KubeEdge-Ianvs adopt a cloud-edge speculative decoding benchmark implemented on top of Sedna `JointInference`, with the goal of evaluating and improving LLM inference efficiency in cloud-edge environments.
3634

37-
This proposal adopts Ianvs `jointinference` as the primary integration paradigm and extends it with speculative-decoding support. Under this design, the edge model is responsible for drafting candidate tokens, while the cloud model performs token-level verification and correction.
35+
This proposal adopts Ianvs `jointinference` as the primary integration paradigm and extends it with speculative-decoding support. Under this design, the example provides dedicated `draft` and `verify` modules, while the paradigm layer implements the speculative-decoding workflow and organizes how these modules collaborate for one sample.
3836

3937
At the current stage, the proposal focuses on building and validating the speculative-decoding benchmark within the existing in-process execution model. On top of this stable foundation, the subsequent step is to explore more effective cloud-edge speculative-decoding strategies, such as improved draft-window design, acceptance-aware collaboration, and network-sensitive policy optimization.
4038

@@ -50,19 +48,23 @@ At the current stage, the proposal focuses on building and validating the specul
5048
This proposal differs from existing cloud-edge collaborative inference examples in Ianvs in the following respects:
5149

5250
- **Paradigm-level workflow extension**: the proposal implements the speculative-decoding workflow in the `JointInference` paradigm layer, so that the framework can organize repeated collaboration rounds for one sample.
53-
- **Stable model-side interface**: the proposal does not introduce new required model APIs for draft and verify operations. Both sides continue to use `inference()` as the common model-side entry.
54-
- **Example-level algorithm freedom**: the framework controls the workflow, while the example retains ownership of how request payloads, metadata, draft tokens, and verification logic are actually implemented.
51+
- **Dedicated speculative-decoding modules**: the proposal introduces dedicated `draft` and `verify` modules, instead of overloading the existing `cloud` and `edge` modules with speculative-decoding-specific semantics.
52+
- **Example-level algorithm freedom**: the framework controls the workflow, while the example retains ownership of how the `draft` and `verify` modules actually invoke models and exchange metadata.
5553

5654
### Overall Architecture
5755

58-
This proposal adopts Ianvs `jointinference` as the integration entry because speculative decoding in cloud-edge scenarios remains a joint-inference problem: one sample enters the controller, the paradigm layer organizes collaboration, and the example layer implements the concrete model behavior.
56+
This proposal adopts Ianvs `jointinference` as the integration entry because speculative decoding in cloud-edge scenarios remains a joint-inference problem: one sample enters the controller, the paradigm layer organizes collaboration, and the example layer implements the concrete module behavior.
5957

6058
The design has two clear boundaries:
6159

6260
- **Change of Sedna**: implement `JointInference` as a workflow controller that can support repeated collaboration rounds for one sample.
6361
- **Change of Ianvs**: keep Ianvs `joint_inference` as the controller-side wrapper that loads data, passes modules and configuration to Sedna, and collects benchmark outputs.
6462

65-
Under this design, the paradigm layer owns the workflow, while the example layer owns the actual inference behavior of edge and cloud models.
63+
Under this design, the paradigm layer owns the workflow, while the example layer owns the actual behavior of the `draft` and `verify` modules.
64+
65+
The current architecture is illustrated below:
66+
67+
<img src="./images/SD.png" alt="New Architecture" width="80%">
6668

6769
### Changes in Sedna
6870

@@ -76,20 +78,21 @@ The main Sedna change is to implement the **speculative-decoding workflow** in t
7678

7779
This design gives the paradigm genuine multi-round inference capability.
7880

79-
At the same time, the proposal keeps the model-side interface stable. It does **not** introduce new required framework APIs such as dedicated `draft()` or `verify()` methods. Both sides continue to use `inference()` as the common entry. The distinction between drafting and verification is carried through example-defined request payloads, metadata, and `kwargs`, while the exact semantics remain under example control.
81+
At the same time, the proposal keeps the framework boundary clear. The paradigm layer is responsible for coordinating repeated rounds through the speculative-decoding workflow, while the concrete model invocation remains inside the example-defined `draft` and `verify` modules. In this design, Sedna does not need to reinterpret the original `cloud` and `edge` modules for speculative decoding. Instead, the new modules make the speculative-decoding semantics explicit at the example layer.
8082

8183
This design has two advantages:
8284

83-
- it avoids creating compatibility pressure on existing examples;
84-
- it keeps the framework generic instead of binding it to one fixed speculative-decoding API shape.
85+
- it keeps speculative-decoding-specific behavior separate from the existing cloud-edge module semantics;
86+
- it makes the implementation easier for users to understand, because drafting and verification are represented by dedicated modules, while workflow control is clearly placed in the paradigm layer.
8587

8688
### Changes in Ianvs
8789

8890
On the Ianvs side, `joint_inference` continues to play the role of the controller-side wrapper. Ianvs loads the dataset, applies the optional dataset processor, builds the Sedna `JointInference` job, iterates over samples, and collects benchmark results. In this proposal, the Ianvs core flow can be reused as it is and does not require dedicated core-code modification for speculative decoding.
8991

9092
Accordingly, the Ianvs-side work is mainly concentrated in the example layer:
9193

92-
- implement the edge and cloud model behavior required by the speculative-decoding benchmark;
94+
- implement the `draft` module that performs speculative draft generation;
95+
- implement the `verify` module that performs cloud-side verification and correction;
9396
- provide dataset processing that maps benchmark samples into the request format expected by the example;
9497
- provide benchmark metrics and result parsing for latency, throughput, acceptance rate, and related outputs.
9598

@@ -103,9 +106,9 @@ examples/cloud-edge-speculative-decoding-benchmark
103106
├── README.md
104107
├── testalgorithms
105108
│ └── speculative-decoding
106-
│ ├── cloud_model.py
107109
│ ├── data_processor.py
108-
│ ├── edge_model.py
110+
│ ├── draft.py
111+
│ ├── verify.py
109112
│ └── test_speculative_decoding.yaml
110113
└── testenv
111114
├── acceptance_rate.py
@@ -117,80 +120,24 @@ examples/cloud-edge-speculative-decoding-benchmark
117120
└── time_to_first_token.py
118121
```
119122

120-
### Integration Options
121-
122-
Based on the above design principles, this proposal keeps two integration options.
123-
124-
#### Option 1: Explicit SD Mode in `JointInference`
125-
126-
The first option is to keep a dedicated SD-oriented mode in the paradigm layer. In this option, Sedna `JointInference` explicitly recognizes speculative decoding and enters the paradigm-level workflow. The workflow belongs to the paradigm, but each round still invokes the example-defined edge model and cloud model through the common `inference()` entry.
127-
128-
<p align="center">
129-
<img src="./images/SD-1.png" alt="Plan A" width="80%">
130-
</p>
131-
132-
This option is attractive for the following reasons:
133-
134-
- the workflow is explicit at the paradigm layer;
135-
- the benchmark configuration can expose speculative decoding directly;
136-
- the framework loop is visible without forcing new model-side interfaces;
137-
- the example still retains full control over how drafting and verification are encoded;
138-
- the SD path is kept separate from the original HEM-oriented flow, which makes the semantics of this mode more direct and easier to reason about.
139-
140-
For this option, the module boundary can be summarized as follows:
141-
142-
- **Original modules**: Ianvs `TestEnvManager`, `TestCaseController`, and `StoryManager`; the existing Sedna `JointInference` control skeleton; the existing Ianvs controller-side module loading and benchmark execution flow.
143-
- **New or extended modules**: an explicit SD-oriented mode in the paradigm layer; paradigm-level speculative-decoding workflow control in Sedna `JointInference`; example-defined edge and cloud modules that interpret `inference()` requests for drafting and verification.
144-
- **Impact on other examples**: other examples are affected only if they explicitly select the SD-oriented mode. Examples that continue to use the original modes and their current `inference()` behavior remain compatible. The shared-code impact is concentrated in Sedna `JointInference` mode dispatch and result handling.
145-
146-
#### Option 2: Reusing `mining-then-inference`
147-
148-
The second option is to reuse the existing `mining-then-inference` branch and let that branch enter the speculative-decoding workflow when routing requires cloud-side collaboration. Under this design, the paradigm owns the workflow, and the SD loop is entered from the route-first control path already provided by the current code structure.
149-
150-
<p align="center">
151-
<img src="./images/SD-2.png" alt="Plan B" width="80%">
152-
</p>
153-
154-
This option is attractive for the following reasons:
155-
156-
- it preserves more of the current `JointInference` control structure;
157-
- the original route-first semantics remain meaningful;
158-
- speculative decoding is activated only when the routing result requires it;
159-
- the model-side interface still remains `inference()` only;
160-
- it follows the flow shape already exposed by the current code, which can reduce the amount of structural change required in the paradigm layer.
161-
162-
For this option, the module boundary can be summarized as follows:
163-
164-
- **Original modules**: Ianvs controller-side execution flow; the existing Sedna `mining-then-inference` branch; the existing HEM-based route-first structure; existing edge and cloud module registration in Ianvs.
165-
- **New or extended modules**: a speculative-decoding branch within the reused route-first path; paradigm-level multi-round workflow control once the branch enters speculative decoding; example-defined edge and cloud `inference()` behavior for draft and verify semantics.
166-
- **Impact on other examples**: this option touches the reused `mining-then-inference` branch and therefore requires more attention to branch-level compatibility. Its configuration constraints are also stronger, because speculative decoding must be enabled together with `mining-then-inference`. Examples that rely on the original route-first semantics should still keep their behavior as long as the speculative-decoding branch is not enabled.
167-
168-
169-
### High-Level Paradigm Concept
123+
### Speculative-Decoding Modules
170124

171-
Beyond the two retained options above, there is a more general joint-inference abstraction worth recording. The idea is to extract a router-driven multi-round paradigm:
125+
Based on the current design, the proposal introduces two dedicated example-side modules for speculative decoding, while the workflow itself is implemented in the paradigm layer.
172126

173-
- one sample enters the controller;
174-
- a routing module decides whether to execute on edge, cloud, or exit;
175-
- the selected side performs inference;
176-
- the router examines the returned output and decides whether another round is needed or whether execution should terminate.
127+
#### draft
177128

178-
In that more general design, routing becomes the generic abstraction, and HEM can be moved into the example layer as one possible router implementation. This would make the paradigm less tied to any specific example and more reusable across different collaboration strategies. Under this abstraction, many existing examples can be described in a unified way: a HEM-based example becomes a custom router that performs one-shot binary routing and then exits, while a speculative-decoding example becomes a custom router that alternates between edge-side drafting and cloud-side verification across multiple rounds. This is why the high-level direction is attractive: it can, in principle, subsume the current examples under one common joint-inference framework.
129+
The `draft` module is responsible for speculative draft generation. It encapsulates how the drafting model is loaded, how inputs are interpreted, and how draft tokens and related metadata are produced for the workflow. This keeps drafting logic separate from the semantics of the original `edge` module.
179130

180-
This high-level direction is illustrated below:
131+
#### verify
181132

182-
<p align="center">
183-
<img src="./images/SD-3.png" alt="High-Level Design" width="80%">
184-
</p>
133+
The `verify` module is responsible for verification and correction. It encapsulates how the verification model is invoked, how draft tokens are checked, and how accepted or corrected outputs are returned to the workflow. This keeps verification logic separate from the semantics of the original `cloud` module.
185134

186-
However, this direction would require a broader redesign of the current joint-inference abstraction. For that reason, it is recorded here as a longer-term paradigm direction rather than the immediate implementation target.
135+
From a module-boundary perspective, the current design can be summarized as follows:
187136

188-
From an implementation perspective, the current proposal keeps the scope clear:
137+
- **Original modules**: Ianvs `TestEnvManager`, `TestCaseController`, and `StoryManager`; the existing Ianvs controller-side benchmark flow; the existing Sedna `JointInference` skeleton.
138+
- **New or extended modules**: paradigm-level speculative-decoding workflow support in Sedna `JointInference`; example-defined `draft` and `verify` modules.
139+
- **Impact on other examples**: the original `cloud` and `edge` modules do not need to be reinterpreted for speculative decoding. Existing examples can continue to use their original modules and semantics. The speculative-decoding benchmark is isolated through the new modules, which reduces confusion for users and limits compatibility impact on unrelated examples.
189140

190-
- Sedna changes focus on implementing multi-round workflow capability in `JointInference`;
191-
- Ianvs changes focus on passing configuration and collecting benchmark results;
192-
- model-side algorithm details remain in the example layer;
193-
- compatibility is protected by continuing to use `inference()` as the common model-side entry.
194141

195142
### Benchmark Construction
196143

-67.6 KB
Binary file not shown.
-69.9 KB
Binary file not shown.
-48.2 KB
Binary file not shown.
130 KB
Loading

0 commit comments

Comments
 (0)