You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Cloud-Edge Speculative Decoding Benchmark for LLM based on KubeEdge-Ianvs
19
17
@@ -34,7 +32,7 @@ However, the effectiveness of speculative decoding in cloud-edge scenarios canno
34
32
35
33
We propose that KubeEdge-Ianvs adopt a cloud-edge speculative decoding benchmark implemented on top of Sedna `JointInference`, with the goal of evaluating and improving LLM inference efficiency in cloud-edge environments.
36
34
37
-
This proposal adopts Ianvs `jointinference` as the primary integration paradigm and extends it with speculative-decoding support. Under this design, the edge model is responsible for drafting candidate tokens, while the cloud model performs token-level verification and correction.
35
+
This proposal adopts Ianvs `jointinference` as the primary integration paradigm and extends it with speculative-decoding support. Under this design, the example provides dedicated `draft` and `verify` modules, while the paradigm layer implements the speculative-decoding workflow and organizes how these modules collaborate for one sample.
38
36
39
37
At the current stage, the proposal focuses on building and validating the speculative-decoding benchmark within the existing in-process execution model. On top of this stable foundation, the subsequent step is to explore more effective cloud-edge speculative-decoding strategies, such as improved draft-window design, acceptance-aware collaboration, and network-sensitive policy optimization.
40
38
@@ -50,19 +48,23 @@ At the current stage, the proposal focuses on building and validating the specul
50
48
This proposal differs from existing cloud-edge collaborative inference examples in Ianvs in the following respects:
51
49
52
50
-**Paradigm-level workflow extension**: the proposal implements the speculative-decoding workflow in the `JointInference` paradigm layer, so that the framework can organize repeated collaboration rounds for one sample.
53
-
-**Stable model-side interface**: the proposal does not introduce new required model APIs for draft and verify operations. Both sides continue to use `inference()` as the common model-side entry.
54
-
-**Example-level algorithm freedom**: the framework controls the workflow, while the example retains ownership of how request payloads, metadata, draft tokens, and verification logic are actually implemented.
51
+
-**Dedicated speculative-decoding modules**: the proposal introduces dedicated `draft` and `verify` modules, instead of overloading the existing `cloud` and `edge` modules with speculative-decoding-specific semantics.
52
+
-**Example-level algorithm freedom**: the framework controls the workflow, while the example retains ownership of how the `draft` and `verify` modules actually invoke models and exchange metadata.
55
53
56
54
### Overall Architecture
57
55
58
-
This proposal adopts Ianvs `jointinference` as the integration entry because speculative decoding in cloud-edge scenarios remains a joint-inference problem: one sample enters the controller, the paradigm layer organizes collaboration, and the example layer implements the concrete model behavior.
56
+
This proposal adopts Ianvs `jointinference` as the integration entry because speculative decoding in cloud-edge scenarios remains a joint-inference problem: one sample enters the controller, the paradigm layer organizes collaboration, and the example layer implements the concrete module behavior.
59
57
60
58
The design has two clear boundaries:
61
59
62
60
-**Change of Sedna**: implement `JointInference` as a workflow controller that can support repeated collaboration rounds for one sample.
63
61
-**Change of Ianvs**: keep Ianvs `joint_inference` as the controller-side wrapper that loads data, passes modules and configuration to Sedna, and collects benchmark outputs.
64
62
65
-
Under this design, the paradigm layer owns the workflow, while the example layer owns the actual inference behavior of edge and cloud models.
63
+
Under this design, the paradigm layer owns the workflow, while the example layer owns the actual behavior of the `draft` and `verify` modules.
@@ -76,20 +78,21 @@ The main Sedna change is to implement the **speculative-decoding workflow** in t
76
78
77
79
This design gives the paradigm genuine multi-round inference capability.
78
80
79
-
At the same time, the proposal keeps the model-side interface stable. It does **not** introduce new required framework APIs such as dedicated `draft()` or`verify()` methods. Both sides continue to use `inference()` as the common entry. The distinction between drafting and verification is carried through example-defined request payloads, metadata, and `kwargs`, while the exact semantics remain under example control.
81
+
At the same time, the proposal keeps the framework boundary clear. The paradigm layer is responsible for coordinating repeated rounds through the speculative-decoding workflow, while the concrete model invocation remains inside the example-defined `draft` and`verify` modules. In this design, Sedna does not need to reinterpret the original `cloud`and `edge` modules for speculative decoding. Instead, the new modules make the speculative-decoding semantics explicit at the example layer.
80
82
81
83
This design has two advantages:
82
84
83
-
- it avoids creating compatibility pressure on existing examples;
84
-
- it keeps the framework generic instead of binding it to one fixed speculative-decoding API shape.
85
+
- it keeps speculative-decoding-specific behavior separate from the existing cloud-edge module semantics;
86
+
- it makes the implementation easier for users to understand, because drafting and verification are represented by dedicated modules, while workflow control is clearly placed in the paradigm layer.
85
87
86
88
### Changes in Ianvs
87
89
88
90
On the Ianvs side, `joint_inference` continues to play the role of the controller-side wrapper. Ianvs loads the dataset, applies the optional dataset processor, builds the Sedna `JointInference` job, iterates over samples, and collects benchmark results. In this proposal, the Ianvs core flow can be reused as it is and does not require dedicated core-code modification for speculative decoding.
89
91
90
92
Accordingly, the Ianvs-side work is mainly concentrated in the example layer:
91
93
92
-
- implement the edge and cloud model behavior required by the speculative-decoding benchmark;
94
+
- implement the `draft` module that performs speculative draft generation;
95
+
- implement the `verify` module that performs cloud-side verification and correction;
93
96
- provide dataset processing that maps benchmark samples into the request format expected by the example;
94
97
- provide benchmark metrics and result parsing for latency, throughput, acceptance rate, and related outputs.
Based on the above design principles, this proposal keeps two integration options.
123
-
124
-
#### Option 1: Explicit SD Mode in `JointInference`
125
-
126
-
The first option is to keep a dedicated SD-oriented mode in the paradigm layer. In this option, Sedna `JointInference` explicitly recognizes speculative decoding and enters the paradigm-level workflow. The workflow belongs to the paradigm, but each round still invokes the example-defined edge model and cloud model through the common `inference()` entry.
This option is attractive for the following reasons:
133
-
134
-
- the workflow is explicit at the paradigm layer;
135
-
- the benchmark configuration can expose speculative decoding directly;
136
-
- the framework loop is visible without forcing new model-side interfaces;
137
-
- the example still retains full control over how drafting and verification are encoded;
138
-
- the SD path is kept separate from the original HEM-oriented flow, which makes the semantics of this mode more direct and easier to reason about.
139
-
140
-
For this option, the module boundary can be summarized as follows:
141
-
142
-
-**Original modules**: Ianvs `TestEnvManager`, `TestCaseController`, and `StoryManager`; the existing Sedna `JointInference` control skeleton; the existing Ianvs controller-side module loading and benchmark execution flow.
143
-
-**New or extended modules**: an explicit SD-oriented mode in the paradigm layer; paradigm-level speculative-decoding workflow control in Sedna `JointInference`; example-defined edge and cloud modules that interpret `inference()` requests for drafting and verification.
144
-
-**Impact on other examples**: other examples are affected only if they explicitly select the SD-oriented mode. Examples that continue to use the original modes and their current `inference()` behavior remain compatible. The shared-code impact is concentrated in Sedna `JointInference` mode dispatch and result handling.
145
-
146
-
#### Option 2: Reusing `mining-then-inference`
147
-
148
-
The second option is to reuse the existing `mining-then-inference` branch and let that branch enter the speculative-decoding workflow when routing requires cloud-side collaboration. Under this design, the paradigm owns the workflow, and the SD loop is entered from the route-first control path already provided by the current code structure.
This option is attractive for the following reasons:
155
-
156
-
- it preserves more of the current `JointInference` control structure;
157
-
- the original route-first semantics remain meaningful;
158
-
- speculative decoding is activated only when the routing result requires it;
159
-
- the model-side interface still remains `inference()` only;
160
-
- it follows the flow shape already exposed by the current code, which can reduce the amount of structural change required in the paradigm layer.
161
-
162
-
For this option, the module boundary can be summarized as follows:
163
-
164
-
-**Original modules**: Ianvs controller-side execution flow; the existing Sedna `mining-then-inference` branch; the existing HEM-based route-first structure; existing edge and cloud module registration in Ianvs.
165
-
-**New or extended modules**: a speculative-decoding branch within the reused route-first path; paradigm-level multi-round workflow control once the branch enters speculative decoding; example-defined edge and cloud `inference()` behavior for draft and verify semantics.
166
-
-**Impact on other examples**: this option touches the reused `mining-then-inference` branch and therefore requires more attention to branch-level compatibility. Its configuration constraints are also stronger, because speculative decoding must be enabled together with `mining-then-inference`. Examples that rely on the original route-first semantics should still keep their behavior as long as the speculative-decoding branch is not enabled.
167
-
168
-
169
-
### High-Level Paradigm Concept
123
+
### Speculative-Decoding Modules
170
124
171
-
Beyond the two retained options above, there is a more general joint-inference abstraction worth recording. The idea is to extract a router-driven multi-round paradigm:
125
+
Based on the current design, the proposal introduces two dedicated example-side modules for speculative decoding, while the workflow itself is implemented in the paradigm layer.
172
126
173
-
- one sample enters the controller;
174
-
- a routing module decides whether to execute on edge, cloud, or exit;
175
-
- the selected side performs inference;
176
-
- the router examines the returned output and decides whether another round is needed or whether execution should terminate.
127
+
#### draft
177
128
178
-
In that more general design, routing becomes the generic abstraction, and HEM can be moved into the example layer as one possible router implementation. This would make the paradigm less tied to any specific example and more reusable across different collaboration strategies. Under this abstraction, many existing examples can be described in a unified way: a HEM-based example becomes a custom router that performs one-shot binary routing and then exits, while a speculative-decoding example becomes a custom router that alternates between edge-side drafting and cloud-side verification across multiple rounds. This is why the high-level direction is attractive: it can, in principle, subsume the current examples under one common joint-inference framework.
129
+
The `draft` module is responsible for speculative draft generation. It encapsulates how the drafting model is loaded, how inputs are interpreted, and how draft tokens and related metadata are produced for the workflow. This keeps drafting logic separate from the semantics of the original `edge` module.
The `verify` module is responsible for verification and correction. It encapsulates how the verification model is invoked, how draft tokens are checked, and how accepted or corrected outputs are returned to the workflow. This keeps verification logic separate from the semantics of the original `cloud` module.
185
134
186
-
However, this direction would require a broader redesign of the current joint-inference abstraction. For that reason, it is recorded here as a longer-term paradigm direction rather than the immediate implementation target.
135
+
From a module-boundary perspective, the current design can be summarized as follows:
187
136
188
-
From an implementation perspective, the current proposal keeps the scope clear:
137
+
-**Original modules**: Ianvs `TestEnvManager`, `TestCaseController`, and `StoryManager`; the existing Ianvs controller-side benchmark flow; the existing Sedna `JointInference` skeleton.
138
+
-**New or extended modules**: paradigm-level speculative-decoding workflow support in Sedna `JointInference`; example-defined `draft` and `verify` modules.
139
+
-**Impact on other examples**: the original `cloud` and `edge` modules do not need to be reinterpreted for speculative decoding. Existing examples can continue to use their original modules and semantics. The speculative-decoding benchmark is isolated through the new modules, which reduces confusion for users and limits compatibility impact on unrelated examples.
189
140
190
-
- Sedna changes focus on implementing multi-round workflow capability in `JointInference`;
191
-
- Ianvs changes focus on passing configuration and collecting benchmark results;
192
-
- model-side algorithm details remain in the example layer;
193
-
- compatibility is protected by continuing to use `inference()` as the common model-side entry.
0 commit comments