Skip to content

Commit 5d57b72

Browse files
committed
Joint Inference: Query-Routing Algorithm Added to algorithms section of ianvs documentation
Signed-off-by: Aryan <nandaaryan823@gmail.com>
1 parent cad7fc7 commit 5d57b72

4 files changed

Lines changed: 172 additions & 1 deletion

File tree

docs/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,8 @@ Start your journey on Ianvs with the following links:
5959
Single Task Learning: FPN <proposals/algorithms/single-task-learning/fpn>
6060
Incremental Learning: BasicIL-FPN <proposals/algorithms/incremental-learning/basicIL-fpn>
6161
Lifelong Learning: RFNet <proposals/algorithms/lifelong-learning/Cloud-Robotic AI Benchmarking for Edge-cloud Collaborative Lifelong Learning>
62-
62+
Joint Inference : Query-Routing <proposals/algorithms/joint-inference/query-routing>
63+
6364
.. toctree::
6465
:maxdepth: 1
6566
:titlesonly:
339 KB
Loading
99.7 KB
Loading
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Joint Inference: Query Routing
2+
3+
## Motivation
4+
5+
Joint inference is a computational paradigm in machine learning where multiple systems, models, or devices collaborate to perform inference tasks. Unlike traditional inference, where a single model or system processes an input to produce an output, joint inference distributes the workload across different entities—such as edge devices, cloud servers, or heterogeneous models—to optimize key performance metrics like latency, accuracy, privacy, energy efficiency, and cost. This approach is particularly relevant for large language models (LLMs) with billions or trillions of parameters, which are resource-intensive and challenging to deploy on a single system.
6+
7+
### Key Characteristics
8+
9+
- **Distributed Workload**: Inference tasks are partitioned, coordinated, or shared among multiple systems based on their capabilities, network conditions, or task requirements.
10+
- **Collaboration**: Systems work together, often in real-time, to produce a unified output. For example, one system might preprocess data, while another performs heavy computation.
11+
- **Optimization Goals**: Joint inference aims to balance trade-offs, such as reducing latency while maintaining accuracy, ensuring privacy, or minimizing computational costs.
12+
- **Flexibility**: It supports various architectures, including cloud-edge setups, multi-model ensembles, or federated systems.
13+
14+
### Example Scenarios
15+
16+
- **Edge-Cloud Collaboration**: An edge device processes simple queries locally, while complex queries are offloaded to the cloud.
17+
- **Multi-Model Ensemble**: A small model generates initial predictions, which a larger model refines.
18+
- **Federated Inference**: Multiple edge devices process local data and share intermediate results with a central server for final inference.
19+
20+
**Note**: Here, we will specifically focus on `Cloud-Edge Collaborative Inference Scenario for LLMS`.
21+
22+
## Cloud Edge Collaborative Inference for LLMs
23+
24+
Cloud-edge collaborative inference is a specific implementation of joint inference where inference tasks for LLMs are distributed between edge devices (e.g., mobile phones, PCs, IoT devices, or base stations) and cloud servers. This approach leverages the low-latency and privacy-preserving nature of edge computing alongside the high computational power of cloud infrastructure to optimize LLM deployment.
25+
26+
### Why LLM need cloud-edge collaborative inference?
27+
28+
Currently, such LLMs have billions or even trillions of parameters, requiring massive computing power for training and deployment. Therefore, they are often deployed in cloud computing centers and serving via APIs. However, such service paradigm faces many drawbacks.
29+
30+
- Time to First Token(TTFT) is quite long, due to transmission delays from the distance to the data center.
31+
- Uploading user data to the cloud may lead to additional privacy risks and retraining risks.
32+
- Calling APIs of the most advanced models (GPT-4o *et.al*) is often very expensive.
33+
- Not all tasks require high-performance models to complete.
34+
35+
These issues can be addressed by introducing Edge Computing, which is an architecture featured by low-latency, privacy security, energy-efficient.
36+
37+
By deploying small-scale LLMs on edge devices like mobile phones, PCs and communication base station, users will have low-latency and privacy-secure services. Empirically, models with fewer than 3B parameters are possible to be deployed on the aforementioned edge devices. However, due to Scaling Law, smaller models perform worse than larger models, so they can only maintain good performance on certain tasks.
38+
39+
Thus, smaller models on edge should collaborate with larger models on cloud to achieve better performance on other tasks.
40+
41+
### Possible Collaborative Inference Strategy
42+
43+
There are several cloud-edge collaborative inference strategy, including:
44+
45+
- **Query Routing**: route query to smaller-scale model on edge or larger-scale model on cloud based on its difficulty.
46+
- **Speculative Decoding**: smaller-scale models predicting future multiple words quickly during decoding followed by parallel validation via larger-scale models; if validation fails then re-generation by larger-scale occurs.
47+
48+
**Note**: Here, we will specifically focus on `Query-Routing Inference Strategy`.
49+
50+
### Query-Routing
51+
52+
The principle of *query routing* strategy is shown in the following figure.
53+
54+
![](./images/query-routing.png)
55+
56+
The core of this strategy is designing a routing model *R* that can predict the performance of user request *q* in both the large cloud model *L* and small edge model *S*. This model takes request *q* as input and outputs two labels, 0 or 1, representing preference for either edge or cloud.
57+
58+
This is essentially a text classification task which can be accomplished by training an NLP model. Training such a model involves considerations such as selecting models, datasets, loss functions, evaluation metrics etc.
59+
60+
Thus, Query Routing is a very useful cloud-edge collaboration strategy based on two facts:
61+
62+
- Calling top-tier large language models is expensive: For GPT-4o, the pricing is \$5.00 / 1M input tokens and \$15.00 / 1M output tokens.
63+
64+
- Not all tasks require calling top-tier models: For tasks like translation, organization, summarization, data formatting,and casual conversation, small models with 3B parameters or less can achieve satisfactory results.
65+
66+
These two facts suggest that if we can call different models based on the difficulty of the task, it will help save unnecessary API calls and thus reduce costs. Additionally, if edge device prformance is sufficient, locally deployed small models can also demonstrate excellent latency and throughput metrics, further enhancing user experience.
67+
68+
### Details of Design
69+
70+
The overall design is shown in the figure below.
71+
72+
![image-20240926143857223](./images/design-details.png)
73+
74+
When Ianvs starts the benchmarking job, the Test Env Manager will first pass the data of the user-specified Dataset to the Test Case Controller for Joint Inference one by one.
75+
76+
Joint Inference supports multiple modes, including `mining-then-inference`, `inference-then-mining`, and `self-design`. Among them, `mining-then-inference` is suitable for LLM scenarios, `inference-then-mining` is suitable for CV scenarios, and `self-design` allows us to implement more complex collaborative inference strategies on our own.
77+
78+
In this example, we will rely on Ianvs' Joint Inference Paradigm using the `inference-then-mining` mode to implement a Query Routing strategy. First, we call our custom Hard Example Mining module to determine if it is a hard case. If it is, we call the inference interface of the Edge Model to complete the inference; if not, we call the inference interface of the Cloud Model to complete it.
79+
80+
81+
After all tests are completed, the Test Env Manager will calculate relevant metrics based on selected Metrics and hand over to Story Manager for printing test reports and generating Leader Board.
82+
83+
### Implementation
84+
85+
##### EdgeModel Configuration
86+
87+
The `EdgeModel` is the model that will be deployed on the local machine, supporting `huggingface` and `vllm` as serving backends.
88+
89+
For `EdgeModel`, the open parameters are:
90+
91+
| Parameter Name | Type | Description | Defalut |
92+
| ---------------------- | ----- | ------------------------------------------------------------ | ------------------------ |
93+
| model | str | model name | Qwen/Qwen2-1.5B-Instruct |
94+
| backend | str | model serving framework | huggingface |
95+
| temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 |
96+
| top_p | float | nucleus sampling parameter | 0.8 |
97+
| max_tokens | int | The maximum number of tokens that can be generated in the chat completion | 512 |
98+
| repetition_penalty | float | The parameter for repetition penalty | 1.05 |
99+
| tensor_parallel_size | int | The size of tensor parallelism (Used for vLLM) | 1 |
100+
| gpu_memory_utilization | float | The percentage of GPU memory utilization (Used for vLLM) | 0.9 |
101+
102+
##### CloudModel Configuration
103+
104+
The `CloudModel` represents the model on cloud, it will call LLM API via OpenAI API format.
105+
106+
For `CloudModel`, the open parameters are:
107+
108+
| Parameter Name | Type | Description | Defalut |
109+
| ------------------ | ---- | ------------------------------------------------------------ | ----------- |
110+
| model | str | model name | gpt-4o-mini |
111+
| temperature | float | What sampling temperature to use, between 0 and 2 | 0.8 |
112+
| top_p | float | nucleus sampling parameter | 0.8 |
113+
| max_tokens | int | The maximum number of tokens that can be generated in the chat completion | 512 |
114+
| repetition_penalty | float | The parameter for repetition penalty | 1.05 |
115+
116+
##### Router Configuration
117+
118+
Router is a component that routes the query to the edge or cloud model. The router is configured by `hard_example_mining` in `examples/cloud-edge-collaborative-inference-for-llm/testrouters/query-routing/test_queryrouting.yaml`.
119+
120+
Currently, supported routers include:
121+
122+
| Router Type | Description | Parameters |
123+
| ------------ | ------------------------------------------------------------ | ---------------- |
124+
| EdgeOnly | Route all queries to the edge model. | - |
125+
| CloudOnly | Route all queries to the cloud model. | - |
126+
| OracleRouter | Optimal Router | |
127+
| BERTRouter | Use a BERT classifier to route the query to the edge or cloud model. | model, threshold |
128+
| RandomRouter | Route the query to the edge or cloud model randomly. | threshold |
129+
130+
We can modify the `router` parameter in `test_queryrouting.yaml` to select the router we want to use.
131+
132+
For BERT router, we can use [routellm/bert](https://huggingface.co/routellm/bert) or [routellm/bert_mmlu_augmented](https://huggingface.co/routellm/bert_mmlu_augmented) or our own BERT model.
133+
134+
### Customize algorithm
135+
136+
Sedna provides a class called `class_factory.py` in `common` package, in which only a few lines of changes are required to become a module of sedna.
137+
138+
Two classes are defined in `class_factory.py`, namely `ClassType` and `ClassFactory`.
139+
140+
`ClassFactory` can register the modules you want to reuse through decorators. For example, in the following code example, we have customized an **Joint Inference : Query Routing algorithm**, we only need to add a line of `ClassFactory.register(ClassType.GENERAL)` to complete the registration.
141+
142+
The following code is just to show the overall structure Routing Strategies, not the complete version. The complete code can be found [here](https://github.com/kubeedge/ianvs/blob/main/examples/cloud-edge-collaborative-inference-for-llm/testalgorithms/query-routing/hard_sample_mining.py).
143+
144+
```python
145+
from sedna.common.class_factory import ClassFactory, ClassType
146+
147+
__all__ = ('BERTFilter', 'EdgeOnlyFilter', 'CloudOnlyFilter',
148+
'RandomRouterFilter', 'OracleRouterFilter')
149+
150+
@ClassFactory.register(ClassType.HEM, alias="BERTRouter")
151+
class BERTFilter(BaseFilter, abc.ABC):
152+
"""BERTRouter Logic here"""
153+
154+
@ClassFactory.register(ClassType.HEM, alias="EdgeOnly")
155+
class EdgeOnlyFilter(BaseFilter, abc.ABC):
156+
"""EdgeOnly Router Logic here"""
157+
158+
@ClassFactory.register(ClassType.HEM, alias="CloudOnly")
159+
class CloudOnlyFilter(BaseFilter, abc.ABC):
160+
"""CloudOnly Router Logic here"""
161+
162+
@ClassFactory.register(ClassType.HEM, alias="RandomRouter")
163+
class RandomRouterFilter(BaseFilter, abc.ABC):
164+
"""RandomRouter Router Logic here"""
165+
166+
@ClassFactory.register(ClassType.HEM, alias="OracleRouter")
167+
""""OracleRouter Logic here"""
168+
```
169+
170+
After registration, you only need to change the name of the STL and parameters in the yaml file, and then the corresponding class will be automatically called according to the name.

0 commit comments

Comments
 (0)