Skip to content

Commit 1d3da5b

Browse files
authored
Merge pull request #196 from IcyFeather233/main
Implementation of [LFX] Domain-specific large model benchmarks: the edge perspective
2 parents 645c836 + b460ead commit 1d3da5b

File tree

13 files changed

+935
-0
lines changed

13 files changed

+935
-0
lines changed

core/testcasecontroller/algorithm/paradigm/singletask_learning/singletask_learning.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ def run(self):
6969

7070
job = self.build_paradigm_job(ParadigmType.SINGLE_TASK_LEARNING.value)
7171

72+
self._preprocess(job)
73+
7274
trained_model = self._train(job, self.initial_model)
7375

7476
if trained_model is None:
@@ -108,6 +110,11 @@ def _compress(self, trained_model):
108110

109111
return compressed_model
110112

113+
def _preprocess(self, job):
114+
if job.preprocess() is None:
115+
return None
116+
return job.preprocess()
117+
111118
def _train(self, job, initial_model):
112119
train_output_dir = os.path.join(self.workspace, "output/train/")
113120
os.environ["BASE_MODEL_URL"] = initial_model

examples/government_rag/README.md

Lines changed: 114 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,114 @@
1+
# Government RAG Example
2+
3+
This example demonstrates how to use Retrieval-Augmented Generation (RAG) technology in the government domain to enhance the performance of large language models.
4+
5+
## Features
6+
7+
1. Supports multiple document formats (txt, docx)
8+
2. Uses LangChain for document processing and retrieval
9+
3. Implements four test modes:
10+
- Type 1 - Model: No RAG mode
11+
- Type 2 - Global: Uses only data related to the tested edge node as the RAG knowledge base
12+
- Type 3 - Local: Uses data from all edge nodes as the RAG knowledge base
13+
- Type 4 - Other: Uses data unrelated to the tested edge node as the RAG knowledge base
14+
15+
## Test Methodology
16+
17+
In terms of experimental design, for evaluating the effectiveness of LLM + RAG solutions on Government data, we believe there should be three comparisons:
18+
19+
1. MODEL: LLM only
20+
2. GLOBAL: LLM + RAG (only knowledge data relevant to edge question)
21+
3. LOCAL: LLM + RAG (all knowledge data)
22+
4. OTHER: LLM + RAG (all knowledge data not relevant to edge question)
23+
24+
Taking "Beijing" as an example of a tested provincial edge node, there are four experimental settings:
25+
26+
| Category | Tested Node | Knowledge Base | Representative Capability |
27+
|----------|-------------|----------------|--------------------------|
28+
| Type 1 (MODEL) | Beijing | No knowledge base, no RAG | Basic capability of large models without knowledge base |
29+
| Type 2 (GLOBAL) | Beijing | Beijing's relevant knowledge base | Model's ability to learn from knowledge base |
30+
| Type 3 (LOCAL) | Beijing | Knowledge base from all provinces and autonomous regions | Model's ability to search for relevant knowledge |
31+
| Type 4 (OTHER) | Beijing | Knowledge base from all provinces except Beijing | Model's ability to generalize knowledge |
32+
33+
For evaluation metrics, we believe that due to the difficulty in controlling variables such as knowledge base quality across different provinces, metrics between different provinces (edge nodes) are not directly comparable.
34+
35+
For each region's evaluation, we have four experimental designs (Type 1, 2, 3, 4). We use Accuracy as the metric to calculate scores, and then take the average of the four experimental scores as the final government capability score for that region.
36+
37+
- Type 1 evaluates the LLM's basic capability without any knowledge base assistance.
38+
- Type 2, using only knowledge base relevant to the edge node, evaluates the LLM's ability to learn from the knowledge base.
39+
- Type 3, using the complete knowledge base, evaluates the LLM's ability to retrieve relevant knowledge.
40+
- Type 4, using knowledge base unrelated to the edge node, evaluates the LLM's ability to generalize knowledge.
41+
42+
**Note: This benchmark design has limitations and can only partially reflect different large models' capabilities in handling government affairs across different provinces. The results are for reference only and cannot fully accurately reflect a model's government affairs capability.**
43+
44+
## Usage
45+
46+
1. Prepare data:
47+
48+
The datasets are in [kaggle](https://www.kaggle.com/datasets/kubeedgeianvs/the-government-affairs-dataset-govaff?select=RAG-multi-edges-QA) now. The knowledge base (policy documents for each province), its vector store, and the benchmark (Question and Answer) are all in this link.
49+
50+
Place the data in the `dataset/gov_rag` directory with the following structure:
51+
```
52+
.
53+
├── data.jsonl
54+
├── dataset
55+
│ ├── Shanghai
56+
│ │ ├── Shanghai Data Trading Venue Management Implementation Measures.docx
57+
│ │ ├── Action Plan for Promoting Innovation and Development of the Data Elements Industry Based on the New Digital Economy Track.docx
58+
...
59+
│ └── Heilongjiang
60+
│ ├── Heilongjiang Province Big Data Development and Application Promotion Regulations.docx
61+
│ ├── Heilongjiang Province Big Data Development and Application Promotion Regulations.txt
62+
│ └── Heilongjiang Province.docx
63+
└── metadata.json
64+
```
65+
66+
And you can put the knowledge base embedding directory in `./chroma_db` path to avoid re-generate the knowledge base embedding.
67+
68+
Data example:
69+
```json
70+
{"query": "In Shanghai's notice on data asset management, which approach is advocated to promote the compliant and efficient circulation of data assets?{\"A\": \"Rely entirely on government regulation\", \"B\": \"Combination of market leadership and government guidance\", \"C\": \"Enterprise-independent development\", \"D\": \"Unconditional data sharing\"}\nPlease answer directly with A/B/C/D, no explanation.", "response": "B", "level_1_dim": "single-modal", "level_2_dim": "text", "level_3_dim": "government", "level_4_dim": "Shanghai"}
71+
```
72+
73+
2. Run the test:
74+
```bash
75+
ianvs -f examples/government_rag/singletask_learning_bench/benchmarkingjob.yaml
76+
```
77+
78+
## Test Results
79+
80+
Four models were selected for testing: Deepseek, LLaMA3-8B, ernie_speed, and ernie-tiny-8k. These models are representative, including strong language models, lightweight models, and models with varying amounts of Chinese corpus.
81+
82+
Test results are as follows:
83+
84+
| Model | Experiment | Global Accuracy | Shanghai | Yunnan | National | Beijing | Nanjing | Jilin | Sichuan | Tianjin | Anhui | Shandong | Shanxi | Guangdong | Guangxi | Jiangsu | Jiangxi | Hebei | Henan | Zhejiang | Hainan | Hubei | Hunan | Gansu | Fujian | Guizhou | Liaoning | Chongqing | Shaanxi | Heilongjiang |
85+
|---------------|------------|-----------------|----------|--------|----------|---------|---------|-------|---------|---------|-------|----------|--------|-----------|---------|---------|---------|-------|-------|----------|--------|-------|-------|-------|--------|---------|----------|-----------|---------|--------------|
86+
| Deepseek | MODEL | 0.855357143 | 0.9 | 0.9 | 1 | 0.75 | 0.85 | 0.9 | 0.75 | 0.7 | 0.75 | 0.9 | 0.8 | 0.95 | 0.95 | 0.85 | 0.9 | 0.75 | 0.95 | 0.8 | 0.85 | 0.9 | 0.8 | 0.7 | 0.9 | 0.8 | 1 | 0.95 | 0.75 | 0.95 |
87+
| Deepseek | GLOBAL | 0.891071429 | 0.9 | 0.9 | 0.95 | 0.9 | 0.95 | 0.85 | 0.85 | 0.75 | 0.85 | 0.95 | 0.8 | 0.95 | 0.95 | 0.85 | 0.85 | 0.85 | 1 | 0.95 | 0.95 | 0.95 | 0.8 | 0.8 | 0.9 | 0.8 | 1 | 0.95 | 0.8 | 0.95 |
88+
| Deepseek | LOCAL | 0.891071429 | 0.9 | 0.95 | 0.95 | 0.9 | 0.95 | 0.85 | 0.85 | 0.75 | 0.9 | 0.95 | 0.75 | 0.95 | 0.95 | 0.85 | 0.85 | 0.85 | 1 | 0.95 | 0.9 | 0.95 | 0.8 | 0.75 | 0.9 | 0.8 | 1 | 0.95 | 0.85 | 0.95 |
89+
| Deepseek | OTHER | 0.889285714 | 0.9 | 0.9 | 0.95 | 0.9 | 0.9 | 0.85 | 0.85 | 0.75 | 0.9 | 0.95 | 0.75 | 0.95 | 0.95 | 0.85 | 0.85 | 0.8 | 1 | 0.95 | 0.95 | 0.95 | 0.8 | 0.8 | 0.9 | 0.8 | 1 | 0.95 | 0.85 | 0.95 |
90+
| llama3-8b | MODEL | 0.664285714 | 0.8 | 0.6 | 0.7 | 0.7 | 0.8 | 0.7 | 0.4 | 0.5 | 0.6 | 0.9 | 0.7 | 0.8 | 0.9 | 0.7 | 0.5 | 0.4 | 0.7 | 0.7 | 0.4 | 0.6 | 0.7 | 0.4 | 0.6 | 0.7 | 0.8 | 0.7 | 0.7 | 0.9 |
91+
| llama3-8b | GLOBAL | 0.782142857 | 0.9 | 0.7 | 0.65 | 0.8 | 0.85 | 0.75 | 0.55 | 0.55 | 0.85 | 0.9 | 0.8 | 0.9 | 0.8 | 0.8 | 0.5 | 0.8 | 0.8 | 0.9 | 0.7 | 0.8 | 0.7 | 0.7 | 1 | 0.8 | 1 | 0.9 | 0.7 | 0.8 |
92+
| llama3-8b | LOCAL | 0.801785714 | 0.9 | 0.7 | 0.65 | 0.85 | 0.85 | 0.75 | 0.45 | 0.75 | 0.8 | 0.95 | 0.85 | 0.85 | 0.7 | 0.85 | 0.6 | 0.85 | 0.7 | 0.85 | 0.7 | 0.95 | 0.7 | 0.75 | 1 | 0.85 | 1 | 0.9 | 0.75 | 0.95 |
93+
| llama3-8b | OTHER | 0.775 | 0.9 | 0.7 | 0.6 | 0.8 | 0.8 | 0.8 | 0.5 | 0.7 | 0.8 | 0.9 | 0.7 | 1 | 0.7 | 0.8 | 0.6 | 0.8 | 0.7 | 0.9 | 0.7 | 0.9 | 0.7 | 0.6 | 0.9 | 0.7 | 1 | 0.9 | 0.7 | 0.9 |
94+
| ernie-tiny-8k | MODEL | 0.319642857 | 0.6 | 0.1 | 0.3 | 0.1 | 0.1 | 0.2 | 0.2 | 0.2 | 0.2 | 0.7 | 0.2 | 0.4 | 0.3 | 0.5 | 0.1 | 0.3 | 0.5 | 0.8 | 0.2 | 0.2 | 0.2 | 0.35 | 0.3 | 0.3 | 0.1 | 0.2 | 0.8 | 0.5 |
95+
| ernie-tiny-8k | GLOBAL | 0.285714286 | 0.3 | 0 | 0.3 | 0.2 | 0.1 | 0.25 | 0.1 | 0.2 | 0.1 | 0.4 | 0.2 | 0.4 | 0.2 | 0.7 | 0.15 | 0.2 | 0.5 | 0.8 | 0.2 | 0.2 | 0.1 | 0.3 | 0.2 | 0.3 | 0.1 | 0.3 | 0.7 | 0.5 |
96+
| ernie-tiny-8k | LOCAL | 0.301785714 | 0.2 | 0.1 | 0.15 | 0.1 | 0.1 | 0.25 | 0.15 | 0.25 | 0.15 | 0.4 | 0.25 | 0.5 | 0.25 | 0.75 | 0.15 | 0.2 | 0.55 | 0.8 | 0.15 | 0.25 | 0.25 | 0.3 | 0.25 | 0.35 | 0.1 | 0.3 | 0.7 | 0.5 |
97+
| ernie-tiny-8k | OTHER | 0.2875 | 0.3 | 0.1 | 0.2 | 0.2 | 0.1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.4 | 0.2 | 0.3 | 0.2 | 0.7 | 0.1 | 0.2 | 0.5 | 0.8 | 0.2 | 0.2 | 0.2 | 0.35 | 0.2 | 0.3 | 0.1 | 0.4 | 0.7 | 0.5 |
98+
| ernie_speed | MODEL | 0.705357143 | 0.9 | 0.5 | 0.8 | 0.3 | 0.8 | 0.7 | 0.6 | 0.6 | 0.6 | 0.9 | 0.7 | 1 | 0.9 | 0.8 | 0.9 | 0.4 | 0.75 | 0.7 | 0.3 | 0.7 | 0.7 | 0.5 | 0.4 | 0.8 | 0.8 | 0.9 | 0.8 | 1 |
99+
| ernie_speed | GLOBAL | 0.776785714 | 1 | 0.6 | 0.8 | 0.75 | 0.8 | 0.9 | 0.7 | 0.65 | 0.8 | 0.8 | 0.8 | 0.8 | 0.7 | 0.9 | 0.7 | 0.6 | 0.7 | 0.8 | 0.7 | 0.95 | 0.6 | 0.7 | 0.8 | 0.8 | 1 | 0.9 | 0.5 | 1 |
100+
| ernie_speed | LOCAL | 0.785714286 | 0.9 | 0.6 | 0.8 | 0.65 | 0.8 | 0.95 | 0.8 | 0.65 | 0.8 | 0.8 | 0.9 | 1 | 0.75 | 0.9 | 0.8 | 0.6 | 0.75 | 0.8 | 0.75 | 0.95 | 0.6 | 0.6 | 0.6 | 0.85 | 1 | 0.9 | 0.5 | 1 |
101+
| ernie_speed | OTHER | 0.769642857 | 0.9 | 0.5 | 0.7 | 0.7 | 0.9 | 0.9 | 0.8 | 0.6 | 0.8 | 0.8 | 0.9 | 0.85 | 0.7 | 0.9 | 0.6 | 0.6 | 0.8 | 0.7 | 0.6 | 0.9 | 0.7 | 0.8 | 0.6 | 0.8 | 1 | 0.9 | 0.6 | 1 |
102+
103+
The visualization of the average scores for each edge node is as follows:
104+
105+
<table>
106+
<tr>
107+
<td><img src="./assets/model_accuracy_radar.png" alt="Model Accuracy Radar" width="200"/></td>
108+
<td><img src="./assets/Deepseek_radar_plot.png" alt="Deepseek Radar Plot" width="200"/></td>
109+
<td><img src="./assets/llama3-8b_radar_plot.png" alt="Llama3-8B Radar Plot" width="200"/></td>
110+
<td><img src="./assets/ernie_speed_radar_plot.png" alt="Ernie Speed Radar Plot" width="200"/></td>
111+
<td><img src="./assets/ernie-tiny-8k_radar_plot.png" alt="Ernie-Tiny-8K Radar Plot" width="200"/></td>
112+
</tr>
113+
</table>
114+
It can be observed that, overall, Deepseek performs the best, followed by LLaMA3-8B and ernie_speed with similar performance, while ernie-tiny-8k performs the worst. Additionally, when the model itself is already strong (Deepseek), the effect of RAG is relatively weaker. When the model's performance is moderate, RAG has a more significant impact. For models with poor performance, the effect of RAG is unstable. The general trend is Local RAG > Global RAG > Other RAG > Model. Usually, using Other RAG can outperform the Model, indicating that policy documents from different edge nodes may have some correlation, leading to some performance gains. However, when the model performs bad, these gains are not consistent and can sometimes have a negative impact due to differences in policies across edge nodes.
419 KB
Loading
420 KB
Loading
492 KB
Loading
456 KB
Loading
580 KB
Loading
Lines changed: 72 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,72 @@
1+
benchmarkingjob:
2+
# job name of bechmarking; string type;
3+
name: "benchmarkingjob"
4+
# the url address of job workspace that will reserve the output of tests; string type;
5+
workspace: "/home/icyfeather/project/ianvs/workspace"
6+
7+
# the url address of test environment configuration file; string type;
8+
# the file format supports yaml/yml;
9+
testenv: "./examples/government_rag/singletask_learning_bench/testenv/testenv.yaml"
10+
11+
# the configuration of test object
12+
test_object:
13+
# test type; string type;
14+
# currently the option of value is "algorithms",the others will be added in succession.
15+
type: "algorithms"
16+
# test algorithm configuration files; list type;
17+
algorithms:
18+
# algorithm name; string type;
19+
- name: "singletask_learning"
20+
# the url address of test algorithm configuration file; string type;
21+
# the file format supports yaml/yml;
22+
url: "./examples/government_rag/singletask_learning_bench/testalgorithms/gen_algorithm.yaml"
23+
24+
# the configuration of ranking leaderboard
25+
rank:
26+
# rank leaderboard with metric of test case's evaluation and order ; list type;
27+
# the sorting priority is based on the sequence of metrics in the list from front to back;
28+
sort_by: [ { "acc_model": "descend" }, { "acc_global": "descend" }, { "acc_local": "descend" }, { "acc_other": "descend" } ]
29+
30+
# visualization configuration
31+
visualization:
32+
# mode of visualization in the leaderboard; string type;
33+
# There are quite a few possible dataitems in the leaderboard. Not all of them can be shown simultaneously on the screen.
34+
# In the leaderboard, we provide the "selected_only" mode for the user to configure what is shown or is not shown.
35+
mode: "selected_only"
36+
# method of visualization for selected dataitems; string type;
37+
# currently the options of value are as follows:
38+
# 1> "print_table": print selected dataitems;
39+
method: "print_table"
40+
41+
# selected dataitem configuration
42+
# The user can add his/her interested dataitems in terms of "paradigms", "modules", "hyperparameters" and "metrics",
43+
# so that the selected columns will be shown.
44+
selected_dataitem:
45+
# currently the options of value are as follows:
46+
# 1> "all": select all paradigms in the leaderboard;
47+
# 2> paradigms in the leaderboard, e.g., "singletasklearning"
48+
paradigms: [ "all" ]
49+
# currently the options of value are as follows:
50+
# 1> "all": select all modules in the leaderboard;
51+
# 2> modules in the leaderboard, e.g., "basemodel"
52+
modules: [ "all" ]
53+
# currently the options of value are as follows:
54+
# 1> "all": select all hyperparameters in the leaderboard;
55+
# 2> hyperparameters in the leaderboard, e.g., "momentum"
56+
hyperparameters: [ "all" ]
57+
# currently the options of value are as follows:
58+
# 1> "all": select all metrics in the leaderboard;
59+
# 2> metrics in the leaderboard, e.g., "f1_score"
60+
metrics: [ "acc_model", "acc_global", "acc_local", "acc_other" ]
61+
62+
# model of save selected and all dataitems in workspace; string type;
63+
# currently the options of value are as follows:
64+
# 1> "selected_and_all": save selected and all dataitems;
65+
# 2> "selected_only": save selected dataitems;
66+
save_mode: "selected_and_all"
67+
68+
69+
70+
71+
72+

0 commit comments

Comments
 (0)