Skip to content

Commit 4d3f051

Browse files
authored
Merge branch 'kubeedge:main' into lfx_proposal#185_point3
2 parents dacb4ae + 42732e1 commit 4d3f051

25 files changed

Lines changed: 816 additions & 51 deletions

File tree

.github/workflows/codeql-analysis.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ on:
2323
jobs:
2424
analyze:
2525
name: Analyze
26-
runs-on: ubuntu-20.04
26+
runs-on: ubuntu-22.04
2727

2828
strategy:
2929
fail-fast: false

.github/workflows/fossa.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ on:
1010

1111
jobs:
1212
build:
13-
runs-on: ubuntu-20.04
13+
runs-on: ubuntu-22.04
1414
strategy:
1515
matrix:
1616
python-version: [ "3.7", "3.8", "3.9" ]

.github/workflows/main-doc.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ on:
1111

1212
jobs:
1313
pylint:
14-
runs-on: ubuntu-20.04
14+
runs-on: ubuntu-22.04
1515
name: pylint
1616
strategy:
1717
matrix:

.github/workflows/main.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ on:
1111

1212
jobs:
1313
pylint:
14-
runs-on: ubuntu-20.04
14+
runs-on: ubuntu-22.04
1515
name: pylint
1616
strategy:
1717
matrix:
46.1 KB
Loading
72.6 KB
Loading
117 KB
Loading
296 KB
Loading
118 KB
Loading
Lines changed: 262 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,262 @@
1+
# Project Proposal: Domain-Specific Large Model Benchmarking for Edge-Based E-Government Services
2+
3+
## 1. Abstract
4+
5+
With the rapid development of large language models (LLMs), the demand for personalized, compliant, and real-time services has given rise to edge computing-based LLMs. Among various domains, government services represent a critical scenario where edge models can play a pivotal role. Government operations require high levels of data privacy and real-time responsiveness, making edge deployment an ideal solution.
6+
7+
However, most existing benchmarks focus on general capabilities or specific academic tasks, lacking comprehensive evaluation datasets for vertical domains like Chinese government services. To address this gap, we previously proposed the "Chinese Government Affairs Understanding Evaluation Benchmark" (CGAUE). This benchmark provides an open, community-driven evaluation framework that tests both objective and subjective capabilities of LLMs, and has been tested with common LLMs in Ianvs.
8+
9+
Yet our preliminary work still has room for improvement: we directly invoked LLMs without fine-tuning or implementing Retrieval-Augmented Generation (RAG); most existing LLMs haven't been sufficiently trained on government data; moreover, government data updates rapidly while LLMs cannot acquire new knowledge after training, resulting in suboptimal performance on government data. Addressing this issue is precisely the goal of our current work.
10+
11+
## 2. Research Motivation
12+
13+
### 2.1 Preliminary Research and Experiments
14+
15+
Previous research shows that existing LLMs without edge optimization face significant challenges when handling Chinese government domain tasks. For example, models like GPT-4 that haven't been trained on specialized government data perform poorly in tasks such as policy interpretation and public service consultation, often generating incorrect responses due to lack of domain knowledge and inability to access localized data in real-time.
16+
17+
### 2.2 Edge Deployment Requirements for Government Scenarios
18+
19+
Key drivers for deploying edge LLMs in Chinese government scenarios include:
20+
21+
- **Market Size**: China's AI government solutions market is projected to reach $15 billion by 2025, driven by accelerated AI and edge computing adoption in public administration
22+
- **Typical Cases**: Shenzhen has deployed edge LLMs to optimize efficiency in public consultation and policy communication; Guangzhou has integrated edge LLMs into smart city infrastructure to provide localized public service responses
23+
- **Data Sensitivity**: Government data like social security involves sensitive information requiring localized processing to prevent leaks and ensure compliance
24+
- **Low Latency Requirements**: Government services like policy consultation demand real-time responses, where edge models outperform cloud solutions
25+
- **Regional Knowledge**: Significant policy variations across regions make edge deployment ideal for adapting LLMs to local policies
26+
27+
### 2.3 Technical Rationale for RAG Adoption
28+
29+
Given the need for localized, real-time, and compliant processing of government data, Retrieval-Augmented Generation (RAG) technology is crucial. RAG enhances LLM capabilities by integrating external knowledge sources, ensuring models can access the latest regional policies—particularly beneficial for edge deployments requiring operation on local data.
30+
31+
## 3. Project Objectives
32+
33+
1. Build a **cross-provincial government knowledge base** for RAG-enhanced LLM benchmarking
34+
2. Design **four testing modes**:
35+
- *Type 1: No-RAG mode*
36+
- *Type 2: Using only data relevant to the tested edge node as RAG knowledge base*
37+
- *Type 3: Using all edge node data as RAG knowledge base*
38+
- *Type 4: Using all data irrelevant to the tested edge node as RAG knowledge base*
39+
3. Implement and compare mainstream RAG architectures in Ianvs
40+
41+
## 4. Methodology
42+
43+
### 4.1 Data Collection and Processing
44+
45+
We need to collect government data from the internet (this data may not necessarily relate to our Benchmark), then clean and extract relevant data using the following method:
46+
47+
![data process](./assets/data_process.png)
48+
49+
For locations where collected data remains insufficient after cleaning, we can use LLMs and search APIs to specifically search and generate data:
50+
51+
![data generate](./assets/data_generate.png)
52+
53+
This process yields cleaned data relevant to each province's Benchmark queries. For province Pᵢ, we obtain corresponding knowledge base data Kᵢ through cleaning or generation.
54+
55+
### 4.2 RAG Integration Solution
56+
57+
RAG can be divided into several modules:
58+
1. Python libraries for RAG
59+
2. Knowledge base processing module (embedding and storage in vector database)
60+
3. Retrieval module (query-based retrieval from vector database)
61+
62+
For **Part 1 (RAG Python libraries)**, we recommend LangChain—the most commonly used LLM RAG library with extensive functionality. Since we won't modify LangChain itself, no whl packaging is needed. LangChain won't be called in Ianvs Core, only in `examples/path/testalgorithms/path/xxx.py`, so it won't affect the global `requirements.txt` and can be pip-installed separately for specific scenarios.
63+
64+
For **Part 2 (Knowledge base processing)**, we believe the knowledge base shouldn't be placed in yaml configuration files processed via `core\testenvmanager\dataset\dataset.py` because:
65+
1. LLM scenarios constitute only a small portion of Ianvs projects, and not all LLM scenarios require RAG
66+
2. Knowledge bases differ fundamentally from Ianvs' original "data" concept (train/test data vs. knowledge base)
67+
3. RAG knowledge bases can take various formats that are difficult to process uniformly
68+
4. Knowledge base processing should remain flexible for developer customization
69+
70+
For **Part 3**, similar reasoning applies—vectorization, storage, and retrieval should be implemented in scenario-specific algorithm files rather than Core.
71+
72+
The modified Ianvs structure is shown below:
73+
74+
![](./assets/ianvs_structure.png)
75+
76+
While Core framework remains unchanged, we'll add RAG Knowledge Base and processed vector database concepts stored in Local Storage.
77+
78+
However, Algorithm processing requires changes. Previous LLM Scenarios used Single Task Learning paradigm with only train and inference steps. We propose adding a preprocess step for knowledge base initialization:
79+
80+
![](./assets/singletasklearning_change.png)
81+
82+
## 4.3 Knowledge Base Vectorization and Retrieval Solution
83+
84+
As mentioned, knowledge bases can be complex:
85+
86+
```
87+
├── /product_docs/
88+
│ ├── user_manual_v2.3.pdf
89+
│ ├── API_reference.md
90+
│ └── version_history/
91+
│ ├── v1.0-release-notes.txt
92+
│ └── v2.0-beta-notes.docx
93+
├── /customer_support/
94+
│ ├── FAQ.json
95+
│ └── ticket_records.db
96+
└── config.yaml
97+
```
98+
99+
LangChain handles this through:
100+
101+
### 4.3.1 File Loading and Preprocessing
102+
```python
103+
from langchain.document_loaders import (
104+
DirectoryLoader,
105+
TextLoader,
106+
PyPDFLoader,
107+
JSONLoader,
108+
UnstructuredFileLoader
109+
)
110+
111+
# Configure multi-format loaders
112+
loaders = {
113+
'.pdf': PyPDFLoader,
114+
'.md': TextLoader,
115+
'.txt': TextLoader,
116+
'.json': JSONLoader,
117+
'.yaml': TextLoader,
118+
'.docx': UnstructuredFileLoader, # Requires unstructured
119+
'.db': None # Requires custom handling
120+
}
121+
122+
# Recursive directory loading
123+
def load_documents(root_path):
124+
documents = []
125+
for item in Path(root_path).rglob('*'):
126+
if item.is_file():
127+
ext = item.suffix.lower()
128+
if ext in loaders and loaders[ext]:
129+
loader = loaders[ext](str(item))
130+
documents.extend(loader.load())
131+
return documents
132+
```
133+
134+
### 4.3.2 Intelligent Document Splitting
135+
#### Text Files (MD/TXT/YAML)
136+
```python
137+
from langchain.text_splitter import RecursiveCharacterTextSplitter
138+
139+
text_splitter = RecursiveCharacterTextSplitter(
140+
chunk_size=1000,
141+
chunk_overlap=200,
142+
separators=["\n\n", "\n", "", "", "", "……", " "]
143+
)
144+
```
145+
146+
#### Structured Data (JSON/DB)
147+
```python
148+
# JSON-specific processing
149+
json_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
150+
chunk_size=500,
151+
separators=["}\n{", ",\n", "\n"] # Split by JSON structure
152+
)
153+
154+
# Database file processing
155+
def process_db(file_path):
156+
import sqlite3
157+
conn = sqlite3.connect(file_path)
158+
tables = conn.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall()
159+
text_content = []
160+
for table in tables:
161+
text_content.append(f"## Table Structure: {table[0]}")
162+
data = conn.execute(f"SELECT * FROM {table[0]} LIMIT 10").fetchall()
163+
text_content.append(str(data))
164+
return "\n".join(text_content)
165+
```
166+
167+
#### PDF/DOCX Documents
168+
```python
169+
from unstructured.partition.auto import partition
170+
171+
def process_complex_file(file_path):
172+
elements = partition(filename=file_path)
173+
return "\n\n".join([str(el) for el in elements])
174+
```
175+
176+
### 4.3.3 Hybrid Splitting Process
177+
```python
178+
def split_documents(docs):
179+
final_splits = []
180+
for doc in docs:
181+
content = doc.page_content
182+
metadata = doc.metadata
183+
184+
if metadata['source'].endswith('.json'):
185+
splits = json_splitter.split_text(content)
186+
elif metadata['source'].endswith('.db'):
187+
content = process_db(metadata['source'])
188+
splits = text_splitter.split_text(content)
189+
else:
190+
splits = text_splitter.split_text(content)
191+
192+
for split in splits:
193+
new_doc = Document(
194+
page_content=split,
195+
metadata=metadata
196+
)
197+
final_splits.append(new_doc)
198+
return final_splits
199+
```
200+
201+
### 4.3.4 Vectorization and Storage
202+
```python
203+
from langchain.embeddings import HuggingFaceEmbeddings
204+
from langchain.vectorstores import FAISS
205+
206+
# Embedding model
207+
embedding = HuggingFaceEmbeddings(
208+
model_name="GanymedeNil/text2vec-large-chinese",
209+
encode_kwargs={'normalize_embeddings': True}
210+
)
211+
212+
# Processing and storage
213+
loaded_docs = load_documents("./knowledge_base")
214+
splitted_docs = split_documents(loaded_docs)
215+
vector_db = FAISS.from_documents(splitted_docs, embedding)
216+
217+
# Save index
218+
vector_db.save_local("./vector_store")
219+
```
220+
221+
## 4.4 Testing Solution
222+
223+
For experimental design, we need three comparisons for LLM+RAG performance on government data:
224+
1. LLM only
225+
2. LLM + RAG (only relevant knowledge data)
226+
3. LLM + RAG (all knowledge data)
227+
4. LLM + RAG (only irrelevant knowledge data)
228+
229+
Using "Beijing" as example test edge node:
230+
231+
| Type | Test Node | Knowledge Base | Capability Measured |
232+
| --- | --- | --- | --- |
233+
| Type 1 | Beijing | No knowledge base | LLM baseline capability |
234+
| Type 2 | Beijing | Beijing-relevant knowledge | LLM learning capability |
235+
| Type 3 | Beijing | All provinces' knowledge | LLM search/retrieval capability |
236+
| Type 4 | Beijing | All non-Beijing knowledge | LLM generalization capability |
237+
238+
Type 1 has been tested previously. Types 2-4 are new RAG-enhanced comparisons.
239+
240+
We'll use Accuracy scores averaged across all four types as final government capability score for each region.
241+
242+
**Note: This benchmark has limitations and only partially reflects LLM capabilities for regional government questions.**
243+
244+
Code implementation requires careful handling of knowledge base switching during cross-testing. These changes will be confined to `example` directory without modifying Core code.
245+
246+
## 5. Limitations and Future Work
247+
248+
Current proposal focuses on RAG introduction but leaves several directions unexplored:
249+
- **Incremental training**: Untested capability to adapt to new data
250+
- **Fine tuning**: Unevaluated domain adaptation through fine-tuning
251+
- **Few-shot learning**: Requires further validation in specialized domains
252+
253+
Future research should develop and test complete methodologies for domain-specific scenarios.
254+
255+
## 6. Project Timeline
256+
| Phase | Timeline |
257+
|-----|-----|
258+
| Data Collection | 3.3-3.21 |
259+
| RAG Integration | 3.24-4.11 |
260+
| Benchmark Testing | 4.14-5.2 |
261+
| Performance Optimization | 5.5-5.23 |
262+
| Project Finalization | 5.26-5.30 |

0 commit comments

Comments
 (0)