|
| 1 | +# Project Proposal: Domain-Specific Large Model Benchmarking for Edge-Based E-Government Services |
| 2 | + |
| 3 | +## 1. Abstract |
| 4 | + |
| 5 | +With the rapid development of large language models (LLMs), the demand for personalized, compliant, and real-time services has given rise to edge computing-based LLMs. Among various domains, government services represent a critical scenario where edge models can play a pivotal role. Government operations require high levels of data privacy and real-time responsiveness, making edge deployment an ideal solution. |
| 6 | + |
| 7 | +However, most existing benchmarks focus on general capabilities or specific academic tasks, lacking comprehensive evaluation datasets for vertical domains like Chinese government services. To address this gap, we previously proposed the "Chinese Government Affairs Understanding Evaluation Benchmark" (CGAUE). This benchmark provides an open, community-driven evaluation framework that tests both objective and subjective capabilities of LLMs, and has been tested with common LLMs in Ianvs. |
| 8 | + |
| 9 | +Yet our preliminary work still has room for improvement: we directly invoked LLMs without fine-tuning or implementing Retrieval-Augmented Generation (RAG); most existing LLMs haven't been sufficiently trained on government data; moreover, government data updates rapidly while LLMs cannot acquire new knowledge after training, resulting in suboptimal performance on government data. Addressing this issue is precisely the goal of our current work. |
| 10 | + |
| 11 | +## 2. Research Motivation |
| 12 | + |
| 13 | +### 2.1 Preliminary Research and Experiments |
| 14 | + |
| 15 | +Previous research shows that existing LLMs without edge optimization face significant challenges when handling Chinese government domain tasks. For example, models like GPT-4 that haven't been trained on specialized government data perform poorly in tasks such as policy interpretation and public service consultation, often generating incorrect responses due to lack of domain knowledge and inability to access localized data in real-time. |
| 16 | + |
| 17 | +### 2.2 Edge Deployment Requirements for Government Scenarios |
| 18 | + |
| 19 | +Key drivers for deploying edge LLMs in Chinese government scenarios include: |
| 20 | + |
| 21 | +- **Market Size**: China's AI government solutions market is projected to reach $15 billion by 2025, driven by accelerated AI and edge computing adoption in public administration |
| 22 | +- **Typical Cases**: Shenzhen has deployed edge LLMs to optimize efficiency in public consultation and policy communication; Guangzhou has integrated edge LLMs into smart city infrastructure to provide localized public service responses |
| 23 | +- **Data Sensitivity**: Government data like social security involves sensitive information requiring localized processing to prevent leaks and ensure compliance |
| 24 | +- **Low Latency Requirements**: Government services like policy consultation demand real-time responses, where edge models outperform cloud solutions |
| 25 | +- **Regional Knowledge**: Significant policy variations across regions make edge deployment ideal for adapting LLMs to local policies |
| 26 | + |
| 27 | +### 2.3 Technical Rationale for RAG Adoption |
| 28 | + |
| 29 | +Given the need for localized, real-time, and compliant processing of government data, Retrieval-Augmented Generation (RAG) technology is crucial. RAG enhances LLM capabilities by integrating external knowledge sources, ensuring models can access the latest regional policies—particularly beneficial for edge deployments requiring operation on local data. |
| 30 | + |
| 31 | +## 3. Project Objectives |
| 32 | + |
| 33 | +1. Build a **cross-provincial government knowledge base** for RAG-enhanced LLM benchmarking |
| 34 | +2. Design **four testing modes**: |
| 35 | + - *Type 1: No-RAG mode* |
| 36 | + - *Type 2: Using only data relevant to the tested edge node as RAG knowledge base* |
| 37 | + - *Type 3: Using all edge node data as RAG knowledge base* |
| 38 | + - *Type 4: Using all data irrelevant to the tested edge node as RAG knowledge base* |
| 39 | +3. Implement and compare mainstream RAG architectures in Ianvs |
| 40 | + |
| 41 | +## 4. Methodology |
| 42 | + |
| 43 | +### 4.1 Data Collection and Processing |
| 44 | + |
| 45 | +We need to collect government data from the internet (this data may not necessarily relate to our Benchmark), then clean and extract relevant data using the following method: |
| 46 | + |
| 47 | + |
| 48 | + |
| 49 | +For locations where collected data remains insufficient after cleaning, we can use LLMs and search APIs to specifically search and generate data: |
| 50 | + |
| 51 | + |
| 52 | + |
| 53 | +This process yields cleaned data relevant to each province's Benchmark queries. For province Pᵢ, we obtain corresponding knowledge base data Kᵢ through cleaning or generation. |
| 54 | + |
| 55 | +### 4.2 RAG Integration Solution |
| 56 | + |
| 57 | +RAG can be divided into several modules: |
| 58 | +1. Python libraries for RAG |
| 59 | +2. Knowledge base processing module (embedding and storage in vector database) |
| 60 | +3. Retrieval module (query-based retrieval from vector database) |
| 61 | + |
| 62 | +For **Part 1 (RAG Python libraries)**, we recommend LangChain—the most commonly used LLM RAG library with extensive functionality. Since we won't modify LangChain itself, no whl packaging is needed. LangChain won't be called in Ianvs Core, only in `examples/path/testalgorithms/path/xxx.py`, so it won't affect the global `requirements.txt` and can be pip-installed separately for specific scenarios. |
| 63 | + |
| 64 | +For **Part 2 (Knowledge base processing)**, we believe the knowledge base shouldn't be placed in yaml configuration files processed via `core\testenvmanager\dataset\dataset.py` because: |
| 65 | +1. LLM scenarios constitute only a small portion of Ianvs projects, and not all LLM scenarios require RAG |
| 66 | +2. Knowledge bases differ fundamentally from Ianvs' original "data" concept (train/test data vs. knowledge base) |
| 67 | +3. RAG knowledge bases can take various formats that are difficult to process uniformly |
| 68 | +4. Knowledge base processing should remain flexible for developer customization |
| 69 | + |
| 70 | +For **Part 3**, similar reasoning applies—vectorization, storage, and retrieval should be implemented in scenario-specific algorithm files rather than Core. |
| 71 | + |
| 72 | +The modified Ianvs structure is shown below: |
| 73 | + |
| 74 | + |
| 75 | + |
| 76 | +While Core framework remains unchanged, we'll add RAG Knowledge Base and processed vector database concepts stored in Local Storage. |
| 77 | + |
| 78 | +However, Algorithm processing requires changes. Previous LLM Scenarios used Single Task Learning paradigm with only train and inference steps. We propose adding a preprocess step for knowledge base initialization: |
| 79 | + |
| 80 | + |
| 81 | + |
| 82 | +## 4.3 Knowledge Base Vectorization and Retrieval Solution |
| 83 | + |
| 84 | +As mentioned, knowledge bases can be complex: |
| 85 | + |
| 86 | +``` |
| 87 | +├── /product_docs/ |
| 88 | +│ ├── user_manual_v2.3.pdf |
| 89 | +│ ├── API_reference.md |
| 90 | +│ └── version_history/ |
| 91 | +│ ├── v1.0-release-notes.txt |
| 92 | +│ └── v2.0-beta-notes.docx |
| 93 | +├── /customer_support/ |
| 94 | +│ ├── FAQ.json |
| 95 | +│ └── ticket_records.db |
| 96 | +└── config.yaml |
| 97 | +``` |
| 98 | + |
| 99 | +LangChain handles this through: |
| 100 | + |
| 101 | +### 4.3.1 File Loading and Preprocessing |
| 102 | +```python |
| 103 | +from langchain.document_loaders import ( |
| 104 | + DirectoryLoader, |
| 105 | + TextLoader, |
| 106 | + PyPDFLoader, |
| 107 | + JSONLoader, |
| 108 | + UnstructuredFileLoader |
| 109 | +) |
| 110 | + |
| 111 | +# Configure multi-format loaders |
| 112 | +loaders = { |
| 113 | + '.pdf': PyPDFLoader, |
| 114 | + '.md': TextLoader, |
| 115 | + '.txt': TextLoader, |
| 116 | + '.json': JSONLoader, |
| 117 | + '.yaml': TextLoader, |
| 118 | + '.docx': UnstructuredFileLoader, # Requires unstructured |
| 119 | + '.db': None # Requires custom handling |
| 120 | +} |
| 121 | + |
| 122 | +# Recursive directory loading |
| 123 | +def load_documents(root_path): |
| 124 | + documents = [] |
| 125 | + for item in Path(root_path).rglob('*'): |
| 126 | + if item.is_file(): |
| 127 | + ext = item.suffix.lower() |
| 128 | + if ext in loaders and loaders[ext]: |
| 129 | + loader = loaders[ext](str(item)) |
| 130 | + documents.extend(loader.load()) |
| 131 | + return documents |
| 132 | +``` |
| 133 | + |
| 134 | +### 4.3.2 Intelligent Document Splitting |
| 135 | +#### Text Files (MD/TXT/YAML) |
| 136 | +```python |
| 137 | +from langchain.text_splitter import RecursiveCharacterTextSplitter |
| 138 | + |
| 139 | +text_splitter = RecursiveCharacterTextSplitter( |
| 140 | + chunk_size=1000, |
| 141 | + chunk_overlap=200, |
| 142 | + separators=["\n\n", "\n", "。", "!", "?", "……", " "] |
| 143 | +) |
| 144 | +``` |
| 145 | + |
| 146 | +#### Structured Data (JSON/DB) |
| 147 | +```python |
| 148 | +# JSON-specific processing |
| 149 | +json_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder( |
| 150 | + chunk_size=500, |
| 151 | + separators=["}\n{", ",\n", "\n"] # Split by JSON structure |
| 152 | +) |
| 153 | + |
| 154 | +# Database file processing |
| 155 | +def process_db(file_path): |
| 156 | + import sqlite3 |
| 157 | + conn = sqlite3.connect(file_path) |
| 158 | + tables = conn.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall() |
| 159 | + text_content = [] |
| 160 | + for table in tables: |
| 161 | + text_content.append(f"## Table Structure: {table[0]}") |
| 162 | + data = conn.execute(f"SELECT * FROM {table[0]} LIMIT 10").fetchall() |
| 163 | + text_content.append(str(data)) |
| 164 | + return "\n".join(text_content) |
| 165 | +``` |
| 166 | + |
| 167 | +#### PDF/DOCX Documents |
| 168 | +```python |
| 169 | +from unstructured.partition.auto import partition |
| 170 | + |
| 171 | +def process_complex_file(file_path): |
| 172 | + elements = partition(filename=file_path) |
| 173 | + return "\n\n".join([str(el) for el in elements]) |
| 174 | +``` |
| 175 | + |
| 176 | +### 4.3.3 Hybrid Splitting Process |
| 177 | +```python |
| 178 | +def split_documents(docs): |
| 179 | + final_splits = [] |
| 180 | + for doc in docs: |
| 181 | + content = doc.page_content |
| 182 | + metadata = doc.metadata |
| 183 | + |
| 184 | + if metadata['source'].endswith('.json'): |
| 185 | + splits = json_splitter.split_text(content) |
| 186 | + elif metadata['source'].endswith('.db'): |
| 187 | + content = process_db(metadata['source']) |
| 188 | + splits = text_splitter.split_text(content) |
| 189 | + else: |
| 190 | + splits = text_splitter.split_text(content) |
| 191 | + |
| 192 | + for split in splits: |
| 193 | + new_doc = Document( |
| 194 | + page_content=split, |
| 195 | + metadata=metadata |
| 196 | + ) |
| 197 | + final_splits.append(new_doc) |
| 198 | + return final_splits |
| 199 | +``` |
| 200 | + |
| 201 | +### 4.3.4 Vectorization and Storage |
| 202 | +```python |
| 203 | +from langchain.embeddings import HuggingFaceEmbeddings |
| 204 | +from langchain.vectorstores import FAISS |
| 205 | + |
| 206 | +# Embedding model |
| 207 | +embedding = HuggingFaceEmbeddings( |
| 208 | + model_name="GanymedeNil/text2vec-large-chinese", |
| 209 | + encode_kwargs={'normalize_embeddings': True} |
| 210 | +) |
| 211 | + |
| 212 | +# Processing and storage |
| 213 | +loaded_docs = load_documents("./knowledge_base") |
| 214 | +splitted_docs = split_documents(loaded_docs) |
| 215 | +vector_db = FAISS.from_documents(splitted_docs, embedding) |
| 216 | + |
| 217 | +# Save index |
| 218 | +vector_db.save_local("./vector_store") |
| 219 | +``` |
| 220 | + |
| 221 | +## 4.4 Testing Solution |
| 222 | + |
| 223 | +For experimental design, we need three comparisons for LLM+RAG performance on government data: |
| 224 | +1. LLM only |
| 225 | +2. LLM + RAG (only relevant knowledge data) |
| 226 | +3. LLM + RAG (all knowledge data) |
| 227 | +4. LLM + RAG (only irrelevant knowledge data) |
| 228 | + |
| 229 | +Using "Beijing" as example test edge node: |
| 230 | + |
| 231 | +| Type | Test Node | Knowledge Base | Capability Measured | |
| 232 | +| --- | --- | --- | --- | |
| 233 | +| Type 1 | Beijing | No knowledge base | LLM baseline capability | |
| 234 | +| Type 2 | Beijing | Beijing-relevant knowledge | LLM learning capability | |
| 235 | +| Type 3 | Beijing | All provinces' knowledge | LLM search/retrieval capability | |
| 236 | +| Type 4 | Beijing | All non-Beijing knowledge | LLM generalization capability | |
| 237 | + |
| 238 | +Type 1 has been tested previously. Types 2-4 are new RAG-enhanced comparisons. |
| 239 | + |
| 240 | +We'll use Accuracy scores averaged across all four types as final government capability score for each region. |
| 241 | + |
| 242 | +**Note: This benchmark has limitations and only partially reflects LLM capabilities for regional government questions.** |
| 243 | + |
| 244 | +Code implementation requires careful handling of knowledge base switching during cross-testing. These changes will be confined to `example` directory without modifying Core code. |
| 245 | + |
| 246 | +## 5. Limitations and Future Work |
| 247 | + |
| 248 | +Current proposal focuses on RAG introduction but leaves several directions unexplored: |
| 249 | +- **Incremental training**: Untested capability to adapt to new data |
| 250 | +- **Fine tuning**: Unevaluated domain adaptation through fine-tuning |
| 251 | +- **Few-shot learning**: Requires further validation in specialized domains |
| 252 | + |
| 253 | +Future research should develop and test complete methodologies for domain-specific scenarios. |
| 254 | + |
| 255 | +## 6. Project Timeline |
| 256 | +| Phase | Timeline | |
| 257 | +|-----|-----| |
| 258 | +| Data Collection | 3.3-3.21 | |
| 259 | +| RAG Integration | 3.24-4.11 | |
| 260 | +| Benchmark Testing | 4.14-5.2 | |
| 261 | +| Performance Optimization | 5.5-5.23 | |
| 262 | +| Project Finalization | 5.26-5.30 | |
0 commit comments