fix: eliminate redundant RAG model reloading that defeated paralleliz…#382
fix: eliminate redundant RAG model reloading that defeated paralleliz…#382ARYANPATEL-BIT wants to merge 1 commit intokubeedge:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: ARYANPATEL-BIT The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/assign @jaypume |
There was a problem hiding this comment.
Code Review
This pull request optimizes the RAG benchmarking process by initializing the model once during preprocessing to improve parallel performance. However, the reviewer noted that this change removes the logic for handling different retrieval scopes (global, local, and other), which breaks the benchmarking comparisons. Additional feedback pointed out a hardcoded absolute path that hinders portability and suggested consolidating redundant initialization logic.
| else: | ||
| # Run the embedding-based retrieval under the GPU lock. | ||
| # This is a quick vector search, not a full model reload. | ||
| with self.gpu_lock: | ||
| if rag_type == "[global]": | ||
| if self.rag is None: | ||
| self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db") | ||
| elif rag_type == "[local]": | ||
| self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db", provinces=[location]) | ||
| else: # [other] | ||
| all_locations = set(self.all_locations) | ||
| self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db", provinces=list(all_locations - set([location]))) | ||
|
|
||
| relevant_docs = self.rag.query(query, k=1) |
There was a problem hiding this comment.
The logic for handling different rag_type values ([local], [other]) has been removed. Previously, these types triggered the creation of a GovernmentRAG instance with specific provinces filters. Now, all RAG queries use the same global instance initialized in preprocess(), which ignores the location and rag_type parameters for retrieval. This breaks the benchmarking logic intended to compare different retrieval scopes (global vs local vs other).
| # ChromaDB each time. That made the ThreadPoolExecutor useless | ||
| # since all threads were serialized by the gpu_lock while waiting | ||
| # for the slow model loading to finish. | ||
| self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db") |
There was a problem hiding this comment.
| if self.rag is None: | ||
| LOGGER.info("RAG not initialized yet, loading now...") | ||
| self.rag = GovernmentRAG( | ||
| model_name="/home/icyfeather/models/bge-m3", | ||
| device="cuda", | ||
| persist_directory="./chroma_db" | ||
| ) |
There was a problem hiding this comment.
…ation Signed-off-by: Aryan Patel <aryan.patel7291@gmail.com>
f04e3a9 to
462a739
Compare
Overview
This PR resolves a critical architectural issue in the Government RAG example where redundant model initialization inside a mutex completely negated parallel execution.
Problem
The
process_query()function instantiated a newGovernmentRAGobject for nearly every query. Each instantiation:bge-m3embedding modelThis logic was executed inside a
self.gpu_lock, causing:ThreadPoolExecutor(max_workers=4)were forced to wait for the lockAs a result, the system behaved like a single-threaded pipeline despite being designed for parallel execution.
Solution
This PR aligns the implementation with a standard single-instance retrieval architecture:
Single initialization
GovernmentRAGis initialized once duringpreprocess()Optimized lock scope
gpu_locknow only protects the vector similarity search (self.rag.query())True parallelism for LLM calls
get_model_response()is moved outside the lockFail-safe initialization
predict()ifpreprocess()is skippedImpact
Summary
This change removes a fundamental bottleneck that was silently disabling parallelism, ensuring the benchmarking pipeline performs as intended under concurrent workloads.
##Fixes: #380 381