Skip to content

fix: eliminate redundant RAG model reloading that defeated paralleliz…#382

Open
ARYANPATEL-BIT wants to merge 1 commit intokubeedge:mainfrom
ARYANPATEL-BIT:perf/gov-rag-model-loading
Open

fix: eliminate redundant RAG model reloading that defeated paralleliz…#382
ARYANPATEL-BIT wants to merge 1 commit intokubeedge:mainfrom
ARYANPATEL-BIT:perf/gov-rag-model-loading

Conversation

@ARYANPATEL-BIT
Copy link
Copy Markdown

@ARYANPATEL-BIT ARYANPATEL-BIT commented Apr 10, 2026

Overview

This PR resolves a critical architectural issue in the Government RAG example where redundant model initialization inside a mutex completely negated parallel execution.


Problem

The process_query() function instantiated a new GovernmentRAG object for nearly every query. Each instantiation:

  • Reloaded the bge-m3 embedding model
  • Reinitialized ChromaDB connections

This logic was executed inside a self.gpu_lock, causing:

  • Thread serialization: All threads in ThreadPoolExecutor(max_workers=4) were forced to wait for the lock
  • Severe performance degradation: Expensive model loading occurred repeatedly per query
  • Resource inefficiency: Excessive VRAM usage, memory fragmentation, and unnecessary I/O overhead

As a result, the system behaved like a single-threaded pipeline despite being designed for parallel execution.


Solution

This PR aligns the implementation with a standard single-instance retrieval architecture:

  • Single initialization

    • GovernmentRAG is initialized once during preprocess()
    • The same instance is reused across all queries
  • Optimized lock scope

    • gpu_lock now only protects the vector similarity search (self.rag.query())
    • Removes unnecessary blocking around model setup
  • True parallelism for LLM calls

    • get_model_response() is moved outside the lock
    • Enables concurrent API calls and response generation
  • Fail-safe initialization

    • Adds a fallback initialization in predict() if preprocess() is skipped

Impact

  • Restores actual multi-threaded performance
  • Eliminates redundant model reloads
  • Reduces GPU/VRAM pressure and prevents memory fragmentation
  • Significantly improves benchmark execution speed and stability

Summary

This change removes a fundamental bottleneck that was silently disabling parallelism, ensuring the benchmarking pipeline performs as intended under concurrent workloads.

##Fixes: #380 381

@kubeedge-bot
Copy link
Copy Markdown
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ARYANPATEL-BIT
To complete the pull request process, please assign jaypume after the PR has been reviewed.
You can assign the PR to them by writing /assign @jaypume in a comment when ready.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubeedge-bot kubeedge-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 10, 2026
@ARYANPATEL-BIT
Copy link
Copy Markdown
Author

/assign @jaypume

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the RAG benchmarking process by initializing the model once during preprocessing to improve parallel performance. However, the reviewer noted that this change removes the logic for handling different retrieval scopes (global, local, and other), which breaks the benchmarking comparisons. Additional feedback pointed out a hardcoded absolute path that hinders portability and suggested consolidating redundant initialization logic.

Comment on lines 186 to 190
else:
# Run the embedding-based retrieval under the GPU lock.
# This is a quick vector search, not a full model reload.
with self.gpu_lock:
if rag_type == "[global]":
if self.rag is None:
self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db")
elif rag_type == "[local]":
self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db", provinces=[location])
else: # [other]
all_locations = set(self.all_locations)
self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db", provinces=list(all_locations - set([location])))

relevant_docs = self.rag.query(query, k=1)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The logic for handling different rag_type values ([local], [other]) has been removed. Previously, these types triggered the creation of a GovernmentRAG instance with specific provinces filters. Now, all RAG queries use the same global instance initialized in preprocess(), which ignores the location and rag_type parameters for retrieval. This breaks the benchmarking logic intended to compare different retrieval scopes (global vs local vs other).

# ChromaDB each time. That made the ThreadPoolExecutor useless
# since all threads were serialized by the gpu_lock while waiting
# for the slow model loading to finish.
self.rag = GovernmentRAG(model_name="/home/icyfeather/models/bge-m3", device="cuda", persist_directory="./chroma_db")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The model_name parameter is set to a hardcoded absolute path (/home/icyfeather/models/bge-m3). This makes the code non-portable and will cause it to fail on any system where this specific path does not exist. Consider using a configuration parameter or a relative path.

Comment on lines +211 to +217
if self.rag is None:
LOGGER.info("RAG not initialized yet, loading now...")
self.rag = GovernmentRAG(
model_name="/home/icyfeather/models/bge-m3",
device="cuda",
persist_directory="./chroma_db"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The initialization logic for GovernmentRAG is duplicated here. To ensure consistency and avoid repeating hardcoded parameters, you should call self.preprocess() instead.

        if self.rag is None:
            LOGGER.info("RAG not initialized yet, loading now...")
            self.preprocess()

…ation

Signed-off-by: Aryan Patel <aryan.patel7291@gmail.com>
@ARYANPATEL-BIT ARYANPATEL-BIT force-pushed the perf/gov-rag-model-loading branch from f04e3a9 to 462a739 Compare April 10, 2026 20:06
@MooreZheng MooreZheng requested review from IcyFeather233 and hsj576 and removed request for Poorunga April 13, 2026 11:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants