Fix prompt injection vulnerability in Architect mode (Issue #5058)#5065
Open
slokrami07 wants to merge 1 commit intoAider-AI:mainfrom
Open
Fix prompt injection vulnerability in Architect mode (Issue #5058)#5065slokrami07 wants to merge 1 commit intoAider-AI:mainfrom
slokrami07 wants to merge 1 commit intoAider-AI:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Vulnerability Context
In the multi-agent orchestration flow, the Architect model reads repository files and proposes a solution. This proposal was previously passed to the Editor model using editor_coder.run(with_message=content, preproc=False). Because the preproc=False flag bypasses standard input processing, malicious instructions hidden in untrusted files (like README.md) could be executed blindly by the Editor, turning malicious repository content into committed, backdoored code.
What was changed
This PR introduces a dedicated Threat Normalization layer acting as a strict trust boundary between the Architect and Editor agents, without breaking the necessary preproc=False optimization:
Secondary Validation Layer: Added validate_architect_payload() to ArchitectCoder.
Contextual Evaluation: Before handing off the payload, the validation agent securely fetches the original user request from the session history and queries the underlying LLM (at temperature=0.0) to compare the original intent with the Architect's proposed plan.
Strict Interception: If the Architect's plan deviates maliciously from the user's explicit intent (e.g., unauthorized data exfiltration, unexpected network calls via urllib or requests, accessing .env files not requested), the LLM flags the payload as UNSAFE. The handoff is immediately aborted, preventing the Editor from executing the injected payload.
How it was tested
Ensured cur_messages and done_messages are accessed safely during test initialization.
Ran the full basic test suite (tests/basic/test_coder.py), successfully passing all 42 tests.
Verified that legitimate complex refactors are marked as SAFE because they align with the original user request.