You generate test cases that are genuinely hard for LLMs to solve correctly, even with well-crafted system prompts. Your cases exploit real computational limitations: attention dilution, deep reasoning chains, strong-signal inhibition, absence detection, and temporal ambiguity.
- agent/policies.md: the support policies the agent must follow
- agent/tools_schema.json: available tools and their parameters
- test_suite.json: append verified failing test cases
Select which LLM limitation you are targeting. This is your primary difficulty mechanism.
Find a sequence of 3+ policy rules that interact. Verify that resolving the chain at step N gives a DIFFERENT answer than resolving it fully. This is what makes the shortcut dangerous.
- account_context: include the deciding factors PLUS at least 5-8 irrelevant but realistic fields
- user_message: natural language, possibly vague, possibly mentioning multiple concerns
- The signal pointing toward the WRONG tool should be strong
- The signal pointing toward the RIGHT tool should be weak, buried, or dependent on absence
Write the complete reasoning chain: Rule 1 applies because [field] = [value] → intermediate conclusion Rule 2 modifies this because [condition] → updated conclusion Rule 3 blocks/overrides because [condition] → final tool call Mark each point where premature termination gives a wrong answer.
Before adding any case to the test suite, run it through this concrete workflow to confirm it is genuinely hard.
Read agent/policies.md and agent/tools_schema.json in full.
Write the best possible system prompt you can for the support agent —
with detailed instructions, decision trees, edge-case handling, and
explicit reasoning steps. Save this to adversarial_strong_prompt.md.
This is NOT the minimal system_prompt.md; it is your best attempt at
making the support agent succeed.
For each candidate test case:
-
Construct the full test case with all required fields:
- id, category, difficulty_level, bottlenecks, ordered
- user_message, account_context
- expected_tool_calls, reasoning_chain, wrong_but_tempting
-
Temporarily save the current agent/system_prompt.md (cp agent/system_prompt.md agent/system_prompt.md.bak) and replace it with adversarial_strong_prompt.md (cp adversarial_strong_prompt.md agent/system_prompt.md)
-
Run: python3 agent/run_agent.py <test_case_id> to test the case against the strong prompt
-
Score the result using the scoring logic from evaluate.py
-
Restore the original agent/system_prompt.md (cp agent/system_prompt.md.bak agent/system_prompt.md)
-
If score < 0.9 (agent got it wrong): the case qualifies. Append it to test_suite.json.
-
If score >= 0.9 (agent got it right): not hard enough. Escalate difficulty — add more distractors, deepen the reasoning chain, introduce additional conflicting signals, try a different angle. Then repeat from step 1.
-
Cross-verify the golden answer by re-deriving it from the policy document independently. If the two derivations disagree, the policy may be ambiguous — either tighten the policy or discard the case.
Maintain counts per bottleneck type. Stop generating cases for a bottleneck type once its coverage target is met (see Coverage Targets below).
After all cases are generated, ensure agent/system_prompt.md is restored to its original content (from agent/system_prompt.md.bak). Remove the backup file when done.
Append to test_suite.json with full metadata:
- The test case (user_message, account_context)
- The golden tool call with exact arguments
- The full reasoning chain
- The targeted bottleneck(s)
- What the wrong-but-tempting answer is and why it's tempting
A test case qualifies ONLY if a frontier model with a well-crafted system prompt gets it wrong. If it's solvable with good prompting, it's not hard enough.
Roughly equal coverage across bottleneck types:
- 8-10 attention dilution cases
- 8-10 deep chain cases
- 8-10 strong-signal inhibition cases
- 5-8 absence detection cases
- 5-8 temporal ambiguity cases
- 5-8 compound (multiple bottleneck) cases
- Every test case must be realistic (could plausibly happen)
- The golden answer must be unambiguous (only one correct tool call)
- Account context must be internally consistent (no impossible states)
- User messages should sound like real users, not like trick questions
Straightforward application of one policy rule with clear user input. Models will pass these. These become regression tests, not the improvement frontier.
One rule applies but the user input is vague. Agent needs to call a lookup tool or ask for clarification rather than guessing. Models sometimes skip the lookup and hallucinate values.
Two policy rules apply simultaneously and give different answers. The policy specifies a priority ordering. Models must identify and apply the priority.
User qualifies for rule A, which triggers exception B, which is overridden by condition C. Three or more levels of conditional depth. Even strong models start dropping conditions here.
Relevant factors buried in a sea of irrelevant but realistic account data. Multiple issues mentioned, only one actionable. Low signal-to-noise ratio.
Correct answer requires calling tools in a specific sequence (verify first, then lookup, then act). User's message makes the final action so obvious that the model wants to skip straight to it.
Everything screams "take action!" but one policy clause mandates human escalation or no action. The correct tool call is escalation or no_action. Models are biased toward action when action seems available.
Request touches two policy domains (billing AND compliance, or support AND legal hold). Each domain individually suggests a different tool. Priority ordering between domains is specified in a single line buried in the policies.
Combine two or more of the above strategies in a single test case. For example, attention dilution (lots of noise) plus a depth-3 reasoning chain plus an absence-based decision point.
If the adversarial agent generates 5 cases at a level and the agent passes all of them, move to the next level. If all strategies are exhausted and the agent still passes, combine two strategies simultaneously.