In the know_your_business domain, all 34 tasks labeled "awaiting information" in the CSV have stronger escalation signals than the majority of tasks labeled "escalate." An agent that faithfully follows the SOP's explicit escalation rules will predict "escalate" for these tasks and be scored as incorrect.
Evidence
The SOP defines clear escalation triggers:
- Step 3: Escalate if TIN format is invalid or license expired >42 days
- Step 4: Pay attention to
offshore_jurisdiction_flag
- Step 6: Evaluate
shell_company_suspected
- Step 7: Escalate on sanctions matches and PEP status
- Step 12: "Triggering any of the above checks" → escalate
The CSV ground truth for the three categories:
| Signal |
"awaiting info" (34 tasks) |
"escalate" (24 tasks) |
"approve" (32 tasks) |
| shell_company_suspected=True |
34/34 (100%) |
16/24 (67%) |
0/32 (0%) |
| offshore_jurisdiction_flag=True |
34/34 (100%) |
16/24 (67%) |
0/32 (0%) |
| At least one sanctions "Matched" |
~34/34 |
~16/24 |
0/32 |
| Mean risk_score |
0.93 |
0.66 |
0.18 |
| registration_status=Inactive |
17/34 (50%) |
6/24 (25%) |
0/32 (0%) |
Every "awaiting information" task has shell company suspected, offshore jurisdiction, and sanctions matches. Only two-thirds of "escalate" tasks have these signals. Eight "escalate" tasks have zero of these flags (clear sanctions, no shell, no offshore, Active registration) — their risk scores are as low as 0.15.
The SOP's Step 12 says: "Determine escalation_status based on cumulative risk assessment, triggering 'escalate' status if your review triggers any of the above checks." Under this rule, all 34 "awaiting information" tasks should be "escalate."
The implicit rule
The SOP's Step 1 says: "If you believe that there are some irregularities here, please set the status as 'awaiting information' so that associates can reach out to these businesses for additional data." This refers to profile-level issues (mismatched names, suspicious websites, incomplete addresses).
The CSV labels appear to follow an implicit priority rule: if the business profile has irregularities in Step 1, the final status is "awaiting information" regardless of what downstream tools find (sanctions, shell companies, etc.). This priority rule is not stated in the SOP. In fact, it contradicts Step 12's instruction to escalate "if your review triggers any of the above checks."
The paper lists "implicit knowledge that humans learn but rarely document" as a benchmark challenge. However, the agent evaluates each task in isolation with no access to ground truth labels, so there is no mechanism for it to learn an unstated priority rule from the data.
Impact on benchmark scores
We evaluated this domain with Gemini 3.1 Pro (90 tasks):
- Raw TSR: 42.2% (38/90)
- On SOP-consistent tasks (excluding the 31 disagreements): 64.4% (38/59)
The best published baseline is 58% (Claude 4.5 Opus ReAct). It is unclear how much of the gap between top baselines and 100% is attributable to this labeling issue vs genuine reasoning difficulty.
This is consistent with a pattern across several SOP-Bench domains where CSV ground truth follows labeling rules not present in the SOP:
referral_abuse_detection_v1: 9 tasks follow a closure-priority rule not in the SOP (#4)
traffic_spoofing_detection: 39 tasks labeled "Warning Issued" for Medium risk when SOP says "Temporary Suspension" (#5)
Suggested Fix
Either:
- Update the SOP to explicitly state the priority rule: "If Step 1 identifies profile irregularities requiring additional data, set status to 'awaiting information' even if downstream checks trigger escalation criteria."
- Relabel the 34 "awaiting information" tasks based on the SOP's current rules (most would become "escalate").
Environment
- SOP-Bench commit:
156e9ecd60f42c43e4f3a12824e466afff21e9d8 (2026-02-22, initial release)
- Domain:
know_your_business
- Files examined:
test_set_with_outputs.csv, sop.txt
- Paper: SOP-Bench v2 (2026-02-23), https://ar5iv.labs.arxiv.org/html/2506.08119
In the
know_your_businessdomain, all 34 tasks labeled "awaiting information" in the CSV have stronger escalation signals than the majority of tasks labeled "escalate." An agent that faithfully follows the SOP's explicit escalation rules will predict "escalate" for these tasks and be scored as incorrect.Evidence
The SOP defines clear escalation triggers:
offshore_jurisdiction_flagshell_company_suspectedThe CSV ground truth for the three categories:
Every "awaiting information" task has shell company suspected, offshore jurisdiction, and sanctions matches. Only two-thirds of "escalate" tasks have these signals. Eight "escalate" tasks have zero of these flags (clear sanctions, no shell, no offshore, Active registration) — their risk scores are as low as 0.15.
The SOP's Step 12 says: "Determine
escalation_statusbased on cumulative risk assessment, triggering 'escalate' status if your review triggers any of the above checks." Under this rule, all 34 "awaiting information" tasks should be "escalate."The implicit rule
The SOP's Step 1 says: "If you believe that there are some irregularities here, please set the status as 'awaiting information' so that associates can reach out to these businesses for additional data." This refers to profile-level issues (mismatched names, suspicious websites, incomplete addresses).
The CSV labels appear to follow an implicit priority rule: if the business profile has irregularities in Step 1, the final status is "awaiting information" regardless of what downstream tools find (sanctions, shell companies, etc.). This priority rule is not stated in the SOP. In fact, it contradicts Step 12's instruction to escalate "if your review triggers any of the above checks."
The paper lists "implicit knowledge that humans learn but rarely document" as a benchmark challenge. However, the agent evaluates each task in isolation with no access to ground truth labels, so there is no mechanism for it to learn an unstated priority rule from the data.
Impact on benchmark scores
We evaluated this domain with Gemini 3.1 Pro (90 tasks):
The best published baseline is 58% (Claude 4.5 Opus ReAct). It is unclear how much of the gap between top baselines and 100% is attributable to this labeling issue vs genuine reasoning difficulty.
This is consistent with a pattern across several SOP-Bench domains where CSV ground truth follows labeling rules not present in the SOP:
referral_abuse_detection_v1: 9 tasks follow a closure-priority rule not in the SOP (#4)traffic_spoofing_detection: 39 tasks labeled "Warning Issued" for Medium risk when SOP says "Temporary Suspension" (#5)Suggested Fix
Either:
Environment
156e9ecd60f42c43e4f3a12824e466afff21e9d8(2026-02-22, initial release)know_your_businesstest_set_with_outputs.csv,sop.txt