Skip to content

feat(ai): add Contradiction Agent#6304

Draft
ConnorYoh wants to merge 1 commit intomainfrom
feat/contradiction-agent
Draft

feat(ai): add Contradiction Agent#6304
ConnorYoh wants to merge 1 commit intomainfrom
feat/contradiction-agent

Conversation

@ConnorYoh
Copy link
Copy Markdown
Member

Summary

A new AI specialist agent that finds textual contradictions across a PDF — arguments, claimed facts, points of view, recommendations — and is invoked as a tool by the existing Review and Question agents using the same two-turn handshake the Math Auditor uses.

The Math Auditor catches numeric inconsistencies; nothing today catches textual ones (e.g. p.2 "the deadline is March 5" vs p.7 "submissions close on April 1"). This closes that gap.

How it works

  • Two-round flow mirroring the math agent: examine() triages which pages need text/OCR, deliberate() does the work.
  • Per-page parallel claim extraction under a semaphore (cap 10), then ONE fast-model LLM call canonicalises subjects across the document.
  • Bucketed detection: claims grouped by canonical subject, with one batched detector LLM call per bucket (cap 5). Buckets larger than 12 claims are chunked with a 2-claim overlap so no claim is silently dropped.
  • Pre-filter heuristics before the detector: drop identical-quote pairs and same-page same-polarity paraphrases.
  • Review surface: each Contradiction yields TWO sticky-note CommentSpecs cross-referencing each other across pages.
  • Question surface: synthesises a prose answer quoting both conflicting passages verbatim.

Architect findings addressed

# Status
C1 source_tool == endpoint path ✅ locked down by AiWorkflowServiceContradictionTest
C2 discriminated union round-trip test_artifact_union.py
C3/C5 two semaphores, batched per-bucket ✅ extract=10, detect=5
C4 chunked detection, no silent drops test_concurrency.py::test_worst_case_50_claim_bucket_finds_cross_chunk_pair
C6 subject canonicalisation default-on ✅ with lexical fallback on LLM failure
C7 combined math+contradiction intent ⚠️ v1 limitation — math is dropped; pinned by test_combined_intent.py
C8 shared _throttled ✅ math agent migrated to agents/_concurrency.py
C11 separate _PairedLocalisedContradiction _LocalisedComment untouched

Hardening

  • N4 prompt-injection — every synth/localiser prompt now wraps verdict JSON and user message in <verdict> / <user_message> tags with an explicit untrusted-data preamble. Applied to math and contradiction paths.
  • N5 Java ClaimPolarity enum mirrors the Python Literal["assert","deny","recommend","reject","neutral"]. Unknown values fail early instead of drifting silently.
  • N6 pages_examined semantics now reports only pages whose claims were actually checked; blank folios are excluded.

Limitations (documented)

  • Combined math + contradiction intent on a single prompt drops math silently. Documented in module docstrings of pdf_review.py / pdf_questions.py and pinned by test_combined_intent.py. Revisit when there's real-corpus data on combined-prompt frequency.
  • Cross-bucket pairs farther apart than the chunk overlap (>10 indices) are not detected. Documented in test_concurrency.py.

Test plan

  • pytest engine/tests/205/205 pass
  • ./gradlew :proprietary:testgreen, coverage targets met
  • Math-auditor regression suite passes unchanged
  • Discriminated-union round-trip covers math, contradiction, mixed, and source_tool-omitted payloads
  • Worst-case 50-claim bucket — cross-chunk pair detected via overlap
  • Concurrency assertions: extract saturates at exactly 10, detect at exactly 5
  • Java orchestrator never calls extractTablesAsCsv (verified)

A new specialist agent that detects textual contradictions across a PDF —
arguments, claimed facts, points of view, recommendations — and is
invoked as a tool by the existing Review and Question agents using the
same two-turn handshake the Math Auditor uses.

Why
- Math Auditor catches numeric inconsistencies; nothing today catches
  textual ones (e.g. p.2 "the deadline is March 5" vs p.7 "submissions
  close on April 1"). This closes that gap.

How it works
- Two-round flow: examine() triages which pages need text/OCR, then
  deliberate() extracts atomic claims per page in parallel under a
  semaphore (cap 10), canonicalises subjects via one fast-model LLM
  call, buckets claims by subject, and runs one batched detector LLM
  call per bucket (cap 5) to enumerate contradicting pairs. Buckets
  larger than 12 claims are chunked with overlap so no claim is silently
  dropped.
- Review surface: each Contradiction yields TWO sticky-note CommentSpecs
  cross-referencing each other across pages.
- Question surface: synthesises a prose answer that quotes both
  conflicting passages verbatim.

Architect findings addressed (see plan Section 6)
- C1 source_tool == endpoint path (locked down by test).
- C2 ToolReportArtifact lifted to discriminated union on source_tool.
- C3/C5 two semaphores (extract=10, detect=5); per-bucket batched calls.
- C4 chunked detection with overlap; no silent claim drops.
- C6 subject canonicalisation default-on with lexical fallback.
- C7 v1 limitation: combined math+contradiction intent drops math
  silently (precedence test pins this in test_combined_intent.py).
- C8 _throttled extracted into agents/_concurrency.py and the math
  agent migrated off its private copy.
- C11 separate _PairedLocalisedContradiction model; _LocalisedComment
  unchanged.

Hardening
- N4 prompt-injection: untrusted-data preamble + <user_message>/<verdict>
  delimiters on every synth/localiser prompt (math + contradiction).
- N5 Java ClaimPolarity enum mirrors the Python Literal.
- N6 pages_examined now reports only pages whose claims were actually
  checked; blank folios are excluded.

Tests
- Python 205/205 pass (claim ledger, agent flow, routes, artifact
  union, review/question resume, concurrency saturation, combined
  intent precedence).
- Java proprietary suite green; coverage targets met.
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines ignoring generated files. enhancement New feature or request labels May 1, 2026
@stirlingbot stirlingbot Bot added Java Pull requests that update Java code Test Testing-related issues or pull requests engine labels May 1, 2026
@ConnorYoh ConnorYoh marked this pull request as draft May 1, 2026 17:19
"[contradiction-agent] session=%s step 2: extracting claims from %d pages (parallel, max=%d)",
evidence.session_id,
len(folios_with_text),
self._extract_semaphore._value, # advisory — initial value
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading asyncio.Semaphore._value (self._extract_semaphore._value) accesses a semaphore's private mutable internals concurrently; avoid reading private attributes or use a thread-safe API (e.g., track capacity separately or omit the volatile value).

Details

✨ AI Reasoning
​A singleton ContradictionAgent is created at startup and used concurrently by incoming requests. The code reads self._extract_semaphore._value (a private attribute of asyncio.Semaphore) for logging while other coroutines can be acquiring/releasing the same semaphore, causing a race / inconsistent diagnostic and relying on a private, internal field.

🔧 How do I fix it?
Use locks, concurrent collections, or atomic operations when accessing shared mutable state. Avoid modifying collections during iteration. Use proper synchronization primitives like mutex, lock, or thread-safe data structures.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

// -----------------------------------------------------------------------

@SafeVarargs
private static List<Integer> union(List<Integer>... lists) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

union() does result.contains(page) inside nested loops causing O(n^2) work; use a Set to dedupe (or collect to a Set then sort) to make it linear-time.

Details

✨ AI Reasoning
​The union method builds a deduplicated list by iterating each input list and calling result.contains(page) for each element. This causes quadratic work as result grows. For page lists (document pages) the size can be large enough that O(n^2) allocations and repeated scans are avoidable; using a hashed Set for membership or leveraging a single pass collection-to-Set would make it linear-time and reduce allocations. The defect is localized to the deduplication loop and its membership check, which is executed during requisition fulfilment and could be invoked on many pages per audit.

🔧 How do I fix it?
Move constant work outside loops. Use StringBuilder instead of string concatenation in loops. Cache compiled regex patterns. Use hash-based lookups instead of nested loops. Batch database operations instead of N+1 queries.

Reply @AikidoSec feedback: [FEEDBACK] to get better review comments in the future.
Reply @AikidoSec ignore: [REASON] to ignore this issue.
More info

@stirlingbot
Copy link
Copy Markdown
Contributor

stirlingbot Bot commented May 1, 2026

🚀 V2 Auto-Deployment Complete!

Your V2 PR with embedded architecture has been deployed!

🔗 Direct Test URL (non-SSL) http://54.175.155.236:6304

🔐 Secure HTTPS URL: https://6304.ssl.stirlingpdf.cloud

This deployment will be automatically cleaned up when the PR is closed.

🔄 Auto-deployed for approved V2 contributors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

engine enhancement New feature or request Java Pull requests that update Java code size:XXL This PR changes 1000+ lines ignoring generated files. Test Testing-related issues or pull requests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant