Skip to content

feat: compliance checker - study type detection, single LLM call, citation grounding#16

Open
GovindhKishore wants to merge 12 commits intoga4gh:mainfrom
GovindhKishore:feat/compliance-checker
Open

feat: compliance checker - study type detection, single LLM call, citation grounding#16
GovindhKishore wants to merge 12 commits intoga4gh:mainfrom
GovindhKishore:feat/compliance-checker

Conversation

@GovindhKishore
Copy link
Copy Markdown

Closes #15

What This PR Does

Implements the compliance checking layer for GA4GH RegBot on top of the ingestion pipeline (#10) and hybrid retrieval engine (#13). Takes an uploaded consent form, detects the study type, retrieves relevant GA4GH policy chunks, and generates structured compliance verdicts with citations via a single LLM call.

Files Added

  • src/compliance/checker.py : full compliance check pipeline
  • src/compliance/prompts.py : LLM prompt template
  • src/compliance/init.py : module init
  • tests/test_checker.py : 9 unit tests

How It Works

Step 1 - Text Extraction
Uploaded consent form (PDF or TXT) is extracted into raw text. Unsupported file types raise a clear ValueError.

Step 2 - Study Type Detection
Consent form text is scanned against STUDY_TYPE_KEYWORDS from config.py. Returns a list of detected study types:

  • pediatric, familial, rare_disease, large_scale, clinical_genomic

Step 3 - Build Check Queue
Universal checks (8) always run on every consent form. Conditional checks are appended per detected study type.
Example for a pediatric form: 8 universal + 2 pediatric = 10 checks.

Step 4 - Retrieve GA4GH Chunks Upfront
For every check in the queue, retrieve() is called with the corresponding CHECK_QUERY string from config.py and optional
subcategory filter for conditional checks. All chunks are retrieved before the LLM is called.

Step 5 - Single LLM Call
All checks with their retrieved GA4GH chunks are injected into a single prompt alongside the full consent form text. The LLM
evaluates every check in one call. This keeps token usage minimal compared to one LLM call per check.

Step 6 - Parse Verdicts
Response is parsed line by line using CHECK: as block separator.
Each verdict block produces a dict:
{
"check": "withdrawal_rights",
"verdict": "NON-COMPLIANT",
"reason": "No opt-out clause found in consent form",
"citation": "[GA4GH Consent Policy POL 002, Section (If applicable), page 3]"
}

Key Design Decision - Single LLM Call

Per-check LLM calls would send the full consent form text N times where N is the number of checks (up to 21 for complex study types). Single LLM call sends the consent form once regardless of check count.

Tests - All 9 Passing

pytest tests/test_checker.py -v

tests/test_checker.py::test_extract_text_txt PASSED
tests/test_checker.py::test_detects_pediatric PASSED
tests/test_checker.py::test_detects_familial PASSED
tests/test_checker.py::test_detects_no_type PASSED
tests/test_checker.py::test_detects_multiple_types PASSED
tests/test_checker.py::test_universal_checks_always_included PASSED
tests/test_checker.py::test_conditional_checks_added_for_pediatric PASSED
tests/test_checker.py::test_parse_response_single_verdict PASSED
tests/test_checker.py::test_parse_response_multiple_verdicts PASSED

9 passed in 5.93s

End-to-End Validation

The full pipeline was validated locally using a dummy consent form passed through the complete pipeline. All 8 universal checks evaluated:

data_sharing_purpose - COMPLIANT
Form clearly states genomic research purpose
[Framework for Responsible Sharing, N/A, page 1]

withdrawal_rights - COMPLIANT
Form explicitly states participants may withdraw at any time
[Consent Toolkit Rare Disease, N/A, page 3]

data_categories_disclosed - NON-COMPLIANT
Form mentions DNA sample and health information vaguely
without defining specific data types
[MRCG, N/A, page 6]

etc....

Verdicts are accurate and citations point to real GA4GH documents.
Note: some citations show N/A for the section field. This is expected as during ingestion, section headings are detected per chunk using regex patterns. If no section heading is found within that chunk, the section field is stored as N/A in the metadata. The source document and page number remain intact for citation traceability.

How to Run Tests

pytest tests/test_checker.py -v

@GovindhKishore
Copy link
Copy Markdown
Author

@dedyli Improved docstrings across ingestion, retrieval, and compliance pipelines for better clarity and maintainability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: compliance checker - study type detection, single LLM call, citation grounding

1 participant