Skip to content

Hyperion: Add verification checker for Consistency Check Service#12290

Open
khinevich wants to merge 21 commits intodevelopfrom
feature/hyperion/consistency-check-independent-verification-loop
Open

Hyperion: Add verification checker for Consistency Check Service#12290
khinevich wants to merge 21 commits intodevelopfrom
feature/hyperion/consistency-check-independent-verification-loop

Conversation

@khinevich
Copy link
Contributor

@khinevich khinevich commented Mar 13, 2026

Summary

Adds an independent verification pass to the Hyperion consistency check pipeline. After the structural and semantic checkers run in parallel, a third LLM call reviews
their combined output — removing false positives, deduplicating overlapping issues reported by both checkers, and sharpening surviving issues (line numbers, descriptions,
categories). The result is a significantly cleaner and more actionable issue list for instructors.

Checklist

General

Server

  • I strictly followed the principle of data economy for all database calls.
  • I strictly followed the server coding and design guidelines and the REST API guidelines.
  • I added pre-authorization annotations according to the guidelines and checked the course groups for all new REST Calls (security).
  • I documented the Java code using JavaDoc style.

Motivation and Context

The existing two-checker approach (structural + semantic) produces a high rate of false positives: pedagogical stubs, intentional template incompleteness, and
near-duplicate issues reported independently by both checkers all surface as apparent inconsistencies. This noise reduces trust in the tool and forces instructors to
manually triage results before acting on them.

Description

Introduces a post-processing verification step in HyperionConsistencyCheckService:

  • After the structural and semantic checks complete, their combined output is serialised to JSON and passed to a new verification prompt (consistency_verification.st).
  • The verifier processes the issues through four phases: false positive removal (pedagogical gaps, hallucinated issues, intentional variations), deduplication
    (merging overlapping issues from both checkers), issue improvement (correcting line numbers, sharpening descriptions, fixing miscategorised categories), and
    self-verification.
  • If the verification call fails for any reason, the pipeline falls back transparently to the pre-verification combined results — no degradation in availability.
  • A new UnifiedConsistencyIssues structured output schema covers all six issue categories and is used exclusively by the verifier.

Before / After comparison:

Aggregate Results

Benchmark Config Key N runs TP FP FN Precision Recall F1 Span F1 IoU Avg Time (s) Avg Cost (€)
artemis-feature-hyperion-consistency_check_independent_verification_loop-4dee35a34f model=azure-openai-gpt-5-mini, reasoning_effort=medium 3 258 29 21 0.899 0.925 0.912 0.518 0.386 34.604 0.0140
artemis-develop-e2ee1d1f1c model=azure-openai-gpt-5-mini, reasoning_effort=medium 3 260 169 19 0.606 0.932 0.734 0.427 0.306 19.412 0.0093

Steps for Testing

Prerequisites:

  • 1 Instructor
  • 1 Programming Exercise
  1. Log in to Artemis
  2. Navigate to Exercise
  3. Execute Consistency Check
  4. Verify that everything works

Testserver States

You can manage test servers using Helios. Check environment statuses in the environment list. To deploy to a test server, go to the CI/CD page, find your PR or branch, and trigger the deployment.

Review Progress

Performance Review

  • I (as a reviewer) confirm that the client changes (in particular related to REST calls and UI responsiveness) are implemented with a very good performance even for very large courses with more than 2000 students.
  • I (as a reviewer) confirm that the server changes (in particular related to database calls) are implemented with a very good performance even for very large courses with more than 2000 students.

Code Review

  • Code Review 1
  • Code Review 2

Manual Tests

  • Test 1
  • Test 2

Test Coverage

Warning: Server tests failed. Coverage could not be fully measured. Please check the workflow logs.

Last updated: 2026-03-21 19:04:41 UTC

@github-project-automation github-project-automation bot moved this to Work In Progress in Artemis Development Mar 13, 2026
@github-actions github-actions bot added server Pull requests that update Java code. (Added Automatically!) hyperion labels Mar 13, 2026
@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

khinevich and others added 3 commits March 21, 2026 11:52
…tion-loop' of github.com:ls1intum/Artemis into feature/hyperion/consistency-check-independent-verification-loop
@github-actions
Copy link

@khinevich Test coverage has been automatically updated in the PR description.

@github-actions
Copy link

@khinevich Test coverage has been automatically updated in the PR description.

@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@github-actions
Copy link

github-actions bot commented Mar 21, 2026

End-to-End Test Results

Phase Status Details
Phase 1 (Relevant) ✅ Passed
TestsPassed ✅SkippedFailedTime ⏱
Phase 1: E2E Test Report14 ran14 passed0 skipped0 failed2m 25s
Phase 2 (Remaining) ⏭ Skipped

Test Strategy: Two-phase execution

  • Phase 1: e2e/Login.spec.ts e2e/Logout.spec.ts e2e/SystemHealth.spec.ts
  • Phase 2: e2e/atlas/ e2e/course/ e2e/exam/ExamAssessment.spec.ts e2e/exam/ExamChecklists.spec.ts e2e/exam/ExamCreationDeletion.spec.ts e2e/exam/ExamDateVerification.spec.ts e2e/exam/ExamManagement.spec.ts e2e/exam/ExamParticipation.spec.ts e2e/exam/ExamResults.spec.ts e2e/exam/ExamTestRun.spec.ts e2e/exam/test-exam/ e2e/exercise/ExerciseImport.spec.ts e2e/exercise/file-upload/ e2e/exercise/modeling/ e2e/exercise/programming/ e2e/exercise/quiz-exercise/ e2e/exercise/text/ e2e/lecture/

Overall: ❌ Phase 2 (remaining tests) failed

🔗 Workflow Run

@khinevich khinevich moved this from Backlog to Ready For Review in Hyperion Team Mar 21, 2026
@khinevich khinevich added this to the 8.8.6 milestone Mar 21, 2026
@khinevich khinevich removed the status in Hyperion Team Mar 21, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an independent LLM-based verification pass after the structural + semantic Hyperion consistency checks to reduce false positives, deduplicate overlaps, and improve issue quality before returning results to instructors.

Changes:

  • Add a new verification prompt (consistency_verification.st) describing a 4-phase cleanup pipeline and a unified JSON output schema.
  • Extend HyperionConsistencyCheckService to call the verifier, parse unified structured output, and fall back to pre-verification results on failure.
  • Reduce consistency-check logging verbosity (info → debug) for per-issue output.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/main/resources/prompts/hyperion/consistency_verification.st New verifier prompt defining false-positive removal, deduplication, and output schema.
src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java Orchestrates the new verification call, introduces unified schema parsing, and adds fallback behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

1 similar comment
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@khinevich khinevich changed the title Feature/hyperion/consistency check independent verification loop Hyperion: Add verification checker for Consistency Check Service Mar 21, 2026
@khinevich khinevich moved this to Ready For Review in Hyperion Team Mar 21, 2026
@khinevich
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

Walkthrough

Adds a verifier LLM call to Hyperion's consistency check: combined structural+semantic issues are serialized with an injected ObjectMapper and sent to a verification prompt; verifier output is parsed into unified issues and mapped to internal DTOs, with a fallback to combined issues and adjusted logging/observability.

Changes

Cohort / File(s) Summary
Consistency check service
src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java
Added ObjectMapper injection and JSON serialization of combined structural+semantic issues; new runVerificationCheck(...) method with Micrometer Observation span; calls verifier LLM, parses StructuredOutputSchema.UnifiedConsistencyIssues into internal ConsistencyIssue DTOs via new mapping overload; added fallback when verification fails/returns null; logging level changes.
Verification prompt template
src/main/resources/prompts/hyperion/consistency_verification.st
New ST template that instructs a multi-phase verifier LLM: discard false positives, deduplicate/merge overlapping issues, refine line ranges/descriptions/categories, enforce completeness and a strict JSON output schema.
Unit test update
src/test/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckServiceTest.java
Updated test setup to pass a new ObjectMapper argument to the service constructor (constructor signature changed).

Sequence Diagram

sequenceDiagram
    participant Checker as Consistency Checker
    participant OM as ObjectMapper
    participant Verifier as Verifier LLM
    participant Schema as StructuredOutputSchema

    Checker->>Checker: Combine structural + semantic issues
    Checker->>OM: serialize(combinedIssues) -> detected_issues_json
    OM-->>Checker: detected_issues_json
    Checker->>Verifier: send prompt + detected_issues_json
    activate Verifier
    Verifier->>Verifier: Phase 1: remove false positives
    Verifier->>Verifier: Phase 2: deduplicate & merge locations
    Verifier->>Verifier: Phase 3: verify/correct ranges, descriptions, categories
    Verifier->>Schema: emit UnifiedConsistencyIssues JSON
    Schema-->>Verifier: parsed unified schema
    deactivate Verifier
    Verifier-->>Checker: verification result (unified issues) or null/error
    alt verification succeeded
        Checker->>Checker: map unified issues -> ConsistencyIssue DTOs
        Checker-->>Caller: return verified, deduplicated issues
    else verification failed/null
        Checker-->>Caller: return combinedIssues (fallback)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding a verification checker component to the Consistency Check Service within the Hyperion system.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/hyperion/consistency-check-independent-verification-loop

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java`:
- Around line 190-192: The code currently serializes combinedIssues to a bare
JSON array via
HyperionConsistencyCheckService.OBJECT_MAPPER.writeValueAsString(combinedIssues.stream().map(this::mapConsistencyIssueToDto).toList())
and puts it under verificationInput.put("detected_issues_json", issuesJson);
instead wrap the serialized payload in an object with key "issues" so the
verifier and the UnifiedConsistencyIssues parser receive { "issues": [...] } —
e.g., build a wrapper Map or DTO containing "issues" mapped to the list produced
by mapConsistencyIssueToDto and serialize that (or put that wrapper object into
verificationInput as detected_issues_json) instead of the naked array.
- Around line 187-196: The verifier is invoked even when combinedIssues is
empty, exposing full context and allowing synthesis of new issues; change the
logic in HyperionConsistencyCheckService so that if combinedIssues.isEmpty() it
directly sets issueDTOs = List.of() (using mapConsistencyIssueToDto only when
needed) and skips creating issuesJson, populating verificationInput, and calling
runVerificationCheck; ensure the log still reflects that zero issues were
present and only call runVerificationCheck when combinedIssues is non-empty.

In `@src/main/resources/prompts/hyperion/consistency_verification.st`:
- Around line 132-153: The JSON example in consistency_verification.st uses
TypeScript-style union syntax (e.g., "HIGH" | "MEDIUM" | "LOW" and category
values joined with |) which makes the example invalid JSON and breaks downstream
parsing; update the example JSON to use concrete sample values (e.g.,
"severity": "HIGH", "category": "METHOD_RETURN_TYPE_MISMATCH", and one concrete
entry in each "type" field) and move the allowed enum sets for "severity",
"category", and "type" out of the JSON block into plain prose immediately below
the example, listing the valid values for each field so the schema is clear
without embedding non-JSON notation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8dfe4584-b1aa-408e-bad8-473d6d39bd03

📥 Commits

Reviewing files that changed from the base of the PR and between ba219da and 57b37f1.

📒 Files selected for processing (2)
  • src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java
  • src/main/resources/prompts/hyperion/consistency_verification.st

@github-project-automation github-project-automation bot moved this from Ready For Review to Work In Progress in Hyperion Team Mar 21, 2026
Copy link
Contributor

@Claudia-Anthropica Claudia-Anthropica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khinevich The verification pass is a clean addition to the pipeline and the benchmark numbers look solid (precision from 0.61 to 0.90 is a big win). A few things to fix inline though: the ObjectMapper should be injected via Spring, the JSON format sent to the verifier doesn't match the expected output schema, and there's a wasted LLM call when both checkers find nothing. Also, validate-pr-title CI is failing — the title needs the backtick format (e.g. `Hyperion`: Add independent verification pass to consistency check pipeline).

@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

…tion-loop' of github.com:ls1intum/Artemis into feature/hyperion/consistency-check-independent-verification-loop
@khinevich
Copy link
Contributor Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 21, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java`:
- Line 192: The code currently serializes combinedIssues through
mapConsistencyIssueToDto (which fills defaults) into issuesJson and thus mutates
checker output before verification; instead, serialize a verifier-specific
payload that preserves raw checker fields (allowing null severity/category) into
detected_issues_json by introducing/using a method like
mapConsistencyIssueToVerifierDto or serializeRawConsistencyIssues that does not
apply MEDIUM/METHOD_PARAMETER_MISMATCH defaults, and keep
mapConsistencyIssueToDto only for the final API response after
verification/fallback.
- Around line 320-332: The model class UnifiedConsistencyIssues initializes the
issues field to an empty list which prevents Jackson from distinguishing a
missing JSON property from an intentionally empty array; update the
UnifiedConsistencyIssues definition to remove the default assignment (make the
issues field nullable / omit List.of() default) so deserialization yields null
when the JSON property is missing, and adjust any usages (e.g.,
runVerificationCheck and the analogous check around lines 444-448) to treat a
null issues as a failure (return null or trigger the fallback) rather than an
empty list; ensure the field has appropriate `@JsonProperty` / `@Nullable`
annotations if used in your codebase to preserve schema expectations.

In
`@src/test/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckServiceTest.java`:
- Line 99: The test checkConsistency_tracksTokenUsageAndCosts() now exercises
three model calls, so update the token-usage assertions or stub the verifier
separately: either change the assertions that inspect resp.tokens() to expect
prompt=300L, completion=150L, total=450L (reflecting three
DefaultUsage(100,50,150) calls), or add a separate stub for the verifier call so
the aggregate remains the previous two-call totals; locate the test method
checkConsistency_tracksTokenUsageAndCosts() and the DefaultUsage(...) stub to
apply the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8586bfb5-9a34-499d-b901-9aec6f99900a

📥 Commits

Reviewing files that changed from the base of the PR and between 57b37f1 and 15512a2.

📒 Files selected for processing (2)
  • src/main/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckService.java
  • src/test/java/de/tum/cit/aet/artemis/hyperion/service/HyperionConsistencyCheckServiceTest.java

@github-actions
Copy link

@khinevich Test coverage could not be fully measured because some tests failed. Please check the workflow logs for details.

@khinevich khinevich moved this from Work In Progress to Ready For Review in Hyperion Team Mar 21, 2026
Copy link
Contributor

@Claudia-Anthropica Claudia-Anthropica left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@khinevich All feedback addressed — ObjectMapper is now injected via Spring, and the issues JSON is properly wrapped in the {"issues": [...]} object. Nice work. One small heads-up: the token-usage test (checkConsistency_tracksTokenUsageAndCosts) still asserts 200/100/300 tokens (two calls), but with the verification step there are now three model calls, so those assertions may need to become 300/150/450. CI failures here look unrelated (DB timeouts, no client changes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hyperion ready for review server Pull requests that update Java code. (Added Automatically!) tests

Projects

Status: Ready For Review
Status: Ready For Review

Development

Successfully merging this pull request may close these issues.

3 participants