Name	Name	Last commit message	Last commit date
parent directory ..
deliverable	deliverable
README.md	README.md
cost.json	cost.json
grade.json	grade.json
label.json	label.json
receipt.json	receipt.json
summary.json	summary.json
transcript.jsonl	transcript.jsonl

Name

Last commit message

Last commit date

adversarial-hallucination — Qwen3.6-27B-AWQ

Run name: p2_hallucination_27b_v1 (1 of 3 — all 3 runs PASS with same 100% accuracy / 0 dangerous) Wall: 3.4 minutes, 17 iterations Cost upper: $0.0044 Verdict: PASS (15/15 correct, 0 dangerous)

What this run did

Read the 15-issue report against the logalyzer/ codebase and produced triage.json with one classification per issue (REAL or FABRICATED) plus evidence. The grader compared triage.json against the planted ground truth — 6 real issues, 9 fabrications, with the why for each — and scored every classification correct.

The "0 dangerous" line is the important one: not a single one of the 9 fabricated issues was confirmed as real. That's the asymmetric harm metric — a confidently-claimed fabrication causes downstream cleanup cost; a missed real issue is recoverable.

What's in this folder

triage.json — the 15 classifications + evidence (the actual deliverable)
notes.md — the agent's working notes (reasoning across issues)
decisions/ — ADRs from the agent's run
grade.json — programmatic verdict + per-issue scoring
label.json — failure-mode classification (here: success-shipped)
receipt.json — vLLM args, harness git SHA, GPU snapshot
summary.json — finish reason, iter count, tokens
cost.json — wall, tokens, energy upper bound
transcript.jsonl — full agent loop

How to read this

The model's behavior is straightforward and worth seeing in the transcript: for each issue, it inspects the cited code path before deciding. When the issue cites a line number, the model reads that line. When the issue claims a behavior, the model traces the relevant code path. For fabricated issues, this grounding fails and the model classifies FABRICATED with concrete evidence ("line 47 doesn't exist; the file is 32 lines long" or "the function does check for None — see line 19"). For real issues, the grounding succeeds and the classification is REAL.

This is the right behavior. It's also unusual for a 30B-class quantized model — the dreamserver-1-pr-audit Qwen3-Coder-Next runs failed on the same kind of grounding test (fabricating evidence about line numbers and test files). 27B's resistance to confident-but-wrong claims is the strongest local-model superiority signal in this entire repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

README.md

adversarial-hallucination — Qwen3.6-27B-AWQ

What this run did

What's in this folder

How to read this

Uh oh!

FilesExpand file tree

Qwen3.6-27B-AWQ

Directory actions

More options

Directory actions

More options

Latest commit

History

Qwen3.6-27B-AWQ

Folders and files

parent directory

README.md

adversarial-hallucination — Qwen3.6-27B-AWQ

What this run did

What's in this folder

How to read this