dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects by gggdttt · Pull Request #732 · microsoft/BC-Bench

gggdttt · 2026-07-02T14:29:01Z

Summary

A cross-model audit of the 20 "clean" (negative, expected_comments == []) code-review
samples revealed that 4 of them leaked real defects — every model we ran
(claude-opus-4-6, claude-opus-4-8, gpt-5-3-codex across both Claude Code and Copilot CLI)
consistently and correctly flagged the same genuine problem. That makes these unfair
negatives: a strong reviewer is penalized for finding a real bug we mislabeled as clean.

This PR removes the leaked bug from each patch so the sample is truly clean. The label
stays a negative (expected_comments remains []).

Changes (`dataset/codereview.jsonl`, 4 lines)

Instance	Leaked real defect (flagged by all models)	Fix
`synthetic__upgrade-clean-01`	`OnUpgradePerCompany` ran `Customer.ModifyAll("Privacy Blocked", false)` — unconditionally cleared the GDPR privacy flag for every customer	Replaced with a filtered, trigger-respecting `Search Name` backfill (`SetRange` + `FindSet(true)` + `Validate`/`Modify(true)`), still guarded by the upgrade tag
`synthetic__privacy-clean-03`	Primary key on non-unique `"Customer Name"` → same-named customers collide/drop	Added surrogate `"Entry No."` integer primary key
`synthetic__performance-clean-04`	Unguarded `Item.Get(SalesLine."No.")` → non-item lines (G/L, Resource, blank) raise a runtime error	Added `SetRange(Type, Type::Item)` + guarded `if Item.Get(...)`; same Type filter on `ApplyDiscounts`
`synthetic__privacy-clean-01`	`"Entry No."` field + table missing `DataClassification`; `AutoIncrement = true` on a `TableType = Temporary` table (ignored at runtime)	Added table/field `DataClassification` + captions; removed `TableType = Temporary` so the log persists and AutoIncrement works

Verification

git apply --check --whitespace=nowarn → rc=0 for all 4 patches
tests/test_dataset_integrity.py → 4/4 pass
File stays pure LF, 100 lines, exactly 4 lines changed

Notes

Scope is limited to the 4 clearest cases (Tier 1). Two Tier-2 candidates
(performance-clean-03 FlowField/FindLast on temp-sourced API pages,
security-clean-03 empty [TryFunction]) are left for a follow-up discussion.
Related to the gold-audit effort tracked in ADO 639655.

…real defects Cross-model audit (opus-4-6/opus-4-8/gpt-5-3-codex across ClaudeCode + Copilot) found 4 "clean" negatives that every model consistently and correctly flagged for genuine defects, making them unfair negatives. Removed the leaked bugs so the samples are truly clean; expected_comments stays []. - upgrade-clean-01: OnUpgradePerCompany cleared Customer."Privacy Blocked" for ALL customers -> replaced with a filtered, trigger-respecting Search Name backfill guarded by an upgrade tag. - privacy-clean-03: non-unique primary key on "Customer Name" -> added surrogate "Entry No." primary key. - performance-clean-04: unguarded Item.Get on non-item sales lines -> Type=Item filter + guarded Get (and ApplyDiscounts Type filter). - privacy-clean-01: missing DataClassification + AutoIncrement on a temporary table -> added DataClassification/Captions and made System Configuration Log persistent. Verified: git apply --check rc=0 for all 4 patches; test_dataset_integrity.py 4/4; pure LF, only 4 lines changed. Relates to ADO 639655.

haoranpb

Remember benchmark version will need a bump after you are done

gggdttt · 2026-07-03T07:38:19Z

Convert this to draft because need to test if it can really improve

haoranpb · 2026-07-03T07:56:57Z

@gggdttt I think it's a good idea to combine all the dataset changes into one PR for tracebilily.

Some systematic failure analysis like what you are doing now is a good way improve the dataset quality

…f 5 positive samples Positive (expect_findings) samples plant one true-positive (the gold in expected_comments) and surround it with false-positive "bait" files that look flaggable but which reviewers correctly reject. Five bait files accidentally contained a SEPARATE real defect unrelated to their bait theme, so a model correctly flagging it would be penalized as a false positive. Fix the unintended defect while preserving the intended bait pattern; expected_comments (gold) unchanged. - privacy-004 ErrorLogEntry.Table.al: undefined Enum "Error Context Type" -> Text[100] (also resolves Text->Enum assignment in SystemErrorHandler). - privacy-009 TaxDataMigrationHelper.Codeunit.al: remove reference to undefined Codeunit "Migration Validation Assert"; compare tax ids directly. - performance-005 EnumIterator.Codeunit.al: SetRange+Get silently ignored the Analysis Area filter -> SetRange(Name)+IsEmpty, matching sibling methods. - performance-009 NameValueBufferAPI.Page.al: OnInsertRecord Rec.FindLast() clobbered the incoming payload -> use a copied temp buffer to compute next ID. - privacy-007 CustomerContactBuffer.Table.al: non-unique PK on Customer Name -> add surrogate Entry No. primary key.

… files of 5 positive samples" This reverts commit b7236c5.

- performance-003: add missing performance gold for the per-row N+1 in PermissionSetListOverview (Permission.Count() inside OnAfterGetRecord, line 77). Same domain as the sample's existing gold, so add rather than neutralize. - privacy-007: neutralize the CustomerContactBuffer temp table by giving it a synthetic 'Entry No.' primary key (mirrors privacy-clean-03) so the non-unique 'Customer Name' PK collision no longer reads as a genuine defect. Off-domain (functional) distractor, so neutralize rather than add.

The distractor table ErrorLogEntry declared field(2) Error Context as an undefined Enum type, and SystemErrorHandler assigned a Text[100] value to it (Text-to-Enum mismatch). Both are unintended construction artifacts reviewers correctly flag, not the sample's intended PII false-positive bait (SystemId/RecordId in error messages). Change the field to Text[100] to match the assigned parameter, removing the undefined enum and the type mismatch. Gold (AttachmentErrorReporter L7) unchanged.

TaxDataMigrationHelper referenced an undefined Codeunit Migration Validation Assert, a construction artifact reviewers correctly flag rather than the sample's intended PII false-positive bait (Federal ID/TIN handling in migration code). Inline the equality check in ValidateTaxDataIntegrity and drop the now-unused ValidationContextTxt label, removing the undefined dependency. Gold (EmployeeIdentityStaging L6) unchanged.

BusinessIntegrationEvents referenced an undefined Table Business Entity, a construction artifact reviewers correctly flag rather than the sample's intended integration-event PII false-positive bait. Add a clean BusinessEntity table definition (all fields classified, Boolean Processed flag to avoid an Option-vs-Enum style comment) and switch the status mutation to the Boolean flag. The integration-event record-parameter and temp-Modify baits are preserved. Gold (CustomerTelemetryLogger L8) unchanged.

InlineUpgradeSteps backfilled Search Name while SetRange-filtering on that same field inside FindSet, skipping records; opus-4.8 flagged this 5/5 as permanent data loss, making the clean negative unfair. Drop the SetRange on the modified field and guard the modify with an in-loop blank check instead. expected_comments stays empty.

Follows the data-loss neutralization: filtering on Name (not the modified Search Name field) narrows the FindSet without reintroducing skip-based data loss, and eliminates the self-introduced 'no-filter/whole-table scan' nitpick. Validated 5x with claude-opus-4.8: 0/5 data-loss, 0/5 no-filter complaints, max severity medium.

haoranpb requested changes Jul 3, 2026

View reviewed changes

gggdttt marked this pull request as draft July 3, 2026 07:37

wenjiefan added 8 commits July 3, 2026 11:28

Revert "dataset(code-review): fix leaked unrelated defects in FP-bait…

00269f0

… files of 5 positive samples" This reverts commit b7236c5.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732

dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732
gggdttt wants to merge 9 commits into
mainfrom
dataset/code-review/clean-sample-fixes

gggdttt commented Jul 2, 2026

Uh oh!

haoranpb left a comment

Uh oh!

gggdttt commented Jul 3, 2026

Uh oh!

haoranpb commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

gggdttt commented Jul 2, 2026

Summary

Changes (dataset/codereview.jsonl, 4 lines)

Verification

Notes

Uh oh!

haoranpb left a comment

Choose a reason for hiding this comment

Uh oh!

gggdttt commented Jul 3, 2026

Uh oh!

haoranpb commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Changes (`dataset/codereview.jsonl`, 4 lines)