dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732
Draft
gggdttt wants to merge 9 commits into
Draft
dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732gggdttt wants to merge 9 commits into
gggdttt wants to merge 9 commits into
Conversation
…real defects Cross-model audit (opus-4-6/opus-4-8/gpt-5-3-codex across ClaudeCode + Copilot) found 4 "clean" negatives that every model consistently and correctly flagged for genuine defects, making them unfair negatives. Removed the leaked bugs so the samples are truly clean; expected_comments stays []. - upgrade-clean-01: OnUpgradePerCompany cleared Customer."Privacy Blocked" for ALL customers -> replaced with a filtered, trigger-respecting Search Name backfill guarded by an upgrade tag. - privacy-clean-03: non-unique primary key on "Customer Name" -> added surrogate "Entry No." primary key. - performance-clean-04: unguarded Item.Get on non-item sales lines -> Type=Item filter + guarded Get (and ApplyDiscounts Type filter). - privacy-clean-01: missing DataClassification + AutoIncrement on a temporary table -> added DataClassification/Captions and made System Configuration Log persistent. Verified: git apply --check rc=0 for all 4 patches; test_dataset_integrity.py 4/4; pure LF, only 4 lines changed. Relates to ADO 639655.
haoranpb
requested changes
Jul 3, 2026
haoranpb
left a comment
Collaborator
There was a problem hiding this comment.
Remember benchmark version will need a bump after you are done
Collaborator
Author
|
Convert this to draft because need to test if it can really improve |
Collaborator
|
@gggdttt I think it's a good idea to combine all the dataset changes into one PR for tracebilily. Some systematic failure analysis like what you are doing now is a good way improve the dataset quality |
added 8 commits
July 3, 2026 11:28
…f 5 positive samples Positive (expect_findings) samples plant one true-positive (the gold in expected_comments) and surround it with false-positive "bait" files that look flaggable but which reviewers correctly reject. Five bait files accidentally contained a SEPARATE real defect unrelated to their bait theme, so a model correctly flagging it would be penalized as a false positive. Fix the unintended defect while preserving the intended bait pattern; expected_comments (gold) unchanged. - privacy-004 ErrorLogEntry.Table.al: undefined Enum "Error Context Type" -> Text[100] (also resolves Text->Enum assignment in SystemErrorHandler). - privacy-009 TaxDataMigrationHelper.Codeunit.al: remove reference to undefined Codeunit "Migration Validation Assert"; compare tax ids directly. - performance-005 EnumIterator.Codeunit.al: SetRange+Get silently ignored the Analysis Area filter -> SetRange(Name)+IsEmpty, matching sibling methods. - performance-009 NameValueBufferAPI.Page.al: OnInsertRecord Rec.FindLast() clobbered the incoming payload -> use a copied temp buffer to compute next ID. - privacy-007 CustomerContactBuffer.Table.al: non-unique PK on Customer Name -> add surrogate Entry No. primary key.
… files of 5 positive samples" This reverts commit b7236c5.
- performance-003: add missing performance gold for the per-row N+1 in PermissionSetListOverview (Permission.Count() inside OnAfterGetRecord, line 77). Same domain as the sample's existing gold, so add rather than neutralize. - privacy-007: neutralize the CustomerContactBuffer temp table by giving it a synthetic 'Entry No.' primary key (mirrors privacy-clean-03) so the non-unique 'Customer Name' PK collision no longer reads as a genuine defect. Off-domain (functional) distractor, so neutralize rather than add.
The distractor table ErrorLogEntry declared field(2) Error Context as an undefined Enum type, and SystemErrorHandler assigned a Text[100] value to it (Text-to-Enum mismatch). Both are unintended construction artifacts reviewers correctly flag, not the sample's intended PII false-positive bait (SystemId/RecordId in error messages). Change the field to Text[100] to match the assigned parameter, removing the undefined enum and the type mismatch. Gold (AttachmentErrorReporter L7) unchanged.
TaxDataMigrationHelper referenced an undefined Codeunit Migration Validation Assert, a construction artifact reviewers correctly flag rather than the sample's intended PII false-positive bait (Federal ID/TIN handling in migration code). Inline the equality check in ValidateTaxDataIntegrity and drop the now-unused ValidationContextTxt label, removing the undefined dependency. Gold (EmployeeIdentityStaging L6) unchanged.
BusinessIntegrationEvents referenced an undefined Table Business Entity, a construction artifact reviewers correctly flag rather than the sample's intended integration-event PII false-positive bait. Add a clean BusinessEntity table definition (all fields classified, Boolean Processed flag to avoid an Option-vs-Enum style comment) and switch the status mutation to the Boolean flag. The integration-event record-parameter and temp-Modify baits are preserved. Gold (CustomerTelemetryLogger L8) unchanged.
InlineUpgradeSteps backfilled Search Name while SetRange-filtering on that same field inside FindSet, skipping records; opus-4.8 flagged this 5/5 as permanent data loss, making the clean negative unfair. Drop the SetRange on the modified field and guard the modify with an in-loop blank check instead. expected_comments stays empty.
Follows the data-loss neutralization: filtering on Name (not the modified Search Name field) narrows the FindSet without reintroducing skip-based data loss, and eliminates the self-introduced 'no-filter/whole-table scan' nitpick. Validated 5x with claude-opus-4.8: 0/5 data-loss, 0/5 no-filter complaints, max severity medium.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A cross-model audit of the 20 "clean" (negative,
expected_comments == []) code-reviewsamples revealed that 4 of them leaked real defects — every model we ran
(claude-opus-4-6, claude-opus-4-8, gpt-5-3-codex across both Claude Code and Copilot CLI)
consistently and correctly flagged the same genuine problem. That makes these unfair
negatives: a strong reviewer is penalized for finding a real bug we mislabeled as clean.
This PR removes the leaked bug from each patch so the sample is truly clean. The label
stays a negative (
expected_commentsremains[]).Changes (
dataset/codereview.jsonl, 4 lines)synthetic__upgrade-clean-01OnUpgradePerCompanyranCustomer.ModifyAll("Privacy Blocked", false)— unconditionally cleared the GDPR privacy flag for every customerSearch Namebackfill (SetRange+FindSet(true)+Validate/Modify(true)), still guarded by the upgrade tagsynthetic__privacy-clean-03"Customer Name"→ same-named customers collide/drop"Entry No."integer primary keysynthetic__performance-clean-04Item.Get(SalesLine."No.")→ non-item lines (G/L, Resource, blank) raise a runtime errorSetRange(Type, Type::Item)+ guardedif Item.Get(...); same Type filter onApplyDiscountssynthetic__privacy-clean-01"Entry No."field + table missingDataClassification;AutoIncrement = trueon aTableType = Temporarytable (ignored at runtime)DataClassification+ captions; removedTableType = Temporaryso the log persists and AutoIncrement worksVerification
git apply --check --whitespace=nowarn→ rc=0 for all 4 patchestests/test_dataset_integrity.py→ 4/4 passNotes
(
performance-clean-03FlowField/FindLaston temp-sourced API pages,security-clean-03empty[TryFunction]) are left for a follow-up discussion.