Skip to content

dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732

Draft
gggdttt wants to merge 9 commits into
mainfrom
dataset/code-review/clean-sample-fixes
Draft

dataset(code-review): fix 4 mislabeled 'clean' negatives with leaked real defects#732
gggdttt wants to merge 9 commits into
mainfrom
dataset/code-review/clean-sample-fixes

Conversation

@gggdttt

@gggdttt gggdttt commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

A cross-model audit of the 20 "clean" (negative, expected_comments == []) code-review
samples revealed that 4 of them leaked real defects — every model we ran
(claude-opus-4-6, claude-opus-4-8, gpt-5-3-codex across both Claude Code and Copilot CLI)
consistently and correctly flagged the same genuine problem. That makes these unfair
negatives: a strong reviewer is penalized for finding a real bug we mislabeled as clean.

This PR removes the leaked bug from each patch so the sample is truly clean. The label
stays a negative (expected_comments remains []).

Changes (dataset/codereview.jsonl, 4 lines)

Instance Leaked real defect (flagged by all models) Fix
synthetic__upgrade-clean-01 OnUpgradePerCompany ran Customer.ModifyAll("Privacy Blocked", false) — unconditionally cleared the GDPR privacy flag for every customer Replaced with a filtered, trigger-respecting Search Name backfill (SetRange + FindSet(true) + Validate/Modify(true)), still guarded by the upgrade tag
synthetic__privacy-clean-03 Primary key on non-unique "Customer Name" → same-named customers collide/drop Added surrogate "Entry No." integer primary key
synthetic__performance-clean-04 Unguarded Item.Get(SalesLine."No.") → non-item lines (G/L, Resource, blank) raise a runtime error Added SetRange(Type, Type::Item) + guarded if Item.Get(...); same Type filter on ApplyDiscounts
synthetic__privacy-clean-01 "Entry No." field + table missing DataClassification; AutoIncrement = true on a TableType = Temporary table (ignored at runtime) Added table/field DataClassification + captions; removed TableType = Temporary so the log persists and AutoIncrement works

Verification

  • git apply --check --whitespace=nowarn → rc=0 for all 4 patches
  • tests/test_dataset_integrity.py → 4/4 pass
  • File stays pure LF, 100 lines, exactly 4 lines changed

Notes

  • Scope is limited to the 4 clearest cases (Tier 1). Two Tier-2 candidates
    (performance-clean-03 FlowField/FindLast on temp-sourced API pages,
    security-clean-03 empty [TryFunction]) are left for a follow-up discussion.
  • Related to the gold-audit effort tracked in ADO 639655.

…real defects

Cross-model audit (opus-4-6/opus-4-8/gpt-5-3-codex across ClaudeCode + Copilot)
found 4 "clean" negatives that every model consistently and correctly flagged
for genuine defects, making them unfair negatives. Removed the leaked bugs so the
samples are truly clean; expected_comments stays [].

- upgrade-clean-01: OnUpgradePerCompany cleared Customer."Privacy Blocked" for ALL
  customers -> replaced with a filtered, trigger-respecting Search Name backfill
  guarded by an upgrade tag.
- privacy-clean-03: non-unique primary key on "Customer Name" -> added surrogate
  "Entry No." primary key.
- performance-clean-04: unguarded Item.Get on non-item sales lines -> Type=Item
  filter + guarded Get (and ApplyDiscounts Type filter).
- privacy-clean-01: missing DataClassification + AutoIncrement on a temporary table
  -> added DataClassification/Captions and made System Configuration Log persistent.

Verified: git apply --check rc=0 for all 4 patches; test_dataset_integrity.py 4/4;
pure LF, only 4 lines changed. Relates to ADO 639655.

@haoranpb haoranpb left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remember benchmark version will need a bump after you are done

@gggdttt gggdttt marked this pull request as draft July 3, 2026 07:37
@gggdttt

gggdttt commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

Convert this to draft because need to test if it can really improve

@haoranpb

haoranpb commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

@gggdttt I think it's a good idea to combine all the dataset changes into one PR for tracebilily.

Some systematic failure analysis like what you are doing now is a good way improve the dataset quality

wenjiefan added 8 commits July 3, 2026 11:28
…f 5 positive samples

Positive (expect_findings) samples plant one true-positive (the gold in
expected_comments) and surround it with false-positive "bait" files that look
flaggable but which reviewers correctly reject. Five bait files accidentally
contained a SEPARATE real defect unrelated to their bait theme, so a model
correctly flagging it would be penalized as a false positive. Fix the
unintended defect while preserving the intended bait pattern; expected_comments
(gold) unchanged.

- privacy-004 ErrorLogEntry.Table.al: undefined Enum "Error Context Type" -> Text[100] (also resolves Text->Enum assignment in SystemErrorHandler).
- privacy-009 TaxDataMigrationHelper.Codeunit.al: remove reference to undefined Codeunit "Migration Validation Assert"; compare tax ids directly.
- performance-005 EnumIterator.Codeunit.al: SetRange+Get silently ignored the Analysis Area filter -> SetRange(Name)+IsEmpty, matching sibling methods.
- performance-009 NameValueBufferAPI.Page.al: OnInsertRecord Rec.FindLast() clobbered the incoming payload -> use a copied temp buffer to compute next ID.
- privacy-007 CustomerContactBuffer.Table.al: non-unique PK on Customer Name -> add surrogate Entry No. primary key.
… files of 5 positive samples"

This reverts commit b7236c5.
- performance-003: add missing performance gold for the per-row N+1 in
  PermissionSetListOverview (Permission.Count() inside OnAfterGetRecord,
  line 77). Same domain as the sample's existing gold, so add rather than
  neutralize.
- privacy-007: neutralize the CustomerContactBuffer temp table by giving it
  a synthetic 'Entry No.' primary key (mirrors privacy-clean-03) so the
  non-unique 'Customer Name' PK collision no longer reads as a genuine
  defect. Off-domain (functional) distractor, so neutralize rather than add.
The distractor table ErrorLogEntry declared field(2) Error Context as an undefined Enum type, and SystemErrorHandler assigned a Text[100] value to it (Text-to-Enum mismatch). Both are unintended construction artifacts reviewers correctly flag, not the sample's intended PII false-positive bait (SystemId/RecordId in error messages). Change the field to Text[100] to match the assigned parameter, removing the undefined enum and the type mismatch. Gold (AttachmentErrorReporter L7) unchanged.
TaxDataMigrationHelper referenced an undefined Codeunit Migration Validation Assert, a construction artifact reviewers correctly flag rather than the sample's intended PII false-positive bait (Federal ID/TIN handling in migration code). Inline the equality check in ValidateTaxDataIntegrity and drop the now-unused ValidationContextTxt label, removing the undefined dependency. Gold (EmployeeIdentityStaging L6) unchanged.
BusinessIntegrationEvents referenced an undefined Table Business Entity, a construction artifact reviewers correctly flag rather than the sample's intended integration-event PII false-positive bait. Add a clean BusinessEntity table definition (all fields classified, Boolean Processed flag to avoid an Option-vs-Enum style comment) and switch the status mutation to the Boolean flag. The integration-event record-parameter and temp-Modify baits are preserved. Gold (CustomerTelemetryLogger L8) unchanged.
InlineUpgradeSteps backfilled Search Name while SetRange-filtering on that same field inside FindSet, skipping records; opus-4.8 flagged this 5/5 as permanent data loss, making the clean negative unfair. Drop the SetRange on the modified field and guard the modify with an in-loop blank check instead. expected_comments stays empty.
Follows the data-loss neutralization: filtering on Name (not the modified Search Name field) narrows the FindSet without reintroducing skip-based data loss, and eliminates the self-introduced 'no-filter/whole-table scan' nitpick. Validated 5x with claude-opus-4.8: 0/5 data-loss, 0/5 no-filter complaints, max severity medium.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants