Skip to content

bug: Data Designer failures silently drop records — add visibility and retry #110

@lipikaramaswamy

Description

@lipikaramaswamy

Summary

The NDD adapter in src/anonymizer/engine/ndd/adapter.py wraps Data Designer calls but has two paths where records are dropped silently. Additionally, there is no retry mechanism for failed records despite the adapter already tracking them via WorkflowRunResult.failed_records.

Silent failure paths

1. Preview mode returns empty DataFrame on None dataset
In NddAdapter.run_workflow(), if preview_results.dataset is None, the adapter returns an empty DataFrame (workflow_input_df.iloc[0:0].copy()) with no log message explaining why or how many records were lost.

2. Missing RECORD_ID_COLUMN causes detection bypass
_detect_missing_records() only flags omissions if RECORD_ID_COLUMN is present in the output. If Data Designer strips this column, the method short-circuits silently:

if RECORD_ID_COLUMN not in output_df.columns:
    return []  # all missing records go unreported

Missing retry logic

The adapter already has the infrastructure needed for retries:

  • WorkflowRunResult.failed_records tracks every record that didn't appear in output, with record_id, step, and reason
  • _compute_record_id() generates deterministic UUID5s so failed rows can be precisely identified and re-submitted
  • rewrite_workflow.py already accumulates all_failed across steps

But nothing uses this to retry. Failed records are logged and discarded.

Proposed fix

Visibility:

  • Log a warning with record count and reasons when preview_results.dataset is None
  • When RECORD_ID_COLUMN is missing from output, warn that missing-record detection is disabled rather than returning empty silently

Retry loop in NddAdapter.run_workflow() (or caller in rewrite_workflow.py):

  • After each run_workflow() call, check result.failed_records
  • If failed records exist and retry_attempt < max_retries, extract the failed rows from the original input DataFrame using their record IDs and re-submit them
  • Merge retry successes back into the main output DataFrame
  • Accumulate records that fail all retries into all_failed with reason "max_retries_exceeded"
  • Expose max_retries as a config parameter (suggested default: 2)
  • Log each retry attempt: "Retrying %d failed records (attempt %d/%d)"

Notes

The RAT-Bench 33-record run lost 4 records due to nvidia/nemotron-3-nano-30b-a3b internal server errors with 3–6 hour hangs — a retry with backoff would have recovered these automatically.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions