Skip to content

bug: Duplicate column names in output dataframe #121

@memadi-nv

Description

@memadi-nv

Priority Level

Medium

Task Summary

Duplicate column names in output dataframe. This issue is created from item #14 of the bugbash.

When the input CSV already contains a column named text_replaced (which is also the Anonymizer's output column name), the output DataFrame has duplicate column names:

Output columns: ['text', 'text_replaced', 'text_replaced', 'text_with_spans',
                 'final_entities', 'text_replaced', 'text_replaced']

Impact: df['text_replaced'] returns ambiguous results. Users who have pre-existing columns with names matching Anonymizer's output columns will get corrupted DataFrames. This could also cause silent data loss when saving to CSV.

Recommendation: Either prefix/suffix Anonymizer output columns to avoid collisions, or raise an error/warning if input columns collide with output column names.

Technical Details & Implementation Plan

No response

Dependencies

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    taskDevelopment task

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions