After the conversion in the targets pipeline, we could add a step that reads in the Parquet registers and deduplicates (ignoring the source_file column).
This could/should also be added to the conversion log somehow (how many rows were deduplicated; n rows before and after).
After the conversion in the targets pipeline, we could add a step that reads in the Parquet registers and deduplicates (ignoring the
source_filecolumn).This could/should also be added to the conversion log somehow (how many rows were deduplicated; n rows before and after).