What if the covariates data (X) or the data files (y) are repeated/duplicated?

It's also possible in X that the file names are duplicated but the other covariates or y files are not. Or, y is duplicated but the X info is not. I'll filter the unique rows/files and get the first occurrences and then take the intersection of X and y. Or to be safe, just remove all instances of the duplicated rows in both X and y and take it from there.