fix(datasets) Fix partition inconsistencies across dataset splits#5340
fix(datasets) Fix partition inconsistencies across dataset splits#5340jafermarq merged 11 commits intoflwrlabs:mainfrom
Conversation
…t regardless of the ordering of examples in the dataset.
|
Hey everyone! Just a heads-up on the changes I made to the tests. I adjusted the If anyone has suggestions or a cleaner solution, I'm all ears! |
…t regardless of the ordering of examples in the dataset.
8b4b407 to
63d6175
Compare
Thanks for highlighting this! I agree that introducing this randomness isn't super ideal so I was wondering if we can keep the construction of the test datasets in |
|
Thanks for your feedback @jafermarq! Definitely, that seems like a better option. I'll update the tests and rollback the changes to |
…pper/flower into bugfix/inconsistent-partition-ids
… changes to test setup functions.
|
@jafermarq I've modified the new consistency tests so that they're deterministic and reverted the changes I made to the setup methods. Thanks for your patience! Let me know if there's anything else that needs to be improved before merging 😄 |
jafermarq
left a comment
There was a problem hiding this comment.
@adamtupper , many thanks for making those changes!
datasets/flwr_datasets/partitioner/dirichlet_partitioner_test.py
Outdated
Show resolved
Hide resolved
datasets/flwr_datasets/partitioner/dirichlet_partitioner_test.py
Outdated
Show resolved
Hide resolved
jafermarq
left a comment
There was a problem hiding this comment.
🚀 Thanks for the fixes! @adamtupper
Issue
This pull request addresses Issue #5243.
Description
These changes addresses a similar issue in the
NaturalIdPartitionerandDirichletParitionerthat results in mismatches between dataset splits (e.g., for training, validation, and testing).NaturalIdPartitioner, natural IDs were mapped to partition IDs based on the ordering of examples in the dataset. This resulted in different natural ID to partition ID mappings if the dataset was ordered differently, or on different subsets of the same dataset.DirichletPartitioner, a similar issue meant that changing the order of the dataset (or generating partitions for different subsets of the same dataset) led to different label distributions for each partition.Related issues/PRs
Fixes #5243
Proposal
Sort the
unique_natural_idsandunique_classesfor theNaturalIdPartitionerandDirichletParitioner, respectively.Explanation
This fix ensures that partition generation is no longer dependent on the ordering of examples in the dataset. This means that, providing the set of natural IDs/unique classes is the same between dataset splits, the partitions are consistently labeled across the partitioners for the different splits. An example of a scenario where this is important is for personalized FL, where each client has their own unique training and test data.
Checklist
#contributions)Any other comments?