This repository was archived by the owner on Sep 18, 2024. It is now read-only.
Add balance in flow_from_directory to handle data imbalance (using random oversampling)#310
Open
DOLARIK wants to merge 11 commits intokeras-team:masterfrom
Open
Add balance in flow_from_directory to handle data imbalance (using random oversampling)#310DOLARIK wants to merge 11 commits intokeras-team:masterfrom
DOLARIK wants to merge 11 commits intokeras-team:masterfrom
Conversation
_balance_config: dict, supposed to store relevant key-value pairs important for further steps to handle data imbalance _balance_config is not generated for validation subset as it does not need resampling (oversampling/handing data imbalance) Yet to implement _make_balance_config
Scans the directory and generates a dict object which stores the configurations that will be used for handling data imbalance. Yet to implement '_generate_class_count'
Scans the directory and generates a dict object which maintains a class count (no. of image samples in each class/category)
Yet to implement '_settle_debt'
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR helps us balance the imbalanced classes using random oversampling.
Introduces a new argument, balance, in
DirectoryIterator. This argument is boolean in nature. (acceptsTrue/False)Example:
Only the training subset undergoes oversampling. Validation subset is excluded from oversampling. This is done as we use random oversampling only to increase the number of samples in the training dataset for robust learning.
I have created a Colab Notebook to play around with this new feature:
Underlying Concept:
Consider an imbalanced dataset having categories A, B and C, with the following files in the respective sub-directories:
As seen here, majority count here is 4 (in A), so the count in all the other categories too will be made 4 by randomly sampling a filename from the original set of filenames from the respective sub-directories (category directories):
Here, from B, after randomly oversampling from
[B_0.jpg, B_1.jpg]and appending them to the list, to make the total filenames equal to 4 (the majority count) we got[B_0.jpg, B_1.jpg,B_1.jpg, B_0.jpg]. (resampled filenames are in bold)Similarly, for C, after random oversampling, we got
[C_0.jpg, C_1.jpg, C_2.jpg,C_0.jpg](resampled filenames are in bold)After Data Augmentation:
Notice that B_1.jpg{43} and B_1.jpg{05} are technically two different images. So, this is how with random oversampling using data augmentation, we can increase the number of samples in our dataset.
This feature has helped me handle data imbalance without using external libraries and keep the whole training pipeline clean, smooth and simple. I hope it'll help others too.
I am still figuring out the best practices for creating unit tests and updating the docs, therefore have not been able to add new tests for this feature yet. So, for you to test this out now, I have created this Colab Notebook. It includes demo data directory and visualizations. I will soon add appropriate tests.
This new feature has passed all the pre-existing tests.
PR Overview