-
Notifications
You must be signed in to change notification settings - Fork 565
Save labels.txt in artifact for imagefolderdataset #3145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Save labels.txt in artifact for imagefolderdataset #3145
Conversation
…ifact directory for ImageFolderDataset
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3145 +/- ##
==========================================
- Coverage 81.32% 81.31% -0.02%
==========================================
Files 817 817
Lines 117804 117818 +14
==========================================
- Hits 95802 95800 -2
- Misses 22002 22018 +16 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking on this request!
I have a couple of comments
serde = { workspace = true, features = ["std", "derive"] } | ||
serde_json = { workspace = true, features = ["std"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep pointing to workspace dependencies
/// Format for saving labels | ||
#[derive(Debug, Clone, Copy)] | ||
pub enum LabelFormat { | ||
/// Text format with one label per line | ||
Txt, | ||
/// JSON format with an array of labels | ||
Json, | ||
/// YAML format with an array of labels | ||
Yaml, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the flexibility is nice, I would stick to a single format. We can remove the enum and just save a text file.
/// | ||
/// This struct provides a way to access data and labels from a Polars DataFrame | ||
/// as if it were a Dataset of type I with labels of type L. | ||
pub struct LabeledDataframeDataset<I, L> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this specialization is required 🤔 a dataframe dataset can be used by the user to get the labels from the appropriate field for each item.
Maybe I am missing something. Can you provide more details if you think this should be added?
|
||
/// Dataset where all items are stored in ram. | ||
pub struct InMemDataset<I> { | ||
items: Vec<I>, | ||
} | ||
|
||
/// Dataset where all items and their labels are stored in ram. | ||
pub struct LabeledInMemDataset<I, L> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar argument here. The InMemDataset
is generic over the items. So if items contain a label/class, it is entirely up to the user 🤔
|
||
/// The labeled dataset trait defines a dataset that contains labeled data. | ||
/// It extends the basic Dataset trait with functionality to handle labels. | ||
pub trait LabeledDataset<I, L>: Dataset<I> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trait might be overkill
// Sort classes by index to ensure consistent ordering | ||
let mut sorted_classes: Vec<_> = classes_map.iter().collect(); | ||
sorted_classes.sort_by_key(|(_, idx)| *idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would also sort classes provided by the user given that public constructors also go through this path.
When parsing folder names automatically, we already sort the names. But I don't think we should sort classes provided by the user. This could lead to unexpected behavior.
// Save labels.txt in the artifact directory | ||
let artifact_dir = "/tmp/burn-dataset"; | ||
let labels_path = std::path::Path::new(&artifact_dir).join("labels.txt"); | ||
|
||
// Create parent directories if they don't exist | ||
if let Some(parent) = labels_path.parent() { | ||
std::fs::create_dir_all(parent) | ||
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?; | ||
} | ||
|
||
// Write labels to file | ||
let mut labels_file = std::fs::File::create(&labels_path) | ||
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?; | ||
|
||
// Sort classes by index to ensure consistent ordering | ||
let mut sorted_classes: Vec<_> = classes_map.iter().collect(); | ||
sorted_classes.sort_by_key(|(_, idx)| *idx); | ||
|
||
for (class_name, _) in sorted_classes { | ||
writeln!(labels_file, "{}", class_name) | ||
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would move this to a separate method instead, e.g., dataset.save_classes(path)
where one could pass the current artifact/experiment dir.
It is not always required, but having an opt-in method could make more sense.
Pull Request Template
Checklist
cargo run-checks
command has been executed.Tip
Want more detailed macro error diagnostics? This is especially useful for debugging tensor-related tests:
RUSTC_BOOTSTRAP=1 RUSTFLAGS="-Zmacro-backtrace" cargo run-checks
Related Issues/PRs
#2761
Changes
The issue requested adding functionality to save a labels.txt file in the artifact directory when using the ImageFolderDataset. This feature helps preserve the mapping between class indices and their corresponding label names, which is particularly valuable for:
The solution implemented:
Testing