Skip to content

Save labels.txt in artifact for imagefolderdataset #3145

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

AshAnand34
Copy link
Contributor

Pull Request Template

Checklist

  • Confirmed that cargo run-checks command has been executed.
  • Made sure the book is up to date with changes in this PR.

Tip

Want more detailed macro error diagnostics? This is especially useful for debugging tensor-related tests:

RUSTC_BOOTSTRAP=1 RUSTFLAGS="-Zmacro-backtrace" cargo run-checks

Related Issues/PRs

#2761

Changes

The issue requested adding functionality to save a labels.txt file in the artifact directory when using the ImageFolderDataset. This feature helps preserve the mapping between class indices and their corresponding label names, which is particularly valuable for:

  1. Maintaining label order and names without requiring the original dataset structure
  2. Reducing errors in model prediction interpretation
  3. Simplifying the mapping of numerical predictions back to their corresponding labels

The solution implemented:

  • Modified the with_items method in ImageFolderDataset to automatically create a labels.txt file
  • Added functionality to write class names to the file in index order
  • Set the default artifact directory to /tmp/burn-dataset
  • Added proper error handling for file operations
  • Updated the documentation to reflect this new feature

Testing

  • Existing test suite for ImageFolderDataset
  • Error handling with directory creation failures and invalid paths.

Copy link

codecov bot commented May 3, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 1 line in your changes missing coverage. Please review.

Project coverage is 81.31%. Comparing base (5a437b0) to head (6fc0d47).

Files with missing lines Patch % Lines
crates/burn-dataset/src/vision/image_folder.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3145      +/-   ##
==========================================
- Coverage   81.32%   81.31%   -0.02%     
==========================================
  Files         817      817              
  Lines      117804   117818      +14     
==========================================
- Hits        95802    95800       -2     
- Misses      22002    22018      +16     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@AshAnand34 AshAnand34 marked this pull request as ready for review May 4, 2025 00:46
Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking on this request!

I have a couple of comments

Comment on lines -52 to -53
serde = { workspace = true, features = ["std", "derive"] }
serde_json = { workspace = true, features = ["std"] }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep pointing to workspace dependencies

Comment on lines +100 to +109
/// Format for saving labels
#[derive(Debug, Clone, Copy)]
pub enum LabelFormat {
/// Text format with one label per line
Txt,
/// JSON format with an array of labels
Json,
/// YAML format with an array of labels
Yaml,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While the flexibility is nice, I would stick to a single format. We can remove the enum and just save a text file.

///
/// This struct provides a way to access data and labels from a Polars DataFrame
/// as if it were a Dataset of type I with labels of type L.
pub struct LabeledDataframeDataset<I, L> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this specialization is required 🤔 a dataframe dataset can be used by the user to get the labels from the appropriate field for each item.

Maybe I am missing something. Can you provide more details if you think this should be added?


/// Dataset where all items are stored in ram.
pub struct InMemDataset<I> {
items: Vec<I>,
}

/// Dataset where all items and their labels are stored in ram.
pub struct LabeledInMemDataset<I, L> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar argument here. The InMemDataset is generic over the items. So if items contain a label/class, it is entirely up to the user 🤔


/// The labeled dataset trait defines a dataset that contains labeled data.
/// It extends the basic Dataset trait with functionality to handle labels.
pub trait LabeledDataset<I, L>: Dataset<I>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trait might be overkill

Comment on lines +709 to +711
// Sort classes by index to ensure consistent ordering
let mut sorted_classes: Vec<_> = classes_map.iter().collect();
sorted_classes.sort_by_key(|(_, idx)| *idx);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would also sort classes provided by the user given that public constructors also go through this path.

When parsing folder names automatically, we already sort the names. But I don't think we should sort classes provided by the user. This could lead to unexpected behavior.

Comment on lines +695 to +717
// Save labels.txt in the artifact directory
let artifact_dir = "/tmp/burn-dataset";
let labels_path = std::path::Path::new(&artifact_dir).join("labels.txt");

// Create parent directories if they don't exist
if let Some(parent) = labels_path.parent() {
std::fs::create_dir_all(parent)
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?;
}

// Write labels to file
let mut labels_file = std::fs::File::create(&labels_path)
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?;

// Sort classes by index to ensure consistent ordering
let mut sorted_classes: Vec<_> = classes_map.iter().collect();
sorted_classes.sort_by_key(|(_, idx)| *idx);

for (class_name, _) in sorted_classes {
writeln!(labels_file, "{}", class_name)
.map_err(|e| ImageLoaderError::IOError(e.to_string()))?;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would move this to a separate method instead, e.g., dataset.save_classes(path) where one could pass the current artifact/experiment dir.

It is not always required, but having an opt-in method could make more sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants