Skip to content

Performance Issue: save_to_disk() 200-1200% slower due to unconditional flatten_indices() #7861

@KCKawalkar

Description

@KCKawalkar

🐛 Bug Description

The save_to_disk() method unconditionally calls flatten_indices() when _indices is not None, causing severe performance degradation for datasets processed with filtering, shuffling, or multiprocessed mapping operations.

Root cause: This line rebuilds the entire dataset unnecessarily:

dataset = self.flatten_indices() if self._indices is not None else self

📊 Performance Impact

Dataset Size Operation Save Time Slowdown
100K Baseline (no indices) 0.027s -
100K Filtered (with indices) 0.146s +431%
100K Shuffled (with indices) 0.332s +1107%
250K Shuffled (with indices) 0.849s +1202%

🔄 Reproduction

from datasets import Dataset
import time

# Create dataset
dataset = Dataset.from_dict({'text': [f'sample {i}' for i in range(100000)]})

# Baseline save (no indices)
start = time.time()
dataset.save_to_disk('baseline')
baseline_time = time.time() - start

# Filtered save (creates indices)
filtered = dataset.filter(lambda x: True)
start = time.time()
filtered.save_to_disk('filtered')
filtered_time = time.time() - start

print(f"Baseline: {baseline_time:.3f}s")
print(f"Filtered: {filtered_time:.3f}s") 
print(f"Slowdown: {(filtered_time/baseline_time-1)*100:.1f}%")

Expected output: Filtered dataset is 400-1000% slower than baseline

💡 Proposed Solution

Add optional parameter to control flattening:

def save_to_disk(self, dataset_path, flatten_indices=True):
    dataset = self.flatten_indices() if (self._indices is not None and flatten_indices) else self
    # ... rest of save logic

Benefits:

  • ✅ Immediate performance improvement for users who don't need flattening
  • ✅ Backwards compatible (default behavior unchanged)
  • ✅ Simple implementation

🌍 Environment

  • datasets version: 2.x
  • Python: 3.10+
  • OS: Linux/macOS/Windows

📈 Impact

This affects most ML preprocessing workflows that filter/shuffle datasets before saving. Performance degradation scales exponentially with dataset size, making it a critical bottleneck for production systems.

🔗 Additional Resources

We have comprehensive test scripts demonstrating this across multiple scenarios if needed for further investigation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions