Skip to content

feat: lazy dataset preprocessing#2007

Open
edjson wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
edjson:feat/lazy-dataset-preprocessing
Open

feat: lazy dataset preprocessing#2007
edjson wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
edjson:feat/lazy-dataset-preprocessing

Conversation

@edjson
Copy link
Copy Markdown

@edjson edjson commented Apr 23, 2026

What does this PR do ?

Replace .map(fn) dataset processing with LazyMappedDataset. This processes items on the fly, and caches results.

Changelog

  • Added nemo_automodel/components/datasets/lazy_mapped_dataset.py with LazyMappedDataset class.

  • Updated nemo_automodel/components/datasets/llm/xlam.py and nemo_automodel/components/datasets/llm/squad.py to use LazyMappedDataset.

  • Updated tests/unit_tests/datasets/llm/test_xlam.py to match new LazyMappedDataset behavior.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@edjson edjson changed the title Feat/lazy dataset preprocessing Feat: lazy dataset preprocessing Apr 23, 2026
edjson added 2 commits April 22, 2026 17:21
Signed-off-by: Edison <edisonggacc@gmail.com>
Signed-off-by: Edison <edisonggacc@gmail.com>
@edjson edjson force-pushed the feat/lazy-dataset-preprocessing branch from ac57989 to 022c5ab Compare April 23, 2026 00:21
@edjson edjson changed the title Feat: lazy dataset preprocessing feat: lazy dataset preprocessing Apr 23, 2026
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 23, 2026

/ok to test 289f728

Signed-off-by: Edison <edisonggacc@gmail.com>
@edjson edjson force-pushed the feat/lazy-dataset-preprocessing branch from 289f728 to 052074c Compare April 23, 2026 04:49
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 23, 2026

/ok to test 052074c

Signed-off-by: Edison <edisonggacc@gmail.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 23, 2026

/ok to test ca6d515

Signed-off-by: Edison <edisonggacc@gmail.com>
@akoumpa
Copy link
Copy Markdown
Contributor

akoumpa commented Apr 23, 2026

/claude review

Comment on lines +55 to +64
if cache_size > 0:

@lru_cache(maxsize=cache_size)
def _cached_transform(idx: int) -> Any:
return self._map_fn(self._dataset[idx])

self._get_item = _cached_transform
logger.debug("LazyMappedDataset: LRU cache enabled (maxsize=%d)", cache_size)
else:
self._get_item = lambda idx: self._map_fn(self._dataset[idx])
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Both the lru_cache-wrapped local function (line 58) and the lambda (line 64) stored in self._get_item are not picklable. This will cause DataLoader(num_workers>0) to fail with spawn or forkserver start methods (which are the recommended/default methods for CUDA workloads).

The old .map() approach returned a HuggingFace Dataset that is fully picklable, so this is a regression.

You could fix this by implementing __getstate__/__setstate__ to drop and rebuild the cache on unpickle, or by using a picklable caching strategy instead of lru_cache on a local function.

logger = logging.getLogger(__name__)


class LazyMappedDataset(Dataset):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tests: LazyMappedDataset is a new public class but has no dedicated unit tests. The existing squad/xlam tests exercise it indirectly, but there should be standalone tests covering at minimum:

  • Basic __getitem__ / __len__ behavior
  • Cache hits (verify map_fn is called once per index, not on repeat access)
  • cache_size=0 (no caching) path
  • Pickling round-trip (important for DataLoader multi-worker compatibility — see other comment)

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Apr 23, 2026
if cache_size is None:
cache_size = len(dataset)

self._cache_size = cache_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @edjson , do you know if datasets provides any facility for caching on disk that we could leverage here? I realized today, that inevitably, this will cache in memory, which for large datasets may be trouble. Please let me know what you think. I know the LRU is already in place, and should help mitigate the problem, but I'm wondering whether there's additional options we could explore. Thanks.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello,
Thank you for raising this. HuggingFace does have some disk caching methods we can leverage including:

  • Enable streaming. If datasets are too large we can use streaming to download and process data on the fly, which is good for large data sets.

  • Memory Mapping. load_dataset downloads and caches data to the disk, however we can set the location for the arrow cache file in map() that way we have it written to a SSD instead of a HDD. We can set HF_DATASETS_CACHE as an environment variable to direct all caching to a specific path for a simpler system-wide approach. Also the num_proc parameter in .map() when using multiprocessing, each worker loads its own shard into memory simultaneously multiplying RAM usage so specifying the cache location is worth exploring.

  • Disable in-memory caching. Making sure that keep_in_memory is set to false, that way datasets will not load data into memory. There is a global toggle in datasets.disable_caching()/datasets.enable_caching() if we want full control over when caching happens

  • Intermediately save to the disk. We can use .save_to_disk after processing to free RAM space on frequent runs. This preserves the Arrow format so re-loading is fast and does not require reprocessing.

Given the concern with large datasets, streaming and Arrow cache redirection may be the best starting points. I would be happy to open a separate PR to explore adding an optional cache_dir parameter to the dataset functions. That way users would pass a cache_file_name to .map() to specify a cache location in the YAML config.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 24, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Apr 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants