feat: lazy dataset preprocessing by edjson · Pull Request #2007 · NVIDIA-NeMo/Automodel

edjson · 2026-04-23T00:18:47Z

What does this PR do ?

Replace .map(fn) dataset processing with LazyMappedDataset. This processes items on the fly, and caches results.

Changelog

Added nemo_automodel/components/datasets/lazy_mapped_dataset.py with LazyMappedDataset class.
Updated nemo_automodel/components/datasets/llm/xlam.py and nemo_automodel/components/datasets/llm/squad.py to use LazyMappedDataset.
Updated tests/unit_tests/datasets/llm/test_xlam.py to match new LazyMappedDataset behavior.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to use lazy map for dataset preprocessing #1917

copy-pr-bot · 2026-04-23T00:18:51Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Edison <edisonggacc@gmail.com>

akoumpa · 2026-04-23T02:46:19Z

/ok to test 289f728

Signed-off-by: Edison <edisonggacc@gmail.com>

akoumpa · 2026-04-23T05:43:42Z

/ok to test 052074c

Signed-off-by: Edison <edisonggacc@gmail.com>

akoumpa · 2026-04-23T06:00:18Z

/ok to test ca6d515

Signed-off-by: Edison <edisonggacc@gmail.com>

akoumpa · 2026-04-23T17:55:38Z

/claude review

claude · 2026-04-23T17:57:53Z

+        if cache_size > 0:
+
+            @lru_cache(maxsize=cache_size)
+            def _cached_transform(idx: int) -> Any:
+                return self._map_fn(self._dataset[idx])
+
+            self._get_item = _cached_transform
+            logger.debug("LazyMappedDataset: LRU cache enabled (maxsize=%d)", cache_size)
+        else:
+            self._get_item = lambda idx: self._map_fn(self._dataset[idx])


Bug: Both the lru_cache-wrapped local function (line 58) and the lambda (line 64) stored in self._get_item are not picklable. This will cause DataLoader(num_workers>0) to fail with spawn or forkserver start methods (which are the recommended/default methods for CUDA workloads).

The old .map() approach returned a HuggingFace Dataset that is fully picklable, so this is a regression.

You could fix this by implementing __getstate__/__setstate__ to drop and rebuild the cache on unpickle, or by using a picklable caching strategy instead of lru_cache on a local function.

claude · 2026-04-23T17:57:57Z

+logger = logging.getLogger(__name__)
+
+
+class LazyMappedDataset(Dataset):


Missing tests: LazyMappedDataset is a new public class but has no dedicated unit tests. The existing squad/xlam tests exercise it indirectly, but there should be standalone tests covering at minimum:

Basic __getitem__ / __len__ behavior

Cache hits (verify map_fn is called once per index, not on repeat access)

cache_size=0 (no caching) path

Pickling round-trip (important for DataLoader multi-worker compatibility — see other comment)

Signed-off-by: Edison <edisonggacc@gmail.com>

akoumpa · 2026-04-24T16:04:07Z

+        if cache_size is None:
+            cache_size = len(dataset)
+
+        self._cache_size = cache_size


Hi @edjson , do you know if datasets provides any facility for caching on disk that we could leverage here? I realized today, that inevitably, this will cache in memory, which for large datasets may be trouble. Please let me know what you think. I know the LRU is already in place, and should help mitigate the problem, but I'm wondering whether there's additional options we could explore. Thanks.

Hello,
Thank you for raising this. HuggingFace does have some disk caching methods we can leverage including:

Enable streaming. If datasets are too large we can use streaming to download and process data on the fly, which is good for large data sets.

Memory Mapping. load_dataset downloads and caches data to the disk, however we can set the location for the arrow cache file in map() that way we have it written to a SSD instead of a HDD. We can set HF_DATASETS_CACHE as an environment variable to direct all caching to a specific path for a simpler system-wide approach. Also the num_proc parameter in .map() when using multiprocessing, each worker loads its own shard into memory simultaneously multiplying RAM usage so specifying the cache location is worth exploring.

Disable in-memory caching. Making sure that keep_in_memory is set to false, that way datasets will not load data into memory. There is a global toggle in datasets.disable_caching()/datasets.enable_caching() if we want full control over when caching happens

Intermediately save to the disk. We can use .save_to_disk after processing to free RAM space on frequent runs. This preserves the Arrow format so re-loading is fast and does not require reprocessing.

Given the concern with large datasets, streaming and Arrow cache redirection may be the best starting points. I would be happy to open a separate PR to explore adding an optional cache_dir parameter to the dataset functions. That way users would pass a cache_file_name to .map() to specify a cache location in the YAML config.

edjson requested review from a team, HuiyingLi, ZhiyuLi-Nvidia, adil-a, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners April 23, 2026 00:18

github-actions Bot added the community-request label Apr 23, 2026

edjson changed the title ~~Feat/lazy dataset preprocessing~~ Feat: lazy dataset preprocessing Apr 23, 2026

edjson added 2 commits April 22, 2026 17:21

feat: use lazy map dataset preprocessing

0ae410d

Signed-off-by: Edison <edisonggacc@gmail.com>

feat: enable default caching in LazyMappedDataset

022c5ab

Signed-off-by: Edison <edisonggacc@gmail.com>

edjson force-pushed the feat/lazy-dataset-preprocessing branch from ac57989 to 022c5ab Compare April 23, 2026 00:21

edjson changed the title ~~Feat: lazy dataset preprocessing~~ feat: lazy dataset preprocessing Apr 23, 2026

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 02:46 Inactive

copy-pr-bot Bot temporarily deployed to test April 23, 2026 02:46 Inactive

fix: sort imports in xlam.py

052074c

Signed-off-by: Edison <edisonggacc@gmail.com>

edjson force-pushed the feat/lazy-dataset-preprocessing branch from 289f728 to 052074c Compare April 23, 2026 04:49

copy-pr-bot Bot had a problem deploying to nemo-ci April 23, 2026 05:44 Error

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 05:44 Inactive

copy-pr-bot Bot temporarily deployed to test April 23, 2026 05:44 Inactive

fix: sort imports in squad.py

ca6d515

Signed-off-by: Edison <edisonggacc@gmail.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 06:00 Inactive

copy-pr-bot Bot temporarily deployed to test April 23, 2026 06:01 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci April 23, 2026 06:48 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci April 23, 2026 07:06 Failure

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 23, 2026

fix: update test_fp8_flag_is_noop to use LazyMappedDataset

e3267e7

Signed-off-by: Edison <edisonggacc@gmail.com>

claude Bot reviewed Apr 23, 2026

View reviewed changes

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Apr 23, 2026

edjson added 2 commits April 23, 2026 21:26

fix: add pickling support and unit tests for LazyMappedDataset

6bd632f

Signed-off-by: Edison <edisonggacc@gmail.com>

Merge branch 'main' into feat/lazy-dataset-preprocessing

c1697ec

akoumpa reviewed Apr 24, 2026

View reviewed changes

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label Apr 24, 2026

Merge branch 'main' into feat/lazy-dataset-preprocessing

92e8424

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: lazy dataset preprocessing#2007

feat: lazy dataset preprocessing#2007
edjson wants to merge 8 commits intoNVIDIA-NeMo:mainfrom
edjson:feat/lazy-dataset-preprocessing

edjson commented Apr 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Uh oh!

akoumpa Apr 24, 2026

Uh oh!

edjson Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		class LazyMappedDataset(Dataset):

Conversation

edjson commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

akoumpa commented Apr 23, 2026

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

akoumpa Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

edjson Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

edjson commented Apr 23, 2026 •

edited

Loading