Conversation
Signed-off-by: Edison <edisonggacc@gmail.com>
Signed-off-by: Edison <edisonggacc@gmail.com>
ac57989 to
022c5ab
Compare
|
/ok to test 289f728 |
Signed-off-by: Edison <edisonggacc@gmail.com>
289f728 to
052074c
Compare
|
/ok to test 052074c |
Signed-off-by: Edison <edisonggacc@gmail.com>
|
/ok to test ca6d515 |
Signed-off-by: Edison <edisonggacc@gmail.com>
|
/claude review |
| if cache_size > 0: | ||
|
|
||
| @lru_cache(maxsize=cache_size) | ||
| def _cached_transform(idx: int) -> Any: | ||
| return self._map_fn(self._dataset[idx]) | ||
|
|
||
| self._get_item = _cached_transform | ||
| logger.debug("LazyMappedDataset: LRU cache enabled (maxsize=%d)", cache_size) | ||
| else: | ||
| self._get_item = lambda idx: self._map_fn(self._dataset[idx]) |
There was a problem hiding this comment.
Bug: Both the lru_cache-wrapped local function (line 58) and the lambda (line 64) stored in self._get_item are not picklable. This will cause DataLoader(num_workers>0) to fail with spawn or forkserver start methods (which are the recommended/default methods for CUDA workloads).
The old .map() approach returned a HuggingFace Dataset that is fully picklable, so this is a regression.
You could fix this by implementing __getstate__/__setstate__ to drop and rebuild the cache on unpickle, or by using a picklable caching strategy instead of lru_cache on a local function.
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class LazyMappedDataset(Dataset): |
There was a problem hiding this comment.
Missing tests: LazyMappedDataset is a new public class but has no dedicated unit tests. The existing squad/xlam tests exercise it indirectly, but there should be standalone tests covering at minimum:
- Basic
__getitem__/__len__behavior - Cache hits (verify
map_fnis called once per index, not on repeat access) cache_size=0(no caching) path- Pickling round-trip (important for
DataLoadermulti-worker compatibility — see other comment)
Signed-off-by: Edison <edisonggacc@gmail.com>
| if cache_size is None: | ||
| cache_size = len(dataset) | ||
|
|
||
| self._cache_size = cache_size |
There was a problem hiding this comment.
Hi @edjson , do you know if datasets provides any facility for caching on disk that we could leverage here? I realized today, that inevitably, this will cache in memory, which for large datasets may be trouble. Please let me know what you think. I know the LRU is already in place, and should help mitigate the problem, but I'm wondering whether there's additional options we could explore. Thanks.
There was a problem hiding this comment.
Hello,
Thank you for raising this. HuggingFace does have some disk caching methods we can leverage including:
-
Enable streaming. If datasets are too large we can use streaming to download and process data on the fly, which is good for large data sets.
-
Memory Mapping.
load_datasetdownloads and caches data to the disk, however we can set the location for the arrow cache file inmap()that way we have it written to a SSD instead of a HDD. We can setHF_DATASETS_CACHEas an environment variable to direct all caching to a specific path for a simpler system-wide approach. Also thenum_procparameter in.map()when using multiprocessing, each worker loads its own shard into memory simultaneously multiplying RAM usage so specifying the cache location is worth exploring. -
Disable in-memory caching. Making sure that
keep_in_memoryis set to false, that way datasets will not load data into memory. There is a global toggle indatasets.disable_caching()/datasets.enable_caching()if we want full control over when caching happens -
Intermediately save to the disk. We can use
.save_to_diskafter processing to free RAM space on frequent runs. This preserves the Arrow format so re-loading is fast and does not require reprocessing.
Given the concern with large datasets, streaming and Arrow cache redirection may be the best starting points. I would be happy to open a separate PR to explore adding an optional cache_dir parameter to the dataset functions. That way users would pass a cache_file_name to .map() to specify a cache location in the YAML config.
What does this PR do ?
Replace
.map(fn)dataset processing withLazyMappedDataset. This processes items on the fly, and caches results.Changelog
Added
nemo_automodel/components/datasets/lazy_mapped_dataset.pywithLazyMappedDatasetclass.Updated
nemo_automodel/components/datasets/llm/xlam.pyandnemo_automodel/components/datasets/llm/squad.pyto useLazyMappedDataset.Updated
tests/unit_tests/datasets/llm/test_xlam.pyto match newLazyMappedDatasetbehavior.Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information