Skip to content

what's the biggest dataset you've tried? #1253

@exnx

Description

@exnx

Hello, I have a dataset of 7T tokens, which when I run in the gpt-neox codebase, it creates about 5000 .npy files. I can get this to train for a 7B model on 32 gpus. But when I try to use 64 gpus, I get an error that says too many files have been opened, reaching the limit for max files opened. I believe it's opening a file decorator for each gpu and worker of all 5000 .npy files, so the mode gpus, the more files opened. Has anyone else ran into a similar limit? The current limit when typing ulimit -n is 1048576.

Here's the error I got:

GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/socket.py", line 546, in fromfd
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/threading.py", line 975, in run
GPUCA6E:     nfd = dup(fd)
GPUCA6E:             self._target(*self._args, **self._kwargs) 
GPUCA6E:  ^^^^^^^
GPUCA6E: OSError: [Errno 24] Too many open files
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 54, in _pin_memory_loop
GPUCA6E:     do_one_step()
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/utils/data/_utils/pin_memory.py", line 31, in do_one_step
GPUCA6E:     r = in_queue.get(timeout=MP_STATUS_CHECK_INTERVAL)
GPUCA6E:         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/queues.py", line 122, in get
GPUCA6E:     return _ForkingPickler.loads(res)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/site-packages/torch/multiprocessing/reductions.py", line 355, in rebuild_storage_fd
GPUCA6E:     fd = df.detach()
GPUCA6E:          ^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/resource_sharer.py", line 58, in detach
GPUCA6E:     return reduction.recv_handle(conn)
GPUCA6E:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 189, in recv_handle
GPUCA6E:     return recvfds(s, 1)[0]
GPUCA6E:            ^^^^^^^^^^^^^
GPUCA6E:   File "/home/cirrascale/miniconda3/envs/flash-devo-copy/lib/python3.11/multiprocessing/reduction.py", line 159, in recvfds
GPUCA6E:     raise EOFError
GPUCA6E: EOFError

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions