Skip to content

LoRA weight merging giving torch distributed error on single-node single-gpu #255

Open
@ayushbits

Description

@ayushbits

I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py). I received the following error:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['trainer.accelerator=gpu', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'gpt_model_file=gemma_2b_pt.nemo', 'lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo', 'merged_model_path=gemma_lora_pubmedqa_merged.nemo']
Traceback (most recent call last):
 File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 171, in main
   model = MegatronGPTModel.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
   return super().restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 468, in restore_from
   instance = cls._save_restore_connector.restore_from(
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 1306, in restore_from
   trainer.strategy.setup_environment()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
   self.setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 244, in setup_distributed
   super().setup_distributed()
 File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
   _init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
 File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/distributed.py", line 297, in _init_dist_connection
   torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
   func_return = func(*args, **kwargs)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1172, in init_process_group
   store, rank, world_size = next(rendezvous_iterator)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 244, in _env_rendezvous_handler
   store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
 File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
   return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).

Setup Information:

torch: 2.2.0a0+81ea7a4
nemo: 2.0
Container: nvcr.io/nvidia/nemo:24.01.gemma

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions