Open
Description
I am running this notebook. However, when I try to merge LoRA and model weights before exporting to TensorRTLLM (python /opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py
). I received the following error:
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[W socket.cpp:464] [c10d] The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use).
[W socket.cpp:464] [c10d] The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
[E socket.cpp:500] [c10d] The server socket has failed to listen on any local network address.
Error executing job with overrides: ['trainer.accelerator=gpu', 'tensor_model_parallel_size=1', 'pipeline_model_parallel_size=1', 'gpt_model_file=gemma_2b_pt.nemo', 'lora_model_path=nemo_experiments/gemma_lora_pubmedqa/checkpoints/gemma_lora_pubmedqa.nemo', 'merged_model_path=gemma_lora_pubmedqa_merged.nemo']
Traceback (most recent call last):
File "/opt/NeMo/scripts/nlp_language_modeling/merge_lora_weights/merge.py", line 171, in main
model = MegatronGPTModel.restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/models/nlp_model.py", line 478, in restore_from
return super().restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/core/classes/modelPT.py", line 468, in restore_from
instance = cls._save_restore_connector.restore_from(
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 1306, in restore_from
trainer.strategy.setup_environment()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 154, in setup_environment
self.setup_distributed()
File "/usr/local/lib/python3.10/dist-packages/nemo/collections/nlp/parts/nlp_overrides.py", line 244, in setup_distributed
super().setup_distributed()
File "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/strategies/ddp.py", line 203, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/usr/local/lib/python3.10/dist-packages/lightning_fabric/utilities/distributed.py", line 297, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
func_return = func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1172, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 244, in _env_rendezvous_handler
store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/rendezvous.py", line 172, in _create_c10d_store
return TCPStore(
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:53747 (errno: 98 - Address already in use). The server socket has failed to bind to ?UNKNOWN? (errno: 98 - Address already in use).
Setup Information:
torch: 2.2.0a0+81ea7a4
nemo: 2.0
Container: nvcr.io/nvidia/nemo:24.01.gemma
Metadata
Metadata
Assignees
Labels
No labels