Skip to content

LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner #2587

@skb888

Description

@skb888

Hi Team, I follow the instruction to run the LLM fine tuning, but faces below errors.
It comes from
kubectl logs pod llama-ppwtq5t2-worker-0 -n -c pytorch

!pip install -U kubeflow-katib
Successfully installed kubeflow-katib-0.18.0

!pip install -U "kubeflow-training[huggingface]"
Successfully installed peft-0.15.1 tokenizers-0.21.4 transformers-4.50.2

Here is the detailed errors:
2025-10-31T03:31:21Z INFO Starting HuggingFace LLM Trainer
Traceback (most recent call last):
File "/app/hf_llm_training.py", line 188, in
**train_args = TrainingArguments(json.loads(args.training_parameters))
File "/usr/lib/python3.10/json/init.py", line 346, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E1031 03:31:22.800000 140614705968960 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 52) of binary: /usr/bin/python
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 33, in
sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
run(args)
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
hf_llm_training.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-10-31_03:31:22
host : llama-ppwtq5t2-worker-0
rank : 1 (local_rank: 0)
exitcode : 1 (pid: 52)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions