LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner

Hi Team, I follow the instruction to run the LLM fine tuning, but faces below errors.
It comes from 
**kubectl logs pod llama-ppwtq5t2-worker-0 -n <ns> -c pytorch** 

!pip install -U kubeflow-katib
Successfully installed kubeflow-katib-0.18.0

!pip install -U "kubeflow-training[huggingface]"
Successfully installed peft-0.15.1 tokenizers-0.21.4 transformers-4.50.2

Here is the detailed errors:
2025-10-31T03:31:21Z INFO     Starting HuggingFace LLM Trainer
Traceback (most recent call last):
  File "/app/hf_llm_training.py", line 188, in <module>
    **train_args = TrainingArguments(**json.loads(args.training_parameters))**
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
E1031 03:31:22.800000 140614705968960 torch/distributed/elastic/multiprocessing/api.py:826] failed (exitcode: 1) local_rank: 0 (pid: 52) of binary: /usr/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.4.0a0+f70bd71a48.nv24.6', 'console_scripts', 'torchrun')())
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 900, in main
    run(args)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
hf_llm_training.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-10-31_03:31:22
  host      : llama-ppwtq5t2-worker-0
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 52)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner #2587

Failures:
<NO_OTHER_FAILURES>

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

LLMs Fine-Tuning Errors in llm worker pod with pytorch conatiner #2587

Description

Failures: <NO_OTHER_FAILURES>

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failures:
<NO_OTHER_FAILURES>