Skip to content

Trainjob file-metricscollector does not exit after collecting metrics, but the same job is normal #2616

@wangyakun

Description

@wangyakun

What happened?

I have created a experiment by python SDK uses trainjob, then the same thing occured like #2614 (pod state is 2/3, NotReady, metrics-collector container is still running, no error.)
However, I duplicate the experiment yaml, only change the "trainjob" to "job" in experimnet yaml, and run it, finally it works well: metrics-collector container exit normally, and experiment state is succeeded.
I think this is maybe a new bug, related to the trainjob, and different from #2614.

promblem katib experiment

python code:

import kubeflow.katib as katib
from kubeflow.katib import KatibClient
from kubeflow.katib.models import (
    V1beta1Experiment,
    V1beta1ExperimentSpec,
    V1beta1AlgorithmSpec,
    V1beta1ObjectiveSpec,
    V1beta1ParameterSpec,
    V1beta1TrialTemplate,
    V1beta1TrialParameterSpec,
    V1ObjectMeta
)

parameters = [
    V1beta1ParameterSpec(
        name="learning_rate",
        parameter_type="double",
        feasible_space={"min": "1e-05", "max": "5e-05"}
    ),
    V1beta1ParameterSpec(
        name="r",
        parameter_type="int",
        feasible_space={"min": "1", "max": "8"}
    )
]

TRAIN_IMAGE = "36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev"       
EXP_NAME = "katib-llamafactory-qwen-sft3-lyt-old"
NAMESPACE = "aict"

trial_spec={
    "apiVersion": "trainer.kubeflow.org/v1alpha1",
    "kind": "TrainJob",
    "spec": {
        "podTemplateOverrides": [
            {
                "spec": {
                    "containers": [
                        {
                            "name": "node",
                            "volumeMounts": [
                                {
                                    "mountPath": "/datas",
                                    "name": "trainer-datas",
                                },
                            ],
                        },
                    ],
                    "volumes": [
                        {
                            "name": "trainer-datas",
                            "persistentVolumeClaim": {
                                "claimName": "katib-llamafactory-qwen-sft"
                            },
                        },
                    ],
                },
                "targetJobs":[
                    {
                        "name": "node",
                    },
                ],
            },
        ],
        "runtimeRef": {
            "apiGroup": "trainer.kubeflow.org",
            "kind": "ClusterTrainingRuntime",
            "name": "custom-test",
        },
        "trainer": {
            "numNodes": 1,
            "image": TRAIN_IMAGE,
            "command": [
                "sh",
                "-c",
                "set -x;"
                "accelerate launch "
                "  --multi_gpu"
                "  src/train.py "
                "  --model_name_or_path=/datas/models "
                "  --output_dir=/datas/output "
                "  --dataset_dir /datas/datasets "
                "  --do_train "
                "  --report_to=tensorboard "
                "  --finetuning_type=lora "
                "  --flash_attn=auto "
                "  --packing=False "
                "  --plot_loss=True "
                "  --ddp_timeout=180000000 "
                "  --fp16=True "
                "  --cutoff_len=4096 "
                "  --dataset=default "
                "  --gradient_accumulation_steps=8 "
                "  --learning_rate=${trialParameters.learning_rate} "
                "  --logging_steps=5 "
                "  --lr_scheduler_type=cosine "
                "  --max_samples=100000 "
                "  --num_train_epochs=1 "
                "  --optim=adamw_torch "
                "  --per_device_train_batch_size=2 "
                "  --save_steps=256 "
                "  --stage=sft "
                "  --template=qwen "
                "  --lora_alpha=16 "
                "  --lora_dropout=0 "
                "  --lora_rank=${trialParameters.r} "
                "  --loraplus_lr_ratio=0 "
                "  --use_dora=false "
                "  --use_rslora=false "
                "  --overwrite_output_dir;"
                "if [ -f /datas/output/train_results.json ]; then"
                "  echo 'Converting all_results.json to single-line format...' >&2;"
                "  python3 -c \"import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))\" > /tmp/all_results_single.json;"
                "  mv /tmp/all_results_single.json /datas/output/train_results.json;"
                "  echo 'JSON conversion complete' >&2;"
                "fi;"
                "cat /datas/output/train_results.json;"
                "sync;"
                "sleep 10;"
                "echo completed > /datas/output/$$$$.pid;"
                "sleep 30;"
                "exit 0"
            ],
            "resourcesPerNode": {
                "limits": {
                    "cpu": "1",
                    "memory": "8Gi",
                    "nvidia.com/gpu": "1",      
                },
                "requests": {
                    "cpu": "1",
                    "memory": "8Gi",
                    "nvidia.com/gpu": "1",
                },
            },
        },
    }
}

trial_template = V1beta1TrialTemplate(
    primary_container_name="node",
    trial_parameters=[
        V1beta1TrialParameterSpec(
            name="learning_rate", 
            description="Learning rate",
            reference="learning_rate"
        ),
        V1beta1TrialParameterSpec(
            name="r", 
            description="LoRA rank",
            reference="r"
        )
    ],
    trial_spec=trial_spec,
    success_condition='status.conditions.#(type=="Complete")#|#(status=="True")#',
    failure_condition='status.conditions.#(type=="Failed")#|#(status=="True")#',
    retain=True,
)

experiment_spec = V1beta1ExperimentSpec(
    algorithm=V1beta1AlgorithmSpec(algorithm_name="random"),
    objective=V1beta1ObjectiveSpec(
        type="minimize",
        goal=2.0,
        objective_metric_name="train_loss",
        metric_strategies=[{"name": "train_loss", "value": "min"}]
    ),
    parameters=parameters,
    trial_template=trial_template,
    max_trial_count=2,
    parallel_trial_count=1,
    metrics_collector_spec={
        "collector": {"kind": "File"},
        "source": {
            "fileSystemPath": {
                "kind": "File",
                "path": "/datas/output/train_results.json",
                "format": "JSON",
            }
        }
    },
    max_failed_trial_count=1
)

experiment = V1beta1Experiment(
    api_version="kubeflow.org/v1beta1",
    kind="Experiment",
    metadata=V1ObjectMeta(name=EXP_NAME, namespace=NAMESPACE),
    spec=experiment_spec
)

cl = KatibClient(namespace=NAMESPACE)
cl.create_experiment(experiment)

cl.wait_for_experiment_condition(name=EXP_NAME)
print(cl.get_optimal_hyperparameters(EXP_NAME))

experiment yaml:

metadata:
  name: katib-llamafactory-qwen-sft3-lyt-old
  namespace: aict
  uid: b41ef14b-7287-42f9-9dbf-db43b0f8f1de
  resourceVersion: '146317884'
  generation: 1
  creationTimestamp: '2026-02-06T07:20:00Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
            f:source:
              .: {}
              f:fileSystemPath:
                .: {}
                f:format: {}
                f:kind: {}
                f:path: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:retain: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:podTemplateOverrides: {}
                f:runtimeRef:
                  .: {}
                  f:apiGroup: {}
                  f:kind: {}
                  f:name: {}
                f:trainer:
                  .: {}
                  f:command: {}
                  f:image: {}
                  f:numNodes: {}
                  f:resourcesPerNode:
                    .: {}
                    f:limits:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                      f:nvidia.com/gpu: {}
                    f:requests:
                      .: {}
                      f:cpu: {}
                      f:memory: {}
                      f:nvidia.com/gpu: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:00Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T07:20:14Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:observation: {}
          f:runningTrialList: {}
          f:startTime: {}
          f:trials: {}
          f:trialsRunning: {}
      subresource: status
spec:
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        max: '5e-05'
        min: '1e-05'
        distribution: uniform
    - name: r
      parameterType: int
      feasibleSpace:
        max: '8'
        min: '1'
        distribution: uniform
  objective:
    type: minimize
    goal: 2
    objectiveMetricName: train_loss
    metricStrategies:
      - name: train_loss
        value: min
  algorithm:
    algorithmName: random
  trialTemplate:
    retain: true
    trialSpec:
      apiVersion: trainer.kubeflow.org/v1alpha1
      kind: TrainJob
      spec:
        podTemplateOverrides:
          - spec:
              containers:
                - name: node
                  volumeMounts:
                    - mountPath: /datas
                      name: trainer-datas
              volumes:
                - name: trainer-datas
                  persistentVolumeClaim:
                    claimName: katib-llamafactory-qwen-sft
            targetJobs:
              - name: node
        runtimeRef:
          apiGroup: trainer.kubeflow.org
          kind: ClusterTrainingRuntime
          name: custom-test
        trainer:
          command:
            - sh
            - '-c'
            - >-
              set -x;accelerate launch   --multi_gpu  src/train.py  
              --model_name_or_path=/datas/models   --output_dir=/datas/output  
              --dataset_dir /datas/datasets   --do_train  
              --report_to=tensorboard   --finetuning_type=lora  
              --flash_attn=auto   --packing=False   --plot_loss=True  
              --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096  
              --dataset=default   --gradient_accumulation_steps=8  
              --learning_rate=${trialParameters.learning_rate}  
              --logging_steps=5   --lr_scheduler_type=cosine  
              --max_samples=100000   --num_train_epochs=1  
              --optim=adamw_torch   --per_device_train_batch_size=2  
              --save_steps=256   --stage=sft   --template=qwen  
              --lora_alpha=16   --lora_dropout=0  
              --lora_rank=${trialParameters.r}   --loraplus_lr_ratio=0  
              --use_dora=false   --use_rslora=false   --overwrite_output_dir;if
              [ -f /datas/output/train_results.json ]; then  echo 'Converting
              all_results.json to single-line format...' >&2;  python3 -c
              "import json;
              data=json.load(open('/datas/output/train_results.json'));
              print(json.dumps({k: str(v) for k, v in data.items()},
              separators=(',', ':')))" > /tmp/all_results_single.json;  mv
              /tmp/all_results_single.json /datas/output/train_results.json; 
              echo 'JSON conversion complete' >&2;fi;cat
              /datas/output/train_results.json;sync;sleep 10;echo completed >
              /datas/output/$$$$.pid;sleep 30;exit 0
          image: >-
            36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
          numNodes: 1
          resourcesPerNode:
            limits:
              cpu: '1'
              memory: 8Gi
              nvidia.com/gpu: '1'
            requests:
              cpu: '1'
              memory: 8Gi
              nvidia.com/gpu: '1'
    trialParameters:
      - name: learning_rate
        description: Learning rate
        reference: learning_rate
      - name: r
        description: LoRA rank
        reference: r
    primaryPodLabels:
      batch.kubernetes.io/job-completion-index: '0'
      jobset.sigs.k8s.io/replicatedjob-name: node
    primaryContainerName: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 1
  maxTrialCount: 2
  maxFailedTrialCount: 1
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /datas/output/train_results.json
        kind: File
        format: JSON
    collector:
      kind: File
  resumePolicy: Never
status:
  startTime: '2026-02-06T07:20:00Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2026-02-06T07:20:00Z'
      lastTransitionTime: '2026-02-06T07:20:00Z'
    - type: Running
      status: 'True'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2026-02-06T07:20:14Z'
      lastTransitionTime: '2026-02-06T07:20:14Z'
  currentOptimalTrial:
    observation: {}
  runningTrialList:
    - katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
  trials: 1
  trialsRunning: 1

state of pod:

# kubectl get pod | grep old
katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4   2/3     NotReady           0               58m
katib-llamafactory-qwen-sft3-lyt-old-random-7bbc5c78d4-nnpx4   1/1     Running            0               58m

log of metrics:

# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
I0206 07:20:19.726497     105 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
I0206 07:37:20.817003     105 main.go:143] {
I0206 07:37:20.817104     105 main.go:143]     "epoch": 1.0,
I0206 07:37:20.817123     105 main.go:143]     "total_flos": 8476485388075008.0,
I0206 07:37:20.817127     105 main.go:143]     "train_loss": 3.3930898904800415,
I0206 07:37:20.817139     105 main.go:143]     "train_runtime": 986.1327,
I0206 07:37:20.817144     105 main.go:143]     "train_samples_per_second": 5.71,
I0206 07:37:20.817165     105 main.go:143]     "train_steps_per_second": 0.357
I0206 07:37:20.817169     105 main.go:143] }
2026/02/06 07:37:25 Re-opening truncated file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened truncated /datas/output/train_results.json
I0206 07:37:25.229170     105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}
2026/02/06 07:37:25 Re-opening moved/deleted file /datas/output/train_results.json ...
2026/02/06 07:37:25 Successfully reopened /datas/output/train_results.json
I0206 07:37:25.229414     105 main.go:143] {"epoch":"1.0","total_flos":"8476485388075008.0","train_loss":"3.3930898904800415","train_runtime":"986.1327","train_samples_per_second":"5.71","train_steps_per_second":"0.357"}

describe of pod:

# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Name:             katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4
Namespace:        aict
Priority:         0
Service Account:  default
Node:             llm1/192.168.1.4
Start Time:       Fri, 06 Feb 2026 15:20:14 +0800
Labels:           batch.kubernetes.io/controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
                  batch.kubernetes.io/job-completion-index=0
                  batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  controller-uid=e614da32-fa81-4f5a-ac21-8232bc38224b
                  job-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  jobset.sigs.k8s.io/global-replicas=1
                  jobset.sigs.k8s.io/group-name=default
                  jobset.sigs.k8s.io/group-replicas=1
                  jobset.sigs.k8s.io/job-global-index=0
                  jobset.sigs.k8s.io/job-group-index=0
                  jobset.sigs.k8s.io/job-index=0
                  jobset.sigs.k8s.io/job-key=77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
                  jobset.sigs.k8s.io/jobset-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  jobset.sigs.k8s.io/jobset-uid=9ca13e49-5443-4656-ad74-327147852c5f
                  jobset.sigs.k8s.io/replicatedjob-name=node
                  jobset.sigs.k8s.io/replicatedjob-replicas=1
                  jobset.sigs.k8s.io/restart-attempt=0
                  katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt-old
                  katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  security.istio.io/tlsMode=istio
                  service.istio.io/canonical-name=katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
                  service.istio.io/canonical-revision=latest
Annotations:      batch.kubernetes.io/job-completion-index: 0
                  cni.projectcalico.org/containerID: 2358563f7dc46f4fc07175fdcd7d351a3901a777995d3f95a0b1117030b6f8af
                  cni.projectcalico.org/podIP: 10.42.0.119/32
                  cni.projectcalico.org/podIPs: 10.42.0.119/32
                  istio.io/rev: default
                  jobset.sigs.k8s.io/global-replicas: 1
                  jobset.sigs.k8s.io/group-name: default
                  jobset.sigs.k8s.io/group-replicas: 1
                  jobset.sigs.k8s.io/job-global-index: 0
                  jobset.sigs.k8s.io/job-group-index: 0
                  jobset.sigs.k8s.io/job-index: 0
                  jobset.sigs.k8s.io/job-key: 77f78d77fd4a9c5f3dda557b211da8643fc5a4eb
                  jobset.sigs.k8s.io/jobset-name: katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
                  jobset.sigs.k8s.io/jobset-uid: 9ca13e49-5443-4656-ad74-327147852c5f
                  jobset.sigs.k8s.io/replicatedjob-name: node
                  jobset.sigs.k8s.io/replicatedjob-replicas: 1
                  jobset.sigs.k8s.io/restart-attempt: 0
                  kubectl.kubernetes.io/default-container: node
                  kubectl.kubernetes.io/default-logs-container: node
                  prometheus.io/path: /stats/prometheus
                  prometheus.io/port: 15020
                  prometheus.io/scrape: true
                  sidecar.istio.io/interceptionMode: REDIRECT
                  sidecar.istio.io/status:
                    {"initContainers":["istio-validation","istio-proxy"],"containers":null,"volumes":["workload-socket","credential-socket","workload-certs","...
                  traffic.sidecar.istio.io/excludeInboundPorts: 15020
                  traffic.sidecar.istio.io/includeInboundPorts: *
                  traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status:           Running
IP:               10.42.0.119
IPs:
  IP:           10.42.0.119
Controlled By:  Job/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
Init Containers:
  istio-validation:
    Container ID:  containerd://279a472bd3f21151ca632eb9ab5a65d5b0f659cd122b4ef9cccb0b53db7e10f2
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --log_output_level=default:info
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 15:20:15 +0800
      Finished:     Fri, 06 Feb 2026 15:20:15 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
  istio-proxy:
    Container ID:  containerd://e389ec32b15abe0aac51909fbbc366210061e2a940d7f9e7ce9068637eecf0c2
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
    State:          Running
      Started:      Fri, 06 Feb 2026 15:20:17 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
    Startup:    http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
    Environment:
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0-0-q2df4 (v1:metadata.name)
      POD_NAMESPACE:                 aict (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      ISTIO_CPU_LIMIT:               2 (limits.cpu)
      PROXY_CONFIG:                  {"tracing":{}}
                                     
      ISTIO_META_POD_PORTS:          [
                                     ]
      ISTIO_META_APP_CONTAINERS:     node
      GOMEMLIMIT:                    1073741824 (limits.memory)
      GOMAXPROCS:                    2 (limits.cpu)
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_NODE_NAME:           (v1:spec.nodeName)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
      ISTIO_META_OWNER:              kubernetes://apis/batch/v1/namespaces/aict/jobs/katib-llamafactory-qwen-sft3-lyt-old-k652h5cp-node-0
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/credential-uds from credential-socket (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Containers:
  node:
    Container ID:  containerd://c0bd832963b1034adde9b8843ca3d58f3eb9e5b898bd532b631d6575976eaf6a
    Image:         36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
    Image ID:      36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      set -x;accelerate launch   --multi_gpu  src/train.py   --model_name_or_path=/datas/models   --output_dir=/datas/output   --dataset_dir /datas/datasets   --do_train   --report_to=tensorboard   --finetuning_type=lora   --flash_attn=auto   --packing=False   --plot_loss=True   --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096   --dataset=default   --gradient_accumulation_steps=8   --learning_rate=2.5746808259899108e-05   --logging_steps=5   --lr_scheduler_type=cosine   --max_samples=100000   --num_train_epochs=1   --optim=adamw_torch   --per_device_train_batch_size=2   --save_steps=256   --stage=sft   --template=qwen   --lora_alpha=16   --lora_dropout=0   --lora_rank=8   --loraplus_lr_ratio=0   --use_dora=false   --use_rslora=false   --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then  echo 'Converting all_results.json to single-line format...' >&2;  python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json;  mv /tmp/all_results_single.json /datas/output/train_results.json;  echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 30;exit 0 && echo completed > /datas/output/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 15:20:19 +0800
      Finished:     Fri, 06 Feb 2026 15:38:05 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      JOB_COMPLETION_INDEX:   (v1:metadata.labels['batch.kubernetes.io/job-completion-index'])
      KATIB_TRIAL_NAME:       (v1:metadata.labels['katib.kubeflow.org/trial'])
    Mounts:
      /datas from trainer-datas (rw)
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://244906db4f40c2947327ecd487e3f3d356acc192391b79b6ec494bdd9655b1fe
    Image:         ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
    Image ID:      ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      katib-llamafactory-qwen-sft3-lyt-old-k652h5cp
      -m
      train_loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /datas/output/train_results.json
      -format
      JSON
    State:          Running
      Started:      Fri, 06 Feb 2026 15:20:19 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-df5ql (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  credential-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  trainer-datas:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  katib-llamafactory-qwen-sft
    ReadOnly:   false
  kube-api-access-df5ql:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

This promblem katib experiment in Kubeflow UI display:
Image

normal katib experiment

experiment yaml, only changed job kind from "trainjob" to "job":

metadata:
  name: katib-llamafactory-qwen-sft3-lyt
  namespace: aict
  uid: 704b3553-3c93-496a-adb2-f69e5b40a09b
  resourceVersion: '144115325'
  generation: 1
  creationTimestamp: '2026-02-06T02:27:46Z'
  finalizers:
    - update-prometheus-metrics
  managedFields:
    - manager: OpenAPI-Generator
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:27:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:spec:
          .: {}
          f:algorithm:
            .: {}
            f:algorithmName: {}
          f:maxFailedTrialCount: {}
          f:maxTrialCount: {}
          f:metricsCollectorSpec:
            .: {}
            f:collector:
              .: {}
              f:kind: {}
            f:source:
              .: {}
              f:fileSystemPath:
                .: {}
                f:format: {}
                f:kind: {}
                f:path: {}
          f:objective:
            .: {}
            f:goal: {}
            f:metricStrategies: {}
            f:objectiveMetricName: {}
            f:type: {}
          f:parallelTrialCount: {}
          f:parameters: {}
          f:trialTemplate:
            .: {}
            f:failureCondition: {}
            f:primaryContainerName: {}
            f:retain: {}
            f:successCondition: {}
            f:trialParameters: {}
            f:trialSpec:
              .: {}
              f:apiVersion: {}
              f:kind: {}
              f:spec:
                .: {}
                f:template:
                  .: {}
                  f:metadata:
                    .: {}
                    f:annotations:
                      .: {}
                      f:cni.istio.io/exclude: {}
                      f:istio.io/rev: {}
                      f:sidecar.istio.io/inject: {}
                  f:spec:
                    .: {}
                    f:containers: {}
                    f:restartPolicy: {}
                    f:schedulerName: {}
                    f:volumes: {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:27:46Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:metadata:
          f:finalizers:
            .: {}
            v:"update-prometheus-metrics": {}
    - manager: katib-controller
      operation: Update
      apiVersion: kubeflow.org/v1beta1
      time: '2026-02-06T02:45:39Z'
      fieldsType: FieldsV1
      fieldsV1:
        f:status:
          .: {}
          f:completionTime: {}
          f:conditions: {}
          f:currentOptimalTrial:
            .: {}
            f:bestTrialName: {}
            f:observation:
              .: {}
              f:metrics: {}
            f:parameterAssignments: {}
          f:startTime: {}
          f:succeededTrialList: {}
          f:trials: {}
          f:trialsSucceeded: {}
      subresource: status
spec:
  parameters:
    - name: learning_rate
      parameterType: double
      feasibleSpace:
        max: '5e-05'
        min: '1e-05'
        distribution: uniform
    - name: r
      parameterType: int
      feasibleSpace:
        max: '8'
        min: '1'
        distribution: uniform
  objective:
    type: minimize
    goal: 2
    objectiveMetricName: train_loss
    metricStrategies:
      - name: train_loss
        value: min
  algorithm:
    algorithmName: random
  trialTemplate:
    retain: true
    trialSpec:
      apiVersion: batch/v1
      kind: Job
      spec:
        template:
          metadata:
            annotations:
              cni.istio.io/exclude: 'true'
              istio.io/rev: ''
              sidecar.istio.io/inject: 'false'
          spec:
            containers:
              - args:
                  - >-
                    set -x;accelerate launch   --multi_gpu  src/train.py  
                    --model_name_or_path=/datas/models  
                    --output_dir=/datas/output   --dataset_dir /datas/datasets  
                    --do_train   --report_to=tensorboard  
                    --finetuning_type=lora   --flash_attn=auto  
                    --packing=False   --plot_loss=True  
                    --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096  
                    --dataset=default   --gradient_accumulation_steps=8  
                    --learning_rate=${trialParameters.learning_rate}  
                    --logging_steps=5   --lr_scheduler_type=cosine  
                    --max_samples=100000   --num_train_epochs=1  
                    --optim=adamw_torch   --per_device_train_batch_size=2  
                    --save_steps=256   --stage=sft   --template=qwen  
                    --lora_alpha=16   --lora_dropout=0  
                    --lora_rank=${trialParameters.r}   --loraplus_lr_ratio=0  
                    --use_dora=false   --use_rslora=false  
                    --overwrite_output_dir;if [ -f
                    /datas/output/train_results.json ]; then  echo 'Converting
                    all_results.json to single-line format...' >&2;  python3 -c
                    "import json;
                    data=json.load(open('/datas/output/train_results.json'));
                    print(json.dumps({k: str(v) for k, v in data.items()},
                    separators=(',', ':')))" > /tmp/all_results_single.json;  mv
                    /tmp/all_results_single.json
                    /datas/output/train_results.json;  echo 'JSON conversion
                    complete' >&2;fi;cat
                    /datas/output/train_results.json;sync;sleep 10;echo
                    completed > /datas/output/$$$$.pid;sleep 10;exit 0
                command:
                  - sh
                  - '-c'
                image: >-
                  36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
                name: node
                resources:
                  limits:
                    cpu: '1'
                    memory: 8Gi
                    nvidia.com/gpu: '1'
                  requests:
                    cpu: '1'
                    memory: 8Gi
                    nvidia.com/gpu: '1'
                volumeMounts:
                  - mountPath: /datas
                    name: trainer-datas
            restartPolicy: Never
            schedulerName: volcano
            volumes:
              - name: trainer-datas
                persistentVolumeClaim:
                  claimName: katib-llamafactory-qwen-sft
    trialParameters:
      - name: learning_rate
        description: Learning rate
        reference: learning_rate
      - name: r
        description: LoRA rank
        reference: r
    primaryContainerName: node
    successCondition: status.conditions.#(type=="Complete")#|#(status=="True")#
    failureCondition: status.conditions.#(type=="Failed")#|#(status=="True")#
  parallelTrialCount: 1
  maxTrialCount: 1
  maxFailedTrialCount: 0
  metricsCollectorSpec:
    source:
      fileSystemPath:
        path: /datas/output/train_results.json
        kind: File
        format: JSON
    collector:
      kind: File
  resumePolicy: Never
status:
  startTime: '2026-02-06T02:27:46Z'
  completionTime: '2026-02-06T02:45:39Z'
  conditions:
    - type: Created
      status: 'True'
      reason: ExperimentCreated
      message: Experiment is created
      lastUpdateTime: '2026-02-06T02:27:46Z'
      lastTransitionTime: '2026-02-06T02:27:46Z'
    - type: Running
      status: 'False'
      reason: ExperimentRunning
      message: Experiment is running
      lastUpdateTime: '2026-02-06T02:45:39Z'
      lastTransitionTime: '2026-02-06T02:45:39Z'
    - type: Succeeded
      status: 'True'
      reason: ExperimentMaxTrialsReached
      message: Experiment has succeeded because max trial count has reached
      lastUpdateTime: '2026-02-06T02:45:39Z'
      lastTransitionTime: '2026-02-06T02:45:39Z'
  currentOptimalTrial:
    bestTrialName: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
    parameterAssignments:
      - name: learning_rate
        value: '3.5283942788578944e-05'
      - name: r
        value: '2'
    observation:
      metrics:
        - name: train_loss
          min: '3.3930898904800415'
          max: '3.3930898904800415'
          latest: '3.3930898904800415'
  succeededTrialList:
    - katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
  trials: 1
  trialsSucceeded: 1

state of pod:

# kubectl get pod | grep lyt | grep -v old
katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x                0/2     Completed          0             5h54m

log of metrics:

# kubectl logs -f -c metrics-logger-and-collector katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
I0206 02:28:24.004010      68 main.go:400] Trial Name: katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
I0206 02:45:13.921332      68 main.go:143] {
I0206 02:45:13.921369      68 main.go:143]     "epoch": 1.0,
I0206 02:45:13.921381      68 main.go:143]     "total_flos": 8399292649701376.0,
I0206 02:45:13.921394      68 main.go:143]     "train_loss": 3.3930898904800415,
I0206 02:45:13.921406      68 main.go:143]     "train_runtime": 969.046,
I0206 02:45:13.921410      68 main.go:143]     "train_samples_per_second": 5.811,
I0206 02:45:13.921423      68 main.go:143]     "train_steps_per_second": 0.363
I0206 02:45:13.921689      68 main.go:143] }
W0206 02:45:36.212115      68 file-metricscollector.go:143] Metrics will not have timestamp since {"epoch":"1.0","total_flos":"8399292649701376.0","train_loss":"3.3930898904800415","train_runtime":"969.046","train_samples_per_second":"5.811","train_steps_per_second":"0.363"} doesn't have the key timestamp
I0206 02:45:36.236663      68 main.go:459] Metrics reported. :
metric_logs:{time_stamp:"0001-01-01T00:00:00Z"  metric:{name:"train_loss"  value:"3.3930898904800415"}}

describe of pod:

# kubectl describe pod katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x 
Name:             katib-llamafactory-qwen-sft3-lyt-qkxpnrzh-xcn6x
Namespace:        aict
Priority:         0
Service Account:  default
Node:             llm1/192.168.1.4
Start Time:       Fri, 06 Feb 2026 10:28:22 +0800
Labels:           batch.kubernetes.io/controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  batch.kubernetes.io/job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
                  controller-uid=91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  job-name=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
                  katib.kubeflow.org/experiment=katib-llamafactory-qwen-sft3-lyt
                  katib.kubeflow.org/trial=katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Annotations:      cni.istio.io/exclude: true
                  cni.projectcalico.org/containerID: 84d074e5dab08b026f5ac653a2d78a6ff3e9cc4744d8a6b3ceacfe71a7f8df33
                  cni.projectcalico.org/podIP: 
                  cni.projectcalico.org/podIPs: 
                  istio.io/rev: 
                  scheduling.k8s.io/group-name: podgroup-91c17c88-fb99-4d33-8bf3-83b2fde6eb03
                  sidecar.istio.io/inject: false
Status:           Succeeded
IP:               10.42.0.187
IPs:
  IP:           10.42.0.187
Controlled By:  Job/katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
Containers:
  node:
    Container ID:  containerd://34cca64a540c52daede81cc96587da6432fc3a4b1d9cac0368c032c59c96b0da
    Image:         36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia:1.0.1-llamafactory-0.9.2-dev
    Image ID:      36.134.128.101.nip.io:31104/aict-gpu/llama-factory-amd64-nvidia@sha256:2899589618c16624103fb0170b865119fce8af891bb38dbf1be36b8c4f2cdc2f
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      set -x;accelerate launch   --multi_gpu  src/train.py   --model_name_or_path=/datas/models   --output_dir=/datas/output   --dataset_dir /datas/datasets   --do_train   --report_to=tensorboard   --finetuning_type=lora   --flash_attn=auto   --packing=False   --plot_loss=True   --ddp_timeout=180000000   --fp16=True   --cutoff_len=4096   --dataset=default   --gradient_accumulation_steps=8   --learning_rate=3.5283942788578944e-05   --logging_steps=5   --lr_scheduler_type=cosine   --max_samples=100000   --num_train_epochs=1   --optim=adamw_torch   --per_device_train_batch_size=2   --save_steps=256   --stage=sft   --template=qwen   --lora_alpha=16   --lora_dropout=0   --lora_rank=2   --loraplus_lr_ratio=0   --use_dora=false   --use_rslora=false   --overwrite_output_dir;if [ -f /datas/output/train_results.json ]; then  echo 'Converting all_results.json to single-line format...' >&2;  python3 -c "import json; data=json.load(open('/datas/output/train_results.json')); print(json.dumps({k: str(v) for k, v in data.items()}, separators=(',', ':')))" > /tmp/all_results_single.json;  mv /tmp/all_results_single.json /datas/output/train_results.json;  echo 'JSON conversion complete' >&2;fi;cat /datas/output/train_results.json;sync;sleep 10;echo completed > /datas/output/$$$$.pid;sleep 10;exit 0 && echo completed > /datas/output/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 10:28:23 +0800
      Finished:     Fri, 06 Feb 2026 10:45:34 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Requests:
      cpu:             1
      memory:          8Gi
      nvidia.com/gpu:  1
    Environment:
      KATIB_TRIAL_NAME:   (v1:metadata.labels['katib.kubeflow.org/trial'])
    Mounts:
      /datas from trainer-datas (rw)
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://e41a258f1595fa7b6df7e59cc6bf1e48eff53f29c840f0187556245660ecb8ce
    Image:         ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
    Image ID:      ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      katib-llamafactory-qwen-sft3-lyt-qkxpnrzh
      -m
      train_loss
      -o-type
      minimize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /datas/output/train_results.json
      -format
      JSON
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Fri, 06 Feb 2026 10:28:24 +0800
      Finished:     Fri, 06 Feb 2026 10:45:36 +0800
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /datas/output from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-l7s79 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   False 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  trainer-datas:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  katib-llamafactory-qwen-sft
    ReadOnly:   false
  kube-api-access-l7s79:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

This normal katib experiment in Kubeflow UI display:

Image

What did you expect to happen?

I expect trainjob experiment works normally like job experiment, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.

Environment

Kubernetes version:

$ kubectl version
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1

Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
ghcr.io/kubeflow/katib/katib-controller:v0.19.0

Katib Python SDK version:

$ pip show kubeflow-katib
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: [premnath.vel@gmail.com](mailto:premnath.vel@gmail.com)
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:

Impacted by this bug?

katib trainjob experiment works not well

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions