Skip to content

File-metricscollector does not exit after collecting metrics, trial still running #2614

@wangyakun

Description

@wangyakun

What happened?

I have created a experiment in Kubeflow UI by https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml, Then I saw 3 pods of trials:

Image

we can see one of pod is running.
I follow the log of running pod's training-container until the container completed:

# kubectl logs file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -f -c training-container
100.0%
100.0%
100.0%
100.0%
2026-02-02T10:29:09Z INFO     Train Epoch: 1 [0/60000 (0%)]	loss=2.2980
2026-02-02T10:29:09Z INFO     Train Epoch: 1 [640/60000 (1%)]	loss=2.2977
2026-02-02T10:29:10Z INFO     Train Epoch: 1 [1280/60000 (2%)]	loss=2.1945
2026-02-02T10:29:10Z INFO     Train Epoch: 1 [1920/60000 (3%)]	loss=2.0590
2026-02-02T10:29:11Z INFO     Train Epoch: 1 [2560/60000 (4%)]	loss=1.7646
2026-02-02T10:29:11Z INFO     Train Epoch: 1 [3200/60000 (5%)]	loss=1.6865
2026-02-02T10:29:11Z INFO     Train Epoch: 1 [3840/60000 (6%)]	loss=1.1514
2026-02-02T10:29:12Z INFO     Train Epoch: 1 [4480/60000 (7%)]	loss=0.9604
2026-02-02T10:29:12Z INFO     Train Epoch: 1 [5120/60000 (9%)]	loss=1.0196
2026-02-02T10:29:13Z INFO     Train Epoch: 1 [5760/60000 (10%)]	loss=1.0410
2026-02-02T10:29:13Z INFO     Train Epoch: 1 [6400/60000 (11%)]	loss=0.9910
2026-02-02T10:29:14Z INFO     Train Epoch: 1 [7040/60000 (12%)]	loss=0.8280
2026-02-02T10:29:14Z INFO     Train Epoch: 1 [7680/60000 (13%)]	loss=0.9400
2026-02-02T10:29:15Z INFO     Train Epoch: 1 [8320/60000 (14%)]	loss=0.7275
2026-02-02T10:29:15Z INFO     Train Epoch: 1 [8960/60000 (15%)]	loss=0.6647
2026-02-02T10:29:16Z INFO     Train Epoch: 1 [9600/60000 (16%)]	loss=0.9186
2026-02-02T10:29:16Z INFO     Train Epoch: 1 [10240/60000 (17%)]	loss=0.8398
2026-02-02T10:29:17Z INFO     Train Epoch: 1 [10880/60000 (18%)]	loss=0.5977
2026-02-02T10:29:18Z INFO     Train Epoch: 1 [11520/60000 (19%)]	loss=0.8014
2026-02-02T10:29:18Z INFO     Train Epoch: 1 [12160/60000 (20%)]	loss=0.9261
2026-02-02T10:29:18Z INFO     Train Epoch: 1 [12800/60000 (21%)]	loss=0.4948
2026-02-02T10:29:19Z INFO     Train Epoch: 1 [13440/60000 (22%)]	loss=0.9426
2026-02-02T10:29:20Z INFO     Train Epoch: 1 [14080/60000 (23%)]	loss=0.6710
2026-02-02T10:29:22Z INFO     Train Epoch: 1 [14720/60000 (25%)]	loss=0.4970
2026-02-02T10:29:24Z INFO     Train Epoch: 1 [15360/60000 (26%)]	loss=0.6170
2026-02-02T10:29:26Z INFO     Train Epoch: 1 [16000/60000 (27%)]	loss=0.6423
2026-02-02T10:29:27Z INFO     Train Epoch: 1 [16640/60000 (28%)]	loss=0.7401
2026-02-02T10:29:27Z INFO     Train Epoch: 1 [17280/60000 (29%)]	loss=0.5749
2026-02-02T10:29:28Z INFO     Train Epoch: 1 [17920/60000 (30%)]	loss=0.7347
2026-02-02T10:29:29Z INFO     Train Epoch: 1 [18560/60000 (31%)]	loss=0.7903
2026-02-02T10:29:29Z INFO     Train Epoch: 1 [19200/60000 (32%)]	loss=0.6705
2026-02-02T10:29:30Z INFO     Train Epoch: 1 [19840/60000 (33%)]	loss=0.4709
2026-02-02T10:29:31Z INFO     Train Epoch: 1 [20480/60000 (34%)]	loss=0.6821
2026-02-02T10:29:31Z INFO     Train Epoch: 1 [21120/60000 (35%)]	loss=0.7016
2026-02-02T10:29:32Z INFO     Train Epoch: 1 [21760/60000 (36%)]	loss=0.6120
2026-02-02T10:29:32Z INFO     Train Epoch: 1 [22400/60000 (37%)]	loss=0.5444
2026-02-02T10:29:33Z INFO     Train Epoch: 1 [23040/60000 (38%)]	loss=0.5801
2026-02-02T10:29:33Z INFO     Train Epoch: 1 [23680/60000 (39%)]	loss=0.8274
2026-02-02T10:29:33Z INFO     Train Epoch: 1 [24320/60000 (41%)]	loss=0.6841
2026-02-02T10:29:34Z INFO     Train Epoch: 1 [24960/60000 (42%)]	loss=0.5976
2026-02-02T10:29:34Z INFO     Train Epoch: 1 [25600/60000 (43%)]	loss=0.5630
2026-02-02T10:29:35Z INFO     Train Epoch: 1 [26240/60000 (44%)]	loss=0.5582
2026-02-02T10:29:35Z INFO     Train Epoch: 1 [26880/60000 (45%)]	loss=0.6062
2026-02-02T10:29:36Z INFO     Train Epoch: 1 [27520/60000 (46%)]	loss=0.5625
2026-02-02T10:29:37Z INFO     Train Epoch: 1 [28160/60000 (47%)]	loss=0.6783
2026-02-02T10:29:37Z INFO     Train Epoch: 1 [28800/60000 (48%)]	loss=0.6038
2026-02-02T10:29:38Z INFO     Train Epoch: 1 [29440/60000 (49%)]	loss=0.5356
2026-02-02T10:29:38Z INFO     Train Epoch: 1 [30080/60000 (50%)]	loss=0.6211
2026-02-02T10:29:39Z INFO     Train Epoch: 1 [30720/60000 (51%)]	loss=0.3793
2026-02-02T10:29:39Z INFO     Train Epoch: 1 [31360/60000 (52%)]	loss=0.5791
2026-02-02T10:29:40Z INFO     Train Epoch: 1 [32000/60000 (53%)]	loss=0.6636
2026-02-02T10:29:40Z INFO     Train Epoch: 1 [32640/60000 (54%)]	loss=0.6469
2026-02-02T10:29:40Z INFO     Train Epoch: 1 [33280/60000 (55%)]	loss=0.5759
2026-02-02T10:29:41Z INFO     Train Epoch: 1 [33920/60000 (57%)]	loss=0.6169
2026-02-02T10:29:42Z INFO     Train Epoch: 1 [34560/60000 (58%)]	loss=0.6562
2026-02-02T10:29:42Z INFO     Train Epoch: 1 [35200/60000 (59%)]	loss=0.3669
2026-02-02T10:29:43Z INFO     Train Epoch: 1 [35840/60000 (60%)]	loss=0.4151
2026-02-02T10:29:43Z INFO     Train Epoch: 1 [36480/60000 (61%)]	loss=0.3987
2026-02-02T10:29:44Z INFO     Train Epoch: 1 [37120/60000 (62%)]	loss=0.5282
2026-02-02T10:29:44Z INFO     Train Epoch: 1 [37760/60000 (63%)]	loss=0.3574
2026-02-02T10:29:45Z INFO     Train Epoch: 1 [38400/60000 (64%)]	loss=0.6176
2026-02-02T10:29:45Z INFO     Train Epoch: 1 [39040/60000 (65%)]	loss=0.4018
2026-02-02T10:29:46Z INFO     Train Epoch: 1 [39680/60000 (66%)]	loss=0.3824
2026-02-02T10:29:47Z INFO     Train Epoch: 1 [40320/60000 (67%)]	loss=0.5100
2026-02-02T10:29:47Z INFO     Train Epoch: 1 [40960/60000 (68%)]	loss=0.5637
2026-02-02T10:29:48Z INFO     Train Epoch: 1 [41600/60000 (69%)]	loss=0.5293
2026-02-02T10:29:48Z INFO     Train Epoch: 1 [42240/60000 (70%)]	loss=0.5368
2026-02-02T10:29:49Z INFO     Train Epoch: 1 [42880/60000 (71%)]	loss=0.5887
2026-02-02T10:29:49Z INFO     Train Epoch: 1 [43520/60000 (72%)]	loss=0.5728
2026-02-02T10:29:50Z INFO     Train Epoch: 1 [44160/60000 (74%)]	loss=0.5560
2026-02-02T10:29:50Z INFO     Train Epoch: 1 [44800/60000 (75%)]	loss=0.5798
2026-02-02T10:29:51Z INFO     Train Epoch: 1 [45440/60000 (76%)]	loss=0.4024
2026-02-02T10:29:51Z INFO     Train Epoch: 1 [46080/60000 (77%)]	loss=0.4744
2026-02-02T10:29:51Z INFO     Train Epoch: 1 [46720/60000 (78%)]	loss=0.5275
2026-02-02T10:29:52Z INFO     Train Epoch: 1 [47360/60000 (79%)]	loss=0.3368
2026-02-02T10:29:52Z INFO     Train Epoch: 1 [48000/60000 (80%)]	loss=0.4847
2026-02-02T10:29:53Z INFO     Train Epoch: 1 [48640/60000 (81%)]	loss=0.4953
2026-02-02T10:29:53Z INFO     Train Epoch: 1 [49280/60000 (82%)]	loss=0.5277
2026-02-02T10:29:55Z INFO     Train Epoch: 1 [49920/60000 (83%)]	loss=0.6247
2026-02-02T10:29:56Z INFO     Train Epoch: 1 [50560/60000 (84%)]	loss=0.4469
2026-02-02T10:29:56Z INFO     Train Epoch: 1 [51200/60000 (85%)]	loss=0.4864
2026-02-02T10:29:57Z INFO     Train Epoch: 1 [51840/60000 (86%)]	loss=0.4604
2026-02-02T10:29:57Z INFO     Train Epoch: 1 [52480/60000 (87%)]	loss=0.3293
2026-02-02T10:29:58Z INFO     Train Epoch: 1 [53120/60000 (88%)]	loss=0.4813
2026-02-02T10:29:58Z INFO     Train Epoch: 1 [53760/60000 (90%)]	loss=0.5119
2026-02-02T10:29:59Z INFO     Train Epoch: 1 [54400/60000 (91%)]	loss=0.4248
2026-02-02T10:29:59Z INFO     Train Epoch: 1 [55040/60000 (92%)]	loss=0.5736
2026-02-02T10:30:02Z INFO     Train Epoch: 1 [55680/60000 (93%)]	loss=0.4366
2026-02-02T10:30:03Z INFO     Train Epoch: 1 [56320/60000 (94%)]	loss=0.5449
2026-02-02T10:30:04Z INFO     Train Epoch: 1 [56960/60000 (95%)]	loss=0.5107
2026-02-02T10:30:04Z INFO     Train Epoch: 1 [57600/60000 (96%)]	loss=0.5116
2026-02-02T10:30:05Z INFO     Train Epoch: 1 [58240/60000 (97%)]	loss=0.6616
2026-02-02T10:30:06Z INFO     Train Epoch: 1 [58880/60000 (98%)]	loss=0.5308
2026-02-02T10:30:07Z INFO     Train Epoch: 1 [59520/60000 (99%)]	loss=0.3882
2026-02-02T10:30:10Z INFO     {metricName: accuracy, metricValue: 0.8168};{metricName: loss, metricValue: 0.4944}

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw

Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw

Then I follow the log of metrics-logger-and-collector, it still running, never end:

# kubectl logs file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -f -c metrics-logger-and-collector
I0202 10:25:28.978585      52 main.go:400] Trial Name: file-metrics-collector-with-json-format-h6h6sxqg
I0202 10:30:10.293412      52 main.go:143] {"checkpoint_path": "", "global_step": "1", "loss": "0.49443635559082033", "timestamp": 1770028210.2863286, "trial": "0"}
I0202 10:30:10.293475      52 main.go:143] {"accuracy": "0.8168", "checkpoint_path": "", "global_step": "1", "timestamp": 1770028210.2866523, "trial": "0"}

After long time, I check the pod state, it still NotReady:

# kubectl get pod | grep json
file-metrics-collector-with-json-format-djvvcd7v-8dwzp            0/3     Pending            0             39m
file-metrics-collector-with-json-format-h6h6sxqg-kdj8p            2/3     NotReady           0             39m
file-metrics-collector-with-json-format-random-768db846b4-8dsxl   1/1     Running            0             39m
file-metrics-collector-with-json-format-tb54bn54-tvndw            0/3     Pending            0             39m

In another session, I exec into this container and checked file and process, everything is normal:

# kubectl exec -it file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -c metrics-logger-and-collector -- sh
/app # 
/app # ls -al /katib
total 12
drwxrwxrwx    2 root     root          4096 Feb  2 10:30 .
drwxr-xr-x    1 root     root            53 Feb  2 10:25 ..
-rw-r--r--    1 root     root            10 Feb  2 10:30 45.pid
-rw-r--r--    1 root     root           235 Feb  2 10:30 mnist.json
/app # cat /katib/45.pid 
completed
/app # cat /katib/mnist.json 
{"checkpoint_path": "", "global_step": "1", "loss": "0.49443635559082033", "timestamp": 1770028210.2863286, "trial": "0"}
{"accuracy": "0.8168", "checkpoint_path": "", "global_step": "1", "timestamp": 1770028210.2866523, "trial": "0"}
/app # ps -ef
PID   USER     TIME  COMMAND
    1 65535     0:00 /pause
   19 1337      0:00 /usr/local/bin/pilot-agent proxy sidecar --domain aict.svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --
   31 1337      0:03 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-strategy immediate --local-address-ip-version v4 --fil
   52 root      2:20 ./file-metricscollector -t file-metrics-collector-with-json-format-h6h6sxqg -m accuracy;loss -o-type maximize -s-db katib-db-manager.ku
  110 root      0:00 sh
  119 root      0:00 ps -ef
/app # 

Here is the output of kubectl describe pod:

# kubectl describe pod file-metrics-collector-with-json-format-h6h6sxqg-kdj8p
Name:             file-metrics-collector-with-json-format-h6h6sxqg-kdj8p
Namespace:        aict
Priority:         0
Service Account:  default
Node:             llm1/192.168.1.4
Start Time:       Mon, 02 Feb 2026 18:25:22 +0800
Labels:           batch.kubernetes.io/controller-uid=6cb8a82f-542f-4494-99aa-0e3ff3f1221c
                  batch.kubernetes.io/job-name=file-metrics-collector-with-json-format-h6h6sxqg
                  controller-uid=6cb8a82f-542f-4494-99aa-0e3ff3f1221c
                  job-name=file-metrics-collector-with-json-format-h6h6sxqg
                  katib.kubeflow.org/experiment=file-metrics-collector-with-json-format
                  katib.kubeflow.org/trial=file-metrics-collector-with-json-format-h6h6sxqg
                  security.istio.io/tlsMode=istio
                  service.istio.io/canonical-name=file-metrics-collector-with-json-format-h6h6sxqg
                  service.istio.io/canonical-revision=latest
Annotations:      cni.projectcalico.org/containerID: 8af4f4492f102dc70cc8898e6f1511b280f7d717d0ffed495e73ccb942cef8ce
                  cni.projectcalico.org/podIP: 10.42.0.192/32
                  cni.projectcalico.org/podIPs: 10.42.0.192/32
                  istio.io/rev: default
                  kubectl.kubernetes.io/default-container: training-container
                  kubectl.kubernetes.io/default-logs-container: training-container
                  prometheus.io/path: /stats/prometheus
                  prometheus.io/port: 15020
                  prometheus.io/scrape: true
                  sidecar.istio.io/interceptionMode: REDIRECT
                  sidecar.istio.io/status:
                    {"initContainers":["istio-validation","istio-proxy"],"containers":null,"volumes":["workload-socket","credential-socket","workload-certs","...
                  traffic.sidecar.istio.io/excludeInboundPorts: 15020
                  traffic.sidecar.istio.io/includeInboundPorts: *
                  traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status:           Running
IP:               10.42.0.192
IPs:
  IP:           10.42.0.192
Controlled By:  Job/file-metrics-collector-with-json-format-h6h6sxqg
Init Containers:
  istio-validation:
    Container ID:  containerd://468934f880c08e990f1f23de834c0a35221f72cee8ff325a8f0613679231d9f0
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          <none>
    Host Port:     <none>
    Args:
      istio-iptables
      -p
      15001
      -z
      15006
      -u
      1337
      -m
      REDIRECT
      -i
      *
      -x
      
      -b
      *
      -d
      15090,15021,15020
      --log_output_level=default:info
      --run-validation
      --skip-rule-apply
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 02 Feb 2026 18:25:23 +0800
      Finished:     Mon, 02 Feb 2026 18:25:23 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:        100m
      memory:     128Mi
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
  istio-proxy:
    Container ID:  containerd://71a38c6d438fc750220674f5593ca45047ae7aeb601c03c3b3374a99b92559bf
    Image:         36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
    Image ID:      36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
    Port:          15090/TCP
    Host Port:     0/TCP
    Args:
      proxy
      sidecar
      --domain
      $(POD_NAMESPACE).svc.cluster.local
      --proxyLogLevel=warning
      --proxyComponentLogLevel=misc:error
      --log_output_level=default:info
    State:          Running
      Started:      Mon, 02 Feb 2026 18:25:25 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     2
      memory:  1Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Readiness:  http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
    Startup:    http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
    Environment:
      PILOT_CERT_PROVIDER:           istiod
      CA_ADDR:                       istiod.istio-system.svc:15012
      POD_NAME:                      file-metrics-collector-with-json-format-h6h6sxqg-kdj8p (v1:metadata.name)
      POD_NAMESPACE:                 aict (v1:metadata.namespace)
      INSTANCE_IP:                    (v1:status.podIP)
      SERVICE_ACCOUNT:                (v1:spec.serviceAccountName)
      HOST_IP:                        (v1:status.hostIP)
      ISTIO_CPU_LIMIT:               2 (limits.cpu)
      PROXY_CONFIG:                  {"tracing":{}}
                                     
      ISTIO_META_POD_PORTS:          [
                                     ]
      ISTIO_META_APP_CONTAINERS:     training-container
      GOMEMLIMIT:                    1073741824 (limits.memory)
      GOMAXPROCS:                    2 (limits.cpu)
      ISTIO_META_CLUSTER_ID:         Kubernetes
      ISTIO_META_NODE_NAME:           (v1:spec.nodeName)
      ISTIO_META_INTERCEPTION_MODE:  REDIRECT
      ISTIO_META_WORKLOAD_NAME:      file-metrics-collector-with-json-format-h6h6sxqg
      ISTIO_META_OWNER:              kubernetes://apis/batch/v1/namespaces/aict/jobs/file-metrics-collector-with-json-format-h6h6sxqg
      ISTIO_META_MESH_ID:            cluster.local
      TRUST_DOMAIN:                  cluster.local
    Mounts:
      /etc/istio/pod from istio-podinfo (rw)
      /etc/istio/proxy from istio-envoy (rw)
      /var/lib/istio/data from istio-data (rw)
      /var/run/secrets/credential-uds from credential-socket (rw)
      /var/run/secrets/istio from istiod-ca-cert (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
      /var/run/secrets/tokens from istio-token (rw)
      /var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
      /var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Containers:
  training-container:
    Container ID:  containerd://d96df4fbc26052d72ff3227879b24bf49106cb9d988adf91e32ad31b4e12e841
    Image:         ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest
    Image ID:      docker.m.daocloud.io/kubeflowkatib/pytorch-mnist-cpu@sha256:3564468f2313733108c773b3831db9dd3c8c0da6b326a7c0d3f051281b33bbad
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
    Args:
      python3 /opt/pytorch-mnist/mnist.py --epochs=1 --log-path=/katib/mnist.json --lr=0.02696864809019029 --momentum=0.645118906199316 --logger=hypertune && echo completed > /katib/$$$$.pid
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Mon, 02 Feb 2026 18:25:28 +0800
      Finished:     Mon, 02 Feb 2026 18:30:10 +0800
    Ready:          False
    Restart Count:  0
    Environment:
      KATIB_TRIAL_NAME:   (v1:metadata.labels['katib.kubeflow.org/trial'])
    Mounts:
      /katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
  metrics-logger-and-collector:
    Container ID:  containerd://11b708e419359b73e84e54c133fdaf324899e56cff34c39035004b95f12b6a39
    Image:         ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
    Image ID:      ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
    Port:          <none>
    Host Port:     <none>
    Args:
      -t
      file-metrics-collector-with-json-format-h6h6sxqg
      -m
      accuracy;loss
      -o-type
      maximize
      -s-db
      katib-db-manager.kubeflow:6789
      -path
      /katib/mnist.json
      -format
      JSON
    State:          Running
      Started:      Mon, 02 Feb 2026 18:25:28 +0800
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:                500m
      ephemeral-storage:  5Gi
      memory:             100Mi
    Requests:
      cpu:                50m
      ephemeral-storage:  500Mi
      memory:             10Mi
    Environment:          <none>
    Mounts:
      /katib from metrics-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  workload-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  credential-socket:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  workload-certs:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-envoy:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  istio-data:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  istio-podinfo:
    Type:  DownwardAPI (a volume populated by information about the pod)
    Items:
      metadata.labels -> labels
      metadata.annotations -> annotations
  istio-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  43200
  istiod-ca-cert:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      istio-ca-root-cert
    Optional:  false
  kube-api-access-kk2s6:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
  metrics-volume:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      
    SizeLimit:   <unset>
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  37m   default-scheduler  Successfully assigned aict/file-metrics-collector-with-json-format-h6h6sxqg-kdj8p to llm1
  Normal  Pulled     37m   kubelet            Container image "36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1" already present on machine
  Normal  Created    37m   kubelet            Created container: istio-validation
  Normal  Started    37m   kubelet            Started container istio-validation
  Normal  Pulled     37m   kubelet            Container image "36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1" already present on machine
  Normal  Created    37m   kubelet            Created container: istio-proxy
  Normal  Started    37m   kubelet            Started container istio-proxy
  Normal  Pulling    37m   kubelet            Pulling image "ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest"
  Normal  Pulled     37m   kubelet            Successfully pulled image "ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest" in 1.187s (1.187s including waiting). Image size: 2975520867 bytes.
  Normal  Created    37m   kubelet            Created container: training-container
  Normal  Started    37m   kubelet            Started container training-container
  Normal  Pulled     37m   kubelet            Container image "ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0" already present on machine
  Normal  Created    37m   kubelet            Created container: metrics-logger-and-collector
  Normal  Started    37m   kubelet            Started container metrics-logger-and-collector

This experiment in Kubeflow UI display:

Image

What did you expect to happen?

Metrics collected normally from the json file, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.

Environment

Kubernetes version:

$ kubectl version

result:
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1


Katib controller version:

$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"

result:
ghcr.io/kubeflow/katib/katib-controller:v0.19.0


Katib Python SDK version:

$ pip show kubeflow-katib

result:
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: premnath.vel@gmail.com
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:


But I am not running this experiment by Katib Python SDK.

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions