What happened?
I have created a experiment in Kubeflow UI by https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml, Then I saw 3 pods of trials:
we can see one of pod is running.
I follow the log of running pod's training-container until the container completed:
# kubectl logs file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -f -c training-container
100.0%
100.0%
100.0%
100.0%
2026-02-02T10:29:09Z INFO Train Epoch: 1 [0/60000 (0%)] loss=2.2980
2026-02-02T10:29:09Z INFO Train Epoch: 1 [640/60000 (1%)] loss=2.2977
2026-02-02T10:29:10Z INFO Train Epoch: 1 [1280/60000 (2%)] loss=2.1945
2026-02-02T10:29:10Z INFO Train Epoch: 1 [1920/60000 (3%)] loss=2.0590
2026-02-02T10:29:11Z INFO Train Epoch: 1 [2560/60000 (4%)] loss=1.7646
2026-02-02T10:29:11Z INFO Train Epoch: 1 [3200/60000 (5%)] loss=1.6865
2026-02-02T10:29:11Z INFO Train Epoch: 1 [3840/60000 (6%)] loss=1.1514
2026-02-02T10:29:12Z INFO Train Epoch: 1 [4480/60000 (7%)] loss=0.9604
2026-02-02T10:29:12Z INFO Train Epoch: 1 [5120/60000 (9%)] loss=1.0196
2026-02-02T10:29:13Z INFO Train Epoch: 1 [5760/60000 (10%)] loss=1.0410
2026-02-02T10:29:13Z INFO Train Epoch: 1 [6400/60000 (11%)] loss=0.9910
2026-02-02T10:29:14Z INFO Train Epoch: 1 [7040/60000 (12%)] loss=0.8280
2026-02-02T10:29:14Z INFO Train Epoch: 1 [7680/60000 (13%)] loss=0.9400
2026-02-02T10:29:15Z INFO Train Epoch: 1 [8320/60000 (14%)] loss=0.7275
2026-02-02T10:29:15Z INFO Train Epoch: 1 [8960/60000 (15%)] loss=0.6647
2026-02-02T10:29:16Z INFO Train Epoch: 1 [9600/60000 (16%)] loss=0.9186
2026-02-02T10:29:16Z INFO Train Epoch: 1 [10240/60000 (17%)] loss=0.8398
2026-02-02T10:29:17Z INFO Train Epoch: 1 [10880/60000 (18%)] loss=0.5977
2026-02-02T10:29:18Z INFO Train Epoch: 1 [11520/60000 (19%)] loss=0.8014
2026-02-02T10:29:18Z INFO Train Epoch: 1 [12160/60000 (20%)] loss=0.9261
2026-02-02T10:29:18Z INFO Train Epoch: 1 [12800/60000 (21%)] loss=0.4948
2026-02-02T10:29:19Z INFO Train Epoch: 1 [13440/60000 (22%)] loss=0.9426
2026-02-02T10:29:20Z INFO Train Epoch: 1 [14080/60000 (23%)] loss=0.6710
2026-02-02T10:29:22Z INFO Train Epoch: 1 [14720/60000 (25%)] loss=0.4970
2026-02-02T10:29:24Z INFO Train Epoch: 1 [15360/60000 (26%)] loss=0.6170
2026-02-02T10:29:26Z INFO Train Epoch: 1 [16000/60000 (27%)] loss=0.6423
2026-02-02T10:29:27Z INFO Train Epoch: 1 [16640/60000 (28%)] loss=0.7401
2026-02-02T10:29:27Z INFO Train Epoch: 1 [17280/60000 (29%)] loss=0.5749
2026-02-02T10:29:28Z INFO Train Epoch: 1 [17920/60000 (30%)] loss=0.7347
2026-02-02T10:29:29Z INFO Train Epoch: 1 [18560/60000 (31%)] loss=0.7903
2026-02-02T10:29:29Z INFO Train Epoch: 1 [19200/60000 (32%)] loss=0.6705
2026-02-02T10:29:30Z INFO Train Epoch: 1 [19840/60000 (33%)] loss=0.4709
2026-02-02T10:29:31Z INFO Train Epoch: 1 [20480/60000 (34%)] loss=0.6821
2026-02-02T10:29:31Z INFO Train Epoch: 1 [21120/60000 (35%)] loss=0.7016
2026-02-02T10:29:32Z INFO Train Epoch: 1 [21760/60000 (36%)] loss=0.6120
2026-02-02T10:29:32Z INFO Train Epoch: 1 [22400/60000 (37%)] loss=0.5444
2026-02-02T10:29:33Z INFO Train Epoch: 1 [23040/60000 (38%)] loss=0.5801
2026-02-02T10:29:33Z INFO Train Epoch: 1 [23680/60000 (39%)] loss=0.8274
2026-02-02T10:29:33Z INFO Train Epoch: 1 [24320/60000 (41%)] loss=0.6841
2026-02-02T10:29:34Z INFO Train Epoch: 1 [24960/60000 (42%)] loss=0.5976
2026-02-02T10:29:34Z INFO Train Epoch: 1 [25600/60000 (43%)] loss=0.5630
2026-02-02T10:29:35Z INFO Train Epoch: 1 [26240/60000 (44%)] loss=0.5582
2026-02-02T10:29:35Z INFO Train Epoch: 1 [26880/60000 (45%)] loss=0.6062
2026-02-02T10:29:36Z INFO Train Epoch: 1 [27520/60000 (46%)] loss=0.5625
2026-02-02T10:29:37Z INFO Train Epoch: 1 [28160/60000 (47%)] loss=0.6783
2026-02-02T10:29:37Z INFO Train Epoch: 1 [28800/60000 (48%)] loss=0.6038
2026-02-02T10:29:38Z INFO Train Epoch: 1 [29440/60000 (49%)] loss=0.5356
2026-02-02T10:29:38Z INFO Train Epoch: 1 [30080/60000 (50%)] loss=0.6211
2026-02-02T10:29:39Z INFO Train Epoch: 1 [30720/60000 (51%)] loss=0.3793
2026-02-02T10:29:39Z INFO Train Epoch: 1 [31360/60000 (52%)] loss=0.5791
2026-02-02T10:29:40Z INFO Train Epoch: 1 [32000/60000 (53%)] loss=0.6636
2026-02-02T10:29:40Z INFO Train Epoch: 1 [32640/60000 (54%)] loss=0.6469
2026-02-02T10:29:40Z INFO Train Epoch: 1 [33280/60000 (55%)] loss=0.5759
2026-02-02T10:29:41Z INFO Train Epoch: 1 [33920/60000 (57%)] loss=0.6169
2026-02-02T10:29:42Z INFO Train Epoch: 1 [34560/60000 (58%)] loss=0.6562
2026-02-02T10:29:42Z INFO Train Epoch: 1 [35200/60000 (59%)] loss=0.3669
2026-02-02T10:29:43Z INFO Train Epoch: 1 [35840/60000 (60%)] loss=0.4151
2026-02-02T10:29:43Z INFO Train Epoch: 1 [36480/60000 (61%)] loss=0.3987
2026-02-02T10:29:44Z INFO Train Epoch: 1 [37120/60000 (62%)] loss=0.5282
2026-02-02T10:29:44Z INFO Train Epoch: 1 [37760/60000 (63%)] loss=0.3574
2026-02-02T10:29:45Z INFO Train Epoch: 1 [38400/60000 (64%)] loss=0.6176
2026-02-02T10:29:45Z INFO Train Epoch: 1 [39040/60000 (65%)] loss=0.4018
2026-02-02T10:29:46Z INFO Train Epoch: 1 [39680/60000 (66%)] loss=0.3824
2026-02-02T10:29:47Z INFO Train Epoch: 1 [40320/60000 (67%)] loss=0.5100
2026-02-02T10:29:47Z INFO Train Epoch: 1 [40960/60000 (68%)] loss=0.5637
2026-02-02T10:29:48Z INFO Train Epoch: 1 [41600/60000 (69%)] loss=0.5293
2026-02-02T10:29:48Z INFO Train Epoch: 1 [42240/60000 (70%)] loss=0.5368
2026-02-02T10:29:49Z INFO Train Epoch: 1 [42880/60000 (71%)] loss=0.5887
2026-02-02T10:29:49Z INFO Train Epoch: 1 [43520/60000 (72%)] loss=0.5728
2026-02-02T10:29:50Z INFO Train Epoch: 1 [44160/60000 (74%)] loss=0.5560
2026-02-02T10:29:50Z INFO Train Epoch: 1 [44800/60000 (75%)] loss=0.5798
2026-02-02T10:29:51Z INFO Train Epoch: 1 [45440/60000 (76%)] loss=0.4024
2026-02-02T10:29:51Z INFO Train Epoch: 1 [46080/60000 (77%)] loss=0.4744
2026-02-02T10:29:51Z INFO Train Epoch: 1 [46720/60000 (78%)] loss=0.5275
2026-02-02T10:29:52Z INFO Train Epoch: 1 [47360/60000 (79%)] loss=0.3368
2026-02-02T10:29:52Z INFO Train Epoch: 1 [48000/60000 (80%)] loss=0.4847
2026-02-02T10:29:53Z INFO Train Epoch: 1 [48640/60000 (81%)] loss=0.4953
2026-02-02T10:29:53Z INFO Train Epoch: 1 [49280/60000 (82%)] loss=0.5277
2026-02-02T10:29:55Z INFO Train Epoch: 1 [49920/60000 (83%)] loss=0.6247
2026-02-02T10:29:56Z INFO Train Epoch: 1 [50560/60000 (84%)] loss=0.4469
2026-02-02T10:29:56Z INFO Train Epoch: 1 [51200/60000 (85%)] loss=0.4864
2026-02-02T10:29:57Z INFO Train Epoch: 1 [51840/60000 (86%)] loss=0.4604
2026-02-02T10:29:57Z INFO Train Epoch: 1 [52480/60000 (87%)] loss=0.3293
2026-02-02T10:29:58Z INFO Train Epoch: 1 [53120/60000 (88%)] loss=0.4813
2026-02-02T10:29:58Z INFO Train Epoch: 1 [53760/60000 (90%)] loss=0.5119
2026-02-02T10:29:59Z INFO Train Epoch: 1 [54400/60000 (91%)] loss=0.4248
2026-02-02T10:29:59Z INFO Train Epoch: 1 [55040/60000 (92%)] loss=0.5736
2026-02-02T10:30:02Z INFO Train Epoch: 1 [55680/60000 (93%)] loss=0.4366
2026-02-02T10:30:03Z INFO Train Epoch: 1 [56320/60000 (94%)] loss=0.5449
2026-02-02T10:30:04Z INFO Train Epoch: 1 [56960/60000 (95%)] loss=0.5107
2026-02-02T10:30:04Z INFO Train Epoch: 1 [57600/60000 (96%)] loss=0.5116
2026-02-02T10:30:05Z INFO Train Epoch: 1 [58240/60000 (97%)] loss=0.6616
2026-02-02T10:30:06Z INFO Train Epoch: 1 [58880/60000 (98%)] loss=0.5308
2026-02-02T10:30:07Z INFO Train Epoch: 1 [59520/60000 (99%)] loss=0.3882
2026-02-02T10:30:10Z INFO {metricName: accuracy, metricValue: 0.8168};{metricName: loss, metricValue: 0.4944}
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz
Extracting ./data/FashionMNIST/raw/train-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz
Extracting ./data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ./data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to ./data/FashionMNIST/raw
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz
Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz
Extracting ./data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/FashionMNIST/raw
Then I follow the log of metrics-logger-and-collector, it still running, never end:
# kubectl logs file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -f -c metrics-logger-and-collector
I0202 10:25:28.978585 52 main.go:400] Trial Name: file-metrics-collector-with-json-format-h6h6sxqg
I0202 10:30:10.293412 52 main.go:143] {"checkpoint_path": "", "global_step": "1", "loss": "0.49443635559082033", "timestamp": 1770028210.2863286, "trial": "0"}
I0202 10:30:10.293475 52 main.go:143] {"accuracy": "0.8168", "checkpoint_path": "", "global_step": "1", "timestamp": 1770028210.2866523, "trial": "0"}
After long time, I check the pod state, it still NotReady:
# kubectl get pod | grep json
file-metrics-collector-with-json-format-djvvcd7v-8dwzp 0/3 Pending 0 39m
file-metrics-collector-with-json-format-h6h6sxqg-kdj8p 2/3 NotReady 0 39m
file-metrics-collector-with-json-format-random-768db846b4-8dsxl 1/1 Running 0 39m
file-metrics-collector-with-json-format-tb54bn54-tvndw 0/3 Pending 0 39m
In another session, I exec into this container and checked file and process, everything is normal:
# kubectl exec -it file-metrics-collector-with-json-format-h6h6sxqg-kdj8p -c metrics-logger-and-collector -- sh
/app #
/app # ls -al /katib
total 12
drwxrwxrwx 2 root root 4096 Feb 2 10:30 .
drwxr-xr-x 1 root root 53 Feb 2 10:25 ..
-rw-r--r-- 1 root root 10 Feb 2 10:30 45.pid
-rw-r--r-- 1 root root 235 Feb 2 10:30 mnist.json
/app # cat /katib/45.pid
completed
/app # cat /katib/mnist.json
{"checkpoint_path": "", "global_step": "1", "loss": "0.49443635559082033", "timestamp": 1770028210.2863286, "trial": "0"}
{"accuracy": "0.8168", "checkpoint_path": "", "global_step": "1", "timestamp": 1770028210.2866523, "trial": "0"}
/app # ps -ef
PID USER TIME COMMAND
1 65535 0:00 /pause
19 1337 0:00 /usr/local/bin/pilot-agent proxy sidecar --domain aict.svc.cluster.local --proxyLogLevel=warning --proxyComponentLogLevel=misc:error --
31 1337 0:03 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-strategy immediate --local-address-ip-version v4 --fil
52 root 2:20 ./file-metricscollector -t file-metrics-collector-with-json-format-h6h6sxqg -m accuracy;loss -o-type maximize -s-db katib-db-manager.ku
110 root 0:00 sh
119 root 0:00 ps -ef
/app #
Here is the output of kubectl describe pod:
# kubectl describe pod file-metrics-collector-with-json-format-h6h6sxqg-kdj8p
Name: file-metrics-collector-with-json-format-h6h6sxqg-kdj8p
Namespace: aict
Priority: 0
Service Account: default
Node: llm1/192.168.1.4
Start Time: Mon, 02 Feb 2026 18:25:22 +0800
Labels: batch.kubernetes.io/controller-uid=6cb8a82f-542f-4494-99aa-0e3ff3f1221c
batch.kubernetes.io/job-name=file-metrics-collector-with-json-format-h6h6sxqg
controller-uid=6cb8a82f-542f-4494-99aa-0e3ff3f1221c
job-name=file-metrics-collector-with-json-format-h6h6sxqg
katib.kubeflow.org/experiment=file-metrics-collector-with-json-format
katib.kubeflow.org/trial=file-metrics-collector-with-json-format-h6h6sxqg
security.istio.io/tlsMode=istio
service.istio.io/canonical-name=file-metrics-collector-with-json-format-h6h6sxqg
service.istio.io/canonical-revision=latest
Annotations: cni.projectcalico.org/containerID: 8af4f4492f102dc70cc8898e6f1511b280f7d717d0ffed495e73ccb942cef8ce
cni.projectcalico.org/podIP: 10.42.0.192/32
cni.projectcalico.org/podIPs: 10.42.0.192/32
istio.io/rev: default
kubectl.kubernetes.io/default-container: training-container
kubectl.kubernetes.io/default-logs-container: training-container
prometheus.io/path: /stats/prometheus
prometheus.io/port: 15020
prometheus.io/scrape: true
sidecar.istio.io/interceptionMode: REDIRECT
sidecar.istio.io/status:
{"initContainers":["istio-validation","istio-proxy"],"containers":null,"volumes":["workload-socket","credential-socket","workload-certs","...
traffic.sidecar.istio.io/excludeInboundPorts: 15020
traffic.sidecar.istio.io/includeInboundPorts: *
traffic.sidecar.istio.io/includeOutboundIPRanges: *
Status: Running
IP: 10.42.0.192
IPs:
IP: 10.42.0.192
Controlled By: Job/file-metrics-collector-with-json-format-h6h6sxqg
Init Containers:
istio-validation:
Container ID: containerd://468934f880c08e990f1f23de834c0a35221f72cee8ff325a8f0613679231d9f0
Image: 36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
Image ID: 36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
Port: <none>
Host Port: <none>
Args:
istio-iptables
-p
15001
-z
15006
-u
1337
-m
REDIRECT
-i
*
-x
-b
*
-d
15090,15021,15020
--log_output_level=default:info
--run-validation
--skip-rule-apply
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 02 Feb 2026 18:25:23 +0800
Finished: Mon, 02 Feb 2026 18:25:23 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
istio-proxy:
Container ID: containerd://71a38c6d438fc750220674f5593ca45047ae7aeb601c03c3b3374a99b92559bf
Image: 36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1
Image ID: 36.134.128.101.nip.io:31104/kubeflow/proxyv2@sha256:79ae318dc23920468ea6cfaa0743883b3764b472635c3a698166c33dd4edb329
Port: 15090/TCP
Host Port: 0/TCP
Args:
proxy
sidecar
--domain
$(POD_NAMESPACE).svc.cluster.local
--proxyLogLevel=warning
--proxyComponentLogLevel=misc:error
--log_output_level=default:info
State: Running
Started: Mon, 02 Feb 2026 18:25:25 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 2
memory: 1Gi
Requests:
cpu: 100m
memory: 128Mi
Readiness: http-get http://:15021/healthz/ready delay=0s timeout=3s period=15s #success=1 #failure=4
Startup: http-get http://:15021/healthz/ready delay=0s timeout=3s period=1s #success=1 #failure=600
Environment:
PILOT_CERT_PROVIDER: istiod
CA_ADDR: istiod.istio-system.svc:15012
POD_NAME: file-metrics-collector-with-json-format-h6h6sxqg-kdj8p (v1:metadata.name)
POD_NAMESPACE: aict (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
SERVICE_ACCOUNT: (v1:spec.serviceAccountName)
HOST_IP: (v1:status.hostIP)
ISTIO_CPU_LIMIT: 2 (limits.cpu)
PROXY_CONFIG: {"tracing":{}}
ISTIO_META_POD_PORTS: [
]
ISTIO_META_APP_CONTAINERS: training-container
GOMEMLIMIT: 1073741824 (limits.memory)
GOMAXPROCS: 2 (limits.cpu)
ISTIO_META_CLUSTER_ID: Kubernetes
ISTIO_META_NODE_NAME: (v1:spec.nodeName)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
ISTIO_META_WORKLOAD_NAME: file-metrics-collector-with-json-format-h6h6sxqg
ISTIO_META_OWNER: kubernetes://apis/batch/v1/namespaces/aict/jobs/file-metrics-collector-with-json-format-h6h6sxqg
ISTIO_META_MESH_ID: cluster.local
TRUST_DOMAIN: cluster.local
Mounts:
/etc/istio/pod from istio-podinfo (rw)
/etc/istio/proxy from istio-envoy (rw)
/var/lib/istio/data from istio-data (rw)
/var/run/secrets/credential-uds from credential-socket (rw)
/var/run/secrets/istio from istiod-ca-cert (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
/var/run/secrets/tokens from istio-token (rw)
/var/run/secrets/workload-spiffe-credentials from workload-certs (rw)
/var/run/secrets/workload-spiffe-uds from workload-socket (rw)
Containers:
training-container:
Container ID: containerd://d96df4fbc26052d72ff3227879b24bf49106cb9d988adf91e32ad31b4e12e841
Image: ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest
Image ID: docker.m.daocloud.io/kubeflowkatib/pytorch-mnist-cpu@sha256:3564468f2313733108c773b3831db9dd3c8c0da6b326a7c0d3f051281b33bbad
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
python3 /opt/pytorch-mnist/mnist.py --epochs=1 --log-path=/katib/mnist.json --lr=0.02696864809019029 --momentum=0.645118906199316 --logger=hypertune && echo completed > /katib/$$$$.pid
State: Terminated
Reason: Completed
Exit Code: 0
Started: Mon, 02 Feb 2026 18:25:28 +0800
Finished: Mon, 02 Feb 2026 18:30:10 +0800
Ready: False
Restart Count: 0
Environment:
KATIB_TRIAL_NAME: (v1:metadata.labels['katib.kubeflow.org/trial'])
Mounts:
/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
metrics-logger-and-collector:
Container ID: containerd://11b708e419359b73e84e54c133fdaf324899e56cff34c39035004b95f12b6a39
Image: ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0
Image ID: ghcr.io/kubeflow/katib/file-metrics-collector@sha256:0616af2111b2c6029105ac4670e1e94a0ceb7ba02ddb06a8cee3a687fde1514c
Port: <none>
Host Port: <none>
Args:
-t
file-metrics-collector-with-json-format-h6h6sxqg
-m
accuracy;loss
-o-type
maximize
-s-db
katib-db-manager.kubeflow:6789
-path
/katib/mnist.json
-format
JSON
State: Running
Started: Mon, 02 Feb 2026 18:25:28 +0800
Ready: True
Restart Count: 0
Limits:
cpu: 500m
ephemeral-storage: 5Gi
memory: 100Mi
Requests:
cpu: 50m
ephemeral-storage: 500Mi
memory: 10Mi
Environment: <none>
Mounts:
/katib from metrics-volume (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kk2s6 (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
workload-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
credential-socket:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
workload-certs:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
istio-data:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
istio-podinfo:
Type: DownwardAPI (a volume populated by information about the pod)
Items:
metadata.labels -> labels
metadata.annotations -> annotations
istio-token:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 43200
istiod-ca-cert:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: istio-ca-root-cert
Optional: false
kube-api-access-kk2s6:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
metrics-volume:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 37m default-scheduler Successfully assigned aict/file-metrics-collector-with-json-format-h6h6sxqg-kdj8p to llm1
Normal Pulled 37m kubelet Container image "36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1" already present on machine
Normal Created 37m kubelet Created container: istio-validation
Normal Started 37m kubelet Started container istio-validation
Normal Pulled 37m kubelet Container image "36.134.128.101.nip.io:31104/kubeflow/proxyv2:1.26.1" already present on machine
Normal Created 37m kubelet Created container: istio-proxy
Normal Started 37m kubelet Started container istio-proxy
Normal Pulling 37m kubelet Pulling image "ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest"
Normal Pulled 37m kubelet Successfully pulled image "ghcr.io/kubeflow/katib/pytorch-mnist-cpu:latest" in 1.187s (1.187s including waiting). Image size: 2975520867 bytes.
Normal Created 37m kubelet Created container: training-container
Normal Started 37m kubelet Started container training-container
Normal Pulled 37m kubelet Container image "ghcr.io/kubeflow/katib/file-metrics-collector:v0.19.0" already present on machine
Normal Created 37m kubelet Created container: metrics-logger-and-collector
Normal Started 37m kubelet Started container metrics-logger-and-collector
This experiment in Kubeflow UI display:
What did you expect to happen?
Metrics collected normally from the json file, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.
Environment
Kubernetes version:
result:
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1
Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"
result:
ghcr.io/kubeflow/katib/katib-controller:v0.19.0
Katib Python SDK version:
$ pip show kubeflow-katib
result:
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: premnath.vel@gmail.com
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:
But I am not running this experiment by Katib Python SDK.
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
What happened?
I have created a experiment in Kubeflow UI by https://github.com/kubeflow/katib/blob/master/examples/v1beta1/metrics-collector/file-metrics-collector-with-json-format.yaml, Then I saw 3 pods of trials:
we can see one of pod is running.
I follow the log of running pod's training-container until the container completed:
Then I follow the log of metrics-logger-and-collector, it still running, never end:
After long time, I check the pod state, it still NotReady:
In another session, I exec into this container and checked file and process, everything is normal:
Here is the output of kubectl describe pod:
This experiment in Kubeflow UI display:
What did you expect to happen?
Metrics collected normally from the json file, file-metricscollector end normally, and Status of trial should Succeeded on the Kubeflow UI.
Environment
Kubernetes version:
result:
Client Version: v1.32.10+rke2r1
Kustomize Version: v5.5.0
Server Version: v1.32.10+rke2r1
Katib controller version:
$ kubectl get pods -n kubeflow -l katib.kubeflow.org/component=controller -o jsonpath="{.items[*].spec.containers[*].image}"result:
ghcr.io/kubeflow/katib/katib-controller:v0.19.0
Katib Python SDK version:
result:
Name: kubeflow-katib
Version: 0.19.0
Summary: Katib Python SDK for APIVersion v1beta1
Home-page: https://github.com/kubeflow/katib/tree/master/sdk/python/v1beta1
Author: Kubeflow Authors
Author-email: premnath.vel@gmail.com
License: Apache License Version 2.0
Location: /opt/conda/lib/python3.11/site-packages
Requires: certifi, grpcio, kubeflow-training, kubernetes, protobuf, setuptools, six, urllib3
Required-by:
But I am not running this experiment by Katib Python SDK.
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍