Skip to content

Add more useful error logs when a job fails #51

@jeffcarp

Description

@jeffcarp

It's difficult to tell why this job failed, would be great to get more logs and info to debug:

% python examples/example_gke.py
2026-02-25 05:38:23.460649: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
============================================================
Keras Remote - GKE Examples
============================================================

--- Example 1: Simple Computation (CPU) ---
Running simple_computation(10, 20) on GKE...
Packaging function and context...
Payload serialized to /tmp/tmpg8kw64fo/payload.pkl
Context packaged to /tmp/tmpg8kw64fo/context.zip
No requirements.txt found
Building container image...
Using cached container: us-docker.pkg.dev/keras-team-gcp/keras-remote/base:cpu-22d410ef9d5d
View image: https://console.cloud.google.com/artifacts/docker/keras-team-gcp/us/keras-remote/base?project=keras-team-gcp
Uploading artifacts to Cloud Storage (job: job-f6bd5a77)...
Uploaded payload to gs://keras-team-gcp-keras-remote-jobs/job-f6bd5a77/payload.pkl
Uploaded context to gs://keras-team-gcp-keras-remote-jobs/job-f6bd5a77/context.zip
View artifacts: https://console.cloud.google.com/storage/browser/keras-team-gcp-keras-remote-jobs/job-f6bd5a77?project=keras-team-gcp
Submitting job to GKEBackend...
Submitted K8s job: keras-remote-job-f6bd5a77
View job with: kubectl get job keras-remote-job-f6bd5a77 -n default
View logs with: kubectl logs -l job-name=keras-remote-job-f6bd5a77 -n default
Job keras-remote-job-f6bd5a77 running...
Pod keras-remote-job-f6bd5a77-4rbwp logs:

Deleted K8s job: keras-remote-job-f6bd5a77
Downloading result...
Traceback (most recent call last):
  File "~/jeffcarp/gh/keras-team/remote/examples/example_gke.py", line 165, in <module>
    main()
    ~~~~^^
  File "~/jeffcarp/gh/keras-team/remote/examples/example_gke.py", line 113, in main
    result = simple_computation()
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/core/core.py", line 70, in wrapper
    return _execute_on_gke(
      func,
    ...<8 lines>...
      env_vars,
    )
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/core/core.py", line 127, in _execute_on_gke
    return execute_remote(ctx, GKEBackend(cluster=cluster, namespace=namespace))
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/backend/execution.py", line 315, in execute_remote
    raise job_error from None
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/backend/execution.py", line 300, in execute_remote
    backend.wait_for_job(job, ctx)
    ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/backend/execution.py", line 134, in wait_for_job
    gke_client.wait_for_job(job, namespace=self.namespace)
    ~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "~/jeffcarp/gh/keras-team/remote/keras_remote/backend/gke_client.py", line 133, in wait_for_job
    raise RuntimeError(f"GKE job {job_name} failed")
RuntimeError: GKE job keras-remote-job-f6bd5a77 failed

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions