keras-team
diff --git a/‎.gemini/styleguide.md‎
Lines changed: 15 additions & 0 deletions b/‎.gemini/styleguide.md‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 96 additions & 8 deletions b/‎README.md‎
Lines changed: 96 additions & 8 deletions
diff --git a/‎keras_remote/backend/execution.py‎
Lines changed: 13 additions & 3 deletions b/‎keras_remote/backend/execution.py‎
Lines changed: 13 additions & 3 deletions
diff --git a/‎keras_remote/backend/execution_test.py‎
Lines changed: 3 additions & 1 deletion b/‎keras_remote/backend/execution_test.py‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎keras_remote/backend/gke_client.py‎
Lines changed: 19 additions & 3 deletions b/‎keras_remote/backend/gke_client.py‎
Lines changed: 19 additions & 3 deletions
diff --git a/‎keras_remote/backend/log_streaming.py‎
Lines changed: 19 additions & 49 deletions b/‎keras_remote/backend/log_streaming.py‎
Lines changed: 19 additions & 49 deletions
@@ -1,3 +1,18 @@
+When performing code reviews on pull requests, you must strictly adhere to the following principles in addition to the API design guidelines above:
+
+1. **Question the Necessity of Changes**: Do not assume that the pull request changes are strictly necessary. Critically review the proposed changes to ensure they add real value. Point out any code that solving a non-existent problem or adding unnecessary complexity.
+
+2. **Call out "AI Slop"**: Actively look for and identify "AI slop"—generic, overly verbose, or hallucinated code that lacks context or violates best practices. If you suspect the code is AI slop, explicitly call it out.
+
+3. **Poke Holes in the Implementation**: Your goal is to critically test the logic. Actively search for and point out failing edge cases, race conditions, or unhandled exceptions in the implementation.
+
+4. **Demand Robustness**: Do not accept fragile code. If the proposed code is not robust enough or lacks proper error handling, explicitly tell the author why the current approach is brittle and what must be done to reinforce it.
+
+5. **Respect Existing Repo Patterns**: Before suggesting review comments (like asking users to add boilerplate or specific patterns), actively check for existing design patterns across the repository. Do not suggest adding useless code or structures that contradict or fall outside the established Keras repo coding style.
+
+
+
+
 # Keras Remote API design guidelines
 
 These guidelines are meant to help focus design discussions and help us create delightful developer experiences for remote execution.
 
@@ -72,9 +72,10 @@ This adds the `keras-remote up`, `keras-remote down`, `keras-remote status`, and
 - Python 3.11+
 - Google Cloud SDK (`gcloud`)
   - Run `gcloud auth login` and `gcloud auth application-default login`
-- [Pulumi CLI](https://www.pulumi.com/docs/install/) (required for `[cli]` install only)
 - A Google Cloud project with billing enabled
 
+Note: The Pulumi CLI is bundled and managed automatically. It will be installed to `~/.keras-remote/pulumi` on first use if not already present.
+
 ## Quick Start
 
 ### 1. Configure Google Cloud
@@ -203,15 +204,102 @@ def train():
 
 See [examples/Dockerfile.prebuilt](examples/Dockerfile.prebuilt) for a template.
 
+## Handling Data
+
+Keras Remote provides a declarative and performant Data API to seamlessly make your local and cloud data available to your remote functions.
+
+The Data API is designed to be read-only. It reliably delivers data to your pods at the start of a job. For saving model outputs or checkpointing, you should write directly to GCS from within your function.
+
+Under the hood, the Data API optimizes your workflows with two key features:
+
+- **Smart Caching:** Local data is content-hashed and uploaded to a cache bucket only once. Subsequent job runs that use byte-identical data will hit the cache and skip the upload entirely, drastically speeding up execution.
+- **Automatic Zip Exclusion:** When you reference a data path inside your current working directory, Keras Remote automatically excludes that directory from the project's zipped payload to avoid uploading the same data twice.
+
+There are three main ways to handle data depending on your workflow:
+
+### 1. Dynamic Data (The `Data` Class)
+
+The simplest and most Pythonic approach is to pass `Data` objects as regular function arguments. The `Data` class wraps a local file/directory path or a Google Cloud Storage (GCS) URI.
+
+On the remote pod, these objects are automatically resolved into plain string paths pointing to the downloaded files, meaning your function code never needs to know about GCS or cloud storage APIs.
+
+```python
+import pandas as pd
+import keras_remote
+from keras_remote import Data
+
+@keras_remote.run(accelerator="v6e-8")
+def train(data_dir):
+    # data_dir is resolved to a dynamic local path on the remote machine
+    df = pd.read_csv(f"{data_dir}/train.csv")
+    # ...
+
+# Uploads the local directory to the remote pod automatically
+train(Data("./my_dataset/"))
+
+# Cache hit: subsequent runs with the same data skip the upload!
+train(Data("./my_dataset/"))
+```
+
+**Note on GCS Directories:** When referencing a GCS directory with the `Data` class, you must include a trailing slash (e.g., `Data("gs://my-bucket/dataset/")`). If you omit the trailing slash, the system will treat it as a single file object.
+
+You can also pass multiple `Data` arguments, or nest them inside lists and dictionaries (e.g., `train(datasets=[Data("./d1"), Data("./d2")])`).
+
+### 2. Static Data (The `volumes` Parameter)
+
+For established training scripts where data requirements are static, you can use the `volumes` parameter in the `@keras_remote.run` decorator. This mounts data at fixed, hardcoded absolute filesystem paths, allowing you to drop `keras_remote` into existing codebases without altering the function signature.
+
+```python
+import pandas as pd
+import keras_remote
+from keras_remote import Data
+
+@keras_remote.run(
+    accelerator="v6e-8",
+    volumes={
+        "/data": Data("./my_dataset/"),
+        "/weights": Data("gs://my-bucket/pretrained-weights/")
+    }
+)
+def train():
+    # Data is guaranteed to be available at these absolute paths
+    df = pd.read_csv("/data/train.csv")
+    model.load_weights("/weights/model.h5")
+    # ...
+
+# No data arguments needed!
+train()
+
+```
+
+### 3. Direct GCS Streaming (For Large Datasets)
+
+If your dataset is very large (e.g., > 10GB), it is inefficient to download the entire dataset to the remote pod's local disk. Instead, skip the `Data` wrapper entirely and pass a GCS URI string directly. You can then use frameworks with native GCS streaming support (like `tf.data` or `grain`) to read the data on the fly.
+
+```python
+import grain.python as grain
+import keras_remote
+
+@keras_remote.run(accelerator="v6e-8")
+def train(data_uri):
+    # Native GCS reading, no download overhead
+    data_source = grain.ArrayRecordDataSource(data_uri)
+    # ...
+
+# Pass as a plain string, no Data() wrapper needed
+train("gs://my-bucket/arrayrecords/")
+
+```
+
 ## Configuration
 
 ### Environment Variables
 
-| Variable               | Required | Default         | Description                        |
-| ---------------------- | -------- | --------------- | ---------------------------------- |
-| `KERAS_REMOTE_PROJECT` | Yes      | —               | Google Cloud project ID            |
-| `KERAS_REMOTE_ZONE`    | No       | `us-central1-a` | Default compute zone               |
-| `KERAS_REMOTE_CLUSTER` | No       | —               | GKE cluster name                   |
+| Variable               | Required | Default         | Description             |
+| ---------------------- | -------- | --------------- | ----------------------- |
+| `KERAS_REMOTE_PROJECT` | Yes      | —               | Google Cloud project ID |
+| `KERAS_REMOTE_ZONE`    | No       | `us-central1-a` | Default compute zone    |
+| `KERAS_REMOTE_CLUSTER` | No       | —               | GKE cluster name        |
 
 ### Decorator Parameters
 
@@ -345,10 +433,10 @@ keras-remote down
 
 This removes:
 
-- GKE cluster and accelerator node pools (via Pulumi)
+- GKE cluster and accelerator node pools
 - Artifact Registry repository and container images
 - Cloud Storage buckets (jobs and builds)
-Use `--yes` to skip the confirmation prompt.
+  Use `--yes` to skip the confirmation prompt.
 
 ## Contributing
 
 
@@ -16,7 +16,11 @@
 from google.api_core import exceptions as google_exceptions
 
 from keras_remote.backend import gke_client, pathways_client
-from keras_remote.constants import get_default_zone, zone_to_region
+from keras_remote.constants import (
+  get_default_cluster_name,
+  get_default_zone,
+  zone_to_region,
+)
 from keras_remote.credentials import ensure_credentials
 from keras_remote.data import _make_data_ref
 from keras_remote.infra import container_builder
@@ -39,6 +43,7 @@ class JobContext:
   container_image: Optional[str]
   zone: str
   project: str
+  cluster_name: str
 
   # Generated identifiers
   job_id: str = field(default_factory=lambda: f"job-{uuid.uuid4().hex[:8]}")
@@ -58,7 +63,7 @@ class JobContext:
   image_uri: Optional[str] = None
 
   def __post_init__(self):
-    self.bucket_name = f"{self.project}-keras-remote-jobs"
+    self.bucket_name = f"{self.project}-kr-{self.cluster_name}-jobs"
     self.region = zone_to_region(self.zone)
     self.display_name = f"keras-remote-{self.func.__name__}-{self.job_id}"
 
@@ -73,9 +78,10 @@ def from_params(
     zone: Optional[str],
     project: Optional[str],
     env_vars: dict,
+    cluster_name: Optional[str] = None,
     volumes: Optional[dict] = None,
   ) -> "JobContext":
-    """Factory method with default resolution for zone/project."""
+    """Factory method with default resolution for zone/project/cluster."""
     if not zone:
       zone = get_default_zone()
     if not project:
@@ -85,6 +91,8 @@ def from_params(
           "project must be specified or set KERAS_REMOTE_PROJECT"
           " (or GOOGLE_CLOUD_PROJECT) environment variable"
         )
+    if not cluster_name:
+      cluster_name = get_default_cluster_name()
 
     return cls(
       func=func,
@@ -95,6 +103,7 @@ def from_params(
       container_image=container_image,
       zone=zone,
       project=project,
+      cluster_name=cluster_name,
       volumes=volumes,
     )
 
@@ -303,6 +312,7 @@ def _build_container(ctx: JobContext) -> None:
       accelerator_type=ctx.accelerator,
       project=ctx.project,
       zone=ctx.zone,
+      cluster_name=ctx.cluster_name,
     )
 
 
 
@@ -40,8 +40,9 @@ def test_post_init_derived_fields(self):
       container_image=None,
       zone="europe-west4-b",
       project="my-proj",
+      cluster_name="my-cluster",
     )
-    self.assertEqual(ctx.bucket_name, "my-proj-keras-remote-jobs")
+    self.assertEqual(ctx.bucket_name, "my-proj-kr-my-cluster-jobs")
     self.assertEqual(ctx.region, "europe-west4")
     self.assertTrue(ctx.display_name.startswith("keras-remote-my_train-"))
     self.assertRegex(ctx.job_id, r"^job-[0-9a-f]{8}$")
@@ -171,6 +172,7 @@ def _make_ctx(self, container_image=None):
       container_image=container_image,
       zone="us-central1-a",
       project="proj",
+      cluster_name="keras-remote-cluster",
     )
 
   def test_success_flow(self):
 
@@ -437,22 +437,38 @@ def _check_node_pool_exists_cached(selector_items) -> bool:
       pool_labels = config_dict.get("labels", {}).copy()
 
       # Map GKE injected node labels for accelerators mapping
-      accelerators = config_dict.get("accelerators", [])
-      if accelerators:
-        accel_type = accelerators[0].get("acceleratorType", "")
+      accel_config_list = config_dict.get("accelerators", [])
+      if accel_config_list:
+        accel_type = accel_config_list[0].get("acceleratorType", "")
         if accel_type.startswith("tpu-"):
           pool_labels["cloud.google.com/gke-tpu-accelerator"] = accel_type
         else:
           pool_labels["cloud.google.com/gke-accelerator"] = accel_type
 
       # TPU mapping fallback
       machine_type = config_dict.get("machineType", "")
+
+      # Check resource labels for TPU type (common in v5e/v5litepod)
+      resource_labels = config_dict.get("resourceLabels", {})
+      if "goog-gke-accelerator-type" in resource_labels:
+        pool_labels["cloud.google.com/gke-tpu-accelerator"] = resource_labels[
+          "goog-gke-accelerator-type"
+        ]
+
       if machine_type.startswith("ct"):
         # We roughly map TPU topology presence for preflight
         pool_labels["cloud.google.com/gke-tpu-topology"] = selector.get(
           "cloud.google.com/gke-tpu-topology", ""
         )
 
+      # Infer accelerator count from machine type using registry
+      # This is robust because it uses the same source of truth as the Pod spec generation
+      for tpu_spec in accelerators.TPUS.values():
+        for chips, topo_spec in tpu_spec.topologies.items():
+          if topo_spec.machine_type == machine_type:
+            pool_labels["cloud.google.com/gke-accelerator-count"] = str(chips)
+            break
+
       if all(pool_labels.get(k) == str(v) for k, v in selector.items()):
         return True
     return False
 
@@ -5,16 +5,14 @@
 job execution.
 """
 
-import sys
 import threading
-from collections import deque
 
 import urllib3
 from absl import logging
 from kubernetes.client.rest import ApiException
 from rich.console import Console
-from rich.live import Live
-from rich.panel import Panel
+
+from keras_remote.cli.output import LiveOutputPanel
 
 _MAX_DISPLAY_LINES = 25
 
@@ -27,14 +25,13 @@ def _stream_pod_logs(core_v1, pod_name, namespace):
 
   In interactive terminals, logs are displayed in a Rich Live panel.
   In non-interactive contexts (piped output, CI), logs are streamed
-  as raw lines with Rich Rule delimiters.
+  as plain lines with Rule delimiters.
 
   Args:
       core_v1: Kubernetes CoreV1Api client.
       pod_name: Name of the pod to stream logs from.
       namespace: Kubernetes namespace.
   """
-  console = Console()
   resp = None
   try:
     resp = core_v1.read_namespaced_pod_log(
@@ -43,10 +40,22 @@ def _stream_pod_logs(core_v1, pod_name, namespace):
       follow=True,
       _preload_content=False,
     )
-    if console.is_terminal:
-      _render_live_panel(resp, pod_name, console)
-    else:
-      _render_plain(resp, pod_name, console)
+    title = f"Remote logs \u2022 {pod_name}"
+    with LiveOutputPanel(
+      title,
+      max_lines=_MAX_DISPLAY_LINES,
+      target_console=Console(),
+      show_subtitle=False,
+    ) as panel:
+      buffer = ""
+      for chunk in resp.stream(decode_content=True):
+        buffer += chunk.decode("utf-8", errors="replace")
+        while "\n" in buffer:
+          line, buffer = buffer.split("\n", 1)
+          panel.on_output(line)
+      # Flush remaining partial line
+      if buffer.strip():
+        panel.on_output(buffer)
   except ApiException:
     pass  # Pod deleted or not found
   except urllib3.exceptions.ProtocolError:
@@ -60,45 +69,6 @@ def _stream_pod_logs(core_v1, pod_name, namespace):
       resp.release_conn()
 
 
-def _render_live_panel(resp, pod_name, console):
-  """Render streaming logs inside a Rich Live panel."""
-  lines = deque(maxlen=_MAX_DISPLAY_LINES)
-  title = f"Remote logs \u2022 {pod_name}"
-  buffer = ""
-
-  with Live(
-    _make_log_panel(lines, title),
-    console=console,
-    refresh_per_second=4,
-  ) as live:
-    for chunk in resp.stream(decode_content=True):
-      buffer += chunk.decode("utf-8", errors="replace")
-      while "\n" in buffer:
-        line, buffer = buffer.split("\n", 1)
-        lines.append(line)
-      live.update(_make_log_panel(lines, title))
-
-    # Flush remaining partial line
-    if buffer.strip():
-      lines.append(buffer)
-      live.update(_make_log_panel(lines, title))
-
-
-def _render_plain(resp, pod_name, console):
-  """Render streaming logs as raw lines with Rule delimiters."""
-  console.rule(f"Remote logs ({pod_name})", style="blue")
-  for chunk in resp.stream(decode_content=True):
-    sys.stdout.write(chunk.decode("utf-8", errors="replace"))
-    sys.stdout.flush()
-  console.rule("End remote logs", style="blue")
-
-
-def _make_log_panel(lines, title):
-  """Build a Panel renderable from accumulated log lines."""
-  content = "\n".join(lines) if lines else "Waiting for output..."
-  return Panel(content, title=title, border_style="blue")
-
-
 class LogStreamer:
   """Context manager that owns the log-streaming thread lifecycle.