Kubernetes Infrastructure for Netclode

Overview

This directory contains Kubernetes manifests for deploying the Netclode agent sandbox infrastructure.

Kubeconfig Setup

IMPORTANT: Always use explicit contexts to avoid deploying to the wrong cluster.

1. Copy k3s config from the netclode host

# On the netclode host (e.g., via SSH)
cat /etc/rancher/k3s/k3s.yaml

# Copy the output and save locally, replacing the server address
# Change: server: https://127.0.0.1:6443
# To:     server: https://<netclode-host>:6443

2. Configure kubeconfig with named contexts

# Backup existing config
cp ~/.kube/config ~/.kube/config.backup

# Create a merged config with explicit contexts
# Option A: Use KUBECONFIG env var to merge
export KUBECONFIG=~/.kube/config:~/.kube/netclode.yaml
kubectl config view --flatten > ~/.kube/config.merged
mv ~/.kube/config.merged ~/.kube/config

# Option B: Manually add the netclode context
kubectl config set-cluster netclode --server=https://<netclode-host>:6443 --certificate-authority=...
kubectl config set-credentials netclode-admin --client-certificate=... --client-key=...
kubectl config set-context netclode --cluster=netclode --user=netclode-admin

3. Rename your current context to 'silo' (or appropriate name)

# Check current context name
kubectl config current-context

# Rename it (e.g., if it's "default")
kubectl config rename-context default silo

4. Disable default context (require explicit --context)

# Unset the current context - kubectl will error without --context flag
kubectl config unset current-context

5. Usage

# Always specify context explicitly
kubectl --context=netclode get pods -n netclode
kubectl --context=silo get pods

# Or set for current shell session only
export KUBECTL_CONTEXT=netclode
kubectl get pods -n netclode  # uses $KUBECTL_CONTEXT

Components

Agent Sandbox Controller

The agent-sandbox-controller manages Sandbox, SandboxClaim, SandboxTemplate, and SandboxWarmPool CRDs.

Files:

extensions.controller.yaml - StatefulSet for the controller
extensions.yaml - ClusterRoleBindings
extensions-rbac.generated.yaml - ClusterRole for extensions controller
rbac.generated.yaml - ClusterRole for core controller

Custom Image: We use a custom-built controller image (ghcr.io/angristan/agent-sandbox-controller:volumeclaim-v7) that includes:

volumeClaimTemplates support for SandboxTemplate
Fix for PVC explosion bug in warm pools (see below)
PVC adoption: when SandboxClaim adopts a warm pool pod, it also adopts its PVCs

Warm Pool

SandboxWarmPool keeps pre-warmed pods with JuiceFS PVCs ready for instant allocation.

Files:

sandbox-warmpool.yaml - SandboxWarmPool resource
sandbox-template.yaml - SandboxTemplate with volumeClaimTemplates

Control-Plane Configuration:

To enable warm pool allocation in the control-plane, set the environment variable:

env:
  - name: WARM_POOL_ENABLED
    value: "true"

When enabled, the control-plane creates SandboxClaim resources instead of direct Sandbox resources. The controller assigns a pre-warmed pod from the pool (or creates a new one if the pool is empty).

Session Assignment:

Since warm pool pods are already running, they cannot receive per-session environment variables dynamically. Instead, agents connect via gRPC and authenticate using their Kubernetes ServiceAccount token:

Agent reads SA token from /var/run/secrets/kubernetes.io/serviceaccount/token
Agent connects to control-plane via gRPC, sending the token in registration
Control-plane validates token via Kubernetes TokenReview API (extracts verified pod name)
When SandboxClaim binds to this pod, control-plane pushes SessionAssigned message

This prevents rogue agents from impersonating legitimate pods - identity is cryptographically verified.

Storage

Files:

storage.yaml - JuiceFS StorageClass

Requirements:

JuiceFS CSI driver must be installed: helm install juicefs-csi juicefs/juicefs-csi-driver -n kube-system
juicefs-secret must exist in the netclode namespace with valid Redis metadata URL

Runtime

Files:

runtime-class.yaml - RuntimeClass for Kata Containers (kata-clh)

Deployment Order

Always use --context=netclode to ensure you're deploying to the correct cluster.

# Set context for all commands (or add --context=netclode to each)
CTX="--context=netclode"

# 1. Create namespaces
kubectl $CTX apply -f namespace.yaml

# 2. Install CRDs
kubectl $CTX apply -f agents.x-k8s.io_sandboxes.yaml
kubectl $CTX apply -f extensions.agents.x-k8s.io_sandboxclaims.yaml
kubectl $CTX apply -f extensions.agents.x-k8s.io_sandboxtemplates.yaml
kubectl $CTX apply -f extensions.agents.x-k8s.io_sandboxwarmpools.yaml

# 3. Install RBAC
kubectl $CTX apply -f rbac.generated.yaml
kubectl $CTX apply -f extensions-rbac.generated.yaml
kubectl $CTX apply -f extensions.yaml

# 4. Install controller
kubectl $CTX apply -f extensions.controller.yaml

# 5. Install runtime and storage prerequisites
kubectl $CTX apply -f runtime-class.yaml
kubectl $CTX apply -f storage.yaml
kubectl $CTX apply -f juicefs-config.yaml
kubectl $CTX rollout restart statefulset juicefs-csi-controller -n kube-system

# 6. Deploy sandbox template and warm pool
kubectl $CTX apply -f sandbox-template.yaml
kubectl $CTX apply -f sandbox-warmpool.yaml

Cleanup

To remove all netclode k8s resources:

CTX="--context=netclode"

# Delete namespace (removes controller, serviceaccount, etc.)
kubectl $CTX delete ns agent-sandbox-system

# Delete CRDs
kubectl $CTX delete crd sandboxclaims.extensions.agents.x-k8s.io
kubectl $CTX delete crd sandboxes.agents.x-k8s.io
kubectl $CTX delete crd sandboxtemplates.extensions.agents.x-k8s.io
kubectl $CTX delete crd sandboxwarmpools.extensions.agents.x-k8s.io

# Delete ClusterRoles and ClusterRoleBindings
kubectl $CTX delete clusterrolebinding agent-sandbox-controller agent-sandbox-controller-extensions
kubectl $CTX delete clusterrole agent-sandbox-controller agent-sandbox-controller-extensions

# Delete RuntimeClass and StorageClass
kubectl $CTX delete runtimeclass kata-clh
kubectl $CTX delete sc juicefs-sc

# Delete any orphaned PVs (if PVC explosion occurred)
kubectl $CTX get pv --no-headers | grep Released | awk '{print $1}' | xargs kubectl $CTX delete pv

Known Issues and Learnings

PVC Preservation During Pause (Session Anchors)

Problem: When a session is paused, the Sandbox CR is deleted. PVCs created via volumeClaimTemplates have an ownerReference to the Sandbox, so Kubernetes garbage-collects them, causing data loss.

Solution: The control-plane creates a "session anchor" ConfigMap that acts as a second owner of the PVC. With two owners (Sandbox + ConfigMap), the PVC survives when the Sandbox is deleted during pause. The PVC only gets GC'd when both owners are deleted.

How it works:

When a session is created, control-plane creates ConfigMap/session-anchor-<sessionID>
The ConfigMap is added as a non-controller ownerReference on the PVC
When paused: Sandbox deleted → PVC survives (anchor still owns it)
When resumed: New Sandbox created with same PVC
When deleted: Anchor deleted → PVC explicitly deleted

RBAC Requirements: The control-plane service account needs permissions for ConfigMaps and PVC updates:

# In namespace.yaml Role sandbox-manager
- apiGroups: [""]
  resources: ["configmaps"]
  verbs: ["get", "list", "watch", "create", "update", "delete"]
- apiGroups: [""]
  resources: ["persistentvolumeclaims"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]

Verifying anchors:

# List session anchors
kubectl --context netclode -n netclode get configmap -l netclode.dev/component=session-anchor

# Check PVC ownership (should show both Sandbox and ConfigMap)
kubectl --context netclode -n netclode get pvc <pvc-name> -o jsonpath='{.metadata.ownerReferences}' | jq

PVC Explosion Bug (Fixed in volumeclaim-v6)

Problem: The warm pool controller was creating thousands of PVCs in a loop.

Root Cause: The controller watches PVCs it owns (Owns(&corev1.PersistentVolumeClaim{})). When a PVC is created:

PVC creation triggers a reconcile
The reconcile runs before the pod is created
Reconcile sees 0 pods, thinks it needs to create one
Creates a NEW PVC (with new random suffix) and pod
New PVC triggers another reconcile... infinite loop

Fix: Before creating new pods, count owned PVCs and compare to pod count. If ownedPVCs > currentPods, a creation is in progress - skip creating more.

// If there are more PVCs than pods, a creation is in progress
creationInProgress := ownedPVCs > currentReplicas
if currentReplicas < desiredReplicas && !creationInProgress {
    // Safe to create new pods
}

JuiceFS Configuration

The JuiceFS secret must have a valid metaurl pointing to an accessible Redis server. redis://localhost:6379 will NOT work from inside pods.

Example working secret:

apiVersion: v1
kind: Secret
metadata:
  name: juicefs-secret
  namespace: netclode
stringData:
  name: netclode-vol
  metaurl: redis://<redis-host>:6379/0
  storage: s3
  bucket: <bucket-url>
  access-key: <access-key>
  secret-key: <secret-key>

JuiceFS cache size

JuiceFS mounts use the upstream defaults unless we override them. The client read cache size defaults to --cache-size=102400 (MiB, ~100 GiB) and also respects --free-space-ratio=0.1, so cache will shrink if disk space is tight. To change it for all PVCs, add mountOptions in the CSI driver ConfigMap (infra/k8s/juicefs-config.yaml) and restart the CSI controller so new mount pods pick it up:

data:
  config.yaml: |
    mountOptions:
      - cache-size=204800
      - free-space-ratio=0.1

Kata Containers

The sandbox pods use runtimeClassName: kata-clh for VM-level isolation. Ensure Kata Containers is installed on the cluster nodes.

Warm Pool Pods Stuck in Pending

If warm pool pods stay in Pending state, check:

Insufficient CPU - Check kubectl describe node for allocated resources
- JuiceFS mount pods request 1 CPU by default (see storage.yaml for fix)
- On small nodes, scale down coredns: kubectl scale deployment coredns -n kube-system --replicas=2
Unbound PVCs - The scheduler may fail before PVC is bound
- Check PVC status: kubectl get pvc -n netclode
- If PVC is Bound but pod is Pending, delete pod to trigger reschedule
Orphaned PVCs blocking controller - If ownedPVCs > currentPods, controller thinks creation is in progress
- Check controller logs: kubectl logs -n agent-sandbox-system agent-sandbox-controller-0
- Delete orphaned PVCs: kubectl delete pvc -n netclode -l agents.x-k8s.io/pool
JuiceFS delvol jobs consuming CPU - When PVCs are deleted, JuiceFS creates cleanup jobs
- These jobs request 1 CPU each and can exhaust node resources
- Clean up stuck jobs: kubectl get jobs -n kube-system -o name | grep delvol | xargs kubectl delete -n kube-system

Small Node Configuration (2 CPU)

On a 2-CPU node, resource management is critical:

CTX="--context=netclode"

# Scale down coredns (default 5 replicas is too many)
kubectl $CTX scale deployment coredns -n kube-system --replicas=2

# Configure JuiceFS mount pods to use less CPU (see storage.yaml for details)
# Default is 1 CPU per mount pod - with multiple PVCs this exhausts the node

Warm Pool Troubleshooting

Warm pool not being used:

Verify WARM_POOL_ENABLED=true is set in control-plane deployment
Check control-plane logs for "warmPool=true" at startup
Verify SandboxWarmPool has ready replicas: kubectl get sandboxwarmpool -n netclode

Claims not binding:

Check SandboxClaim status: kubectl get sandboxclaim -n netclode
Check controller logs: kubectl logs -n agent-sandbox-system agent-sandbox-controller-0
Verify warm pool has available pods: kubectl get pods -n netclode -l agents.x-k8s.io/pool

Agent not connecting:

Verify agent can reach control-plane: curl http://control-plane.netclode.svc.cluster.local:3000/health

Control Plane Exposure

The control plane is exposed via Tailscale Ingress with HTTPS and automatic Let's Encrypt certificates.

Architecture

The iOS app uses URLSession which only supports HTTP/2 over HTTPS. HTTP/2 is required for bidirectional streaming (Connect protocol). The setup uses:

Tailscale Ingress (control-plane-ingress.yaml) - Exposes the control-plane on the tailnet
Custom proxy image (ghcr.io/angristan/tailscale:connect-fix) - Adds h2c support for Connect protocol content types

The custom proxy image patches Tailscale's reverse proxy to enable h2c (HTTP/2 cleartext) for Connect RPC content types (application/connect+proto, application/connect+json), not just gRPC.

Configuration

The Tailscale operator is deployed via Ansible and configured in infra/ansible/roles/tailscale-operator/. It uses OAuth credentials from /var/secrets/ts-oauth-client-id and /var/secrets/ts-oauth-client-secret.

Accessing the Control Plane

After deployment, the control plane will be available at: https://netclode-control-plane-ingress.YOUR-TAILNET.ts.net

Finding Your Tailnet Name

# From any machine with Tailscale installed
tailscale status

# Or check the Tailscale admin console
# https://login.tailscale.com/admin/machines

Verifying the Connection

# Test HTTPS endpoint
curl -v https://netclode-control-plane-ingress.YOUR-TAILNET.ts.net/health

# Check the ingress status
kubectl --context netclode -n netclode describe ingress control-plane

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes Infrastructure for Netclode

Overview

Kubeconfig Setup

1. Copy k3s config from the netclode host

2. Configure kubeconfig with named contexts

3. Rename your current context to 'silo' (or appropriate name)

4. Disable default context (require explicit --context)

5. Usage

Components

Agent Sandbox Controller

Warm Pool

Storage

Runtime

Deployment Order

Cleanup

Known Issues and Learnings

PVC Preservation During Pause (Session Anchors)

PVC Explosion Bug (Fixed in volumeclaim-v6)

JuiceFS Configuration

JuiceFS cache size

Kata Containers

Warm Pool Pods Stuck in Pending

Small Node Configuration (2 CPU)

Warm Pool Troubleshooting

Control Plane Exposure

Architecture

Configuration

Accessing the Control Plane

Finding Your Tailnet Name

Verifying the Connection

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Kubernetes Infrastructure for Netclode

Overview

Kubeconfig Setup

1. Copy k3s config from the netclode host

2. Configure kubeconfig with named contexts

3. Rename your current context to 'silo' (or appropriate name)

4. Disable default context (require explicit --context)

5. Usage

Components

Agent Sandbox Controller

Warm Pool

Storage

Runtime

Deployment Order

Cleanup

Known Issues and Learnings

PVC Preservation During Pause (Session Anchors)

PVC Explosion Bug (Fixed in volumeclaim-v6)

JuiceFS Configuration

JuiceFS cache size

Kata Containers

Warm Pool Pods Stuck in Pending

Small Node Configuration (2 CPU)

Warm Pool Troubleshooting

Control Plane Exposure

Architecture

Configuration

Accessing the Control Plane

Finding Your Tailnet Name

Verifying the Connection