A Kubernetes integration that monitors jobs and reports job lifecycle events and network resource allocation to Arista CloudVision to enable job-aware network troubleshooting.
graph TD
%% CloudVision at top
cv[CloudVision]
%% Main components
informer[cv-job-informer<br/>deployment]
discovery[cv-interface-discovery<br/>daemonset<br/>optional]
%% Kubernetes resources
jobs[Job CRDs<br/>dynamically discovered]
pods[Pods<br/>with network annotations]
%% Node state CRs - two modes
node_states[NodeInterfaceState CRs<br/>mode: discovery]
sriov_states[SriovNetworkNodeState CRs<br/>mode: sriovoperator]
%% Optional operator
sriov_op[SR-IOV Network Operator<br/>optional alternative]
%% CNI plugins
multus[Multus CNI]
%% Flows - Job monitoring
jobs -->|watch| informer
pods -->|watch| informer
multus -.->|annotates| pods
%% Flows - NodeConfig mode: discovery
discovery -.->|creates| node_states
node_states -->|watch| informer
%% Flows - NodeConfig mode: sriovoperator
sriov_op -.->|creates| sriov_states
sriov_states -->|watch| informer
%% Flows - to CloudVision
informer -->|JobConfig API<br/>job events| cv
informer -->|NodeConfig API<br/>node inventory| cv
%% Styling
style cv fill:#fff4e6,stroke:#ff9800,stroke-width:3px
style informer fill:#e1f5ff,stroke:#0066cc,stroke-width:2px
style discovery fill:#e1f5ff,stroke:#0066cc,stroke-width:2px,stroke-dasharray: 5 5
style jobs fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style pods fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style node_states fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style sriov_states fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
style multus fill:#f3e5f5,stroke:#9c27b0
style sriov_op fill:#fff3e0,stroke:#ff6f00,stroke-dasharray: 5 5
The integration consists of two major components:
- Watches for Jobs: Uses Kubernetes informer pattern with dynamic resource discovery based on pod's ownerReferences to watch any job type (TrainJob, PyTorchJob, MPIJob, etc.) in real-time
- Tracks Lifecycle: Detects when jobs start and finish (or fail)
- Extracts Network Info: Reads secondary network interface details from Multus CNI annotations (MAC addresses) on job pods. Only secondary interfaces are reported because primary interface (eth0) is not used for RDMA traffic in HPC jobs.
- Reports Job Events to CloudVision: Sends job lifecycle changes to the JobConfig API with job metadata and network information
- Reports Node Interface Inventory to CloudVision: Watches NodeInterfaceState CRs (from cv-interface-discovery daemonset) or SriovNetworkNodeState CRs (from SR-IOV Network Operator) and sends node-level interface inventory to the NodeConfig API
- Discovers Node Interfaces: Each daemonset pod discovers all physical network interfaces (SR-IOV and non-SR-IOV) on its node and creates/updates NodeInterfaceState custom resources
- Job ID and name (extracted from owner reference of pods)
- Location (configurable via
LOCATIONto differentiate between multiple clusters) - Job start/finish timestamps
- Job state (RUNNING, COMPLETED, FAILED, CANCELLED)
- Node names OR interface MAC addresses for jobs (configurable via
JOBCONFIG_MODE) - Node-level interface inventory (interface name, IP and MAC addresses)
JobConfig Mode - Resource Allocation Reporting
The service extracts resource allocation info from pods and sends it to CloudVision API. Choose the mode based on your cluster setup:
| Mode | When to Use | What Gets Sent | Requirements |
|---|---|---|---|
interface (default) |
Nodes are shared between multiple jobs (each job uses specific interfaces) | List of interface MAC addresses from Multus CNI network-status annotation |
Multus CNI (or equivalent) to attach secondary network interfaces |
node |
Each node is exclusively used by a single job (all interfaces on the node belong to the job) | List of node names from pod spec.nodeName |
None - works with any Kubernetes cluster |
How CloudVision Uses This Data:
- Interface mode: Learns exact switch interfaces used by the job via MAC address correlation
- Node mode: Assumes all switch interfaces connected to the node are used by the job (learned via LLDP from nodes)
NodeConfig Modes - Network Interface Discovery
The CV Job Informer reports network interface inventory from each node to CloudVision for better network correlation. Choose the mode based on your cluster setup:
| Mode | Description | What Gets Deployed |
|---|---|---|
discovery (default) |
Built-in interface discovery - discovers all physical interfaces (SR-IOV and non-SR-IOV) | NodeInterfaceState CRD + cv-interface-discovery DaemonSet |
sriovoperator |
Use existing SR-IOV Network Operator (SR-IOV interfaces only) | Nothing (watches existing SriovNetworkNodeState CRs) |
disabled |
No automatic discovery - you must call NodeConfig API separately with node interface inventory | Nothing |
- Kubernetes cluster >= 1.20
- Job Operator - Any job operator that creates pods with ownerReferences
- Multus CNI - CNI meta‑plugin that attaches secondary network interfaces to pods for high-speed RDMA communications
- Works with any RDMA NIC resource allocation mechanism: SR-IOV Device Plugin, DRA (Dynamic Resource Allocation), RDMA Shared Device Plugin with MACVLAN, etc.
- Required only when
JOBCONFIG_MODE=interfacefor reporting interface-level resource allocation - Not required when
JOBCONFIG_MODE=node(reports node-level allocation instead)
Note: To obtain the
API_SERVERandAPI_TOKENfor CloudVision API, refer to the CloudVision API Guide.
All components are deployed to the cloudvision namespace by default. The deployment script will automatically create the namespace if it doesn't exist.
Deployment requires a Docker registry to distribute the image to cluster nodes. You can either use a pre-built public image or build and push your own.
Use the pre-built image from GitHub Container Registry (no build required):
make deploy \
API_SERVER=www.arista.io \
API_TOKEN=your-token-here \
LOCATION=testlab \
REGISTRY=ghcr.io/aristanetworks \
IMAGE_TAG=latest \
SKIP_BUILD=trueFor production use or customization, build and push to your own registry.
Step 1: Authenticate with Docker Registry
# For Docker Hub
docker login
# For private registry (e.g., Harbor, ECR, GCR, ACR)
docker login your-registry.ioStep 2: Build, Push, and Deploy
make deploy \
API_SERVER=www.arista.io \
API_TOKEN=your-token-here \
JOBCONFIG_MODE=interface \
NODECONFIG_MODE=discovery \
LOCATION=testlab \
REGISTRY=docker.io/your-username \
PUSH=true \
LOG_LEVEL=infoThis will:
- Build the Docker image locally
- Push it to your registry (requires authentication from Step 1)
- Deploy to Kubernetes with the registry image
What gets created in your cluster:
The deployment creates the following Kubernetes resources:
- Namespace:
cloudvision(created automatically if it doesn't exist) - ServiceAccount:
cv-job-informer(incloudvisionnamespace) - ClusterRole:
cv-job-informer(cluster-wide permissions to watch jobs, pods, nodes, and node interface states) - ClusterRoleBinding:
cv-job-informer(binds the ClusterRole to the ServiceAccount) - Secret:
cv-job-informer-api-credentials(stores API server URL and authentication token) - Deployment:
cv-job-informer(runs 1 replica on the control plane node) - NodeInterfaceState CRD + cv-interface-discovery DaemonSet (when
NODECONFIG_MODE=discovery)
All resources are labeled with app: cv-job-informer for easy management and cleanup.
> make help
CV Job Informer - Kubernetes Job Monitoring Service
Targets:
delete Delete cv-job-informer from Kubernetes
deploy Build and deploy to Kubernetes (requires API_SERVER, API_TOKEN, LOCATION)
describe Describe pod (COMPONENT=job for cv-job-informer, COMPONENT=node for cv-interface-discovery)
help Show this help message
logs View logs (COMPONENT=job for cv-job-informer, COMPONENT=node for cv-interface-discovery)
restart Restart component (COMPONENT=job for cv-job-informer, COMPONENT=node for cv-interface-discovery)
status Check status (COMPONENT=job for cv-job-informer, COMPONENT=node for cv-interface-discovery)
Variables:
NAMESPACE Namespace(s) to monitor (default: all namespaces)
- Empty = all namespaces cluster-wide
- Single namespace = watch only that namespace
- Comma-separated = watch all, filter to specified
Note: cv-job-informer pod is always deployed to 'cloudvision' namespace
API_SERVER API server address (REQUIRED)
See https://aristanetworks.github.io/cloudvision-apis/connecting for details
API_TOKEN API authentication token (REQUIRED)
See https://aristanetworks.github.io/cloudvision-apis/connecting for details
LOCATION Location identifier, e.g. cluster name (REQUIRED)
JOBCONFIG_MODE JobConfig mode: node or interface (default: interface)
See "JobConfig Mode - Resource Allocation Reporting" section for when to use each mode
NODECONFIG_MODE NodeConfig mode: discovery, sriovoperator, or disabled (default: discovery)
See "NodeConfig Modes - Network Interface Discovery" section for details on each mode
NODE_INTERFACE_TYPE Node interfaces for NodeConfig: all, pf, or vf (default: all)
LOG_LEVEL Log level: debug, info, warning, error (default: info)
REGISTRY Container registry (e.g., docker.io/username)
IMAGE_TAG Docker image tag (default: latest)
PUSH Push image to registry: true or false (default: false)
SKIP_BUILD Skip building image, use existing: true or false (default: false)
COMPONENT Component to operate on: job or node (default: job)
- job = cv-job-informer deployment
- node = cv-interface-discovery daemonset
Used by: logs, status, restart, describe commandsEvent Flow Diagram
sequenceDiagram
participant K8s as Kubernetes API
participant PodInformer as Pod Informer
participant JobInformer as Job Informer(s)<br/>(Dynamic)
participant NodeInformer as Node Informer
participant PodHandler as Pod Handler
participant JobHandler as Job Handler
participant CV as CloudVision API
Note over K8s,CV: 1. Dynamic Resource Discovery
K8s->>PodInformer: Pod ADD event
PodInformer->>PodHandler: on_pod_add(pod)
PodHandler->>PodHandler: Extract parent resource<br/>from ownerReferences
PodHandler->>JobInformer: Create informer for<br/>parent resource type<br/>(if not exists)
Note over JobInformer: Dynamically creates<br/>informers for TrainJob,<br/> RunaiJob, Workflow, etc.
Note over K8s,CV: 2. Job Lifecycle Tracking
K8s->>JobInformer: Job ADD event
JobInformer->>JobHandler: on_job_add(job)
JobHandler->>JobHandler: Track job in PENDING state
K8s->>PodInformer: Pod UPDATE (Running)
PodInformer->>PodHandler: on_pod_update(pod)
PodHandler->>PodHandler: Extract network interfaces<br/>from Multus annotation
PodHandler->>PodHandler: Check pod states<br/>(Pending/Failed/Running)
alt Any pod Pending or Failed
PodHandler->>PodHandler: Skip STARTED event<br/>Wait for all pods to start<br/>or job to complete
else All pods Running
PodHandler->>PodHandler: Schedule STARTED event<br/>(stability delay)
Note over PodHandler: Wait for pod state<br/>to stabilize<br/>(5s delay)
PodHandler->>PodHandler: Re-check: All pods running?<br/>No pending/failed pods?<br/>Interfaces stable?
PodHandler->>CV: POST JobConfig<br/>state=STARTED<br/>interfaces=[MACs]
PodHandler->>JobHandler: Update job status<br/>to RUNNING
end
Note over K8s,CV: 3. Interface Change Detection
K8s->>PodInformer: Pod UPDATE (new interface)
PodInformer->>PodHandler: on_pod_update(pod)
PodHandler->>PodHandler: Detect interface change
PodHandler->>PodHandler: Schedule UPDATE event<br/>(stability delay)
PodHandler->>CV: POST JobConfig<br/>state=UPDATE<br/>interfaces=[new MACs]
Note over K8s,CV: 4. Job Completion
K8s->>JobInformer: Job UPDATE (Completed)
JobInformer->>JobHandler: on_job_update(job)
JobHandler->>JobHandler: Cancel pending event timer
alt Job status = PENDING
JobHandler->>JobHandler: Job never fully started<br/>Skip FINISHED event<br/>Clean up cache
else Job status = RUNNING
JobHandler->>JobHandler: Extract start/end times
JobHandler->>CV: POST JobConfig<br/>state=FINISHED<br/>termination=SUCCEEDED<br/>end_time=...
JobHandler->>JobHandler: Mark job as finished<br/>Clean up cache
end
Note over K8s,CV: 5. Job Cancellation
K8s->>JobInformer: Job DELETE event
JobInformer->>JobHandler: on_job_delete(job)
JobHandler->>JobHandler: Cancel pending event timer
alt Job status = PENDING
JobHandler->>JobHandler: Job never fully started<br/>Skip API call<br/>Clean up tracking
else Job status = RUNNING
JobHandler->>CV: POST JobConfig<br/>state=FINISHED<br/>termination=CANCELLED
JobHandler->>JobHandler: Clean up tracking
end
Note over K8s,CV: 6. Node Interface Inventory (Optional)
K8s->>NodeInformer: NodeInterfaceState /<br/>SriovNetworkNodeState<br/>UPDATE event
NodeInformer->>NodeInformer: Extract interfaces<br/>(PFs/VFs with MACs)
NodeInformer->>NodeInformer: Detect interface changes
NodeInformer->>CV: POST NodeConfig<br/>node=node-1<br/>interfaces=[PF/VF MACs]
How Job Informer Works
What cv-job-informer monitors
- Job CRDs (any type: TrainJob, PyTorchJob, MPIJob, etc.) - for job lifecycle events via dynamic resource discovery based on pod's ownerReferences
- Pods - for job resource allocation:
- Node names (which nodes are running the job)
- Network interface MAC addresses (from Multus CNI annotations)
- NodeInterfaceState CRs (when
NODECONFIG_MODE=discovery) - for node-level interface inventory created by cv-interface-discovery daemonset - SriovNetworkNodeState CRs (when
NODECONFIG_MODE=sriovoperator) - for node-level SR-IOV interface inventory created by SR-IOV Network Operator
Supported Job Resource Types:
Only the following resource types are monitored (whitelist approach). This ensures the informer only watches resources it has RBAC permissions for:
| API Group | Kind | Description |
|---|---|---|
batch |
Job |
Kubernetes batch Jobs (also used by JobSet) |
kubeflow.org |
PyTorchJob, TFJob, MPIJob, XGBoostJob, PaddleJob |
Kubeflow Training Operator |
trainer.kubeflow.org |
TrainJob |
Kubeflow Trainer v2 |
argoproj.io |
Workflow, WorkflowTemplate, CronWorkflow |
Argo Workflows |
run.ai |
RunaiJob |
Run:ai v1 API |
run.ai |
TrainingWorkload, InferenceWorkload, InteractiveWorkload |
Run:ai v2 API |
batch.volcano.sh |
Job |
Volcano batch scheduler |
ray.io |
RayJob, RayCluster |
KubeRay |
Adding Support for New Resource Types:
To monitor additional job resource types:
-
Add RBAC permissions in
job_informer.yaml:- apiGroups: ["your-api-group.io"] resources: ["yourjobs"] verbs: ["get", "list", "watch"]
-
Add to whitelist in
constants.py:SUPPORTED_JOB_RESOURCES = { # ... existing entries ... ("your-api-group.io", "YourJob"), }
-
Redeploy the cv-job-informer
How Resource Allocation is Extracted:
The service extracts resource allocation information from pods to send to CloudVision API:
-
Node names (always available)
- Extracted from pod
spec.nodeNamefield - Sent when
JOBCONFIG_MODE=node - Works in all Kubernetes clusters
- Extracted from pod
-
Interface addresses (requires Multus CNI)
- Extracted from
k8s.v1.cni.cncf.io/network-statusannotation added by Multus CNI - MAC addresses of secondary interfaces (net1, net2, etc.) are sent when
JOBCONFIG_MODE=interface - Works with any secondary network attachment (SR-IOV, DRA, Macvlan, etc.)
- Extracted from
-
RDMA device info (for logging only, requires SR-IOV Network Device Plugin)
- Added to network-status annotation by SR-IOV Network Device Plugin
- Provides device name and PCI address for debugging logs
What it needs:
- Read-only access to pods, nodes, and job CRDs (via RBAC)
- Network access to CloudVision API
- CloudVision API credentials (stored in Kubernetes secret)
How Interface Discovery Works
The cv-interface-discovery daemonset runs one pod on each node to discover network interfaces. Here's how it retrieves interface information:
- Enumerate Network Interfaces: Scans
/sys/class/net/to find all network devices on the node - Filter Physical Interfaces: Identifies physical interfaces by checking for a
devicesymlink pointing to the PCI device (excludes virtual interfaces like bridges, bonds, veth pairs which don't have this symlink) - Detect SR-IOV Hierarchy:
- Reads
/sys/class/net/<interface>/device/sriov_numvfsto identify SR-IOV Physical Functions (PFs) - Reads
/sys/class/net/<interface>/device/virtfn*symlinks to enumerate all configured VFs (stable, always present) - Reads
/sys/class/net/<interface>/device/physfnto identify Virtual Functions (VFs) and their parent PF - Maps VF-to-PF relationships
- Reads
- Extract MAC Addresses: Reads
/sys/class/net/<interface>/addressfor each interface's MAC address - Extract IP Addresses: Uses socket ioctl (SIOCGIFADDR) to get IPv4 addresses for interfaces that are in "up" state
- Collect Metadata: Gathers interface names, types (PF/VF/regular), PCI device information, and RDMA device names (if available)
- VF Caching for Stability: Maintains an in-memory cache of VF details (name, MAC, IP, RDMA) keyed by PCI address
- When VFs are visible in host namespace: reads current details and updates cache
- When VFs are moved to pod namespaces (during job execution): uses cached details
- This ensures stable VF reporting and avoids unnecessary NodeConfig updates when jobs start/stop
- Create NodeInterfaceState CR: Stores all discovered interface data in a custom resource named after the node
What Gets Stored in NodeInterfaceState CR:
- List of all physical network interfaces with their MAC addresses and IP addresses
- SR-IOV PF/VF hierarchy (which VFs belong to which PF)
- Interface types and names
- RDMA device names (for RDMA-capable interfaces)
- VF details remain stable even when VFs are allocated to pods (using cached information)
- Owner reference to the Node object (ensures automatic CR deletion when node is removed)
How cv-job-informer Uses It:
- Watches all NodeInterfaceState CRs cluster-wide
- When a CR is created/updated, extracts the interface inventory and sends to CloudVision NodeConfig API
- When a CR is deleted (e.g., node removed from cluster), deletes the NodeConfig from CloudVision
Alternative (NODECONFIG_MODE=sriovoperator): When SR-IOV Network Operator is already deployed, it creates SriovNetworkNodeState CRs with similar information. cv-job-informer watches those instead, and cv-interface-discovery is not deployed.
Example API Payloads Sent to CloudVision
{
"key": {
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
},
"location": "cluster-west",
"job_name": "gpt-fine-tuning",
"state": "JOB_STATE_RUNNING",
"start_time": "2025-12-05T10:30:00Z",
"interfaces": {
"values": [
"aa:bb:cc:dd:ee:01",
"aa:bb:cc:dd:ee:02",
"aa:bb:cc:dd:ee:03",
"aa:bb:cc:dd:ee:04"
]
}
}{
"key": {
"id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
},
"location": "cluster-west",
"job_name": "gpt-fine-tuning",
"state": "JOB_STATE_FINISHED",
"start_time": "2025-12-05T10:30:00Z",
"end_time": "2025-12-05T12:45:30Z",
"nodes": {
"values": [
"gpu-node-1",
"gpu-node-2",
"gpu-node-3",
"gpu-node-4"
]
}
}Sent when NODECONFIG_MODE=discovery or NODECONFIG_MODE=sriovoperator:
{
"key": {
"node_name": "gpu-node-1"
},
"location": "cluster-west",
"interfaces": [
{
"name": "ens1f0v0",
"mac": "aa:bb:cc:dd:ee:01",
"ip": "192.168.1.11"
},
{
"name": "ens1f0v1",
"mac": "aa:bb:cc:dd:ee:02",
"ip": "192.168.1.12"
}
]
}Privacy Notes:
- ✅ No user data, code, or training data is sent
- ✅ No pod logs or container output is sent
- ✅ No environment variables or secrets are sent
- ✅ Only job metadata and basic node interface info are sent
- ✅ Runs in your cluster (no external dependencies except CloudVision API)
Click to expand
Note: This section describes an alternative use case for GPU-as-a-Service cloud providers integrating tenant schedulers with CloudVision. This is separate from the regular Kubernetes job monitoring described above.
The send_jobconfig() API utility function in api_utils.py supports a tenant mode (isTenantJob=True) for reporting tenant allocations to CloudVision. Tenant allocations appear on the CloudVision Tenant Dashboard (separate from the regular Job Dashboard).
Use Case:
- GPU-as-a-Service providers with multi-tenant schedulers
- Track which network resources are allocated to each tenant
- Correlate network issues to specific tenant workloads
How It Works:
Tenant schedulers must call send_jobconfig() directly at these lifecycle points:
- Tenant Allocation: Call with
job_state='JOB_STATE_RUNNING'andisTenantJob=True - Resource Change: Call with updated
nodesorinterfaceswhen tenant resources scale - Tenant Deallocation: Call with
job_state='JOB_STATE_COMPLETED'andisTenantJob=True
Example Integration:
from api_utils import send_jobconfig
# When tenant is allocated resources
send_jobconfig(
api_server="www.arista.io",
api_token="your-api-token",
job_id="tenant-unique-id",
job_name="tenant-abc",
location="us-west-cluster",
job_state="JOB_STATE_RUNNING",
nodes=["gpu-node-1", "gpu-node-2"],
start_time="2025-12-05T10:30:00Z",
jobconfig_mode="node",
isTenantJob=True
)
# When tenant allocation ends
send_jobconfig(
api_server="www.arista.io",
api_token="your-api-token",
job_id="tenant-unique-id",
job_name="tenant-abc",
location="us-west-cluster",
job_state="JOB_STATE_COMPLETED",
nodes=["gpu-node-1", "gpu-node-2"],
start_time="2025-12-05T10:30:00Z",
end_time="2025-12-06T18:00:00Z",
jobconfig_mode="node",
isTenantJob=True
)This repository provides reference implementations for integrating HPC job workloads with CloudVision. It is intended as a starting point for users to adapt and customize for their specific environments.
This is not a fully supported Arista product. Users are responsible for reviewing, testing, and modifying this code to meet their security and operational requirements. By using this code, you acknowledge it is provided as-is for reference purposes.