Profiling Control System with Heartbeat Protocol

This document describes the implementation of a centralized profiling control system where a Performance Studio backend can dynamically issue profiling commands (start/stop) to gProfiler agents via a heartbeat protocol.

System Overview

┌─────────────────────┐    Heartbeat     ┌──────────────────────┐
│                     │ ◄──────────────► │                      │
│  Performance Studio │                  │   gProfiler Agent    │
│     Backend         │   Commands       │                      │
│                     │ ────────────────► │                      │
└─────────────────────┘                  └──────────────────────┘

Key Components

Performance Studio Backend - Central control server that:
- Receives profiling requests via REST API
- Manages profiling commands for hosts/services
- Responds to agent heartbeats with pending commands
- Tracks command execution status
gProfiler Agent - Profiling agent that:
- Sends periodic heartbeats to the backend
- Receives and executes profiling commands
- Ensures idempotent command execution
- Reports command completion status

Features

✅ Backend Features

REST API for submitting profiling requests
Heartbeat endpoint for agent communication
Command merging for multiple requests targeting same host
Process-level and host-level stop commands
Idempotent command execution using unique command IDs
Command completion tracking
PerfSpect integration for hardware metrics collection

✅ Agent Features

Heartbeat communication with configurable intervals
Dynamic profiling based on server commands
Command-driven execution (start/stop profiling)
Idempotency to prevent duplicate command execution
Persistent command tracking across agent restarts
Graceful error handling and retry logic
PerfSpect auto-installation for hardware metrics collection
Hardware metrics integration with CPU profiling data

API Endpoints

1. Submit Profiling Request

POST /api/metrics/profile_request

Request Body:

{
  "service_name": "my-service",
  "command_type": "start",  // "start" or "stop"
  "duration": 60,
  "frequency": 11,
  "profiling_mode": "cpu",
  "target_hostnames": ["host1", "host2"],
  "pids": [1234, 5678],  // Optional: specific PIDs
  "stop_level": "process",  // "process" or "host" (for stop commands)
  "additional_args": {
    "enable_perfspect": true  // Optional: enable hardware metrics collection
  }
}

Response:

{
  "success": true,
  "message": "Start profiling request submitted successfully",
  "request_id": "req-uuid",
  "command_id": "cmd-uuid",
  "estimated_completion_time": "2025-01-08T12:00:00Z"
}

2. Agent Heartbeat

POST /api/metrics/heartbeat

Request Body:

{
  "ip_address": "192.168.1.100",
  "hostname": "worker-01",
  "service_name": "my-service",
  "last_command_id": "cmd-uuid",
  "available_pids" : [java:{}, python:{}],
  "namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}],
  "status": "active",
  "timestamp": "2025-01-08T11:00:00Z"
}
"containers" -> "host" Table -> {container_name, array_of_hosts}
"pod" -> "host" Table -> {pod_name, array_of_hosts}
"namespace" -> "host" Table -> {namespace, array_of_hosts}

1. add k8s namespace hierarchy info as part of heartbeat 
2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts, 
3. perform profiling : support profiling request by namespaces, pods and containers ( 5 )
4. test e2e ( 3 )

Response:

{
  "success": true,
  "message": "Heartbeat received. New profiling command available.",
  "profiling_command": {
    "command_type": "start",
    "combined_config": {
      "duration": 60,
      "frequency": 11,
      "profiling_mode": "cpu",
      "pids": "" 
    }
  },
  "command_id": "cmd-uuid"
}

3. Report Command Completion

POST /api/metrics/command_completion

Request Body:

{
  "command_id": "cmd-uuid",
  "hostname": "worker-01", 
  "status": "completed",  // "completed" or "failed"
  "execution_time": 65,
  "error_message": null,
  "results_path": "s3://bucket/path/to/results"
}

PerfSpect Hardware Metrics Integration

The heartbeat system supports Intel PerfSpect integration for collecting hardware performance metrics alongside CPU profiling data. This feature enables comprehensive performance analysis by combining software-level profiling with hardware-level metrics.

Overview

When enable_perfspect: true is included in the additional_args of a profiling request, the gProfiler agent will:

Auto-install PerfSpect: Downloads and extracts the latest PerfSpect binary from GitHub releases
Configure hardware collection: Enables --enable-hw-metrics-collection flag
Set PerfSpect path: Configures --perfspect-path to the auto-installed binary
Collect metrics: Runs PerfSpect alongside CPU profiling to gather hardware metrics

Agent Behavior

Command Processing

When the agent receives a heartbeat response with enable_perfspect: true in the combined_config:

# Agent processes the configuration
if combined_config.get("enable_perfspect", False):
    new_args.collect_hw_metrics = True
    
    # Auto-install PerfSpect
    from gprofiler.perfspect_installer import get_or_install_perfspect
    perfspect_path = get_or_install_perfspect()
    if perfspect_path:
        new_args.tool_perfspect_path = str(perfspect_path)
        logger.info(f"PerfSpect auto-installed at: {perfspect_path}")

Installation Process

Download: Fetches perfspect.tgz from https://github.com/intel/PerfSpect/releases/latest/download/perfspect.tgz
Extract: Unpacks to /tmp/gprofiler_perfspect/perfspect/
Verify: Checks binary exists and is executable
Configure: Sets path for gProfiler to use

Data Collection

PerfSpect runs with the following command:

/tmp/gprofiler_perfspect/perfspect/perfspect metrics \
  --duration 60 \
  --output /tmp/perfspect_data

Output Files

When PerfSpect is enabled, additional files are generated:

Hardware Metrics CSV: /tmp/perfspect_data/{hostname}_metrics.csv
Hardware Summary CSV: /tmp/perfspect_data/{hostname}_metrics_summary.csv
Hardware HTML Report: /tmp/perfspect_data/{hostname}_metrics_summary.html
Latest Metrics: /tmp/perfspect_data/{hostname}_metrics_summary_latest.csv
Latest HTML: /tmp/perfspect_data/{hostname}_metrics_summary_latest.html

Example Request with PerfSpect

curl -X POST http://localhost:8000/api/metrics/profile_request \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "web-service",
    "command_type": "start",
    "duration": 60,
    "frequency": 11,
    "profiling_mode": "cpu",
    "target_hostnames": ["worker-01", "worker-02"],
    "additional_args": {
      "enable_perfspect": true
    }
  }'

Combined Config Example

The agent receives the following combined_config in heartbeat responses:

{
  "duration": 60,
  "frequency": 11,
  "continuous": true,
  "command_type": "start",
  "profiling_mode": "cpu",
  "enable_perfspect": true
}

Requirements

Platform: Linux x86_64 (PerfSpect requirement)
Permissions: Root access for hardware performance counter access
Network: Internet access to download PerfSpect binary
Storage: ~50MB for PerfSpect installation and data files

Troubleshooting

Common Issues

Permission Denied: Ensure agent runs with sufficient privileges
```
sudo ./gprofiler --enable-heartbeat-server ...
```

Download Failures: Check network connectivity and GitHub access

curl -I https://github.com/intel/PerfSpect/releases/latest/download/perfspect.tgz

Binary Not Found: Verify installation directory permissions
```
ls -la /tmp/gprofiler_perfspect/perfspect/
```

Debug Logging

Enable verbose logging to see PerfSpect installation and execution details:

./gprofiler --enable-heartbeat-server --verbose

Look for log messages:

PerfSpect auto-installed at: /path/to/binary
Using perfspect path: /path/to/binary
Failed to auto-install PerfSpect, hardware metrics disabled

Usage Examples

Backend - Submit Start Command

curl -X POST http://localhost:8000/api/metrics/profile_request \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "web-service",
    "command_type": "start",
    "duration": 120,
    "frequency": 11,
    "profiling_mode": "cpu",
    "target_hostnames": ["web-01", "web-02"]
    "containers" : [],
    "pods" : [],
    "namespaces" : [],
  }'

Backend - Submit Stop Command

curl -X POST http://localhost:8000/api/metrics/profile_request \
  -H "Content-Type: application/json" \
  -d '{
    "service_name": "web-service", 
    "command_type": "stop",
    "stop_level": "host",
    "target_hostnames": ["web-01"]
  }'

Agent - Run in Heartbeat Mode

Basic heartbeat mode:

python gprofiler/main.py \
  --enable-heartbeat-server \
  --upload-results \
  --token "your-token" \
  --service-name "web-service" \
  --api-server "http://performance-studio:8000" \
  --heartbeat-interval 30 \
  --output-dir /tmp/profiles \
  --verbose

Production deployment with all optimizations:

# Set environment variables first
export GPROFILER_TOKEN="my_token"
export GPROFILER_SERVICE="your-service-name"  
export GPROFILER_SERVER="http://localhost:8080"

# Production command (can also source /opt/gprofiler/envs.sh for variables)
/opt/gprofiler/gprofiler \
  -u \
  --token=$GPROFILER_TOKEN \
  --service-name=$GPROFILER_SERVICE \
  --server-host $GPROFILER_SERVER \
  --dont-send-logs \
  --server-upload-timeout 10 \
  -c \
  --disable-metrics-collection \
  --java-safemode= \
  -d 60 \
  --java-no-version-check

Implementation Details

Backend Logic

Command Generation: Each profiling request generates a unique command_id
Command Merging: Multiple requests for the same host are merged into single commands
Stop Handling:
- Process-level stops remove specific PIDs from commands
- Host-level stops terminate all profiling for the host
Heartbeat Response: Returns pending commands with command_type and configuration

Agent Logic

Heartbeat Loop: Sends heartbeats at configured intervals
Command Processing:
- start: Stop current profiler (if any) and start new one with given config
- stop: Stop current profiler without starting a new one
Idempotency: Track executed command IDs to prevent duplicates
Persistence: Save executed command IDs to disk for restart resilience

Command Flow

1. User submits profiling request to backend
   ↓
2. Backend creates command with unique ID
   ↓  
3. Agent sends heartbeat to backend
   ↓
4. Backend responds with pending command
   ↓
5. Agent executes command (start/stop profiling)
   ↓
6. Agent reports completion to backend
   ↓
7. Backend updates command status

Configuration

Backend Configuration

Database connection for command storage
API endpoints for profiling control
Command merging and deduplication logic

Agent Configuration

--enable-heartbeat-server     # Enable heartbeat mode
--heartbeat-interval 30       # Heartbeat frequency (seconds)
--api-server URL             # Backend server URL
--upload-results             # Required for heartbeat mode
--token TOKEN                # Authentication token
--service-name NAME          # Service identifier

Testing

Test Scripts

test_heartbeat_system.py - Test backend API and heartbeat flow
run_heartbeat_agent.py - Run agent in heartbeat mode for testing

Test Workflow

Start Performance Studio backend
Run test agent: python run_heartbeat_agent.py
Submit test commands: python test_heartbeat_system.py
Verify agent receives and executes commands
Check idempotency and error handling

Error Handling

Backend

Validates profiling request parameters
Handles database connection errors
Returns appropriate HTTP status codes
Logs all operations for debugging

Agent

Retries failed heartbeats with backoff
Continues heartbeat loop on command execution errors
Persists executed command IDs across restarts
Graceful shutdown on termination signals

Security Considerations

Authentication: Token-based authentication for agent-backend communication
Authorization: Service-based access control for profiling commands
Command Validation: Validate all command parameters before execution
Rate Limiting: Prevent abuse of profiling requests
Audit Logging: Track all profiling activities for compliance

Future Enhancements

Real-time Status: WebSocket connection for real-time agent status
Command Scheduling: Schedule profiling commands for future execution
Resource Monitoring: Check system resources before starting profiling
Multi-tenant Support: Isolation between different services/teams
Command Prioritization: Priority queues for urgent profiling requests
Distributed Coordination: Coordinate profiling across multiple agents

Troubleshooting

Common Issues

Agent not receiving commands
- Check network connectivity to backend
- Verify authentication token
- Check service name matching
Commands not executing
- Check agent logs for errors
- Verify command parameters are valid
- Check system permissions for profiling
Duplicate commands
- Verify idempotency implementation
- Check command ID persistence
- Review heartbeat timing
PerfSpect hardware metrics not working
- Ensure Linux x86_64 platform (PerfSpect requirement)
- Verify root/sudo permissions for hardware counters
- Check internet connectivity for auto-installation
- Look for "PerfSpect auto-installed" or "Failed to auto-install" log messages
- Verify /tmp/gprofiler_perfspect/perfspect/perfspect binary exists and is executable

Debugging

Enable verbose logging: --verbose
Check heartbeat logs: /tmp/gprofiler-heartbeat.log
Monitor backend API logs
Use test scripts to isolate issues
For PerfSpect issues:
- Check PerfSpect installation: ls -la /tmp/gprofiler_perfspect/perfspect/
- Test PerfSpect manually: /tmp/gprofiler_perfspect/perfspect/perfspect --help
- Check PerfSpect data directory: ls -la /tmp/perfspect_data/
- Monitor hardware metrics collection in agent logs

Building and Running gProfiler Locally

Prerequisites

Linux system (x86_64 or Aarch64)
Python 3.10+ for source builds
Docker for containerized builds
16GB+ RAM for full builds
Root access for profiling operations

Build Options

1. Build Executable (Recommended)

cd gprofiler

# Full build (takes 20-30 minutes, builds all profilers from source)
./scripts/build_x86_64_executable.sh

# Fast build (for development, skips some optimizations)
./scripts/build_x86_64_executable.sh --fast

The executable will be created at build/x86_64/gprofiler.

2. Build Docker Image

./scripts/build_x86_64_container.sh -t gprofiler

3. Run from Source (Development)

# Install dependencies
pip3 install -r requirements.txt

# Copy required resources
./scripts/copy_resources_from_image.sh

# Run directly from source (requires root)
sudo python3 -m gprofiler [options]

Running Locally

Basic Local Profiling

# Make executable and run basic profiling
chmod +x build/x86_64/gprofiler
sudo ./build/x86_64/gprofiler -o /tmp/gprofiler-output -d 30

Production-Style Local Run

# Set environment variables
export GPROFILER_TOKEN="my_token"
export GPROFILER_SERVICE="your-service-name"
export GPROFILER_SERVER="http://localhost:8080"

# Run with production flags
sudo ./build/x86_64/gprofiler \
  -u \
  --token=$GPROFILER_TOKEN \
  --service-name=$GPROFILER_SERVICE \
  --server-host $GPROFILER_SERVER \
  --dont-send-logs \
  --server-upload-timeout 10 \
  -c \
  --disable-metrics-collection \
  --java-safemode= \
  -d 60 \
  --java-no-version-check

Local Heartbeat Mode Testing

# Run agent in heartbeat mode for testing
sudo ./build/x86_64/gprofiler \
  --enable-heartbeat-server \
  --upload-results \
  --token=$GPROFILER_TOKEN \
  --service-name=$GPROFILER_SERVICE \
  --api-server $GPROFILER_SERVER \
  --heartbeat-interval 30 \
  --output-dir /tmp/profiles \
  --dont-send-logs \
  --server-upload-timeout 10 \
  --disable-metrics-collection \
  --java-safemode= \
  --java-no-version-check \
  --verbose

Local PerfSpect Testing (Manual)

# Test PerfSpect integration manually (Linux x86_64 only)
sudo ./build/x86_64/gprofiler \
  --enable-hw-metrics-collection \
  --perfspect-path /path/to/perfspect \
  --perfspect-duration 60 \
  --output-dir /tmp/profiles \
  --duration 60 \
  --verbose

Command Line Options Explained

-u, --upload-results              # Upload results to Performance Studio
--token=$GPROFILER_TOKEN          # Authentication token
--service-name=$GPROFILER_SERVICE # Service identifier  
--server-host $GPROFILER_SERVER   # Performance Studio backend URL
--dont-send-logs                  # Disable log transmission
--server-upload-timeout 10        # Upload timeout (seconds)
-c, --continuous                  # Continuous profiling mode
--disable-metrics-collection      # Disable system metrics collection
--java-safemode=                  # Disable Java safe mode (empty value)
-d 60                            # Profiling duration (seconds)
--java-no-version-check          # Skip Java version check
--enable-heartbeat-server         # Enable heartbeat communication
--heartbeat-interval 30           # Heartbeat frequency (seconds)
--api-server URL                  # Heartbeat API server URL
-o, --output-dir PATH            # Local output directory
--verbose                        # Enable verbose logging

# PerfSpect Hardware Metrics Options (Linux x86_64 only)
--enable-hw-metrics-collection    # Enable hardware metrics via PerfSpect
--perfspect-path PATH            # Path to PerfSpect binary (auto-installed in heartbeat mode)
--perfspect-duration SECONDS     # PerfSpect collection duration (default: 60)

Development Workflow

Build: ./scripts/build_x86_64_executable.sh --fast
Test locally: sudo ./build/x86_64/gprofiler -o /tmp/results -d 30
View results: Open /tmp/results/last_flamegraph.html in browser
Test heartbeat: Run with --enable-heartbeat-server flag

Troubleshooting Local Builds

Build fails: Ensure 16GB+ RAM available
Permission errors: Run profiling commands with sudo
Docker issues: Ensure Docker daemon is running
Missing dependencies: Install build requirements with package manager

Uh oh!

FilesExpand file tree

HEARTBEAT_SYSTEM_README.md

Latest commit

History

HEARTBEAT_SYSTEM_README.md

File metadata and controls

Profiling Control System with Heartbeat Protocol

System Overview

Key Components

Features

✅ Backend Features

✅ Agent Features

API Endpoints

1. Submit Profiling Request

2. Agent Heartbeat

3. Report Command Completion

PerfSpect Hardware Metrics Integration

Overview

Agent Behavior

Command Processing

Installation Process

Data Collection

Output Files

Example Request with PerfSpect

Combined Config Example

Requirements

Troubleshooting

Common Issues

Debug Logging

Usage Examples

Backend - Submit Start Command

Backend - Submit Stop Command

Agent - Run in Heartbeat Mode

Implementation Details

Backend Logic

Agent Logic

Command Flow

Configuration

Backend Configuration

Agent Configuration

Testing

Test Scripts

Test Workflow

Error Handling

Backend

Agent

Security Considerations

Future Enhancements

Troubleshooting

Common Issues

Debugging

Building and Running gProfiler Locally

Prerequisites

Build Options

1. Build Executable (Recommended)

2. Build Docker Image

3. Run from Source (Development)

Running Locally

Basic Local Profiling

Production-Style Local Run

Local Heartbeat Mode Testing

Local PerfSpect Testing (Manual)

Command Line Options Explained

Development Workflow

Troubleshooting Local Builds