This document describes the implementation of a centralized profiling control system where a Performance Studio backend can dynamically issue profiling commands (start/stop) to gProfiler agents via a heartbeat protocol.
┌─────────────────────┐ Heartbeat ┌──────────────────────┐
│ │ ◄──────────────► │ │
│ Performance Studio │ │ gProfiler Agent │
│ Backend │ Commands │ │
│ │ ────────────────► │ │
└─────────────────────┘ └──────────────────────┘
-
Performance Studio Backend - Central control server that:
- Receives profiling requests via REST API
- Manages profiling commands for hosts/services
- Responds to agent heartbeats with pending commands
- Tracks command execution status
-
gProfiler Agent - Profiling agent that:
- Sends periodic heartbeats to the backend
- Receives and executes profiling commands
- Ensures idempotent command execution
- Reports command completion status
- REST API for submitting profiling requests
- Heartbeat endpoint for agent communication
- Command merging for multiple requests targeting same host
- Process-level and host-level stop commands
- Idempotent command execution using unique command IDs
- Command completion tracking
- PerfSpect integration for hardware metrics collection
- Heartbeat communication with configurable intervals
- Dynamic profiling based on server commands
- Command-driven execution (start/stop profiling)
- Idempotency to prevent duplicate command execution
- Persistent command tracking across agent restarts
- Graceful error handling and retry logic
- PerfSpect auto-installation for hardware metrics collection
- Hardware metrics integration with CPU profiling data
POST /api/metrics/profile_requestRequest Body:
{
"service_name": "my-service",
"command_type": "start", // "start" or "stop"
"duration": 60,
"frequency": 11,
"profiling_mode": "cpu",
"target_hostnames": ["host1", "host2"],
"pids": [1234, 5678], // Optional: specific PIDs
"stop_level": "process", // "process" or "host" (for stop commands)
"additional_args": {
"enable_perfspect": true // Optional: enable hardware metrics collection
}
}Response:
{
"success": true,
"message": "Start profiling request submitted successfully",
"request_id": "req-uuid",
"command_id": "cmd-uuid",
"estimated_completion_time": "2025-01-08T12:00:00Z"
}POST /api/metrics/heartbeatRequest Body:
{
"ip_address": "192.168.1.100",
"hostname": "worker-01",
"service_name": "my-service",
"last_command_id": "cmd-uuid",
"available_pids" : [java:{}, python:{}],
"namespaces" : [{namespace: kube_system, pods : [{pod_name: gprofiler, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}},{pod_name: webapp, containers : {{pid:123, name: metrics-exporter},{pid:123, name: metrics-exporter}}]}],
"status": "active",
"timestamp": "2025-01-08T11:00:00Z"
}
"containers" -> "host" Table -> {container_name, array_of_hosts}
"pod" -> "host" Table -> {pod_name, array_of_hosts}
"namespace" -> "host" Table -> {namespace, array_of_hosts}
1. add k8s namespace hierarchy info as part of heartbeat
2. save k8s information in hostheartbeats table and create de-normalized table for containersToHosts, podsToHost and namespaceToHosts,
3. perform profiling : support profiling request by namespaces, pods and containers ( 5 )
4. test e2e ( 3 )Response:
{
"success": true,
"message": "Heartbeat received. New profiling command available.",
"profiling_command": {
"command_type": "start",
"combined_config": {
"duration": 60,
"frequency": 11,
"profiling_mode": "cpu",
"pids": ""
}
},
"command_id": "cmd-uuid"
}POST /api/metrics/command_completionRequest Body:
{
"command_id": "cmd-uuid",
"hostname": "worker-01",
"status": "completed", // "completed" or "failed"
"execution_time": 65,
"error_message": null,
"results_path": "s3://bucket/path/to/results"
}The heartbeat system supports Intel PerfSpect integration for collecting hardware performance metrics alongside CPU profiling data. This feature enables comprehensive performance analysis by combining software-level profiling with hardware-level metrics.
When enable_perfspect: true is included in the additional_args of a profiling request, the gProfiler agent will:
- Auto-install PerfSpect: Downloads and extracts the latest PerfSpect binary from GitHub releases
- Configure hardware collection: Enables
--enable-hw-metrics-collectionflag - Set PerfSpect path: Configures
--perfspect-pathto the auto-installed binary - Collect metrics: Runs PerfSpect alongside CPU profiling to gather hardware metrics
When the agent receives a heartbeat response with enable_perfspect: true in the combined_config:
# Agent processes the configuration
if combined_config.get("enable_perfspect", False):
new_args.collect_hw_metrics = True
# Auto-install PerfSpect
from gprofiler.perfspect_installer import get_or_install_perfspect
perfspect_path = get_or_install_perfspect()
if perfspect_path:
new_args.tool_perfspect_path = str(perfspect_path)
logger.info(f"PerfSpect auto-installed at: {perfspect_path}")- Download: Fetches
perfspect.tgzfromhttps://github.com/intel/PerfSpect/releases/latest/download/perfspect.tgz - Extract: Unpacks to
/tmp/gprofiler_perfspect/perfspect/ - Verify: Checks binary exists and is executable
- Configure: Sets path for gProfiler to use
PerfSpect runs with the following command:
/tmp/gprofiler_perfspect/perfspect/perfspect metrics \
--duration 60 \
--output /tmp/perfspect_dataWhen PerfSpect is enabled, additional files are generated:
- Hardware Metrics CSV:
/tmp/perfspect_data/{hostname}_metrics.csv - Hardware Summary CSV:
/tmp/perfspect_data/{hostname}_metrics_summary.csv - Hardware HTML Report:
/tmp/perfspect_data/{hostname}_metrics_summary.html - Latest Metrics:
/tmp/perfspect_data/{hostname}_metrics_summary_latest.csv - Latest HTML:
/tmp/perfspect_data/{hostname}_metrics_summary_latest.html
curl -X POST http://localhost:8000/api/metrics/profile_request \
-H "Content-Type: application/json" \
-d '{
"service_name": "web-service",
"command_type": "start",
"duration": 60,
"frequency": 11,
"profiling_mode": "cpu",
"target_hostnames": ["worker-01", "worker-02"],
"additional_args": {
"enable_perfspect": true
}
}'The agent receives the following combined_config in heartbeat responses:
{
"duration": 60,
"frequency": 11,
"continuous": true,
"command_type": "start",
"profiling_mode": "cpu",
"enable_perfspect": true
}- Platform: Linux x86_64 (PerfSpect requirement)
- Permissions: Root access for hardware performance counter access
- Network: Internet access to download PerfSpect binary
- Storage: ~50MB for PerfSpect installation and data files
-
Permission Denied: Ensure agent runs with sufficient privileges
sudo ./gprofiler --enable-heartbeat-server ...
-
Download Failures: Check network connectivity and GitHub access
curl -I https://github.com/intel/PerfSpect/releases/latest/download/perfspect.tgz
-
Binary Not Found: Verify installation directory permissions
ls -la /tmp/gprofiler_perfspect/perfspect/
Enable verbose logging to see PerfSpect installation and execution details:
./gprofiler --enable-heartbeat-server --verboseLook for log messages:
PerfSpect auto-installed at: /path/to/binaryUsing perfspect path: /path/to/binaryFailed to auto-install PerfSpect, hardware metrics disabled
curl -X POST http://localhost:8000/api/metrics/profile_request \
-H "Content-Type: application/json" \
-d '{
"service_name": "web-service",
"command_type": "start",
"duration": 120,
"frequency": 11,
"profiling_mode": "cpu",
"target_hostnames": ["web-01", "web-02"]
"containers" : [],
"pods" : [],
"namespaces" : [],
}'curl -X POST http://localhost:8000/api/metrics/profile_request \
-H "Content-Type: application/json" \
-d '{
"service_name": "web-service",
"command_type": "stop",
"stop_level": "host",
"target_hostnames": ["web-01"]
}'Basic heartbeat mode:
python gprofiler/main.py \
--enable-heartbeat-server \
--upload-results \
--token "your-token" \
--service-name "web-service" \
--api-server "http://performance-studio:8000" \
--heartbeat-interval 30 \
--output-dir /tmp/profiles \
--verboseProduction deployment with all optimizations:
# Set environment variables first
export GPROFILER_TOKEN="my_token"
export GPROFILER_SERVICE="your-service-name"
export GPROFILER_SERVER="http://localhost:8080"
# Production command (can also source /opt/gprofiler/envs.sh for variables)
/opt/gprofiler/gprofiler \
-u \
--token=$GPROFILER_TOKEN \
--service-name=$GPROFILER_SERVICE \
--server-host $GPROFILER_SERVER \
--dont-send-logs \
--server-upload-timeout 10 \
-c \
--disable-metrics-collection \
--java-safemode= \
-d 60 \
--java-no-version-check- Command Generation: Each profiling request generates a unique
command_id - Command Merging: Multiple requests for the same host are merged into single commands
- Stop Handling:
- Process-level stops remove specific PIDs from commands
- Host-level stops terminate all profiling for the host
- Heartbeat Response: Returns pending commands with
command_typeand configuration
- Heartbeat Loop: Sends heartbeats at configured intervals
- Command Processing:
start: Stop current profiler (if any) and start new one with given configstop: Stop current profiler without starting a new one
- Idempotency: Track executed command IDs to prevent duplicates
- Persistence: Save executed command IDs to disk for restart resilience
1. User submits profiling request to backend
↓
2. Backend creates command with unique ID
↓
3. Agent sends heartbeat to backend
↓
4. Backend responds with pending command
↓
5. Agent executes command (start/stop profiling)
↓
6. Agent reports completion to backend
↓
7. Backend updates command status
- Database connection for command storage
- API endpoints for profiling control
- Command merging and deduplication logic
--enable-heartbeat-server # Enable heartbeat mode
--heartbeat-interval 30 # Heartbeat frequency (seconds)
--api-server URL # Backend server URL
--upload-results # Required for heartbeat mode
--token TOKEN # Authentication token
--service-name NAME # Service identifier- test_heartbeat_system.py - Test backend API and heartbeat flow
- run_heartbeat_agent.py - Run agent in heartbeat mode for testing
- Start Performance Studio backend
- Run test agent:
python run_heartbeat_agent.py - Submit test commands:
python test_heartbeat_system.py - Verify agent receives and executes commands
- Check idempotency and error handling
- Validates profiling request parameters
- Handles database connection errors
- Returns appropriate HTTP status codes
- Logs all operations for debugging
- Retries failed heartbeats with backoff
- Continues heartbeat loop on command execution errors
- Persists executed command IDs across restarts
- Graceful shutdown on termination signals
- Authentication: Token-based authentication for agent-backend communication
- Authorization: Service-based access control for profiling commands
- Command Validation: Validate all command parameters before execution
- Rate Limiting: Prevent abuse of profiling requests
- Audit Logging: Track all profiling activities for compliance
- Real-time Status: WebSocket connection for real-time agent status
- Command Scheduling: Schedule profiling commands for future execution
- Resource Monitoring: Check system resources before starting profiling
- Multi-tenant Support: Isolation between different services/teams
- Command Prioritization: Priority queues for urgent profiling requests
- Distributed Coordination: Coordinate profiling across multiple agents
-
Agent not receiving commands
- Check network connectivity to backend
- Verify authentication token
- Check service name matching
-
Commands not executing
- Check agent logs for errors
- Verify command parameters are valid
- Check system permissions for profiling
-
Duplicate commands
- Verify idempotency implementation
- Check command ID persistence
- Review heartbeat timing
-
PerfSpect hardware metrics not working
- Ensure Linux x86_64 platform (PerfSpect requirement)
- Verify root/sudo permissions for hardware counters
- Check internet connectivity for auto-installation
- Look for "PerfSpect auto-installed" or "Failed to auto-install" log messages
- Verify
/tmp/gprofiler_perfspect/perfspect/perfspectbinary exists and is executable
- Enable verbose logging:
--verbose - Check heartbeat logs:
/tmp/gprofiler-heartbeat.log - Monitor backend API logs
- Use test scripts to isolate issues
- For PerfSpect issues:
- Check PerfSpect installation:
ls -la /tmp/gprofiler_perfspect/perfspect/ - Test PerfSpect manually:
/tmp/gprofiler_perfspect/perfspect/perfspect --help - Check PerfSpect data directory:
ls -la /tmp/perfspect_data/ - Monitor hardware metrics collection in agent logs
- Check PerfSpect installation:
- Linux system (x86_64 or Aarch64)
- Python 3.10+ for source builds
- Docker for containerized builds
- 16GB+ RAM for full builds
- Root access for profiling operations
cd gprofiler
# Full build (takes 20-30 minutes, builds all profilers from source)
./scripts/build_x86_64_executable.sh
# Fast build (for development, skips some optimizations)
./scripts/build_x86_64_executable.sh --fastThe executable will be created at build/x86_64/gprofiler.
./scripts/build_x86_64_container.sh -t gprofiler# Install dependencies
pip3 install -r requirements.txt
# Copy required resources
./scripts/copy_resources_from_image.sh
# Run directly from source (requires root)
sudo python3 -m gprofiler [options]# Make executable and run basic profiling
chmod +x build/x86_64/gprofiler
sudo ./build/x86_64/gprofiler -o /tmp/gprofiler-output -d 30# Set environment variables
export GPROFILER_TOKEN="my_token"
export GPROFILER_SERVICE="your-service-name"
export GPROFILER_SERVER="http://localhost:8080"
# Run with production flags
sudo ./build/x86_64/gprofiler \
-u \
--token=$GPROFILER_TOKEN \
--service-name=$GPROFILER_SERVICE \
--server-host $GPROFILER_SERVER \
--dont-send-logs \
--server-upload-timeout 10 \
-c \
--disable-metrics-collection \
--java-safemode= \
-d 60 \
--java-no-version-check# Run agent in heartbeat mode for testing
sudo ./build/x86_64/gprofiler \
--enable-heartbeat-server \
--upload-results \
--token=$GPROFILER_TOKEN \
--service-name=$GPROFILER_SERVICE \
--api-server $GPROFILER_SERVER \
--heartbeat-interval 30 \
--output-dir /tmp/profiles \
--dont-send-logs \
--server-upload-timeout 10 \
--disable-metrics-collection \
--java-safemode= \
--java-no-version-check \
--verbose# Test PerfSpect integration manually (Linux x86_64 only)
sudo ./build/x86_64/gprofiler \
--enable-hw-metrics-collection \
--perfspect-path /path/to/perfspect \
--perfspect-duration 60 \
--output-dir /tmp/profiles \
--duration 60 \
--verbose-u, --upload-results # Upload results to Performance Studio
--token=$GPROFILER_TOKEN # Authentication token
--service-name=$GPROFILER_SERVICE # Service identifier
--server-host $GPROFILER_SERVER # Performance Studio backend URL
--dont-send-logs # Disable log transmission
--server-upload-timeout 10 # Upload timeout (seconds)
-c, --continuous # Continuous profiling mode
--disable-metrics-collection # Disable system metrics collection
--java-safemode= # Disable Java safe mode (empty value)
-d 60 # Profiling duration (seconds)
--java-no-version-check # Skip Java version check
--enable-heartbeat-server # Enable heartbeat communication
--heartbeat-interval 30 # Heartbeat frequency (seconds)
--api-server URL # Heartbeat API server URL
-o, --output-dir PATH # Local output directory
--verbose # Enable verbose logging
# PerfSpect Hardware Metrics Options (Linux x86_64 only)
--enable-hw-metrics-collection # Enable hardware metrics via PerfSpect
--perfspect-path PATH # Path to PerfSpect binary (auto-installed in heartbeat mode)
--perfspect-duration SECONDS # PerfSpect collection duration (default: 60)- Build:
./scripts/build_x86_64_executable.sh --fast - Test locally:
sudo ./build/x86_64/gprofiler -o /tmp/results -d 30 - View results: Open
/tmp/results/last_flamegraph.htmlin browser - Test heartbeat: Run with
--enable-heartbeat-serverflag
- Build fails: Ensure 16GB+ RAM available
- Permission errors: Run profiling commands with
sudo - Docker issues: Ensure Docker daemon is running
- Missing dependencies: Install build requirements with package manager