Skip to content

Latest commit

 

History

History
333 lines (254 loc) · 7.17 KB

File metadata and controls

333 lines (254 loc) · 7.17 KB

Task 10: Monitoring and Observability - Quick Reference

✅ COMPLETED

Metrics Service: Performance tracking for all stages
Health Service: Multi-component health checks
Monitoring API: 7 REST endpoints
Tests: 159/159 passing (including 6 new metrics tests)
Build: Clean, no errors


API Endpoints (Quick Reference)

Endpoint Purpose
GET /api/v1/monitoring/health Full system health status
GET /api/v1/monitoring/health/simple Simple health check (for LB)
GET /api/v1/monitoring/metrics System-wide metrics
GET /api/v1/monitoring/metrics/stages All stage metrics
GET /api/v1/monitoring/metrics/stages/:stage Specific stage metrics
GET /api/v1/monitoring/performance Performance summary
POST /api/v1/monitoring/metrics/reset Reset metrics (admin)

Common Tasks

Check System Health

curl http://localhost:3000/api/v1/monitoring/health

Get Performance Metrics

curl http://localhost:3000/api/v1/monitoring/metrics

Get Stage Metrics

curl http://localhost:3000/api/v1/monitoring/metrics/stages/embedding

Simple Health Check

curl http://localhost:3000/api/v1/monitoring/health/simple
# Returns 200 if healthy, 503 if unhealthy

Health Status Values

Status Meaning
healthy All services up
degraded Some services slow, none down
unhealthy One or more services down

Service Status Values

Status Meaning
up Service responding normally
degraded Service responding slowly (>1000ms)
down Service not responding

Metrics Overview

System Metrics

  • totalProcessed: Total documents processed (24h)
  • successfulProcessed: Successfully processed
  • failedProcessed: Failed processing
  • averageProcessingTime: Average duration (ms)
  • errorRate: Failure rate (0-1)
  • processedLastHour: Documents in last hour
  • commonErrors: Error type frequencies

Stage Metrics

  • totalExecutions: Times stage executed
  • successfulExecutions: Successful runs
  • failedExecutions: Failed runs
  • averageDuration: Average time (ms)
  • minDuration: Fastest time (ms)
  • maxDuration: Slowest time (ms)

Files Created

File Lines Purpose
src/services/metrics.service.ts 285 Metrics collection
src/services/health.service.ts 225 Health checks
src/routes/monitoring.routes.ts 166 API endpoints
src/services/metrics.service.test.ts 142 Tests

Total: 818 lines


MetricsService Methods

Method Purpose
recordProcessing(metric) Record processing event
getSystemMetrics() Get system-wide metrics
getStageMetrics(stage) Get stage-specific metrics
getAllStageMetrics() Get all stages
getPerformanceSummary() Get performance overview
getUptime() Get uptime in seconds
reset() Clear all metrics

HealthService Methods

Method Purpose
checkDatabase() Check PostgreSQL health
checkRedis() Check Redis health
checkVectorDb() Check Qdrant health
checkQueue() Check queue health
getSystemHealth() Get overall health status
getHealthCheckResult() Get simple health result

Worker Integration

Processing workers automatically record metrics:

// Automatically tracked for each stage:
- Stage name
- Processing duration
- Success/failure status
- Error type (if failed)
- Document ID
- Timestamp

Monitoring Dashboard Integration

Health Widget

fetch('/api/v1/monitoring/health')
  .then(r => r.json())
  .then(health => {
    // Display overall status
    document.getElementById('status').textContent = health.status;
    
    // Display service statuses
    Object.entries(health.services).forEach(([name, status]) => {
      // Show service status indicator
    });
  });

Metrics Widget

fetch('/api/v1/monitoring/metrics')
  .then(r => r.json())
  .then(metrics => {
    // Display success rate
    const successRate = metrics.successfulProcessed / metrics.totalProcessed;
    
    // Display error rate
    document.getElementById('error-rate').textContent = 
      (metrics.errorRate * 100).toFixed(2) + '%';
      
    // Display stage performance
    Object.values(metrics.stageMetrics).forEach(stage => {
      // Show stage duration chart
    });
  });

Structured Logging

Already in place via Winston:

logger.info('Processing stage', { 
  documentId, 
  stage, 
  jobId 
});

logger.error('Stage processing failed', { 
  error, 
  documentId, 
  stage 
});

Log Files:

  • logs/app.log: All logs
  • logs/error.log: Errors only
  • Rotation: 5 files × 10MB

Metrics Retention

  • Period: 24 hours
  • Cleanup: Automatic on each recording
  • Reset: Manual via API endpoint

Performance Summary Response

{
  "uptime": 3600,
  "totalProcessed": 1000,
  "successRate": 0.98,
  "averageProcessingTime": 2500,
  "slowestStage": {
    "stage": "embedding",
    "duration": 3500
  },
  "fastestStage": {
    "stage": "chunking",
    "duration": 1200
  }
}

Error Tracking

Common Error Types

  • TIMEOUT_ERROR: Processing timeout
  • PROCESSING_ERROR: Processing failure
  • VALIDATION_ERROR: Input validation
  • NETWORK_ERROR: Network issues
  • DATABASE_ERROR: Database problems

Error Metrics

{
  "commonErrors": {
    "TIMEOUT_ERROR": 10,
    "PROCESSING_ERROR": 8,
    "VALIDATION_ERROR": 2
  }
}

Testing

# Build
npm run build

# Test (159 tests, including metrics)
npm test

# Test metrics service only
npm test -- --testPathPattern=metrics.service.test.ts

# Start server with monitoring
npm start

# Check health
curl http://localhost:3000/api/v1/monitoring/health

Integration Example

import { metricsService } from './services/metrics.service';
import { healthService } from './services/health.service';

// Record processing event
metricsService.recordProcessing({
  stage: 'chunking',
  duration: 1500,
  success: true,
  timestamp: new Date(),
  documentId: 'doc-123',
});

// Get metrics
const metrics = metricsService.getSystemMetrics();
console.log(`Error rate: ${metrics.errorRate * 100}%`);

// Check health
const health = await healthService.getSystemHealth();
console.log(`System status: ${health.status}`);

Requirement 10 Compliance

AC1: Automatic retry on transient errors (Task 3)
AC2: Log start/success/failure events (Winston)
AC3: Record detailed failure info (ErrorHandler + Metrics)
AC4: Maintain processing state (Task 3)
AC5: Monitor health and performance (Task 10)

Status: All 5 criteria met ✅


Next Steps

✅ Task 10 Complete
➡️ Task 11: Testing and Validation

  • Unit tests for core services
  • Integration tests
  • Performance and load testing

Status: All Task 10 requirements met ✅
Quality: Production-ready with comprehensive monitoring
Tests: 159/159 passing
Next: Ready for Task 11 (Testing and Validation)