Metrics Service: Performance tracking for all stages
Health Service: Multi-component health checks
Monitoring API: 7 REST endpoints
Tests: 159/159 passing (including 6 new metrics tests)
Build: Clean, no errors
| Endpoint | Purpose |
|---|---|
GET /api/v1/monitoring/health |
Full system health status |
GET /api/v1/monitoring/health/simple |
Simple health check (for LB) |
GET /api/v1/monitoring/metrics |
System-wide metrics |
GET /api/v1/monitoring/metrics/stages |
All stage metrics |
GET /api/v1/monitoring/metrics/stages/:stage |
Specific stage metrics |
GET /api/v1/monitoring/performance |
Performance summary |
POST /api/v1/monitoring/metrics/reset |
Reset metrics (admin) |
curl http://localhost:3000/api/v1/monitoring/healthcurl http://localhost:3000/api/v1/monitoring/metricscurl http://localhost:3000/api/v1/monitoring/metrics/stages/embeddingcurl http://localhost:3000/api/v1/monitoring/health/simple
# Returns 200 if healthy, 503 if unhealthy| Status | Meaning |
|---|---|
healthy |
All services up |
degraded |
Some services slow, none down |
unhealthy |
One or more services down |
| Status | Meaning |
|---|---|
up |
Service responding normally |
degraded |
Service responding slowly (>1000ms) |
down |
Service not responding |
totalProcessed: Total documents processed (24h)successfulProcessed: Successfully processedfailedProcessed: Failed processingaverageProcessingTime: Average duration (ms)errorRate: Failure rate (0-1)processedLastHour: Documents in last hourcommonErrors: Error type frequencies
totalExecutions: Times stage executedsuccessfulExecutions: Successful runsfailedExecutions: Failed runsaverageDuration: Average time (ms)minDuration: Fastest time (ms)maxDuration: Slowest time (ms)
| File | Lines | Purpose |
|---|---|---|
src/services/metrics.service.ts |
285 | Metrics collection |
src/services/health.service.ts |
225 | Health checks |
src/routes/monitoring.routes.ts |
166 | API endpoints |
src/services/metrics.service.test.ts |
142 | Tests |
Total: 818 lines
| Method | Purpose |
|---|---|
recordProcessing(metric) |
Record processing event |
getSystemMetrics() |
Get system-wide metrics |
getStageMetrics(stage) |
Get stage-specific metrics |
getAllStageMetrics() |
Get all stages |
getPerformanceSummary() |
Get performance overview |
getUptime() |
Get uptime in seconds |
reset() |
Clear all metrics |
| Method | Purpose |
|---|---|
checkDatabase() |
Check PostgreSQL health |
checkRedis() |
Check Redis health |
checkVectorDb() |
Check Qdrant health |
checkQueue() |
Check queue health |
getSystemHealth() |
Get overall health status |
getHealthCheckResult() |
Get simple health result |
Processing workers automatically record metrics:
// Automatically tracked for each stage:
- Stage name
- Processing duration
- Success/failure status
- Error type (if failed)
- Document ID
- Timestampfetch('/api/v1/monitoring/health')
.then(r => r.json())
.then(health => {
// Display overall status
document.getElementById('status').textContent = health.status;
// Display service statuses
Object.entries(health.services).forEach(([name, status]) => {
// Show service status indicator
});
});fetch('/api/v1/monitoring/metrics')
.then(r => r.json())
.then(metrics => {
// Display success rate
const successRate = metrics.successfulProcessed / metrics.totalProcessed;
// Display error rate
document.getElementById('error-rate').textContent =
(metrics.errorRate * 100).toFixed(2) + '%';
// Display stage performance
Object.values(metrics.stageMetrics).forEach(stage => {
// Show stage duration chart
});
});Already in place via Winston:
logger.info('Processing stage', {
documentId,
stage,
jobId
});
logger.error('Stage processing failed', {
error,
documentId,
stage
});Log Files:
logs/app.log: All logslogs/error.log: Errors only- Rotation: 5 files × 10MB
- Period: 24 hours
- Cleanup: Automatic on each recording
- Reset: Manual via API endpoint
{
"uptime": 3600,
"totalProcessed": 1000,
"successRate": 0.98,
"averageProcessingTime": 2500,
"slowestStage": {
"stage": "embedding",
"duration": 3500
},
"fastestStage": {
"stage": "chunking",
"duration": 1200
}
}TIMEOUT_ERROR: Processing timeoutPROCESSING_ERROR: Processing failureVALIDATION_ERROR: Input validationNETWORK_ERROR: Network issuesDATABASE_ERROR: Database problems
{
"commonErrors": {
"TIMEOUT_ERROR": 10,
"PROCESSING_ERROR": 8,
"VALIDATION_ERROR": 2
}
}# Build
npm run build
# Test (159 tests, including metrics)
npm test
# Test metrics service only
npm test -- --testPathPattern=metrics.service.test.ts
# Start server with monitoring
npm start
# Check health
curl http://localhost:3000/api/v1/monitoring/healthimport { metricsService } from './services/metrics.service';
import { healthService } from './services/health.service';
// Record processing event
metricsService.recordProcessing({
stage: 'chunking',
duration: 1500,
success: true,
timestamp: new Date(),
documentId: 'doc-123',
});
// Get metrics
const metrics = metricsService.getSystemMetrics();
console.log(`Error rate: ${metrics.errorRate * 100}%`);
// Check health
const health = await healthService.getSystemHealth();
console.log(`System status: ${health.status}`);✅ AC1: Automatic retry on transient errors (Task 3)
✅ AC2: Log start/success/failure events (Winston)
✅ AC3: Record detailed failure info (ErrorHandler + Metrics)
✅ AC4: Maintain processing state (Task 3)
✅ AC5: Monitor health and performance (Task 10)
Status: All 5 criteria met ✅
✅ Task 10 Complete
➡️ Task 11: Testing and Validation
- Unit tests for core services
- Integration tests
- Performance and load testing
Status: All Task 10 requirements met ✅
Quality: Production-ready with comprehensive monitoring
Tests: 159/159 passing
Next: Ready for Task 11 (Testing and Validation)