This document contains a comprehensive list of Frequently Asked Questions (FAQs) and a "Defense Guide" to help answer questions from judges or clients confidently.
Answer: Our system learns the normal behavior of each service automatically using statistics. It detects slow failures early, understands dependencies, and prevents cascading breakdowns — all in real-time without complex setup.
Answer: Those tools often rely on manual static thresholds (e.g., "Alert if CPU > 80%") which generates noise. Our system uses dynamic baselines, trend prediction, and dependency awareness. We warn before the failure happens, not just after.
Answer: Most failures start small and escalate quickly. Our 1-second evaluation cycle helps teams take action immediately, avoiding costly downtime.
Answer: Reliability equals Revenue.
- Reduces Downtime
- Minimizes User Complaints
- Lowers Engineer Stress
- Prevents Revenue Loss
Answer: Imagine an Auth Service where latency slowly rises from 50ms to 500ms.
- Without us: No alert until it hits a hard limit or crashes. Users are frustrated.
- With us: The system detects the "Rising Trend" (>15%) early, sends a warning, and the issue is fixed before any outage occurs.
Answer: There is NO heavy AI model (like LLMs). Instead, we use "Statistical AI" (Math-based Intelligence):
- Mean & Standard Deviation: To understand "normal".
- Z-Score Thresholds: To detect anomalies.
- Linear Regression: For trend analysis.
This approach is faster, lighter, and more reliable than black-box AI models.
Answer: Microservice health metrics follow predictable patterns. Basic math is sufficient to detect abnormal changes without the overhead of heavy Machine Learning models. It’s smarter because it’s faster.
- Dynamic Baselines:
- Mean + 2.5 StdDev → Warning 🟡
- Mean + 3.5 StdDev → Critical 🔴
- Trend Analysis: If latency rises >15% in a short window → Early Warning.
- Heartbeat Check: No heartbeat for >60 seconds → Critical.
- Debounce: Alerts are only sent if the issue persists (prevents noise).
Think of this like a runner warming up.
- The Problem: When you first wake up, you are groggy and slow. If someone measured your sprint speed right when you woke up, they would think you are a terrible runner.
- The Solution: We give the service 90 seconds (1.5 min) to "wake up," connect to the database, and load its files.
- Why it makes sense: If we started alerting immediately, we would get False Alarms because the service is naturally slow during startup. We wait for it to "settle down" to find its true normal speed.
This is called a "Cold Start."
- Empty Caches: The service hasn't saved any shortcuts yet, so it has to do the hard work for every request.
- Connection Handshakes: It has to introduce itself to the database ("Hi, I'm Auth Service, here is my password") which takes time.
- Code Optimization: Modern languages (like Node.js or Java) actually optimize the code while it runs. The first few times a function runs, it's slow. After 100 times, the computer finds a faster way to run it.
- Simple: Speed. (Time to wait).
- Example: You order a coffee.
- If you get it in 2 minutes, Latency = 2m.
- If you get it in 10 minutes, Latency = 10m.
- Goal: We want this number to be LOW.
- Simple: "The Normal."
- Example: You usually sleep 8 hours. That is your baseline. If you sleep 12 hours one day, that is abnormal.
- Simple: "The Moving Normal."
- Scenario: On weekdays, your baseline sleep is 8 hours. On weekends, your baseline is 10 hours.
- Why we use it: A "Static" baseline says "Anything over 8 hours is bad!" (which is wrong on weekends). A Dynamic baseline learns: "Oh, it's Saturday? 10 hours is fine. No alert needed."
- In our app: It learns that "50ms is normal for Auth Service" but "200ms is normal for Storage Service."
- Simple: The Average.
- Formula:
(Sum of all numbers) / (Count of numbers) - Example: Latencies are 10, 20, 30. Mean = 20.
- Simple: The "Wiggle Room" or "Consistency."
- Low StdDev: You run a mile in 6 mins, 6:01, 5:59. (You are consistent).
- High StdDev: You run a mile in 6 mins, then 10 mins, then 4 mins. (You are chaotic/unstable).
- Why it matters: If a service usually has Low StdDev (very stable) and suddenly spikes, we know something is wrong immediately.
- Concept: "This is getting weirdly slow."
- Formula:
Mean + (2.5 × StdDev) - English: Take the Average, add 2.5 times the "Wiggle Room". If the current speed is slower than that, send a Warning.
- Concept: "Okay, this is definitely broken."
- Formula:
Mean + (3.5 × StdDev) - English: Take the Average, add 3.5 times the "Wiggle Room". This is extremely far away from normal.
- Concept: How we calculate the color.
- Formula:
(Current Latency - Mean) / StdDev - English: "How many 'wiggles' away from normal is this request?"
- Score < 2.5 = Green (Normal)
- Score > 2.5 = Yellow (Warning)
- Score > 3.5 = Red (Critical)
Answer: You have 4 actual backend servers monitored by the Guardian:
- Backend (Port 3000)
- Auth Service (Port 3001)
- Booking Service (Port 3002)
- Storage Service (Port 3003)
The other two are the Frontend (Development Server) and the Traffic Generator (Script).
Answer: Yes. The system is lightweight. The Agent stores metrics in memory and sends data in batches (every 2 seconds) using async non-blocking I/O. It adds less than 1ms overhead to the application.
Answer: Monitoring stops temporarily, but the Client Services (Auth, Booking, etc.) keep running. The Health Agent uses try/catch blocks, so if the Guardian is down, the agent quietly skips sending metrics. It is NOT a single point of failure.
Answer: For the MVP, MongoDB is flexible and allows us to store Users, Baselines, Logs, and Metrics in one place. For production, the architecture allows swapping the storage layer with InfluxDB or Prometheus easily.
Answer: Currently, we store everything. In production, we would implement a TTL (Time-To-Live) Index to automatically delete raw data older than 7 days, keeping only daily summaries for long-term trends.
Answer: "Wrapping" means adding our Health Agent SDK. It requires just 3 lines of code:
import { createHealthAgent } from './health-agent.js';
const agent = createHealthAgent('My-App-Name');
app.use(agent.middleware());This automatically starts timers, checks latency, captures errors, and sends metrics. No rewrite needed.
Answer: Yes, but it requires a language-specific wrapper. The Backend API accepts JSON metrics from ANY language. Currently, we have a Node.js Agent. Future plans include Python, Java, and Go Agents.
Answer: JWT acts as a digital ID card.
- Admin logs in → Backend verifies credentials → Issues JWT.
- Frontend stores JWT and attaches
Authorization: Bearer <token>to every request. - This proves the user is allowed to view the dashboard.
Answer: Currently, it uses internal network traffic. In production, we would implement API Keys and mTLS (Mutual TLS) to ensure only authorized services can send metrics.
- CRITICAL: Service crashes (No heartbeat > 60s) → Waits 30s to confirm → Immediate Alert.
- STARTUP: Service starts → Immediate Alert.
- WARNING: Service enters warning state (slowness) → Sends one message every 15 minutes (Rate Limited/Debounced).
Answer: It is a "Self-Service Configuration":
- Client creates their own bot using BotFather.
- Client adds the bot to their team group.
- Client pastes the Bot Token and Chat ID into our Dashboard Settings.
- The Guardian uses their bot to send alerts only to them.
Answer:
- Auth Service: 🔴 Critical
- Booking Service:
⚠️ At Risk (Not Error) The system understands Dependencies. It highlights the Root Cause (Auth) so admins don't waste time debugging the Booking service.
Answer: No. This is an Infrastructure Monitor. It detects slowness, crashes, and HTTP errors. It does not validate business logic or output correctness; that requires functional testing.
Answer: Single Admin User.
- Default:
admin@healthguardian.com/admin123 - Technique: Email/Password + bcrypt hashing + JWT Session.
Answer: Yes, but ONLY if the admin user doesn't already exist in the database. If the user exists, the script does nothing. Best practice is to set ADMIN_PASSWORD in the .env file.
- Latency: Time taken to complete ONE request (Speed).
- Throughput: Number of requests per second (Volume).
Answer:
- Docker/Kubernetes Integration: For automated container restarts.
- RBAC: Role-Based Access Control for multiple users.
- More Channels: Slack, Email, and PagerDuty integration.