Skip to content

Conversation

Copy link

Copilot AI commented Dec 9, 2025

Service outages between daily e2e test runs go undetected. This adds a lightweight health endpoint for monitoring system integration and faster alerting.

Changes

  • src/handlers/handler_health.py: New HandlerHealth class

    • Tracks service uptime from initialization
    • Checks writer STATE dicts for initialization (Kafka: logger+producer, EventBridge: logger+client+event_bus_arn, Postgres: logger accessible)
    • Returns 200 {"status": "ok", "uptime_seconds": N} when healthy
    • Returns 503 {"status": "degraded", "details": {...}} when dependencies not initialized
  • src/event_gate_lambda.py: Registered /health route (no auth required)

  • conf/api.yaml: Added OpenAPI spec for /health endpoint

  • tests/handlers/test_handler_health.py: 6 tests covering healthy/degraded states and uptime calculation

Example

GET /health200 {"status": "ok", "uptime_seconds": 12345}

# When Kafka not initialized:
GET /health503 {"status": "degraded", "details": {"kafka": "not_initialized"}}

Release notes

Added /health endpoint for service health monitoring and faster outage detection

Closes #90

Original prompt

Background

As of now there are only e2e tests run once a day to monitor if the service runs, so outages between runs can go undetected. The service should include a lightweight healthcheck endpoint to enable integration with monitoring systems, surface liveness/readiness status, and provide faster alerting.

Feature

Expose an HTTP endpoint that returns the service status (dependencies reachable, configuration loaded, migrations applied) with clear success/failure semantics for monitoring probes.

Requirements

1. Create src/handlers/handler_health.py

  • Follow the Apache 2.0 license header pattern from other files (lines 1-15)
  • Implement a HandlerHealth class similar to HandlerTopic and HandlerToken patterns
  • Include an __init__ method that accepts logger and config parameters
  • Track service start time (using datetime.now(timezone.utc)) to calculate uptime
  • Implement a get_health() method that returns API Gateway response format (dict with statusCode, headers, body)
  • Perform lightweight dependency checks:
    • Kafka: check if writer_kafka.STATE dict has required keys ('logger', 'producer', 'bootstrap_server')
    • EventBridge: check if writer_eventbridge.STATE dict has required keys ('logger', 'client', 'event_bus_arn')
    • PostgreSQL: check if writer_postgres.STATE dict has 'logger' key
  • Return 200 with {"status": "ok", "uptime_seconds": <value>} when all dependencies are healthy
  • Return 503 with {"status": "degraded", "details": {"kafka": "not_initialized"}} (or similar) when any dependency fails
  • Use the same logging patterns as other handlers (logger.debug statements)
  • Include proper type hints on all methods
  • Add docstrings for the class and public methods

2. Update src/event_gate_lambda.py

  • Import the new HandlerHealth class: from src.handlers.handler_health import HandlerHealth
  • Initialize handler_health after line 86 (after writers are initialized): handler_health = HandlerHealth(logger, config)
  • Add route handling for /health resource in the lambda_handler function (around line 111-112, after /topics check):
if resource == "/health":
    return handler_health.get_health()
  • Follow the same pattern as /token and /topics routes

3. Create tests/handlers/test_handler_health.py

  • Use the Apache 2.0 license header (lines 1-15)
  • Import necessary modules including unittest.mock
  • Create test functions (not a class) following the pattern from other test files
  • Test healthy state: mock all writer STATE dicts as initialized, assert 200 response
  • Test degraded state: mock writer STATE dicts as empty/missing keys, assert 503 response
  • Test uptime calculation: verify uptime_seconds is a positive number
  • Use pytest patterns and the event_gate_module fixture from conftest.py if needed
  • Follow existing test file patterns for imports and structure

4. Update conf/api.yaml

  • Add a /health path definition to the OpenAPI 3 specification
  • Document GET method (no authentication required)
  • Add response schemas:
    • 200: {"status": "ok", "uptime_seconds": 12345}
    • 503: {"status": "degraded", "details": {"db": "unreachable"}}
  • Follow the existing YAML formatting style and structure from the file
  • Add appropriate description: "Service health and dependency status check"

Coding Style Requirements

  • Follow the exact patterns from existing handlers
  • Type hints on all function signatures
  • Docstrings with proper formatting (see existing handlers)
  • Debug logging statements: logger.debug("message")
  • Error handling consistent with existing code
  • Same response structure format: {"statusCode": X, "headers": {...}, "body": json.dumps(...)}
  • Use logger = logging.getLogger(__name__) pattern
  • Set log level: log_level = os.environ.get("LOG_LEVEL", "INFO"); logger.setLevel(log_level)

Implementation Notes

  • Keep all existing functionality intact
  • The health endpoint should NOT require JWT authentication
  • Make dependency checks lightweight (just verify STATE dicts are populated)
  • Don't make actual network calls to dependencies (just check initialization)
  • Follow existing code organization and module structure

Expected Behavior

GET /health returns:

  • {"status": "ok", "uptime_seconds": 12345} with 200 when all checks pass
  • {"status": "degraded", "details": {"kafka": "not_initialized"}} with 503 when a dependency is down

PR Description

Add the following to the PR description:


Release notes

Added /health endpoint for service health monitoring and faster outage detection

Closes #90


This pull request was created as a result of the following prompt from Copilot chat.

Background

As of now there are only e2e tests run once a day to monitor if the service runs, so outages between runs can go undetected. The service should include a lightweight healthcheck endpoint to enable integration with monitoring systems, surface liveness/readiness status, and provide faster alerting.

Feature

Expose an HTTP endpoint that returns the service status (dependencies reachable, configuration loaded, migrations applied) with clear success/failure semantics for monitoring probes.

Requirements

1. Create src/handlers/handler_health.py

  • Follow the Apache 2.0 license header pattern from other files (lines 1-15)
  • Implement a HandlerHealth class similar to HandlerTopic and HandlerToken patterns
  • Include an __init__ method that accepts logger and config parameters
  • Track service start time (using datetime.now(timezone.utc)) to calculate uptime
  • Implement a get_health() method that returns API Gateway response format (dict with statusCode, headers, body)
  • Perform lightweight dependency checks:
    • Kafka: check if writer_kafka.STATE dict has required keys ('logger', 'producer', 'bootstrap_server')
    • EventBridge: check if writer_eventbridge.STATE dict has required keys ('logger', 'client', 'event_bus_arn')
    • PostgreSQL: check if writer_postgres.STATE dict has 'logger' key
  • Return 200 with {"status": "ok", "uptime_seconds": <value>} when all dependencies are healthy
  • Return 503 with {"status": "degraded", "details": {"kafka": "not_initialized"}} (or similar) when any dependency fails
  • Use the same logging patterns as other handlers (logger.debug statements)
  • Include proper type hints on all methods
  • Add docstrings for the class and public methods

2. Update src/event_gate_lambda.py

  • Import the new HandlerHealth class: from src.handlers.handler_health import HandlerHealth
  • Initialize handler_health after line 86 (after writers are initialized): handler_health = HandlerHealth(logger, config)
  • Add route handling for /health resource in the lambda_handler function (around line 111-112, after /topics check):
if resource == "/health":
    return handler_health.get_health()
  • Follow the same pattern as /token and /topics routes

3. Create tests/handlers/test_handler_health.py

  • Use the Apache 2.0 license header (lines 1-15)
  • Import necessary modules including unittest.mock
  • Create test functions (not a class) following the pattern from other test files
  • Test healthy state: mock all writer STATE dicts as initialized, assert 200 response
  • Test degraded state: mock writer STATE dicts as empty/missing keys, assert 503 response
  • Test uptime calculation: verify uptime_seconds is a positive number
  • Use pytest patterns and the event_gate_module fixture from conftest.py if needed
  • Follow existing test file patterns for imports and structure

4. Update conf/api.yaml

  • Add a /health path definition to the OpenAPI 3 specification
  • Document GET method (no authentication required)
  • Add response schemas:
    • 200: {"status": "ok", "uptime_seconds": 12345}
    • 503: {"status": "degraded", "details": {"db": "unreachable"}}
  • Follow the existing YAML formatting style and structure from the file
  • Add appropriate description: "Service health and dependency status check"

Coding Style Requirements

  • Follow the exact patterns from existing handlers
  • Type hints on all function signatures
  • Docstrings with proper formatting (see existing handlers)
  • Debug logging statements: logger.debug("message")
  • Error handling consistent with existing code
  • Same response structure format: {"statusCode": X, "headers": {...}, "body": json.dumps(...)}
  • Use logger = logging.getLogger(__name__) pattern
  • Set log level: log_level = os.environ.get("LOG_LEVEL", "INFO"); logger.setLevel(log_level)

Implementation Notes

  • Keep all existing functionality intact
  • The health endpoint should NOT require JWT authentication
  • Make dependency checks lightweight (just verify STATE dicts are populated)
  • Don't make actual network calls to dependencies (just check initialization)
  • Follow existing code organization and module structure

Expected Behavior

GET /health returns:

  • {"status": "ok", "uptime_seconds": 12345} with 200 when all checks pass
  • {"status": "degraded", "details": {"kafka": "not_initialized"}} with 503 when a dependency is down

PR Description

Add the following to the PR description:


Release notes

Added /health endpoint for service health monitoring and faster outage detection

Closes #90



💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@coderabbitai
Copy link

coderabbitai bot commented Dec 9, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI and others added 2 commits December 9, 2025 11:07
- Created HandlerHealth class with dependency checks for Kafka, EventBridge, and PostgreSQL
- Added /health route to lambda_handler
- Implemented uptime tracking and status reporting
- Added comprehensive tests for healthy and degraded states
- Updated OpenAPI spec with /health endpoint documentation

Co-authored-by: tmikula-dev <[email protected]>
- Changed all logger.debug() calls to self.logger.debug()
- Removed unused global logger initialization
- All tests pass, pylint 10.00/10, mypy clean

Co-authored-by: tmikula-dev <[email protected]>
Copilot AI changed the title [WIP] Expose healthcheck endpoint for service status monitoring Add /health endpoint for service monitoring Dec 9, 2025
Copilot AI requested a review from tmikula-dev December 9, 2025 11:13
@tmikula-dev tmikula-dev added the work in progress Work on this item is not yet finished (mainly intended for PRs) label Dec 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

work in progress Work on this item is not yet finished (mainly intended for PRs)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add healthcheck endpoint

2 participants