Skip to content

Latest commit

 

History

History
747 lines (561 loc) · 23.4 KB

File metadata and controls

747 lines (561 loc) · 23.4 KB

AGENTS.md

This document provides essential information for AI coding agents working on the InfraAlert project.

Project Overview

InfraAlert is an AI-powered multi-agent system for infrastructure issue reporting and response coordination in Nigeria. The system transforms citizen reports into coordinated responses using advanced AI and machine learning.

Architecture

The system consists of multiple specialized agents:

  • Issue Detection Agent (agents/issue_detection/): Analyzes and classifies infrastructure issues using Vision API, NLP, Speech-to-Text, and Gemini AI
  • Priority Analysis Agent (agents/priority_analysis/): Evaluates severity and urgency of detected issues
  • Resource Coordination Agent (agents/resource_coordination/): Matches and dispatches resources for issue resolution
  • Platform Integration Agent (agents/platform_integration/): Handles platform integrations and external services
  • Orchestrator Agent (agents/orchestrator/): Coordinates communication between agents
  • MCP Server (mcp_server/): FastMCP-based tool server (7 tools) with stdio/SSE transports

ADK multi-agent execution model in this repo

InfraAlert implements a multi-agent system on top of ADK with two orchestration modes:

  • Per-service root agents: Each agent directory exposes a root_agent in its agent.py, which is the ADK entrypoint.
  • Dual orchestration: The orchestrator (agents/orchestrator/agent.py) selects between:
    • In-process SequentialAgent (default): Composes Detection, Priority, Coordination, and Platform Integration as ADK sub-agents sharing session state directly. Best for local dev, adk web, and Agent Engine.
    • HTTP microservices (USE_HTTP_ORCHESTRATOR=true): Coordinates agents via HTTP calls to Cloud Run services. Uses A2A protocol (/tasks/send) with legacy /task fallback.
  • MCP Protocol: The MCP server uses FastMCP SDK with stdio/SSE transports. The Platform Integration agent connects via MCPToolset for native tool access.
  • A2A Protocol: All agents expose /.well-known/agent.json for discovery and POST /tasks/send for standardized task submission.
  • Session-aware services: All agent services accept a session_id in the payload and X-ADK-Session-Id header and log it for tracing.
  • Agent instructions: Each agent has a detailed instruction prompt guiding Gemini reasoning for its domain.
  • Workflow patterns: docs/adk-workflow-agents.md documents ADK workflow agent patterns (SequentialAgent, ParallelAgent, LoopAgent).

Tech Stack

  • Language: Python 3.12+ (3.11+ also supported)
  • Package Manager: uv (fast, reliable Python package manager)
  • Platform: Google Cloud Platform (GCP)
  • AI/ML: Vertex AI, Google Generative AI (Gemini), Vision API, Speech-to-Text, Natural Language API
  • Framework: Google ADK (Agent Development Kit)
  • Databases: Firestore, BigQuery
  • Storage: Cloud Storage
  • Compute: Cloud Run
  • Messaging: Pub/Sub
  • Monitoring: Cloud Monitoring, Cloud Logging
  • Web Framework: Flask, FastAPI
  • Testing: pytest
  • Linting/Formatting: Ruff, MyPy

Prerequisites

Before working on this project, ensure you have:

  1. Python 3.12+ (or 3.11+)
  2. uv package manager - Install via: ./scripts/setup-uv.sh or make setup-uv
  3. gcloud CLI - Install via: ./scripts/setup-gcloud.sh or make setup-gcloud
  4. Docker Desktop (for local development)
  5. Google Cloud Account (for deployment)

Verify installations:

make verify-uv      # Verify uv installation
make verify-gcloud  # Verify gcloud CLI installation
make setup-tools    # Install both tools at once

Build and Test Commands

Quick Commands

make help          # Show all available commands
make install       # Install production dependencies only
make install-dev   # Install all dependencies (production + development)
make lint          # Check code quality (Ruff + MyPy)
make lint-fix      # Auto-fix linting issues
make format        # Format code with Ruff
make format-check  # Check formatting without applying
make test          # Run test suite
make check         # Run lint + tests (use before committing)
make clean         # Remove caches and temporary files
make pre-commit    # Run all pre-commit hooks

Installation

# Install all dependencies
make install-dev

# Or manually
uv pip install --system --python python3 -r requirements.txt
uv pip install --system --python python3 -r requirements-test.txt

Environment Setup

CRITICAL: .env files are gitignored. Create them from example files:

# Automated setup (recommended)
./scripts/setup-env.sh

# Or manually
cp .env.example .env
cp agents/issue_detection/.env.example agents/issue_detection/.env
# ... repeat for other agents as needed

Required Environment Variables:

  • GEMINI_API_KEY - Google Gemini API key (required for all agents)
  • PROJECT_ID or GOOGLE_CLOUD_PROJECT - GCP project ID
  • Other GCP credentials as needed (service accounts, buckets, etc.)

Optional - Africa's Talking (SMS Notifications):

  • AFRICASTALKING_USERNAME - Africa's Talking username (stored in Secret Manager for production)
  • AFRICASTALKING_API_KEY - Africa's Talking API key (stored in Secret Manager for production)
  • AFRICASTALKING_SENDER_ID - Sender ID for SMS (optional, default: "InfraAlert")

Note: Africa's Talking is optional. Without it, the system runs in "mock mode" (no actual SMS sent). See the Africa's Talking section in .env.example for setup instructions. Migration: Switched from Twilio to Africa's Talking for 40% cost savings and better Nigerian coverage.

Never commit .env files to git!

Local Development

# Start all services via Docker Compose
docker-compose up

# Or run individual agents
cd agents/issue_detection
python app.py

Testing

# Run all tests
make test

# Run tests for specific agent
cd agents/issue_detection
pytest tests/ -v

# Run with coverage
pytest --cov=. --cov-report=html

CI/CD Testing

Test Cloud Build configuration locally:

make test-cloudbuild

Code Style Guidelines

Python Style

  • Line Length: 100 characters (configured in pyproject.toml)
  • Python Version: Target Python 3.12+ (configured in pyproject.toml)
  • Quote Style: Double quotes (")
  • Indent Style: Spaces

Linting and Formatting

This project uses:

  • Ruff: Fast Python linter and formatter (replaces Black, isort, flake8, etc.)
  • MyPy: Static type checker

Configuration: See pyproject.toml for detailed settings.

Ruff Rules:

  • Selects: E (pycodestyle errors), F (pyflakes), W (pycodestyle warnings), B (flake8-bugbear), I (isort)
  • Ignores: E203 (whitespace before ':'), E266 (too many leading '#' for block comment), E501 (line too long)

MyPy Configuration:

  • Python version: 3.12
  • Warns on return any, unused configs, redundant casts
  • Some modules (google., flask., pydantic.*) ignore missing imports
  • Some agent modules ignore errors (due to dynamic ADK types)

Code Quality Checks

# Check code quality
make lint

# Auto-fix issues
make lint-fix

# Format code
make format

# Check formatting
make format-check

Pre-commit Hooks

# Install pre-commit hooks
pre-commit install

# Run manually
make pre-commit

Type Hints

  • Use type hints for function parameters and return values
  • MyPy is configured but not strictly enforced (some modules ignore errors)
  • Prefer explicit types over Any when possible

Import Organization

Ruff (via isort) handles import sorting automatically. Follow these guidelines:

  • Standard library imports first
  • Third-party imports second
  • Local application imports last
  • Use absolute imports when possible

Testing Instructions

Test Structure

  • Tests are located in tests/ directories within each agent
  • Test files follow naming pattern: test_*.py or *_test.py
  • Test classes: Test*
  • Test functions: test_*

Running Tests

# Run all tests
make test

# Run tests for specific agent
cd agents/issue_detection
pytest tests/ -v --tb=short

# Run with coverage
pytest --cov=. --cov-report=html --cov-report=term-missing

# Run specific test file
pytest agents/issue_detection/tests/test_issue_detection.py -v

Test Configuration

See pyproject.toml for pytest configuration:

  • Test paths: ["tests"]
  • Coverage source: . (project root)
  • Coverage excludes: */tests/*, */test_*, */__pycache__/*, */venv/*

Writing Tests

  • Use pytest fixtures for common setup
  • Mock external services (GCP APIs, etc.)
  • Test both success and error cases
  • Use descriptive test names
  • Keep tests isolated and independent

Security Considerations

Environment Variables and Secrets

CRITICAL: Never commit secrets or credentials to the repository!

  • All .env files are gitignored
  • Use .env.example files as templates
  • Store production secrets in Google Secret Manager
  • Cloud Run services use Secret Manager for sensitive values (see cloudbuild.yaml)

API Keys

  • GEMINI_API_KEY must be stored securely
  • In production, use Google Secret Manager (configured in Cloud Build)
  • Never hardcode API keys in source code
  • Rotate keys regularly

Google Cloud Credentials

  • Use service accounts with least privilege
  • Never commit service account keys
  • Use Application Default Credentials (ADC) when possible
  • For local development, use gcloud auth application-default login

Input Validation

  • Validate all user inputs
  • Sanitize data before processing
  • Use Pydantic models for data validation
  • Implement rate limiting where appropriate

Dependencies

  • Regularly update dependencies to patch security vulnerabilities
  • Review requirements.txt and requirements-test.txt
  • Use uv pip list --outdated to check for updates

Docker Security

  • Use minimal base images
  • Don't run containers as root
  • Scan images for vulnerabilities
  • Keep base images updated

Deployment

Cloud Build (Automated)

Deployment happens automatically via Cloud Build when code is pushed:

# Test Cloud Build locally
make test-cloudbuild

# Manual Cloud Build trigger
gcloud builds submit --config=cloudbuild.yaml --substitutions=_TAG=$(git rev-parse --short HEAD)

Manual Deployment

Each agent has its own deploy.sh script:

cd agents/issue_detection
./deploy.sh

Deployment Configuration

See cloudbuild.yaml for:

  • Build steps (quality checks, Docker builds, deployments)
  • Cloud Run service configurations
  • Resource limits (memory, CPU, instances)
  • Environment variables and secrets
  • Regional settings

Deployment Order

  1. MCP Server (must deploy first)
  2. Platform Integration Agent (depends on MCP Server)
  3. Issue Detection Agent
  4. Priority Analysis Agent
  5. Resource Coordination Agent
  6. Orchestrator Agent (depends on all other agents)

Commit Messages and Pull Requests

Commit Message Guidelines

Use clear, descriptive commit messages:

feat: Add new feature description
fix: Fix bug description
docs: Update documentation
refactor: Code refactoring
test: Add or update tests
chore: Maintenance tasks

Examples:

  • feat(issue_detection): Add multi-language support
  • fix(priority_analysis): Resolve memory leak in analysis
  • docs: Update AGENTS.md with deployment instructions

Pull Request Guidelines

  1. Before Creating PR:

    • Run make check (lint + tests)
    • Ensure all tests pass
    • Update documentation if needed
    • Check for security issues
  2. PR Description Should Include:

    • What changes were made
    • Why the changes were needed
    • How to test the changes
    • Any breaking changes
  3. Code Review Checklist:

    • Code follows style guidelines
    • Tests are included and passing
    • Documentation is updated
    • No secrets or credentials exposed
    • Security considerations addressed

Project Structure

InfraAlert/
├── agents/                    # Agent implementations
│   ├── issue_detection/      # Issue detection agent
│   ├── priority_analysis/    # Priority analysis agent
│   ├── resource_coordination/# Resource coordination agent
│   ├── platform_integration/ # Platform integration services
│   └── orchestrator/         # Orchestrator agent
├── bigquery/                 # BigQuery schemas
├── config/                   # Configuration files
├── docs/                     # Documentation
├── functions/                # Cloud Functions
├── mcp_server/               # MCP server implementation
├── scripts/                  # Utility scripts
├── webapp/                   # Web application
├── cloudbuild.yaml           # Cloud Build configuration
├── docker-compose.yml        # Local development setup
├── Makefile                  # Development commands
├── pyproject.toml            # Python project configuration
└── requirements.txt          # Production dependencies

Common Tasks

Adding a New Dependency

Python (agents, backend):

# Add to requirements.txt
uv pip install --system --python python3 <package>

# Or edit requirements.txt manually, then:
make install-dev

Frontend (webapp/frontend): Use bun (not npm):

cd webapp/frontend
bun add <package>          # production dependency
bun add -d <package>       # dev dependency

Adding a New Agent

  1. Create directory structure: agents/new_agent/
  2. Add requirements.txt, Dockerfile, deploy.sh
  3. Update cloudbuild.yaml with build/deploy steps
  4. Add environment variables to .env.example
  5. Update documentation

Debugging

# Run agent locally with debug logging
cd agents/issue_detection
python app.py

# Check logs in Cloud Run
gcloud run services logs read issue-detection-agent --region=us-central1

# Use Firestore emulator locally
docker-compose up firestore-emulator

Database Schemas

  • BigQuery schemas: bigquery/
  • Firestore: Configured per agent in config.py files

Additional Resources

  • Main README: See README.md for detailed setup instructions
  • Agent-Specific Docs: Each agent has its own README.md
  • Google ADK Docs: https://cloud.google.com/vertex-ai/docs/adk
  • ADK Workflow Agents: See docs/adk-workflow-agents.md for SequentialAgent, ParallelAgent, LoopAgent patterns
  • Agent Engine Deployment: See docs/agent-engine-deployment.md for Google Agent Engine deployment

Recent Improvements (feat/adk-memory-session-improvements)

1. Standardized ADK Agent Structure

All agents now follow ADK's required structure conventions:

  • ✅ Each agent has __init__.py that imports from .agent
  • ✅ Each agent has agent.py that exports root_agent or app
  • ✅ Platform Integration agent now has proper ADK entry point

Files Modified:

  • agents/issue_detection/__init__.py
  • agents/priority_analysis/__init__.py
  • agents/resource_coordination/__init__.py
  • agents/platform_integration/__init__.py (created)
  • agents/platform_integration/agent.py (created)

2. Session ID Propagation

Implemented cross-agent session tracking:

  • ✅ Orchestrator generates stable session_id from ADK session or workflow ID
  • ✅ Session ID passed via HTTP headers (X-ADK-Session-Id) and payload
  • ✅ All agents extract and log session_id for distributed tracing
  • ✅ Session ID stored in Firestore for correlation

The shared session/state contract is centralized in agents/session_config.py:

  • SessionKeys.ISSUE_DATA, SessionKeys.ISSUE_ID, SessionKeys.DETECTION_RESULT are written by the Issue Detection agent and consumed by the Orchestrator and Priority Analysis agent.
  • SessionKeys.PRIORITY_DATA, SessionKeys.PRIORITY_SCORE, SessionKeys.ANALYSIS_RESULT are produced by the Priority Analysis agent and consumed by the Orchestrator and Resource Coordination agent.
  • SessionKeys.WORKFLOW_STATE, SessionKeys.CURRENT_STEP and SessionKeys.SESSION_ID are managed primarily by the Orchestrator to track end-to-end workflow state.
  • Long-term patterns such as SessionKeys.PREVIOUS_ISSUES and SessionKeys.PATTERNS_LEARNED can be used by agents that need cross-session context.

Files Modified:

  • agents/session_config.py - Added SessionKeys.SESSION_ID
  • agents/orchestrator/orchestrator_agent.py - Generate and propagate session_id
  • agents/issue_detection/app.py - Extract and log session_id
  • agents/priority_analysis/app.py - Extract and log session_id
  • agents/resource_coordination/app.py - Extract and log session_id
  • agents/platform_integration/app.py - Extract and log session_id

Usage:

# In orchestrator
session_id = session_state.get(SessionKeys.SESSION_ID)
if not session_id:
    session_id = getattr(ctx.session, "id", f"wf-{int(time.time())}")
    session_state[SessionKeys.SESSION_ID] = session_id

# In agent Flask apps
session_id = request.headers.get("X-ADK-Session-Id")
if not session_id:
    session_id = data.get("session_id")
logger.info(f"Processing task with session_id: {session_id}")

3. Durable Session & Memory Backends

Extended session/memory configuration to support production backends:

Session Backends:

  • ✅ In-Memory (development - default)
  • ✅ Database Sessions (Cloud Run + PostgreSQL/AlloyDB)
  • ✅ Vertex AI Sessions (Agent Engine)

Memory Backends:

  • ✅ In-Memory Memory (development - default)
  • ✅ Memory Bank (production - facts/summaries)
  • ✅ RAG Memory (production - document Q&A)

Files Modified:

  • agents/session_config.py - Added Vertex AI session/memory support
  • .env.example - Documented all configuration options

Configuration:

# Vertex AI Sessions (Agent Engine)
USE_VERTEX_SESSIONS=true
VERTEX_AI_PROJECT_ID=your-project-id
VERTEX_AI_LOCATION=us-central1

# Memory Bank
USE_MEMORY_BANK=true
MEMORY_BANK_NAME=projects/PROJECT_ID/locations/LOCATION/memoryBanks/infraalert-kb

# RAG Memory
USE_RAG_MEMORY=true
RAG_CORPUS_NAME=projects/PROJECT_ID/locations/LOCATION/ragCorpora/infraalert-docs

4. Observability & Tracing

Added OpenTelemetry instrumentation for distributed tracing:

  • ✅ Reusable observability module (agents/observability.py)
  • ✅ Cloud Trace integration
  • ✅ Auto-instrumentation for Flask and requests
  • ✅ Trace context in logs for correlation

Files Created:

  • agents/observability.py - OpenTelemetry initialization and helpers

Files Modified:

  • requirements.txt - Added OpenTelemetry dependencies

Usage:

from agents.observability import init_observability, add_trace_context_to_logger

# Initialize before creating Flask app
tracer = init_observability(
    service_name="issue-detection-agent",
    enable_tracing=True,
    enable_instrumentation=True
)

# Add trace context to logs
add_trace_context_to_logger()

# Use tracer for custom spans
if tracer:
    with tracer.start_as_current_span("process_issue"):
        # Your code here
        pass

5. PR CI Pipeline

Added separate CI pipelines for fast PR feedback:

  • ✅ Lint pipeline (.cloudbuild/ci/lint.yaml) - Ruff linter & formatter
  • ✅ Test pipeline (.cloudbuild/ci/test.yaml) - Pytest with coverage
  • ✅ Separated from deployment pipeline for faster feedback

Files Created:

  • .cloudbuild/ci/lint.yaml - Code quality checks
  • .cloudbuild/ci/test.yaml - Test suite execution

Benefits:

  • ⚡ Fast feedback on PRs (10-20 min vs full build/deploy)
  • ✅ Catch issues before merging
  • 💰 Lower Cloud Build costs (no image building on failed tests)

6. Unified Service Account

Documented and configured unified service account pattern:

  • ✅ Single infraalert-app-sa for all services
  • ✅ Simplified IAM management
  • ✅ Least-privilege roles documented
  • ✅ Setup script provided

Files Created:

  • scripts/setup-app-sa.sh - Automated service account setup script
  • scripts/setup-app-sa.sh - Automated service account setup

Files Modified:

  • .env.example - Added AGENT_SERVICE_ACCOUNT configuration
  • cloudbuild.yaml - Added _APP_SERVICE_ACCOUNT substitution

Setup:

# Create and configure service account
./scripts/setup-app-sa.sh your-project-id

# Update .env
AGENT_SERVICE_ACCOUNT=infraalert-app-sa@your-project-id.iam.gserviceaccount.com

7. Comprehensive Documentation

Added detailed documentation for memory, sessions, and service accounts:

Files Created:

  • docs/adk-workflow-agents.md - ADK workflow agent patterns and usage
    • Session backends comparison
    • Memory backends comparison
    • Configuration examples (dev/prod/Cloud Run/Agent Engine)
    • Migration guides
    • Best practices
    • Troubleshooting

Key Topics Covered:

  • Short-term session vs long-term memory
  • When to use each backend
  • How to set up Memory Bank
  • How to enable RAG memory
  • Session ID propagation flow
  • Memory write policies per agent
  • Monitoring and observability

Deployment Checklist (Post-Improvements)

  1. Environment Configuration

    # Required
    export GEMINI_API_KEY="your-api-key"
    export PROJECT_ID="your-project-id"
    
    # Optional - Session Backend (choose one)
    export USE_VERTEX_SESSIONS=true  # For Agent Engine
    # OR
    export DATABASE_URL="postgresql://..."  # For Cloud Run
    
    # Optional - Memory Backend (choose one)
    export USE_MEMORY_BANK=true
    export MEMORY_BANK_NAME="projects/.../memoryBanks/infraalert-kb"
    # OR
    export USE_RAG_MEMORY=true
    export RAG_CORPUS_NAME="projects/.../ragCorpora/infraalert-docs"
    
    # Optional - Observability
    export ENABLE_TRACING=true
  2. Service Account Setup

    # Create unified service account
    ./scripts/setup-app-sa.sh your-project-id
    
    # Update .env
    echo "AGENT_SERVICE_ACCOUNT=infraalert-app-sa@your-project-id.iam.gserviceaccount.com" >> .env
  3. Deploy

    # Run quality checks first
    make check
    
    # Deploy all services
    make deploy  # or use Cloud Build
  4. Verify

    # Check session_id propagation in logs
    gcloud logging read \
      'resource.type="cloud_run_revision" AND jsonPayload.session_id!=""' \
      --limit=10
    
    # Check service accounts
    for service in issue-detection-agent priority-analysis-agent orchestrator-agent; do
      gcloud run services describe $service --region=us-central1 \
        --format="value(spec.template.spec.serviceAccountName)"
    done

Troubleshooting

Common Issues

  1. uv not found: Run make setup-uv or ./scripts/setup-uv.sh
  2. gcloud not found: Run make setup-gcloud or ./scripts/setup-gcloud.sh
  3. Import errors: Ensure dependencies are installed (make install-dev)
  4. Environment variables missing: Run ./scripts/setup-env.sh
  5. Tests failing: Check that all dependencies are installed and .env files are configured

Getting Help

  • Check agent-specific README files
  • Review Cloud Build logs for deployment issues
  • Check Cloud Run logs for runtime errors
  • Review docs/ directory for architecture details

Important Notes for Agents

  1. Always run make check before committing - This ensures code quality and tests pass
  2. Never commit .env files - They contain sensitive credentials
  3. Use uv for package management - Don't use pip directly
  4. Follow the existing code style - Use Ruff for formatting
  5. Add tests for new features - Maintain test coverage
  6. Update documentation - Keep README and AGENTS.md current
  7. Check security implications - Review changes for security issues
  8. Test locally before deploying - Use Docker Compose or run agents directly
  9. Respect agent boundaries - Each agent has a specific responsibility
  10. Use type hints - Help MyPy catch type errors