This document provides essential information for AI coding agents working on the InfraAlert project.
InfraAlert is an AI-powered multi-agent system for infrastructure issue reporting and response coordination in Nigeria. The system transforms citizen reports into coordinated responses using advanced AI and machine learning.
The system consists of multiple specialized agents:
- Issue Detection Agent (
agents/issue_detection/): Analyzes and classifies infrastructure issues using Vision API, NLP, Speech-to-Text, and Gemini AI - Priority Analysis Agent (
agents/priority_analysis/): Evaluates severity and urgency of detected issues - Resource Coordination Agent (
agents/resource_coordination/): Matches and dispatches resources for issue resolution - Platform Integration Agent (
agents/platform_integration/): Handles platform integrations and external services - Orchestrator Agent (
agents/orchestrator/): Coordinates communication between agents - MCP Server (
mcp_server/): FastMCP-based tool server (7 tools) with stdio/SSE transports
InfraAlert implements a multi-agent system on top of ADK with two orchestration modes:
- Per-service root agents: Each agent directory exposes a
root_agentin itsagent.py, which is the ADK entrypoint. - Dual orchestration: The orchestrator (
agents/orchestrator/agent.py) selects between:- In-process
SequentialAgent(default): Composes Detection, Priority, Coordination, and Platform Integration as ADK sub-agents sharing session state directly. Best for local dev,adk web, and Agent Engine. - HTTP microservices (
USE_HTTP_ORCHESTRATOR=true): Coordinates agents via HTTP calls to Cloud Run services. Uses A2A protocol (/tasks/send) with legacy/taskfallback.
- In-process
- MCP Protocol: The MCP server uses
FastMCPSDK with stdio/SSE transports. The Platform Integration agent connects viaMCPToolsetfor native tool access. - A2A Protocol: All agents expose
/.well-known/agent.jsonfor discovery andPOST /tasks/sendfor standardized task submission. - Session-aware services: All agent services accept a
session_idin the payload andX-ADK-Session-Idheader and log it for tracing. - Agent instructions: Each agent has a detailed
instructionprompt guiding Gemini reasoning for its domain. - Workflow patterns:
docs/adk-workflow-agents.mddocuments ADK workflow agent patterns (SequentialAgent,ParallelAgent,LoopAgent).
- Language: Python 3.12+ (3.11+ also supported)
- Package Manager: uv (fast, reliable Python package manager)
- Platform: Google Cloud Platform (GCP)
- AI/ML: Vertex AI, Google Generative AI (Gemini), Vision API, Speech-to-Text, Natural Language API
- Framework: Google ADK (Agent Development Kit)
- Databases: Firestore, BigQuery
- Storage: Cloud Storage
- Compute: Cloud Run
- Messaging: Pub/Sub
- Monitoring: Cloud Monitoring, Cloud Logging
- Web Framework: Flask, FastAPI
- Testing: pytest
- Linting/Formatting: Ruff, MyPy
Before working on this project, ensure you have:
- Python 3.12+ (or 3.11+)
- uv package manager - Install via:
./scripts/setup-uv.shormake setup-uv - gcloud CLI - Install via:
./scripts/setup-gcloud.shormake setup-gcloud - Docker Desktop (for local development)
- Google Cloud Account (for deployment)
Verify installations:
make verify-uv # Verify uv installation
make verify-gcloud # Verify gcloud CLI installation
make setup-tools # Install both tools at oncemake help # Show all available commands
make install # Install production dependencies only
make install-dev # Install all dependencies (production + development)
make lint # Check code quality (Ruff + MyPy)
make lint-fix # Auto-fix linting issues
make format # Format code with Ruff
make format-check # Check formatting without applying
make test # Run test suite
make check # Run lint + tests (use before committing)
make clean # Remove caches and temporary files
make pre-commit # Run all pre-commit hooks# Install all dependencies
make install-dev
# Or manually
uv pip install --system --python python3 -r requirements.txt
uv pip install --system --python python3 -r requirements-test.txtCRITICAL: .env files are gitignored. Create them from example files:
# Automated setup (recommended)
./scripts/setup-env.sh
# Or manually
cp .env.example .env
cp agents/issue_detection/.env.example agents/issue_detection/.env
# ... repeat for other agents as neededRequired Environment Variables:
GEMINI_API_KEY- Google Gemini API key (required for all agents)PROJECT_IDorGOOGLE_CLOUD_PROJECT- GCP project ID- Other GCP credentials as needed (service accounts, buckets, etc.)
Optional - Africa's Talking (SMS Notifications):
AFRICASTALKING_USERNAME- Africa's Talking username (stored in Secret Manager for production)AFRICASTALKING_API_KEY- Africa's Talking API key (stored in Secret Manager for production)AFRICASTALKING_SENDER_ID- Sender ID for SMS (optional, default: "InfraAlert")
Note: Africa's Talking is optional. Without it, the system runs in "mock mode" (no actual SMS sent).
See the Africa's Talking section in .env.example for setup instructions.
Migration: Switched from Twilio to Africa's Talking for 40% cost savings and better Nigerian coverage.
Never commit .env files to git!
# Start all services via Docker Compose
docker-compose up
# Or run individual agents
cd agents/issue_detection
python app.py# Run all tests
make test
# Run tests for specific agent
cd agents/issue_detection
pytest tests/ -v
# Run with coverage
pytest --cov=. --cov-report=htmlTest Cloud Build configuration locally:
make test-cloudbuild- Line Length: 100 characters (configured in
pyproject.toml) - Python Version: Target Python 3.12+ (configured in
pyproject.toml) - Quote Style: Double quotes (
") - Indent Style: Spaces
This project uses:
- Ruff: Fast Python linter and formatter (replaces Black, isort, flake8, etc.)
- MyPy: Static type checker
Configuration: See pyproject.toml for detailed settings.
Ruff Rules:
- Selects: E (pycodestyle errors), F (pyflakes), W (pycodestyle warnings), B (flake8-bugbear), I (isort)
- Ignores: E203 (whitespace before ':'), E266 (too many leading '#' for block comment), E501 (line too long)
MyPy Configuration:
- Python version: 3.12
- Warns on return any, unused configs, redundant casts
- Some modules (google., flask., pydantic.*) ignore missing imports
- Some agent modules ignore errors (due to dynamic ADK types)
# Check code quality
make lint
# Auto-fix issues
make lint-fix
# Format code
make format
# Check formatting
make format-check# Install pre-commit hooks
pre-commit install
# Run manually
make pre-commit- Use type hints for function parameters and return values
- MyPy is configured but not strictly enforced (some modules ignore errors)
- Prefer explicit types over
Anywhen possible
Ruff (via isort) handles import sorting automatically. Follow these guidelines:
- Standard library imports first
- Third-party imports second
- Local application imports last
- Use absolute imports when possible
- Tests are located in
tests/directories within each agent - Test files follow naming pattern:
test_*.pyor*_test.py - Test classes:
Test* - Test functions:
test_*
# Run all tests
make test
# Run tests for specific agent
cd agents/issue_detection
pytest tests/ -v --tb=short
# Run with coverage
pytest --cov=. --cov-report=html --cov-report=term-missing
# Run specific test file
pytest agents/issue_detection/tests/test_issue_detection.py -vSee pyproject.toml for pytest configuration:
- Test paths:
["tests"] - Coverage source:
.(project root) - Coverage excludes:
*/tests/*,*/test_*,*/__pycache__/*,*/venv/*
- Use pytest fixtures for common setup
- Mock external services (GCP APIs, etc.)
- Test both success and error cases
- Use descriptive test names
- Keep tests isolated and independent
CRITICAL: Never commit secrets or credentials to the repository!
- All
.envfiles are gitignored - Use
.env.examplefiles as templates - Store production secrets in Google Secret Manager
- Cloud Run services use Secret Manager for sensitive values (see
cloudbuild.yaml)
GEMINI_API_KEYmust be stored securely- In production, use Google Secret Manager (configured in Cloud Build)
- Never hardcode API keys in source code
- Rotate keys regularly
- Use service accounts with least privilege
- Never commit service account keys
- Use Application Default Credentials (ADC) when possible
- For local development, use
gcloud auth application-default login
- Validate all user inputs
- Sanitize data before processing
- Use Pydantic models for data validation
- Implement rate limiting where appropriate
- Regularly update dependencies to patch security vulnerabilities
- Review
requirements.txtandrequirements-test.txt - Use
uv pip list --outdatedto check for updates
- Use minimal base images
- Don't run containers as root
- Scan images for vulnerabilities
- Keep base images updated
Deployment happens automatically via Cloud Build when code is pushed:
# Test Cloud Build locally
make test-cloudbuild
# Manual Cloud Build trigger
gcloud builds submit --config=cloudbuild.yaml --substitutions=_TAG=$(git rev-parse --short HEAD)Each agent has its own deploy.sh script:
cd agents/issue_detection
./deploy.shSee cloudbuild.yaml for:
- Build steps (quality checks, Docker builds, deployments)
- Cloud Run service configurations
- Resource limits (memory, CPU, instances)
- Environment variables and secrets
- Regional settings
- MCP Server (must deploy first)
- Platform Integration Agent (depends on MCP Server)
- Issue Detection Agent
- Priority Analysis Agent
- Resource Coordination Agent
- Orchestrator Agent (depends on all other agents)
Use clear, descriptive commit messages:
feat: Add new feature description
fix: Fix bug description
docs: Update documentation
refactor: Code refactoring
test: Add or update tests
chore: Maintenance tasks
Examples:
feat(issue_detection): Add multi-language supportfix(priority_analysis): Resolve memory leak in analysisdocs: Update AGENTS.md with deployment instructions
-
Before Creating PR:
- Run
make check(lint + tests) - Ensure all tests pass
- Update documentation if needed
- Check for security issues
- Run
-
PR Description Should Include:
- What changes were made
- Why the changes were needed
- How to test the changes
- Any breaking changes
-
Code Review Checklist:
- Code follows style guidelines
- Tests are included and passing
- Documentation is updated
- No secrets or credentials exposed
- Security considerations addressed
InfraAlert/
├── agents/ # Agent implementations
│ ├── issue_detection/ # Issue detection agent
│ ├── priority_analysis/ # Priority analysis agent
│ ├── resource_coordination/# Resource coordination agent
│ ├── platform_integration/ # Platform integration services
│ └── orchestrator/ # Orchestrator agent
├── bigquery/ # BigQuery schemas
├── config/ # Configuration files
├── docs/ # Documentation
├── functions/ # Cloud Functions
├── mcp_server/ # MCP server implementation
├── scripts/ # Utility scripts
├── webapp/ # Web application
├── cloudbuild.yaml # Cloud Build configuration
├── docker-compose.yml # Local development setup
├── Makefile # Development commands
├── pyproject.toml # Python project configuration
└── requirements.txt # Production dependencies
Python (agents, backend):
# Add to requirements.txt
uv pip install --system --python python3 <package>
# Or edit requirements.txt manually, then:
make install-devFrontend (webapp/frontend): Use bun (not npm):
cd webapp/frontend
bun add <package> # production dependency
bun add -d <package> # dev dependency- Create directory structure:
agents/new_agent/ - Add
requirements.txt,Dockerfile,deploy.sh - Update
cloudbuild.yamlwith build/deploy steps - Add environment variables to
.env.example - Update documentation
# Run agent locally with debug logging
cd agents/issue_detection
python app.py
# Check logs in Cloud Run
gcloud run services logs read issue-detection-agent --region=us-central1
# Use Firestore emulator locally
docker-compose up firestore-emulator- BigQuery schemas:
bigquery/ - Firestore: Configured per agent in
config.pyfiles
- Main README: See
README.mdfor detailed setup instructions - Agent-Specific Docs: Each agent has its own
README.md - Google ADK Docs: https://cloud.google.com/vertex-ai/docs/adk
- ADK Workflow Agents: See
docs/adk-workflow-agents.mdfor SequentialAgent, ParallelAgent, LoopAgent patterns - Agent Engine Deployment: See
docs/agent-engine-deployment.mdfor Google Agent Engine deployment
All agents now follow ADK's required structure conventions:
- ✅ Each agent has
__init__.pythat imports from.agent - ✅ Each agent has
agent.pythat exportsroot_agentorapp - ✅ Platform Integration agent now has proper ADK entry point
Files Modified:
agents/issue_detection/__init__.pyagents/priority_analysis/__init__.pyagents/resource_coordination/__init__.pyagents/platform_integration/__init__.py(created)agents/platform_integration/agent.py(created)
Implemented cross-agent session tracking:
- ✅ Orchestrator generates stable
session_idfrom ADK session or workflow ID - ✅ Session ID passed via HTTP headers (
X-ADK-Session-Id) and payload - ✅ All agents extract and log session_id for distributed tracing
- ✅ Session ID stored in Firestore for correlation
The shared session/state contract is centralized in agents/session_config.py:
SessionKeys.ISSUE_DATA,SessionKeys.ISSUE_ID,SessionKeys.DETECTION_RESULTare written by the Issue Detection agent and consumed by the Orchestrator and Priority Analysis agent.SessionKeys.PRIORITY_DATA,SessionKeys.PRIORITY_SCORE,SessionKeys.ANALYSIS_RESULTare produced by the Priority Analysis agent and consumed by the Orchestrator and Resource Coordination agent.SessionKeys.WORKFLOW_STATE,SessionKeys.CURRENT_STEPandSessionKeys.SESSION_IDare managed primarily by the Orchestrator to track end-to-end workflow state.- Long-term patterns such as
SessionKeys.PREVIOUS_ISSUESandSessionKeys.PATTERNS_LEARNEDcan be used by agents that need cross-session context.
Files Modified:
agents/session_config.py- AddedSessionKeys.SESSION_IDagents/orchestrator/orchestrator_agent.py- Generate and propagate session_idagents/issue_detection/app.py- Extract and log session_idagents/priority_analysis/app.py- Extract and log session_idagents/resource_coordination/app.py- Extract and log session_idagents/platform_integration/app.py- Extract and log session_id
Usage:
# In orchestrator
session_id = session_state.get(SessionKeys.SESSION_ID)
if not session_id:
session_id = getattr(ctx.session, "id", f"wf-{int(time.time())}")
session_state[SessionKeys.SESSION_ID] = session_id
# In agent Flask apps
session_id = request.headers.get("X-ADK-Session-Id")
if not session_id:
session_id = data.get("session_id")
logger.info(f"Processing task with session_id: {session_id}")Extended session/memory configuration to support production backends:
Session Backends:
- ✅ In-Memory (development - default)
- ✅ Database Sessions (Cloud Run + PostgreSQL/AlloyDB)
- ✅ Vertex AI Sessions (Agent Engine)
Memory Backends:
- ✅ In-Memory Memory (development - default)
- ✅ Memory Bank (production - facts/summaries)
- ✅ RAG Memory (production - document Q&A)
Files Modified:
agents/session_config.py- Added Vertex AI session/memory support.env.example- Documented all configuration options
Configuration:
# Vertex AI Sessions (Agent Engine)
USE_VERTEX_SESSIONS=true
VERTEX_AI_PROJECT_ID=your-project-id
VERTEX_AI_LOCATION=us-central1
# Memory Bank
USE_MEMORY_BANK=true
MEMORY_BANK_NAME=projects/PROJECT_ID/locations/LOCATION/memoryBanks/infraalert-kb
# RAG Memory
USE_RAG_MEMORY=true
RAG_CORPUS_NAME=projects/PROJECT_ID/locations/LOCATION/ragCorpora/infraalert-docsAdded OpenTelemetry instrumentation for distributed tracing:
- ✅ Reusable observability module (
agents/observability.py) - ✅ Cloud Trace integration
- ✅ Auto-instrumentation for Flask and requests
- ✅ Trace context in logs for correlation
Files Created:
agents/observability.py- OpenTelemetry initialization and helpers
Files Modified:
requirements.txt- Added OpenTelemetry dependencies
Usage:
from agents.observability import init_observability, add_trace_context_to_logger
# Initialize before creating Flask app
tracer = init_observability(
service_name="issue-detection-agent",
enable_tracing=True,
enable_instrumentation=True
)
# Add trace context to logs
add_trace_context_to_logger()
# Use tracer for custom spans
if tracer:
with tracer.start_as_current_span("process_issue"):
# Your code here
passAdded separate CI pipelines for fast PR feedback:
- ✅ Lint pipeline (
.cloudbuild/ci/lint.yaml) - Ruff linter & formatter - ✅ Test pipeline (
.cloudbuild/ci/test.yaml) - Pytest with coverage - ✅ Separated from deployment pipeline for faster feedback
Files Created:
.cloudbuild/ci/lint.yaml- Code quality checks.cloudbuild/ci/test.yaml- Test suite execution
Benefits:
- ⚡ Fast feedback on PRs (10-20 min vs full build/deploy)
- ✅ Catch issues before merging
- 💰 Lower Cloud Build costs (no image building on failed tests)
Documented and configured unified service account pattern:
- ✅ Single
infraalert-app-safor all services - ✅ Simplified IAM management
- ✅ Least-privilege roles documented
- ✅ Setup script provided
Files Created:
scripts/setup-app-sa.sh- Automated service account setup scriptscripts/setup-app-sa.sh- Automated service account setup
Files Modified:
.env.example- AddedAGENT_SERVICE_ACCOUNTconfigurationcloudbuild.yaml- Added_APP_SERVICE_ACCOUNTsubstitution
Setup:
# Create and configure service account
./scripts/setup-app-sa.sh your-project-id
# Update .env
AGENT_SERVICE_ACCOUNT=infraalert-app-sa@your-project-id.iam.gserviceaccount.comAdded detailed documentation for memory, sessions, and service accounts:
Files Created:
docs/adk-workflow-agents.md- ADK workflow agent patterns and usage- Session backends comparison
- Memory backends comparison
- Configuration examples (dev/prod/Cloud Run/Agent Engine)
- Migration guides
- Best practices
- Troubleshooting
Key Topics Covered:
- Short-term session vs long-term memory
- When to use each backend
- How to set up Memory Bank
- How to enable RAG memory
- Session ID propagation flow
- Memory write policies per agent
- Monitoring and observability
-
Environment Configuration
# Required export GEMINI_API_KEY="your-api-key" export PROJECT_ID="your-project-id" # Optional - Session Backend (choose one) export USE_VERTEX_SESSIONS=true # For Agent Engine # OR export DATABASE_URL="postgresql://..." # For Cloud Run # Optional - Memory Backend (choose one) export USE_MEMORY_BANK=true export MEMORY_BANK_NAME="projects/.../memoryBanks/infraalert-kb" # OR export USE_RAG_MEMORY=true export RAG_CORPUS_NAME="projects/.../ragCorpora/infraalert-docs" # Optional - Observability export ENABLE_TRACING=true
-
Service Account Setup
# Create unified service account ./scripts/setup-app-sa.sh your-project-id # Update .env echo "AGENT_SERVICE_ACCOUNT=infraalert-app-sa@your-project-id.iam.gserviceaccount.com" >> .env
-
Deploy
# Run quality checks first make check # Deploy all services make deploy # or use Cloud Build
-
Verify
# Check session_id propagation in logs gcloud logging read \ 'resource.type="cloud_run_revision" AND jsonPayload.session_id!=""' \ --limit=10 # Check service accounts for service in issue-detection-agent priority-analysis-agent orchestrator-agent; do gcloud run services describe $service --region=us-central1 \ --format="value(spec.template.spec.serviceAccountName)" done
- uv not found: Run
make setup-uvor./scripts/setup-uv.sh - gcloud not found: Run
make setup-gcloudor./scripts/setup-gcloud.sh - Import errors: Ensure dependencies are installed (
make install-dev) - Environment variables missing: Run
./scripts/setup-env.sh - Tests failing: Check that all dependencies are installed and
.envfiles are configured
- Check agent-specific README files
- Review Cloud Build logs for deployment issues
- Check Cloud Run logs for runtime errors
- Review
docs/directory for architecture details
- Always run
make checkbefore committing - This ensures code quality and tests pass - Never commit
.envfiles - They contain sensitive credentials - Use
uvfor package management - Don't usepipdirectly - Follow the existing code style - Use Ruff for formatting
- Add tests for new features - Maintain test coverage
- Update documentation - Keep README and AGENTS.md current
- Check security implications - Review changes for security issues
- Test locally before deploying - Use Docker Compose or run agents directly
- Respect agent boundaries - Each agent has a specific responsibility
- Use type hints - Help MyPy catch type errors