Skip to content

Latest commit

 

History

History
641 lines (503 loc) · 30.3 KB

File metadata and controls

641 lines (503 loc) · 30.3 KB

CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Repository Overview

HolmesGPT is an AI-powered troubleshooting agent that connects to observability platforms (Kubernetes, Prometheus, Grafana, etc.) to automatically diagnose and analyze infrastructure and application issues. It uses an agentic loop to investigate problems by calling tools to gather data from multiple sources.

Development Commands

Environment Setup

# Install dependencies with Poetry
poetry install

Testing

# Install test dependencies with Poetry
poetry install --with dev
# Run all non-LLM tests (unit and integration tests)
make test-without-llm
poetry run pytest tests -m "not llm"

# Run LLM evaluation tests (requires API keys)
make test-llm-ask-holmes          # Test single-question interactions
make test-llm-investigate         # Test AlertManager investigations
poetry run pytest tests/llm/ -n 6 -vv  # Run all LLM tests in parallel

# Run pre-commit checks (includes ruff, mypy, poetry validation)
# NOTE: Only run these when the user explicitly asks. They run in CI automatically.
make check
poetry run pre-commit run -a

Code Quality (only run when explicitly asked)

# Format code with ruff
poetry run ruff format

# Check code with ruff (auto-fix issues)
poetry run ruff check --fix

# Type checking with mypy
poetry run mypy

Architecture Overview

Core Components

CLI Entry Point (holmes/main.py):

  • Typer-based CLI with subcommands for ask, investigate, toolset
  • Handles configuration loading, logging setup, and command routing

** Interactive mode for CLI** (holmes/interactive.py):

  • Handles interactive mode for ask subcommand
  • Implements slash commands

Configuration System (holmes/config.py):

  • Loads settings from ~/.holmes/config.yaml or via CLI options
  • Manages API keys, model selection, and toolset configurations
  • Factory methods for creating sources (AlertManager, Jira, PagerDuty, etc.)

Core Investigation Engine (holmes/core/):

  • tool_calling_llm.py: Main LLM interaction with tool calling capabilities
  • investigation.py: Orchestrates multi-step investigations with runbooks
  • toolset_manager.py: Manages available tools and their configurations
  • tools.py: Tool definitions and execution logic

Plugin System (holmes/plugins/):

  • Sources: AlertManager, Jira, PagerDuty, OpsGenie integrations
  • Toolsets: Kubernetes, Prometheus, Grafana, AWS, Docker, etc.
  • Prompts: Jinja2 templates for different investigation scenarios
  • Destinations: Slack integration for sending results

Key Patterns

Toolset Architecture:

  • Each toolset is a YAML file defining available tools and their parameters
  • Tools can be Python functions or bash commands with safety validation
  • Toolsets are loaded dynamically and can be customized via config files
  • Important: All toolsets MUST return detailed error messages from underlying APIs to enable LLM self-correction
    • Include the exact query/command that was executed
    • Include time ranges, parameters, and filters used
    • Include the full API error response (status code and message)
    • For "no data" responses, specify what was searched and where

Thin API Wrapper Pattern for Python Toolsets:

  • Reference implementation: servicenow_tables/servicenow_tables.py
  • Use requests library for HTTP calls (not specialized client libraries like opensearchpy)
  • Simple config class with Pydantic validation
  • Health check in prerequisites_callable() method
  • Each tool is a thin wrapper around a single API endpoint

Server-Side Filtering is Critical:

  • Never return unbounded data from APIs - this causes token overflow
  • Always include filter parameters on tools that query collections (e.g., index parameter for Elasticsearch _cat APIs)
  • Example problem: opensearch_list_shards returned ALL shards → 25K+ tokens on large clusters
  • Example fix: elasticsearch_cat tool requires index parameter for shards/segments endpoints
  • When server-side filtering is not possible, use JsonFilterMixin (see json_filter_mixin.py) to add max_depth and jq parameters for client-side filtering

Toolset Config Backwards Compatibility: When renaming config fields in a toolset, maintain backwards compatibility using Pydantic's extra="allow":

# ✅ DO: Use extra="allow" to accept deprecated fields without polluting schema
class MyToolsetConfig(BaseModel):
    model_config = ConfigDict(extra="allow")

    # Only define current field names in schema
    new_field_name: int = 10

    @model_validator(mode="after")
    def handle_deprecated_fields(self):
        extra = self.model_extra or {}
        deprecated = []

        # Map old names to new names
        if "old_field_name" in extra:
            self.new_field_name = extra["old_field_name"]
            deprecated.append("old_field_name -> new_field_name")

        if deprecated:
            logging.warning(f"Deprecated config names: {', '.join(deprecated)}")
        return self

# ❌ DON'T: Define deprecated fields in schema with Optional[None]
class BadConfig(BaseModel):
    new_field_name: int = 10
    old_field_name: Optional[int] = None  # Pollutes schema, shows in model_dump()

Benefits of extra="allow" approach:

  • Schema only shows current field names
  • model_dump() returns clean output without deprecated fields
  • Old configs still work (backwards compatible)
  • Deprecation warnings guide users to update

See prometheus/prometheus.py PrometheusConfig for a complete example.

Class Hierarchy Placement:

  • When adding new config fields, methods, or behavior, always check the class hierarchy and place the change at the most general level that applies
  • Don't scope a fix to a specific subclass just because the issue/request mentions it by name — check if sibling classes share the same need
  • Example: timeout_seconds and max_retries belong on GrafanaConfig, not GrafanaTempoConfig, because all Grafana toolsets (Tempo, Loki, Dashboards) make HTTP requests

LLM Integration:

  • Uses LiteLLM for multi-provider support (OpenAI, Anthropic, Azure, etc.)
  • Structured tool calling with automatic retry and error handling
  • Context-aware prompting with system instructions and examples

Investigation Flow:

  1. Load user question/alert
  2. Select relevant toolsets based on context
  3. Execute LLM with available tools
  4. LLM calls tools to gather data
  5. LLM analyzes results and provides conclusions
  6. Optionally write results back to source system

Testing Framework

Three-tier testing approach:

  1. Unit Tests (tests/): Standard pytest tests for individual components
  2. Integration Tests: Test toolset integrations
  3. LLM Evaluation Tests (tests/llm/): End-to-end tests using fixtures

Running regular (non-LLM) tests:

poetry run pytest tests -m "not llm"
make test-without-llm

Running LLM eval tests:

# Run specific eval - IMPORTANT: Use -k flag, NOT full test path with brackets
poetry run pytest -k "09_crashpod" --no-cov

# Run all evals in parallel
poetry run pytest tests/llm/ -n 6 --no-cov

# Regression evals
poetry run pytest -m 'llm and easy' --no-cov

For the complete eval CLI reference (flags, env vars, model comparison, debugging), see the /create-eval skill which contains full documentation in its reference files.

Configuration

Config File Location: ~/.holmes/config.yaml

Key Configuration Sections:

  • model: LLM model to use (default: gpt-5.4)
  • api_key: LLM API key (or use environment variables)
  • custom_toolsets: Override or add toolsets
  • custom_runbooks: Add investigation runbooks
  • Platform-specific settings (alertmanager_url, jira_url, etc.)

Environment Variables:

  • OPENAI_API_KEY, ANTHROPIC_API_KEY: LLM API keys
  • OPENROUTER_API_KEY: Alternative LLM provider via OpenRouter (domain: api.openrouter.ai). When using OpenRouter, you must also set CLASSIFIER_MODEL to an OpenRouter model (e.g., CLASSIFIER_MODEL="openrouter/openai/gpt-4.1") because the default classifier model is not available via OpenRouter.
  • MODEL: Override default model(s) - supports comma-separated list
  • CLASSIFIER_MODEL: Override the classifier model used internally. Required when using OpenRouter (e.g., openrouter/openai/gpt-4.1)
  • RUN_LIVE: Enable live execution of tools in tests (default: true)
  • BRAINTRUST_API_KEY: For test result tracking and CI/CD report generation
  • BRAINTRUST_ORG: Braintrust organization name (default: "robustadev")
  • ELASTICSEARCH_URL, ELASTICSEARCH_API_KEY: For Elasticsearch/OpenSearch cloud testing

Development Guidelines

Code Quality:

  • Use Ruff for formatting and linting (configured in pyproject.toml)
  • Type hints required (mypy configuration in pyproject.toml)
  • Pre-commit hooks enforce quality checks in CI
  • ALWAYS place Python imports at the top of the file, not inside functions or methods
  • NEVER run pre-commit, ruff, or mypy unless the user explicitly asks you to. These tools are triggered by commit hooks which are not installed on all machines, and running them causes widespread formatting/type changes to files unrelated to your task. Only lint/format files you are actively editing, and only if asked.

Documentation Examples:

  • Primary examples should use the latest Anthropic Claude models:
    • Recommended: anthropic/claude-sonnet-4-5-20250929 or anthropic/claude-opus-4-5-20251101
    • Use the latest Claude 4.5 family models (Sonnet or Opus) as the default/primary examples
  • You may include other providers (OpenAI, Gemini, etc.) where it would be useful for users, such as in model listing sections or provider-specific documentation
  • Avoid using deprecated or older model versions like claude-3.5-sonnet, gpt-4-vision-preview

Testing Requirements:

  • All new features require unit tests
  • New toolsets require integration tests
  • Complex investigations should have LLM evaluation tests
  • Maintain 40% minimum test coverage
  • Live execution is now enabled by default to ensure tests match real-world behavior
  • Use responses library for HTTP mocking, not @patch("requests.get"). The responses library intercepts at the transport/adapter level, giving more realistic test behavior. Use responses.RequestsMock() with rsps.add() for mock responses.

Pull Request Process:

  • PRs require maintainer approval
  • Pre-commit hooks are checked in CI (do NOT run them locally unless asked)
  • LLM evaluation tests run automatically in CI
  • Keep PRs focused and include tests
  • ALWAYS use git commit -s to sign off commits (required for DCO)
  • When committing, use git commit -s --no-verify to skip local pre-commit hooks (they are not installed consistently and will cause unrelated changes)

Git Workflow Guidelines:

  • ALWAYS create commits, NEVER amend
  • ALWAYS merge, NEVER rebase
  • ALWAYS push, NEVER force push
  • Maintain a history of your work to allow the user to revert back to a previous iteration

File Structure Conventions:

  • Toolsets: holmes/plugins/toolsets/{name}.yaml or {name}/
  • Prompts: holmes/plugins/prompts/{name}.jinja2
  • Tests: Match source structure under tests/

Adding a New Integration (Toolset): When adding a new toolset or integration, update all of the following pages to keep them in sync:

  1. README.md — Data Sources table (add a row with logo, link, status, and description)
  2. docs/walkthrough/why-holmesgpt.md — Categorized integration list under "Every Major Observability Platform"
  3. docs/data-sources/builtin-toolsets/index.md — Grid cards listing on the toolsets index page
  4. docs/data-sources/builtin-toolsets/{name}.md — Dedicated documentation page for the new toolset
  5. Add a logo image to images/integration_logos/ if one is available

Debugging CLI / Rich Live Display Issues

When troubleshooting terminal rendering bugs (ghost frames, flickering, misaligned output):

Capturing terminal output through a PTY:

# Use `script` to force a PTY and capture raw ANSI escape sequences
script -qec "poetry run python your_script.py" /dev/null > /tmp/raw_output.txt 2>&1

Without a PTY, Rich detects non-interactive mode and skips Live rendering entirely.

Analyzing ANSI escape sequences:

# Key escape codes for Rich Live:
# \x1b[1A  = cursor up 1 line
# \x1b[2K  = erase entire line
# Rich erases previous frame with: (erase + cursor-up) × height, then prints new frame

# Count cursor-ups per frame transition to detect drift:
import re
erase_pattern = r"\x1b\[2K(?:\x1b\[1A\x1b\[2K)*"
for match in re.finditer(erase_pattern, raw_output):
    ups = match.group(0).count("\x1b[1A")

Writing unit tests for Live display (no LLM required):

# Render to StringIO with force_terminal=True to get ANSI sequences
from io import StringIO
buf = StringIO()
console = Console(file=buf, force_terminal=True, width=120)
# ... render frames ...
raw = buf.getvalue()  # Contains full ANSI escape sequences
# Parse cursor-up counts vs frame heights to detect ghost frames

Key patterns:

  • Ghost frames = cumulative drift where each frame leaves 1+ orphaned lines
  • Verify by counting: cursor-ups per transition should equal rendered lines per frame
  • Known Rich 13.9.4 bug: Live.refresh() calls console.print(Control()) with default end="\\n", adding a trailing newline not counted in LiveRender._shape. When the terminal has room below the display, each frame leaks 1 ghost line. When the display is at the bottom (common case), the \\n causes scrolling and height-1 cursor-ups is correct.
  • Workaround: subclass Live and override refresh() to pass end="". Do NOT patch position_cursor — that over-erases when the display is at the terminal bottom (the common case).

Investigating Eval Regressions / Holmes Behavior Changes

Understand the behavior from trace data before designing a fix. Braintrust (or the local evals_report.md) holds the rendered prompts, per-LLM-call metrics, and tool-call sequences for each iter. Pull traces for one baseline run and one current run for the same (test, model) before reading any source diffs — file-level diffs (prompts, code, config) routinely mislead about what actually changed at runtime (e.g. a jinja2 template can grow in source lines and shrink in rendered output). Look at individual runs first; aggregate statistics over n iters can hide deterministic per-iter differences under variance.

When reverting or fixing a suspect PR, read its full diff (git show --stat <commit>) before deciding what to change — a PR often has multiple effects, and reverting only the file you noticed leaves the others still acting on the model.

Security Notes

  • All tools have read-only access by design
  • Bash toolset validates commands for safety
  • No secrets should be committed to repository
  • Use environment variables or config files for API keys
  • RBAC permissions are respected for Kubernetes access

Eval Tests (LLM Evaluations)

For creating, running, and debugging LLM eval tests, use the /create-eval skill. It contains the complete workflow, test_case.yaml field reference, anti-hallucination patterns, infrastructure setup guides, and CLI reference.

Test Structure:

  • Use sequential test numbers: check existing tests for next available number
  • Required files: test_case.yaml, infrastructure manifests, toolsets.yaml (if needed)
  • Use dedicated namespace per test: app-<testid> (e.g., app-177)
  • All resource names must be unique across tests to prevent conflicts

Tags:

  • CRITICAL: Only use valid tags from pyproject.toml - invalid tags cause test collection failures
  • Check existing tags before adding new ones, ask user permission for new tags

Cloud Service Evals (No Kubernetes Required):

  • Evals can test against cloud services (Elasticsearch, external APIs) directly via environment variables
  • Faster setup (<30 seconds vs minutes for K8s infrastructure)
  • before_test creates test data in the cloud service; after_test cleans up only if safe (see reentrancy below)
  • Use toolsets.yaml to configure the toolset with env var references: api_url: "{{ env.ELASTICSEARCH_URL }}"
  • CI/CD secrets: When adding evals for a new integration, you must add the required environment variables to .github/workflows/eval-regression.yaml in the "Run tests" step. Tell the user which secrets they need to add to their GitHub repository settings (e.g., ELASTICSEARCH_URL, ELASTICSEARCH_API_KEY).
  • HTTP request passthrough: The root conftest.py has a responses fixture with autouse=True that mocks ALL HTTP requests by default. When adding a new cloud integration, you MUST add the service's URL pattern to the passthrough list in conftest.py (search for rsps.add_passthru). Use re.compile() for pattern matching (e.g., rsps.add_passthru(re.compile(r"https://.*\.cloud\.es\.io"))).
  • Cloud Service Eval Reentrancy: The same eval can run on multiple PRs in parallel in CI. Cloud service evals that create resources with static names (e.g., Confluence spaces, Elasticsearch indices) must be reentrant:
    • before_test must be idempotent: create-or-reuse resources, never fail if they already exist
    • after_test must NOT delete shared resources that another parallel run may be using. Either omit after_test entirely, or limit cleanup to resources with a unique run-scoped identifier
    • Use test-ID-based resource names (e.g., HLMS233 for space keys) to avoid collisions with other evals, but accept that the same eval may overlap with itself across parallel PR runs
    • Kubernetes evals don't have this problem because each PR gets its own KIND cluster, so namespaces are already isolated. Cloud service evals share a single account/instance across all PR runs

User Prompts & Expected Outputs:

  • Be specific: Test exact values like "The dashboard title is 'Home'" not generic "Holmes retrieves dashboard"
  • Match prompt to test: User prompt must explicitly request what you're testing
    • BAD: "Get the dashboard"
    • GOOD: "Get the dashboard and tell me the title, panels, and time range"
  • Anti-cheat prompts: Don't use technical terms that give away solutions
    • BAD: "Find node_exporter metrics"
    • GOOD: "Find CPU pressure monitoring queries"
  • Test discovery, not recognition: Holmes should search/analyze, not guess from context
  • Ruling out hallucinations is paramount: When choosing between test approaches, prefer the one that rules out hallucinations:
    • Best: Check specific values that can only be discovered by querying (e.g., unique IDs, injected error codes, exact counts)
    • Acceptable: Use include_tool_calls: true to verify the tool was called when output values are too generic to rule out hallucinations
    • Bad: Check generic output patterns that an LLM could plausibly guess (e.g., "cluster status is green/yellow/red", "has N nodes")
  • expected_output is invisible to LLM: The expected_output field is only used by the evaluator - the LLM never sees it. This means:
    • You can safely put secrets/verification codes in expected_output that the LLM must discover
    • before_test can inject a unique verification code into test data, and expected_output can check for it
    • This is a powerful pattern for cloud service tests: create data with a unique code in before_test, ask LLM to find it, verify with expected_output
    # Example: before_test creates a page with verification code "HOLMES-EVAL-7x9k2m4p"
    # The LLM must discover this code by querying the service
    expected_output:
      - "Must report the verification code: HOLMES-EVAL-7x9k2m4p"
  • include_tool_calls: true: Use when expected output is too generic to be hallucination-proof. Prefer specific answer checking when possible, but verifying tool calls is better than a test that can't rule out hallucinations.
    # Use when values are generic (cluster health could be guessed)
    include_tool_calls: true
    expected_output:
      - "Must call elasticsearch_cluster_health tool"
      - "Must report cluster status"

Infrastructure Setup:

  • Don't just test pod readiness - verify actual service functionality
  • Poll real API endpoints and check for expected content (e.g., "title":"Home", "type":"welcome")
  • CRITICAL: Use exit 1 when setup verification fails to fail the test early
  • Never use :latest container tags - use specific versions like grafana/grafana:12.3.1

Running and Testing Evals

🚨 CRITICAL: Always Test Your Changes

NEVER submit test changes without verification:

Required Testing Workflow:

  1. Setup Phase: poetry run pytest -k "test_name" --only-setup --no-cov
  2. Full Test: poetry run pytest -k "test_name" --no-cov
  3. Verify Results: Ensure 100% pass rate and expected behavior

When to Test:

  • ✅ After creating new tests
  • ✅ After modifying existing tests
  • ✅ After refactoring shared infrastructure
  • ✅ After performance optimizations
  • ✅ After adding/changing tags

Red Flags - Never Skip Testing:

  • ❌ "The changes look good" without running
  • ❌ "It's just a small change"
  • ❌ "I'll test it later"

Testing is Part of Development: Testing is not optional - it's an integral part of the development process. Untested code is broken code.

Testing Methodology:

  • Phase 1: Test setup with --only-setup flag first
  • Phase 2: Run full test after confirming setup works
  • Use background execution for long tests: nohup ... > logfile.log 2>&1 &
  • Handle port conflicts: clean up previous test port forwards before running

Common Flags:

  • --skip-cleanup: Keep resources after test (useful for debugging setup)
  • --skip-setup: Skip before_test commands (useful for iterative testing)

Shared Infrastructure Pattern

When to use shared infrastructure:

  • Multiple tests use the same service (Grafana, Loki, Prometheus)
  • Service configuration is standardized across tests

Implementation:

# Create shared manifest in tests/llm/fixtures/shared/servicename.yaml
# Use in tests:
kubectl apply -f ../../shared/servicename.yaml -n app-<testid>

Benefits:

  • Single place for version updates
  • Consistent configuration across tests
  • Reduced maintenance overhead
  • Follows established pattern (Loki, Prometheus, Grafana)

Setup Verification Best Practices

Prefer kubectl exec over port forwarding for setup verification:

# GOOD - kubectl exec pattern (no port conflicts)
kubectl exec -n namespace deployment/service -- wget -q -O- http://localhost:port/health

# AVOID - port forward for setup verification (causes conflicts)
kubectl port-forward svc/service port:port &
curl localhost:port/health
kill $PORTFWD_PID

Performance optimization guidelines:

  • Use sleep 1 instead of sleep 5 for most retry loops
  • Remove sleeps after straightforward operations (port forward start)
  • Reduce timeout values: 60s for pod readiness, 30s for API verification
  • Question every sleep - many are unnecessary

Race Condition Handling: Never use bare kubectl wait immediately after resource creation. Use retry loops:

# WRONG - fails if pod not scheduled yet
kubectl apply -f deployment.yaml
kubectl wait --for=condition=ready pod -l app=myapp --timeout=300s

# CORRECT - retry loop handles race condition
kubectl apply -f deployment.yaml
POD_READY=false
for i in {1..60}; do
  if kubectl wait --for=condition=ready pod -l app=myapp --timeout=5s 2>/dev/null; then
    echo "✅ Pod is ready!"
    POD_READY=true
    break
  fi
  sleep 1
done
if [ "$POD_READY" = false ]; then
  echo "❌ Pod failed to become ready after 60 seconds"
  kubectl logs -l app=myapp --tail=20  # Diagnostic info
  exit 1  # CRITICAL: Fail the test early
fi

Eval Best Practices

Realism:

  • No fake/obvious logs like "Memory usage stabilized at 800MB"
  • No hints in filenames like "disk_consumer.py" - use realistic names like "training_pipeline.py"
  • No error messages that give away it's simulated like "Simulated processing error"
  • Use real-world scenarios: ML pipelines with checkpoint issues, database connection pools
  • Resource naming should be neutral, not hint at the problem (avoid "broken-pod", "crashloop-app")

Architecture:

  • Implement full architecture even if complex (e.g., use Loki for log aggregation, not simplified alternatives)
  • Proper separation of concerns (app → file → Promtail → Loki → Holmes)
  • ALWAYS use Secrets for scripts, not inline manifests or ConfigMaps
  • Use minimal resource footprints (reduce memory/CPU for test services)

Anti-Cheat Testing Guidelines:

  • Prevent Domain Knowledge Cheats: Use neutral, application-specific names instead of obvious technical terms
    • Example: "E-Commerce Platform Monitoring" not "Node Exporter Full"
    • Example: "Payment Service Dashboard" not "MySQL Error Dashboard"
    • Add source comments: # Uses Node Exporter dashboard but renamed to prevent cheats
  • Resource Naming Rules: Avoid hint-giving names
    • Use realistic business context: "checkout-api", "user-service", "inventory-db"
    • Avoid obvious problem indicators: "broken-pod" → "payment-service-1"
    • Test discovery ability, not pattern recognition
  • Prompt Design: Don't give away solutions in prompts
    • BAD: "Find the node_pressure_cpu_waiting_seconds_total query"
    • GOOD: "Find the Prometheus query that monitors CPU pressure waiting time"
    • Test Holmes's search/analysis skills, not domain knowledge shortcuts

Configuration:

  • Custom runbooks: Add runbooks field in test_case.yaml (runbooks: {} for empty catalog)
  • Custom toolsets: Create separate toolsets.yaml file (never put in test_case.yaml)
  • Toolset config must go under config field:
toolsets:
  grafana/dashboards:
    enabled: true
    config:  # All toolset-specific config under 'config'
      api_url: http://localhost:10177

Always run evals before submitting when possible:

  1. poetry run pytest -k "test_name" --only-setup --no-cov — verify setup
  2. poetry run pytest -k "test_name" --no-cov — run full test
  3. Verify cleanup: kubectl get namespace app-NNN should return NotFound

Reading CodeRabbit Review Comments

In the sandbox environment, gh CLI is not available and the GitHub REST API will quickly rate-limit unauthenticated requests. Use the following approach:

  1. Find the PR number via the GitHub API (unauthenticated, one call):
    curl -s "https://api.github.com/repos/HolmesGPT/holmesgpt/pulls?head=HolmesGPT:BRANCH_NAME&state=open" \
      | python3 -c "import sys,json; [print(f'PR #{p[\"number\"]}') for p in json.load(sys.stdin)]"
  2. Fetch comments with WebFetch (not rate-limited): Use the WebFetch tool on https://github.com/HolmesGPT/holmesgpt/pull/<NUMBER> and ask it to extract all CodeRabbit comments, including file/line references, full text, and code suggestions.

What does NOT work:

  • gh CLI — not installed in the sandbox
  • Multiple curl calls to api.github.com — hits unauthenticated rate limits (60/hour) fast
  • The local git proxy (127.0.0.1) — only supports git protocol, not the GitHub REST API

Documentation Lookup

When asked about content from the HolmesGPT documentation website (https://holmesgpt.dev/), look in the local docs/ directory:

  • Python SDK examples: docs/installation/python-installation.md
  • CLI installation: docs/installation/cli-installation.md
  • Kubernetes deployment: docs/installation/kubernetes-installation.md
  • Toolset documentation: docs/data-sources/builtin-toolsets/
  • API reference: docs/reference/

MkDocs Navigation

The docs site uses the awesome-nav plugin. Navigation is controlled by .nav.yml files in each docs/ subdirectory, not by the nav: section in mkdocs.yml. When adding a new docs page, you must add it to the .nav.yml file in the corresponding directory (e.g., docs/reference/.nav.yml for reference pages).

MkDocs Formatting Notes

When writing documentation in the docs/ directory:

  • Lists after headers: Always add a blank line between a header/bold text and a list, otherwise MkDocs won't render the list properly

    **Good:**
    
    - item 1
    - item 2
    
    **Bad:**
    - item 1
    - item 2
  • Headers inside tabs: Use bold text for section headings inside tabs, not markdown headers (##, ###, etc.)

    Why: MkDocs Material font sizes make H2 (~25px) and H3 (~20px) visually larger than tab titles (~14px). When a header inside a tab is bigger than the tab title itself, it looks like it belongs outside/above the tabs, breaking the visual hierarchy.

    <!-- GOOD: Bold text for sections inside tabs -->
    === "Tab Name"
    
        **Create the policy:**
    
        Instructions here...
    
        **Create the role:**
    
        More instructions...
    
    <!-- BAD: Headers inside tabs look like they're outside -->
    === "Tab Name"
    
        ### Create the policy
    
        Instructions here...
  • Avoid excessive headers: Don't create a header for every small section. Headers should be used sparingly for major sections. For minor sections like test steps or examples, use bold text or combine content into a single code block with comments instead of separate headers.

    <!-- BAD: Header for every test step -->
    ## Testing
    ### Test 1: Check Status
    ### Test 2: Check Logs
    ### Test 3: Health Check
    
    <!-- GOOD: Single section with combined content -->
    ## Testing the Connection
    
    ```bash
    # Check pod status
    kubectl get pods -n YOUR_NAMESPACE
    
    # Check logs
    kubectl logs -n YOUR_NAMESPACE
    
    # Health check
    curl http://localhost:8000/health
    
    
  • Don't describe Holmes's behavior: In "Common Use Cases" sections, show only the example prompts. Don't explain what Holmes will do or list steps like "Holmes will: 1. Query X, 2. Analyze Y, 3. Return Z". Users will see this when they run it.

  • Skip Capabilities sections: Don't list what a toolset/integration can do. Users discover capabilities by using Holmes. Feature lists become stale quickly.

  • Skip Security Best Practices sections: Assume users understand basics like rotating credentials, using least privilege, and deleting local secrets. These sections add little value.

  • Consolidate troubleshooting commands: Instead of separate headers for each troubleshooting scenario, use a single code block with comments:

    # Authentication errors - check if secret is mounted
    kubectl exec ...
    
    # Permission denied - verify roles
    gcloud projects get-iam-policy ...
  • Common Use Cases format: Just example prompts, one per code block, no sub-headers, no explanations.