Skip to content

Commit 75ffffa

Browse files
committed
Merge upstream/main
# Conflicts: # README.md # docs/daily-updates/2026-04-30.mdx # docs/daily-updates/2026-05-01.mdx # docs/daily-updates/2026-05-02.mdx # docs/daily-updates/2026-05-03.mdx # docs/daily-updates/2026-05-04.mdx # docs/daily-updates/2026-05-05.mdx # docs/daily-updates/overview.mdx
2 parents 65cdfe2 + 76e1373 commit 75ffffa

1,476 files changed

Lines changed: 157271 additions & 28867 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/settings.json

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
{
2+
"permissions": {
3+
"allow": [
4+
"Bash(uv run *)",
5+
"Bash(python *)",
6+
"Bash(python3 *)",
7+
"Bash(python3*)",
8+
"PowerShell(uv run *)",
9+
"PowerShell(python *)",
10+
"PowerShell(python3 *)"
11+
]
12+
}
13+
}

.cursor/rules/graph-nodes.mdc

Lines changed: 18 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -1,70 +1,29 @@
11
---
2-
description: LangGraph pipeline architecture and node development
2+
description: Investigation pipeline architecture and stage development
33
globs:
4-
- app/graph_pipeline.py
5-
- app/routing.py
6-
- app/state.py
7-
- app/nodes/**
4+
- app/pipeline/**
5+
- app/agent/**
6+
- app/state/**
7+
- app/delivery/**
88
---
99

10-
# Graph & Node Development
10+
# Investigation pipeline
1111

12-
## Pipeline Architecture
12+
## Coordinator
1313

14-
The agent is a LangGraph `StateGraph` over `AgentState` (a `TypedDict` in `app/state.py`).
14+
`app/pipeline/pipeline.py` runs **resolve integrations → extract alert → investigation agent → deliver**.
1515

16-
### Investigation Flow
17-
```
18-
inject_auth → extract_alert → resolve_integrations → plan_actions → investigate → diagnose
19-
↑ │
20-
└── (loop if recommendations) ──┘
21-
22-
publish → END
23-
```
16+
`app/pipeline/runners.py` exposes `run_investigation`, `run_chat`, and async streaming helpers.
2417

25-
### Chat Flow
26-
```
27-
inject_auth → router → chat_agent ⇄ tool_executor → END
28-
→ general → END
29-
```
18+
## Key packages
3019

31-
## Key Files
20+
- `app/agent/context.py`, `extract.py`, `investigation.py`, `chat.py`
21+
- `app/delivery/` for publishing and integrations
22+
- `app/state/agent_state.py` for `AgentState` / `InvestigationState`
3223

33-
- `app/graph_pipeline.py` — `build_graph()` wires nodes and edges
34-
- `app/routing.py` — conditional edge functions (`route_by_mode`, `route_after_extract`, etc.)
35-
- `app/state.py` — `AgentState` TypedDict + `AgentStateModel` Pydantic validator
36-
- `app/nodes/__init__.py` — barrel exports for all node functions
24+
## Conventions
3725

38-
## Writing a Node
39-
40-
1. Create a subpackage under `app/nodes/` (e.g., `app/nodes/my_step/`)
41-
2. Implement the node function in `node.py`
42-
3. Re-export from the subpackage `__init__.py`
43-
4. Register in `app/nodes/__init__.py` and wire into `graph_pipeline.py`
44-
45-
### Node function pattern:
46-
47-
```python
48-
from langsmith import traceable
49-
from app.output import get_tracker
50-
from app.state import InvestigationState
51-
52-
@traceable(name="node_my_step")
53-
def node_my_step(state: InvestigationState) -> dict:
54-
tracker = get_tracker()
55-
tracker.start("my_step", "Doing something")
56-
# ... read from state, do work ...
57-
tracker.complete("my_step", fields_updated=["evidence"], message="Done")
58-
return {"field": value} # partial state update dict
59-
```
60-
61-
## Rules
62-
63-
- Nodes receive full state, return a **partial dict** of fields to update
64-
- Always edit `routing.py` alongside `graph_pipeline.py` when adding conditional edges
65-
- New state keys go in `AgentState` (TypedDict) AND `AgentStateModel` (Pydantic) in `state.py`
66-
- Use `@traceable(name="node_xxx")` for LangSmith tracing on all node functions
67-
- Use `get_tracker()` for CLI progress output
68-
- Use `InvestigateInput.from_state(state)` in investigation nodes to extract typed inputs
69-
- The investigation loop is capped at 5 total iterations (see `should_continue_investigation`)
70-
- `InvestigationState` is an alias for `AgentState`
26+
- Prefer `@traceable` from `app.utils.tracing` on externally-visible orchestration helpers.
27+
- Stages read full state and return **partial dict** updates.
28+
- Use `get_tracker()` for CLI progress when appropriate.
29+
- Keep new persisted keys in the state `TypedDict` and any matching validators.

.devcontainer/devcontainer.json

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,11 +20,11 @@
2020
"PYTHONUTF8": "1"
2121
},
2222
"forwardPorts": [
23-
2024
23+
8000
2424
],
2525
"portsAttributes": {
26-
"2024": {
27-
"label": "LangGraph dev server",
26+
"8000": {
27+
"label": "OpenSRE health app",
2828
"onAutoForward": "notify"
2929
}
3030
},

.dockerignore

Lines changed: 41 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,42 @@
1-
.git
2-
.cursor
1+
# Ignore node_modules and other dependency directories
2+
node_modules
3+
bower_components
4+
vendor
5+
6+
# Ignore logs and temporary files
7+
*.log
8+
*.tmp
9+
*.swp
10+
11+
# Ignore .env files and other environment files
312
.env
4-
.DS_Store
5-
**/.DS_Store
6-
**/__pycache__
7-
**/.pytest_cache
8-
**/.mypy_cache
9-
**/.ruff_cache
10-
**/.venv
11-
**/.venv-devcontainer
12-
**/node_modules
13-
**/cdk.out
14-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/certifi
15-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/charset_normalizer
16-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/idna
17-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/requests
18-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/urllib3
19-
tests/test_case_upstream_lambda/pipeline_code/api_ingester/*.dist-info
13+
.env.*
14+
*.local
15+
16+
# Ignore git-related files
17+
.git
18+
.gitignore
19+
20+
# Ignore Docker-related files and configs
21+
.dockerignore
22+
docker-compose.yml
23+
24+
# Ignore build and cache directories
25+
dist
26+
build
27+
.cache
28+
__pycache__
29+
30+
# Ignore IDE and editor configurations
31+
.vscode
32+
.idea
33+
*.sublime-project
34+
*.sublime-workspace
35+
.DS_Store # macOS-specific
36+
37+
# Ignore test and coverage files
38+
coverage
39+
*.coverage
40+
*.test.js
41+
*.spec.js
42+
tests

.env.example

Lines changed: 92 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,62 @@
66
# 3. Configure one integration with `opensre integrations setup <service>`.
77
# 4. Verify with `opensre health` and `opensre integrations verify <service>`.
88
#
9-
# `~/.tracer/integrations.json` is the preferred local integration store.
9+
# `~/.config/opensre/integrations.json` is the preferred local integration store.
1010
# The env vars below are still supported as fallback/direct configuration.
1111

1212
# --- Most important ---------------------------------------------------------
1313

1414
# Provider used for LLM calls. Common values: anthropic, openai, openrouter,
15-
# gemini, nvidia, codex.
15+
# gemini, nvidia, minimax, bedrock, ollama, codex, claude-code, opencode, kimi,
16+
# copilot.
1617
LLM_PROVIDER=anthropic
1718

1819
# Codex CLI works for `opensre investigate` after `codex login`.
1920
# Leave CODEX_MODEL empty to use the CLI's currently configured model.
2021
CODEX_MODEL=
2122
CODEX_BIN=
2223

24+
# Claude Code CLI works for `opensre investigate` after `claude login` or setting ANTHROPIC_API_KEY.
25+
# Install: npm i -g @anthropic-ai/claude-code
26+
# Leave CLAUDE_CODE_MODEL empty to use the CLI's currently configured model.
27+
CLAUDE_CODE_MODEL=
28+
CLAUDE_CODE_BIN=
29+
30+
# Gemini CLI works for `opensre investigate` after `gemini` auth setup.
31+
# Install: npm i -g @google/gemini-cli
32+
# Leave GEMINI_CLI_MODEL empty to use the CLI's configured default model.
33+
GEMINI_CLI_MODEL=
34+
GEMINI_CLI_BIN=
35+
# OpenCode CLI works for `opensre investigate` after `opencode auth login`.
36+
# Leave OPENCODE_MODEL empty to use the CLI's currently configured model
37+
OPENCODE_MODEL=
38+
OPENCODE_BIN=
39+
40+
# Cursor Agent CLI
41+
# Leave CURSOR_MODEL empty to use the CLI's currently configured model.
42+
CURSOR_MODEL=
43+
CURSOR_BIN=
44+
45+
# Kimi Code CLI works for `opensre investigate` after `kimi login`.
46+
# Leave KIMI_MODEL empty to use the CLI's currently configured model.
47+
KIMI_MODEL=
48+
KIMI_BIN=
49+
KIMI_API_KEY=
50+
# KIMI_SHARE_DIR=~/.kimi
51+
52+
# GitHub Copilot CLI works for `opensre investigate` after running `copilot`
53+
# and authenticating with the interactive `/login` slash command.
54+
# Install: npm i -g @github/copilot
55+
# Leave COPILOT_MODEL empty to use the CLI's currently configured model.
56+
COPILOT_MODEL=
57+
COPILOT_BIN=
58+
# Optional auth fallbacks (only used when no stored Copilot CLI login exists):
59+
# COPILOT_GITHUB_TOKEN=
60+
# GH_TOKEN=
61+
# GITHUB_TOKEN=
62+
# Optional config dir override (default: ~/.copilot):
63+
# COPILOT_HOME=
64+
2365
# Set the key for the provider you choose above.
2466
ANTHROPIC_API_KEY=
2567
ANTHROPIC_REASONING_MODEL=
@@ -34,6 +76,11 @@ OPENROUTER_API_KEY=
3476
OPENROUTER_REASONING_MODEL=
3577
OPENROUTER_TOOLCALL_MODEL=
3678

79+
# Requesty is an OpenAI-compatible LLM gateway with fallback routing and caching.
80+
REQUESTY_API_KEY=
81+
REQUESTY_REASONING_MODEL=
82+
REQUESTY_TOOLCALL_MODEL=
83+
3784
# Gemini uses the OpenAI-compatible endpoint in this project.
3885
GEMINI_API_KEY=
3986
GEMINI_REASONING_MODEL=
@@ -44,6 +91,11 @@ NVIDIA_API_KEY=
4491
NVIDIA_REASONING_MODEL=
4592
NVIDIA_TOOLCALL_MODEL=
4693

94+
# Amazon Bedrock — set `LLM_PROVIDER=bedrock` above. Uses the same AWS credential
95+
# chain as the AWS integration block below (region, keys, or IAM role). No LLM API key.
96+
BEDROCK_REASONING_MODEL=
97+
BEDROCK_TOOLCALL_MODEL=
98+
4799
# --- First integrations to set up ------------------------------------------
48100

49101
# For a first real RCA run, one of Grafana / Datadog / Honeycomb / Coralogix
@@ -80,6 +132,16 @@ ARGOCD_VERIFY_SSL=true
80132
# ARGOCD_INSTANCES='[{"name":"prod","base_url":"https://argocd.example.com","bearer_token":"***","project":"default"}]'
81133
ARGOCD_INSTANCES=
82134

135+
# Helm 3 (read-only CLI — list/status/history/get values/get manifest)
136+
# Requires OSRE_HELM_INTEGRATION=1 (or true/yes) to activate from env.
137+
OSRE_HELM_INTEGRATION=
138+
HELM_PATH=helm
139+
HELM_KUBE_CONTEXT=
140+
HELM_KUBECONFIG=
141+
HELM_NAMESPACE=
142+
# Optional: cap manifest size from helm get manifest (integer, min 1024; default 600000).
143+
# HELM_MANIFEST_MAX_CHARS=
144+
83145
# Datadog
84146
DD_API_KEY=
85147
DD_APP_KEY=
@@ -109,6 +171,12 @@ CORALOGIX_SUBSYSTEM_NAME=
109171
# CORALOGIX_INSTANCES='[{"name":"prod","api_key":"...","base_url":"https://api.coralogix.com"}]'
110172
CORALOGIX_INSTANCES=
111173

174+
# SigNoz (Query API — logs, metrics, traces)
175+
# Local Docker stack: http://localhost:8080 (see infra/scripts/signoz/)
176+
# API key: Settings → Service Accounts → Keys
177+
SIGNOZ_URL=
178+
SIGNOZ_API_KEY=
179+
112180
# AWS
113181
AWS_REGION=us-east-1
114182
AWS_ROLE_ARN=
@@ -134,6 +202,17 @@ GITHUB_MCP_TOOLSETS=repos,issues,pull_requests,actions,search
134202
# OPENSRE_GITHUB_MCP_REPO_PROBE_LIMIT=
135203

136204
# Sentry
205+
# Runtime error monitoring for OpenSRE itself uses the project Sentry DSN constant.
206+
# Optional: override for operator-side DSN rotation without rebuilding.
207+
# OPENSRE_SENTRY_DSN=
208+
SENTRY_ERROR_SAMPLE_RATE=1.0
209+
SENTRY_TRACES_SAMPLE_RATE=1.0
210+
OPENSRE_SENTRY_DISABLED=0
211+
# Tag value attached to Sentry events to identify how this process is deployed.
212+
# Common values: railway, ec2, vercel, local. Defaults to "local" when unset.
213+
# OPENSRE_DEPLOYMENT_METHOD=local
214+
215+
# Sentry investigation integration
137216
SENTRY_URL=https://sentry.io
138217
SENTRY_ORG_SLUG=
139218
SENTRY_PROJECT_SLUG=
@@ -240,6 +319,10 @@ JIRA_PROJECT_KEY=
240319
OPSGENIE_API_KEY=
241320
OPSGENIE_REGION=us
242321

322+
# incident.io
323+
INCIDENT_IO_API_KEY=
324+
INCIDENT_IO_BASE_URL=
325+
243326
# Vercel
244327
VERCEL_API_TOKEN=
245328
VERCEL_TEAM_ID=
@@ -273,6 +356,12 @@ DISCORD_DEFAULT_CHANNEL_ID=
273356
TELEGRAM_BOT_TOKEN=
274357
TELEGRAM_DEFAULT_CHAT_ID=
275358

359+
# WhatsApp (Twilio)
360+
TWILIO_ACCOUNT_SID=
361+
TWILIO_AUTH_TOKEN=
362+
TWILIO_WHATSAPP_FROM=
363+
WHATSAPP_DEFAULT_TO=
364+
276365
# --- Web app / hosted runtime only -----------------------------------------
277366

278367
# Required only when using the Tracer web app / hosted integration path.
@@ -284,14 +373,10 @@ OPENSRE_API_KEY=
284373

285374
# --- Deployment / runtime ---------------------------------------------------
286375

287-
# Required for deployed OpenSRE / LangGraph services.
376+
# Required for hosted OpenSRE runtimes that need persistent storage.
288377
DATABASE_URI=
289378
REDIS_URI=
290379

291-
# Optional LangSmith integration
292-
LANGSMITH_API_KEY=
293-
LANGSMITH_DEPLOYMENT_NAME=open-sre-agent
294-
295380
ENV=development
296381

297382
# Reversible masking before external LLM calls. Off by default.

.github/ISSUE_TEMPLATE/feature_request.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@ description: Suggest a new feature or capability
33
title: "[FEATURE] "
44
labels:
55
- enhancement
6+
- pending triage
67
body:
78
- type: textarea
89
id: problem

.github/dependabot.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22

33
version: 2
44
updates:
5-
- package-ecosystem: "pip"
5+
- package-ecosystem: "uv"
66
directory: "/"
77
schedule:
8-
interval: "daily"
8+
interval: "weekly"
99
open-pull-requests-limit: 10
1010

1111
- package-ecosystem: "npm"

0 commit comments

Comments
 (0)