Skip to content

Commit 2dde2cc

Browse files
committed
Update docs
1 parent ba88255 commit 2dde2cc

5 files changed

Lines changed: 293 additions & 36 deletions

File tree

.mcp.json

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,16 +7,6 @@
77
"bot-memory": {
88
"type": "sse",
99
"url": "http://localhost:8080/sse"
10-
},
11-
"mcp-atlassian": {
12-
"command": "uvx",
13-
"args": ["mcp-atlassian"],
14-
"env": {
15-
"JIRA_URL": "${JIRA_URL}",
16-
"JIRA_USERNAME": "${JIRA_USERNAME}",
17-
"JIRA_API_TOKEN": "${JIRA_API_TOKEN}"
18-
},
19-
"type": "stdio"
2010
}
2111
}
2212
}

ARCHITECTURE.md

Lines changed: 20 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -225,7 +225,7 @@ Agent cycle completes
225225
|---------|-------------|--------|
226226
| Claude (Vertex AI) | GCP service account key | `sa-key.json` + env vars in `.env` |
227227
| Jira | API token | `JIRA_URL`, `JIRA_USERNAME`, `JIRA_API_TOKEN` in `.env` |
228-
| GitHub | SSH key | `gh auth login` (one-time setup, SSH protocol) |
228+
| GitHub | SSH key + PAT (`GH_TOKEN`) | SSH for git ops, PAT for gh CLI API calls |
229229
| GitLab | SSH key | `glab auth login` (one-time setup, SSH protocol) |
230230
| Memory server | None (localhost) | Hardcoded `http://localhost:8080` |
231231
| Chrome DevTools | None (localhost) | Hardcoded `http://127.0.0.1:9222` |
@@ -243,21 +243,19 @@ The system deploys as **separate pods**:
243243

244244
This keeps the deployment simple — one memory server serves all bot instances, and each bot is independently scalable by adding new pods with different labels.
245245

246-
### What needs to change
246+
### Container images
247247

248-
1. **Bot container** — containerize with Docker. Proven as a POC (Dockerfile existed, SSH and auth were the main challenges). Needs:
249-
- GCP service account key injected via secret mount
250-
- SSH key for git access (shared by gh/glab CLIs)
251-
- gh/glab CLI auth config files (`~/.config/gh/`, `~/.config/glab-cli/`) — one-time setup by a human via `gh auth login` / `glab auth login` on the bot account, then baked into the image or mounted
252-
- Non-root user (Claude Code rejects root)
248+
Both images use Red Hat UBI9 base images:
253249

254-
2. **Memory server pod** — already containerized (Docker Compose). Needs:
255-
- PostgreSQL connection string pointing to RDS instance
256-
- Cluster-internal service for bot pods to reach it (e.g. `memory-server:8080`)
250+
- **Bot container** (`Dockerfile`) — `ubi9/ubi` with Python 3.12, Node.js 22 (NodeSource), Chromium headless (EPEL), gh CLI, glab CLI, uv. Runs as non-root `botuser` (Claude Code rejects root). Entrypoint decodes secrets from env vars (SSH key, GPG key, SA key), starts Chromium in background, then launches the bot runner.
257251

258-
3. **Chrome/Chromium** — needs a headless browser sidecar or a shared remote debugging endpoint for visual verification of UI changes
252+
- **Memory server** (`memory-server/Dockerfile`) — multi-stage build. Stage 1: `ubi9/nodejs-22` builds the React dashboard. Stage 2: `ubi9/python-312-minimal` runs the FastMCP app with dashboard assets baked in.
259253

260-
4. **Multiple labels** — each label runs as a separate bot container. All bot containers share a single memory server pod (low traffic, no need to replicate).
254+
### What needs to change for cluster
255+
256+
1. **Memory server** — PostgreSQL connection string pointing to RDS instance instead of local container. Cluster-internal service for bot pods to reach it (e.g. `memory-server:8080`).
257+
258+
2. **Multiple labels** — each label runs as a separate bot container. All bot containers share a single memory server pod (low traffic, no need to replicate).
261259

262260
### What stays the same
263261

@@ -294,9 +292,8 @@ This keeps the deployment simple — one memory server serves all bot instances,
294292
│ │ (pgvector) │ │
295293
│ └────────────┘ │
296294
│ │
297-
│ ┌──────────────────────────┐ │
298-
│ │ Headless Chrome (shared) │ │
299-
│ └──────────────────────────┘ │
295+
│ (Chromium headless runs inside │
296+
│ each bot pod on port 9222) │
300297
│ │
301298
└──────────────────────┬──────────────┘
302299
@@ -308,13 +305,14 @@ This keeps the deployment simple — one memory server serves all bot instances,
308305

309306
### Secrets required
310307

311-
| Secret | Used by |
312-
|--------|---------|
313-
| GCP service account key (sa-key.json) | Bot — Claude API via Vertex AI |
314-
| Jira API token | Bot — mcp-atlassian MCP server |
315-
| SSH key | Bot — git clone/push, used by both gh and glab CLIs |
316-
| gh/glab CLI auth config | Bot — API access for PRs/MRs (one-time manual setup per bot account) |
317-
| RDS PostgreSQL credentials | Memory server |
308+
| Secret | Env var | Used by |
309+
|--------|---------|---------|
310+
| SSH private key (base64) | `SSH_PRIVATE_KEY_B64` | Bot — git clone/push over SSH |
311+
| GPG private key (base64) | `GPG_PRIVATE_KEY_B64` | Bot — commit signing |
312+
| GitHub PAT | `GH_TOKEN` | Bot — gh CLI (PR creation, reviews, comments) |
313+
| GCP service account key (base64) | `GOOGLE_SA_KEY_B64` | Bot — Claude API via Vertex AI |
314+
| Jira credentials | `JIRA_URL`, `JIRA_USERNAME`, `JIRA_API_TOKEN` | Bot — mcp-atlassian MCP server |
315+
| RDS PostgreSQL credentials | `DATABASE_URL` | Memory server |
318316

319317
### Scaling
320318

OPERATIONS.md

Lines changed: 245 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,245 @@
1+
# Operations Guide
2+
3+
How to manage and monitor the bot in day-to-day use.
4+
5+
## Monitoring
6+
7+
### Jira Filter
8+
9+
All bot-eligible tickets across teams are tracked via a shared Jira filter:
10+
11+
**Filter ID**: [107017](https://redhat.atlassian.net/issues/?filter=107017)
12+
13+
**JQL**: `project = RHCLOUD AND labels in (hcc-ai-framework, hcc-ai-platform-accessmanagement)`
14+
15+
This shows all tickets tagged for the bot, regardless of status. Use it to see what the bot is working on, what's queued, and what's done.
16+
17+
To add a new team's tickets, add their label to the filter (e.g. `hcc-ai-<team-name>`).
18+
19+
### Dashboard
20+
21+
The memory server dashboard at `http://localhost:8080` shows:
22+
- **Active tasks** — what the bot is currently working on, with status and PR links
23+
- **Memories** — learnings the bot has stored from completed work
24+
- **Costs** — per-cycle cost breakdown by work type
25+
26+
### Logs
27+
28+
```bash
29+
make logs # Tail bot.log
30+
docker compose logs -f bot # Docker container logs
31+
docker compose logs -f memory-server # Memory server logs
32+
```
33+
34+
## Ticket Lifecycle
35+
36+
A ticket goes through these stages as the bot processes it:
37+
38+
```
39+
Backlog In Progress Code Review Done
40+
┌──────────┐ ┌──────────────┐ ┌─────────────┐ ┌──────┐
41+
│ Groomed │ bot claims │ Bot working │ PR open │ Waiting for │ merged │ Done │
42+
│ + labeled│ ───────────> │ on branch │ ───────> │ human review│ ───────> │ │
43+
│ │ │ bot/<KEY> │ │ │ │ │
44+
└──────────┘ └──────────────┘ └─────────────┘ └──────┘
45+
│ │ │ │
46+
│ │ │ │
47+
Human grooms Bot updates Bot responds Bot closes
48+
and labels task metadata to review Jira ticket
49+
the ticket in memory server feedback + stores
50+
learnings
51+
```
52+
53+
### Example: RHCLOUD-37254
54+
55+
A real ticket processed by the bot:
56+
57+
1. **Groomed** — Human added labels `hcc-ai-platform-accessmanagement` and `repo:insights-rbac`
58+
2. **Picked up** — Bot found it via JQL query, assigned itself, transitioned to "In Progress", added to the active sprint
59+
3. **Implemented** — Bot cloned `insights-rbac`, created branch `bot/RHCLOUD-37254`, loaded the `rbac` persona, read repo `CLAUDE.md`, implemented the fix
60+
4. **PR opened** — Bot pushed the branch, opened a PR via `gh pr create`, transitioned ticket to "Code Review", commented on Jira with the PR link
61+
5. **Review cycle** — Human reviewed the PR. Bot checked for new feedback each cycle and addressed comments
62+
6. **Merged** — Once the PR was merged, bot transitioned the ticket to "Done" and stored learnings in RAG memory
63+
64+
### Example: RHCLOUD-46011
65+
66+
A frontend ticket with visual verification:
67+
68+
1. **Groomed** — Labels `hcc-ai-framework` and `repo:astro-virtual-assistant-frontend`
69+
2. **Implemented** — Bot loaded the `frontend` persona, used PatternFly MCP for component docs
70+
3. **Visual verification** — Bot started the dev server, used chrome-devtools MCP to navigate to the affected page, took before/after screenshots
71+
4. **PR opened** — Screenshots embedded as base64 in the PR description (never committed to the repo)
72+
5. **Result**[astro-virtual-assistant-frontend#368](https://github.com/RedHatInsights/astro-virtual-assistant-frontend/pull/368)
73+
74+
## What the Bot Has Done
75+
76+
The bot has been running since late March 2026 on the `hcc-ai-framework` label. Here's a summary of completed work, showing how different ticket types are handled.
77+
78+
### Cross-repo feature: RHCLOUD-46384
79+
80+
**Add icon field to FrontendEnvironment CRD** — a feature spanning 3 repos.
81+
82+
The bot received a groomed ticket with labels `repo:frontend-operator` and `repo:insights-chrome`. It investigated the `insights-chrome` code first, discovered `ServiceIcon.tsx` already supported icon rendering — the gap was in the operator CRD. It then:
83+
84+
- Added an `icon` field to the FrontendEnvironment CRD in Go, updated the reconciler, regenerated manifests, and updated e2e tests
85+
- Opened [frontend-operator#569](https://github.com/RedHatInsights/frontend-operator/pull/569)
86+
- Reported on Jira that insights-chrome needed no changes and that app-interface (readonly repo) needed manual config updates
87+
88+
When asked via a Jira comment to also handle the app-interface changes, the bot initially reported it lacked push access. After being told about a fork repo, it cloned the fork, made the changes, and opened [app-interface MR !180888](https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/180888) — a cross-host (GitHub + GitLab) ticket resolved in a single cycle.
89+
90+
### UI bug fix: RHCLOUD-44667
91+
92+
**Timespan option labels empty in Notifications Event Log** — a PatternFly migration bug.
93+
94+
A user reported that the dropdown for selecting time ranges showed blank labels. The bot identified the root cause: `SelectOption` components were self-closing (`<SelectOption ... />`) with no children. PatternFly v6 requires label text as children.
95+
96+
- Fixed both `EventLogDateFilter.tsx` and `NotificationsLogDateFilter.tsx`
97+
- Opened [notifications-frontend#884](https://github.com/RedHatInsights/notifications-frontend/pull/884)
98+
- PR merged, bot closed the Jira ticket and stored the PF6 pattern as a RAG memory for future reference
99+
100+
### CVE triage: RHCLOUD-44642
101+
102+
**CVE-2026-24842 node-tar in pdf-generator** — security vulnerability ticket.
103+
104+
The bot checked the current state of `node-tar` in pdf-generator and found it was already at version 7.5.11 (fix version was 7.5.7). Verified with `npm ls tar` and `npm audit`. No code changes needed — bot commented the analysis on Jira and closed the ticket. Total time: one cycle.
105+
106+
### CI migration with human feedback loop: RHCLOUD-46420
107+
108+
**Migrate pdf-generator Jenkins from GHPRB to GitHub Branch Source** — a deadline-driven infrastructure task.
109+
110+
The bot initially created a Jenkinsfile and opened [pdf-generator#313](https://github.com/RedHatInsights/pdf-generator/pull/313). A human commented on Jira: "The Jenkins jobs haven't been passing for a long time — we should just remove them entirely." The bot:
111+
112+
1. Closed the original PR
113+
2. Opened a new PR [#314](https://github.com/RedHatInsights/pdf-generator/pull/314) removing the unused CI scripts
114+
3. Opened [app-interface MR !180890](https://gitlab.cee.redhat.com/service/app-interface/-/merge_requests/180890) removing the Jenkins job config
115+
4. All within the same conversation thread on Jira
116+
117+
This demonstrates the feedback loop: the bot adapts to human direction mid-flight.
118+
119+
### Summary table
120+
121+
| Ticket | Summary | Repos | Type | Result |
122+
|--------|---------|-------|------|--------|
123+
| RHCLOUD-46384 | Add icon field to FrontendEnvironment CRD | frontend-operator, insights-chrome, app-interface | cross-repo feature | PR + MR merged |
124+
| RHCLOUD-38822 | Service Dropdown displaying incorrect icon set | frontend-operator, insights-chrome, app-interface | investigation + fix | closed |
125+
| RHCLOUD-46426 | ChangeDefaultTemplate does not reset previous default | widget-layout-backend | bug fix | PR merged |
126+
| RHCLOUD-44667 | Timespan option labels empty in Notifications Event Log | notifications-frontend | UI bug fix | PR merged |
127+
| RHCLOUD-46420 | Migrate pdf-generator Jenkins PR check | pdf-generator, app-interface | CI cleanup | PR + MR merged |
128+
| RHCLOUD-45699 | Update virtual-assistant grype scanning | astro-virtual-assistant-v2 | tech debt | PR merged |
129+
| RHCLOUD-44642 | CVE-2026-24842 pdf-generator: node-tar | pdf-generator | CVE triage | already fixed, closed |
130+
| RHCLOUD-44644 | CVE-2026-24842 payload-tracker-frontend: node-tar | payload-tracker-frontend | CVE fix | PR merged |
131+
| RHCLOUD-43838 | CVE-2025-12816 payload-tracker-frontend: node-forge | payload-tracker-frontend | CVE fix | PR merged |
132+
133+
In progress:
134+
135+
| Ticket | Summary | Repos | Status |
136+
|--------|---------|-------|--------|
137+
| RHCLOUD-37254 | RBAC allowing roles with same name as System Roles | insights-rbac | Code Review |
138+
139+
## Adding Work for the Bot
140+
141+
### Step 1: Groom the ticket
142+
143+
Use the interactive grooming prompt:
144+
145+
```bash
146+
claude --prompt-file prompts/groom.md
147+
```
148+
149+
Or manually ensure the ticket has:
150+
- Clear problem statement (current vs expected behavior)
151+
- Specific files/components if known
152+
- Acceptance criteria as a checklist
153+
- Scoped to a single PR
154+
155+
### Step 2: Label the ticket
156+
157+
Required labels:
158+
- **Primary label** — matches the bot instance: `hcc-ai-framework` or `hcc-ai-platform-accessmanagement`
159+
- **`repo:<name>`** — must match a key in `project-repos.json` (e.g. `repo:insights-rbac`, `repo:notifications-frontend`)
160+
161+
Optional:
162+
- `needs-investigation` — bot investigates and reports findings instead of implementing
163+
- `platform-experience-ui` — routes to the UI sprint board
164+
165+
### Step 3: Leave it unassigned
166+
167+
The bot only picks up unassigned tickets. If a ticket is assigned to someone, the bot skips it.
168+
169+
### JQL the bot uses
170+
171+
New work:
172+
```
173+
project = RHCLOUD AND labels = <PRIMARY_LABEL>
174+
AND assignee is EMPTY AND status != Done
175+
ORDER BY priority DESC, created ASC
176+
```
177+
178+
Assigned tickets check:
179+
```
180+
project = RHCLOUD AND labels = <PRIMARY_LABEL>
181+
AND assignee = currentUser() AND status != Done
182+
ORDER BY updated DESC
183+
```
184+
185+
## Running Multiple Bots
186+
187+
Each bot instance handles one team label. To run multiple bots:
188+
189+
### On host (development)
190+
191+
```bash
192+
# Terminal 1
193+
make run LABEL=hcc-ai-framework
194+
195+
# Terminal 2
196+
make run-rbac
197+
```
198+
199+
### In Docker
200+
201+
```bash
202+
# Start with default label
203+
make docker-up
204+
205+
# Start with a different label
206+
BOT_LABEL=hcc-ai-platform-accessmanagement make docker-up
207+
```
208+
209+
For multiple labels in Docker simultaneously, you'd run separate compose projects or add multiple bot services to `docker-compose.yml`.
210+
211+
## Cost Management
212+
213+
Each cycle records its cost. Monitor spending:
214+
215+
```bash
216+
make costs-today # Today's spend
217+
make costs-week # Last 7 days
218+
make costs # All time
219+
```
220+
221+
The bot sleeps for 5 minutes between active cycles and 1 hour when idle (no work found). These intervals are configured in `config.json`.
222+
223+
## Troubleshooting
224+
225+
### Bot finds no work
226+
227+
Check:
228+
1. Are there tickets with the right primary label? Use the Jira filter above
229+
2. Are the tickets unassigned?
230+
3. Do they have a `repo:` label matching `project-repos.json`?
231+
4. Is the bot at the 5-task capacity limit? Check the dashboard
232+
233+
### MCP server fails to connect
234+
235+
Check the bot logs for which server failed:
236+
- `bot-memory` — is the memory server running? (`make memory-server`)
237+
- `mcp-atlassian` — are Jira credentials set in `.env`?
238+
- `chrome-devtools` — is Chromium running? (only in Docker; on host, run `./start-chromium.sh`)
239+
240+
### Bot is stuck on a ticket
241+
242+
Check the task record in the dashboard. Look at `metadata.last_step` and `metadata.notes`. If truly stuck:
243+
1. Comment on the Jira ticket explaining the blocker
244+
2. Manually set the task status to `paused` via the dashboard
245+
3. The bot will skip it and move to other work

bot/config.py

Lines changed: 14 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -28,13 +28,23 @@ def load_config(script_dir: Path) -> Config:
2828

2929

3030
def load_mcp_servers(script_dir: Path) -> dict:
31-
"""Load and merge MCP servers from persona configs.
31+
"""Load and merge MCP servers from bot and persona configs.
3232
33-
The root .mcp.json (bot-memory, chrome-devtools, mcp-atlassian) is loaded
34-
automatically by the SDK via setting_sources=["project"]. This function
35-
only loads additional per-persona MCP servers.
33+
The root .mcp.json (bot-memory, chrome-devtools) is loaded automatically
34+
by the SDK via setting_sources=["project"]. This function loads additional
35+
servers: bot-specific (bot/mcp.json for mcp-atlassian) and per-persona.
3636
"""
3737
servers: dict = {}
38+
39+
# Bot-specific MCP servers (e.g. mcp-atlassian — kept separate from
40+
# .mcp.json so it doesn't interfere with local dev sessions)
41+
bot_mcp = script_dir / "bot" / "mcp.json"
42+
if bot_mcp.exists():
43+
with open(bot_mcp) as f:
44+
data = json.load(f)
45+
for name, cfg in data.get("mcpServers", {}).items():
46+
servers[name] = cfg
47+
3848
for mcp_file in sorted(script_dir.glob("personas/*/mcp.json")):
3949
with open(mcp_file) as f:
4050
data = json.load(f)

bot/mcp.json

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"mcpServers": {
3+
"mcp-atlassian": {
4+
"command": "uvx",
5+
"args": ["mcp-atlassian"],
6+
"env": {
7+
"JIRA_URL": "${JIRA_URL}",
8+
"JIRA_USERNAME": "${JIRA_USERNAME}",
9+
"JIRA_API_TOKEN": "${JIRA_API_TOKEN}"
10+
},
11+
"type": "stdio"
12+
}
13+
}
14+
}

0 commit comments

Comments
 (0)