Enhance RCA preflight checks with interactive setup and graceful degr…#3
Enhance RCA preflight checks with interactive setup and graceful degr…#3Shreyanand wants to merge 3 commits intoredhat-et:mainfrom
Conversation
…adation - Add env_vars and configurable metadata to all setup checks so Claude can walk users through interactive configuration of missing settings - Make Splunk optional: warn and skip steps 2-3 instead of hard failure - Add SSH config validation (_ssh_host_exists) and ssh_setup_needed flag for interactive SSH config creation - Add MLFlow SSH tunnel auto-start: kill stale tunnels, start fresh, verify reachability with MLFlow-specific response validation - Add JUMPBOX_URI to MLFlow check for tunnel configuration - Fix settings.example.json key: "environment" → "env" to match Claude Code's expected format - Add settings.example.json template - Categorize checks: JOB_LOGS_DIR required, MLFlow recommended, SSH/Splunk/GitHub optional Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a preflight setup system and example config for the root-cause-analysis skill: new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Setup as "setup.py"
participant Jumpbox
participant MLflow
User->>CLI: run `scripts/cli.py setup` or `scripts/cli.py analyze`
CLI->>Setup: run_checks(base_dir, repo_root)
Setup->>Setup: perform checks (venv, JOB_LOGS_DIR, SSH, Splunk, GitHub, MLflow, hooks)
alt MLflow reachable
Setup->>MLflow: probe /api/2.0/mlflow/experiments/search
MLflow-->>Setup: 200 OK
else MLflow unreachable
Setup->>Jumpbox: optionally start SSH tunnel (ssh -f -N -L ...)
Jumpbox-->>Setup: tunnel established
Setup->>MLflow: probe via local forwarded port
MLflow-->>Setup: responds or error
end
Setup-->>CLI: aggregated checks results (JSON or human)
CLI-->>User: print/return status (and analyze may set skip_splunk)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
skills/root-cause-analysis/scripts/setup.py (1)
313-322: Consider handlingpgrepunavailability gracefully.
pgrepmay not be available on all systems (e.g., minimal containers, some macOS configurations). If the command fails withFileNotFoundError, the current exception handler returnsFalse, which is reasonable, but a brief log/warning might help debugging tunnel issues.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@skills/root-cause-analysis/scripts/setup.py` around lines 313 - 322, In _tunnel_already_running, add explicit handling for FileNotFoundError when invoking subprocess.run so missing pgrep is logged (instead of silently returning False); catch FileNotFoundError separately, emit a brief warning that includes the port and that pgrep is unavailable (using the module logger or processLogger used elsewhere), and then return False as before; keep the existing broad exception fallback for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.claude/settings.example.json:
- Line 9: Update the example settings template so the SPLUNK_VERIFY_SSL default
is secure: change the "SPLUNK_VERIFY_SSL" value from "false" to "true" in
.claude/settings.example.json and add a short inline comment or adjacent
documentation recommending setting it to "false" only for local
development/testing, not production, to avoid weakening security.
In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 163-166: When skip_splunk is true, the saved splunk_logs lacks
keys downstream functions expect (e.g., build_correlation_timeline reads job_id,
guid, namespace, query_time_window, fetched_at); update the skip branch in the
CLI (the skip_splunk block that calls save_step) to populate the same schema
returned by fetch_correlated_logs() by adding the missing keys (job_id, guid,
namespace, query_time_window, fetched_at) with sensible defaults (None/empty
values or an ISO timestamp for fetched_at) so downstream functions like
build_correlation_timeline and fetch_correlated_logs consumers won’t raise
KeyError.
- Around line 171-173: The mypy errors come from splunk_logs.get(...) returning
object; update typing so len() gets a Sized. Either change the return annotation
of fetch_correlated_logs to a typed mapping (e.g., def
fetch_correlated_logs(...) -> Dict[str, Sequence[Any]] or Mapping[str,
Sequence[Any]]), or cast each value at the call site (e.g., cast(Sequence[Any],
splunk_logs.get('ocp_logs', []))) before calling len in the print lines that
reference splunk_logs.get('ocp_logs'), splunk_logs.get('error_logs'), and
splunk_logs.get('pods_found').
In `@skills/root-cause-analysis/scripts/setup.py`:
- Line 9: Remove the unused import Config from module .config in setup.py to
resolve the ruff F401 error; edit the top-level import statement that currently
reads "from .config import Config" and delete that line (or, if future use
intended, replace it with a commented note), ensuring no references to the
symbol Config remain in the file (check for any use of Config in functions or
variables like any setup or initialization logic).
- Around line 337-341: The code builds a shell string in cmd and calls
subprocess.run(..., shell=True) using untrusted jumpbox_uri and port; replace
this with the list-form invocation to avoid shell interpretation (e.g., call
subprocess.run with args list ["ssh","-f","-N","-L", f"{port}:127.0.0.1:{port}",
jumpbox_uri] and remove shell=True) and optionally validate/normalize
jumpbox_uri before use; update the subprocess.run call in the same block (the
cmd variable and the subprocess.run invocation) to use the argument list,
preserving capture_output, text, and timeout options.
---
Nitpick comments:
In `@skills/root-cause-analysis/scripts/setup.py`:
- Around line 313-322: In _tunnel_already_running, add explicit handling for
FileNotFoundError when invoking subprocess.run so missing pgrep is logged
(instead of silently returning False); catch FileNotFoundError separately, emit
a brief warning that includes the port and that pgrep is unavailable (using the
module logger or processLogger used elsewhere), and then return False as before;
keep the existing broad exception fallback for other errors.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c3acdedc-e479-4532-8ade-905950b6fab4
📒 Files selected for processing (4)
.claude/settings.example.jsonskills/root-cause-analysis/SKILL.mdskills/root-cause-analysis/scripts/cli.pyskills/root-cause-analysis/scripts/setup.py
| "SPLUNK_INDEX": "<your-splunk-index>", | ||
| "SPLUNK_OCP_APP_INDEX": "<splunk-ocp-app-index>", | ||
| "SPLUNK_OCP_INFRA_INDEX": "<splunk-ocp-infra-index>", | ||
| "SPLUNK_VERIFY_SSL": "false", |
There was a problem hiding this comment.
Default SPLUNK_VERIFY_SSL: "false" weakens security posture.
Disabling SSL certificate verification exposes connections to man-in-the-middle attacks. Consider defaulting to "true" in the template and only recommending "false" for development/testing environments in comments or documentation.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.claude/settings.example.json at line 9, Update the example settings
template so the SPLUNK_VERIFY_SSL default is secure: change the
"SPLUNK_VERIFY_SSL" value from "false" to "true" in
.claude/settings.example.json and add a short inline comment or adjacent
documentation recommending setting it to "false" only for local
development/testing, not production, to avoid weakening security.
| if skip_splunk: | ||
| print(" Skipped: Splunk not configured") | ||
| splunk_logs = {"ocp_logs": [], "error_logs": [], "pods_found": [], "skipped": True, "reason": "Splunk not configured"} | ||
| save_step(analysis_dir, 2, splunk_logs) |
There was a problem hiding this comment.
Skipped Step 2 output may lack fields expected by Step 3.
When Splunk is skipped, the output contains only ocp_logs, error_logs, pods_found, skipped, and reason. However, fetch_correlated_logs() normally returns additional fields like job_id, guid, namespace, query_time_window, and fetched_at that build_correlation_timeline() may access.
Consider aligning the skipped schema with the normal return schema to avoid potential KeyError in downstream processing.
Proposed fix
if skip_splunk:
print(" Skipped: Splunk not configured")
- splunk_logs = {"ocp_logs": [], "error_logs": [], "pods_found": [], "skipped": True, "reason": "Splunk not configured"}
+ splunk_logs = {
+ "job_id": job_context.get("job_id"),
+ "guid": job_context.get("guid", ""),
+ "namespace": job_context.get("namespace", ""),
+ "query_time_window": {},
+ "fetched_at": None,
+ "ocp_logs": [],
+ "error_logs": [],
+ "pods_found": [],
+ "errors": [],
+ "skipped": True,
+ "reason": "Splunk not configured",
+ }
save_step(analysis_dir, 2, splunk_logs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/scripts/cli.py` around lines 163 - 166, When
skip_splunk is true, the saved splunk_logs lacks keys downstream functions
expect (e.g., build_correlation_timeline reads job_id, guid, namespace,
query_time_window, fetched_at); update the skip branch in the CLI (the
skip_splunk block that calls save_step) to populate the same schema returned by
fetch_correlated_logs() by adding the missing keys (job_id, guid, namespace,
query_time_window, fetched_at) with sensible defaults (None/empty values or an
ISO timestamp for fetched_at) so downstream functions like
build_correlation_timeline and fetch_correlated_logs consumers won’t raise
KeyError.
| print(f" OCP logs: {len(splunk_logs.get('ocp_logs', []))}") | ||
| print(f" Error logs: {len(splunk_logs.get('error_logs', []))}") | ||
| print(f" Pods found: {len(splunk_logs.get('pods_found', []))}") |
There was a problem hiding this comment.
Type narrowing needed for splunk_logs.get() calls.
The pipeline reports mypy errors: "Argument 1 to 'len' has incompatible type 'object'; expected 'Sized'" at lines 171-173. The .get() method returns object without type information.
Proposed fix
- print(f" OCP logs: {len(splunk_logs.get('ocp_logs', []))}")
- print(f" Error logs: {len(splunk_logs.get('error_logs', []))}")
- print(f" Pods found: {len(splunk_logs.get('pods_found', []))}")
+ ocp_logs = splunk_logs.get('ocp_logs', [])
+ error_logs = splunk_logs.get('error_logs', [])
+ pods_found = splunk_logs.get('pods_found', [])
+ print(f" OCP logs: {len(ocp_logs) if isinstance(ocp_logs, list) else 0}")
+ print(f" Error logs: {len(error_logs) if isinstance(error_logs, list) else 0}")
+ print(f" Pods found: {len(pods_found) if isinstance(pods_found, list) else 0}")Alternatively, add type annotations to fetch_correlated_logs return type or cast the results.
🧰 Tools
🪛 GitHub Actions: CI
[error] 171-171: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]
[error] 172-172: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]
[error] 173-173: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/scripts/cli.py` around lines 171 - 173, The mypy
errors come from splunk_logs.get(...) returning object; update typing so len()
gets a Sized. Either change the return annotation of fetch_correlated_logs to a
typed mapping (e.g., def fetch_correlated_logs(...) -> Dict[str, Sequence[Any]]
or Mapping[str, Sequence[Any]]), or cast each value at the call site (e.g.,
cast(Sequence[Any], splunk_logs.get('ocp_logs', []))) before calling len in the
print lines that reference splunk_logs.get('ocp_logs'),
splunk_logs.get('error_logs'), and splunk_logs.get('pods_found').
| try: | ||
| cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}" | ||
| result = subprocess.run( | ||
| cmd, shell=True, capture_output=True, text=True, timeout=15, | ||
| ) |
There was a problem hiding this comment.
Potential command injection via shell=True with user input.
The jumpbox_uri value comes from the JUMPBOX_URI environment variable and is interpolated directly into a shell command string. If this value contains shell metacharacters (;, |, &, $(), etc.), arbitrary commands could be executed.
Proposed fix: use subprocess list form instead of shell
try:
- cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}"
- result = subprocess.run(
- cmd, shell=True, capture_output=True, text=True, timeout=15,
- )
+ # Split jumpbox_uri safely - it may contain "-p port" arguments
+ import shlex
+ jumpbox_parts = shlex.split(jumpbox_uri)
+ cmd = ["ssh", "-f", "-N", "-L", f"{port}:127.0.0.1:{port}"] + jumpbox_parts
+ result = subprocess.run(
+ cmd, capture_output=True, text=True, timeout=15,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}" | |
| result = subprocess.run( | |
| cmd, shell=True, capture_output=True, text=True, timeout=15, | |
| ) | |
| try: | |
| # Split jumpbox_uri safely - it may contain "-p port" arguments | |
| import shlex | |
| jumpbox_parts = shlex.split(jumpbox_uri) | |
| cmd = ["ssh", "-f", "-N", "-L", f"{port}:127.0.0.1:{port}"] + jumpbox_parts | |
| result = subprocess.run( | |
| cmd, capture_output=True, text=True, timeout=15, | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/scripts/setup.py` around lines 337 - 341, The code
builds a shell string in cmd and calls subprocess.run(..., shell=True) using
untrusted jumpbox_uri and port; replace this with the list-form invocation to
avoid shell interpretation (e.g., call subprocess.run with args list
["ssh","-f","-N","-L", f"{port}:127.0.0.1:{port}", jumpbox_uri] and remove
shell=True) and optionally validate/normalize jumpbox_uri before use; update the
subprocess.run call in the same block (the cmd variable and the subprocess.run
invocation) to use the argument list, preserving capture_output, text, and
timeout options.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@skills/root-cause-analysis/scripts/setup.py`:
- Around line 258-260: The error message for the experiment placeholder is using
the wrong variable name; update the message appended to issues in the
is_placeholder(experiment) block to reference the actual environment variable
MLFLOW_EXPERIMENT_NAME (e.g., "MLFLOW_EXPERIMENT_NAME not configured") so it
matches the env_vars entry and avoids confusion; locate the
is_placeholder/existing experiment check where issues.append("EXPERIMENT_NAME
not configured") and change the string to the correct env var name.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9270cf6f-29e1-44b9-a113-c1f311d3c084
📒 Files selected for processing (2)
skills/root-cause-analysis/scripts/cli.pyskills/root-cause-analysis/scripts/setup.py
| if is_placeholder(experiment): | ||
| issues.append("EXPERIMENT_NAME not configured") | ||
| env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"}) |
There was a problem hiding this comment.
Inconsistent variable name in error message.
The error message says "EXPERIMENT_NAME not configured" but the actual environment variable is MLFLOW_EXPERIMENT_NAME. This could confuse users trying to diagnose configuration issues.
Proposed fix
if is_placeholder(experiment):
- issues.append("EXPERIMENT_NAME not configured")
+ issues.append("MLFLOW_EXPERIMENT_NAME not configured")
env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"})📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if is_placeholder(experiment): | |
| issues.append("EXPERIMENT_NAME not configured") | |
| env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"}) | |
| if is_placeholder(experiment): | |
| issues.append("MLFLOW_EXPERIMENT_NAME not configured") | |
| env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"}) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@skills/root-cause-analysis/scripts/setup.py` around lines 258 - 260, The
error message for the experiment placeholder is using the wrong variable name;
update the message appended to issues in the is_placeholder(experiment) block to
reference the actual environment variable MLFLOW_EXPERIMENT_NAME (e.g.,
"MLFLOW_EXPERIMENT_NAME not configured") so it matches the env_vars entry and
avoids confusion; locate the is_placeholder/existing experiment check where
issues.append("EXPERIMENT_NAME not configured") and change the string to the
correct env var name.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
| .venv/bin/python scripts/cli.py setup --json | ||
| ``` | ||
|
|
||
| Review the JSON output. Some settings are required, others are optional: |
There was a problem hiding this comment.
Wouldn't it be better to have this in a seperate file so that it can be used in the other skills or maybe even on its own
There was a problem hiding this comment.
do you mean save this json in as an intermediate file?
Summary by CodeRabbit
New Features
Bug Fixes
Documentation