Skip to content

Enhance RCA preflight checks with interactive setup and graceful degr…#3

Open
Shreyanand wants to merge 3 commits intoredhat-et:mainfrom
Shreyanand:main
Open

Enhance RCA preflight checks with interactive setup and graceful degr…#3
Shreyanand wants to merge 3 commits intoredhat-et:mainfrom
Shreyanand:main

Conversation

@Shreyanand
Copy link
Member

@Shreyanand Shreyanand commented Mar 19, 2026

  • Add env_vars and configurable metadata to all setup checks so Claude can walk users through interactive configuration of missing settings
  • Make Splunk optional: warn and skip steps 2-3 instead of hard failure
  • Add SSH config validation (_ssh_host_exists) and ssh_setup_needed flag for interactive SSH config creation
  • Add MLFlow SSH tunnel auto-start: kill stale tunnels, start fresh, verify reachability with MLFlow-specific response validation
  • Add JUMPBOX_URI to MLFlow check for tunnel configuration
  • Fix settings.example.json key: "environment" → "env" to match Claude Code's expected format
  • Add settings.example.json template
  • Categorize checks: JOB_LOGS_DIR required, MLFlow recommended, SSH/Splunk/GitHub optional

Summary by CodeRabbit

  • New Features

    • Added a sample configuration with environment placeholders and session start/stop hooks.
    • Added an interactive preflight check workflow and new CLI setup command (JSON output option) to validate and populate missing prerequisites, including MLflow and SSH tunnel support.
  • Bug Fixes

    • Improved error handling to skip optional integrations (e.g., Splunk) when misconfigured instead of failing.
  • Documentation

    • Updated setup guidance, restart instructions, and preflight behavior (required vs optional checks).

…adation

- Add env_vars and configurable metadata to all setup checks so Claude
  can walk users through interactive configuration of missing settings
- Make Splunk optional: warn and skip steps 2-3 instead of hard failure
- Add SSH config validation (_ssh_host_exists) and ssh_setup_needed flag
  for interactive SSH config creation
- Add MLFlow SSH tunnel auto-start: kill stale tunnels, start fresh,
  verify reachability with MLFlow-specific response validation
- Add JUMPBOX_URI to MLFlow check for tunnel configuration
- Fix settings.example.json key: "environment" → "env" to match
  Claude Code's expected format
- Add settings.example.json template
- Categorize checks: JOB_LOGS_DIR required, MLFlow recommended,
  SSH/Splunk/GitHub optional

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link

coderabbitai bot commented Mar 19, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 47f91840-aa37-4133-8a7d-166e204fb152

📥 Commits

Reviewing files that changed from the base of the PR and between 98425b7 and 76f2aa9.

📒 Files selected for processing (2)
  • skills/root-cause-analysis/scripts/cli.py
  • skills/root-cause-analysis/scripts/setup.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • skills/root-cause-analysis/scripts/cli.py

📝 Walkthrough

Walkthrough

Adds a preflight setup system and example config for the root-cause-analysis skill: new .claude/settings.example.json, a setup CLI subcommand, a new scripts/setup.py checks module (venv, JOB_LOGS_DIR, SSH, Splunk, GitHub, MLflow with SSH-tunnel fallback), and changes to analyze to optionally skip Splunk.

Changes

Cohort / File(s) Summary
Configuration & Hooks
\.claude/settings.example.json
New example settings file with environment placeholders for Splunk, GitHub, SSH/jumpbox, MLflow, job logs, and Claude SessionStart/Stop hooks.
Skill Documentation
skills/root-cause-analysis/SKILL.md
Replaced setup instructions with a “Preflight Check” workflow: run scripts/cli.py setup --json, categorized required/recommended/optional config, interactive configuration flow, SSH ~/.ssh/config handling, write/merge .claude/settings.json, and MLflow startup/behavior notes.
CLI
skills/root-cause-analysis/scripts/cli.py
Added setup subcommand (cmd_setup) that runs checks and emits JSON or human-readable output; cmd_analyze now warns and sets skip_splunk when Splunk is misconfigured and writes Step 2 payloads showing skipped status instead of hard-failing.
Preflight Checks Module
skills/root-cause-analysis/scripts/setup.py
New module implementing checks and helpers: is_placeholder, check_python_venv, check_job_logs_dir, check_ssh, check_splunk, check_github, check_github_mcp, check_mlflow, check_mlflow_server, check_session_hook, plus run_checks and print_checks. Includes MLflow reachability probe and optional SSH-tunnel startup via JUMPBOX_URI.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant CLI
  participant Setup as "setup.py"
  participant Jumpbox
  participant MLflow

  User->>CLI: run `scripts/cli.py setup` or `scripts/cli.py analyze`
  CLI->>Setup: run_checks(base_dir, repo_root)
  Setup->>Setup: perform checks (venv, JOB_LOGS_DIR, SSH, Splunk, GitHub, MLflow, hooks)
  alt MLflow reachable
    Setup->>MLflow: probe /api/2.0/mlflow/experiments/search
    MLflow-->>Setup: 200 OK
  else MLflow unreachable
    Setup->>Jumpbox: optionally start SSH tunnel (ssh -f -N -L ...)
    Jumpbox-->>Setup: tunnel established
    Setup->>MLflow: probe via local forwarded port
    MLflow-->>Setup: responds or error
  end
  Setup-->>CLI: aggregated checks results (JSON or human)
  CLI-->>User: print/return status (and analyze may set skip_splunk)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main changes: enhancing RCA preflight checks with interactive setup and graceful degradation when optional components are unavailable.
Docstring Coverage ✅ Passed Docstring coverage is 95.65% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
skills/root-cause-analysis/scripts/setup.py (1)

313-322: Consider handling pgrep unavailability gracefully.

pgrep may not be available on all systems (e.g., minimal containers, some macOS configurations). If the command fails with FileNotFoundError, the current exception handler returns False, which is reasonable, but a brief log/warning might help debugging tunnel issues.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/setup.py` around lines 313 - 322, In
_tunnel_already_running, add explicit handling for FileNotFoundError when
invoking subprocess.run so missing pgrep is logged (instead of silently
returning False); catch FileNotFoundError separately, emit a brief warning that
includes the port and that pgrep is unavailable (using the module logger or
processLogger used elsewhere), and then return False as before; keep the
existing broad exception fallback for other errors.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.claude/settings.example.json:
- Line 9: Update the example settings template so the SPLUNK_VERIFY_SSL default
is secure: change the "SPLUNK_VERIFY_SSL" value from "false" to "true" in
.claude/settings.example.json and add a short inline comment or adjacent
documentation recommending setting it to "false" only for local
development/testing, not production, to avoid weakening security.

In `@skills/root-cause-analysis/scripts/cli.py`:
- Around line 163-166: When skip_splunk is true, the saved splunk_logs lacks
keys downstream functions expect (e.g., build_correlation_timeline reads job_id,
guid, namespace, query_time_window, fetched_at); update the skip branch in the
CLI (the skip_splunk block that calls save_step) to populate the same schema
returned by fetch_correlated_logs() by adding the missing keys (job_id, guid,
namespace, query_time_window, fetched_at) with sensible defaults (None/empty
values or an ISO timestamp for fetched_at) so downstream functions like
build_correlation_timeline and fetch_correlated_logs consumers won’t raise
KeyError.
- Around line 171-173: The mypy errors come from splunk_logs.get(...) returning
object; update typing so len() gets a Sized. Either change the return annotation
of fetch_correlated_logs to a typed mapping (e.g., def
fetch_correlated_logs(...) -> Dict[str, Sequence[Any]] or Mapping[str,
Sequence[Any]]), or cast each value at the call site (e.g., cast(Sequence[Any],
splunk_logs.get('ocp_logs', []))) before calling len in the print lines that
reference splunk_logs.get('ocp_logs'), splunk_logs.get('error_logs'), and
splunk_logs.get('pods_found').

In `@skills/root-cause-analysis/scripts/setup.py`:
- Line 9: Remove the unused import Config from module .config in setup.py to
resolve the ruff F401 error; edit the top-level import statement that currently
reads "from .config import Config" and delete that line (or, if future use
intended, replace it with a commented note), ensuring no references to the
symbol Config remain in the file (check for any use of Config in functions or
variables like any setup or initialization logic).
- Around line 337-341: The code builds a shell string in cmd and calls
subprocess.run(..., shell=True) using untrusted jumpbox_uri and port; replace
this with the list-form invocation to avoid shell interpretation (e.g., call
subprocess.run with args list ["ssh","-f","-N","-L", f"{port}:127.0.0.1:{port}",
jumpbox_uri] and remove shell=True) and optionally validate/normalize
jumpbox_uri before use; update the subprocess.run call in the same block (the
cmd variable and the subprocess.run invocation) to use the argument list,
preserving capture_output, text, and timeout options.

---

Nitpick comments:
In `@skills/root-cause-analysis/scripts/setup.py`:
- Around line 313-322: In _tunnel_already_running, add explicit handling for
FileNotFoundError when invoking subprocess.run so missing pgrep is logged
(instead of silently returning False); catch FileNotFoundError separately, emit
a brief warning that includes the port and that pgrep is unavailable (using the
module logger or processLogger used elsewhere), and then return False as before;
keep the existing broad exception fallback for other errors.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3acdedc-e479-4532-8ade-905950b6fab4

📥 Commits

Reviewing files that changed from the base of the PR and between 3a30a04 and ba50b50.

📒 Files selected for processing (4)
  • .claude/settings.example.json
  • skills/root-cause-analysis/SKILL.md
  • skills/root-cause-analysis/scripts/cli.py
  • skills/root-cause-analysis/scripts/setup.py

"SPLUNK_INDEX": "<your-splunk-index>",
"SPLUNK_OCP_APP_INDEX": "<splunk-ocp-app-index>",
"SPLUNK_OCP_INFRA_INDEX": "<splunk-ocp-infra-index>",
"SPLUNK_VERIFY_SSL": "false",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Default SPLUNK_VERIFY_SSL: "false" weakens security posture.

Disabling SSL certificate verification exposes connections to man-in-the-middle attacks. Consider defaulting to "true" in the template and only recommending "false" for development/testing environments in comments or documentation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.claude/settings.example.json at line 9, Update the example settings
template so the SPLUNK_VERIFY_SSL default is secure: change the
"SPLUNK_VERIFY_SSL" value from "false" to "true" in
.claude/settings.example.json and add a short inline comment or adjacent
documentation recommending setting it to "false" only for local
development/testing, not production, to avoid weakening security.

Comment on lines +163 to 166
if skip_splunk:
print(" Skipped: Splunk not configured")
splunk_logs = {"ocp_logs": [], "error_logs": [], "pods_found": [], "skipped": True, "reason": "Splunk not configured"}
save_step(analysis_dir, 2, splunk_logs)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Skipped Step 2 output may lack fields expected by Step 3.

When Splunk is skipped, the output contains only ocp_logs, error_logs, pods_found, skipped, and reason. However, fetch_correlated_logs() normally returns additional fields like job_id, guid, namespace, query_time_window, and fetched_at that build_correlation_timeline() may access.

Consider aligning the skipped schema with the normal return schema to avoid potential KeyError in downstream processing.

Proposed fix
     if skip_splunk:
         print("  Skipped: Splunk not configured")
-        splunk_logs = {"ocp_logs": [], "error_logs": [], "pods_found": [], "skipped": True, "reason": "Splunk not configured"}
+        splunk_logs = {
+            "job_id": job_context.get("job_id"),
+            "guid": job_context.get("guid", ""),
+            "namespace": job_context.get("namespace", ""),
+            "query_time_window": {},
+            "fetched_at": None,
+            "ocp_logs": [],
+            "error_logs": [],
+            "pods_found": [],
+            "errors": [],
+            "skipped": True,
+            "reason": "Splunk not configured",
+        }
         save_step(analysis_dir, 2, splunk_logs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/cli.py` around lines 163 - 166, When
skip_splunk is true, the saved splunk_logs lacks keys downstream functions
expect (e.g., build_correlation_timeline reads job_id, guid, namespace,
query_time_window, fetched_at); update the skip branch in the CLI (the
skip_splunk block that calls save_step) to populate the same schema returned by
fetch_correlated_logs() by adding the missing keys (job_id, guid, namespace,
query_time_window, fetched_at) with sensible defaults (None/empty values or an
ISO timestamp for fetched_at) so downstream functions like
build_correlation_timeline and fetch_correlated_logs consumers won’t raise
KeyError.

Comment on lines +171 to +173
print(f" OCP logs: {len(splunk_logs.get('ocp_logs', []))}")
print(f" Error logs: {len(splunk_logs.get('error_logs', []))}")
print(f" Pods found: {len(splunk_logs.get('pods_found', []))}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Type narrowing needed for splunk_logs.get() calls.

The pipeline reports mypy errors: "Argument 1 to 'len' has incompatible type 'object'; expected 'Sized'" at lines 171-173. The .get() method returns object without type information.

Proposed fix
-            print(f"  OCP logs: {len(splunk_logs.get('ocp_logs', []))}")
-            print(f"  Error logs: {len(splunk_logs.get('error_logs', []))}")
-            print(f"  Pods found: {len(splunk_logs.get('pods_found', []))}")
+            ocp_logs = splunk_logs.get('ocp_logs', [])
+            error_logs = splunk_logs.get('error_logs', [])
+            pods_found = splunk_logs.get('pods_found', [])
+            print(f"  OCP logs: {len(ocp_logs) if isinstance(ocp_logs, list) else 0}")
+            print(f"  Error logs: {len(error_logs) if isinstance(error_logs, list) else 0}")
+            print(f"  Pods found: {len(pods_found) if isinstance(pods_found, list) else 0}")

Alternatively, add type annotations to fetch_correlated_logs return type or cast the results.

🧰 Tools
🪛 GitHub Actions: CI

[error] 171-171: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]


[error] 172-172: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]


[error] 173-173: mypy: Argument 1 to "len" has incompatible type "object"; expected "Sized" [arg-type]

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/cli.py` around lines 171 - 173, The mypy
errors come from splunk_logs.get(...) returning object; update typing so len()
gets a Sized. Either change the return annotation of fetch_correlated_logs to a
typed mapping (e.g., def fetch_correlated_logs(...) -> Dict[str, Sequence[Any]]
or Mapping[str, Sequence[Any]]), or cast each value at the call site (e.g.,
cast(Sequence[Any], splunk_logs.get('ocp_logs', []))) before calling len in the
print lines that reference splunk_logs.get('ocp_logs'),
splunk_logs.get('error_logs'), and splunk_logs.get('pods_found').

Comment on lines +337 to +341
try:
cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}"
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=15,
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Potential command injection via shell=True with user input.

The jumpbox_uri value comes from the JUMPBOX_URI environment variable and is interpolated directly into a shell command string. If this value contains shell metacharacters (;, |, &, $(), etc.), arbitrary commands could be executed.

Proposed fix: use subprocess list form instead of shell
     try:
-        cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}"
-        result = subprocess.run(
-            cmd, shell=True, capture_output=True, text=True, timeout=15,
-        )
+        # Split jumpbox_uri safely - it may contain "-p port" arguments
+        import shlex
+        jumpbox_parts = shlex.split(jumpbox_uri)
+        cmd = ["ssh", "-f", "-N", "-L", f"{port}:127.0.0.1:{port}"] + jumpbox_parts
+        result = subprocess.run(
+            cmd, capture_output=True, text=True, timeout=15,
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try:
cmd = f"ssh -f -N -L {port}:127.0.0.1:{port} {jumpbox_uri}"
result = subprocess.run(
cmd, shell=True, capture_output=True, text=True, timeout=15,
)
try:
# Split jumpbox_uri safely - it may contain "-p port" arguments
import shlex
jumpbox_parts = shlex.split(jumpbox_uri)
cmd = ["ssh", "-f", "-N", "-L", f"{port}:127.0.0.1:{port}"] + jumpbox_parts
result = subprocess.run(
cmd, capture_output=True, text=True, timeout=15,
)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/setup.py` around lines 337 - 341, The code
builds a shell string in cmd and calls subprocess.run(..., shell=True) using
untrusted jumpbox_uri and port; replace this with the list-form invocation to
avoid shell interpretation (e.g., call subprocess.run with args list
["ssh","-f","-N","-L", f"{port}:127.0.0.1:{port}", jumpbox_uri] and remove
shell=True) and optionally validate/normalize jumpbox_uri before use; update the
subprocess.run call in the same block (the cmd variable and the subprocess.run
invocation) to use the argument list, preserving capture_output, text, and
timeout options.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@skills/root-cause-analysis/scripts/setup.py`:
- Around line 258-260: The error message for the experiment placeholder is using
the wrong variable name; update the message appended to issues in the
is_placeholder(experiment) block to reference the actual environment variable
MLFLOW_EXPERIMENT_NAME (e.g., "MLFLOW_EXPERIMENT_NAME not configured") so it
matches the env_vars entry and avoids confusion; locate the
is_placeholder/existing experiment check where issues.append("EXPERIMENT_NAME
not configured") and change the string to the correct env var name.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9270cf6f-29e1-44b9-a113-c1f311d3c084

📥 Commits

Reviewing files that changed from the base of the PR and between ba50b50 and 98425b7.

📒 Files selected for processing (2)
  • skills/root-cause-analysis/scripts/cli.py
  • skills/root-cause-analysis/scripts/setup.py

Comment on lines +258 to +260
if is_placeholder(experiment):
issues.append("EXPERIMENT_NAME not configured")
env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Inconsistent variable name in error message.

The error message says "EXPERIMENT_NAME not configured" but the actual environment variable is MLFLOW_EXPERIMENT_NAME. This could confuse users trying to diagnose configuration issues.

Proposed fix
     if is_placeholder(experiment):
-        issues.append("EXPERIMENT_NAME not configured")
+        issues.append("MLFLOW_EXPERIMENT_NAME not configured")
         env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if is_placeholder(experiment):
issues.append("EXPERIMENT_NAME not configured")
env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"})
if is_placeholder(experiment):
issues.append("MLFLOW_EXPERIMENT_NAME not configured")
env_vars.append({"name": "MLFLOW_EXPERIMENT_NAME", "prompt": "MLFlow experiment name for tracing"})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@skills/root-cause-analysis/scripts/setup.py` around lines 258 - 260, The
error message for the experiment placeholder is using the wrong variable name;
update the message appended to issues in the is_placeholder(experiment) block to
reference the actual environment variable MLFLOW_EXPERIMENT_NAME (e.g.,
"MLFLOW_EXPERIMENT_NAME not configured") so it matches the env_vars entry and
avoids confusion; locate the is_placeholder/existing experiment check where
issues.append("EXPERIMENT_NAME not configured") and change the string to the
correct env var name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
.venv/bin/python scripts/cli.py setup --json
```

Review the JSON output. Some settings are required, others are optional:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to have this in a seperate file so that it can be used in the other skills or maybe even on its own

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean save this json in as an intermediate file?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants