Skip to content

docs: Add deployment guide based on revised kustomization logic#228

Merged
shayan1995 merged 6 commits intomainfrom
chore/installation
Mar 16, 2026
Merged

docs: Add deployment guide based on revised kustomization logic#228
shayan1995 merged 6 commits intomainfrom
chore/installation

Conversation

@shayan1995
Copy link
Contributor

@shayan1995 shayan1995 commented Mar 10, 2026

Description

Added the installation guide based on the latest kustomization logic

Type of Change

  • Docs - Changes in documentation or OpenAPI schema (docs:)

Services Affected

  • API - API models or endpoints updated
  • Workflow - Workflow service updated
  • DB - DB DAOs or migrations updated
  • Site Manager - Site Manager updated
  • Cert Manager - Cert Manager updated
  • Site Agent - Site Agent updated
  • RLA - RLA service updated
  • Powershelf Manager - Powershelf Manager updated

Breaking Changes

  • This PR contains breaking changes

Testing

  • Unit tests added/updated
  • Integration tests added/updated
  • Manual testing performed
  • No testing required (docs, internal refactor, etc.)

Additional Notes

None

@shayan1995 shayan1995 self-assigned this Mar 10, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions
Copy link

github-actions bot commented Mar 10, 2026

Test Results

8 174 tests  ±0   8 174 ✅ ±0   8m 19s ⏱️ +4s
  139 suites ±0       0 💤 ±0 
   14 files   ±0       0 ❌ ±0 

Results for commit c11a5fb. ± Comparison against base commit 376cb61.

♻️ This comment has been updated with latest results.

@thossain-nv thossain-nv changed the title added the installation guide based on the new kustomization logic docs: Add deployment guide based on latest kustomization logic Mar 11, 2026
@thossain-nv
Copy link
Contributor

Thanks @shayan1995 Revised doc location and added some edits. Will continue tomorrow.

@github-actions
Copy link

🔐 TruffleHog Secret Scan

No secrets or credentials found!

Your code has been scanned for 700+ types of secrets and credentials. All clear! 🎉

🔗 View scan details

🕐 Last updated: 2026-03-11 02:00:14 UTC | Commit: 09020a9

@github-actions
Copy link

🛡️ Vulnerability Scan

🚨 Found 64 vulnerability(ies)
📊 vs main: 64 (no change)

Severity Breakdown:

  • 🔴 Critical/High: 64
  • 🟡 Medium: 0
  • 🔵 Low/Info: 0

🔗 View full details in Security tab

🕐 Last updated: 2026-03-11 02:00:20 UTC | Commit: 09020a9

@coderabbitai
Copy link

coderabbitai bot commented Mar 16, 2026

📝 Walkthrough

Walkthrough

Adds comprehensive Kubernetes deployment documentation (quick-start and detailed installation) and a new bash script to generate and optionally deploy a self-signed root CA for cluster PKI; no code or API surface changes.

Changes

Cohort / File(s) Summary
Top-level README
README.md
Inserted two "Production Cluster Deployment" sections linking to deployment quickstart and detailed installation guides.
Deployment Guides
deploy/README.md, deploy/INSTALLATION.md
Added a deploy quick-start and an extensive INSTALLATION guide covering topology, kustomize/overlay structure, prerequisites, namespace setup, PKI/cert-manager, Temporal, DB/Keycloak steps, manifests, secrets reference, and step-by-step apply workflow.
CA Generation Script
scripts/gen-site-ca.sh
New bash script to create a self-signed root CA (OpenSSL), write outputs to directory or create Kubernetes TLS secrets, supports namespace, output-dir, CN/org/days, dry-run, and help flags.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Poem

🐰 I nibble docs beneath the moon,
I brew a CA with openssl tune,
Kustomize stacks and secrets neat,
Cluster-ready hops on nimble feet—🌙🚀

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description is directly related to the changeset, confirming this is a documentation addition reflecting the latest kustomization logic, which aligns with all four files added/modified in the PR.
Title check ✅ Passed The title accurately describes the main change: adding deployment documentation based on revised kustomization logic, which aligns with the PR content (INSTALLATION.md, deploy/README.md, and scripts).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch chore/installation
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@thossain-nv thossain-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks @shayan1995

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (4)
scripts/gen-site-ca.sh (2)

145-171: Consider avoiding eval by executing commands directly.

Using eval to execute shell commands stored in variables can be fragile and is generally discouraged. Since the commands are well-defined, they can be executed directly.

♻️ Suggested refactor
-# ---- Build kubectl commands ----------------------------------------
-APPLY_NS_CMD="kubectl create secret tls ca-signing-secret \
-  --cert=$TMP_DIR/ca.crt \
-  --key=$TMP_DIR/ca.key \
-  -n $NAMESPACE \
-  --dry-run=client -o yaml | kubectl apply -f -"
-
-APPLY_CM_CMD="kubectl create secret tls ca-signing-secret \
-  --cert=$TMP_DIR/ca.crt \
-  --key=$TMP_DIR/ca.key \
-  -n cert-manager \
-  --dry-run=client -o yaml | kubectl apply -f -"
+# ---- Helper function to apply secret ------------------------------
+apply_secret() {
+  local ns="$1"
+  kubectl create secret tls ca-signing-secret \
+    --cert="$TMP_DIR/ca.crt" \
+    --key="$TMP_DIR/ca.key" \
+    -n "$ns" \
+    --dry-run=client -o yaml | kubectl apply -f -
+}
+
+# For dry-run display
+show_cmd() {
+  local ns="$1"
+  echo "  kubectl create secret tls ca-signing-secret \\"
+  echo "    --cert=$TMP_DIR/ca.crt \\"
+  echo "    --key=$TMP_DIR/ca.key \\"
+  echo "    -n $ns \\"
+  echo "    --dry-run=client -o yaml | kubectl apply -f -"
+}
 
 if [[ "$DRY_RUN" == "true" ]]; then
   warn "Dry-run mode — commands that would be run:"
-  echo "  $APPLY_NS_CMD"
-  echo "  $APPLY_CM_CMD"
+  show_cmd "$NAMESPACE"
+  show_cmd "cert-manager"
   exit 0
 fi
 
 # ---- Apply CA secrets to cluster ----------------------------------
 info "Creating ca-signing-secret in namespace '${NAMESPACE}'…"
-eval "$APPLY_NS_CMD"
+apply_secret "$NAMESPACE"
 ok "ca-signing-secret created in '${NAMESPACE}'"
 
 info "Creating ca-signing-secret in namespace 'cert-manager'…"
 if kubectl get namespace cert-manager >/dev/null 2>&1; then
-  eval "$APPLY_CM_CMD"
+  apply_secret "cert-manager"
   ok "ca-signing-secret created in 'cert-manager'"
 else
   warn "Namespace 'cert-manager' not found — skipping. Create it after cert-manager is installed:"
-  echo "  $APPLY_CM_CMD"
+  show_cmd "cert-manager"
 fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gen-site-ca.sh` around lines 145 - 171, The script currently builds
kubectl command strings in APPLY_NS_CMD and APPLY_CM_CMD and executes them with
eval, which is fragile; change the implementation to execute the kubectl
commands directly (or store the command as a bash array and call it) instead of
using eval: replace the eval "$APPLY_NS_CMD" and eval "$APPLY_CM_CMD"
invocations with direct command execution (or array expansion) and ensure the
DRY_RUN handling still prints the command strings when DRY_RUN=="true"; update
both the namespace-create and cert-manager-create code paths (references:
APPLY_NS_CMD, APPLY_CM_CMD, DRY_RUN, and the two eval usages).

114-120: Consider preserving stderr for debugging.

Redirecting stderr to /dev/null hides potentially useful error messages from OpenSSL if the command fails.

📝 Suggested improvement
 openssl req -x509 -sha256 -nodes -newkey rsa:4096 \
   -keyout "$TMP_DIR/ca.key" \
   -out    "$TMP_DIR/ca.crt" \
   -days   "$CA_DAYS" \
   -config "$TMP_DIR/ca.cnf" \
-  -extensions v3_ca \
-  2>/dev/null
+  -extensions v3_ca

If suppressing output is intentional (e.g., to hide progress indicators), consider redirecting only specific output or using -quiet if available.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/gen-site-ca.sh` around lines 114 - 120, The openssl req invocation in
scripts/gen-site-ca.sh currently hides all stderr with "2>/dev/null", which
suppresses useful error messages; update the openssl command that writes
TMP_DIR/ca.key and TMP_DIR/ca.crt (using CA_DAYS and TMP_DIR/ca.cnf with
-extensions v3_ca) to either remove the "2>/dev/null" redirection so errors are
visible, or redirect stderr to a log file or variable (e.g.,
"$TMP_DIR/openssl.err") or use the "-quiet" flag if you only want to silence
progress, ensuring any real errors remain captured for debugging.
deploy/INSTALLATION.md (2)

292-292: Minor grammar: use hyphens for compound adjectives.

The phrase "static/well known in cluster services" should use hyphens for clarity.

📝 Suggested fix
-The ClusterIssuer is used for generating mTLS certs for static/well known in cluster services.
+The ClusterIssuer is used for generating mTLS certs for static/well-known in-cluster services.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` at line 292, The sentence describing the
ClusterIssuer uses unhyphenated compound adjectives; update the phrase
"static/well known in cluster services" to use hyphens so it reads
"static/well-known in-cluster services" (keep the existing references to
carbide-rest-ca-issuer and ca-signing-secret unchanged).

20-43: Consider adding language identifiers to fenced code blocks.

Static analysis flagged several code blocks without language specifiers. While these are directory structure diagrams (not executable code), adding a language identifier improves rendering and accessibility.

📝 Suggested fix for the directory structure block
-```
+```text
 deploy/kustomize/
 ├── base/                 # Reusable base manifests (not applied directly)

Similarly, consider using text or plaintext for other non-code blocks at lines 67, 111, 332, and 727.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` around lines 20 - 43, The fenced directory-tree code
block beginning with the block containing "deploy/kustomize/" and other non-code
fenced blocks that show directory diagrams should include a language identifier
(e.g., change ``` to ```text) so they render/accessibly as plain text; locate
the fenced block that starts with the deploy/kustomize directory tree and any
other similar directory-listing fenced blocks in the INSTALLATION.md and replace
the opening backticks with ```text (or ```plaintext) for each.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/INSTALLATION.md`:
- Line 932: Update the Secrets Reference table entry for `db-creds` (currently
showing only `carbide-rest-db-migration`) to include all actual consumers: add
`carbide-rest-api` and `carbide-rest-workflow` alongside
`carbide-rest-db-migration`, and ensure the source column still points to
`base/common/db-creds.yaml` and the secret owner/service column still notes
`carbide-rest`; update any surrounding doc text if it references the old
consumer list so the sections for `carbide-rest-api` and `carbide-rest-workflow`
accurately reflect that they consume the `db-creds` secret.

In `@deploy/README.md`:
- Around line 61-63: Update the README command to remove the redundant namespace
flag: change the Keycloak apply command to use only "kubectl apply -k
deploy/kustomize/base/keycloak" because the namespace is already declared in the
kustomization.yaml (namespace: carbide-rest) — ensure you reference the
kustomization.yaml (the file declaring namespace) and update the README line
that currently includes "-n carbide-rest".

---

Nitpick comments:
In `@deploy/INSTALLATION.md`:
- Line 292: The sentence describing the ClusterIssuer uses unhyphenated compound
adjectives; update the phrase "static/well known in cluster services" to use
hyphens so it reads "static/well-known in-cluster services" (keep the existing
references to carbide-rest-ca-issuer and ca-signing-secret unchanged).
- Around line 20-43: The fenced directory-tree code block beginning with the
block containing "deploy/kustomize/" and other non-code fenced blocks that show
directory diagrams should include a language identifier (e.g., change ``` to
```text) so they render/accessibly as plain text; locate the fenced block that
starts with the deploy/kustomize directory tree and any other similar
directory-listing fenced blocks in the INSTALLATION.md and replace the opening
backticks with ```text (or ```plaintext) for each.

In `@scripts/gen-site-ca.sh`:
- Around line 145-171: The script currently builds kubectl command strings in
APPLY_NS_CMD and APPLY_CM_CMD and executes them with eval, which is fragile;
change the implementation to execute the kubectl commands directly (or store the
command as a bash array and call it) instead of using eval: replace the eval
"$APPLY_NS_CMD" and eval "$APPLY_CM_CMD" invocations with direct command
execution (or array expansion) and ensure the DRY_RUN handling still prints the
command strings when DRY_RUN=="true"; update both the namespace-create and
cert-manager-create code paths (references: APPLY_NS_CMD, APPLY_CM_CMD, DRY_RUN,
and the two eval usages).
- Around line 114-120: The openssl req invocation in scripts/gen-site-ca.sh
currently hides all stderr with "2>/dev/null", which suppresses useful error
messages; update the openssl command that writes TMP_DIR/ca.key and
TMP_DIR/ca.crt (using CA_DAYS and TMP_DIR/ca.cnf with -extensions v3_ca) to
either remove the "2>/dev/null" redirection so errors are visible, or redirect
stderr to a log file or variable (e.g., "$TMP_DIR/openssl.err") or use the
"-quiet" flag if you only want to silence progress, ensuring any real errors
remain captured for debugging.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 704c2369-b56e-4f03-a24a-260a71b90ecd

📥 Commits

Reviewing files that changed from the base of the PR and between a972346 and e962148.

📒 Files selected for processing (4)
  • README.md
  • deploy/INSTALLATION.md
  • deploy/README.md
  • scripts/gen-site-ca.sh

@shayan1995 shayan1995 enabled auto-merge (squash) March 16, 2026 20:41
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
deploy/INSTALLATION.md (4)

292-292: Clarify the phrase "static/well known in cluster services."

The phrase "static/well known in cluster services" could be clearer with hyphens: "static/well-known, in-cluster services" to better convey that these are well-known services running within the cluster.

📝 Suggested rewording
-The ClusterIssuer is used for generating mTLS certs for static/well known in cluster services.
+The ClusterIssuer is used for generating mTLS certs for static, well-known in-cluster services.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` at line 292, Update the sentence that mentions
"static/well known in cluster services" to a clearer phrasing: replace it with
"static/well-known, in-cluster services" (or similar) so it reads that the
ClusterIssuer `carbide-rest-ca-issuer` (using `ca-signing-secret`) is used for
generating mTLS certs for static/well-known, in-cluster services; ensure the
ClusterIssuer name `carbide-rest-ca-issuer` and secret `ca-signing-secret`
remain unchanged.

20-43: Consider adding language identifier to code block.

The directory structure code block (line 20) is missing a language identifier. Adding text or plaintext would improve syntax highlighting and markdown compliance.

📝 Suggested fix
-```
+```text
 deploy/kustomize/
 ├── base/                 # Reusable base manifests (not applied directly)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` around lines 20 - 43, The fenced code block showing
the deploy/kustomize directory tree lacks a language identifier; update the
opening fence (the triple backticks before "deploy/kustomize/") to include a
language such as text or plaintext (e.g., change ``` to ```text) so the block
becomes a properly annotated code block; ensure the closing triple backticks
remain unchanged and commit the updated INSTALLATION.md.

417-431: Add verification for Temporal namespace creation.

After creating the cloud and site namespaces, consider adding a verification step to confirm they were created successfully before proceeding.

📋 Suggested verification commands
# Verify namespaces were created
kubectl exec -it -n temporal deployment/temporal-admintools -- \
  temporal operator namespace list \
  --address temporal-frontend.temporal:7233 | grep -E 'cloud|site'
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` around lines 417 - 431, Add a verification step after
the two temporal operator namespace create commands: run the temporal operator
namespace list command from the temporal-admintools pod (using the same
--address temporal-frontend.temporal:7233) and confirm the output contains both
"cloud" and "site"; document the expected success output and a suggested
retry/exit action if either namespace is missing so the installer can stop and
re-run creation before proceeding.

186-189: Consider centralizing database credential management.

Requiring manual synchronization of passwords across four separate files (admin-creds.yaml, init-configmap.yaml, base/common/db-creds.yaml, base/temporal-helm/db-creds.yaml) is error-prone and creates a maintenance burden. Password mismatches can lead to authentication failures at runtime.

Consider one of these approaches:

  • Use Kustomize secretGenerator with a single source file that all components reference
  • Document a script or tool to update all locations atomically
  • Use an external secret management solution (e.g., External Secrets Operator, Vault) to inject credentials at runtime
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` around lines 186 - 189, The docs currently require
manually syncing DB passwords across admin-creds.yaml, init-configmap.yaml,
base/common/db-creds.yaml, and base/temporal-helm/db-creds.yaml which is
error-prone; update INSTALLATION.md to prescribe a single source-of-truth
credential approach: replace the manual step with instructions to centralize
credentials using one of the suggested methods (e.g., add a Kustomize
secretGenerator workflow referencing one creds file and show how components
reference that secret, or include a short documented script to propagate an
updated password to all four files atomically, or outline how to use an external
secret manager like External Secrets Operator/Vault to inject credentials);
reference the four filenames (admin-creds.yaml, init-configmap.yaml,
base/common/db-creds.yaml, base/temporal-helm/db-creds.yaml) and provide concise
commands/examples for the chosen approach so operators can follow a single
canonical procedure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@deploy/INSTALLATION.md`:
- Around line 736-755: The TEMPORAL_SUBSCRIBE_QUEUE value in the kubectl patch
for the configmap carbide-rest-site-agent-config is incorrectly set to the
literal "site" instead of the site UUID; update the patch that replaces
/data/TEMPORAL_SUBSCRIBE_QUEUE to use the $SITE_UUID variable (same as
/data/CLUSTER_ID) so TEMPORAL_SUBSCRIBE_QUEUE is the per-site queue name,
ensuring the site agent polls the correct Temporal queue.

---

Nitpick comments:
In `@deploy/INSTALLATION.md`:
- Line 292: Update the sentence that mentions "static/well known in cluster
services" to a clearer phrasing: replace it with "static/well-known, in-cluster
services" (or similar) so it reads that the ClusterIssuer
`carbide-rest-ca-issuer` (using `ca-signing-secret`) is used for generating mTLS
certs for static/well-known, in-cluster services; ensure the ClusterIssuer name
`carbide-rest-ca-issuer` and secret `ca-signing-secret` remain unchanged.
- Around line 20-43: The fenced code block showing the deploy/kustomize
directory tree lacks a language identifier; update the opening fence (the triple
backticks before "deploy/kustomize/") to include a language such as text or
plaintext (e.g., change ``` to ```text) so the block becomes a properly
annotated code block; ensure the closing triple backticks remain unchanged and
commit the updated INSTALLATION.md.
- Around line 417-431: Add a verification step after the two temporal operator
namespace create commands: run the temporal operator namespace list command from
the temporal-admintools pod (using the same --address
temporal-frontend.temporal:7233) and confirm the output contains both "cloud"
and "site"; document the expected success output and a suggested retry/exit
action if either namespace is missing so the installer can stop and re-run
creation before proceeding.
- Around line 186-189: The docs currently require manually syncing DB passwords
across admin-creds.yaml, init-configmap.yaml, base/common/db-creds.yaml, and
base/temporal-helm/db-creds.yaml which is error-prone; update INSTALLATION.md to
prescribe a single source-of-truth credential approach: replace the manual step
with instructions to centralize credentials using one of the suggested methods
(e.g., add a Kustomize secretGenerator workflow referencing one creds file and
show how components reference that secret, or include a short documented script
to propagate an updated password to all four files atomically, or outline how to
use an external secret manager like External Secrets Operator/Vault to inject
credentials); reference the four filenames (admin-creds.yaml,
init-configmap.yaml, base/common/db-creds.yaml,
base/temporal-helm/db-creds.yaml) and provide concise commands/examples for the
chosen approach so operators can follow a single canonical procedure.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d184877-c7f0-488e-b1bd-229eeb908c38

📥 Commits

Reviewing files that changed from the base of the PR and between e962148 and c11a5fb.

📒 Files selected for processing (3)
  • README.md
  • deploy/INSTALLATION.md
  • deploy/README.md
🚧 Files skipped from review as they are similar to previous changes (2)
  • README.md
  • deploy/README.md

Comment on lines +736 to +755
```bash
SITE_UUID=<your-site-uuid>
OTP=<otp-from-site-cr-status>
CA_B64=$(kubectl get secret ca-signing-secret -n carbide-rest -o jsonpath='{.data.tls\.crt}')

kubectl patch secret site-registration -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/site-uuid\", \"value\": \"$(echo -n $SITE_UUID | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/otp\", \"value\": \"$(echo -n $OTP | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/cacert\", \"value\": \"$CA_B64\"}
]"

# Also update CLUSTER_ID in the configmap to match SITE_UUID
kubectl patch configmap carbide-rest-site-agent-config -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/CLUSTER_ID\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_NAMESPACE\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_QUEUE\", \"value\": \"site\"}
]"

kubectl rollout restart statefulset/carbide-rest-site-agent -n carbide-rest
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical inconsistency in TEMPORAL_SUBSCRIBE_QUEUE configuration.

Line 751 sets TEMPORAL_SUBSCRIBE_QUEUE to the literal string "site", but line 712 documents that this variable should be set to the site UUID to match the per-site Temporal queue. This mismatch would cause the site agent to poll the wrong queue and fail to receive site-specific workflow tasks.

🐛 Proposed fix
 kubectl patch configmap carbide-rest-site-agent-config -n carbide-rest --type='json' -p="[
   {\"op\": \"replace\", \"path\": \"/data/CLUSTER_ID\", \"value\": \"$SITE_UUID\"},
   {\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_NAMESPACE\", \"value\": \"$SITE_UUID\"},
-  {\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_QUEUE\", \"value\": \"site\"}
+  {\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_QUEUE\", \"value\": \"$SITE_UUID\"}
 ]"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
SITE_UUID=<your-site-uuid>
OTP=<otp-from-site-cr-status>
CA_B64=$(kubectl get secret ca-signing-secret -n carbide-rest -o jsonpath='{.data.tls\.crt}')
kubectl patch secret site-registration -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/site-uuid\", \"value\": \"$(echo -n $SITE_UUID | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/otp\", \"value\": \"$(echo -n $OTP | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/cacert\", \"value\": \"$CA_B64\"}
]"
# Also update CLUSTER_ID in the configmap to match SITE_UUID
kubectl patch configmap carbide-rest-site-agent-config -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/CLUSTER_ID\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_NAMESPACE\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_QUEUE\", \"value\": \"site\"}
]"
kubectl rollout restart statefulset/carbide-rest-site-agent -n carbide-rest
```
SITE_UUID=<your-site-uuid>
OTP=<otp-from-site-cr-status>
CA_B64=$(kubectl get secret ca-signing-secret -n carbide-rest -o jsonpath='{.data.tls\.crt}')
kubectl patch secret site-registration -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/site-uuid\", \"value\": \"$(echo -n $SITE_UUID | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/otp\", \"value\": \"$(echo -n $OTP | base64)\"},
{\"op\": \"replace\", \"path\": \"/data/cacert\", \"value\": \"$CA_B64\"}
]"
# Also update CLUSTER_ID in the configmap to match SITE_UUID
kubectl patch configmap carbide-rest-site-agent-config -n carbide-rest --type='json' -p="[
{\"op\": \"replace\", \"path\": \"/data/CLUSTER_ID\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_NAMESPACE\", \"value\": \"$SITE_UUID\"},
{\"op\": \"replace\", \"path\": \"/data/TEMPORAL_SUBSCRIBE_QUEUE\", \"value\": \"$SITE_UUID\"}
]"
kubectl rollout restart statefulset/carbide-rest-site-agent -n carbide-rest
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@deploy/INSTALLATION.md` around lines 736 - 755, The TEMPORAL_SUBSCRIBE_QUEUE
value in the kubectl patch for the configmap carbide-rest-site-agent-config is
incorrectly set to the literal "site" instead of the site UUID; update the patch
that replaces /data/TEMPORAL_SUBSCRIBE_QUEUE to use the $SITE_UUID variable
(same as /data/CLUSTER_ID) so TEMPORAL_SUBSCRIBE_QUEUE is the per-site queue
name, ensuring the site agent polls the correct Temporal queue.

@thossain-nv thossain-nv changed the title docs: Add deployment guide based on latest kustomization logic docs: Add deployment guide based on revised kustomization logic Mar 16, 2026
@shayan1995 shayan1995 merged commit adfc73b into main Mar 16, 2026
53 checks passed
@shayan1995 shayan1995 deleted the chore/installation branch March 16, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants