Skip to content

fix(deployment): Use unique Celery node names per worker replica.#2269

Merged
junhaoliao merged 3 commits into
y-scope:mainfrom
goynam:fix/celery-worker-unique-node-name
May 14, 2026
Merged

fix(deployment): Use unique Celery node names per worker replica.#2269
junhaoliao merged 3 commits into
y-scope:mainfrom
goynam:fix/celery-worker-unique-node-name

Conversation

@goynam
Copy link
Copy Markdown
Contributor

@goynam goynam commented May 9, 2026

Summary

  • Appends $(HOSTNAME) to the Celery -n (node name) flag in both compression-worker and query-worker Helm deployment templates, making each pod register with a unique node name (e.g., compression-worker@clp-compression-worker-abc123)
  • Fixes RabbitMQ pidbox queue conflicts and periodic crashes when running multiple worker pod replicas
  • Bumps Helm chart version to 0.3.2-dev.2

Fixes #2259

Test plan

  • Deploy with compressionWorker.replicas > 1 and verify each pod registers a unique Celery node name in RabbitMQ
  • Confirm no pidbox queue conflicts in RabbitMQ management UI
  • Verify compression and query tasks still execute correctly
  • Check celery inspect active shows distinct node names per pod

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Summary by CodeRabbit

  • Chores
    • Bumped Helm chart version to 0.3.2-dev.5.
    • Updated worker startup to use hostname-based node names for better operational visibility and support for multiple instances across distributed deployments.

Review Change Stack

Fixes #2259

@goynam goynam requested a review from a team as a code owner May 9, 2026 11:58
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 9, 2026

Walkthrough

Celery worker startup node names for compression and query workers were changed from static values to hostname-qualified names (e.g., compression-worker@%h), and the Helm chart version was bumped to 0.3.2-dev.5.

Changes

Worker Reliability and Infrastructure

Layer / File(s) Summary
Worker Node Name Configuration
tools/deployment/package-helm/templates/compression-worker-deployment.yaml, tools/deployment/package-helm/templates/query-worker-deployment.yaml, tools/deployment/package/docker-compose-all.yaml
Celery worker -n arguments changed from static node names (compression-worker, query-worker) to hostname-qualified names (compression-worker@%h, query-worker@%h) across Helm templates and docker-compose.
Helm Chart Versioning
tools/deployment/package-helm/Chart.yaml
Chart version incremented from 0.3.2-dev.4 to 0.3.2-dev.5.

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed The changes fully address issue #2259 by updating Celery node names to use @%h suffix in all deployment configurations (Helm templates and docker-compose) to ensure unique node names per pod.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the pidbox conflict issue: Helm chart version bump, Celery node name updates in three templates, and docker-compose updates are all in scope.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check ✅ Passed The pull request title accurately describes the main change: using unique Celery node names per worker replica to fix deployment issues.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version field is inconsistent with the PR summary:
update the version value in Chart.yaml (the version: "0.3.2-dev.3" entry) to the
intended release string (e.g., "0.3.2-dev.2") so the chart version matches the
PR description; confirm the correct target version with the PR and change the
version field accordingly.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Line 90: The command arg "query-worker@$(HOSTNAME)" will not expand because
HOSTNAME is not defined as a pod env var; add an env entry that exposes the pod
name via the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef:
fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with
"$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath:
metadata.name; note that setting the pod spec hostname alone does not create an
env var for command args, so ensure you inject the chosen variable via
env/valueFrom.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`:
- Around line 43-61: The postStart hook (postStart -> exec -> mongosh /
rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes
and also hides real errors via redirection and "|| true"; fix by moving the
replica set initialization into an initContainer (e.g., add an initContainers
entry that waits for mongosh ping then runs rs.status()/rs.initiate without
swallowing errors) or, if keeping postStart, change the readinessProbe to check
replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and
remove the broad error suppression so failures surface; update references to
postStart, rs.initiate, mongosh and readinessProbe accordingly.
- Around line 50-53: The until loop that runs mongosh --host localhost:27017
--eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the
container start script to add a retry limit or timeout: introduce a max_retries
counter (or use timeout command) and increment on each failed attempt, exit with
non-zero status and a clear error log if retries exceed the limit; update the
loop that currently uses "until mongosh ...; do sleep 1; done" to check the
counter and break/exit on timeout so the pod won't hang forever.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 04b8aee3-0404-4a56-a477-29bccd714349

📥 Commits

Reviewing files that changed from the base of the PR and between 0144808 and b5a158e.

📒 Files selected for processing (4)
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/templates/results-cache-statefulset.yaml

apiVersion: "v2"
name: "clp"
version: "0.3.2-dev.0"
version: "0.3.2-dev.3"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Version number doesn't match the PR summary.

The PR summary states the chart should be bumped to 0.3.2-dev.2, but the code shows 0.3.2-dev.3. Please verify which version is intended.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version
field is inconsistent with the PR summary: update the version value in
Chart.yaml (the version: "0.3.2-dev.3" entry) to the intended release string
(e.g., "0.3.2-dev.2") so the chart version matches the PR description; confirm
the correct target version with the PR and change the version field accordingly.

"--loglevel", "WARNING",
"-Q", "query",
"-n", "query-worker"
"-n", "query-worker@$(HOSTNAME)"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check Kubernetes documentation and Celery node naming patterns in the codebase

# Check if there are any existing examples or patterns of environment variable usage in command arrays
rg -n -C3 'command:.*\$\(' --type=yaml

# Search for Celery node name patterns in the codebase to see if @ is used elsewhere
rg -n -C3 '"-n".*@' --type=yaml

# Search for HOSTNAME usage patterns
rg -n -C3 'HOSTNAME' --type=yaml

Repository: y-scope/clp

Length of output: 3185


🌐 Web query:

Does Kubernetes expand environment variables like $(HOSTNAME) in pod command arrays without an explicit shell?

💡 Result:

Yes—Kubernetes does expand variables in the pod.spec.containers[].command and.spec.containers[].args fields even when you are using the exec-style array form (no explicit shell), but only using Kubernetes’s own $(VAR_NAME) interpolation rules, not Bash’s $VAR or ${...} syntax. In the official docs, Kubernetes states that “The environment variable appears in parentheses, ‘$(VAR)’. This is required for the variable to be expanded in the command or args field.” [1] It also shows using env values in args via $(MESSAGE) with command: ["/bin/echo"] and args: ["$(MESSAGE)"]. [2] So a Pod like: - command: ["/bin/echo"] - args: ["$(HOSTNAME)"] will expand $(HOSTNAME) during Pod creation iff a value is available for HOSTNAME to substitute per Kubernetes’s interpolation mechanism (e.g., via the downward API into an env entry, or another env var defined earlier in the manifest). [1][2] If you write Bash-style $HOSTNAME inside command/args, it will not be expanded by Kubernetes because that syntax is for a shell, and Kubernetes does not run a shell for you; when you need shell behavior, Kubernetes docs recommend wrapping with a shell like /bin/sh -c. [1] For example, to get Kubernetes interpolation to substitute a value, you’d typically define HOSTNAME via env (e.g., metadata.name via the downward API) and then reference it in args with $(HOSTNAME). [3][4] Finally, if you truly want Bash parameter expansion like ${Var1:-$Var2}, you must start bash (e.g., command: ["/bin/bash", "-c", "..."]). [5]

Citations:


🌐 Web query:

Does Celery support node names with @ symbol in the format worker-name@hostname?

💡 Result:

Yes. Celery worker nodenames use an “@” separator between the short worker name and the hostname, i.e. the format is name@hostname (e.g., worker1@george.example.com). Celery explicitly defines the nodename separator as “@” and provides helpers that “join(name, hostname)” using that separator; the documentation also shows starting workers with “-n worker1@%h”, where %h expands to the hostname including domain name. [1][2] So a nodename like worker-name@hostname is supported (assuming hostname is whatever Celery expands/you pass).

Citations:


🏁 Script executed:

# Read the full deployment files to check for environment variable definitions
cat -n tools/deployment/package-helm/templates/query-worker-deployment.yaml | head -100

Repository: y-scope/clp

Length of output: 4669


🏁 Script executed:

# Find and check the includes to see if they define HOSTNAME
fd -t f "_helpers.tpl" tools/deployment/package-helm/

# Also search for any hostname-related environment variable definitions
rg -n "HOSTNAME|hostname" tools/deployment/package-helm/templates/ --type=yaml

# Check if there's any downward API configuration
rg -n "fieldRef|downward" tools/deployment/package-helm/templates/ --type=yaml

Repository: y-scope/clp

Length of output: 992


🏁 Script executed:

# Check the helpers template file
cat -n tools/deployment/package-helm/templates/_helpers.tpl | head -150

Repository: y-scope/clp

Length of output: 5489


🏁 Script executed:

# Continue reading the helpers file to find celery-related includes
rg -n "celeryBrokerUrlEnvVar|celeryResultBackendEnvVar" tools/deployment/package-helm/templates/_helpers.tpl -A 10

Repository: y-scope/clp

Length of output: 935


Add HOSTNAME environment variable or use an alternative approach for the node name.

The $(HOSTNAME) variable will not expand in the command array because HOSTNAME is not defined as an environment variable in the pod specification. Kubernetes only expands $(VAR_NAME) when the variable is explicitly defined (via env or downward API); it does not automatically inject HOSTNAME.

To fix this, either:

  • Add HOSTNAME to the env section using the downward API (e.g., valueFrom: fieldRef: fieldPath: metadata.name)
  • Use the pod spec's hostname field in conjunction with a proper environment variable reference
  • Replace $(HOSTNAME) with $(POD_NAME) and inject POD_NAME via the downward API

Note: Other deployments in this chart (reducer, query-scheduler) set a hostname field at the pod spec level, but this does not create an environment variable for command arguments.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` at line
90, The command arg "query-worker@$(HOSTNAME)" will not expand because HOSTNAME
is not defined as a pod env var; add an env entry that exposes the pod name via
the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef:
fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with
"$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath:
metadata.name; note that setting the pod spec hostname alone does not create an
env var for command args, so ensure you inject the chosen variable via
env/valueFrom.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junhaoliao, you're correct — the current $(HOSTNAME) approach will not work. As confirmed by the Kubernetes issue, when $(HOSTNAME) is used in a command/args array without an explicit env definition, Kubernetes echoes it literally as $(HOSTNAME) rather than expanding it. While HOSTNAME is automatically set as a container environment variable at runtime by the container runtime, Kubernetes only performs $(VAR_NAME) substitution in command and args fields for variables explicitly declared in the pod's env section.

There are two clean fixes:


Option 1 (Recommended): Expose pod name via Downward API

Add HOSTNAME to the env section using the Downward API so Kubernetes can substitute it in the command array. This is the least invasive change:

env:
  # ... existing env vars ...
  - name: "HOSTNAME"
    valueFrom:
      fieldRef:
        fieldPath: "metadata.name"

This applies to both query-worker-deployment.yaml and compression-worker-deployment.yaml. The command line using $(HOSTNAME) would then expand correctly at pod creation time.

Note: metadata.name yields the pod name (e.g., query-worker-6b9f4d-abc12), which is unique per replica and suitable as a Celery node name suffix.


Option 2: Shell wrapper

Wrap the command in a shell so that $HOSTNAME (shell variable, not Kubernetes interpolation) is expanded at runtime:

command: [
  "/bin/sh", "-c",
  "python3 -u /opt/clp/lib/python3/site-packages/bin/celery -A job_orchestration.executor.query worker --concurrency {{ .Values.workerConcurrency }} --loglevel WARNING -Q query -n query-worker@$HOSTNAME"
]

This works because the container runtime sets HOSTNAME as an OS-level env var, and the shell expands $HOSTNAME naturally. However, it requires collapsing the structured array into a single shell string, which is harder to maintain and loses the quoting safety of exec form.


Option 1 (Downward API) is preferable as it keeps the structured command array and is the idiomatic Kubernetes approach for injecting pod metadata into arguments.


✏️ Learnings added
Learnt from: junhaoliao
Repo: y-scope/clp PR: 2269
File: tools/deployment/package-helm/templates/query-worker-deployment.yaml:90-90
Timestamp: 2026-05-10T10:11:03.655Z
Learning: In Kubernetes Deployment YAML (Helm templates), `$(HOSTNAME)` in a container's `command`/`args` array is NOT expanded by Kubernetes unless `HOSTNAME` is explicitly declared in the container's `env` section. Even though the container runtime automatically sets `HOSTNAME` as an OS-level env var, Kubernetes variable substitution only works for variables declared in the pod spec's `env` array. The correct fix is to add an explicit env entry using the Downward API: `- name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name`. This applies to `tools/deployment/package-helm/templates/query-worker-deployment.yaml` and `tools/deployment/package-helm/templates/compression-worker-deployment.yaml`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Comment on lines +43 to +61
lifecycle:
postStart:
exec:
command:
- "bash"
- "-c"
- |
until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
--quiet 2>/dev/null; do
sleep 1
done
mongosh --host localhost:27017 --eval '
try { rs.status(); } catch(e) {
rs.initiate({
_id: "rs0",
members: [{_id: 0, host: "localhost:27017"}]
});
}
' 2>/dev/null || true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Race condition: readiness probe may pass before rs.initiate() completes.

The postStart hook runs asynchronously and does not block the readiness probe. The readiness probe (lines 62-70) only verifies that MongoDB responds to ping, which can succeed before rs.initiate() finishes. This means the pod could be marked ready and receive traffic while the replica set is still initialising, potentially causing connection errors for applications expecting a configured replica set.

Additionally, the extensive error suppression (2>/dev/null and || true) will hide genuine initialisation failures, making debugging difficult.

💡 Recommended improvements

Consider one of these approaches:

Option 1: Use initContainer instead of postStart

initContainers:
  - name: "init-replica-set"
    image: "mongo:7.0.1"
    command:
      - "bash"
      - "-c"
      - |
        until mongosh --host localhost:27017 --eval "db.runCommand('ping')" --quiet; do
          sleep 1
        done
        mongosh --host localhost:27017 --eval '
          try {
            rs.status();
            print("Replica set already initialized");
          } catch(e) {
            print("Initializing replica set...");
            rs.initiate({
              _id: "rs0",
              members: [{_id: 0, host: "localhost:27017"}]
            });
            print("Replica set initialized");
          }
        '

Option 2: Update readiness probe to verify replica set status

readinessProbe:
  exec:
    command:
      - "mongosh"
      - "127.0.0.1:27017/test"
      - "--quiet"
      - "--eval"
      - "rs.status().ok"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`
around lines 43 - 61, The postStart hook (postStart -> exec -> mongosh /
rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes
and also hides real errors via redirection and "|| true"; fix by moving the
replica set initialization into an initContainer (e.g., add an initContainers
entry that waits for mongosh ping then runs rs.status()/rs.initiate without
swallowing errors) or, if keeping postStart, change the readinessProbe to check
replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and
remove the broad error suppression so failures surface; update references to
postStart, rs.initiate, mongosh and readinessProbe accordingly.

Comment on lines +50 to +53
until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
--quiet 2>/dev/null; do
sleep 1
done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add timeout to prevent infinite loop.

The until loop lacks a timeout and could hang indefinitely if MongoDB never becomes available, preventing the container from starting properly. Consider adding a timeout mechanism or retry limit.

⏱️ Proposed fix with timeout
-                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
-                        --quiet 2>/dev/null; do
-                      sleep 1
-                    done
+                    TIMEOUT=60
+                    COUNTER=0
+                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
+                        --quiet 2>/dev/null; do
+                      sleep 1
+                      COUNTER=$((COUNTER + 1))
+                      if [ $COUNTER -ge $TIMEOUT ]; then
+                        echo "Timeout waiting for MongoDB to start"
+                        exit 1
+                      fi
+                    done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
--quiet 2>/dev/null; do
sleep 1
done
TIMEOUT=60
COUNTER=0
until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
--quiet 2>/dev/null; do
sleep 1
COUNTER=$((COUNTER + 1))
if [ $COUNTER -ge $TIMEOUT ]; then
echo "Timeout waiting for MongoDB to start"
exit 1
fi
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`
around lines 50 - 53, The until loop that runs mongosh --host localhost:27017
--eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the
container start script to add a retry limit or timeout: introduce a max_retries
counter (or use timeout command) and increment on each failed attempt, exit with
non-zero status and a clear error log if retries exceed the limit; update the
loop that currently uses "until mongosh ...; do sleep 1; done" to check the
counter and break/exit on timeout so the pod won't hang forever.

"--loglevel", "WARNING",
"-Q", "query",
"-n", "query-worker"
"-n", "query-worker@$(HOSTNAME)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details

"--loglevel", "WARNING",
"-Q", "query",
"-n", "query-worker"
"-n", "query-worker@$(HOSTNAME)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker

Suggested change
"-n", "query-worker@$(HOSTNAME)"
"-n", "query-worker@%h"

"--loglevel", "WARNING",
"-Q", "compression",
"-n", "compression-worker"
"-n", "compression-worker@$(HOSTNAME)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker

Suggested change
"-n", "compression-worker@$(HOSTNAME)"
"-n", "compression-worker@%h"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes in this file seem not related to the PR?

@@ -1,6 +1,6 @@
apiVersion: "v2"
name: "clp"
version: "0.3.2-dev.4"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the git branch's base seems to be out of date. please merge origin (y-scope)'s main branch into the PR source branch

@junhaoliao junhaoliao requested a review from gibber9809 May 10, 2026 10:10
…ox conflicts (fixes y-scope#2259)

Use Celery's built-in `%h` hostname substitution instead of Kubernetes
`$(HOSTNAME)` env var expansion (which doesn't work in command arrays,
see kubernetes/kubernetes#57726). Each worker pod now registers as
`compression-worker@<hostname>` or `query-worker@<hostname>`, preventing
RabbitMQ pidbox queue conflicts when scaling to multiple replicas.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@goynam goynam force-pushed the fix/celery-worker-unique-node-name branch from b5a158e to 424e626 Compare May 11, 2026 05:16
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tools/deployment/package-helm/Chart.yaml (1)

3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Confirm chart version aligns with the intended release bump.

The PR objective states 0.3.2-dev.2, but this file is now 0.3.2-dev.5. If dev.5 is intentional, please sync the PR description/objective text to avoid release-tracking drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version
value ("version: \"0.3.2-dev.5\"") does not match the PR objective which states
"0.3.2-dev.2"; either update the PR description/objective to state "0.3.2-dev.5"
or change the Chart.yaml version back to "0.3.2-dev.2" so release-tracking stays
in sync—locate the version line in Chart.yaml and make the version string
consistent with the intended release bump.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version value ("version: \"0.3.2-dev.5\"") does not
match the PR objective which states "0.3.2-dev.2"; either update the PR
description/objective to state "0.3.2-dev.5" or change the Chart.yaml version
back to "0.3.2-dev.2" so release-tracking stays in sync—locate the version line
in Chart.yaml and make the version string consistent with the intended release
bump.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 11a39afd-42c9-4ca6-9377-20999376b542

📥 Commits

Reviewing files that changed from the base of the PR and between b5a158e and 424e626.

📒 Files selected for processing (3)
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml

Copy link
Copy Markdown
Member

@junhaoliao junhaoliao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got some valuable insights from @gibber9809 offline - let's update the docker-compose config to match the changes in the helm chart.

for the PR title, how about:

fix(deployment): Append hostname to Celery worker node names to prevent pidbox conflicts (fixes #2259).

"--loglevel", "WARNING",
"-Q", "compression",
"-n", "compression-worker"
"-n", "compression-worker@%h"
Copy link
Copy Markdown
Member

@junhaoliao junhaoliao May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gibber9809 and i discussed offline - let's update https://github.com/y-scope/clp/blob/aeb077ca4/tools/deployment/package/docker-compose-all.yaml#L329 to match this command argument

though, practically, that fix won't be enough to ensure uniqueness in the docker compose services:

  1. we hardcode some service name at:
    hostname: "compression_worker"
    . then if we ever scale the compression worker to multiple replicas, each replica reads the same hostname which results in same worker names across all replicas
  2. we should avoid hardcoding the hostname then. that way, docker compose will assign numeric post-fixes to the hostnames (e.g. compression-worker-1), ensuring uniqueness of pidbox names when used by Celery, while the DNS can still resolve compression-worker to different IPs of the replicas.
  3. however, that why we hardcoded hostname for every docker-compose service was to ensure their hostnames match the "component names", defined in Python CLP utils, e.g.
    COMPRESSION_WORKER_COMPONENT_NAME = "compression_worker"
    . those "component names" use _ in names, which differs from the docker-compose service names where - are used as separators. We should also calibrate the component names (as tracked by Standardise service names to use hyphens instead of underscores for RFC1123 compliance #1777 ; refactor(deployment): Standardizes service names to use hyphens instead of underscores for RFC1123 compliance (resolves #1777). #1785 has propsoed a fix) to match the service names.

since satisfying 2 + 3 will involve changing a lot of code, i think simply aligning the command in docker-compose-all.yaml (e.g. change "compression-worker" -> "compression-worker@h") is sufficient for the scope of this PR. we shall fix 2 + 3 seperately.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that said, as pointed out by @gibber9809 , the above proposal will only ensure uniqueness across different compression-worker replicas on the same host where the docker-compose project runs. (which is still valuable as we can debug multi worker issues without a multi host cluster)

however, if the docker compose project is run across multiple hosts to form a cluster, hostnames can still collide. as proposed by #2274 , maybe we should formally remove mulithost support in the docker compose flow

"--loglevel", "WARNING",
"-Q", "query",
"-n", "query-worker"
"-n", "query-worker@%h"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the Helm template change — ensures uniqueness per host when
debugging multi-worker issues locally via docker-compose.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
goynam pushed a commit to goynam/clp that referenced this pull request May 13, 2026
The compression-worker and query-worker node name changes belong in
PR y-scope#2269, not this MongoDB replica set PR. Reverting to main.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/deployment/package/docker-compose-all.yaml`:
- Line 329: The Celery node name flags (-n "compression-worker@%h" and -n
"query-worker@%h") collide because the services set fixed hostnames (hostname:
"compression_worker" and hostname: "query_worker"); remove the fixed hostname
entries from the compression worker and query worker service definitions so %h
resolves to each container's unique hostname, or alternatively modify the
node-name argument to append a unique suffix (e.g., include the container ID or
an env-provided unique value) instead of relying on the static hostnames; update
the service configs that reference "-n", "compression-worker@%h" and "-n",
"query-worker@%h" accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 599721c4-90e9-4c4f-947c-52f5acb1002a

📥 Commits

Reviewing files that changed from the base of the PR and between 424e626 and a3b8f01.

📒 Files selected for processing (1)
  • tools/deployment/package/docker-compose-all.yaml

"-f", "/var/log/compression_worker/worker.log",
"-Q", "compression",
"-n", "compression-worker"
"-n", "compression-worker@%h"
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/deployment/package/docker-compose-all.yaml"

echo "== Static worker hostnames =="
rg -n '^\s+hostname:\s+"(compression_worker|query_worker)"' "$file"

echo
echo "== Celery nodename args using %h =="
rg -n '^\s+"-n",\s+"(compression-worker|query-worker)@%h"' "$file"

Repository: y-scope/clp

Length of output: 307


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/deployment/package/docker-compose-all.yaml"

echo "=== Context around compression_worker (lines 290-340) ==="
sed -n '290,340p' "$file" | cat -n

echo
echo "=== Context around query_worker (lines 478-530) ==="
sed -n '478,530p' "$file" | cat -n

Repository: y-scope/clp

Length of output: 4809


🏁 Script executed:

#!/bin/bash
set -euo pipefail

file="tools/deployment/package/docker-compose-all.yaml"

echo "=== Full spider-compression-worker service definition (lines 334-380) ==="
sed -n '334,380p' "$file" | cat -n

Repository: y-scope/clp

Length of output: 2444


@%h is non-unique when service hostnames are fixed.

Lines 329 and 516 use %h, but this file hard-codes hostname: "compression_worker" (line 298) and hostname: "query_worker" (line 486). When either worker service is scaled to multiple replicas, all instances resolve to identical node names (e.g., compression-worker@compression_worker), which causes Celery pidbox conflicts.

Remove the fixed hostname values on these services so %h resolves to the unique container hostname, or append a unique suffix independent of the service hostname.

Also applies to: 516-516

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package/docker-compose-all.yaml` at line 329, The Celery
node name flags (-n "compression-worker@%h" and -n "query-worker@%h") collide
because the services set fixed hostnames (hostname: "compression_worker" and
hostname: "query_worker"); remove the fixed hostname entries from the
compression worker and query worker service definitions so %h resolves to each
container's unique hostname, or alternatively modify the node-name argument to
append a unique suffix (e.g., include the container ID or an env-provided unique
value) instead of relying on the static hostnames; update the service configs
that reference "-n", "compression-worker@%h" and "-n", "query-worker@%h"
accordingly.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is known limitation - discussed at https://github.com/y-scope/clp/pull/2269/changes#r3227145236 already

we will not attempt to address the limitation in this PR. @coderabbitai please help create an issue to track

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

Sure! Let me create a GitHub issue to track this.


🏁 Script executed:

gh issue create \
  --repo y-scope/clp \
  --title "fix(deployment): Make docker-compose worker Celery node names unique when scaling replicas" \
  --body "## Summary

In \`tools/deployment/package/docker-compose-all.yaml\`, the \`compression-worker\` and \`query-worker\` services use \`-n compression-worker@%h\` and \`-n query-worker@%h\` respectively for Celery node names. However, both services also set fixed \`hostname\` values (\`compression_worker\` and \`query_worker\`). This means \`%h\` resolves to the same static hostname for all replicas of a given service, so scaling either worker to multiple replicas still produces colliding Celery node names (e.g., all replicas become \`compression-worker@compression_worker\`), causing RabbitMQ pidbox conflicts.

## Context

- This limitation was identified during review of PR `#2269` (which fixed the equivalent issue in the Helm deployment by using \`%h\`, where each pod gets a unique hostname).
- In docker-compose, unlike Kubernetes pods, containers of a scaled service share the fixed \`hostname\` defined in the service config, so \`%h\` does not provide uniqueness.
- The review discussion: https://github.com/y-scope/clp/pull/2269#discussion_r3231983691

## Possible Approaches

- Remove the fixed \`hostname\` entries from the \`compression-worker\` and \`query-worker\` service definitions so Docker assigns a unique container hostname, allowing \`%h\` to resolve uniquely per replica.
- Alternatively, derive a unique suffix from an environment variable (e.g., container ID) rather than relying on \`%h\`.

## Impact

Affects users who scale \`compression-worker\` or \`query-worker\` to more than one replica in the docker-compose deployment. The symptom is periodic worker crashes due to \`W_PIDBOX_IN_USE\` conflicts (same root cause as issue `#2259`).

---
Requested by `@junhaoliao` — related to PR `#2269` and comment https://github.com/y-scope/clp/pull/2269#discussion_r3231983691"

Length of output: 43


@junhaoliao, the follow-up issue has been created: #2279

It captures the limitation that fixed hostname values in the docker-compose services prevent %h from resolving to unique names when scaling replicas, along with the possible approaches to address it in a future PR.

@junhaoliao junhaoliao changed the title fix(deployment): Use unique Celery node names per pod to prevent pidbox conflicts fix(deployment): Use unique Celery node names per worker replica. May 14, 2026
@junhaoliao junhaoliao added this to the Mid-May 2026 milestone May 14, 2026
@junhaoliao junhaoliao merged commit d6f510f into y-scope:main May 14, 2026
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Compression workers crash-loop due to Celery pidbox conflict when using shared node name

2 participants