fix(deployment): Use unique Celery node names per worker replica. by goynam · Pull Request #2269 · y-scope/clp

goynam · 2026-05-09T11:58:52Z

Summary

Appends $(HOSTNAME) to the Celery -n (node name) flag in both compression-worker and query-worker Helm deployment templates, making each pod register with a unique node name (e.g., compression-worker@clp-compression-worker-abc123)
Fixes RabbitMQ pidbox queue conflicts and periodic crashes when running multiple worker pod replicas
Bumps Helm chart version to 0.3.2-dev.2

Fixes #2259

Test plan

Deploy with compressionWorker.replicas > 1 and verify each pod registers a unique Celery node name in RabbitMQ
Confirm no pidbox queue conflicts in RabbitMQ management UI
Verify compression and query tasks still execute correctly
Check celery inspect active shows distinct node names per pod

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

Summary by CodeRabbit

Chores
- Bumped Helm chart version to 0.3.2-dev.5.
- Updated worker startup to use hostname-based node names for better operational visibility and support for multiple instances across distributed deployments.

Fixes #2259

coderabbitai · 2026-05-09T11:59:05Z

Walkthrough

Celery worker startup node names for compression and query workers were changed from static values to hostname-qualified names (e.g., compression-worker@%h), and the Helm chart version was bumped to 0.3.2-dev.5.

Changes

Worker Reliability and Infrastructure

Layer / File(s)	Summary
Worker Node Name Configuration `tools/deployment/package-helm/templates/compression-worker-deployment.yaml`, `tools/deployment/package-helm/templates/query-worker-deployment.yaml`, `tools/deployment/package/docker-compose-all.yaml`	Celery worker `-n` arguments changed from static node names (`compression-worker`, `query-worker`) to hostname-qualified names (`compression-worker@%h`, `query-worker@%h`) across Helm templates and docker-compose.
Helm Chart Versioning `tools/deployment/package-helm/Chart.yaml`	Chart version incremented from `0.3.2-dev.4` to `0.3.2-dev.5`.

🎯 1 (Trivial) | ⏱️ ~5 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	The changes fully address issue `#2259` by updating Celery node names to use @%h suffix in all deployment configurations (Helm templates and docker-compose) to ensure unique node names per pod.
Out of Scope Changes check	✅ Passed	All changes are directly related to fixing the pidbox conflict issue: Helm chart version bump, Celery node name updates in three templates, and docker-compose updates are all in scope.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Title check	✅ Passed	The pull request title accurately describes the main change: using unique Celery node names per worker replica to fix deployment issues.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version field is inconsistent with the PR summary:
update the version value in Chart.yaml (the version: "0.3.2-dev.3" entry) to the
intended release string (e.g., "0.3.2-dev.2") so the chart version matches the
PR description; confirm the correct target version with the PR and change the
version field accordingly.

In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Line 90: The command arg "query-worker@$(HOSTNAME)" will not expand because
HOSTNAME is not defined as a pod env var; add an env entry that exposes the pod
name via the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef:
fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with
"$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath:
metadata.name; note that setting the pod spec hostname alone does not create an
env var for command args, so ensure you inject the chosen variable via
env/valueFrom.

In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`:
- Around line 43-61: The postStart hook (postStart -> exec -> mongosh /
rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes
and also hides real errors via redirection and "|| true"; fix by moving the
replica set initialization into an initContainer (e.g., add an initContainers
entry that waits for mongosh ping then runs rs.status()/rs.initiate without
swallowing errors) or, if keeping postStart, change the readinessProbe to check
replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and
remove the broad error suppression so failures surface; update references to
postStart, rs.initiate, mongosh and readinessProbe accordingly.
- Around line 50-53: The until loop that runs mongosh --host localhost:27017
--eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the
container start script to add a retry limit or timeout: introduce a max_retries
counter (or use timeout command) and increment on each failed attempt, exit with
non-zero status and a clear error log if retries exceed the limit; update the
loop that currently uses "until mongosh ...; do sleep 1; done" to check the
counter and break/exit on timeout so the pod won't hang forever.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 04b8aee3-0404-4a56-a477-29bccd714349

📥 Commits

Reviewing files that changed from the base of the PR and between 0144808 and b5a158e.

📒 Files selected for processing (4)

tools/deployment/package-helm/Chart.yaml
tools/deployment/package-helm/templates/compression-worker-deployment.yaml
tools/deployment/package-helm/templates/query-worker-deployment.yaml
tools/deployment/package-helm/templates/results-cache-statefulset.yaml

coderabbitai · 2026-05-09T12:01:51Z

 apiVersion: "v2"
 name: "clp"
-version: "0.3.2-dev.0"
+version: "0.3.2-dev.3"


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Version number doesn't match the PR summary.

The PR summary states the chart should be bumped to 0.3.2-dev.2, but the code shows 0.3.2-dev.3. Please verify which version is intended.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version field is inconsistent with the PR summary: update the version value in Chart.yaml (the version: "0.3.2-dev.3" entry) to the intended release string (e.g., "0.3.2-dev.2") so the chart version matches the PR description; confirm the correct target version with the PR and change the version field accordingly.

coderabbitai · 2026-05-09T12:01:51Z

            "--loglevel", "WARNING",
            "-Q", "query",
-            "-n", "query-worker"
+            "-n", "query-worker@$(HOSTNAME)"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Check Kubernetes documentation and Celery node naming patterns in the codebase # Check if there are any existing examples or patterns of environment variable usage in command arrays rg -n -C3 'command:.*\$\(' --type=yaml # Search for Celery node name patterns in the codebase to see if @ is used elsewhere rg -n -C3 '"-n".*@' --type=yaml # Search for HOSTNAME usage patterns rg -n -C3 'HOSTNAME' --type=yaml

Repository: y-scope/clp

Length of output: 3185

🌐 Web query:

Does Kubernetes expand environment variables like $(HOSTNAME) in pod command arrays without an explicit shell?

💡 Result:

Yes—Kubernetes does expand variables in the pod.spec.containers[].command and.spec.containers[].args fields even when you are using the exec-style array form (no explicit shell), but only using Kubernetes’s own $(VAR_NAME) interpolation rules, not Bash’s $VAR or ${...} syntax. In the official docs, Kubernetes states that “The environment variable appears in parentheses, ‘$(VAR)’. This is required for the variable to be expanded in the command or args field.” [1] It also shows using env values in args via $(MESSAGE) with command: ["/bin/echo"] and args: ["$(MESSAGE)"]. [2] So a Pod like: - command: ["/bin/echo"] - args: ["$(HOSTNAME)"] will expand $(HOSTNAME) during Pod creation iff a value is available for HOSTNAME to substitute per Kubernetes’s interpolation mechanism (e.g., via the downward API into an env entry, or another env var defined earlier in the manifest). [1][2] If you write Bash-style $HOSTNAME inside command/args, it will not be expanded by Kubernetes because that syntax is for a shell, and Kubernetes does not run a shell for you; when you need shell behavior, Kubernetes docs recommend wrapping with a shell like /bin/sh -c. [1] For example, to get Kubernetes interpolation to substitute a value, you’d typically define HOSTNAME via env (e.g., metadata.name via the downward API) and then reference it in args with $(HOSTNAME). [3][4] Finally, if you truly want Bash parameter expansion like ${Var1:-$Var2}, you must start bash (e.g., command: ["/bin/bash", "-c", "..."]). [5]

Citations:

1: https://v1-32.docs.kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container

2: https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/

3: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/

4: $(HOSTNAME) not replaced in args kubernetes/kubernetes#57726

5: https://stackoverflow.com/questions/67874552/kubernetes-using-bash-variable-expansion-in-container-entrypoint

🌐 Web query:

Does Celery support node names with @ symbol in the format worker-name@hostname?

💡 Result:

Yes. Celery worker nodenames use an “@” separator between the short worker name and the hostname, i.e. the format is name@hostname (e.g., worker1@george.example.com). Celery explicitly defines the nodename separator as “@” and provides helpers that “join(name, hostname)” using that separator; the documentation also shows starting workers with “-n worker1@%h”, where %h expands to the hostname including domain name. [1][2] So a nodename like worker-name@hostname is supported (assuming hostname is whatever Celery expands/you pass).

Citations:

1: https://docs.celeryq.dev/en/main/%5Fmodules/celery/utils/nodenames.html

2: https://celery.readthedocs.io/en/3.0/internals/reference/celery.utils.nodenames.html

🏁 Script executed:

# Read the full deployment files to check for environment variable definitions cat -n tools/deployment/package-helm/templates/query-worker-deployment.yaml | head -100

Repository: y-scope/clp

Length of output: 4669

🏁 Script executed:

# Find and check the includes to see if they define HOSTNAME fd -t f "_helpers.tpl" tools/deployment/package-helm/ # Also search for any hostname-related environment variable definitions rg -n "HOSTNAME|hostname" tools/deployment/package-helm/templates/ --type=yaml # Check if there's any downward API configuration rg -n "fieldRef|downward" tools/deployment/package-helm/templates/ --type=yaml

Repository: y-scope/clp

Length of output: 992

🏁 Script executed:

# Check the helpers template file cat -n tools/deployment/package-helm/templates/_helpers.tpl | head -150

Repository: y-scope/clp

Length of output: 5489

🏁 Script executed:

# Continue reading the helpers file to find celery-related includes rg -n "celeryBrokerUrlEnvVar|celeryResultBackendEnvVar" tools/deployment/package-helm/templates/_helpers.tpl -A 10

Repository: y-scope/clp

Length of output: 935

Add HOSTNAME environment variable or use an alternative approach for the node name.

The $(HOSTNAME) variable will not expand in the command array because HOSTNAME is not defined as an environment variable in the pod specification. Kubernetes only expands $(VAR_NAME) when the variable is explicitly defined (via env or downward API); it does not automatically inject HOSTNAME.

To fix this, either:

Add HOSTNAME to the env section using the downward API (e.g., valueFrom: fieldRef: fieldPath: metadata.name)

Use the pod spec's hostname field in conjunction with a proper environment variable reference

Replace $(HOSTNAME) with $(POD_NAME) and inject POD_NAME via the downward API

Note: Other deployments in this chart (reducer, query-scheduler) set a hostname field at the pod spec level, but this does not create an environment variable for command arguments.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` at line 90, The command arg "query-worker@$(HOSTNAME)" will not expand because HOSTNAME is not defined as a pod env var; add an env entry that exposes the pod name via the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with "$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name; note that setting the pod spec hostname alone does not create an env var for command args, so ensure you inject the chosen variable via env/valueFrom.

as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details

@junhaoliao, you're correct — the current $(HOSTNAME) approach will not work. As confirmed by the Kubernetes issue, when $(HOSTNAME) is used in a command/args array without an explicit env definition, Kubernetes echoes it literally as $(HOSTNAME) rather than expanding it. While HOSTNAME is automatically set as a container environment variable at runtime by the container runtime, Kubernetes only performs $(VAR_NAME) substitution in command and args fields for variables explicitly declared in the pod's env section.

There are two clean fixes:

Option 1 (Recommended): Expose pod name via Downward API

Add HOSTNAME to the env section using the Downward API so Kubernetes can substitute it in the command array. This is the least invasive change:

env: # ... existing env vars ... - name: "HOSTNAME" valueFrom: fieldRef: fieldPath: "metadata.name"

This applies to both query-worker-deployment.yaml and compression-worker-deployment.yaml. The command line using $(HOSTNAME) would then expand correctly at pod creation time.

Note: metadata.name yields the pod name (e.g., query-worker-6b9f4d-abc12), which is unique per replica and suitable as a Celery node name suffix.

Option 2: Shell wrapper

Wrap the command in a shell so that $HOSTNAME (shell variable, not Kubernetes interpolation) is expanded at runtime:

command: [ "/bin/sh", "-c", "python3 -u /opt/clp/lib/python3/site-packages/bin/celery -A job_orchestration.executor.query worker --concurrency {{ .Values.workerConcurrency }} --loglevel WARNING -Q query -n query-worker@$HOSTNAME" ]

This works because the container runtime sets HOSTNAME as an OS-level env var, and the shell expands $HOSTNAME naturally. However, it requires collapsing the structured array into a single shell string, which is harder to maintain and loses the quoting safety of exec form.

Option 1 (Downward API) is preferable as it keeps the structured command array and is the idiomatic Kubernetes approach for injecting pod metadata into arguments.

✏️ Learnings added

Learnt from: junhaoliao Repo: y-scope/clp PR: 2269 File: tools/deployment/package-helm/templates/query-worker-deployment.yaml:90-90 Timestamp: 2026-05-10T10:11:03.655Z Learning: In Kubernetes Deployment YAML (Helm templates), `$(HOSTNAME)` in a container's `command`/`args` array is NOT expanded by Kubernetes unless `HOSTNAME` is explicitly declared in the container's `env` section. Even though the container runtime automatically sets `HOSTNAME` as an OS-level env var, Kubernetes variable substitution only works for variables declared in the pod spec's `env` array. The correct fix is to add an explicit env entry using the Downward API: `- name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name`. This applies to `tools/deployment/package-helm/templates/query-worker-deployment.yaml` and `tools/deployment/package-helm/templates/compression-worker-deployment.yaml`.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

coderabbitai · 2026-05-09T12:01:51Z

+          lifecycle:
+            postStart:
+              exec:
+                command:
+                  - "bash"
+                  - "-c"
+                  - |
+                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
+                        --quiet 2>/dev/null; do
+                      sleep 1
+                    done
+                    mongosh --host localhost:27017 --eval '
+                      try { rs.status(); } catch(e) {
+                        rs.initiate({
+                          _id: "rs0",
+                          members: [{_id: 0, host: "localhost:27017"}]
+                        });
+                      }
+                    ' 2>/dev/null || true


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Race condition: readiness probe may pass before rs.initiate() completes.

The postStart hook runs asynchronously and does not block the readiness probe. The readiness probe (lines 62-70) only verifies that MongoDB responds to ping, which can succeed before rs.initiate() finishes. This means the pod could be marked ready and receive traffic while the replica set is still initialising, potentially causing connection errors for applications expecting a configured replica set.

Additionally, the extensive error suppression (2>/dev/null and || true) will hide genuine initialisation failures, making debugging difficult.

💡 Recommended improvements

Consider one of these approaches:

Option 1: Use initContainer instead of postStart

initContainers: - name: "init-replica-set" image: "mongo:7.0.1" command: - "bash" - "-c" - | until mongosh --host localhost:27017 --eval "db.runCommand('ping')" --quiet; do sleep 1 done mongosh --host localhost:27017 --eval ' try { rs.status(); print("Replica set already initialized"); } catch(e) { print("Initializing replica set..."); rs.initiate({ _id: "rs0", members: [{_id: 0, host: "localhost:27017"}] }); print("Replica set initialized"); } '

Option 2: Update readiness probe to verify replica set status

readinessProbe: exec: command: - "mongosh" - "127.0.0.1:27017/test" - "--quiet" - "--eval" - "rs.status().ok"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml` around lines 43 - 61, The postStart hook (postStart -> exec -> mongosh / rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes and also hides real errors via redirection and "|| true"; fix by moving the replica set initialization into an initContainer (e.g., add an initContainers entry that waits for mongosh ping then runs rs.status()/rs.initiate without swallowing errors) or, if keeping postStart, change the readinessProbe to check replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and remove the broad error suppression so failures surface; update references to postStart, rs.initiate, mongosh and readinessProbe accordingly.

coderabbitai · 2026-05-09T12:01:51Z

+                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
+                        --quiet 2>/dev/null; do
+                      sleep 1
+                    done


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add timeout to prevent infinite loop.

The until loop lacks a timeout and could hang indefinitely if MongoDB never becomes available, preventing the container from starting properly. Consider adding a timeout mechanism or retry limit.

⏱️ Proposed fix with timeout

- until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ - --quiet 2>/dev/null; do - sleep 1 - done + TIMEOUT=60 + COUNTER=0 + until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ + --quiet 2>/dev/null; do + sleep 1 + COUNTER=$((COUNTER + 1)) + if [ $COUNTER -ge $TIMEOUT ]; then + echo "Timeout waiting for MongoDB to start" + exit 1 + fi + done

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \

--quiet 2>/dev/null; do

sleep 1

done

TIMEOUT=60

COUNTER=0

until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \

--quiet 2>/dev/null; do

sleep 1

COUNTER=$((COUNTER + 1))

if [ $COUNTER -ge $TIMEOUT ]; then

echo "Timeout waiting for MongoDB to start"

exit 1

fi

done

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml` around lines 50 - 53, The until loop that runs mongosh --host localhost:27017 --eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the container start script to add a retry limit or timeout: introduce a max_retries counter (or use timeout command) and increment on each failed attempt, exit with non-zero status and a clear error log if retries exceed the limit; update the loop that currently uses "until mongosh ...; do sleep 1; done" to check the counter and break/exit on timeout so the pod won't hang forever.

junhaoliao · 2026-05-10T10:07:38Z

            "--loglevel", "WARNING",
            "-Q", "query",
-            "-n", "query-worker"
+            "-n", "query-worker@$(HOSTNAME)"


as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details

junhaoliao · 2026-05-10T10:08:21Z

            "--loglevel", "WARNING",
            "-Q", "query",
-            "-n", "query-worker"
+            "-n", "query-worker@$(HOSTNAME)"


according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker

Suggested change

"-n", "query-worker@$(HOSTNAME)"

"-n", "query-worker@%h"

junhaoliao · 2026-05-10T10:08:35Z

            "--loglevel", "WARNING",
            "-Q", "compression",
-            "-n", "compression-worker"
+            "-n", "compression-worker@$(HOSTNAME)"


according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker

Suggested change

"-n", "compression-worker@$(HOSTNAME)"

"-n", "compression-worker@%h"

junhaoliao · 2026-05-10T10:08:55Z

changes in this file seem not related to the PR?

junhaoliao · 2026-05-10T10:09:58Z

@@ -1,6 +1,6 @@
 apiVersion: "v2"
 name: "clp"
-version: "0.3.2-dev.4"


the git branch's base seems to be out of date. please merge origin (y-scope)'s main branch into the PR source branch

…ox conflicts (fixes y-scope#2259) Use Celery's built-in `%h` hostname substitution instead of Kubernetes `$(HOSTNAME)` env var expansion (which doesn't work in command arrays, see kubernetes/kubernetes#57726). Each worker pod now registers as `compression-worker@<hostname>` or `query-worker@<hostname>`, preventing RabbitMQ pidbox queue conflicts when scaling to multiple replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

♻️ Duplicate comments (1)

tools/deployment/package-helm/Chart.yaml (1)
3-3: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Confirm chart version aligns with the intended release bump.

The PR objective states 0.3.2-dev.2, but this file is now 0.3.2-dev.5. If dev.5 is intentional, please sync the PR description/objective text to avoid release-tracking drift.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version
value ("version: \"0.3.2-dev.5\"") does not match the PR objective which states
"0.3.2-dev.2"; either update the PR description/objective to state "0.3.2-dev.5"
or change the Chart.yaml version back to "0.3.2-dev.2" so release-tracking stays
in sync—locate the version line in Chart.yaml and make the version string
consistent with the intended release bump.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version value ("version: \"0.3.2-dev.5\"") does not
match the PR objective which states "0.3.2-dev.2"; either update the PR
description/objective to state "0.3.2-dev.5" or change the Chart.yaml version
back to "0.3.2-dev.2" so release-tracking stays in sync—locate the version line
in Chart.yaml and make the version string consistent with the intended release
bump.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 11a39afd-42c9-4ca6-9377-20999376b542

📥 Commits

Reviewing files that changed from the base of the PR and between b5a158e and 424e626.

📒 Files selected for processing (3)

tools/deployment/package-helm/Chart.yaml
tools/deployment/package-helm/templates/compression-worker-deployment.yaml
tools/deployment/package-helm/templates/query-worker-deployment.yaml

junhaoliao

got some valuable insights from @gibber9809 offline - let's update the docker-compose config to match the changes in the helm chart.

for the PR title, how about:

fix(deployment): Append hostname to Celery worker node names to prevent pidbox conflicts (fixes #2259).

junhaoliao · 2026-05-12T14:17:17Z

            "--loglevel", "WARNING",
            "-Q", "compression",
-            "-n", "compression-worker"
+            "-n", "compression-worker@%h"


@gibber9809 and i discussed offline - let's update https://github.com/y-scope/clp/blob/aeb077ca4/tools/deployment/package/docker-compose-all.yaml#L329 to match this command argument

though, practically, that fix won't be enough to ensure uniqueness in the docker compose services:

we hardcode some service name at:

clp/tools/deployment/package/docker-compose-all.yaml

Line 298 in aeb077c

hostname: "compression_worker"

. then if we ever scale the compression worker to multiple replicas, each replica reads the same hostname which results in same worker names across all replicas

we should avoid hardcoding the hostname then. that way, docker compose will assign numeric post-fixes to the hostnames (e.g. compression-worker-1), ensuring uniqueness of pidbox names when used by Celery, while the DNS can still resolve compression-worker to different IPs of the replicas.

however, that why we hardcoded hostname for every docker-compose service was to ensure their hostnames match the "component names", defined in Python CLP utils, e.g.

clp/components/clp-py-utils/clp_py_utils/clp_config.py

Line 42 in 0144808

COMPRESSION_WORKER_COMPONENT_NAME = "compression_worker"

. those "component names" use _ in names, which differs from the docker-compose service names where - are used as separators. We should also calibrate the component names (as tracked by Standardise service names to use hyphens instead of underscores for RFC1123 compliance #1777 ; refactor(deployment): Standardizes service names to use hyphens instead of underscores for RFC1123 compliance (resolves #1777). #1785 has propsoed a fix) to match the service names.

since satisfying 2 + 3 will involve changing a lot of code, i think simply aligning the command in docker-compose-all.yaml (e.g. change "compression-worker" -> "compression-worker@h") is sufficient for the scope of this PR. we shall fix 2 + 3 seperately.

that said, as pointed out by @gibber9809 , the above proposal will only ensure uniqueness across different compression-worker replicas on the same host where the docker-compose project runs. (which is still valuable as we can debug multi worker issues without a multi host cluster)

however, if the docker compose project is run across multiple hosts to form a cluster, hostnames can still collide. as proposed by #2274 , maybe we should formally remove mulithost support in the docker compose flow

junhaoliao · 2026-05-12T14:17:33Z

            "--loglevel", "WARNING",
            "-Q", "query",
-            "-n", "query-worker"
+            "-n", "query-worker@%h"


ditto - let's update https://github.com/y-scope/clp/blob/aeb077ca4/tools/deployment/package/docker-compose-all.yaml#L516 to match this command argument

Matches the Helm template change — ensures uniqueness per host when debugging multi-worker issues locally via docker-compose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The compression-worker and query-worker node name changes belong in PR y-scope#2269, not this MongoDB replica set PR. Reverting to main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tools/deployment/package/docker-compose-all.yaml`:
- Line 329: The Celery node name flags (-n "compression-worker@%h" and -n
"query-worker@%h") collide because the services set fixed hostnames (hostname:
"compression_worker" and hostname: "query_worker"); remove the fixed hostname
entries from the compression worker and query worker service definitions so %h
resolves to each container's unique hostname, or alternatively modify the
node-name argument to append a unique suffix (e.g., include the container ID or
an env-provided unique value) instead of relying on the static hostnames; update
the service configs that reference "-n", "compression-worker@%h" and "-n",
"query-worker@%h" accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 599721c4-90e9-4c4f-947c-52f5acb1002a

📥 Commits

Reviewing files that changed from the base of the PR and between 424e626 and a3b8f01.

📒 Files selected for processing (1)

tools/deployment/package/docker-compose-all.yaml

coderabbitai · 2026-05-13T06:29:19Z

      "-f", "/var/log/compression_worker/worker.log",
      "-Q", "compression",
-      "-n", "compression-worker"
+      "-n", "compression-worker@%h"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/deployment/package/docker-compose-all.yaml" echo "== Static worker hostnames ==" rg -n '^\s+hostname:\s+"(compression_worker|query_worker)"' "$file" echo echo "== Celery nodename args using %h ==" rg -n '^\s+"-n",\s+"(compression-worker|query-worker)@%h"' "$file"

Repository: y-scope/clp

Length of output: 307

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/deployment/package/docker-compose-all.yaml" echo "=== Context around compression_worker (lines 290-340) ===" sed -n '290,340p' "$file" | cat -n echo echo "=== Context around query_worker (lines 478-530) ===" sed -n '478,530p' "$file" | cat -n

Repository: y-scope/clp

Length of output: 4809

🏁 Script executed:

#!/bin/bash set -euo pipefail file="tools/deployment/package/docker-compose-all.yaml" echo "=== Full spider-compression-worker service definition (lines 334-380) ===" sed -n '334,380p' "$file" | cat -n

Repository: y-scope/clp

Length of output: 2444

@%h is non-unique when service hostnames are fixed.

Lines 329 and 516 use %h, but this file hard-codes hostname: "compression_worker" (line 298) and hostname: "query_worker" (line 486). When either worker service is scaled to multiple replicas, all instances resolve to identical node names (e.g., compression-worker@compression_worker), which causes Celery pidbox conflicts.

Remove the fixed hostname values on these services so %h resolves to the unique container hostname, or append a unique suffix independent of the service hostname.

Also applies to: 516-516

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package/docker-compose-all.yaml` at line 329, The Celery node name flags (-n "compression-worker@%h" and -n "query-worker@%h") collide because the services set fixed hostnames (hostname: "compression_worker" and hostname: "query_worker"); remove the fixed hostname entries from the compression worker and query worker service definitions so %h resolves to each container's unique hostname, or alternatively modify the node-name argument to append a unique suffix (e.g., include the container ID or an env-provided unique value) instead of relying on the static hostnames; update the service configs that reference "-n", "compression-worker@%h" and "-n", "query-worker@%h" accordingly.

this is known limitation - discussed at https://github.com/y-scope/clp/pull/2269/changes#r3227145236 already

we will not attempt to address the limitation in this PR. @coderabbitai please help create an issue to track

🧩 Analysis chain

Sure! Let me create a GitHub issue to track this.

🏁 Script executed:

gh issue create \ --repo y-scope/clp \ --title "fix(deployment): Make docker-compose worker Celery node names unique when scaling replicas" \ --body "## Summary In \`tools/deployment/package/docker-compose-all.yaml\`, the \`compression-worker\` and \`query-worker\` services use \`-n compression-worker@%h\` and \`-n query-worker@%h\` respectively for Celery node names. However, both services also set fixed \`hostname\` values (\`compression_worker\` and \`query_worker\`). This means \`%h\` resolves to the same static hostname for all replicas of a given service, so scaling either worker to multiple replicas still produces colliding Celery node names (e.g., all replicas become \`compression-worker@compression_worker\`), causing RabbitMQ pidbox conflicts. ## Context - This limitation was identified during review of PR `#2269` (which fixed the equivalent issue in the Helm deployment by using \`%h\`, where each pod gets a unique hostname). - In docker-compose, unlike Kubernetes pods, containers of a scaled service share the fixed \`hostname\` defined in the service config, so \`%h\` does not provide uniqueness. - The review discussion: https://github.com/y-scope/clp/pull/2269#discussion_r3231983691 ## Possible Approaches - Remove the fixed \`hostname\` entries from the \`compression-worker\` and \`query-worker\` service definitions so Docker assigns a unique container hostname, allowing \`%h\` to resolve uniquely per replica. - Alternatively, derive a unique suffix from an environment variable (e.g., container ID) rather than relying on \`%h\`. ## Impact Affects users who scale \`compression-worker\` or \`query-worker\` to more than one replica in the docker-compose deployment. The symptom is periodic worker crashes due to \`W_PIDBOX_IN_USE\` conflicts (same root cause as issue `#2259`). --- Requested by `@junhaoliao` — related to PR `#2269` and comment https://github.com/y-scope/clp/pull/2269#discussion_r3231983691"

Length of output: 43

@junhaoliao, the follow-up issue has been created: #2279

It captures the limitation that fixed hostname values in the docker-compose services prevent %h from resolving to unique names when scaling replicas, along with the possible approaches to address it in a future PR.

goynam requested a review from a team as a code owner May 9, 2026 11:58

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

junhaoliao requested changes May 10, 2026

View reviewed changes

junhaoliao requested a review from gibber9809 May 10, 2026 10:10

goynam force-pushed the fix/celery-worker-unique-node-name branch from b5a158e to 424e626 Compare May 11, 2026 05:16

coderabbitai Bot reviewed May 11, 2026

View reviewed changes

junhaoliao requested changes May 12, 2026

View reviewed changes

fix(deployment): Also update docker-compose worker node names to use %h

a3b8f01

Matches the Helm template change — ensures uniqueness per host when debugging multi-worker issues locally via docker-compose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai Bot reviewed May 13, 2026

View reviewed changes

junhaoliao force-pushed the main branch from 4505f8c to 8977316 Compare May 13, 2026 22:48

Merge branch 'main' into fix/celery-worker-unique-node-name

7c0c37e

coderabbitai Bot mentioned this pull request May 14, 2026

fix(deployment): Make docker-compose worker Celery node names unique when scaling replicas #2279

Open

junhaoliao changed the title ~~fix(deployment): Use unique Celery node names per pod to prevent pidbox conflicts~~ fix(deployment): Use unique Celery node names per worker replica. May 14, 2026

junhaoliao approved these changes May 14, 2026

View reviewed changes

junhaoliao added this to the Mid-May 2026 milestone May 14, 2026

junhaoliao merged commit d6f510f into y-scope:main May 14, 2026
28 checks passed

-                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
-                        --quiet 2>/dev/null; do
-                      sleep 1
-                    done
+                    TIMEOUT=60
+                    COUNTER=0
+                    until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
+                        --quiet 2>/dev/null; do
+                      sleep 1
+                      COUNTER=$((COUNTER + 1))
+                      if [ $COUNTER -ge $TIMEOUT ]; then
+                        echo "Timeout waiting for MongoDB to start"
+                        exit 1
+                      fi
+                    done

	"-n", "compression-worker@$(HOSTNAME)"
	"-n", "compression-worker@%h"

Conversation

goynam commented May 9, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 10, 2026

Choose a reason for hiding this comment

Option 1 (Recommended): Expose pod name via Downward API

Option 2: Shell wrapper

Uh oh!

coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

junhaoliao left a comment

Choose a reason for hiding this comment

Uh oh!

junhaoliao May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

goynam commented May 9, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 9, 2026 •

edited

Loading

coderabbitai Bot May 9, 2026 •

edited

Loading

junhaoliao May 12, 2026 •

edited

Loading

coderabbitai Bot May 13, 2026 •

edited

Loading