fix(deployment): Use unique Celery node names per worker replica.#2269
Conversation
WalkthroughCelery worker startup node names for compression and query workers were changed from static values to hostname-qualified names (e.g., ChangesWorker Reliability and Infrastructure
🎯 1 (Trivial) | ⏱️ ~5 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version field is inconsistent with the PR summary:
update the version value in Chart.yaml (the version: "0.3.2-dev.3" entry) to the
intended release string (e.g., "0.3.2-dev.2") so the chart version matches the
PR description; confirm the correct target version with the PR and change the
version field accordingly.
In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml`:
- Line 90: The command arg "query-worker@$(HOSTNAME)" will not expand because
HOSTNAME is not defined as a pod env var; add an env entry that exposes the pod
name via the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef:
fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with
"$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath:
metadata.name; note that setting the pod spec hostname alone does not create an
env var for command args, so ensure you inject the chosen variable via
env/valueFrom.
In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`:
- Around line 43-61: The postStart hook (postStart -> exec -> mongosh /
rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes
and also hides real errors via redirection and "|| true"; fix by moving the
replica set initialization into an initContainer (e.g., add an initContainers
entry that waits for mongosh ping then runs rs.status()/rs.initiate without
swallowing errors) or, if keeping postStart, change the readinessProbe to check
replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and
remove the broad error suppression so failures surface; update references to
postStart, rs.initiate, mongosh and readinessProbe accordingly.
- Around line 50-53: The until loop that runs mongosh --host localhost:27017
--eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the
container start script to add a retry limit or timeout: introduce a max_retries
counter (or use timeout command) and increment on each failed attempt, exit with
non-zero status and a clear error log if retries exceed the limit; update the
loop that currently uses "until mongosh ...; do sleep 1; done" to check the
counter and break/exit on timeout so the pod won't hang forever.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 04b8aee3-0404-4a56-a477-29bccd714349
📒 Files selected for processing (4)
tools/deployment/package-helm/Chart.yamltools/deployment/package-helm/templates/compression-worker-deployment.yamltools/deployment/package-helm/templates/query-worker-deployment.yamltools/deployment/package-helm/templates/results-cache-statefulset.yaml
| apiVersion: "v2" | ||
| name: "clp" | ||
| version: "0.3.2-dev.0" | ||
| version: "0.3.2-dev.3" |
There was a problem hiding this comment.
Version number doesn't match the PR summary.
The PR summary states the chart should be bumped to 0.3.2-dev.2, but the code shows 0.3.2-dev.3. Please verify which version is intended.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version
field is inconsistent with the PR summary: update the version value in
Chart.yaml (the version: "0.3.2-dev.3" entry) to the intended release string
(e.g., "0.3.2-dev.2") so the chart version matches the PR description; confirm
the correct target version with the PR and change the version field accordingly.
| "--loglevel", "WARNING", | ||
| "-Q", "query", | ||
| "-n", "query-worker" | ||
| "-n", "query-worker@$(HOSTNAME)" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Description: Check Kubernetes documentation and Celery node naming patterns in the codebase
# Check if there are any existing examples or patterns of environment variable usage in command arrays
rg -n -C3 'command:.*\$\(' --type=yaml
# Search for Celery node name patterns in the codebase to see if @ is used elsewhere
rg -n -C3 '"-n".*@' --type=yaml
# Search for HOSTNAME usage patterns
rg -n -C3 'HOSTNAME' --type=yamlRepository: y-scope/clp
Length of output: 3185
🌐 Web query:
Does Kubernetes expand environment variables like $(HOSTNAME) in pod command arrays without an explicit shell?
💡 Result:
Yes—Kubernetes does expand variables in the pod.spec.containers[].command and.spec.containers[].args fields even when you are using the exec-style array form (no explicit shell), but only using Kubernetes’s own $(VAR_NAME) interpolation rules, not Bash’s
Citations:
- 1: https://v1-32.docs.kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container
- 2: https://kubernetes.io/docs/tasks/inject-data-application/define-environment-variable-container/
- 3: https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/
- 4: $(HOSTNAME) not replaced in args kubernetes/kubernetes#57726
- 5: https://stackoverflow.com/questions/67874552/kubernetes-using-bash-variable-expansion-in-container-entrypoint
🌐 Web query:
Does Celery support node names with @ symbol in the format worker-name@hostname?
💡 Result:
Yes. Celery worker nodenames use an “@” separator between the short worker name and the hostname, i.e. the format is name@hostname (e.g., worker1@george.example.com). Celery explicitly defines the nodename separator as “@” and provides helpers that “join(name, hostname)” using that separator; the documentation also shows starting workers with “-n worker1@%h”, where %h expands to the hostname including domain name. [1][2] So a nodename like worker-name@hostname is supported (assuming hostname is whatever Celery expands/you pass).
Citations:
- 1: https://docs.celeryq.dev/en/main/%5Fmodules/celery/utils/nodenames.html
- 2: https://celery.readthedocs.io/en/3.0/internals/reference/celery.utils.nodenames.html
🏁 Script executed:
# Read the full deployment files to check for environment variable definitions
cat -n tools/deployment/package-helm/templates/query-worker-deployment.yaml | head -100Repository: y-scope/clp
Length of output: 4669
🏁 Script executed:
# Find and check the includes to see if they define HOSTNAME
fd -t f "_helpers.tpl" tools/deployment/package-helm/
# Also search for any hostname-related environment variable definitions
rg -n "HOSTNAME|hostname" tools/deployment/package-helm/templates/ --type=yaml
# Check if there's any downward API configuration
rg -n "fieldRef|downward" tools/deployment/package-helm/templates/ --type=yamlRepository: y-scope/clp
Length of output: 992
🏁 Script executed:
# Check the helpers template file
cat -n tools/deployment/package-helm/templates/_helpers.tpl | head -150Repository: y-scope/clp
Length of output: 5489
🏁 Script executed:
# Continue reading the helpers file to find celery-related includes
rg -n "celeryBrokerUrlEnvVar|celeryResultBackendEnvVar" tools/deployment/package-helm/templates/_helpers.tpl -A 10Repository: y-scope/clp
Length of output: 935
Add HOSTNAME environment variable or use an alternative approach for the node name.
The $(HOSTNAME) variable will not expand in the command array because HOSTNAME is not defined as an environment variable in the pod specification. Kubernetes only expands $(VAR_NAME) when the variable is explicitly defined (via env or downward API); it does not automatically inject HOSTNAME.
To fix this, either:
- Add HOSTNAME to the env section using the downward API (e.g.,
valueFrom: fieldRef: fieldPath: metadata.name) - Use the pod spec's
hostnamefield in conjunction with a proper environment variable reference - Replace
$(HOSTNAME)with$(POD_NAME)and inject POD_NAME via the downward API
Note: Other deployments in this chart (reducer, query-scheduler) set a hostname field at the pod spec level, but this does not create an environment variable for command arguments.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/deployment/package-helm/templates/query-worker-deployment.yaml` at line
90, The command arg "query-worker@$(HOSTNAME)" will not expand because HOSTNAME
is not defined as a pod env var; add an env entry that exposes the pod name via
the downward API (e.g., add env: - name: HOSTNAME valueFrom: fieldRef:
fieldPath: metadata.name) or alternatively replace "$(HOSTNAME)" with
"$(POD_NAME)" and add env: - name: POD_NAME valueFrom: fieldRef: fieldPath:
metadata.name; note that setting the pod spec hostname alone does not create an
env var for command args, so ensure you inject the chosen variable via
env/valueFrom.
There was a problem hiding this comment.
as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details
There was a problem hiding this comment.
@junhaoliao, you're correct — the current $(HOSTNAME) approach will not work. As confirmed by the Kubernetes issue, when $(HOSTNAME) is used in a command/args array without an explicit env definition, Kubernetes echoes it literally as $(HOSTNAME) rather than expanding it. While HOSTNAME is automatically set as a container environment variable at runtime by the container runtime, Kubernetes only performs $(VAR_NAME) substitution in command and args fields for variables explicitly declared in the pod's env section.
There are two clean fixes:
Option 1 (Recommended): Expose pod name via Downward API
Add HOSTNAME to the env section using the Downward API so Kubernetes can substitute it in the command array. This is the least invasive change:
env:
# ... existing env vars ...
- name: "HOSTNAME"
valueFrom:
fieldRef:
fieldPath: "metadata.name"This applies to both query-worker-deployment.yaml and compression-worker-deployment.yaml. The command line using $(HOSTNAME) would then expand correctly at pod creation time.
Note:
metadata.nameyields the pod name (e.g.,query-worker-6b9f4d-abc12), which is unique per replica and suitable as a Celery node name suffix.
Option 2: Shell wrapper
Wrap the command in a shell so that $HOSTNAME (shell variable, not Kubernetes interpolation) is expanded at runtime:
command: [
"/bin/sh", "-c",
"python3 -u /opt/clp/lib/python3/site-packages/bin/celery -A job_orchestration.executor.query worker --concurrency {{ .Values.workerConcurrency }} --loglevel WARNING -Q query -n query-worker@$HOSTNAME"
]This works because the container runtime sets HOSTNAME as an OS-level env var, and the shell expands $HOSTNAME naturally. However, it requires collapsing the structured array into a single shell string, which is harder to maintain and loses the quoting safety of exec form.
Option 1 (Downward API) is preferable as it keeps the structured command array and is the idiomatic Kubernetes approach for injecting pod metadata into arguments.
✏️ Learnings added
Learnt from: junhaoliao
Repo: y-scope/clp PR: 2269
File: tools/deployment/package-helm/templates/query-worker-deployment.yaml:90-90
Timestamp: 2026-05-10T10:11:03.655Z
Learning: In Kubernetes Deployment YAML (Helm templates), `$(HOSTNAME)` in a container's `command`/`args` array is NOT expanded by Kubernetes unless `HOSTNAME` is explicitly declared in the container's `env` section. Even though the container runtime automatically sets `HOSTNAME` as an OS-level env var, Kubernetes variable substitution only works for variables declared in the pod spec's `env` array. The correct fix is to add an explicit env entry using the Downward API: `- name: HOSTNAME valueFrom: fieldRef: fieldPath: metadata.name`. This applies to `tools/deployment/package-helm/templates/query-worker-deployment.yaml` and `tools/deployment/package-helm/templates/compression-worker-deployment.yaml`.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
| lifecycle: | ||
| postStart: | ||
| exec: | ||
| command: | ||
| - "bash" | ||
| - "-c" | ||
| - | | ||
| until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ | ||
| --quiet 2>/dev/null; do | ||
| sleep 1 | ||
| done | ||
| mongosh --host localhost:27017 --eval ' | ||
| try { rs.status(); } catch(e) { | ||
| rs.initiate({ | ||
| _id: "rs0", | ||
| members: [{_id: 0, host: "localhost:27017"}] | ||
| }); | ||
| } | ||
| ' 2>/dev/null || true |
There was a problem hiding this comment.
Race condition: readiness probe may pass before rs.initiate() completes.
The postStart hook runs asynchronously and does not block the readiness probe. The readiness probe (lines 62-70) only verifies that MongoDB responds to ping, which can succeed before rs.initiate() finishes. This means the pod could be marked ready and receive traffic while the replica set is still initialising, potentially causing connection errors for applications expecting a configured replica set.
Additionally, the extensive error suppression (2>/dev/null and || true) will hide genuine initialisation failures, making debugging difficult.
💡 Recommended improvements
Consider one of these approaches:
Option 1: Use initContainer instead of postStart
initContainers:
- name: "init-replica-set"
image: "mongo:7.0.1"
command:
- "bash"
- "-c"
- |
until mongosh --host localhost:27017 --eval "db.runCommand('ping')" --quiet; do
sleep 1
done
mongosh --host localhost:27017 --eval '
try {
rs.status();
print("Replica set already initialized");
} catch(e) {
print("Initializing replica set...");
rs.initiate({
_id: "rs0",
members: [{_id: 0, host: "localhost:27017"}]
});
print("Replica set initialized");
}
'Option 2: Update readiness probe to verify replica set status
readinessProbe:
exec:
command:
- "mongosh"
- "127.0.0.1:27017/test"
- "--quiet"
- "--eval"
- "rs.status().ok"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`
around lines 43 - 61, The postStart hook (postStart -> exec -> mongosh /
rs.initiate) can allow the readinessProbe to pass before rs.initiate() completes
and also hides real errors via redirection and "|| true"; fix by moving the
replica set initialization into an initContainer (e.g., add an initContainers
entry that waits for mongosh ping then runs rs.status()/rs.initiate without
swallowing errors) or, if keeping postStart, change the readinessProbe to check
replica set health (e.g., use mongosh --eval "rs.status().ok" or equivalent) and
remove the broad error suppression so failures surface; update references to
postStart, rs.initiate, mongosh and readinessProbe accordingly.
| until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ | ||
| --quiet 2>/dev/null; do | ||
| sleep 1 | ||
| done |
There was a problem hiding this comment.
Add timeout to prevent infinite loop.
The until loop lacks a timeout and could hang indefinitely if MongoDB never becomes available, preventing the container from starting properly. Consider adding a timeout mechanism or retry limit.
⏱️ Proposed fix with timeout
- until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
- --quiet 2>/dev/null; do
- sleep 1
- done
+ TIMEOUT=60
+ COUNTER=0
+ until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \
+ --quiet 2>/dev/null; do
+ sleep 1
+ COUNTER=$((COUNTER + 1))
+ if [ $COUNTER -ge $TIMEOUT ]; then
+ echo "Timeout waiting for MongoDB to start"
+ exit 1
+ fi
+ done📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ | |
| --quiet 2>/dev/null; do | |
| sleep 1 | |
| done | |
| TIMEOUT=60 | |
| COUNTER=0 | |
| until mongosh --host localhost:27017 --eval "db.runCommand('ping')" \ | |
| --quiet 2>/dev/null; do | |
| sleep 1 | |
| COUNTER=$((COUNTER + 1)) | |
| if [ $COUNTER -ge $TIMEOUT ]; then | |
| echo "Timeout waiting for MongoDB to start" | |
| exit 1 | |
| fi | |
| done |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/deployment/package-helm/templates/results-cache-statefulset.yaml`
around lines 50 - 53, The until loop that runs mongosh --host localhost:27017
--eval "db.runCommand('ping')" --quiet can hang indefinitely; modify the
container start script to add a retry limit or timeout: introduce a max_retries
counter (or use timeout command) and increment on each failed attempt, exit with
non-zero status and a clear error log if retries exceed the limit; update the
loop that currently uses "until mongosh ...; do sleep 1; done" to check the
counter and break/exit on timeout so the pod won't hang forever.
| "--loglevel", "WARNING", | ||
| "-Q", "query", | ||
| "-n", "query-worker" | ||
| "-n", "query-worker@$(HOSTNAME)" |
There was a problem hiding this comment.
as pointed out by the rabbit, the proposed fix won't work. see kubernetes/kubernetes#57726 for details
| "--loglevel", "WARNING", | ||
| "-Q", "query", | ||
| "-n", "query-worker" | ||
| "-n", "query-worker@$(HOSTNAME)" |
There was a problem hiding this comment.
according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker
| "-n", "query-worker@$(HOSTNAME)" | |
| "-n", "query-worker@%h" |
| "--loglevel", "WARNING", | ||
| "-Q", "compression", | ||
| "-n", "compression-worker" | ||
| "-n", "compression-worker@$(HOSTNAME)" |
There was a problem hiding this comment.
according to https://docs.celeryq.dev/en/stable/userguide/workers.html#starting-the-worker
| "-n", "compression-worker@$(HOSTNAME)" | |
| "-n", "compression-worker@%h" |
There was a problem hiding this comment.
changes in this file seem not related to the PR?
| @@ -1,6 +1,6 @@ | |||
| apiVersion: "v2" | |||
| name: "clp" | |||
| version: "0.3.2-dev.4" | |||
There was a problem hiding this comment.
the git branch's base seems to be out of date. please merge origin (y-scope)'s main branch into the PR source branch
…ox conflicts (fixes y-scope#2259) Use Celery's built-in `%h` hostname substitution instead of Kubernetes `$(HOSTNAME)` env var expansion (which doesn't work in command arrays, see kubernetes/kubernetes#57726). Each worker pod now registers as `compression-worker@<hostname>` or `query-worker@<hostname>`, preventing RabbitMQ pidbox queue conflicts when scaling to multiple replicas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b5a158e to
424e626
Compare
There was a problem hiding this comment.
♻️ Duplicate comments (1)
tools/deployment/package-helm/Chart.yaml (1)
3-3:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winConfirm chart version aligns with the intended release bump.
The PR objective states
0.3.2-dev.2, but this file is now0.3.2-dev.5. Ifdev.5is intentional, please sync the PR description/objective text to avoid release-tracking drift.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/deployment/package-helm/Chart.yaml` at line 3, The Chart.yaml version value ("version: \"0.3.2-dev.5\"") does not match the PR objective which states "0.3.2-dev.2"; either update the PR description/objective to state "0.3.2-dev.5" or change the Chart.yaml version back to "0.3.2-dev.2" so release-tracking stays in sync—locate the version line in Chart.yaml and make the version string consistent with the intended release bump.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@tools/deployment/package-helm/Chart.yaml`:
- Line 3: The Chart.yaml version value ("version: \"0.3.2-dev.5\"") does not
match the PR objective which states "0.3.2-dev.2"; either update the PR
description/objective to state "0.3.2-dev.5" or change the Chart.yaml version
back to "0.3.2-dev.2" so release-tracking stays in sync—locate the version line
in Chart.yaml and make the version string consistent with the intended release
bump.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 11a39afd-42c9-4ca6-9377-20999376b542
📒 Files selected for processing (3)
tools/deployment/package-helm/Chart.yamltools/deployment/package-helm/templates/compression-worker-deployment.yamltools/deployment/package-helm/templates/query-worker-deployment.yaml
junhaoliao
left a comment
There was a problem hiding this comment.
got some valuable insights from @gibber9809 offline - let's update the docker-compose config to match the changes in the helm chart.
for the PR title, how about:
fix(deployment): Append hostname to Celery worker node names to prevent pidbox conflicts (fixes #2259).
| "--loglevel", "WARNING", | ||
| "-Q", "compression", | ||
| "-n", "compression-worker" | ||
| "-n", "compression-worker@%h" |
There was a problem hiding this comment.
@gibber9809 and i discussed offline - let's update https://github.com/y-scope/clp/blob/aeb077ca4/tools/deployment/package/docker-compose-all.yaml#L329 to match this command argument
though, practically, that fix won't be enough to ensure uniqueness in the docker compose services:
- we hardcode some service name at: . then if we ever scale the compression worker to multiple replicas, each replica reads the same hostname which results in same worker names across all replicas
- we should avoid hardcoding the hostname then. that way, docker compose will assign numeric post-fixes to the hostnames (e.g.
compression-worker-1), ensuring uniqueness of pidbox names when used by Celery, while the DNS can still resolvecompression-workerto different IPs of the replicas. - however, that why we hardcoded hostname for every docker-compose service was to ensure their hostnames match the "component names", defined in Python CLP utils, e.g.
. those "component names" use
_in names, which differs from the docker-compose service names where-are used as separators. We should also calibrate the component names (as tracked by Standardise service names to use hyphens instead of underscores for RFC1123 compliance #1777 ; refactor(deployment): Standardizes service names to use hyphens instead of underscores for RFC1123 compliance (resolves #1777). #1785 has propsoed a fix) to match the service names.
since satisfying 2 + 3 will involve changing a lot of code, i think simply aligning the command in docker-compose-all.yaml (e.g. change "compression-worker" -> "compression-worker@h") is sufficient for the scope of this PR. we shall fix 2 + 3 seperately.
There was a problem hiding this comment.
that said, as pointed out by @gibber9809 , the above proposal will only ensure uniqueness across different compression-worker replicas on the same host where the docker-compose project runs. (which is still valuable as we can debug multi worker issues without a multi host cluster)
however, if the docker compose project is run across multiple hosts to form a cluster, hostnames can still collide. as proposed by #2274 , maybe we should formally remove mulithost support in the docker compose flow
| "--loglevel", "WARNING", | ||
| "-Q", "query", | ||
| "-n", "query-worker" | ||
| "-n", "query-worker@%h" |
There was a problem hiding this comment.
ditto - let's update https://github.com/y-scope/clp/blob/aeb077ca4/tools/deployment/package/docker-compose-all.yaml#L516 to match this command argument
Matches the Helm template change — ensures uniqueness per host when debugging multi-worker issues locally via docker-compose. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The compression-worker and query-worker node name changes belong in PR y-scope#2269, not this MongoDB replica set PR. Reverting to main. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tools/deployment/package/docker-compose-all.yaml`:
- Line 329: The Celery node name flags (-n "compression-worker@%h" and -n
"query-worker@%h") collide because the services set fixed hostnames (hostname:
"compression_worker" and hostname: "query_worker"); remove the fixed hostname
entries from the compression worker and query worker service definitions so %h
resolves to each container's unique hostname, or alternatively modify the
node-name argument to append a unique suffix (e.g., include the container ID or
an env-provided unique value) instead of relying on the static hostnames; update
the service configs that reference "-n", "compression-worker@%h" and "-n",
"query-worker@%h" accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 599721c4-90e9-4c4f-947c-52f5acb1002a
📒 Files selected for processing (1)
tools/deployment/package/docker-compose-all.yaml
| "-f", "/var/log/compression_worker/worker.log", | ||
| "-Q", "compression", | ||
| "-n", "compression-worker" | ||
| "-n", "compression-worker@%h" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/deployment/package/docker-compose-all.yaml"
echo "== Static worker hostnames =="
rg -n '^\s+hostname:\s+"(compression_worker|query_worker)"' "$file"
echo
echo "== Celery nodename args using %h =="
rg -n '^\s+"-n",\s+"(compression-worker|query-worker)@%h"' "$file"Repository: y-scope/clp
Length of output: 307
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/deployment/package/docker-compose-all.yaml"
echo "=== Context around compression_worker (lines 290-340) ==="
sed -n '290,340p' "$file" | cat -n
echo
echo "=== Context around query_worker (lines 478-530) ==="
sed -n '478,530p' "$file" | cat -nRepository: y-scope/clp
Length of output: 4809
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="tools/deployment/package/docker-compose-all.yaml"
echo "=== Full spider-compression-worker service definition (lines 334-380) ==="
sed -n '334,380p' "$file" | cat -nRepository: y-scope/clp
Length of output: 2444
@%h is non-unique when service hostnames are fixed.
Lines 329 and 516 use %h, but this file hard-codes hostname: "compression_worker" (line 298) and hostname: "query_worker" (line 486). When either worker service is scaled to multiple replicas, all instances resolve to identical node names (e.g., compression-worker@compression_worker), which causes Celery pidbox conflicts.
Remove the fixed hostname values on these services so %h resolves to the unique container hostname, or append a unique suffix independent of the service hostname.
Also applies to: 516-516
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tools/deployment/package/docker-compose-all.yaml` at line 329, The Celery
node name flags (-n "compression-worker@%h" and -n "query-worker@%h") collide
because the services set fixed hostnames (hostname: "compression_worker" and
hostname: "query_worker"); remove the fixed hostname entries from the
compression worker and query worker service definitions so %h resolves to each
container's unique hostname, or alternatively modify the node-name argument to
append a unique suffix (e.g., include the container ID or an env-provided unique
value) instead of relying on the static hostnames; update the service configs
that reference "-n", "compression-worker@%h" and "-n", "query-worker@%h"
accordingly.
There was a problem hiding this comment.
this is known limitation - discussed at https://github.com/y-scope/clp/pull/2269/changes#r3227145236 already
we will not attempt to address the limitation in this PR. @coderabbitai please help create an issue to track
There was a problem hiding this comment.
🧩 Analysis chain
Sure! Let me create a GitHub issue to track this.
🏁 Script executed:
gh issue create \
--repo y-scope/clp \
--title "fix(deployment): Make docker-compose worker Celery node names unique when scaling replicas" \
--body "## Summary
In \`tools/deployment/package/docker-compose-all.yaml\`, the \`compression-worker\` and \`query-worker\` services use \`-n compression-worker@%h\` and \`-n query-worker@%h\` respectively for Celery node names. However, both services also set fixed \`hostname\` values (\`compression_worker\` and \`query_worker\`). This means \`%h\` resolves to the same static hostname for all replicas of a given service, so scaling either worker to multiple replicas still produces colliding Celery node names (e.g., all replicas become \`compression-worker@compression_worker\`), causing RabbitMQ pidbox conflicts.
## Context
- This limitation was identified during review of PR `#2269` (which fixed the equivalent issue in the Helm deployment by using \`%h\`, where each pod gets a unique hostname).
- In docker-compose, unlike Kubernetes pods, containers of a scaled service share the fixed \`hostname\` defined in the service config, so \`%h\` does not provide uniqueness.
- The review discussion: https://github.com/y-scope/clp/pull/2269#discussion_r3231983691
## Possible Approaches
- Remove the fixed \`hostname\` entries from the \`compression-worker\` and \`query-worker\` service definitions so Docker assigns a unique container hostname, allowing \`%h\` to resolve uniquely per replica.
- Alternatively, derive a unique suffix from an environment variable (e.g., container ID) rather than relying on \`%h\`.
## Impact
Affects users who scale \`compression-worker\` or \`query-worker\` to more than one replica in the docker-compose deployment. The symptom is periodic worker crashes due to \`W_PIDBOX_IN_USE\` conflicts (same root cause as issue `#2259`).
---
Requested by `@junhaoliao` — related to PR `#2269` and comment https://github.com/y-scope/clp/pull/2269#discussion_r3231983691"
Length of output: 43
@junhaoliao, the follow-up issue has been created: #2279
It captures the limitation that fixed hostname values in the docker-compose services prevent %h from resolving to unique names when scaling replicas, along with the possible approaches to address it in a future PR.
Summary
$(HOSTNAME)to the Celery-n(node name) flag in both compression-worker and query-worker Helm deployment templates, making each pod register with a unique node name (e.g.,compression-worker@clp-compression-worker-abc123)Fixes #2259
Test plan
compressionWorker.replicas > 1and verify each pod registers a unique Celery node name in RabbitMQcelery inspect activeshows distinct node names per pod🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com
Summary by CodeRabbit
Fixes #2259