Skip to content

temp#2277

Closed
Nathan903 wants to merge 2 commits into
y-scope:mainfrom
Nathan903:dev
Closed

temp#2277
Nathan903 wants to merge 2 commits into
y-scope:mainfrom
Nathan903:dev

Conversation

@Nathan903
Copy link
Copy Markdown
Contributor

@Nathan903 Nathan903 commented May 14, 2026

Summary by CodeRabbit

  • Documentation

    • Added a comprehensive telemetry implementation plan covering design, rollout phases, testing, and docs.
  • Chores / Deployment

    • Added a local-dev telemetry collector and production collector configuration.
    • Wired telemetry environment variables into service deployment manifests and added a telemetry profile for compose-based runs.
  • Infrastructure

    • Introduced an OTLP-compatible collector service for local and packaged deployments.

Review Change Stack

@Nathan903 Nathan903 requested a review from a team as a code owner May 14, 2026 02:16
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6fabed98-fa3a-4627-a086-8e1808080f37

📥 Commits

Reviewing files that changed from the base of the PR and between 9b1cea7 and c3a7abf.

📒 Files selected for processing (8)
  • components/clp-package-utils/clp_package_utils/controller.py
  • tools/deployment/package/docker-compose-all.yaml
  • tools/deployment/package/docker-compose-base.yaml
  • tools/deployment/package/docker-compose-spider-base.yaml
  • tools/deployment/package/docker-compose-spider.yaml
  • tools/deployment/package/docker-compose.dev.yaml
  • tools/deployment/package/docker-compose.yaml
  • tools/deployment/package/otel-collector-config.yaml

Walkthrough

Adds a TELEMETRY_IMPLEMENTATION_PLAN document; wires OpenTelemetry env vars and a local OTEL Collector into Docker Compose (with a dev override); updates the controller to emit OTEL resource attributes and telemetry endpoint handling; and adds an otel-collector config for OTLP/HTTP metrics export.

Changes

Telemetry Implementation Plan & Deployment

Layer / File(s) Summary
Implementation plan document
TELEMETRY_IMPLEMENTATION_PLAN.md
Adds a seven-phase telemetry implementation plan covering current state, architectural decisions, dependency/init helpers, client-side collector config, multi-service instrumentation, deployment wiring, server-side infra, docs, and verification.
Controller telemetry env wiring
components/clp-package-utils/clp_package_utils/controller.py
Records instance_id, detects OS/arch and VERSION, sets OTEL_RESOURCE_ATTRIBUTES, OTEL_EXPORTER_OTLP_ENDPOINT, TELEMETRY_ENDPOINT or CLP_DISABLE_TELEMETRY, and adds --profile telemetry to compose start when enabled.
Compose: add OTEL envs + collector service
tools/deployment/package/docker-compose-all.yaml
Adds OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_RESOURCE_ATTRIBUTES, OTEL_SERVICE_NAME, CLP_DISABLE_TELEMETRY to many services and introduces an otel-collector service under the telemetry profile with config mount and TELEMETRY_ENDPOINT.
Compose: extends & dev override
tools/deployment/package/docker-compose-base.yaml, tools/deployment/package/docker-compose-spider-base.yaml, tools/deployment/package/docker-compose-spider.yaml, tools/deployment/package/docker-compose.dev.yaml, tools/deployment/package/docker-compose.yaml
Adds otel-collector extends entries across base and spider compose files; adds a development override exposing port 4318 for local debugging.
OTEL Collector config
tools/deployment/package/otel-collector-config.yaml
Adds OTEL Collector configuration with OTLP/HTTP receiver on 0.0.0.0:4318, batch processor, and otlphttp exporter using ${TELEMETRY_ENDPOINT}, and a metrics pipeline wiring receiver → processor → exporter.

Sequence Diagram(s)

(none)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • y-scope/clp#2251: Overlaps with prior telemetry wiring changes to components/clp-package-utils/clp_package_utils/controller.py (instance ID and OpenTelemetry env vars).

Suggested reviewers

  • junhaoliao
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'temp' is vague and generic, providing no meaningful information about the changeset's purpose or content. Replace 'temp' with a descriptive title summarizing the main change, such as 'Add telemetry implementation plan' or 'Document CLP telemetry design and rollout phases'.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 Ruff (0.15.12)
components/clp-package-utils/clp_package_utils/controller.py

Unexpected Ruff issue shape at index 15

🔧 Trivy (0.69.3)

Trivy execution failed: 2026-05-14T02:26:53Z FATAL Fatal error run error: fs scan error: scan error: scan failed: failed analysis: post analysis error: post analysis error: cloudformation scan error: fs filter error: fs filter error: walk error range error: stat .coderabbit-opengrep-fallback.yml: no such file or directory: range error: stat .coderabbit-opengrep-fallback.yml: no such file or directory


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

♻️ Duplicate comments (1)
TELEMETRY_IMPLEMENTATION_PLAN.md (1)

685-691: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

PII verification contradicts service.instance.id auto-population.

Step 7.6 requires verification that "No hostnames appear", but this contradicts AD-4 (line 33-35) which allows the SDK to auto-populate service.instance.id with "hostname + PID". While Docker container hostnames are random, Kubernetes pod names follow predictable patterns (e.g., log-ingestor-7d8f9c6b-xkz2p) that might encode deployment metadata.

Recommendation: Add a specific PII check for service.instance.id format across all SDKs in use (Rust and Python). Verify that:

  • Docker: container hostnames are truly random (not derived from image names or config)
  • Kubernetes: pod names don't leak sensitive information (tenant IDs, user data, etc.)
  • Or override service.instance.id with a per-process random UUID to eliminate this class of risk entirely
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` around lines 685 - 691, Update the
telemetry PII verification to explicitly cover the auto-populated OpenTelemetry
resource attribute service.instance.id referenced in AD-4 and Step 7.6: add a
named check that scans OTLP payloads for service.instance.id values from both
Rust and Python SDKs, validate that Docker container hostnames are random and
Kubernetes pod names do not contain tenant/user identifiers, and provide an
explicit remediation path to override service.instance.id with a per-process
random UUID (or other stable non-identifying token) when the value appears
predictable or contains sensitive substrings; ensure the plan text references
the checks by name ("service.instance.id PII check") and lists the SDKs (Rust,
Python) so reviewers can locate and run the verification.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@TELEMETRY_IMPLEMENTATION_PLAN.md`:
- Around line 305-306: The document is ambiguous about which topology-gauge
emission method to use; pick the Python-script approach and document its
location and invocation: add a new script
tools/deployment/package/emit_topology_gauges.py that reads
CLP_*_WORKER_CONCURRENCY and replica info from the controller environment and
pushes OTLP metrics, and update the controller startup flow (invoke
emit_topology_gauges.py from start_clp.py after all services are started) so the
controller emits the topology gauges once at startup; mention the script name,
its purpose, and the exact call site in start_clp.py in the
TELEMETRY_IMPLEMENTATION_PLAN.md text.
- Line 536: Clarify TTL precedence between the ClickHouse exporter config's ttl
parameter and the pre-created table TTL: update the
TELEMETRY_IMPLEMENTATION_PLAN.md text around the ClickHouse exporter config
entry (`ttl: 90d`) and the optional pre-created schema TTL clause (`TTL ... +
INTERVAL 90 DAY`) to state that table-level TTL takes precedence, and instruct
users to remove or comment out the `ttl` field from the exporter config when
using the pre-created schema (Step 5.3), or alternatively note that the exporter
`ttl` only applies when tables are auto-created.
- Line 95: Clarify the plan by removing the "optionally add `service.version` to
the resource" guidance and state that the init function should not set
`service.version` because the controller will always set it via
`OTEL_RESOURCE_ATTRIBUTES` (reference `service.version`,
`OTEL_RESOURCE_ATTRIBUTES`, "init function", and "controller" / "Phase 4.1" to
locate the text to change); update the sentence at line 95 to explicitly say the
controller is responsible for `service.version` and eliminate any suggestion the
init function may add it to avoid duplicate assignment.
- Line 93: Clarify that telemetry disable precedence uses OR logic: if either
the CLP_DISABLE_TELEMETRY environment variable or telemetry_config.disable is
true then telemetry is disabled and a no-op MeterProvider is returned; update
the init function's signature comment (the function that checks
CLP_DISABLE_TELEMETRY and telemetry_config.disable) to explicitly state this
priority chain and that either flag will disable telemetry, mirroring the
consent priority style on line 7.
- Around line 33-35: There is a contradiction between AD-4 (which says SDKs may
set service.instance.id using hostname+PID) and Phase 7.6 PII verification
(which forbids hostnames); resolve it by choosing one of two fixes: (A)
explicitly document in AD-4 that the service.instance.id will be overridden at
process start with a random UUID (per-process) to eliminate hostname leakage and
add a note to Phase 7.6 that service.instance.id is safe because it’s
UUID-generated, or (B) update Phase 7.6 to explicitly allow SDK-populated
container hostnames (e.g., ephemeral Docker hostnames) while adding a mandatory
check that Kubernetes pod names are treated as potential PII and must be
redacted/verified; update the AD-4 wording to state that SDK-populated values
are acceptable only for ephemeral container hostnames and not for pod names.
Ensure you edit the AD-4 section header and the Phase 7.6 PII verification text
so both refer to service.instance.id consistently and include the chosen policy
(UUID override OR hostname exception + pod-name verification).
- Around line 605-609: Update Step 5.6 to require a lightweight authentication
mechanism instead of "no authentication + rate limiting": specify generating a
deployment-specific API key during installation, storing it in the collector
config, and requiring clients to send it in an Authorization header (or
`X-API-Key`) which the collector validates (replace the current suggestion to
rely only on the `ratelimit` processor and optional `basicauth`/`headers`
checks); keep IP rate limiting (the `ratelimit` processor) as a secondary
defense and note that mTLS remains unnecessary for Docker Compose clients.
- Line 447: The template sets os.type using the wrong Helm variable
(.Capabilities.KubeVersion); change the assignment so os.type is a literal
"linux" (remove the .Capabilities.KubeVersion usage and the | default "linux"
fallback) or wire it to a new values.yaml key (e.g., values.osType) if you need
to support Windows node pools; update the template reference to use that value
(values.osType) and document the new values.yaml entry.
- Line 159: Update the environment variable expansion in the OpenTelemetry
Collector config: locate the line containing "endpoint: ${TELEMETRY_ENDPOINT}"
and change the substitution to include the OTel env prefix so it becomes
"endpoint: ${env:TELEMETRY_ENDPOINT}" to ensure Collector v0.96.0 expands the
TELEMETRY_ENDPOINT variable at load time.
- Line 52: The plan incorrectly assumes reqwest is already in-tree; update
TELEMETRY_IMPLEMENTATION_PLAN.md to state that aws-sdk-s3 and aws-sdk-sqs use
hyper, not reqwest, so using http-proto + reqwest-client would require adding
reqwest as an explicit dependency; then revise the decision rationale for
choosing http-proto + reqwest-client vs gRPC (tonic + prost + protoc) to reflect
this added dependency cost and present the alternative options (add reqwest
explicitly, use a hyper-based HTTP client to avoid new deps, or opt for gRPC
stack) and cite the package names http-proto, reqwest-client, reqwest, tonic,
prost, protoc, aws-sdk-s3 and aws-sdk-sqs to make the change discoverable.

---

Duplicate comments:
In `@TELEMETRY_IMPLEMENTATION_PLAN.md`:
- Around line 685-691: Update the telemetry PII verification to explicitly cover
the auto-populated OpenTelemetry resource attribute service.instance.id
referenced in AD-4 and Step 7.6: add a named check that scans OTLP payloads for
service.instance.id values from both Rust and Python SDKs, validate that Docker
container hostnames are random and Kubernetes pod names do not contain
tenant/user identifiers, and provide an explicit remediation path to override
service.instance.id with a per-process random UUID (or other stable
non-identifying token) when the value appears predictable or contains sensitive
substrings; ensure the plan text references the checks by name
("service.instance.id PII check") and lists the SDKs (Rust, Python) so reviewers
can locate and run the verification.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: e53a7697-5308-4884-8b3d-91861ea80b12

📥 Commits

Reviewing files that changed from the base of the PR and between 8977316 and 9b1cea7.

📒 Files selected for processing (1)
  • TELEMETRY_IMPLEMENTATION_PLAN.md

Comment on lines +33 to +35
### AD-4: Let the OTel SDK auto-populate `service.instance.id`

**Why:** Some SDKs auto-assign this (hostname + PID). It lets the backend count unique running instances, distinguishing "2 pods each with concurrency 4" from "1 pod with concurrency 8." Don't override it manually.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Clarify hostname inclusion in service.instance.id and PII implications.

AD-4 states that some SDKs auto-assign service.instance.id using "hostname + PID", but line 690 in Phase 7.6 requires verification that "No hostnames appear" in resource attributes, and line 733 shows service.instance.id is SDK-populated. This is contradictory.

Recommendation: Either (1) explicitly override service.instance.id with a random UUID per process to guarantee no hostname leakage, or (2) update the PII verification section (7.6) to clarify that container hostnames (which are random in Docker) are acceptable, and document that Kubernetes pod names must be verified as non-identifying.

🧰 Tools
🪛 LanguageTool

[misspelling] ~33-~33: This word is normally spelled as one.
Context: ...equivalent. ### AD-4: Let the OTel SDK auto-populate service.instance.id Why: Some SD...

(EN_COMPOUNDS_AUTO_POPULATE)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` around lines 33 - 35, There is a
contradiction between AD-4 (which says SDKs may set service.instance.id using
hostname+PID) and Phase 7.6 PII verification (which forbids hostnames); resolve
it by choosing one of two fixes: (A) explicitly document in AD-4 that the
service.instance.id will be overridden at process start with a random UUID
(per-process) to eliminate hostname leakage and add a note to Phase 7.6 that
service.instance.id is safe because it’s UUID-generated, or (B) update Phase 7.6
to explicitly allow SDK-populated container hostnames (e.g., ephemeral Docker
hostnames) while adding a mandatory check that Kubernetes pod names are treated
as potential PII and must be redacted/verified; update the AD-4 wording to state
that SDK-populated values are acceptable only for ephemeral container hostnames
and not for pod names. Ensure you edit the AD-4 section header and the Phase 7.6
PII verification text so both refer to service.instance.id consistently and
include the chosen policy (UUID override OR hostname exception + pod-name
verification).

opentelemetry-otlp = { version = "0.27", features = ["metrics", "http-proto", "reqwest-client"] }
```

We use `http-proto` + `reqwest-client`. `reqwest` is already in the dependency tree via AWS SDKs. This avoids adding `tonic` + `prost` + `protoc`, which the gRPC transport would require.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if reqwest is already in the Rust dependency tree

# Find all Cargo.toml files and check for reqwest or AWS SDK dependencies
fd -e toml Cargo.toml --exec cat {} \; | rg -i 'reqwest|aws-sdk'

# Also check Cargo.lock if it exists
if [ -f Cargo.lock ]; then
  rg -i '^name = "reqwest"' Cargo.lock -A 1
fi

Repository: y-scope/clp

Length of output: 170


🏁 Script executed:

# Search for reqwest in the entire dependency tree (Cargo.lock)
if [ -f Cargo.lock ]; then
  echo "=== Checking Cargo.lock for reqwest ==="
  rg 'name = "reqwest"' Cargo.lock -A 3
  echo ""
  echo "=== Total dependency count in Cargo.lock ==="
  rg '^name = "' Cargo.lock | wc -l
fi

Repository: y-scope/clp

Length of output: 145


🏁 Script executed:

# Also check all Cargo.toml files to get a complete picture
echo "=== All Cargo.toml files in the repo ==="
fd -e toml -path "*/Cargo.toml" | head -20

Repository: y-scope/clp

Length of output: 283


🌐 Web query:

aws-sdk-s3 reqwest dependency

💡 Result:

The aws-sdk-s3 Rust crate does not directly depend on reqwest. Its dependencies include aws-smithy-client, aws-runtime, hyper (via smithy-http-client), rustls (default), and tokio, but reqwest is absent from Cargo.toml listings and dependency graphs across versions like 1.98.0 [1], 1.132.0 [2], 1.123.0 [3], and source Cargo.toml [4][5]. Official AWS docs confirm default HTTP client uses hyper + rustls + aws-lc-rs [6]. Reqwest appears in other S3 crates like rust-s3 [7][8] or as optional integration via aws-smithy-http-client-reqwest [9][10], but not as a required dependency for aws-sdk-s3. To use reqwest with aws-sdk-s3, provide a custom HttpClient via aws_config [9].

Citations:


🏁 Script executed:

# Search for explicit reqwest dependency in all Cargo.toml files
find . -name "Cargo.toml" -type f -exec rg 'reqwest' {} + 2>/dev/null | head -20

Repository: y-scope/clp

Length of output: 37


The assumption that reqwest is already a dependency via AWS SDKs is incorrect.

reqwest is not present in the project's dependency tree. The AWS SDKs (aws-sdk-s3 v1.121.0 and aws-sdk-sqs v1.92.0) use hyper as their default HTTP client, not reqwest. If the implementation uses http-proto + reqwest-client, reqwest would need to be added as an explicit new dependency. This changes the cost-benefit analysis for choosing this approach over gRPC alternatives—the justification for "avoiding adding tonic + prost + protoc" no longer holds if reqwest must be introduced regardless.

Reconsider whether http-proto + reqwest-client remains the optimal choice, or whether the added dependency cost is acceptable for the benefits it provides.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 52, The plan incorrectly assumes
reqwest is already in-tree; update TELEMETRY_IMPLEMENTATION_PLAN.md to state
that aws-sdk-s3 and aws-sdk-sqs use hyper, not reqwest, so using http-proto +
reqwest-client would require adding reqwest as an explicit dependency; then
revise the decision rationale for choosing http-proto + reqwest-client vs gRPC
(tonic + prost + protoc) to reflect this added dependency cost and present the
alternative options (add reqwest explicitly, use a hyper-based HTTP client to
avoid new deps, or opt for gRPC stack) and cite the package names http-proto,
reqwest-client, reqwest, tonic, prost, protoc, aws-sdk-s3 and aws-sdk-sqs to
make the change discoverable.

```

The init function should:
1. Check `CLP_DISABLE_TELEMETRY` env var and `telemetry_config.disable` — if either is true, return a no-op `MeterProvider` (all metric operations become no-ops with zero overhead)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Clarify precedence when both disable mechanisms are set.

The init function checks both CLP_DISABLE_TELEMETRY env var and telemetry_config.disable. If they conflict (e.g., env var says disable but config says enable), which takes precedence? Document the priority chain explicitly, similar to the consent priority chain mentioned in line 7.

Recommendation: Use OR logic (disable if either is true) and document this in a comment within the function signature block.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 93, Clarify that telemetry disable
precedence uses OR logic: if either the CLP_DISABLE_TELEMETRY environment
variable or telemetry_config.disable is true then telemetry is disabled and a
no-op MeterProvider is returned; update the init function's signature comment
(the function that checks CLP_DISABLE_TELEMETRY and telemetry_config.disable) to
explicitly state this priority chain and that either flag will disable
telemetry, mirroring the consent priority style on line 7.

The init function should:
1. Check `CLP_DISABLE_TELEMETRY` env var and `telemetry_config.disable` — if either is true, return a no-op `MeterProvider` (all metric operations become no-ops with zero overhead)
2. Build a `Resource` — the SDK automatically reads `OTEL_RESOURCE_ATTRIBUTES` and `OTEL_SERVICE_NAME` env vars, so no manual attribute construction is needed
3. Optionally add `service.version` to the resource (from the CLP version constant if available, or let the controller set it via `OTEL_RESOURCE_ATTRIBUTES`)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Resolve inconsistency: who sets service.version?

Line 95 says "Optionally add service.version to the resource", but line 368 in Phase 4.1 shows the controller always sets service.version via OTEL_RESOURCE_ATTRIBUTES. This is inconsistent.

Recommendation: Clarify that the init function does NOT need to add service.version because the controller will always set it via OTEL_RESOURCE_ATTRIBUTES. Remove the "optionally add" guidance to prevent duplicate attribute assignment.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 95, Clarify the plan by removing
the "optionally add `service.version` to the resource" guidance and state that
the init function should not set `service.version` because the controller will
always set it via `OTEL_RESOURCE_ATTRIBUTES` (reference `service.version`,
`OTEL_RESOURCE_ATTRIBUTES`, "init function", and "controller" / "Phase 4.1" to
locate the text to change); update the sentence at line 95 to explicitly say the
controller is responsible for `service.version` and eliminate any suggestion the
init function may add it to avoid duplicate assignment.


exporters:
otlphttp:
endpoint: ${TELEMETRY_ENDPOINT}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

OpenTelemetry Collector 0.96.0 environment variable expansion syntax

💡 Result:

OpenTelemetry Collector (including v0.96.0) supports the following environment-variable expansion syntaxes in the Collector YAML configuration: 1) Braces (env provider) syntax - ${ENV} - Example: key: ${env:DB_KEY} [1] 2) Env-provider syntax - ${env:ENV} - Example: key: ${env:DB_KEY} [1] Both ${ENV} and ${env:ENV} use the same underlying env-provider behavior in the Collector configuration. [2] 3) Defaults (bash-like) syntax - ${env:DB_KEY:-some-default-var} sets a default when DB_KEY is unset. [1] 4) Literal '$' escaping - Use $$ to output a literal $ and prevent ${...} from being treated as an environment-variable reference. Example: namespace: $$DataVisualization [1] 5) Naked syntax (documented in Collector RFC) - $ENV is listed as a “naked syntax” option. [2] Notes - Environment variable expansion is performed by the Collector during configuration processing (effectively at startup/config load time), so changing environment variables afterward requires restarting/reloading configuration. [3] If you share a specific config snippet (especially one involving concatenation like ${A}:${B} or nested expressions), I can validate the exact expected parsing for that case.

Citations:


Update environment variable substitution syntax to use the correct OTel Collector format.

Line 159 uses ${TELEMETRY_ENDPOINT}, but OpenTelemetry Collector v0.96.0 requires the env: prefix for environment variable expansion. Change to ${env:TELEMETRY_ENDPOINT} to ensure the collector properly expands the environment variable during configuration load.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 159, Update the environment
variable expansion in the OpenTelemetry Collector config: locate the line
containing "endpoint: ${TELEMETRY_ENDPOINT}" and change the substitution to
include the OTel env prefix so it becomes "endpoint: ${env:TELEMETRY_ENDPOINT}"
to ensure Collector v0.96.0 expands the TELEMETRY_ENDPOINT variable at load
time.

Comment on lines +305 to +306
**Docker Compose:** The controller emits topology gauges once at startup. It already knows the concurrency values (it sets `CLP_*_WORKER_CONCURRENCY` in `.env`) and replica counts (effectively 1 per container in Docker Compose — scaling means running multiple containers). The simplest approach is to have the controller write a small Python script or use the `curl`-based OTLP metric push to emit these gauges once after starting all services.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Specify the implementation approach for topology gauge emission.

The text says "The simplest approach is to have the controller write a small Python script or use the curl-based OTLP metric push" but doesn't specify which approach to use. This creates implementation ambiguity.

Recommendation: Choose one approach and document it clearly. The Python script approach is more maintainable and testable. Specify where the script should live (e.g., tools/deployment/package/emit_topology_gauges.py) and when it should be called (e.g., after all services start in start_clp.py).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` around lines 305 - 306, The document is
ambiguous about which topology-gauge emission method to use; pick the
Python-script approach and document its location and invocation: add a new
script tools/deployment/package/emit_topology_gauges.py that reads
CLP_*_WORKER_CONCURRENCY and replica info from the controller environment and
pushes OTLP metrics, and update the controller startup flow (invoke
emit_topology_gauges.py from start_clp.py after all services are started) so the
controller emits the topology gauges once at startup; mention the script name,
its purpose, and the exact call site in start_clp.py in the
TELEMETRY_IMPLEMENTATION_PLAN.md text.

service.version={{ .Chart.AppVersion }},
clp.deployment.method=helm,
clp.storage.engine={{ .Values.clpConfig.package.storageEngine }},
os.type={{ .Capabilities.KubeVersion | default "linux" }},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Critical: os.type uses incorrect Helm template variable.

Line 447 uses .Capabilities.KubeVersion for os.type, but KubeVersion returns the Kubernetes version (e.g., "1.28"), not the operating system type.

Correct approach: For Kubernetes deployments, the OS type should be hardcoded to "linux" (all production Kubernetes nodes run Linux), or use a new value in values.yaml if Windows node pools are supported. Remove the | default "linux" fallback since it won't trigger.

🔧 Proposed fix
-os.type={{ .Capabilities.KubeVersion | default "linux" }},
+os.type=linux,

Or if Windows nodes are possible:

-os.type={{ .Capabilities.KubeVersion | default "linux" }},
+os.type={{ .Values.hostOS | default "linux" }},
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
os.type={{ .Capabilities.KubeVersion | default "linux" }},
os.type=linux,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 447, The template sets os.type
using the wrong Helm variable (.Capabilities.KubeVersion); change the assignment
so os.type is a literal "linux" (remove the .Capabilities.KubeVersion usage and
the | default "linux" fallback) or wire it to a new values.yaml key (e.g.,
values.osType) if you need to support Windows node pools; update the template
reference to use that value (values.osType) and document the new values.yaml
entry.

clickhouse:
endpoint: tcp://clickhouse:9000
database: clp_telemetry
ttl: 90d
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Clarify TTL configuration when using pre-created schema.

The ClickHouse exporter config sets ttl: 90d (line 536) and the optional pre-created schema also sets TTL ... + INTERVAL 90 DAY (line 576). If the table is pre-created with a TTL, the exporter's TTL parameter may be ignored.

Recommendation: Document that if using the pre-created schema (Step 5.3), the ttl parameter in the exporter config should be removed or commented out, as the table-level TTL takes precedence. Alternatively, specify that the exporter TTL only applies when auto-creating tables.

Also applies to: 576-576

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` at line 536, Clarify TTL precedence between
the ClickHouse exporter config's ttl parameter and the pre-created table TTL:
update the TELEMETRY_IMPLEMENTATION_PLAN.md text around the ClickHouse exporter
config entry (`ttl: 90d`) and the optional pre-created schema TTL clause (`TTL
... + INTERVAL 90 DAY`) to state that table-level TTL takes precedence, and
instruct users to remove or comment out the `ttl` field from the exporter config
when using the pre-created schema (Step 5.3), or alternatively note that the
exporter `ttl` only applies when tables are auto-created.

Comment on lines +605 to +609

Start with no authentication + rate limiting. The metrics are anonymous and non-sensitive — rate limiting per source IP (using the `ratelimit` processor) prevents abuse. If spam becomes a problem later, add an API key header (`X-API-Key`) that the collector validates via `basicauth` or a `headers` check.

mTLS is overkill for anonymous metrics and would complicate the Docker Compose client experience.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Re-evaluate security posture for publicly exposed telemetry endpoint.

Step 5.6 proposes starting with "no authentication + rate limiting" for a publicly exposed HTTPS endpoint. While the metrics are anonymous, this creates several risks:

  • Spam/abuse: Malicious actors can flood the endpoint with fake metrics, inflating storage costs and polluting dashboards.
  • Data integrity: Without authentication, there's no guarantee metrics come from legitimate CLP deployments.
  • Resource exhaustion: Rate limiting by source IP can be bypassed with distributed attacks or cloud NAT.

Recommendation: Implement a lightweight authentication mechanism from the start (e.g., a deployment-specific API key generated during installation and stored in the config, sent via Authorization header). This is minimally more complex than no auth but significantly improves security posture. Rate limiting should be a secondary defense.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@TELEMETRY_IMPLEMENTATION_PLAN.md` around lines 605 - 609, Update Step 5.6 to
require a lightweight authentication mechanism instead of "no authentication +
rate limiting": specify generating a deployment-specific API key during
installation, storing it in the collector config, and requiring clients to send
it in an Authorization header (or `X-API-Key`) which the collector validates
(replace the current suggestion to rely only on the `ratelimit` processor and
optional `basicauth`/`headers` checks); keep IP rate limiting (the `ratelimit`
processor) as a secondary defense and note that mTLS remains unnecessary for
Docker Compose clients.

@Nathan903 Nathan903 closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant