Skip to content

[Backport 7.80.x] [n8n] Fix metric mappings and add full v2 metric coverage#23733

Merged
AAraKKe merged 1 commit into
7.80.xfrom
backport-23635-to-7.80.x
May 19, 2026
Merged

[Backport 7.80.x] [n8n] Fix metric mappings and add full v2 metric coverage#23733
AAraKKe merged 1 commit into
7.80.xfrom
backport-23635-to-7.80.x

Conversation

@dd-octo-sts

@dd-octo-sts dd-octo-sts Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

Backport 57659b4 from #23635.


What does this PR do?

Overhauls the n8n integration's metric mapping, test environment, fixtures, and public documentation so the check matches what n8n actually emits across both supported major versions.

  • Metric map and metadata. Fixes the incorrect workflow_executions_duration_seconds mapping, removes mappings that n8n does not emit, restores valid families that the check was silently dropping (queue.job.dequeued, nodejs.active.requests), and adds the missing families verified live against n8n 1.118.1 and n8n 2.19.5.
  • n8n 2.x metric coverage. Adds the 2.x-only families: workflow.execution.duration.seconds.*, the audit.workflow.* lifecycle counters, embed.login.*, token.exchange.*, process.pss.bytes, the opt-in workflow_statistics_* gauges (workflows.total, users.total, production.executions, ...), and the new VM-isolated expression engine family (expression.evaluation.duration.seconds.*, expression.code.cache.*, expression.pool.*). metadata.csv calls out the version and env flag requirements for each.
  • Dynamic event-bus counters. Maps the broader n8n event-bus surface: roughly 70 dynamic counters covering audit (user, credentials, package, variable, execution data), AI node, runner, worker lifecycle, and workflow cancellation events. The test compose cannot realistically exercise these families (auth flows, AI invocations, package installs, etc.), so the new counters are added to the integration's rare-event set and excluded from the live symmetric-inclusion metadata assertion. The unit fixture carries synthetic samples so the metric map stays validated bidirectionally there. Counter names that n8n declares but its own prom-client exporter silently rejects (anything containing a hyphen, e.g. external-secrets, token-exchange, role-mapping, cluster.*) are intentionally not mapped and are noted as such in metrics.py and the README.
  • Worker scrape coverage. Documents and exercises scraping the n8n worker process as a separate Datadog instance. The conftest brings up both main and worker, tagged n8n_process:main / n8n_process:worker, so the worker-only families (node.started, node.finished, queue.job.dequeued, runner.task.requested) are covered end-to-end.
  • Test matrix and environment. Runs the integration in queue mode against both n8n 1.118.1 and 2.19.5 via the hatch matrix. The compose file pulls n8nio/n8n:${N8N_VERSION} directly, allocates dynamic host ports to avoid CI/local collisions, and uses port 5680 for the worker since n8n 2.x reserves 5679 for the task-runner broker.
  • Setup flow. Rebuilds dd_environment around WaitFor and CheckEndpoints conditions: wait for n8n's /healthz, import workflows via the n8n CLI, activate them, restart so webhooks register, then confirm both /metrics endpoints are reachable. Workflow IDs are read from the bind-mounted JSON, traffic generation runs in the fixture body, and the workflow-started wait raises with the last seen samples so timeouts are self-debugging.
  • Lab harness for dashboards and monitors. Adds tests/lab/ with its own docker-compose.yaml, traffic generator, sample workflows, and config so developers can run a long-lived n8n stack against a real Datadog Agent (autodiscovery for stdout logs, bind-mounted volume for the event-bus log files) and iterate on dashboards / monitors with real data. Gated on N8N_IS_LAB=true; integration tests are unaffected.
  • Rare-event metrics. Keeps real but opportunistic metrics (auth failure counters, audit lifecycle counters that only fire on state transitions, the libuv nodejs.active.requests gauge gated on in-flight handles, the expression-engine family gated on N8N_EXPRESSION_ENGINE=vm + N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED=true) in the metric map and metadata, with synthetic samples in the unit fixtures, and excludes them from the live symmetric assertion via the public exclude= parameter of assert_metrics_using_metadata.
  • Readiness behavior. n8n.readiness.check is still submitted, but the OpenMetrics scrape is no longer gated on /healthz/readiness. Metrics keep flowing when readiness reports degraded so queue depth, process state, and other SRE-relevant signals are not lost during incidents. The OpenMetrics base check's own scrape service check still surfaces scrape failures.
  • Public documentation. README updates cover the version-specific metric availability, the env flags that gate each family (N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS, N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION, N8N_METRICS_INCLUDE_QUEUE_METRICS, N8N_EXPRESSION_ENGINE, N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED), the requirement to keep raw_metric_prefix in sync with a customized N8N_METRICS_PREFIX, the 5679-reserved-by-task-runner-broker caveat for queue mode, the relocated event-bus log file path under the n8n user folder, and the dynamic event-bus surface with an extra_metrics example for future n8n events.

Motivation

Issue #23633 reported that the integration exposed the wrong Datadog metric name for n8n workflow execution duration. Validating the check against live n8n containers showed a broader gap: some mapped metrics were invented or stale, several real metrics were missing, and the test environment did not exercise queue mode, worker metrics, or version-specific differences.

This PR makes the integration empirically grounded against both the older supported n8n line and the current 2.x line, and gives developers a lab harness to keep iterating on n8n monitoring artifacts with real telemetry.

Validation

  • ddev test -fs n8n
  • ddev validate config -s n8n
  • ddev validate models -s n8n
  • ddev validate metadata n8n
  • ddev validate readmes n8n
  • ddev --no-interactive test n8n
  • ddev env test --dev n8n py3.13-1
  • ddev env test --dev n8n py3.13-2

Review checklist

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

* Fix n8n metric mappings and add full v2 metric coverage

- Drop fabricated metric names that n8n never emitted; map only what is empirically present.
- Add the n8n 2.x metric families: workflow.execution.duration histogram, audit.workflow.*, embed.login.*, token.exchange.*, process.pss.bytes, runner.task.requested, and the workflow_statistics gauges.
- Add worker-only families (node.started, node.finished, queue.job.dequeued, runner.task.requested) by introducing a worker-scrape instance.
- Stop gating the OpenMetrics scrape on /healthz/readiness; emit n8n.readiness.check unconditionally so metrics still flow when the readiness endpoint is unhealthy.
- Replace the custom Dockerfile with a direct n8nio/n8n image reference and parameterise the version via hatch.toml so the test matrix can run against both 1.118.1 and 2.19.5.
- Allocate free host ports via datadog_checks.dev.utils.find_free_ports and forward them through docker_run env_vars to avoid port collisions on re-runs.

* Add changelog for PR #23635

* Refine n8n metric coverage and e2e setup

* Document raw_metric_prefix requirement when customizing N8N_METRICS_PREFIX

* Reformat changelog so towncrier renders sub-bullets correctly

* Add tests/lab traffic generator for n8n

A long-running n8n simulation that layers on top of the integration test
environment so a real Datadog Agent can ship metrics to a Datadog org for
dashboard / monitor iteration.

- tests/lab/workflows/: five lab-only workflow JSONs covering distinct shapes
  (fast, slow Wait node, always-fail Code, flaky 30%, four-step chain).
- tests/lab/traffic_generator.py: click CLI (start/generate/stop) that runs
  ddev env start --base, copies + imports + activates the lab workflows,
  restarts n8n, and drives a configurable async traffic mix against the
  webhooks and REST API.
- tests/lab/config.yaml: webhook + REST probabilities and tick / reload
  intervals; hot-reloaded while the generator runs.
- tests/lab/.ddev.toml: pins the lab to an `n8nlab` ddev org.
- tests/lab/run_lab.sh: bash entrypoint with an EXIT trap so Ctrl+C always
  runs lab:stop.
- hatch.toml: new [envs.lab] env with click/httpx/pyyaml/rich and
  start/generate/stop scripts.

* Add missing n8n event metric mappings

* Add VM-isolated expression engine metrics (n8n 2.x)

n8n 2.x ships a new VM-isolated expression engine in @n8n/expression-runtime
that registers its own Prometheus metrics under the n8n_expression_* prefix.
The metrics are gated on N8N_EXPRESSION_ENGINE=vm and
N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED=true, neither of which defaults
to on, so live containers do not emit samples unless explicitly opted in.

Map the family in datadog_checks/n8n/metrics.py, add metadata.csv rows with
the version + flag requirements documented, add synthetic samples to the
unit fixtures so check_symmetric_inclusion stays green, list the metrics in
the V2_ONLY / RARE_EVENT sets used by the integration assertions, and call
out the env vars in the README's version-specific block.

* Split lab into its own compose, mount workflows by bind

The lab previously shared tests/docker/docker-compose.yaml with the test
env and drove workflow import through docker exec in traffic_generator.py.
That coupled two consumers with different port expectations (the test env
uses find_free_ports for parallel safety; the lab needs a fixed URL for
docs, agent config, and the traffic generator) and put workflow lifecycle
in two places.

Add tests/lab/docker-compose.yaml that hardcodes 5678/5680 and bind-mounts
both the test fixtures and the lab workflows under /workflows/. Gate the
compose + port selection in tests/common.py on N8N_IS_LAB so the same
conftest serves both modes. Move workflow import/activate into conftest
(scanning the bind-mount, reading stable ids from JSON), and add a
lab-only logs block + docker_volumes yield so the Datadog Agent picks up
n8n stdout via autodiscovery and the event-bus log files via the data
volume. Drop the docker-exec workflow import from traffic_generator.py
now that conftest owns it. Update the README log-collection section to
reflect that event-bus logs live under the n8n user folder rather than
N8N_LOG_FILE_LOCATION.

* Address PR review feedback

- Tighten _generate_workflow_traffic success check to == 200 so a
  webhook that responds 4xx (e.g. not yet registered after restart) does
  not falsely count as a healthy workflow run; capture last_status /
  last_exc and surface them in the RuntimeError so CI failures point at
  the real cause.
- Replace the bespoke time.monotonic() wait loops with WaitFor + raise
  predicates (the dominant pattern across integrations-core). Restructure
  dd_environment conditions so the docker_run condition chain runs:
  wait for /healthz, activate workflows, then assert /metrics reachable
  on main + worker. Workflow-started wait stays inline since
  _generate_workflow_traffic is not idempotent.
- Drop drop_rare_event_metrics; pass the public exclude= parameter to
  assert_metrics_using_metadata so we don't reach into AggregatorStub
  internals.
- Replace bare try/except RequestException: pass blocks with
  contextlib.suppress.
- Parametrize the two unit-fixture metadata tests; add the missing
  pytestmark = pytest.mark.unit and a comment explaining why the
  unit assertion is version-pinned to major=2.
- Re-word the lab traffic generator reload-failure messages so it's
  clear the lab keeps running with the previous config.
- Add N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION to the
  README's version-specific block and to the changelog flag list; indent
  the changelog sub-bullets so towncrier nests them under the wrapping
  bullet.

* Fix e2e test referencing removed drop_rare_event_metrics helper

Use the public exclude= parameter on assert_metrics_using_metadata,
matching test_integration.py. test_e2e.py was missed in the earlier
review-feedback commit.

* Proofread n8n README against the Datadog style guide

- Remove stray scratch notes accidentally committed at the end of the
  file (numbered questions and a changelog-process note that didn't
  belong in the public README).
- Sentence-case the 'Data collected' and 'Service checks' headings.
- Replace hyphen-as-em-dash usage (' - ') by splitting into separate
  sentences.
- Replace slash-as-and/or in lists and tag descriptions:
  'enqueued/dequeued/completed/failed/stalled counters' -> spelled-out
  list; 'result:success/failure' -> 'result:success or result:failure';
  'stdout/stderr' -> 'stdout and stderr'.

* Move workflow setup back into docker_run conditions to fix e2e

In the previous refactor _generate_workflow_traffic and the
_workflow_started_non_zero wait were moved into the body of the
dd_environment context manager. That made them vulnerable to fixture
re-invocation paths (e.g. session teardown or flaky-plugin retry) that
fired the body code against torn-down containers, producing a setup
error after the e2e test had already passed.

Put both back into conditions=[...]. That keeps them inside docker_run's
set_up() retry envelope (attempts=2 in CI), and they are no longer
exposed to the post-yield teardown path. The post-restart /healthz wait
moves back inside _activate_imported_workflows so the function stays
self-contained as a condition. Restore the (instances, E2E_METADATA)
tuple yield for non-lab mode so the e2e Agent container still gets the
docker_volumes mount it expects.

* Address second-round PR review feedback

- conftest.py: parse n8n_workflow_started_total samples as floats instead
  of string-matching ' 0', so '0.0' / '0e+0' counter values are not
  treated as non-zero and OpenMetrics '# HELP'/'# TYPE' comment lines
  that share the prefix are skipped.
- common.py: collapse the get_all_metadata_metrics passthrough into
  get_metadata_metrics_for_version (update integration + e2e call sites)
  and document the intentional V2_ONLY / RARE_EVENT overlap so future
  contributors do not assume the duplication is accidental.
- check.py: cache the readiness endpoint with functools.cached_property
  (it is derived from immutable config) and parameterise the dict return
  / argument types as dict[str, Any].
- traffic_generator.py: scope the asyncio.Event and current config to
  _run_traffic instead of holding them at module level, threading both
  through _config_reloader. Switches the SIGINT/SIGTERM hook to
  loop.add_signal_handler so a second 'generate' invocation in the same
  process starts from a clean state.

* Address third-round PR review feedback

- conftest.py: move the worker CheckEndpoints to after _activate_imported_workflows
  so any cascade from the n8n main restart is caught before downstream conditions
  scrape the worker.
- test_unit.py: import the requests module and reference
  requests.ConnectionError at the call site so the builtin ConnectionError
  name is not shadowed for the rest of the module.
- traffic_generator.py: extract _make_output_table() so the table schema lives
  in one place and _print_row() only owns row data.

* Wait for webhook registration after n8n restart on v2

On n8n 2.x, /healthz comes back after `docker compose restart n8n` before
n8n has finished re-registering the active workflows' webhook routes. The
existing WaitFor(_n8n_healthy) inside _activate_imported_workflows was
satisfied while /webhook/test still returned 404, so _generate_workflow_traffic
raced the registration and failed with last_status=404.

Add a second WaitFor poll on the integration-test webhook itself so the
registration is observed before downstream conditions run. v1 happens to
register fast enough that the gap is not observable there, but the extra
check costs at most one poll on the happy path.

* Map n8n event-bus dynamic counters

Map the broader n8n event-bus surface (~45 dynamic counters) covering
audit (user, credentials, package, variable, execution data), AI node,
runner, and workflow cancellation events, plus execution throttling.
Counter names rejected by n8n's own prom-client validation (hyphenated
families such as external-secrets, token-exchange, role-mapping, and
cluster) are intentionally not mapped and called out in metrics.py.

The integration test environment cannot realistically exercise these
families end to end, so each new metric is documented as best-effort
in metadata.csv and added to RARE_EVENT_METRIC_NAMES. The unit fixture
carries synthetic samples so the metric map stays validated. README
covers the dynamic-counter scope and shows an extra_metrics example
for users to add events from future n8n releases.

* Drop technical hyphen-rejection paragraph from n8n README

* Tighten n8n changelog to one-line themes

* Tone down n8n changelog lead-in

* Reframe n8n changelog from user perspective

* Treat any 2xx response as ready, bump n8n to a major release

The readiness gauge now reports 1 for any HTTP 2xx response on
/healthz/readiness, not only 200. Rename the changelog entry from
.added to .changed so the next release is a major bump, reflecting
the breadth of the integration overhaul.

(cherry picked from commit 57659b4)
@dd-octo-sts

dd-octo-sts Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

Validation Report

All 20 validations passed.

Show details
Validation Description Status
agent-reqs Verify check versions match the Agent requirements file
ci Validate CI configuration and Codecov settings
codeowners Validate every integration has a CODEOWNERS entry
config Validate default configuration files against spec.yaml
dep Verify dependency pins are consistent and Agent-compatible
http Validate integrations use the HTTP wrapper correctly
imports Validate check imports do not use deprecated modules
integration-style Validate check code style conventions
jmx-metrics Validate JMX metrics definition files and config
labeler Validate PR labeler config matches integration directories
legacy-signature Validate no integration uses the legacy Agent check signature
license-headers Validate Python files have proper license headers
licenses Validate third-party license attribution list
metadata Validate metadata.csv metric definitions
models Validate configuration data models match spec.yaml
openmetrics Validate OpenMetrics integrations disable the metric limit
package Validate Python package metadata and naming
readmes Validate README files have required sections
saved-views Validate saved view JSON file structure and fields
version Validate version consistency between package and changelog

View full run

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 172d23b0af

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

- N8N_METRICS_INCLUDE_API_ENDPOINTS=true
- N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true
# Logging configuration
- EXECUTIONS_MODE=queue

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Back queue-mode tests with Postgres

This compose file now starts n8n in queue mode with a separate worker, but it leaves n8n on its default SQLite database and shares the same /home/node/.n8n volume between processes. n8n's queue-mode setup is a distributed main/worker deployment and SQLite is unsupported there, so the new integration/e2e environment can hit database locks or unsupported execution behavior when the webhook traffic is queued to the worker. Add a Postgres service and DB_* env vars for both n8n containers, or avoid queue mode in this compose.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, will consider it for a test fix later, haven't had the issue yet.

Comment thread n8n/tests/lab/run_lab.sh
cd "$SCRIPT_DIR"
hatch run lab:stop -e "$ENV" || true
cd "$ORIGINAL_DIR"
exit 0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve the lab command exit status

When lab:start or lab:generate exits non-zero, the EXIT trap runs cleanup, and this unconditional exit 0 makes run_lab.sh report success anyway. That masks failed lab runs or generator crashes in any automation that relies on the wrapper's exit code; capture $? at the start of cleanup and exit with that status after stopping the lab.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lab is used for testing and we need to do the cleanup no matter what, do not really care about the exist status

@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.44221% with 23 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (7.80.x@ff792eb). Learn more about missing BASE report.

Additional details and impacted files
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@datadog-prod-us1-4

Copy link
Copy Markdown
Contributor

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 88.44%
Overall Coverage: 90.17%

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 172d23b | Docs | Datadog PR Page | Give us feedback!

@AAraKKe AAraKKe merged commit 83acc5a into 7.80.x May 19, 2026
57 checks passed
@AAraKKe AAraKKe deleted the backport-23635-to-7.80.x branch May 19, 2026 07:56
@dd-octo-sts dd-octo-sts Bot added this to the 7.79.0 milestone May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant