You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[n8n] Fix metric mappings and add full v2 metric coverage (#23635)
* Fix n8n metric mappings and add full v2 metric coverage
- Drop fabricated metric names that n8n never emitted; map only what is empirically present.
- Add the n8n 2.x metric families: workflow.execution.duration histogram, audit.workflow.*, embed.login.*, token.exchange.*, process.pss.bytes, runner.task.requested, and the workflow_statistics gauges.
- Add worker-only families (node.started, node.finished, queue.job.dequeued, runner.task.requested) by introducing a worker-scrape instance.
- Stop gating the OpenMetrics scrape on /healthz/readiness; emit n8n.readiness.check unconditionally so metrics still flow when the readiness endpoint is unhealthy.
- Replace the custom Dockerfile with a direct n8nio/n8n image reference and parameterise the version via hatch.toml so the test matrix can run against both 1.118.1 and 2.19.5.
- Allocate free host ports via datadog_checks.dev.utils.find_free_ports and forward them through docker_run env_vars to avoid port collisions on re-runs.
* Add changelog for PR #23635
* Refine n8n metric coverage and e2e setup
* Document raw_metric_prefix requirement when customizing N8N_METRICS_PREFIX
* Reformat changelog so towncrier renders sub-bullets correctly
* Add tests/lab traffic generator for n8n
A long-running n8n simulation that layers on top of the integration test
environment so a real Datadog Agent can ship metrics to a Datadog org for
dashboard / monitor iteration.
- tests/lab/workflows/: five lab-only workflow JSONs covering distinct shapes
(fast, slow Wait node, always-fail Code, flaky 30%, four-step chain).
- tests/lab/traffic_generator.py: click CLI (start/generate/stop) that runs
ddev env start --base, copies + imports + activates the lab workflows,
restarts n8n, and drives a configurable async traffic mix against the
webhooks and REST API.
- tests/lab/config.yaml: webhook + REST probabilities and tick / reload
intervals; hot-reloaded while the generator runs.
- tests/lab/.ddev.toml: pins the lab to an `n8nlab` ddev org.
- tests/lab/run_lab.sh: bash entrypoint with an EXIT trap so Ctrl+C always
runs lab:stop.
- hatch.toml: new [envs.lab] env with click/httpx/pyyaml/rich and
start/generate/stop scripts.
* Add missing n8n event metric mappings
* Add VM-isolated expression engine metrics (n8n 2.x)
n8n 2.x ships a new VM-isolated expression engine in @n8n/expression-runtime
that registers its own Prometheus metrics under the n8n_expression_* prefix.
The metrics are gated on N8N_EXPRESSION_ENGINE=vm and
N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED=true, neither of which defaults
to on, so live containers do not emit samples unless explicitly opted in.
Map the family in datadog_checks/n8n/metrics.py, add metadata.csv rows with
the version + flag requirements documented, add synthetic samples to the
unit fixtures so check_symmetric_inclusion stays green, list the metrics in
the V2_ONLY / RARE_EVENT sets used by the integration assertions, and call
out the env vars in the README's version-specific block.
* Split lab into its own compose, mount workflows by bind
The lab previously shared tests/docker/docker-compose.yaml with the test
env and drove workflow import through docker exec in traffic_generator.py.
That coupled two consumers with different port expectations (the test env
uses find_free_ports for parallel safety; the lab needs a fixed URL for
docs, agent config, and the traffic generator) and put workflow lifecycle
in two places.
Add tests/lab/docker-compose.yaml that hardcodes 5678/5680 and bind-mounts
both the test fixtures and the lab workflows under /workflows/. Gate the
compose + port selection in tests/common.py on N8N_IS_LAB so the same
conftest serves both modes. Move workflow import/activate into conftest
(scanning the bind-mount, reading stable ids from JSON), and add a
lab-only logs block + docker_volumes yield so the Datadog Agent picks up
n8n stdout via autodiscovery and the event-bus log files via the data
volume. Drop the docker-exec workflow import from traffic_generator.py
now that conftest owns it. Update the README log-collection section to
reflect that event-bus logs live under the n8n user folder rather than
N8N_LOG_FILE_LOCATION.
* Address PR review feedback
- Tighten _generate_workflow_traffic success check to == 200 so a
webhook that responds 4xx (e.g. not yet registered after restart) does
not falsely count as a healthy workflow run; capture last_status /
last_exc and surface them in the RuntimeError so CI failures point at
the real cause.
- Replace the bespoke time.monotonic() wait loops with WaitFor + raise
predicates (the dominant pattern across integrations-core). Restructure
dd_environment conditions so the docker_run condition chain runs:
wait for /healthz, activate workflows, then assert /metrics reachable
on main + worker. Workflow-started wait stays inline since
_generate_workflow_traffic is not idempotent.
- Drop drop_rare_event_metrics; pass the public exclude= parameter to
assert_metrics_using_metadata so we don't reach into AggregatorStub
internals.
- Replace bare try/except RequestException: pass blocks with
contextlib.suppress.
- Parametrize the two unit-fixture metadata tests; add the missing
pytestmark = pytest.mark.unit and a comment explaining why the
unit assertion is version-pinned to major=2.
- Re-word the lab traffic generator reload-failure messages so it's
clear the lab keeps running with the previous config.
- Add N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION to the
README's version-specific block and to the changelog flag list; indent
the changelog sub-bullets so towncrier nests them under the wrapping
bullet.
* Fix e2e test referencing removed drop_rare_event_metrics helper
Use the public exclude= parameter on assert_metrics_using_metadata,
matching test_integration.py. test_e2e.py was missed in the earlier
review-feedback commit.
* Proofread n8n README against the Datadog style guide
- Remove stray scratch notes accidentally committed at the end of the
file (numbered questions and a changelog-process note that didn't
belong in the public README).
- Sentence-case the 'Data collected' and 'Service checks' headings.
- Replace hyphen-as-em-dash usage (' - ') by splitting into separate
sentences.
- Replace slash-as-and/or in lists and tag descriptions:
'enqueued/dequeued/completed/failed/stalled counters' -> spelled-out
list; 'result:success/failure' -> 'result:success or result:failure';
'stdout/stderr' -> 'stdout and stderr'.
* Move workflow setup back into docker_run conditions to fix e2e
In the previous refactor _generate_workflow_traffic and the
_workflow_started_non_zero wait were moved into the body of the
dd_environment context manager. That made them vulnerable to fixture
re-invocation paths (e.g. session teardown or flaky-plugin retry) that
fired the body code against torn-down containers, producing a setup
error after the e2e test had already passed.
Put both back into conditions=[...]. That keeps them inside docker_run's
set_up() retry envelope (attempts=2 in CI), and they are no longer
exposed to the post-yield teardown path. The post-restart /healthz wait
moves back inside _activate_imported_workflows so the function stays
self-contained as a condition. Restore the (instances, E2E_METADATA)
tuple yield for non-lab mode so the e2e Agent container still gets the
docker_volumes mount it expects.
* Address second-round PR review feedback
- conftest.py: parse n8n_workflow_started_total samples as floats instead
of string-matching ' 0', so '0.0' / '0e+0' counter values are not
treated as non-zero and OpenMetrics '# HELP'/'# TYPE' comment lines
that share the prefix are skipped.
- common.py: collapse the get_all_metadata_metrics passthrough into
get_metadata_metrics_for_version (update integration + e2e call sites)
and document the intentional V2_ONLY / RARE_EVENT overlap so future
contributors do not assume the duplication is accidental.
- check.py: cache the readiness endpoint with functools.cached_property
(it is derived from immutable config) and parameterise the dict return
/ argument types as dict[str, Any].
- traffic_generator.py: scope the asyncio.Event and current config to
_run_traffic instead of holding them at module level, threading both
through _config_reloader. Switches the SIGINT/SIGTERM hook to
loop.add_signal_handler so a second 'generate' invocation in the same
process starts from a clean state.
* Address third-round PR review feedback
- conftest.py: move the worker CheckEndpoints to after _activate_imported_workflows
so any cascade from the n8n main restart is caught before downstream conditions
scrape the worker.
- test_unit.py: import the requests module and reference
requests.ConnectionError at the call site so the builtin ConnectionError
name is not shadowed for the rest of the module.
- traffic_generator.py: extract _make_output_table() so the table schema lives
in one place and _print_row() only owns row data.
* Wait for webhook registration after n8n restart on v2
On n8n 2.x, /healthz comes back after `docker compose restart n8n` before
n8n has finished re-registering the active workflows' webhook routes. The
existing WaitFor(_n8n_healthy) inside _activate_imported_workflows was
satisfied while /webhook/test still returned 404, so _generate_workflow_traffic
raced the registration and failed with last_status=404.
Add a second WaitFor poll on the integration-test webhook itself so the
registration is observed before downstream conditions run. v1 happens to
register fast enough that the gap is not observable there, but the extra
check costs at most one poll on the happy path.
* Map n8n event-bus dynamic counters
Map the broader n8n event-bus surface (~45 dynamic counters) covering
audit (user, credentials, package, variable, execution data), AI node,
runner, and workflow cancellation events, plus execution throttling.
Counter names rejected by n8n's own prom-client validation (hyphenated
families such as external-secrets, token-exchange, role-mapping, and
cluster) are intentionally not mapped and called out in metrics.py.
The integration test environment cannot realistically exercise these
families end to end, so each new metric is documented as best-effort
in metadata.csv and added to RARE_EVENT_METRIC_NAMES. The unit fixture
carries synthetic samples so the metric map stays validated. README
covers the dynamic-counter scope and shows an extra_metrics example
for users to add events from future n8n releases.
* Drop technical hyphen-rejection paragraph from n8n README
* Tighten n8n changelog to one-line themes
* Tone down n8n changelog lead-in
* Reframe n8n changelog from user perspective
* Treat any 2xx response as ready, bump n8n to a major release
The readiness gauge now reports 1 for any HTTP 2xx response on
/healthz/readiness, not only 200. Rename the changelog entry from
.added to .changed so the next release is a major bump, reflecting
the breadth of the integration overhaul.
# Optional: Customize the metric prefix (default is 'n8n_')
45
49
N8N_METRICS_PREFIX=n8n_
46
50
```
47
51
48
52
For more details, see the n8n documentation on [enabling Prometheus metrics][10].
49
53
54
+
If you change `N8N_METRICS_PREFIX` from its default of `n8n_`, you **must** also set `raw_metric_prefix` in the integration's `conf.yaml` to the same value. Otherwise the check will not recognize the exposed metric names and will silently submit nothing:
Most n8n counters are registered dynamically the first time their underlying event fires. The integration ships mappings for around 70 of these event-bus counters, including:
- Audit (workflow, user, credentials, package, variable, execution data): `n8n.audit.workflow.executed.count`, `n8n.audit.user.login.success.count`, `n8n.audit.user.credentials.created.count`, and similar
68
+
- AI nodes: `n8n.ai.tool.called.count`, `n8n.ai.llm.generated.count`, `n8n.ai.vector.store.searched.count`, and similar
69
+
- Runner, queue, and node lifecycle: `n8n.runner.task.requested.count`, `n8n.queue.job.completed.count`, `n8n.node.started.count`, `n8n.node.finished.count`
70
+
71
+
These counters do not appear on the `/metrics` endpoint until the corresponding event has occurred. A healthy idle deployment will not produce data points for them until that activity fires. The complete list is in [`metadata.csv`][7].
72
+
73
+
If a future n8n release exposes a new event-driven counter that is not yet covered by this integration, add it to the `extra_metrics` option in your instance configuration:
74
+
75
+
```yaml
76
+
instances:
77
+
- openmetrics_endpoint: http://n8n:5678/metrics
78
+
extra_metrics:
79
+
- some_new_n8n_event_total: some.new.n8n.event
80
+
```
81
+
82
+
The left-hand side is the Prometheus counter name as n8n exposes it (keep the `_total` suffix). The right-hand side is the dotted Datadog metric name to submit it as.
83
+
84
+
#### Queue mode and workers
85
+
86
+
In queue mode, n8n runs separate worker processes that execute jobs picked up from a Redis-backed queue. Each worker exposes its own `/metrics` endpoint and emits a different subset of metrics than the main process. Worker-observed metrics include `n8n.queue.job.dequeued.count`, `n8n.queue.job.stalled.count`, `n8n.node.started.count`, `n8n.node.finished.count`, and `n8n.runner.task.requested.count`. Main-only metrics include `n8n.instance.role.leader` and the `n8n.scaling.mode.queue.jobs.*` family.
87
+
88
+
To expose worker metrics, set `QUEUE_HEALTH_CHECK_ACTIVE=true` and `QUEUE_HEALTH_CHECK_PORT=<port>` on each worker. **In n8n 2.x, port `5679` is reserved for the task runner broker, so pick a different port (for example `5680`).**
89
+
90
+
For full coverage in queue deployments, configure one Datadog instance per n8n process exposing `/metrics`, including main and worker processes:
Several metric families were introduced in n8n 2.x and are not emitted on n8n 1.x:
101
+
102
+
- `n8n.workflow.execution.duration.seconds.*`(histogram). Gated by `N8N_METRICS_INCLUDE_WORKFLOW_EXECUTION_DURATION`, which defaults to `true` in n8n 2.x.
103
+
- `n8n.audit.workflow.activated.count`, `n8n.audit.workflow.deactivated.count`, `n8n.audit.workflow.executed.count`, `n8n.audit.workflow.resumed.count`, `n8n.audit.workflow.version.updated.count`, and `n8n.audit.workflow.waiting.count`
104
+
- `n8n.embed.login.requests.count`(tagged with `result:success` or `result:failure`), `n8n.embed.login.failures.count` (tagged with `reason`)
105
+
- `n8n.token.exchange.requests.count`(tagged with `result:success` or `result:failure`), `n8n.token.exchange.failures.count` (tagged with `reason`), `n8n.token.exchange.identity.linked.count`, `n8n.token.exchange.jit.provisioning.count`
106
+
- `n8n.process.pss.bytes`(Linux only)
107
+
- The `n8n.{production,manual,production.root}.executions`, `n8n.users.total`, `n8n.enabled.users`, `n8n.workflows.total`, and `n8n.credentials.total` family. Only emitted when `N8N_METRICS_INCLUDE_WORKFLOW_STATISTICS=true` is set.
108
+
- The `n8n.expression.*` family (`evaluation.duration.seconds`, `code.cache.{hit,miss,eviction,size}`, `pool.{acquired,replenish.failed,scaled.up,scaled.to.zero}`). Only emitted when n8n is running the new VM-isolated expression engine *and* observability for it is on. Set `N8N_EXPRESSION_ENGINE=vm` and `N8N_EXPRESSION_ENGINE_OBSERVABILITY_ENABLED=true` on the n8n process; both default to off (the engine defaults to `legacy`). These metrics surface the per-expression evaluation latency, the compiled-expression LRU cache hit and miss rates, and the V8-isolate pool's idle scaling behavior. They are most useful for troubleshooting workflow latency that traces back to slow `{{ ... }}` evaluation.
109
+
110
+
Some metrics only emit samples after the corresponding runtime event occurs. For example, failures-only counters (`*.failures.count`) need an authentication failure, audit workflow counters need the matching workflow state transition, and the libuv `n8n.nodejs.active.requests` gauge needs an in-flight libuv request. A healthy idle deployment may not produce data points for these metrics until that activity occurs.
111
+
112
+
#### Tag cardinality
113
+
114
+
When `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=true`, http and workflow execution histograms are tagged with `workflow_id` (and similar labels for nodes). On deployments with many distinct workflows or nodes, this can produce high-cardinality metrics. Drop the label via `exclude_labels` or omit `N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL` to keep tag cardinality bounded.
115
+
50
116
#### Configure the Datadog Agent
51
117
52
118
1. Edit the `n8n.d/conf.yaml` file, in the `conf.d/` folder at the root of your Agent's configuration directory to start collecting your n8n performance data. See the [sample n8n.d/conf.yaml][4] for all available configuration options.
@@ -59,27 +125,32 @@ _Available for Agent versions >6.0_
59
125
60
126
#### Enable n8n logging
61
127
62
-
Configure n8n to output logs by setting the following environment variables:
128
+
Configure n8n application logs by setting the following environment variables:
63
129
64
130
```bash
65
131
# Set the log level (error, warn, info, debug)
66
132
N8N_LOG_LEVEL=info
67
133
68
-
# Output logs to console (for containerized environments) or file
134
+
# Output application logs to console or file
69
135
N8N_LOG_OUTPUT=console
70
136
71
-
# If using file output, specify the log file location
137
+
# Use JSON formatting so Datadog can parse n8n application log attributes
138
+
N8N_LOG_FORMAT=json
139
+
140
+
# If using file output, specify the application log file location
72
141
N8N_LOG_FILE_LOCATION=/var/log/n8n/n8n.log
73
142
```
74
143
75
144
#### Structured event logs
76
145
77
-
n8n can output structured JSON logs to `n8nEventLog.log` containing detailed workflow execution events. Enable this by setting the log output to file:
146
+
n8n also writes structured event bus logs to `n8nEventLog*.log`. These logs contain workflow, node, queue, runner, and audit events and are separate from the application logs controlled by `N8N_LOG_OUTPUT` and `N8N_LOG_FILE_LOCATION`.
78
147
79
-
```bash
80
-
N8N_LOG_OUTPUT=file
81
-
N8N_LOG_FILE_LOCATION=/var/log/n8n/
82
-
```
148
+
By default, event bus log files are written under the n8n user folder, for example:
149
+
150
+
- Host installations: `~/.n8n/n8nEventLog*.log`
151
+
- Official Docker image: `/home/node/.n8n/n8nEventLog*.log`
152
+
153
+
If you use a custom n8n user folder, collect the event bus logs from that folder instead. If you customize the event bus log file base name with `N8N_EVENTBUS_LOGWRITER_LOGBASENAME`, update the Datadog log path to match.
83
154
84
155
The event log includes the following event types:
85
156
@@ -102,32 +173,46 @@ Each event contains rich metadata including `executionId`, `workflowId`, `workfl
102
173
logs_enabled: true
103
174
```
104
175
105
-
2. Add this configuration block to your `n8n.d/conf.yaml` file to start collecting your n8n logs:
176
+
2. Add log collection entries to your `n8n.d/conf.yaml` file.
177
+
178
+
For a host-based n8n installation where the Agent can read local files, collect the application log file and the event bus log files:
106
179
107
180
```yaml
108
181
logs:
109
182
- type: file
110
183
path: /var/log/n8n/*.log
111
184
source: n8n
112
-
service: n8n
185
+
service: <SERVICE>
186
+
- type: file
187
+
path: /home/n8n/.n8n/n8nEventLog*.log
188
+
source: n8n
189
+
service: <SERVICE>
113
190
```
114
191
115
-
For containerized environments using Docker, use the following configuration instead:
192
+
Adjust `/home/n8n/.n8n/n8nEventLog*.log` to the n8n user folder on your host.
193
+
194
+
For a containerized n8n deployment, collect stdout and stderr from the n8n container for application logs, and make the n8n user folder available to the Agent for event bus file logs. For example, if the n8n data directory is mounted on the host at `/var/lib/n8n`, configure:
116
195
117
196
```yaml
118
197
logs:
119
198
- type: docker
120
199
source: n8n
121
-
service: n8n
200
+
service: <SERVICE>
201
+
- type: file
202
+
path: /var/lib/n8n/n8nEventLog*.log
203
+
source: n8n
204
+
service: <SERVICE>
122
205
```
123
206
207
+
If the Agent runs in a container, mount the n8n data volume or host directory into the Agent container and use the path as seen from inside the Agent container.
208
+
124
209
3. [Restart the Agent][5].
125
210
126
211
### Validation
127
212
128
213
[Run the Agent's status subcommand][6] and look for `n8n` under the Checks section.
129
214
130
-
## Data Collected
215
+
## Data collected
131
216
132
217
### Metrics
133
218
@@ -137,7 +222,7 @@ See [metadata.csv][7] for a list of metrics provided by this integration.
137
222
138
223
The n8n integration does not include any events.
139
224
140
-
### Service Checks
225
+
### Service checks
141
226
142
227
See [service_checks.json][8] for a list of service checks provided by this integration.
- Add metrics introduced in n8n 2.x (workflow execution duration, audit events, authentication, workflow and user statistics, expression engine, and process memory).
5
+
- Track n8n's dynamic events (workflow cancellations, audit activity, AI nodes, user and credential changes, package and variable changes).
6
+
- Add support for monitoring n8n worker processes alongside the main process.
0 commit comments