feature:add service-to-middleware relations by wsx864321 · Pull Request #117 · VictoriaMetrics/VictoriaTraces

wsx864321 · 2026-03-06T03:39:16Z

Describe Your Changes

Increase the relationship mapping between services and middleware.
It should be noted that the middleware name obtained here is db.system, according to the official documentation of Hotel（ https://github.com/open-telemetry/semantic-conventions/blob/v1.32.0/docs/database/database-spans.md )db.system.name， But in fact, the constant provided by the official SDK is db.system, so I am using db.system here, which can be discussed here.

Summary by cubic

Adds service-to-database edges to the service graph so you can see which services call which DBs. DB nodes are named <service>:<db> and can be limited or disabled via -servicegraph.databaseTaskLimit.

New Features
- Adds vtselect.GetServiceDBGraphTimeRange to build svc→DB edges from client spans with span_attr:db.system.name; parent=resource_attr:service.name, child=<service>:<db.system.name>; aggregates callCount.
- Introduces otelpb.SpanAttrDbSystemName; adds an integration test for svc→DB edges (e.g., serviceA:MongoDB) and documents the new flag.

^{Written for commit f6efba7. Summary will update on new commits.}

cubic-dev-ai

2 issues found across 4 files

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:732">
P2: Middleware edge generation drops valid DB client relations when the parent span kind is not 2/5 due to restrictive inner-join filter.</violation>
</file>

<file name="lib/protoparser/opentelemetry/pb/trace_fields.go">

<violation number="1" location="lib/protoparser/opentelemetry/pb/trace_fields.go:73">
P2: Middleware detection uses `db.system` constant, which may miss spans following newer OTel semconv `db.system.name`.</violation>
</file>

Since this is your first cubic review, here's how it works:

cubic automatically reviews your code and comments on bugs and improvements
Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
Ask questions if you need clarification on any suggestion

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

app/vtselect/traces/query/query.go

lib/protoparser/opentelemetry/pb/trace_fields.go

jiekun · 2026-03-06T07:51:13Z

Hello. To align the scenario what this pull request want to optimize:

The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".
The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

wsx864321 · 2026-03-06T07:58:37Z

Hello. To align the scenario what this pull request want to optimize:

The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".

The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

No, currently it is ‘client’ span to be the “child”， ‘server’ or ‘consumer’ span to be the "parent"。

 (NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as
 span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

This is sql

wsx864321 · 2026-03-06T08:02:45Z

Hello. To align the scenario what this pull request want to optimize:

The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".

The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

[
    [
        {
            "Name": "parent",
            "Value": "docmini-teststage-main"
        },
        {
            "Name": "child",
            "Value": "mysql:docmini_test"
        },
        {
            "Name": "callCount",
            "Value": "20"
        }
    ]
]

this is query result

jiekun · 2026-03-06T08:04:50Z

By "current implementation", I refer to what we have now in VictoriaMetrics. I'd like to confirm what's missing first, before checking the new implementation.

currently it is ‘client’ span to be the “child”， ‘server’ or ‘consumer’ span to be the "parent"

It seems this is not the case in current version (v0.8.0), could you confirm?

wsx864321 · 2026-03-06T08:10:34Z

By "current implementation", I refer to what we have now in VictoriaMetrics. I'd like to confirm what's missing first, before checking the new implementation.

currently it is ‘client’ span to be the “child”， ‘server’ or ‘consumer’ span to be the "parent"

It seems this is not the case in current version (v0.8.0), could you confirm?

I misunderstood your meaning, you are right. I just want to increase the relationship between services and middleware

jiekun · 2026-03-06T08:11:11Z

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

The client span will carry database info, which could be used as the child node of the relation.
The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

wsx864321 · 2026-03-06T08:19:40Z

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

我明白了。这个想法看起来不错。我一直在想，我们是否真的需要parent在 LogsQL 中引入那个 span，因为所有信息都可以在（中间件）客户端 span 中找到：

客户端 span 将携带数据库信息，该信息可用作关系的子节点。

客户端 span 还将携带资源属性，这些属性可以用作关系的父节点。

如果我的说法有误，请随时指正。我对语义约定不太熟悉，也没有进行过测试。

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

The client span will carry database info, which could be used as the child node of the relation.

The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

Exactly, there's really no need for ‘parent’，because there is ‘resource_attr:service.name’. Can I revise it and resubmit it?

wsx864321 · 2026-03-06T08:22:18Z

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

The client span will carry database info, which could be used as the child node of the relation.

The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

  (NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace,resource_attr:service.name | rename parent_span_id as
 span_id, span_attr:db.system as child, span_attr:db.namespace as namespace, resource_attr:service.name as parent | stats by (parent, child,namespace) count() callCount

this is logSql

jiekun · 2026-03-06T08:22:19Z

Feel free to update the branch or perform a force push.

cubic-dev-ai

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:617">
P1: `GetServiceGraphTimeRange` was repurposed to db/middleware edge extraction but is still used as the core service-to-service graph path, causing functional regression in background graph generation.</violation>

<violation number="2" location="app/vtselect/traces/query/query.go:624">
P2: Service graph namespace dimension is introduced on generation but dropped on read aggregation, causing cross-namespace edge collapse.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

app/vtselect/traces/query/query.go

wsx864321 · 2026-03-06T09:05:42Z

Feel free to update the branch or perform a force push.

Updated code,Plz cr,thx.

jiekun · 2026-03-09T03:06:25Z

Hi. While I'm testing this, may I ask if you know of other projects that are also doing similar things (generating relationships from a single span)? This would help us align with them, especially as we consider what the service graph should look like in the future iteration.

wsx864321 · 2026-03-09T04:43:21Z

Hi. While I'm testing this, may I ask if you know of other projects that are also doing similar things (generating relationships from a single span)? This would help us align with them, especially as we consider what the service graph should look like in the future iteration.

Sorry. I haven't researched any other projects yet

utrack · 2026-03-10T12:03:21Z

@jiekun Datadog are doing the same and it's a great one :) I've reimplemented the same thing in our dashboards using grafana infinity and a bunch of JSON mutation logic, but it would be great to have it OOTB.

Here the Postgres does not report any metrics.

wsx864321 · 2026-03-11T01:09:11Z

@jiekun Datadog are doing the same and it's a great one :) I've reimplemented the same thing in our dashboards using grafana infinity and a bunch of JSON mutation logic, but it would be great to have it OOTB.

Here the Postgres does not report any metrics.

This looks great, it seems that we can also achieve the same function, but the data may not be accurate in the case of tail sampling. I will also study Datadog.thx

jiekun · 2026-03-11T01:50:22Z

I looks good overall. But I think conditions like namespace, extra enable and limit flags, and some others could be removed. Would you mind me editing the branch to save you some time?

jiekun · 2026-03-11T01:57:58Z

Also, regarding the use of db.system instead of db.system.name, I checked the change history, and this change was introduced 13 months ago: open-telemetry/semantic-conventions@62d5a7c

The new attribute name db.system.name, now, is marked as stable while the previous one is deprecated. I understand that OTel changes things frequently but is it possible for you to solve it by:

upgrade the instrumentation SDK.
use attributeprocessor in the OTel collector to change the attribute name to fit the semantic-conventions schema?

wsx864321 · 2026-03-11T02:07:39Z

I looks good overall. But I think conditions like namespace, extra enable and limit flags, and some others could be removed. Would you mind me editing the branch to save you some time?

Of course, you can modify it as you wish

wsx864321 · 2026-03-11T02:09:19Z

Also, regarding the use of db.system instead of db.system.name, I checked the change history, and this change was introduced 13 months ago: open-telemetry/semantic-conventions@62d5a7c

The new attribute name db.system.name, now, is marked as stable while the previous one is deprecated. I understand that OTel changes things frequently but is it possible for you to solve it by:

upgrade the instrumentation SDK.

use attributeprocessor in the OTel collector to change the attribute name to fit the semantic-conventions schema?

ok， Then let's use the latest standard db.sysytem.name, and I'll upgrade our SDK to solve it

1. Simplified the chile node by removing `namespace`. The `namespace` info may be return back since one service can connect to multiple databases in same type (e.g. MySQL), and the `namespace` is an important identifier. However, the existing service graphs doesn't carry this fine-grain information, and it's important to align them. This may be improved together when VictoriaTrace/Jaeger API support visualizing more info. 2. Removed extra flags for enabling svc-db service graph. It can share the same limit of the svc-svc since svc-svc relation (seems to) has higher priority. The change could be reverted if these are not the case in real-world. But let's start from the minimal implementation and iterate when necessary. Signed-off-by: Jiekun <jiekun@victoriametrics.com>

jiekun · 2026-03-11T03:10:46Z

Please see the follow-up commit a7bc1d1.

However, during my retesting, it appears that the "database" node without an identifier in the global view may be referenced by many services—even though these services are not actually pointing to the "same database" instance or cluster.

Each "edge" of the relation is still show the correct call count (e.g., the accounting service -> the postgresql database: call count = 4). However, the postgresql node in these cases is not the same one across different services.

I have concern about this:

The identifier is not clear, I'm not sure if we should use namespace or maybe there's something else. Theoretically, each service has its own database, so simply using service name + db.system.name would solve the issue. But it may generate 2x of the relations.

Edit: Alright. I think it's better to adopt svc name:db name way as mentioned. And I returned back the limit for svc-db relations so user can set it to 0 to disable this functionality.

1. Simplified the chile node by removing `namespace`. The `namespace` info may be return back since one service can connect to multiple databases in same type (e.g. MySQL), and the `namespace` is an important identifier. However, the existing service graphs doesn't carry this fine-grain information, and it's important to align them. This may be improved together when VictoriaTrace/Jaeger API support visualizing more info. 2. Removed extra flags for enabling svc-db service graph. It can share the same limit of the svc-svc since svc-svc relation (seems to) has higher priority. The change could be reverted if these are not the case in real-world. But let's start from the minimal implementation and iterate when necessary. Signed-off-by: Jiekun <jiekun@victoriametrics.com>

cubic-dev-ai

5 issues found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:712">
P2: DB namespace dimension was dropped from service-to-DB aggregation, collapsing distinct DB targets into one edge per DB system.</violation>
</file>

<file name="app/victoria-traces/servicegraph/servicegraph.go">

<violation number="1" location="app/victoria-traces/servicegraph/servicegraph.go:115">
P2: Removing middleware flags and guard introduces backward-incompatible startup risk and removes previous middleware opt-out behavior.</violation>

<violation number="2" location="app/victoria-traces/servicegraph/servicegraph.go:121">
P1: DB graph fetch error causes early continue, dropping already-fetched service graph rows for that tenant/time window.</violation>
</file>

<file name="lib/protoparser/opentelemetry/pb/trace_fields.go">

<violation number="1" location="lib/protoparser/opentelemetry/pb/trace_fields.go:42">
P2: DB service-graph field key changed to `span_attr:db.system.name`, which can miss existing telemetry that uses `db.system` and underreport service→middleware relations.</violation>
</file>

<file name="deployment/docker/compose-vt-single.yml">

<violation number="1" location="deployment/docker/compose-vt-single.yml:36">
P2: Default compose now references a fork/dirty VictoriaTraces image tag, creating pull reliability and provenance risks for users.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

app/victoria-traces/servicegraph/servicegraph.go

app/vtselect/traces/query/query.go

app/victoria-traces/servicegraph/servicegraph.go

lib/protoparser/opentelemetry/pb/trace_fields.go

deployment/docker/compose-vt-single.yml

…, and prefix in db name

wsx864321 · 2026-03-11T03:37:51Z

Please see the follow-up commit a7bc1d1.

However, during my retesting, it appears that the "database" node without an identifier in the global view may be referenced by many services—even though these services are not actually pointing to the "same database" instance or cluster.
Each "edge" of the relation is still show the correct call count (e.g., the `accounting` service -> the `postgresql` database: call count = 4). However, the `postgresql` node in these cases is not the same one across different services.
I have concern about this:

The identifier is not clear, I'm not sure if we should use namespace or maybe there's something else. Theoretically, each service has its own database, so simply using service name + db.system.name would solve the issue. But it may generate 2x of the relations.

Edit: Alright. I think it's better to adopt svc name:db name way as mentioned. And I returned back the limit for svc-db relations so user can set it to 0 to disable this functionality.

GET， I use namespace, actually to add the library name, you can refer to it（ https://github.com/open-telemetry/semantic-conventions/blob/main/docs/db/database-spans.md ）Of course, I can also accept using SVC: dbname here

Signed-off-by: Zhu Jiekun <jiekun@victoriametrics.com>

cubic-dev-ai

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apptest/tests/service_graph_test.go">

<violation number="1" location="apptest/tests/service_graph_test.go:77">
P2: The test assumes deterministic dependency ordering, but API/query code does not guarantee order, so this assertion can be flaky with multiple edges.</violation>
</file>

_{Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.}

apptest/tests/service_graph_test.go

…VictoriaTraces into fork/wsx864321/feature/middleware_graph_0305

codecov · 2026-03-11T08:25:34Z

Codecov Report

❌ Patch coverage is 0% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@7918002). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
app/vtselect/traces/query/query.go	0.00%	63 Missing ⚠️
app/victoria-traces/servicegraph/servicegraph.go	0.00%	19 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##             master    #117   +/-   ##
========================================
  Coverage          ?   7.71%           
========================================
  Files             ?      61           
  Lines             ?    8829           
  Branches          ?       0           
========================================
  Hits              ?     681           
  Misses            ?    8053           
  Partials          ?      95

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jiekun · 2026-03-11T08:43:54Z

@wsx864321 please take a look if you can, and maybe build and try with your data. We do love to see the feedback. If you need help building an binary or docker image please let me know.

If all good, it will be merged to the upstream very soon.

wsx864321 · 2026-03-11T09:13:23Z

@wsx864321 please take a look if you can, and maybe build and try with your data. We do love to see the feedback. If you need help building an binary or docker image please let me know.

If all good, it will be merged to the upstream very soon.

I have no further questions, it meets my expectations. you can merge at any time

utrack · 2026-03-11T13:22:18Z

apologies for piping up here :) are you sure you want to display the db.system.name (postgres) instead of server.address or db.namespace+"@"+server.address? semconv

When building a graph you probably want to make the nodes clickable - and there's nothing to click if there's no semi-unique ID.

EDIT: it's probably more about the overall approach reg mapping the 'inferred entities', which can be both external services and databases - feel free to ignore me and we can chat in a separate issue

jiekun · 2026-03-11T14:13:41Z

I personally don't have much experience with semconv, so I will rely heavily on your feedback. Any comments (both positive and negative) are welcome.

During my testing, I noticed that db.namespace is sometimes empty (likely because it is not marked as required). And for server.address, a single application (client) may connect to multiple instances of the same database (due to sharding, partitioning, etc.), so I also think this is not ideal -- service-to-service relations do not yet use any IP to distinguish the nodes.

My expectations are as follows:

db.system.name still seems meaningful to me (but feel free to disagree, and I’m open to hearing your reasoning :) ).
We should use an alternative to identify this node without relying on the instance attribute (uniq identifier like IP:Port is too fine-grained. any DB, middleware cluster should be display as a single node in the service graph).

Anyway, since there's different idea, I think we can hold this pull request for now and see if we have better way before merging the initial implementation.

utrack · 2026-03-11T14:48:44Z

There's essentially a couple of different tradeoffs from what I've seen; it's nearly impossible to make it work for everyone 😅

The problem 2) would only exist if you're using network.peer.address instead of server.address I think.

As an example, the network.peer.address is a resolved IP (which can change+is different when there's a multi-node setup), while server.address contains the hostname, which is stable.

Here's a brain dump on my experience of making it work on our side:

Approach 0: db.system.name

Pros:

There is a node on the graph

Cons:

On the graph, all the services using a single DB type will 'cloud' around a single node, even if they use different DB instances
All other cons below will apply

Approach 1: (caller)+db.system.name

Pros:

Stable, unique graph nodes
There will never be a 'cloud of nodes' situation

Cons:

If a service uses two or more DBs of the same type - there will only be a single node in the graph
You won't see it on the graph if two services are using the same DB

Approach 2: db.namespace

Pros:

Stable graph nodes

Cons:

db.namespace is optional, as you've mentioned
If there are many DB instances, then we may see the 'cloud of nodes' around a default namespace - even though they're located in the different instances

This is what Datadog is using by default, and before we've had a cloud around main postgres DB.

The Datadog agent turns db.name into peer.db.name by default.
To fix the cloud of nodes, we just put the DB's hostname (not IP) into the db.name for DD.

Real life sample of a cloud of services connecting to 'main' namespace from an another demo follows. Those have different server.address/db.instance:

Approach 3: (network.peer.address)

This is the IP of the DB node you're connected to.

Pros:

You can see a node for every replica you have

Cons:

You can see a node for every replica you have 😅
You get a new graph node every time the DNS+routing changes
False negatives when looking for services that use the same DB - i.e. two services may connect to different IPs and get different nodes, but they will still use the same instance

Approach 4: (server.address)

Pros:

Stable graph nodes
Services that use the same DB instance will be connected
If you have RO database replicas, then they have their own DNS name - so, you get 1 node for the DB master and 1 for the replicas. Useful when debugging the RO/RW selection logic

Cons:

Cloud of nodes if your setup is a single beefy DB cluster that's split by namespaces for different services.
Falls apart if you have many DNS names which lead to the same instance

Approach 5: (db.namespace)+(server.address)

We effectively fall back to A.4 if db.namespace is not reported.

Pros:

Same as A.4, plus:
Only those services that use the same DB namespace on the same instance will be connected

Cons:

If you have a 'single beefy DB cluster' setup, then the node's names might be verbose
Cloud of nodes if you have a 'single beefy DB cluster' setup, AND you don't report db.namespace

Personally, we went for the A.4 which works for us, but I'd say that A.5 is more suitable for an open solution.
If a single DB instance has many DNS names (see A.4 con 2), then I'm inclined to classify this as a 'you problem' :D. We would need to get a DB identifier somehow, but it's not a part of the standard OTel spec.

wsx864321 · 2026-03-12T01:12:35Z

There's essentially a couple of different tradeoffs from what I've seen; it's nearly impossible to make it work for everyone 😅

The problem 2) would only exist if you're using network.peer.address instead of server.address I think.

As an example, the network.peer.address is a resolved IP (which can change+is different when there's a multi-node setup), while server.address contains the hostname, which is stable.

Here's a brain dump on my experience of making it work on our side:

Approach 0: db.system.name

Pros:

There is a node on the graph

Cons:

On the graph, all the services using a single DB type will 'cloud' around a single node, even if they use different DB instances

All other cons below will apply

Approach 1: (caller)+db.system.name

Pros:

Stable, unique graph nodes

There will never be a 'cloud of nodes' situation

Cons:

If a service uses two or more DBs of the same type - there will only be a single node in the graph

You won't see it on the graph if two services are using the same DB

Approach 2: db.namespace

Pros:

Stable graph nodes

Cons:

db.namespace is optional, as you've mentioned

If there are many DB instances, then we may see the 'cloud of nodes' around a default namespace - even though they're located in the different instances

This is what Datadog is using by default, and before we've had a cloud around main postgres DB.

The Datadog agent turns db.name into peer.db.name by default. To fix the cloud of nodes, we just put the DB's hostname (not IP) into the db.name for DD.

Real life sample of a cloud of services connecting to 'main' namespace from an another demo follows. Those have different server.address/db.instance:
#### Approach 3: (network.peer.address) This is the IP of the DB node you're connected to.
Pros:

You can see a node for every replica you have

Cons:

You can see a node for every replica you have 😅

You get a new graph node every time the DNS+routing changes

False negatives when looking for services that use the same DB - i.e. two services may connect to different IPs and get different nodes, but they will still use the same instance

Approach 4: (server.address)

Pros:

Stable graph nodes

Services that use the same DB instance will be connected

If you have RO database replicas, then they have their own DNS name - so, you get 1 node for the DB master and 1 for the replicas. Useful when debugging the RO/RW selection logic

Cons:

Cloud of nodes if your setup is a single beefy DB cluster that's split by namespaces for different services.

Falls apart if you have many DNS names which lead to the same instance

Approach 5: (db.namespace)+(server.address)

We effectively fall back to A.4 if db.namespace is not reported.

Pros:

Same as A.4, plus:

Only those services that use the same DB namespace on the same instance will be connected

Cons:

If you have a 'single beefy DB cluster' setup, then the node's names might be verbose

Cloud of nodes if you have a 'single beefy DB cluster' setup, AND you don't report db.namespace

Personally, we went for the A.4 which works for us, but I'd say that A.5 is more suitable for an open solution. If a single DB instance has many DNS names (see A.4 con 2), then I'm inclined to classify this as a 'you problem' :D. We would need to get a DB identifier somehow, but it's not a part of the standard OTel spec.

In fact, I would prefer db.system.name-db.namespace for several reasons:

In the Hotel specification, whether it is network.exe or server-side, it is only a recommended attribute and not a requirement, but db.namespace is a mandatory attribute.
Whether it is network. peer.port or server. address in a master-slave or cluster architecture, the graph will be very complex
Usually, different db.namespaces also represent different instances

@jiekun cc

utrack · 2026-03-12T08:45:39Z

In a cluster architecture you usually have either a single DNS name, or two names (one is a fast read tier and the second is a master for the writes).

The problem with the namespaces is that there's a default namespace which is used most of the time - like the 'main' example on my screenshot.
There can be different namespaces, and they do usually represent different instances - however, same namespace does not mean it's the same instance.

jiekun · 2026-03-12T09:05:54Z

In a cluster architecture you usually have either a single DNS name, or two names

I agree with this actually, but it's not the case I've experienced. In my previous job for cluster connection, the administrator provides different addresses for different instances, while each addresses is a DNS record and can be switch easily when incident happen.

So the case here is that, different users may have different deployments, and it's hard to reach a consensus for now. I got both feedback from you now but it's hard to decide.

utrack · 2026-03-12T10:32:46Z

Agree, I just remembered MongoDB connection strings, and they can have as many 'seed' DBs as they want...

Maybe just make it configurable? Grafana/otelcol's servicegraph does something called 'dimensions' in the config, which control how their service graph is built.

wsx864321 · 2026-03-12T10:52:04Z

Perhaps we can proceed with the current, most certain solution, which can then be continuously evolved without compatibility issues.

cubic-dev-ai bot reviewed Mar 6, 2026

View reviewed changes

app/vtselect/traces/query/query.go Outdated Show resolved Hide resolved

lib/protoparser/opentelemetry/pb/trace_fields.go Outdated Show resolved Hide resolved

jiekun self-assigned this Mar 6, 2026

wsx864321 force-pushed the feature/middleware_graph_0305 branch from d750959 to a689cc3 Compare March 6, 2026 07:45

jiekun self-requested a review March 6, 2026 08:15

cubic-dev-ai bot reviewed Mar 6, 2026

View reviewed changes

app/vtselect/traces/query/query.go Outdated Show resolved Hide resolved

app/vtselect/traces/query/query.go Outdated Show resolved Hide resolved

wsx864321 force-pushed the feature/middleware_graph_0305 branch 2 times, most recently from c0c56d0 to b3321bf Compare March 6, 2026 09:04

wsx864321 force-pushed the feature/middleware_graph_0305 branch 2 times, most recently from 3f6a94e to e5de53c Compare March 6, 2026 09:32

feature:add service-to-middleware relations

89800f8

wsx864321 force-pushed the feature/middleware_graph_0305 branch from e6ba3fe to 89800f8 Compare March 6, 2026 09:34

cubic-dev-ai bot reviewed Mar 11, 2026

View reviewed changes

jiekun added 2 commits March 11, 2026 11:30

feature: [service graph] db relations update with seperate limit flag…

2cd781b

…, and prefix in db name

feature: [service graph] restore the docker image for testing

5a7dbe3

jiekun added 3 commits March 11, 2026 11:41

feature: [service graph] add integration test case

ed1fdee

feature: [service graph] update changelog

c96e5f8

Merge branch 'master' into feature/middleware_graph_0305

b04c1da

Signed-off-by: Zhu Jiekun <jiekun@victoriametrics.com>

cubic-dev-ai bot reviewed Mar 11, 2026

View reviewed changes

apptest/tests/service_graph_test.go Show resolved Hide resolved

feature: [service graph] update comment wording

2c44cf0

jiekun approved these changes Mar 11, 2026

View reviewed changes

Merge branch 'feature/middleware_graph_0305' of github.com:wsx864321/…

f6efba7

…VictoriaTraces into fork/wsx864321/feature/middleware_graph_0305

utrack mentioned this pull request Mar 11, 2026

Feature (service graph): inferred services/entities #121

Open

Conversation

wsx864321 commented Mar 6, 2026 • edited by cubic-dev-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe Your Changes

Summary by cubic

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jiekun commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

jiekun commented Mar 6, 2026

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

jiekun commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

jiekun commented Mar 6, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

wsx864321 commented Mar 6, 2026

Uh oh!

jiekun commented Mar 9, 2026

Uh oh!

wsx864321 commented Mar 9, 2026

Uh oh!

utrack commented Mar 10, 2026

Uh oh!

wsx864321 commented Mar 11, 2026

Uh oh!

jiekun commented Mar 11, 2026

Uh oh!

jiekun commented Mar 11, 2026

Uh oh!

wsx864321 commented Mar 11, 2026

Uh oh!

wsx864321 commented Mar 11, 2026

Uh oh!

jiekun commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wsx864321 commented Mar 11, 2026

Uh oh!

cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Mar 11, 2026

Codecov Report

Uh oh!

jiekun commented Mar 11, 2026

Uh oh!

wsx864321 commented Mar 11, 2026

Uh oh!

utrack commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wsx864321 commented Mar 6, 2026 •

edited by cubic-dev-ai bot

Loading

jiekun commented Mar 6, 2026 •

edited

Loading

jiekun commented Mar 6, 2026 •

edited

Loading

jiekun commented Mar 11, 2026 •

edited

Loading

utrack commented Mar 11, 2026 •

edited

Loading

utrack commented Mar 11, 2026 •

edited

Loading