Skip to content

feature:add service-to-middleware relations#117

Open
wsx864321 wants to merge 10 commits intoVictoriaMetrics:masterfrom
wsx864321:feature/middleware_graph_0305
Open

feature:add service-to-middleware relations#117
wsx864321 wants to merge 10 commits intoVictoriaMetrics:masterfrom
wsx864321:feature/middleware_graph_0305

Conversation

@wsx864321
Copy link

@wsx864321 wsx864321 commented Mar 6, 2026

Describe Your Changes

Increase the relationship mapping between services and middleware.
It should be noted that the middleware name obtained here is db.system, according to the official documentation of Hotel( https://github.com/open-telemetry/semantic-conventions/blob/v1.32.0/docs/database/database-spans.md )db.system.name, But in fact, the constant provided by the official SDK is db.system, so I am using db.system here, which can be discussed here.

image

Summary by cubic

Adds service-to-database edges to the service graph so you can see which services call which DBs. DB nodes are named <service>:<db> and can be limited or disabled via -servicegraph.databaseTaskLimit.

  • New Features
    • Adds vtselect.GetServiceDBGraphTimeRange to build svc→DB edges from client spans with span_attr:db.system.name; parent=resource_attr:service.name, child=<service>:<db.system.name>; aggregates callCount.
    • Introduces otelpb.SpanAttrDbSystemName; adds an integration test for svc→DB edges (e.g., serviceA:MongoDB) and documents the new flag.

Written for commit f6efba7. Summary will update on new commits.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:732">
P2: Middleware edge generation drops valid DB client relations when the parent span kind is not 2/5 due to restrictive inner-join filter.</violation>
</file>

<file name="lib/protoparser/opentelemetry/pb/trace_fields.go">

<violation number="1" location="lib/protoparser/opentelemetry/pb/trace_fields.go:73">
P2: Middleware detection uses `db.system` constant, which may miss spans following newer OTel semconv `db.system.name`.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@jiekun jiekun self-assigned this Mar 6, 2026
@wsx864321 wsx864321 force-pushed the feature/middleware_graph_0305 branch from d750959 to a689cc3 Compare March 6, 2026 07:45
@jiekun
Copy link
Member

jiekun commented Mar 6, 2026

Hello. To align the scenario what this pull request want to optimize:

  1. The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".
  2. The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

@wsx864321
Copy link
Author

Hello. To align the scenario what this pull request want to optimize:

  1. The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".
  2. The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

No, currently it is ‘client’ span to be the “child”, ‘server’ or ‘consumer’ span to be the "parent"。

 (NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as
 span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

This is sql

@wsx864321
Copy link
Author

Hello. To align the scenario what this pull request want to optimize:

  1. The current implementation only allows a client or producer span to be the "parent", and a server or consumer span to be the "child".
  2. The middleware calls, however, usually only contain client span, hence the relation is missing in service graph.

So this puill request want to generate the relation only by (middleware) client span, is that correct?

[
    [
        {
            "Name": "parent",
            "Value": "docmini-teststage-main"
        },
        {
            "Name": "child",
            "Value": "mysql:docmini_test"
        },
        {
            "Name": "callCount",
            "Value": "20"
        }
    ]
]

this is query result

@jiekun
Copy link
Member

jiekun commented Mar 6, 2026

By "current implementation", I refer to what we have now in VictoriaMetrics. I'd like to confirm what's missing first, before checking the new implementation.

currently it is ‘client’ span to be the “child”, ‘server’ or ‘consumer’ span to be the "parent"

It seems this is not the case in current version (v0.8.0), could you confirm?

@wsx864321
Copy link
Author

By "current implementation", I refer to what we have now in VictoriaMetrics. I'd like to confirm what's missing first, before checking the new implementation.

currently it is ‘client’ span to be the “child”, ‘server’ or ‘consumer’ span to be the "parent"

It seems this is not the case in current version (v0.8.0), could you confirm?

I misunderstood your meaning, you are right. I just want to increase the relationship between services and middleware

@jiekun
Copy link
Member

jiekun commented Mar 6, 2026

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

  1. The client span will carry database info, which could be used as the child node of the relation.
  2. The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

@jiekun jiekun self-requested a review March 6, 2026 08:15
@wsx864321
Copy link
Author

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

我明白了。这个想法看起来不错。我一直在想,我们是否真的需要parent在 LogsQL 中引入那个 span,因为所有信息都可以在(中间件)客户端 span 中找到:

  1. 客户端 span 将携带数据库信息,该信息可用作关系的子节点。
  2. 客户端 span 还将携带资源属性,这些属性可以用作关系的父节点。

如果我的说法有误,请随时指正。我对语义约定不太熟悉,也没有进行过测试。

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

  1. The client span will carry database info, which could be used as the child node of the relation.
  2. The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

  1. The client span will carry database info, which could be used as the child node of the relation.
  2. The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

  1. The client span will carry database info, which could be used as the child node of the relation.
  2. The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

Exactly, there's really no need for ‘parent’,because there is ‘resource_attr:service.name’. Can I revise it and resubmit it?

@wsx864321
Copy link
Author

(NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace | rename parent_span_id as span_id, span_attr:db.system as child, span_attr:db.namespace as namespace | join by (span_id) ((NOT span_id:"") AND (kind:~"2|4") | fields span_id, resource_attr:service.name | rename resource_attr:service.name as parent) inner | stats by (parent, child,namespace) count() callCount

I see. The idea looks fine. I was thinking whether we really need that parent span involved in the LogsQL, because all the information could be found in the (middleware) client span:

  1. The client span will carry database info, which could be used as the child node of the relation.
  2. The client span will also carry resource attributes, which could be used as the parent node of the relation.

Please feel free to correct me if it's not the case. I'm not familiar with the semantic conventions and haven't tested it.

  (NOT "span_attr:db.system": "") AND (kind:3) | fields parent_span_id,span_attr:db.system,span_attr:db.namespace,resource_attr:service.name | rename parent_span_id as
 span_id, span_attr:db.system as child, span_attr:db.namespace as namespace, resource_attr:service.name as parent | stats by (parent, child,namespace) count() callCount

this is logSql

@jiekun
Copy link
Member

jiekun commented Mar 6, 2026

Feel free to update the branch or perform a force push.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:617">
P1: `GetServiceGraphTimeRange` was repurposed to db/middleware edge extraction but is still used as the core service-to-service graph path, causing functional regression in background graph generation.</violation>

<violation number="2" location="app/vtselect/traces/query/query.go:624">
P2: Service graph namespace dimension is introduced on generation but dropped on read aggregation, causing cross-namespace edge collapse.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@wsx864321 wsx864321 force-pushed the feature/middleware_graph_0305 branch 2 times, most recently from c0c56d0 to b3321bf Compare March 6, 2026 09:04
@wsx864321
Copy link
Author

Feel free to update the branch or perform a force push.

Updated code,Plz cr,thx.

@wsx864321 wsx864321 force-pushed the feature/middleware_graph_0305 branch 2 times, most recently from 3f6a94e to e5de53c Compare March 6, 2026 09:32
@wsx864321 wsx864321 force-pushed the feature/middleware_graph_0305 branch from e6ba3fe to 89800f8 Compare March 6, 2026 09:34
@jiekun
Copy link
Member

jiekun commented Mar 9, 2026

Hi. While I'm testing this, may I ask if you know of other projects that are also doing similar things (generating relationships from a single span)? This would help us align with them, especially as we consider what the service graph should look like in the future iteration.

@wsx864321
Copy link
Author

Hi. While I'm testing this, may I ask if you know of other projects that are also doing similar things (generating relationships from a single span)? This would help us align with them, especially as we consider what the service graph should look like in the future iteration.

Sorry. I haven't researched any other projects yet

@utrack
Copy link

utrack commented Mar 10, 2026

@jiekun Datadog are doing the same and it's a great one :) I've reimplemented the same thing in our dashboards using grafana infinity and a bunch of JSON mutation logic, but it would be great to have it OOTB.

Here the Postgres does not report any metrics.

Screenshot_20260310_130204

@wsx864321
Copy link
Author

@jiekun Datadog are doing the same and it's a great one :) I've reimplemented the same thing in our dashboards using grafana infinity and a bunch of JSON mutation logic, but it would be great to have it OOTB.

Here the Postgres does not report any metrics.

Screenshot_20260310_130204

This looks great, it seems that we can also achieve the same function, but the data may not be accurate in the case of tail sampling. I will also study Datadog.thx

@jiekun
Copy link
Member

jiekun commented Mar 11, 2026

I looks good overall. But I think conditions like namespace, extra enable and limit flags, and some others could be removed. Would you mind me editing the branch to save you some time?

@jiekun
Copy link
Member

jiekun commented Mar 11, 2026

Also, regarding the use of db.system instead of db.system.name, I checked the change history, and this change was introduced 13 months ago: open-telemetry/semantic-conventions@62d5a7c

The new attribute name db.system.name, now, is marked as stable while the previous one is deprecated. I understand that OTel changes things frequently but is it possible for you to solve it by:

  • upgrade the instrumentation SDK.
  • use attributeprocessor in the OTel collector to change the attribute name to fit the semantic-conventions schema?

@wsx864321
Copy link
Author

I looks good overall. But I think conditions like namespace, extra enable and limit flags, and some others could be removed. Would you mind me editing the branch to save you some time?

Of course, you can modify it as you wish

@wsx864321
Copy link
Author

Also, regarding the use of db.system instead of db.system.name, I checked the change history, and this change was introduced 13 months ago: open-telemetry/semantic-conventions@62d5a7c

The new attribute name db.system.name, now, is marked as stable while the previous one is deprecated. I understand that OTel changes things frequently but is it possible for you to solve it by:

  • upgrade the instrumentation SDK.
  • use attributeprocessor in the OTel collector to change the attribute name to fit the semantic-conventions schema?

ok, Then let's use the latest standard db.sysytem.name, and I'll upgrade our SDK to solve it

1. Simplified the chile node by removing `namespace`. The `namespace` info may be return back since one service can connect to multiple databases in same type (e.g. MySQL), and the `namespace` is an important identifier. However, the existing service graphs doesn't carry this fine-grain information, and it's important to align them. This may be improved together when VictoriaTrace/Jaeger API support visualizing more info.
2. Removed extra flags for enabling svc-db service graph. It can share the same limit of the svc-svc since svc-svc relation (seems to) has higher priority.

The change could be reverted if these are not the case in real-world. But let's start from the minimal implementation and iterate when necessary.

Signed-off-by: Jiekun <jiekun@victoriametrics.com>
@jiekun
Copy link
Member

jiekun commented Mar 11, 2026

Please see the follow-up commit a7bc1d1.

However, during my retesting, it appears that the "database" node without an identifier in the global view may be referenced by many services—even though these services are not actually pointing to the "same database" instance or cluster.

image

Each "edge" of the relation is still show the correct call count (e.g., the accounting service -> the postgresql database: call count = 4). However, the postgresql node in these cases is not the same one across different services.

I have concern about this:

  • The identifier is not clear, I'm not sure if we should use namespace or maybe there's something else. Theoretically, each service has its own database, so simply using service name + db.system.name would solve the issue. But it may generate 2x of the relations.

Edit: Alright. I think it's better to adopt svc name:db name way as mentioned. And I returned back the limit for svc-db relations so user can set it to 0 to disable this functionality.

image

1. Simplified the chile node by removing `namespace`. The `namespace` info may be return back since one service can connect to multiple databases in same type (e.g. MySQL), and the `namespace` is an important identifier. However, the existing service graphs doesn't carry this fine-grain information, and it's important to align them. This may be improved together when VictoriaTrace/Jaeger API support visualizing more info.
2. Removed extra flags for enabling svc-db service graph. It can share the same limit of the svc-svc since svc-svc relation (seems to) has higher priority.

The change could be reverted if these are not the case in real-world. But let's start from the minimal implementation and iterate when necessary.

Signed-off-by: Jiekun <jiekun@victoriametrics.com>
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 issues found across 5 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="app/vtselect/traces/query/query.go">

<violation number="1" location="app/vtselect/traces/query/query.go:712">
P2: DB namespace dimension was dropped from service-to-DB aggregation, collapsing distinct DB targets into one edge per DB system.</violation>
</file>

<file name="app/victoria-traces/servicegraph/servicegraph.go">

<violation number="1" location="app/victoria-traces/servicegraph/servicegraph.go:115">
P2: Removing middleware flags and guard introduces backward-incompatible startup risk and removes previous middleware opt-out behavior.</violation>

<violation number="2" location="app/victoria-traces/servicegraph/servicegraph.go:121">
P1: DB graph fetch error causes early continue, dropping already-fetched service graph rows for that tenant/time window.</violation>
</file>

<file name="lib/protoparser/opentelemetry/pb/trace_fields.go">

<violation number="1" location="lib/protoparser/opentelemetry/pb/trace_fields.go:42">
P2: DB service-graph field key changed to `span_attr:db.system.name`, which can miss existing telemetry that uses `db.system` and underreport service→middleware relations.</violation>
</file>

<file name="deployment/docker/compose-vt-single.yml">

<violation number="1" location="deployment/docker/compose-vt-single.yml:36">
P2: Default compose now references a fork/dirty VictoriaTraces image tag, creating pull reliability and provenance risks for users.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

@wsx864321
Copy link
Author

Please see the follow-up commit a7bc1d1.

However, during my retesting, it appears that the "database" node without an identifier in the global view may be referenced by many services—even though these services are not actually pointing to the "same database" instance or cluster.

image Each "edge" of the relation is still show the correct call count (e.g., the `accounting` service -> the `postgresql` database: call count = 4). However, the `postgresql` node in these cases is not the same one across different services.

I have concern about this:

  • The identifier is not clear, I'm not sure if we should use namespace or maybe there's something else. Theoretically, each service has its own database, so simply using service name + db.system.name would solve the issue. But it may generate 2x of the relations.

Edit: Alright. I think it's better to adopt svc name:db name way as mentioned. And I returned back the limit for svc-db relations so user can set it to 0 to disable this functionality.

image

GET, I use namespace, actually to add the library name, you can refer to it( https://github.com/open-telemetry/semantic-conventions/blob/main/docs/db/database-spans.md )Of course, I can also accept using SVC: dbname here

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="apptest/tests/service_graph_test.go">

<violation number="1" location="apptest/tests/service_graph_test.go:77">
P2: The test assumes deterministic dependency ordering, but API/query code does not guarantee order, so this assertion can be flaky with multiple edges.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

…VictoriaTraces into fork/wsx864321/feature/middleware_graph_0305
@codecov
Copy link

codecov bot commented Mar 11, 2026

Codecov Report

❌ Patch coverage is 0% with 82 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (master@7918002). Learn more about missing BASE report.

Files with missing lines Patch % Lines
app/vtselect/traces/query/query.go 0.00% 63 Missing ⚠️
app/victoria-traces/servicegraph/servicegraph.go 0.00% 19 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             master    #117   +/-   ##
========================================
  Coverage          ?   7.71%           
========================================
  Files             ?      61           
  Lines             ?    8829           
  Branches          ?       0           
========================================
  Hits              ?     681           
  Misses            ?    8053           
  Partials          ?      95           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jiekun
Copy link
Member

jiekun commented Mar 11, 2026

@wsx864321 please take a look if you can, and maybe build and try with your data. We do love to see the feedback. If you need help building an binary or docker image please let me know.

If all good, it will be merged to the upstream very soon.

@wsx864321
Copy link
Author

@wsx864321 please take a look if you can, and maybe build and try with your data. We do love to see the feedback. If you need help building an binary or docker image please let me know.

If all good, it will be merged to the upstream very soon.

I have no further questions, it meets my expectations. you can merge at any time

@utrack
Copy link

utrack commented Mar 11, 2026

apologies for piping up here :) are you sure you want to display the db.system.name (postgres) instead of server.address or db.namespace+"@"+server.address? semconv

When building a graph you probably want to make the nodes clickable - and there's nothing to click if there's no semi-unique ID.


EDIT: it's probably more about the overall approach reg mapping the 'inferred entities', which can be both external services and databases - feel free to ignore me and we can chat in a separate issue

@jiekun
Copy link
Member

jiekun commented Mar 11, 2026

I personally don't have much experience with semconv, so I will rely heavily on your feedback. Any comments (both positive and negative) are welcome.

During my testing, I noticed that db.namespace is sometimes empty (likely because it is not marked as required). And for server.address, a single application (client) may connect to multiple instances of the same database (due to sharding, partitioning, etc.), so I also think this is not ideal -- service-to-service relations do not yet use any IP to distinguish the nodes.

My expectations are as follows:

  1. db.system.name still seems meaningful to me (but feel free to disagree, and I’m open to hearing your reasoning :) ).
  2. We should use an alternative to identify this node without relying on the instance attribute (uniq identifier like IP:Port is too fine-grained. any DB, middleware cluster should be display as a single node in the service graph).

Anyway, since there's different idea, I think we can hold this pull request for now and see if we have better way before merging the initial implementation.

@utrack
Copy link

utrack commented Mar 11, 2026

There's essentially a couple of different tradeoffs from what I've seen; it's nearly impossible to make it work for everyone 😅

The problem 2) would only exist if you're using network.peer.address instead of server.address I think.

As an example, the network.peer.address is a resolved IP (which can change+is different when there's a multi-node setup), while server.address contains the hostname, which is stable.

Here's a brain dump on my experience of making it work on our side:


Approach 0: db.system.name

Pros:

  • There is a node on the graph

Cons:

  • On the graph, all the services using a single DB type will 'cloud' around a single node, even if they use different DB instances
  • All other cons below will apply

Approach 1: (caller)+db.system.name

Pros:

  • Stable, unique graph nodes
  • There will never be a 'cloud of nodes' situation

Cons:

  • If a service uses two or more DBs of the same type - there will only be a single node in the graph
  • You won't see it on the graph if two services are using the same DB

Approach 2: db.namespace

Pros:

  • Stable graph nodes

Cons:

  • db.namespace is optional, as you've mentioned
  • If there are many DB instances, then we may see the 'cloud of nodes' around a default namespace - even though they're located in the different instances

This is what Datadog is using by default, and before we've had a cloud around main postgres DB.

The Datadog agent turns db.name into peer.db.name by default.
To fix the cloud of nodes, we just put the DB's hostname (not IP) into the db.name for DD.

Real life sample of a cloud of services connecting to 'main' namespace from an another demo follows. Those have different server.address/db.instance:

image

Approach 3: (network.peer.address)

This is the IP of the DB node you're connected to.

Pros:

  • You can see a node for every replica you have

Cons:

  • You can see a node for every replica you have 😅
  • You get a new graph node every time the DNS+routing changes
  • False negatives when looking for services that use the same DB - i.e. two services may connect to different IPs and get different nodes, but they will still use the same instance

Approach 4: (server.address)

Pros:

  • Stable graph nodes
  • Services that use the same DB instance will be connected
  • If you have RO database replicas, then they have their own DNS name - so, you get 1 node for the DB master and 1 for the replicas. Useful when debugging the RO/RW selection logic

Cons:

  • Cloud of nodes if your setup is a single beefy DB cluster that's split by namespaces for different services.
  • Falls apart if you have many DNS names which lead to the same instance

Approach 5: (db.namespace)+(server.address)

We effectively fall back to A.4 if db.namespace is not reported.

Pros:

  • Same as A.4, plus:
  • Only those services that use the same DB namespace on the same instance will be connected

Cons:

  • If you have a 'single beefy DB cluster' setup, then the node's names might be verbose
  • Cloud of nodes if you have a 'single beefy DB cluster' setup, AND you don't report db.namespace

Personally, we went for the A.4 which works for us, but I'd say that A.5 is more suitable for an open solution.
If a single DB instance has many DNS names (see A.4 con 2), then I'm inclined to classify this as a 'you problem' :D. We would need to get a DB identifier somehow, but it's not a part of the standard OTel spec.

@wsx864321
Copy link
Author

There's essentially a couple of different tradeoffs from what I've seen; it's nearly impossible to make it work for everyone 😅

The problem 2) would only exist if you're using network.peer.address instead of server.address I think.

As an example, the network.peer.address is a resolved IP (which can change+is different when there's a multi-node setup), while server.address contains the hostname, which is stable.

Here's a brain dump on my experience of making it work on our side:

Approach 0: db.system.name

Pros:

  • There is a node on the graph

Cons:

  • On the graph, all the services using a single DB type will 'cloud' around a single node, even if they use different DB instances
  • All other cons below will apply

Approach 1: (caller)+db.system.name

Pros:

  • Stable, unique graph nodes
  • There will never be a 'cloud of nodes' situation

Cons:

  • If a service uses two or more DBs of the same type - there will only be a single node in the graph
  • You won't see it on the graph if two services are using the same DB

Approach 2: db.namespace

Pros:

  • Stable graph nodes

Cons:

  • db.namespace is optional, as you've mentioned
  • If there are many DB instances, then we may see the 'cloud of nodes' around a default namespace - even though they're located in the different instances

This is what Datadog is using by default, and before we've had a cloud around main postgres DB.

The Datadog agent turns db.name into peer.db.name by default. To fix the cloud of nodes, we just put the DB's hostname (not IP) into the db.name for DD.

Real life sample of a cloud of services connecting to 'main' namespace from an another demo follows. Those have different server.address/db.instance:

image #### Approach 3: (network.peer.address) This is the IP of the DB node you're connected to.

Pros:

  • You can see a node for every replica you have

Cons:

  • You can see a node for every replica you have 😅
  • You get a new graph node every time the DNS+routing changes
  • False negatives when looking for services that use the same DB - i.e. two services may connect to different IPs and get different nodes, but they will still use the same instance

Approach 4: (server.address)

Pros:

  • Stable graph nodes
  • Services that use the same DB instance will be connected
  • If you have RO database replicas, then they have their own DNS name - so, you get 1 node for the DB master and 1 for the replicas. Useful when debugging the RO/RW selection logic

Cons:

  • Cloud of nodes if your setup is a single beefy DB cluster that's split by namespaces for different services.
  • Falls apart if you have many DNS names which lead to the same instance

Approach 5: (db.namespace)+(server.address)

We effectively fall back to A.4 if db.namespace is not reported.

Pros:

  • Same as A.4, plus:
  • Only those services that use the same DB namespace on the same instance will be connected

Cons:

  • If you have a 'single beefy DB cluster' setup, then the node's names might be verbose
  • Cloud of nodes if you have a 'single beefy DB cluster' setup, AND you don't report db.namespace

Personally, we went for the A.4 which works for us, but I'd say that A.5 is more suitable for an open solution. If a single DB instance has many DNS names (see A.4 con 2), then I'm inclined to classify this as a 'you problem' :D. We would need to get a DB identifier somehow, but it's not a part of the standard OTel spec.

In fact, I would prefer db.system.name-db.namespace for several reasons:

  • In the Hotel specification, whether it is network.exe or server-side, it is only a recommended attribute and not a requirement, but db.namespace is a mandatory attribute.
  • Whether it is network. peer.port or server. address in a master-slave or cluster architecture, the graph will be very complex
  • Usually, different db.namespaces also represent different instances

@jiekun cc

@utrack
Copy link

utrack commented Mar 12, 2026

In a cluster architecture you usually have either a single DNS name, or two names (one is a fast read tier and the second is a master for the writes).

The problem with the namespaces is that there's a default namespace which is used most of the time - like the 'main' example on my screenshot.
There can be different namespaces, and they do usually represent different instances - however, same namespace does not mean it's the same instance.

@jiekun
Copy link
Member

jiekun commented Mar 12, 2026

In a cluster architecture you usually have either a single DNS name, or two names

I agree with this actually, but it's not the case I've experienced. In my previous job for cluster connection, the administrator provides different addresses for different instances, while each addresses is a DNS record and can be switch easily when incident happen.

So the case here is that, different users may have different deployments, and it's hard to reach a consensus for now. I got both feedback from you now but it's hard to decide.

@utrack
Copy link

utrack commented Mar 12, 2026

Agree, I just remembered MongoDB connection strings, and they can have as many 'seed' DBs as they want...

Maybe just make it configurable? Grafana/otelcol's servicegraph does something called 'dimensions' in the config, which control how their service graph is built.

@wsx864321
Copy link
Author

Perhaps we can proceed with the current, most certain solution, which can then be continuously evolved without compatibility issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants