Skip to content

[DBM] Add container tags hash to queries (if enabled)#8061

Open
vandonr wants to merge 39 commits intomasterfrom
vandonr/process2
Open

[DBM] Add container tags hash to queries (if enabled)#8061
vandonr wants to merge 39 commits intomasterfrom
vandonr/process2

Conversation

@vandonr
Copy link
Contributor

@vandonr vandonr commented Jan 14, 2026

Summary of changes

Add the ability to write the container tags hash to DBM queries + to the related span.
The goal is that DBM would then query the spans bearing that hash, and then use the container tags on this (those) spans(s) to enrich the queries with it.
This is controlled by a setting that is disabled by default, and would be enabled if propagation mode is "service" or greater

see RFC: https://docs.google.com/document/d/15GtNOKGBCt6Dc-HsDNnMmCdZwhewFQx8yUlI9in5n3M
related PR in python: DataDog/dd-trace-py#15293

Reason for change

DBM and DSM propagate service context in outbound communications (SQL comments, message headers), but neither product has awareness of the container environment (e.g., kube_cluster, namespace, pod_name). Propagating full container tags is not feasible due to cardinality constraints (query cache invalidation in OracleDB/SQLServer, exponential pathway growth in DSM) and size limitations (64–128 bytes for DBM non-comment methods).

This is needed for the service renaming initiative (defining services based on container names) and APM primary tags (container-based dimensions like Kubernetes cluster).

The solution: the agent computes a hash of low-cardinality container tags and back-propagates it to the tracer, which includes it in outbound DBM/DSM communications. DBM then resolves the hash by correlating with APM spans that carry the same hash as a span tag.

Implementation details

  • Add BaseHash static class that computes an FNV-64 hash of ProcessTags.SerializedTags combined with the container tags hash from the agent, encoded as base64
  • Read the container tags hash from the Datadog Agent via DiscoveryService, stored in ContainerMetadata.ContainerTagsHash
  • ContainerMetadata converted from static to instance class (singleton via ContainerMetadata.Instance) to improve testability
  • DatabaseMonitoringPropagator injects the base hash into SQL comments (as ddch) when DD_DBM_INJECT_SQL_BASEHASH is true
  • Add _dd.dbm_container_tags_hash span tag on SqlTags so DBM can correlate the hash back to the span's container tags
  • New config key DD_DBM_INJECT_SQL_BASEHASH (disabled by default), intended to be enabled when DBM propagation mode is service or higher
  • Add container ID header to MinimalAgentHeaderHelper for agent communication

Test coverage

Adding a test in DbScopeFactoryTests.cs forced me to inject the value from pretty high, which I find a bit "dirty", but at least we don't have to rely on global static instance in tests.

Other details

vandonr and others added 10 commits December 2, 2025 18:03
## Summary of changes
Replaced custom mutex guard with `std::lock_guard`, using
`std::recursive_mutex` instead of `CRITICAL_SECTION` in windows and
`std::mutex` with railings in Linux

## Reason for change
Some locks have been spotted in smoke test wich could be cause by the
lack of thread recursive lock in the `std::mutex`

## Implementation details

## Test coverage

## Other details
<!-- Fixes #{issue} -->


<!--  ⚠️ Note:

Where possible, please obtain 2 approvals prior to merging. Unless
CODEOWNERS specifies otherwise, for external teams it is typically best
to have one review from a team member, and one review from apm-dotnet.
Trivial changes do not require 2 reviews.

MergeQueue is NOT enabled in this repository. If you have write access
to the repo, the PR has 1-2 approvals (see above), and all of the
required checks have passed, you can use the Squash and Merge button to
merge the PR. If you don't have write access, or you need help, reach
out in the #apm-dotnet channel in Slack.
-->
@vandonr vandonr requested review from a team as code owners January 14, 2026 15:03
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4fd01fab6f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 104 to +108
else
{
// PropagateDataViaComment (service) - this injects varius trace information as a comment in the query
if (tracer.Settings.InjectSqlBasehash && !string.IsNullOrEmpty(baseHash))
{
tags.BaseHash = baseHash;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Set BaseHash even when DBM comment already present

This new BaseHash tagging only happens in the else branch when the command text is not already DBM-injected. In the cached‑command scenario (or when users pre‑inject DBM comments), alreadyInjected is true, so _dd.propagated_hash is never set on subsequent spans even though the query still carries ddsh in the SQL comment. If DBM looks up container tags by scanning recent spans for that hash, later queries can’t be enriched once the first span ages out. Consider setting tags.BaseHash whenever the feature is enabled (and baseHash is non‑empty), regardless of the alreadyInjected branch.

Useful? React with 👍 / 👎.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes, that's an interesting point, but I'm not sure we care, we only need one span with the hash to get the values, so we don't really need to tag all spans. I think in practice it works well like this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding a comment explaining this.

@pr-commenter
Copy link

pr-commenter bot commented Jan 14, 2026

Benchmarks

Benchmark execution time: 2026-03-24 11:01:35

Comparing candidate commit 900ac90 in PR branch vandonr/process2 with baseline commit 4e38cdd in branch master.

Found 9 performance improvements and 7 performance regressions! Performance is the same for 258 metrics, 14 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

scenario:Benchmarks.Trace.AgentWriterBenchmark.WriteAndFlushEnrichedTraces netcoreapp3.1

  • 🟩 execution_time [-87.790ms; -87.704ms] or [-44.085%; -44.042%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorMoreComplexBody net6.0

  • 🟥 execution_time [+15.198ms; +19.106ms] or [+7.780%; +9.780%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody net6.0

  • 🟩 execution_time [-16.899ms; -13.287ms] or [-7.904%; -6.215%]

scenario:Benchmarks.Trace.Asm.AppSecBodyBenchmark.ObjectExtractorSimpleBody netcoreapp3.1

  • 🟥 execution_time [+16.264ms; +22.445ms] or [+8.284%; +11.432%]

scenario:Benchmarks.Trace.AspNetCoreBenchmark.SendRequest net6.0

  • 🟩 execution_time [-8.803ms; -7.370ms] or [-8.834%; -7.396%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces net472

  • 🟩 execution_time [-16.356ms; -13.298ms] or [-7.013%; -5.702%]
  • 🟩 throughput [+62.858op/s; +76.693op/s] or [+6.108%; +7.452%]

scenario:Benchmarks.Trace.CIVisibilityProtocolWriterBenchmark.WriteAndFlushEnrichedTraces netcoreapp3.1

  • 🟥 execution_time [+56.258ms; +63.712ms] or [+27.570%; +31.222%]
  • 🟥 throughput [-394.236op/s; -347.884op/s] or [-23.931%; -21.118%]

scenario:Benchmarks.Trace.CharSliceBenchmark.OptimizedCharSlice netcoreapp3.1

  • 🟥 execution_time [+147.055µs; +157.625µs] or [+5.337%; +5.720%]
  • 🟥 throughput [-19.705op/s; -18.331op/s] or [-5.429%; -5.051%]

scenario:Benchmarks.Trace.ElasticsearchBenchmark.CallElasticsearchAsync net472

  • 🟥 throughput [-18999.263op/s; -16744.614op/s] or [-6.104%; -5.380%]

scenario:Benchmarks.Trace.Iast.StringAspectsBenchmark.StringConcatAspectBenchmark netcoreapp3.1

  • 🟩 allocated_mem [-16.498KB; -16.469KB] or [-6.045%; -6.034%]

scenario:Benchmarks.Trace.Log4netBenchmark.EnrichedLog netcoreapp3.1

  • 🟩 execution_time [-38.452ms; -34.542ms] or [-18.879%; -16.959%]

scenario:Benchmarks.Trace.SingleSpanAspNetCoreBenchmark.SingleSpanAspNetCore netcoreapp3.1

  • 🟩 throughput [+14923514.547op/s; +16273400.150op/s] or [+6.617%; +7.215%]

scenario:Benchmarks.Trace.TraceAnnotationsBenchmark.RunOnMethodBegin net472

  • 🟩 throughput [+42708.064op/s; +44809.510op/s] or [+6.289%; +6.599%]

@vandonr
Copy link
Contributor Author

vandonr commented Jan 15, 2026

I just realized I need to put process tags in there too

@vandonr vandonr marked this pull request as draft January 15, 2026 14:53
Copy link
Collaborator

@bouwkast bouwkast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main question that I have is that it appears this is correctly following the RFC in how we propagate the hash, but the merged Python implementation recomputes the hash.

But the RFC isn't precise enough in describing the hash and expected behavior / requirements for me to know which is correct really

Base automatically changed from vandonr/process3 to master February 3, 2026 18:49
@vandonr vandonr changed the title Add container tags hash to DBM queries (if enabled) [DBM] Add container tags hash to queries (if enabled) Mar 17, 2026
@vandonr vandonr requested a review from a team as a code owner March 24, 2026 10:09
getDiscoveryServiceFunc: static s => DiscoveryService.CreateUnmanaged(
s.TracerSettings.Manager.InitialExporterSettings,
ContainerMetadata.Instance,
new ServiceRemappingHash(null),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one I'm not 100% sure, but since it's only used for DBM for now, I don't think it'd play any role in that code path, so it should be safe to hardcode a disabled instance

private const string SqlCommentOuthost = "ddh";
private const string SqlCommentVersion = "ddpv";
private const string SqlCommentEnv = "dde";
private const string SqlCommentBaseHash = "ddsh";
Copy link
Member

@lucaspimentel lucaspimentel Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says

injects the base hash into SQL comments (as ddch)

I couldn't find either one in the RFC, but the dd-trace-py PR uses ddsh, like this one. Is that a typo in the PR description?

Comment on lines +38 to +51
public string? ContainerTagsHash
{
get;
private set;
}

/// <summary>
/// Gets the base64 representation of the hash
/// </summary>
public string? B64Value
{
get;
private set;
}
Copy link
Member

@lucaspimentel lucaspimentel Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These properties used to have Volatile.Read()/Volatile.Write() and we should probably keep that since they are written from a background thread in DiscoveryService and read in the hot path when creating spans.

Furthermore, UpdateContainerTagsHash updates both values non-atomically, so a reader could see a stale B64Value with a new ContainerTagsHash. If consistency between the two is important, consider using a lock to read/write both values, or using immutable copies.

hash = FnvHash64.GenerateHash(containerTagsHash, FnvHash64.Version.V1, hash);
}

var b64 = Convert.ToBase64String(BitConverter.GetBytes(hash));
Copy link
Member

@lucaspimentel lucaspimentel Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is allocating:

byte[] in BitConverter.GetBytes()
char[] for the parameter in TrimEnd(params char[]) in .NET Framework (Newer runtimes have a TrimEnd(char) overload)
string in TrimEnd() if it modifies the string
more string instance for each Replace() if they modify the string

Good news! We have "vendored" versions of BinaryPrimitives and Base64, so we can avoid BitConverter.GetBytes() and Convert.ToBase64String(), and then trimming and replacing 1:1 chars can be done in place, so this code should work in all TFMs:

#if NETCOREAPP3_1_OR_GREATER
        Span<byte> buf = stackalloc byte[12];
#else
        // can't stackalloc into the vendored Span<T>
        var buf = new byte[12];
#endif

        BinaryPrimitives.WriteUInt64LittleEndian(buf, hash); // write 8 bytes into a 12-byte buffer
        Base64.EncodeToUtf8InPlace(buf, 8, out int bytesWritten);

        while (bytesWritten > 0 && buf[bytesWritten - 1] == (byte)'=')
        {
            bytesWritten--;
        }

        for (int i = 0; i < bytesWritten; i++)
        {
            if (buf[i] == (byte)'+')
            {
                buf[i] = (byte)'-';
            }
            else if (buf[i] == (byte)'/')
            {
                buf[i] = (byte)'_';
            }
        }

#if NETCOREAPP3_1_OR_GREATER
        return Encoding.ASCII.GetString(buf[..bytesWritten]);
#else
        // can't use Range
        return Encoding.ASCII.GetString(buf, 0, bytesWritten);
#endif

This has zero heap allocations on NETCOREAPP3_1_OR_GREATER, and only the byte[12] otherwise (aside from the final string we need to return in both cases which is unavoidable).

Comment on lines +9 to +10
using System.Threading;
using Datadog.Trace.PlatformHelpers;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used.

Suggested change
using System.Threading;
using Datadog.Trace.PlatformHelpers;

public static Scope? CreateDbCommandScope(Tracer tracer, IDbCommand command)
{
var commandType = command.GetType();
var baseHash = tracer.TracerManager.ServiceRemappingHash?.B64Value;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we guard this behind the setting?

var baseHash = tracer.Settings.DbmInjectSqlBasehash ?
                   tracer.TracerManager.ServiceRemappingHash?.B64Value :
                   null;

}
}

private static string Compute(string processTags, string? containerTagsHash)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While working on the "less-allocatey" code below, I noticed there are no unit tests for this method.

if (!_warnedOnSet)
{
_warnedOnSet = true;
Log.Error("The code is trying to set the value '{Value}' to {Prop}, but this has no effect in .NET Framework.", value, nameof(ContainerTagsHash));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This log is now gone in the new version. Intentional?

/// <summary>
/// Gets the base64 representation of the hash
/// </summary>
public string? B64Value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Naming nit] The .NET naming conventions would use Base64Value, here, or simply Base64. No need to abbreviate "Base" to "B".

@lucaspimentel
Copy link
Member

related: #8363

bouwkast added a commit that referenced this pull request Mar 24, 2026
Instead of guarding the caller with #if !NETFRAMEWORK, make the
setter a silent no-op. This avoids conflict with #8061 which
replaces the caller entirely.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants