Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics by Ronitsabhaya75 · Pull Request #12680 · apple/foundationdb

Ronitsabhaya75 · 2026-02-06T18:43:48Z

Summary

This PR adds a native /metrics HTTP endpoint that exposes FoundationDB's internal SimpleCounter metrics in Prometheus text exposition format.
(waiting for more discussion for further)

Changes

flow/include/flow/PrometheusMetrics.h: Header with formatPrometheusMetrics() function
fdbrpc/PrometheusMetricsHandler.actor.cpp: HTTP request handler for /metrics endpoint
flow/SimpleCounter.cpp: Made hierarchicalToPrometheus() and isValidPrometheusMetricName() public
fdbserver/workloads/PrometheusMetricsTest.actor.cpp: Simulation test workload
tests/fast/PrometheusMetrics.toml: Test configuration

Scope

Initial implementation targets simulation HTTP infrastructure only.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

The PR has a description, explaining both the problem and the solution.
The description mentions which forms of testing were done and the testing seems reasonable.
Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

Co-authored-by: Rodrigo Muñoz <rodrigo.munoz.cs@gmail.com>

gxglass · 2026-02-11T00:53:51Z

@Ronitsabhaya75 I need to get a little more familiar with some background bits but will review in the next few days.

Just out of curiosity, are you doing this as an exercise, or are you planning to use this in production?

gxglass · 2026-02-11T00:58:54Z

Close + reopenin to run CI's

foundationdb-ci · 2026-02-11T01:03:27Z

Result of foundationdb-pr on Linux RHEL 9

Commit ID: c0081a1
Duration 0:04:19
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-11T01:03:30Z

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Commit ID: c0081a1
Duration 0:04:22
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-11T01:03:33Z

Result of foundationdb-pr-clang on Linux RHEL 9

Commit ID: c0081a1
Duration 0:04:26
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-11T01:03:34Z

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Commit ID: c0081a1
Duration 0:04:26
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-11T01:03:38Z

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Commit ID: c0081a1
Duration 0:04:29
Result: ❌ FAILED
Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)
Cluster Test Logs zip file of the test logs (available for 30 days)

foundationdb-ci · 2026-02-11T01:05:33Z

Result of foundationdb-pr-macos on macOS Ventura 13.x

Commit ID: c0081a1
Duration 0:06:25
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

foundationdb-ci · 2026-02-11T01:05:44Z

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Commit ID: c0081a1
Duration 0:06:34
Result: ❌ FAILED
Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
Build Log terminal output (available for 30 days)
Build Workspace zip file of the working directory (available for 30 days)

Ronitsabhaya75 · 2026-02-11T02:06:38Z

@Ronitsabhaya75 I need to get a little more familiar with some background bits but will review in the next few days.

Just out of curiosity, are you doing this as an exercise, or are you planning to use this in production?

@gxglass I was planning for prod currently im planning for development purpose later once we are good we can think how we can implement for prod

#12679 you can check the issue here I have created discussion too

…ners

…ift on macOS

…unify on Make

…ation tests Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

…tforms Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

… member Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

- Drop CentOS 7 (EOL, Docker image had pre-built binaries) - Add rm -rf build to ensure clean compilation from source - Add binary verification step (ls -la, file, du -h) - Add explicit output validation (grep for TYPE annotations) - Separate build and test into distinct CI steps for clarity Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

…ilds Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Ronitsabhaya75 · 2026-03-11T04:10:40Z

Current arch for promethus and /metrics

Ronitsabhaya75 · 2026-03-11T04:11:10Z

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

Ronitsabhaya75 · 2026-03-11T04:15:08Z

Thank you for helping me for figuring out i add you as co-author

@shilpan97 — Prometheus exporter logic & formatting
@Renish-patel — CI workflow & E2E Validation testing

Ronitsabhaya75 · 2026-03-11T04:17:06Z

@gxglass its ready to review can you review the changes and lemme know the suggestions what you think, I added CI run for testing and pasted the log output for working fine with all 3 OS(linux rhel 9, Centos7, macOS).

thank you to you too for guiding me :)

otel-e2e.yml has been fully rewritten to natively test both OTEL internal metrics collection and Prometheus /metrics extraction with guaranteed clean builds from source across all 3 platforms. This file is no longer needed. Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

gxglass

Here are some review comments. I'm doing this on a best effort basis - subsequent reviews may take some delays this week due to normal business demands

gxglass · 2026-02-13T17:23:12Z

.github/workflows/prometheus-e2e.yml

@@ -0,0 +1,123 @@
+name: Prometheus E2E Test


What is the scope and objective of this test? In general put a comment to explain this. However I believe this test is not needed.

Reading below this appears to just run a single simulation test case. It is not necessary or desirable to have an explicit test to run just one specific simulation. We already run 10,000 randomized simulations in CI so the newly added test case (tests/fast/PrometheusMetrics.toml) would get picked up and run by that automagically.

A simulation test is not "end to end". It is contained within a single fdbserver process running a simulated cluster and simulated machines. These tests are of course valuable for many purposes, but integration with the external world is not a thing we can say they do for us. Here an end to end test would be to integrate with an external system to consume the metrics output and thus to confirm that the data is well formed and the counters are correctly reported. Like, integrate with actual Prometheus. I am OK testing that on a manual basis rather than automating it because I don't expect this functionality to be easily broken and because it is community supported to begin with. Also I absolutely do not want to take additional latency or flakiness in the CI process (we have too much of both already) so going out of the way to test best-effort functionality, if it adds anything that might break or add extra time, is not needed IMHO.

Thus my request is a) delete this file and don't worry about triggering your one .toml file, b) do manually test with Prometheus to ensure it works end to end at least as of the time you tested it.

gxglass · 2026-03-11T05:41:38Z

fdbserver/workloads/PrometheusMetricsTest.actor.cpp

+#include "flow/actorcompiler.h" // This must be the last #include.
+
+Reference<HTTP::IRequestHandler> makePrometheusMetricsHandler();
+


Put a comment describing what this workload does (and in more detail than just "tests prometheus metrics handling"; existing workload files may not be great examples of comments they actually communicate non-obvious information).

gxglass · 2026-03-11T05:43:17Z

.github/workflows/otel-e2e.yml

@@ -0,0 +1,269 @@
+# OTEL & Prometheus Metrics E2E Test


I don't think this file is necessary. I realize you probably did a lot of misc hacking on this to get this into this shape but TBH I am far from convinced this is needed.

A clue that this is not needed is that nothing else does stuff like this, as far as I know. Another clue is, why should a test case for one misc feature like this (otel metrics) have to be responsible for all of this work to build fdbserver? Doesn't that strike you as having to do too much work to accomplish one incremental new thing (i.e. run one new .toml file)? (Finding yourself in the position of having to do a hell of a lot of work just to accomplish a seemingly routine thing is a clue that maybe you are coming at it from the wrong angle.)

For local small scale development it is fine to just run fdbserver -r simulation -f path/to/your/file.toml by hand. At small scale you can play with different -s random_seed -b buggify_value options if that is of interest.

There is infrastructure elsewhere that runs 1000s of simulation runs. You can see evidence of this in the build log files that the CI generates on pull requests. It is possible that there are rules which control when the CI's get triggered and it may be necessary for somebody to close and reopen your PR to trigger them. But basically what I am trying to say is that there is already a lot of automation to run simulations and you don't have to take steps to go out of your way to do more work there. Just add the workload file and toml file like you have done -- that's all you have to do.

gxglass · 2026-03-11T05:45:53Z

fdbserver/workloads/PrometheusMetricsTest.actor.cpp

+	ACTOR Future<Void> _start(PrometheusMetricsTestWorkload* self) {
+		state double startTime = now();
+
+		// Create a SimpleCounter metric


delete this comment. It conveys no information that is not conveyed by the following line of code

gxglass · 2026-03-11T05:46:13Z

fdbserver/workloads/PrometheusMetricsTest.actor.cpp

+		static SimpleCounter<int64_t>* testCounter = SimpleCounter<int64_t>::makeCounter("/test/prometheus/requests");
+		testCounter->increment(42);
+
+		// Increment OTEL counter (creates an OTELSum in MetricCollection)


delete this comment too

gxglass · 2026-03-11T05:52:17Z

flow/include/flow/PrometheusMetrics.h

+		// OTEL Sums -> Prometheus counters
+		for (const auto& [uid, sum] : metrics->sumMap) {
+			std::string name = sanitizePrometheusName(sum.name);
+			if (name.empty())


why is it valid for name to be empty? I think this should be an assert that the name is valid.

gxglass · 2026-03-11T05:54:09Z

flow/include/flow/PrometheusMetrics.h

+std::string hierarchicalToPrometheus(const std::string& input);
+bool isValidPrometheusMetricName(std::string_view name);
+
+// Sanitize a metric name for Prometheus compatibility.


why are you defining all these methods in a header file? I don't think they need to be inline. Just put plain old function prototypes here and put the methods in the .cpp file elsewhere.

gxglass · 2026-03-11T05:56:44Z

flow/include/flow/TDMetric.actor.h

-		if (g_network == nullptr || knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) == MetricsDataModel::NONE)
+		if (g_network == nullptr)
+			return nullptr;
+		// Allow access to MetricCollection when either:


There are too many levels of negative logic in here. It would be simpler to write this as

if (g_network && (knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) != MetricsDataModel::NONE || FLOW_KNOBS->PROMETHEUS_METRICS_ENABLED) {
return static_cast<MetricCollection*>((void*)g_network->global(INetwork::enMetrics));
}
return nullptr;

gxglass · 2026-03-11T05:59:31Z

fdbserver/workloads/PrometheusMetricsTest.actor.cpp

+
+		// Verify SimpleCounter metric is present
+		ASSERT(body.find("test_prometheus_requests") != std::string::npos);
+


Can you put some lower bound on the number of total metrics returned? I doubt it will do anything other than go up over time. You have have the test use a hard-coded constant that is maybe 10-20 less than the actual number of metrics.

What is that number, actually?

gxglass · 2026-03-11T06:05:01Z

fdbserver/workloads/PrometheusMetricsTest.actor.cpp

+		// Verify SimpleCounter metric is present
+		ASSERT(body.find("test_prometheus_requests") != std::string::npos);
+
+		// Write body to file for CI reporting


Can you put this in

if (0 /* enable this for my own local testing */) {
// write random file in local workspce
}

I would rather not put random files into the local directory. Change 0 to 1 for your own testing.

gxglass · 2026-03-11T06:12:47Z

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

…test assertions Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Ronitsabhaya75 · 2026-03-11T15:57:36Z

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

~67 unique metric families per simulation run:

flow_arena_* - ~10
flow_fastalloc_* - ~30
Transport_TLS_* - ~6
flow_counters_* - 1
test_prometheus_requests - 1

gxglass · 2026-03-11T23:32:55Z

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

~67 unique metric families per simulation run:

flow_arena_* - ~10 flow_fastalloc_* - ~30 Transport_TLS_* - ~6 flow_counters_* - 1 test_prometheus_requests - 1

67 only? That is kind of surprising. I'd think that a complicated system like FDB (about 500,000 lines of code excluding dependencies like sqlite and rocksdb) would have hundreds or more metrics. I'd like to understand the 67 better. I can think of several possible explanations:

Possibly are using the internal metrics API incorrectly and somehow not getting all of the metrics. This relates to the call to MetricCollection::getMetricCollection and how we iterate over stuff. Could you double check that and make sure we are not missing something?
Relatedly, the test is not running enough interesting workloads to run through code paths that create or increment metrics. Can you add Cycle and Attrition to your toml file for let's say 30 seconds or whatever is conventional? (Look around in other *.toml files.) That should be enough to produce some activity in the system.
Another possibility is that FDB just doesn't define a lot of metrics currently. For my background could you post the entire list? It's only 61 so cut/paste here is fine. A way to spot check this is to check some of these metric names, see how they are defined in code, then look for other uses of the same idiom and make sure your metrics report captures them all.

Thanks!

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

…stall

Ronitsabhaya75 force-pushed the feature/prometheus-metrics-endpoint branch 2 times, most recently from 8232ecd to 6a676cc Compare February 7, 2026 02:51

Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics

c0081a1

Co-authored-by: Rodrigo Muñoz <rodrigo.munoz.cs@gmail.com>

Ronitsabhaya75 force-pushed the feature/prometheus-metrics-endpoint branch 2 times, most recently from 8232ecd to c0081a1 Compare February 7, 2026 04:31

gxglass self-requested a review February 10, 2026 17:56

gxglass closed this Feb 11, 2026

gxglass reopened this Feb 11, 2026

Ronitsabhaya75 added 12 commits February 12, 2026 21:11

Add GitHub Actions workflow for Prometheus E2E testing

a5b7692

Enable manual trigger for Prometheus E2E workflow

2ba582b

Trigger Actions on feature branch push

9d4664b

Fix CI: Enable CRB and EPEL for ninja-build

6b2cdf0

Expand CI matrix: Add CentOS 7 and macOS 13

42b8e25

Fix macOS CI: Split into separate jobs

83ca034

Fix macOS runner and refine job names

55b132a

Build: Register PrometheusMetrics.toml test

94c2f63

Fix CI: Sequential workflow (RHEL9->CentOS7->Mac) and fix CentOS7 run…

afc72e0

…ners

Fix CI: Limit ninja parallelism to -j2 to prevent OOM

7a03911

Fix CI: Apply -j2 limit to CentOS and macOS builds

133fb84

Fix CI: Free disk space and use manual docker run for RHEL9

5b45fc7

Ronitsabhaya75 and others added 16 commits February 15, 2026 18:19

Add OTEL metrics E2E testing workflow

5a4eafb

Fix OTEL CI: use make instead of ninja, fix CentOS7 SCL repos

62aadf0

Fix CI: Use macos-latest runner

a7a6557

Fix CI Round 2: Remove ninja from RHEL, fix all SCL repos, disable Sw…

4032533

…ift on macOS

Fix CI Round 3: Fix CentOS repos regex, restore Ninja for macOS

5d2307d

Fix CI Round 4: Use checkout@v3 for CentOS7, disable Swift globally, …

75d45aa

…unify on Make

feat: expand Prometheus /metrics to all OTEL metric types with integr…

6718df4

…ation tests Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: use Ninja generator and FDB Docker images in CI for all 3 OS pla…

bd32240

…tforms Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: reduce macOS CI parallelism to -j2 to avoid OOM during build

06f5162

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: replace wcx.dbId with wcx.clientId — WorkloadContext has no dbId…

db841fb

… member Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: macOS boost dependency and re-trigger CI

7aa250f

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix the OETL for mac CI

baa3769

fixing CI for linux failure

d0fbc1f

E2E CI for Prometheus /metrics on RHEL9, CentOS7, macOS with clean bu…

b970b1c

…ilds Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: add rh-python38 + Jinja2 to CentOS 7 for CMake compatibility

2970de7

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Ronitsabhaya75 marked this pull request as ready for review March 11, 2026 04:05

gxglass reviewed Mar 11, 2026

View reviewed changes

refactor: address review — move impl to .cpp, simplify logic, harden …

9159b10

…test assertions Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Ronitsabhaya75 and others added 3 commits March 12, 2026 10:50

test: add Cycle+Attrition workloads and temp CI for full E2E validation

dd30b42

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

fix: add all 3 platforms to CI (RHEL9, CentOS7, macOS) and fix pip in…

2d4ebad

…stall

clean: remove temp CI, revert if(0), keep Cycle+Attrition workloads

5bd0734

		#include "flow/actorcompiler.h" // This must be the last #include.

		Reference<HTTP::IRequestHandler> makePrometheusMetricsHandler();


		// Verify SimpleCounter metric is present
		ASSERT(body.find("test_prometheus_requests") != std::string::npos);

Conversation

Ronitsabhaya75 commented Feb 6, 2026

Summary

Changes

Scope

Code-Reviewer Section

For Release-Branches

Uh oh!

gxglass commented Feb 11, 2026

Uh oh!

gxglass commented Feb 11, 2026

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-clang-arm on Linux CentOS 7

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-clang on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-clang-ide on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-macos on macOS Ventura 13.x

Uh oh!

foundationdb-ci commented Feb 11, 2026

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

Uh oh!

Ronitsabhaya75 commented Feb 11, 2026

Uh oh!

Ronitsabhaya75 commented Mar 11, 2026

Uh oh!

Ronitsabhaya75 commented Mar 11, 2026

Uh oh!

Ronitsabhaya75 commented Mar 11, 2026

Uh oh!

Ronitsabhaya75 commented Mar 11, 2026

Uh oh!

gxglass left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gxglass commented Mar 11, 2026

Uh oh!

Ronitsabhaya75 commented Mar 11, 2026

Uh oh!

gxglass commented Mar 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants