Skip to content

Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics#12680

Open
Ronitsabhaya75 wants to merge 45 commits intoapple:mainfrom
Ronitsabhaya75:feature/prometheus-metrics-endpoint
Open

Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics#12680
Ronitsabhaya75 wants to merge 45 commits intoapple:mainfrom
Ronitsabhaya75:feature/prometheus-metrics-endpoint

Conversation

@Ronitsabhaya75
Copy link
Contributor

Summary

This PR adds a native /metrics HTTP endpoint that exposes FoundationDB's internal SimpleCounter metrics in Prometheus text exposition format.
(waiting for more discussion for further)

Changes

  • flow/include/flow/PrometheusMetrics.h: Header with formatPrometheusMetrics() function
  • fdbrpc/PrometheusMetricsHandler.actor.cpp: HTTP request handler for /metrics endpoint
  • flow/SimpleCounter.cpp: Made hierarchicalToPrometheus() and isValidPrometheusMetricName() public
  • fdbserver/workloads/PrometheusMetricsTest.actor.cpp: Simulation test workload
  • tests/fast/PrometheusMetrics.toml: Test configuration

Scope

Initial implementation targets simulation HTTP infrastructure only.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@Ronitsabhaya75 Ronitsabhaya75 force-pushed the feature/prometheus-metrics-endpoint branch 2 times, most recently from 8232ecd to 6a676cc Compare February 7, 2026 02:51
Co-authored-by: Rodrigo Muñoz <rodrigo.munoz.cs@gmail.com>
@Ronitsabhaya75 Ronitsabhaya75 force-pushed the feature/prometheus-metrics-endpoint branch 2 times, most recently from 8232ecd to c0081a1 Compare February 7, 2026 04:31
@gxglass gxglass self-requested a review February 10, 2026 17:56
@gxglass
Copy link
Contributor

gxglass commented Feb 11, 2026

@Ronitsabhaya75 I need to get a little more familiar with some background bits but will review in the next few days.

Just out of curiosity, are you doing this as an exercise, or are you planning to use this in production?

@gxglass
Copy link
Contributor

gxglass commented Feb 11, 2026

Close + reopenin to run CI's

@gxglass gxglass closed this Feb 11, 2026
@gxglass gxglass reopened this Feb 11, 2026
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c0081a1
  • Duration 0:04:19
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c0081a1
  • Duration 0:04:22
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c0081a1
  • Duration 0:04:26
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-ide on Linux RHEL 9

  • Commit ID: c0081a1
  • Duration 0:04:26
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c0081a1
  • Duration 0:04:29
  • Result: ❌ FAILED
  • Error: Error while executing command: if [[ $(git diff --shortstat 2> /dev/null | tail -n1) == "" ]]; then echo "CODE FORMAT CLEAN"; else echo "CODE FORMAT NOT CLEAN"; echo; echo "THE FOLLOWING FILES NEED TO BE FORMATTED"; echo; git ls-files -m; echo; if [[ $FDB_VERSION =~ 7\.\3. ]]; then echo skip; else exit 1; fi; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: c0081a1
  • Duration 0:06:25
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /usr/local/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: c0081a1
  • Duration 0:06:34
  • Result: ❌ FAILED
  • Error: Error while executing command: ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -i ${HOME}/.ssh_key ec2-user@${MAC_EC2_HOST} /opt/homebrew/bin/bash --login -c ./build_pr_macos.sh. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@Ronitsabhaya75
Copy link
Contributor Author

@Ronitsabhaya75 I need to get a little more familiar with some background bits but will review in the next few days.

Just out of curiosity, are you doing this as an exercise, or are you planning to use this in production?

@gxglass I was planning for prod currently im planning for development purpose later once we are good we can think how we can implement for prod

#12679 you can check the issue here I have created discussion too

Ronitsabhaya75 and others added 16 commits February 15, 2026 18:19
…ation tests

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
…tforms

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
… member

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
- Drop CentOS 7 (EOL, Docker image had pre-built binaries)
- Add rm -rf build to ensure clean compilation from source
- Add binary verification step (ls -la, file, du -h)
- Add explicit output validation (grep for TYPE annotations)
- Separate build and test into distinct CI steps for clarity

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
…ilds

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
@Ronitsabhaya75 Ronitsabhaya75 marked this pull request as ready for review March 11, 2026 04:05
@Ronitsabhaya75
Copy link
Contributor Author

Current arch for promethus and /metrics

Screenshot 2026-03-10 at 11 06 10 PM

@Ronitsabhaya75
Copy link
Contributor Author

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

@Ronitsabhaya75
Copy link
Contributor Author

Thank you for helping me for figuring out i add you as co-author

@shilpan97 — Prometheus exporter logic & formatting
@Renish-patel — CI workflow & E2E Validation testing

@Ronitsabhaya75
Copy link
Contributor Author

@gxglass its ready to review can you review the changes and lemme know the suggestions what you think, I added CI run for testing and pasted the log output for working fine with all 3 OS(linux rhel 9, Centos7, macOS).

thank you to you too for guiding me :)

otel-e2e.yml has been fully rewritten to natively test both
OTEL internal metrics collection and Prometheus /metrics extraction
with guaranteed clean builds from source across all 3 platforms.
This file is no longer needed.

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Copy link
Contributor

@gxglass gxglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some review comments. I'm doing this on a best effort basis - subsequent reviews may take some delays this week due to normal business demands

@@ -0,0 +1,123 @@
name: Prometheus E2E Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the scope and objective of this test? In general put a comment to explain this. However I believe this test is not needed.

Reading below this appears to just run a single simulation test case. It is not necessary or desirable to have an explicit test to run just one specific simulation. We already run 10,000 randomized simulations in CI so the newly added test case (tests/fast/PrometheusMetrics.toml) would get picked up and run by that automagically.

A simulation test is not "end to end". It is contained within a single fdbserver process running a simulated cluster and simulated machines. These tests are of course valuable for many purposes, but integration with the external world is not a thing we can say they do for us. Here an end to end test would be to integrate with an external system to consume the metrics output and thus to confirm that the data is well formed and the counters are correctly reported. Like, integrate with actual Prometheus. I am OK testing that on a manual basis rather than automating it because I don't expect this functionality to be easily broken and because it is community supported to begin with. Also I absolutely do not want to take additional latency or flakiness in the CI process (we have too much of both already) so going out of the way to test best-effort functionality, if it adds anything that might break or add extra time, is not needed IMHO.

Thus my request is a) delete this file and don't worry about triggering your one .toml file, b) do manually test with Prometheus to ensure it works end to end at least as of the time you tested it.

#include "flow/actorcompiler.h" // This must be the last #include.

Reference<HTTP::IRequestHandler> makePrometheusMetricsHandler();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put a comment describing what this workload does (and in more detail than just "tests prometheus metrics handling"; existing workload files may not be great examples of comments they actually communicate non-obvious information).

@@ -0,0 +1,269 @@
# OTEL & Prometheus Metrics E2E Test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this file is necessary. I realize you probably did a lot of misc hacking on this to get this into this shape but TBH I am far from convinced this is needed.

A clue that this is not needed is that nothing else does stuff like this, as far as I know. Another clue is, why should a test case for one misc feature like this (otel metrics) have to be responsible for all of this work to build fdbserver? Doesn't that strike you as having to do too much work to accomplish one incremental new thing (i.e. run one new .toml file)? (Finding yourself in the position of having to do a hell of a lot of work just to accomplish a seemingly routine thing is a clue that maybe you are coming at it from the wrong angle.)

For local small scale development it is fine to just run fdbserver -r simulation -f path/to/your/file.toml by hand. At small scale you can play with different -s random_seed -b buggify_value options if that is of interest.

There is infrastructure elsewhere that runs 1000s of simulation runs. You can see evidence of this in the build log files that the CI generates on pull requests. It is possible that there are rules which control when the CI's get triggered and it may be necessary for somebody to close and reopen your PR to trigger them. But basically what I am trying to say is that there is already a lot of automation to run simulations and you don't have to take steps to go out of your way to do more work there. Just add the workload file and toml file like you have done -- that's all you have to do.

ACTOR Future<Void> _start(PrometheusMetricsTestWorkload* self) {
state double startTime = now();

// Create a SimpleCounter metric
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this comment. It conveys no information that is not conveyed by the following line of code

static SimpleCounter<int64_t>* testCounter = SimpleCounter<int64_t>::makeCounter("/test/prometheus/requests");
testCounter->increment(42);

// Increment OTEL counter (creates an OTELSum in MetricCollection)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete this comment too

// OTEL Sums -> Prometheus counters
for (const auto& [uid, sum] : metrics->sumMap) {
std::string name = sanitizePrometheusName(sum.name);
if (name.empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is it valid for name to be empty? I think this should be an assert that the name is valid.

std::string hierarchicalToPrometheus(const std::string& input);
bool isValidPrometheusMetricName(std::string_view name);

// Sanitize a metric name for Prometheus compatibility.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you defining all these methods in a header file? I don't think they need to be inline. Just put plain old function prototypes here and put the methods in the .cpp file elsewhere.

if (g_network == nullptr || knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) == MetricsDataModel::NONE)
if (g_network == nullptr)
return nullptr;
// Allow access to MetricCollection when either:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many levels of negative logic in here. It would be simpler to write this as

if (g_network && (knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) != MetricsDataModel::NONE || FLOW_KNOBS->PROMETHEUS_METRICS_ENABLED) {
return static_cast<MetricCollection*>((void*)g_network->global(INetwork::enMetrics));
}
return nullptr;


// Verify SimpleCounter metric is present
ASSERT(body.find("test_prometheus_requests") != std::string::npos);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put some lower bound on the number of total metrics returned? I doubt it will do anything other than go up over time. You have have the test use a hard-coded constant that is maybe 10-20 less than the actual number of metrics.

What is that number, actually?

// Verify SimpleCounter metric is present
ASSERT(body.find("test_prometheus_requests") != std::string::npos);

// Write body to file for CI reporting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put this in

if (0 /* enable this for my own local testing */) {
// write random file in local workspce
}

I would rather not put random files into the local directory. Change 0 to 1 for your own testing.

@gxglass
Copy link
Contributor

gxglass commented Mar 11, 2026

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

…test assertions

Co-authored-by: Shilpan Shah <Shilpan97@gmail.com>
Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
@Ronitsabhaya75
Copy link
Contributor Author

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

~67 unique metric families per simulation run:

flow_arena_* - ~10
flow_fastalloc_* - ~30
Transport_TLS_* - ~6
flow_counters_* - 1
test_prometheus_requests - 1

@gxglass
Copy link
Contributor

gxglass commented Mar 11, 2026

logs for workinf output for /metrics

# TYPE flow_arena_arenasCreated counter
flow_arena_arenasCreated 1143085
# TYPE flow_arena_arenaBlocksCreated counter
flow_arena_arenaBlocksCreated 1290021
# TYPE flow_fastalloc_allocateCallsSize32 counter
flow_fastalloc_allocateCallsSize32 1112869
# TYPE flow_fastalloc_allocateBytesSize32 counter
flow_fastalloc_allocateBytesSize32 35611808
# TYPE flow_platform_mmapBytes counter
flow_platform_mmapBytes 11534016
# TYPE flow_arena_arenaBytesReserved counter
flow_arena_arenaBytesReserved 20044625
# TYPE flow_arena_arenaBlockAllocations counter
flow_arena_arenaBlockAllocations 1589351
# TYPE flow_arena_arenaBlockBytesAllocated counter
flow_arena_arenaBlockBytesAllocated 354487384
# TYPE flow_fastalloc_allocateCallsSize64 counter
flow_fastalloc_allocateCallsSize64 4699617
# TYPE flow_fastalloc_allocateBytesSize64 counter
flow_fastalloc_allocateBytesSize64 300775488
...
# TYPE Transport_TLS_OutgoingConnectionCreated counter
Transport_TLS_OutgoingConnectionCreated 133
# TYPE Transport_TLS_OutgoingConnectionHandshakeComplete counter
Transport_TLS_OutgoingConnectionHandshakeComplete 133
# TYPE Transport_TLS_IncomingConnectionCreated counter
Transport_TLS_IncomingConnectionCreated 136
# TYPE Transport_TLS_IncomingConnectionHandshakeAccepted counter
Transport_TLS_IncomingConnectionHandshakeAccepted 136
...
# TYPE flow_counters_reports counter
flow_counters_reports 11
# TYPE flow_arena_totalSizeBlocksExamined counter
flow_arena_totalSizeBlocksExamined 20
# TYPE test_prometheus_requests counter
test_prometheus_requests 294

For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks!

~67 unique metric families per simulation run:

flow_arena_* - ~10 flow_fastalloc_* - ~30 Transport_TLS_* - ~6 flow_counters_* - 1 test_prometheus_requests - 1

67 only? That is kind of surprising. I'd think that a complicated system like FDB (about 500,000 lines of code excluding dependencies like sqlite and rocksdb) would have hundreds or more metrics. I'd like to understand the 67 better. I can think of several possible explanations:

  1. Possibly are using the internal metrics API incorrectly and somehow not getting all of the metrics. This relates to the call to MetricCollection::getMetricCollection and how we iterate over stuff. Could you double check that and make sure we are not missing something?
  2. Relatedly, the test is not running enough interesting workloads to run through code paths that create or increment metrics. Can you add Cycle and Attrition to your toml file for let's say 30 seconds or whatever is conventional? (Look around in other *.toml files.) That should be enough to produce some activity in the system.
  3. Another possibility is that FDB just doesn't define a lot of metrics currently. For my background could you post the entire list? It's only 61 so cut/paste here is fine. A way to spot check this is to check some of these metric names, see how they are defined in code, then look for other uses of the same idiom and make sure your metrics report captures them all.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants