Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics#12680
Add native Prometheus /metrics HTTP endpoint for SimpleCounter metrics#12680Ronitsabhaya75 wants to merge 45 commits intoapple:mainfrom
Conversation
8232ecd to
6a676cc
Compare
Co-authored-by: Rodrigo Muñoz <rodrigo.munoz.cs@gmail.com>
8232ecd to
c0081a1
Compare
|
@Ronitsabhaya75 I need to get a little more familiar with some background bits but will review in the next few days. Just out of curiosity, are you doing this as an exercise, or are you planning to use this in production? |
|
Close + reopenin to run CI's |
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-clang-ide on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
@gxglass I was planning for prod currently im planning for development purpose later once we are good we can think how we can implement for prod #12679 you can check the issue here I have created discussion too |
…ation tests Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
…tforms Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
… member Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
- Drop CentOS 7 (EOL, Docker image had pre-built binaries) - Add rm -rf build to ensure clean compilation from source - Add binary verification step (ls -la, file, du -h) - Add explicit output validation (grep for TYPE annotations) - Separate build and test into distinct CI steps for clarity Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
…ilds Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
|
logs for workinf output for /metrics |
|
Thank you for helping me for figuring out i add you as co-author @shilpan97 — Prometheus exporter logic & formatting |
|
@gxglass its ready to review can you review the changes and lemme know the suggestions what you think, I added CI run for testing and pasted the log output for working fine with all 3 OS(linux rhel 9, Centos7, macOS). thank you to you too for guiding me :) |
otel-e2e.yml has been fully rewritten to natively test both OTEL internal metrics collection and Prometheus /metrics extraction with guaranteed clean builds from source across all 3 platforms. This file is no longer needed. Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
gxglass
left a comment
There was a problem hiding this comment.
Here are some review comments. I'm doing this on a best effort basis - subsequent reviews may take some delays this week due to normal business demands
.github/workflows/prometheus-e2e.yml
Outdated
| @@ -0,0 +1,123 @@ | |||
| name: Prometheus E2E Test | |||
There was a problem hiding this comment.
What is the scope and objective of this test? In general put a comment to explain this. However I believe this test is not needed.
Reading below this appears to just run a single simulation test case. It is not necessary or desirable to have an explicit test to run just one specific simulation. We already run 10,000 randomized simulations in CI so the newly added test case (tests/fast/PrometheusMetrics.toml) would get picked up and run by that automagically.
A simulation test is not "end to end". It is contained within a single fdbserver process running a simulated cluster and simulated machines. These tests are of course valuable for many purposes, but integration with the external world is not a thing we can say they do for us. Here an end to end test would be to integrate with an external system to consume the metrics output and thus to confirm that the data is well formed and the counters are correctly reported. Like, integrate with actual Prometheus. I am OK testing that on a manual basis rather than automating it because I don't expect this functionality to be easily broken and because it is community supported to begin with. Also I absolutely do not want to take additional latency or flakiness in the CI process (we have too much of both already) so going out of the way to test best-effort functionality, if it adds anything that might break or add extra time, is not needed IMHO.
Thus my request is a) delete this file and don't worry about triggering your one .toml file, b) do manually test with Prometheus to ensure it works end to end at least as of the time you tested it.
| #include "flow/actorcompiler.h" // This must be the last #include. | ||
|
|
||
| Reference<HTTP::IRequestHandler> makePrometheusMetricsHandler(); | ||
|
|
There was a problem hiding this comment.
Put a comment describing what this workload does (and in more detail than just "tests prometheus metrics handling"; existing workload files may not be great examples of comments they actually communicate non-obvious information).
.github/workflows/otel-e2e.yml
Outdated
| @@ -0,0 +1,269 @@ | |||
| # OTEL & Prometheus Metrics E2E Test | |||
There was a problem hiding this comment.
I don't think this file is necessary. I realize you probably did a lot of misc hacking on this to get this into this shape but TBH I am far from convinced this is needed.
A clue that this is not needed is that nothing else does stuff like this, as far as I know. Another clue is, why should a test case for one misc feature like this (otel metrics) have to be responsible for all of this work to build fdbserver? Doesn't that strike you as having to do too much work to accomplish one incremental new thing (i.e. run one new .toml file)? (Finding yourself in the position of having to do a hell of a lot of work just to accomplish a seemingly routine thing is a clue that maybe you are coming at it from the wrong angle.)
For local small scale development it is fine to just run fdbserver -r simulation -f path/to/your/file.toml by hand. At small scale you can play with different -s random_seed -b buggify_value options if that is of interest.
There is infrastructure elsewhere that runs 1000s of simulation runs. You can see evidence of this in the build log files that the CI generates on pull requests. It is possible that there are rules which control when the CI's get triggered and it may be necessary for somebody to close and reopen your PR to trigger them. But basically what I am trying to say is that there is already a lot of automation to run simulations and you don't have to take steps to go out of your way to do more work there. Just add the workload file and toml file like you have done -- that's all you have to do.
| ACTOR Future<Void> _start(PrometheusMetricsTestWorkload* self) { | ||
| state double startTime = now(); | ||
|
|
||
| // Create a SimpleCounter metric |
There was a problem hiding this comment.
delete this comment. It conveys no information that is not conveyed by the following line of code
| static SimpleCounter<int64_t>* testCounter = SimpleCounter<int64_t>::makeCounter("/test/prometheus/requests"); | ||
| testCounter->increment(42); | ||
|
|
||
| // Increment OTEL counter (creates an OTELSum in MetricCollection) |
| // OTEL Sums -> Prometheus counters | ||
| for (const auto& [uid, sum] : metrics->sumMap) { | ||
| std::string name = sanitizePrometheusName(sum.name); | ||
| if (name.empty()) |
There was a problem hiding this comment.
why is it valid for name to be empty? I think this should be an assert that the name is valid.
| std::string hierarchicalToPrometheus(const std::string& input); | ||
| bool isValidPrometheusMetricName(std::string_view name); | ||
|
|
||
| // Sanitize a metric name for Prometheus compatibility. |
There was a problem hiding this comment.
why are you defining all these methods in a header file? I don't think they need to be inline. Just put plain old function prototypes here and put the methods in the .cpp file elsewhere.
flow/include/flow/TDMetric.actor.h
Outdated
| if (g_network == nullptr || knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) == MetricsDataModel::NONE) | ||
| if (g_network == nullptr) | ||
| return nullptr; | ||
| // Allow access to MetricCollection when either: |
There was a problem hiding this comment.
There are too many levels of negative logic in here. It would be simpler to write this as
if (g_network && (knobToMetricModel(FLOW_KNOBS->METRICS_DATA_MODEL) != MetricsDataModel::NONE || FLOW_KNOBS->PROMETHEUS_METRICS_ENABLED) {
return static_cast<MetricCollection*>((void*)g_network->global(INetwork::enMetrics));
}
return nullptr;
|
|
||
| // Verify SimpleCounter metric is present | ||
| ASSERT(body.find("test_prometheus_requests") != std::string::npos); | ||
|
|
There was a problem hiding this comment.
Can you put some lower bound on the number of total metrics returned? I doubt it will do anything other than go up over time. You have have the test use a hard-coded constant that is maybe 10-20 less than the actual number of metrics.
What is that number, actually?
| // Verify SimpleCounter metric is present | ||
| ASSERT(body.find("test_prometheus_requests") != std::string::npos); | ||
|
|
||
| // Write body to file for CI reporting |
There was a problem hiding this comment.
Can you put this in
if (0 /* enable this for my own local testing */) {
// write random file in local workspce
}
I would rather not put random files into the local directory. Change 0 to 1 for your own testing.
For my info, how many total metrics are currently reported? I am not sure I know this number. Thanks! |
…test assertions Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>
|
67 only? That is kind of surprising. I'd think that a complicated system like FDB (about 500,000 lines of code excluding dependencies like sqlite and rocksdb) would have hundreds or more metrics. I'd like to understand the 67 better. I can think of several possible explanations:
Thanks! |
Co-authored-by: Shilpan Shah <Shilpan97@gmail.com> Co-authored-by: Renish Avaiya <renishpatel2482001@gmail.com>

Summary
This PR adds a native
/metricsHTTP endpoint that exposes FoundationDB's internal SimpleCounter metrics in Prometheus text exposition format.(waiting for more discussion for further)
Changes
/metricsendpointScope
Initial implementation targets simulation HTTP infrastructure only.
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branchormainif this is the youngest branch)