Skip to content

Consolidate process count metrics #19706

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jon-signal
Copy link
Contributor

@jon-signal jon-signal commented Feb 24, 2025

What does this PR do?

This introduces a new, single foundationdb.processes_per_role gauge that is tagged by fdb_role and fdb_process_class. It makes no changes to existing metrics.

Motivation

There are two main motivations for this new gauge:

  1. The set of role types in a FoundationDB cluster can change from version to version. For example, FoundationDB 6.x had a notion of a proxy role, but in FoundationDB 7.x, that role is gone and has been replaced by grv_proxy and commit_proxy. Using tags (instead of individual gauges) to count processes by role gives this integration more flexibility to work with different versions of FoundationDB.
  2. Having processes tagged by role gives operators a mechanism by which they can detect if a single process is performing multiple roles. For example, in a heavily-loaded cluster, having a single process act in both the storage and log roles would be bad news, and operators might want to add a monitor for problematic duplicate role assignments.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
  • If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

@@ -217,7 +212,7 @@ def check_metrics(self, status):
role_counts[rolename] = 1

for role in role_counts:
self.gauge("foundationdb.processes_per_role." + role, role_counts[role])
self.gauge("foundationdb.instances", role_counts[role], ["fdb_role:" + role])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that foundationdb.instances was previously a count, not a gauge, but I think that was a bug.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up, I still think this is a bug, but out of scope for this pull request.

Copy link

codecov bot commented Feb 24, 2025

Codecov Report

Attention: Patch coverage is 86.95652% with 3 lines in your changes missing coverage. Please review.

Project coverage is 85.36%. Comparing base (3cabf9e) to head (2959875).
Report is 3 commits behind head on master.

Additional details and impacted files
Flag Coverage Δ
activemq ?
cassandra ?
foundationdb 82.77% <86.95%> (+0.24%) ⬆️
hive ?
hivemq ?
hudi ?
ignite ?
jboss_wildfly ?
kafka ?
presto ?
solr ?

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@jon-signal jon-signal marked this pull request as draft February 25, 2025 00:43
@jon-signal
Copy link
Contributor Author

jon-signal commented Feb 25, 2025

Thinking about this a little more, this is definitely going to need dashboard updates, which I simply haven't done yet. I guess let's see if we agree on the basic idea first; if the general approach seems reasonable, then I'll dive into the dashboards/alerts/everything else!

Also, I think we can actually collapse this down into a single foundationdb.processes gauge tagged by fdb_process_class fdb_role. This depends on my understanding of repeated tags (for processes with multiple roles), though, which is admittedly a little shaky. The idea here is that we might wind up with something that looks like this:

foundationdb.processes = 21 (['fdb_process_class:stateless'])
foundationdb.processes = 4  (['fdb_process_class:log', 'fdb_role:coordinator', 'fdb_role:log'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:data_distributor'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:cluster_controller'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:consistency_scan'])
foundationdb.processes = 9  (['fdb_process_class:storage', 'fdb_role:storage'])
foundationdb.processes = 5  (['fdb_process_class:log', 'fdb_role:coordinator'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:grv_proxy'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:ratekeeper'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:master'])
foundationdb.processes = 2  (['fdb_process_class:stateless', 'fdb_role:commit_proxy'])
foundationdb.processes = 1  (['fdb_process_class:stateless', 'fdb_role:resolver'])

I think what will happen is that just querying foundationdb.processes with no tags will yield the number of fdbserver processes with no double-counting. At the same time, if we ask for the sum of all foundationdb.processes values tagged with fdb_role:coordinator, we should wind up with 9 (the 5 stand-alone coordinators and then the 5 in the dual role of log and coordinator). I think that means that we can use that single metric to do the job that was previously done by two metrics and then an additional family of metrics.

The other nice thing about the "tag by role" approach (and I should have mentioned this earlier) is that it means that if new roles appear in future versions of FoundationDB, those new roles will just appear as new tags and operators won't have to add new metrics to their "how many processes do I have?" queries.

Seem reasonable?

@jon-signal jon-signal force-pushed the consolidate_fdb_process_count_metrics branch 2 times, most recently from d806243 to b53229a Compare February 25, 2025 02:14
@jon-signal
Copy link
Contributor Author

From the conversation in #19681 (comment), it sounds like the best way forward here would be to introduce a new metric rather than changing anything that already exists. I'm not in a position to do that this afternoon, but acknowledge it needs to get done and will address it as soon as I can!

@jon-signal jon-signal force-pushed the consolidate_fdb_process_count_metrics branch 2 times, most recently from 7c5e213 to ef36263 Compare March 18, 2025 16:15
@jon-signal
Copy link
Contributor Author

I've revised this significantly in light of discussions elsewhere, and now I think it's much less controversial. I'll update the description to match.

@jon-signal jon-signal marked this pull request as ready for review March 18, 2025 16:21
@jon-signal jon-signal force-pushed the consolidate_fdb_process_count_metrics branch from ef36263 to c7c0690 Compare March 18, 2025 16:25
@steveny91
Copy link
Contributor

@jon-signal Hello again! I'll take a look at this soon ™️

@jon-signal jon-signal force-pushed the consolidate_fdb_process_count_metrics branch from c7c0690 to 6471128 Compare April 1, 2025 15:33
@jon-signal jon-signal force-pushed the consolidate_fdb_process_count_metrics branch from 6471128 to 2959875 Compare April 21, 2025 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants