Skip to content

Conversation

@bhattmanish98
Copy link
Contributor

This PR introduces a centralized Aggregated Metrics Manager and defines the conditions under which aggregated metrics are emitted from individual file systems.

Key Changes

  1. Criteria for Emitting Aggregated Metrics
    Aggregated metrics are emitted based on the following conditions:
  • Time-based interval - Each file system periodically emits its collected metrics at a fixed interval. After emission, metric collection is reset.

  • Threshold-based emission - A scheduler runs at regular intervals to check whether the total number of operations has exceeded a configured threshold. This prevents the aggregated metrics string from growing too large to be safely sent as an HTTP request header. If the threshold is reached, the collected metrics are emitted immediately, and metric collection is reset.

  • Idle-period emission - If a file system remains idle for a configured duration, any accumulated metrics are emitted, and metric collection is reset.

  • File system close - When a file system is closed, all remaining collected metrics are emitted to ensure no data is lost.

  1. Centralized Metrics Management
    All file systems now push their aggregated metrics to a shared Aggregated Metrics Manager. This manager evaluates the configured emission criteria and determines whether metrics should be emitted immediately or deferred until a later time.
    This will also rate limit the number of metrics calls per second.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@hadoop-yetus

This comment was marked as outdated.

@bhattmanish98
Copy link
Contributor Author

============================================================
HNS-OAuth-DFS

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 904, Failures: 0, Errors: 0, Skipped: 220
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 23

============================================================
HNS-SharedKey-DFS

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 4
[WARNING] Tests run: 907, Failures: 0, Errors: 0, Skipped: 166
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 10

============================================================
NonHNS-SharedKey-DFS

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 744, Failures: 0, Errors: 0, Skipped: 287
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 9
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 11

============================================================
AppendBlob-HNS-OAuth-DFS

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 904, Failures: 0, Errors: 0, Skipped: 231
[WARNING] Tests run: 135, Failures: 0, Errors: 0, Skipped: 9
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 23

============================================================
NonHNS-SharedKey-Blob

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 750, Failures: 0, Errors: 0, Skipped: 144
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 11

============================================================
NonHNS-OAuth-DFS

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 741, Failures: 0, Errors: 0, Skipped: 289
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 9
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 24

============================================================
NonHNS-OAuth-Blob

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 747, Failures: 0, Errors: 0, Skipped: 156
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 24

============================================================
AppendBlob-NonHNS-OAuth-Blob

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 742, Failures: 0, Errors: 0, Skipped: 202
[WARNING] Tests run: 135, Failures: 0, Errors: 0, Skipped: 4
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 24

============================================================
HNS-Oauth-DFS-IngressBlob

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 3
[WARNING] Tests run: 778, Failures: 0, Errors: 0, Skipped: 229
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 8
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 23

============================================================
NonHNS-OAuth-DFS-IngressBlob

[WARNING] Tests run: 235, Failures: 0, Errors: 0, Skipped: 10
[WARNING] Tests run: 739, Failures: 0, Errors: 0, Skipped: 286
[WARNING] Tests run: 158, Failures: 0, Errors: 0, Skipped: 9
[WARNING] Tests run: 271, Failures: 0, Errors: 0, Skipped: 24

@bhattmanish98 bhattmanish98 marked this pull request as ready for review December 19, 2025 14:12
abfsConfiguration.isBackoffRetryMetricsEnabled());
break;
case INTERNAL_FOOTER_METRIC_FORMAT:
initializeReadFooterMetrics();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break missing here


@Override
public String toString() {
String metric = "";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use empty string constant

public static final String FS_AZURE_METRIC_ACCOUNT_NAME = "fs.azure.metric.account.name";
public static final String FS_AZURE_METRIC_ACCOUNT_KEY = "fs.azure.metric.account.key";
public static final String FS_AZURE_METRIC_URI = "fs.azure.metric.uri";
public static final String FS_AZURE_METRIC_FORMAT = "fs.azure.metric.format";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add jvadoc for all these with {@value} tag.

public static final String FS_AZURE_METRIC_ACCOUNT_NAME = "fs.azure.metric.account.name";
public static final String FS_AZURE_METRIC_ACCOUNT_KEY = "fs.azure.metric.account.key";
public static final String FS_AZURE_METRIC_URI = "fs.azure.metric.uri";
public static final String FS_AZURE_METRIC_FORMAT = "fs.azure.metric.format";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep all the metric related configs name consistent with same prefix.
fs.azure.metrics....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metric account name, key and format have "metric", other configurations have "metrics". I have kept it intentionally. Do you want to change this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think we should have common prefix for all the metric related configs

metricAccountKey)) {
int dotIndex = metricAccountName.indexOf(AbfsHttpConstants.DOT);
if (dotIndex <= 0) {
throw new InvalidUriException(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test around this exception if not already there

final AbfsConfiguration abfsConfiguration,
final EncryptionContextProvider encryptionContextProvider,
final AbfsClientContext abfsClientContext,
final String fileSystemId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its better to pass it as a part of client context similar to other client related fields.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, make sense. Will do this change.

this.metricsEmitScheduler
= Executors.newSingleThreadScheduledExecutor();
// run every 1 minute to check the metrics count
this.metricsEmitScheduler.scheduleAtFixedRate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 separate schedulers being added here seems like.
Each client has its own scheduler and then the singleton metric manager class also has one?

Is this as per design?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is as per design, each file system will emit the metrics to manager class at regular interval if not closed. The singleton manager class will do actual API call to send those collected metrics.

if (isMetricCollectionEnabled && runningTimerTask != null) {
runningTimerTask.cancel();
timer.cancel();
if (isMetricCollectionEnabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not already done, verify that after FS close all the threads are properly getting shutdown and no leak is there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are few tests which already cover this. Will check if more test cases are needed to cover more scenario.

private static final Logger LOG = LoggerFactory.getLogger(
AbfsReadFooterMetrics.class);

private static final String FOOTER_LENGTH = "20";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment as to why 20?

operationType, failureReason, retryCount, retryPolicy.getAbbreviation(), retryInterval);
if (abfsBackoffMetrics != null) {
updateBackoffTimeMetrics(retryCount, sleepDuration);
updateBackoffTimeMetrics(retryCount, retryInterval);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier sleepDuration was wrongly passed in updateBackoffTimeMetrics, actually it is retryInterval which tells us the delay between two retries.

}
StringBuilder metricBuilder = new StringBuilder();
getRetryMetrics(metricBuilder);
if (isRetryMetricEnabled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should always have retry metrics added as a part of these aggregate metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this change as per the discussion we had with the team. We can discuss this offline and if agreed by all I will revert the change.

private static volatile AggregateMetricsManager instance;

// Rate limiter to control the rate of dispatching metrics.
private static volatile SimpleRateLimiter rateLimiter;
Copy link
Contributor

@anmolanmol1234 anmolanmol1234 Dec 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rateLimiter is declared as static but is initialized in the constructor using permitsPerSecond.

This means the rate limiter ends up being global to the JVM, even though its value comes from instance-level configuration. In practice, whichever code initializes the manager first decides the rate, and any later calls to get() with different values are silently ignored.


boolean isRemoved = bucket.deregisterClient(abfsClient);

if (bucket.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small race window between isEmpty() and remove(). Another thread may concurrently register a new client for the same account and reuse the bucket, but it can still be removed based on the earlier emptiness check. This makes the behavior timing-dependent and hard to reason about under concurrency.

You can use buckets.computeIfPresent() to perform the emptiness check and removal atomically, which avoids this race and keeps the map state consistent.

// Add shutdown hook to dispatch remaining metrics on JVM shutdown.
Runtime.getRuntime().addShutdownHook(new Thread(() -> {
dispatchMetrics();
scheduler.shutdown();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we wait for dispatch metrics to finish before scheduler is shutdown ?

});

// Schedule periodic dispatching of metrics.
this.scheduler.scheduleAtFixedRate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use scheduleWithFixedDelay as scheduleAtFixedRate can overlap executions.

* @param permitsPerSecond Rate limit for dispatching metrics.
* @return Singleton instance of AggregateMetricsManager.
*/
public static AggregateMetricsManager get(final long dispatchIntervalInMins,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get method name is very generic, use better naming

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants