-
Notifications
You must be signed in to change notification settings - Fork 9.2k
Hadoop-19676: ABFS: Aggregated Metrics Manager with Multiple Emission Criteria #8137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
============================================================
|
| abfsConfiguration.isBackoffRetryMetricsEnabled()); | ||
| break; | ||
| case INTERNAL_FOOTER_METRIC_FORMAT: | ||
| initializeReadFooterMetrics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
break missing here
|
|
||
| @Override | ||
| public String toString() { | ||
| String metric = ""; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use empty string constant
| public static final String FS_AZURE_METRIC_ACCOUNT_NAME = "fs.azure.metric.account.name"; | ||
| public static final String FS_AZURE_METRIC_ACCOUNT_KEY = "fs.azure.metric.account.key"; | ||
| public static final String FS_AZURE_METRIC_URI = "fs.azure.metric.uri"; | ||
| public static final String FS_AZURE_METRIC_FORMAT = "fs.azure.metric.format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add jvadoc for all these with {@value} tag.
...ls/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/constants/ConfigurationKeys.java
Show resolved
Hide resolved
| public static final String FS_AZURE_METRIC_ACCOUNT_NAME = "fs.azure.metric.account.name"; | ||
| public static final String FS_AZURE_METRIC_ACCOUNT_KEY = "fs.azure.metric.account.key"; | ||
| public static final String FS_AZURE_METRIC_URI = "fs.azure.metric.uri"; | ||
| public static final String FS_AZURE_METRIC_FORMAT = "fs.azure.metric.format"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep all the metric related configs name consistent with same prefix.
fs.azure.metrics....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
metric account name, key and format have "metric", other configurations have "metrics". I have kept it intentionally. Do you want to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think we should have common prefix for all the metric related configs
...ls/hadoop-azure/src/main/java/org/apache/hadoop/fs/azurebfs/services/AbfsBackoffMetrics.java
Show resolved
Hide resolved
| metricAccountKey)) { | ||
| int dotIndex = metricAccountName.indexOf(AbfsHttpConstants.DOT); | ||
| if (dotIndex <= 0) { | ||
| throw new InvalidUriException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test around this exception if not already there
| final AbfsConfiguration abfsConfiguration, | ||
| final EncryptionContextProvider encryptionContextProvider, | ||
| final AbfsClientContext abfsClientContext, | ||
| final String fileSystemId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its better to pass it as a part of client context similar to other client related fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, make sense. Will do this change.
| this.metricsEmitScheduler | ||
| = Executors.newSingleThreadScheduledExecutor(); | ||
| // run every 1 minute to check the metrics count | ||
| this.metricsEmitScheduler.scheduleAtFixedRate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 separate schedulers being added here seems like.
Each client has its own scheduler and then the singleton metric manager class also has one?
Is this as per design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is as per design, each file system will emit the metrics to manager class at regular interval if not closed. The singleton manager class will do actual API call to send those collected metrics.
| if (isMetricCollectionEnabled && runningTimerTask != null) { | ||
| runningTimerTask.cancel(); | ||
| timer.cancel(); | ||
| if (isMetricCollectionEnabled()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If not already done, verify that after FS close all the threads are properly getting shutdown and no leak is there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are few tests which already cover this. Will check if more test cases are needed to cover more scenario.
| private static final Logger LOG = LoggerFactory.getLogger( | ||
| AbfsReadFooterMetrics.class); | ||
|
|
||
| private static final String FOOTER_LENGTH = "20"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add comment as to why 20?
| operationType, failureReason, retryCount, retryPolicy.getAbbreviation(), retryInterval); | ||
| if (abfsBackoffMetrics != null) { | ||
| updateBackoffTimeMetrics(retryCount, sleepDuration); | ||
| updateBackoffTimeMetrics(retryCount, retryInterval); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier sleepDuration was wrongly passed in updateBackoffTimeMetrics, actually it is retryInterval which tells us the delay between two retries.
Line 341 in 4a82720
| updateBackoffTimeMetrics(retryCount, sleepDuration); |
| } | ||
| StringBuilder metricBuilder = new StringBuilder(); | ||
| getRetryMetrics(metricBuilder); | ||
| if (isRetryMetricEnabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should always have retry metrics added as a part of these aggregate metrics
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this change as per the discussion we had with the team. We can discuss this offline and if agreed by all I will revert the change.
| private static volatile AggregateMetricsManager instance; | ||
|
|
||
| // Rate limiter to control the rate of dispatching metrics. | ||
| private static volatile SimpleRateLimiter rateLimiter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rateLimiter is declared as static but is initialized in the constructor using permitsPerSecond.
This means the rate limiter ends up being global to the JVM, even though its value comes from instance-level configuration. In practice, whichever code initializes the manager first decides the rate, and any later calls to get() with different values are silently ignored.
|
|
||
| boolean isRemoved = bucket.deregisterClient(abfsClient); | ||
|
|
||
| if (bucket.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a small race window between isEmpty() and remove(). Another thread may concurrently register a new client for the same account and reuse the bucket, but it can still be removed based on the earlier emptiness check. This makes the behavior timing-dependent and hard to reason about under concurrency.
You can use buckets.computeIfPresent() to perform the emptiness check and removal atomically, which avoids this race and keeps the map state consistent.
| // Add shutdown hook to dispatch remaining metrics on JVM shutdown. | ||
| Runtime.getRuntime().addShutdownHook(new Thread(() -> { | ||
| dispatchMetrics(); | ||
| scheduler.shutdown(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we wait for dispatch metrics to finish before scheduler is shutdown ?
| }); | ||
|
|
||
| // Schedule periodic dispatching of metrics. | ||
| this.scheduler.scheduleAtFixedRate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use scheduleWithFixedDelay as scheduleAtFixedRate can overlap executions.
| * @param permitsPerSecond Rate limit for dispatching metrics. | ||
| * @return Singleton instance of AggregateMetricsManager. | ||
| */ | ||
| public static AggregateMetricsManager get(final long dispatchIntervalInMins, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
get method name is very generic, use better naming
This PR introduces a centralized Aggregated Metrics Manager and defines the conditions under which aggregated metrics are emitted from individual file systems.
Key Changes
Aggregated metrics are emitted based on the following conditions:
Time-based interval - Each file system periodically emits its collected metrics at a fixed interval. After emission, metric collection is reset.
Threshold-based emission - A scheduler runs at regular intervals to check whether the total number of operations has exceeded a configured threshold. This prevents the aggregated metrics string from growing too large to be safely sent as an HTTP request header. If the threshold is reached, the collected metrics are emitted immediately, and metric collection is reset.
Idle-period emission - If a file system remains idle for a configured duration, any accumulated metrics are emitted, and metric collection is reset.
File system close - When a file system is closed, all remaining collected metrics are emitted to ensure no data is lost.
All file systems now push their aggregated metrics to a shared Aggregated Metrics Manager. This manager evaluates the configured emission criteria and determines whether metrics should be emitted immediately or deferred until a later time.
This will also rate limit the number of metrics calls per second.