Skip to content

Improved the monitors scan server page#6351

Open
keith-turner wants to merge 6 commits intoapache:mainfrom
keith-turner:scan-server-view
Open

Improved the monitors scan server page#6351
keith-turner wants to merge 6 commits intoapache:mainfrom
keith-turner:scan-server-view

Conversation

@keith-turner
Copy link
Copy Markdown
Contributor

Made the scan server page have the same columns as the tablet server
page for scans and added a few scan server specific columns.

Made supporting changes to get a timer metric to display on the monitor,
then used those changes to display the average reservation per scan on a
scan server.

Added two new metrics to the scan server and displayed those on the
monitor. One metric is the number of tablets a scan server current has
cached. The other is the number of files a scan server has reserved.

@keith-turner
Copy link
Copy Markdown
Contributor Author

keith-turner commented Apr 30, 2026

Opened this is a draft because it builds on #6344. Will take it out of draft after that is merged.

This is a screenshot of what the new monitor page looks like.

Screenshot at 2026-04-29 19-06-40

When testing this page I was splitting and merging tablets while running scans against the scan servers inorder to see the metrics adjust to this. While doing this ran in a problem where scan server lost track of a reserved file after a tablet merge and it caused scans to fail, need to see if I can reproduce this.

Also noticed while running test that the datanode had a high CPU load even though supposedly everything was fitting in cache in the scan servers, so need to investigate this also.

@dlmarion dlmarion changed the title Improved the monitors scan sever page Improved the monitors scan server page Apr 30, 2026
@dlmarion
Copy link
Copy Markdown
Contributor

Looking at the new scan server page vs the page before, I think the following columns might be useful to add back (to both tserver and sserver pages):

Scan Files Open
Scan Yield Count
Scan Zombie Thread Count
Scans Returned Early for Low Mem
Scans Paused for Low Mem

SCAN_BUSY_TIMEOUT_COUNT("accumulo.scan.busy.timeout.count", MetricType.FUNCTION_COUNTER,
"Count of the scans where a busy timeout happened.", MetricDocSection.SCAN, "Scan Busy Count",
null, NUMBER),
SCAN_TABLETS_CACHED("accumulo.scan.tablet.cached", MetricType.GAUGE,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already being captured with the SCAN_TABLET_METADATA_CACHE metric emitted by the Caffeine cache, but we are likely not letting it through to the monitor. In the LoggingMeterRegistry output it shows up as:

cache.size{cache=accumulo.scan.tablet.metadata.cache,host=localhost,instance.name=accumulo4,port=9700,process.name=scan.server,resource.group=default} value=0

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have the property general.micrometer.cache.metrics.enabled which why I decided not to use the cache property, because it complicates setting up the monitor correctly even more. Maybe the best thing to do is to remove that property and always emit cache metrics and pass the cache metrics through to the monitor. Not sure why that property was added, but we do not have an analogous property for executors. Any objection to removing general.micrometer.cache.metrics.enabled and always enabling cache metrics?

Copy link
Copy Markdown
Contributor Author

@keith-turner keith-turner May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In 08a2dc1 switched to using existing cache metrics and removed new metric and removed the property to enable/disable cache metrcis.

@dlmarion
Copy link
Copy Markdown
Contributor

Any objection to removing general.micrometer.cache.metrics.enabled and always enabling cache metrics?

I have no issue with this.

@keith-turner
Copy link
Copy Markdown
Contributor Author

Looking at the new scan server page vs the page before, I think the following columns might be useful to add back (to both tserver and sserver pages):

Scan Files Open
Scan Yield Count
Scan Zombie Thread Count
Scans Returned Early for Low Mem
Scans Paused for Low Mem

I can add files open and yields would also add the to the tserver though which already really crowded. Trying to have scan servers and tablet servers display the same columns for scans when they have same data.

For the other three I am thinking of collapsing them into a single column and replace the existing failed scans, which shows a count of scan w/ exceptions, w/ a scan problems column that is a sum of all 4 metrics. Wonder if we could make a column value where you get more data when you mouse over it, that could show the count breakdown. W/o that thinking we will eventually have a per server page where more details could be seen.

@dlmarion
Copy link
Copy Markdown
Contributor

The zombie thread count column could be left out and instead add a message in the messages table (#6346)

Made the scan server page have the same columns as the tablet server
page for scans and added a few scan server specific columns.

Made supporting changes to get a timer metric to display on the monitor,
then used those changes to display the average reservation per scan on a
scan server.

Added two new metrics to the scan server and displayed those on the
monitor. One metric is the number of tablets a scan server current has
cached. The other is the number of files a scan server has reserved.
 * added scan problems column that sums scan errors, scan mem pauses, and scan mem returns
 * added scan open files column and scan yield column
 * fixed bug with computing cache hit ratio
 * narrowed what stats are used to display specific metric types
 * cleaned up some errors in the Metric enum
@keith-turner
Copy link
Copy Markdown
Contributor Author

The zombie thread count column could be left out and instead add a message in the messages table (#6346)

That seems fine. I did not include of the sum I created in 931285d because its a GUAGE and everything else being summed is a Functional counter, so they have different units for display. In 931285d now summing three metrics for a new Scan Problems column on the monitor, summing scan errors, scan returned for mem, and scan paused for mem. Also added the scan files open and scan file yield counts columns in 931285d. @dlmarion taking this out of draft

@keith-turner keith-turner marked this pull request as ready for review May 1, 2026 23:26
@keith-turner
Copy link
Copy Markdown
Contributor Author

Also noticed while running test that the datanode had a high CPU load even though supposedly everything was fitting in cache in the scan servers, so need to investigate this also.

Tracked down this problem. The computation for cache hit rate on the monitor was wrong, it was dividing hit_count/hit_count instead of hit_count/request_count. Fixed that in 931285d and now seeing lower hit rates instead of always seeing 100%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants