Improved the monitors scan server page by keith-turner · Pull Request #6351 · apache/accumulo

keith-turner · 2026-04-30T01:44:45Z

Made the scan server page have the same columns as the tablet server
page for scans and added a few scan server specific columns.

Made supporting changes to get a timer metric to display on the monitor,
then used those changes to display the average reservation per scan on a
scan server.

Added two new metrics to the scan server and displayed those on the
monitor. One metric is the number of tablets a scan server current has
cached. The other is the number of files a scan server has reserved.

keith-turner · 2026-04-30T02:10:50Z

Opened this is a draft because it builds on #6344. Will take it out of draft after that is merged.

This is a screenshot of what the new monitor page looks like.

When testing this page I was splitting and merging tablets while running scans against the scan servers inorder to see the metrics adjust to this. While doing this ran in a problem where scan server lost track of a reserved file after a tablet merge and it caused scans to fail, need to see if I can reproduce this.

Also noticed while running test that the datanode had a high CPU load even though supposedly everything was fitting in cache in the scan servers, so need to investigate this also.

dlmarion · 2026-04-30T11:37:04Z

Looking at the new scan server page vs the page before, I think the following columns might be useful to add back (to both tserver and sserver pages):

Scan Files Open
Scan Yield Count
Scan Zombie Thread Count
Scans Returned Early for Low Mem
Scans Paused for Low Mem

dlmarion · 2026-04-30T11:45:44Z

  SCAN_BUSY_TIMEOUT_COUNT("accumulo.scan.busy.timeout.count", MetricType.FUNCTION_COUNTER,
      "Count of the scans where a busy timeout happened.", MetricDocSection.SCAN, "Scan Busy Count",
      null, NUMBER),
+  SCAN_TABLETS_CACHED("accumulo.scan.tablet.cached", MetricType.GAUGE,


I think this is already being captured with the SCAN_TABLET_METADATA_CACHE metric emitted by the Caffeine cache, but we are likely not letting it through to the monitor. In the LoggingMeterRegistry output it shows up as:

cache.size{cache=accumulo.scan.tablet.metadata.cache,host=localhost,instance.name=accumulo4,port=9700,process.name=scan.server,resource.group=default} value=0

We also have the property general.micrometer.cache.metrics.enabled which why I decided not to use the cache property, because it complicates setting up the monitor correctly even more. Maybe the best thing to do is to remove that property and always emit cache metrics and pass the cache metrics through to the monitor. Not sure why that property was added, but we do not have an analogous property for executors. Any objection to removing general.micrometer.cache.metrics.enabled and always enabling cache metrics?

In 08a2dc1 switched to using existing cache metrics and removed new metric and removed the property to enable/disable cache metrcis.

dlmarion · 2026-04-30T20:09:44Z

Any objection to removing general.micrometer.cache.metrics.enabled and always enabling cache metrics?

I have no issue with this.

keith-turner · 2026-04-30T20:19:13Z

Looking at the new scan server page vs the page before, I think the following columns might be useful to add back (to both tserver and sserver pages):

Scan Files Open
Scan Yield Count
Scan Zombie Thread Count
Scans Returned Early for Low Mem
Scans Paused for Low Mem

I can add files open and yields would also add the to the tserver though which already really crowded. Trying to have scan servers and tablet servers display the same columns for scans when they have same data.

For the other three I am thinking of collapsing them into a single column and replace the existing failed scans, which shows a count of scan w/ exceptions, w/ a scan problems column that is a sum of all 4 metrics. Wonder if we could make a column value where you get more data when you mouse over it, that could show the count breakdown. W/o that thinking we will eventually have a per server page where more details could be seen.

dlmarion · 2026-04-30T20:25:00Z

The zombie thread count column could be left out and instead add a message in the messages table (#6346)

Made the scan server page have the same columns as the tablet server page for scans and added a few scan server specific columns. Made supporting changes to get a timer metric to display on the monitor, then used those changes to display the average reservation per scan on a scan server. Added two new metrics to the scan server and displayed those on the monitor. One metric is the number of tablets a scan server current has cached. The other is the number of files a scan server has reserved.

* added scan problems column that sums scan errors, scan mem pauses, and scan mem returns * added scan open files column and scan yield column * fixed bug with computing cache hit ratio * narrowed what stats are used to display specific metric types * cleaned up some errors in the Metric enum

keith-turner · 2026-05-01T23:26:36Z

The zombie thread count column could be left out and instead add a message in the messages table (#6346)

That seems fine. I did not include of the sum I created in 931285d because its a GUAGE and everything else being summed is a Functional counter, so they have different units for display. In 931285d now summing three metrics for a new Scan Problems column on the monitor, summing scan errors, scan returned for mem, and scan paused for mem. Also added the scan files open and scan file yield counts columns in 931285d. @dlmarion taking this out of draft

keith-turner · 2026-05-01T23:28:53Z

Also noticed while running test that the datanode had a high CPU load even though supposedly everything was fitting in cache in the scan servers, so need to investigate this also.

Tracked down this problem. The computation for cache hit rate on the monitor was wrong, it was dividing hit_count/hit_count instead of hit_count/request_count. Fixed that in 931285d and now seeing lower hit rates instead of always seeing 100%.

keith-turner mentioned this pull request Apr 30, 2026

Fixes fate ranged lock encoding #6350

Merged

dlmarion changed the title ~~Improved the monitors scan sever page~~ Improved the monitors scan server page Apr 30, 2026

dlmarion reviewed Apr 30, 2026

View reviewed changes

keith-turner added 2 commits April 30, 2026 20:33

revert changes made in other PR

0c3503a

keith-turner force-pushed the scan-server-view branch from 7290dee to 0c3503a Compare April 30, 2026 20:38

keith-turner added 2 commits May 1, 2026 00:53

use existing cache metrics for cache size

08a2dc1

keith-turner marked this pull request as ready for review May 1, 2026 23:26

keith-turner added 2 commits May 1, 2026 23:33

fix build

b9ecb0b

fix comment

ac561fc

dlmarion approved these changes May 4, 2026

View reviewed changes

ctubbsii added this to the 4.0.0 milestone Jun 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improved the monitors scan server page#6351

Improved the monitors scan server page#6351
keith-turner wants to merge 6 commits into
apache:mainfrom
keith-turner:scan-server-view

keith-turner commented Apr 30, 2026

Uh oh!

keith-turner commented Apr 30, 2026 •

edited

Loading

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

dlmarion Apr 30, 2026

Uh oh!

keith-turner Apr 30, 2026

Uh oh!

keith-turner May 1, 2026 •

edited

Loading

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

keith-turner commented Apr 30, 2026

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

keith-turner commented May 1, 2026

Uh oh!

keith-turner commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

keith-turner commented Apr 30, 2026

Uh oh!

keith-turner commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

dlmarion Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

keith-turner Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

keith-turner May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

keith-turner commented Apr 30, 2026

Uh oh!

dlmarion commented Apr 30, 2026

Uh oh!

keith-turner commented May 1, 2026

Uh oh!

keith-turner commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

keith-turner commented Apr 30, 2026 •

edited

Loading

keith-turner May 1, 2026 •

edited

Loading