Improved the monitors scan server page#6351
Improved the monitors scan server page#6351keith-turner wants to merge 6 commits intoapache:mainfrom
Conversation
|
Opened this is a draft because it builds on #6344. Will take it out of draft after that is merged. This is a screenshot of what the new monitor page looks like.
When testing this page I was splitting and merging tablets while running scans against the scan servers inorder to see the metrics adjust to this. While doing this ran in a problem where scan server lost track of a reserved file after a tablet merge and it caused scans to fail, need to see if I can reproduce this. Also noticed while running test that the datanode had a high CPU load even though supposedly everything was fitting in cache in the scan servers, so need to investigate this also. |
|
Looking at the new scan server page vs the page before, I think the following columns might be useful to add back (to both tserver and sserver pages): Scan Files Open |
| SCAN_BUSY_TIMEOUT_COUNT("accumulo.scan.busy.timeout.count", MetricType.FUNCTION_COUNTER, | ||
| "Count of the scans where a busy timeout happened.", MetricDocSection.SCAN, "Scan Busy Count", | ||
| null, NUMBER), | ||
| SCAN_TABLETS_CACHED("accumulo.scan.tablet.cached", MetricType.GAUGE, |
There was a problem hiding this comment.
I think this is already being captured with the SCAN_TABLET_METADATA_CACHE metric emitted by the Caffeine cache, but we are likely not letting it through to the monitor. In the LoggingMeterRegistry output it shows up as:
cache.size{cache=accumulo.scan.tablet.metadata.cache,host=localhost,instance.name=accumulo4,port=9700,process.name=scan.server,resource.group=default} value=0
There was a problem hiding this comment.
We also have the property general.micrometer.cache.metrics.enabled which why I decided not to use the cache property, because it complicates setting up the monitor correctly even more. Maybe the best thing to do is to remove that property and always emit cache metrics and pass the cache metrics through to the monitor. Not sure why that property was added, but we do not have an analogous property for executors. Any objection to removing general.micrometer.cache.metrics.enabled and always enabling cache metrics?
There was a problem hiding this comment.
In 08a2dc1 switched to using existing cache metrics and removed new metric and removed the property to enable/disable cache metrcis.
I have no issue with this. |
I can add files open and yields would also add the to the tserver though which already really crowded. Trying to have scan servers and tablet servers display the same columns for scans when they have same data. For the other three I am thinking of collapsing them into a single column and replace the existing failed scans, which shows a count of scan w/ exceptions, w/ a scan problems column that is a sum of all 4 metrics. Wonder if we could make a column value where you get more data when you mouse over it, that could show the count breakdown. W/o that thinking we will eventually have a per server page where more details could be seen. |
|
The zombie thread count column could be left out and instead add a message in the messages table (#6346) |
Made the scan server page have the same columns as the tablet server page for scans and added a few scan server specific columns. Made supporting changes to get a timer metric to display on the monitor, then used those changes to display the average reservation per scan on a scan server. Added two new metrics to the scan server and displayed those on the monitor. One metric is the number of tablets a scan server current has cached. The other is the number of files a scan server has reserved.
7290dee to
0c3503a
Compare
* added scan problems column that sums scan errors, scan mem pauses, and scan mem returns * added scan open files column and scan yield column * fixed bug with computing cache hit ratio * narrowed what stats are used to display specific metric types * cleaned up some errors in the Metric enum
That seems fine. I did not include of the sum I created in 931285d because its a GUAGE and everything else being summed is a Functional counter, so they have different units for display. In 931285d now summing three metrics for a new Scan Problems column on the monitor, summing scan errors, scan returned for mem, and scan paused for mem. Also added the scan files open and scan file yield counts columns in 931285d. @dlmarion taking this out of draft |
Tracked down this problem. The computation for cache hit rate on the monitor was wrong, it was dividing hit_count/hit_count instead of hit_count/request_count. Fixed that in 931285d and now seeing lower hit rates instead of always seeing 100%. |

Made the scan server page have the same columns as the tablet server
page for scans and added a few scan server specific columns.
Made supporting changes to get a timer metric to display on the monitor,
then used those changes to display the average reservation per scan on a
scan server.
Added two new metrics to the scan server and displayed those on the
monitor. One metric is the number of tablets a scan server current has
cached. The other is the number of files a scan server has reserved.