Skip to content

Fix SNMP cacheNumObjCount -- number of disk cached objects #2053

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

cvuosalo
Copy link

@cvuosalo cvuosalo commented Apr 10, 2025

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For
rock cache_dirs, the number of StoreEntry objects in use is usually very
different from the number of cached objects because these caches do not
create StoreEntry objects when indexing cached content.

We now use the same stats that are reported on "on-disk objects" line in
"Internal Data Structures" section of mgr:info cache manager report. Due
to floating-point arithmetic/conversions, these stats are approximate,
but it is best to keep SNMP and cache manager reports consistent.

This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or
more objects continue to report wrong/smaller cacheNumObjCount values.

@cvuosalo cvuosalo marked this pull request as draft April 10, 2025 16:43
@cvuosalo cvuosalo changed the title Correct number of cache objects reported to monitoring Correct SNMP counter for number of cache objects Apr 10, 2025
@yadij
Copy link
Contributor

yadij commented Apr 11, 2025

This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes.

@rousskov
Copy link
Contributor

This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes.

It cannot be correct AFAICT because PR code does not communicate with disker processes that currently manage rock cache_dir stats. See Rock::SwapDir::doReportStat() and commit 39c1e1d for more details. If we keep the current official code architecture, then this PR would have to asynchronously aggregate information across diskers like mgr:info currently does. Doing that well (e.g., without code duplication) may be difficult, but I have not checked any details.

Regardless of the design specifics, we should strive to keep cache manager stats and SNMP stats in sync: Both code areas should use the same mechanisms for obtaining statistics and just format/report it differently. Unfortunately, achieving that ideal requires a lot of work.

Answer = snmp_var_new_integer(Var->name, Var->name_length,
(snint) StoreEntry::inUseCount(),
(snint) storestats.swap.count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you try using a c++ static_cast instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style of cast is used extensively throughout the file. I don't want to go against the style convention within this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the type used for this cast is itself wrong1
  • lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
  • no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
  • official code this PR is replacing had similar casting bugs/problems

Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?

Footnotes

  1. snmp_var_new_integer() does not take snint (i.e. int64_t) value; it takes an int value instead. The supplied storestats.swap.count value type is ... double. The old inUseCount() return value type is size_t.

@cvuosalo
Copy link
Author

cvuosalo commented Apr 17, 2025

As @rousskov said, the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time.

LATER EDIT:
Further extensive testing actually shows this PR is correct and does provide the correct count of cache objects. I will elaborate in another comment.

@rousskov
Copy link
Contributor

the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time.

Fortunately, Squid already has SNMP response aggregation framework that, according to its description, can do what you want in principle. For starting points, see Snmp::Inquirer and Snmp::Pdu::aggregate(). I am not an SNMP expert and do not remember much about that Squid SNMP code. We need to figure out why SNMP stats aggregation code is not triggered for the "number of cached entries" object and adjust the code accordingly.

N.B. I assume that the relevant SNMP stats aggregation code is not triggered today for the "number of cached entries" object.

@cvuosalo
Copy link
Author

cvuosalo commented May 5, 2025

I have tested this PR extensively and used snmpwalk to monitor the value of cacheNumObjCount on both test systems and production systems processing heavy traffic. I have configured debug_options and tracked the internal incrementing of the object count. My tests and tracing have shown this PR is correct. In Squid 6, the SNMP response aggregation framework is in place and working correctly to aggregate the Rock cache statistics. All the features @rousskov mentions above as being necessary have been implemented in the Squid 6 code and are working correctly, as far as I have seen in my review of the code and from my tests. I would like to remove this PR from draft status and propose it for inclusion in Squid 6.

@rousskov rousskov marked this pull request as ready for review May 6, 2025 15:14
@squid-anubis squid-anubis added M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels labels May 6, 2025
@rousskov rousskov changed the title Correct SNMP counter for number of cache objects Fix SNMP cacheNumObjCount -- number of disk cached objects May 6, 2025
Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My tests and tracing have shown this PR is correct.

Thank you for running those tests! I now also believe that this PR handles SMP stats aggregation correctly. This PR should be merged as long as cacheNumObjCount should be limited to on-disk objects (i.e. should exclude memory-cached objects). I left a change request dedicated to addressing that question/concern. That request is the only reason reason I am not approving this PR now.

I polished PR title/description (i.e. future commit message) and polished the code a bit. @cvuosalo, please check and adjust further as needed, keeping Squid Project commit message requirements in mind.

I would like to remove this PR from draft status

Done. FWIW, I hope you can change that status yourself, as you see fit.

and propose it for inclusion in Squid 6.

This PR targets master/v8, as it should. Squid v6 and v7 inclusion may happen after this PR is merged into master. Those decisions are up to v6 and v7 maintainers.

Answer = snmp_var_new_integer(Var->name, Var->name_length,
(snint) StoreEntry::inUseCount(),
(snint) stats.swap.count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StoreInfoStats::swap.count is the number of objects currently cached on disk. It does not include objects cached in memory.

This PR may not need further code changes, but I suspect that we do want to included memory-cached objects. At the very least, the old StoreEntry-based code included them (for non-SMP memory caches) since 1998 commit 571c3b8 (at least).

Should cacheNumObjCount cover both memory and disk caches?

Copy link
Contributor

@rousskov rousskov May 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any.

In my opinion, the correct answer is "yes, this PR should include the number of memory-cached objects".

I have not checked all the details and tested nothing, but I believe that information should be readily available to existing PR code via stats.mem.count field. For SMP cases, this information is already reported on the "Hot Object Cache Items" line inside "Internal Data Structures" section of mgr:info cache manager report (and also on the "Current entries" line inside "Shared Memory Cache" section of kid-specific mgr:storedir cache manager report).

N.B. AFAICT, code using stats.mem.count for aggregated stats reporting suffers from the same bug we fixed for rock disk caches in 2011 commit 39c1e1d. If that is indeed the case, then tests with 2 workers should show 2*M memory cache entries on the "Hot Object Cache Items" line mentioned above, where M is the real/actual number of memory cache entries. Fixing that existing bug is outside this PR scope IMO. This PR should proceed with using StoreInfoStats::mem regardless of that bug existence but, ideally, disclose the problem by adding an source code comment with an XXX marker. I can help with that.

Going forward, I recommend the following steps:

  1. Wait at least three days for others to chime in.
  2. If nobody stops you, add stats.mem.count, retest, adjust PR title/description accordingly, make sure all CI tests are green (except "PR approval" one), and request my re-review using GitHub "Reviewers" dialog in the top right corner of the PR "Conversation" tab.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a test and verified that with rock cache, stats.mem.count equals the value of "Hot Object Cache Items" times the number of workers, as suspected.

Answer = snmp_var_new_integer(Var->name, Var->name_length,
(snint) StoreEntry::inUseCount(),
(snint) storestats.swap.count,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • the type used for this cast is itself wrong1
  • lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
  • no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
  • official code this PR is replacing had similar casting bugs/problems

Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?

Footnotes

  1. snmp_var_new_integer() does not take snint (i.e. int64_t) value; it takes an int value instead. The supplied storestats.swap.count value type is ... double. The old inUseCount() return value type is size_t.

@rousskov rousskov added the S-waiting-for-author author action is expected (and usually required) label May 6, 2025
@cvuosalo
Copy link
Author

cvuosalo commented May 6, 2025

@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision?

@rousskov
Copy link
Contributor

rousskov commented May 6, 2025

Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision?

IMO, no changes are needed except "add memory-cached objects counter" changes tracked in #2053 (comment). Please see that discussion for specific recommendations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-for-author author action is expected (and usually required)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants