-
Notifications
You must be signed in to change notification settings - Fork 552
Fix SNMP cacheNumObjCount -- number of disk cached objects #2053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes. |
It cannot be correct AFAICT because PR code does not communicate with disker processes that currently manage rock cache_dir stats. See Rock::SwapDir::doReportStat() and commit 39c1e1d for more details. If we keep the current official code architecture, then this PR would have to asynchronously aggregate information across diskers like mgr:info currently does. Doing that well (e.g., without code duplication) may be difficult, but I have not checked any details. Regardless of the design specifics, we should strive to keep cache manager stats and SNMP stats in sync: Both code areas should use the same mechanisms for obtaining statistics and just format/report it differently. Unfortunately, achieving that ideal requires a lot of work. |
src/snmp_agent.cc
Outdated
Answer = snmp_var_new_integer(Var->name, Var->name_length, | ||
(snint) StoreEntry::inUseCount(), | ||
(snint) storestats.swap.count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you try using a c++ static_cast instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style of cast is used extensively throughout the file. I don't want to go against the style convention within this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the type used for this cast is itself wrong1
- lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
- no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
- official code this PR is replacing had similar casting bugs/problems
Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?
Footnotes
-
snmp_var_new_integer() does not take
snint
(i.e.int64_t
) value; it takes anint
value instead. The supplied storestats.swap.count value type is ...double
. The old inUseCount() return value type issize_t
. ↩
As @rousskov said, the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time. LATER EDIT: |
Fortunately, Squid already has SNMP response aggregation framework that, according to its description, can do what you want in principle. For starting points, see Snmp::Inquirer and Snmp::Pdu::aggregate(). I am not an SNMP expert and do not remember much about that Squid SNMP code. We need to figure out why SNMP stats aggregation code is not triggered for the "number of cached entries" object and adjust the code accordingly. N.B. I assume that the relevant SNMP stats aggregation code is not triggered today for the "number of cached entries" object. |
I have tested this PR extensively and used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tests and tracing have shown this PR is correct.
Thank you for running those tests! I now also believe that this PR handles SMP stats aggregation correctly. This PR should be merged as long as cacheNumObjCount should be limited to on-disk objects (i.e. should exclude memory-cached objects). I left a change request dedicated to addressing that question/concern. That request is the only reason reason I am not approving this PR now.
I polished PR title/description (i.e. future commit message) and polished the code a bit. @cvuosalo, please check and adjust further as needed, keeping Squid Project commit message requirements in mind.
I would like to remove this PR from draft status
Done. FWIW, I hope you can change that status yourself, as you see fit.
and propose it for inclusion in Squid 6.
This PR targets master/v8, as it should. Squid v6 and v7 inclusion may happen after this PR is merged into master. Those decisions are up to v6 and v7 maintainers.
Answer = snmp_var_new_integer(Var->name, Var->name_length, | ||
(snint) StoreEntry::inUseCount(), | ||
(snint) stats.swap.count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
StoreInfoStats::swap.count is the number of objects currently cached on disk. It does not include objects cached in memory.
This PR may not need further code changes, but I suspect that we do want to included memory-cached objects. At the very least, the old StoreEntry-based code included them (for non-SMP memory caches) since 1998 commit 571c3b8 (at least).
Should cacheNumObjCount cover both memory and disk caches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any.
In my opinion, the correct answer is "yes, this PR should include the number of memory-cached objects".
I have not checked all the details and tested nothing, but I believe that information should be readily available to existing PR code via stats.mem.count
field. For SMP cases, this information is already reported on the "Hot Object Cache Items" line inside "Internal Data Structures" section of mgr:info
cache manager report (and also on the "Current entries" line inside "Shared Memory Cache" section of kid-specific mgr:storedir
cache manager report).
N.B. AFAICT, code using stats.mem.count for aggregated stats reporting suffers from the same bug we fixed for rock disk caches in 2011 commit 39c1e1d. If that is indeed the case, then tests with 2 workers should show 2*M memory cache entries on the "Hot Object Cache Items" line mentioned above, where M is the real/actual number of memory cache entries. Fixing that existing bug is outside this PR scope IMO. This PR should proceed with using StoreInfoStats::mem
regardless of that bug existence but, ideally, disclose the problem by adding an source code comment with an XXX marker. I can help with that.
Going forward, I recommend the following steps:
- Wait at least three days for others to chime in.
- If nobody stops you, add
stats.mem.count
, retest, adjust PR title/description accordingly, make sure all CI tests are green (except "PR approval" one), and request my re-review using GitHub "Reviewers" dialog in the top right corner of the PR "Conversation" tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a test and verified that with rock cache, stats.mem.count
equals the value of "Hot Object Cache Items" times the number of workers, as suspected.
src/snmp_agent.cc
Outdated
Answer = snmp_var_new_integer(Var->name, Var->name_length, | ||
(snint) StoreEntry::inUseCount(), | ||
(snint) storestats.swap.count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the type used for this cast is itself wrong1
- lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
- no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
- official code this PR is replacing had similar casting bugs/problems
Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?
Footnotes
-
snmp_var_new_integer() does not take
snint
(i.e.int64_t
) value; it takes anint
value instead. The supplied storestats.swap.count value type is ...double
. The old inUseCount() return value type issize_t
. ↩
@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision? |
IMO, no changes are needed except "add memory-cached objects counter" changes tracked in #2053 (comment). Please see that discussion for specific recommendations. |
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For
rock cache_dirs, the number of StoreEntry objects in use is usually very
different from the number of cached objects because these caches do not
create StoreEntry objects when indexing cached content.
We now use the same stats that are reported on "on-disk objects" line in
"Internal Data Structures" section of mgr:info cache manager report. Due
to floating-point arithmetic/conversions, these stats are approximate,
but it is best to keep SNMP and cache manager reports consistent.
This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or
more objects continue to report wrong/smaller cacheNumObjCount values.