Fix SNMP cacheNumObjCount -- number of cached objects #2053

cvuosalo · 2025-04-10T16:42:35Z

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For
Squid instances using a rock cache_dirs or a shared memory cache, the
number of StoreEntry objects in use is usually very different from the
number of cached objects because these caches do not use StoreEntry
objects as a part of their index. For all instances, inUseCount() also
includes ongoing transactions and internal tasks that are not related to
cached objects at all.

We now use the sum of the counters already reported on "on-disk objects"
and "Hot Object Cache Items" lines in "Internal Data Structures" section
of mgr:info cache manager report. Due to floating-point arithmetic,
these stats are approximate, but it is best to keep SNMP and cache
manager reports consistent.

This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or
more objects continue to report wrong/smaller cacheNumObjCount values.

On MemStore::getStats() and StoreInfoStats changes

To include the number of memory-cached objects while supporting SMP
configurations with shared memory caches, we had to change how cache
manager code aggregates StoreInfoStats::mem data collected from SMP
worker processes. Before these changes, StoreInfoStats::operator +=()
used a mem.shared data member to trigger special aggregation code hack,
but

SNMP-specific code cannot benefit from that StoreInfoStats aggregation
because SNMP code exchanges simple counters rather than StoreInfoStats
objects. StoreInfoStats::operator +=() is never called by SNMP code.
Instead, SNMP uses Snmp::Pdu::aggregate() and friends.
We could not accommodate SNMP by simply adding special aggregation
hacks directly to MemStore::getStats() because that would break
critical "all workers report about the same stats" expectations of the
special hack in StoreInfoStats::operator +=().

To make both SNMP and cache manager use cases work, we removed the hack
from StoreInfoStats::operator +=() and hacked MemStore::getStats()
instead, making the first worker responsible for shared memory cache
stats reporting (unlike SMP rock diskers, there is no single kid process
dedicated to managing a shared memory cache). StoreInfoStats operator
now uses natural aggregation logic without hacks.

TODO: After these changes, StoreInfoStats::mem.shared becomes
essentially unused because it was only used to enable special
aggregation hack in StoreInfoStats that no longer exists. Remove?

yadij · 2025-04-11T11:04:42Z

This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes.

rousskov · 2025-04-11T14:34:42Z

This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes.

It cannot be correct AFAICT because PR code does not communicate with disker processes that currently manage rock cache_dir stats. See Rock::SwapDir::doReportStat() and commit 39c1e1d for more details. If we keep the current official code architecture, then this PR would have to asynchronously aggregate information across diskers like mgr:info currently does. Doing that well (e.g., without code duplication) may be difficult, but I have not checked any details.

Regardless of the design specifics, we should strive to keep cache manager stats and SNMP stats in sync: Both code areas should use the same mechanisms for obtaining statistics and just format/report it differently. Unfortunately, achieving that ideal requires a lot of work.

kinkie · 2025-04-14T19:55:20Z

src/snmp_agent.cc

        Answer = snmp_var_new_integer(Var->name, Var->name_length,
-                                      (snint) StoreEntry::inUseCount(),
+                                      (snint) storestats.swap.count,


could you try using a c++ static_cast instead?

This style of cast is used extensively throughout the file. I don't want to go against the style convention within this file.

the type used for this cast is itself wrong¹

lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems

no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations

official code this PR is replacing had similar casting bugs/problems

Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?

Footnotes

snmp_var_new_integer() does not take snint (i.e. int64_t) value; it takes an int value instead. The supplied storestats.swap.count value type is ... double. The old inUseCount() return value type is size_t. ↩

cvuosalo · 2025-04-17T22:27:46Z

As @rousskov said, the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time.

LATER EDIT:
Further extensive testing actually shows this PR is correct and does provide the correct count of cache objects. I will elaborate in another comment.

rousskov · 2025-04-21T19:41:34Z

the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time.

Fortunately, Squid already has SNMP response aggregation framework that, according to its description, can do what you want in principle. For starting points, see Snmp::Inquirer and Snmp::Pdu::aggregate(). I am not an SNMP expert and do not remember much about that Squid SNMP code. We need to figure out why SNMP stats aggregation code is not triggered for the "number of cached entries" object and adjust the code accordingly.

N.B. I assume that the relevant SNMP stats aggregation code is not triggered today for the "number of cached entries" object.

cvuosalo · 2025-05-05T22:31:16Z

I have tested this PR extensively and used snmpwalk to monitor the value of cacheNumObjCount on both test systems and production systems processing heavy traffic. I have configured debug_options and tracked the internal incrementing of the object count. My tests and tracing have shown this PR is correct. In Squid 6, the SNMP response aggregation framework is in place and working correctly to aggregate the Rock cache statistics. All the features @rousskov mentions above as being necessary have been implemented in the Squid 6 code and are working correctly, as far as I have seen in my review of the code and from my tests. I would like to remove this PR from draft status and propose it for inclusion in Squid 6.

rousskov

My tests and tracing have shown this PR is correct.

Thank you for running those tests! I now also believe that this PR handles SMP stats aggregation correctly. This PR should be merged as long as cacheNumObjCount should be limited to on-disk objects (i.e. should exclude memory-cached objects). I left a change request dedicated to addressing that question/concern. That request is the only reason reason I am not approving this PR now.

I polished PR title/description (i.e. future commit message) and polished the code a bit. @cvuosalo, please check and adjust further as needed, keeping Squid Project commit message requirements in mind.

I would like to remove this PR from draft status

Done. FWIW, I hope you can change that status yourself, as you see fit.

and propose it for inclusion in Squid 6.

This PR targets master/v8, as it should. Squid v6 and v7 inclusion may happen after this PR is merged into master. Those decisions are up to v6 and v7 maintainers.

src/snmp_agent.cc

CONTRIBUTORS

rousskov · 2025-05-06T16:13:54Z

src/snmp_agent.cc

        Answer = snmp_var_new_integer(Var->name, Var->name_length,
-                                      (snint) StoreEntry::inUseCount(),
+                                      (snint) storestats.swap.count,


the type used for this cast is itself wrong¹

lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems

no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations

official code this PR is replacing had similar casting bugs/problems

Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?

Footnotes

snmp_var_new_integer() does not take snint (i.e. int64_t) value; it takes an int value instead. The supplied storestats.swap.count value type is ... double. The old inUseCount() return value type is size_t. ↩

cvuosalo · 2025-05-06T20:21:30Z

@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision?

rousskov · 2025-05-06T21:18:15Z

Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision?

IMO, no changes are needed except "add memory-cached objects counter" changes tracked in #2053 (comment). Please see that discussion for specific recommendations.

rousskov

Thank you for advancing this PR. We are making progress, but more work is needed.

src/snmp_agent.cc

Merge branch 'fix-cache-object-count' of github.com:cvuosalo/squid into fix-cache-object-count

cvuosalo · 2025-05-28T19:24:18Z

@rousskov After more testing and running with debugging statements, I understand better how this latest version of the PR is working. I think it will work fine for our purposes. I approve.
Thank you for your extensive help in completing this PR.

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?

yadij · 2025-05-29T00:34:24Z

@rousskov, please fix the conflicts so this can be merged.

rousskov · 2025-05-29T01:40:56Z

@rousskov, please fix the conflicts so this can be merged.

What conflicts?

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?

squidadm · 2025-05-29T12:19:48Z

queued for backport to v7

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?

squidadm · 2025-05-29T18:24:59Z

queued for backport to v6

SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?

Correct cache object count

f5802d2

cvuosalo marked this pull request as draft April 10, 2025 16:43

cvuosalo changed the title ~~Correct number of cache objects reported to monitoring~~ Correct SNMP counter for number of cache objects Apr 10, 2025

kinkie reviewed Apr 14, 2025

View reviewed changes

cvuosalo and others added 3 commits May 5, 2025 17:31

Merge branch 'master' into fix-cache-object-count

337b998

fixup: Simplify and avoid camelCase naming violation

ef89e72

Added primary author to CONTRIBUTORS

d469598

rousskov marked this pull request as ready for review May 6, 2025 15:14

squid-anubis added M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels labels May 6, 2025

rousskov changed the title ~~Correct SNMP counter for number of cache objects~~ Fix SNMP cacheNumObjCount -- number of disk cached objects May 6, 2025

rousskov requested changes May 6, 2025

View reviewed changes

rousskov added the S-waiting-for-author author action is expected (and usually required) label May 6, 2025

cvuosalo added 3 commits May 13, 2025 23:13

Preliminary fix to include mem.count

f9642bc

Fix merge

b7628f1

Merge branch 'master' into fix-cache-object-count

be664ff

squid-anubis added the M-failed-description https://github.com/measurement-factory/anubis#pull-request-labels label May 13, 2025

cvuosalo requested a review from rousskov May 13, 2025 21:33

rousskov requested changes May 14, 2025

View reviewed changes

src/snmp_agent.cc Outdated Show resolved Hide resolved

src/snmp_agent.cc Outdated Show resolved Hide resolved

cvuosalo added 2 commits May 16, 2025 22:22

Correct use of mem.count

6b9b221

Updating branch to latest version.

325c3d8

Merge branch 'fix-cache-object-count' of github.com:cvuosalo/squid into fix-cache-object-count

rousskov self-requested a review May 19, 2025 13:49

rousskov removed the S-waiting-for-author author action is expected (and usually required) label May 19, 2025

squid-anubis added M-abandoned-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels and removed M-passed-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels labels May 28, 2025

squid-anubis added M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels and removed M-abandoned-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels labels May 29, 2025

yadij approved these changes May 29, 2025

View reviewed changes

yadij added S-waiting-for-author author action is expected (and usually required) backport-to-v6 backport-to-v7 maintainer has approved these changes for v7 backporting labels May 29, 2025

squid-anubis added the M-waiting-staging-checks https://github.com/measurement-factory/anubis#pull-request-labels label May 29, 2025

rousskov removed the S-waiting-for-author author action is expected (and usually required) label May 29, 2025

squid-anubis closed this May 29, 2025

squidadm removed the backport-to-v7 maintainer has approved these changes for v7 backporting label May 29, 2025

squidadm removed the backport-to-v6 label May 29, 2025

Fix SNMP cacheNumObjCount -- number of cached objects #2053

Fix SNMP cacheNumObjCount -- number of cached objects #2053

Uh oh!

Conversation

cvuosalo commented Apr 10, 2025 • edited by rousskov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

On MemStore::getStats() and StoreInfoStats changes

Uh oh!

yadij commented Apr 11, 2025

Uh oh!

rousskov commented Apr 11, 2025

Uh oh!

kinkie Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

cvuosalo May 5, 2025

Choose a reason for hiding this comment

Uh oh!

rousskov May 6, 2025

Choose a reason for hiding this comment

Footnotes

Uh oh!

cvuosalo commented Apr 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rousskov commented Apr 21, 2025

Uh oh!

cvuosalo commented May 5, 2025

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rousskov May 6, 2025

Choose a reason for hiding this comment

Footnotes

Uh oh!

cvuosalo commented May 6, 2025

Uh oh!

rousskov commented May 6, 2025

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cvuosalo commented May 28, 2025

Uh oh!

yadij commented May 29, 2025

Uh oh!

rousskov commented May 29, 2025

Uh oh!

squidadm commented May 29, 2025

Uh oh!

squidadm commented May 29, 2025

Uh oh!

Uh oh!

cvuosalo commented Apr 10, 2025 •

edited by rousskov

Loading

cvuosalo commented Apr 17, 2025 •

edited

Loading