-
Notifications
You must be signed in to change notification settings - Fork 564
Fix SNMP cacheNumObjCount -- number of cached objects #2053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This looks reasonable. Have you tested with multiple workers and rock caches to ensure the number is correct when there are multiple cache processes. |
It cannot be correct AFAICT because PR code does not communicate with disker processes that currently manage rock cache_dir stats. See Rock::SwapDir::doReportStat() and commit 39c1e1d for more details. If we keep the current official code architecture, then this PR would have to asynchronously aggregate information across diskers like mgr:info currently does. Doing that well (e.g., without code duplication) may be difficult, but I have not checked any details. Regardless of the design specifics, we should strive to keep cache manager stats and SNMP stats in sync: Both code areas should use the same mechanisms for obtaining statistics and just format/report it differently. Unfortunately, achieving that ideal requires a lot of work. |
src/snmp_agent.cc
Outdated
Answer = snmp_var_new_integer(Var->name, Var->name_length, | ||
(snint) StoreEntry::inUseCount(), | ||
(snint) storestats.swap.count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you try using a c++ static_cast instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style of cast is used extensively throughout the file. I don't want to go against the style convention within this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the type used for this cast is itself wrong1
- lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
- no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
- official code this PR is replacing had similar casting bugs/problems
Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?
Footnotes
-
snmp_var_new_integer() does not take
snint
(i.e.int64_t
) value; it takes anint
value instead. The supplied storestats.swap.count value type is ...double
. The old inUseCount() return value type issize_t
. ↩
As @rousskov said, the change in this draft PR doesn't provide the correct count of cache objects. I am trying to understand what code changes would be required to achieve the correction, but that will take some time. LATER EDIT: |
Fortunately, Squid already has SNMP response aggregation framework that, according to its description, can do what you want in principle. For starting points, see Snmp::Inquirer and Snmp::Pdu::aggregate(). I am not an SNMP expert and do not remember much about that Squid SNMP code. We need to figure out why SNMP stats aggregation code is not triggered for the "number of cached entries" object and adjust the code accordingly. N.B. I assume that the relevant SNMP stats aggregation code is not triggered today for the "number of cached entries" object. |
I have tested this PR extensively and used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My tests and tracing have shown this PR is correct.
Thank you for running those tests! I now also believe that this PR handles SMP stats aggregation correctly. This PR should be merged as long as cacheNumObjCount should be limited to on-disk objects (i.e. should exclude memory-cached objects). I left a change request dedicated to addressing that question/concern. That request is the only reason reason I am not approving this PR now.
I polished PR title/description (i.e. future commit message) and polished the code a bit. @cvuosalo, please check and adjust further as needed, keeping Squid Project commit message requirements in mind.
I would like to remove this PR from draft status
Done. FWIW, I hope you can change that status yourself, as you see fit.
and propose it for inclusion in Squid 6.
This PR targets master/v8, as it should. Squid v6 and v7 inclusion may happen after this PR is merged into master. Those decisions are up to v6 and v7 maintainers.
src/snmp_agent.cc
Outdated
Answer = snmp_var_new_integer(Var->name, Var->name_length, | ||
(snint) StoreEntry::inUseCount(), | ||
(snint) storestats.swap.count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- the type used for this cast is itself wrong1
- lots of other nearby snmp_var_new_integer() callers have similar casting bugs/problems
- no simple cast can supply a reasonable value -- more code is needed to handle SNMP Gauge32 limitations
- official code this PR is replacing had similar casting bugs/problems
Given the above facts, I think this PR can leave this bad cast "as is" despite our "no C casts in new/changed code" preference/policy. Said that, I am not going to object to fixing this cast in this PR if @kinkie insists on that fix. However, I would then insist on PR code supplying the maximum Gauge32 value instead of overflowing that counter, with a level-2 (or even level-1) error logged to cache.log. @kinkie, do you insist?
Footnotes
-
snmp_var_new_integer() does not take
snint
(i.e.int64_t
) value; it takes anint
value instead. The supplied storestats.swap.count value type is ...double
. The old inUseCount() return value type issize_t
. ↩
@rousskov I don't have an opinion on including memory-cached objects in the count nor about the casts. I would defer to the Squid experts on these questions. Just let me know what additional changes to make, if any. Is there anything else in the PR that needs revision? |
IMO, no changes are needed except "add memory-cached objects counter" changes tracked in #2053 (comment). Please see that discussion for specific recommendations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for advancing this PR. We are making progress, but more work is needed.
Merge branch 'fix-cache-object-count' of github.com:cvuosalo/squid into fix-cache-object-count
@rousskov After more testing and running with debugging statements, I understand better how this latest version of the PR is working. I think it will work fine for our purposes. I approve. |
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
@rousskov, please fix the conflicts so this can be merged. |
What conflicts? |
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
queued for backport to v7 |
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
queued for backport to v6 |
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For Squid instances using a rock cache_dirs or a shared memory cache, the number of StoreEntry objects in use is usually very different from the number of cached objects because these caches do not use StoreEntry objects as a part of their index. For all instances, inUseCount() also includes ongoing transactions and internal tasks that are not related to cached objects at all. We now use the sum of the counters already reported on "on-disk objects" and "Hot Object Cache Items" lines in "Internal Data Structures" section of `mgr:info` cache manager report. Due to floating-point arithmetic, these stats are approximate, but it is best to keep SNMP and cache manager reports consistent. This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or more objects continue to report wrong/smaller cacheNumObjCount values. ### On MemStore::getStats() and StoreInfoStats changes To include the number of memory-cached objects while supporting SMP configurations with shared memory caches, we had to change how cache manager code aggregates StoreInfoStats::mem data collected from SMP worker processes. Before these changes, `StoreInfoStats::operator +=()` used a mem.shared data member to trigger special aggregation code hack, but * SNMP-specific code cannot benefit from that StoreInfoStats aggregation because SNMP code exchanges simple counters rather than StoreInfoStats objects. `StoreInfoStats::operator +=()` is never called by SNMP code. Instead, SNMP uses Snmp::Pdu::aggregate() and friends. * We could not accommodate SNMP by simply adding special aggregation hacks directly to MemStore::getStats() because that would break critical "all workers report about the same stats" expectations of the special hack in `StoreInfoStats::operator +=()`. To make both SNMP and cache manager use cases work, we removed the hack from StoreInfoStats::operator +=() and hacked MemStore::getStats() instead, making the first worker responsible for shared memory cache stats reporting (unlike SMP rock diskers, there is no single kid process dedicated to managing a shared memory cache). StoreInfoStats operator now uses natural aggregation logic without hacks. TODO: After these changes, StoreInfoStats::mem.shared becomes essentially unused because it was only used to enable special aggregation hack in StoreInfoStats that no longer exists. Remove?
SNMP counter cacheNumObjCount used StoreEntry::inUseCount() stats. For
Squid instances using a rock cache_dirs or a shared memory cache, the
number of StoreEntry objects in use is usually very different from the
number of cached objects because these caches do not use StoreEntry
objects as a part of their index. For all instances, inUseCount() also
includes ongoing transactions and internal tasks that are not related to
cached objects at all.
We now use the sum of the counters already reported on "on-disk objects"
and "Hot Object Cache Items" lines in "Internal Data Structures" section
of
mgr:info
cache manager report. Due to floating-point arithmetic,these stats are approximate, but it is best to keep SNMP and cache
manager reports consistent.
This change does not fix SNMP Gauge32 overflow bug: Caches with 2^32 or
more objects continue to report wrong/smaller cacheNumObjCount values.
On MemStore::getStats() and StoreInfoStats changes
To include the number of memory-cached objects while supporting SMP
configurations with shared memory caches, we had to change how cache
manager code aggregates StoreInfoStats::mem data collected from SMP
worker processes. Before these changes,
StoreInfoStats::operator +=()
used a mem.shared data member to trigger special aggregation code hack,
but
SNMP-specific code cannot benefit from that StoreInfoStats aggregation
because SNMP code exchanges simple counters rather than StoreInfoStats
objects.
StoreInfoStats::operator +=()
is never called by SNMP code.Instead, SNMP uses Snmp::Pdu::aggregate() and friends.
We could not accommodate SNMP by simply adding special aggregation
hacks directly to MemStore::getStats() because that would break
critical "all workers report about the same stats" expectations of the
special hack in
StoreInfoStats::operator +=()
.To make both SNMP and cache manager use cases work, we removed the hack
from StoreInfoStats::operator +=() and hacked MemStore::getStats()
instead, making the first worker responsible for shared memory cache
stats reporting (unlike SMP rock diskers, there is no single kid process
dedicated to managing a shared memory cache). StoreInfoStats operator
now uses natural aggregation logic without hacks.
TODO: After these changes, StoreInfoStats::mem.shared becomes
essentially unused because it was only used to enable special
aggregation hack in StoreInfoStats that no longer exists. Remove?