Conversation
|
FYI while doing some heavy stress testing I'm seeing some DB stall issues |
|
Investigating and the issue may be related to recent changes on the master branch |
|
Yup having reverted my test env to be based upon current master branch I still see the same issue... Digging further. |
rjschwei
left a comment
There was a problem hiding this comment.
Comments should start with uppercase to stay consistent throughout the code base
|
Determined that the issue was due to a recent upgrade of containerd.io on my system to 2.1.5, which has a very low default soft file limit, leading to problems when there were lots of active connections... |
Never mind I see the 2.25 label - so I added that... |
|
@paragjain0910 @felixsch @mssola I've updated the PR with the implementation of the optimized serialization of systems in the send_bulk_system_update() request payload. I identified relatively minimal changes to the existing problematic test case implementation to get it passing again. I also spotted and fixed a minor error in the update request handler that could wipe existing profile associations if profiles were provided to the update, but all were either unknown incomplete or invalid profiles. An additional test case has been added to cover this scenario. |
|
Found an issue when stress testing under heavy load, have identified a promising fix that I will try out tomorrow. |
|
@rtamalin It would be great if you could also include the testing steps for the endpoints. That would really help me in testing the scenarios. |
I use the github.com/rtamalin/rmt-client-tester tooling that I created in August for exploring the RMT performance when creating profiles with my initial investigative hacking. It is a simple Go based app framework that can be used to precreate a (large) set of system hardware info JSON blobs, currently with the system profiles included (though I plan to move them out to simplify/clean up the code this week, time permitting) and then I can repeatably run a tool that submits register, update and deregister requests based upon those generated hardware info blobs... I think the current README.md is reasonably up to date but will check today... |
|
FYI @paragjain0910 @rjschwei over the weekend I was able to validate with some heavy stress testing that my proposed solution to the DB deadlock issues I was seeing in the Profile ensure_profile_exists() routine (https://github.com/SUSE/rmt/pull/1388/files#diff-0414a45947b697e51d740c5ba32eb6c6445f4c795807250ce68106de5ff7e414R73 in the current state of things) works. This involves using the "LOCK IN SHARE MODE" directive to the lock associated with the where() query to ensure that shared rather than exclusive locking is being used when trying to lookup the Profile entry created by a racing request handler. However when investigating this I realized that the current PubCloud interception of the announce_system request handling is by passing the system profile processing that is implemented in the standard RMT workflow... I will look in addressing this today. |
Add an addition unique index spanning the system_id and profile_id fields in the system_profiles table to ensure that a given profile can only be associated with a given system once.
We should consider profiles that have an empty identifier value as invalid, so update the filter_profiles() method to check for and treat them as invalid.
Update the System.complete_profiles=() method to avoid deletion and recreation of linking records for profile associations that haven't changed.
Enhance the SystemSerializer to take an optional serialized_profiles set as an initialize() argument, defaulting a new empty set if not specified, and setup a serializer instance variable holding it. This serialized_profiles instance variable set tracks profile.id's and is used to determine if the serializer has previously serialized a specific profile or not, with the first serialization including the data field, and subsequent serializations dropping it. Update the send_bulk_system_update() request generation to setup a new serialized_profiles set for each batch of systems being processed ensuring that only the first occurrence of a given profile includes the data field. Update tests to exercise the new SystemSerializer initialization and optional system profiles data field inclusion, and verify that the expected profiles are serialized by send_bulk_system_update().
Improve the debug messages logged by the announce_system and update request handlers to report the profile types for problematic profiles identified. Additionally enhance the Profile.filter_profiles() method to return hashes with symbolized keys to simplify determining which incomplete profiles are unknown. Only update the profiles associated with a system if valid complete profiles were either provided or identified from incomplete profiles, and add a test to ensure that existing profile associations are not replaced if not valid complete profiles were provided in the update.
Under heavy stress test loading a subtle race was identified where the find_by() in the rescue block was failing to find the Profile record created by a racing request handler. By locking the table before calling find_by() we ensure that the query is run against the current state of the table, rather than a potentially stale read-only version that was restored as part of the rollback, which wouldn't include the yet to be created new record.
The recent ensure_profile_exists() changes used nested transactions to minimize the rollback cost of potential collisions, but doing so incurs a general overhead that may not be beneficial unless racing collisions occur very frequently. For now the nested transaction wrapping has been converted to a comment about it's potential use if the need arrises at a future point.
In ensure_profile_exists() use shared locking in the rescue block when querying for the profile created by a racing request handler to avoid triggering a deadlock.
The scc_proxy engine supports PubCloud specific handling of the
announce_system and other requests both in the case of BYOS and
PAYG instances, which needs to be updated appropriately handle
profiles:
* For BYOS clients, when proxying announce_system requests to
the SCC, any provided system_profiles should be included in
the request sent to the SCC, and if the response from the SCC
includes the X-System-Profiles-Action header then set the
equivalent header in the response generated by the RMT.
* For PAYG clients, check for the existence of system_profiles
in the request and handle them appropriately, similarly to
the standard RMT announce_system request handling.
To support this the SccProxy.announce_system_scc() has been changed
to return both the JSON parsed response body, and response headers,
allowing them to be checked by callers.
Additionally the announce_system handler in the scc_proxy engine has
been updated to avoid replacing the standard Rails response object
with the JSON parse response body returned by announce_system_scc().
Also refactored the standard RMT announce_system handling to move
system profile processing to a helper method which can be called
by the scc_proxy implementation as needed.
Enhance announce_base_product_hybrid() to include a system's profiles in the announce_system request that it sends to the SCC. Add a new as_payload() helper method to the Profile model that returns the payload representation of a Profile record. Update the SystemSerializer to leverage the as_payload() helper when serializing systems. Simplify a number of unit tests by leveraging the as_payload() helper and a new test to verify correct operation of the helper. Update scc_proxy unit tests to ensure that the inclusion of the system profiles by announce_base_product_hybrid() is exercised.
Co-authored-by: Thomas Schmidt <tschmidt@suse.de>
Add an explanatory comment as to why locking is used in the orphaned profile destroy handler. Add a log message in the rescue block for the case where a profile entry was already destroyed by a racing orphaned profile destroy handler.
Move the changelog entry for this work to the newly added 2.26 section.
Previously Parag identified that the Rails automatic system.profiles update management in the complete_profiles=() custom attribute handler was deleting and recreating entries in the linking system_profiles table even when the corresponding profiles weren't changing. I had implemented explicit management of the profiles updates to avoid this, and a corresponding test case to catch the problem. Tom recently questioned why we needed the explcit update management code, and retesting with the automatic Rails update management handling restored no longer sees the system_profiles entries changing. It appears that some other change that was made to the code base has had a side effect that fixed whatever was causing the Rails automatic update management to recreate the entries, so we no longer need to implement explicit management of the system.profiles update.
Have also updated associated design doc to call out why these choices differ from how things are implemented in the equivalent SCC profile support.
Remove TODO that is no longer relevant - destroy_async is not currently available in the RMT ecosystem, and not sure there is a benefit anyway.
Add comment summarising reason for using an after_commit.
Combine multiple class method definitions inside a block.
The recent rebase to pull in the most recent changes from the master branch resulted in rubocop failures in the scc_proxy engine changes: * class too long for the class Engine definition * method too long for the announce_system method in that class I have addressed the latter by defining an additional helper method for handling the X-System-Profiles-Action header processing. For the former I've chosen to just disable this class too long check for this class definition.
b40ae6c to
9ddefdc
Compare
|
Rebased again to pick up additional recently landed changes. |
Define two new tables:
Add support for handling any provided profiles to the handlers for the announce_system and update requests in the connect V3 API.
This support involves checking that the provided profiles are valid and, when provided for the first time, complete, with new profiles being added to the profiles table, and appropriate references being added or updated via the system_profiles table to associate systems with their corresponding profiles.
Similarly, when incomplete profiles are included in update requests, the corresponding complete profiles will be retrieved and used, with any missing profiles being dropped and considered problematic.
Additionally, when problematic profiles are detected, whether invalid or incomplete, they will be ignored and the X-System-Profiles-Action header will be included in the response, with a value of clear-cache.
In the System model a new custom attribute, complete_profiles, with an accompanying assignment method, complete_profiles=, is used to handle the assignment of complete profiles to be associated with a system record as part of record creation or update.
If racing requests attempt to create the same new profile, one will succeed and the others will rescue the ActiveRecord:RecordNotUnique exception and instead lookup the newly created profile record.
Add support for optimizing the content of send_bulk_system_update() request to only include the full profile, including both identifier and data fields, on the first occurrence of a profile within the list of serialized systems, with subsequent profile occurrences dropping the data field.
Update the test cases to validate correct operation of new models, and associated request handling changes.
Related Jira: TEL-265